Trajectory clustering is an important operation of knowledge discovery from mobility data. Especially nowadays, the need for performing advanced analytic operations over massively produced data, such as mobility traces, in efficient and scalable ways is imperative. However, discovering clusters of complete trajectories can overlook significant patterns that exist only for a small portion of their lifespan. DSC, addresses the problem of Distributed Subtrajectory Clustering in an efficient and highly scalable way.
The problem is challenging because the subtrajectories to be clustered are not known in advance, but they need to be discovered dynamically based on adjacent subtrajectories in space and time. Efforts that try to deal with this problem in a centralized way do exist. Unfortunately, applying centralized algorithms for subtrajectory clustering over massive data in a scalable way is far from straightforward. This calls for parallel and distributed algorithms that address the scalability requirements.
- Subtrajectory Join: The first step is to retrieve for each trajectory all the moving objects, with their respective portion of movement, that moved close enough in space and time for at least some time duration. Actually, this first step is a well-defined problem in the literature of mobility data management, known as subtrajectory join
- Segmentation: The second step takes as input the result of the first step, which is actually a trajectory and its neighboring trajectories and aims at segmenting each into a set of subtrajectories. A trajectory will be segmented every time its neighbourhood changes significantly.
- Clustering: The goal of the third step is to create clusters (whose cardinality is unknown) of similar subtrajectories and at the same time identify subtrajectories that are significantly dissimilar from the others (outliers).
The main feature of this approach is that it can process immense volumes of trajectory data and identify clusters of subtrajectories in an efficient and scalable way.
DSC is essential to mobility data analysts that try to extract valuable knowledge out of mobility data. Interesting application scenarios of subtrajectory clustering include:
- Network discovery, where given a set of trajectories (e.g. from urban traffic or from the maritime or the aviation domain) we want to identify the underlying network of movement by grouping subtrajectories that move ``close'' to each other and use cluster representatives/medoids as network edges.
- Predictive analytics over mobility data, where the goal is the extraction of valuable knowledge from data and its utilization in order predict future behavioural patterns (i.e. movement). The idea is first to identify popular mobility patterns, either global (for the whole dataset) or local (for each moving object separately), by employing some subtrajectory clustering technique that also provides the cluster representatives. Then, when some new position of a moving object is reported, the goal is to try to ``match'' the new portion of movement with the most similar historical patterns and employ this pattern in order to predict its future location.
Interactive mobility data exploration and analysis, which can aid/facilitate mobility analysts, in e.g. urban planning and traffic analysis applications.
The proliferation of GPS-enabled devices, which leads to the creation of enormous amounts of mobility-related data, has created new opportunities in several markets, such as transportations, fleet management and insurance, in terms of utilization of this data flow for better decision making
The ability to process larger volumes of data in a scalable way which will lead to better decisions in the respective domain of application
There are several challenges that needed to be addressed during the design and implementation of such a Big Data solution, such as:
- How to partition the data in such a way so that each node can perform its computation independently, thus minimizing the communication cost between nodes, which is a cost that can turn out to be a serious bottleneck.
- How to achieve load balancing, in order to balance the load fairly between the different nodes.
- How to minimize the iterations of data processing, which are typically required in clustering algorithms.
All of the above were tackled in the context of the popular MapReduce programming paradigm and its open source implementation, Hadoop.