key: cord-0466066-8uuou7n2 authors: Cats, Oded; Ferranti, Francesco title: Unravelling the Spatial Properties of Individual Mobility Patterns using Longitudinal Travel Data date: 2021-06-30 journal: nan DOI: nan sha: 4513c08705083b424ef111b85d1a89ecfb41f3da doc_id: 466066 cord_uid: 8uuou7n2 The analysis of longitudinal travel data enables investigating how mobility patterns vary across the population and identify the spatial properties thereof. The objective of this study is to identify the extent to which users explore different parts of the network as well as identify distinctive user groups in terms of the spatial extent of their mobility patterns. To this end, we propose two means for representing spatial mobility profiles and clustering travellers accordingly. We represent users patterns in terms of zonal visiting frequency profiles and grid-cells spatial extent heatmaps. We apply the proposed analysis to a large-scale multi-modal mobility data set from the public transport system in Stockholm, Sweden. We unravel three clusters - locals, commuters and explorers - that best describe the zonal visiting frequency and show that their composition varies considerably across users' place of residence and related demographics. We also identify 18 clusters of visiting spatial extent which form four groups that follow similar shapes of travel extent yet oriented in different directions. The approach proposed and applied in this study could be applied for any longitudinal individual travel demand data. Human mobility has been known to exhibit some common features that extend beyond time and space. There is evidence to suggest that individual's daily travel time budget has been limited to approximately 70 minutes a day throughout human civilisation history and across cultures and geographies, whereas travel distances have increased as a result of technological advancements (Ahmed & Stopher, 2014) . Notwithstanding, the geographical area and the diversity of destinations visited, as well as the frequency of these visits vary considerably across the population. Schneider et al. (2013) identify recurrent and distinctive motifs that represent individual daily mobility using travel survey and mobile phone data. Their analysis neglects the geographical aspects of individual mobility by limiting the representation of destinations to nodes in motif graphs. Hasan et al. (2013) develop a model for estimating macroscopic travel demand properties such as visiting frequency and origin-destination flows using smart card data. Schläpfer et al. (2021) estimate the frequency of visits to different parts of the city as a function of travel distances and demonstrate that related variables follow universal laws. Tu et al. (2018) analyse spatial variations in passenger ridership in public transport and Zhu et al. (2020) compare the spatio-temporal heterogeneity in the usage of different shared mobility systems. There is thus a growing understanding of the macroscopic properties of human mobility based on disaggregate mobility traces. Notwithstanding, the same macroscopic properties may emerge from different compositions of mobility patterns at the individual level. The objective of this study is to unravel the commonalities and differences in the spatial properties of individual mobility patterns. In particular, our aim is to identify the extent to which users explore different parts of the network as well as identify distinctive user groups in terms of the spatial extent of their mobility patterns. We propose two representations of user profiles that are then subject to clustering. Study outputs can be used to devise targeted mobility products information and subscriptions. We apply the proposed analysis to a large-scale mobility data set from the public transport system in Stockholm, Sweden. Multi-modal public transport networks are playing a critical role in urban mobility. Public transport systems are increasingly equipped with automated fare collection (AFC) systems which passively collect data concerning individual transactions. In the last decade, smart card data has been increasingly used to examine temporal daily and weekly variations (Ma et al., 2013; Goulet-Langlois et al., 2016; Ghaemi et al., 2017; Deschaintres et al., 2019; He et al., 2020; Egu & Bonnel, 2020) . In contrast, little is known on market segments in regards to their geographical characteristics. The analysis of longitudinal travel data enables the investigation of how mobility patterns vary across the passengers' population and identify the spatial properties thereof. The approach proposed and applied in this study could be applied for any longitudinal individual travel demand data. In the following section we detail the input mobility data and the underlying network partitioning used in this study, followed by two approaches for user segmentation pertaining to zonal visiting profiles and the spatial extent of locations visited. In section 3 we provide a brief description of our application for the Stockholm County's public transport system. Next, we present and discuss the results of the proposed user segmentation clustering analysis for our application. We conclude with a discussion of study implications and suggestions for further research. In this section we describe the sequence of steps performed in order to unravel the spatial properties of individual mobility patterns. Two key ingredients are mobility data and a spatial partitioning of the geographical area under consideration. In the first two subsections we describe the features of the disaggregate longitudinal travel data required to enable the proposed analysis and the data-driven approach which we have adopted for generating travel demand zones. Next, we detail the two segmentation method proposed in this study. First a method based on a visiting profile which is agnostic to the specific locations visited is presented, followed by a method based on the spatial extent of the areas visited by users. Our analysis requires information on mobility records with longitudinal data on time and location stamps for the origin and destination of each journey for each traveller i ∈ N , Where N is the set of all travellers considered in the analysis performed. Such data can be available for example from travel surveys, travel app tracking data, synthetic populations used in activity-based agent-based models, derived from car plate recognition records, ride-hailing or shared-fleet usage records. In the context of public transport journeys, observed origin and destination locations correspond to stops where the set of stops is denoted by S. A travel diary consisting of all journeys per traveller throughout the analysis period can be inferred from smart card data records by applying alighting location (Trépanier et al., 2007; Munizaga & Palma, 2012) and transfer inference algorithms (Gordon et al., 2013; Yap et al., 2017) , depending on the fare validation scheme. Note that the analysis requires that the data contains card holder identifier which is consistent throughout the analysis period. For certain investigations it might be relevant to examine the socio-demographic characteristics of the obtained user classes. In some cases certain socio-demographic variables might be directly available from card holder information or based on the subscription/concession program (e.g. pensioners, students). However, some variables of interest may not be directly available at the individual passenger level. We therefore adopt the approach proposed by Amaya et al. (2018) and Sari Aslam et al. (2019) and adapted by Kholodov et al. (2021) to identify the most likely home zone per traveller based on the frequency of each zone serving as an origin for each card holder. The zones used in this procedure can correspond to any zonal aggregation for which detailed socio-demographic (e.g. census) data is available for the relevant study area. Our analysis of users mobility patterns involves the identification of their visiting patterns of different parts of the geographical area under analysis. It is therefore essential to determine which set of zones will be used for describing passengers' visiting patterns. Zones can be based for example on those readily available from the official central bureau of statistics or by overlaying a grid and defining equal size grid cells. We hereby adopt a data-driven technique for generating travel demand zones that has been proposed and applied by Luo et al. (2017) . The aim of the adopted travel demand cluster generation technique is to cluster groups of stops based on information concerning both passenger flows and spatial distances. Hence, in addition to the passenger origin-destination matrix, the geographical coordinates of each stop is provided as input. The clustering of stops thus results with geographically compact zones (i.e. sets of stops) which exhibit similar travel demand patterns in terms of the distribution of travel destinations for journeys originating from stops included in the same zone. In the following we omit the time indexes because the analysis can be performed for any analysis period of choice without loss of generality. The method used to generate the demand zones follows the four-steps K-meansbased station aggregation method proposed in Luo et al. (2017) . The four steps consist of: 1. Use K-means to obtain clusters of stops. The clusters {C k } k=1,...,K are a partition of S with centers {µ k } k=1,..,K . Consider geodesic distance d(·, ·) between stop s ∈ S and cluster centers when performing the algorithm. Store the clustering results for a range of possible values of k. 2. Compute the distance-based metric. For each k calculate: • Intra-cluster squared distance: With the aim of minimizing D intra and maximizing D inter , we consider the problem of minimizing their ratio τ = D intra D inter 3. Compute flow-based metric. For each k=1,..,K calculate: • Intra-cluster flow: Where f (s1, s2) is the observed passenger flow from stop s1 to stop s2 during the time period under consideration. With the aim of maximizing intra-zonal flows and minimizing inter-zonal flows, we maximize the quantity δ = F intra F inter . 4. Determine the number of clusters. Normalize the metrics to a [0, 1] range (τ → τ , δ → δ ) and look for high values of the integrated metric m(k) = δ τ . k * = arg max K {m(k)} gives the optimal number of clusters within our range. The resulting partitioning is such that all stops s ∈ S are grouped into a k number of zones denoted as z ∈ Z. Where the set of zones Z is a collectively exhaustive and mutually exhaustive clustering of all the stops in S. Once the number of clusters is selected, we name and refer to each cluster of stops by the stop with the highest passenger volume included in each set. Our goal is to identify groups of users that exhibit a similar behavior in terms of the frequency and diversity of zones visited. Note that in this user segmentation we are not interested in which specific zones are visited by individual users but rather the general properties of the zonal visiting profile. For example, if two individuals visit one zone 50% of the time, i.e. have journeys for which this zone serves as the destination, 30% for a second most-frequently visited zone and 20% for their third most-frequently visited zone, then they are considered to have identical travel patterns for the sake of this analysis, regardless of the identity and locations of these zones. Hence, the key information needed for this analysis is to calculate for each individual user the number of journeys destined to each zone. The zones used here are those generated in the data-driven travel demand zone generation described in the previous section. For this analysis we consider the count of journeys ending in each zone for each user: Where fi(si, sj) is the total flow, i.e. number of journeys, performed by user i ∈ N with stop si as an origin and stop sj as a destination. Each user i ∈ N is characterized by a vector Wi with each entry denoting the number of visits to each of the collectively exhaustive and mutually exclusive zones. These vectors are calculated and normalized for each user, so as to consider the share of the journeys destined to each zone rather than the absolute numbers. We sort Wi by descending values, i.e. from the most visited zone to the least visited. Consequently, we obtain an ordered and standardized vector per user which contains information about the visiting profile, where the first entry represents the percentage of journeys destined to the most visited zone, the second entry represents the percentage of journeys destined to the second most visited zone and so forth. This procedure results with zonal visiting frequency user profiles. Next, we cluster these profiles using the K-Means algorithm. The resulting centers from the K-Means algorithm are then interpreted based on the number of zones visited as well as the share of journeys attracted to each of the explored zones. According to the distribution of journeys share one can progressively define the clusters from the most to the least local. The analysis can be further enriched by analyzing how the visiting profile clusters manifest themselves spatially, i.e. how does the share of users belonging to each visiting class varies geographically across home-zones as well as in relation to key socio-economic variables. While the analysis described in the previous section allows identifying clusters of users in terms of visiting patterns, it does not contain information on the spatial extent of user's mobility patterns and the shape thereof. In this subsequent step, we are interested in characterising users in terms of the spatial extent of the locations they visit during the course of the analysis period based on individual travel diary data. Unlike the visiting profile, the identity of the areas visited is of importance in this analysis. For each user's journey, we consider the origin and destination stops as two pinned locations. For this analysis, we choose to aggregate stops into grid cell zones by overlaying a grid over the study area, but this can be substituted by any other aggregation of choice. The grid cells avoids the problem of large variations in geographical size among zones. Each stop s is linked to a grid cell/zone z. We then count for each zone how many journeys originated from this zone or destined to this zone. The user visit count per zone is thus Vi,s x = y∈Sx fi(sx, sy) + y∈Sx fi(sy, sx) Each user visiting pattern is then represented using an array with each entree corresponding to the probability that a user visits this zone in relation to the overall visiting volume of the respective zone. To obtain this probability-representation, the data are normalized:Ṽi, sx = Vi, sx/ N j=1 Vi, sx. This procedure results with spatial visiting extent user profiles. We then cluster users according to their normalised visit probability array. The clustering approach adopted is based on a Gaussian Mixture model, implemented using the Expectation Maximization algorithm. The choice of this method is mainly motivated by its capability of performing a soft classification, i.e. it considers the probability that a data point (user) belongs to each cluster and assigns it to the one for which it is most likely to belong (i.e. has the least distance from the centroid of the respective cluster). Stockholm County is comprised of 26 municipalities expanding over 6519.3 km 2 built over an archipelago and is home to 2.37 million inhabitants. Stockholm Region is the public transport authority overseeing all public transport services in the county. The multi-modal public transport system consists of bus, tram, metro, commuter train and ferry services. The network consists of more than 5,700 stops served by a rail network of 469 km and more than 9,000 km of bus service network. More than 1.2 million trips are performed on an average weekday by about 600,000 travellers. The fare scheme in Stockholm county involves tapping-in only. As part of a previous study, the virtual tap-out location was inferred for each trip and a transfer inference algorithm was applied. The details of these inference algorithms and a discussion of their validity and limitations are available at Kholodov et al. (2021) . Consequently, given a disaggregate tap-in transaction database, a travel diary can be constructed for each card-holder where each entree is a journey performed during the course of the analysis period, containing the origin stop and (inferred) destination stop along with their respective time stamps. The output of this process constitutes the main input for our study. Our analysis is based on data from January 1, 2019 to December 31, 2019, with the exception of the month of July during which the data warehouse was under maintenance works. In total, 468,596,472 journeys were performed by 7,191,376 card holders during the 11-months analysis period. While no personal information is available at the card-holder level for this study, we use zonal-level socio-demographic and socio-economic data made available by the Swedish central bureau of statistics per census zone (there are 1,251 such zone in the case study area). As described in Section 2.1, we do so based on the inferred home zone per card holder. Given our interest in analyzing longitudinal data to study user behavioural patterns, we remove card-holders that use single tickets. This resulted in removing 38% of the cards, associated with 20% of all journeys. Consequently, we are left with a total of 4,423,783 remaining cards which performed 371,285,809 journey records. An additional filter is applied solely for the analysis described in Section 2.3. Here users for which a home zone cannot be reasonably inferred are excluded. The total remaining users and journey records for this specific analysis are 3,782,954 and 368,710,217, respectively. Note that while this reflects a loss of 14% of the users, only 0.68% of the journeys are discarded, since frequent passengers are not affected by this filtering. The clustering procedure detailed in Section 2.2 has been applied for our case study. The values of the integrated metric, m(k), for different number of clusters are shown in Figure 1 . Even though 18 clusters yield the optimal metric value among all values tested (up to 150), we have chosen to opt for k = 29 since it allows for a more nuanced analysis of geographical variations for our case study area and is the second-best local optima. In Figure 2 we display geographically the stop clustering results along with the respective number of card holders residing in each of the obtained zones based on the results of the home zone inference procedure described in Section 2.1. We now turn to the clustering of users into distinctive exploration profiles. For each of the 3,782,954 cards included in our analysis we construct a vector consisting of 29 entrees, the number of zones for which stops have been clustered as described in the previous sub-section. Each of the entrees indicates the number of journeys for which the respective passenger has visited a given zone , i.e. this zone has served as a travel destination, during the course of the analysis period. For illustration purposes, we show in figure 3 an example of a zonal visiting frequency user profiles for a selected traveller. For each of the possible number of clusters considered, we have run the algorithm 30 times using different starting conditions in order to find the best performing partitioning. The average silhouette index values obtained for number of clusters of up to 8 are shown in Figure 4 . We opt for three clusters in this case since, following the elbow rule-of-thumb, this constitutes a turning point with a local optimum. In addition, we found the results of three clusters to capture the most interesting patterns from an interpretative point of view. The resulting cluster centers are presented in Figure 5 . As described in Section 2.3, the profiles are characterised in terms of the distribution of trips made over zones. We identify three user profiles with regard to their zonal visiting frequency profile: • Local: users which mostly (more than 90% of the trips) travel to a single zone in the system. These users are mostly travelling within their home-zone area for a variety of trip purposes. • Commuter: users that visit primarily two zones and for which the two most frequented zones are visited almost equally frequently (50% and 40% of the trips). These users are arguably likely to be commuters travelling back and forth between the home-zone and work-zone. • Explorer: users for which there is a predominant destination zone (the destination of about 70% of their trips), but they still visit several other regions of the city to various extents. The plurality of the card holders belong to the 'Commuter' cluster with 39.1% of all of the card holders, followed by 'Local' with 32.3% and only 28.6% of the card holders categorised as an 'Explorer'. We then turn to analysing the composition of user segments per home-zone. The share of each segment amongst card-holders residing in each of the zones is shown in Figure 6 . The reader may refer to Figure 2 in order to geographically identify the zone using the displayed zone number and name. As can be expected, areas with high job intensity such as the central parts of Stockholm city (Z-4 T-Centralen, service and retail), Södertälje (z-3, manufacturing) and Nynäshamn (z-18, harbor), have a higher share of 'Local' users. The 'Commuter' type is predominant for those areas that are close to an urban center, e.g. Z-7, Vällingby and Z-29, Jordbro. The share of users belonging to the 'Explorer' type is overall evenly distributed across zones, with a peak for users residing in the archipelago areas (i.e. z-11, Dragedet). Figure 6 : Share of user visiting pattern segment per home-zone area Next, we examine key socio-demographic attributes of users assigned to each of the three visiting profile clusters. Note that this information is obtained from the underlying census zones. Summary statistics are summarised in Table 1 . As can be seen, users following a 'Commuter' visiting pattern reside in areas characterised by a a lower social index which reflects an array of social variables, lower than average income level and a higher than average share of residents from foreign background. The opposite holds for 'Local' users. 'Explorer' users reside in areas which are on par with the overall mean values across the case study area. Commuter We follow the method described in Section 2.4 in order to analyze the spatial extent of the area visited by each user. For each of the 4,423,783 cards included in this analysis we generate a spatial visiting extent user profile with the share of journeys destined to each spatial unit considered. We illustrate this using a heat map representing the share of visits per grid cell as shown in 7 for an example user. For the analysis of spatial analysis extent, we choose to overlay a grid over the case study area. On one hand, a more finely meshed grid implies a larger number of cells, posing the risk of over-fitting and scarcity of stops per cell in peripheral parts of the network. On the other hand, a more dispersed grid would instead bundles many stops within the same cell, and may compromise geographical nuances. We have experimented with various grid sizes ranging from 0.5km on 0.5km to 2.5km on 2.5km with 0.5km increments. in addition, the type of covariance matrix to adopt for the Gaussian Mixture Model has to be specified. To evaluate the different configurations we calculate for each of them the BIC and the AIC of 10 randomly sampled 10% subsets of the data, based on which we have selected using grid cells of 2km on 2km. The configuration using a mesh with 2 × 2 km cells and a diagonal type of co-variance yielded a total of 15 clusters. With this size, the grid covering the entire case study consists of 54 × 86 cells (see 8. The number of active cells in the grid, i.e. the number of cells where at least one stop of the network is situated, is 1,154, whereas the total number of cells is 4,644 (largely due to the large bodies of water in our case study area). The mean of the Gaussians determining the centers of the 18 clusters are shown in Figure 9 , along with the share of users assigned to each cluster. The city center of Stockholm is clearly dominant in all clusters, in line with the highly monocentric structure characterising the structure of Stockholm metropolitan area despite the recent emergence of secondary activity centers (Cats et al., 2015) . In the following we discuss key observations from the clusters obtained. The largest cluster, gathering almost 29% of the users (more than one out of four cards), is cluster 5. This cluster represents users that the spatial extent of their travel patterns is almost entirely limited to central areas of Stockholm city -66% of their journeys are confined to 7 neighbouring zones, i.e. 28 km 2 . We observe that the remaining clusters can be loosely described as following into one of three types. Several clusters exhibit similar shapes of travel extent yet oriented in different directions. For example, cluster 2 corresponds to users that travel between central parts of Stockholm and the city of Södertalje (south-west end of Stockholm County) as well as visit areas situated along the corridors connecting these cities. A similar pattern emerges for cluster 8, but instead of Södertälje the spatial extent profile is oriented towards Gustavsberg-Värmdö. Other clusters following this pattern include cluster 7 (Stockholm -Norrtälje direction), cluster 9 (Stockholm -Jakobsberg-Bålsta), and cluster 13 (Stockholm-Märsta). Another group of clusters includes those exhibiting more sparse visiting patterns, forming a cloud shape with a variety of locations visited more uniformly, yet mostly confined to a limited part of the case study area. The following clusters fall in this category: cluster 1 (North-West area of Stockholm city -Solna), cluster 3 (South-East area of Stockholm city -Tyresö, Handen), cluster 4 (North of Stockholm -Danderyd, Täby), cluster 10 (South-West area of Stockholm -Skarholmen, Huddinge, Farsta), and cluster 15 (West -Ekerö, Bromma) The last group of clusters includes those that exhibit a more disperse pattern, characterized by scattered visits to non-neighboring parts of the case study area: cluster 6, cluster 11, cluster 12, cluster 14. We proposed two methods for clustering travellers based on the spatial properties of their mobility patterns using longitudinal travel data. These methods were applied for the case of smart card data from the multi-modal public transport system of Stockholm County. After partitioning the network based on a data-driven approach, we represent users patterns in terms of zonal visiting frequency profiles and grid-cells spatial extent heatmaps. We identify three clusters -denominated locals, commuters and explorers -that best describe the zonal visiting frequency and show that their composition varies considerably across users' place of residence and related demographics. We also unravel 18 clusters of visiting spatial extent which form four groups that follow the same overall trend in terms of intensity and concentration of the zones visited, yet prevalent in different locations across the network. The user segmentation approach proposed in this study can be used to devise products and fare schemes that cater for specific mobility patterns in terms of frequency and extent of locations visited across the network. Moreover, the analysis can be used to identify gaps in the network that may benefit from increased accessibility. Future research may investigate how the identified segments have evolved over a long period of times, for example in relation to policies such as stimulating a more polycentric regional development. This can be especially insightful in understanding changes in mobility patterns and user segmentation as a consequence of major network developments or significant disruptions such as the COVID-19 pandemic. Seventy minutes plus or minus 10-a review of travel time budget studies Estimating the residence zone of frequent public transport users to make travel pattern and time use analysis Identification and classification of public transport activity centres in stockholm using passenger flows data Analyzing transit user behavior with 51 weeks of smart card data Investigating day-to-day variability of transit usage on a multimonth scale with smart card data. a case study in lyon A visual segmentation method for temporal smart card data Automated inference of linked transit journeys in london using fare-transaction and vehicle location data Inferring patterns in the multi-week activity sequences of public transport users Spatiotemporal patterns of urban human mobility A classification of public transit users with smart card data based on time series distance metrics and a hierarchical clustering method Public transport fare elasticities from smartcard data: Evidence from a natural experiment Constructing transit origin-destination matrices with spatial clustering Mining smart card data for transit riders' travel patterns Estimation of a disaggregate multimodal public transport origin-destination matrix from passive smartcard data from santiago, chile A high-precision heuristic model to detect home and work locations from smart card data The universal visitation law of human mobility Unravelling daily human mobility motifs Individual trip destination estimation in a transit smart card automated fare collection system Spatial variations in urban public ridership derived from gps trajectories and smart card data A robust transfer inference algorithm for public transport journeys during disruptions Understanding spatiotemporal heterogeneity of bike-sharing and scooter-sharing mobility This study is funded by Region Stockholm, project "Unravelling travel demand patterns using Access card data" RS 2019-0499. We also thank Region Stockholm for providing the smart card data that made this study possible. The authors also thank Isak Rubensson, Matej Cebecauer and Erik Jenelius for their support in the process.