key: cord-0481822-kkmwww6b
authors: Shalit, Nadav; Fire, Michael; Ben-Elia, Eran
title: A Supervised Machine Learning Model For Imputing Missing Boarding Stops In Smart Card Data
date: 2020-03-10
journal: nan
DOI: nan
sha: 8cf906ffd6ec61565df8ffc2981fcabaaca52bd9
doc_id: 481822
cord_uid: kkmwww6b

Public transport has become an essential part of urban existence with increased population densities and environmental awareness. Large quantities of data are currently generated, allowing for more robust methods to understand travel behavior by harvesting smart card usage. However, public transport datasets suffer from data integrity problems; boarding stop information may be missing due to imperfect acquirement processes or inadequate reporting. We developed a supervised machine learning method to impute missing boarding stops based on ordinal classification using GTFS timetable, smart card, and geospatial datasets. A new metric, Pareto Accuracy, is suggested to evaluate algorithms where classes have an ordinal nature. Results are based on a case study in the city of Beer Sheva, Israel, consisting of one month of smart card data. We show that our proposed method is robust to irregular travelers and significantly outperforms well-known imputation methods without the need to mine any additional datasets. Validation of data from another Israeli city using transfer learning shows the presented model is general and context-free. The implications for transportation planning and travel behavior research are further discussed.

Public transport (PT) is an integral part of everyday life in many cities. The gradual shift of the global population over the past century to urban areas is markedly increasing people's dependence on PT for their daily mobility needs (Petrović et al., 2016) . PT is a complex system that is based on physical elements of stops, vehicles, routes, and other temporal and spatial elements (Ceder, 2016 ). The PT system consists of regularly scheduled vehicle trips open to all paying passengers, with the capacity to carry multiple passengers whose trips may have different origins, destinations, and purposes (Walker, 2012) . PT is ideal when passengers regard its service as punctual and regular (Walker, 2012) . With the growth in the number of cars on urban roads, PT improvements have become an essential part of traffic congestion mitigation strategies and vital in promoting sustainable transportation (Al Mamun et al., 2011) . Although understanding the patterns of PT use is crucial to its planning, this task remains a significant challenge in practice and research. Numerous studies in recent years have examined the behavior of PT travelers (Li et al., 2018) in efforts to address this challenge. Habitual travel behavior is of great interest to transportation planners, and its analysis can help improve demand predictions and justify necessary upgrades to PT supply (Briand et al., 2017) . This analysis can also contribute to improvements in PT (service/planning/upgrades) with respect to the management of COVID transmission, in terms of providing better information on crowded areas, such as bus stops, which is critically important to the global issue of public health. To this end, transportation planners typically use travel behavior surveys (Stopher and Greaves, 2007) . While these surveys statistically reflect travel behavior correctly, they are also expensive, time-consuming, and often unable to generate sufficient amounts of data relative to the size of the population, , and would need significant changes in scope to cover recent COVID concerns.

Conversely, data harvested from smart cards can generate millions of records compared to a typical sample ranging from 2,500 to 10,000 households using surveys (Maeda et al., 2019) . Smart cards, also known as automatic fare collection (AFC), provide an efficient and cost-saving alternative to the manual fare collection method (Jang, 2010; Chen and Fan, 2018) . In addition to fulfilling fare collection needs, as a bi-product, smart card transactions also generate geocoded timestamps that record every passenger's boardings, line transfers, and sometimes alightings for a wide range of PT vehicles (bus, tram, train, or metro) (Faroqi et al., 2018; Pelletier et al., 2011) . These records are generated for almost the entire passenger population (Faroqi et al., 2018; Pelletier et al., 2011) . Such information is a treasure trove for travel behavior analyses, especially for extracting passengers' spatiotemporal travel patterns (e.g., Origin-Destination matrices or path choice (Wang et al., 2011) ). Nevertheless, common statistical inference methods applied in surveys are of little practical use for understanding the travel patterns of an entire population. Therefore, different methods are required. Kandt and Batty (2020) proclaimed a new area of urban research defined by advances in big data analytics, with smoother decision-making and a deeper understanding of urban systems. A massive increase in the volumes, velocities and varieties of big data have also been paralleled by recent developments in the data science field. New data mining tools and robust cloud computing capabilities (Li et al., 2018 (Li et al., , 2015 create new opportunities to analyze travel behavior patterns at the individual level, over extended periods, and in large urban areas (Ma et al., 2017) . The availability of big data has a vast potential to improve the quality of transportation planning and research, and by applying big data analytics and data mining methods, this task has become much more feasible (Ma et al., 2013) .

However, similar to the case in other domains, the veracity of such datasets remains questionable (Ben-Elia et al., 2018) . Smart card datasets, in particular, may suffer from integrity problems, such as incorrect or missing values, e.g., when operators only record partial data. For example, in Yan et al. (2019) , boarding stop information was completely missing and only time stamps remained intact in the dataset. A common solution for such problems is to replace the missing or erroneous data by utilizing alternative publicly accessible data. One possible solution is to use official PT timetables to impute the missing data in missing boarding stop information. One popular source for such data comes from the General Transit Feed Specification (GTFS), first created in 2006 by Google (Google, 2016), defined as a standard file format for storing PT schedules and associated geographic information (Ma et al., 2012) . GTFS contains the complete schedules and routes of every PT line planned for each day of the month in tabular formats together with corresponding geographic shapefiles and is now widely used in over 750 urban regions across the world (Hadas, 2013; Antrim et al., 2013) .

Nonetheless, PT running times and arrival times at stops are never perfectly aligned with their official timetables, where PT is not always punctual, even in developed countries. For example, Cats and Loutos (2016) found that only 10% of all arrivals were within an interval of 15 seconds. This issue becomes more acute, especially when PT vehiclesmainly buses -also share the same road space with private and commercial vehicles (i.e., mixed traffic). While this issue is less severe in major urban areas in developed countries where rail-based and PT bus preemption infrastructure is widespread and right-of-way strongly enforced, this is not the reality everywhere. For example, in Israel (an official OECD member), buses accounted nationally for 85% of PT trips in 2019, with more than 2M passengers served daily. The country suffers from a shortage in adequate PT infrastructure (namely too few priority lanes -14 meters per capita, compared to 300 meters in the EU), thus resulting in poor PT service punctuality (Ceder, 2004) . As shown, later on, this fact makes schedule-based imputation a poor substitute for boarding stop prediction. A second solution is to discard such data by simply removing missing records or those that do not align with a prescribed hypothesis (Tao et al., 2014) . Nonetheless, discarding data can be regarded as a reasonable solution only when that share of the missing data is small. However, when the missing portion is substantial, the whole dataset could be compromised and discarded. This scenario can render certain urban areas effectively blind vis-a-vis smart card data. A third option is to complement the missing data by combining different datasets. In this respect, either automatic vehicle location (or AVL) which uses installed GPS transponders to locate PT vehicles and estimate real-time arrival times at designated stops; or automatic passenger counters (or APC) which use infrared or laser technologies to estimate boarding and alighting passenger numbers, have been used in combination with smart card data (Mazloumi et al., 2010; Shalaby and Farhan, 2004; Khiari et al., 2016) . Yet, such data is neither always available (Yan et al., 2019; Chen and Fan, 2018) , nor efficient, as many more errors may well be introduced in the process (Luo et al., 2018) . These two facts likely reduce their suitability for data imputation. Moreover, even when such data sources exist, matching between them is somewhat challenging. For example, Lou et al. (2018) had no vehicle trip identification (ID), making it impossible to match with AVL records. A further difficulty is that missing data can vary by city or between different operators (Laña et al., 2018) . In some cities, data integrity is regarded as very strong, and consequently, boarding and alighting imputation tasks are very good (Munizaga et al., 2014) . In contrast, in other cities where data sources are lacking, data integrity can also be flawed. Therefore heuristic methods, such as ML, are the only viable solution to perform imputation tasks (Yan et al., 2019) .

To this end, our aim is a general and context-free boarding stop imputation method. Specifically, we address use cases where data quality is considered too insufficient to impute by cross-inference and without the need to harvest any other data than what is necessary. While still providing valuable insights for transportation planners, we consider this of particular relevance to developing countries where the traveler population is mostly PT-dependent. We established a general boarding stop imputation method to improve the quality and integrity of PT datasets by predicting missing or corrupted travelers' records in smart card data.

Namely, to the best of our knowledge, we developed the first machine learning (ML) algorithm for predicting passengers' boarding stops (see Figure 1 ). Our algorithm is based on features extracted by harvesting three big data sources, the planned GTFS schedule data, smart card (AFC) data, and geospatial (GIS) data. We applied a machine learning model to these features to predict boarding stops based on the notion of embedding (see Section 3). To train and evaluate our algorithm's performance, we utilized a real-world smart card dataset from the city of Beer Sheva in Israel that consists of over a million trips taken by more than 85,000 passengers. Since the boarding stops are embedded, they also become ordered, and therefore, the problem we addressed is ordinal classification. Accordingly, we also propose a new method of evaluation that shows the percentage in each error dimension that we define as Pareto Accuracy, which is more interpretable and allows for better comparison between imputation models. We show that our model performed significantly better than a naïve prediction model based on harvesting GTFS data alone (aka schedule-based) and other imputation methods.

In this study, we succeeded to generate a model which is both wholly generic and has considerably higher accuracy and recall values than other tested imputation methods (see Section 4). Additionally, we demonstrated that we obtain similar prediction results in an entirely different city using our method. Moreover, we show how other imputation methods are not always applicable, while our methodology can be applied with broader scope.

Our study's overall focus is to improve the integrity of public transport data. Specifically, our study provides the following two main contributions:

1. We present a novel prediction model for imputing missing boarding stops using supervised learning. Moreover, our proposed model is generic and transferable, i.e., it can be trained on one city's data and then impute missing data in another municipality.

2. We developed a new method -Pareto Accuracy -for evaluating public transport metrics that are more interpretable and allow broader comparisons between imputation models.

The rest of the paper is organized as follows: In Section 2, we review related work on smart card usage, missing data imputation, and ML applications in transportation research. Section 3 describes the use case, experimental framework, and methods used to develop the ML model and the extraction of its features. In Section 4, we present the results of the ML model and compare its performance to other known solutions. In Section 5, we discuss the implications of the findings and the study's limitations and present our conclusions and future research directions.

We provide an overview of relevant studies by first presenting smart card research in general, followed by studies that have utilized smart card data with machine learning to perform predictive analytics. We then give an overview of the field of missing data imputation. Lastly, we present studies in the field of ordinal classification.

The smart card system was introduced as a smart and efficient AFC system in the early 2000s (Chien et al., 2002) and has since become an increasingly popular payment method (Bagchi and White, 2005; Trépanier et al., 2007) . In particular, smart cards have also become an increasingly popular source of big data for research and policy making (Jang, 2010; Agard et al., 2007) . For example, smart data is used for exploring travel behavior, determining travel patterns, measuring the performance of PT services, locating critical transfer points, and analyzing crowdedness effects on route choice (Bryan and Blythe, 2007; Alguero, 2013; Zhao et al., 2017; Li et al., 2018; Jang, 2010; Yap et al., 2020) .

Recently, smart card datasets were used to study travel behavior changes to travel behavior as a result of the Covid-19 pandemic (Zhang et al., 2021; Almlöf et al., 2020; Orro et al., 2020) . Comprehensive literature reviews of smart card usage were provided by (Pelletier et al., 2011) , (Faroqi et al., 2018) and (Schmöcker et al., 2017) .

Initially, smart card research applied rather classic statistical methods and descriptive analytics. Devillaine et al. (2012) inferred location, time, duration, and designation of PT users' activities using rules derived from smart card data and work and study schedules. The main research challenge evident in the literature was to estimate origin-destination (OD) matrices which describe the spatial distribution of travel demand between locations during different periods of the day (Wang et al., 2011; Gordon et al., 2013; Munizaga and Palma, 2012; Chu and Chapleau, 2008) . OD matrices are also crucial inputs to perform the three stages in PT network design, namely: route design, frequency (headway) setting, and timetabling (Guihaire and Hao, 2008) . Before the advent of smart cards, these matrices were only derived and validated based on some representative sample of travelers . However, as noted, surveys often lack sufficient spatial and temporal coverage. Various studies have demonstrated the advances in OD estimation with smart card data (Wang et al., 2011; Gordon et al., 2013; Munizaga and Palma, 2012; Chu and Chapleau, 2008) .

Nevertheless, with the introduction of smart cards, new problems in OD estimation appeared. Namely, many PT agencies adopted a TAP (Transit Access Protocol) IN system where only boarding stop information is recorded. In contrast, the availability of alighting stop information "TAP IN+TAP OUT" systems allows for the OD matrix to be derived using more straightforward approaches. Alighting stop information is necessary for many tasks such as route loading profiles, market research, and improvements in service planning. However, under TAP IN, the destination must be somehow predicted (Faroqi et al., 2018; Trépanier et al., 2007) .

In addition, combining smart card data with a smaller scale travel behavior survey for validation purposes is a useful approach to better understand passengers' daily travel patterns (Wang et al., 2011) . Nonetheless, OD analyses inherently assume that PT passengers travel routinely back and forth from/to the same locations. Recent findings suggest this assumption does not necessarily hold, and some share of PT passengers are quite flexible (Huang et al., 2018) or use PT infrequently (Benenson et al., 2019) . Therefore, a simple OD estimation will possibly result in PT planning that is mismatched with actual demand patterns.

Traditional analysis methods do not take advantage of the full potential of the added value of big data. At the same time, rapid growth in power and cost reduction in computational technologies provide new opportunities, both in terms of the availability of the massive amount of data collected and the development of more novel algorithms (Welch and Widita, 2019) . Agard et al. (2007) obtained travel behavior indicators that identify daily travel patterns and clustering of major user groups. Bhaskar et al. (2014) applied a density-based spatial clustering application with noise (DBSCAN) algorithm to cluster passengers and identify classes of passengers for strategic planning improvements. Ma et al. (2013) used smart card data to cluster the travel patterns of PT riders to characterize commuter profiles.

In this respect, the literature shows a shift toward harvesting the prognostic nature of machine learning (ML) to yield better predictive analytics highlighting the growing emphasis on using smart card data for analytical purposes. This shift underscores the change from the more straightforward analyses conducted in the past to the more comprehensive analysis done today. Hagenauer and Helbich (2017) compared several ML classifiers and showed both their predictive power and ability to uncover travelers' mode choices via feature importance analysis. For example, the trip distance was the most important predicting factor, while the temperature was only a key feature for predicting bicycle use (2017).

In 2018, Palacio (2018) showed that ML predictions are much more accurate than traditional linear models that were sub-optimal both in terms of R-square and MSE. In the following year, Traut and Steinfeld (2019) combined smart card data with crime records to assist agencies in identifying insecure, and dangerous PT stops. , who inferred mode and route choices, stress the need for cross-disciplinary collaborations between data scientists and transportation planners to exploit the information withheld in the data. Further evidence of the prominence of big data analytics in PT research can be found in several review papers such as (Fonzone et al., 2016; Anda et al., 2017; Milne and Watling, 2019; Li et al., 2018; Namiot and Sneps-Sneppe, 2017) .

Deep learning algorithms have also been utilized to address PT issues using smart card data. Deep learning is a sub-field of ML that automatically creates feature engineering, and its methods are state-of-the-art in many domains. Examples of such implementations include inference of passenger employment status , forecasting passenger destinations (Jung and Sohn, 2017; Toqué et al., 2016) using standard deep network and long-and short-term memory networks, inference of demographics using convolutional neural networks , improving passenger segmentation , and predicting multimodal passenger flows (Toqué et al., 2017) .

Incomplete data is a universal problem, and the application of different imputation methods will often yield different results. Therefore, to preserve reproducibility, they must be adequately addressed (Saunders et al., 2006) . This problem is, notably, relevant for transportation planning, e.g., in the case of road traffic analysis (Qu et al., 2009) . Incomplete data is a well-known problem in the data mining literature where a significant amount of data can be missing or incorrect. Lakshminarayan et al. (1996) elucidated both the severity of this issue as well as recommended applying ML techniques toward its solution rather than classical statistical methods. Batista and Monard (2003) assert that missing data imputation must be carefully handled to prevent bias from being introduced. Moreover, they show that the most common methods, such as mean or mode imputation, are not always optimal. One example we found in the PT literature is of Kusakabe and Asakura (2014) . They used a Naïve Bayesian model for data imputation and analysis of PT to understand continuous long-term changes in trip attributes. They showed both the power of smart card data and the usefulness of missing data imputation in this field. Their method of imputation, however, is not reported in sufficient detail to be understood or replicated.

Several techniques to optimize missing data imputation showed the importance attributed to this area of research (Bertsimas et al., 2017) . Moreover, even state-of-the-art deep learning methods have been applied to this problem (Camino et al., 2019; Garg et al., 2018; Costa et al., 2018) . These implementations were performed on a variety of datasets and problems such as classification of continuous attributes (breast cancer and default credit card classification);

images (Camino et al., 2019; Garg et al., 2018) and regression (Camino et al., 2019) . Insofar as this field of study has not been operationalized for PT data, further examination is warranted, particularly when considering the issue of completing missing data to provide better information on crowded PT areas, as it pertains to the spread of COVID.

In many imputation tasks, including PT, ML methods outperform standard methods significantly when the missing portion increases (Yan et al., 2019; Saunders et al., 2006; Laña et al., 2018; Echaniz et al., 2019) . Additionally, standard imputation methods are too sensitive to the ratio of missing data and infrequent or 'irregular' users of the PT network (Van Lint et al., 2005) . Conversely, ML-based imputation showed stable results regardless of the missing ratio (Laña et al., 2018) . As noted previously, one solution is to impute the missing boarding stops using complementary datasets such as AVL or APC. However, AVL data are not always available (Chen and Fan, 2018) whereas combining several datasets (i.e., AVL, AFC, APC, GTFS, etc.) can introduce more errors, and make it much harder to match them perfectly (Luo et al., 2018) .

Classification is a form of supervised ML that aims to generalize a hypothesis from a given set of records. It learns to create h(x i ) → y i where y has a finite number of classes (Kotsiantis et al., 2007) . The basic metrics for classification are sensitivity, specificity, and accuracy (Jiao and Du, 2016) . Accuracy is the percentage of observations classified correctly, specificity is the percentage of true negatives classified correctly, and sensitivity is the percentage of true positives classified correctly. A classification task becomes ordered when the classes have some inherent order between them. However, the aforementioned metrics (e.g., sensitivity, specificity, and accuracy) for evaluating this problem are unsuitable, in our case, as they also require a high level of interpretability of the results.

Ordinal classification is a form of multi-class classification where the classes exhibit some natural ordering (such as cold, warm, and hot), but not necessarily numerical traits for each class. Rather than being chosen based on the traditional metrics discussed above, a classifier may be chosen based on the severity of its errors (Gaudette and Japkowicz, 2009 ). Additionally, classic modeling techniques will sometimes perform suboptimally since ML models assume there is no order between classes. In such tasks, e.g., the well-known Boston housing and breast cancer datasets, different models that take advantage of ordinal information are preferred (Frank and Hall, 2001) . In this case, additional metrics are proposed to calculate such tasks differently, such as regression metrics like Mean Absolute Error (MAE) and the Mean Square Error (MSE) and even their own metric, the Ordinal Classification Index (Cardoso and Sousa, 2011) . Notwithstanding, as noted below, these approaches neither fit our data nor our needs. Therefore, we developed a different and novel performance metric (see Section 3.3).

The main goal of our study is to use ML algorithms to improve the integrity of PT data. Specifically, we develop a supervised learning-based model to impute missing boarding stops in any given smart card dataset. Moreover, our goal is to construct a generic model that will be fully transferable to other datasets to impute missing data in different contexts without further adjustments.

To maintain these generic objectives, we had to contend with two significant challenges: First, we could only incorporate generic properties in our model. For instance, our model cannot include the actual line number of a bus route specific to a particular city. Moreover, since supervised ML algorithms can only predict classes they were initially trained upon, classification classes must remain the same across datasets, e.g., bus stop #14 in a specific city is an irrelevant feature for other cities. Therefore, once more, a different numerical representation is applied by embedding (see Section 3.2). Second, in order to develop a genuinely generic model that can also be applied to other geographical contexts on which it was not initially trained, the model must also undergo a process of transfer learning (Torrey and Shavlik, 2010 ) that entails the transfer of relevant knowledge by fine-tuning a model on a "novel" dataset. In our case, our model underwent the process of transfer learning using a dataset on which it had not trained before.

The missing boarding stop values were imputed using the following methodology (see Figure 1 ): First, we preprocessed and cleaned the smart card dataset that we utilized in this study (see Section 3.1). Next, we extracted various features from two other datasets: (a) the GTFS timetable data; and (b) open municipal geospatial data. In addition, we converted boarding stops from their original identifiers to embedded numerical representations based on GTFS data (see Section 3.2). Afterward, we applied ML algorithms to estimate a model that can predict the missing boarding stops. We used SHAP values for determining feature importance (Lundberg and Lee, 2017) , i.e., which features make the most substantial contribution to the predictive power of the model (see Figure 5 ). We also evaluated the performance of our model using a novel metric called Pareto Accuracy. Then, based on common metrics, we evaluated our model relative to a schedule-based model estimated only on GTFS timetable data. Finally, we compared our model to several other comparative models (e.g., passenger history, temporal proximity, or semi-random guessing) that were previously used in the literature. Below, we describe each step of our approach in more detail.

As noted above, we used three datasets:

1. The Smart card dataset -"Rav Kav" is the Israeli AFC system applying the TAP protocol, allowing PT passengers to pay for their trip using their smartcards anywhere in the country. Rav-Kav operates a nationwide TAP IN for buses and rail that codes information on unique passenger identifiers, traveler types (such as student or senior travelers), boarding stops, boarding timestamps, fares, discount attributes, and unique trip identifiers of the line at that time. For rail trips only, TAP OUT also records alighting stops and times. During the period 2018/9, circa 2M boardings were recorded per day in the entire country. 2. GTFS -a GTFS feed, as described above, consists of rail/bus schedules and timetables, stops, and routes of every PT trip planned for every day of the month. In Israel since 2012, the GTFS feed is published daily online by the Ministry of Transport, providing schedules of 36 bus and rail operators, encompassing 7,800 route-direction-pattern alternatives served by 28,000 bus and rail stations. The GTFS feed aligns with the smart card dataset as described below. This study utilized the GTFS dataset to enrich the feature space and convert boarding stop records into an embedded numerical value. 3. Geospatial information -we derived a variety of geospatial attributes from municipal GIS databases.

To obtain a dataset suitable for constructing the prediction model, we were required to remove any record that lacked a boarding stop or a trip ID (a unique identifier of a trip provided by a specific and unique PT operator) from the smart card dataset. Next, we joined the smart card dataset with the GTFS dataset by matching the trip ID attributes. Lastly, we joined the geospatial dataset with the smart card dataset using the GTFS dataset, which contains all the geographic coordinates of each PT route.

ML performance is highly correlated to the quality of the feature space, and therefore, including more features results in better model performance (Gudivada et al., 2017) . While the smart card data contains the PT line and boarding time of each passenger, it lacked several essential data, such as the duration that had elapsed since that line left the origin depot, the time remained until arrival to the final destination, the total number of stops, and other relevant trip attributes. Moreover, the smart card data is missing physical geospatial characteristics, such as the number of traffic lights on the PT route that more likely increases traffic congestion and consequent delays, and could well strengthen model performance.

Overall, three features were extracted using the smart card dataset, five features using the GTFS dataset, three features using the geospatial dataset and four from combined GTFS and smart card datasets. From the 41 features we tested in total, we have selected 15 to include in our model (see Table 1 ). The day of the week in which the boarding occurred Is_weekend

Is it a weekend?

To construct the prediction model, we used the GTFS dataset to create a schedule-based prediction. This naive prediction reflects the transit vehicle's position along a line according to the GTFS schedule. Namely, let S i be the sequence number of the boarding stop based on the GTFS schedule and let A i be the actual boarding stop sequence number. Then, we define D i as D i = A i − S i . Our prediction model goal was to predict D i by utilizing the variety of features presented in the previous section.

For instance, consider a passenger who boarded a line at the third stop, i.e., A i = 3, but the transit vehicle was scheduled to arrive at the second stop at the designated time. The schedule-based prediction would be 2, i.e., S i = 2, the stop where it was supposed to be at that time. Then, the difference is D i = A i − S i = 3 − 2 = 1, and this is the class the algorithm will predict.

Subsequently, we performed the following steps to construct the prediction model: First, we selected several well-known classification algorithms. Namely, we used Random Forest (Singh et al., 2016) , Logistic Regression (Singh et al., 2016) , and XGBoost (Chen and Guestrin, 2016) . Second, we split our dataset into a train dataset, which consisted of the first three weeks of data, and a test dataset, which consisted of the last week of the same dataset. Figure 2 shows the distributions of the embedded boarding stops by the computed difference between the actual and schedule-based sequences (D i ) for the train and test subsets. No apparent differences between the two distributions are evident. Third, for both the train and test datasets, we extracted all the 15 features mentioned above. Fourth, we constructed the prediction models using each one of the selected algorithms. Lastly, we compared the generated models and selected the one with the best performance based on the Pareto Accuracy metric (see Section 3.3). 

We evaluated each model and compared it to the schedule-based method on the test dataset using common metrics: accuracy, recall, precision, F1 (see Appendix A for definitions), and the new metric we developed Pareto Accuracy. We used the following variables for our novel Pareto Accuracy metric: Let p i be the predicted sequence of stop i , a i be the actual sequence, and d i be the absolute difference between them. Let l be the limit of acceptable difference for imputation, i.e., if an error of one-stop is tolerated, such as for neighborhood segmentation, then l = 0. Let X i be an indicator defined as:

We defined Pareto Accuracy as follows:

The PA metric is a generalization of the accuracy metric. Namely, P A 0 is the well-known accuracy metric. Unlike other ordinal classification methods, the primary advantage of using the PA metric is to evaluate the accurate dimension of error while being extremely robust to outliers (by setting parameter l). Moreover, this metric is highly informative since its outcome value can be interpreted easily; for example, 0.6 means that 60% of predictions had at most l difference from true labels.

For example, let us consider a set of eight observations of embedded boarding stops {-2,0,3,20,-3,4,3,2}, where each observation is a simulated boarding by a passenger where each number (D i ) in the set represents the difference between expected (S i ) and actual boarding stops (A i ). With a value of 20, the fourth observation is an outlier, which might occur due to some fault in the decoder device of the public transport operator. We do not want to predict it, as it is naturally unpredictable. We seek a metric that will be both resilient to outliers, as they are unpredictable, and still account for the true dimension of the errors (see Section 2.3). Let us compare two classifiers, A and B. Classifier A predicted the following boarding stops {-2,0,4,3,-2,3,2,2}, while Classifier B predicted {3,0,3,7,1,1,3,2}. Classifier A is a more useful classifier since, in general, its predicted values are closer to the actual values, i.e., its variance is very small, which makes it more reliable. However, when using the classical accuracy and RMSE metrics, Classifier B has a higher accuracy and RMSE values than Classifier A, with accuracy values of 50% vs. 37.5%, and RMSE values of 5.2 vs. 6. By using the Pareto Accuracy (P A 1 ), we obtain a more accurate picture in which Classifier A clearly outperforms Classifier B (87.5% vs. 50%). Here, we see a case where metrics used for both classical classification (accuracy) and ordinal classification (RMSE) do not reflect the actual performance of each classifier.

In addition to the metrics, to evaluate the performance of our model and to compare it to the schedule-based model, we also performed spatial analysis by plotting heatmaps and a temporal analysis using hours and days of the week (see Section 4.3). The analysis entailed comparing boarding stops that were predicted well, i.e., at accuracies of 50% or above. Lastly, to enrich our understanding of the nature and patterns of PT, we produced and analyzed feature importance by exploiting the SHAP values method, considered as constituting a unified framework for interpreting predictions based on game theory (Lundberg and Lee, 2017) . 1 The values are the average of the marginal contributions across all permutations.

We evaluated the above methodology by applying it to the smart card data of the city of Beer Sheva, Israel. With about 200,000 inhabitants, Beer Sheva is the largest city in the southern part of Israel. It presents an interesting use case given its relatively remote location, making it more isolated from a traffic perspective. Additionally, it has a sparse PT network that is easier to model. Furthermore, it has complete passenger boarding stop information, and road traffic in the city is not prone to heavy congestion. We utilized a smart card dataset consisting of over 1M records (after preprocessing, about 92% of the smart card records remained) from over 85,000 distinct travelers for one month during November and December 2018.

Next, we used a GTFS feed containing over 27,000 stops and over 200,000 PT trips in Israel for the equivalent period as the smart card data and included all the operators (or agencies in GTFS tables) in the country. The dataset also included a detailed timetable for every PT trip. Lines and stops for the city of Beer Sheva were sorted by operator and geographic coordinates. All selected routes were bus lines.

We also used a geospatial dataset from the municipal open GIS portal that contained a variety of geographical attributes of the city of Beer Sheva, such as traffic light locations, built-area densities, and more. We then extracted the 15 features from the above datasets. We converted the boarding stops from their Beer Sheva identifiers to numerical values (i.e., embedding). Lastly, we estimated an ML algorithm to classify the boarding stops and evaluated the classifier's performance as described earlier.

As mentioned, one of our primary goals was to develop a generic model that can be applied in any city. To that end, we validated our model based on the data of a neighboring city, Kiryat Gat situated 43Km north of Beer Sheva. We applied the method of transfer learning (Torrey and Shavlik, 2010) , entailing the transfer of relevant knowledge by fine-tuning a model on a "novel" dataset, i.e., a set of data on which it did not train. Other than allowing our model to train more to prove our hypothesis, we split the data initially into intervals of 10 days for the transfer learning task and then into intervals of 20 days for the evaluation. While this grouping of the data could cause sub-par model performance, it showed that transfer learning could be accomplished with little data, most of which the model can impute. Spatial and temporal analyses were also included for this use case.

The main advantage of our modeling approach is that no ground truth is necessary to apply the model. This advantage is related to the fact that training is enabled without using domain-specific labels i.e., when data integrity is poor, and no complementary data is available. We test this assertion by comparing the ML model to other possible imputation models. Such methods, specifically passenger history and temporal closeness, can, in some cases, provide very accurate predictions, mainly when data integrity is high. However, it is important to note that they have some essential limitations. The passenger history method requires passengers have multiple observations in the dataset, which is not always available when dealing with irregular travelers or to split the research data and utilize fewer data records. Additionally, the temporal closeness method is susceptible to data integrity and sparse rides. Passenger history and temporal closeness were applied using the following two algorithms:

Algorithm 

Else return P i 3: end for

In addition, we also evaluated a semi-random classifier as a lower end imputation method using the following algorithm: Model robustness was validated by examining model performance on irregular passengers in comparison to the comparative imputation methods, given that simple imputation methods are ineffective when considering irregular travelers (Van Lint et al., 2005) . Therefore, we examined model performance for predicting the boarding stop of one-time travelers in Beer Sheva, i.e., passengers who boarded once and did not return with PT on the same day. These observations are usually discarded because they do not contribute to OD estimation.

The results are presented in the following order: First, we describe some properties of the data we used, showing its suitability for the developed methodology. Second, we describe the estimated ML model and its performance in comparison to the schedule-based model. Third, we analyze the performance between the two models both temporally and spatially. Fourth, we show the validation of the ML model on the use case of the city of Kiryat Gat, using transfer learning. Fifth, we compare our model to other alternative imputation model specifications mentioned earlier. Lastly, we examine prediction robustness.

We began the analyses by exploring the processed data. First, we examined the degree of lateness in the smart card data compared to the timetable data in the GTFS feed for the city of Beer Sheva. For every PT trip, the time difference between planned and actual arrival times was computed for every stop on each line (see Figure 3 ). As can be observed in Figure 3 , the density function shows both incidents of early arrival and lateness between about 500 seconds (8 minutes) early to 1000 seconds (16 minutes) late. This result suggests that the data is very suitable for applying our method. Moreover, it can be estimated that the schedule-based model using only GTFS timetable data will be less accurate. Second, we investigated the distribution of the missing boarding stop information in the smart card data. Figure 4 presents the mean proportion of missing boarding stops per trip of the top three PT operators in Israel. This distribution is not random. If boarding stops were missing at random, the mean would be expected to be around 0 with a long tail. However, as the density function is far from that shape, we can deduce that boarding stops are indeed not missing at random. 

We trained several classifiers and evaluated their performances. Among the trained classifiers, the XGBoost classifier presented the best performance (see Table 2 ). We compared the classifiers using the common metrics as described before. Additionally, we evaluated our Pareto Accuracy metric based on error sizes of 1, 2, i.e., P A 1 , P A 2 . Any larger gap would typically be deemed unacceptable in terms of level-of-service and because these error sizes are highly correlated with P A i for i > 2. One significant advantage of embedding is the calculation speed, which was an average of 15.9 ± 0.023 seconds on about 300K observations. The SHAP values to evaluate the effect of each feature are presented in Figure 5 . Here we can note: (a) by far the most important feature for the prediction is created by the predicted sequence, which shows it is highly correlated to actual patterns and is very useful for classification (i.e., schedule-based); (b) other than the first two SHAP features, the following four are temporal, which is commonsensical given that the different periods have varied impacts on traffic (such as the morning peak) and as a bus progresses along its route, stochastic events accumulate and the variance increases; (c) although geospatial features are not of the highest importance, they are not trivial, and thus, we conclude that certain physical attributes can influence the nature of our problem, e.g., denser areas can engender more congestion; and (d) the two least significant features pertain to day of the week, from which we can assert that daily PT routines remained quite stable in our case study. In Figure 6 , we present Pareto Accuracy between the ML model and the schedule-based one. It shows that the results are stable even for higher values than 1. Therefore, we can conclude that the proposed model outperforms the schedule-based model. Figure 6 : Pareto Accuracy comparison between ML and schedule-based models (test)

In addition to the aggregated results, we analyzed the model performance both temporally (see Figure 7 ) and spatially (see Figure 8 ). The temporal analysis shows that, in terms of accuracy, our proposed model outperformed the schedulebased method on both a daily and an hourly basis. 2 . Moreover, the spatial analysis showed similar results, and the stops where the predictions were ranked 'good', i.e., over 50% accuracy, were plotted.

Two major insights can be derived from these analyses: First, the ML model predicts many more stops than the schedule-based model. Second, the schedule-based model renders good predictions mainly for the central stops (train stations, main roads, or industrial zones). However, when the model is applied to non-central locations, it is suboptimal, in stark contrast to the ML model, making good predictions across all locations. 

As noted, we performed the model validation for the nearby city of Kiryat Gat. Evidently the ML model performed remarkably better than the schedule-based model (see Table 3 ). Figure 9 shows the Pareto Accuracy for different values of error size showing the ML model is consistently better. 3 Figure 10 presents performance comparison in Kiryat Gat shows the temporal analysis -accuracy by day of the week and on an hourly basis for weekdays which shows similar properties to the trained model the ML model demonstrated higher accuracy compared to the schedule-based model. Figure 11 presents the spatial analysis revealing once more that the ML model predicts more stops with higher accuracy. Table 4 shows the results of the comparisons to alternative imputation methods. While the predicted accuracy of the two alternative methods is similar, the disadvantages of the aforementioned methods are more evident in the lower share of the population than can be predicted compared to the ML model. The semi-random classifier naturally demonstrates that it is far from trustworthy in the case of hierarchical PT networks. It is important to note that while the accuracy of our proposed method is lower, it is far more robust, both in terms of percentage of population predicted and on irregular travelers, which other suggested methods are incapable of predicting (see Section 4.6). For example, in predicting using historical records, we cannot predict for a new passenger or a new route. For using temporal closeness, the prediction will be extremely sensitive to sparse routes.

While the personal history method can indeed be relevant as evident in Table 4 , as noted above (see Section 3.5) model robustness was evaluated examining performance for predicting the boarding stop of one-time travelers. As shown in Table 5 , the results clearly show (see first row in Table 5 ) that the ML model is robust and capable of predicting missing stops even for irregular or new passengers that have no historical pattern. Additionally, as noted earlier, the suggested methods are very limited. The evaluation of passengers they do not predict is clearly shown below in Table 5 (see second and third row). 

In this study, we showed that by mining smart card data and extracting timetable data, we could construct a passenger boarding stop prediction model, which surpasses the traditional schedule-based method. Our research revealed that applying machine learning techniques improves the integrity of PT data, which can significantly benefit the field of transportation planning and operations. From the results, we can deduce the following conclusions: First, our methodology for feature extraction and machine learning model construction demonstrates several noteworthy advantages: (a) the ML algorithm generates a generic model that can be used with other smart card datasets since the labels (i.e., numeric representations) are always aligned in all datasets; (b) by embedding the boarding stops, our method ensures that the number of distinct labels is relatively small and significant computation time reduction can be accomplished; (c) boarding stop use is inherently imbalanced, as some stops are frequently used while others are used rarely. Our proposed methodology is able to accurately classify many classes despite the inherent imbalances, thus contributing to unpredictability reduction; (d) the method is data lean and requires only mining a smart card dataset and a GTFS feed (or any compatible timetable dataset) without the need to process any other datasets; (e) the ML model is entirely complementary to other imputation methods including the schedule-based method as well as passenger history or temporal closeness; and (f) the method provides a robust model capable of dealing even with irregular or unpredictable passengers. Second, our model (applying the XGBoost algorithm) produced the highest performance, with 41% accuracy and 71% P A 1 , whereas the schedule-based method achieved only 21% accuracy and 47% P A 1 . Even for larger error sizes, the ML model outperformed the schedule-based one. Moreover, the schedule-based method was able to render good predictions only for a few main stops compared to the ML model, which predicted well across all stops. This dependency on centrality was clearly visible in the spatial analysis of the stops that were well-predicted. This result confirms our conjecture that the schedule-based imputation approaches can be significantly improved by using ML methods. Furthermore, we also found that complex methods, such as ensemble, resulted in much better model performance than simple algorithms, such as logistic regression. In future research, we intend to test the performance of additional prediction algorithms, such as Deep Neural Networks (Jung and Sohn, 2017; Liu and Chen, 2017) .

Third, from the SHAP values ( Figure 5 ), the following can be noted: The temporal features (created by the timetable from the GTFS feeds) are indeed crucial for the operation of the ML model. Geospatial features, however, were less important. Accordingly, we estimated a model trained without the geospatial features (see Table B .1 in Appendix B). In comparison to the richer model, the performance is somewhat worse. Therefore, we assert that such information is considered useful: Firstly, to understand patterns in a given city, for instance, which spatial attribute is more closely correlated with lateness or earliness. Secondly, it can help the transfer learning process in a new city, i.e., if the model was trained on city A, and will be used to predict city B, using the spatial features will produce a more robust model to the difference between those cities.

Fourth, we showed that the ML model is transferable (see Section 3.2) and able to provide strong and consistent results when validated on another city while outperforming the schedule-based imputation method. Nonetheless, our method, given its generic nature, is not entirely comparable with methods of dissimilar nature, such as those presented in Table  4 which cannot be straightforwardly transferred to another context. Since, to the best of our knowledge, no other imputation method shows such transferability, robustness, and generic nature other than the schedule-based imputation, the latter should be regarded as the comparative benchmark until another imputation method is developed.

Fifth, we recommend using our model when the lack of data does not allow for other more accurate methods to be used, such as passenger history or temporal closeness. Nonetheless, our model can complement these methods, especially for those records that are overlooked, as shown in Table 4 , and thus can utilize more of the scarce data at hand. As noted, our method does not require to mine or access any additional datasets (like AVL or APC), which are not always available and can increase the extent of errors in the prediction. This observation makes our method extremely suitable for planning purposes in non-auto-dependent and less technologically-orientated societies in the developing countries and the Global South (Sohail et al., 2006) .

Lastly, we introduced a new generalized accuracy metric which we named Pareto Accuracy that allows to better compare between classifiers for ordinal classification problems. This metric is more robust to outliers, easier to interpret, and accounts for the true dimension of errors. In addition, the metric is easy to implement. In the future, we hope to understand how Pareto Accuracy can improve additional ordinal classification use cases.

There are a few limitations to the study worth noting. One is that our method requires several constraints to succeed, such as: timestamps, trip IDs, and existing trip timetables. These constraints potentially reduce the number of relevant datasets and the number of observations that could be imputed. However, these constraints also preclude the use of the schedule-based method; hence, in practice, our method has little effect on the ability to impute missing data. In addition, the generality of our method can increase bias, as it ignores features that cannot be transferred between datasets. These features, such as having each PT line as a categorical feature, can reduce bias when imputing a specific dataset.

Possible extensions include: predicting alighting stops (when the operator does not record TAP out), imputing other attributes of interest such as trip ID or time of day, etc. In the future, we would like to test our model in other cities to verify its generalizations. In addition, we also suggest testing the influence of transfer learning on new datasets.

To summarize, missing data imputation is a difficult and complex task. On the one hand, one wants as much data as possible for analyses, while on the other hand, data integrity is of critical importance and demands the availability of imputation methods that work well. We assert that the commonly used schedule-based method suffers from a subpar performance in terms of accuracy and other key metrics. It is highly dependent on the centrality of boarding stops. In contrast, we showed that our model outperformed the schedule-based method in all metrics over different temporal periods. It was more robust to the centrality of the imputed stops and irregularity of recorded trips. This makes it a much more suitable method for imputation as it improves data integrity. In addition, our method is based on generic classification and thus can be used in a wide variety of use cases.

Mining public transport user behaviour from smart card data

A composite index of public transit accessibility

Using smart card technologies to measure public transport performance: Data capture and analysis

Who is still travelling by public transport during covid-19? socioeconomic factors explaining travel behaviour in stockholm based on smart card data. Socioeconomic Factors Explaining Travel Behaviour in Stockholm Based on Smart Card Data

Transport modelling in the age of big data

The many uses of gtfs data-opening the door to transit and multimodal applications. Location-Aware Information

The potential of public transport smart card data

An analysis of four missing data treatment methods for supervised learning

Epilogue: the new frontiers of behavioral research on the interrelationships between ict, activities, time use and mobility

From predictive methods to missing data imputation: an optimization approach

Passenger segmentation using smart card data

Analyzing year-to-year changes in public transport passenger behaviour using smart card data

Understanding behaviour through smartcard data analysis

Improving missing data imputation with deep generative models

Measuring the performance of ordinal classification

Real-time bus arrival information system: an empirical evaluation

New urban public transportation systems: Initiatives, effectiveness, and challenges

Public transit planning and operation: Modeling, practice and behavior

The promises of big data and small data for travel behavior (aka human mobility) analysis. Transportation research part C: emerging technologies

Traveler segmentation using smart card data with deep learning on noisy labels

Xgboost: A scalable tree boosting system

Extracting bus transit boarding stop information using smart card transaction data

An efficient and practical solution to remote authentication: smart card

Enriching archived smart card transaction data for transit demand modeling

Missing data imputation via denoising autoencoders: the untold story

Detection of activities of public transport users by analyzing smart card data

Modelling user satisfaction in public transport systems considering missing information

Applications of transit smart cards beyond a fare collection tool: a literature review

New services, new travelers, old models? directions to pioneer public transport models in the era of big data

A simple approach to ordinal classification

Dl-gsa: a deep learning metaheuristic approach to missing data imputation

Evaluation methods for ordinal classification

Automated inference of linked transit journeys in london using fare-transaction and vehicle location data

Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations

Transit network design and scheduling: A global review

Assessing public transport systems connectivity based on google transit data

A comparative study of machine learning classifiers for modeling travel mode choice

Tracking job and housing dynamics with smartcard data

Travel time and transfer analysis using transit smart card data

Performance measures in evaluating machine learning based bioinformatics predictors for classifications

Deep-learning architecture to forecast destinations of bus passengers from entry-only smart-card data

Smart cities, big data and urban policy: Towards urban analytics for the long run

Automated setting of bus schedule coverage using unsupervised machine learning

Supervised machine learning: A review of classification techniques

Behavioural data mining of transit smart card data: A data fusion approach

Imputation of missing data using machine learning techniques

On the imputation of missing data for road traffic forecasting

Towards smart card based mutual authentication schemes in cloud computing

Smart card data mining of public transport destination: A literature review

A novel passenger flow prediction model using deep learning methods

A unified approach to interpreting model predictions

Constructing spatiotemporal load profiles of transit vehicles with multiple data sources

Understanding commuting patterns using transit smart card data

Mining smart card data for transit riders' travel patterns

Transit smart card data mining for passenger origin information extraction

Detecting and understanding urban changes through decomposing the numbers of visitors' arrivals using human mobility data

Using gps data to gain insight into public transport travel time variability

Big data and understanding change in the context of planning transport systems

Validating travel behavior estimated from smartcard data

Estimation of a disaggregate multimodal public transport origin-destination matrix from passive smartcard data from santiago, chile

A survey of smart cards data mining

Impact on city bus transit services of the covid-19 lockdown and return to the new normal: the case of a coruña (spain)

Machine learning forecasts of public transport demand: A comparative analysis of supervised algorithms using smart card data

Smart card data use in public transit: A literature review

Appraisal of urbanization and traffic on environmental quality

Ppca-based missing data imputation for traffic flow volume: A systematical approach

Imputing missing data: A comparison of methods for social work researchers

An overview on opportunities and challenges of smart card data analysis. Public Transport Planning with Smart Card Data

Prediction model of bus arrival and departure times using avl and apc data

A review of supervised machine learning algorithms

Effective regulation for sustainable public transport in developing countries

Household travel surveys: Where are we going?

Examining the spatial-temporal dynamics of bus passenger travel behaviour using smart card data and the flow-comap

Forecasting dynamic public transport origindestination matrices with long-short term memory recurrent neural networks

Short & long term forecasting of multimodal transport passenger flows with machine learning methods

Transfer learning

Identifying commonly used and potentially unsafe transit transfers with crowdsourcing. Transportation research part A: policy and practice

Individual trip destination estimation in a transit smart card automated fare collection system

Accurate freeway travel time prediction with state-space neural networks under missing data

Human transit: How clearer thinking about public transit can enrich our communities and our lives

Bus passenger origin-destination estimation and related analyses using automated data collection systems

Big data in public transportation: a review of sources and methods

Alighting stop determination using two-step algorithms in bus transit systems

Crowding valuation in urban tram and bus transportation based on smart card data

Changes in local travel behaviour before and during the covid-19 pandemic in hong kong

A deep learning approach to infer employment status of passengers by using smart card data

Deep learning for demographic prediction based on smart card data and household survey

Spatio-temporal analysis of passenger travel patterns in massive smart card data

This research was supported by the Ministry of Science & Technology, Israel, and The Ministry of Science & Technology of the People's Republic of China (Grant No. 3-15741). We want to thank Prof. Itzhak Benenson (Tel Aviv University) for preliminary discussions of the research question. We also want to thank Valfredo Macedo Veiga Junior (Valf) for designing the infographic illustration and Sandra Falkenstein for editing and proofreading this paper. Special thanks to Data Scientist Raz Vais, advisor to the Israeli National Public Transport Authority, for help in obtaining and processing the data.

Metrics presented in this paper:• Accuracy -Percent of observations that were correctly classified • Recall -The number of observations for each class that were correctly classified divided by the total number of distinct observations from this class. Final Recall is the weighted average of the above on all classes.• Precision -The number of observations for each class that were correctly classified divided by the total number of observations that were predicted within this class. Final Precision is the weighted average of the above on all classes.• F1 -2*(Precision * Recall)/(Precision + Recall)• AUC -Area under curve (AUC) is the area under the ROC curve. This curve, for each class, is the true positive rate as a function of the false positive rate. A weighted average of the areas under the curves of all classes is calculated as the AUC metric.• RMSE -Root mean square error (RMSE) is a method for ordinal classification and regression. It sums the square difference from prediction to actual label, then returns the root of the above average.