key: cord-0681892-j9skwrqz
authors: Zayet, Tasnim M. A.; Ismail, Maizatul Akmar; Varathan, Kasturi Dewi; Noor, Rafidah M. D.; Chua, Hui Na; Lee, Angela; Low, Yeh Ching; Singh, Sheena Kaur Jaswant
title: Investigating transportation research based on social media analysis: a systematic mapping review
date: 2021-06-24
journal: Scientometrics
DOI: 10.1007/s11192-021-04046-2
sha: 7c013fd23d576f85d28691c2f682e0e952e515ee
doc_id: 681892
cord_uid: j9skwrqz

Social media is a pool of users’ thoughts, opinions, surrounding environment, situation and others. This pool can be used as a real-time and feedback data source for many domains such as transportation. It can be used to get instant feedback from commuters; their opinions toward the transportation network and their complaints, in addition to the traffic situation, road conditions, events detection and many others. The problem is in how to utilize social media data to achieve one or more of these targets. A systematic review was conducted in the field of transportation-related research based on social media analysis (TRR-SMA) from the years between 2008 and 2018; 74 papers were identified from an initial set of 703 papers extracted from 4 digital libraries. This review will structure the field and give an overview based on the following grounds: activity, keywords, approaches, social media data and platforms and focus of the researches. It will show the trend in the research subjects by countries, in addition to the activity trends, platforms usage trend and others. Further analysis of the most employed approach (Lexicons) and data (text) will be also shown. Finally, challenges and future works are drawn and proposed. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11192-021-04046-2.

Social media platforms have rapidly gained its user base since its emergence in the late 1990s. Since then, people have started to publish their thoughts, opinions, experiences, ratings and even their daily lives' details through social media applications such as Facebook, Twitter, Weibo and many others. Consequently, social media data has become a great source of real-time data that can be useful for many sectors. In politics, social media data is used to predict election results (Caetano et al., 2018; Jaidka et al., 2019) ; in e-commerce, it is used to create recommendations; in economy, it is used to make inferences about the stock market movements (Bollen et al., 2011; X. Zhang, Zhang, et al., 2018; Zhang, Chen, et al., 2018) ; in tourism, it is used to recommend the best hotels, experiences and places (A. S. H. and in education, it is used for teaching and learning resources (Pereira et al., 2018) .

The real-time property of social media data has been debated in numerous studies. These studies found that social media is a source of real-time or near-real-time data, especially in the case of accidents and crises, because people post on their accounts immediately after an incident occurs (Daly et al., 2013; Osborne et al., 2014; Y. Gu et al., 2016; S. Zhao et al., 2017) . So, in time-sensitive domains such as transportation, real-time social media data can be very useful; timing has the potential to influence users' everyday activities and even save lives. In the transportation network, any delays could result in severe problems, for example, people being late for their work and schools, ambulance delays, or crowdedness of stations and others. Usually, transportation companies use physical sensors (Guerrero-Ibáñez et al., 2018) to collect information about traffic situations and delays. These sensors are costly and need maintenance. In addition, any defect or attack on the sensors network can cause problems (X. in the delivery of information, which can lead to problems in the traffic system and organization.

In case of defects or faults in the sensors' network, other real-time data sources are needed for troubleshooting. One of the most suitable sources, in the aspects of time and budgeting, is social media data. Social media data can provide real-time information about various transportation statuses from the commuters' perspective. It can help in extracting road hazards (Kumar et al., 2014) , traffic situations (D'Andrea et al., 2015; D. Wang et al., 2017) , event or accident detection (Candelieri & Archetti, 2015) and others.

Furthermore, people in charge of the transportation network can have inferences based on commuters' opinions and complaints (Abalı et al., 2018) toward transportation networks (Adeborna & Siau, 2014; Gal-Tzur et al., 2014) . They need these inferences to form opinions on the best paths to take in order to maximise network utilisation. Commonly, traditional data collection approaches such as questionnaires and surveys are used by decision-makers to gain an understanding of public opinion (Pournarakis et al., 2017) . Those traditional methods are time-consuming and costly in contrast to social media data use.

Due to the reasons above, employing social media data in transportation-related studies has become a trend. Social media data can be obtained by performing a simple crawling of the network. Therefore, this paper presents a systematic review of transportation studies that have adopted social media analysis (TRR-SMA) and have been published between the years 2008 and 2018. It is showing the structure of the field by providing information on targeted topics, types of data used, data collection methods and data analysis methods; thus, this paper provides the new researchers with comprehensive information on how the social media data in transportation researches had been utilized in the past and for which goals. In addition, it also provides in depth information on the challenges, issues and research gaps that future researchers should embark; hence, it can serve as a starting exploration point for new TRR-SMA researchers.

As for the rest of the paper, it is organized as follows: Firstly, an overview of the existing survey papers is presented. Secondly is the presentation of the research methodology of this study. Thirdly, the report of our systematic mapping review analysis is provided. Fourthly, the discussion of the main findings and possible future works are laid out. Finally, the conclusion of the paper is presented. In this study, terms such as papers, literature, works and studies are of the same meaning. In addition, applications and platforms are also considered as synonyms.

There are several review papers in the field. In producing review papers, three key factors affect the reported papers: the query, the time of the publications and the digital libraries used in the searching process. Different review papers can be generated by changing at least one of any of those factors. Each existing review paper varies from our paper in different aspects: the query used in the searching process, the period of the published literature, reported data and its type, and the methodology of the analysis. In the following work, we present each review paper, then subsequently, our systematic mapping review analysis.

Nikolaidou et al. presented a review paper of used social media data, methods and challenges in transport planning and public transit quality. Nikolaidou et al. reviewed around 50 papers which were published between 2010 and 2016. Lv et al. (2017) proposed a literature review of social media-based transportation research using social network analysis method and 67 papers for the period between 2011 and 2015. They showed the research collaboration in the field based on the authors, institutions and countries. Rashidi et al. (2017) proposed a review of social media data used for modelling travel behaviour with its advantages and challenges. They had ended with 800 + papers from Scopus for analysis which were published between 2007 and 2015 but they had mentioned around 70 papers in their review paper. However, as they were aiming on modelling the travel behaviour, they focused on the location data. Chaniotakis et al. (2016) produced a mapping review of studies that used social media data and platforms. In addition, the authors addressed the research challenges and opportunities in using social media for transportation studies. Chaniotakis et al. review was based on 22 literature published between 2008 proposed a review using around 70 papers published between 2008 and 2014. The review covered the social media data that could be used in transportation-related research. They performed a more in-depth study and argued whether social media data could be used along with the transportation data. In addition, they presented text mining methods and the challenges in the transportation field.

Comparatively, the analysis method of our literature review was conducted in a more systematic way. The selection process of our primary studies was designed to ensure reliability and reproducibility. This implies that our results of analysis can be regenerated by following the steps presented in the Methodology section. In most of the above-mentioned review papers, the used query was not reported, so placing a comparison based on query was not possible. However, (Rashidi et al., 2017) was the only one to report the used query which is different from the one we used. We shape the query using general terms since we are looking at the field from a broad perspective to extract the used social media platforms, data, and data analysis approaches, as well as the research targets. In the broadest sense, it is important to provide a large scale and coverage of the data being analysed. Limiting any of the targeted data to predefined types and categories would reduce the number of outcomes, resulting in the omission of some data forms. So, we did not use any name of social media platforms in forming our query.

In addition, our paper presents a review of the literature published in the years between 2008 and 2018 from four digital libraries, suggesting that the review has a broader scope than previous studies. From the aspect of the analysed data classification, we present a more fine-grained examination by analysing various components while others focused on one of these components. The components that have been included in the analysis are the distribution of the publication over countries, first authors, publishers and years, keywords analysis, focusses of transportation-related research and types of the used social media data and platforms. To the best of our knowledge, this paper is the first to analyse the research subjects by countries and years in order to show the trends in the relevant research field, as well as the first to present the trend analysis of social media platforms used for TRR-SMA. Furthermore, it is the first to analyse the text data attributes according to the aims of using them.

We adopted a systematic mapping review methodology to provide an overview and research structure of the field of TRR-SMA (Petersen et al., 2008) . This paper differs from systematic literature reviews (SLRs) as the latter aims to identify, analyse, evaluate and report the existing research in a field using well pre-defined and repeatable steps. These steps generate the primary study set (Kitchenham et al., 2009) . The primary set contains the papers resulted from the search and selection process that will be reported in the paper. Meanwhile, a systematic mapping study or scoping study aims to structure and give an overview of the field of interest by classifying the papers in the primary study set and analysing them according to their numbers in the categories (Petersen et al., 2008) .

In conducting our study, we adapted the guidelines provided by (Kitchenham et al., 2009) and (Petersen et al., 2008) with slight modifications to suit the context of our research objectives. Kitchenham and Charters proposed the guidelines for generating an SLR in the field of software engineering, while Petersen et al. presented guidelines of a systematic mapping method in the same field. A systematic mapping study for the software engineering field, based on a similar methodology, was found in another study (Zakari et al., 2018) .

The main steps suggested by (Petersen et al., 2008 ) and adopted in this study are shown in Fig. 1 . The first step is defining the research questions (RQs) that will be answered by the study. The second step is constructing the search protocol and search, while the third step is screening the search results using the inclusion and exclusion criteria. The fourth step is constructing the classification scheme and defining the categories, and finally, extracting the desired data from the primary set and mapping it to the categories. The following sub-sections describe each step of our research process in detail. 

The ultimate aim of this study is to investigate the researches in the field of social media-based transportation. By investigating the researches, we find the elements/foundations that have to be taken into consideration when starting any research in the field, hence, it directs the way of utilizing social media in transportation-related researches.

Three main RQs have been identified. Table 1 shows these RQs with their respective motivations. The first and second RQs lead to multiple sub-RQs (s-RQs) as they implied different data to be analysed, hence, these s-RQs are forming the subsections to present the results in a more organized way. The s-RQs are shown in Table 2 and Table 3 . As for the RQ3, it poses the challenges, findings and future works; since these three are inextricably linked and lead to one another, they will be addressed for each analysed data rather than being divided into s-RQs. To identify the trends in the field, the used keywords, the used social media data, the used social media platforms and the used approaches RQ2: What are the aims of transportation researches based on social media analysis?

To identify the targets of the researches and their trends in the world and by countries RQ3: What are the challenges, principal findings and possible future works in the field?

To identify the challenges in the field and main findings from the analysis and draw the needs of the field To identify the trend in the publication in the field in terms of years, countries, publishers and first authorship s2-RQ1: What are the used keywords in the field? To identify the most used keywords by authors in the researches s3-RQ1: What are the social media data/attributes used by the researchers?

To identify the used social media attributes, the most used ones and the aim of using them s4-RQ1: What are the rules of text data/text mining in the TRR-SMA field?

To identify the usages of the text data S5-RQ1: What are the social media platforms used by the literature?

To identify the most used social media platforms and their usage trend S6-RQ1: What are the datasets used by researchers? To identify the datasets and the methods of collecting them S7-RQ1: What are the approaches used to analyse social media data in transportation researches based on social media analysis?

To identify the most used methods for analysing social media data

The search protocol contains three steps: Defining the digital sources, constructing the search query and lastly, applying the search process. Figure 2 shows an overview of the search and selection process. To identify the targets of the researches s2-RQ2: What are the trended subjects in the world and by countries?

To identify the trends of researchsubjects around the world and in the target countries s3-RQ2: What are the social media attributes used for achieving the targets?

To identify the role of social media data in the research field to achieve each research target

Four digital libraries (DLs) were chosen to refine the selection of papers: IEEE Xplore, ACM, Web of Science and Scopus. Google Scholar was not considered as it contains too large a proportion of irrelevant literature to this study, in addition to grey literature.

Based on the defined research questions in Section "Defining the research questions", terms and keywords were identified. These terms and keywords were revised according to the pre-scanned literature. Furthermore, we revised them by including the synonyms. The terms and keywords are illustrated in Table 4 . By combining the mentioned terms, the following query was constructed: (transport* AND ("sentiment analysis" OR "opinion mining" OR "text mining" OR "social media analysis" OR "social network analysis")) General terms were used in our query. Instead of using the names of social media platforms that we hope to learn from the studies, we used the term "social media analysis" and its synonyms to describe the process of analysing social media data.

The search was performed on the 16th of January 2019 using the previous mentioned DLs and by searching through titles, abstracts and keywords. The total number of search results was 703 as illustrated in Fig. 2 .

Any paper to be included in the primary study set has to fulfil the inclusion and exclusion criteria to assure its relevance to the field and its ability to answer the RQs. In other words, the paper has to fulfil all the inclusion terms and none of the exclusion terms. The defined Inclusion (IC) and Exclusion (EC) terms are shown in Table 5 . In case of IC4 and EC4, the extended version was included because, usually, it provides more information about the procedures, the experiment, the findings, and the assessment, as opposed to reporting the process without results or evaluation or reporting the results based on only a portion of the data. Moreover, worth to mention that in our exclusion and inclusion criteria, the citations do not play any rule. The search results were refined in two screening stages using the IC/EC terms, the first stage, according to the title and abstract and the second stage, according to the full text. Table 6 shows examples of excluded studies and the corresponding exclusion term.

The construction of the classification scheme depends on the prior knowledge of the field and the first screening phase of the literature. The prior knowledge was gained by scanning existing reviews and literature in the field. The first screening was applied using the abstract and the title of the papers. During this process, the most frequent topics were extracted to form the categories. The most frequently appeared topics in the literature were the approaches used, social media platforms, social media data types and countries. These categories formed the base of the mapping. To add more value to the results, we added the timing attribute. By analysing categories according to time and trends, we discover the patterns in the researches. Figure 3 shows the classification scheme that will be followed in analysing the researches.

In this stage, the primary study set (Online Resource 1) will be screened in full for data extraction. The primary set (the 74 papers) was identified by using the first screening phase depending on the title and abstract and the second phase depending on the full-text. The data extraction process was done using Excel. The Excel sheets can be provided if needed. The work is a journal paper or a paper in a conference proceeding/ peer-reviewed paper EC2: Papers written in other languages, other than English IC3: Clearly mentioned the dataset used EC3: The dataset is not clear/mentioned IC4: Workshop/journal papers that have been extended from conference papers EC4: Conference papers that have been extended to journal/workshop papers EC5: Works related to social media analysis but not for transportation-related subject or vice versa The data were extracted and mapped to the specified categories in the classification scheme. The results of the mapping and the analysis were visualized using Tableau which is a visualization tool widely used in the business intelligence industry.

This section lays the answers to the research questions. Tableau was used for visualizing the results.

Five sub-questions were generated from RQ1 as enlisted in Table 2 . By combining their answers and drawing a conclusion, RQ1 will be answered.

To perform an activity analysis, we have dissected the number of publications according to years, published countries, first authorship and publishers (journals and conferences).

• Activity according to years: In the years between (2008-2018), 74 papers were expurgated according to our search methodology ("RQ1: How social media is used in transportation research based on social media analysis?" Section). Our analysis points to growing attention toward the TRR-SMA field. In the period between 2008 and 2012, there was no activity in the research according to our primary set. This is believed to be attributed to the social media evolution that began in 2011 -2012 . (O'Regan, 2018 . In 2012, Facebook users reached more than 1 billion and in 2011, the number of tweets per day exceeded 140 million. In 2013, a modest activity started in the field followed by a prominent rise in the number of publications in the following years. Figure 4 illustrates the trend in the publication activity over the past decade. • Activity according to countries: Our analysis seeks to distinguish the most interested countries in the TRR-SMA field. Figure 5 shows the top active countries in the field. The USA produced the largest share of the publications. It published around 19% of the total publications (74 papers), followed by Indonesia, China and the UK. They outputted approximately 11%, 9.5%, 8% of the publications, respectively. Cumulatively, these four countries produced around half of the publications in the field. Noticeably, the USA is dedicating high attention to the field. This high attention can be referred to the need for improvements in transportation infrastructure as stated by the Council on Foreign Relations 1 (cfr). It also stated that the USA transportation lost to South . TIGER focuses on environment, energy and surface transport. In 2012, the USA president approved to progress the plan. 2 This may explain the reasons for the United States' activity, which began in 2014, and the increase in the ranking of the United States' transportation infrastructure, which was ranked sixth in the world in 2017-2018. 3 • Activity according to first authorship: This type of activity analysis aims to discover the most active authors in the field. The activity of the authors was calculated by counting his/her publications where he/she was the first author of the publication. The top five authors in the past ten years were Gal-Tzur, Candelieri, Ali, Salas and Serna with 2 publications for each. • Activity according to publishers: Here, the term "publishers" refers to journals and conferences that have published work in the field of TRR-SMA. Table 7 shows the top publishers and their corresponding number of publications.

Authors usually use keywords to point out the research subjects, fields, tools, approaches and techniques they use in their literature. Hence, keywords are the best identification of literature. As for other researchers in the field, they use keywords to retrieve related literature to their field from digital databases (Sharma & Mediratta, 2002) . Sharma and Mediratta describe the keywords as "the "keys" to unlock the desired scientific paper abstracts/full articles from a vast collection of related publications". Due to these reasons, keywords analysis, on one hand, is important to authors as they should choose the proper keywords to identify their work and make it easy to reach. On the other hand, keywords are important to the researchers in the field to choose the suitable ones during the searching process to maximize the relevance of the retrieved publications. Figure 6 shows the most used keywords with their frequency. As expected, the most used keywords are related to social media analysis and transportation fields as they are the main two topics of this paper, in addition, they were the base of the used keywords in the search process. The most used keyword "Social Networking (Online)" indicates the social media networks while "Data mining", "Text Mining", "Big Data", "Natural Language Processing Systems" and "sentiment analysis" are forming the sub-fields of social media analysis.

Other keywords are related to the approaches, data attributes, social media platforms and subjects used in the field; "Twitter" is the most used social media platform in the field (see s5-RQ1 answer); "Air Transportation", "Traffic Congestion", "Transportation Services", "Customer Satisfactions" and "Accidents" are laying under the most targeted research subjects in the field (refer to RQ2 answer).

One of the pillars for structuring the TRR-SMA area is the data used in the studies. Users' posts on social media sites will provide data other than the text, picture, or video they posted. In addition to time, users' posts will be tagged with the location if they activate the geotagging feature. Online Resource 1 shows the used social media data by the researchers. Text, location and time data were the frequently-used social media attributes.

Other data such as followers, number of tweets, friends and retweets were used to gauge the users' influence. The number of tweets was also used as a gauge of traffic congestion. In terms of ratings, they were used as a gauge of public opinion.

Text data was utilized by all papers in our primary set. Further investigation on the role of text data and the used attributes of text data was made and presented in Table 8 . Location and time attributes are important in the transportation domain especially in identifying road conditions and issues; they are fundamental in defining the exact location and time of incidents and traffic. Therefore, social media is considered as a real-time source of information as people post incidents that have happened. Many papers did not just employ the geotagged location, as they also used text data to detect locations. This attributes to the low amount of geotagged data on social media (Sloan and Morgan 2015).

Social media platforms become a major part of humans' life, and with their importance, many platforms have been unveiled. Choosing the proper platforms for research is essential. Online Resource 1 shows the social media platforms used by researchers in the TRR-SMA field and Fig. 7 shows the usage trend of the platforms over the past decade. Surprisingly, the platforms' usage trends are divisive; on the one hand, Twitter is the most popular platform in the field, despite not being the most popular worldwide. On the other hand, while Facebook is the most popular social media platform worldwide, its use in the field is limited. In a ten-year period, it was cited in just 9 out of 74 reports. Twitter is the data source for approximately 72% of the papers. It is a microblog; it limits the number of characters per tweet and tags the tweets with the time and location (if users allowed). Facebook offers users the same features with one main difference-there is no limit on the number of characters per post. Due to this, it is thought that the processing of tweets is easier than Facebook posts. As the number of tweets characters are limited, people will directly point to their subject without further explanation or description. However, the main reason behind the limited usage of Facebook data is retrievability. Facebook announces limitations on its APIs 4 (application programming interface). Facebook APIs are used to crawl Facebook data, and restricting API access means limiting Facebook data access. In addition, Twitter has announced that precise location tagging would be removed from their platform. 5 According to the company, the precise location tagging will be available for images taken with Twitter's camera. This creates a challenge for fields that depend on precise location, such as transportation, and necessitates the search for alternate ways to detect locations.

The contents of social media platforms are either open or closed; some platforms allow content retrieval through APIs, while others do not. The datasets used by researches were collected as follows:

• Twitter datasets: Twitter datasets were collected using the APIs. Twitter APIs allow the developer to access and retrieve Twitter contents including users' data and timeline, retweets, hashtags data and others. • Facebook datasets: Mainly Graph API used by researchers to collect Facebook data. The problem with the Graph API 7 and other Facebook APIs that Facebook limits the access to the users' data. 8 ProSuite tool to collect Facebook data was used by (Baj-Rogowska, 2017) while (Ali et al., 2017) collected the data manually. • Weibo datasets: There are two approaches to retrieve Weibo contents which are the requesting APIs and crawling. APIs usually are paid and limit the number of queries (Y. Chen et al., 2018) . Crawling is performed using HTTP request (Y. Chen et al., 2018) or crawlers such as Crawlzilla and Selenium (S. Chen et al., 2016) • TripAdvisor datasets: The TripAdvisor content APIs are available, but they are only for use on travel websites. Since TripAdvisor's APIs are not available for academic research or data analytics, 9 scrappers and crawlers are used to obtain the data. • Others: other datasets from review platforms such as Google reviews (K. Lee & Yu, 2018) and Yelp (Gao et al., 2016) are used. Google reviews can be retrieved using Google APIs. 10 Google APIs have a retrieval limit of 5 reviews per location. As for Yelp, it offers an open dataset for academic purposes 11 besides its APIs for business. 

The second step, after defining the data that will be used, is to understand how to use it to obtain the desired useful information. Machine learning-based (ML) approaches, natural language processing-based (NLP) approaches, and statistical-based (SL) approaches are the three groups in which these approaches were classified. These categories, however, can overlap since several methods fall into more than one of the three categories. Machine learning algorithms are commonly used for classification, regression, and grouping related instances; in the literature, they were mostly used to identify the commuters' perspective toward the transportation network or to identify the related posts to transportation or events. Machine learning approaches can be divided into two categories: supervised and unsupervised.

Supervised machine learning methods need pre-labelled data as training samples to perform classification. Support Vector Machine (SVM) and Naïve Bayes (NB) are ones of the most popular supervised machine learning approaches. They are well-known for their performance in text classification field; hence they have been commonly used in the literature. The SVM algorithm maximises the distance between training data groups and draws a hyperplane between them. Then, it decides which side of the hyperplane the new instances belong to by using the features of the new instances and the information obtained from the training data. SVM was employed by studies for sentiment analysis to identify the public opinion regarding the transportation network or their complaints (Candelieri et al., 2015; Pournarakis et al., 2017; Sinha et al., 2017; Windasari et al., 2017; Yang et al., 2016) and for distinguishing the transport-related posts from the irrelated ones (Y. Chen et al., 2018; Gal-Tzur et al., 2014; Salas et al., 2017; Salas et al., 2018) . NB utilises Bayes theorem from statistics. Using the training data and the features of the new instance, NB calculates its likelihood to belong to a class. NB was employed to classify the related posts to transportation or events by (Abalı et al., 2018) , to analyse commuters' sentiment toward transportation by (Alamsyah et al., 2018; Dutta Das et al., 2017; Fiarni et al., 2018; Kumar et al., 2014; Liyang et al., 2016; Sternberg et al., 2018) and to predict vehicle recall by (X. Zhang et al., 2015) . Multiple researches compared the two, and some of these researches compared them to other approaches including decision trees (DT) and for the same previous aims; (Alamsyahl et al., 2018; Anastasia & Budi, 2016; Giancristofaro et al., 2016; Gupta et al., 2018; Rane & Kumar, 2018; Z. Zhang, Zhang, et al., 2018; Zhang, Chen, et al., 2018) used multiple classifiers to compare their results in analysing commuters' sentiment toward transportation related topics, while (D'Andrea et al., 2015; Gal-Tzur et al., 2018; Hoang et al., 2016; Kuflik et al., 2017; Tse et al., 2016) used different machine learning techniques to identify the posts related to a transportation topic. Other classification approaches such as Maximum Entropy (ME) (Dutta Das et al., 2017; Samonte et al., 2018) and Logistic Regression (LR) (Rane & Kumar, 2018; Zhang, Chen, et al., 2018; were also used by some researchers. ME selects the appropriate distribution to represent the data based on the measurement of entropy, whereas LR is used to represent the data and describe the relationship between variables.

In regard to the unsupervised machine learning methods, they do not need prelabelled data; instead, they rely on data features to find the similarity between instances. K-Nearest Neighbour (KNN) is unsupervised machine learning algorithm which is used for clustering (D'Andrea et al., 2015; Kumar et al., 2014; Rane & Kumar, 2018; Saldana-Perez et al., 2017; X. Zhang et al., 2015; Z. Zhang, Zhang, et al., 2018; Zhang, Chen, et al., 2018) . Another popular clustering algorithm, which is used within the sentiment analysis process to discover the top topics, is k-means and its version spherical K-means (Liau & Tan, 2014) where both of them produced similar topics.

Another approach of ML is Deep learning (DL). DL is known for its high accuracy and its ability to learn from large amount of data . DL includes many architectures, one of the basic architectures is Multilayer perceptron (MLP) which was used by two studies (Ali et al., 2018) for sentiment analysis purpose and by for identifying the traffic-related information. Chen et al. (2018) used a combination of other DL architectures, namely: convolutional neural networks (CNNs) and long short-term memory (LSTM) and compared them with other machine learning methods.

As text data is the most commonly used data in the research, natural language processing-based approaches have a dominant rule in the information extraction and data analysis processes. Before performing any data analysis/classification or knowledge extraction task, the bag of words (BOW) approach is typically used to represent text data. BOW converts text data into a numerical form that ML algorithms and others beside computers can understand. Gal-Tzur et al., 2014; Giancristofaro et al., 2016; Liau & Tan, 2014; Musaev et al., 2018; Pournarakis et al., 2017; Rane & Kumar, 2018) . N-gram is a model under computational linguistics and refers to sequence extraction from text or speech, so generating n-grams is included in the preprocessing stage of texts in the studies (Ali et al., 2018; Daly et al., 2013; Windasari et al., 2017) . Another pre-processing stage of text is parsing, parsing indicates the syntax analysis of text data and usually is used to extract the grammatical rules of the language (Luckner et al., 2017; Zhang, Kotkov, et al., 2016; Zhang, Sun, et al., 2016) . The dominant natural language processing approach in the studies is lexicons. This likely to be due to its simplicity. There are two types of lexicons: the sentimental lexicon (SL) and the dictionary (Dic). The dictionary contains the words related to domain (domain lexicon) or language (general lexicon) and a sentimental lexicon is a dictionary which associates each word with its sentiment polarity. The generation methods of lexicons may include interference of machine learning or/and statistical-based approaches as it is shown in Table 9 . Table 9 illustrates the lexicons used by researchers and the generated ones. Commonly, Bing Liu lexicon (Hu et al., 2004) is the most used general lexicon.

Another approach which can be used in the pre-processing stage of text data is Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is a statistical-based information retrieval metric and it is one of the most common and traditional term weighting approaches in which it assigns a weight for each token in the text relying on its occurrences. TF-IDF was used by the researchers mostly to weight the words in the topic modelling process (Candelieri et al., 2014; Fu et al., 2015; Itoh et al., 2016; Sinha et al., 2017; C. Wang et al., 2018; Windasari et al., 2017; Yang et al., 2016; Z. Zhang, Zhang, et al., 2018; Zhang, Chen, et al., 2018) . Topic modelling approaches usually use statistical models to represent the data and draw inferences. Latent Dirichlet Allocation (LDA) is the most popular topic modelling approach. It uses Dirichlet distribution to represent the text data to find the most important topics. It was used by 13.5% of the studies (Alamsyah et al., 2018; Buch et al., 2018; Kovács-Győri et al., 2018; Kulkarni et al., 2018; Pournarakis et al., 2017; D. Wang et al., 2017; Wayasti et al., 2018) . Researchers used LDA to identify sub-topics that could signify aspects of the transportation network that users discussed or issues that users encountered. LDA was used before classification algorithms to find categories/topics since the result of LDA could present categories.

To identify these subjects, the sub research questions of RQ2 are answered as follows: s1-RQ2: Which subjects were targeted by researchers in the TRR-SMA field?

Transportation is a broad subject; this leads the researchers to target the transportation subject in general or one or more of its sub-subjects. The works in our primary set are classified according to their target as illustrated in Fig. 8 . Online Resource 1 shows the literature and their targets. It groups the literature according to the targeted countries to demonstrate the trend and the targeted subjects.

To demonstrate the trend of the research subjects in each country, the table in Online Resource 1 groups the literature according to the targeted countries. The target country is the source location of the data. The source location is defined when retrieving the data using the APIs. The Worldwide location was assumed in the case of the undefined location of the data, except for Air Transport subject, it was assumed worldwide if the work targeted multiple airlines from different countries. Otherwise, the source location will be the airline's home country. Figure 9 illustrates the distribution of the subjects over the past decade in the world.

From Online Resource 1 and Fig. 10 , we can extract the dominant subjects and their trends in each country; in the USA, the dominant subject was "general". The subject "general" was trending from 2016 till 2018 while "Issues" was trending in 2015. "Road Transport" has a little interest; in the UK, the "Issues" and "General" subjects got equal attention, the "General" subject was trending in 2018 and "Issues" was trending in 2017-2018. We can say the UK has no interest in Air Transport as no publications targeted it; in China, the concentration was forwarded to the "Issues" subject, it was targeted by researchers in 2014, 2016 and 2018; in Indonesia, the dominant target was "OT", it was the target of 6 papers out of 7, this can indicate the weakness of the transportation network that leads the commuters to depend on OT services; in other countries, the trend is hard to be tolled as they were targeted by a small number of researches.

However, other information can be extracted such as the region of services. Air Transport is a global service. This is logical as airlines responsible for moving people through countries. Another interesting fact is regarding OT services-OT-Uber was the target when worldwide was the location and this can indicate that Uber 12 is a global company. Grab and Gojek were the target when the researches targeted Indonesia. This indicates that Grab 13 and Gojek 14 are local or at most regional companies. This fact can be proved by looking at the companies' websites.

Used and generated lexicons analysis To draw the answer to this question, the subjects were associated with the attributes used (see Online Resource 1). Noticeably, text, location and time were mostly used together for exploring the traffic situation and causes, roads and, in general, network conditions. Text data and ratings were utilized for measuring the commuters' attitude Concluding the answers of the previous RQs, we can recognize challenges and future works. Challenges and future work are drawn from the findings as follows:

• Activity: Through activity analysis (see s1-RQ1 answer), the evolution trend of the TRR-SMA field was shown. The activity in the field is increasing through time. The country with the dominant role in the production process is the USA with 14 publications and the transport-related publishers are the biggest source of the publications. A Fig. 9 The trended subjects around the world in the past decade Fig. 10 The trended subjects in the top locations related challenge to activity analysis is to demonstrate the reasons behind the inactivity or small activity of countries such as Japan, Turkey, Singapore, Canada and others. This could be referred to the economic situation, the producibility of researches, the availability of the data, the interest in the transportation infrastructure and the priority of transportation planning and enhancement and many others. An in-depth review of the grey and white literature (organisations reports and technical documents) in addition to official news about the development of the transportation network may identify these reasons. • Social Media Data: The attached Online Resource 1 shows each study of the primary set with its corresponding subject, the data used by the researchers to achieve their aims and the platform used. Table 8 shows the purpose of using the text data, and the text attributes used for each purpose. By combining Online Resource 1 and Table 8 , the subject of the research with the corresponding data used and the purpose of using text data can be extracted for each study. Furthermore, it can be concluded that text was used by all papers for different goals owing to the fact that text is the main content of the shared posts on social media platforms such as Twitter, Facebook and Weibo. Further abilities of text data need to be explored like providing precise data about weather situations and location. As less than 1% of the social media data is geotagged, extraction of location becomes one of the most important objectives of text data; however, there are several issues surrounding the extracted location from text, such as whether the extracted location is the event/incident location. Is the place that was extracted a commuter location? Is the extracted name a location name? Is the place that was extracted fake? To find answers to these questions, further research is needed. As for weather, it can seriously affect the transportation network as it can cause closures of roads and others. One of the research in our primary set extracted weather from text using keywords (Daly et al., 2013) and another used other resources to get weather data (Rybarczyk et al., 2018) . According to the impact of weather on transportation, further consideration by researchers is needed besides exploring the efficiency of text data in providing precise weather information. In other words, how likely would people share the weather information through social media and to which degree is the extracted data true and precise. Another goal of text is to extract transportation modes (e.g. metro, bus, taxi); however, more research is needed because modes may have different names in different regions or even countries. • Other essential data in transportation studies is time. Time is easy to retrieve as it is tagged to posts as people share but to which extent it had been explored. In general, time is used by researchers in identifying incidents time and traffic time. However, other usages of time need to be explored such as delay detection and validation of extracted location from text, as several locations from different posts can be extracted and compared according to time to see if it is possible to move between these locations during the recorded time. • Social Media Platforms: Multiple applications were used as sources of the data. Many researches used multiple applications as a source (refer to Online Resource 1). Facebook data was not used frequently. This perhaps is related to the limitations on Facebook APIs and its allowance of long posts. One of the concerns that may face the researchers in the future is Twitter ending its exact location geotagging service and limiting it to images taken by its camera. This will open the need for further research using images and text to detect locations or even exploring the efficiency of other social media applications such as Instagram. Instagram can reflect the real situation of transportation network since its main content is images, so Instagram and, in general, images efficiency in delivering transport-related data needs further investigation. • Social Media datasets: In general, self-extracted datasets were used in the researches; each group of researchers collected their own dataset from the platforms using either the APIs or the crawlers/scrappers. This makes it difficult to compare the results of publications in the same subject. Two key factors affect the transportation domain are real-time data and location; hence, by using different query attributes such as location and time, different datasets can be retrieved. However, a standard open dataset is required so the performance of the research can be compared. • Approaches: Lexicons, SVM and NB are the most employed approaches in the TRR-SMA field. The usage of lexicons is the highest. The high usage can be related to the lexicon's simplicity and effectiveness in topic and sentiment classification. The problem with lexicons is domain dependency, coverage and outdating. Domain dictionaries that are used for topic classification have a problem in detecting the domain words, usually LDA, TF or TF-IDF will be used in the detection process, however, these methods will also result in a set of unrelated words to the domain. As a result, manual filtration of the domain-related words is used by researchers which needs time and effort. This rises to surface the need for automatic filtration of the words. Other problems are with sentimental lexicons. The coverage and outdating of words are issues of many lexicons. Take for example the most used general lexicon by literature-Bing Liu (Hu & Liu, 2004) . It composes of approximately 7000 words, while the Oxford dictionary contains around 170 K words. 15 This big difference in the number of words creates the coverage issue. Another issue is the updates of lexicons and the usage of words through time where the new generation may stop using some words and start using other words to indicate other meanings (Schulz et al., 2010) . To overcome the previous mentioned issues, frequent updates of lexicons are needed, yet how frequently are the general lexicons being updated? Moreover, the sentiments of words may change depending on the domain or context, hence recognizing the changeable sentiment is a dilemma for general lexicons. To the extent of the authors knowledge, there is no transport-related lexicon. The creation of this lexicon is believed to improve the performance of the systems that uses transportation related data. • Subjects: Focuses or targets of researches were identified and grouped by countries.

The aim of grouping was to indicate the trends and targeted subjects by country. One of the future directions can investigate why some publisher countries perform research on other countries. This can be due to huge incidents that occur in the targeted country, the requirement of funding agencies, shortages in the research field in the targeted country, availability of data, authors origin countries and others. Other possible directions are presented as follows:

Transportation modes: Air transport and road transport modes were the most explored in the researches, this likely due to the availability of data and the availability of the transportation modes on most of the countries. On one hand, most air transportation research has focused on customer opinion mining to investigate service quality by gathering posts and feedback on the airlines' or airports' social media pages. On the other hand, research on road transportation focuses on finding out what commuters think about rail, OT, and buses, as well as issues of transportation infrastructure. However, other research directions can be explored regarding the two, air transport and road transport, such as creating automated alerting system in case of delays or accidents or automated reply system to commuters' inquiries and complaints. Other modes of transport are water transport modes such as ferries. Even though water transport is a significant mode of transportation in many countries, no research has focused on it. Water transport was part of the general subject in a few studies (Gao et al., 2016; Rybarczyk et al., 2018) .

Routes: In routes, the origin destination locations were extracted from text. Usually, the combination of "from" and "to" is used for this purpose. In certain cases, users will simply mention the destination; in these cases, a method to find the origin, as well as a method to verify that the extracted location is actually a destination are required.

Issues: In the event that an issue with the transportation network arises such as road hazards, accidents, road closures and others, social media may serve as an early warning device. However, social media posts are human-created knowledge that is not always accurate and can be influenced by people's moods and psychology. Validation methods for the extracted issues are needed, particularly in the event of an emergency or sudden problem. One of these methods is comparing the extracted information from social media with official news and other potential resources taking into account the time and location of the extracted information. However, further research into text analysis and summarisation techniques is needed for the comparison purposes. Furthermore, further research is required on automated issues discovery from text rather than relying on human effort or pre-defined lexicons.

Recommendation systems: The studies did not cover transportation recommendation systems. Personalisation is an attribute that social media may provide to recommendation systems. It can be used in OT services to suggest personalised rides, especially in ridesharing services where multiple passengers can ride together. The ride can be recommending based on the profile of the passengers to assure more friendly and personalised rides. In other cases, social media may be used to suggest routes, especially in the event of unexpected closures or extreme traffic congestion. Furthermore, by analysing users' activity patterns, personalised trips/routes can be suggested. Moreover, social media can employ the public trend in a particular location (e.g. city) to recommend transportation modes which will not be recommended by transportation application such as scooters and cabriolets.

COVID-19: During the COVID-19 pandemic, several countries imposed movement restrictions, resulting in a significant drop in traffic and, in some cases, empty ways. Furthermore, at times, all modes of transportation were shut down, and even when they were running, commuters' complaints were different than they were prior to COVID-19. Commuters' main concerns in the COVID-19 period would be travel restriction laws, COVID-19 possible transmission methods in transportation modes, and the required precautions to obey during the rides, in addition to trip cancellation and compensation. This opened the door to new research directions, such as using social media data as a warning in the event of a transportation emergency, for example: fainting persons in the transportation and discovering new COVID-19 cases in the transportation or using it as a source of information about people who do not follow COVID-19 precautionary rules during the rides such as social distancing and mask wearing, among others.

In this work, we structured the TRR-SMA field by performing a systematic mapping review. We identified the foundations of the field by prior reading and constructed the classification scheme based on them. In addition, the query terms were defined. These terms were used afterward to retrieve the researches from 4 DLs: IEEEXplore, ACM, Web of Science and Scopus. The search results were refined using the ECs and ICs terms. In the end, 74 papers were included in the primary set.

The foundations of the classification scheme were activity, keywords, social media data, social media platforms and targets. Through the analysis, the trends were drawn and discussed a.

Activity analysis was done in terms of country, year, publisher and first author. The TRR-SMA field is getting increasing attention. Throughout the years, the most productive country was the USA and most productive publishers were the transportation-related publishers. Moreover, publications were analysed in terms of the data, platforms and approaches used. Text data was the most utilized data by the papers. Hence, further analysis of text data was performed and presented in terms of the aims and the corresponding used text attributes. In addition, an analysis of used lexicons was presented as lexicons are the most used approach. In the end, an analysis in terms of research subjects was presented. In the analysis of the subjects, papers were grouped by countries to show the trends and the covered subjects in each country.

By accumulating and analysing the results, possible challenges and future works were drawn and discussed. These challenges and future works can guide new researchers and create new research opportunities. The most crucial future works is creating a transportspecific lexicon, creating personalised transport related recommendation systems using social media data, conducting researches regarding water transport and exploring the effect of COVID-19 on the TRR-SMA field.

The online version contains supplementary material available at https:// doi. org/ 10. 1007/ s11192-021-04046-2.

Detecting citizen problems and their locations using twitter data

An approach to sentiment analysis -The case of airline quality rating

Dynamic large scale data on Twitter using sentiment analysis and topic modeling case study: Uber

Mapping online transportation service quality and multiclass classification problem solving priorities

Feature-based Transportation Sentiment Analysis Using Fuzzy Ontology and SentiWordNet

Fuzzy ontology-based sentiment analysis of transportation and city feature reviews for safe traveling

Consumers' trust and popularity of negative posts in social media: A case study on the integration between B2C and C2C business models

Twitter sentiment analysis of online transportation service providers

Sentiment analysis of Facebook posts: The Uber case

Twitter mood predicts the stock market

Big Data Analytics: A Case Study of Public Opinion Towards the Adoption of Driverless Cars

Using sentiment analysis to define twitter political users' classes and their homophily during the 2016 American presidential election

Analyzing tweets to enable sustainable, multi-modal and personalized urban mobility: Approaches and results from the Italian project TAM-TAM

Detecting events and sentiment on twitter for improving urban mobility

Web-based traffic sentiment analysis: Methods and applications

Is the grass greener? Mining electric vehicle opinions

Tweeting about public transit -Gleaning public perceptions from a social media microblog

Mapping social media for transportation studies

Big Data Analytics on Aviation Social Media: The Case of China Southern Airlines on Sina Weibo

Detecting traffic information from social media texts with deep learning approaches

Real-time detection of traffic from twitter stream analysis

Westland row why so slow? Fusing social media and linked data sources for understanding real-time traffic conditions

Sentimental Analysis for Airline Twitter data

Implementing rule-based and naive bayes algorithm on incremental sentiment analysis system for Indonesian online transportation services review

Steds: Social Media Based Transportation Event Detection with Text Summarization

The potential of social media in delivering transport policy goals

An improved methodology for extracting information required for transport-related decisions from Q&A forums: A case study of TripAdvisor

Public transit customer satisfaction dimensions discovery from online reviews

Mining complaints for traffic-jam estimation: A social sensor application

Predicting Sentiment toward Transportation in Social Media using Visual and Textual Features

Enhancing transport data collection through social media sources: Methods, challenges and opportunities for textual data

From twitter to detector: Real-time traffic incident detection using social media data

Sensor technologies for intelligent transportation systems

Twitter usage across industry: A spatiotemporal analysis

Using twitter data for transit performance assessment: A framework for evaluating transit riders' opinions about quality of service

Crowdsensing and analyzing micro-event tweets for public transportation insights

Supporting sustainable system adoption: Sociosemantic analysis of transit rider debates on social media

Mining and summarizing customer reviews

Visual exploration of changes in passenger flows and tweets on mega-city metro network

Predicting elections from social media: A three-country, three-method comparative study

Improving sentiment scoring mechanism: A case study on airline services. Industrial Management and Data Systems

What makes tourists feel negatively about tourism destinations? Application of hybrid text mining methodology to smart destination management. Technological Forecasting and Social Change

Systematic literature reviews in software engineering-a systematic literature review

#London2012: Towards citizen-contributed urban planning through sentiment analysis of twitter data

Automating a framework to extract and analyse transport related social media content: The potential and the challenges

Unsupervised classification of online community input to advance transportation services

Where not to go? Detecting road hazards using Twitter

High enough? Explaining and predicting traveler satisfaction using airline reviews

Know your hotels well! An online review analysis using text analytics

Assessment of airport service quality: A complementary approach to measure perceived service quality based on Google reviews

Gaining customer knowledge in low cost airlines through text mining. Industrial Management and Data Systems

Comparison of tourist thematic sentiment analysis methods based on weibo data

Understanding public sentiment toward I-710 Corridor Project from social media based on Natural Language processing

Data-enabled public preferences inform integration of autonomous vehicles with transit-oriented development in Atlanta

Public transport stops state detection and propagation warsaw use case

Social media based transportation research: The state of the work and the networking

An emotional polarity analysis of consumers' airline service tweets

Detection of damage and failure events of road infrastructure using social media

Utilizing social media in transport planning and public transit quality: Survey of literature

The smartphone and social media

Real-time detection, tracking, and monitoring of automatically discovered events in social media

Managing traffic flow based on predictive data analysis

BROAD-RSI-educational recommender system using social networks interactions and linked data

Systematic mapping studies in software engineering

A computational model for mining consumer perceptions in social media

Mining open and crowdsourced data to improve situational awareness for railway

Sentiment Classification System of Twitter Data for US Airline Service Analysis

Exploring the capacity of social media data for modelling travel behaviour: Opportunities and challenges

Travel and us: the impact of mode share on sentiment using geo-social media and GIS

Traffic event detection framework using social media

Incident detection using data from social media

Classification of traffic related short texts to analyse road problems in urban areas

Sentiment analysis of customer engagement on social media in transport online

Language change across generations for robots using cognitive maps

Use of social media for assessing sustainable urban mobility indicators

Road condition monitoring application based on social media with text mining system: Case Study: East Java

TRANSPORT ANALYSIS APPROACH BASED on BIG DATA and TEXT MINING ANALYSIS from SOCIAL MEDIA

Sustainability analysis on Urban Mobility based on Social Media content

Importance of keywords for retrieval of relevant articles in medline search

Web and social media analytics towards enhancing urban transportations: A case for Bangalore

Who tweets with their location? Understanding the relationship between demographic characteristics and the use of geoservices and geotagging on Twitter

Analysing Customer Engagement of Turkish Airlines Using Big Social Data

Enabling Next Generation Logistics and Planning for Smarter Societies

Tensistrength: Stress and relaxation magnitude detection for social media texts

Sensing pollution on online social networks: A transportation perspective

Mining social media for open innovation in transportation systems

Social networks and railway passenger capacity: An empirical study based on text mining and deep learning

Real-time traffic event detection from social media

Real time road traffic monitoring alert based on incremental learning from tweets

Mining customer opinion for topic modeling purpose: Case study of ride-hailing service provider

Sentiment analysis on Twitter posts: An analysis of positive or negative opinion on GoJek

Social Media Analysis on Evaluating Organisational Performance: A Railway Service Management Context

Software fault localisation: A systematic mapping study

Online stakeholder interaction of some airlines in the light of situational crisis communication theory

A framework for evaluating customer satisfaction

Fault activity aware service delivery in wireless sensor networks for smart cities

Predicting vehicle recalls with user-generated contents: A text mining approach

Improving stock market prediction via heterogeneous information fusion. Knowledge-Based Systems

A combinational classification for the customers of airline platform based on text mining

Real-time multimedia social event detection in microblog