key: cord-0189073-52auwnu7 authors: Thao, Tran Phuong title: Influences of Temporal Factors on GPS-based Human Mobility Lifestyle date: 2020-09-22 journal: nan DOI: nan sha: 8e129a9074d83d1fd0a722c0d4560d12b6a4846b doc_id: 189073 cord_uid: 52auwnu7 Analysis of human mobility from GPS trajectories becomes crucial in many aspects such as policy planning for urban citizens, location-based service recommendation/prediction, and especially mitigating the spread of biological and mobile viruses. In this paper, we propose a method to find temporal factors affecting the human mobility lifestyle. We collected GPS data from 100 smartphone users in Japan. We designed a model that consists of 13 temporal patterns. We then applied a multiple linear regression and found that people tend to keep their mobility habits on Thursday and the days in the second week of a month but tend to lose their habits on Friday. We also explained some reasons behind these findings. Understanding individual human mobility plays an important role especially when the geographic spread of the infectious virus that causes COVID-19 has taken the world into uncharted territory. Not only that, it is also a critical factor in policy planning [1] , [2] , travel demand forecasting [3] , [4] , location-based recommendation/service advertising [6] , or location-based personal authentication [5] . M. Gonzalez et al. [25] proved that human mobility follows a high degree of regularity. Therefore, several sophisticated models have been proposed to determine the factors influencing the probability that people tend to increase and lose their mobility lifestyle. The factors can be classified into spatial, temporal, and social in which temporal one has been proved to be the most important affecting factor. However, the temporal factors found in existing work are still coarse-grained (i.e., weekend/weekday without clarifying which specific days of the week, or which week of the month, etc.) In this paper, we investigated the recurrence and temporal periodicity inherent to human mobility inferred from mobile phone data with more fine-grained factors. We collected GPS data from 100 random smartphone users in Japan. We designed a model consisting of 13 temporal factors from 3 pattern categories (i.e., days of the weeks, quarters of the month, and holidays including weekend and national public holidays) for independent variables. We also proposed an algorithm to compute the probability (i.e., similarity score) of the users to visit the locations they visited before for the target outcome. We then applied a multiple linear regression and performed a t-test. We found that people tend to keep their mobility habits on Thursday and the days in the second week of a month but tend to lose their habits on Friday. We also discussed some reasons and applications behind these findings. The rest of this paper is organized as follows. Section II introduces related work. Section III presents our proposed methodology. Section IV gives the experiment and our findings. Section V discusses applications and limitations of our method. Section VI describes the conclusion. In this section, we introduce related work about factors affecting the location habit. The work can be classified into three research directions. S. Zhao et al. [10] observed that 80% successive checkedin POIs (Points-of-Interest) happen in less than 32 kilometers. They explained that people often act around their home or office, so even being independent with the last check-in, the successive check-in can still happen in the same activity area. S. Yali et al. [11] analyzed the two location-based social networks Foursquare and Gowalla. They found that the probabilities for distances within 5 km are greater than 40%, which decrease to17% and 8% within 10 km on the datasets, respectively. Most users checked in within 20 km. T. Thao et al. [13] , [14] leveraged the idea that the locations at close time clocks have a closer correlation in physical distance than the locations at far time clocks since a human needs a period of time to move from a location to another location gradually. The experimental result showed that the extracted distance coherence features along with the longitudes and latitudes could improve the authentication's accuracy. While all the papers [10] , [11] , [14] focused on the fact that closer locations have a higher probability of being visited by users, Y. Hongzhi et al. [12] raised a more challenging problem when people travel to a new city where they have no activity history. They showed that people tend to travel a limited distance when visiting venues and attending events. Furthermore, the activity records in their non-home cities are only 0.47% of the activity records when living in their home cities. To solve the problem, the authors analyzed the two factors including user interest (e.g., kids would pay more attention in playgrounds while young ladies may be more interested in cosmetics stores) and local preference (e.g., people are more likely to visit local sightseeing attractions and attend popular events in the city when they travel to an unfamiliar city). They found that the factors also affect the decision to visit an unfamiliar location. G. Huiji et al. [7] extracted the correlations between the check-in time and the corresponding check-in preferences of a user. They found that weekly patterns (7 days of the week) and weekday/weekend patterns can capture the temporal checkin preferences of a user. However, the results do not clearly indicate which day of the week, weekday, or weekend is the affecting factor but only the general patterns. S. Zhao et al. [23] found that the day of week check-in pattern at different hours: users take more check-ins in the late afternoon and the evening from 04:00 p.m. to 3:00 a.m. on weekends than the weekdays. Saturday and Sunday take a similar pattern, while the days from Monday to Friday take a similar pattern that is different from the weekends. It may infer that weekday and weekend are two types of effects on the check-in behavior of the user. J. Bao et al. [24] split a week into two parts including weekdays and weekends. For each part, they split a day into hourly time bins. A total of 24 × 2 time bins are used for the expression of temporal patterns. M. Gonzalez et al. [25] measured the return probability for each individual. They found that the probability that a user returns to the position where the user was first observed after t hours for a two-dimensional random walk should follow 1 t ln 2 (t) . The return probability is characterized by several peaks at 24h, 48h, and 72h, which indicates a strong tendency of humans to return to locations they visited before. M. Xie et al. [8] explored the importance of spatial, temporal, and social factors and found that they can be ranked as follows: temporal effect > content effect > spatial effect. This indicates the temporal factors may provide the most information although of course, combining them is the best solution. H. Wang et al. [15] studied that the social link is an important factor affecting the choices of people when deciding which new place to visit. The authors analyzed the Gowalla dataset and found that a friend or a friend-of-a-friend has visited more than 30% of the new places visited by a user in the past. With the same observation that social friends tend to have similar check-in behavior, several papers [16] - [19] also extracted the similarity score between the users derived from the social friendships. The experiment result showed that it could enhance the accuracy. Besides the links of friend and friend-of-a-friend, H. Bagci et al. [20] showed that local expert is also a factor affecting the place to visit. J. Bao et al. [21] pointed out that users who visit many high-quality locations tend to have high knowledge about the vicinity. In a similar manner, if a particular location is visited by many high-quality users (i.e., experts), it is more probable for that location to be a quality location. L. Kai et al. [22] aimed at the service locations only, such as restaurants, fitness centers, etc. They found that the factors including demographics, preferences, and service levels (e.g., price range, discount or not, advertisements) can increase the probability of mobile users visiting the service locations. In this section, we present our proposed methodology including data collection and the model design. A navigation application named MITHRA (Multi-factor Identification/auTHentication ReseArch) was created to collect the GPS information from the Android smartphone users. One hundred users were randomly recruited, thus live and work in random areas. The data consists of timestamps and GPS information (longitude and latitude). The application collects the data every 5 minutes. The users have different data collection periods because it depends on the time that each user stars running the application. The entire collected data from all the users is from January to April 2017. The timestamp is up to seconds. The precision of the longitudes and latitudes is six decimal places (e.g., 36.xxxxxx), which correspond to 0.1 meters. Regarding the data privacy, a privacy consent is shown to the users during the installation process. The application can only be successfully installed if the users accept the terms and conditions agreement. We do not collect any personal information such as name, age, date of birth, gender, etc. except address which is used for user identity. Our project is reviewed by the Ethics Review Committee of the Graduate School of Information Science and Technology, the University of Tokyo. At first, we briefly describe how a linear regression work. Linear regression is a statistical method used for measuring whether a set of factors affect (or can be used to predict) a certain outcome. It can model the relationship between one or more independent variables (features) and one dependent (output) variable. The value of the target function is expected to be a linear combination of the features. Formally, let f * denote the predicted value: where X = {x 1 , x 2 , · · · , x n } denotes the set of features, n denotes the number of features, C = {c 1 , c 2 , · · · , c n } denotes the set of coefficients, and c 0 denotes the intercept. c 0 is a constant representing the expected mean value of f * when x i = 0 for all i = {1, · · · , n}. There are several methods to solve the regression (e.g., Ridge Regression, Lasso, etc.) but we use the most common method Ordinary Least Squares (OLS) which minimizes the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation: When x 1 , x 2 , · · · , x n are correlated and the columns of the design matrix X are approximately linear dependent, X will become close to singular. (lon 31 , lat 31 ) weight 31 2017/04/02 (lon 10 , lat 10 ) 0 (lon 5 , lat 5 ) weight 5 · · · (lon 31 , lat 31 ) weight 31 2017/04/03 (lon 10 , lat 10 ) 0 (lon 3 , lat 3 ) weight 3 · · · (lon 32 , lat 32 ) 0 We are now ready to define our model for the regression. For each user U , the model is defined as: where score represents the target function; wdays, mquar, and hdays represent the variables related to the days of the weeks, quarters of the month, and holidays, respectively. In this part, we explain the algorithm used to calculate the similarity score, which measures the probability of a user re-visiting a location that he/she visited before. The scores also represent the mobility lifestyle pattern of a user. For each user U , the data is splitted into two parts based on the data collection time period. The data from the first half of the time period is denoted by D learn and the one from the later half is denoted by D test . The similarity score between D learn and D test is used for the target function. The procedure to calculate the similarity score is described as follows. a) Measuring Template from D learn : First, the longitude and latitude in each data record d i ∈ D learn are rounded to 2 decimal places from original 6 decimal places since the location accuracy of people's movement is often within 1 km square. Let dat i , tim i , lon i , lat i denote the date (year, month, day), the time (hour, minute, second), the longitude and latitude after being rounded, of d i , respectively. Let H = {00:00-00:59, 01:00-01:59, · · · , 23:00-23:59} be the 24 hourly-time periods. Each period is denoted by h α ∈ H where α ∈ [0, 23]. The records in D learn are grouped into 24 subsets according to h α . For each α, the following sets are constructed: • T learn α = {(lon i , lat i )}: the set contains the longitude and latitude of all the records d i such that tim i ∈ h α regardless of dat i . • U learn α = {(lon uniq j , lat uniq j )} ⊂ T learn α : the set contains only the unique pairs of longitude and latitude. For ∀j, j ∈ [0, | U learn α |], (lon uniq j = lon uniq j ) ∨ (lat uniq j = lat uniq j ) (remark, it is an OR, not AND operation). • W learn α = {weight j }: the set contains the corresponding weight of the pair (lon uniq j , lat uniq j ) ∈ U learn α . U learn α and M learn α have the same length. The weight is calculated as the percentage that the user U stays at the coordinate (lon uniq j , lat uniq j ), that is the ratio between the number of the pair values (lon uniq j , lat uniq j ) and the length of T learn α : Extracting Representatives from D test : In D learn , we grouped the data into 24 hours regardless of the date. For D test , we consider each different date before grouping the data of the date into 24 hours. For each unique date δ from the data in D test and for each α ∈ [0, 23], we also construct T test α = {(lon i , lat i )} in the same way as T learn α but with dat i = δ. We determine the representative r test δα for T test α by extracting the element (lon i , lat i ) ∈ T test α at which the user U stays for the longest period of time on the date δ. We have α representatives for entire D test . c) Matching to Calculate Similarity Scores: For each date δ in D test and for each α ∈ [0, 23], if the representative r δα exists in U learn α , the similarity score s δα will be set to the corresponding weight from W learn α . If not, s δα is set to zero. The example is given in Table I . After all the weights for 24 hours in each day δ are computed, all the weights in D test for the user U are summed up and used for the final value of score. So, each user U has a corresponding similarity score. For the example in Table I , the final score for U is weight 1 + weight 5 + weight 3 + · · · + 2weight 31 . 2) Variables: For each user U and each day δ mentioned above, the following binary variables were extracted. The first group is 7 binary variables which correspond to 7 days of the week (i.e., is δ Monday, · · · , is δ Sunday) denoted by {mon, tue, · · · , sun}. The second group is 4 variables which correspond to 4 weeks of the month (i.e., is δ the first week, · · · , is δ the fourth week) denoted by {wk1, wk2, wk3, wk4}. The third group is 2 variables related to holidays (i.e., is δ a weekend and is δ a national holiday) denoted by {natl, wknd}. These 13 binary variables are summed up for all the days δ of each user U . wdays, mquar, and hdays represent the summed variables for the first, second, and third group, respectively. Let D P denote the final data which will be used for the regression which consists of 100 samples with 13 variables. The program is written in Python 3.7.4 on a computer MacBook Pro 2.8 GHz Intel Core i7, RAM 16 GB. The multiple (linear) regression model is executed using scikitlearn package version 0.21. The t-test is computed using statsmodels package version 0.11. The distribution of the 13 variables and the target score is given in Table II . While the independent variables (wdays, mquar, hdays) and dependent variable (score) do not need to be normally distributed, the normality is required for the residuals. The entire preprocessed data (D P as mentioned in Section III-B2) has 100 samples corresponding to 100 users with 13 variables. We performed an Jarque-Bera test, and the result is showed in the second column in Table III . The p-value is less than 0.05, which indicates that the residuals are not normally distributed. Therefore, we had to conduct an analysis of data outliers in the next part. First, we measured the z-score for each of the 13 variables from 100 samples. According to the empirical rule (so-called 68-95-99.7 rule or three-sigma rule) [9] , any z-score that is greater than 3 or less than -3 is considered to be an outlier. Almost all of the data (99.7%) should be within three standard deviations from the mean; and 99.7% of the z-scores to be within the range (-3, +3) . Therefore, we scanned all the zscores and could find six samples that have any of 13 variables with z-score greater than 3 or less than -3. The 6 outliers are the 1st, 4th, 8th, 22nd, 30th, and 82nd sample in D P denoted by outlier (−3,+3) = {s 1 , s 4 , s 8 , s 22 , s 30 , s 82 }. Our aim is to remove the smallest number of outliers such that the p-value of the residuals can be increased up to 0.05 or more. We, therefore, run an algorithm to perform the Jarque-Bera test after removing each k-combination of the elements in the set outlier (−3,+3) . k is chosen in ascending order from 1 to the length n = |outlier (−3,+3) | = 6. Remark that we do not need to check all the combinations n k=1 n k . If we can find a p-value that is equal or greater than 0.05 at a certain k = k p , it is unnecessary to check the other combinations with k > k p . Unfortunately, we could not find (to remove) any outlier combination that can pass the Jarque-Bera test. Therefore, we then reduced the outlier range from (-3, +3) to (-2.9, +2.9) and could extract 8 samples, says outlier (−2.9,+2.9) = {s 1 , s 4 , s 8 , s 21 , s 22 , s 30 , s 82 , s 94 }. Similarly, we also performed the Jarque-Bera test; and fortunately, we could find two combinations at k = 3 that can boost the p-value when removing them: C 1 = {1, 21, 94} and C 2 = {1, 21, 30}. The z-scores of all 100 samples are plotted in Fig. 1. 13 colors of the datapoints represent 13 variables. All the data belonging to the 4 outlier samples s 1 , s 21 , s 30 , and s 94 lie along the 4 red lines. The results of the tests are summarized in the last two columns of Table III . Let D C1 = D P \ C 1 and D C2 = D P \ C 2 denote the data after removing the outliers from C 1 and C 1 . The Quantile-Quantile (QQ) plots of D P , D C1 , and D C1 are given in Figures 2, 3 , and 4, respectively. It can observe that the datapoints from D C1 and D C1 are closer to the straight 45-degree reference lines than those from D P . It may raise the question of why not just remove all the data outliers. First, we should note that removing all the outliers does not mean that the p-value for the residuals can be increased. We made a test when removing the 6 samples from outlier (−3,+3) and the 8 samples from outlier (−2.9,+2.9) . The p-values are then even worse (0.03 → 0.021 and 0.03 → 0.019, respectively). Second, keeping the samples as many as possible can preserve the nature of human behaviors. That is why we balance the trade-off by finding the combinations of outliers as above. We now apply the multiple linear regression on D C1 and D C2 . The affecting factors are determined based on the pvalues with 3 significant levels: • p ≤ 0.001: significant affecting factors Table IV . For D C1 , we found two normal factors, including thu and fri with positive and negative coefficients, respectively. It indicates that people tend to keep their movement lifestyle on Thursday but tend to lose their movement lifestyle on Friday. For D C2 , we found one nearlysignificant factor fri with negative coefficient like D C1 and one normal factor wk2 with a positive coefficient. It indicates that people tend to lose the movement lifestyle on Friday and tend to keep the movement lifestyle on the days in the second week of the month. The result can be heuristically explained that Thursday and the second week of the month are the middle time of the week and the month, respectively. The human behavior (even human mood and social interaction) becomes more stable than the first days of the weeks (the first weeks of the month) and the days near to the weekends (weeks near to the end of the month). In contrast, the result of Friday may be caused by "nomikai" which is a drinking party (often on Friday and with co-workers) phenomenon particular to Japanese culture. Even so, a deeper analysis and formal proof of these results should be investigated for future work. Our findings can help understand more about human mobility psychological and behavioural science which is important for urban planning, traffic forecasting, and the spread of biological and mobile viruses. They can also help enhance the effectiveness of the location-based recommendations and the location-based predication, and enable the advertisers to design and present their location services to targeted customers. For example, if a restaurant where a customer visited before knows that he tends to lose the habit of going to the restaurant on Friday, it can promote more discounts on those days rather than the other days of the week. In this paper, weekly temporal patterns were analyzed. Future work can examine daily temporal patterns which are different time frames during a day such as {00:01-06:00, 06:01-12:00, 12:01-18:00, 18:01-24:00} (for the interval of 6 hours), {00:01-03:00, 03:01-06:00, · · · , 21:01-24:00} (for the interval of 3 hours), etc. Combining weekly and daily temporal patterns is also a promising approach to figure out which time bins that people tend to visit (or tend to lose the habit of visiting) the usual location. In this paper, we aim to find which temporal factors that affect the human mobility lifestyle. We collected GPS data including longitude, latitude, and timestamp from 100 random participants in Japan using a smartphone application. We designed a regression model that utilizes 13 weekly temporal factors as independent variables categorized into 3 pattern types: days of the week, quarters of the month, and holidays. We proposed an algorithm to compute the similarity score between the location history and the most recent location log. We applied a multiple linear regression with a t-test and found that people tend to keep their mobility habit on Thursday and the days in the second week of the month but tend to lose the habit on Friday. Embedding economies of scale concepts for hub network design Evaluating the effectiveness of urban growth boundaries using human mobility and activity records Microsimulation of daily activity-travel patterns for travel demand forecasting Understanding the tourist mobility using GPS: Where is the next place? Self-enhancing GPS-Based Authentication Using Corresponding Address Discovering regions of different functions in a city using human mobility and POIs Exploring temporal effects for location recommendation on location-based social networks Learning Graph-based POI Embedding for Location-based Recommendation Linear and Nonlinear Models: Fixed Effects, Random Effects, and Mixed Models STELLAR: spatial-temporal latent ranking for successive point-of-interest recommendation An adaptive point-of interest recommendation method for location-based social networks based on user activity and spatial features LCARS: A Location-Content-Aware Recommender System Location-based Behavioral Authentication Using GPS Distance Coherence GPS-based Behavioral Authentication Utilizing Distance Coherence Location Recommendation in Location-based Social Networks using User Check-in Data iGSLR: Personalized Geo-Social Location Recommendation -A Kernel Density Estimation Approach An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks Multi-Layered Friendship Modeling for Location-Based Mobile Social Networks Inferring friendship from check-in data of location-based social networks Context-Aware Friend Recommendation for Location Based Social Networks using Random Walk Location-based and preferenceaware recommendation using sparse geo-social networking data Building a targeted mobile advertising system for location-based services Geo-Teaser: Geo-Temporal Sequential Embedding Rank for Point-of-interest Recommendation Geo-social Media Data Analytic for User Modeling and Location-based Services Understanding Individual Human Mobility Patterns