key: cord-0503102-dvrlkd5l
authors: Sinha, Subhrajit; Chakraborty, Meghna
title: Causal Analysis and Prediction of Human Mobility in the U.S. during the COVID-19 Pandemic
date: 2021-11-24
journal: nan
DOI: nan
sha: f16486cf82b24f0eb6d143acaa5f9d2cd2e4c575
doc_id: 503102
cord_uid: dvrlkd5l

Since the increasing outspread of COVID-19 in the U.S., with the highest number of confirmed cases and deaths in the world as of September 2020, most states in the country have enforced travel restrictions resulting in sharp reductions in mobility. However, the overall impact and long-term implications of this crisis to travel and mobility remain uncertain. To this end, this study develops an analytical framework that determines and analyzes the most dominant factors impacting human mobility and travel in the U.S. during this pandemic. In particular, the study uses Granger causality to determine the important predictors influencing daily vehicle miles traveled and utilize linear regularization algorithms, including Ridge and LASSO techniques, to model and predict mobility. State-level time-series data were obtained from various open-access sources for the period starting from March 1, 2020 through June 13, 2020 and the entire data set was divided into two parts for training and testing purposes. The variables selected by Granger causality were used to train the three different reduced order models by ordinary least square regression, Ridge regression, and LASSO regression algorithms. Finally, the prediction accuracy of the developed models was examined on the test data. The results indicate that the factors including the number of new COVID cases, social distancing index, population staying at home, percent of out of county trips, trips to different destinations, socioeconomic status, percent of people working from home, and statewide closure, among others, were the most important factors influencing daily VMT. Also, among all the modeling techniques, Ridge regression provides the most superior performance with the least error, while LASSO regression also performed better than the ordinary least square model.

The novel coronavirus (COVID-19) pandemic is delineating our times' global health crisis and has had a significant impact on the way we understand our everyday lives. Since its emergence in Asia in late 2019, all continents in the world except Antarctica have been fighting the virus in earnest. The World Health Organization (WHO) [1] has declared COVID-19 a global pandemic on March 11, 2020 and the United States declared a national emergency on March 13, 2020 [2] . As of September 15, 2020, almost 30 million cases of COVID-19 were confirmed in 215 countries around the globe [3] and among all countries, the U.S. has the highest number of confirmed cases and fatalities in the world due to COVID-19 [3] . Several countries have closed their borders, exercised lockdowns, curfews, stay-at-home orders, and social distancing protocols, resulting in sharp reductions in mobility and travel demand at local, regional, national, and even international levels. By March 24, 2020, more than 20 percent of the world's population has been ordered to remain at home as governments, health, and administrative organizations take extreme measures to protect their communities from the spread of the virus [4] .

In the U.S., more than 40 states have already enforced stay-at-home order, the earliest being in California effective from March 19, 2020 [5] . Only in the U.S., the mobility restrictions during the COVID-19 pandemic, in the form of travel bans, stay-at-home mandates, and lockdown policies, have impacted millions of people. Overall, human mobility has been severely impacted due to the travel restrictions and individual concerns to avoid public gatherings, resulting in tremendous economic impacts in transportation sectors. However, the overall influence and the long-term implications of this pandemic to mobility and transportation systems still remain unknown at this point in time. Against the background of this unprecedented global crisis, questions remain as to how the different factors during the pandemic affect human mobility and travel.

With the increasing availability of high-quality data related to COVID-19, analyzing transportation and mobility during and after this crisis is imperative. Although there are still many unknowns, statistical models and analytical tools would help produce evidence-based research and policy interventions after COVID-19 outbreak. At the national level, Zhang et al. (2020) developed a COVID-19 impact analysis platform [6] that can inform users about the spread of COVID-19 in the U.S., and the effects of the virus spread and government orders on mobility and social distancing in the country, using privacy-protected smartphone device location data, coupled with the census information. The platform gets updated daily and goes back to January 1, 2020, for benchmarking, and the results are scaled and aggregated to the entire population for both state and county levels [7] . Gao et al. (2020) reported on the interactive web-based mapping platform [8] developed by the GeoDS Lab at the University of Wisconsin, Madison, with the support of the National Science Foundation RAPID program [9] . This platform provides information on how people in different counties and states in the U.S. responded to the social distancing directives and guidelines. The platform integrates geographic information systems (GIS) and the daily updated human mobility patterns obtained from large-scale anonymous and aggregated smartphone data at the county-level [10] .

In recent times, research in several domains is heavily focusing on large-scale data analysis utilizing sophisticated computing capabilities and machine learning [11, 12, 13, 14, 15, 16, 17] . However, causal analysis of data for prediction purposes has received limited attention in the literature. Causality has been one of the oldest and under-examined questions in all of science. However, conceptually, its importance has been widely acknowledged in prior studies and used in various scientific disciplines including social media research [18] , neuroscience [19] , biological networks [20, 21] , and economics and finance [22, 23, 24] , among others. Causal analysis and influence characterization were initially geared towards time-series analysis and among the different measures, Granger causality [22, 24] , directed information [25, 26] , and Schreiber's transfer entropy [27] have gained the most attention. In the realm of dynamical systems, a new definition and measure of causal characterization, called information transfer, has been proposed recently [28, 29, 30] , where the authors show that the existing definitions of causality, namely, Granger causality, directed information, and transfer entropy fail to capture the correct causal structure in a dynamical system. Additionally, some recent studies [31, 32] provided a data-driven approach to infer the causal structure of a dynamical system from time-series data.

Although there are different measures of causality, for time-series data analysis, Granger causality is one of the simplest and most widely used methods. Broadly, in econometric studies, Granger causality test has been popular in analyzing time-series data for identifying influential factors and prediction purposes. Konstantakis et al. (2017) employed Granger causality, along with other quantitative techniques, to investigate the factors that affect automobile sales in Greece [33] . More recently, Homolka et al. (2020) determine macro and socio-economic indicators that may significantly predict car registrations across European countries with the help of Granger causality test and Vector Autoregressive models [34] . Also, in the domain of transportation research, particularly in predictive analysis with big data, Granger causality has been employed widely. Beyzatlar et al. (2014) analyzed panel data from fifteen European countries (EU-15) using Granger causality to investigate the relationship between income and transportation [35] . In order to accurately build the traffic flow prediction models, Li et al. (2015) utilized Granger causality to determine the potential dependence among the pool of predictor variables in the time-series big data collected by different sensors [36] . In a relevant study, McMullen and Eckstein (2012) tested Granger causality between vehicle miles traveled (VMT) and various measures of the national economic activity over time [37] . The authors rightly argued that it is imperative to properly understand the relationship between VMT and other pertinent transportation and economic factors, as VMT trend is one of the key components in transportation policies, in general.

Moreover, regression analysis is commonly used in data analysis and machine learning [38] , and is applied extensively in various domains, including transportation [14, 39, 40, 41, 42] . Among the various regression algorithms, linear regression is computationally one of the most efficient techniques and is often used as a starting point for many different problems, although in many cases, the linear regression model may be suboptimal. In the analysis of time-series data coming from a dynamical system, linear regression arises naturally in data-driven analysis of dynamical systems using the transfer operators, namely Perron-Frobenius and Koopman operators [43, 44, 45, 46] . More recently, studies [47, 48] used robust optimization techniques to compute the aforementioned operators from noisy data sets, and the resulting optimization problem was a variation of the ordinary least squares (OLS), namely, least squares with regularization. Regularization is a standard technique used in data analysis to overcome some of the limitations of OLS, including the overfitting and susceptibility to noise in the data [38] . The literature on transportation analysis utilizing regularization is scant. Recently Polson and Sokolov (2017) developed a deep learning model utilizing a linear model that is fitted with 1 -regularization to predict traffic flows. The study showed that deep learning architecture was capable of capturing non-linear spatio-temporal effects in traffic and providing short-term predictions of traffic flow [49] . Tan et al. (2011) proposed a semi-supervised Elastic Net regression method for pedestrian counting by utilizing sequential information between unlabelled samples and their temporally neighboring samples as a regularization term. The developed model was able to attain superior prediction performance and select representative features from the original set of features without losing their interpretability [50] . Hasan et al. (2017) proposed statistical techniques to identify spatial relationships among road links in an urban road network to select predictors for a short-term traffic prediction model for a given road link. The study used a time-lagged multiple linear regression method and utilized two analytical methods, including the Granger causality test and Elastic Net regularization, using one year of traffic flow and speed data from the selected road network in Brisbane, Australia. For a given target link, the relevant predictors obtained by the Granger causality and Elastic Net were used separately to build the respective traffic prediction models. The results showed that Granger causality-based traffic prediction model provided superior prediction accuracy than that using the Elastic Net regression [51] . More recently, Battifarano and Qian (2019) explored the spatio-temporal correlations between the urban environment, traffic flow characteristics, and surge multipliers and proposed a general framework for predicting the short-term evolution of surge multipliers in real-time using a log-linear model with 1 -regularization, integrated with pattern clustering. The modeling algorithm was validated by using Uber and Lyft data from Pittsburgh [52] .

While there is a plethora of information available related to mobility and travel during this pandemic, it is critical to develop a robust methodological framework to accurately identify and analyze the key factors influencing human mobility and subsequently predict mobility using the selected factors in the time of such health crisis. To this end, this study develops an analytical framework that helps determine the most significant factors affecting mobility by utilizing Granger causality followed by predicting mobility using linear regularization algorithms including the Ridge, and Least Absolute Shrinkage and Selection Operator (LASSO) modeling techniques.

The paper is organised as follows. Section 2 describes the theoretical concepts of Granger causality test, while Section 3 explains the analytical methodology of ordinary least square method, and Ridge and LASSO linear regularization algorithms. Additionally, Section 4 describes the time-series data analyzed in this study along with its descriptive statistics. Moreover, Sections 5 and 6 discuss the results of the causal analysis and the prediction of the regression models developed. Finally, a summary and conclusion of this study along with its limitations and future research scope is included in Section 7.

In this section, we briefly discuss the concept of Granger causality with its two different methodological approaches. Granger causality [22, 24] is a quantitative measure for inferring causal relationships between variables of a time-series data. It is based on the following two principles. The intuition behind the definition of Granger causality is as follows. Suppose the goal is to predict the future of a variable Y . If it happens that the prediction of Y improves by considering the past of variables Y and X as opposed to considering the past of only Y , then it is said that X "Granger causes" Y .

Under assumption 1, the hypothesis for testing Granger causality of X on Y is

where P [Y (t+1) ∈ Ω|I(t)] is the probability of Y (t+1) belonging to the set Ω when the entire information till time t is considered (I(t)) and P [Y (t + 1) ∈ Ω|I X ] is the probability of Y (t + 1) belonging to the set Ω when X is removed from the information set (denoted by I X (t)). When the above hypothesis (1) is satisfied, then we say X Granger causes Y .

Let X t and Y t be two time-series data, individually each of which can be represented by the following regressive models:

When considered jointly, the bivariate autoregressive models are:

where xt and yt are uncorrelated over time, such that their covariance matrix Σ xy is

With this, the Granger causality from each of the variable to the other is given by

Furthermore, the interdependence between the variables X t and Y t is given by

where | · | denotes the determinant of a matrix. Note that when X t and Y t are independent, |Σ xy | = σ xx σ yy and hence G X,Y is zero. When the interdependence is non-zero, in [53] , it was shown that the total interdependence can be decomposed as

where G X·Y = ln σ xx σ yy |Σ xy | is the instantaneous causality between X t and Y t .

Equations (5) give the Granger causality values for a bivariate time-series data. However, in most real-life applications, the obtained time-series data has more than two variables. In such cases, computing pairwise dependence using bivariate Granger causality may lead to ambiguous results [54, 55] because there may be direct or indirect causal links and in such circumstances, conditional Granger causality [54] is more appropriate to infer causality. For simplicity, the case with three variables is discussed and the general case with more variables follows directly.

Consider three time-series data X t , Y t and Z t and suppose Y t has a pairwise causal influence on X t . We now consider the causal influence of Y t on X t that is mediated through Z t . Let the joint autoregressive model of X t and Z t be

Again, the joint autoregressive model for all the variables X t , Y t and Z t is

with the residual covariance being

With this, the Granger causality of Y t on X t conditioned on Z t is defined as

With this, if G Y →X|Z > 0 and bivariate Granger analysis shows that G Y →X = 0, then the inclusion of Y results in an improved prediction of X and one can conclude that Y influences X directly. If G Y →X|Z = 0 and bivariate Granger causality analysis results in

In the cases where there are more than three variables, conditional Granger causality can be defined similarly with the scalar σ ij and σ ij replaced by corresponding elements of the residual covariance matrices. However, it is to note that although Granger causality identifies the most influential variables, it does not provide the direction of the association between the explanatory and the dependent variables.

In this section, we discuss the simple linear regression (ordinary least square or OLS) model and two of its variants, namely Ridge and LASSO regression models.

where y i is the i th observation of the dependent variable y and x j , j = 1, 2, · · · , N are the N independent variables. In case of linear regression, the model tries to fit a straight line by minimizing the residuals. In particular, it assumes that the dependent variable can be expressed as a linear combination of the independent variables, as given by the following,

where i is the residual. In matrix form, the equation (12) can be written as

where

The linear regression selects the parameters α j 's (j = 0, · · · , N ) such that the norm of the residual for every y i (i = 1, · · · , n) is minimized. Hence the optimal α is obtained as a solution of the following optimization problem, min

where · 2 is the 2-norm of a vector. The optimization problem (14) is convex and can be solved efficiently either using convex optimization techniques or analytically, such that the optimal α is given by

where X † is the Moore-Penrose inverse of X.

A major drawback of linear regression is that this algorithm has low bias and high variance [38] . This means that the linear regression may perform well on the train data, but it may not generalize well to the test data set, thereby making the model performance unsatisfactory. In machine learning literature, this phenomenon is known as Bias-Variance trade-off [38] . The intuition of bias-variance trade-off is explained in Figure 1 (a). Usually, with a highly complex model, it is possible to fit the training data as closely as possible. In this case, the training error is zero and the model is said to have low bias. However, the highly complex model may not generalize well to the test data, thus making the test error large. This is due to the overfitting of the training data. The complex model, which overfits the training data and produces high test error, is said to have high variance. This situation is often reversed if the model considered is fairly simple. Ridge regression, which puts a 2-norm ( 2 -norm) constraint on the set of coefficients, is able to overcome this challenge efficiently.

Another drawback of OLS is that the obtained model is highly influenced by the outliers in the training data set. For example, as in Figure 1 (b), the outlier data point y k results in the linear model fit as represented by the red line. However, it is obvious by looking at the overall data that the model fit depicted by the black line is the more appropriate linear fit to the data.

Additionally, on many occasions, the real-life data is noisy or uncertain. When the OLS attempts to fit that noise in the data, it eventually results in overfitting and consequently degrading its performance for model prediction. As stated earlier, the data for this study were obtained mostly from smartphone devices, and the chances of acquiring this data may also be subject to individual user's discretion, so it is reasonable to assume that the data utilized in this study may contain some noise or uncertainty in it. To account for the noise in the data, it is assumed that there is some uncertainty, ∆Y and ∆X, in the dependent and independent variables, respectively. It is assumed that the uncertainties in both Y and X are bounded, i.e. there exists some positive real number ρ > 0 such that ∆X 2 ≤ ρ and ∆Y 2 ≤ ρ. With this, the optimization problem (14) is modified to a min-max optimization problem [47, 48] given by,

Min-max optimization problems are generally hard to solve, but in this case, the optimization problem (16) can be equivalently expressed as a convex optimization problem as follows,

is equivalent to the following, min

where λ 2 is a positive real number, depending on the uncertainty bound ρ.

Proof. For proof, see [47, 48] . The optimization problem (18), known as Ridge regression, is a convex problem and can be solved efficiently using any of the available convex optimization problem solvers. The parameter λ 2 is called the regularization parameter and it acts as a trade-off between the OLS cost and the cost on the coefficients α.

Apart from the 2-norm regularization, another popular regularization is the 1 -norm regularization, which is also known as Least Absolute Shrinkage and Selection Operator (LASSO) regression. In particular, instead of putting a 2-norm cost on the coefficients of the linear model, LASSO employs a 1-norm cost on the coefficients. Hence, the LASSO regression model is obtained by solving the following optimization problem:

where · 1 is the 1-norm of a vector and the bound t is the tuning parameter. If t is large, it has no effect on the regression coefficients α i s and in this case, the solution to the optimization problem (19) approach the solution of normal linear regression optimization problem (14) in the limit of large t. However, when the bound t is small, the parameters α i s are constrained and hence are shrunk and are smaller versions of the original least squares estimates. The 1-norm minimization puts constraints on parameters that shrink coefficients towards zero and thus leads to a sparse solution for the linear model.

Data for this study were collected and combined from multiple web-based open-access sources. The majority of data used in this analysis were requested and obtained from the COVID-19 Impact Analysis Platform developed at the Maryland Transportation Institute of the University of Maryland (UMD) [6] . This platform provides both state and county-based information for 50 states in the U.S. and the District of Columbia. To match with the data available from other sources, for the purpose of this study, statewise daily time-series data were requested from this platform. The relevant statewide data obtained and analyzed from this source include the daily number of new COVID-19 confirmed cases per 1,000 people, social distancing index, percent of out of county and state trips, transit mode share, population, percent of people older than 60 years, percent of African American or Hispanic Americans, median income, percent of male population, number of hot spots per 1,000 people, unemployment rate, percent of people working from home, among others. Social distancing index in the data set indicates the increasing space between individuals and decreasing frequency of contact and is represented as an integer from 0 to 100, where 0 indicates no social distancing in the state and 100 indicates all residents are staying at home. Additional information was collected and appended with the data obtained from the UMD platform. The data for daily vehicle miles traveled (VMT) starting from March 1, 2020 through June 19, 2020 for 48 states (excluding Alaska and Hawaii), and the District of Columbia were requested and obtained from the Streetlight Data [56] . The movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential, reported by Google, were further joined with the study data set [57] . Additionally, the percent of the population at different education levels and gross domestic product (GDP) information were collected from the U.S. Census Bureau [58] . Information on the percent of population staying at home during this time was obtained from the Bureau of Transportation Statistics [59] . Furthermore, several states have exercised travel restrictions in the form of stay at home order, limitations on gatherings, domestic travel limitations, or school closures. This information was obtained from the COVID-19 State and Territory Action Tracker Following the joining of the data from various aforementioned sources, a thorough screening and quality check of the data was performed for any missing values. The final data set includes daily data starting from March 1, 2020 through June 13, 2020 from 48 U.S. states excluding Alaska and Hawaii and including the District of Columbia, and consists of a total of 5,145 observations for further analysis. In the final data set, the two states, Alaska and Hawaii, were excluded as the VMT data was not available for the same. The preliminary analysis examined several predictor variables of interest that were initially included in the analysis. The distribution of the dependent variable i.e., the daily VMT during the analysis period (from March 1, 2020 through June 13, 2020) is shown in Figure 2 . As can be clearly seen from Figure 2 , the average VMT dropped substantially from the beginning of March, 2020, around the time when the U.S. started experiencing rapid community outbreaks of the virus and the country declared a national emergency, and it continued to remain low until around end of April, 2020. However, it is interesting to see that the daily VMT gradually increased from around early May, 2020, although the number of daily COVID-19 cases continued to grow considerably over time.

As stated earlier, the compiled data set includes a large number of variables of interest that may impact the daily VMT. Table 1 presents the descriptive statistics, including the minimums, maximums, means, and standard deviations of thirty-one variables that were initially considered as the set of potential determinants. For the purpose of scaling the data, VMT, population, median income, and GDP were included in their natural log forms.

In any multivariate time-series data, there is causal influence between the variables involved. Correlations between the variables quantify the extent of their interdependence, but they do not specifically identify the cause and the effect. In particular, the correlations are symmetric in the variables and lack a directional sense. However, to better understand the relationships among the variables in the data, identification of the causal structure and influential predictor variables is critical. For example, for a large time-series data with a massive number of predictor variables, it is advisable to consider a reduced order model, which can be done by identifying the most influential variables. In this study, the goal of determining the most important predictor variables from the large set of time-series data was accomplished by employing the Granger As discussed earlier in section 2, Granger causality between two variables can be analyzed by either doing a pairwise study by developing bivariate VAR model or by conditional Granger causality by considering multivariate VAR model. For multivariate data, bivariate Granger causality analysis often provides ambiguous results [54, 55] . For validation purposes, the study initially carries out a bivarite Granger causality test whose results are shown in Figure 3 . In this figure, the Granger causality test values of all predictor variables on the dependent variable (daily VMT) using bivariate models are displayed. The lags in this VAR model were chosen by Akaike Information Criterion (AIC) and the optimal lag was found to be 4 for the bivariate VAR models. From Figure 3 , it can be seen that the model was not efficient in determining causality for most of the variables and several apparently important predictors were not correctly captured in explaining the causal relationships between the explanatory and dependent variable. For example, the most influential predictor variable was identified as the change in the percent of trips to grocery and pharmacy from the average baseline, whereas factors like social distancing index, or population staying at home or percent of people working from home etc. had very little to no causal influence on the dependent variable. This observation is counter-intuitive and supports the previous theoretical argument that bivariate Granger causality tests provide incoherent results for multivariate time-series data. For conditional Granger causality test, a multivariate VAR model was developed and similar to the bivariate model, the optimal lag for the VAR model was determined by the Akaike Information Criterion and this optimal lag was obtained as 10. While determining the conditional Granger causality (11) of the i th predictor variable on the dependent variable, the conditional set was chosen as all the predictor variables except the i th predictor variable. Figure 4 presents the results of the conditional Granger causality test by showing the causality values for each of the predictor variables. As can be seen from Figure 4 , the conditional Granger causality efficiently identified the important predictors from a pool of a large number of explanatory variables. For example, unlike the bivariate causality test, social distancing index or population staying at home were identified as the important variables in explaining the daily vehicle miles traveled.

In the previous section, the influence of the predictor variables on the VMT was computed and was rank-ordered according to the influence each predictor variable has on the VMT. In this section, the most influential predictor variables identified by Granger causality test were used to develop three different reduced order linear models, namely, ordinary least squares, LASSO, and Ridge models. In particular, the largest value of the causal influence was around 0.03 and the cut-off was selected to be 0.003. This cut-off resulted in the selection of seventeen predictor variables and with these selected variables, reduced order models for predicting VMT were developed.

The final data set used in this study included information from March 1, 2020 through June 13, 2020. After splitting the data set into two parts for training and testing, the train and test data sets ranged from March 1, 2020 through May 18, 2020 and from May 19, 2020 through June 13, 2020, respectively. The first (train) part was utilized for training the regression models, while the latter part (test) was used to test the efficiency of the prediction of the obtained regression models. As mentioned earlier, the seventeen most influential variables were considered for the reduced order models. It is important to note that, except for the OLS model, the other computation of Ridge and LASSO models (optimization problems (18) and (19)) involve one regularization parameter and in all the optimization problems the λ i s were chosen such that, 0 < λ i ≤ 1. Table 2 compares the optimal coefficients of the final set of explanatory variables between all three regression modeling techniques. As can be seen from Table 2 , the coefficients of the predictors were fairly comparable across all modeling techniques. As expected, when the number of new COVID cases per 1,000 people increased, the daily vehicle miles traveled decreased. Expectedly, with the increase in the social distancing index, percent of people working from home, statewide closure, and tests done per 100 people, the daily VMT decreased. Additionally, the vehicle miles traveled per day increased with the increase in population, unemployment rate, person of out of county trips, socio-economic status, and the increase in percent of trips to transit stations, retail and recreational places, grocery and pharmacy, workplaces, and residences. The predictor for population staying at home was rightly captured by regularization methods, namely LASSO and Ridge regressions, and showed a negative association with daily VMT. Lastly, although the negative association between the increase in the percent of trips to parks and daily VMT seemed counterintuitive, this could be partially due to the reason that people might also rely on non-motorized transport to go to parks and such trips would not be captured by the daily VMT.

Furthermore, the log-lambda plots shown in Figures 5 and 6 depicts how the predictor variables that enter in the model, varied across the LASSO and Ridge regression techniques as the regularization parameter change. When the regularization parameter λ is small, the contribution of the regularization part to the cost functions in the optimization problems (18) and (19) is small and as such all these optimization problems are reduced to the simple linear regression or OLS. However, as λ is increased, the weight on the regularization component in the optimization problems increases and as such, the coefficients of the independent variables become smaller and approach zero. It is clearly seen from Figures 5 and 6 that all the coefficients did not approach zero at the same time. In particular, the important variables remained non-zero for larger values of λ as compared to the relatively non-important variables. Based on the order of importance, the coefficients of the explanatory variables differed slightly between the LASSO and Ridge regularization techniques. This difference could be due to the characteristic of the Ridge regularization that reduces the norm of the coefficients more uniformly, while the LASSO model attempts to set as many coefficients to zero as possible. 

Following the development of the models using the training data, the prediction of the dependent variable (daily VMT) was tested and compared between the three regression techniques. Essentially, the performance of the three different models on the test data set was evaluated, and the predicted values were compared with the observed values. Ultimately, the root mean square errors (RMSEs) from all models provided the measure of performance and efficiency of the models. The RMSEs of the three different models utilizing both the train and test data set are presented in Table 3 . The comparison of the RMSEs between the three modeling techniques clearly shows that the Ridge regression performed the best for both train and test data by having the least RMSE among all models. This is expected, as the Ridge regression provides superior prediction by overcoming the issue of overfitting with low variance and better generalization to the test data compared to other regularization methods (refer to Figure 1 ). Additionally, based on the RMSEs, also the LASSO model provided better prediction performance compared to the OLS. However, the higher error in the test data in all models could partially be due a much smaller sample size and the consistent gradual increase in VMT in the test period as opposed to the change in VMT from the average baseline in both directions during the training data period.

For the purpose of showing how the different modeling techniques perform at an individual state level, a comparative graphical representation of RMSEs in the prediction of daily VMT for the three regression techniques are presented separately for the states of Florida, New Jersey, New York, and Texas as examples in Figure 7 . These figures clearly show that even at the individual state level, normal linear model provided the poorest performance in terms of large errors, while Ridge and LASSO models showed comparable prediction performance. Among the three models, Ridge showed the most superior performance, followed by LASSO regression with a slightly higher prediction error. 19 Additionally, the date-wise RMSEs in the prediction, utilizing the three developed models over the analysis period are presented in Figure 8 . Let y di be the observed value of daily VMT on the d th day for the state i. For the test data set, d = 1, 2, · · · , 26 and i = 1, 2, · · · , 49. Letŷ p di be the predicted daily VMT on the d th day for i th state when using the p th modeling technique. Here p is one of linear, LASSO or Ridge regression techniques. Therefore, the error in the prediction using the p th model on the d th day for i th state is,

The RMSE in the prediction of the p th model for the d th day over all the states is,

where n s = 49 is the number of the states and D.C. In Figure 8 (a), r p d is plotted for all modeling techniques over the period ranging between May 19, 2020 through June 13, 2020 (test period). From Figure 8 (a), it can once again be seen that the Ridge regression performed the best for most of the test period, as its RMSE was usually lower compared to the other methods. The average of r p d across the test period, which can be expressed as,

where n d is the number of days (equals to 26 in this study), is plotted in Figure 8 (b). This plot also confirms that the Ridge regression model has the least error in prediction.

Since the emergence and rapid growth of novel coronavirus in December 2019, countries worldwide are taking extreme measures to prevent the spread of the virus. The U.S. is greatly hit by the pandemic and currently (September 2020) has the highest number of confirmed COVID cases and deaths in the world. Since the national emergency declared by the White House on March 13, 2020, most states in the U.S. implemented travel restrictions and social distancing protocols to combat the crisis, causing drastic reductions in mobility and travel demand at all levels. However, the overall impact and the longterm implications of this crisis to mobility and travel still remain unknown at this point in time. In order to understand these implications better, statistical models and analytical tools utilizing the increasingly available open-access data is the need of the hour. To that end, this study developed an analytical framework that helped determine and analyze the most important factors impacting and predicting human mobility and travel in the U.S. during the pandemic by employing Granger causality and linear regularization algorithms, including the Ridge, and LASSO modeling techniques.

Data for this study were obtained for 48 states (excluding Alaska and Hawaii) and the District of Columbia from various databases created and maintained to analyze the impacts of this pandemic. The data obtained and analyzed in this study included information on mobility and movement trends, travel restrictions and social distancing, and health and demographics of the population. The compiled data set included daily time-series data starting from March 1, 2020 through June 13, 2020.

Evaluating a large-scale data set requires advanced analytical techniques to identify the most important factors in explaining the response variable. Commensurate with analyzing such rich data, this study employed Granger causality and linear regularization techniques, including the Ridge, and LASSO models, along with ordinary least square regression. The entire data set was split into two parts, where approximately 75 percent of the data was used for training the models, and the remaining 25 percent was used to test the prediction performance. Determining the set of most important predictors impacting the daily VMT from the pool of several potential determinants was accomplished using the Granger causality. The seventeen selected variables were further used to develop reduced order models, employing the linear, Ridge, and Lasso regression techniques on the training data. Finally, the performance of the prediction was tested by feeding the test data into the models for all regression techniques.

The results of this study revealed that the coefficients of the predictors were comparable across all modeling techniques. When factors including the number of new COVID cases, the social distancing index, percent of people working from home, statewide closure, and tests done per 100 people increased, the daily VMT decreased. Conversely, the vehicle miles traveled per day increased with the increase in population, unemployment rate, person of out of county trips, socio-economic status, and the increase in percent of trips to transit stations, retail and recreational places, grocery and pharmacy, workplaces, and residences. The population staying at home was rightly captured by regularization methods, namely LASSO and Ridge regressions, and shows a negative association with the daily VMT.

Furthermore, the developed models were used to predict the daily VMT for all the states for a period of 26 days (from May 19, 2020 through June 13, 2020). Although all the developed models compare favorably, the Ridge regression model performed the best by having the least root mean square error (RMSE) in prediction among all models. This result makes sense because the Ridge regression is robust in overcoming the issue of overfitting and thus generalizes better to the test data set, resulting in lesser prediction error. Also, LASSO regularization technique performed superior to the ordinary least square regression.

The study is only the starting point to help understand the associations between different factors and human mobility during the COVID-19 pandemic. The authors of this study intend to expand this study to utilize county-based data to understand these associations from a more granular level. Moreover, it would be insightful to include additional variables into the analysis as potential predictors. From the modeling perspective, it is reasonable to argue that the available data is subjected to some uncertainties and future research should be carried out to explicitly take the uncertainties into account to derive at more precise models. Furthermore, as the crisis is moving on to the greater peaks in terms of the number of confirmed cases and deaths over time in the U.S., subsequent analysis is warranted with data from the following months (post June 13, 2020).

None. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. 

Proclamation on Declaring a National Emergency Concerning the Novel Coronavirus Disease

Around 20% of global population under coronavirus lockdown

Stay at home Q&A

An interactive covid-19 mobility impact and social distancing analysis platform, medRxiv

Mapping Mobility Changes in Response to COVID-19

National Science Foundation awards rapid response grants to support coronavirus

Mapping county-level mobility pattern changes in the united states in response to covid-19

Predicting personality with social media

Big data for social transportation

Deep learning methods in transportation domain: a review

Linear regularization-based analysis and prediction of human mobility in the us during the covid-19 pandemic

Safety effectiveness of all-electronic toll collection systems

Examining correlation and trends in seatbelt use among occupants of the same vehicle using a bivariate probit model

Machine learning for medical imaging

Information transfer in social media

The networks of the brain

Motif discovery in tissue-specific regulatory sequences using directed information

Inference of biologically relevant gene influence networks using the directed information criterion., in: In proc

Investigating causal relations by econometric models and cross-spectral methods, Econometrica: journal of the

Causality, cointegration, and control

Causality, feedback and directed information

Directed information for channels with feedback

Measuring information transfer

Formalism for information transfer in dynamical network

Causality preserving information transfer measure for control dynamical system

On information transfer in discrete dynamical systems

Data-driven approach for inferencing causality and network topology

On data-driven computation of information transfer for causal inference in discretetime dynamical systems

Modeling the dynamic response of automobile sales in troubled times: A real-time vector autoregressive analysis with causality testing for greece

Short-and medium-term car registration forecasting based on selected macro and socio-economic indicators in european countries

Granger-causality between transportation and gdp: A panel data approach

Robust causal dependence mining in big data network and its application to traffic flow predictions

Relationship between vehicle miles traveled and economic activity

Machine learning

Association between driveway land use and safety performance on rural highways

Analysis of trends and correlation in child restraint use and seating position of child passengers in motor vehicles: Application of a bivariate probit model

Assessing safety performance on urban and suburban roadways of lower functional classification: A comparison of minor arterial and collector roadway segments

Relationship between horizontal curve density and safety performance on rural two-lane road segments by road jurisdiction and surface type

Spectral analysis of nonlinear flows

A data-driven approximation of the koopman operator: Extending dynamic mode decomposition

On computation of koopman operator from sparse data

Koopman operator methods for global phase space exploration of equivariant dynamical systems

Robust approximation of koopman operator and prediction in random dynamical systems

On robust computation of koopman operator and prediction in random dynamical systems

Deep learning for short-term traffic flow prediction, Transportation Research Part C: Emerging Technologies

Semi-supervised elastic net for pedestrian counting

Spatial variable selection methods for network-wide short-term traffic prediction

Predicting real-time surge pricing of ride-sourcing companies

Measurement of linear dependence and feedback between multiple time series

Measures of conditional linear dependence and feedback between time series

Frequency decomposition of conditional granger causality and application to multivariate neural field potential data

Big Data for Mobility

COVID-19 Community Mobility Reports, Google

COVID-19 State and Territory Actions Tracker