key: cord-0730339-tuc8x2zz authors: Mustavee, Shakib; Agarwal, Shaurya; Enyioha, Chinwendu; Das, Suddhasattwa title: A linear dynamical perspective on epidemiology: interplay between early COVID-19 outbreak and human mobility date: 2022-05-05 journal: Nonlinear Dyn DOI: 10.1007/s11071-022-07469-5 sha: d5658d10b78efc55d83e5ec172719e6636a3d3d1 doc_id: 730339 cord_uid: tuc8x2zz This paper investigates the impact of human activity and mobility (HAM) in the spreading dynamics of an epidemic. Specifically, it explores the interconnections between HAM and its effect on the early spread of the COVID-19 virus. During the early stages of the pandemic, effective reproduction numbers exhibited a high correlation with human mobility patterns, leading to a hypothesis that the HAM system can be studied as a coupled system with disease spread dynamics. This study applies the generalized Koopman framework with control inputs to determine the nonlinear disease spread dynamics and the input–output characteristics as a locally linear controlled dynamical system. The approach solely relies on the snapshots of spatiotemporal data and does not require any knowledge of the system’s underlying physical laws. We exploit the Koopman operator framework by utilizing the Hankel dynamic mode decomposition with Control (HDMDc) algorithm to obtain a linear disease spread model incorporating human mobility as a control input. The study demonstrated that the proposed methodology could capture the impact of local mobility on the early dynamics of the ongoing global pandemic. The obtained locally linear model can accurately forecast the number of new infections for various prediction windows ranging from two to four weeks. The study corroborates a leader-follower relationship between mobility and disease spread dynamics. In addition, the effect of delay embedding in the HDMDc algorithm is also investigated and reported. A case study was performed using COVID infection data from Florida, US, and HAM data extracted from Google community mobility data report. (HDMDc) algorithm to obtain a linear disease spread model incorporating human mobility as a control input. The study demonstrated that the proposed methodology could capture the impact of local mobility on the early dynamics of the ongoing global pandemic. The obtained locally linear model can accurately forecast the number of new infections for various prediction windows ranging from two to four weeks. The study corroborates a leader-follower relationship between mobility and disease spread dynamics. In addition, the effect of delay embedding in the HDMDc algorithm is also investigated and reported. A case study was performed using COVID infection data from Florida, Given how increasingly connected the world is, epidemics are becoming a commonplace. As we know and have come to see through the ongoing COVID-19 pandemic, the significant loss of lives, as well as the short and long-term economic impact, can be very devastating. Besides the loss of lives, the pandemic has also crippled global transportation, food supply, and challenged healthcare systems in ways not seen before. Understanding and forecasting the spread dynamics is a challenging task, in part because these are high dimensional, nonlinear, and time-varying systems. In addition, the spreading process exhibits a multi-scale spatiotemporal phenomenon [3, 11, 46] . It depends on many exogenous variables, including human activity and mobility (HAM) and mitigation measures such as vaccination and face coverings adopted by people. HAM is considered a critical factor in the disease spread, given the fact that the effective reproduction number of the pandemic is highly correlated to mobility. During the onset of the ongoing global pandemic, mitigation strategies revolved around imposing various restriction measures on human activity and mobility. Since the first 'stay-at-home' order was issued in the USA on March 15, 2020, in Puerto Rico, similar executive orders issued by state and municipal authorities notably curbed travel demand and thus potentially limiting the community spread of COVID-19. Similarly, governmental agencies worldwide have imposed lockdown and introduced various social isolation strategies for controlling the spread of coronavirus. The underlying rationale for these restriction strategies, such as lockdown and social isolation, is to reduce the scope of direct interpersonal contacts, which on the other hand, adversely impact human activity and mobility, in turn, slows down the disease transmission rate. Thus, dynamics of COVID-19 spread and the dynamics of HAM are intertwined via an intricate relationship. Despite the close connections between the spread of a pandemic and mobility, obtaining a quantifiable relationship between them is challenging because the spread dynamics of a pandemic such as COVID-19 depend on various other factors such as social distancing mask-wearing, mutation of the virus, etc. Moreover, mobility is a multi-modal service, which means each mode of mobility has a different mechanism to impact the disease spread dynamics. Thus, Linka et al. [38] characterized mobility as a 'global barometer' of Although the literature has, for many decades, proposed ways of modeling epidemic processes in human populations, several factors that affect the rate of spread or control are not often accounted for. For example, human mobility and activity levels in an outbreak can affect how fast the outbreak is contained or how quickly it spreads. While this knowledge, in part, informs the preliminary soft measures typically taken by health policymakers such as social distanc-ing and lockdown, which prevents mobility or travel of individuals between infected and uninfected areas, the degree to which such external factors affect the pace of spread are not often wholly known. A significant research gap that arises here is establishing the interplay between the pandemic dynamics and the HAM changes due to adopted countermeasures (lockdown, social distancing, etc.). This gap in knowledge makes the existing control strategies deviate from arguably optimal approaches based on the actual nonlinear dynamics. For a highly infectious disease like COVID-19, a paradigm shift in characterizing the spreading dynamics is necessary. To contain a resurgence of the outbreak using scarce or limited resources (such as vaccines and ventilators), a reliable approach of integrating HAM into the spread models aided by novel data-driven tools within a rigorous mathematical framework is necessary. The output of the proposed tightly integrated, the equationfree approach can provide more robust analysis and serve as a helpful tool for policymakers. In this paper, we study the interconnections between the simultaneous evolution of two systems-HAM and disease spread-by treating them as a coupled dynamical system (see Fig. 1 ). This research builds a system discovery framework through the Koopman operator framework exploiting Hankel dynamic mode decomposition with control (HDMDc) algorithm for understanding the dynamics of the COVID-19 disease and its interplay with HAM. Figure 2 illustrates the arrangement of the two systems in a feedback loop set. HAM and epidemics spread system are coupled in a cascaded fashion, where the current state of epidemics (e.g., number of daily infections) triggers a feedback con- trol action (e.g., lockdowns, face-covering, social distancing), which affects the HAM system. This feedback through the HAM system feeds back into the epidemics system, whereas other inputs (e.g., vaccinations) may affect the epidemics system directly. As a case study, this paper investigates the early dynamics of COVID-19 in the state of Florida, US, and develops a locally linear model using large-scale snapshots of spatiotemporal infection data and HAM data. During the second wave of the pandemic in the USA, the state of Florida played a critical role. From July 01, 2020, to July 21, 2020, almost 15% of total reported cases in the US came from Florida, although the state only accommodates 6.6% of the total US population. Mobility data was collected from Google COVID-19 Community Mobility Report [22] and the COVID case counts from the USA Facts repository [54] . Integrating HAM data into the framework will highlight the undercurrents of the disease evolution as the economy reopened. Contributions: Contributions of this paper are summarized as follows: -We formulate and study human activity, mobility, and epidemics spread as cascaded coupled dynamical systems. -The proposed method incorporates the impact of HAM into infectious disease dynamics. -The developed method for system identification yields a locally linear equation-free model. -A novel technique is proposed to quantify mobility as a representation of human activity, which is then used as a control input. -We also identify the effect of delay embedding on the model accuracy and prediction results. Outline: Rest of the paper is organized as follows: Section 2 discusses the background of the problem and the current state of the art, Sect. 3 presents the adopted mathematical framework, and Sect. 4 discusses the characteristics of the data and its preparation details. Finally, in Sect. 5, we will illustrate and explain the results of our analysis. This section briefly reviews some relevant literature and approaches on mathematical models in epidemiology (2.1), explores the connections between mobility and epidemic spreading (2.2), and highlights the application of dynamic mode decomposition (DMD) and other DMD-type algorithms such as DMDc, HDMD, HDMDc in various random dynamical system identification including epidemiology and transportation (2.3). Daniel Bernoulli is one of the pioneers of mathematical models in epidemiology. He used differential equations to estimate the impact of smallpox on life expectancy in the mid of 18th century [5] . However, at the beginning of the 20th century, public health physicians first introduced deterministic epidemiology models for disease spread prediction [25] . These models served as foundations for compartmental models developed between 1900 and 1935 [7] . In recent years, the application of mathematics in epidemiology has significantly increased. Since the onset of the global pandemic in December 2019, various mathematical models and techniques have been applied to understand, explain and predict its spreading dynamics. Rahimi et al. [48] have presented an review on COVID-19 prediction models. The proliferation of models at the same time has raised a lot of criticisms and doubts, as they often speculate contradictory prediction results. In [31] authors discussed key limitations of mathematical models used in interpreting epidemiological data and public decision making. This research presented several validation approaches to corroborate the speculations of these analyses. Mathematical methods used to model COVID-19 generally fall into two categories: (i) statistical models, such as regression (ii) mechanistic models, such as SIR (susceptible infected recovered), SEIR (susceptible exposed infected recovered) and other compartmental models [27] . A comparative study of the two classes is shown in [34] . The authors classify machine learning as statistical models stating these black-box approaches are data-hungry and lacks interpretability. However, they report some advantages of statistical approaches, including high accuracy in short-term predictions. In contrast, compartmental models can interpret interacting disease mechanisms and contain disease-specific information. Hence, the application of compartmental models in analyzing and understanding the COVID-19 outbreak has been widespread. It is important to note that the application of compartmental models are very diverse and often observed beyond the epidemiology domain. For example, these models are widely used to model information spread [40] [41] [42] . In [51] , authors develop an SEIRS (susceptible exposed infected recovered susceptible) model to analyze the impact of an individual's behavioral response on disease spread. The model considers an individual to be informed of the disease and its precautionary measures. The paper reports that awareness and protective actions can control infection spread significantly. In [50] , authors present another compartmental model called SAIRS (Susceptible asymptomatic infected symptomatic Infected recovered susceptible). The paper shows numerical evidence for the claim that preventing more individuals from being asymptotically exposed can decrease disease transmission considerably. In [14] , authors investigates COVID-19 transmission patterns in West Bengal, India, using SAVICR (susceptible asymptomatic vaccinated comorbidity infectious recovered). Few more examples of compartmental model can be found in [10, 23, 44] . While compartmental models are parametric approach, there are several non parametric approaches which are also helpful to understand and analyze the impact of various factors in disease spread. For example in [19] , authors present a comparative study of different regression approaches to analyze COVID-19 survival data considering the presence of competing risks. The study demonstrates the impact of age and gender on surviving patients. Besides, the study in [20] adopts a nonparametric approach based on Kaplan-Meier (KM) technique for analyzing survival data of COVID-19 patients. The study reveals age, gender, number of comorbidities, and healthcare accessibility as critical contributing factors to COVID-related mortality in Spain. The relationship between mobility and the spread of infectious disease has been well-established [2, 8, 16, 24, 32] . This was a subject of paramount importance even in the pre-COVID-19 era. Mobility restriction strategies like cordon sanitaire were implemented for controlling various epidemics such as bubonic plague (1666) [47] , yellow fever (1793, 1821, 1882 [33] , and cholera (1830, 1884) [9] . Some of the early efforts that estimated the impact of mobility through mathematical analysis on disease outbreak include [4, 49] , which are followed by few other studies. The relationship between mobility and disease spread are reported in [2, 8, 24, 32] . In [16] , authors illustrated the relationship between mobility restriction and epidemic at various scales and levels. One of the earliest attempts to correlate mobility trends with COVID-19 transmission was reported in [56] . This research showed that there lies a positive relationship between mobility inflow and the number of COVID-19 cases. A prediction model of COVID-19 using community mobility reports was proposed in [55] . It used a partial differential equation model and integrated Google community mobility data with it. The model captures the impact of human actives on transboundary spread. The authors demonstrated a case study in a clusters counties of Arizona. In [57] , the authors applied the Poisson time series model to explore the connection between population mobility and COVID-19 daily cases. The model was simultaneously applied at the county level and state level in South Carolina. On the other hand, the effect of human mobility trend without any pharmaceutical interventions was investigated in [28] , where a generalized additive mixed model (GAMM) was proposed. In [30] , it was shown that the spread dynamics of COVID-19 during the early stage of the outbreak is highly correlated to human mobility. In this research, mobility data was obtained from mobile data. The study in [36] aims to explore the relationship between public mobility trends and non-pharmaceutical mobility interventions imposed by the government, such as stay-at-home orders and lockdowns. It captures both statewide and nationwide mobility and teleworking trends during the early stage of the COVID-19 pandemic in the US. The study also analyzes the impact of various sociodemographic factors on mobility, such as population density group and income. By using the susceptible-exposed-infected-recovered (SEIR) model, the impact of air traffic and car mobility on COVID-19 dynamics in Europe was evaluated by Linka et al. [38] . They used a standard SEIR compartment model with a network structure for capturing local dynamics. To integrate the effect of mobility in the model, they applied a hyperbolic tangenttype ansatz. The authors argued that local mobility has a high correlation with the reproduction number. Also, they showed that mobility and reproduction are correlated during the early stages of the outbreak but become uncorrelated during later stages. Hence, they advocated that local mobility can be used as a quantitative metric for prediction and identification stages. Another insightful study was conducted in [39] , where the authors presented their arguments with quantitative analysis on travel restriction in controlling an outbreak. They also applied the SEIR model and integrated mobility into the model. Linka et al. have found a leadlag relationship between mobility and the spread of COVID-19. Similar results were obtained in [34] . In [58] , authors constructed a SIRD model considering human mobility. The model makes a long-term prediction on epidemic trends in several states of the USA from July 27, 2020, to January 22, 2021. This study forecasts long-term epidemic trends using equationbased parametric methods incorporating mobility. The application of dynamic mode decomposition (DMD)-type models in epidemiology was first demonstrated in [45] . In this research, the DMDc was applied to historical data to model the spread of infectious diseases like flu, measles, and polio and explore its connection with the vaccination. This study demonstrated that exogenous inputs of dynamics such as vaccination in case of epidemic spread could be incorporated with the original dynamics by using DMDc. The application of Hankel matrix structure in system identification is demonstrated in [17] . Like disease spread, traffic flow is also a dynamical system that is highly nonlinear and prone to randomness. In [1] , the authors extracted spatiotemporal coherent structures of highway traffic dynamics by using the concept of so-called Koopman modes. They applied Hankel DMD for computing Koopman modes and used the estimated modes for traffic dynamics forecast. Besides, this technique can be applied to capture various traffic dynamics of signalized intersections. In [37] , HDMDc was used to identify the underlying dynamical system of traffic queue. Furthermore, they made predictions on traffic queue length and applied Koopman spectral analysis to determine queue length instability. In [53] , HDMDc was also applied for modeling intersection queue length dynamics. Another interesting application of DMD in random dynamical system identification is demonstrated in [6] where it was applied in infrastructure energy assessment. In this work, the authors used HDMD to extract insights into the thermal dynamics of an airport. Figure 3 illustrates the relationship between the major modeling paradigms used in epidemiology and their connection with the proposed approach. The advantages of DMD-type algorithms are summarized as follows: -DMD-type algorithms yield a locally linear model. -DMD-type algorithms are purely data-driven and require no physical knowledge about the system. -They do not require parameter estimation. -They provide an equation-free approach. -These approaches can be modified to incorporate input-output characteristics of the system. This section will present the mathematical framework of the proposed methodology, problem formulation, discussion on DMD type techniques such as HDMD and HDMDc, and the issue of delay coordinate embedding. To interpret and analyze the data that we are dealing with, we make the assumption that the data is generated by an underlying dynamical system. We assume that there is a phase space Ω and a map F : Ω → Ω. The transformation F generates a dynamics on the phase space Ω. Given any initial point ω 0 ∈ Ω, there is a trajectory This study aims to construct a data-driven low-dimensional representation of the original high-dimensional system. We assume the existence of (Ω, F), where Ω is the domain of the actual high dimensional dynamics, and F is a transformation that connects the current states with the future states. However, a complex system like epidemic spread has numerous influencing factors. Only a few of them are genuinely observable. Since the original system is partially observable, the explicit form of the transformation F is unknown. For this reason, instead of assuming any explicit form of F, we aim to approximate F from the data. The information about Ω will obtained through a measurement X : Ω → R n and it generates the sequence of data points in R k These assumptions make up the data-driven framework, it enables a parameter-free reconstruction of the dynamical systems. Note that the measurement X is not necessarily one-to-one, thus the data-points x i may not correspond to a unique underlying state in Ω. In particular, it may not be possible to connect the x i with a dynamical rule of the form x i+1 =F(x i ). x i should be interpreted as a partial observation of the true state in Ω, in the i-th time frame. The method of delay coordinate incorporates a number h of time shifted versions of the map X to obtain a higher-dimensional map X (F 1 ω), . . . , X (F h ω) . Thus, the delay coordinated version of each point x n is , x n+1 , . . . , x n+h ) . The main purpose of delay coordinates is that they produce an embedding of the dynamics, as shown in the lemma below. Lemma 1 (Delay-coordinate embedding [52] ) Given a diffeomorphism F : Ω → Ω, for almost every function X : Ω → R n , there is an integer h 0 such that for every h > h 0 , the function X (h) : Ω → R n(h+1) obtained by incorporating h delay coordinates into X , is an embedding / one-to-one map of Ω into R n(h+1) . This statement above is for 'almost every' measurement functions X , meaning that it holds for typical X [29] . For such X , the statement says that there is some choice of h by which every value of X (h) is the unique representative of an underlying state of Ω. The image of X (h) is thus an embedding or one-to-one image of Ω in the Euclidean space R n(h+1) . The size of h would depend on X and Ω. Dynamic mode decomposition (DMD) is a tool for data-driven discovery of dynamical systems. One way to look at it is as a union of spatial dimensionality reduction and Fourier transformation in time. Let X (h) be the n(1+h)×m matrix formed by collecting m consecutive snapshots x Similarly, let X be n(1+h)×m time shifted version of X . The method of DMD takes a locally linear approximation of the dynamics (1) by assuming a linear relation of the form where the matrix A represents the discrete-time locally linear dynamics. It is a best fitting operator, which minimizes Frobenius norm of equation (2). Suppose X has the SVD (singular value decomposition) X = U V * . Then, the Moore Penrose inverse of A can be computed as [35] DMD is thus a regression algorithm [15] and it produces locally linear approximation of the dynamics. However, in a few instances DMD fails to capture the dynamical nature of the system, e.g., in case of a standing wave. Hankel DMD (HDMD) is a variant of DMD applied on a time-delayed Hankel matrix which increases the order of the underlying dynamical system, thereby aiding in the estimation of hidden oscillatory modes. DMD with control (DMDc) is a modified version of DMD which takes into account both the system measurements and the exogenous control input to uncover the input-output characteristics and the underlying dynamics. Hankel DMD with control (HDMDc) is defined in a similar fashion and is the application of DMDc algorithm on the time delayed coordinates. If there is an external control input u k ∈ R q acting on the system, then the autonomous system (1) becomes If the u k are collected into an s × m matrix ϒ, we get the system of linear equations Although the original dynamics is by no means linear, (4) represents a finite-dimensional approximation of a linear formulation of the dynamics, called the Koopman formulation. The Koopman operator is essentially a time-shift operator. It operates on functions instead of points on the phase space Ω. Given any function f : Ω → R, f is another function defined as where F is the underlying dynamical system (1). The Koopman operator converts any nonlinear dynamical system into a linear map. Thus, all the tools from operator theory / functional analysis can be brought into the study of dynamics [12, 13, 21] . However, the original dynamics which is usually on a finite dimensional phase-space is converted into dynamics on some infinite dimensional vector space. The properties of depends on the choice of vector/function space. Some common choices of function spaces are C(Ω) the space of continuous functions, or C r (Ω), the space of r -times differentiable functions on Ω. We shall use L 2 (μ), the space of square-integrable functions with respect to the invariant measure μ of the dynamics. Now note that the columns of matrix X are repeated iterations of : for some unknown initial state ω 0 ∈ Ω. Similarly for some unknown control-measurement u : Ω → R q . Thus, both A, B record the action of U on finite dimensional Krylov subspaces. This justifies (4) as a finite rank approximation of the infinite-dimensional operator and thus of F. If we set Ω = X ϒ and G = A B , the equation becomes To solve this, we reuse notation to take the SVD Ω = U V * and then set In practice, if the size of the data is limited to N time samples, then if we choose h delay coordinates, then m can be at most N − h. This section discusses data sources and data preparation methods used in this research. It is divided into two: Sect. 4.1 provides a timeline for state government response and case counts, Sect. 4.2 provides the details of the data used in the case study, while Sect. 4.3 introduces the proposed technique of generating control inputs from the mobility data. After the first reported cases on March 1, 2020, the state of Florida entered into a public health emergency followed by a series of restrictions and shutdowns on bars and recreation facilities from March 17, 2020, to March 31, 2020. On April 1, 2020, the Governor of Florida issued a stay-at-home order. In the following, two weeks number of reported cases gradually decreased, and subsequently, beach saloons, bars, and educational institutes gradually received permission to reopen from April 17, 2020, to June 10, 2020. However, as the number of cases rapidly increased, the state government shut down all bars on June 26, 2020. The datasets generated during and/or analyzed during the current study are available in the following sources: (i) Google community mobility data report, available here [22] . (ii) Data of US COVID-19 cases and deaths by state, available here [54] . This publicly available dataset reports daily percentile changes of mobility measuring the number trips in five types of destinations. In this study, we treat them as five different modes of mobility. They are (i) Retail and recreation (m (1) ), (ii) Grocery and pharmacy (m (2) ), (iii) Transit stations (m (3) ), (iv) Parks (m (4) ), (v) Work stations (m (5) ). The proposed model consider all of them as control inputs. Now, we discuss how to quantify the mobility trends as control inputs for the proposed model. where n j is the number of modes of mobility. This dataset records daily mobility data of each county regarding percentile changes concerning a normalized pre-Covid baseline to adjust weekend factors. The baseline was a daily mobility average from the five weeks starting from January 3, 2020, to February 6, 2020. This dataset records daily mobility data of each county regarding percentile changes concerning a normalized pre-covid baseline to adjust weekend factors. The baseline was a daily mobility average from the five weeks of January 3, 2020, to February 6, 2020. A sevenday moving average of percentile change illustrates the general mobility trend shown in Fig. 4 . It also summarizes all governmental decisions. All forms of mobility exhibited strong increasing trends after the reopening decisions came in mid-April and continued until the end of June. However, partial shutdown orders on various facilities reemerged as the second wave approached in the summer. The figure shows that in Florida, during the second wave of the pandemic highest number of daily cases were reported around the third week of July 2020. However, from the fourth week of July, the trend went downward (Fig. 5) . On the other hand, the data of county-wise COVID-19 confirmed cases in Florida was collected from USAFact public website [54] . It presents daily counts of cumulative COVID cases at the county level, which is shown in the color map presented in Fig. 6 . In this research, we considered the twenty most populated counties in terms of total infection in Florida. 86.3% of all reported COVID-19 cases in Florida up to 01 July 2020 came from these twenty counties. The actual daily case counts are discrete phenomena, and thus it shows the random fluctuations in the raw data, which primarily arises due to the collective impact of weekends and reporting system of the testing process. Rolling-average is a widely preferred solution to encounter the problem. For each time i, u i := u k,i n k=1 is the control vector that we create. The term w above is a scaling factor. If w = 1, then the components of u i lie in [0, 1], which could be a mismatch with the magnitude of the signals x (h) k . We choose w by trial and error for the best forecasting results (Fig. 7) . In this section, we will discuss the obtained results obtained. This section is divided into two subsections. Section 5.1 describes the physical significance, characteristics of the proposed linear model. Besides, we will also discuss different parameters of the linear model, i.e., effect of delay embeddings and corroboration of Fig. 7 Control inputs as calculated using (10), with scaling factor w = 100. The inputs to (10) were obtained from Google Covid-19 Community Mobility Report. They serve as a continuous measurement of the effect of external control parameters such as lockdowns and re-openings lead-lag relationship. In Sect. 5.2, we will focus on the prediction results obtained from the model. The proposed HDMDc-based linear model of disease spread describes the evolution of reported cumulative case counts and conjoins mobility inputs with it. The states of the dynamics consist of cumulative case counts of COVID-19 of twenty counties of Florida. The model was trained with historical cumulative case count and mobility input, while the training window spanned from April 19, 2020 to June 24, 2020. In this study, various delay coordinates were used to formulate Hankel matrix G from the original data matrix X and control input sequence in order to raise the dimension of observable space. It means that the observable space of the dynamics consists of previous two weeks' data. In HDMDc, the dimension of identified matrices of a nonlinear system depends on two parameters: (1) Number of delay coordinates h (2) Output dimension n of the measurement map X . We have used 56 delay embeddings for the system identification. As a result A, B ∈ R 1120×1120 . As an example, Fig. 8 shows A and B matrices of a system consisting of four counties of Florida-Marion, Lake, Osceola, Collier. We do not show the obtained system In this plot, we took moving average across four consecutive h to understand the trend matrices for all the twenty counties due to their large sizes which inhibits any visual interpretation. Choice of the number of delay embedding: The choice of the number of delay embedding coordinates in system identification is a complex question as it depends on the nature of a system, and it is still an open research question. The choice of delay embedding in linear system identification of a nonlinear system is elaborately discussed in [43] . The relationship between delay embedding and error observed in this research is shown in Fig. 9 . In this figure, the average prediction MAPE across all twenty counties for delay embedding ranging between h = 10 and h = 56 is shown. The figure shows that with the increase in delay embedding, the accuracy of prediction increases. The predictions were performed for a four-week long window. Figure 11 presents the impact of delay embedding coordinates in system identification accuracy. All three case figures in Fig. 11 show that with the increase in the number of delay coordinate h, the imaginary parts of the system eigenvalues get closer to zero. Also, we observed that with the increase in h, the number of nonzero eigenvalues decreases. Figure 10 shows the exact relationship between h and the number of nonzero eigenvalues. Figure 11a shows how eigenvalues of A matrix delay embedding change with the increase in delay embedding when there is no control input in the system. Similarly, Fig. 11b , c shows the evolution of eigenvalue with respect to delay embedding, when weighting factor on mobility input w = 300 and 600, respectively. It shows that with the increase in the weighting factor, the amplitude of eigenvalues decreases. Also note that with the increase in delay embedding and weighting factor, the prediction accuracy increases. With model introduced in Sect. 5.1, prediction on case count was made. The prediction window is stretched from June 25, 2020 to July 17, 2020. We computed mean absolute percentage error (MAPE) to quantify the accuracy of the forecast. Let x k,i be the original Fig. 11 Evolution of eigenvalue spectrum of system matrix A for different weighting factors. Average prediction error was calculated for a 28-day prediction window across twenty counties Fig. 12 Case count prediction errors in three different scenarios are presented in the figure as a case study. For each day of the prediction window, we calculated the average of prediction errors across all twenty counties selected in the study cumulative COVID cases in the kth county up to any ith day, andx k,i be the forecast. Then, define the absolute percentage error (APE) E k,i and its time-average (MAPE) E k as follows: We generated control input sequences by using the strategy proposed in Sect. 4.3. Generated control inputs are shown in Fig. 7 . Mobility and the spread of infection obey a leaderfollower relationship. Multiple studies such as [18, 30, 34, 38] reported that mobility trends maintain a leaderfollower relationship of approximately two to three weeks with COVID-19 infection trends. The proposed model can also capture that relationship. To demonstrate that we have reorganized the data matrix X with daily cases instead of cumulative cases. Furthermore, to reflect that lead-lag relationship into the model, control inputs are shifted back 14 days, which means the current states are controlled by mobility inputs generated two weeks before. In Fig. 12 , prediction errors in three different scenarios are plotted such as -case-1: considering the leader-follower relationship -case-2: ignoring the leader-follower relationship -case-3: ignoring all the mobility inputs Figure 12 shows that after leader-follower adjustment in the model, prediction results were more accu-rate. The figure also exhibits that the prediction results were worst when mobility was not considered as the control input. This figure supports the validity of the proposed model. First of all, it shows that mobility as an exogenous input has improved the HDMD algorithm's performance, which supports the idea of including mobility as an input into the model. Furthermore, the figure testifies that the proposed model can reflect the lag-lead relationship between COVID-19 and mobility dynamics. Although prediction is not a primary application of DMD-type algorithms, we made forecasts to demonstrate the model's validity. Figures 13, 14, and 15 , show prediction results, respectively, for two-week and threeweek and four-week window for all the twenty counties we selected for this study. Table 1 illustrates average prediction errors. We calculated MAPE across the prediction window, and we define it as the prediction error for a specific county for a specific prediction window. We also included the number of delay embedding h and weight w on mobility used in each prediction window. For verifying the prediction results, we also estimate the numerical deviance of the prediction from actual number of cases. Let x daily k,i be the number of reported COVID cases in the kth county on any ith day, and x d k,i be the forecast of the same. Then, we take their numerical difference, i.e., x d k,i −x d k,i . However, data of daily reported cases exhibits fluctuating nature, as we mentioned in Sect. 4.1. Hence, we take 7-day moving average of the difference and define it as the daily prediction error. Figure 16 shows daily prediction error from Jun 28 to July 19, 2020, for all counties studied in this paper. The figure shows that for the biggest four counties, daily prediction errors were between -200/day and 300/day, while for the smallest four counties, the differences were between -20/day and 20/day. For rest of the counties, the errors were between the two above mentioned extreme cases. In the two-week window, overall MAPE prediction error across twenty counties was less than 5%. Prediction performance in Lee county was found to be best as the model could predict total cases with an estimated error of only 0.1%. On the other hand, the model performed worst for Manatee county, where the prediction error was 15.13%. In the three-week window, overall MAPE prediction error across twenty counties was less than 10%. In this window, the model's prediction performance for Volu- sia county was the best as it could predict total cases with an estimated error of only 0.18%, while the performance was the worst for Manatee county. For the latter, error was 35.45%. The table also illustrates that overall prediction performance was best at the two-week prediction window. It makes sense because longer prediction windows usually result in higher prediction errors. The proposed data-driven model provides a locally linear representation of a very high-dimensional complex system. However, the driving factors of the dynamics, such as the level of urbanization, transportation modes, demographic distributions, number of tourist attractions, vary from county to county. This is one of the primary reasons for relatively high prediction error in some counties. In the four-week window, we considered two different delay embedding h = 42 and 56. For h = 42 overall prediction error of the model was worse than that of h = 56. However, we observed some interesting patterns here. Setting h = 42, it was found that for five counties, prediction errors were more than 25%. For Manatee county, the model performed the worst with an error of 57.57%. On the other hand, when we chose h = 56, only two counties exhibited prediction error more than 25%, and the worst prediction result was found in Polk county with an error of 32.41%. However, prediction performance drastically deteriorated for the two most populated and COVID-infested counties of Florida, namely Miami-Dade and Broward. It implies that although increasing embedding improves the overall prediction performance, it can deteriorate prediction accuracy for some counties. Hence, careful selection of delay embedding is necessary, and it should be selected as per the model's objective. Although we have chosen some specific number of delay embedding in this We studied the role of delay embedding h and scaling factor w in forecasting performance. We observed that increasing the number of embedding and scaling factor improves forecasting performances for longer prediction windows. When the number of embedding increases, the model can train itself with a richer set of historical values, which might improve its accuracy for a longer prediction window, whereas for a shorter window, the model can adjust itself with a lesser number of delay embedding. The results also show that forecasting performance improves when we assign heavier scaling factors on control inputs for longer prediction windows. In order to validate the proposed model, further reproduction number of COVID-19 was estimated from the infection case counts forecast. By definition, reproduction number is the average fractional increase of the number of new cases with respect to the existing cases during the infectious period. For example, if the infectious period of COVID-19 is considered 34 days and the total number of cases are increased 2.5 times during those 34 days, the reproduction number is R 0 is 2.5. However, disease spread is a dynamic incident; hence, R 0 keeps evolving with time. We estimated effective reproduction number R t by using the predicted case counts obtained from the model. For the computation, we used a python module called epyestim [26] . The county-wise prediction errors are shown in the last column of the Table 1 , while county-wise prediction results are shown in Fig. 17 . We observe that in Fig. 17 in most of the cases, estimated results diverged from the original R after the first week of the window. However, the overall prediction error was 10.63%, and the difference between actual and predicted R(t) was around 0.2 even in the worst cases. It implies that the proposed model is also useful for forecasting the dynamic reproduction number, which is an important parameter for any epidemiological modeling. This paper developed a novel framework for data assimilation and uncovering interconnections between human activity and mobility (HAM) and disease spread. The study exploits recent advances in Koopman operator theory to understand the relationship between epidemic dynamics and HAM. The proposed framework results in a locally linear model (unlike SEIR and other traditional mechanistic models) for disease spread where HAM acts as an external influence. It is a purely data-driven model which does not need any parameter estimation. Although the approach is data-driven, the framework should not be confused with black-box machine learning models. A case study was performed for COVID-19 spread in Florida. The obtained linear model successfully predicted new cases and R 0 over the next few weeks window. There are a few shortcomings in the study which can be improved in future studies. Firstly, the study considers local HAM data (i.e., within each county) and does not consider inter-county HAM data. In other words, it does not consider the impact of global mobility. The HAM is a complex phenomenon, and it is challenging to consider every aspect of it in the modeling. For example, HAMs such as private parties, spring break traveling, usage of face masks, and long-term travel. Secondly, the study was focused on the early phase of COVID-19 spread. It will be interesting to study how are HAM and disease spread were related during the later stages of the pandemic. Funding The authors have no funding sources to declare for this work. Data availability Mobility data were collected from Google COVID-19 Community Mobility Report [22] and the COVID case counts from the USA Facts repository [54] . The computations were performed with MATLAB, and code is available upon request to the authors. The authors declare that they have no conflict of interest. Data-driven analysis and forecasting of highway traffic dynamics Human mobility networks, travel restrictions, and the global spread of 2009 h1n1 pandemic Multiscale mobility networks and the spatial spreading of infectious diseases Computer modelling of influenza epidemics for the whole country (ussr) Essai d'une nouvelle analyse de la mortalité causée par la petite vérole, et des avantages de l'inoculation pour la prévenir. Histoire de l'Acad Koopman mode analysis on thermal data for building energy assessment Mathematical models in epidemiology Role of social networks in shaping disease transmission during a community outbreak of 2009 h1n1 pandemic influenza Public health and ethical considerations in planning for quarantine Modeling infectious disease dynamics The role of the airline transportation network in the prediction and predictability of global epidemics Delay-coordinate maps and the spectra of Koopman operators Koopman spectra in reproducing kernel hilbert spaces A fractional ordered covid-19 model incorporating comorbidity and vaccination Compressed dynamic mode decomposition for background modeling Mobility restrictions for the control of epidemics: When do they work? Hankel matrix rank minimization with applications to system identification and realization Early indicators of human activity during covid-19 period using digital trace data of population activities. Front Comparison of regression approaches for analyzing survival data in the presence of competing risks Application of nonparametric models for analyzing survival data of covid-19 patients Reproducing kernel Hilbert space compactification of unitary evolution groups Google covid-19 community mobility reports Seir modeling of the covid-19 and its dynamics Multiple outbreaks for the same pandemic: Local transportation and social distancing explain the different "waves" of a-h1n1pdm cases observed in méxico during The mathematics of infectious diseases Epyestim. python package to estimate the time-varying effective reproduction number of an epidemic from reported case numbers Wrong but useful-what covid-19 epidemiologic models can and cannot tell us A big-data driven approach to analyzing and modeling human mobility trend under non-pharmaceutical interventions during covid-19 pandemic Prevalence: a translationinvariant "almost every" on infinite-dimensional spaces Human mobility and covid-19 initial dynamics The use and misuse of mathematical modeling for infectious disease policymaking: lessons for the covid-19 pandemic Spread of a novel influenza a (h1n1) virus via global airline transportation Encyclopedia of plague and pestilence: from ancient times to the present Data-driven modeling of covid-19-lessons learned Dynamic mode decomposition: data-driven modeling of complex systems Human mobility trends during the early stage of the covid-19 pandemic in the united states Koopman operator approach for instability detection and mitigation in signalized traffic Global and local mobility as a barometer for covid-19 dynamics Outbreak dynamics of covid-19 in Europe and the effect of travel restrictions Information spread in a social media age: modeling and control Event triggered social media chatter: a new modeling framework Modeling social contagion and information diffusion in complex sociotechnical systems On the structure of time-delay embedding in linear models of non-linear dynamical systems Outbreak dynamics of covid-19 in china and the united states Dynamic mode decomposition with control Understanding covid-19 nonlinear multi-scale dynamic spreading in Italy Some further consideration of the plague in Eyam, 1665/6 A review on covid-19 forecasting models A mathematical model for the global spread of influenza Modelling the role of optimal social distancing on disease prevalence of covid-19 epidemic Epidemic model of covid-19 outbreak by inducing behavioural response in population Exploring dmd-type algorithms for modeling signalised intersections Us covid-19 cases and deaths by state Using a partial differential equation with google mobility data to predict covid-19 in Arizona Mobile device data reveal the dynamics in a positive relationship between human mobility and covid-19 infections Spatial-temporal relationship between population mobility and covid-19 outbreaks in South Carolina: a time series forecasting analysis. medRxiv: the preprint server for health sciences pp Exploring the influence of human mobility factors and spread prediction on early covid-19 in the USA