key: cord-0651134-tiez4vf4 authors: Baek, Jackie; Farias, Vivek F.; Georgescu, Andreea; Levi, Retsef; Peng, Tianyi; Sinha, Deeksha; Wilde, Joshua; Zheng, Andrew title: The Limits to Learning an SIR Process: Granular Forecasting for Covid-19 date: 2020-06-11 journal: nan DOI: nan sha: 6d9854eb7698f615b9352be972a8e24f92a58311 doc_id: 651134 cord_uid: tiez4vf4 A multitude of forecasting efforts have arisen to support management of the ongoing COVID-19 epidemic. These efforts typically rely on a variant of the SIR process and have illustrated that building effective forecasts for an epidemic in its early stages is challenging. This is perhaps surprising since these models rely on a small number of parameters and typically provide an excellent retrospective fit to the evolution of a disease. So motivated, we provide an analysis of the limits to estimating an SIR process. We show that no unbiased estimator can hope to learn this process until observing enough of the epidemic so that one is approximately two-thirds of the way to reaching the peak for new infections. Our analysis provides insight into a regularization strategy that permits effective learning across simultaneously and asynchronously evolving epidemics. This strategy has been used to produce accurate, granular predictions for the COVID-19 epidemic that has found large-scale practical application in a large US state. The so-called Susceptible-Infected-Recovered (SIR) model, proposed nearly a century ago [2] , remains a cornerstone for the forecasting of epidemics. In numerous retrospective studies the model has been found to fit the trajectory of epidemics well, while simultaneously providing a meaningful level of interpretability. As such, the plurality of models used for forecasting efforts related to the COVID-19 epidemic are either SIR models or close cousins thereof. Surprisingly, the experience with these forecasts has illustrated that predicting the cumulative number of cases (or peak number of cases) in an epidemic, early in its course, is a challenging task. Ultimately, this paper is motivated by producing high quality forecasts for the progression of epidemics such as COVID-19. However, we begin with a fundamental question that, despite the SIR model's prevalence, has apparently not been asked: What are the limits of learning in epidemic models such as the SIR model? Surprisingly, we find that it is fundamentally difficult to predict the cumulative number of infections (or the 'peak' of an infection) until quite late in an epidemic. In particular, we show that no estimator can hope to learn these quantities until observing enough of the epidemic so that one is approximately two-thirds of the way to reaching the peak for new infections. As far as we can tell, this result is the first of its kind for epidemic modeling. We find that this hardness is driven by uncertainty in the initial prevalence of the infection in the population, which is not observable with incomplete testing and/ or asymptomatic patients. On the other hand, we show that certain other important parameters in the SIR model -including the so-called reproduction number and infectious period -are actually easy to learn. Our analysis on the limits of learning above suggests a specific regularization approach that dramatically impacts forecast performance when learning across regions. In greater detail, we consider a differentiable model, where the true infection curve is latent and follows SIR dynamics. The observations are the results of limited testing, which censors the true infection curve, and deaths. Our learning rate analysis suggests that by suitably 'matching' regions early in the epidemic to regions that are further along, we can build useful estimators of initial disease prevalence in the former regions. This matching is effectively enabled through a regularization strategy in our approach where certain region specific random effect terms are regularized to zero. We demonstrate that our approach allows us to learn granular epidemic models (essentially at the level of individual counties) that are substantially more accurate than a highly referenced benchmark [3] . As such, our model has found practical application in a large US state. Among other uses, the forecasts inform decisions related to providing surge capacities in hospitals across the state; such decisions necessitate a granular forecast. Related Literature: Extant epidemiological models are typically compartmental models, of which the SIR model [2] is perhaps the best known. The plurality of COVID-19 modeling efforts are founded on SIR-type models, eg. [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] . Some of these efforts consider generalizations to the SIR model that add additional states or 'compartments'. For instance the (retrospectively fit) model studied in [6] considers an eight state model (as opposed to three for the vanilla SIR). Not surprisingly, learning gets harder with as the number of states increases [15] . Whereas the identifiability of the SIR model (in a deterministic sense) is well understood [16] , this is not the case for the natural stochastic variant of this model. Specifically, calibrating a vanilla SIR model to data requires learning the so-called infectious period and basic reproduction rate. Both these parameters are relatively easy to calibrate with limited data; this is supported both by the present paper, but also commonly observed empirically; see for instance [15] . In addition to these parameters, however, one needs to measure both the initial number of infected individuals and the size of the susceptible population. Estimating the number of infected individuals poses a challenge in the presence of limited testing and asymptomatic carriers. Indeed, epidemiological models for COVID-19 typically assume that measured infections are some fraction of true infections; eg. [4, 6] . This challenge is closely related to that of measuring the true fraction of cases that lead to fatalities (or the so-called Infection Fatality Rate) [17] . Our main theorem shows that having to learn the true initial prevalence of the infection presents a fundamental difficulty to learning SIR models with limited data. Our theoretical results are complemented by empirical findings in the literature, see [18, 19] . Our result is the first to characterize the limits to learning an SIR process [1] and relies on a non-trivial application of the Cramer-Rao bound. Finally, on the empirical front, we compare the quality of our predictions to publicly available predictions for the IHME model [3] : this highly publicized benchmark model has a large number of forecast vintages concurrently available which makes possible an analysis of relative prediction accuracy as a function of available data. We begin by describing the deterministic SIR model. Let s(t), i(t) and r(t) be the size of the susceptible, infected and recovered populations respectively, as observed at time t. The SIR model is defined by the following system of ODEs, specified by the tuple of parameters (α, β, γ): The rate of recovery is specified by γ > 0; 1/γ is frequently referred to as the infectious period. β > 0 quantifies the rate of transmission; β/γ R 0 is also referred to as the basic reproduction number. N is the total population (assumed known). The typical exposition of this model sets α = 1 corresponding to the setting where the infection is entirely observed. On the other hand, the quantity α ∈ (0, 1] corresponds to the fraction of true infections we actually observe. The role of α is made clear by the following Proposition: Proposition 2.1. Let {(s (t), i (t), r (t)) : t ≥ 0} be a solution to (1) for parameters α = 1, β = β , γ = γ and initial conditions i(0) = i (0), s(0) = s (0). Then, for any α > 0, {(α s (t), α i (t), α r (t)) : t ≥ 0} is a solution to (1) for parameters α = α , β = β , γ = γ and i(0) = α i (0), s(0) = α s (0). The SIR model described above is identifiable if i(t) is observable, and N is known. Specifically: Proposition 2.2. Fix N , and let i(t) be observed over some open set in R + . Then the parameters (α, β, γ) are identifiable. A self-contained proof follows immediately from the identity theorem; a more involved argument based on identification results for non-linear systems can be found in [16] . Stochastic SIR Process: Of course, any real world model must incorporate noise, and we describe next a natural continuous-time Markov chain variant of the deterministic model described above, proposed by [1] . Specifically, the stochastic SIR process, {(S(t), I(t), R(t)) : t ≥ 0}, is a multivariate counting process, with RCLL paths, determined by the parameters (α, β, γ). The jumps in this process occur at rate λ(t) (to be defined) and correspond either to a new observed infection (where I(t) increments by one, and S(t) decrements by one) or to a new observed recovery (where I(t) decrements by one, and R(t) increments by one). Let C(t) = I(t) + R(t) denote the cumulative number of infections observed up to time t. Denote by t k the time of the kth jump, and let T k be the time between the k − 1st and kth jumps. Finally, let I k I(t k ), and similarly define R k , S k and C k . The SIR process is then completely specified by: It is well known that solutions to the deterministic SIR model (1) provide a good approximation to sample paths of the SIR process (described by (2) , (3)) in the so-called fluid regime; see [20] . The next section analyzes the rate at which one may hope to learn the unknown parameters (α, β, γ) as a function of k; our key result will illustrate that in large systems, α is substantially harder to learn than β or γ. In fact, an approximation suggests that the time taken to learn this parameter accurately will be approximately two-thirds of the time required to hit the peak for infections. In Section 4 we will consider a differentiable model inspired by the SIR process and show that a regularization strategy motivated by our learning analysis yields material performance gains. This section characterizes the rate at which one may hope to learn the parameters of a stochastic SIR process, simply from observing the process. Observations: Define the stopping time τ = inf{k : I k = 0}; clearly τ is bounded. For clarity, when k > τ , we define C k = C k−1 , I k = 0, and T k = ∞. We define the k-th information set O k = (I 0 , R 0 , T 1 , C 1 , . . . , T k , C k ) for all k ≥ 1. Note that I k and R k are deterministic given C k , I 0 , and R 0 . We next characterize a sequence of systems of increasing size. Specifically, consider a sequence of SIR processes, indexed by n, such that α n N n → ∞ while β n = β, γ n = γ. Moreover, let m n = o(α n N n ) and β > γ. Finally, assume that I n,0 , R n,0 ≤ m n . Our main theorem characterizes the Fisher information of O mn relative to α n as n grows large. Interpretation of Corollary 3.2: Now since we know (from the fluid model (1)) that cumulative and peak infections scale with α n N n , it is reasonable to require that var(α n ) scale like o(α 2 n ). For this, Corollary 3.2 shows that we require m n = ω((α n N n ) 2/3 ) observations. How long might this take relative to the point at which, say, the epidemic itself is past its peak? One way of characterizing such a benchmark is comparing the time it takes to get to (α n N n ) 2/3 observations (call this time T 1,n ) with the time taken for the instantaneous reproductive number for the epidemic R t β γ S n (t)/(α n N n ) to fall below 1 (the expected increment in the infection process is negative at times when R t < 1); call this time T 2,n . More precisely, T 1,n = inf t : C n (t) ≥ (α n N n ) 2/3 and T 2,n = inf {t : βS n (t)/(α n N n ) < γ}. Unfortunately, characterizing either of these (random) times exactly in the stochastic SIR process appears to be a difficult task and so we consider analyzing these times in the deterministic model, where we denote T d 1,n = inf t : c n (t) ≥ (α n N n ) 2/3 for the process defined by (1) and similarly T d 2,n = inf {t : βs n (t)/(α n N n ) < γ}. We have: In summary, this suggests that the sampling requirements made precise by Corollary 3.2 can only be met at such time where we are close to reaching the peak infection rate. We next turn our attention to learning β and γ: where B 1 , B 2 > 0 depend only on β and γ and M 1 , M 2 > 0 are absolute constants. In stark contrast with Corollary 3.2, Theorem 3.4 shows that the variance in estimating β and γ grows small as m n and I n,0 grow, but is independent of α n and N n . Consequently, to achieve any desired level of accuracy, we simply need the number of events m n and the initial number of infections I n,0 to exceed a constant that is independent of the size of the system N n . Define p 1 2 ( β β+γ + 1 2 ) > 1 2 , and let P n = α n N n . We fix n large enough so that m n + C 0 ≤ Pn 2 and β(Pn−mn−C0) β(Pn−mn−C0)+Pnγ > p (this is possible since β β+γ > p and m n = o(P n )). We remove the subscript n henceforth. Define the indicator variable E k = 1{τ > k} on the event which the SIR process has not terminated after k jumps. The following lemma states that both E k and I k can be determined from C k , I 0 , and R 0 , which will allow us to decouple variables in O m in the analysis of the Fisher information. The result follows from the definitions of τ , E k , and C k ; the details can be found in the Appendix. for all k ≥ 0. For all k, E k = 1{C k > r k }. Moreover, when The next lemma writes an exact expression for J Om (α). Proof. We first define conditional Fisher information and state some known properties. Definition 3.7. Suppose X, Y are random variables defined on the same probability space whose distributions depend on a parameter θ. Let g X|Y (x, y, θ) = ∂ ∂θ log f X|Y ;θ (x|y) 2 be the square of the score of the conditional distribution of X given Y = y with parameter θ evaluated at x. Then, the conditional Fisher information is defined as J X|Y (θ) = E X,Y g X|Y (X, Y, θ) . Property 3.10. If X is deterministic given Y = y, g X|Y (X, y, θ) = 0. Since I 0 and R 0 are known and not random, the Fisher information of O m is equal to the Fisher information of (T 1 , C 1 , T 2 , C 2 , . . . , T m , C m ). Then, Property 3.8 implies Note that for any k, C k and T k only depend on and I k−1 can be determined from C k−1 (Lemma 3.5), the distributions of C k and T k are determined by C k−1 . Therefore, we use Property 3.9 to simplify the above equation to where we used J T1 (α) = J T1|C0 (α), J C1 (α) = J C1|C0 (α). Moreover, when E k−1 = 0, C k and T k are deterministic conditioned on C k−1 , which implies the score in this case is 0 (Property 3.10). Therefore, we can condition on E k−1 = 1 to write The last step is to evaluate g C k |C k−1 (C k , C k−1 , α) and g T k |C k−1 (T k , C k−1 , α). When E k−1 = 1, the distributions of C k and T k conditioned on C k−1 have a simple form provided in (2)-(3). Property 3.11 allows for straight-forward calculations, resulting in (8) . See Appendix for details of this last step. Proof of Theorem 3.1. We show both upper and lower bounds for J Om (α) starting from Equation (8) . For the upper bound, we have that for a constant H 1 . For the lower bound, we first prove a lower bound for Pr(E m = 1). Lemma 3.12. There exists a constant D that only depends on β and γ such that This result relies on an interesting stochastic dominance argument and can be found in the Appendix. Then, similarly to the upper bound, J Om (α) ≥ H 2 m 3 α 4 N 2 follows from using Pr(E m = 1) ≥ 1 2 and the fact that C k ≥ k+I0+2R0 2 when E k = 1 (Lemma 3.5). Combining the upper and lower bounds finish the proof. This section describes a practical SIR model that can incorporate a number of real-world features and datasets. We then consider an approximation to the MLE estimate for this model. Finally we propose an approach to overcome the difficulty in learning α. We present comprehensive experimental results at a 'regional' granularity (essentially close to county level) that illustrate strong relative merits. Discrete-time SIR Recall the stochastic SIR process, {(S(t), I(t), R(t)) : t ≥ 0}, a multi-variate counting process determined by parameters (α, β, γ). We now allow β to be time-varying, yielding a counting process with jumps : t ∈ N} for regions i ∈ I by considering the Euler-approximation to this counting process (e.g. [22] ). Specifically, let and ∆R[t] analogously. The discrete-time approximation to the SIR process is then given by: For each region i, the observability parameter α i and reproduction factor {β i [t] : t ∈ N} must be learned from data, but we assume the total population N i and recovery rate γ are known (a typical assumption; for COVID-19, γ ∼ 1/4). Demographic and mobility factors influence the reproduction rate of the disease. To model these effects, we estimate as a mixed effects model incorporating these covariates: We estimate the parameters i , representing stationary per-region random effects, and θ ∈ R d1+d2 , a vector of fixed effects shared across regions. In the real world, the SIR model is a latent process -we never directly observe any of the state . Note that other processes (deaths, hospitalizations, etc.) that depend on the latent state may also be observable, and can easily be incorporated into this learning scheme. The MLE problem for parameters (θ, α, ) is This is a difficult non-linear filtering problem (and an interesting direction for research). We therefore consider an approximation: : t ∈ N} the deterministic process obtained by ignoring the martingale difference terms in the definition of the discrete time SIR process. We consider the approximation where ω i [t] is log-normally distributed with mean 1 and variance exp(σ 2 ) − 1. Under this approximation, the MLE problem is now transformed to which constitutes a differentiable model. We solve (9), (or a weighted version) using Adam [23] . 1 Working around the limits to learning Theorem 3.4 asserts that β is easy to learn via MLE, suggesting that the parameters θ, { i } underlying our model of β can be estimated as well. However, Corollary 3.2 illustrates that we cannot estimate {α i } in reasonable time or accuracy from infections alone. This necessitates augmenting the estimation problem (9) with auxiliary information. We propose to take advantage of heterogeneity across regions: infection curves start at different times in each region, and we typically have access to some set P [t] ⊆ I of regions that have already experienced enough infections to reliably estimate α i for i ∈ P [t] via MLE. At a high level, our strategy will be to identify the set P [t], estimate α i for i ∈ P [t], then extrapolate these estimates to obtain α i for i / ∈ P [t]. We define the set where γ 1 is a hyperparameter. In the fluid model this identifies the first time t where d 2 s(t)/dt 2 < 0. We expect α i to vary across regions due to different demographics (that impact asymptomatic rates) and testing policies. Given a set of parameters {α i : i ∈ P [t]} estimated via MLE, a natural approach to estimate α i for regions i / ∈ P [t] is by matching to regions in P [t] with similar covariates. It is also worth noting that since in large systems in the fluid regime, we may show that Overall Learning Algorithm ('Two-Stage') So motivated, given data up to time T , we now define our overall learning algorithm 'Two-Stage', as follows: where γ 2 is a hyper-parameter and δ i are region specific random effect terms. We then estimate the model parameters (θ, φ, δ, ) by minimizing the following specialization of (9): The hyper-parameters γ 1 , γ 2 and the learning rates are tuned via cross-validation. We use our estimates to forecast cumulative infections in the US at the level of sub-state 'regions', corresponding broadly to public health service areas. The median state has seven regions. The static covariates Z i for each region i comprise of 60 demographic features from multiple data providers, ranging from socio-economic to health indicators. The time varying covariates M i (t) correspond to mobility data inferred from aggregated cell phone data. All datasets are widely available and are in active use in other models. The Appendix provides a detailed description of these covariates. This section focuses on two goals: First, we perform an ablation study focused on the impact of assumed structure on the parameter α i . This allows us to understand whether and why optimizing (11) provides improved performance. Second, and perhaps more importantly, our overall goal was simply to produce accurate forecasts; here we compare the performance of our approach to published forecast performance for the most broadly cited forecaster for COVID-19, [3] . The impact of learning α: We compare four approaches to learning α: Default: Ignoring the modeling of α altogether by setting α i = 1 ∀i ∈ I (the 'default' SIR choice); MLE: Learn α i for each region via the MLE approximation (9); Two-Stage: The two-stage approach specified by (11); and Idealized: Using a value of α computed in-sample (i.e. by looking into the future) via (9 Turning to our proposed approach, we see that Two-Stage significantly outperforms MLE far away from the test date. Close to the test date the two approaches are comparable. For maximum daily infections -perhaps the most critical metric for capacity planning -MLE drastically underperforms Two-Stage far from the test date. Our approach to learning from peaked regions significantly mitigates the difficulty of learning α. As further evidence, Figure 2 shows how well α from the April 21 vintage predicts α in the final May 21 vintage, for regions that have peaked by May 21. As Proposition 3.3 predicts, the April 21 α are close to the May 21 α values for peaked regions, in both MLE and Two-Stage models. For non-peaked regions, the April 21 MLE α estimates vary much more than the Two-Stage estimates, and tend to underestimate the true α value. Overall model performance Finally, we compare our analyzed models to the widely used IHME model [3] . Figure 3 In contrast with the plurality of available forecast models (which make predictions at the level of a state), the models we construct here are at a much finer level of granularity (essentially, the county) and provide accurate predictions early in an epidemic. As a result our models can guide the deployment of resources in states with heterogeneity in resource availability and prevalence. The model was deployed in such an operational fashion in a large US state, where it was used to proactively place hospital resources (mobile surge capacity) in areas where we anticipated large peaks in infections. Many of these predictions were proven accurate and timely in hindsight. This would not be possible with a state-level forecast (let alone a forecast with limited predictive power). While not every state was able to make such data driven decisions in resource deployment, we believe that in the event of a second outbreak, the approach we develop here can serve this need broadly. transmission and control of covid-19: a mathematical modelling study. The lancet infectious diseases, 2020. [9] Joseph T Wu, Kathy Leung, and Gabriel M Leung. Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study. The Lancet, 395(10225):689-697, 2020. [10] Cleo Anastassopoulou, Lucia Russo, Athanasios Tsakris, and Constantinos Siettos. Data-based analysis, modelling and forecasting of the covid-19 outbreak. PloS one, 15 (3):e0230405, 2020. [11] Kathakali Biswas, Abdul Khaleque, and Parongama Sen. Covid-19 spread: Reproduction of data and prediction using a sir model on euclidean network. arXiv preprint arXiv:2003.07063, 2020. [12] Maria Chikina and Wesley Pegden. Modeling strict age-targeted mitigation strategies for covid-19. arXiv preprint arXiv:2004.04144, 2020. [13] Clément Massonnaud, Jonathan Roux, and Pascal Crépey. Covid-19: Forecasting short term hospital needs in france. medRxiv, 2020. [14] Rahul Goel and Rajesh Sharma. Mobility based sir model for pandemics-with case study of covid-19. arXiv preprint arXiv:2004.13015, 2020. [15] Kimberlyn Roosa and Gerardo Chowell. Assessing parameter identifiability in compartmental dynamic models using a computational approach: application to infectious disease transmission models. Appendices A and B contain the proofs of Theorems 3.1 and 3.4 respectively. Appendix C contains the proofs of Propositions 2.1, 2.2, and 3.3, each in their own subsections. Appendix D contains a result justfying the use of (10) as a sufficient condition in reliably estimating α. Appendix E contains a description of the covariates used in the practical SIR model. We finish the sections of the proof that were not included in the main paper. This includes the proof of Lemma 3.5, Lemma 3.12, and calcuations for Lemma 3.6. We define λ(α, k − 1, is the mean of the k-th inter-arrival time and η(α, C k−1 ) is the probability that the arrival in the k-th instance is a new infection rather than a recovery. Proof. Suppose k < τ i.e E k = 1. Then, k is equal to total number of jumps that have occurred so far (the number of movements from S to I and from I to R). The number of individuals that have moved from S to I is C k − I 0 − R 0 , and the number of movements from I to R is Suppose k ≥ τ i.e E k = 0. Then, k is greater than or equal to the total number of jumps, which is still equal to 2C k − I 0 − I k − 2R 0 . Hence C k ≤ r k in this case. Proof. Let X k iid ∼ Bern(p) for k = 1, 2, . . . . Let {A k : k ≥ 0} be a stochastic process defined by: Let τ A = min{k : A k ≤ r k } be the "stopping time" of this process. The proof of this claim involves showing the process {A k } is stochastically less than {C k }; the proof can be found in Section A.2.1. We now upper bound Pr(τ A ≤ m). τ A ≤ m if and only if A k ≤ r k for some k ≤ m. Before this happens, A k = C 0 + X 1 + · · · + X k . Therefore, if τ A ≤ m, it must be that C 0 + X 1 + · · · + X k ≤ k+I0+2R0 2 for some k ≤ m. Since E[X 1 + · · · + X k ] = pk, using the Chernoff bound (multiplicative form: is a constant since it is a geometric series with a ratio smaller than 1, since p > 1/2.) Let D be the solution to A.2.1 Proof of Claim B.5. Definition A.2. For scalar random variables X, Y , we say that X is stochastically less than Y For random vectors X, Y ∈ R n we say that X ≤ st Y if for all increasing functions φ : R n → R, We make use of the following known result for establishing stochastic order for stochastic processes. Theorem A.3 (Veinott 1965). Suppose X 1 , . . . , X n , Y 1 , . . . , Y n are random variables such that X 1 ≤ st Y 1 and for any x ≤ y, Proof of Claim B.5. Because of the condition β(P −m−C0) β(P −m−C0)+P γ > p, for k ≤ m and k ≤ τ , We condition on A k−1 = x and C k−1 = y for x ≤ y, and we must show A k ≤ st C k . (We do not need to condition on all past variables since the both processes are Markov.) If x ≤ r k−1 , then A k = A k−1 = x ≤ y = C k−1 ≤ C k . Otherwise, the process A k has not stopped, and neither has C k since y ≥ x. Then, A k ∼ x + Bern(p) and C k ∼ y + Bern(q) for some q ≥ p. Clearly, A k ≤ st C k in this case. We apply Theorem A. Derivation of We reparameterize to write the Fisher information as: (26) . We reparameterize to write Substituting, Using the expressions derived above for Thus, Fix an instance in which the assumptions of the theorem statement hold. We remove the subscript n, and let P = αN . Let p 1 2 ( β β+γ + 1 2 ) > 1 2 . Let = Cm−C0 m be an estimator for β β+γ ,B =S m m be an estimator for 1 β+γ forS m = min(m,τ ) k=1 This first lemma follows from (19) of the proof of Lemma 3.12. where B 1 , B 2 > 0 are constant that depend only on β and γ. The next two lemmas give a high probability confidence bound for estimators andB. (1 + δ) , τ ≥ m ≤ 2 exp(−mδ 2 /(4 + 2δ)). (36) The next proposition combines the two estimators from the above lemmas and into estimatorsβ and γ. where B 1 , B 2 > 0 are constants that depend on β and γ. We first show Theorem 3.4 using these results. We then prove Lemma B.2, Lemma B.3, and Proposition B.4 in Appendix B.2. Proof. Let δ = 5 log m m . First, we claim that the probability in Proposition B.4 is greater than Using δ ≤ 1 4 again, Hence, the bound in B.4 holds with probability greater than 1 − 8 m − 2B 1 e −B2I0 . Since we assume m(m + C 0 ) ≤ P and z = 1 − m+C0 Let R be the event that the confidence bounds (38)-(39) hold. Note that 1+δ 1−δ ≤ 1 + 3δ and for an absolute constant M 3 > 0. The second last step uses (42) and 1 + z ≤ 2. Similarly, Using the fact that (1 − δ)(1 + δ) ≥ 1 2 , for an absolute constant M 2 , since β > γ. Therefore, for an absolute constant M 1 > 0. Similarly, Proof. Fix m, let z := P −m−C0 P , p = β β+γ z. Then p > 1 2 . Define three stochastic processes (62) (63) Note thatC k is a modified version of C k whereC k still evolves after the stopping time. Claim B.5. A m is stochastically less thanC m (A m ≤ stCm );C m is stochastically less than B m (C m ≤ st B m ); that is, for any ∈ R, This claim follows from Theorem A.3, using a similar argument to Claim B.5. (68) Using the Chernoff bound gives, Therefore, Similarly, for the upper tail bound, we have due to the multiplicative Chernoff bound where Z is the sum of i.i.d Bernoulli random variables. Combine upper and lower tail bounds and note that p/z = β β+γ . Then, we can conclude, for any δ > 0, Proof. Conditioned on (I 0 , C 0 , I 1 , C 1 , . . . , I m−1 , C m−1 ) with τ ≥ m, we have are independent exponential random variables. Theorem 5.1 in [24] gives us a tail bound for the sum of independent exponential random variables: let X = n i=1 X i with X i ∼ Exp(a i ) independent, then for δ > 0, Pr(X ≥ (1 + δ)µ) ≤ 1 1 + δ e −a * µ(δ−ln(1+δ)) ≤ e −a * µ(δ−ln(1+δ)) (79) where µ = E[X], a * = min 1≤i≤n a i . LetS m| C, I beS m conditioned on (I 0 , C 0 , I 1 , C 1 , . . . , It is easy to verify the following facts Combining these with Eqs. (79) and (80), we have Therefore, Then, for any sets U 1 , U 2 , Proof. As in [25, 26] , the solution {(s (t), i (t), r (t)) : t ≥ 0} with α = 1 can be written: Making the appropriate substitutions yields the following equivalent system: Therefore, it remains to show that for α > 0, {(s(t), i(t), r(t)) : t ≥ 0} {(α s (t), α i (t), α r (t)) : t ≥ 0} is a solution for (97) and (98). Starting with (97), where Noting that ξ (t) = ξ(t) and substituting i(t) = α i (t) yields the equations below, clearly showing that {(s(t), i(t), r(t)) : t ≥ 0} satisfy (97) and (98): Proof. Consider initial conditions (s(0), i(0), 0), as in [25, 26] , the analytical solution is given by Consider two SIR models with parameters (α, β, γ) and (α , β , γ ), and initial conditions (s 0 , i 0 , 0) and (s 0 , i 0 , 0) respectively. We claim that infection trajectories i(t) and i (t) being identical on an open set [0, T ) implies the parameters and initial conditions are identical as well. Assume i(t) = i (t) for all t ∈ [0, T ); then, given the exact solution above it follows that As functions of x, both the RHS and LHS in the equality above are holomorphic, and hence, using the identity theorem, we conclude that αN = α N , s 0 = s 0 , − β αN = − β α N , γ = γ , which completes the proof. Proof. The crux of the problem is summarised in two smaller results, bounding T d 1,n and T d 2,n respectively. To ease our exposition, we will let P n = α n N n and drop the index n in these helper results. Proposition C.1. There exists a constant ν 1 that only depends on γ, β such that Proposition C.2. Let ρ 1 = 1 − 1 log log P , there exists a constant ν 2 that only depends on γ, β and a constant C = O(1), such that The argument follows directly by taking the limit of the bounds we provide in Propositions C.1-C.2. Specifically, using that the constants ν 1,n , ν 2,n do not change with n, we arrive at Proof of Proposition C.1. Defineĩ(t) such thatĩ(0) = i(0) and dĩ dt = (β − γ)ĩ, implying Since dĩ dt ≥ di dt for all t,ĩ(t) ≥ i(t) for all t. Then, for all t, ds dt = −β s P i ≥ −βi ≥ −βĩ. Hence we can write Since s(0) − s(T d 1 ) = P 2/3 − c(0), setting t = T d 1 and solving for T d 1 in the inequality above results in for ν 1 = β−γ β 3/2 as desired. For ρ ∈ [0, γ β ], let t ρ be the time t when s(t) P = ρ. ρ will represent the fraction of the total population that is susceptible. Since ρ ≤ γ β , i is increasing for the time period of interest. Let β > γ, P be fixed. Let ρ 1 = 1 − 1 log log P and ρ 2 = γ β . We assume P is large enough that Proof of Lemma C.3. Fix ρ. At time t ρ , the total number of people infected is c(t ρ ) = i(t ρ )+r(t ρ ) = P (1 − ρ), by definition. At any time t ≤ t ρ , the rate of increase in i is . Using the fact that i(0) ≥ c(0) 2 and rearranging terms gives the desired result. Proof of Lemma C.4. The difference in s between t ρ1 and t ρ2 is s(t ρ1 ) − s(t ρ2 ) = P (ρ 1 − ρ 2 ). As a consequence of the mean value theorem, Using these two expressions, The desired expression follows from rearranging terms. Lemma C.5. For any ρ ≤ min{ γ β , 1/2}, t ρ ≤ 1 βρ−γ log ν2 i(0) P , for ν 2 = 2(β−γ) β . The proof of this lemma follows the exact same procedure as the proof of Proposition C.1. Proof of Lemma C.5. We proceed in the same way as the proof of Proposition C.1 except in this case we will lower bound s(0) − s(t). We achieve this by lettingĩ be defined to grow slower than i, so it is used as a lower bound. Defineĩ(t) such thatĩ(0) = i(0) and dĩ dt = (βρ − γ)ĩ, implying i(t) = i(0) exp{(βρ − γ)t}. Since dĩ dt ≤ di dt when ,ĩ(t) ≤ i(t) for all t < t ρ2 . In addition, when t < t ρ2 , s P ≥ γ β ≥ ρ. Then, for t < t ρ2 , Hence we can write s(t) ≤ s(0) + Since s(t ρ ) = ρP , ρP ≤ s(0) − βρi(0) βρ − γ (exp{(βρ − γ)t ρ } − 1). Solving for t ρ results in where ν 2 = 2(β−γ) β , using the fact that ρ ≤ 1/2. Proof of Proposition C.2. Using the results from Lemmas C.3-C.5, t ρ2 = t ρ1 + (t ρ2 − t ρ1 ) . Note that, as required in the statement, C = O(1). Indeed, and so, as P grows large, C tends to (1 − ρ 2 )/ρ 2 (β − γ) (recall that c(0) = O(log log P )). In the practical model, P [t] represents the regions that have enough observations to reliably estimate α at time t. In this section, we justify why the definition (10) is a sufficient condition. Let T d 3,n = inf{t : d 2 s dt 2 > 0} be the time at which the rate of new infections is highest in the deterministic SIR model. Proposition D.1. As n → ∞, T d 1,n ≤ T d 3,n . Proof. From (127), we see that d 2 s dt 2 > 0 if and only if s < γ β α n N n + i. At t ≤ T d 1,n , c(t) ≤ (α n N n ) 2/3 which implies i(t) ≤ (α n N n ) 2/3 and s(t) ≥ α n N n − (α n N n ) 2/3 . We assume n is large enough so that 2(α n N n ) −1/3 < 1 − γ β . Then, (128) cannot hold: 2(α n N n ) −1/3 < 1 − γ β (129) γ β + (α n N n ) −1/3 < 1 − (α n N n ) −1/3 (130) γ β α n N n + i(t) ≤ γ β α n N n + (α n N n ) 2/3 < α n N n − (α n N n ) 2/3 ≤ s(t). Therefore, T d 1,n ≤ T d 3,n . For observed COVID-19 cases, we use publicly available case data from the ongoing COVID-19 epidemic provided by [27] . X i [t] consists of static demographic covariates and time-varying mobility features that affect the disease transmission rate. The dynamic covariates proxy mobility by estimating the daily fraction of people staying at home relative to a region-specific benchmark of activity in early March before social distancing measures were put in place. We also include a regional binary indicator of the days when the fraction of people staying home exceeds the benchmark by 0.2 or more. These data are provided by Safegraph, a data company that aggregates anonymized location data from numerous applications in order to provide insights about physical places. To enhance privacy, SafeGraph excludes census block group information if fewer than five devices visited an establishment in a month from a given census block group. Documentation can be found at [28] . The static covariates capture standard demographic features of a region that influence variation in infection rates. These features fall into several categories: • Fraction of individuals that live in close proximity or provide personal care to relatives in other generations. These covariates are reported by age group by state from survey responses conducted by [29] . • Family size from U.S. Census data, aggregated and cleaned by [30] . • Fraction of the population living in group quarters, including colleges, group homes, military quarters, and nursing homes (U.S. Census via [30] ). Some evolutionary stochastic processes Containing papers of a mathematical and physical character A modified sir model for the covid-19 contagion in italy A simple sir model with a large set of asymptomatic infectives Modelling the covid-19 epidemic and implementation of population-wide interventions in italy Coronatracker: world-wide covid-19 outbreak data analysis and prediction CDC Interactive Atlas of Heart Disease and Stroke The authors express their gratitude to Danial Mirza, Suzana Iacob, El Ghali Zerhouni, Neil Sanjay Pendse, Shen Chen, Celia Escribe and Jonathan Amar for their excellent support in data collection and organization.