key: cord-0230410-bmigbwgd authors: Eftekhari, Hamid; Mukherjee, Debarghya; Banerjee, Moulinath; Ritov, Ya'acov title: Markovian And Non-Markovian Processes with Active Decision Making Strategies For Addressing The COVID-19 Pandemic date: 2020-08-02 journal: nan DOI: nan sha: d67623d1372223c4df563804071fc74bf7c8b5dd doc_id: 230410 cord_uid: bmigbwgd We study and predict the evolution of Covid-19 in six US states from the period May 1 through August 31 using a discrete compartment-based model and prescribe active intervention policies, like lockdowns, on the basis of minimizing a loss function, within the broad framework of partially observed Markov decision processes. For each state, Covid-19 data for 40 days (starting from May 1 for two northern states and June 1 for four southern states) are analyzed to estimate the transition probabilities between compartments and other parameters associated with the evolution of the epidemic. These quantities are then used to predict the course of the epidemic in the given state for the next 50 days (test period) under various policy allocations, leading to different values of the loss function over the training horizon. The optimal policy allocation is the one corresponding to the smallest loss. Our analysis shows that none of the six states need lockdowns over the test period, though the no lockdown prescription is to be interpreted with caution: responsible mask use and social distancing of course need to be continued. The caveats involved in modeling epidemic propagation of this sort are discussed at length. A sketch of a non-Markovian formulation of Covid-19 propagation (and more general epidemic propagation) is presented as an attractive avenue for future research in this area. The Coronavirus COVID-19 pandemic is one of the gravest global public health crises that the world has seen in the past100 years, with echoes of the Spanish Flu of the early 20'th century. A highly contagious respiratory virus, it triggers problematic immune responses in enough individuals, especially among those with certain co-morbidities and the elderly, while also ravaging the lungs and other vital organs. Governments have had to force extensive lockdowns all over the world in a bid to control its rampant spread, engendering in the process, a massive economic downturn, the long term effects of which are largely unpredictable as of even date. This combination of a medical and fiscal crisis has brought forth an unprecedented response from several sections of society, almost on a war-like footing, with medical professionals at the helm, along with epidemiologists, economists, and last but not least data scientists bringing their combined expertise together to grapple with this threat, e.g. [35] , [9] , [24] , [24] , [25] , [33] (and references therein) to name a few. This paper aims to address the temporal evolution of epidemic diseases in a population, paying attention to realistic features of the COVID epidemic, in conjunction with an active [i.e. in real time] decision making strategy depending on the observables of the epidemic process. We model the flow of the epidemic using an SEIRD (Susceptible-Exposed-Infected-Recovered-Deceased) model, which is an extension of the traditional SIR model with two additional compartments, Exposed and Deceased, and the Infected compartment is split into two sub-compartments: mildly infected and severely infected. Instead of analyzing a continuous time evolution via differential equations, we confine ourselves to a discrete time Markovian process, taking the point of view that days are a natural unit of measurement. We note that Markov processes have a long history in the study of infectious diseases, started by the work of Kermack and McKendrick [19, 20, 21] and more recent works [4, 5, 40] . Furthermore, dynamic policies for the control of Markov decision processes have also been studied in some detail [39] , [31] , [32] , [15] , [11] , [22] . We say that a compartment B is accessible from another compartment A (write A → B) if it is possible to move from A to B directly without hitting other compartments. For example, if we denote by S the compartment of susceptible people and by E the compartment of latent/exposed people then it is immediate that S → E, but S R where R is the compartment of recovered people, as a person cannot go to the recovery compartment directly from the susceptible one without hitting E: see figure 1 for a detailed diagram of flow. We assume that at each time point, an individual has two options: either stay in the same compartment as they were in at the previous time, or move to another reachable compartment. The probability of transition to the next state is universal in the sense that it does not depend on the individuals but may depend on the previous state. This mechanism generates a Markov process on the number of individuals in each compartment. Some core features of our proposed Markovian model to be elaborated on in Section 2 are: (a) Allowing distinction between mild and severe infections and the subsequent evolution of such individuals in time. (b) Explicit incorporation of an observed process generated from a partially observed parent process via a simple testing based mechanism. (c) Active decision-making strategies based on the history of the epidemic. (d) Optimizing decision making strategies in terms of an expected loss/reward function that measures the trade-off between the economic costs of lockdowns and the cost of lost lives or costs due to treatment denial. Clarification and detailed analysis of each of these aspects are provided in Section 2. We also present some ideas about a non-Markovian process that models each individual separately allowing their personal characteristics to determine their evolution, as a potential recipe for future studies. The non-Markovian model captures spatial heterogeneity by allowing the transition probability from susceptible to latent to depend on the distance between individuals. Section 3 deals with such details. Section 4 concludes with a discussion of various possible directions for future research. Our first, and more parsimonious approach, is related to the paradigm of Partially Observed Markov Decision Processes (henceforth POMDP) which have been used extensively in monitoring processes evolving over a state space with active intervention and decision making strategies (see [31] , [32] , [15] , [11] , [22] and references therein). Consider the following model for an epidemic: S denotes the class of susceptibles, L denotes latent individuals who are infected but cannot infect others as yet, I m comprises the (mildly) infected, I s the severely ill, R those that have recovered and D the deceased from the disease. The transition chain is from S via L to I m and from I m either to R or to I s and from I s either to R or D. A susceptible person only gets infected upon contact with an infected person, but not through a latent individual. Define N as the total number of people in a fixed location which will be assumed to stay constant over the time horizon. The state space for this Markov process is X = N 6 , where a generic element of X should be thought as a vector of numbers of individuals in each compartment. The true numbers in the different compartments are driven by an underlying (hidden, henceforth to be referred to as population) Markov process which is not observed directly . We only observe (random or deterministic) fractions of these numbers, which gives rise to the observed Markov process. Write the population Markov process as {X t ∈ N 6 } t=1,...,T over a finite horizon of time, where X t,j denotes the number of individuals in compartment j ∈ {S, L, I m , I s , R, D} at time t. We denote the observed S L I m I s R D Figure 1 : Flow of the Markovian process: S ← Susceptible, L ← Latent/Exposed, Markov Process as X O t described next: On each day, we observe a cohort of mildly infected people and a cohort of severely ill individuals via testing. The former are typically asked to quarantine at home and self-treat, while the latter require immediate medical assistance and are hospitalized subject to capacity constraints. Furthermore, we record a fraction of recovered and deceased people, who are documented. The number of individuals that showed up for testing and tested negative [and can be taken to belong to the S ∪ L group] are not tracked. We assume therefore that we have a partially observed process: We take our action space to be A = {0, 1, 2} where 0 indicates no lockdown, 1 indicates partial lockdown, i.e. some essential businesses and organizations stay open and 2 denotes a full lockdown, where only the most essential activities are sustained. Before proceeding further, let us articulate our convention about compartment numbers: any compartment number at a given time t is taken to be the total count recorded by the end of day t, thus the number in a compartment at (the end of) day t + 1 is given by the number at day t plus movement into and out of the compartment in day t + 1 where the parameters for such movements are either time invariant or depend upon the numbers recorded (at the end of) day t. The action a t (which is 0, 1 or 2) is proposed at the end of day t and affects the infection dynamics from day t + 1 onwards: a t = 0 corresponds to 'no lockdown' which we interpret broadly to mean that most activities are in place though people are practicing generally responsible behavior by and large, while a t = 1 means 'mild lockdown' where a substantial number of activities are closed, and a t = 2 means 'severe lockdown' where most activities have been restricted and human contact is minimal . Policy (decision) making, i.e. which action to take at what time, is based on the numbers for the observed process via an optimization scheme based on a reward. The transition probability from the Susceptible to the Latent state on day t + 1 is given by where with R 0 denoting the basic reproductive number under severe lockdown and r 1 < r 2 . The basic reproduction number during any fixed lockdown phase is the average number of persons that an infected person transmits in that phase over the course of their illness when the proportion of susceptibles is near 1, i.e. the epidemic is in a controlled phase. Even though the early estimates of the basic reproductive number (under no lockdown) were between 2.43 to 3.10 [13] , we assume that over our period of analysis [May 2020 onwards by which time the epidemic had spread considerably at least in the northern US states] the value of R at remains smaller than 2 due to increased awareness in society. Specifically, we assume that after reopening, the value of R at is 1.8 under no lockdown, while it assumes the values 1.3 and 0.8 respectively (drops of 0.5) under the first and second levels of lockdown. 1 The probabilties P Im→Is and P Im→R are the chance of transiting from the I m state to states I s and R respectively on day t and are not taken to depend on t. The mathematical derivation of this formula is given in Section 5. For the other transition probabilities, we assume that on any given day, an individual moves from L to I m with probability P L→Im . Similarly an individual in I m will either move to I s (w.p. P Im→Is ), or to R (w.p. P Im→R ) or stay in I m (w.p. 1 − P Im→Is − P Im→R ). For an individual in I s , they can go to R (w.p. P Is→R ), or to D (w.p. P Is→D ) or stay in I s (w.p. 1 − P Is→R − P Is→D t ). The transition probability from I s to D depends on time t through the number of severely ill people in that location. If this exceeds the medical capacity, a severely ill person cannot be accommodated and the subsequent lack of medical care will affect the 1 Estimating the effect of lockdowns on the basic reproductive number (r1 and r2) is difficult given the available data, so we set r1 = 0.5 and r2 = 1 for our analysis. In principle, one can also estimate these parameters if more data is available. prognosis of illness, leading to a higher probability of death. Specifically, we use a simple dichotomous form: where cap is hospitalization capacity at the location, and P Is→D 2 is generally taken to be a multiple of P Is→D 1 (which we took to be 3 in our implementation) for our presented analysis. The other two compartments R and D are terminal compartments, i.e. if a person is in one of these compartments, their status changes no further. The transition process from X t to X t+1 under action a t can be summarized via the following mechanism: 1. Generate the random variables Y 1 , Y 2 , Y 3 , Y 4 , Y 5 , Y 6 as (henceforth Bin and Mult are abbreviations of Binomial and Multinomial respectively): 2. Update the state: The observed process denoted by X O t is driven by two mechanisms: First, the observed compartments themselves follow the same transitions as the population process, and second, at each point in time there is a flow from the population process to the observed: a number of individuals are tested and added to the observed compartments. Note that an individual in X t,Im but not in X 0 t,Im (respectively in X t,Is but not in X 0 t,Is ) will enter into X O t+1,Im (respectively into X O t+1,Is ) if they are tested positive. We assume that the number of individuals that test positive on day t + 1 and are mildly infected (e.g. do not require immediate medical attention) follows a binomial distribution with parameters X t,Im − X O t,Im and P t+1 (presents for testing | mildly infected), henceforth denoted by P test,mild t+1 . Similarly the number of individuals that test positive on day t + 1 and are severely infected (e.g. need immediate medical attention) follows a binomial distribution with parameters X t,Is − X O t,Is and P t+1 (presents for testing | severely infected), henceforth denoted by P test,severe t+1 . The process evolution can be summarized as follows: 1. Generate the following random variables: Then update the observed process as: In order to simulate such a process over a time horizon t = 1, . . . , T , we need four sets of quantities: 1. The numbers in the compartments of the population process at day 1. 2. The numbers in the compartments of the observed process at day 1. 3. Vector of transition probabilities between the compartments. 4. Testing probabilities i.e. P test,mild t , P test,severe t . We will discuss points 1 and 2 later, but briefly speaking, we set them based on real data and some inflation factor (viewing the population process as an inflated version of the observed process). So, assume for the moment that we have X 1 and X O 1 . Coming to points 3 and 4, we estimate the parameters (both transition probabilities and testing probabilities) jointly, based on the real data. To sketch the idea, we essentially use a co-ordinate descent approach, i.e. we update the transition probability vector and testing probabilities iteratively. Given an initial vector of transition probabilities, P 0 , one can generate the population process over the training period as described in Section 2.2. Now, as far as the observed process is concerned, note that we cannot generate T m , T s as we don't know the testing probabilities yet. An intuitive and effective way is to replace them by the real data on daily new cases, appropriately divided into mild and severe compartments: say at day t, we seeT m = new mild cases andT s = new severe cases. Starting from Day 1, and using {(T m (t),T s (t))} we can generate a version of the observed process, sayX O t over the training period via steps (1) and (2) displayed above in this section [obviously ignoring the last two lines of (1)]. Next, the testing probabilities are estimated via the maximum likelihood principle as the ratio of observed mild cases and observed severe cases to the number of unobserved mild and severe cases respectively, since the denominators can be recovered from the infected numbers for the hidden and observed processes. Given these testing probabilities one can simulate the observed process X 0 t for any transition probability vector P via steps (1) and (2) above, and determine the one which emulates the real process the best, i.e. minimizes a loss function measuring the discrepancy between X 0 t and the real data process. We now set the value of this minimizer to P 0 and repeat the two updating steps. We keep on updating both parameters over a fixed number of iterations and choose our final estimate as the one corresponding to that iteration which gives the smallest minimum loss over all iterations. The following algorithm summarizes this discussion: Input: Initial estimate of transition probabilitiesP init and the number of iterations K ; Using P test,mild(k) t and P test,severe(k) t , obtain the updated transition probabilitieŝ P (k) by minimizing the loss in Equation (2.1) and let loss (k) be the value of this minimized loss. end Set k min = argmin k loss (k) . ReturnP ≡P (k min ) and P test,mild(k min ) t and P test,severe(k min ) t . We next describe each part of the iterative update procedure in detail. We start with the method of obtaining estimates of P test,mild t and P severe,mild t given a set of values P for the transition probabilities. Given T (t) new cases on day t based on real data [14] , we set where p s is the proportion of severe cases and p m ≡ 1 − p s the proportion of mild cases. Exact values for these proportions are presented in Section 2.6 where we do a state by state analysis. The generation of the processX 0 t incorporating real data is described next: The processes {X t } and {X O t } are simulated for T days where T is the time length of the training period [taken to be 40 days] and the testing probabilites calculated as the proportion of daily new cases to unobserved cases, as in the below algorithm. Input: Start with the data on daily new cases for T consecutive days and estimates of transition probabilities between compartments; Generate the processes X t andX O t for days 2 through T with initial values for X 1 and The initialization of the two processes as well as the elicitation of the initial transition probabilities will be discussed later. Observe that the above algorithm gives us testing probabilities for days 2 through T . Note that, testing probabilities on day 1 are not required as we are not generating data on day 1, we are fixing it based on real data. We next estimate the parameters P = (P L→Im , P Im→R , P Im→Is , P Is→R , P Is→D 1 ) (the inter-compartment transition probabilites) of the epidemic propagation 2 . This is based on a discrepancy calculation: For days 1 through T , we have real data available on the number of infected as well as the number of epidemic related deaths in the location of interest. Note that the first is not a cumulative quantity, while the second is. This provides our real observed data process: (X real t,active , X real t,D ) for 1 ≤ t ≤ T . On the other hand, for any parameter vector P , we can generate the population process X t from 1 through T , and also the observed process X O t using the testing probabilities just determined. The process X O should be differentiated from the processX O from the previous section, as there is no real data directly involved in the former's construction. The goal of generating this new process is to obtain the infected and deceased numbers at each time t and then compare them to the corresponding numbers for the real observed process. The parameter vector that minimizes the average discrepancy between 2 Note that the probability of transition from S to L at any given time is known once these probabilities and the action status a are known. 3 We deliberately leave recovery compartment from the parameter estimation procedure due to lack of real More specifically, we first obtain the smoothed daily numbers of deaths as follows. Since X O t,D is the cumulative number of deaths up to time t, the 7-day moving average of daily deaths is obtained via Similarly, the smoothed daily number of deaths is obtained via D smoothed real The following loss is now optimized over a grid of parameter values: where P = (P L→Im , P Im→R , P Im→Is , P Is→R , P Is→D 1 ) denotes the vector of transition probabilities and ϕ (taken to be 10 [38] ) is an inflation factor that is used to generate the initial unobserved states from the initial observed real data. That is, for the observed process, we set the initial states byX where p S is the proportion of severe infections among the active cases. For the population process, the initial states are set to Finally, we note that the reason we used ratios instead of differences in the loss is that the numbers in different compartments have different orders of magnitude. For example, the number of infected cases is much larger than the number of deaths. The expectation in the above loss function is computed empirically by taking the average of the loss function over several runs of the processes. Initial Estimates. Referring back to Algorithm 1, note that we needP init : initial estimates of the inter-compartmental probabilities to start the algorithm. These estimates were data on recovery for some of the states that we analyzed. With more data one can incorporate this in the procedure to obtain a better fit. obtained by matching certain features of the process with the available data. For example, since the time to transition from state L to state I m is geometric with probability P L→Im , its expected value will be 1/P L→Im . On the other hand, from studies on the incubation period of the disease [23] , we known that the average time from latent to mildly infected is 5 days. Thus, we obtain the initial estimateP L→Im Extrapolating beyond day T (prediction): Given the parameter estimates obtained in the last section, we can now project into the future and optimize the lockdown strategy based on these projections. This is of critical importance as it can serve as a principled guide for real world policy makers. Consider running the population process forT days from day 1 (whereT > T ) for forecasting purposes. We can extrapolate the testing probability for the additionalT − T days in terms of the testing probabilities obtained for the first T days via the following rule: if there is no trend in the testing rates over the first T days one can use the average, whereas if there is a monotone trend, one can use the value at time T . Note that for the training period 1 ≤ t ≤ T , the actions are known and fixed, as they are dictated by real-world measures taken during this period, and therefore the policy optimization only takes place during the test period T + 1 ≤ t ≤T . Policies and actions. POMDP models are difficult to optimize in full generality, and as our model has several compartments, and will be run with reasonably large populations, the problem of finding the globally optimal policy via maximizing a value function is essentially computationally intractable. Hence, we confine ourselves to the class of so-called bang-bang policies (see [6] , [3] , [17] , [34] and references therein)-impose or lift lockdowns. These are relatively easy to optimize over and allow us to avoid formal belief update methods. In our implementations, these policies can be parametrized by three thresholds, henceforth denoted by l, u 1 and u 2 , and a parameter θ ∈ [0, 1] that produces a linear combination Is )/N that is compared to the three thresholds. Specifically, at each time t where t is divisible by 14, • if w ≥ u 2 , the highest level of lockdown is enforced for the next 14 days (a t = 2 for t ≤ t ≤ t + 13), • if u 1 ≤ w < u 2 , a lower level of lockdown is enforced (a t = 1 for t ≤ t ≤ t + 13), • if w ≤ l, then all lockdown is released (a t = 0 for t ≤ t ≤ t + 13), • if none of the above holds, i.e. l < w < u 1 , then the status quo is maintained for the next fourteen days (a t = a t−1 for for t ≤ t ≤ t + 13). We denote this class of policies by Π B = {(l, u 1 , u 2 , θ) : 0 < l < u 1 < u 2 < 1 and 0 ≤ θ ≤ 1}. Given the policy class, we next explain how to quantify the costs associated with a particular policy. Immediate reward function and Policies: In order to quantify the various costs related to the epidemic and the actions taken, we propose the following immediate (daily) reward function given a fixed policy (that is, given the thresholds l, u 1 , u 2 , θ). Given that at the end of day t we are at state X t and propose action a t [which depends on the threshold parameters as explained above], the immediate reward function [observed at end of day t + 1 under policy a t ] is given by: where C E is the daily economic cost of lockdown for each particular state obtained via scaling the cost for the US according to GDPs: C E (state) = GDP of state GDP of United States · (Daily cost of lockdown for the US in $million). (See Table 1 and Section 2.6 for the numeric values of these quantities). C L is cost of life, the loss incurred due to the death of an individual (see Section 2.6). The factor ρ < 1 is introduced to account for the loss due to treatment denial once hospital capacity (taken to be 40% of hospital beds in the state [7] ) is exceeded: we take this to be a quarter of the life cost. We then optimize over l, u 1 , u 2 , θ [via grid-search] by maximizing the expected sum of the immediate reward functions over the time horizon of interest: with the expectation computed by averaging over several runs of the process. We reiterate that in the definition of r Xt,X t+1 ,at , the action a t depends on the policy parameters l, u 1 , u 2 , θ, and X t itself depends on the actions taken in the history of the process. Note that since the action a t assumes a value of 2 under full lockdown, in the reward function we divide the action by 2 so that the economic cost under full lockdown is C E and under partial lockdown For concreteness, we present the policy optimization procedure in Algorithm 3. In practice we replace the class of policies Π B by a discretized version Π d . Input: Fitted parametersP , The class of policies Π d , Number of replicates J ; for (l, u 1 , u 2 , θ) ∈ Π d do for j = 1, . . . , J do Run the population process X(t) for days 1 throughT with estimated parametersP and policy (l, u 1 , u 2 , θ) to compute R (j) (l, u 1 , u 2 , θ) = T t=T +1 r Xt,X t+1 ,at (l, u 1 , u 2 , θ) ; end Approximate the expected reward byR(l, u 1 , u 2 , θ)) = J −1 J j=1 R (j) (l, u 1 , u 2 , θ) ; end Find the optimal policy (l , u 1 , u 2 , θ ) = argmax (l,u 1 ,u 2 ,θ)∈Π dR (l, u 1 , u 2 , θ)); Output: Return (l , u 1 , u 2 , θ ). Once the optimal policy for a given location has been determined till timeT one can plot the evolution of the mean curve of the population process from days T + 1 throughT under the optimal policy for any compartment of interest (e.g. deaths or infected), with compartment probabilities given by the least squares estimatesP . We next discuss quantifying the sensitivity around such a mean curve. There are two sources of uncertainty: (1) Variability in the process given a fixed set of parameters for its evolution, and (2) Error in estimating the parameters. We combine these two sources of variation to generate a prediction band. Towards that end, we perturb the estimated parameters by adding some noise to the log-odds of these probabilities and transforming back to the original scale. However we do not change the optimal policy: we run the population process using this noisy version of the optimal transition probabilities over a certain time horizon (T − T = 50 days) multiple times, starting from the end of the training period and obtain the mean curve, which takes care of the first source of variation. We then repeat this experiment multiple times with different sets of noisy parameters which accounts for the second source of variation. Finally, we calculate upper and lower 5th percentiles of the mean curves thus produced (for each time t) to generate the sensitivity band. Our algorithm can be summarized as follows: The perturbations are in the log-odds scale to avoid boundary issues involving the intercompartment probabilities, as many of them are small numbers. The choice of a standard deviation for the noise of the same order as the mean but smaller than it allows exploration of parameters that are on the same scale asP , but not too far from it. In this section we provide detailed analysis of six states: Michigan (MI), New Jersey (NJ), Florida (FL), Arizona (AZ), Texas (TX) and California (CA). For each state, optimal values of the inter-compartmental transition probabilities were obtained by analyzing data from its training period. Given these estimates from the respective training periods, policy optimization was accomplished as described in Section 2.4. Recall from subsection 2.4, that we need both the cost of lockdown C E and the cost of life C L to obtain the optimal policy. These were calculated in the following way: 1. For C E , we started with a suggested $7 trillion per year for the United States [ is, roughly $20 billion per day for all the states combined. Then we scaled this number by the fraction of the GDP of each state to the US GDP to obtain the economic cost (under lockdown) to the GDP for the state of interest. These numbers are displayed in the second column of Table 1. 2. For the cost of life C L we use [12] , where $4.7 million is given as value of statistical life year (VSLY) and a lower value of $1.5 million is obtained as the equal-value life year gained (see [12, Figure 2 ] for a comparison of these estimates). We ran our simulation under both these costs. The six states were divided into two broad categories: northern states (MI, NJ) and southern states (TX, AZ, CA and FL). For each of the states, we trained using real data from 40 days, then predicted the process for the next 50 days (thus, a total 90 days starting from the start of the training period). For northern states our training data starts from May 1 and for the southern states from June 1, the reason for this discrepancy being the recent surge of covid cases in the southern states, which renders the data from May less informative for prediction purposes at this time. We present MI, TX and FL in the main body of the paper and the remaining three states in the supplement. In the pictures corresponding to the infected and death compartments, we display 3 curves: the blue curve corresponds to the evolution of the process usingP (the least squares estimate from Algorithm 1) labeled as Estimated Curve, the green curve labeled as Real shows the actual observed numbers for that state and the Mean Curve in orange is the mean of the trajectories of the process using the perturbed parameters. More precisely, as described in Algorithm 4, we obtain a mean process for each set of perturbed parameters and the Mean Curve is the just the mean of those mean curves. The shaded region represents the 90% band based on the mean curves generated by the perturbed parameters as described previously. The dashed vertical line separates the training period from the test period. For p S we use (for all states but NJ) an estimated value of 0.07 which is approximately the proportion of daily active cases that are hospitalized. This quantity generally ranges between 5 and 8 during the training period for all the states we consider other than NJ [1, 14, 18] . For NJ we use an estimated value of 0.15 [2] . We performed our analysis twice for each state, once for high life cost and once for low life cost, and in each case we obtained the same optimal policy: lift the lock down (i.e. a(t) = 0 after training period). This may seem surprising at first sight given that the numbers are rising fast especially in the southern states but a closer inspection of the available data provides some justification for the prescribed optimal policy. Consider the the projected numbers of severely infected and the figures on the number of hospital beds allocated for covid patients in Table 2 . It is projected that MI will have in the ballpark of 700 severe cases by the end of July. 4 On the other hand MI has around 11000 hospital beds for Covid patients, which is clearly more than sufficient. While the projected number of deaths in Michigan shown by the blue curve in Figure ( c) over the test period [till July 25th] is over-estimating the real death curve [orange] which is quite flat, the divergence is still modest, and the a(t) = 0 policy over the training period is quite compatible with the slow growth rate of both projected and real death curves. As far as Texas and Florida are concerned, unlike Michigan (and New Jersey, shown in the supplement), the number of active severe cases is projected to increase through the end of August, in keeping with the growing phase of the epidemic at present, but the growth of severe cases is not particularly fast: for FL, AZ the projected curves are concave whereas for TX and CA they have a roughly linear trend; at any rate there is no 'shooting-up' that might cause grave alarm. As for MI, the projected number of severe cases stays well below hospital capacity, i.e. the 'curve stays flat'. In summary, for each state under consideration, given the available number of (COVID) hospitable beds, the number of people who currently are (and predicted to be in the future) severely ill, and the current and predicted numbers of deaths, the policy finds the economic cost incurred due to lockdown to be too high from the optimization point of view. It is also evident from our picture, especially from the test region, that our model has overall decent predictions and the projected numbers lie inside the band in most cases. Two further points need to be noted. First, the presented bands are not based on a strict inferential principle and must be interpreted as a rough exploration of variability (sensitivity to perturbations of the least squares estimated) more than anything else. As far as the numbers of infected go, the bands generally tend to be overly conservative on the lower end, whereas in terms of number of deaths one notices just the opposite trend: the upper bounds are ultra conservative now. More insights into the performance of these bands is needed. Second, the policy a(t) = 0 over the test period should be interpreted with caution. Recall that for us a(t) = 0 still corresponds to an R 0 value that is appreciably smaller than the values around 2.5 or more that were estimated in the pre-epidemic phase: in other words, no lockdown is not equated to a return to normal pre-pandemic life, but public health informed behavior with adequate distancing whenever possible and use of masks while generally going about one's business. Most activities and occupations can continue under such circumstances. See also the caveats we discuss in Section 4 in connection with the interpretation of prescribed policies. The vast majority of existing literature on pandemic modeling focuses on Markovian processes, not least owing to a broader understanding of decision making strategies in such frameworks and optimization techniques thereof, as well as the existence of ready software. While non-Markovian processes have been investigated in epidemic propagation contexts, and mostly so in network based analysis (see [26] , [29] , [41] , [37] and references therein), the literature in this arena is significantly sparser, and very few works in the recent slew of papers on the pandemic have considered this angle. However, we believe that exploration of non-Markovian models is quite important since the standard Markovian compartment models used in practice do not quite describe epidemic transmission accurately. While they provide good working approximations, they are not good at capturing natural delays, and accounting for the fact that the probability of an individual transitioning from one compartment (say I m ) to another (say R) typically depends on their history in the I m compartment and is actually time-dependent. In the case of COVID, for example, the delay between being acquiring the infection (L) to being able to transmit the virus (I m ) has a mode at 5-7 days [23] , while the delay between being the I m state and death has a mode of around 3 weeks [see here]. The parsimonious Markovian model is restricted to geometric delays with a mode at 0: under a geometric assumption, the memoryless property guarantees the Markovian property with a time-homogeneous transition probability (for i.i.d. individual trajectories). It is, of course, possible to extend the model by including compartments for each day between infection and disease: e.g., from the compartment of 3 days after acquiring the infection, a subject moves either to the compartment of 1st day of disease or to the compartment of 4th day of infection. But this becomes exceedingly cumbersome. Another issue to keep in mind is that in a decision based framework, the effect of imposing a lockdown typically influences infection rates between households but not within, and therefore a reduction in transmission propensity uniformly over all individuals may not conform to a good modeling strategy especially in locations with typically large households. Also, the effect of imposing a lockdown is not immediate, there are natural delays in its coming into effect, so to speak. It is considerably easier to accommodate these diverse issues in a non-Markovian framework. Towards that end, we next present some ideas for a more flexible non-Markovian model which, instead of modeling the compartments, models each person separately. The development at this point is preliminary, however as these ideas bear promise and point to new paradigms for studying future epidemics, we deem it worth a discussion. Consider a population of N members (for our current simulation N = 5 × 10 6 , from which some results are displayed below) who are located in K (= 9 in the current simulation) clusters, each further divided into communities. For better understanding, one may think clusters as states and communities as counties [or on a shorter scale, clusters as counties and communities as various municipalities within the county]. The population is divided into two sub-groups based on their risk of severe illness upon infection: each member i has an independent risk factor ρ i , primarily governed by his/her medical profile and age, with ρ i > r considered high-risk, and low risk otherwise. An infected member i can potentially infect I i ∼ Poi(λ) members, with λ = R 0 . The natural history of a typical subject after being infected is described in Figure 5 . After the incubation period (L), a subject moves either to be "asymptomatic" or "mildly" ill (defined as a carrier that is not hospitalized). During these two periods the subject may infect others. From the asymptomatic state, the subject recovers after a few days, while from the mild illness state, the subject may transit to recovery, die without hospitalization or become "severely ill" (i.e. is hospitalized and recorded as being severely ill). By the end of this interval, she may die, recover or become "critically ill" (i.e., transferred to the ICU). Finally, after the ICU period, she may recover or die. The probability of a susceptible picking up the infection depends on the social-distancing policy enforced at the time of the potential infection, since that influences R 0 . Once infected, the transit probabilities between the different stages depend on the risk factors, e.g. a person in the high risk group will be more likely to be in the severely ill compartment, whereas, a person in the low risk group will either recover from the asymptomatic stage or the mild infection stage with high probability. The period of time a person is capable of infecting others is taken to be i.i.d. (the common distribution was taken as a Beta (3,6) distribution scaled to the maximum duration of the mild/asymptomatic intervals for our simulations) across different individuals. Denote by Inf(j), the set of individuals infected by an person j. We pick every member of Inf(j) from a mixture distribution such that with high probability they are in close proximity to j [i.e. in the same community] and with small probability at a substantial distance [other communities and clusters]. Define by H kt (respectively L kt ) the number of active cases at time t in cluster k who are at high risk (respectively low risk). An individual j, at the end of their latency plus disease period, dies with probability q(ρ j , k H kt ) on day t for some function q (or recovers otherwise). The function q is chosen so that it enables modeling the chance of dying in terms of the person's individual risk as well as the available medical facilities (which would be typically overburdened for high values of the sum in the second argument). The process starts with N 0 newly infected members. The current implementation enables importing a small number of cases from outside (outside of the clusters considered) daily. The following social-distancing policy is simulated: If for some function h, h(L kt , H kt ) < c 1 , social-distance restriction is relaxed. If h(L kt , H kt ) > c 2 social distancing is enforced on highrisk members of cluster k, and if also L kt /H kt > c 3 , it is applied to all members of the cluster. We take h in our simulations to be a convex combination of the proportions of low risk and and high risk people. The constants c 1 , c 2 , c 3 are optimized to minimize a loss function which depends on the number of days each of the social distancing policies is applied (economic cost), the total number of deaths and the number of hospitalization days of severely ill cases, similar to the loss function we used in the Markovian framework. Figures 6a -6d represent one such simulation result. In plots (c) and (d) the Y -axis represents clusters and X-axis represents the no. of days. The shaded area denotes time duration when lockdown was enforced. The optimal lockdown policies conform to the intuition that frequent and more extensive lockdowns on high risk people are essential. The different shades denoting the lockdowns simply conform to different clusters and carry no other meaning. In this paper we have presented a study of Covid-19 evolution via an SEIRD model within the broad framework of optimal decision making (in terms of lockdown policies) and provided, what appear to be reasonable policies for the states under consideration. We have also sketched some ideas for future development of non-Markovian epidemic propagation models that show considerable promise for the future. In our concluding section, we point out possible extensions and future work in this direction, along with certain caveats of the current approach. Our Markovian model was developed in the framework of optimal bang-bang control policies. One important extension is to allow more complex policies into our set-up, but this requires more involved computation (see [8] , [28] , [16] , [36] and [27] and references therein). Typically, lockdowns are often imposed in a phased fashion with essential goods and services phased out last and also made operational first. Such nuances would require introducing more fine-tuned lockdown statuses than the ones we have used in this paper. Furthermore, it may also be of importance to develop more fine-tuned loss functions using input from economists to calibrate the losses due to different phases of lockdown, i.e. differentiate more carefully between the lockdown costs of essential services and goods and their non-essential versions. As of now, we operate under the assumption that the economic impact of a partial lockdown is half of that of full lockdown, but this factor itself needs more investigation. An important feature of our non-Markovian model is the incorporation of a risk score for each individual in terms of demographic covariates, which allows the enforcement of movement or social distancing restrictions on people depending on their age, health status and medical conditions. This is, in fact, one of the key issues with COVID, as there is enough evidence that the virus can be extremely harsh on older and/or unwell people. We note that a similar distinction between high and low risk individuals can also be incorporated in the Markovian formulation by subdividing each compartment into high risk and low risk compartments, and indeed, our initial strategy for data analysis was to take this into account. Unfortunately, the available data sources do not provide enough information to subdivide these compartments reliably, and we chose to adopt the more parsimonious approach. However, future extensions of this type, should more data become available, would be useful. Finally, the Markovian model we used is purely temporal in the sense that we do not consider spatial effects on transmission of infection, which can sometimes be quite informative. This is not particularly difficult to take into account by sub-dividing a state into sub-regions and using a connectivity matrix to quantify the chance of a random person in a particular sub-region (say, a county) traveling to another sub-region (or staying put), and using this matrix to calculate the chance of a susceptible person picking up the infection by an extension of the current formula at the beginning of Section 2.2. As far as the non-Markovian approach is concerned, we note that the model advocated above is flexible enough to incorporate different types of testing (both PCR and serum), more flexible social distancing and social behavior (e.g. networks of members), and different levels of identifiability (e.g., the effect of asymptomatic patients), which could be considered carefully in future work. Furthemore, modeling individual level variability as opposed to a compartment as a whole, allows a considerable amount of heterogeneity into the framework. The non-Markovian model however suffers from the cons of being computationally more demanding. In order to make a case for broader adoption of this approach, there is a need for innovative optimization techniques to make such models scalable with respect to the population size, which presents another direction of research. Structured inference for process parameters is yet another direction where more work needs to be done. Currently, the bands that are generated around the mean prediction curves (for the infected and deceased compartments) use some 'common-sense' perturbations of the estimated population parameters (from training data), but clearly more objectivity is needed. As discussed previously, the bands need to be interpreted with a certain grain of salt. Appropriate calibration of the variability of the obtained estimates remains an interesting and open question. We end our discussion with some caveats that are critical to keep in mind while interpreting the results of our research and similar other works on Covid predictions and analyses. While the model presented in this paper generally exhibits good fit to the training data and reasonable projections for the test period (to the extent that such data are available, since at the time of writing (in July) future data from August which constitutes one month of test data for the southern states are unavailable), some of the simplifying assumptions that were made in the interests of tractability should be mentioned. 1. Proper modeling of active case counts depends heavily on who receives the tests and how the number of tests changes over time [30] . The testing mechanism in this paper is somewhat idealized: an individual is only tested once and further it is assumed that the compartment status of an individual is determined at the end of the day on which they are tested. This latter assumption ignores delays in getting test results which were somewhat of an issue especially in the earlier phases of the epidemic. Also, we have made the assumption that tests give precise results. We have not taken into account the false positive and negative rates of tests in our model. In reality, patients and hospitals are not distributed uniformly across each state. Rather, there are usually centers of high concentration of infected cases, especially in crowded urban areas. Similarly, economic lockdowns do not always apply to whole states but rather strict lockdowns are enforced only on centers of spread at the county level. This, in turn, affects the cost and duration of lockdowns. In this work, we consider a state as one single unit, and therefore no differentiation is made at more granular levels (e.g. counties). However, as pointed out in the previous subsection, a spatio-temporal model for a state where one treats counties as units of location, can alleviate some of these issues. 3 . In our analysis the economic cost of lockdowns is assumed to increase linearly in the duration of lockdown. This assumption can be violated in practice as longer lockdowns tend to have longer lasting and more devastating effects. For example, many businesses may be able to endure one week of lockdown, say with a cost c to the economy. However, a lockdown of 4 weeks could make the business shutter permanently, leading to a cost much higher than 4c to the economy. Furthermore, the loss function that was optimized to obtain recommended policies, while designed to trade-off the natural losses against each other, can certainly be embellished with more input from economists and government policy makers. Our loss function is meant to serve as a broad template, but depending on the specific applications and goals, other fine-tuned loss functions can certainly be used. 4 . As evidenced by the sensitivity bands in our plots, the projections of the model are sensitive to parameter specifications. Consequently, the optimal lockdown policy can change sharply for slight changes in parameter estimates. As an example, we repeated our analysis for MI (for the high value of cost of life 4.7) with p S = 0.15 5 and the other parameters set or optimized as before. In this perturbation scenario, we obtain full lockdowns (status = 2) for the whole 90-day period under consideration (see Figure 7 ) whereas previously (with p S = 0.07) the optimal policy enforced no lockdown beyond the training period for high and low cost of life. Thus, any policy specification should be examined and considered carefully by policy-makers, possibly with some sensitivity analysis, and also keeping in mind the situation on the ground which the modeler has no access to when they are making predictions. Building upon the previous point, an important question arises as to how far down the road predictions of such models should be relied upon. While we have predicted up to a period of 50 days, in practice we recommend updating predictions much before a 7 week duration, since prediction errors can increase quickly with longer time horizons and the consequence of mis-matched predictions (generally unavoidable) is often human lives. One may want to update predictions once every 2 weeks (especially if the spread is unchecked over a certain region), or 4 weeks (if the spread is generally well controlled). The dynamics of contagious disease propagation involve a slew of intangibles and imponderables that no mathematical model can really capture, since human behavior is quite non-robust to shocks. For example, turmoil in the US over both racial issues and political partisanship has manifested itself in massive demonstrations and public rallies as well as multiple civilian-police confrontations in recent months. Such mass-events have the potential to cause spikes in infection and change the subsequent dynamics of the disease. Obviously, no degree of quantitative modeling can account for the exact nature of such disturbances. This work was done under the auspices of a grant awarded by the MIDAS Propelling Original Data Science (PODS) pilot funding program at University of Michigan. We derive below the form of the transition probability from compartment S to L. At the outset let us define a few quantities and articulate the underlying assumptions: By R a(t) we denote the basic reproduction number under lockdown status a(t): it is on an average the number of people to whom an infected person transmits the virus over the number of days that he/she is capable of infecting [which for our model is the average number of days D that they stay in state I m ] under the assumption that the proportion of susceptibles is close to 1. We assume that the number of people N that a random person comes into contact with on a daily basis follows Poisson(µ a(t) ) where µ a(t) varies by lockdown status. A strict lockdown corresponds to a small value for this number. If β is the probability of transmission of infection per contact between an individual in state S and one in state I m , then: R a(t) = µ a(t) βD . n! = 1 − e −µ a(t) β X t (Im) The above formula admits a ready approximation in terms of the quantity R a(t) which is described next: consider the situation where the status a(t) remains unchanged over the period of time during which an infected person is in state I m . On an average, the person meets µ a(t) people every day of which a proportion X t (s)/N are susceptible and each such contact spreads the infection with probability β. Thus, the average number of people infected on day t is µ a(t) X t (S)/N β. The average infected over the entire period can therefore be approximated by summing this quantity over the average number of days in the I m compartment, which is D. If we assume that the proportion X t (S)/N does not change dramatically over this period, we have µ a(t) × (X t (S)/N ) × β × D as the average infected over the entire period. On the other hand, the average infected over the entire period is also approximately R a(t) ×X t (S)/N : this follows from the fact that we are working under the assumption that a(t) is fixed over the duration of the infecting phase and the rough constancy of the proportion of susceptibles over this phase, leading us to: Hence we have: Putting this back in the main equation we get: upon noting that the number of days D spent in state I m is a geometric random variable with mean 1/(P Im→Is + P Im→R ). In this subsection we include the plots for FL, AZ and NJ. Florida's covid-19 data and surveillance dashboard Live updates: Tracking the coronavirus in new jersey, 2020 Existence of optimal control without convexity and a bang-bang theorem for linear volterra equations Spatiotemporal infectious disease modeling: a bme-sir approach Modeling of space-time infectious disease spread under conditions of uncertainty Discrete and continuous bang-bang and facial spaces or: look for the extreme points Optimal control of markov processes with incomplete state information What will be the economic impact of covid-19 in the us? rough estimates of disease scenarios Some basic economics of covid-19 policy Statistical reinforcement learning How economists calculate the costs and benefits of covid-19 lockdowns Assessment of the sars-cov-2 basic reproduction number, r0, based on the early phase of covid-19 outbreak in italy Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university Learning belief representations for imitation learning in pomdps Solving pomdps by searching in policy space Functional analysis and time optimal control Tracking michigan covid-19 hospitalization data trends Containing papers of a mathematical and physical character Contributions to the mathematical theory of epidemics. ii.-the problem of endemicity Contributions to the mathematical theory of epidemics. iii.-further studies of the problem of endemicity Partially observed Markov decision processes The incubation period of coronavirus disease 2019 (covid-19) from publicly reported confirmed cases: estimation and application The neuroinvasive potential of sars-cov2 may play a role in the respiratory failure of covid-19 patients The global macroeconomic impacts of covid-19: Seven scenarios Functional limit theorems for non-markovian epidemic models Point-based value iteration: An anytime algorithm for pomdps Finding approximate pomdp solutions through belief compression Mean-field models for non-markovian epidemics on networks Coronavirus case counts are meaningless The optimal control of partially observable markov processes over a finite horizon The optimal control of partially observable markov processes over the infinite horizon: Discounted costs An epidemiological forecast model and software assessing interventions on covid-19 epidemic in china. medRxiv The bang-bang principle for linear control systems Understanding of covid-19 based on current evidence Maa*: A heuristic search algorithm for solving decentralized pomdps Pairwise models for non-Markovian epidemics on networks Estimation of excess deaths associated with the covid-19 pandemic in the united states Dynamic health policies for controlling the spread of emerging infections: influenza as an example Generalized markov models of infectious disease spread: A novel framework for developing dynamic health policies Empirical study of a non-markovian epidemic model