key: cord-279539-s2zv7hr4 authors: Narayanan, C. S. title: Modeling the COVID-19 outbreak in the United States date: 2020-05-05 journal: nan DOI: 10.1101/2020.04.30.20086884 sha: doc_id: 279539 cord_uid: s2zv7hr4 The COVID-19 contagion has developed at an alarming rate in the US and as of April 24, 2020, tens of thousands of people have already died from the disease. In the event of an outbreak like such, forecasting the extent of the mortality that will occur is crucial to aid the implementation of effective interventions. Mortality depends on two factors: the case fatality rate and the case incidence. We combine a cohort-based model that determines case fatality rates along with a modified logistic model that evaluates the case incidence to determine the number of deaths in all the US states over time; the model is also able to include the impact of interventions. Both models yield exceptional goodness-of-fit. The model predicted a range of death outcomes (79k to 246k) all of which are considerably greater than the figures presented in mainstream media. This model can be used more effectively than current models to estimate the number of deaths during an outbreak, allowing for better planning. The first case of coronavirus disease 2019, or COVID-19, a respiratory infection caused 2 by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was first identified 3 in Wuhan, China in late 2019 [1] . Subsequently, the outbreak has spread to 212 [2] 4 countries, including the United States, where the first case of COVID-19 was detected 5 in Washington state on January 20, 2020 [3] . As of April 24, 2020, the U.S has 6 reported 830053 cases and 42311 deaths [4] . 7 As the pandemic progresses, determining its prognosis is essential to inform the 8 adoption of adequate mitigation efforts. Many have attempted to forecast the trajectory 9 of the epidemic in the United States, and at the forefront is the White House 11 Dr. Fauci suggested that the US would likely face 100,000 to 200,000 deaths, with 12 millions of cases [7] . However, on April 9 he said the estimate had been revised down 13 to 60,000 [8] . Moreover, a model by the University of Washington, closely followed by 14 the White House, projects 67,000 deaths as of April 24, 2020 [9] , in line with Dr. 15 Fauci's statement [8] . 16 It is well-known that there is a wide variance in the sizes of outbreaks between states 17 as well as the resulting incidence of deaths. The primary reason for the variance is the 18 uncertainty in the prediction of infections and the fatality rate of infected individuals. improved models to determine both the cumulative case incidence and the case fatality 21 rate. We calculated the case fatality rate (CFR) for each state using a cohort-based 22 approach, which has demonstrated greater accuracy than traditional methods [10] . 23 Additionally, the number of cumulative cases in each state is predicted using a modified 24 logistic model. Combining the two, we are able to forecast the number of cumulative 25 deaths by state. Additionally, we analyze the drivers of deaths and discuss implications 26 on policy formation. 27 Data sources 29 The primary data for this study is publicly available and was obtained from the Center 30 for Systems Science and Engineering (CSSE) at Johns Hopkins University [11] . We 31 obtained the data pertaining to daily new cases/deaths and cumulative cases/deaths for 32 the period of January 22, 2020, to April 24, 2020. In addition, we obtained US state 33 population data from the 2010 United States Census Bureau [12] . The number of 34 deaths is a product of the case fatality rate (CFR) and the population confirmed to 35 have been infected [13] . Therefore, in order to determine cumulative fatalities due to 36 COVID-19, it is necessary to first predict the CFR and the number of cases. CFR, case 37 incidence, and deaths were evaluated for all fifty states as well as the District of 38 Columbia. Calculating CFR There are three principal measures of disease lethality: the case fatality rate (CFR), 41 infection fatality rate (IFR) and mortality rate (MR). The mortality rate is represented 42 by the proportion of cumulative deaths to the total at-risk population. This is 43 ultimately indicative of the probability of any individual's mortality among the total 44 population. The CFR uses the same numerator (cumulative deaths) but instead divides 45 it by the number of cumulative confirmed cases. The case fatality [14] rate is the 46 proportion of individuals who die from a disease among all individuals diagnosed with 47 the disease within a specified timeframe [15] . That is, it reveals the percentage of 48 individuals that die among all individuals who test positive for the disease. The IFR is 49 similar to the CFR, except it represents the ratio of deaths to the total number of 50 people who are infected; it accounts for all infected individuals regardless of whether 51 their disease is reported or not. In an ideal scenario, where zero individuals with 52 COVID-19 went unnoticed, and surveillance was faultless, the IFR and CFR would be 53 equivalent. However, this is not truly plausible; testing is limited, asymptomatic 54 infections commonly are not surveilled, and not all instances of the disease are 55 accounted for in reality. As a result, the CFR that we calculate will be much higher 56 than the IFR and the mortality rate. In this paper, we use a logistical function [10] to describe the exponential growth 58 and subsequent flattening of COVID-19 CFR. The CFR depends on three parameters: 59 the final CFR (L), the CFR growth rate (k), and the onset-to-death interval (t 0 ) and is 60 expressed as: Using this model, we calculate the number of deaths each day for each cohort or 62 group of individuals infected on the same day. Next, we build an objective function that 63 April 30, 2020 2/10 minimizes the root mean square error between the actual and predicted values of 64 cumulative deaths. We ran 125,000 simulations, using numerous values of the 65 onset-to-death interval, the CFR, and the CFR growth rate. The CFR was kept in the 66 range of 0.5% to 20%, the slope was kept in the range of 0.005 and 0.7, and the 67 onset-to-death interval bounded between 0 and 60 days. We assigned these bounds 68 because after in-depth explorations of the model, we were convinced the solutions would 69 be within these parameters. We then identify the model parameters that best fit the 70 data (top 1% of the best-fit RMSE). With a kernel density distribution of case fatality 71 rates, we determined the low CFR (the lowest value regardless of its frequency); the 72 mode CFR (the most probable CFR); and the high CFR (the highest value regardless of 73 its frequency). methods is the incorporation of the growth rate. In the SIR model, this is R 0 , the 81 transmission rate-given that the population lacks immunity and there are no deliberate 82 interventions to impede disease transmission. The number of infections will continuously 83 rise in a population if R 0 > 1, will remain steady if R 0 = 1, and will decrease if R 0 < 1. 84 To explicitly account for the impact of mitigation efforts, models must support gradual 85 changes in the shape of the case growth rate. The logistic model forecasts the slow initial rise, exponential growth, and eventual 87 decay of cumulative cases, but cannot account for the changes that result from parameters: the terminal number of cumulative cases (C), the CFR growth rate (r), and 94 the days to the inflection point (t i ). The inflection point indicates the day at which the 95 number of daily cases reaches its maximum. This function that describes the change in 96 case incidencei(t) over time can be expressed as: The modified logistic model has five parameters; however, the terminal number of 98 cumulative cases (C) and inflection point (t0) remain unchanged. The set of equations 99 that describe the incidence using the modified logistic model are: April 30, 2020 3/10 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 5, 2020. . The parameter r m is the modified growth function that changes over time, and f is a 101 smoothing function that determines how quickly the rate diverges around the point of 102 inflection, as well as the magnitude of the transformation. Using this model, we 103 calculated the number of cases for each state. 104 Next, we built an objective function that minimizes the root mean square error 105 between the actual and predicted values of cumulative cases, and ran numerous 106 simulations by varying the four parameters. We held P constant at 0.1 in all our 107 simulations. We ran 375,000 simulations, using numerous values of the days to 108 inflection: t0, the terminal number of cases, C, the growth rate, r, and the growth rate 109 multiplier, K. The terminal number of cases was kept in the range of 0.01% to 3% of the 110 each state's penetration, the growth rate was kept in the range of 0.01 and 0.3, and the 111 number of days to inflection was bounded between 10 and 50 days. The growth rate 112 multiplier was kept in the range between 0 and 40%. The cumulative mortality is the product of the case fatality rate and the cumulative 120 case incidence. With the low, mode, and high values of both CFR and cumulative case 121 incidence, we evaluate the nine possible death tolls for each state by finding the product 122 of each CFR value and each value of the cumulative case incidence. To determine 123 cumulative mortality on the national scale, we add up the respective cells for all states. 124 Predicting case incidence 126 We calculated the case incidence for each jurisdiction. Fig 1 shows the goodness-of-fit 127 between the forecasted cases and the true number of cases for New York. The two sets 128 of figures show the cumulative case incidence and new case incidence. It demonstrates 129 that there is an excellent fit for both new and cumulative cases. We calculated the R 2 130 for all the states and the fit was excellent (greater than 98% for all the states) for all 131 states indicating that the modified logistical function does a great job of modeling the 132 transition after the intervention. More information regarding the model's goodness-of-fit 133 for the states is provided in the supplemental information (see S1 Table) . S2 Table) . However, New York is a substantial outlier: the model 137 predicts 500,000 cases for the mode case. All the numbers we will quote in the rest of 138 the document will be the mode case (unless otherwise specified), as it has the highest 139 likelihood. Even if the best-case scenario transpires, its case incidence will probably 140 exceed the incidence of any other state by over 200000. Next, we evaluated the forecasted case incidence for the entire United States (Fig 3) . 142 The total number of cases predicted is above 1.2M, and the number of new daily cases 143 April 30, 2020 4/10 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 5, 2020. Forecasted case incidence for the top US states. The range of forecasted total case incidence for the top states that contribute 90% of all the deaths. The low, mode, and high cases are displayed. The low and high cases are determined as the low and high end of the 95% confidence interval. peaks at more than 35000. As New York and New Jersey contribute significantly toward 144 the overall case incidence, the United States peak daily cases is strongly dependent on 145 the peak of these two states. In order to understand whether the different states have a case incidence 147 proportionate to their populations, we calculated the discrepancy between the projected 148 cases per capita for individual states and U.S average projected cumulative cases per 149 capita (Fig 4) . New York, New Jersey, Massachusetts, Connecticut, and Louisiana have 150 much higher case incidence per capita. This means that the disease affected these areas 151 disproportionately. In contrast, the states of California, Texas, Florida, and Ohio did 152 much better in controlling the spread of the infection. Difference in forecasted case incidence between states. Difference in forecasted case incidence between the top states that contributed 90% of all the deaths. This was calculated by subtracting the difference between projected cases per capita for individual states and the U.S average projected cumulative cases per capita. We previously calculated the case fatality rates for Hubei province and showed that the 155 goodness-of-fit was excellent [10] . We used the same methodology and calculated the 156 CFR for all the states. The model is able to fit the data extremely well showing that 157 both the model and the methodology are sound. We provide the R 2 value for all the 158 states in the supplemental information (see S1 Table) . 159 We calculated the range of final case fatality rates for each state (Fig 5) . The CFR 160 for most states is between 5% and 10%. Compared to the case incidence, case fatality 161 rates have far less variability. Massachusetts, Connecticut, New York, and Maryland 162 have relatively higher CFR's. In contrast, Texas, California and Georgia have much 163 lower CFR's. We also provide supplemental information for all state (see S3 Table) . In order to measure how well states are faring relative to each other, we calculate the 165 difference between projected CFR's for each state and the average CFR for the US 166 (Fig 6) . Positive (negative) values indicate that the CFR is worse (better). It clearly 167 shows that there is a wide disparity between states' case fatality rates. Furthermore, the 168 difference in CFR closely corresponds with projected cumulative deaths. This shows 169 even more dramatically how much greater New York's and Connecticut's outbreaks are 170 compared to other states. April 30, 2020 5/10 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 5, 2020. . Fig 6. Difference in forecasted case fatality rates between states. Difference in forecasted case fatality rates between the top states that contributed 90% of all the deaths. This was calculated by subtracting the difference between projected CFR for individual states and the U.S average CFR. Considering that CFR and case incidence are the factors of death, it is fitting now to 173 discuss the projected incidence of deaths (Fig 7) . We calculated the number of deaths 174 for each of the top 15 states that contribute 90% of the deaths. This reveals the 175 tremendous disparity between the size of outbreaks in different states (see also S4 176 Table) . The majority of states will experience less than 10,000 deaths. New York is once 177 again a significant outlier; the model returns a minimum of 40,000 deaths and a mode of 178 65,000 deaths, 45% of the U.S. total and more than the next five jurisdictions combined. 179 This can be traced back to the state's relatively high projected case fatality rate and indicative of differences in CFR and case incidence. Sources of variation include the 185 chronology and success of mitigation efforts, the prevalence of testing, and the 186 distribution of age [19] and comorbidities [20] within populations at risk of infection. 187 Fig 7. Forecasted deaths for the top US states. The range of forecasted deaths for the top states that contribute 90% of all the deaths. The low, mode, and high cases are displayed. The low and high cases are determined as the low and high end of the 95% confidence interval. Table 1 summarizes the various possible death tolls under each of the 9 conditions. 188 There is a 95% likelihood that any of these results are possible. The lowest cumulative 189 deaths the U.S. could experience is 78,000, considerably higher than both Fauci's 245904 Predicted deaths tolls for the nine different scenarios. These scenarios are constructed using the low, mode, and high cases for CFR and cases. The low and high cases are determined as the low and high end of the 95% confidence interval. For the best-fit case in all the states, we calculated the number of deaths per day 193 and the cumulative number of deaths (Fig 8) . This shows that the number of deaths 194 will reach an asymptote in the end of June and the number of daily deaths will peak in 195 early May. We used a cohort analysis approach to estimate CFR and a modified logistic model 198 (that explicitly accounts for the impact of mitigation efforts) to forecast case incidence 199 on the state level, and afterwards calculated mortality on the state and national levels. 200 Our model showed a wide range of mortality, with 79,000 deaths on the low end and a 201 maximum of 245,000 deaths. Every possibility predicted by the model exceeds the 202 prognostications produced by both the White House and the University of Washington 203 model. Our model also revealed the deep disparity in deaths among different states, 204 which is attributable to differences in case fatality rate and case incidence. We postulate 205 reasons for these variations. Many states in the US Northeast, including New York, New Jersey, Massachusetts, 207 and Connecticut are disproportionately represented in the cumulative death toll. This 208 disparity is primarily because these states have much worse case incidence and case 209 fatality rates. New York is forecasted to experience the largest outbreak, the greatest 210 CFR, and the highest mortality of any state by far. One explanation for its high case 211 fatality rate is the strain the epidemic has placed on its healthcare system. As a result 212 of its high case incidence, more hospitalizations will be required, overwhelming the 213 medical care system. This could result in diminished quality of medical care, resulting 214 in a high case fatality rate. Additionally, the disease has disproportionately impacted 215 low-income, more vulnerable areas. [21] 216 The reason for the high case incidence itself is more perplexing. Numerous factors 217 are likely at play, such as the popularity of public transportation [22] and the high 218 population density of the New York City metropolitan area [23] ] (where the vast 219 majority of cases have been reported [24]). However, it is difficult to find conclusive 220 evidence that any of these factors are directly accountable for the outbreak. It is very 221 likely that luck played a large role in determining where clusters appeared. There is 222 ample evidence that super-spreading events, or SSE's, can cause sizable outbreaks [25] . 223 For instance, officials in New York stated that as many as fifty infections could be 224 traced back to a single man in Westchester County [26] . 225 We propose two principal explanations for the discrepancy between the death tolls 226 forecasted by the University of Washington model and that of our model: the 227 differences in both the procedure for calculating CFR and the procedure for calculating 228 mortality. Our cohort-based method to determining CFR's predicts case fatality rates 229 more accurately at every stage of the outbreak than other models because it explicitly 230 accounts for the onset-to-death interval. Further, we forecast cumulative mortality by 231 independently evaluating CFR and case incidence. In contrast, the University of 232 Washington model directly predicts deaths; this method is prone to greater errors. While both the CFR and case incidence models fit the data extremely well, there are 234 several challenges with estimating the number of deaths accurately. Our model assumes 235 the scale and methods of surveillance do not significantly change between today and the 236 future. Any changes to the testing process will affect the number of confirmed cases. Breakthroughs in leveraging telemedicine, for instance, would result in increased 238 detection of infected individuals. In this case, the model's current forecasts for case 239 incidence would be underestimates [27] . Additionally, if the shelter-in-place order is 240 withdrawn from states too early, there will likely be an increase in both the case 241 incidence and the mortality. The model itself has limitations; if our assumptions do not 242 hold true, then our analysis will not hold true either. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 5, 2020. I would like to thank Chandra Narayanan for all his guidance on building the best case 254 incidence model. First known person-to-person transmission of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in the USA WHO Coronavirus Disease First case of 2019 novel coronavirus in the United States WHO. Coronavirus disease 2019 (COVID-19) Situation Report -95 NPR Fauci Estimates That 100,000 to 200,000 Americans Could Die From The Coronavirus NPR The Coronavirus Crisis -Fauci Says U.S. Coronavirus Deaths May Be 'More Like 60 Institute for Health Metrics and Evaluation. COVID-19 Projections A novel cohort analysis approach to determining the case fatality rate of COVID-19 and other infectious diseases JHU 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE United States Census Bureau State Population Totals Methods for estimating the case fatality ratio for a novel, emerging infectious disease How testing completely skews Coronavirus case fatality rates Case Fatality Rate The mathematics of infectious diseases Estimating epidemic exponential growth rate and basic reproduction number Estimation of the final size of the COVID-19 epidemic Clinical characteristics of coronavirus disease 2019 in China Comorbidity and its impact on 1590 patients with Covid-19 in China: A Nationwide Analysis Data Suggests Many New York City Neighborhoods Hardest Hit by COVID-19 Are Also Low Income Areas Land Area, and Population Density by County Identifying and Interrupting Superspreading Events-Implications for Control of Severe Acute Respiratory Syndrome Coronavirus 2. Emerg Infect Dis New York Officials Traced More Than 50 Coronavirus Cases back To One Attorney Virtually perfect? Telemedicine for COVID-19