key: cord-0807231-3nh7zgfq
authors: Barman, P.; Deb, N.; Mukherjee, S.
title: Mathematical framework to model Covid-19 daily deaths
date: 2020-05-22
journal: nan
DOI: 10.1101/2020.05.18.20106104
sha: 39be925cc3a1985f82641f24134b4d34e2d9caa9
doc_id: 807231
cord_uid: 3nh7zgfq

The novel coronavirus (2019-nCoV) pandemic has caused widespread socio-economic disruption and, as of 04/07/2020, resulted in more than 72,614 confirmed deaths worldwide. Robust prediction of the trajectory of the death incidence curve is helpful for policy decisions during this ongoing crisis. We propose a non-parametric model to fit the number of daily deaths in a region, which hypothesizes that the death incidence curve will have a convex shape in the beginning, a concave shape near the peak, and a convex shape in the final stage of the death incidence curve after the peak. Using this, we performed robust short-term predictions on phases in five countries worldwide and five US states. Our analysis shows while the five states are all at peaks or past their peaks, US as a country is possibly not at peak yet. Our model can be easily fitted on daily death data from any region.

The novel coronavirus (2019-nCoV) first originated in December 2019 in Wuhan, Hubei province of China, caused by SARS-CoV-2 virus. As of April 7 th , 2020, World Health Organization has reported 72,614 deaths [1] and about 1.2 million SARS-CoV-2 positive cases worldwide. SARS-CoV-2 virus has mainly three modes of transmission, respiratory transmission, aerosol transmission, and contact transmission [2] . Worldwide travel due to globalization has also resulted in an increased risk of transmission [3] . Starting from the Hubei province, the virus has gradually spread over 185 countries [4] and is still spreading. Between December 2019 and March , 2020, China reported a total of 82,545 [5] positive cases, whereas in Italy the number of cases started rising from February 2020 [6] . The trends of confirmed cases [4] show the timeline and peak of infection and deaths is variable for each country.

In the absence of a vaccine or a cure, slowing the rates of infection and preparing the healthcare system has been the primary focus for the government across the world. Mathematical modelling has been at the forefront of policy and decision-making to determine timelines for social distancing, stay at home orders, healthcare policies, and city wide lock-downs [7, 8] . A notable amount of research is currently focusing on prediction of cumulative incidence (CI) [9, 10] , case fatality rate (CFR) [11] [12] [13] and transmission of the disease [14, 15] . Epidemiological models such as Susceptible, Infected and Recovered (SIR), Susceptible, Exposed, Infectious, and Removed (SEIR) have been used to study the transmission based on population mixing [16] , travel, reproduction rate and recovery rates [15, [17] [18] [19] [20] ].

An inherent bias in most of these metrics is their dependence on true positive cases, which is hard to estimate during an epidemic [14, 21, 22] . For example, regions with higher testing rates are more likely to observe higher incidence. Thus a meaningful study of the spread of the disease in a region, or comparison across regions, is non-trivial. In contrast, the number of observed deaths due to infections is more robust to these considerations.

Limited research has been published studying the shape of the death incidence curve (DIC) of 2019-nCoV. A recent article by the Institute of Health Metrics and Evaluation (IHME) reported USA's number of positive cases to peak during first two weeks of April [23] estimating a death toll of 80,000 in the next four months. The same group released a news article projecting that European countries will see their daily death peak during the third week of April [24] . However, both of these death projections have a very wide range of variability (95% CI 38,242-162,106 deaths in US) based on the major assumptions of social distancing in the model. A group of epidemiologists from Imperial College in UK reported a predicted death toll of 550,000 in UK and 2.2million in USA in absence of any social distancing . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint measures [25] . Shortly after that report, they revised their estimate to 5,000 deaths in UK incorporating a perfect social distancing model. The disparity of these predictions highlights the sensitivity of these models on the underlying assumptions. The current global CFR for 2019-nCoV cases is 6.3%, however, Italy's CFR has been 12.82% whereas in South Korea it has been 2.1% [26] . This degree of variation in the observed CFR across countries is due to the difference in the age distribution, as well as possible overloading of the healthcare system [27] . It remains unclear how situation will progress over time because of dependence on many factors, which is why long term prediction on positive cases or transmission rates remain unreliable. A more tractable question is whether we can understand the short-term trajectory of 2019-nCoV in a robust way to aid in decision making by studying the DIC per region.

We analyzed two different datasets: a) time-series data of daily 2019-nCoV related deaths for 185 countries and b) Time series data of daily 2019-nCoV related deaths for each USA states. Country-wise time-series data were obtained from John Hopkin's data platform [4] and state-wise time-series data for daily deaths in USA were obtained from USAfacts.org [28] . For the purpose of testing we have included data through April 12 th , 2020 for countries and through April 14 th ,2020 for US state level data.

To understand the bias in estimation of true positive cases, we have plotted the number of deaths per 1000 infected patients, for five countries.

Motivated by shape of the death incidence curve in China, we consider the 2019-nCoV DIC to be comprised of three functional form segments, a) monotonically increasing convex segment up to an inflection point [29, 30] b) beyond the inflection point a concave function where the curve increases initially to a peak and then decreases to a second inflection point and c) the third segment where the curve is convex and decreases monotonically. We also hypothesize that the peak will not be a single day;

instead the peak will be attained across multiple days and thus will look like a plateau (flattened curve).

If we assume that this general pattern will hold for data coming out of a region, we can use this to predict which portion of the DIC a region is currently in. Our proposed model is non-parametric as it only requires a shape restriction on the function in different segments. Thus the model is flexible and gives a good fit to data from different regions. One limitation of our non-parametric model is it's low . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint predictive power, in that, it will only give reliable short term trend predictions (7days). A parametric model on the other hand can make long term predictions but may fail to adjust to unknown future events such as adherence to the strict lockdown measures, weather, viral mutation and other unknown factors. Thus our model is robust to capture short term trends in a reliable manner, and adapts to sudden changes.

Let { 1 , 2 , . . , } denote the observed time-series for number of new deaths indexed by days numbered 1, 2, … . . For each time point ∈ 1, … , , we consider the following three models: 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint respectively.

Note our models are nested, and so ( ) ≥ ( ) ≥ ( ). In the first segment of the curve (i.e. is a timepoint before the peak) we expect all models ℳ 1 , ℳ 2 , ℳ 3 to give the data a good fit, and so we should have ( ) ≈ ( ) ≈ ( ). In the second segment of the curve ( is a time point at or near the peak), we expect models ℳ 2 and ℳ 3 to give a good fit, and ℳ 1 to fail i.e. we should observe ( ) ≫ ( ) ≈ ( ). In the final segment of the curve ( is a time point after the peak), we expect only ℳ 3 to give a good fit to the data, and so we should have ( ) ≫ ( ). Thus if the ratio

is small, we can say that the peak happens before time point i, and we are in the third segment of the curve. If the ratio

is small, we can say time point is at or near the peak, i.e. we are in the second segment of the curve. And if both the ratios are big, we can say that time point is before the peak, i.e.

we haven't reached the peak and we are in the first segment of the curve. Thus by studying the two ratios ( ) ( ) and ( ) ( ) together at a particular i, we can have an understanding of where we lie on the curve. To combine the information across different time points, we compute the above ratios for every time point between 1 and , and study its change with respect to the time variable .

Model ℳ 1 has no inflection points, and can be fitted directly to get the fit, and hence the MSE ( ). For fitting model ℳ 2 (which has one inflection point ), we search over all choices of , and fit a convex function to { 1 , 2 , . . , }, and a concave function to { +1 , +2 , . . , }. For every choice of a, we thus get a fitted vector, and a corresponding MSE. We then take a minimum of all the MSE over all possible values of to get the value ( ). Similarly, for fitting model ℳ 3 , we search over all possible pairs of inflection points( , ), and fit a convex function to the data { 1 , 2 , . . , }, a concave function to the data � +1 , +2 , . . , �, and a convex function to the data � +1 , +2 , . . , �. Having obtained the fit for every value of ( , ), we again compute the MSE for each such fit, and take a minimum over all possible choices of ( , ) to get the value ( ). Before plotting the ratio curves, we smooth out the curves using a cubic spline (penalty 0.6). All calculations were performed using R 3.6.2.

One advantage of our approach is the model can plug in new data each day without needing the recompute previous days' estimates. We compute the ratios starting from 15 days after the first death in every region because the ratios can be unreliable when computed on initial few days. To see why, note . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint that if there are only four data points, then we can always fit the model ℳ 3 perfectly to give a MSE of 0, as there is always a convex-concave-convex function passing through any four points in the plane. Thus on the 4th day, H(i) = 0 and G(i) can be positive (and small), but the ratio

will equal 0, suggesting that the peak is over at day 4, irrespective of the nature of the DIC, which is clearly not a valid conclusion. Similar phenomenon tends to persist for small number of data points, and so the conclusions are reliable only when the ratio ( ) ( ) reaches close to 1, after starting from 1 initially and dropping a bit. This suggests that the model ℳ 1 is a good fit, and we have entered the timeline where the data has started showing convexity to a reasonable degree, and not just giving a fit because the number of points is too low.

As mentioned above, both the ratios

are always less than 1, and we need to come up with reasonable cutoffs to answer the question "how large is too large" in a systematic way, for each of the two curve plots. Specifically, we need to penalize models with increasing complexity, to avoid overfitting. Overfitting in our case would cause curves fitting to our most complex model M3. To account for this, we use a cut off of 0.5 for the curve of

, and 0.66 for the curve of

. A heuristic behind the choice of the cut offs is that model ℳ 1 has one choice of a function, and model ℳ 2 has two choices of a function, so in some sense the ratio of degrees of freedom of the two models should be ½=0.5. Similarly, the heuristic ratio of the degrees of freedom of models ℳ 2 vs ℳ 3 is 2/3=0. 66. These cutoffs will help us distinguish a real dip from a false dip in the two curves

when the case fatality is highly variable. We would consider a dip in In the event of a second wave of the DIC, our model can be generalized easily to allow multiple peaks.

Assuming that a second peak has the same functional shape, we can have two more models ℳ 4 and ℳ 5 , where the models under ℳ 4 and ℳ 5 have four segments (convex-concave-convex-concave) and five segments (convex-concave-convex-concave-convex) respectively. Comparing the models ℳ 3 vs ℳ 4 and ℳ 4 vs ℳ 5 , we can predict whether the second peak is approaching, or whether we have crossed the second peak, respectively. Thus, this general schema gives a flexible way to accommodate multiple . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint peaks in the DIC, and can even be trained to fit data with an unknown number of peaks by adaptively estimating the number of peaks.

For the purpose of the manuscript, we will describe results from five countries in the world: China, Italy, Germany, South Korea and United States of America (USA). that are plotted in Figure 2C and 2D respectively from China. To see when . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint the peak of the DIC begins, we study Figure 2C , the ratio of MSEs between ℳ 2 and ℳ 1 , which increases to a peak, and then starts falling sharply around February 17 th , suggesting that the peak of the DIC began around February 17 th . Since the curve reached below 0.5, we can also confirm that it is the start of a real peak. Table 2 shows a similar pattern, that around February 17 th the ratio ( ) ( ) starts to drop sharply, and corresponding ratio of ( ) ( ) increases sharply. In Figure 2D , we see that the curve ( ) ( ) increases to a peak first, and then starts decreasing after March 6 th , suggesting that the fit for ℳ 3 is increasingly better than ℳ 2 i.e. peak of the DIC is over around March 5 th . The lowest point in Fig 2D is below 0.66 suggesting that the true peak is over. Thus, China is past the first peak in its DIC.

Using the same principles outlined above, we also fitted death data from France, Germany, Italy, South respectively. Studying the curves, we have listed an approximate peak start and peak end dates in Table   2 . Germany has started its peak from April 2 nd , and is still at peak. Italy has started its peak around March 24 th and escaped peak on April 10 th (Fig 3B) . Based on the slope of the trend of ( ) ( ) and ( ) ( ) values (Fig 3B) , we can also get a sense of how soon the peak will be over. South Korea (Fig 3C) had a first peak from March 5 th -March 18 th , and is going through a second peak from April 6 th . USA has reached a mini peak starting April 1 st (Fig 3D) , but there is no sharp fall, so it is unclear whether this is 

Within USA, we report results from five states: New York (NY), Washington (WA), California (CA), Michigan (MI) and New Jersey (NJ). Figure 4 shows the plots of the ratios of Figure 4A shows the series of three plots for NY; the number of daily new deaths (Fig 4A-a) , ratio of ( ) ( ) over time (Fig 4A-b) and ( ) ( ) over time (Fig 4A-c) . The red curve (Fig 4 A-b) has started a dip around April 8 th suggesting NY has entered its peak, and the blue curve (Fig 4 A-c) shows that NY is still in peak. Figure   4B shows the series of three plots for state of WA; the number of daily new deaths (Fig 4B-a) , ratio of ( ) ( ) over time (Fig 4B-b) and ( ) ( ) over time (Fig 4B-c) . Figure 4B (Table 3 ). Michigan has completed its first peak from April 2 nd to April 6 th and has started a second peak on April 11 th and still in peak. Table 3 contains the list of peak start and end dates for all five US states.

This study was predicated with the aim of providing a generalized mathematical framework to model DIC of a region, irrespective of the extent of 2019-nCoV spread and population density of the region.

Currently, the worldwide pandemic of 2019-nCoV is primarily managed by various state and nationally issued policies involving social distancing and lockdowns. The goal of the study was to robustly and accurately model the current trend in the DIC and identify peaks i.e. when the rate of deaths starts to flatten. This is essential to guide control efforts and to plan health care system requirements.

We used non-parametric shape-restricted regression [33, 34] (convex-concave-convex) to fit the number of new deaths across time. Our model is able to predict at what stage a region is in its 2019-nCoV timeline, from about a week ahead. Our non-parametric approach gives a robust fit despite of considerable variability in the number of cases on each day. In the study we have identified start dates of peaks in five countries and five US states. Our model can be easily be fitted on data from other regions. USA has reached a peak, but most likely this is a mini peak. Our model suggests that there is a . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint trend of a second peak in Asian countries such as South Korea and China, which is also validated by a recent study [35] . Italy has crossed its peak, i.e. case fatality in Italy has started declining. Within USA, some of the states have already reached their peaks. California as a state has been able to better contain the virus, as it had an early start and has already reached its peak, but has a lower total number of deaths per million compared to some of the other states (NY and NJ). Overall, since USA is at a mini peak, it seems that lockdown is working to slow down the rates of 2019-nCoV spread. However, since Asian countries are approaching a second one [35] , it is reasonable to think, for USA, we anticipate an increase in the number of deaths in the coming months.

As indicated before, our model, while providing robust guidelines to what may happen in the near future, fails to predict long term behavior of the DIC, as well as cumulative deaths. In the absence of widespread serological tests, currently there is debate on the percentage of asymptomatic cases in the population. Studies have reported vastly different asymptomatic cases, ranging from 51.4% [36] in Diamond princess ship and 87.9% [37] among pregnant women in New York. Variability in percentage of asymptomatic cases, long incubation period of the virus coupled with current testing on only symptomatic cases, incorporate bias in estimation of either the number of positive cases. On the other hand, there are studies [18] claiming that CFR underestimates the true mortality rate in the population due to the infection. Our analysis re-emphasizes that CFR is affected by multiple unknown factors such as disease prevalence [38] , testing rates, right censoring [18] , human behavior, that are challenging to account for. Thus, we chose number of daily new deaths as our statistic, and relied on a non-parametric model to predict short term behavior. Currently the existing predictions based on parametric modelling assumptions [25] produce estimates with a wide range of uncertainty, and the estimates change considerably over the span of few days, which can be problematic for policy decisions.

In conclusion, we believe our model will be beneficial to identify changing trends. Being able to predict the start and end of peak in DIC in a region is important for policy makers worldwide. This will enable them to mobilize resources (such as, equipment, testing kits, health care professionals) as needed. Since our model can be used to assess peaks in any region, it can be adapted easily to model country, state or county level data. A prolonged national lockdown can have a negative impact both for small and large businesses, leading to a financial crisis. In the efforts of reopening economy, our model can help to identify low-risk regions which are already past their peak.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 22, 2020. . https://doi.org/10.1101/2020.05.18.20106104 doi: medRxiv preprint

Novel Coronavirus (COVID-19) Situation Report -78

The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak

Infectious disease movement in a borderless world: workshop summary

An interactive web-based dashboard to track COVID-19 in real time. The Lancet infectious diseases

Coronavirus disease 2019 (COVID-19) Situation Report -71

Novel Coronavirus (COVID-19) Situation Report -19

The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study. The Lancet Public Health

Modeling the COVID-19 outbreaks and the effectiveness of the containment measures adopted across countries. medRxiv

Covid-19 epidemic in Italy: evolution, projections and impact of government measures

On Estimating the Number of Deaths Related to Covid-19. Mathematics, 2020

Case-fatality rate and characteristics of patients dying in relation to COVID-19 in Italy

The many estimates of the COVID-19 case fatality rate. The Lancet Infectious Diseases

Coronavirus: covid-19 has killed more people than SARS and MERS combined, despite lower case fatality rate

-nCoV): estimating the case fatality rate-a word of caution. Swiss medical weekly

Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The lancet infectious diseases

What do epidemiologists mean by 'population mixing'? Pediatric blood & cancer

A modified SEIR model to predict the COVID-19 outbreak in Spain and Italy: simulating control scenarios and multi-scale epidemics

Estimation of SARS-CoV-2 mortality during the early stages of an epidemic: a modelling study in Hubei, China and northern Italy. medRxiv

Phase-adjusted estimation of the number of Coronavirus Disease 2019 cases in Wuhan

Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. The Lancet

Potential biases in estimating absolute and relative case-fatality risks during outbreaks. PLoS neglected tropical diseases

A novel coronavirus outbreak of global health concern. The Lancet

Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months. medRxiv

New COVID-19 forecasts for Europe: Italy & Spain have passed the peak of their epidemics; UK, early in its epidemic, faces a fast-mounting death toll

Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand

Coronavirus disease 2019 (COVID-19) Situation Report -85

Demographic science aids in understanding the spread and fatality rates of COVID-19. medRxiv

Coronavirus Locations: COVID-19 Map by County and State

On problems in which a change in a parameter occurs at an unknown point

Testing and estimating change-point in time series

Coronavirus (COVID-19) Testing

Critical care utilization for the COVID-19 outbreak in Lombardy, Italy: early experience and forecast during an emergency response

Special Issue on" Nonparametric Inference Under Shape Constraints"

Nonparametric estimation under shape constraints

The second worldwide wave of interest in coronavirus since the covid-19 outbreaks in south korea, italy and iran: A google trends study

Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship

Universal Screening for SARS-CoV-2 in Women Admitted for Delivery

COVID-19 Antibody Seroprevalence