key: cord-0591789-prmp6ega
authors: Ahmetolan, Semra; Bilge, Ayse Humeyra; Demirci, Ali; Peker-Dobie, Ayse; Ergonul, Onder
title: What Can We Estimate from Fatality and Infectious Case Data? A case Study of Covid-19 Pandemic
date: 2020-04-27
journal: nan
DOI: nan
sha: 1b0ffeb0094fb5a60b103d0a8945e4311ea4041d
doc_id: 591789
cord_uid: prmp6ega

Daily case reports and daily fatalities for China, South Korea, France, Germany, Italy, Spain, Iran, Turkey, the United Kingdom and the United States over the period January 22, 2020 - April 20, 2020 are analysed using the Susceptible-Infected-Removed (SIR) model. For each country, the Susceptible-Infected-Removed (SIR) models fitting cumulative infective case data within 5% error are analysed. It is shown that the quantity that can be the most robustly estimated from the normalized data, is the timing of the maximum and timings of the inflection points of the proportion of infected individuals.

countries, there have been no substantial delays in the arrival of the pandemic in non-affected areas. This fact was also observed for the H1N1 epidemic in 2009 [2] . Increased globalization makes infectious diseases everyone's problem, as experienced in this pandemic: the massively increased demand for intensive care treatment of Covid-19 patients has caused severe disruption of health care systems throughout the world.

In most countries the infection is in its early phase in terms of the duration of the infection. A great deal of effort has been invested in the estimation of epidemic parameters of Covid-19 in the early stage for China and some other countries [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] . In [3] the authors analysed the temporal dynamics of the disease in China, Italy and France in the period between 22nd of January and 15 th of March 2020. In [4] , the potential for sustained human-to-human transmission to occur in locations outside Wuhan is assessed based on the estimations of how transmission in Wuhan varied between December, 2019, and February, 2020. The difficulty related to the accurate predictions of the pandemic is discussed in [5] . In [6] , the authors used phenomenological models valid during previous outbreaks to generate and assess short-term forecasts of the cumulative number of confirmed reported cases in Hubei province and for the overall trajectory in China [7] . Epidemic analysis of the disease in Italy is presented in [8] by means of dynamical modelling [9] . Forecasting Covid-19 is investigated in [10] . In addition, the change in the epidemic behaviour of various countries can be traced by the use of data driven systems [11] .

One of the common features of these works is the existence of variations in these parameter estimations. In the present work, we show that the number of reported cases provides the most accurate representation of the number of removed individuals and that the quantity that can be most robustly estimated from normalized data, is the timing of the maximum and timings of the inflection points of the proportion of infected individuals. These values correspond to the peak of the epidemic and to the highest rates of increase and the highest rates of decrease in the number of infected individuals. The stability of the estimations is discussed by comparing predictions based on data with long time spans.

Publicly accessible data that have been released by the state offices of each country are used for the analysis. The data set of each country is collected according to published official reports and available at the website http://www.worldometers.info/coronavirus/ (last access: 27 April 2020). Updated data are also available at the website http://epikhas.khas.edu.tr/. The last data in this work was collected on the 18th of April, 2020. Data covers the period January 22-April 18, 2020 and in the following, "Day 1" corresponds to January 22, 2020. The Susceptible-Infected-Removed (SIR) model [12] is used for analysis. It should be noted that Chinese state-provided data up to April 16, 2020 is used in the analysis. This data was provided prior to the revisions made after this date.

The Susceptible-Infected-Removed (SIR) model is a system of ordinary differential equations modelling the spread of epidemics in a closed population, under the assumption of permanent immunity and homogeneous mixing [12] . These equations are

If the disease has an incubation period, then the Susceptible-Exposed-Infected-Removed (SEIR) model governing its spread is

where is the contact rate, = 1/ is the mean infectious period and 1/ is the incubation period.

Since the Covid-19 infection has an incubation period, the right model to use is the SEIR system. But, in previous work [14] it was shown that the parameters of the SEIR model cannot be determined from the time evolution of the normalized curve of removed individuals. Thus the SEIR model should not be used in the absence of additional information that might be obtained by clinical studies. Nevertheless, the SIR model can be used with some modifications.

The ratio β/η, called the Basic Reproduction Number and denoted as 6 , is the key parameter in both the SIR and SEIR models. This number is related to the growth rate of the number of infected individuals in a fully susceptible population and determines the final value of , 7 , that is the proportion of individuals that will be affected by the disease. This proportion includes individuals who gain immunity without showing symptoms, those who are treated, as well as disease-related fatalities. The reciprocal of the parameter η, T=1/ η is considered as a representative of the mean infectious period.

In [14] , it was shown that the normalized R(t) for the SEIR model with parameters (β, ε, η) and the normalized R I (t) for the SIR model with parameters (β I , η I ) have the same final value (R f ), the same R m value at the peak of the number of infected individuals, and the same slope (R' : ) at that point (after a time shift of these curves), provided that 1/ η I = 1/ε +1/η. It follows that if one has to work with the normalized data, the SIR model can be used to model diseases with an incubation period, provided that the sum of the incubation and the infection periods of the SEIR model is used as an effective infectious period for the SIR model, unless there are reasonable estimates for 6 and/or for the duration of incubation and/or infection periods.

It was also observed in [14] that the normalized curves of removed individuals are practically indistinguishable for moderate values of 6 , such as in ordinary flu. As the Basic Reproduction Number for Covid-19 is high, one cannot expect that the shifted normalized curves for the SIR and SEIR models coincide, but as clinical information on Covid-19 parameters is yet unclear, the SIR system is adopted as the basic model in the present work.

The relation between R 0 and R f is determined as follows. Note that R(t) is monotone increasing, and hence it can be used as an independent variable, instead of t. The derivative of S with respect to R is given by

Assuming initial conditions S → 1 and R → 0 as t approaches negative infinity, one can obtain the following by integrating (3)

As t approaches positive infinity, I → 0, and S(t) and R(t) approach their final values S f and R f , respectively. Then S+I+R=1 yields

Consequently, R 0 is derived as

Therefore, the values S m =S(t m ), I m =I(t m ) and R m =R(t m ), as obtained in [14] are

Here, t m refers to the time at which the number of infectious cases reaches its maximum.

The values L , L and L are crucial in determining the proportion of individuals that need to be vaccinated in order to reduce the proportion of susceptible individuals below the threshold value S m =1/R 0.

The graph of R f versus R 0 is shown on Figure 1 , together with the ranges of R 0 for well-known diseases. It can be seen that for R 0 >2.5, R f is greater than 90%. The figure also shows that the increase in R f with respect to R 0 is very slow for R 0 >3. It is generally accepted that the R 0 for Covid-19 is greater than 3 despite all containment measures [13] , [15] , [16] . Thus, unless vaccination is applied, one would expect that at least 95% of the population would be affected by the disease. In addition, the knowledge of its precise value would have little effect on the planning of healthcare measures. It should also be kept in mind that containment measures provide a temporary control of the spread of the epidemic, just to the point of reducing the burden of the epidemic to a manageable size. According to the Centers for Disease Control and Prevention (CDC), it is still unknown when viral shedding begins or how long it lasts for, and nor is the period of COVID-19's infectiousness known. Like infections with MERS-CoV and SARS-CoV, SARS-CoV-2 RNA may be detectable in the upper or lower respiratory tract for weeks after illness onset, though the presence of viral RNA is no guarantee of the presence of the infectious virus. It has been reported that the virus was found without any symptoms being shown (asymptomatic infections) or before symptoms developed (pre-symptomatic infections) with SARS-Cov-2, though the role they may play in transmission remains unknown. According to prior studies, the incubation period of SARS-CoV-2, like other coronaviruses, may last for 2-14 days [17].

To illustrate an example for an SIR model, R 0 , T and R(t 0 ) are chosen as 3, 10 and 10 -3 , respectively and the related graphs are given on Figure 2 . 

It is in general accepted that the number of fatalities represents the number of removed individuals and the number of confirmed cases represents the number of infected individuals. The proportionality constants are unknown, but as long as they are constant, one can work with the normalized case reports and normalized fatalities and look for the determination of the epidemic parameters from the shape of these normalized curves. In Section 4, it will be shown that for the Covid-19 data, total cases would be a better representative of the number of removed individuals.

According to the SIR and the SEIR models, given by the equations (1) and (2), the rate of change of the number of removed individuals is proportional to the number of infectious cases. In terms of observations, this corresponds to the fact that the ratio of, for example, daily fatalities to daily infectious cases should be constant. In the literature on the analysis of historical epidemics, fatality reports are usually the only available data, hence models are necessarily based on the assumption that cumulative fatalities represent cumulative number of removed individuals. For the Covid-19 pandemic, as daily fatality and infectious case reports are available, further evaluation of the representation of R(t) in terms of fatality data is presented. Daily infections and total fatalities are displayed on Figure 3 , for all countries. From these graphs, it is difficult to see whether the relation ' = is satisfied or not. For this, first, daily fatalities as representatives of the derivatives of R(t) will be compared with daily infectious cases as representatives of I(t) on Figures 4-6. Based on these comparisons, it will be concluded that total infectious cases would be better representative of the number of Removed individuals, R(t).

Normalized daily infectious cases and total fatalities are shown on Figure 3 . From Figure 3 , it can be seen that the epidemic cycle has been completed in China over the course of about 70 days. The jump in total fatalities is due to a change in the reporting scheme. As our analysis is based on total infectious cases, this change has no effect on the models. For South Korea, the epidemic is in a state of slow decrease at the end of about 60 days, but the rate of infections is still high. This qualitative behaviour is an indication of the fact that R 0 for South Korea is expected to be much higher than the one for China. For France, Germany and Iran, the epidemic is in the decline phase. For the rest of the countries, further analysis is needed in order to assess epidemic phase.

Daily infectious cases and daily fatalities are given on Note that in China, fatalities occur earlier than hospitalizations in the initial phase. This early phase is followed by a "stationary" period, over which the number of hospitalizations is around their peak and the number of daily fatalities oscillates around a mean. This intermediate period ends with a sharp decrease in fatalities (the reasons should be investigated). During the third phase, the decrease in the fatalities is faster than the decrease In the hospitalizations. Thus the data for China has 3 phases while the data for South Korea looks much like the stationary phase of the data for China. It should be also pointed out that the number of fatalities in South Korea is very low as compared with China, hence it would be expected that the number of infections is a better representation of the number of infections.

Daily infectious cases and daily fatalities for France, Germany, Italy and Spain, and for Iran, Turkey, the United Kingdom and the United States are shown on Figures 5 and 6 , respectively.

From these graphs, one can see that in Germany, the number of daily infectious cases leads the number of daily fatalities, but for other countries either they coincide or the situation is reversed. The underlying reasons for this behaviour should be analysed in more detail, using country specific information on strategies for testing infectious cases and treatment procedures applied in the course of the epidemic.

As noted above, the knowledge of R 0 determines the total proportion of individuals that would be affected, R f . Furthermore, the peak of I(t) occurs at the time t m , at which the proportion of susceptible individual falls to the value 1/R 0 . This information is useful for the determination of the proportion of people that have to be vaccinated in order to drag the proportion of susceptible individuals below this threshold. The Basic Reproduction Number is "defined" as the number of new infections per unit time in a fully susceptible population. Thus, it is a quantity that might be measured by direct on-site observations. On the other hand, the knowledge of R 0 by itself does not give any information on the timing of the progress of the epidemic. In the present work, the determination of the following parameters is discussed:

1) The Basic Reproduction Number R 0 , 2) The mean duration of the infectious period T, 3) The time t m (days) at which the number of infectious cases reaches its maximum, i.e, the first derivative of I(t) is zero, 4) The time t a (days) at which the rate of increase in the number of infectious cases reaches its maximum, i.e, the time at which the second derivative of I(t) is zero and the first derivative is positive, 5) The time t b (days) at which the rate of decrease in the number of infectious cases reaches its maximum, i.e, the time at which the second derivative of I(t) is zero and the first derivative is negative.

It will be seen that R 0 and T can be estimated only for China where the spread of the epidemic is over. For other countries, R 0 and T cannot be estimated from the normalized data, but the timings of the key events, t m , t a and t b can be determined quite reliably.

These parameters are determined by a "brute force" approach: The models are run for a broad range of parameters. Then the difference between data and the model is compared by using various norms. Finally, the models that match data within 5% are selected. If the scatter plot of the errors versus the parameter to be estimated has a sharp minimum, it is concluded that the corresponding parameter can be determined from the shape of the normalized data.

The parameter ranges for the SIR model are

and the initial values are chosen as

where 1 < k < 10. For South Korea, these parameter ranges are extended appropriately.

In the SIR model, since ' = ; that is, the rate of change in the number of removed individuals is proportional to the number of infected individuals, it is expected that the cumulative cases are proportional to cumulative fatalities. Thus, the SIR model predicts the simultaneity of the daily fatalities and daily infections. The verification of this fact requires the availability of data both for infections and for fatalities. To the best of our knowledge, historical data studied in the literature includes fatalities only, and the data for the 2009 H1N1 epidemic collected at certain major hospitals [18] is unique in the sense of reflecting information on both infections and fatalities. The peculiarity of this data is a shift of about 8 days between total infections and total fatalities, the peak of infections occurring 8 days prior to the peak of fatalities. This time shift was explained by a multi-stage SIR model [19] .

Cumulative cases and cumulative fatalities for Covid-19 do not show such a clear time shift. On the contrary, in China and Korea, fatalities increase faster than infections. In Germany, there is a slight lead for infections, while for other countries the two curves more or less coincide. The lead of fatalities over infections that is observed in China and in Korea is an unexpected fact, which is possibly due to the irregularities in the statistics, in medical treatment practices, etc. We should also note that the progression of the Covid-19 epidemic is unique in the sense that new treatment methods are applied during the initial phase in China and these methods have been applied in other countries.

For China, several programs were run, first by fitting the predicted R(t) to the total fatality data, then to the cumulative infectious case data. In the first case, about 700 models fitting cumulative fatalities within 5% error and about 3000 models that fit cumulative infections within 5% error are found. Furthermore, in the latter case, the minima for the quantities that were aimed to be determined were much sharper. For South Korea, as it will be explained later, the model matching was not successful. For other countries, as the difference between total infections and total fatalities was negligible, total infections are used as a representative of R(t) of the SIR model.

Our main result is that it is not possible to determine the Basic Reproduction Number and the mean duration of the infectious period from the shape of the normalized data (unless there are reasonable estimates for either of these parameters). In order to make a stable determination of the parameters R0 and T by using the early stage data, a certain period of time has to pass. This period is approximately 70 days for 2009 A(H1N1) epidemic [20] . However, this period for Covid-19 is still uncertain. This is possibly the reason why the parameters for countries other than China and South Korea can not be established. On the other hand, the timings of the peak of the infectious cases, the peak of the rate of increase and the rate of decrease of the infectious cases can be determined more precisely from the shape of the normalized data.

The 'best' estimations of the parameters R 0 and T lie on a curve that is nearly linear when a SIR model is used to fit the data of an epidemic. This fact has been observed in previous work [20] , in the study of the H1N1 epidemic and it was explained by the fact that the duration of the epidemic pulse (appropriately defined in terms of a fraction of the peak of infections) was nearly invariant for values of R 0 and T, with R 0 /T constant.

In order to visualize this situation, the solutions of this system of differential equations of the SIR model (1) for parameter ranges 1.5<C <10, R 0 =2*C, T=10*C, and for R 0 /T = 1/5 (constant) are obtained. The graphs of normalized solutions (after an appropriate time shift) are given in Figure  7 .

The scatter plots of the mean infectious period T versus R 0 , and the scatter plots of the modelling error versus the parameters are presented in Figures 9-18 where I t and I tt represent the values of the first and the second derivatives of I(t) at the last day of the data April 18, 2020, respectively. The error stands for the relative error between the normalized R(t) of the model and normalized total infectious cases, in the L 2 norm. Figure 7 (a) Normalized values of I(t) for 1.5<C <10, R 0 =2*C, T=10*C, and for R 0 /T = 1/5, together with the inflection points (t a , t b ), the peak point (t m ) and the timing of the initial (t 1 ) and final (t 2 ) points when the 5% of maximum value I(t) epidemic, (b) Dependency of t 1 , t a , t m , t b , t 2 on R 0 .

In Figures 9-18 , the first graph, in the upper left of the panel is the scatter plot of the mean duration of the infectious period, T, with respect to the basic reproduction number R 0, for models that fit data within 5% error in the norm described above. For all countries, the "best" parameters lie on a curve, instead of being agglomerated around a mean. This indicates that although the SIR model fitting normalized data is unique, the parameters R 0 and T cannot be determined precisely from normalized data. The colors blue, red and yellow in Figures 9-18 represent the results according to whether the last day of the analysis, t f , is 78, 83 and 88, respectively.

In Figures 9-18 , the second (first row, right panel) and the third (second row, left panel) graphs display the scatter plot of the modelling error with respect to R 0 and T respectively. For China, there are well defined minima in the modelling errors at nearly R 0 =3 and T=9. For South Korea, the minima of the error in R 0 seems to be located beyond R 0 =8, and the minimal error in T corresponds to T=25 approximately. These parameter values are not in the ranges reported in the literature. Furthermore, these values are not stable under the variation of the last day of the analysis. It is therefore concluded that the data for South Korea shows completely different characteristic that might be explained by the fact that approximately 27.4 percent of the confirmed coronavirus patients in South Korea were in their 20s. For all of the remaining countries, the ranges of R 0 and T corresponding minimal modelling errors are too large to attempt any reasonable estimation for these parameters.

The fourth (second row, right panel) graph in Figures 9-18 shows the scatter plot of the modelling error versus t m , the timing of the peak of the number of infections. For all of the countries analysed, this parameter can be estimated quite sharply. In order to study the reliability of this estimation, the model matching process is repeated for t f =78, 83 and 88.

The ratio of infected individuals I(t) has two inflection points. The first inflection point (t a ) is located at the left of the maximum (t m ) whereas the second one (t b ) is located at the right of t m . t a and t b correspond to the highest rate of increase and decease in I(t), respectively. In Figures 9-19 , the right and left panels of the third row display scatter plot of the error in these quantities. Their variation with respect to t f is also investigated.

The values of the first and second derivatives at t f are shown on the fourth row, left and right panels, respectively. If the first derivative is positive (negative), the I(t) is in the rising (falling) phase, while if the second derivative is positive (negative) the curve is concave up (down).

The epidemic phases which are categorized by the sign of the first and the second derivatives of I(t) are given in Figure 8 . Estimation of parameters for each country and for t f =88 is summarized in Table 1 . 

In Section 5, it can be seen that although R 0 and T cannot be determined, it was possible to estimate t m , t a and t b quite sharply from data. In this section, the reliability of these estimates is discussed by comparing predictions based on data with different time spans.

The best SIR models fitting data for 78, 83 and 88 days are obtained, and data and graphs of 10 best models for each time span are plotted in Figures 19-23 . For China and South Korea, for which the epidemic cycle is more or less complete, estimations based on time spans varying by 5 days give the same result as can be observed in Figure 19 . Figure 19 . China and South Korea: Graphs of I(t) and R(t) for the best 10 SIR models for each time span (diamond: real data).

On the other hand, for those countries that are as yet before or around the peak of the epidemic, the situation may be different, as can be observed in Figure 20 -23. 

The epidemic parameters of Covid-19 for ten selected countries are estimated by using the data released by the state offices. These parameters include the basic reproduction number, mean duration of infectious period, the time at which the number of infectious cases reaches its maximum, the time at which the rate of increase in the number of infectious cases reaches its maximum, the time at which the rate of decrease in the number of infectious cases reaches its maximum. For each country, the best Susceptible-Infected-Removed (SIR) models fitting cumulative case data are obtained. A wide variety of intervals with different scales of the parameters, basic reproduction number R 0 and infectious period T, are observed. More specifically, the basic reproduction number and mean duration of infectious period are estimated only for China since the spread of the disease there is over. These parameters are found to be 3 and 5, respectively. The fact that the median incubation and infection periods are approximately 5 days, supports the observations for R 0 and T. However, the basic reproduction number and infectious period for other countries cannot be predicted from the normalized data but the timing of key events can be estimated quite reliably. To summarize, we show that the quantity that can be the most robustly estimated from the normalized data, is the timing of the highest rate of increase in the number of infections, i.e, the inflection point of the number of infected individuals. However, it should be pointed out that the analysis performed by the SIR model for South Korea provides dissimilar results which can be explained by the unique age distribution nature of the confirmed cases.

Coronavirus disease 2019 (COVID-19): situation report

Real-time numerical forecast of global epidemic spreading: case study of 2009 A/H1N1pdm

Analysis and forecast of COVID-19 spreading in China, Italy and France

Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The Lancet Infectious Diseases

Why is it difficult to accurately predict the COVID-19 epidemic

Real-time forecasts of the COVID-19 epidemic in China from

Propagation analysis and prediction of the COVID-19

Epidemic analysis of Covid-19 in Italy by dynamical modelling

Extended SIR prediction of the epidemics trend of COVID-19 in Italy and compared with Hunan, China. Frontiers in Medicine

Forecasting COVID-19

Estimation of the final size of the COVID-19 epidemic. medRxiv

Qualitative analyses of communicable disease models

Modelling the epidemic trend of the 2019 novel coronavirus outbreak in China

On the uniqueness of epidemic models fitting a normalized curve of removed individuals

Estimation of the transmission risk of the 2019-nCoV and its implication for public health interventions

Novel coronavirus 2019-nCoV: early estimation of epidemiological parameters and epidemic predictions

A Susceptible-Exposed-Infected-Removed (SEIR) model for the 2009-2010 A/H1N1 epidemic in Istanbul

On the time shift phenomena in epidemic models

Determination of epidemic parameters from early phase fatality data: A case study of the 2009 A (H1N1) pandemic in Europe