key: cord-1001015-12ncl2cj
authors: Anand, A.; Kumar, S.; Ghosh, P.
title: Dynamic Data-Driven Algorithm to Predict the Cumulative COVID-19 Infected Cases Using Susceptible-Infected-Susceptible Model
date: 2021-03-26
journal: nan
DOI: 10.1101/2021.03.24.21253599
sha: 5dfffd9b583c6efd3d8c3ea9ac246eca40eae540
doc_id: 1001015
cord_uid: 12ncl2cj

In recent times, researchers have used Susceptible-Infected-Susceptible (SIS) model to understand the spread of pandemic COVID-19. The SIS model has two compartments, susceptible and infected. In this model, the interest is to determine the number of infected people at a given time point. However, it is also essential to know the cumulative number of infected people at a given time point, which is not directly available from the SIS model's present structure. In this work, we propose a modified structure of the SIS model to determine the cumulative number of infected people at a given time point. We develop a dynamic data-driven algorithm to estimate the model parameters based on an optimally chosen training phase. We demonstrate the proposed algorithm's prediction performance using COVID-19 data from Delhi, India's capital city.

The use of epidemiological models to control the spread of disease and predict the course of an outbreak has a long history. In 1760, Daniel Bernoulli proposed a mathematical model for smallpox [1] . At the beginning of the 20 th century, William Hamer and Ronald Ross studied the epidemic behavior using the law of mass action [2] , [3] . In recent times, the use of epidemiological models is inevitable for better management of an infectious disease.

We have seen the use of various epidemiological models to combat the recent outbreak of Coronavirus disease 2019 (COVID-19). COVID-19 was first reported in Wuhan city of China but soon spread to other parts of the world [4] . Many authors have used some version of Susceptible-Infected-Recovered (SIR) models to predict the COVID-19 outbreak in different countries or regions [5] - [7] . The basic SIR model assumes that the infected individuals are either recovered (and immune) from the disease or died [8] . It also assumes the number of deaths from the disease is negligible compared to the total population. However, WHO mentioned that "there is currently no evidence that people who have recovered from COVID-19 and have antibodies are protected from a second infection" [9] . For example, health authorities in South Korea noticed that 163 patients became COVID-19 positive again after a full recovery [10] . Several studies have found that individuals who are infected by the COVID-19 may build short-term immunity against the disease, and there is no long-lasting guaranteed protection [11] - [13] . In this context, when there is no long-term protection from the disease after infection, the Susceptible-Infected-Susceptible (SIS) model is appropriate. In an SIS model, people who recover from the disease are added to the susceptible compartment as they can be infected again. In this work, we consider the SIS model to predict the COVID-19 outbreak.

In an SIS model, the main focus is to determine the number of infected people at a given time point. However, it is also essential for planning purposes to know the cumulative number of infected people at a given time point. One cannot directly find the cumulative number of infected people from the SIS model's present structure. In this work, our main contribution is to provide an SIS model-structure which can give the cumulative number of infected people easily. We incorporate a death due to disease compartment in the SIS model to estimate the model parameters accurately. We develop a dynamic data-driven algorithm to estimate the model parameters efficiently to predict the cumulative infected cases. In this process, we show how to select the optimal training phase to build the model. Finally, the developed algorithm has been implemented using COVID-19 data from Delhi, India's capital city. We also provide an R-package so that users can easily implement the developed model with their data.

In an SIS model [14] , there are only two compartments, Susceptible and Infected. An SIS model assumes that an individual has not developed any long-term immune against the disease after infection and thus is at risk of re-infection; hence, it gets added back to the susceptible population. In other words, as shown in Figure 1 , after recovering from an infection, an individual again becomes susceptible. Examples of such infections are the common cold and influenza. These equations can well describe an SIS model [8] ,

Here t denotes time. In this work, a day is the smallest unit of time t. However, one can choose other suitable units as necessary. S and I are the susceptible and the infected number of people in the population, respectively. The total population size is N , which is the sum of susceptible (S) and infected (I) populations. The parameter β, transmission rate, is the product of the contact rates among infected, and transmission probability [8] , [15] . In other words, the parameter β is the average number of individuals infected per unit time (a day) from an infected person. Here, by assumption, I infected individuals can contact some individuals randomly; a fraction of S/N of them will be susceptible. The parameter γ is the recovery rate. It is assumed to be 1 T where T is the average duration for which infection lasts in an infected person [16] , [17] . Equation (1) shows the rate of change in the susceptible (S) population; whereas equation (2) depicts the rate of change in the infected (I) population. The term β SI N denotes the number of susceptible people infected daily and is removed from the susceptible compartment and added to the infected compartment.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.24.21253599 doi: medRxiv preprint

The γI denotes the number of people recovered daily and is added back to the susceptible compartment and removed from the infected compartment. Figure 2 shows a simulated SIS model with S +I = N at all time points. Notice that one cannot get the cumulative infected cases directly from the above SIS model. The death due to the disease is not adjusted into this model. It may affect the efficiency of estimating model parameters when the number of deaths due to disease is not negligible (as observed in COVID-19). In the next section, we consider a modified SIS model to address these two issues. 

The proposed model can be well described in these equations,

with,

where S, I, N, β and γ are the same as defined earlier. The C is the cumulative infected cases from the beginning. It includes every person who is infected or was infected. The D is the deceased population due to the disease. Note that D does not include death counts from other causes. We assume that the death rate from other causes not involving the concerned disease is the same as the birth rate. Equation (3) is the same as (1). Equation (4) represents the effective change in the infected compartment. As explained earlier, β SI N number of susceptible individuals get infected daily and are added to this compartment. The µ is the mortality rate of the infection. Thus, γI number of infected individuals are recovered from infection daily, and µI number of infected cases are fatal daily, are removed from this compartment. Equation (5) represents the rate of change in the cumulative infected cases, which is equal to daily infected cases (β SI N ). Equation (6) represents the rate of change in the deceased compartment which is equal to µI.

In general, the SIS model parameters are constant for the entire duration of the study period. When the disease under consideration is present in the community for a longer time, the estimated parameter based on the entire study period may not give the right picture. For example, the COVID-19 disease outspread is highly unpredictable in the long term because the contact rates and transmission probabilities are changing over time. They vary due to various reasons like control measures implemented by respective governments. Therefore, it may be appropriate to train an SIS model with a shorter training phase and make short-term predictions. Here, the 'dynamic data-driven algorithm' means the training phase, used to estimate the model parameters, is dynamic (not fixed) and optimally chosen based on the appropriate historical data. The two phases of the study period are the training and the prediction phases. Figure 3 shows how the study period is divided into different parts for estimation purpose. We define the four-time variables as follows:

• T Current : Denotes the date when the training phase ends. After this date, the prediction phase starts. 

We consider the minimum training phase as the fixed in-sample validation period to compare different models based on different training phases. Here, 'in-sample' refers that the validation period is a subset of the considered training phase. The optimal criteria to choose the appropriate training phase is defined by the root mean squared error, Finally, we obtain the optimal training phase that can be used for future prediction as [T Current − T Start + 1 − t opt , T Current ].

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.24.21253599 doi: medRxiv preprint

There are three parameters in the modified SIS model, namely, β, γ, and µ. As argued earlier, γ = 1 T , where T is the average duration for which infection lasts in an infected person. In case of COVID-19, T is taken as 14 [16] , [17] . Given a training phase, Algorithm 1 is used to find the estimates of β and µ by minimizing L(t, β, µ). for β min ≤ β ≤ βmax and µ min ≤ µ ≤ µmax do for 1 ≤ j ≤ T P red do

Theβ opt andμ opt are the optimum values of β and µ, respectively, using Algorithm 1. Using Algorithm 2, the predicted values of the cumulative infected cases (C) are obtained for every day starting from T Current+1 to T Current+T P red . Here, the prediction period's length depends on the user-supplied value of T P red . For 1 ≤ i ≤ T P red , the root mean squared error for prediction is

An R-package has been developed to help the users easily implement the developed methodology with their data. The R-package is available from https://github.com/abh2k/sisd, with detailed instructions for its use. The package is highly flexible in terms of different user-supplied values like T Current , T Start , T Limit and T P red etc. Given the appropriate data and other required input parameter values, the R-package will provideβ opt andμ opt , root mean squared error (based on 'in-sample' validation), predicted cumulative infected cases.

We consider the COVID-19 data from Delhi, India's capital city with a population size of around 20 million, to demonstrate the proposed algorithm's prediction performance. Delhi observed more than 600 thousands of cumulative COVID-19 infected cases at the end of 2020. The data is publicly available from https://www.covid19india.org/.

In Figure 4 , we have considered four different T Current as 29 May 2020 (in (A)), 24 July 2020 (in (B)), 29 December 2020 (in (C)) and 15 January 2021 (in (D)). This set-up can check the proposed algorithm's prediction performance using the modified SIS model concerning different time periods. Table I shows theβ opt ,μ opt , root mean squared error for prediction and other related information. From all the four graphs in Figure 4 , it is evident that the proposed algorithm is working well to predict the cumulative infected cases with two different prediction periods 30 (for (A) and (B)) and 40 (for (C) and (D)). From Table I , we see that the chosen optimal training periods' lengths can be different with different values ofβ opt ,μ opt . Notice thatβ opt is decreasing over time from 0.12 to 0.09 for Delhi, whereasμ opt increases for the first three scenarios (from 0.007 to 0.091) then dropped a little to 0.070. These observations support the idea of estimating the modified SIS model's parameters based on the optimal training phase instead of the entire history as the training phase. Figure 5 shows what could happen if we include the entire history as a training phase to estimate the model parameters. The 30-day prediction curve based on the entire history (125 days) is exponentially higher than the observed curve of the cumulative infected cases (root mean squared error = 175884.1). The difference between the two curves is getting much bigger for the latter part of the prediction period. However, the prediction curve based on the optimal training phase (total 23 days with 15 days of validation phase) is closer to the curve of observed cumulative infected cases (root mean squared error = 3090.25). The estimated model parameterβ = 0.24, µ = 0.09 based on the entire history, whereas the same are 0.1 and 0.06 using the optimal training phase, respectively. Theβ is quite higher in the case of entire history compared to the same (β opt ) based on the optimal training period. It suggests that the estimation of β should be based on an optimal training period to capture the most recent trend rather than the overall trend using entire history.

Incorporating the deceased compartment into the modified SIS model is crucial because death due to disease may not be negligible. For example, in COVID-19, the number of deaths to the number of people infected is significant in many countries. Figure 6 shows the importance of the deceased compartment in the modified SIS model in terms of µ for Delhi. The prediction curve withβ opt = 0.08 and pre-fixed µ = 0 (no deaths due to disease) is away from the observed cumulative infected cases, and the difference between the two curves keeps increasing over time, with root mean squared error 12990.46. The prediction curve withβ opt = 0.11 andμ opt = 0.071 is much closer to the observed cumulative infected cases curve with root mean squared error 5187.47. In the both scenarios, the prediction phase and the validation phase are of 40 and 15 days, respectively.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.24.21253599 doi: medRxiv preprint 

This work has provided a modified SIS model that accounts for deaths due to disease and predicts cumulative infected cases based on an optimally chosen training phase. The estimation process described in this work is beneficial when the disease under study changes its spreading pattern over time. We have developed the modified SIS model considering COVID-19 as the disease under focus. However, the model and algorithms can be applied to predict the cumulative cases of other infectious diseases.

Even though one can predict for any period-length in the future using the developed model, we recommend restricting the prediction to the short-term only. Any prediction with more than 30 days may not be reliable due to continuous changes in the COVID-19 virus' characteristics and human behavior (e.g., how social distancing norms followed from time to time).

The developed open-access R-package (https://github.com/abh2k/sisd) can be helpful to implement the modified SIS model without dealing with mathematical details of the model. One only needs to prepare the input data set as described in the R-package documentation.

The mathematics of infectious diseases

Epidemiology old and new

The prevention of malaria

Covid-19 and the kidney: From epidemiology to clinical practice

Predictions, role of interventions and effects of a historic national lockdown in india's response to the the covid-19 pandemic: Data science call to arms

Self-management strategies to consider to combat endometriosis symptoms during the covid-19 pandemic

Extended sir prediction of the epidemics trend of covid-19 in italy and compared with hunan, china

Introduction to Simple Epidemic Models

Immunity passports" in the context of COVID-19

south korea, a growing number of covid-19 patients test positive after recovery

Seasonal coronavirus protective immunity is short-lasting

Prevalence of igg antibodies to sars-cov-2 in wuhan-implications for the ability to produce long-lasting protective antibodies against sars-cov-2

Genomic evidence for reinfection with sars-cov-2: a case study

Three basic epidemiological models

Population dynamics of pathogens. Handbook of Infectious Disease Data Analysis

Impact of non-pharmaceutical interventions (npis) to reduce covid19 mortality and healthcare demand

Covid-19 in india: Statewise analysis and prediction