key: cord-1053843-3nqnlcy3
authors: Liu, Yuansan; Srivastava, Saransh; Huang, Zuo; V'azquez-Abad, Felisa J.
title: Pandemic model with data-driven phase detection, a study using COVID-19 data
date: 2021-10-24
journal: Journal of the Operational Research Society
DOI: 10.1080/01605682.2021.1982652
sha: 5a2da596c9162bf3297ce4a4d98d4d3d8bf00777
doc_id: 1053843
cord_uid: 3nqnlcy3

The recent COVID-19 pandemic has promoted vigorous scientific activity in an effort to understand, advice and control the pandemic. Data is now freely available at a staggering rate worldwide. Unfortunately, this unprecedented level of information contains a variety of data sources and formats, and the models do not always conform to the description of the data. Health officials have recognized the need for more accurate models that can adjust to sudden changes, such as produced by changes in behavior or social restrictions. In this work we formulate a model that fits a ``SIR''-type model concurrently with a statistical change detection test on the data. The result is a piece wise autonomous ordinary differential equation, whose parameters change at various points in time (automatically learned from the data). The main contributions of our model are: (a) providing interpretation of the parameters, (b) determining which parameters of the model are more important to produce changes in the spread of the disease, and (c) using data-driven discovery of sudden changes in the evolution of the pandemic. Together, these characteristics provide a new model that better describes the situation and thus, provides better quality of information for decision making.

The novel coronavirus (SARS-CoV-2) associated disease, COVID-19, was declared a pandemic infection within 3 months of the first case reported due to its fast spread world-wide that brought nations into standstill and forced people to stay indoors and observe distancing (Yella Hewings-Martin, 2020) . Almost all nations at some point have imposed some kind of restrictions to avoid the spread of the virus. Since the first cases described in Wuhan in the Hubai province of The People's Republic of China in December 2019, the global scientific community have actively researched and has given us some answers. Scientific developments abound and today the mechanisms by which viruses infect individuals are well known, although research is under way to fully understand the immune responses and defense mechanisms. At the time of submission, vaccines were still being developed (Vazquez-Abad, 2020) .

In this paper we address the problem as an epidemiology study, rather than a biochemical one. We use mathematical models based on the SIR model of (Abbey, 1952; Kermack & McKendrick, 1932) to describe the evolution of the pandemic. These are based on ODEs (ordinary differential equations) that describe an "aggregate" behavior of the spread of the pandemic.

Central to our study is the use of data. We live in unprecedented times where data is available freely via easily accessible internet repositories. . The sheer amount of data has triggered many publications examining and interpreting the evolution of the pandemic. Being humanly impossible to process all this data, analyses based mostly on epidemiology models, statistics and machine learning are now also growing and providing policy recommendations for NPI policies (non-pharmaceutical intervention). However many of these analyses yield contradicting results and it may become even more confusing for the general public. On May 2020, both Dr Anthony Fauci and Dr Deborah Birx (White House Coronavirus Task Force) recognized that the models are no longer useful for meaningful predictions. One of the main reasons is bad quality of data being used and how it is interpreted in the models. As well, Professor Rossi points out that "the models suffered from the same thing: they weren't robust enough to predict within the episodic, fat-tail-risk events" (Rossi, 2020) . In this paper we formulate a mathematical model that concurrently fits parameters and discovers such episodic changes.

Why are the state of the art epidemiology models not suitable for this pandemic? We believe that unprecedented data availability and the subsequent frequent regulatory changes make current models inappropriate. In this paper we address this research gap with two specific foci of attention.

The first focus of this paper is to be critical about the use and interpretation of data. We will deal with the facts that actual numbers are often under-reported, that data formats differ for different countries, and that the reported data may in some cases contain bias either intentional or unintentional. The second focus is to deal with the presence of sudden (or episodic) changes automatically by creating a data-driven method for knowledge discovery. SIR-based models are sensitive to changes in social behavior and mutations of the virus. The behavior of the individuals in the society can be controlled by policy, regulations, restrictions and education. Establishing which of the model parameters is most affected by policy can provide insight for policy makers and health care providers in order to monitor the pandemic. Because these policies are applied at various times in different countries, the parameters of the model should be re-evaluated. In this paper we introduce a novel methodology that uses a statistical learning method of change detection in conjunction with the usual parameter fitting for the ODE model in order to produce a mathematical model that will discover random and sudden changes and accordingly adjust the parameters of the (otherwise deterministic) model.

Our model can provide information from the actual evolution of pandemics and thus provide new knowledge related to social behavior and infection dynamics. Can our model explain the impact of regulations? Can it explain the onset of particular mutations of the virus that have a significant impact? We will be answering these questions in the sequel.

Section 2 presents an overview of current literature on the topic. Section 3 presents our SIR-type ODE model. Section 4 presents our first results that carry out statistical fitting using data. Section 5 introduces a change point detection for discovering the actual times when regimes changed for various geopolitical locations. Section 6 finalizes studying the methodology for making predictions based on our techniques that use both ODE models and statistical fitting of available data.

As recently as 2020, there has been a surge of research papers trying to predict the trend COVID-19 is going to take. Most of the epidemiology studies are based on the ODE methodology of the SIR model type. Although based on the evolution of ODEs, it is often implemented discretizing time, typically in the one-day scale. In (Hamzah et al., 2020) the authors use a simple SEIR model to forecast the outbreak trajectory at the early stage of the pandemic. (Sarkar, Khajanchi, & Nieto, 2020 ) developed a modified SEIR model, where they divided the ODE into more compartments to analyze the pandemic in India, they did not consider death number due to the short epidemic time. (Ndaïrou, Area, Nieto, & Torres, 2020) studied the pandemic in Wuhan using a modified SEIR model which added super spread, hospitalized, and fatal class. (Glass, 2020) used a two-stage SEIR model, by setting different parameter values for pre and post lockdown, to study the effectiveness of lockdown in several countries, in their model, the death number was calculated from confirmed cases with a fixed mortality rate. In the CRISP model (Herbrich, Rastogi, & Vollgraf, 2020) , which is based on the SEIR model, the authors use a probabilistic graphical model to predict infection spread. They go on to develop a Monte Carlo EM algorithm to infer contact-channel infection spread. The EpiLM model of (Vineetha Warriyar K, Almutiry, & Deardon, 2020), also based on SIR compartmental framework, the authors provide tools for simulation and inference for discrete-time individual-level models, and (Carcione, Santos, Bagaini, & Ba, 2020) simulated the COVID-19 epidemic based on SEIR model. All publications listed above use the "SIR" type model, well accepted as mathematical models of infectious diseases, sometimes referred to as "compartmental models". This kind of models were first introduced by John Graunt as early as the 17 th century (Graunt, 1939) . The 1920's saw the rise of compartmental models for a closed population having fixed number of people. The Reed-Frost (Abbey, 1952) and Kermack-McKendrick epidemic model (Kermack & McKendrick, 1932) (Kermack & McKendrick, 1991) (Kermack & McKendrick, 1933 ) gained significant importance, both describing the relationship among susceptible, infectious and recovered people, this kind of models are able to show how different public interventions may affect the outcome of pandemic.Many hybrid models were then created based on this basic SIR model. In (Aron & Schwartz, 1984) , the authors propose an epidemic model where the total population is divided into four types, S (susceptible, those able to contract the disease), E (exposed, those who have been infected but are not yet infectious), I (infected, those capable of transmitting the disease), R (removed, those who have become immune, dead or recovered), it later has been named as SEIR model.

Others (Lorch et al., 2020) design a more fine-grained spatiotemporal predictions of the course of the disease in the population. They make use of the data obtained through the contact tracing technologies and use real data for prediction.

The work in (Greenhalgh, Doyle, & Lewis, 2001 ) derives a two-group-SEIR model to study the effects of condom usage on preventing the HIV spread, where the author split the whole population into two different groups, assuming only people in one of the group will take the precaution(use condom). This approach provides inspiration for our study.

The ODE models aim at describing the evolution of the number of individuals in each class, approximated by a continuous variable. This is believed to accurately describe "large" populations that change by small amounts. We thank the reviewers for pointing out the existence of discrete models for the spread of the virus. The research in (He, Tang, & Rong, 2020) presents a discrete stochastic model using SEIR model to study the efficiency of different social policies in COVID-19 pandemic. They model the number of individuals at each time step by generating random variables with discrete distributions (Poisson, and Binomial) , where the mean values are chosen to fit the values that the ODE would predict. In (A.Nikolaev & Vázquez-Abad, 2015) the ODE approach is used in a different population model that describes the number of molecules of various proteins in a given cell. In that paper we show the relationship between a stochastic model for discrete space (where the transitions follow a multidimensional birth and death process) and the corresponding (continuous space) ODE model. We provide a mathematical proof that the ODE is not necessarily accurate when the number in the population is "large", but rather that under some assumptions the ODE values correspond to the expected number of individuals in the population. Exploring the impact of this relationship for pandemic evolution is not within the scope of the current paper. Simulation models also consider discrete space (number of individuals), few are agent-based simulation models and some of them are based on Markov chains. For example in (A. Washburn & Vázquez-Abad, 2021) we use a continuous time Markov chain to describe the transmission of the virus in short time scales (seconds and minutes). These approaches are different from the one we pursue in this paper. Specifically we aim to fill the gap that existing methods are not addressing directly, namely the fact that parameters change frequently due to changes in policy and decision making, which affects societal behavior and leads to sudden changes in the dynamics of the pandemic.

The situation simulated in (Greenhalgh et al., 2001) is similar to the current COVID-19 situation as there are two groups of people during the pandemic, one of them follows the policies and take precautions, while another group behaves as usual. For the COVID-19 situation, our model also considers two groups: those who take measures to prevent contagion (PPE, hygiene, social distancing) in group 1, and those who don't in group 2, and we extracted death number from the removed population in order to study the trajectory of the number of deaths. Then the whole population N (t) is divided into five types: S 1,2 (susceptible in two groups), E 1,2 (people in two groups with the SARS-CoV2 virus, who can transmit the virus but are asymptomatic), I (people who have COVID-19 symptoms), R (recovered people who have antibodies), D (dead). Consequently we use the acronym SEIDR. Here,. we extracted death number from the removed population in order to study the trajectory of the number of deaths.

Given a contact between a susceptible individual in Group 1 and another one that has the virus, the susceptible individual may or may not contract the virus. The parameter α ∈ [0, 1] models the lack of efficacy of preventive measures, as follows: (a) when following preventive measures, a susceptible individual (S) will be infected when in contact with the SARS-CoV2 with probability α, and (b) an individual with SARS-CoV2 (E or I) will transmit the virus upon contact with another individual with probability α, the case α = 1 means that above preventive actions have no effect to prevent the virus. Thus, modeling random contacts in a population, the fraction of contacts between individuals in S 1 and E 2 that lead to infection is proportional to (αS 1 )E 2 , whereas the fraction of contacts that lead to infection between individuals in S 1 and E 1 is proportional to α 2 S 1 E 1 . Figure 1 shows the structure of the model for a better understanding the dynamics. In this model, the parameters c, M ij , N ij are all related to the government policy.

Specifically, lockdown restrictions can decrease the parameters c, while forcing citizens to take precautions (mandatory use of masks, enforcing hygiene in public places) can increase M 21 , N 21 which means more people getting into group 1. By changing these parameters, we can study the way that policy can affect the spread of the virus. At the moment when we developed the model the general belief was that recovered individuals had antibodies and were immune to contracting the COVID-19 again. This is the reason why we used a feed-forward network model as shown in Figure 1 . The data that we used covers the initial period of the pandemic, where there was no evidence that recovered patients could contract the disease again, so this model is accurate for the data that we used. However, it is straightforward to adapt the model and introduce a loop back to the S state after R.

In the model E state always transitions into I. While it is true that individuals with the SARS-CoV2 virus may never acquire COVID-19 (the disease), the model reflects an aggregate flow, rather than describing each individual's path. Thus, when fitting our parameters for the network model, it will correctly reflect the data: a fraction of the transitions E → I → R happen "instantaneously".

We focused our initial efforts to reduce the dimension of the problem (fewer parameters) with the idea to simplify the fitting process, while retaining the essence of the pandemic spread. Based on BBC news, the number of people wearing masks increased during the pandemic (Cheung, 2020) . Thus, we assume people only migrate from the unsafe group to the safe one, but not in reverse direction. Second, we consider the pandemic dynamics during lockdown periods, when the population change due to migration is negligible. Hence, there are no changes in total population, and M 12 = N 12 = 0. The model for the spread of the virus is as follows:

Where: β : Contact rate between individuals of the population. c 1 : Average contacts with exposed individuals per person per day. c 2 : Average contacts with infected individuals per person per day. σ : Incubation rate, at which infected people develop symptoms. M ij : Per capita rate of migration of susceptible people from group i to j. N ij : Per capita rate of migration of infected ones from group i to j. d : Death rate per day. r : Recovery rate per day. α : Probability of contagion or transmission given that restrictions are followed.

This model has 9 parameters

The reproduction rate R 0 of the virus is widely used in the literature. For the standard SIR models, this parameter is defined as R 0 = β/r (Ridenhour, 2014) . Because our model also considers the rate of death from the COVID-19, we adapt the formula and use:

In this section we focus on the experimental design that we carried out in order to validate our SEIDR model using available data. We use data from the COVID-19 dataset in (Actions-user et al., 2020) , which source is maintained by Johns Hopkins University Center for Systems Science and Engineering (CSSE). Fitting a model (in this case an ODE epidemiology model) using data presents important challenges. While it is true that an enormous amount of data is now publicly available all over the world, it is usually not in the same format, and it does not necessarily represent statistical samples from the variables in our model. For example, the model describes the number of people at time t that have been exposed to the virus. But there is no data for that. What we do have is the number reported to be infected every day. Clearly these are different variables, but dependent. Most of the COVID-19 available data records daily (a) reported infected people, (b) reported dead people, and (c) reported recovered patients.

If we consider the data reporting process as a sampling from an unknown data stream, there are two factors that can affect the gap between the sampled data and the "ground-truth" data (model variables); one is the sampling method, another is the volume of the sampling. In this study, we identify the following measures: (a) the number of tests per thousand people, which measures the capacity and method of the sampling process, and (b) the reported infection numbers, which reflects the volume of the sample. Validation of our model is done using data from a period of time to fit the parameters (see Section 4.2 below) and then using the model to simulate the evolution of the pandemic for some days after. These predictions are then compared with the actual data from the period after the one used in the fitting. It would be ideal at this stage to use datasets from countries that report the number of infections, recoveries and deaths correctly, but there is no way to verify this, so we use the measures above in our selection criteria. Specifically, we chose countries where the number of tests per thousand people and the number of reported infections are the highest (as of June 09, 2020). For the period when we carried out this research. USA and UK were the top among such countries as shown in 

In order to fit the data, we used the cumulative number of deaths per day, because of the three most commonly reported numbers this is the one that better reflects the model's variable. Thus, as a working hypothesis we postulate that the available data is a sample that corresponds to the variable in the model. It is true that there may be some deaths of patients that were positive for COVID-19 but who also had pulmonary, cardiac and other severe conditions, so it is difficult to assess if the deaths were caused by the virus. On the other hand there might be deaths that were never reported as being caused by the virus. However these potential discrepancies are likely to be very small, so as a working hypothesis we assume that the daily number of deaths reported is an accurate number for the true daily number of deaths due to the virus. Fitting the data using the number of deaths leads to solving an optimization problem for the parameter θ in (9). The mean square error (MSE) is given by:

where T is the number of data points used for training (in days), D t (θ) is the cumulative number of deaths predicted at time t (in days) by our SEIDR model for (6), D t is the corresponding available data for the cumulative number of deaths. The goal is to find the minimum of J(θ). Experiments show that our model has multiple local minima (due to non-convexity of the MSE function, sub-optimality and non uniqueness of the parameters). For comparison purposes this is not desirable, so we use constraints on the parameters to contain the solutions. Normally different constraints can lead to convergence of the model to different "optimal" parameters using the same dataset. However, the model can always produce consistent solutions if we set the constrains in reasonable ranges and the dataset is in good consistency like U.K. dataset. Table 2 gives examples of the constrains of parameters that we used for fitting, but the readers can try other constrains to find a different fitting. Choosing appropriate constraints that lead to comparable models was done via experimentation and we will not include in this study the details of our procedure. 

Our experimental design considers the data available for the U.K., from 1 April to 9 June 2020 (70 days) to fit our model parameter θ. After fitting the parameter θ, the data from 10 June to 9 July 2020 (30 days) is used to validate the model. Specifically, once the model parameter has been established, we used this value of θ to run the ODE and predict the dynamics beyond the 10 June 2020 date. The resulting projections for the number of deaths are compared to the real data available for those days. Figure 2 shows the result of the validation, including the predictions (with our model) and the reported as well as the residuals. Same design was also applied to U.S. data from 11 April to 20 June 2020, we use Considering our selection criteria we chose datasets of particularly good quality, so these results strongly support the validity of our model for the COVID-19 pandemic. As mentioned before, it was not until later in the year that some recidivous cases were discovered. To describe this situation, our model can be modified for future studies, adding a transition from R to S to reflect the degradation of antibodies in some of the recovered patients, who lose immunity to the virus. Figure 4 shows the plots of the number of infected people predicted by the models against the available data of the number of reported infections. Clearly the reported number is much smaller than the number predicted by our model. We believe this gap between the prediction and the data arises because (a) not everyone reports the infection, (b) not everyone is symptomatic and (c) not everyone is tested. Some people are not even aware of the infection and pass it in the community (Day, 2020) (Hu et al., 2020) . In addition, the reported infection number is also affected by the testing capacity of a country. (Hasell, 2020) argues the importance of testing data, because more test cases will give us a better view of the spread. From this rationale we can argue that if more people were tested, the reported number would increase to reflect more accurately the real situation, and it will also be closer to the predicted infection number from our model.

In this work, we propose a statistical correction for the model as follows. We use the vertical adjustment parameter v (we will discuss in the following sections) to estimate the gaps between projected numbers and reported numbers. We use regression for the factor v, using reported data {r t , t = 1, . . . , T } and predicted data {p t , t = 1, . . . T } for the periods considered in the fitting and validation steps. The estimate is the result of minimizing over v ∈ R + the mean squared error (MSE): Figure 5 shows the vertical adjustment of the model predictions, dividing the predicted infection numbers by the constant v. It is apparent that the adjusted number is very close to the reported number, which justifies our belief that there is significant under reporting of infections, under the assumption that our ODE model is correct. Table 3 shows the results of the model after fitting the parameters. The results of this analysis for various datasets can provide information to the specialists in charge of medical and social strategic management. Salient features that we can point out are the similarity between various parameters of the model (that we also observed for other data). This is an indication that such parameters may depend highly on the biochemistry of the interaction between the virus SARS-CoV2 and the human respiratory system, or they may be correlated to the demographics and climate of the regions (USA and UK share many of these characteristics). Notably the parameters β, c 1 , M 21 , r and v are similar in both countries.

It is interesting to notice that the parameter R 0 is about 3 times larger than the range of 0.7 − 1.0 published by the Government Office for Science and Scientific Advisory Group for Emergencies of U.K. government (GOV.UK, 2020).

In the table we have highlighted the parameters that have significant differences between countries. Although we do not have the expertise in social and behavioral science, the differences do tell a coherent story. The death rate is larger in U.K., even when the average number of contacts is smaller. However the data shows a higher efficacy of preventive measures in USA (low α) and a higher migration from infected individuals in Group 2 to Group 1 (thus the fraction of people following the preventive measures is higher in USA). Table 3 .: Model parameters of U.K. and U.S.

We first present the results of the model versus the data for USA extending the horizon 30 more days than used for the validation period. To create Figure 6 we use the parameter θ that was obtained as the optimal value for the fitting problem that minimizes (10), over the training period from 11 April to 30 May 2020. Using this value of θ, we calculated the model predictions beyond 30 May 2020, and compare with the available data. While the model showed a very good fit for 30 day predictions (see Figure 5 ), extending the analysis to 60 days clearly shows an anomaly. There is a significant increase in the amount of reported infections from day 150 (19-June) onwards, as shown in the plot to the right of Figure 6 , and this leads to a huge divergence between the adjusted model prediction and the reported numbers. Looking at the residuals, where the fitting period was over the first 50 days, it is also apparent that after 06-July the parameters of the ODE model no longer describe accurately the evolution of the pandemic.

This pattern agrees with the fact that the epidemiology models of the SIR type contain parameters that depend on the social behavioral patterns, and on individual hygiene and prevention habits, in addition to the biological factors that govern the interaction between the human respiratory system and the SARS-CoV2 virus. In the 21st century the main patterns for the spread of pandemics are well understood and soon after the cases were first reported in the Chinese province of Wuhan, governments took measures all over the world (at different times) in order to contain the spread, or as it was commonly referred to, "flatten the curve".

In the particular case of the US data, to the naked eye infection rate changes occurred at the end of May, coinciding with Memorial Day, when many of the public areas usually open for festivities, including beaches, swimming pools, and outdoor cafes (Post, 2020b) . The study by Center for Disease Control and Prevention (Salvatore, 2020) points out that increase in diagnostic tests cannot be the sole reason for observed increased cases. The reopening of universities and colleges for face-to-face classes possibly lead to transmission of virus among young individuals and their close contact. Also, the surge was more prominent in states where strict social restrictions were not imposed (like Texas) (Guardian, 2020) .

But while the USA shows this surge in the Summer, other countries do not. Even within countries, data shows that imposition of social distancing and other measures have greater effect for some states or counties than for others. It is our view that instead of defining the phases according to published dates when policies have been put to effect, we will let the data talk in order to identify the changes in the phases of the pandemic.

The rest of this section introduces known statistical methods for change detection, and then we apply it to our model in order to automatically detect the phases of the ODE.

In this paper, we call a phase (for the ODE) a time interval where the model follows an ODE with constant parameters. It is apparent that sudden changes are imposed on the dynamics of the pandemic due to sudden changes in people's behavior. Notably restrictions including lockdowns, border closing, social distancing, self-isolation, obligatory use of masks and closing of certain businesses are placed in effect on specific dates in particular locations (countries or socio-political regions). Following the model, interpretable parameters such as c 1 , c 2 , M 21 , N 21 should have different values on different phases. However, policy changes are not the only causes for shocks that can cause sudden changes in the process. Social behavior can also produce notable changes in the parameters, for example large numbers of people seeking unnecessary risk by voluntarily breaking the restrictions has been reported to cause abrupt surges in the number of people requiring medical care (Post, 2020a) .

Rather than defining the dates of the segments in advance and perform the statistical fitting of the SEIDR model within phases, we propose to "let the data itself define the phases". That is, we first apply change detection to the data in order to discover statistically significant points of changes in regimes (or phases). One of the potential benefits is assessing the effect of known shocks to the ODE, and discovering possible causes of changes that would not be easy to predict. In order to illustrate the method we now describe how we apply change detection to the data.

As mentioned before, the statistical fitting is done directly using the data on the number of deaths per day in order to find the parameters of the SEIDR model. It is on the residuals of the predicted cumulative number of dead people (by fitted ODE) that we will apply the change detection. Under the null hypothesis that the ODE has constant parameters throughout the whole sample points (X 1 , . . . , X n ) the residuals (difference between the predicted data and the observed data) should have mean zero. Under this assumption we wish to define an appropriate CUSUM process for change point detection. The rationale is that if the true process has different phases, then the residuals would have mean different from zero on different phases (ill fitting). However we do not know in advance what the new means would be, even if the overall mean residual is zero (by construction, if the optimization procedure achieves a zero error in the approximation).

We consider the method originally proposed by Page (Page, 1954) and used for control charts (Montgomery, 2020) . The algorithm defines in parallel two CUSUM processes: one to detect a positive mean and the other, a negative mean. The processes evolve as follows

where the starting values are C P (0) = C m (0) = 0. The parameter µ 0 is the original process mean and K is called the reference value (or the slack value). Again the rejection region for the hypothesis is defined by a threshold h, and the stopping time

is declared as the estimated change point. When the process that reaches the threshold is C p the positive change is declared, otherwise the negative change is declared. In (Montgomery, 2020) the general control chart practice suggested is to choose K = µ0 2 and h = 5σ.

We included a constant factor c using h = c(5 σ) to suit our case. Assuming normality of residuals, we fit a normal distribution on the training data residuals and estimate sample mean and sample standard deviation. Using these CUSUM parameter estimates we draw control charts. Figure 7 shows the result of the algorithm for the whole period in our dataset for the US, where three change points are detected. Recall that by construction, the points detected are almost surely larger than the "real" change in phase, because there is a detection delay, and also because we use the data for the deaths: in all likelihood a change in phase will entail some delay before the death counts reflect a significant change in trend. Figure 8 provides a visualization of the data corresponding to the number of deaths reported for the US, three change points are marked in the plot. To interpret the results we have added a "close-up" around each change point showing what the previous phase would predict against the actual data. Looking at the plots, phase 2 presented a marked spike in the pandemic, compared to phase 1. In contrast, phase 3 has less growth than predicted under the regime from phase 2, but then phase 3 spikes again. The 11 July spike in number of deaths corresponds to a spike in the number of infected people occurring around Memorial Day. It is well known that this period was characteristic of public demonstrations, opening of restaurants and large public gatherings where people were not forced to wear masks and observe social distancing. Shortly afterwards the alarm was raised and measures were taken to impose stricter restrictions on social behavior, which explains our second change point around 05 August, where the spread of the virus seemed to be better controlled. The next change point shows a marked spike and occurs around 02 Nov, right after the Halloween celebrations, and we can To visualize the dynamics of the pandemic for the UK data, Figure 10 shows the number of infected people, where we have marked the detected change points. It is worth noticing that both changes describe a worsening of the pandemic. In July, the UK government allowed no quarantine requirements for returning passengers from 59 countries. There were no restrictions on outdoor gatherings and people were not made to wear masks mandatorily. All these led to the occurrence of 2nd phase of infection which our model corroborates (Wikipedia:Timeline of the COVID-19 pandemic in England , 2020). Table 4 shows the parameters of the model for US and UK on each of the phases detected.

Our results indicate that change points corresponding to increases in the infection rates are generally characterized by an increase value of the parameters c 1 , c 2 . This is an observation that we verified also with other datasets including Texas and New York. Of all the parameters, these seem to be significant. We can observe from Figure  8 that the second change point in US, which is between phase 2 and 3, is different from the other two, where the model over-estimated the number of deaths. This supports the idea that effective regulations were carried out after the explosion of the Summer wave of pandemic, and the estimated parameters shown in Table 4 confirm this, as c 1 decreased significantly from phase 2 to 3. We noticed that the change between phase 2 and 3 in UK does not follow this pattern. Rather, the parameter that is significantly different is α. The parameter β being smaller may well be related to people following isolation practices thus diminishing the contact rate in the overall population, in spite of which we see a spike. The parameter α is related to the infectivity rate of the virus. Random mutations of the virus occur constantly. In particular, around 900 mutations have already been described for the SARS-CoV2, most of them punctual, many of them either self-repair or result in a less potent version of the virus. However, in early December, mutation D614G was described in UK as having higher infectivity. More than 1000 people were then studied that had been infected with this mutation, but as far as we are aware no-one knows the exact time when this mutation was introduced in the world. Our data seems to indicate that it happened in September. The receptor-binding site of the spike molecule in this mutation has higher affinity, resulting in higher infectivity rate (our α). According to our estimates, the infectivity parameter α is about seven times higher in UK phase 3 as it is in US and UK for previous phases.

We finalize with an important observation. Because of the accuracy in the data, our change point detection is carried out on the residuals for the death counts (see Section 4 for details on dataset selection). However, what matters is an estimate of the actual change in phase in the spread of the virus. Figure 11 shows the number of infections reported in UK. The two change points detected are marked in the plot.

To the naked eye, it is apparent that another change appears towards the end of the period. This coincides with the winter spike and it is perhaps related to the usual increase in activity of the corona viruses. The family of corona viruses is typically reactivated in winter, as is common-for example-with the flu and the common cold, but this new variant of the virus seems to cause infections that are less fatal and with milder symptoms. It is possible that our algorithms did not detect the Winter spike because it works on the death count and up to today's date the sample size is not large enough. Figure 11 .: UK number of infected reported.

As mentioned earlier, by construction a change detection algorithm that builds the CUSUM processes C p,m (t) forward in time has a positive detection delay almost surely. In our case the detection helps to define the intervals of each of the phases in a pandemic. The goal of this research is to propose a methodology that can be used by social scientists, medical advisers and decision makers in order to better understand the spread of the virus and to have a model to analyze the impact of events or policies in the various phases of the pandemic under study.

With this in mind, it may be very important to obtain more accurate estimates of the true change points, which may provide invaluable information about the various factors that may have triggered the transitions between phases.

Let T be the point detected by the algorithm, going forward in time as explained in the previous section. Two-way detection follows a simple principle. Look at the process in reversed time, starting from a point τ > T and going backwards in time. Given the parameter θ that fits the model on the new phase (from T to τ ), we now need to solve the equations in order to find the number of deaths going backwards, assuming θ governs the dynamics of the process. Once this is calculated, the residuals can be evaluated using the data, and change detection can be carried out in exactly the same way as before, but going backwards in time.

Our SEIDR model is a system of autonomous ordinary differential equations in dimension 9. Call x(t) the vector-valued process, and denote the system of equations by

Define now the process y(t) = x(τ − t). Our hypothesis will now test for detection on the process y(t) to discover when the (new) parameters θ no longer describe the (past) observed data. Simple change of variables imply that dy(t) dt = −f (θ; x(τ − t)) = −f (θ; y(t)), y 0 = x(τ ).

Thus, using the reversed time ODE, we can perform detection, providing now a new estimate T of the time when the new phase no longer describes the past.

The final estimate for the change point is declared to be (T + T )/2: the mid point between the two detected points. We should remark that some packages have a built in ODE solver, however the most robust method if we wish to use the two-way detection is the leap frog method for solving the ODEs, because this method is time reversible, and thus more consistent when considering the evolution of the model forwards and backwards in time.

In Section 4 we explained how validation and parameter estimation are carried out for the SEIDR model using past data. In Section 5 we explained how data can be used to detect changes in the phases of the model. This section presents our novel methodology that allows for concurrent estimation and phase detection for data streaming. While the previous sections used data in offline mode, the algorithms here can be implemented online to better inform the experts about the evolution of the pandemic in their socio-political regions of interest. As the spread of the virus evolves, our algorithm estimates both the parameters of the SEIDR model as well as phase change detection.

Our model works with a sliding window of ∆ days. From our extensive testing on the COVID-19 data, we propose ∆ = 30. Initially, the data from ∆ days is used to fit and validate the model, as done in Section 4. Call τ l = 0, τ r = ∆, and the first phase is defined as [τ l , τ r ]. The time τ r is the time when dynamic estimation starts, as follows.

Dynamic change detection: Using the parameter θ calculated for the current phase using data before time τ r compute X t , C p (t), C m (t) at time t and compare to threshold h. This process is iterative and updates these quantities online, streaming data for t > τ r . There are two possible scenarios:

• No change is detected for ∆ more days (that is, for days in the interval τ r < t ≤ τ r + ∆). If this is the case, fit the model parameters θ again using the data from [τ l , τ r + ∆]. Set τ r := τ r + ∆ and continue with the dynamic change detection. • Change is detected at time T < τ r + ∆. Complete gathering data for the period [T, T + ∆]. Set τ l = T, τ r = τ l + ∆ and declare the new phase on [τ l , τ r ]. Fit model parameters for the new phase with this data. From τ r onwards perform dynamic change detection as the data becomes available daily.

We conclude this section showing the results of robust predictions, where our model is used to predict the near future evolution of the pandemic using different values of the parameter c 1 . According to our analysis, it is apparent that this is the parameter that has a greater influence in the sudden change when the infection rate blows up. Figure 12 shows such scenarios. Using the current estimate c 1 the plots show the predictions if this parameter were to increase k times, for k = 1, . . . , 6. The larger value 6c 1 leads to a staggering increase of deaths in a short period of time, reaching over 280,00 in 100 days, while the value 4c 1 yields 70,000 in the same period of time. The projected number using c 1 , however, is stable. This kind of analysis can be put to practice in order to evaluate the possible consequences of relaxing social restrictions in a population. How to actually achieve a particular value for c 1 is outside the scope of the present paper, but we plan to pursue research in that direction in the near future. 

We have shown that our proposed 2-group SEIDR model with phase change detection simulates the spread of COVID-19 disease well. Acknowledging, the spatiotemporal disease dynamics, we have proposed a methodology that uses the reported data to draw its conclusions and does not rely on policy or decision-makers. It is rather a tool to assists such institutions for better analysis and disease phase length estimations. Another possible use of our model is to compare different geo-political regions based on their infection reporting accuracy and data quality by using our vertical adjustment. Each parameter in our modeling relates to either public or government or disease behavior. Exploiting these parameters, policymakers will be able to better understand their impact and predict future trends. This will help the region make more informed decisions on lockdown strategies.

Our proposed combination of statistical fitting with phase change detection can be used by other researchers to study past pandemics and its evolution. This will not only help us understand the behavior of people but also educate us for future potential pandemics. Our forward and backward CUSUM methodology will be particularly very useful in studying past events to analyze if the policies used were congruent with the reported data.

An examination of the reed-frost theory of epidemics

Stochastic approximation for regulating circadian cycles, a precision medicine viewpoint

Seasonality and period-doubling bifurcations in an epidemic model

Simulating sars-cov-2 transmission in the new york subway. In 2021 winter simulation conference, s. kim, b. feng, k. smith, s. masoud, z. zheng

A simulation of a covid-19 epidemic based on a deterministic seir model. Frontiers in Public Health, 8 , 230

Why attitudes to masks have changed around the world

Covid-19: four fifths of cases are asymptomatic, china figures indicate

European and us lockdowns and second waves during the covid-19 pandemic

The r number and growth rate in the uk

Natural and political observations made upon the bills of mortality

A mathematical model of aids and condom use

First thing: surge in new covid-19 infections in republican heartlands

Coronatracker: Worldwide covid-19 outbreak data analysis and prediction

What can data on testing tell us about the pandemic?

Crisp: A probabilistic model for individual-level covid-19 infection risk estimation based on contact data

Clinical characteristics of 24 asymptomatic infections with covid-19 screened among close contacts in nanjing, china

Contributions to the mathematical theory of epidemics. ii.-the problem of endemicity

Contributions to the mathematical theory of epidemics. iii.-further studies of the problem of endemicity

Contributions to the mathematical theory of epidemics-i

A spatiotemporal epidemic model to quantify the effects of contact tracing, testing, and containment

Coronavirus pandemic (covid-19). Our World in Data

Introduction to statistical quality control

Mathematical modeling of covid-19 transmission dynamics with a case study of wuhan

Continuous inspection schemes

A high-risk florida teen who died from covid-19 attended a huge church party, then was given hydroxychloroquine by her parents

Memorial day weekend draws big crowds

Unraveling r 0 : Considerations for public health applications

The coronavirus is a problem data models cannot comprehend

Recent increase in covid-19 cases reported among adults aged 18-22 years-united states

Modeling and forecasting the covid-19 pandemic in india

Hope: Developing a covid-19 vaccine

Wikipedia:timeline of the covid-19 pandemic in england

How do sars and mers compare with covid-19?