key: cord-0807567-1l3bku9v
authors: Rizzo, S. G.; Vantini, G.; Saad, M.; Chawla, S.
title: AutoSEIR: Accurate Forecasting from Real-time Epidemic Data Using Machine Learning
date: 2020-07-28
journal: nan
DOI: 10.1101/2020.07.25.20159715
sha: 58f0d2f01b9cf258816f78fd46fae940b3c8489b
doc_id: 807567
cord_uid: 1l3bku9v

Since the SARS-CoV-2 virus outbreak has been recognized as a pandemic on March 11, 2020, several models have been proposed to forecast its evolution following the governments' interventions. In particular, the need for fine-grained predictions, based on real-time and fluctuating data, has highlighted the limitations of traditional SEIR models and parameter fitting, encouraging the study of new models for greater accuracy. In this paper we propose a novel approach to epidemiological parameter fitting and epidemic forecasting, based on an extended version of the SEIR compartmental model and on an auto-differentiation technique for partially observable ODEs (Ordinary Differential Equations). The results on publicly available data show that the proposed model is able to fit the daily cases curve with greater accuracy, obtaining also a lower forecast error. Furthermore, the forecast accuracy allows to predict the peak with an error margin of less than one week, up to 50 days before the peak happens.

The SARS-CoV-2 virus is a new strain of the Severe Acute Respiratory Syndrome (SARS) species that causes COVID-19. Its long incubation period and the fact that infected individuals are often asymptomatic or exhibit mild symptoms are distinctive features that make spreading the virus relatively easy. The COVID-19 fatality and hospitalization rates are reported to be substantially higher than the regular flu [1] and thus can overwhelm the health services of even advanced developed countries. Thus, to control the spread of the virus, forecasting models for COVID-19 are required for planning and decision making. Despite the online release of daily data from most of the affected countries it has become clear that predicting the future trajectory of the pandemic is a very challenging task [2] . In particular, traditional epidemiological models are less effective when applied to very fine-grain, partial observations, than when applied to historical smooth data.

Many epidemiological parameters are used to model the progress of an epidemic and forecast its future evolution. One important parameter is the basic reproduction number R 0 , which defines the average number of new infections resulting from one individual carrying the virus in a completely susceptible population. It has been shown that the measured R 0 as well as other epidemiological parameters vary widely across different countries and studies. As a result, current estimation of R 0 for COVID-19 epidemic ranges from 1.4 to 6.49 [3] . Moreover, the fact that the countries have responded in different manners to prevent its spread makes it difficult to infer the effective reproduction number and other transmission parameters either before or after intervention. One of the limits of traditional compartmental models (e.g., SEIR) is the assumption that the epidemic parameters do not change over time. In practice, most countries have imposed lockdowns and restrictions to mobility, together with physical distancing guidelines, which affected the average number of contacts and the transmissibility of the virus. Models that do not consider the change in transmissibility resulting from lockdowns have not produced accurate predictions [2, 4] .

Once the transmission parameters are accurately estimated, SEIR models can be applied to simulate the evolution of the epidemic. One key challenge in this process is simulating the SEIR compartments while the only available data are the newly reported cases and standard compartments are not observable. Since the newly reported cases are only a fraction of the real infected individuals, in some studies the population has been artificially reduced to a much lower value in order to obtain reasonable results from an SIR model [5] . Moreover, confirmed cases are usually reported a long time after the infection occurs, which explains why the effect of intervention (reduction of new infections) is only observable after at least two weeks. Unless we choose an incorrect incubation period of around two weeks [6] , which is much longer than the estimated median of five days [7] , it is not possible to fit the observed data with a standard SEIR model. A time gap can be manually fixed [5, 8] , for example shifting the prediction or the original data, although an ideal solution would consider this gap to be country-dependent and it should learn the gap length automatically from the data. In general, traditional models need to be extended to cope with all the above challenges.

In this paper, we propose a new extended SEIR model with machine learning parameter fitting, we call AutoSEIR. The main contributions of AutoSEIR are: (i) we propose a new compartment-based model to capture the unique characteristic of COVID-19. In particular we extend the classical SEIR model by introducing two additional compartments, Tested and Untested. The Untested compartment is not observable as the proportion of the population that is being tested continues to be relatively small; (ii) we introduce a testing delay parameter and a confirmed compartment to capture the reporting delay, which usually occurs between the infection and the reporting stage of each case; (iii) we consider two different transmissibility values, before and after intervention, while fitting the whole time series, conversely to other methods that fit each period separately, obtaining a discontinuous result [9] ; and finally (iv) we extend a recently introduced proximal optimization framework for parameter inference for ODEs, using automatic differentiation, to account for interventions and state variables that are not observable. In this study, we evaluate the method for accuracy in parameter fitting, forecast capabilities, peak prediction, and robustness to outliers in the data. The results and daily-updated forecasts are publicly available on our website (https://covid-research.qcri.org/seir/).

Our proposed extended SEIR model contains seven compartments: S (Susceptible), E (Exposed), I (Infected), T (Tested), U (Untested), C (Confirmed) and R (Removed) ( Figure 1 ). Our model accounts for the delay between symptom onset, getting an RT-PCR test, and also the outcome of the RT-PCR test. The dynamics of the compartments are defined by the following ordinary 3 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. . https://doi.org/10.1101/2020.07.25.20159715 doi: medRxiv preprint differential equations (ODEs):

The model contains several parameters, fully described in the following section. In the above equations and in Figure 1 we can separate our parameters in three categories: (1) the transmissibility β that controls the rate of susceptible individuals getting infected after getting in contact with an infected individual, (2) the time rate parameters {α, γ, D T , D R , D U } that control the timing of moving from a compartment to the next, and (3) the ratios p T and p U to split between tested and untested compartments. Since we have two different rates before and after the social distancing is enforced, the β parameter is the only parameter that varies with time.

A distinct feature of our approach is that we can estimate the parameters even when several ODE equations are un-observable. In practice, we will assume that only the compartment of confirmed cases is observable, which happens after individuals have been tested.

Given the defined ODE system, the complete state over the seven compartments at time t is

The state depends on two sets of parameters. One set has fixed values, and one set must be estimated. The vector θ is the vector of free parameters that we infer through optimization and it includes the following:

1. I 0 : the initial infected, i.e., x I (0) = I 0 .

2. R 0 : basic reproduction number, assuming a completely susceptible population.

3. R t : basic reproduction number resulting from social distancing, also assuming a completely susceptible population.

4 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. 5. p T : probability of an infected case to be tested. This is equal to 1 − p U , where p U is the probability of unreported cases.

The fixed parameters are:

1. D U : recovery duration for the unreported cases, from the end of the infectious period to the removed/recovered compartment.

2. D R : recovery duration for the reported cases, from the reporting of the case to the removed/recovered compartment.

3. 1/α: incubation period.

4. 1/γ: infectious period.

The rationale of fixing the above parameters is that the time of social distancing is a known datum, while the timing of recovery compartment was not crucial to the scope of this work and can be more easily estimated from reported data. In a scenario where the estimation of recovery time is a key factor, such as in policy-making for hospital capacity management, x R should be also observable and D U , D R can be included in the parameter vector θ.

The initial timestep t 0 is variable, given by t obs − (α + γ + D T ), thus it depends on the current parameter values. 5 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. .

In order to obtain the number of cases, given the defined system of ODEs and the current parameters vector θ, we solve numerically the ODEs using the midpoint method. The solver takes as input the initial state X(0), the system of ODEs f θ (X t , t), and iterates over t, approximating each X(t + 1) using the midpoint rule:

The initial

where N is the total population of the considered country or area, and I 0 ∈ θ, the initial number of infectious subjects, is part of the parameters vector that will be fitted.

A loss function is defined to measure the error of the simulated results with respect to the real acquired data. In the real data, not all the compartments of X are observable, instead only the ground truth for the new reported cases is always available, which we call y(t). Note that the corresponding variable in our system is not directly available, as we simulate the state of currently infected, but we do not keep track of the newly confirmed cases at time t. This variable can be defined as:

that is the new cases are given by the current number of confirmed cases x C (t), removing the previous step confirmed cases x C (t − 1), and adding the cases that left the confirmed department,

The loss function over all the states is

where y(t) is the observed confirmed cases at discretized timestep t and T is the entire dataset period. In this function we are considering the point-wise error of prediction as well as the cumulative error so that we optimize for both the daily cases and the total cumulative cases. The loss surface as a function of the parameters pair {R t , p T } on the Italian dataset is shown in Figure 3 . In this dataset the number of new cases started to decrease when 0.1% of the population was confirmed as infected, as a result of the imposed restrictions. This decreasing curve can only be fitted either by having an R t smaller than 1, or by considering that in reality a high percentage of the population has been infected but not tested, that is having a very small value for p T . This is reflected in the surface plot where the fitting loss increase exponentially for R t > 1 and high p T . The lowest loss is instead obtained when R t is around 0.8 and the probability of testing is around 0.2. 6 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. 

The optimization problem is defined by the constrained minimization objective

where c l,i and c u,i are lower and upper bounds enforced on the parameter θ i . To compute the gradient of the objective function with respect to the model parameters, we use the automatic differentiation python package autograd.

First I0 infected at model time t0 A unique feature of the proposed method is that, in the ODE solver, the simulation starts a period of time before the first observed cases. This is required in order to match the confirmed compartment's count to the confirmed cases 7 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. . https://doi.org/10.1101/2020.07.25.20159715 doi: medRxiv preprint in the observed data, given that the defined ODE implies a sequence of movements that starts from the susceptible compartment and reaches the confirmed compartment after a period of time. The average duration of this time period is given by the sum of the average periods between the infection compartment and the confirmed compartment, given by the latent/incubation period, the infectious period, and the test delay, which is 1 α + 1 γ + D T . Note that, contrary to other approach where a time gap is manually fixed [8] , in our approach the duration is automatically set based on the fitting of D T . The time discrepancy between the ODE and the observed data is illustrated in Figure 4 . In this chart, which represents the Hubei province initial cases, the first observed data point in the dataset is on January 22. Nevertheless, to fit such data, the infection simulation of the ODE solver starts on January 10 (i.e., 12 days before). This time shift accommodates the period of time needed to go from the first, yet unobserved cases, to the 400+ cases first observed after almost two weeks.

The data used in our experiments are collected from the three sources shown in Table 1 with the corresponding areas of interest. Lockdown intervention dates have been extracted from newspaper sources of each country, summarized on the related Wikipedia page of COVID-19 pandemic lockdowns (https://en.wikipedia.org/wiki/COVID-19 pandemic lockdowns). For all the experiments, we setup the same upper and lower bounds constraints and initial values ( Table 2 ). The constraints are enforced in order to avoid negative values as well as unrealistic high values. Initial values are reasonable values to start the numerical optimization. Both initial values and constraint ranges are shown in Table 2 . All values are constant except for the initial number of infected (i.e., the observed cases at the first time step y(0)).

8 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. Figure 5 : Results of ablation experiments on fitting for UK, Italy, New York and U.S. All the features combined (full model) produce the best fitting while using more reasonable parameters.

We perform ablation experiments to investigate the impact of each proposed extension of the SEIR model that makes up our complete approach. Specifically, we evaluate the accuracy of the model under the following five configurations: 9 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. . 1. Full model. This is the full model presented in Section 2.

No p T . In this model we remove the probability of testing, assuming that all the infected cases are tested and observable.

No D T . This model has no test delay, thus all the infected cases are reported as soon as they are tested.

No R t , D T . In this model the intervention does not have any effect, thus all parameters learned for data points before and after intervention are the same. Additionally, the model has no test delay feature, meaning that all the infected cases are reported as soon as they are tested.

No p T , D T . In this model both the probability of testing and the test delay are removed. All infected cases are tested, and reported without any delay.

We evaluate the described models in two different experiments:

• Fitting. In this experiment we evaluate the ability of the models of fitting the observed data. The complete data up to a recent date is provided for training, and the fitting of the training data is evaluated. The purpose of this experiment is to show that without any of the missing feature the model would not be capable of fitting real data, with the exception of fitting parameters with non-realistic values.

• Forecast. We design this experiment to evaluate the predicting capability of the models. We train the models with data spanning from the start of the epidemic until one to six weeks after the intervention date. For testing, we consider the week following the end of the training data, and we measure the prediction accuracy of each model on this unseen 7-points time series.

The above experiments are carried out on the datasets of U.K., Italy, U.S., and New York. Results of the ablation and forecast experiments are shown in Figure 5 and 6, respectively. Fitting results of the full model are shown in the first row. For each fitting, the root mean square error (RMSE) is given for the whole time series. In all the experiments except one, the result of the full model has the lowest error among the models without one or more extensions. Only the 'No D T ' model for the Italy dataset has a slightly lower error. However, the inferred rate of testing parameter was unrealistic (0.7%), for a country where the testing ratio has been estimated to be in the range of 10%-33% [10, 11] . The same 'No D T ' model yields a good fitting for the rest of the datasets as well, with the downside of estimating an R 0 parameter to be higher than the known value. The rest of the models were not capable of reasonably fitting all the datasets. In particular, the UK dataset could not be fitted without using the testing ratio p T or the after-intervention R t parameter. For the U.S. dataset, only the model 'No R t , D T ' failed to fit the observed data by a large margin.

10 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. . https://doi.org/10.1101/2020.07.25.20159715 doi: medRxiv preprint Figure 6 : Error on next week forecast for Italy, Germany, and UK datasets.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. . https://doi.org/10.1101/2020.07.25.20159715 doi: medRxiv preprint For the forecast experiment, we show the RMSE of the five models for each period used as training set (week from 1 to 6). The full model result, represented with a solid blue line, exhibits the lowest error in most of the configurations. The performance of the remaining models vary widely across different countries datasets and periods of time without any particular pattern. Logically, our results show that the full model has a decreasing trend in error with respect to the increase of the training data, showing that the more historical data is available to the model, the better it predicts future points. This suggests that the full model is good at generalizing the epidemic dynamic, while the remaining models are overfitting the observed data in an unpredictable manner, lacking a robust underlying model. The availability of almost real-time data for the ongoing COVID-19 epidemic has resulted in the occurrence of several fluctuations, outliers and anomalies in the public datasets, which represents an issue for parameter fitting among other challenges. One example of these anomalies is the 600% surge in cases observed on February 12 for the contagion data in Hubei province, where 13,332 cases 12 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. . were observed in one day. The anomaly was caused by a policy change on the counting of cases, when the co-presence of symptoms and chest infection from CT scan was reported as confirmed case. To test the robustness of our model to these contexts, we train on the Hubei province data until February 18, thus including the spike, and we generate the forecast of 6 weeks to validate both fitting and prediction accuracy. In Figure 7 , the results are shown together with the overlapping observed data represented with a red solid line. The figure shows that the fitting of the observed data is not misled by the outlier, moreover the resulting forecast has been reasonably accurate.

Peak prediction error (relative deviation of forecast from real peak) Daily cases on the day when prediction is performed Figure 8 : Peak prediction on Qatar cases. Starting from 60 days before the peak, we forecast the peak date every day. Relative error on forecast is provided as a bar for each day. Solid line represents the daily new cases. The prediction has high accuracy except when close to a deceitful peak, which happened around -30 days with respect to the real peak.

In this experiment we evaluate the ability to predict the date when the peak occurs. Given the predicted new cases for each time step, x N (t), we obtain the predicted peak timestep as

We select the COVID-19 new cases dataset of Qatar as a case study. We run the parameter fitting using as training data starting from 2 months before the peak up to the day before the peak, and we measure the difference with the actual peak date. This difference is defined as 13 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 28, 2020. 

In Figure 8 we show the result for each day. For example, 35 days before the real peak happened (-35 on the figure) the method was able to predict with exact accuracy the peak date. The accuracy is high on average, with a mean absolute error of 10 days. The model has been fooled by a false peak around 30 days before the real peak, however it managed to well re-tune its prediction after more days of data has become available.

We have proposed a novel approach that combines an extended compartmental model, to address the unique features of COVID-19 pandemic and of the available data, with a novel machine learning technique for parameter fitting on ODE systems with partial observations. This approach allowed us to obtain a state-ofthe-art forecast accuracy and peak prediction. The resulting forecasts have been used in government settings and international media outlets as a trusted source for the outbreak projections. Machine learning has been successfully applied to epidemic curve prediction in the past, with methods that treat the problem as a standard supervised time series forecasting task [12] and a wide literature on estimating the epidemic parameters for predicting the future of an ongoing outbreak, especially for influenza [13] . The COVID-19 emergence, together with the readiness and availability of related data, has attracted great attention on epidemic modeling research, in the pursuit of an accurate method to predict the future outcome of the pandemic. Several approaches have been proposed in the recent literature to fit the confirmed cases observed in public datasets. In general, it has been shown that epidemic forecasting and peak prediction is a very challenging task that leads to very inaccurate results with traditional SEIR optimization approaches, especially when interventions are involved [2, 4] .

In [14] , an architecture composed of a neural network and a time integrator is trained to to fit infected and deaths data in a traditional SIRD model, however this is not used to produce forecasts. Also a full library has been presented, named PyRoss [15] , to model the epidemic and fitting the data using Bayesian parameter inference and model selection. To cope with the artificial factors that influence the daily data, the SUQC model (Susceptible, Un-quarantined, infected, Confirmed infected) has been proposed instead of SEIR [16] , with a loss function optimized using the interior-point method. In [9] , an extension of SEIR is proposed, to additionally model unreported infections and hospitalized subjects. The observed time series is split into several period, to account for the change given by restrictions, and the parameters of each subset are fitted in isolation. This results in a discontinuous fitting and inaccurate long-term forecast than using a unified contiguous approach as we propose. More complex extensions to SEIR for COVID-19 include the SEIQRDC model [17] , where the exposed population is able to infect during the latent period, confirmed cases are quarantined and susceptible are partially confined, and a compartmental model that represents mild, severe and critical infections separately, as well as severe and critical disease [18] .

Other than SEIR models, more simple models combining power law and exponential law have been used for fitting, which obtain more accurate results than non-extended compartmental models for forecasting [19] . Monte Carlo simulations have been also applied to model the cumulative cases and predict the reduction of fatalities [20] .

The problem of the unconfirmed cases and the time lag in case confirmation has been addressed in other works by reducing the initial population to a small fraction and shifting the date of the first cases [5, 8] . With the proposed approach, both the real and confirmed cases are simulated, while only the confirmed cases are used to compute the fitting error.

Automatic differentiation of ODE solvers has been used for parameter fittings [21] , however, conversely to our method, it is assumed that the values of all the variables in the ODEs, for all the time steps, are available. In the problem we address, only one of the variable can be observed, this required a novel solution.

As a future direction, more compartments can be included in the SEIR model (e.g., Hospitalized, Vaccinated). Other important demographic factors can be also integrated in the model such as age, sex, mobility data flows, between areas in a city and regions in a country, to model the effect of urban mobility and national transportation lockdowns, with the final goal of supporting the decision making surrounding the enforcement and the lifting of restrictions.

Covid-19: risk factors for severe disease and death

Why is it difficult to accurately predict the covid-19 epidemic?

The reproductive number of COVID-19 is higher compared to SARS coronavirus

Prediction of the epidemic peak of coronavirus disease in japan

Analysis and forecast of covid-19 spreading in china, italy and france

On a knife's edge of a covid-19 pandemic: is containment still possible?

Epidemiological characteristics of novel coronavirus infection: A statistical analysis of publicly available case data

Modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions

Evolving epidemiology and impact of non-pharmaceutical interventions on the outbreak of coronavirus disease 2019 in wuhan, china

Quantifying undetected covid-19 cases and effects of containment measures in italy

Italian coronavirus cases likely "10 times higher than reported

Prediction of an epidemic curve: A supervised classification approach

Influenza forecasting in human populations: a scoping review

First-principles machine learning modelling of covid-19

Inference, prediction and optimization of non-pharmaceutical interventions using compartment models: the pyross library

Modeling the epidemic dynamics and control of covid-19 outbreak in china

A modified seir model to predict the covid-19 outbreak in spain and italy: simulating control scenarios and multi-scale epidemics

Characterizing key attributes of the epidemiology of covid-19 in china: Model-based estimations

Predicting turning point, duration and attack rate of covid-19 outbreaks in major western countries

Mathematical prediction of the time evolution of the covid-19 pandemic in italy by a gauss error function and monte carlo simulations

A block coordinate descent proximal method for simultaneous filtering and parameter estimation