key: cord-0516052-hba1rpv9 authors: Nadler, Philip; Wang, Shuo; Arcucci, Rossella; Yang, Xian; Guo, Yike title: An Epidemiological Modelling Approach for Covid19 via Data Assimilation date: 2020-04-25 journal: nan DOI: nan sha: 44ef80a3db1ac03771ce5511da5371db09ff1f3e doc_id: 516052 cord_uid: hba1rpv9 The global pandemic of the 2019-nCov requires the evaluation of policy interventions to mitigate future social and economic costs of quarantine measures worldwide. We propose an epidemiological model for forecasting and policy evaluation which incorporates new data in real-time through variational data assimilation. We analyze and discuss infection rates in China, the US and Italy. In particular, we develop a custom compartmental SIR model fit to variables related to the epidemic in Chinese cities, named SITR model. We compare and discuss model results which conducts updates as new observations become available. A hybrid data assimilation approach is applied to make results robust to initial conditions. We use the model to do inference on infection numbers as well as parameters such as the disease transmissibility rate or the rate of recovery. The parameterisation of the model is parsimonious and extendable, allowing for the incorporation of additional data and parameters of interest. This allows for scalability and the extension of the model to other locations or the adaption of novel data sources. The global outbreak of n-Cov2019 and the possibility of severe social and economic costs worldwide requires immediate action on suppresion measures. In order to evaluate the efficacy of past and future policy measures to fight and contain the spread of n-Cov2019, a robust and quantifiable analysis system is required. We propose a methodology for forecasting the spread of the virus and show how to estimate latent infection rates, accounting for high uncertainty in observation and model specification. This is done by combining real-time Bayesian updating with epidemiological models. We develop a custom compartmental SIR model which is fit to data related to the spread of the coronavirus in China which we name SITR. This is embedded in a data assimilation framework, a form of recursive Bayesian estimation [1] , which conducts model updates when new observations become available. A hybrid data assimilation approach is applied to make results robust to initial conditions. We use the model to infer the amount of infected people and both, the disease transmissibility rate, as well as the rate of recovery. The time-varying parameter structure of the model allows for the incorporation and analysis of policy action, such as if the shutdown of transportation or closure of schools affect transmissibility. In line with other researchers, our model estimates indicate that the number of infected people is a number of magnitudes higher than the actual reported number of hospitalized people in China. We also find that compared to static models, updating the parameters in a dynamic fashion leads to an upward correction of the true number of infected people as well as reducing forecasting errors. We estimate both short term and long term dynamics in Wuhan and find a peak of infections in the beginning of March. We furthermore apply the model to data to a selection of developed and developing countries and find that infections peak in the middle of April. The rest of the paper is structured as follows: Section two discusses related work. Section three and four introduce the dynamic model as well as the SITR compartmental model. Section five discusses the empirical results and section six concludes. The spread of the novel type of respiratory virus as well as the dramatic economic consequences trying to contain it has led to a rapid engagement of the scientific community, with many different areas of research being explored. Using compartmental models in epidemiology, authors such as [2] , [3] and [4] have done the first studies on the size of the outbreak in China. They applied standard SIR models with static parameters to estimate the basic reproductive number and analyze the exponential growth of the virus in Wuhan. The work of [4] in particular combines standard SEIR models with travel data obtained from Tencent and found that epidemic dynamics show exponential patterns in multiple major cities with a lag behind the Wuhan outbreak of about one to two weeks. First studies using data assimilation for epidemiological modelling have been conducted by other authors such as [5] and [6] , which studied the techniques on different cases such as influenza using standard SIR models, although none has considered the issue of the robust covariance estimates as discussed in [7] or [8] . Further studies such as [9] and [10] study time varying parameters in more detail, although only for standard SIR models with no relationship to the current corona virus outbreak. We are the first to conduct a study of the current spread of 2019-nCoV using data assimilation. We furthermore contribute by providing a novel framework which enables the prior computation of covariance matrices, adding robustness to epidemiological assimilation models. We introduce an adaptive epidemiological modelling framework which combines a SIR model whose model parameters are time-varying with data assimilation techniques. We start our analysis with a standard SIR model [11] , which is a system of three interrelated non-linear ordinary differential equations without an explicit analytical solution. The dynamics of the model are given by : Where S denotes the susceptible population size, I the infected people who are not isolated from the population and R the recovered population. The total population is given by N . The parameters β and γ denote the transmission and recover rate of the virus infection. Note that for the outbreak in cities such as Wuhan, the susceptible number S is observable, which we label as the population of Wuhan city. The recovered population R denotes the population not infectious anymore and being removed from the population, which for the Wuhan outbreak is the number of confirmed cases since confirmed cases are hospitalized and isolated and not infecting the general population anymore. Data Assimilation (DA) is a technique to incorporate observations into a theoretical model where uncertainty is quantified [1] . It allows for problems with uneven spatial and temporal data distribution and redundancy to be addressed such that models can ingest information. DA is a vital step in numerical modeling and has become a main component in the development and validation of mathematical models in meteorology, climatology, geophysics, geology and hydrology [12] . Recently, DA is also applied to numerical simulations of geophysical applications, medicine, biological science and finance. Data assimilation can be applied to a variety of problems where an uncertainty quantification has to be included [13] or where latent parameters need to be computed taking into account new observations. The Adaptive DA-SIR model is a model which incorporates Data Assimilation with a compartmental SIR model. We use DA as an adaptive modelling approach which integrates new observations into our compartmental model to enhance the accuracy of forecasts as well as computing model parameters of interest, in our case β and γ in the SIR model. The SIR model in equations (1)-(3) can be discretised with respect to the time variable, giving the following equations: For a given time step t and assuming to have observations of the variable R t we denote here with R obs t , the DA problem consists in computing the minumum of the cost function and where H : I → R is a linear transformation function usually called observation function [1] which is here represented by the SIR model, and where Q and P denote the the background and the observation covariance matrices, representing an estimation of the errors into the data. Data assimilation is very sensitive to initial conditions and the choices of the covariance matrices. Their calibration needs to be properly chosen. The data we use representing S t , I t and R t is given by the official government numbers for Wuhan and is available at [14] . The solution of the DA problem in (7) leads to a modified extended Kalman filtering algorithm where an SIR model is used to compute the forward steps, e.g. in the time window [t, t + M ]. Where I DA t are the values of I t computed after the assimilation of R obs t as in Eq. 8. To illustrate and put results into perspective, we compare results of our adaptive DA-SIR model with the common SIR model for Wuhan. Both models use the same initial conditions given by the observed data. In Fig. 1 we compare model performance of the standard SIR model and show how assimilation of new observations generates updated model dynamics in the DA-SIR model that do differ from standard SIR model predictions by a wide margin, as is illustrated in the figure. Selected values of the graph are available in Table 1 where we compare estimated cases and infection rates for both the static SIR, and dynamic DA-SIR model. This illustrates how without updating the parameters the number of infected people is underestimated and the assimilation of new observations helps to adjust the trajectory of likely infections in the future. Having shown the large difference between static and dynamic SIR models we next introduce a further refined extension of the dynamic assimilation model. Having illustrated the benefits of embedding the SIR model in a DA framework, we aim to further exploit the available data to do more fine tuned inference. In the previous case of the simple SIR model, both recovered and isolated patients were categorized as R. We revise the SIR model by introducing an intermediate compartment T . Here, T represents the number of people being treated, given by the difference between accumulated confirmed cases and recovered or deceased patients R. Instead of just observing one variable, the number of confirmed cases, we are now observing two variables: the currently confirmed cases being treated T and removed infectious population due to recovery or being deceased R. The model is given by The parameter β e t = β St Nt is the real transmission rate over time, taking into account the total population size N as in the SIR model. Assuming all the parameters β e , α, γ time dependent, the SITR model in equations (9)-(12) can be discretised with respect to the time variable, giving the following equations: which is a linearized approximation of the original SIR model with the additional compartment T . The other variables are the same as in the SIR model, where S denotes the susceptible population, I the infected people who are not isolated from the population. The parameters γ and α denote the recovery and transition rate given by total of incubation and admission days. To extend the model and incorporate information not just of the last timestep, we introduce a model extension which bases model predictions on a sliding window of length τ , similar to a 4D-VAR approach [1] . For a given time window [t, t + τ ] and assuming to have observations of the variable T t we denote here with T obs t , the resulting assimilation scheme is given by which in a first step infers the number of infected people I and where H : I → T is a linear transformation function usually called observation function [1] . To estimate the infection rate, in a second step we use minimize which updates β conditioned on assimilated values of I. The resulting algorithm implements a 4D-VAR assimilation scheme in cost function (7), where forecasts and parameter estimates are based on a sliding window over time. Without preconditioning, the algorithm updates the model parameter values with the noise and observation matrices Q and P being fixed hyperparameters. In order to present results which may have major policy implications, correct and robust estimation of initial conditions and hyperparameters is of high importance, we therefore introduce a formalization and preconditioning of the covariance matrices Q and P before applying the assimilation scheme, which is named hybrid data assimilation. We estimate values for both the state and observation covariance matrices Q and P by using an ensemble approach [7] . The values for P are based on an estimate of the residual covariance matrix of the stationary observed time series. Following the cost function give by Eq. 7, with x b representing an individual background state vector, x b = [S, I, T, R]. The full ensemble of state vectors is given by If the ensemble mean is defined as x b , then V ens , the background state perturbations are computed via In this case, V ens and X b are a n x N matrix called the ensemble background perturbation matrix. The rank-deficient version of the background error covariance matrix is defined as Q * with The ensemble is static, meaning that it does not evolve dynamically with time, but it still incorporates flow-dependent information at the start time which is still beneficial for an extended Kalman filter or 4D analysis. The way the ensembles are chosen and computed determines the accuracy of ensemble DA. The ensemble needs to be computed in such a way that the time dependent variability of the background error covariance matrix, as well as the correlation of variables is captured by the sampling procedure. The method we devise is to divide the collection of background states, x b based on the size of the ensemble into N equally sized groups with each group being denoted by x b (i) meaning that ensemble members belong to the ith group. The mean and standard deviation of each group is then estimated and used to sample the ensemble members from. Algorithm 1 describes in detail how V ens is computed and ensembles are formed. The full background state matrix, x b is split into N groups each of size n × n N . Both, the means as well as the standard deviations of the n rows are estimated and used to generate draws from a multivariate Gaussian distribution to form the ensemble. In order to form V ens , for each ensemble member the corresponding mean is estimated and then subtracted, computing the standard deviation. To put results into perspective we discuss the difference between standard assimilation and hybrid approaches by conducting a sensitivity analysis next. As we mentioned in Section 4.1, the choice of the covariance matrices strongly affect the efficiency and the accuracy of the assimilation approach. As the available data is not accurate enough, in order to justify our estimations, we run a sensitivity analysis to study the impact of our estimated parameters and covariance matrices into the model predictions. To illustrate the hyperparameter sensitivity we compare the number of estimated infected people and we apply a mean root squared forecasting error (MRSFE) metric: whereŷ t,n represents the model prediction, y r t,n the real observation with forecast horizons defined by h = 1, and τ 0 = 1 the starting period of the forecast for n variables. The results are given in Table 3 : Selected data points for predicted number of infected and treated patients, as well as the MRSFE. Results are obtained using a naive unit covariance matrix. We present our results for the final hybrid assimilation scheme, and report the short-and long run predicted dynamics of the virus spread and show additional parameter estimates. We describe the results of the Hybrid ensemble covariance method, comparing it first with a naive unit covariance setup: The results in the top left column of Fig. 3 show a peak of infections around early and mid February for Wuhan, with a strong dip in the January period. This is possibly due to an overweight of observation over model dynamics, showing that the initial unit covariance does not account for model and observation uncertainty. The introduction of the hybrid data assimilation method as given in the top right column of Fig. 3 makes the assimilation scheme more robust to initial values and introduces a more realistic and smooth development of the number of infected cases in Wuhan, without an unlikely strong decrease of infections end of January. The updated model assimilates new observations of infected patients and people recovered from the virus. The long run dynamics predict a recent spike in the number of infected people in Wuhan. The total number of people being treated in hospitals follows with a small lag and will be reached in early march. Comparing the left and right bottom figures of Fig. 3 , the transmissibility rate β shows less variation over time, which differs from the model without ensembles where strong variation is visible. High variability implies that the transmissiblity is affected more easily by external factors which change the dynamics of new infections. Thus the robust model estimates imply that, within the sample period, the transmission rate is stable and unaffected by external policy factors within Wuhan city. Also since the number of treated patients is observable, we use generated forecasts of treated persons as a forecasting metric to evaluate the model fit. The tables 3 and 4 give excerpts from the forecasts values of infected and treated patients as well as the corresponding MRSFE for treated patients in hospitals. For the unit covariance table 3 it is observable how the unrealistic dip in forecasted infections which is visible in the top left of Fig. 3 causes a large spike in forecasting errors for treated people end of January. Comparing it with a hybrid assimilation approach in table 4 reveals an overall lower number of forecasting error and better fit. To compare the predicted dynamics of our model in regions hit less sever by the outbreak we extend the analysis beyond Wuhan. Following the same framework, we investigate the COVID-19 spread in Beijing. As given in Fig. 4 , the infection rate β reached the high level 0.27 from Jan 26th to Feb 2nd, followed by a gradual decrease period with an Figure 3 : The left figures presents short-and longrun dynamics of the infection rates with a simple unit covariance matrix, the right figures present results using the hybrid assimilation approach infection rate bottoming out at a value of 0.05. The predicted infected people in the community touched the peak on Feb 3rd with a daily incidence around 440. When comparing parameter estimates in Fig. 6 to the estimates of Wuhan, the β in Beijing is smaller and the incidence peak is 3 days earlier, which indicated the effects of intervention in early epidemic stages. The long range forecasts in Fig. 5 indicate a gradual decrease of the number of treated patients as well as the number of infections, resembling the dynamics for Wuhan on a lower level. Compared to Wuhan, the peak of the infection occurs five days before Wuhan on the 2nd of February, showing that low early infection levels as well as quarantine measures introduced by the government led to a rapid decline of infection cases. To put results into perspective, the next section applies the methodology to international data, giving estimates of peaks of covid19 globally. To illustrate the flexibility of our approach we add a brief international comparison with additional results for multiple countries. Since country level data is more readily available than city specific data but lacks more granular data such as patients under treatment we conduct inference using the DA-SIR model. We focus solely on the number of infected Table 4 : Selected data points for predicted number of infected and treated patients, as well as the MRSFE. Results are obtained using a hybrid assimilation approach for the covariance matrix. explaining the difference in predicted and observed values. The forecasts are given in table 5. Forecasing errors on a nationwide level are larger compared to city bases estimates. In the data assimilation correction cycle forecasting errors are feed back into the model and lead to a correction of parameters which leads to model adjustments and decreases in infection estimates.According to the model estimates, the infection rate at the end of the sample is around 10 percent of the population, which is forecasted to drop to 6 percent in a 5 day ahead forecast. Comparing results to the United States in Fig. 8 of infections are increasing within sample, as is the forecasted number of infections. At the end of the sample, the infection rate is 0.6 percent of the population, which is forecasted to increase to around one percent within five days. The predicted confirmed cases are closer to observed values due to percent wise relatively low infection numbers compared to Italy, although the model predicts an increasing trend given current model dynamics. The trend is upward sloping, but the overall level as percentage of total population is lower than in Italy. The different levels of infections are likely due to different inception dates of the pandemic, as well as a high amount of uncertainty given very limited testing capabilities in Italy and especially the United States. The results indicate that the pandemic has reached a peak in Italy recently, the dynamics for the United States indicate that no plateau has been reached yet and that the number of infections is likely to increase. The results show how the assimilation framework can be extended to multiple countries and provide robust results given the large uncertainty in infection estimates. We introduced a novel epidemiological assimilation scheme to forecast and evaluate the current corona pandemic worldwide with a specific focus on Wuhan, China. We combined compartmental models in epidemiology with data assimilation schemes showing the advantage of real-time forecasting and parameter estimation in the current crisis. We discussed the benefits and differences in infection numbers when models are updated on a daily basis compared to static modelling. We then introduce a model extension allowing us to observe patients being treated, and patients being removed from the infectious population, which we labelled SITR. Since models are sensitive to estimates of the covariance matrices, we add a hybrid ensemble approach which allows for robust covariance matrix estimates. We find that in Wuhan the peak of infections is reached end of February, with the number of patients being treated peaking early march. The generalisability of our model allows the model to be implemented in other cases and countries such as Italy and the United States where the model indicates further growth in the United States and a decline of total infection cases in Italy. Since this work focused mainly on the methodology of providing a robust recursive Bayesian estimation for the current nCov-2019 outbreak, we propose a further in depth-study of the parameter estimates and a comparative study across countries focusing on the epidemiological implications. Future work can add further complexities to the model, such as taking into account different mortality rates due to population age, cultural norms or quality of the healthcare system, providing applicability and robustness of the model for different datasets and scenarios. We encourage both researchers and policymakers to run similar test results with data from other countries or on a more local level to estimate potential infection rates of outbreaks and the rate of transmission to implement the correct policy measures to contain and mitigate adverse effects of the pandemic. Data assimilation: methods, algorithms, and applications Estimating the potential total number of novel coronavirus cases in wuhan city, china Early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study Variational data assimilation with epidemic models Towards real time epidemiology: data assimilation, modeling and anomaly detection of health surveillance data streams Gsi 3dvar-based ensemble-variational hybrid data assimilation for ncep global forecast system: Single-resolution experiments The evolution of the ecmwf hybrid data assimilation system Real time bayesian estimation of the epidemic potential of emerging infectious diseases Bayesian tracking of emerging epidemics using ensemble optimal statistical interpolation Discussion: the kermack-mckendrick epidemic threshold theorem Numerical effects of the gaussian recursive filters in solving linear systems in the 3dvar case study On the variational data assimilation problem solving and sensitivity analysis National health commission of the people's republic of china We are grateful for helpful discussions and feedback from Joseph Wu, Neil Ferguson and other participants at the Royal Society conference "Scientists against CoViD-19 and beyond"