key: cord-1037371-b0dtcjn3 authors: Li, Xin; Zhao, Zebin; Liu, Feng title: Big data assimilation to improve the predictability of COVID-19 date: 2020-12-02 journal: nan DOI: 10.1016/j.geosus.2020.11.005 sha: d61bc7ba3ac3d24672eb89be4296b64fccc2946d doc_id: 1037371 cord_uid: b0dtcjn3 The global outbreak of COVID-19 requires us to accurately predict the spread of disease and decide how adopting corresponding strategies to ensure the sustainable development. Most of the existing infectious disease forecasting methods are based on the classical Susceptible-Infectious-Removed (SIR) model. However, due to the highly nonlinearity, nonstationarity, sensitivities to initial values and parameters, SIR type models would produce large deviations in the forecast results. Here, we propose a framework of using the Markov Chain Monte Carlo method to estimate the model parameters, and then the data assimilation based on the Ensemble Kalman Filter to update model trajectory by cooperating with the real time confirmed cases, so as to improve the predictability of the pandemic. Based on this framework, we have developed a global COVID-19 real time forecasting system. Moreover, we suggest that big data associated with the spatiotemporally heterogeneous pathological characteristics, social environment in different countries should be assimilated to further improve the COVID-19 predictability. It is hoped that the accurate prediction of COVID-19 will contribute to the adjustments of prevention and control strategies to contain the pandemic, and help achieving the SDG goal of “Good Health and Well-Being”. deviations in the forecast results. Here, we propose a framework of using the Markov Chain Monte Carlo method to estimate the model parameters, and then the data assimilation based on the Ensemble Kalman Filter to update model trajectory by cooperating with the real time confirmed cases, so as to improve the predictability of the pandemic. Based on this framework, we have developed a global COVID-19 real time forecasting system. Moreover, we suggest that big data associated with the spatiotemporally heterogeneous pathological characteristics, social environment in different countries should be assimilated to further improve the COVID-19 predictability. It is hoped that the accurate prediction of COVID-19 will contribute to the adjustments of prevention and control strategies to contain the pandemic, and help achieving the SDG goal of "Good Health and Well-Being". Keywords: COVID-19; Data assimilation; Big data; Prediction; Sustainable development; SDG The global outbreak of the coronavirus disease 2019 (COVID-19) requires us to collaboratively fight against it and ensure the achievement of the United Nations Sustainable Development Goals (SDGs) (United Nations, 2020). Since December 2019, the global outbreak of COVID-19 has affected the health of tens of millions of people around the world, and is likely to affect hundreds of millions in the near future. It will profoundly change the global socioeconomic structure and have political impacts (Jiao et al., 2020; Zhao et al., 2020a) . COVID-19 has become a truly global, systemic and cascading emergency, posing a real threat to global sustainable development. In retrospect, the global response to this challenge has not received the attention it deserves. From a long-term perspective, it is necessary to formulate a new special target based on COVID-19 within the framework of the United Nations SDGs to respond to large-scale health problem. Promoting healthy societies can expand the scope of SDG 3 (Good Health and Well-being) (Guo, 2020). The normal development of the current pandemic requires that we adopt global cooperation and solidarity, scientific and technological innovation, decision-making based on scientific knowledge, strengthening and combining our efforts (United Nations, 2020), and "coordination, classification, and collaboration" to tackle this challenge (Fu et al., 2020; Zhao et al., 2020a) . Reasonable epidemic prediction can provide the total number of infections, the lifecycle of the epidemic, the arrival time of the epidemic peak, and the assessment of the severity of the epidemic. In addition, model forecasting provides the scientific foundation for decision-making and intervention strategy adjustment. Since the outbreak of COVID-19, various models have been proposed to simulate, analyze, and predict the pandemic Chinazzi et al., 2020) . With the increasingly severe COVID-19 spread, infectious disease models have played key roles in the prediction of pandemic trends, scientific prevention and control, and outbreak assessment. Studies of infectious disease models have a long history and can be traced back to the 1760s, with the famous Susceptible-Infectious-Removed (SIR) model being established in the 1920s (Kermack and McKendrick, 1927 ). The SIR model divided the population under natural conditions into three categories, namely, susceptible individuals, infected cases and removed cases. Most of the existing infectious disease forecasting methods are based on the classical SIR model. The SIR-type model can be used to accurately predict uncontrolled disease spread at the early stage since it has a clear pathological dynamic mechanism. However, for medium-and long-term epidemic spread (such as 60 days or more), SIR-type models have poor predictability, which can be defined as the qualitative or quantitative correctness of the system predictions. The reasons are that the epidemic characteristics and transmission patterns vary by region and season (Zhan et al., 2019) . The parameters of each SIR-type model are spatial and time variant at different stages, i.e., affected by changes in pathological characteristics, the social environment, the human intervention for prevention and control measures (Tian et al., 2020) , medical conditions, and self-protection awareness (Scarpino and Petri, 2019) . In addition, the stochasticity and nonstationarity of the models indicate that the prediction of SIR-type models is sensitive to the initial value of the model and the total population in the region. Therefore, due to the nonlinearity, heterogeneity, and randomness of the epidemic dynamics, and the significant impact of prevention and control policies, a better epidemic forecasting method to improve predictability is required, both for volatile and normal epidemic spreads. We propose that data assimilation combined with parameter estimation can improve the predictability of COVID-19: the parameters are optimally estimated based on real-time epidemic data before simulation; then, the trajectory of the infectious disease model can be adjusted dynamically by assimilating real-time observations such as confirmed infected cases, therefore reducing the accumulated errors of the prediction system. This approach is similar to real-time numerical weather predictions. Data assimilation is rooted in and developed based on estimation theory, cybernetics, and chaos theory, fusing model simulation and true observations, and then enhancing the system's predictability and observability (Li et al., 2020b) . Data assimilation has been used to predict the spread of infectious diseases and provide a theoretical reference for prevention based on science. A main advance of the data assimilation technique is that it combines dynamics with real-time observation data to recursively optimize both state variables and parameters of the epidemic model to improve disease spread predictability (Shaman et al., 2013; Zhan et al., 2019) . A successful application of data assimilation to improve the prediction of COVID-19 is manifested in a recent work, which combined the Susceptible Exposed Infectious Recovered (SEIR) model and the Ensemble Adjustment Kalman Filter with a Bayesian inference framework to explain the impact of undiagnosed cases in Wuhan on the rapid spread of COVID-19 (Li et al., 2020a). Additionally, there are significant uncertainties in the cognition of dynamic infectious disease model parameters that are affected by the pathological characteristics of the virus and human activities, such as nonpharmacological interventions; therefore, reasonable parameter estimation is crucial for improving model accuracy (King et al., 2008) . Parameter estimation for infectious disease models is usually conducted in a Bayesian framework or by heuristic optimization. For example, the Bayesian inference and Markov Chain Monte Carlo (MCMC) algorithm have been successfully introduced in SIR and SEIR models to predict COVID-19 spread in Wuhan (Roda et al., 2020) . Furthermore, the particle swarm optimization method was utilized to estimate SEIR model parameters (He et al., 2020) , and the iterative filtering method using the maximum likelihood function was applied in the prediction of COVID-19 spread (Fox et al., 2020) . Overall, compared with the previous pandemic cases studies (Tian et al., 2020; Yang et al., 2020) , we aim to integrate the advantages of both data assimilation and parameter estimation to improve the predictability of COVID-19 and reduce forecast uncertainty, and further enable real-time COVID-19 prediction. We integrated data assimilation, parameter estimation, and infectious disease models to predict COVID-19 spread. We retrospectively forecasted the COVID-19 outbreak in Wuhan by using Common software for Data Assimilation (ComDA) (Liu et al., 2020b) , which not only integrates the classic algorithms of data assimilation and parameter estimation, i.e., Metropolis-Hasting (MH) (Zhu et al., 2014) and Ensemble Kalman Filter (EnKF), but also employs the contact network model and SEIR model as the model operators. In addition, we successfully reproduced the "Diamond Princess" epidemic by employing Bayesian inference and MH parameter estimation methods combined with an infectious disease model. The MH algorithm repeatedly learns the daily confirmed cases by sampling and iterating in the multidimensional parameter space to obtain the optimal estimation of the parameters (Liu et al., 2020a). At the same time, the MH algorithm has been used in the improved SEIR model to predict the pandemic in six African countries in different intervention scenarios, and the validations are consistent with the real pandemic within 20 days of the initial stage (Zhao et al., 2020b) . These results showed that our strategies can improve the predictability of pandemics and reduce forecast uncertainty. We then developed a COVID-19 prediction system based on real-time operational data assimilation and parameter estimation. We used MCMC sampling to estimate the model parameters and EnKF to continuously update the model trajectory by assimilating the latest pandemic data into the dynamic framework for seven-day nowcasting and two-month forecasting (Fig. 1) . The forecasting was further facilitated by scenario analysis with three assumed non-pharmaceutical interventions. Moreover, a pandemic forecast information platform (http://covda.tpdc.ac.cn/#all) was successfully launched and is capable of predicting the spread of COVID-9 in 202 countries/regions all over the world. The epidemiological data (i.e., the population densities, daily and cumulative confirmed cases) were tracked by country/region both in the plots and a geographic information dashboard. The forecasting has achieved ideal results with an average prediction accuracy of approximately 77% for the next 7 days. Therefore, this forecasting system can provide references for the adjustment of relevant prevention and control measures. This system is different with existing real-time visualization systems, such as the Johns Hopkins University visualization system, our study is more inclined to forecast the spread of COVID-19 for a period of time in the future, with a simple and clear interface that makes it easy to view the trend of COVID-19 by country. Figure 1 . Conceptual flowchart of the COVID-19 prediction system using data assimilation, parameter estimation, and big data (scenario A, B and C respectively represent three intervention scenarios, namely, suppression, mitigation, and mildness scenarios) Big data innovation is particularly important for combating COVID-19 to ensure sustainable development of human society (Fu, 2020; Murdoch and Detsky, 2013) . Big data from social media, mobile data and other sources, contains more sophisticated information than daily confirmed cases, particularly on the spatiotemporal variation of social distancing, population mobility, and pathological characteristics in different countries. These data provide timely and useful information for a more reasonable parameterization of the COVID-19 prediction model Zhou et al., 2020) , constrain data assimilation and parameter estimation, and enable the design of more realistic scenarios for long-term prediction. In summary, data assimilation, and parameter estimation can help tackle COVID-19 by enhancing the predictability of the pandemic. Meanwhile, the adjustment of epidemic trajectory by introducing real time big data of pathological characteristics and social environment poses a promising way toward big data assimilation. We hope that the reliable prediction of COVID-19 will contribute to the adjustment of prevention and control strategies to combat the pandemic, ensuring human health, and achieving the goal of "Good Health and Well-Being" established by the United Nations SDG. Accurate prediction of COVID-19 is essential for achieving the United Nations SDGs. 2. Predictability of COVID-19 is improved by data assimilation and parameter estimation A data assimilation based COVID-19 forecasting system is developed The inevitable application of big data to health care Why is it difficult to accurately predict the COVID-19 epidemic? On the predictability of infectious disease outbreaks An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China Real-time influenza forecasts during the 2012-2013 season Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions Real-Time Forecasting of Hand-Foot-and-Mouth Disease Outbreaks using the Integrating Compartment Model and Assimilation Filtering A systematic approach is needed to contain COVID-19 globally Prediction of the COVID-19 spread in African countries and implications for prevention and control: A case study in South Africa COVID-19: Challenges to GIS with big data Simultaneous parameterization of the two-source evapotranspiration model by Bayesian approach: application to spring maize in an arid region of northwest China This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA20100104), the Science-based Advisory Program of the Alliance of International Science Organizations (Grant No. ANSO-SBA-2020-07), the National Natural Science Foundation of China (Grant No. 41801270), and the Foundation for Excellent Youth Scholars of NIEER, CAS. The authors declare that they have no conflict of interest.