key: cord-0760081-axio34pi authors: Ceylan, Zeynep title: Estimation of COVID-19 prevalence in Italy, Spain, and France date: 2020-04-22 journal: Sci Total Environ DOI: 10.1016/j.scitotenv.2020.138817 sha: 4a0f55f37503f093813b1cc00bfae123dc415b1b doc_id: 760081 cord_uid: axio34pi Abstract At the end of December 2019, coronavirus disease 2019 (COVID-19) appeared in Wuhan city, China. As of April 15, 2020, >1.9 million COVID-19 cases were confirmed worldwide, including >120,000 deaths. There is an urgent need to monitor and predict COVID-19 prevalence to control this spread more effectively. Time series models are significant in predicting the impact of the COVID-19 outbreak and taking the necessary measures to respond to this crisis. In this study, Auto-Regressive Integrated Moving Average (ARIMA) models were developed to predict the epidemiological trend of COVID-19 prevalence of Italy, Spain, and France, the most affected countries of Europe. The daily prevalence data of COVID-19 from 21 February 2020 to 15 April 2020 were collected from the WHO website. Several ARIMA models were formulated with different ARIMA parameters. ARIMA (0,2,1), ARIMA (1,2,0), and ARIMA (0,2,1) models with the lowest MAPE values (4.7520, 5.8486, and 5.6335) were selected as the best models for Italy, Spain, and France, respectively. This study shows that ARIMA models are suitable for predicting the prevalence of COVID-19 in the future. The results of the analysis can shed light on understanding the trends of the outbreak and give an idea of the epidemiological stage of these regions. Besides, the prediction of COVID-19 prevalence trends of Italy, Spain, and France can help take precautions and policy formulation for this epidemic in other countries. COVID-19 is defined as a new type of coronavirus that spreads rapidly from person to person and becomes a major epidemic that causes a great tragedy. COVID-19 has been identified from a family of zoonotic coronaviruses, such as the severe acute respiratory syndrome coronavirus (SARS-CoV) and the Middle East Respiratory Syndrome Coronavirus (MERS-CoV) seen in the past decade. The starting point of the virus is considered to be the Wuhan city of China, and the first fatal cases were reported in late 2019. At this point, this virus causes fatal effects, especially on the elderly and those with chronic diseases . The disease has a very dynamic structure and spreads rapidly. Unfortunately, as of April 15, 2020, 123,010 deaths and approximately 2 million cases have been confirmed worldwide. The number of confirmed cases varies due to differences in epidemiological surveillance and detection capacities between countries. However, it can be said that the disease has spread all over the world as of today. Since there is no treatment method determined for this type of virus yet, it requires the effective planning of the health infrastructure and services, where the rate of disease spread should be controlled. For this reason, modelling of daily approved cases and estimation of possible new cases in the future is vital for managing and directing the demand to the health system. Mathematical and statistical modelling tools that can be used for making short and long-term case estimates to plan the number of additional materials and resources needed to deal with the outbreak. Estimating the expected burden of disease is essential for public health officials to effectively and timely manage medical care and other resources needed to overcome the epidemic. Also, such estimates can direct the intensity and type of interventions needed to alleviate the outbreak (Zhang et al., 2020) . In the past, different statistical methods have been applied with high accuracy for different prediction purposes. Recently, different statistical methods such as time series models J o u r n a l P r e -p r o o f 3 (Kurbalija et al., 2014) , multivariate linear regression (Thomson et al., 2006) , gray forecasting model (Y. wen Zhang et al., 2017) , backpropagation neural network Ren et al., 2013; Zhang et al., 2013) , and simulation models (Nsoesie et al., 2013; Orbann et al., 2017) have been used to predict epidemic cases. Epidemics are affected and limited by many different factors. For this reason, the general spread of the outbreak is characterized by tendencies and randomness. Therefore, the mentioned statistical tools are insufficient to analyze the epidemic randomness, and the models are difficult to generalize. The Automatic Regressive Integrated Moving Average (ARIMA) model has been successfully applied in the field of health as well as in different fields in the past due to its simple structure, fast applicability and ability to explain the data set (Cao et al., 2020) . As seen in Table 1 , ARIMA models have been successfully applied in the past to estimate the incidence and prevalence of influenza mortality, malaria incidence, hepatitis, and other infectious diseases. Besides, ARIMA models are widely used for time series prediction of epidemic diseases such as hemorrhagic fever with renal syndrome, dengue fever, and tuberculosis. ARIMA models are instrumental in modelling the temporal dependency structure of a time series, given the changing trends, periodic changes and random distortions in the time series. It is relatively easy to explain to the end-user since ARIMA methods do not contain much mathematics or statistics. In this way, the end-user can have an idea of how the prediction model has been developed and can rely more on the model during the decisionmaking process. Many studies have used different models to predict COVID-19 incidence, prevalence, and mortality rate in China. Li and Feng (2020) developed a function to predict the ongoing trend with data-driven analysis and estimate the outbreak size of the COVID-19 in China (Li and Feng, 2020) . Roosa et al. (2020) used validated phenomenological models during previous outbreaks to create and evaluate short-term forecasts of the cumulative number of confirmed J o u r n a l P r e -p r o o f 4 reported cases in Hubei, China (Roosa et al., 2020) . Fanelli and Piazza (2020) analyzed the temporal dynamics of the COVID-19 pandemic in mainland China, Italy, and France (Fanelli and Piazza, 2020) . Roda et al. (2020) compared standard SIR and SEIR frameworks to model the COVID-19 in Wuhan Province, China (Roda et al., 2020) . Wu et al. predicted Algorithm (PIBA) for estimating the death rate of COVID-19 in real-time using publicly available data . In summary, there are many studies in the literature to predict the spread of COVID-19 in China. However, Europe has become the epicenter of the virus and hit the continent harder than China. If the confirmed cases occur between 10 and 20% of infected individuals, the apparent mortality rate of COVID-19 is 4 % in China, 13 % in Italy, 11 % in Spain, and 15 % in France. Therefore, it is significant to analyze the situation of the COVID-19 epidemic and predict the prevalence trend, especially in Italy and the two most affected countries, France and Spain. The goal of this work is to estimate the prevalence trend for Italy, Spain, and France, where COVID-19 spreads fastest and causes tragic results. The data analyzed in this study correspond to the period between 21 February 2020 and 15 April 2020. The data set was used to perform and analyze a case estimation model by applying different ARIMA models. Thus, J o u r n a l P r e -p r o o f 5 in addition to enlightening the characteristics of the spread of the epidemic, it was aimed to provide authorities with realistic estimates for the peak time and intensity of the epidemic using models based on simple quantitative models. These models can help predict the health infrastructure and material needs that patients will need in these countries in the near future. The daily prevalence data of COVID-19 was taken from the WHO website (https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/), and MS Excel was used to build a time-series database. Descriptive statistics of the COVID-19 data of the mentioned countries between 21/02/2020-15/04/2020 are given in Table 2 . To create a stable and effective ARIMA model, at least 30 observations are required (Box et al., 2015) . Therefore, in this study, a time series containing at least 45 data was used to predict COVID-19 prevalence of Italy, Spain, and France over the next ten days with 95% relative confidence intervals. As seen from Fig. 1 A time series is simply expressed as a set of data points ordered in time (Fanoodi et al., 2019) . Time series analysis aims to reveal reliable and meaningful statistics and use this knowledge to predict future values of the series (Liu et al., 2011; Elevli et al., 2016; He and Tao, 2018; Benvenuto et al., 2020 ). The ARIMA model was introduced by Box and Jenkins in the 1970s (Box et al., 2015) . The ARIMA is one of the most used time series models as it takes into account changing trends, periodic changes and random disturbances in the time series. ARIMA is suitable for all kinds of data, including trend, seasonality, and cyclicity. It is also flexible and useful in modelling the temporal dependency structure of a time series. ARIMA model is generally referred to as an ARIMA (p,d,q) where p is the order of autoregression, d is the degree of difference, and q is the order of moving average ( Fig. 1 and Fig. 2 confirm that the overall prevalence of COVID-19 used in this study does not show seasonal patterns. However, the ACF plots in Fig. 2 shows that the prevalence of the COVID-19 is not stationary because autocorrelations reduce very slightly. Therefore, the first-order difference was taken to stabilize the mean of the COVID-19 prevalence. However, even after the first difference, it J o u r n a l P r e -p r o o f 9 seems that the trends of all series not eliminated, so the second-order differences should be taken. All series became stationary after the second difference, and then parameters of ARIMA models were determined according to the ACF and PACF plots (see Appendix). In addition to the developed ARIMA models, different models were also created, and their performances were compared using various statistical tools. All statistical procedures were performed on the transformed COVID-19 data. ARIMA models with the minimum MAPE values were selected as the best model. Among the tested models, the ARIMA (0,2,1), ARIMA (1,2,0), and ARIMA (0,2,1) models were chosen as the best models for Italy, Spain, and France, respectively. The models fitted the COVID-19 data reasonably well (Fig. 3 , Table 3 ) with a minimum = 4.752, = 5.849, and = 5.634 values. Table 4 shows the parameter estimates for the best models. The p-values of the associated with the parameters are less than 0.05, so the terms are considerably different from zero at the 95.0 % confidence level. The fitted and predicted values are presented in Fig. 4 . As seen in Table 5 , the next 10-day estimate of confirmed cases may be between 196,520-229,147 in Italy, 204,755-257497 in Spain and 140,320-159,619 in France. Effective strategies are needed to prevent and control the spread of epidemics. Estimating the epidemiological trend of the prevalence of outbreaks is crucial for the allocation of medical resources, regulation of production activities, and even for the national economic development of countries. Thus, it is essential to create a reliable and suitable forecasting model that can help governments as a reference to decide on emergency macroeconomic strategies and medical resource allocation. Time series analysis is instrumental in developing hypotheses to understand the prevalence trend of various diseases and forecast the dynamics of observed phenomena, and then in the construction of a quality control system. ARIMA model is one of the most commonly used time series forecasting methods because of its simplicity and systematic structure and acceptable forecasting performance . In this study, the current situation of the COVID-19 pandemic in Italy, Spain, and France was presented, and the ongoing trend and extent of the outbreak were estimated by the ARIMA model. To the best of our knowledge, this study is the first to implement ARIMA models to predict the prevalence of COVID-19 in Italy, Spain, and France. There is great concern that European countries' health system capacity can effectively respond to the needs of infected patients who need intensive care for the COVID-19 pandemic. Especially in Italy, the number of patients infected since February 21 closely follows an exponential trend. Although the number of total confirmed cases of Italy is still increasing, the incidence of new confirmed cases is declining, and the government plans to return to normal life gradually. The daily confirmed cases decreased to 2000-4500 over the last ten days. Meanwhile, Spain, Europe's second-worst-hit country with 18,056 deaths, has seen a drop in daily coronavirus deaths in the past five days. However, the total number of confirmed cases has overtaken Italy. On the other hand, there is no downward trend in new confirmed cases in France, and it seems that more days are needed to reach the plateau. This pattern will cause intensive care units to be at their maximum capacity. If the virus does not develop new mutations, the number of cases is expected to reach the plateau. Otherwise, clinical and social problems will be unmanageable, expected to result in disaster. Forecasting the prevalence of the disease is important for health departments to strengthen surveillance systems and reallocate resources. Time series models play an important role in outbreak analysis and disease prediction. In this study, ARIMA time series models were applied to the overall prevalence of COVID-19 of three European countries most affected by Table 3 . Comparison of tested ARIMA models Table 4 . Parameters of ARIMA models Table 5 . Prediction of total confirmed cases of COVID-19 for the next ten days according to ARIMA models with 95% confidence interval ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Optimization Method for Forecasting Confirmed Cases of COVID-19 in China Data-based analysis, modelling and forecasting of the COVID-19 outbreak Data in brief Application of the ARIMA model on the COVID-2019 epidemic dataset Time Series Analysis: Forecasting and Control Relationship of meteorological factors and human brucellosis in Hebei province Epidemiological features and time-series analysis of influenza incidence in urban and rural areas of Shenyang Using autoregressive integrated moving average (ARIMA) models to predict and monitor the number of beds occupied during a SARS outbreak in a tertiary hospital in Singapore Drinking water quality control: control charts for turbidity and pH Analysis and forecast of COVID-19 spreading in China, Italy and France Forecasting incidence of infectious diarrhea using random forest in Jiangsu Province Reducing demand uncertainty in the platelet supply chain through artificial neural networks and ARIMA models Modelling malaria incidence with environmental dependency in a locality of Sudanese savannah area, Mali Forecasting model for the incidence of hepatitis A based on artificial neural network International Journal of Infectious Diseases Epidemiology and ARIMA model of positive-rate of in fl uenza viruses among children in Wuhan, China : A nineyear retrospective study Time-series analysis in the medical domain : A study of Tacrolimus administration and influence on kidney graft function Trend and forecasting of the COVID-19 outbreak in China A comparative time series analysis and modeling of aerosols in the contiguous United States and China Forecasting the seasonality and trend of pulmonary tuberculosis in Jiangsu Province of China using advanced statistical time-series analyses Forecasting incidence of hemorrhagic fever with renal syndrome in China using ARIMA model A Simulation Optimization Approach to Epidemic Forecasting Defining epidemics in computer simulation models: How do definitions influence conclusions? The time series seasonal patterns of dengue fever and associated weather variables in Bangkok The development of a combined mathematical model to forecast the incidence of hepatitis E in Shanghai Why is it difficult to accurately predict the COVID-19 epidemic? Real-time forecasts of the COVID-19 epidemic in China from Forecast of severe fever with thrombocytopenia syndrome incidence with meteorological factors Comparison of ARIMA and GM(1,1) models for prediction of hepatitis B in China Time series modeling of pertussis incidence in China from 2004 to 2018 with a novel wavelet based SARIMA-NAR hybrid model Autoregressive Integrated Moving Average (ARIMA) and Generalized Regression Neural Network (GRNN) in Forecasting Hepatitis Incidence in Heng County Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Time series analysis of human brucellosis in mainland China by using Elman and Jordan recurrent neural networks Comparison of two hybrid models for forecasting the incidence of hemorrhagic fever with renal syndrome in Jiangsu Province Time series analysis of temporal trends in the pertussis incidence in Mainland China from Time prediction models for echinococcosis based on gray system theory and epidemic dynamics International Journal of Infectious Diseases Estimation of the reproductive number of novel coronavirus (COVID-19) and the probable outbreak size on the Diamond Princess cruise ship : A data-driven analysis Comparative Study of Four Time Series Methods in Forecasting Typhoid Fever Incidence in China Forecast model analysis for the morbidity of tuberculosis in Xinjiang, China SARS: Severe Acute Respiratory Syndrome, HFRS: Hemorrhagic Fever with Renal Syndrome, HPS: Hantavirus Pulmonary Syndrome, SFTS: Severe Fever with Thrombocytopenia Syndrome, ANNs: Artificial Neural Networks, GM (1,1): Grey Model, SARIMA: Seasonal Autoregressive Integrated Moving Average, ETS: Exponential Smoothing, BPNN: Back Propagation Neural Networks, NARNN: Nonlinear Autoregressive Neural Network, RBFNN: Radial Basis Function Neural Networks, GRNN: Generalized Regression Neural Network, ERNN: Elman Recurrent Neural Networks, NBM: Negative Binomial Regression Model, GAM: Generalized Additive Model, NAR: Nonlinear Autoregressive Network, JNN: Jordan Neural Networks, RF: Random Forest, MPR: Multivariate Poisson Regression None. No funding to declare. Zeynep CEYLAN: Writing -original draft, Writing -review & editing. Supplementary data related to this article can be found online. Disease Method(s) (Guan et al., 2004) HAV ARIMA, ANNs (Earnest et al., 2005) SARS ARIMA (Gaudart et al., 2009) Malaria ARIMA (Liu et al., 2011) HFRS ARIMA (Zhang et al., 2013) Typhoid Fever SARIMA, BPNN, RBFNN, and ERNN (Ren et al., 2013) HEV ARIMA, BPNN (Nsoesie et al., 2013) HPS ARIMA (Zheng et al., 2015) Tuberculosis ARIMA (Wu et al., 2015) HFRS ARIMA, GRNN, and NARNN (Zeng et al., 2016) Pertussis ARIMA, ETS (Wei et al., 2016) Hepatitis ARIMA, GRNN (Sun et al., 2018) SFTS ARIMA, NBM, and GAM (Y. wen HBV ARIMA, GM (1,1) (Y. Pertussis SARIMA, NAR (He and Tao, 2018) Influenza ARIMA (Wu et al., 2019) Human Brucellosis ARIMA, ERNN, and JNN Pulmonary Tuberculosis ARIMA, BPNN Influenza SARIMA (Fang et al., 2020) Infectious Diarrhea ARIMA/X models, RF (Polwiang, 2020) Dengue Fever ARIMA, ANN, and MPR (Cao et al., 2020) Brucellosis ARIMA