key: cord-0751459-h2sg9y46 authors: Ilie, Ovidiu-Dumitru; Cojocariu, Roxana-Oana; Ciobica, Alin; Timofte, Sergiu-Ioan; Mavroudis, Ioannis; Doroftei, Bogdan title: Forecasting the Spreading of COVID-19 across Nine Countries from Europe, Asia, and the American Continents Using the ARIMA Models date: 2020-07-30 journal: Microorganisms DOI: 10.3390/microorganisms8081158 sha: fa81347ad0c665395a488f4a5e8c40487427c374 doc_id: 751459 cord_uid: h2sg9y46 Since mid-November 2019, when the first SARS-CoV-2-infected patient was officially reported, the new coronavirus has affected over 10 million people from which half a million died during this short period. There is an urgent need to monitor, predict, and restrict COVID-19 in a more efficient manner. This is why Auto-Regressive Integrated Moving Average (ARIMA) models have been developed and used to predict the epidemiological trend of COVID-19 in Ukraine, Romania, the Republic of Moldova, Serbia, Bulgaria, Hungary, USA, Brazil, and India, these last three countries being otherwise the most affected presently. To increase accuracy, the daily prevalence data of COVID-19 from 10 March 2020 to 10 July 2020 were collected from the official website of the Romanian Government GOV.RO, World Health Organization (WHO), and European Centre for Disease Prevention and Control (ECDC) websites. Several ARIMA models were formulated with different ARIMA parameters. ARIMA (1, 1, 0), ARIMA (3, 2, 2), ARIMA (3, 2, 2), ARIMA (3, 1, 1), ARIMA (1, 0, 3), ARIMA (1, 2, 0), ARIMA (1, 1, 0), ARIMA (0, 2, 1), and ARIMA (0, 2, 0) models were chosen as the best models, depending on their lowest Mean Absolute Percentage Error (MAPE) values for Ukraine, Romania, the Republic of Moldova, Serbia, Bulgaria, Hungary, USA, Brazil, and India (4.70244, 1.40016, 2.76751, 2.16733, 2.98154, 2.11239, 3.21569, 4.10596, 2.78051). This study demonstrates that ARIMA models are suitable for making predictions during the current crisis and offers an idea of the epidemiological stage of these regions. The outbreak with the new coronavirus (COVID-19) caused by severe acute respiratory syndrome (SARS-CoV-2) has led to a 'global pandemic' due to its unprecedented speed of spreading worldwide. Since patient zero that was reported back in mid-November, over ten million people from two hundred and sixteen territories were identified as SARS-CoV-2-infected patients [1] . As seen from Figure 1 , the COVID-19 outbreak hit Ukraine harder than the other five countries between the established period. The first case in Ukraine was reported on 3 March 2020. In contrast with the related regions, the COVID-19 pandemic had started earlier in Romania (26 February) and later in the other four (4 March in Hungary, 6 March in Serbia, 7 March in the Republic of Moldova, and 8 March in Bulgaria). In Ukraine, the total number of confirmed cases of COVID-19 reported during the period is 52,043, the highest number of new cases reported being 1366 registered on 6 July. The overall prevalence for Romania was 31,381, the second hardest-hit region, followed by the Republic of Moldova with 18,666, Serbia with 17,342, Bulgaria with 6672, and Hungary with 4220 cases. Analogous, the second highest incidence between the remaining five regions was in Romania with 614 new cases in 9 July, followed by the Republic of Moldova with 478 on 18 June, 445 in Serbia on 17 April, 330 in Bulgaria on 10 July, and 210 in Hungary on 10 April. On the other hand, the first case reported in the USA took place on 20 January, almost one week later compared with Romania. The second hardest-hit region was Brazil, where the first case was reported on 26 February, while in India on 30 January. The overall prevalence for these three countries is as follows: USA with 3,038,325, Brazil with 1,713,160, and India with 793,892 cases. Concerning the incidence, the highest was as expected in USA with 64,630 on 10 July, followed by Brazil with 54,771 on 21 June, and last, India with 26,506 on 10 July. A time series is simply a series of time-dependent data points [27] used for analyses dedicated to revealing reliable and meaningful statistical data for the subsequent prediction of values of a series [28] . Since it was introduced by Box and Jenkins approximately half a century ago, ARIMA began to be used at a much larger scale [26] . In most cases, ARIMA is used since it takes into account all trends and periodic changes, even random disturbances. Thus, ARIMA is suitable for a large spectrum of data, from seasonality to cyclicity. In this context can be modeled a temporal dependency in a flexible manner. Non-seasonal ARIMA models are defined by three parameters (p, d, q) where p is the order of autoregression, d is the degree of differencing, and q the order of moving average [29] . ARIMA offers the possibility to be modified so that can be conducted different and simple AR, I, or MA models. AR (p) usually explains the present value Y t, unidirectionally it terms of its previous values Y t−1 , Y t−2 , ..., Y t−p , and the current residuals ε t . MA (q) refers to the current value of the time series Y t in terms of its current and previous residuals ε t−1 , ε t−2 , . . . , ε t−q . The general formula of AR (p) and MA (q) can be expressed in Equations (1) and (2) . where: p-past value; Φ and θ-parameters that indicate the autoregression, and moving average, respectively; t-time; Y t -observed value at a time t; ε t -value of the random shock dependent by t; p-past value. In other words, ARMA (p, q) model expresses the current values, as well as its previous ones and residuals linearly. The corresponding formula is given in the below equation: where: α-constant; ε t−1 -value of the previous random shock. In the present study, three performance criteria entitled Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) were applied to test the predictive accuracy of the current ARIMA model. Mathematically, the equations for these three criteria are presented above: where: y t -value observed at a time t; e t -difference between values; n-number of time points; For a better fit of the data, RMSE, MAE, and MAPE must have low values. All analyses were performed using STATGRAPHICS Centurion (v.18.1.13) software with a statistically significant level of p < 0.005. The ARIMA modeling is composed of four repetitive steps: assessment of the model, estimation of parameters, diagnostic checking, and prediction. The first step is to control whether the time series' mean, variance, and autocorrelation constancy over time are stationary and seasonal for a better accuracy [30] . In this context, Time Series plot, Autocorrelation Function (ACF), and Partial Autocorrelation Function (PACF) (Figure 2 ) graphs were constructed to verify the seasonality and stationarity. On one hand, ACF can determine whether the previous values from the series are related to the following one, while PACF highlights the degree of correlation between a variable and a lag of the said variable [31] . Estimated autocorrelations for the time series of the established countries are shown in Figure 3 . Straight lines represent two standard deviations limits, while bars that extend beyond the lines indicate statistically significant autocorrelations. Additionally, a series of ARIMA models have been also created, and their performances were compared using various statistical tools. All statistical procedures were performed on the transformed COVID-19 data. ARIMA models with the minimum MAPE values were selected as the best model. Among the tested models, the ARIMA (1, 1, 0), ARIMA (3, 2, 2), ARIMA (3, 2, 2), ARIMA (3, 1, 1), ARIMA (1, 0, 3), ARIMA (1, 2, 0), ARIMA (1, 1, 0), ARIMA (0, 2, 1), and ARIMA (0, 2, 0) models were chosen as the best models for Ukraine, Romania, the Republic of Moldova, Serbia, Bulgaria, Hungary, USA, Brazil, and India. The models fitted the COVID-19 data are presented in Figure 2 and Tables 2 and 3 Table 3 shows the parameter estimates for the best models. The p-values of the associated with the parameters are less than 0.005, so the terms are considerably different from zero at the 95.0% CI. The fitted and predicted values are presented in Figure 3 . As seen in Table 4 , the next 14-day estimate of confirmed cases may be between 52,816-59,679 in Ukraine, 31,838-38,650 in Romania, and 18,836-21,601 in the Republic of Moldova, 17,639-21,313 in Serbia, 6931-10,000 in Bulgaria, 4225-4319 in Hungary, 3.10259 × 10 6 -3.90611 × 10 6 in USA, 1.75087 × 10 6 -2.24113 × 10 6 in Brazil, and 8.20308 × 10 5 -1.16489 × 10 6 in India, respectively. In the present study, an ARIMA model has been selected, in which the best model forecast for future data is given by a parametric model relating the most recent data value to previous data values and previous noise, or residuals in this context. The output summarizes the statistical significance of the terms in the forecasting model. Terms with p-values less than 0.05 are statistically significantly different from zero at the 95.0% confidence level. The p-value for the AR(x) or term is less than 0.05, so it is significantly different from 0. The p-value for the MA(x) term is less than 0.05, so it is significantly different from 0. When the trend is increasing, in order to obtain a linearity or central trend, the model also chooses q. The estimated standard deviation of the input white noise depends on the best model that was selected during the simulations performed. According to the current literature, this would be the first study of such a manner. Therefore, the idea of a cluster of nations, and the rate of the spread between them is novel. This adds to the fact that this is the first study to address the situation of the most affected nations globally. In the present study the current situation of the COVID-19 pandemic in Ukraine, Romania, the Republic of Moldova, Serbia, Bulgaria, Hungary, USA, Brazil, and India was presented, and the ongoing trend and extent of the outbreak were estimated by the ARIMA model. According to our best of knowledge, this study is the first of its kind to implement ARIMA models to predict the prevalence of COVID-19 in such a manner. In the current literature can be found limited data regarding the usage of ARIMA for the prediction of the COVID-19 course. Most reports evaluated the situation from western and southern Asia. Reports regarding the status of Europe are elusive for an unknown reason, and as a consequence, Europe gradually become the second mainland (Table 5 ). It should be also mentioned that papers that have been subjected to the peer-review process were excluded. Effective strategies are now all more imperative to control the spreading of COVID-19. Thus, estimating epidemiological trends is crucial for the allocations of medical resources and production activities. Among the most effective alternatives that proved their efficacity is quarantine. Chintalapudi et al. [34] have discussed the beneficial impact lockdown had within the Italian population in terms of transmissibility. A data-driven model analysis demonstrated a decrement up to 35% of total registered cases, concomitantly with an increase up to 66% of recovered cases after lockdown and self-isolation. The accuracy of these two parameters was 93.75 and 84.4%, respectively. This tendency of regression proved to be true according to the results obtained by another group of authors. The accuracy of six performance metric models has been tested. Long short-term memory (LSTM) was found to be the most accurate during the study, perspective predictions within the next two weeks being made. Thus, is expected a slight decrease in the number of the total cumulative cases [35] . These observations are strengthened by the results of Papastefanopoulos et al. [40] . Six different time series approaches were also utilized to test the accuracy concerning the COVID-19 outbreak for the top ten most affected countries. Machine learning time series methods were efficiently used to estimate the percentage of the population that will be affected. By using a stochastic modified SEIR model (susceptible-exposed-infectious-recovered) and due to lack of effective pharmaceutical interventions against SARS-CoV-2, López et al. [41] concluded that social confinement should remain in place for the next two months. Behavior, awareness, and immunity decay is attributed to 99% of the current wave. The gradual incorporation of up to 50% of daily working proportion should be also considered. It has been recently shown that Black and South Asian people are more prone to infection and subsequently death than the rest. Among the risk factors is age, being male, deprivation, diabetes, asthma, and numerous other medical conditions following the analysis of a cohort consisting of 17,278,392 UK individuals [42] . If all these restrictions are not respected, humanity will face a second wave of infections much more severe than the previous one [37] according to the latest statistics reported by WHO. Most certainly, governments' internal politics and capability in managing the current situation would be definitory during this temporary crisis [33, [36] [37] [38] . Assuming that 20% of the population of each country in the US will be infected, age-specific mortality pattern shown that counties will be probably heavily affected. These findings suggest the adequate allocation of the medical care resources per capita needed to outside communities to restrain the spread [43] . Chakraborty et al. [32] revealed that to people over the age of 65 should be paid more attention, which is why for them it is recommended intensive care and isolation. In addition, they suggests that the locktime period must be extended, in parallel with the arranging medical centers by increasing the number of beds. Furthermore, Demongeot et al. [39] have brought a new perspective regarding the important role temperature has on COVID-19 spreading, reflected by the total number of active cases. It seems that high temperature directly reduces contagion rates, but this does not mean seasonal temperature could not support the later reappearance following the usage of time series methods. Forecasting the prevalence of a disease is crucial for health departments to create an optimum environment and conditions for patients. As has been presented throughout this manuscript, time series models play an important role in disease prediction. In this study, ARIMA time series models were successfully applied to estimate the overall prevalence of COVID-19 in nine countries, six of them being neighbors, while the other three are the most affected today. The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak A Novel Coronavirus from Patients with Pneumonia in China Clinical Characteristics of Coronavirus Disease 2019 in China Clinical features of patients infected with 2019 novel coronavirus in Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: A descriptive study Epidemiological, clinical and virological characteristics of 74 cases of coronavirus-infected disease 2019 (COVID-19) with gastrointestinal symptoms Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: A retrospective cohort study Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Gastrointestinal symptoms of 95 cases with SARS-CoV-2 infection Clinical findings in a group of patients infected with the 2019 novel coronavirus (SARS-Cov-2) outside of Wuhan, China: Retrospective case series Digestive system is a potential route of COVID-19: An analysis of single-cell coexpression pattern of key proteins in viral entry process COVID-19 in gastroenterology: A clinical perspective SARS-CoV-2 induced diarrhoea as onset symptom in patient with COVID-19 Diarrhoea may be underestimated: A missing link in 2019 novel coronavirus Real-time estimation and prediction of mortality caused by COVID-19 with patient information based algorithm Time-series analysis in the medical domain: A study of Tacrolimus administration and influence on kidney graft function A Simulation Optimization Approach to Epidemic Forecasting Defining epidemics in computer simulation models: How do definitions influence conclusions? Epidemics Potential of environmental models to predict meningitis epidemics in Africa Forecasting the seasonality and trend of pulmonary tuberculosis in Jiangsu Province of China using advanced statistical time-series analyses The development of a combined mathematical model to forecast the incidence of hepatitis E in Comparative Study of Four Time Series Methods in Forecasting Typhoid Fever Incidence in China Comparison of ARIMA and GM(1,1) models for prediction of hepatitis B in China Time Prediction Models for Echinococcosis Based on Gray System Theory and Epidemic Dynamics Relationship of meteorological factors and human brucellosis in Hebei province Time Series Analysis: Forecasting and Control Reducing demand uncertainty in the platelet supply chain through artificial neural networks and ARIMA models Application of the ARIMA model on the COVID-2019 epidemic dataset A comparative time series analysis and modeling of aerosols in the contiguous United States and China Drinking water quality control: Control charts for turbidity and pH Epidemiology and ARIMA model of positive-rate of influenza viruses among children in Wuhan, China: A nine-year retrospective study Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis SutteARIMA: Short-term forecasting method, a case: Covid-19 and stock market in Spain COVID-19 virus outbreak forecasting of registered and recovered cases after sixty day lockdown in Italy: A data driven model approach Comparative analysis and forecasting of COVID-19 cases in various European countries with ARIMA, NARNN and LSTM approaches Estimation of COVID-19 prevalence in Italy, Spain, and France Prediction of the COVID-19 Pandemic for the Top 15 Affected Countries: Advanced Autoregressive Integrated Moving Average (ARIMA) Model. JMIR Public Health Modeling and Forecasting for the number of cases of the COVID-19 pandemic with the Curve Estimation Models, the Box-Jenkins and Exponential Smoothing Methods Temperature Decreases Spread Parameters of the New Covid-19 Case Dynamics COVID-19: A Comparison of Time Series Methods to Forecast Percentage of Active Cases per The end of social confinement and COVID-19 re-emergence risk OpenSAFELY: Factors associated with COVID-19 death in 17 million patients Disease and healthcare burden of COVID-19 in the United States This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license Acknowledgments: Not applicable, with the exception of the research grant mentioned above. The authors declare that they have no conflict of interest.