key: cord-1020113-tydoy64l authors: Hernandez-Matamoros, Andres; Fujita, Hamido; Hayashi, Toshitaka; Perez-Meana, Hector title: Forecasting of COVID19 per regions using ARIMA models and polynomial functions date: 2020-08-06 journal: Appl Soft Comput DOI: 10.1016/j.asoc.2020.106610 sha: fde6c3050c80994d6946709d78e9fbc2f17db435 doc_id: 1020113 cord_uid: tydoy64l COVID-2019 is a global threat, for this reason around the world, researches have been focused on topics such as to detect it, prevent it, cure it, and predict it. Different analyses propose models to predict the evolution of this epidemic. These analyses propose models for specific geographical areas, specific countries, or create a global model. The models give us the possibility to predict the virus behavior, it could be used to make future response plans. This work presents an analysis of COVID-19 spread that shows a different angle for the whole world, through 6 geographic regions (continents). We propose to create a relationship between the countries, which are in the same geographical area to predict the advance of the virus. The countries in the same geographic region have variables with similar values (quantifiable and non-quantifiable), which affect the spread of the virus. We propose an algorithm to performed and evaluated the ARIMA model for 145 countries, which are distributed into 6 regions. Then, we construct a model for these regions using the ARIMA parameters, the population per 1M people, the number of cases, and polynomial functions. The proposal is able to predict the COVID-19 cases with a RMSE average of 144.81. The main outcome of this paper is showing a relation between COVID-19 behavior and population in a region, these results show us the opportunity to create more models to predict the COVID-19 behavior using variables as humidity, climate, culture, among others. In December 2019 in Wuhan, China started the pandemic of COVID-19, commonly known as Coronavirus, which has caused havoc around the world. World Health organization reported on June 7 [18] , the virus is in 216 Countries, there are 6 750 521 active cases, and it has produced 395 779 deaths. For this reason, scientists around the world have been focused on topics such detect it [2] , prevent it [13] , cure it [3] , and predict it [4, 5, 7, 8, 9, 12, 15, 16, 19] . To predict the coronavirus different schemes has been applied, for example in [15] proposes an approach, which is based Composite Monte Carlo enhanced by deep learning and fuzzy rule induction to predict the COVID-19, [20] detailed models for forecasting the course of the pandemic, these models demonstrate the utility of parsimonious models for early-time data. Using the official data forecasting, [21] studied the spread of COVID-19, they realized forward prediction and backward * Figure 1 . Example of the available information on "Our World in Data". The proposed approach consists of two stages "Building the model" and "Evaluating the model". These stages are applied 6 times, one time per region. We use the time series "Total confirmed COVID-19 cases". The first stage "Building the model" requires the time series per country, which starts on the day when each country presented the first case of COVID-19 and it finishes on April 25. The second stage "Evaluating the model" requires the information of COVID-19 on May 28. Then the forecasting between May 12 and May 28 is calculated and compared with the real values. In the following subsections, the proposed approach is explained in a general way and using an example. The example calculates the "p, D, q" values of ARIMA to Canada and builds the North America model. We use the time series "Total confirmed COVID-19 cases" per country. Then, we have a time series presented in the following equations: In the equation 1, y means the total confirmed cases per day presented in a country. In the equation 2, T 1 means the day when each country detected the first COVID-19 patient. Then D 1+n represents the number of days elapsed until April 25. For example, Canada detected the first patient on January 26. Then, we can rewrite the equation 2 as shown the equation 3. The Figure 3 shows the time series of Canada. To compute the best parameters of ARIMA, the time series is separated into training time series and testing time series. To train time series is created using 90% of the data from the original time series. At the same time, test time series is created using 10% of the data from the original time series. The Figure 4 shows the example using time series of Canada. In equation 4, the k value is calculated, this value is the threshold to separate the data between Train and Test. Equations 5 and 7 present the Train and the Test time series, respectively. Then, Algorithm 1 is applied to calculate the best parameters of an autoregressive integrated moving average (ARIMA). ARIMA is a statistical analysis, it uses time series data. The ARIMA predicts future values by examining the differences between values in the time series. An ARIMA model consists of 3 components Auto regression (AR), Integrated (I), and Moving average (MA). Each component is a parameter. To represent these parameters, ARIMA models use a standard notation p, D, and q. This standard notation indicates the type of ARIMA model used. Where p means the number of lag observations, D means the degree of difference, and q means the order of the moving average, for further details refer to [10, 20] . In the previous Algorithm, Root Mean Square Error (RMSE) [11] measures the stability between the original data and forecast data, RMSE is calculate using the equation 9. This process is applied to each country. Annex "A" presents the p, D, q values of the ARIMA model and the RMSE to each analyzed country. J o u r n a l P r e -p r o o f The ARIMA stage calculated the p, D, q values to each country. In this stage, we use these values, the information of cases confirmed COVID-19 per million people on April 25, and the Population per million people (ppMP). Table 1 shows an example of the obtained values in the previous stage, which are used in this stage. Ck (14) In our example k=1, 2, 3, …, 13. Then we need to apply the Algorithm 2 on To solve the equation 15, we present the problem as shown the equation 16. Then, t is used to form Vandermonde matrix [21] V with n+1 columns and m rows. Where m is the length of d. After to solve the equation 16, we find the values of P 1 , P 2 , …, P n+1 . To calculate the best values of P n , we propose the Algorithm 3. The Figure 7 shows We apply the Algorithm 4 using the values of Canada on May 11 and the functions P Nap (t), P Nad (t), and P Naq (t) (equations [19] [20] [21] . Canada belongs to North America, so we use the functions of North America to calculate the ARIMA parameters. To calculate another country, we must use the functions which belong to the region of the country. This section presents the results for each region analyzed. Table 2 shows the average RMSE per region. In the table, the RMSE is calculated between the forecast and the real values. Figure 8 The Annex "A" presents the results per country before to create the geographic models. These results belong to each country in the different regions. As we mention in section 3.1.1, the time series are separated into modeling (90% of the signal) and testing (10% of the signal). Below, we will discuss each region in particular. North America region has 13 countries; this region presents a RMSE average of 640.61. The RMSE average of this region is the most bigger between the regions. Table 3 shows a comparison between [12] and this work before to create the geographic models. As shown Table 3 this approach has better RMSE to forecast the virus in Italy, on the contrary [12] has better RMSE to predict the virus in Turkey and Spain. At first, it seems that their proposal is better than ours, but when the RMSE J o u r n a l P r e -p r o o f Journal Pre-proof averages are compared, we can see that our proposal has a lower RMSE than them, besides we are analyzing 145 countries while they only analyze 3. Table 3 . Comparison between [12] and this work. When the geographic models are created, these models are used to predict new cases in a country. The results are shown in Table 2 . The forecast is made 17 days after the models are calculated, we take this decision to have a real difference between the cases on April 25 and May 11 as shown the tables in Annex "A". As expected, the RMSE error grew because, the prediction is making 17 days after the models were created and we calculate 15 days of prediction cases. In these time interval, the actions as quarantine control, stay at home campaign, social distance taken by governments significantly affect the prediction. If the lector wants current predictions, the information needs to be updated and repeat the building the model stage. We can conclude that the algorithm to model and evaluate the ARIMA models is able to Coronavirus Pandemic (COVID-19 Automatic Detection of Coronavirus Disease (COVID-19) Using X-ray Images and Deep Convolutional Neural Networks Quotient Sciences and CytoAgents Accelerate Potential Treatment for COVID-19 Cytokine Storm Application of the ARIMA model on the COVID-2019 epidemic dataset Finding an Accurate Early Forecasting Model from Small Dataset: A Case of 2019-nCoV Novel Coronavirus Outbreak An ARIMA model to forecast the spread and the final size of COVID-2019 epidemic in Italy Jiao Fan Brief Analysis of the ARIMA model on the COVID-19 in Italy Coronavirus (COVID-19): ARIMA based time-series analysis to forecast near future Forecasting of demand using ARIMA model An empirical evaluation of similarity measures for time series classification Forecasting of COVID-19 Cases and Deaths Using ARIMA Models Asymptomatic novel coronavirus pneumonia patient outside Wuhan: The value of CT images in the course of the disease The use of ARIMA models for reliability forecasting and analysis Composite Monte Carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction Prediction of the COVID-19 Pandemic for the Top 15 Affected Countries: Advanced Autoregressive Integrated Moving Average (ARIMA) Model. JMIR public health and surveillance United Nations, Department of Economic and Social Affairs Coronavirus disease (COVID-19) outbreak situation ARIMA modelling and forecasting of irregularly patterned COVID-19 outbreaks using Japanese and South Korean data The challenges of forecasting the spread of COVID-19 Propagation analysis and prediction of the COVID-19 Transmission potential of the novel coronavirus (COVID-19) onboard the diamond Princess Cruises Ship #JP20K11955.