key: cord-0733288-um5vnnlg authors: Toğa, Gülhan; Atalay, Berrin; Toksari, M. Duran title: COVID-19 Prevalence Forecasting using Autoregressive Integrated Moving Average (ARIMA) and Artificial Neural Networks (ANN): Case of Turkey date: 2021-05-05 journal: J Infect Public Health DOI: 10.1016/j.jiph.2021.04.015 sha: 1a28681c80c2a07b2b1e5dd41b52341d7e2538b2 doc_id: 733288 cord_uid: um5vnnlg A local outbreak of unknown pneumonia was detected in Wuhan (Hubei, China) in December 2019. It is determined to be caused by a severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) and called COVID-19 by scientists. The outbreak has since spread all over the world with a total of 120,815,512 cases and 2,673,308 deaths as of 16 March 2021. The health systems in the world collapsed in many countries due to the pandemic and many countries were negatively affected in the social life. In such situations, it is very important to predict the load that will occur in the health system of a country. In this study, the COVID-19 prevalence of Turkey is inspected. The infected cases, the number of deaths, and the recovered cases are predicted with Autoregressive Integrated Moving Average (ARIMA) and Artificial Neural Networks (ANN) in Turkey. The techniques are compared in terms of correlation coefficient and mean square error (MSE). The results showed that the used techniques used are very successful in the estimation of prevalence in Turkey. infected with the coronavirus was announced as 2,911,642 while 29,623 people died as of March 16, 2021, in Turkey. 2,734,862 patients recovered and gain immunity as of 16 March. During the COVID-19 pandemic spreads rapidly, literature is expanding with the studies of scientists very quickly. COVID-19 studies have received much attention. There are many studies about the medical aspect of COVID-19 literature (Bai et al., 2020; Mehta et al., 2020; Rothan & Byrareddy, 2020) . On the other hand, sociological (Lancker & Parolin, 2020) , economical (Fernandes, 2020) , and statistical (Roser et al., 2020) inspections of are made by the researchers in many studies. Statistical studies generally focus on the country-based forecasting of the pandemic. Besides the many countries (Al-qaness et al., 2020; Ceylan, 2020; Moftakhar et al., 2020; Perone, 2020) , Turkey is also investigated in some studies (Arslan et al., 2020; Aslan et al., 2020; Özdi̇nç et al., 2020) . Statistical and forecasting studies used different techniques such as time-series analyses, data mining techniques, growth models, nonlinear regression analysis, epidemiological models, and artificial intelligence (AI) techniques. One of the most effective time-series-based methods is ARIMA in the COVID-19 forecasting studies. Many country-based applications used ARIMA in the COVID-19 literature (Al-qaness et al., 2020; Benvenuto et al., 2020; Ceylan, 2020; Chakraborty & Ghosh, 2020; Dehesh et al., 2020; Ding et al., 2020; Gupta & Pal, 2020; Perone, 2020) . Furthermore, ANN has also been used for predicting the prevalence of COVID-19 in many studies and reported as a successful tool for prevalence prediction (Distante et al., 2020; Ghazaly et al., 2020; Hasan, 2020; Moftakhar et al., 2020; Tamang et al., 2020) . In this paper, we inspected the dynamics of COVID-19 prevalence in Turkey. Infected cases, number of deaths, and recovered cases are handled, and prediction models are built by using ARIMA and ANN. This paper is organized as follows: The first section gives a brief review of COVID-19 and literature on forecasting the prevalence of the COVID-19 pandemic. The second section examines the materials and methods used for the prediction of the prevalence of pandemic. Results and discussions are given in the third section. Our conclusions are drawn in the final section. This section presents two approaches such as ARIMA and ANN for COVID-19 prevalence forecasting of Turkey. assumptions of the Box-Jenkins method which is discrete and stationary (Box et al., 2015) . It is also a very effective tool in estimating time-series data. ARIMA method combines AR (autoregressive) and MA (moving averages) to analyze data. ARIMA models are used for stationary time series. Stabilization of the data is carried out by taking the difference in the I (Integrated-d) process. If the degree of autoregression parameter is , the degree of difference parameter is , and the degree of moving average parameter is this model is called autoregressive integrated moving average model in degrees ( , , ) and it is written as ARIMA ( , , ) (Box et al., 2015) . The general expression of an ARIMA ( , , ) model is as follows: If the primary differences ( = 1) make the stationary series, the difference operator will be as follows: The number of parameters to be calculated in the general ARIMA ( , , ) model used in the future estimation of series that do not show seasonal fluctuations is as much as in ARMA ( , ). In ARIMA ( , , ) model, or can be zero. In this case, the model is reduced to either the AR ( , ) or MA ( , ) model types. ANN is one of the highly effective and successful data mining techniques in the literature. ANN is an information processing method inspired by the human brain. A brain learns from human experiences and ANN mimics the brain while processing the data. It is classified as supervised or unsupervised learning according to the knowledge of the output variable values. Generally, ANN consists of some basic elements: input, hidden, and output layers. An input J o u r n a l P r e -p r o o f layer is the information provider of the networks. The hidden layer constructs the nonlinear relations between input(s) and output(s) by adjusting weights and this step is called learning. Layers consist of different numbers of neurons and these neurons process the data via activation functions. On the other hand, the output layer gives the forecasting information. It is proper to use ANN if there is no theoretical information about the functional form of the model or the nonlinear structure of the model. This leads us it is not a model-based technique; it is a data-based technique. The general architecture of ANN is given in Figure 1 . Data are generally split into categories for training, testing, and validation purposes. In the training step, a network learns from the data. Stopping the training process is achieved by the validation step, while the prediction ability of a trained ANN is judged in the testing step (Yaghini et al., 2013) . A special type of feedforward neural network called multilayer perceptron (MLP) is used in this study. The MLP is the most widely used ANN model and generally contains one input layer, one output layer, and one or more hidden layers (Basheer and Hajmeer, 2000) . Different training algorithms are used for MLP neural networks. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) is one of the training algorithms usually used for nonlinear least squares is presented and the modified backpropagation algorithm is combined with the BFGS algorithm (Nawi et al., 2006) . BFGS is preferred in this study. The daily confirmed COVID-19 cases data from March 31, 2020 to March 16, 2021 are retrieved from the website of the Turkish Ministry of Health. Other data are gathered from the website of the Turkish Statistical Institute. The data range is taken between March 31 th , 2020 to March 16 th , 2021 (daily data not announced by the government has been neglected). Only the patients that were confirmed by laboratory tests as positive are considered as infected cases by the Turkish government and we use these data for analyzes. In this study, no primary data collection is undertaken, no patient or public was involved in the study. By the way, we don't need any formal ethical assessment or informed consent. All anonymized data are collected from the official websites. In our study, the ARIMA method is used for the prediction of the daily number of infected cases, the daily number of deaths, and the number of recovered cases. ARIMA model cannot be generated models for multiple outcomes. Therefore, we structured four different ARIMA models by using Minitab 17.3.1 software. As the first step of the ARIMA, the stationary condition of the time series is checked. And it is seen in Figure 2 that our data are not stationary. Deciding on the values of , and is the crucial point in the ARIMA model and these parameters affect the performance of the ARIMA model. In this study, ARIMA models are created with different combinations of , , and values, and their performances are compared. The ACF and Partial Autocorrelation Coefficient Function (PACF) graphs are plotted to choose the best performing ARIMA model in Figure 5 and Figure 6 . It is seen that there is a high autocorrelation in the data according to ACF and PACF graph. ARIMA (1, 1, 0) model gives the best forecasting results for the daily number of infected cases. The value obtained as 0.000 for the daily number of infected cases in the level of 5% significance. The same procedures are applied to forecasting the number of daily deaths and the number of daily recovered cases. Table 1 shows the Pearson Correlation (R) value, Sum of Square Error (SSE), MSE, and values obtained for all estimation parameters. As can be seen from Table 1 , the highest correlation value was obtained from the daily number of recovered cases. According to SSE and MSE values, it will be seen that the minimum errors are obtained in estimating the daily number of deaths. We have a sample size of 285 days. The sample size is directly related to the generalization ability of ANN. ANN, generally, converges at the local minima with small sample sizes and yields poor generalization (Mao et al., 2006 for the days applied and 0 for the other days. One of the most important issues of forecasting accuracy is that cases are confirmed and reported after the laboratory tests give positive results. By the way, this is a critical parameter for our model. On the other hand, multicollinearity among independent variables is an important assumption of regression-based approaches. To check this assumption, we analyze independent data to detect multicollinearity. The most common approach to detect multicollinearity is that of the variance inflation factor (VIF). Depending on the rules of 4 or 10, multicollinearity among independent variables can be a possible or serious problem (O'brien, 2007) . VIF values for J o u r n a l P r e -p r o o f our independent variables range between 1-2; therefore we will not discuss the multicollinearity problem among independent variables. ANN analyzes are carried out using the data mining module of the STATISTICA 10.0 software package. The best network architecture is given in Table 2 and the SSE of the training, testing, and validation steps are given in Table 3 . Pearson correlation coefficients are given in Table 4 . As seen in Table 4 , the daily number of deaths, the daily number of recovered cases, and the daily number of infected cases have high correlation coefficient values and this indicates that the model developed has an acceptable generalization capability and accuracy to predict the prevalence of COVID-19 pandemic in Turkey. As depicted in Table 1 , the best network is a multilayer perceptron network consists of five input neurons (curfews are considered as 2 different neurons because of the categorical structure of curfew data), ten neurons with a hidden layer, and three output neurons. The training algorithm is selected as BFGS. On the other hand, activation functions are selected as hyperbolic tangent and logistic functions for the hidden layer and output layer, respectively. Furthermore, the selected network accurately predicts the daily number of infected cases, daily number of deaths, and daily number of recovered cases as seen in Table 3 and Table 4 . High correlation coefficients may suspect about linear relation between data or poor generalization ability of the developed network. However, as can be seen from Figures 7, 8, and 9, the developed model is very successful in nonlinear estimation because the predicted and actual values of output curves overlap in the graphs. Figures 7-9 give the time series predictions for 3 outputs. The effect of the COVID-19 outbreak is growing steadily in the whole world. It becomes very important to forecast the prevalence of the pandemic for the health systems of countries. Accurate forecasting will be an insight into strengthening health systems and resource reallocation. In this manner, reliable prediction of the COVID-19 pandemic enables rapid responses, event-based political decisions, and to predict the future of the pandemic. Thereby, minimization of deaths and health system-caused failures is provided. For more precise estimation, data should be updated in real-time and new parameters that will affect the prevalence of pandemic should be taken into account. Vaccination studies have been started on 14 January 2021 in Turkey and it has been considered that the prevalence of pandemics will be affected by vaccination. Therefore, including the vaccination data explained by authorities in the study will be very effective for predicting the prevalence of the pandemic in Turkey in future works. The daily number of infected cases is checked on the ACF graph in Figure 3 . In the ACF graph, it is seen that the daily number of the infected case has a serial trend. Therefore, data are preprocessed by differencing. The stationary condition has been provided by differenced time series as seen in Figure 4 . Optimization Method for Forecasting Confirmed Cases of COVID-19 in China Nowcasting and Forecasting the Spread of COVID-19 and Healthcare Demand In Turkey, A Modelling Study Modeling COVID-19: Forecasting and analyzing the dynamics of the outbreak in Hubei and Turkey Presumed Asymptomatic Carrier Transmission of COVID-19 Application of the ARIMA model on the COVID-2019 epidemic dataset Time Series Analysis: Forecasting and Control Estimation of COVID-19 prevalence in Italy Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis Forecasting of COVID-19 Confirmed Cases in Different Countries with ARIMA Models Brief Analysis of the ARIMA model on the COVID-19 in Italy Forecasting Covid-19 Outbreak Progression in Italian Regions: A model based on neural network training from Chinese data COVID-19, school closures, and child poverty: A social crisis in the making A New Method to Assist Small Data Set Neural Network Learning COVID-19: Consider cytokine storm syndromes and immunosuppression Exponentially Increasing Trend of Infected Patients with COVID-19 in Iran: A Comparison of Neural Network and ARIMA Forecasting Models An Improved Learning Algorithm Based on The Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method For Back Propagation Neural Networks A Caution Regarding Rules of Thumb for Variance Inflation Factors Predicting the Progress of COVID-19: The Case for Turkey An ARIMA model to forecast the spread and the final size of COVID-2019 epidemic in Italy Coronavirus Disease (COVID-19) -Statistics and Research The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak Forecasting of Covid-19 cases based on prediction using artificial neural network curve fitting technique Vitro Diagnostic Assays for COVID-19: Recent Advances and Emerging Trends. Diagnostics A hybrid algorithm for artificial neural network training The authors declare that there is no conflict of interest.