key: cord-0773182-taqszi7y authors: Ballı, Serkan title: Data Analysis of Covid-19 Pandemic and Short-Term Cumulative Case Forecasting Using Machine Learning Time Series Models date: 2020-11-28 journal: Chaos Solitons Fractals DOI: 10.1016/j.chaos.2020.110512 sha: 8fe8f51dd75dad665370dc9d701eb2af8cd536e6 doc_id: 773182 cord_uid: taqszi7y The Covid-19 pandemic is the most important health disaster that has surrounded the world for the past eight months. There is no clear date yet on when it will end. By now, more than 31 million people have been infected worldwide. Predicting the Covid-19 trend has become a challenging issue. In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and Global was obtained from World Health Organization. Dataset consist of weekly confirmed cases and weekly cumulative confirmed cases for 35 weeks. Then the distribution of the data was examined using the most up-to-date Covid-19 weekly case data and its parameters were obtained according to the statistical distributions. Furthermore, time series prediction model using machine learning was proposed to obtain the curve of disease and forecast the epidemic tendency. Linear regression, multi-layer perceptron, random forest and support vector machines (SVM) machine learning methods were used. The performances of the methods were compared according to the RMSE, APE, MAPE metrics and it was seen that SVM achieved the best trend. According to estimates, the global pandemic will peak at the end of January 2021 and estimated approximately 80 million people will be cumulatively infected. The COVID-19 disease, which occurred after December 2019, spread all over the world after February 2020. The virus has passed from animals to humans and is transmitted from person to person via airborne droplets [8] . In a short time, it became the biggest epidemic the world has seen in the last century. With respect to data from World Health Organization (WHO), the number of cases seen worldwide is increasing rapidly. Despite the measures taken, the virus has not yet been stopped because of its high infectious power. Since COVID-19 first emerged, various trend analysis studies have been conducted. Fanelli and Piazza [6] analyzed the temporal dynamics of the coronavirus disease 2019 outbreak in China, Italy and France. Yesilkanat [20] estimated the near future case numbers for 190 countries in the world using random forest algorithm. Sahin and Sahin [12] estimated the cumulative cases of COVID-19 using fractional nonlinear grey Bernoulli model. Yadav et al. [18] analyzed COVID-19 spread using machine learning methods. Kaxiras et al. [9] used Susceptible-Infectious-Removed (SIR) populations model for describing COVID-19 pandemic. Wang et al. [15] studied on prediction of Covid-19 with logistic model and machine learning technics. Wieczorek et al. [17] presented a neural network powered COVID-19 spread forecasting model. Das [4] estimated incidences of COVID-19 using Box-Jenkins method for the period July 12-September 11, 2020. Shastri et al. [13] performed a time series forecasting of Covid-19 using deep learning models for India and USA. Feroze [7] forecasted the patterns of COVID-19 using bayesian structural time series models. In this study, unlike previous works, the distribution of Covid-19 weekly case increase was examined and the largest extreme value distribution was found for Global and Germany, and smallest extreme value distribution was found for USA. Afterwards, predictions were made for weekly cumu-lative cases for Global, Germany and USA with linear regression, multi-layer perceptron, random forest and SVM machine learning time series methods.The performances of the methods were compared according to the RMSE, APE, MAPE metrics and SVM was found best fitted method to forecast Covid-19 data. Then short-term cumulative case forecasting was applied using all methods for global, Germany and USA countries. The paper is organized as follows. In the section two, machine learning time series models will be introduced. Detailed analysis of the dataset will be explained in the third section. In the fourth section, evaluation metrics, results and discussion will be given. Finally, the conclusion of the study will be summarized in the section five. There are numerous models in the literature that are used to model time series such as Auto-Regressive Integrated Moving Average (ARIMA) and Fourier Transforms. These are univariate due to the nature of the time series's data. Using a single variable can be ineffective in understanding the time series. Therefore, it may be necessary to convert the data to multivariate. Machine learning can be used for this purpose [1] . Machine learning time series takes into account the time parameter and evaluates other inputs based on time. Time feature is divided into sub-components such as daily, weekly, monthly, quarterly, days of the week, weekend, weekdays, N-period lagged date, minimum, maximum, average, powers of time, products of time and lagged variables. Hidden patterns in time series can be captured with these components. As in general machine learning methods, the data for the time series is divided into two as training and test data. The data behavior is learned with training data and a general model is created and this model is tested using the test data. Machine learning time series with nonlinear data can yield successful results. The machine learning methods used in this study are discussed in the following subsections. Random Forest (RF) is a popular unsupervised learning technique and employed for regression and classification [3] . It is an ensemble learning technique. The classifier represents a decision tree [19] . N outputs by N decision trees are obtained using this method. All outcomes are estimated by voting. RF is both a simple and easy method for using parallel [2] . Linear regression is the most basic and simple approach used to find the relationship between variables consisting of numerical data. In this method, the trend of the data is found and estimation is performed accordingly. However, all independent variables must be defined [5] . Artificial neural networks (ANN) work by imitating the learning feature of the human brain. It gives better results for longer term predictions than statistical methods. It can also model nonlinear data. However, it is unknown what and how it does because of its black-box feature [5] . A multilayer perceptron (MLP) is a feed forward ANN model. A MLP consists of three layers: output hidden and input. MLP uses a back propagation known supervised learning technique for training. MLP can discern data that can not be linearly separated [11] . Mathematical calculation of MLP is stated as follows: where y is the output, X is the vector of input, h ij is the weight matrix, b j is the bias vector and f H is the hidden layer's activation function, w j , b o and f o are the vector of weight , the bias scalar and the output layer's activation function [10] . The support vector machine (SVM) is a machine learning technique employed for classification and regression. Instead of using a nonlinear function for regression, it tests to predict the regression employing a linear function in a large space [14] . In SVM prediction is calculated by following formula: where x is vector of the input, b is the bias and w is the vector of weight [1] . In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and global was was obtained from World Health Organization website [16] . Dataset consist of weekly confirmed cases and weekly cumulative confirmed cases for 35 weeks. Descriptive statistics of weekly confirmed cases is given in Table 1 . Considering the 8-months period, a global average of 881504 new cases are seen weekly. The standard deviation for global is also almost close to the mean. This closeness is similar for the USA and Germany. The positive skewness that Germany has means that there is a longer tail on the right. Germany has positive kurtosis, global and USA have negative kurtosis. As shown in Figure 1 , the data distribution for Germany with large kurtosis displays tail data that exceeds the tails of the normal distribution. In addition, distribution analysis was made for weekly case data. The results of goodness of fit test for weekly global cases are given in Table 2 . Fitting of the data to Lognormal, Normal, Exponential, 2-Parameter Exponential, Weibull, 3-Parameter Weibull, Largest Extreme Value, Smallest Extreme Value, Logistic and Gamma distributions was investigated. In Table 2 , AD value represents Anderson-Darling test value. It is a measure of the deviations between the fitted line of the distribution and data points. The p-value is the probability showing that the data follow the distribution. In order to choose the best distribution, it is expected that the AD value is low and the P value is high. The probability plot of the first four distributions with the lowest AD value is given in Figure 2 . As seen in Figure 2 , Largest Extreme Value is the distribution that fits best for global weekly data. Estimated parameters of distributions for global weekly data are given in Table 3 . Using these parameters, similar data suitable for distributions can be derived or used for estimation. Similarly, the goodness of fit test was performed for the weekly case data of Germany and USA, and probability plots are given in Figure 3 and Figure 4 . Accordingly, the best fit distribution for Germany was found as largest extreme value and it was found as smallest extreme value for USA. The smallest extreme value distribution is skewed to the left and the largest extreme value distribution is skewed to the right. This skewness can also be seen in Figure 1 for Germany, USA and Global data. In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and Global was obtained from World Health Organization website [16] . Furthermore, time series prediction model using machine learning is proposed to obtain the disease curve and forecast the epidemic trend. Linear regression, multi-layer perceptron, random for- est and support vector machines methods were used. The evaluation metrics described in the subsection below are used to compare these methods. In order to compare the estimation methods used in this study, root mean square error (RMSE), mean absolute percentage error (MAPE) and absolute percentage error (APE) metrics were used. By measuring APE, the consistency between the original value and the predicted value is calculated. These values are expected to be low when comparing. The following equations will express the APE, MAPE, and RMSE calculations: where n shows observation number, y i is the i-th observed value andŷ i is the i-th estimated value. Machine learning time series takes into account the time parameter and evaluates other inputs based on time. In this study, time feature is divided into sub-components as time index, weekly cases, 17 lagged variables of weekly cases, square of the time index, cube of time index and products of 17 lagged variables and time index. Thus, a total of 38 different variables were extracted. Dataset consist of weekly cumulative confirmed cases for 35 weeks. In machine learning methods, the data for the time series is divided into two as training and test data. In this study, 18 weeks were used for training and 17 weeks as test data. After training and testing, APE, MAPE, RMSE values were found and are given in Table 4 for linear regression, multi-layer perceptron, random for-est and SVM machine learning methods. In Table 4, it is seen that SVM method provides the best performance for Global, Germany and USA data. It is the method with the lowest value for RMSE, MAPE, and APE values. It is obviously seen that SVM achieved the best trend for all data. When Table 4 is examined in detail, SVM and linear regression methods have very close MAPE and RMSE values. Next comes the MLP method. The method with the worst performance is Random Forest. The Random Forest method has also failed in predicting the future. Accordingly, estimations were made with best three method for Global, Germany and USA for 17 weeks after 18/09/2020. These estimates are shown in Figure 5 -7. Figure 5 shows the future trend for Global cumulative data. According to estimates in Figure 5 , the global pandemic will peak at the end of January 2021 and an estimated approximately 80 million people will be cumulatively infected by using SVM method. Approximately 98 million people will be infected according to the linear regression method. For the MLP method, approximately 39 million peo- ple will be infected. The prediction of SVM, which is the best method according to performance metrics, seems more robust and realistic. Figure 6 shows the future trend for cumulative case data for Germany. According to estimates in Figure 6 , Germany will peak at the end of January 2021 and an estimated approximately 580,000 people will be cumulatively infected by using SVM method. Approximately 1 million people will be infected ac-cording to the linear regression method. For the MLP method, 330,000 people will be infected. Performance metrics show that the estimation of SVM is more accurate. According to estimates in Figure 7 , USA will peak at the end of January 2021 and an estimated approximately 11 million people will be cumulatively infected by using SVM method. In the linear regression method, it enters a downward trend and approaches zero. This is not a realistic estimate. According to the MLP method, 6 million people will be infected. Once again, the prediction of SVM seems more realistic. In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and Global was analyzed. The distribution of the data is found as largest extreme value for global and Germany and smallest extreme value for USA. Then time series prediction model is proposed to obtain the disease curve and forecast the epidemic trend using machine learning methods. Linear regression, multi-layer perceptron, random forest and support vector machines (SVM) machine learning methods were used for this purpose. The performances of the methods were compared according to the RMSE, APE, MAPE criteria. The results showed that the SVM model outperformed linear regression, multi-layer perceptron, random forest models in modeling the Covid-19 data, and could be successfully used to diagnose the be-havior of cumulative Covid-19 data over time. With the practical application of such machine learning time series models, further research is expected to provide the most appropriate method for healthcare professionals to control and prevent future epidemics. The author declares that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Serkan Ballı: The author contributed to each part of the paper in the best possible way and also read and approved the paper for submission. An empirical comparison of machine learning models for time series forecasting Random forests Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches Forecasting incidences of covid-19 using box-jenkins method for the period july 12-septembert 11, 2020: A study on highly affected countries Estimation and prediction of construction cost index using neural networks, time series, and regression Analysis and forecast of covid-19 spreading in china, italy and france Forecasting the patterns of covid-19 and causal impacts of lockdown in top ten affected countries using bayesian structural time series models Clinical characteristics of coronavirus disease 2019 in china The first 100 days: modeling the evolution of the covid-19 pandemic Ensemble approach based on bagging, boosting and stacking for shortterm prediction in agribusiness time series Principles of neurodynamics. perceptrons and the theory of brain mechanisms Forecasting the cumulative number of confirmed cases of covid-19 in italy, uk and usa using fractional nonlinear grey bernoulli model Time series forecasting of covid-19 using deep learning models: India-usa comparative case study A comparison of three data mining time series models in prediction of monthly brucellosis surveillance data Prediction of epidemic trends in covid-19 with logistic model and machine learning technics World health organization covid cumulative dataset Neural network powered covid-19 spread forecasting model Analysis on novel coronavirus (covid-19) using machine learning methods A time-series water level forecasting model based on imputation and variable selection method Spatio-temporal estimation of the daily cases of covid-19 in worldwide using random forest machine learning algorithm