key: cord-0202700-28oh194s authors: Naeem, Muhammad; Yu, Jian; Aamir, Muhammad; Khan, Sajjad Ahmad; Adeleye, Olayinka; Khan, Zardad title: Comparative Analysis of Machine Learning Approaches to Analyze and Predict the Covid-19 Outbreak date: 2021-02-11 journal: nan DOI: nan sha: 38295ce24ad4faccf80ec994917bdb249d2e8ddb doc_id: 202700 cord_uid: 28oh194s Background. Forecasting the time of forthcoming pandemic reduces the impact of diseases by taking precautionary steps such as public health messaging and raising the consciousness of doctors. With the continuous and rapid increase in the cumulative incidence of COVID-19, statistical and outbreak prediction models including various machine learning (ML) models are being used by the research community to track and predict the trend of the epidemic, and also in developing appropriate strategies to combat and manage its spread. Methods. In this paper, we present a comparative analysis of various ML approaches including Support Vector Machine, Random Forest, K-Nearest Neighbor and Artificial Neural Network in predicting the COVID-19 outbreak in the epidemiological domain. We first apply the autoregressive distributed lag (ARDL) method to identify and model the short and long-run relationships of the time-series COVID-19 datasets. That is, we determine the lags between a response variable and its respective explanatory time series variables as independent variables. Then, the resulting significant variables concerning their lags are used in the regression model selected by the ARDL for predicting and forecasting the trend of the epidemic. Results. Statistical measures i.e., Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are used for model accuracy. The values of MAPE for the best selected models for confirmed, recovered and deaths cases are 0.407, 0.094 and 0.124 respectively, which falls under the category of highly accurate forecasts. In addition, we computed fifteen days ahead forecast for the daily deaths, recover, and confirm patients and the cases fluctuated across time in all aspects. Besides, the results reveal the advantages of ML algorithms for supporting decision making of evolving short term policies. The outbreak of the novel coronavirus disease in 2019 has emerged as one of the most devastating respiratory disease since the 1918 HIN1 influenzas pandemic, infecting millions of people globally (Tuli et al. 2020) . The cumulative incidence of the virus is continually and rapidly increasing globally. At the early stage of the outbreak, it is important to have a clear understanding of the disease transmission and its dynamic progression, so that relevant agencies and organizations can make informed-decisions and enforce appropriate control measures. Generally, capturing the transmission dynamics of a disease over time can provide insights into its progression, and show whether the outbreak control measures are effective and able to reduce the impact of the disease on a community (Kucharski et al. 2020 ). Access to real-time data and effective application of outbreak prediction or forecasting models are central to obtaining insightful information regarding the transmission dynamics of the disease and its consequences. Moreover, every outbreak has its unique transmission characteristics that are different from the other outbreaks, which raises the question of how standards prediction models would perform in delivering accurate results. In addition, various factors including the number of known and unknown variables, differences in population/behavioural complexity in various geopolitical areas, and the variations in containment strategies increase the uncertainty of prediction models (Ardabili et al. 2020) . As a result, it is challenging for standard epidemiological models such as Susceptible-Infected-Recovered (SIR) to provide reliable results for long-term predictions. Therefore, it is important to not only study the relationship between the components of the outbreak datasets but also evaluate the effectiveness of the common disease prediction models. In recent months, there have been a handful of works that try to understand the spread of COVID-19, especially using statistical approaches. For instance, Kucharski et al. explored a combination of stochastic transmission model and four datasets that captured the daily number of new cases, the daily number of new internationally exported cases, the proportion of infected passengers on evacuation flight and the number of new confirmed cases, to estimate the transmission dynamics of the disease over some time (Kucharski et al. 2020) . In another study, a machine learning-based model is applied to analyse and predict the growth of COVID-19 (Tuli et al. 2020) . The authors demonstrated the effectiveness of using iterative weighting for fitting Generalized Inverse Weibull distribution when developing a prediction solution. Lin et al. , presented a conceptual model designed for the COVID-19 epidemic with consideration of individual behavioural responses and engagements with the government, including extension in holidays, restriction on travel, quarantine, and hospitalization (Lin et al. 2020) . This work combined zoonotic transmission with the emigration pattern, and then estimate the future trends and the reporting proportion. The model gives promising insight into the trend of the COVID-19 outbreak, especially the impact of individual and government reactions or responses to the epidemic. The authors (Anastassopoulou et al. 2020 ) estimated the average values of the key epidemiological parameters including the per day case mortality, recovery ratios, and the basic reproduction number R0 representing the average number of ancillary cases that results from the introduction of a single infectious case in an entirely susceptible population during the active period of the pandemic. The authors fit the dataset to the Susceptible-Infectious-Recovered-Dead (SIDR) model and attempted a three-week prediction of the dynamics of the outbreak at the epic centre. The estimated mean value of R0 as calculated considering the period from the 11 th January to 18 th of January was found to be around 2.6 based on the official confirmed cases. The authors (Hu et al. 2020) proposed a machine learning approach to predict the magnitude, intervals, and completion period of the disease. The authors proposed an improved auto-encoder model to analyse the spread changing aspects of the epidemics then predict the definite cases. In the model, hidden variables are used to group the cities for probing the spread arrangement. By means of the many-step predicting, the expected errors of 6,7,8,9 and 10-step predicting remained 1.64%, 2.27%, 2.14%, 2.08%, 0.73%, correspondingly. Autoregressive Distributed Lag (ARDL) is a flexible method to include independent series in dynamic regression models. ARDL models contain previous values of together response and explanatory variables series. They have been widely used in various domains including marketing, energy, epidemiology, agronomy and ecological studies (Huffaker & Fearne 2019) . Over the years, many packages have been developed for ARDL. For example (Pesaran et al. 2001) . The distributed lag model has a wide range of application i.e. cointegration study in which small and large-run relations between time series data. ARDL boundaries testing of (Pesaran et al. 2001) , which is a common co-integration study technique founded on the distributive lag model and further research work in progress. The other package developed by (Demirhan 2020 ) is nardl to use Distributed Lag Models (DLMs) in R. The package nardl focuses on the application of the nonlinear cointegrating ARDL model is developed by (Shin et al. 2014) . The recent package dynlm takes a unique purpose to fit linear models via stabilizing time-series features (Zeileis & Zeileis 2019 ). In the current study, we will use the R programming and dLagM package that outfits the ARDL test method (Pesaran et al. 2001 ). Subsequently, dLagM uses lag orders, dataset, and overall method which make the prerequisite lags and changes for a definite models. One of the benefits of this approach is that the users are not required to specify the variation for the applied models. Which brings efficacy and value to researchers in various areas. In this work, we present a comparative analysis of various machine learning approaches including Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN) and Artificial Neural Network in predicting the COVID-19 outbreak in the epidemiological domain. We aim to determine how well each of these approaches performs in predicting the confirmed and death cases and then compare their performances with each other. Particularly, we first apply ARDL method to identify and model the short and the long-run relationships of the time-series COVID-19 datasets (confirmed, recovered and death cases). That is, we determine the lags between a response variable and its respective explanatory time series variables as independent variables. Then, the resulting significant variables concerning their lags are used in the regression model selected by the ARDL model for predicting and forecasting the trend and dynamics of the COVID-19. We evaluated the models using relevant accuracy and error metrics including Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). We conducted our study based on the publicly accessible data of daily deaths, recovered and confirmed cases 17549, 332062 and 694123 respectively reported for all over the world from 22 nd January 2020 to 18 th Jan 2021, (Fig. 1) . The data is available in the online repository -GitHub (https://github.com/CSSEGISandData/COVID-19). We perform data processing including the conversion of data format from cumulative to daily basis. This repository is for COVID-19 visual dashboard operated by Johns Hopkins University Centre Systems Science and Engineering (JHU CSSE). They have aggregated data from sources like WHO, WorldoMeters, BNO News, and Washington State Department of Health and many more. The data have the number of confirmed cases, the recovered cases, and the death cases for the global. On this data, we attempted to forecast the key epidemiological parameters, i.e., the number of upcoming daily new confirmed cases, deaths, and recoveries. Though, the quantity of deaths, recovery, and confirmed cases of individuals is expected to be much higher along time. Therefore, we have similarly derived a correlation between these two variables and their past record (lags) by using the ARDL model. The ARDL models are used between regressed series and k number of regressors series in regression analysis. In this study, we used tseries, timeseries, zoo and window packages for the data. In the same way, dLagM package in R for ARDL model. An orders and of the ARDL lag model are denoted by ARDL ( , ), which has independent lags series and dependent lags series. If there is only one independent series, the dependent lag series makes the model autoregressive. The numeral of ℎ independent lag series is denoted by , = 1, . . . , denotes daily recover and confirm cases, the ℎ lags of dependent variable series are shown by , where = 0,1, … , . The ARDL model can be expressed as: Where denotes the number of daily deaths at time . 0 represent the intercept term. In the same way, 1 −1 + 2 −2 + ⋯ , − denotes the ℎ autoregressive lag order of the model of the dependent variable. The two independent variables "recover cases" and "confirm cases" are denoted by and respectively. Whereas 1 + 2 −1 + ⋯ , − and 1 + 2 −1 + ⋯ , − represent the lags order of and respectively. The parameters , denoted coefficients of death, recover, and confirm cases, respectively, while denotes the error term. Eq.1 can be further simplified and presented in (Eq. 2): (2) The number of death, confirm, and recover cases of people is likely to be much higher with time. Therefore, the ARDL model for recovered cases and confirmed cases is shown in (Eq.3). Similarly, the ARDL model for confirmed and recovered cases is shown in (Eq.4) There are different criteria's used to select an optimal lag length selection. The authors in (Chandio et al. 2020 ) use Akaike Information Criterion (AIC) and the authors in (Gayawan & Ipinyomi 2009 ) compare AIC, SIC and adj-R square to select the optimal lag length. We use adj-R square and parsimony model criteria to select an optimal number of the lag length in this study. It makes the call to the function easier when the number of lags order are the same, however, when the number of lags order is different from dependent and every independent sequence, we use the argument remove. It will remove the lags that are not contributed to the model. Once the ARDL model specifies the significant coefficients of the dependent variable and independent variables, the models including the RF, SVM, KNN, and ANN are used to assess the accuracy and error rate of these models. We utilized RF ( To overcome the overfitting problem, we use 80% training and 20% testing parts, respectively. Random forest is one of the best learning algorithms and it requires a bit parameter tuning. In the current study, randomForest, forecast, caret, tiyverse, tsibble and purr package are used for RF. the ntree is 500, mtry is p/3, where p is the number of features, sampsize is 70% and type is "regression" utilized in the function. The other parameters are kept as default. Generally, in time series analysis, Support Vector Regression (SVR) is used. In this study, we use e1071 library, the parameters cost=10 2 , gamma(ᵧ) = 0.1, and the insensitivity (є) = 0.3 respectively. But the function we called SVM and it automatically chooses SVR or SVM when it detects continuous/categorical response of the data respectively. In SVM, various kernel functions are used to develop the input space into a feature space with a complex dimension. Like Gaussian Radial Basis (GRBF), Sigmoid, polynomial, etc. are some kernel functions. For SVM, we use Radial Basis Kernels (RBF) ( , ) = exp (− ‖ − ‖) 2 . In the SVM model, using RBF kernels it is necessary to tune model parameters to find an optimal value of the parameters and reducing the overfitting problem. So, we use the grid search method of tenfold cross-validation on the training part and testing part and their results are averaged. k-nearest neighbor (k-NN) predicts the response variable based on the nearest training points. In this study, we use caret package for k-nearest neighbor Regression in R. It uses a training dataset in its place of learning a discriminative function from the training data. k-NN is used both for classification and regression problems. There are various techniques use to improve model accuracy. Such as maximum percentage accuracy graph, Elbow method, for loops to select an optimal value of k. Generally, the square root of n is used, and we utilized √ . ANN is a mathematical tool and has been generally used for classification and forecasting problems properly that contain predictors (input) and response (output) layers, and a hidden layer. A combination of different hidden layers is used to choose a better MLP architecture network. It is the hidden layers in ANN models that play an important role in many successful applications of neural networks. In the current study, we use neuralnet package for ANN. The parameters, the algorithm, threshold, and linear.output is 'backprop', 0.01, TRUE and the other parameters are kept as default, respectively. ANN model is widely used in the economic and financial studies (Huang et al. 2007; Qi 1996) . The number of hidden layers depends upon the nature of the problem. The authors in (Zhang et al. 1998 ) used two hidden layers and finds better model prediction accuracy. In the same way, the authors in (Xu et al. 2020 ) used (2 × + 1), where is the number of predictors (inputs). For an optimal result of ANN, usually, trial and error method is used in determining the number of hidden nodes i.e. searching the architecture having the smallest MAPE among the models (Güler & Übeyli 2005) . We use 4 hidden layers and 8 neurons in the hidden layers for daily death cases using trial and error procedure and 10,000 times iteration. In the same way, we use 2 hidden layers and 4 number of neurons in the hidden layers for daily recover cases. In this study, as the response variable is continuous, therefore, the forecasting capacity of different machine learning approaches are evaluated by using five different criterions including mean error (ME), RMSE, Mean Absolute Error (MAE), Mean Percentage Error (MPE) and MAPE and presented in (Tab. 1). Where n represents the total number of prediction on training and testing parts respectively, and ̂ representing the observed and predicted values, respectively. A total of three data sets of COVID-19 (confirm, recover and death) are used to evaluate the performance of the different ML approaches and suggested the best model for forecasting the COVID-19 outbreak. All data sets consisting of the world daily confirm, recover and deaths cases. Every time series divided into training and testing sets of observations. The original data divided into 80% training and 20% testing parts and the first 80% of the total observations in every time series used as a training set whereas the rest 20% used as the testing set. To overcome the overfitting problem, we use 10-fold cross-validation for each of the models and then their results are averaged. In addition, we also used prediction accuracy for training parts. Each time series containing a total of 366 observations spanning (22 January 2020, to 18 Jan 2021), the first 252 observations spanning (22 January 2020, to 07 Nov 2020) belong to the training series and the rest 74 observations spanning (08 Nov 2020, to 18 Jan 2021) part of the testing series. We use death, recover, and confirm cases from the COVID-19 dataset. The COVID-19 dataset is loaded into the R package environment, and then, we fit ARDL model to the Daily Deaths series with recover and confirm cases. We choose 1 = 3, 2 = 1, = 1 using adj-R square and parsimony of the model. The insignificant variables are removed and fit the ARDL model. The results obtained from the ARDL model are presented in (Tab.2). The coefficient related to confirm cases and its first lag are highly significant at 1% level and 5% level, respectively. Similarly, first lag of the response variable (daily deaths of , are significant at the 1% level. In addition, the coefficient of recover cases (first and second lags) are also significant at 1% level and 5% level, respectively. Overall, the model is highly significant at the 1% level with a p-value smaller than 2.2e-16 with the adjusted R-squared equal to 85.2%. The fitted model can be written as: 3 shows the summary of the ARDL model, the confirm cases recorded in the current day, and the third and fourth days. The daily recover cases of current, one day, two days and three days before have a significant impact on the number of daily recover cases from the COVID-19 on that particular day. The model is significant at the 1% level ( < 0.0000012), the adjusted R-squared value is 91.55%. The fitted model can be written as: Tab. 4 shows the summary of the ARDL model, the confirm cases recorded in the current, first, second and the third lags. The daily confirm cases of the first lags have a significant impact on the number of daily confirm cases from the COVID-19 on first lag, second lag and the number of confirm cases. The model is significant at the 1% level ( < 0.0000031), the adjusted Rsquared value is 85.39%. The fitted model can be written as: We evaluate models including RF, SVM, KNN, and ANN to compare their performance using various accuracy metrics including ME, RMSE, MAE, MPE and MAPE. These metrics provide different perspectives to assess predicting models. The first three are the absolute performance measures while the fourth and fifth are relative performance measures. The training sample is used to estimate the parameters for specific model architecture. The testing set is then used to select the best model among all models considered. Tab. 5 summarizes the RF, SVM, KNN, and ANN forecasting accuracy measures for the training set of COVID-19 daily deaths data. (Gao et al. 2019) . We highlighted the results for ANN model indicating the smallest value among all models. The ANN method shows significant performance compares to the rest of the method's base on 20% testing parts in most of the cases. Fig. 2 shows the plot of the forecasting accuracy measures for the models. (Aamir et al. 2018; Xu et al. 2020) It is clear from the above plot that on the average, ANN is the best model for forecasting the daily deaths of COVID-19 outbreak. Tab. 7 summarizes the RF, SVM, KNN, and ANN forecasting accuracy measures of the COVID-19 confirm patient's on the training dataset. RF and KNN models are 11.55, 21.89, 13 .58 and 0.0942, respectively. Thus, the value of the MAPE for ANN is in the range of 1 to 10 which revealed that the selected model falls in the category very good model. On average, the ANN method achieved significant performance better than the other methods based on 20% testing parts. This indicates that ANN results are more consistent to RF, SVM, and KNN. Fig. 3 shows the plot of the forecasting accuracy measures for different models. RF and KNN models are 9.88, 150.19, 10 .46 and 0.0029, respectively. Thus, the value of the MAPE for ANN is in the range of 1 to 10 which revealed that the selected model falls in the category very good model. On average, the ANN method achieved significant performance better than the other methods based on 20% testing parts. This indicates that ANN results are more consistent to RF, SVM, and KNN. Fig. 3 shows the plot of the forecasting accuracy measures for different models. Fig. 4 shows the plot of the forecasting accuracy measures for different models. The performance of the neural network model can be assessed once trained the network employing the performance function as a prediction. All the methods are capable of capturing the pattern of the data effectively. Moreover, ANN performed well and almost capture the whole pattern of the testing part of the data when compared to RF, SVM, and KNN methods. Fig.3 shown the prediction accuracy of the number of daily Covid-19 recovered cases of RF, SVM, KNN, and ANN methods. The world daily deaths original testing data of COVID-19 and the forecasted data for RF, SVM, KNN and ANN models are plotted in (Fig.5) . Fig. 5 clearly shows that ANN captured the pattern of the test set of the data better than RF, SVM, and KNN methods. Also, (Fig. 5 ) displays the prediction accuracy of RF, SVM, KNN, and ANN models for COVID-19's daily recover cases. Similar to death cases accuracy results, all the models effectively captured the pattern of the daily recover cases of COVID-19. In the same way, in Fig. 6 and Fig.7 , the ANN captured the pattern on the test part of the data. While the rest of the methods first follow the pattern up to some extent and then insensitive to the original data. The Fig. 6 and Fig. 7 The forecasted number of deaths tend to gradually decline over time. This is an indication that number of daily deaths decreases over time. In (Fig. 8) , the original COVID-19 number of deaths data points and the resulting forecast of ANN were plotted for the next fifteen days from (19 Jan 2021 to 03 Feb 2021). As shown in the figure, the ANN forecast captures and follows the pattern of the original death cases of COVID-19. The subsequent fifteen days forecasted line fluctuated near 10,000. In addition, the forecasted number of deaths tend to gradually decline over time. This is an indication that number of daily deaths decreases over time. In (Fig. 9) , the original COVID-19 recover patients data and forecast of ANN exhibited for the next fifteen days from (04Dec 2020 to 18 Dec 2020). The ANN model forecast captured the pattern of the original COVID-19 recover patient's data. In addition, the next fifteen days forecasted drift going in downward direction. This reveals that the number of daily recoveries is decreasing over time. In (Fig. 10) , the original COVID-19 confirm patients data and forecast of ANN exhibited for the next fifteen days from (19 Jan 2021 to 03 Feb 2021). The ANN model forecast captured the pattern of the original COVID-19 confirm patient's data. In addition, the next fifteen days forecasted drift going in downward direction. This reveals that the number of daily confirm is decreasing over time. This paper proposed four predicting models for COVID-19 outbreak. The methods are compared with respect to five performance metrics including ME, RMSE, MAE, MPE, and MAPE. The results for the daily deaths cases are based on 80% training and 20% testing parts. Among the four methods using these performance metrics, the ANN achieved better results in every aspect. In the same way, the results obtained for the daily recovered cases using 80% training and 20% testing parts and ANN have attained better results with respect to the other methods. Moreover, daily confirm cases results obtained using the same training and testing parts and in most of the cases ANN performed better than the other methods. Therefore, the major findings of this study reveal that ANNs outperform the rest of the methods for both models. In addition, ANN suggests consistent prediction performance compared to RF, SVM, and KNN models and hence preferable as a robust forecast model. The AI-based method's accuracy for predicting the trajectory of the COVID-19 is high. For this specific application in predicting the disease, the authors consider the results are reliable. In this study, ANN generates the fastest convergence and good forecast ability in most cases. The results showed the compensations of machine learning algorithms to support strategy/decision-makers in evolving short term policies about the number of disease prevalence. The forecast models will support the government and health staff to be ready for the forthcoming circumstances and take further promptness in healthcare structures. The forecasted figures were calculated for the next fifteen days (i.e., 19 Jan 2021 to 03 Feb 2021) for COVID-19 data. It is worth noting that forecasting is a complex matter, and some tailored models might not be ubiquitous owing to the complex societal and economic circumstances of different nations. The models and predictions proposed in this article do not reflect the local demography, and the real statistics can variate owing to numerous governmental actions like concentration on lockdown, the strategy of isolation and health facilities, etc. Thus, readers should be careful while interpreting these forecasts. No conflicts of interest, financial or otherwise, are declared by the authors. Improving forecasting accuracy of crude oil price using decomposition ensemble model with reconstruction of IMFs based on ARIMA model Data-based analysis, modelling and forecasting of the COVID-19 outbreak Covid-19 outbreak prediction with machine learning A random forest guided tour Using the ARDL-ECM approach to investigate the nexus between support price and wheat production: An empirical evidence from Pakistan dLagM: An R package for distributed lag models and ARDL bounds testing Forecasting Crude Oil Price Using Kalman Filter Based on the Reconstruction of Modes of Decomposition Ensemble Model A comparison of Akaike, Schwarz and R square criteria for model selection using some fertility models An expert system for detection of electrocardiographic changes in patients with partial epilepsy using wavelet-based neural networks Prediction of influenza-like illness based on the improved artificial tree algorithm and artificial neural network Artificial intelligence forecasting of covid-19 in china Neural networks in finance and economics forecasting Reconstructing systematic persistent impacts of promotional marketing with empirical nonlinear dynamics Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The Lancet Infectious Diseases Forecasting influenza epidemics by integrating internet search queries and traditional surveillance data with the support vector machine regression model in Liaoning A conceptual model for the outbreak of Coronavirus disease 2019 (COVID-19) in Wuhan, China with individual reaction and governmental action Time Series Forecasting with KNN in R: the tsfknn Package Bounds testing approaches to the analysis of level relationships 18 Financial applications of Artificial Neural Networks Modelling asymmetric cointegration and dynamic multipliers in a nonlinear ARDL framework Predicting the Growth and Trend of COVID-19 Pandemic using Machine Learning and Cloud Computing A New Approach for Reconstruction of IMFs of Decomposition and Ensemble Model for Forecasting Crude Oil Prices Forecasting with artificial neural networks:: The state of the art