key: cord-0765845-4suc9ubu authors: Abbasimehr, Hossein; Paki, Reza; Bahrini, Aram title: Improving the performance of deep learning models using statistical features: The case study of COVID‐19 forecasting date: 2021-05-22 journal: Math Methods Appl Sci DOI: 10.1002/mma.7500 sha: 6110d9fc72f8e4434d4be566e5f44e30f6ddd4f9 doc_id: 765845 cord_uid: 4suc9ubu COVID‐19 pandemic has affected all aspects of people's lives and disrupted the economy. Forecasting the number of cases infected with this virus can help authorities make accurate decisions on the interventions that must be implemented to control the pandemic. Investigation of the studies on COVID‐19 forecasting indicates that various techniques such as statistical, mathematical, and machine and deep learning have been utilized. Although deep learning models have shown promising results in this context, their performance can be improved using auxiliary features. Therefore, in this study, we propose two hybrid deep learning methods that utilize the statistical features as auxiliary inputs and associate them with their main input. Specifically, we design a hybrid method of the multihead attention mechanism and the statistical features (ATT_FE) and a combined method of convolutional neural network and the statistical features (CNN_FE) and apply them to COVID‐19 data of 10 countries with the highest number of confirmed cases. The results of experiments indicate that the hybrid models outperform their conventional counterparts in terms of performance measures. The experiments also demonstrate the superiority of the hybrid ATT_FE method over the long short‐term memory model. COVID-19 pandemic 1 has been causing unprecedented challenges to aspects such as people's health, economy, education, and employment. [2] [3] [4] To control the rapid rate of spread of COVID-19 and to decrease its devastating impacts, many countries in the world have implemented intensive interventions and restrictive measures, [5] [6] [7] such as social distancing, border closure, school closure, lockdown, travel restrictions, and public events banning. 8 The study of Flaxman et al 8 on the usefulness of interventions across 11 European countries suggested that the adopted interventions were influential in reducing the transmission rate of the coronavirus. Monitoring and recording the infected cases data are vital for evaluating the success of controlling the COVID-19 pandemic. 9 Analyzing the recorded data gives helpful knowledge about the pandemic trend and helps countries take accurate measures. Johns Hopkins University's Coronavirus Resource Center 10 has collected and published the data about the COVID-19 confirmed cases used by scholars to model the disease's spread and perform data analysis. Forecasting the number of cases infected with this virus can help authorities make accurate decisions on the interventions that must be implemented to control the epidemic. 11 In past studies, researchers have adopted time series forecasting approaches to predict the number of cases of COVID-19 because the available data are collected daily and constitute a time series of data. 5, [12] [13] [14] [15] Various techniques, including mathematical and computational intelligence models, have been employed in past studies on COVID-19 time series forecasting. Al-Qaness et al 16 studied the adaptive neurofuzzy inference system (ANFIS) to predict the number of infected cases in China. To model and forecast the number of cases in Mexico, Torrealba-Rodriguez et al 5 used mathematical and computational models such as logistic, Gompertz, and artificial neural network (ANN). Castillo and Melin 14 proposed a novel hybrid approach based on fuzzy fractal and fuzzy logic to forecast the number of confirmed cases of COVID-19 in 10 countries. Melin et al 17 Due to the successful results of deep learning models in many application domains such as natural language processing (NLP), 18 time series forecasting, 19 ,20 stock market prediction, 21 and customer behavior forecasting, 22 these models also have been adopted for COVID-19 time series forecasting. Mainly, long short-term memory (LSTM) and bidirectional LSTM (BiLSTM) 23 have been applied in previous studies, 9, 12, 13, 15 which have shown good performance in COVID-19 series forecasting. Also, in Abbasimehr and Paki, 24 techniques based on the attention mechanism 25 and convolutional neural networks 26 have been studied in COVID-19 forecasting. Although deep learning methods can achieve suitable performance in COVID-19 forecasting applications, their predictive power mainly depends on the amount of data. To deal with this problem and to improve time series forecasting with deep learning models, in this study, we propose to exploit the statistical features as auxiliary inputs 27 and associate them with the main input of the deep learning methods. Incorporating statistical features, which can capture the majority of dynamics that existed in the input data, 28 improves the deep learning model. We argue that relying only on representation obtained by a deep learning model may not be enough to capture the patterns of time series. Based on this idea, in this study, two hybrid methods using the attention mechanism (ATT_FE) and CNN (CNN_FE) are designed and applied to COVID-19 data. Three deep learning models based on the attention mechanism, CNN, and LSTM are implemented as the benchmark models. These models utilize only the main input data and do not use any side information. In designing both the hybrid models and the benchmark models, a multiple-output modeling approach is adopted, which allows the models to forecast the number of cases for the next few days. Multi-output forecasting is an effective choice for long-horizon forecasting compared to single step-ahead forecasting. 29 Hyperparameter tuning affects the performance of the machine and deep learning models. 30 The utilized models throughout this study contain some significant hyperparameters that must be specified before training those model. The previous studies on COVID-19 forecasting 9,12,15 have employed their models via manually tuned hyperparameters. To find the best hyperparameters and consequently to increase the forecasting accuracy, in this study, a hyperparameter selection algorithm based on the Bayesian optimization (BO) method 31 is developed and applied before training any model. The proposed hybrid models are applied to COVID-19 data of the top 10 countries with the highest number of confirmed cases. The results of the experiments indicate that hybrid models outperform their conventional counterparts. Also, the experiments demonstrate the superiority of the hybrid multihead attention method over the LSTM model. To further investigate the proposed models' usefulness, the forecasting results are visualized, helping governments make plans for their long-term decisions to control the pandemic. Overall, the main advantages of the proposed methods are as follows: (1) combining statistical features with deep learning models to improve the prediction of the number of daily infected cases with COVID-19. As time series data are often small, statistical features capture most patterns existing in the time series and improve the deep learning models' performance. (2) Designing a multiple-input and multiple-output deep neural network architecture to implement our idea. (3) Utilizing BO for the selection of the best parameters. (4) Adopting a multiple-output forecasting approach. 32 The proposed models are designed so that they can predict the next few days. (5) Conducting an extensive experiment using data from 10 countries to demonstrate the effectiveness of the proposed models. The rest of this paper is organized as follows. The next section gives a literature review on models and methods proposed for COVID-19 time series forecasting. In Section 3, we describe the main idea and the architectures of the proposed hybrid models. Section 4 describes the data, provides the experiments results, and compares the proposed models to the benchmark model. Section 5 concludes the paper and outlines future work. In this section, we outline the past studies on COVID-19 time series forecasting. Various techniques, including mathematical, statistical, and machine and deep learning, have been applied for COVID-19 time series forecasting. Gompertz and logistic models are widely used mathematical models (see previous studies 5, [33] [34] [35] [36] . Also, the autoregressive integrated moving average (ARIMA) 9, 11, 13 and exponential smoothing 37 are popular statistical methods that have been utilized in COVID-19 forecasting. Besides, researchers have been drawing their interest towards applying the machine and deep learning techniques such as ANN and LSTM for COVID-19 time series forecasting studies (e.g., previous studies 9, 12, 15, 38, 39 ). The results of these studies demonstrate the usefulness of NN-based methods and especially the deep learning methods such as LSTM, which are inherently suitable to process sequence data for COVID-19 forecasting. Some methods based on fuzzy logic have been proposed in the literature (see Castillo and Melin 14 and Al-Qaness et al 16 ) . Despite the successful application of deep learning models in the context of COVID-19 forecasting, in this study, we argue that their performance can be further improved by feeding them with some informative features. The main challenge of time series forecasting is that the length of time series is often short, and thereby, the number of created instances from a time series is small. In deep learning applications, the resulting model's power is dependent on the number of training samples. However, this issue is unavoidable in time series forecasting, as the length of the series is short. To tackle this problem and improve the performance of deep learning methods, we propose to exploit some informative features in addition to learned features through the deep learning methods. The provided features, which are extracted from each input vector, allow the deep learning model to better discriminate the patterns of time series and subsequently perform accurate forecasting. In this study, we propose two hybrid deep learning models to forecast the COVID-19 number of cases. The methodology used for the COVID-19 time series forecasting is as follows: Step 1: Creating instances from the time series. Using an input window with size L, which is also called Lag in the literature, and an output window with size O, the instances with input-output formats are created. Step 2: Computing features from the input vector. In this step, from each input vector, the features provided in Table 1 are computed. Step 3: Model training using the BO. BO is used to select the best features and hyperparameters concurrently. Step 4: Evaluation of models. This study's main contribution is to improve the deep learning models' performance by incorporating some informative statistical features computed from the input data. The list of features that are extracted from the input data is shown in Table 1 . These features have been introduced in Hyndman et al. 40 Although plenty of features can be extracted from each input vector, the proposed features by Hyndman et al 40 are designed to represent the main characteristics of a time series. Note that in this study, these features are computed from input vectors. The first order of autocorrelation F4 Strength of trend F5 Strength of linearity F6 Strength of curvature F7 Strength of seasonality F8 Spectral entropy F9 Changing variance in the remainder F10 Flat spots using discretization F11 The number of crossing points Recently the performance of attention mechanisms have been demonstrated in NLP applications. 21, 41 The study of Vaswani et al 25 demonstrated the effectiveness of the attention mechanism for processing sequence data. In this study, we propose a hybrid architecture (ATT + FE) based on a multihead attention-based model 25 and the statistical features calculated from the input data for COVID-19 forecasting (Figure 1 ). An attention function takes a query Q and a set of keys and values to get the output O. This procedure is often called scaled dot-product attention. Multihead attention is a set of multiple heads that jointly learn different representations at every position in the sequence. 42 ATT_FE, as illustrated in Figure 1 , consists of two multihead attention layers, the flatten layer, the concatenation layer, and the fully connected layer. The proposed method takes two inputs, including the created instances and statistical features. After preprocessing the input data and creating the instances (input_1), the multihead attention layer computes a new representation of input_1. Also, after calculating the statistical features corresponding to each sample (input_2) and normalizing them, the multihead attention layer takes input_2 and makes a new representation, which gives more importance to the informative features. The outputs of the multihead attention layers are reshaped using the flatten layer, and after concatenation, the merged representation is fed into the fully connected layer, which produces the outputs. CNNs have produced successful results in many application domains and especially in machine vision. 26 This study proposes a hybrid model based on CNN for COVID-19 time series forecasting (as seen in Figure 2 ). In the proposed method, the convolutional layers in CNNs take input data and extract new features by applying convolution operation on data using convolution kernels. Each CNN has a convolution kernel (i.e., a small window) that slides over the input data and accomplishes convolutional operations to derive new features. 43 The derived features using the convolution operation are usually more discriminative than the raw input data; therefore, it improves the forecasting. The architecture of the proposed hybrid CNN-based model (CNN_FE) is illustrated in Figure 2 . CNN_FE contains (1) two CNN layers, (2) the flatten layer, (3) the concatenation layer, and (4) the fully connected layer. Similar to the previous attention-based method, CNN_FE takes two inputs, including the generated instances from the original time series in the input-output format (input_1) and the statistical features (input_2). The first CNN layer extracts the useful features from the input_1. Similarly, the second CNN layer produces new features from input_2. The outputs of CNN layers are reshaped using the flatten layers, and after concatenation, the merged feature set is fed into the fully connected layer, which predicts the outputs. Three benchmarking models, which are the state-of-the-art deep learning model for time series forecasting, are employed to explore the proposed hybrid models' performance and investigate the effectiveness of incorporating statistical features corresponding to the input data. The list of benchmarking models is provided in Table 2 . LSTM is the most widely used deep learning technique for time series forecasting tasks. 21, [44] [45] [46] ATT is a method based on the multihead attention that takes one input, which is the original time series, to perform forecasting. The main part of the CNN model is the 1D convolutional layer, which takes only one input. We need to specify several hyperparameters to implement the proposed methods. Also, as the importance of auxiliary features may vary, it must simultaneously perform feature selection and hyperparameter selection. The performance of the proposed methods and all benchmarking models depends mainly on choosing the optimal hyperparameters in which the best results are obtained. The grid search method is a common method to select optimal hyperparameters. However, it requires more computational resources, and investigating the entire hyperparameters space may not be possible, especially in deep learning applications. In this study, the hyperparameter tuning and feature selection tasks are performed via the BO algorithm 47, 48 for the proposed methods to overcome the grid search method's shortcoming. It should be noted that hyperparameter selection is employed for all utilized benchmarking models. The process of hyperparameter tuning is illustrated in Figure 3 . In the process of the BO algorithm, the suitability of each model obtained using each hyperparameter set is measured using the error on the validation set. In fact, the algorithm finds the best hyperparameters and feature sets, which minimize validation loss. We use the root mean square error (RMSE) as the loss function for training all models. In this study, the proposed and the benchmarking methods are implemented with Keras-2.2.4 on GPU, 49 the Python deep learning library. The hyperparameters of all models used throughout the paper have been optimized using the BO algorithm. To prevent all models from overfitting and to improve their generalization to new data, we use early stopping. 50 Subsequently, to employ early stopping, we set the epoch limit to 500. To perform the experiments, we applied the COVID-19 data obtained from the Humanitarian Data Exchange (HDX). 51 The description of the utilized data is provided in Table 3 . The data are the time series of the confirmed cases for 10 countries with the highest number of cases. The holdout method is adopted, where every time series is split into a training set (80%) and test set (the last 20% of time series) to perform model building. Also, the last 20% of the training set is considered the validation set used to evaluate models in the training phase. We use two primary measures to evaluate the performance of the proposed COVID-19 time series forecasting methods: (1) symmetric mean absolute percentage error (SMAPE), which is a popular measure in time series forecasting tasks, and (2) RMSE. The following equations provide the definitions of SMAPE and RMSE: wherêt and y t are the predicted and actual value at time point t. As mentioned before, the proposed methods have various hyperparameters. Also, the feature selection process should be conducted to choose the informative features corresponding to each model. We tune the hyperparameters and select the features using the process described in Section 3.4. The BO algorithm uses Bayesian inference and a Gaussian process to select hyperparameters. To perform feature selection, each feature's domain is set to 0 or 1, where 1 indicates that the corresponding feature is selected and 0 indicates that it is not selected. Also, one important hyperparameter that significantly impacts time series forecasting accuracy is the input window size (Lag). The range of Lag is set to (10, 11, 12, 13, 14, 15) for all models. For each model, the parameter ranges that are utilized throughout the experiments are shown in Table 4 . As the fully connected and output layers have been incorporated after the main layer of the proposed methods, we set the range of hyperparameters corresponding to these layers identical for all models. To limit the BO algorithm's search space, for these layers, we include their activation functions in the hyperparameter selection process. For both layers, "ReLU" (rectifier linear unit) and "Linear" activation functions 26 The data preprocessing is divided into two main steps: Step 1: Creating instances from the time series. In this step, the instances with input-output formats are created. This is a mandatory step as the supervised learning algorithms require the input data to be in input-output format. Therefore, the COVID-19 time series corresponding to each country must be transformed. We consider L the size of the input window and O the size of the output window (the forecast horizon), and subsequences of size L + O are extracted from the series. The first L points of a sequence are considered the input, and the last O points are considered the output values. Suppose a time series T : t 1 , t 2 , … , t n , and L = 5 and O = 2, then the created instances are shown in Table 5 . It should be noted that the created instances from the time series are the main input of the proposed methods. Step 2: Computing statistical features from the input vector. In this step, from each input vector, the features illustrated in Table 1 are computed. To extract features displayed in Table 1 , we use the tsfeatures * package that has been implemented in R by Hyndman et al. 40 In this section, we compare the performance of the proposed methods, including ATT_FE and CNN_FE. In this way, we can explore the superiority of each combined method to its benchmark method. The results of the hybrid model, CNN_FE, and the CNN model are provided in Table 7 . In terms of SMAPE, CNN_FE outperforms CNN in eight countries, including the United States, India, Russia, South Africa, Mexico, Peru, Colombia, and Iran. The results indicate that combining the statistical features with the CNN model significantly improves the forecasting performance. Similarly, in terms of RMSE, CNN_FE achieves superior performance than CNN. Except for Brazil and Chile, CNN_FE outperforms CNN in all countries, which validates the effectiveness of our proposal. In the previous subsection, the forecasting power of both hybrid models are compared with their basic models. In the following, both hybrid models' performances are evaluated with LSTM, which is a popular time series forecasting model. Table 8 illustrates the results of the hybrid models and the LSTM model. As the results indicate, in terms of SMAPE, ATT_FE performs better than the other models in five countries. CNN_FE yields the best performance in three countries, and LSTM achieves the best SMAPE only in two countries (Mexico). In terms of RMSE, ATT_FE obtains the best performance in five countries. CNN and LSTM reach the best performance in three and two countries, respectively. The pairwise comparison of each hybrid method with LSTM indicates that the ATT_FE model outperforms LSTM in eight countries out of 10 countries in terms of SMAPE. Also, LSTM performs better than the CNN_FE in terms of SMAPE. In terms of RMSE, ATT_FE outperforms LSTM in six cases. Furthermore, LSTM performs better than CNN_FE in terms of RMSE. * https://github.com/robjhyndman/tsfeatures. In this section, the number of forecasted cases is plotted against all countries' actual numbers in Figures 4-13 . In all of the following figures, the red line indicates the real values, and the green line corresponds to the forecasted cases using the ATT_FE. The US (Figure 4 ) plot shows that the forecasted cases with the ATT_FE model are very close to the real values. Also, at some points, there are overlaps between the predicted and actual values. Figure 5 illustrates Brazil's forecasted values, where the forecasted values using the ATT_FE model are close to the real values. For India, as illustrated in Figure 6 , the error is insignificant, and there are overlaps at the majority of the points. Similarly, the ATT_FE model forecasts the number of cases for Russia accurately as the error is small and the overlaps in most of the points are apparent. The plot for South Africa (Figure 8) indicates that there are overlaps at the first time points; however, the error increases as we reach the end of the series. This is possible because our proposed models are trained using the historical data of COVID-19 confirmed cases and predict the future based on past data. As displayed in Figure 9 , Mexico's forecasts are close to the actual value, and ATT_FE shows suitable performance. Also, the results for Peru and Chile are shown in Figures 10 and 11 , respectively, which indicate that the error is relatively high in some time points. For Colombia and Iran, as displayed in Figures 12 and 13 , respectively, the errors are insignificant, and there are overlaps between the forecasted and real values at the majority of time points. In this study, two hybrid methods based on combining the deep learning models, multihead attention (ATT_FE) and CNN (CNN_FE), with some auxiliary statistical features, were developed for COVID-19 time series forecasting. The core advantage of the proposed methods is their capability to exploit deep learning representations and the statistical features in the forecasting task. Furthermore, another contribution of the proposed models is that their design is based on the multiple-output forecasting strategy, enabling one-shot forecasting of multiple next days. The predictive power of the proposed techniques was explored in the COVID-19 time series data of 10 countries. The experiments indicated that in most countries, the ATT_FE model outperformed its nonhybrid model (ATT). Similarly, the CNN_FE model achieved better performance than CNN in most countries. Besides, the comparison of performance with LSTM showed that ATT_FE obtained the best results among all methods. The effectiveness of the proposed approach of utilizing the statistical features as auxiliary inputs to improve the performance of deep learning models is demonstrated. As future work, we intend to apply the proposed methods in other application domains, such as demand forecasting. Furthermore, we can perform transfer learning by employing the data of previous pandemics such as severe acute respiratory syndrome (SARS) to train the model and then use the learned representation to initialize the model's weights or fine-tune the model using the COVID-19 data. An analysis of a nonlinear susceptible-exposed-infected-quarantine-recovered pandemic model of a novel coronavirus with delay effect The species severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 An emotion care model using multimodal textual analysis on COVID-19 Identification of dominant risk factor involved in spread of COVID-19 using hesitant fuzzy MCDM methodology Modeling and prediction of COVID-19 in Mexico applying mathematical and computational models Optimal surveillance mitigation of COVID'19 disease outbreak: fractional order optimal control of compartment model Threshold conditions for global stability of disease free state of COVID-19 Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe Comparative analysis and forecasting of COVID-19 cases in various european countries with ARIMA, NARNN and LSTM approaches Critical trends: tracking critical data Exponentially increasing trend of infected patients with COVID-19 in Iran: a comparison of neural network and ARIMA forecasting models Time series prediction for the epidemic trends of COVID-19 using the improved LSTM deep learning method: case studies in Russia, Peru and Iran Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM Forecasting of COVID-19 time series for countries in the world based on a hybrid approach combining the fractal dimension and fuzzy logic Prediction and analysis of COVID-19 positive cases using deep learning models: a descriptive case study of india Optimization method for forecasting confirmed cases of COVID-19 in China Multiple ensemble neural network models with fuzzy response aggregation for predicting COVID-19 time series: the case of Mexico Recent trends in deep learning based natural language processing An optimized model using LSTM network for demand forecasting Improving time series forecasting using LSTM and attention models Feature engineering for mid-price prediction with deep learning A new framework for predicting customer behavior in terms of RFM by considering the temporal aspect based on time series techniques Understanding LSTM networks Prediction of COVID-19 confirmed cases combining deep learning methods and Bayesian optimization Attention is all you need Deep Learning Incorporating side information by adaptive convolution Forecasting across time series databases using recurrent neural networks on groups of similar series: a clustering approach Multiple-output modeling for multi-step-ahead time series forecasting LSTM: a search space odyssey Practical Bayesian support vector regression for financial time series prediction and market condition change detection Multiple-output modeling for multi-step-ahead time series forecasting Prediction and analysis of coronavirus disease Data analysis on coronavirus spreading by macroscopic growth laws Real-time forecasts of the COVID-19 epidemic in China from Modeling and forecasting trend of COVID-19 epidemic in iran until May 13, 2020 Forecasting the novel coronavirus COVID-19 A methodological approach for predicting COVID-19 epidemic using EEMD-ANN hybrid model Prediction modelling of COVID using machine learning methods from B-cell dataset Large-scale unusual time series detection Sentiment analysis of student feedback using multi-head attention fusion model of word and context embedding for LSTM Multi-head attention with disagreement regularization Deep convolutional neural networks for image classification: a comprehensive review Improving demand forecasting with LSTM by taking into account the seasonality of data Time series forecasting of petroleum production using deep LSTM recurrent networks Deep learning with long short-term memory networks for financial market predictions A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning Bayesian optimization for learning gaits under uncertainty Early stopping-but when? Neural Networks: Tricks of the Trade Improving the performance of deep learning models using statistical features: The case study of COVID-19 forecasting The authors received no specific funding for this study. The data are publicly available at the Humanitarian Data Exchange (HDX) https://data.humdata.org/dataset/novelcoronavirus-2019-ncov-cases. https://orcid.org/0000-0001-8615-5553 Reza Paki https://orcid.org/0000-0002-2692-7547 Aram Bahrini https://orcid.org/0000-0003-1552-8708