key: cord-0476294-7gdrcfxz authors: Gao, Ruobin; Du, Liang; Suganthan, P. N.; Zhou, Qin; Yuen, Kum Fai title: Random vector functional link neural network based ensemble deep learning for short-term load forecasting date: 2021-07-30 journal: nan DOI: nan sha: 98ae57c1aa5bdbea1badfe170e780a638bcb92c8 doc_id: 476294 cord_uid: 7gdrcfxz Electricity load forecasting is crucial for the power systems' planning and maintenance. However, its un-stationary and non-linear characteristics impose significant difficulties in anticipating future demand. This paper proposes a novel ensemble deep Random Vector Functional Link (edRVFL) network for electricity load forecasting. The weights of hidden layers are randomly initialized and kept fixed during the training process. The hidden layers are stacked to enforce deep representation learning. Then, the model generates the forecasts by ensembling the outputs of each layer. Moreover, we also propose to augment the random enhancement features by empirical wavelet transformation (EWT). The raw load data is decomposed by EWT in a walk-forward fashion, not introducing future data leakage problems in the decomposition process. Finally, all the sub-series generated by the EWT, including raw data, are fed into the edRVFL for forecasting purposes. The proposed model is evaluated on twenty publicly available time series from the Australian Energy Market Operator of the year 2020. The simulation results demonstrate the proposed model's superior performance over eleven forecasting methods in three error metrics and statistical tests on electricity load forecasting tasks. F ORECASTING electricity load accurately benefits electric power system planning for maintenance and construction. After collecting raw electricity demand, a reliable forecasting model established on raw historical data can approximate how much electricity is expected in the future. Therefore, accurate forecasts help the supplier to decrease energy generation and expenses and plan the resources efficiently [1] . Furthermore, short-term load forecasting models assist electricity organizations in making opportune decisions in a data-driven fashion. As a result, developing novel and accurate forecasting models for short-term load is beneficial. The electricity load forecasting is one kind of time series forecasting tasks. Anticipating the future using intelligent forecasting models is a well-developed field, where the models established from the historical data are used to extrapolate future values [2] . There are plentiful forecasting models, such as Auto-regressive integrated moving average (ARIMA) [3] , fuzzy time series [4] , support vector regression (SVR) [5] , randomized neural networks [6] , hybrid models [7] [8] [9] [10] , ensemble learning [11, 12] and deep learning models [13] . Accurate Corresponding author: Ruobin Gao (email: gaor0009@e.ntu.edu.sg). and reliable forecasts of electricity load is a challenging and significant problem for the electric power domain. In the field of load forecasting domain, the methods can be classified into three categories (i) statistical models, (ii) computational intelligence models and (ii) hybrid models. The statistical models, such as ARIMA [3] and exponential smoothing [14] , are computationally efficient and theoretically solid, but their performance is not outstanding. The second huge branch is the computational intelligence models including fuzzy system [7, 15] , SVR [5] , shallow artificial neural networks (ANN) [6] and deep learning [13, [16] [17] [18] [19] [20] . In [16] , a pooling deep recurrent neural network (RNN) is proposed to overcome the over-fitting problem caused by deep structures. A deep factored conditional restricted Boltzmann machine (FCRBM) whose parameters are optimized via a genetic wind-driven optimization (GWDO) for load forecasting is proposed in [17] . In [18] , online tuning is utilized to update the deep RNN when the performance degrades. Several deep RNNs are evaluated for load forecasting in [19] , where the input is selected from various weather and scheduled related variables. The last category, hybrid models, includes the combination of feature extraction blocks and several forecasting models to form a single model. For example, the empirical mode decomposition (EMD) is utilized to extract modes from the load and then deep belief network (DBN) is implemented to forecast each mode in [9] . Empirical wavelet transformation (EWT) is applied to decompose the load data into sub-series in a walk-forward fashion and then the concatenation of raw data and sub-series are fed into a random vector functional link (RVFL) network for forecasting purposes [8] . Neural networks are popular models for load forecasting due to their high accuracy and strong ability to handle nonlinearity. The deep learning models [13, [16] [17] [18] [19] [20] succeed in forecasting short-term load accurately because of their hierarchical structures which learn a meaningful representation of the input data. However, most fully trained deep learning models suffer from huge computation burdens. Therefore, this paper proposes a fast ensemble deep learning algorithm for short-term load forecasting. The proposed model inherits the advantages of ensemble learning and deep learning without imposing much computational burden at the same time. This paper investigates the forecasting ability of a special kind of randomized deep neural networks, the deep RVFL network, whose training is fast. Ensemble learning techniques are com- The EWT is an automatic signal decomposition algorithm with solid theoretical foundations and remarkable effectiveness in decomposing non-stationary time series data [24] . Unlike discrete wavelet transform (DWT) and EMD [25] , EWT precisely investigates the time series in the Fourier domain after fast Fourier transform (FFT). It realizes the spectrum separation using band-pass filtering with the data-driven filter banks. Figure 1 shows the EWT's regular procedures. In the EWT, limited freedom is provided for selecting wavelets. The algorithm employs Littlewood-Paley and Meyer's wavelets because of the analytic accessibility of the Fourier domain's closedform expression [26] . In [24] , the formulations of these bandpass filters are denoted by Equations 1 and 2 with a transitional band width parameter γ satisfying γ ≤ . This empowers the formulated empirical scaling and wavelet function {φ 1 (ω), {ψ n (ω)} N n=1 } to be a tight frame of L 2 (R) [27] . It can be observed that {φ 1 (ω), {ψ n (ω)} N n=1 } are used as band-pass filters centered at assorted center frequencies. Plentiful works utilize signal decomposition techniques as a feature engineering block for the forecasting algorithms [7] [8] [9] [10] [28] [29] [30] , however, most do not implement the decomposition in a proper way [8, 30] . As mentioned in [7, 8, 30] , direct application of signal decomposition algorithm to the whole time series causes the data leakage problem in terms of forecasting. The decomposed data are actually the output from convolution operations and the future data definitely are involved during the convolution. Therefore, decomposition of the whole time series is incorrect and improper, especially for establishing forecasting models. Some solutions are proposed to avoid the future data leakage problem for decomposition-based forecasting models, such as the data-driven padding [7] , moving window strategy [30] and walk-forward decomposition [8] . The data-driven padding approach is to train a simple learning algorithm which aims at padding its forecast to the end of the time series [7] . The moving window strategy only decomposes the data located in the window (order) and then the decomposed series are fed into forecasting models [30] . Different from the moving window strategy, only part of the decomposed sub-series are used as input in the walk-forward decomposition. The moving window strategy is a subset of the walk-forward decomposition. When the order is equal to the window, the moving window strategy and the walk-forward decomposition are the same. This paper adopts the walk-forward decomposition for the EWT. The walk-forward EWT decomposes the data in a rolling window w, which consists of Then only the last order data points are used as input for the forecasting model. Therefore, only historical observations are involved both in the decomposition process and the model' training. Inspired by the deep representation learning, the deep RVFL is an extension of the RVFL with a shallow structure [23] . The deep RVFL is established by stacking multiple enhancement layers to achieve deep representation learning. The clean data are fed into each enhancement layer to guide the random features' generation. In this fashion, the enhancement features of hidden layers are generated based on the information from the clean data and the features from the previous layer. A diverse set of features is generated with the help of hierarchical structures. Ensemble learning is introduced into the deep RVFL architecture to formulate the ensemble deep RVFL (edRVFL). Different from the popular deep learning models with a single output layer, the edRVFL trains multiple output layers based on all the hidden features. Finally, the forecasts from all output layers are combined for forecasting. For the sake of presentation simplicity, we only present the edRVFL with a structure of L enhancement layers and there are N enhancement nodes in each layer. Figure 2 shows the architecture of the edRVFL network. Suppose that the input data is X ∈ R n×d , where n and d represent the number of samples and feature dimension, respectively. d is the time lag (order) for the time series forecasting model. The features generated by the first enhancement layer are defined as where W 1 ∈ R d×N represents the weight vector of the first enhancement layer, H 1 ∈ R n×N denotes the enhancement features and g() is a non-linear activation function. The readers can refer to [31] for a comprehensive evaluation of different activation functions. Then, for the deeper enhancement layer l, the enhancement features can be computed as where W l ∈ R (d+N )×N and H l ∈ R n×N . The enhancement weight vectors W 1 and W l are randomly initialized and remained fixed during training. The edRVFL computes the output weights by splitting the task into l small tasks. The output weights are calculated separately for each layer. There are several differences from using the last layer's features and all layers' features for decisions. Most deep learning models only use the last layer's features for decisions, however, the information from the intermediate features is lost. Using all layers' features requires a computation on the feature matrix with a huge dimension. Moreover, both of the above architectures only train one network, but our method benefits from the ensemble approach, which reduces the uncertainty of a single model. The loss function of l th enhancement layer is defined as where β l denotes the output vector of l th layer and λ is the regularization parameter. The minimization of Loss l can be solved via a closed-form solution based on ridge regression [32] . where D = [H l , X]. After computing all β l , the deep network can output L forecasts. The final forecast is an ensemble of all outputs. Any forecast combination approach can be applied to this procedure [33] . According to the suggestions in [33] , the mean or median operation is always likely to improve the forecast combination's performance. Therefore, we use the mean and median as the combination operator. Correspondingly, two different edRVFLs are proposed, the Mea-edRVFL and Med-edRVFL. The model EWT-edRVFL consists of two blocks, the walkforward EWT decomposition and the edRVFL. The walkforward EWT is first applied to the load data to extract some features in a causal fashion. Then the raw data concatenated with the sub-series are fed into the edRVFL with L enhancement layers for learning purposes. The output weights β l of the l th enhancement layer are computed according to Equation 7 . Finally, we ensemble the L forecasts with mean or median operation to obtain the outputŷ. Correspondingly, two different EWT-edRVFLs are proposed, the EWTMea-edRVFL and EWTMed-edRVFL. Since the higher enhancement layer's performance depends on the lower ones', the hyper-parameters of the whole model are tuned in a layer-wise fashion. Once the shallow layer's hyper-parameters are determined, then they are fixed and the cross-validation approach is applied to the next layer. Layerwise cross-validation offers a different set of hyper-parameters for each layer. Therefore each enhancement layer has its own regularization parameter, which helps the overall edRVFL learns a diverse set of output layers. This section presents the empirical study on twenty load time series collected from the Australian Energy Market Operator (AEMO). First, we briefly introduce the data' characteristics and pre-processing steps. Then, the benchmark models and hyper-parameter optimization are described. Finally, the simulation results are shown, and discussions are conducted. Table I summarizes the descriptive statistics of the twenty load time series. These load data are collected from the states of South Australia (SA), Queensland (QLD), New South Wales (NSW), Victoria (VIC), and Tasmania (TAS) of the year 2020, which is significantly affected by Covid-19. Four months, January, April, July, and October are selected to reflect the four seasons' characteristics as in [8, 9, 34] . The data are recorded every half an hour. Therefore, there are 48 data points per day. A suitable and correct data pre-processing approach helps the machine learning model generate accurate outputs. We utilize the max-min normalization to pre-process the raw data. We assume that the maximum and minimum of the training set are x max and x min , respectively. The data are transformed into the range [0,1] using the following equation: where x normalized and x represent the normalized and original time series, respectively. All datasets are split into three sets, the training, validation and test set, to adopt the cross-validation [35] . The validation and test set account for 10% and 20% of the dataset, respectively. The remaining data are used as the training set. Three forecasting error metrics are employed to appraise the accuracy of these models. The first error metric is the regular root mean square error (RMSE) whose definition is where L test is the size of the test set, x j andx j are the raw data and predictions. The second error metric implemented in the paper is the mean absolute scaled error (MASE) [36] . The definition of MASE is where L train represents the size of training set. The denominator of MASE is the mean absolute error of the in-sample naive forecast. The third error metric is the Mean Absolute Percentage Error (MAPE) whose definition is We compare the proposed model with many classical and state-of-the-art models. These models are Persistence model [2] , ARIMA [3] , SVR [5] , MLP [13] , LSTM [37] , Temporal CNN (TCN) [38] , hybrid EWT fuzzy cognitive map (FCM) learned with SVR (EWTFCMSVR) [7] , Wavelet Highorder FCM (WHFCM) [29] , Laplacian ESN (LapESN) [39] , EWTRVFL [8] and RVFL [6] . The previous one day, 48 data points are used as input for all the models as in [8] . To achieve a fair comparison, all models' hyper-parameters are optimized by cross-validation. The hyper-parameter search space is presented in Table II . The decomposition level for the walk-forward EWT is set to 2 according to the conclusion and suggestions in [8] . Some parameters are not involved in the optimization process and they are set to the same values for all the relevant models, which include the batch size equals to 32, learning rate equals to 0.001 and epochs equal to 200. Tables III, IV and V summarize the performance on the test sets. The numbers in bold represent the corresponding model's performance is the best on the specific time series. Figures 3, 4 , 5, 6 and 7 present the comparison of raw data and the forecasts generated by the proposed model. It is clear to find that the proposed model anticipates future trends, cycles, and fluctuations accurately. Statistical tests are implemented to investigate the difference among all the models further. We first implement the Friedman test, and the p-value is smaller than 0.05, which represents that these forecasting models are significantly different on these twenty datasets. Therefore, a post-hoc Nemenyi test is utilized to distinguish them [40] . The critical distance of the Nemenyi test is calculated by: where q α is the critical value coming from the studentized range statistic divided by √ 2, k represents the number of models and N d is the number of datasets [40] . Figure 8 represents the Nemenyi test results. The figures show that the models that achieve excellent performance are at the top, whereas the model with the worst performance is at the bottom. Some consistent conclusions can be drawn from the Nemenyi test results of three error metrics. The Persistence method is the tailender because it learns nothing about the patterns. ARIMA is a penultimate because of its simple linear structure. The LSTM model outperforms many benchmark models except the EWTRVFL and the model proposed in this paper. Figure 8 demonstrates the superiority of the proposed models because they are always at the top. Another finding is that the edRVFL with mean ensemble operator is better than the median operator. Table X records the simulation time for optimization and training time. It is worth noting that the optimization time is the time of the cross-validation using grid-search. The training time represents the time that the model is trained using the hyper-parameters selected by the cross-validation. The time for RVFL-related models is the summation of twenty runs. Several phenomenons are concluded according to Table X. The most time-consuming model is the LSTM because of its recurrent structure which processes the data in a sequential order. The hybrid RVFL model with EWT is more time-consuming than RVFL-related models. For example, the EWT-RVFL and EWT-edRVFL are more time-consuming than RVFL and edRVFL, respectively. Therefore, the main computation is in the walk-forward EWT decomposition block because it happens at each step. This paper proposes a novel ensemble deep RVFL network combined with walk-forward decomposition for shortterm load forecasting. The enhancement layers' weights are randomly initialized and kept fixed as in the shallow RVFL network. Only the output weights of each layer are computed in a closed form. Since the enhancement features are unsupervised and randomly initialized, the walk-forward EWT is implemented to augment the feature extraction. The walkforward EWT is different from most literature, where the whole time series is decomposed at one time. Therefore, there is no data leakage problem during the decomposition process. There are several reasons for the superiority of the proposed model: 1. The edRVFL's structure benefits from ensemble learning. The edRVFL treats each enhancement layer as a single forecaster. Therefore, the ensemble multiple forecasters reduce the uncertainty of a single forecaster. 2. The clean raw data are fed into all enhancement layers to calibrate the random features' generation. 3. The output layer learns both the linear patterns from the direct link and nonlinear patterns from the enhancement features. 4. The walk-forward EWT is used as a feature engineering block to boost the accuracy further. Although our model shows its superiority in these twenty datasets, there are still some limitations. For the walk-forward EWT process, whether to discard the highest frequency is an open problem. It is challenging to determine how valuable information is in the highest frequency component. Moreover, other learning techniques can be considered to further boost the performance, like incremental learning and semiunsupervised learning. The authors thank the anonymous reviewers for providing valuable comments to improve this paper. Optimization time Training time ARIMA [3] 42.595 3.692 SVR [5] 4.058 0.109 MLP [41] 65.260 4.386 LSTM [37] 1561.631 150.642 TCN [38] 171.563 50.919 EWTFCMSVR [7] 40.528 26.531 WHFCM [29] 21.755 0.130 LapESN [39] 29.182 6.078 RVFL [6] 1.689 0.140 EWTRVFL [8] 42.518 2.060 edRVFL 31.859 7.307 EWT-edRVFL 75.620 14.067 Short-term electricity price and load forecasting in isolated power grids based on composite neural network and gravitational search optimization algorithm Forecasting methods and applications Arima models to predict next-day electricity prices Parsimonious fuzzy time series modelling Load forecasting using support vector machines: A study on eunite competition 2001 Random vector functional link network for short-term electricity load demand forecasting Robust empirical wavelet fuzzy cognitive map for time series forecasting Walkforward empirical wavelet random vector functional link for time series forecasting Empirical mode decomposition based ensemble deep learning for load demand time series forecasting A comparative study of empirical mode decomposition-based short-term wind speed forecasting methods Oblique random forest ensemble via least square estimation for time series forecasting Ensemble incremental learning random vector functional link network for short-term electric load forecasting A review of deep learning methods applied on load forecasting Short-term load forecasting with exponentially weighted methods Load forecasting through estimated parametrized based fuzzy inference system in smart grids Deep learning for household load forecasting-a novel pooling deep rnn Electric load forecasting based on deep learning and optimized by heuristic algorithm in smart grid Deep learning for load forecasting with smart meter data: Online adaptive recurrent neural network Robust short-term electrical load forecasting framework for commercial buildings using deep recurrent neural networks Ensemble deep learning for regression and time series forecasting Random vector functional link networks for function approximation on manifolds A novel empirical mode decomposition with support vector regression for wind speed forecasting Random vector functional link neural network based ensemble deep learning Empirical wavelet transform Empirical mode decomposition as a filter bank Ten lectures on the probabilistic method The art of frame theory Energy load forecasting using empirical mode decomposition and support vector regression Time-series forecasting based on high-order fuzzy cognitive maps and wavelet transform A new crude oil price forecasting model based on variational mode decomposition A comprehensive evaluation of random vector functional link networks Ridge regression learning algorithm in dual variables Handbook of economic forecasting A novel evolutionary-based deep convolutional neural network model for intelligent load forecasting On the use of crossvalidation for time series predictor evaluation Another look at measures of forecast accuracy Short-term residential load forecasting based on lstm recurrent neural network An empirical evaluation of generic convolutional and recurrent networks for sequence modeling Laplacian echo state network for multivariate time series prediction Statistical comparisons of classifiers over multiple data sets An efficient approach for short term load forecasting using artificial neural networks