key: cord-0644110-7vt3qk9v
authors: Sol'is, Mart'in; Calvo-Valverde, Luis-Alexander
title: Performance of Deep Learning models with transfer learning for multiple-step-ahead forecasts in monthly time series
date: 2022-03-18
journal: nan
DOI: nan
sha: 945a41318cf0ff96751c37e00ff4043e817362f5
doc_id: 644110
cord_uid: 7vt3qk9v

Deep Learning and transfer learning models are being used to generate time series forecasts; however, there is scarce evidence about their performance prediction that it is more evident for monthly time series. The purpose of this paper is to compare Deep Learning models with transfer learning and without transfer learning and other traditional methods used for monthly forecasts to answer three questions about the suitability of Deep Learning and Transfer Learning to generate predictions of time series. Time series of M4 and M3 competitions were used for the experiments. The results suggest that deep learning models based on TCN, LSTM, and CNN with transfer learning tend to surpass the performance prediction of other traditional methods. On the other hand, TCN and LSTM, trained directly on the target time series, got similar or better performance than traditional methods for some forecast horizons.

Transfer Learning is a technique used successfully in computer vision and NLP to extract the knowledge from a source domain and used in learning a model on a target domain [7] . Although Transfer Learning has not been widely used for deep learning models of time series [8] , its application has emerged recently. For example, some authors have used Transfer learning to deal with the scarcity of labeled time-series of data in classification cases [e.g 9,10]. Qi-Qiao et al [8] used transfer learning to make predictions with financial time series extracted from stock markets. Otovi et al [11] and Poghosyan et al [12] proved that a pre-trained model on a global time-series database could be used with transfer learning to get good predictions of other times series that even come from a different domain.

Due to the relevance that deep learning and transfer learning models have acquired for time series forecasting, there is a need to benchmark them against more traditional methods [2] . Performance comparisons have been made using statistical forecasting methods and classical machine learning methods [13, 14 15, 16] . However, comparisons and analyses about the predictive performance between those methods and deep learning models are scarce. This research aims to contribute to the aforementioned knowledge gap, answering the following questions.

1. Are the Deep Learning models for multiple-step-ahead forecasts in monthly time series more effective in terms of performance prediction than traditional methods? 2. How does the performance of the Deep Learning models change according to the forecast horizon compared to the traditional models? 3. Are there groups of time series where Deep Learning methods and the application of transfer learning are more effective?

We hope that this work provides valuable information about the suitability of Deep Learning models and the usage of pre-trained models to make predictions with transfer learning on new target time series. The main contributions of this paper are as follows:

1. We answered three questions about the effectiveness of Deep Learning models for monthly forecasts not responded yet.

Although there are benchmarking about the time series methods for forecasting, our work is the first, as far as we know, to compare and analyze the performance of the deep learning models with and without transfer learning with other more traditional machine learning algorithms and statistical methods.

3. The performance of deep learning models with transfer learning for monthly forecasts is analyzed throughout relevant characteristics of time series. Therefore, this work can be used as a reference point about the suitability of applying deep learning models and transfer learning for monthly forecast tasks.

Laptev, Yu, and Rajagopal [17] transfer time-series features across diverse domains and mentioned that there is no prior work related to the application of transfer learning for time-series. From 2018 some studies have emerged with the application of transfer learning to deep learning models to generate the forecast for a target time series. Those models have been applied to different tasks as time series classification tasks [e.g. 9, 10, 18], time series regression tasks [11, 12] and anomy detection in time series [19, 20] .

In the case of regression tasks (the scope of this research), the transfer learning has been used in different areas. In finance, Qi-Qiao et al [8] evaluate the effectiveness of transfer learning for stock price prediction using a two-layer Neuronal Network and two-layer LSTM. Xu y Meng [21] developed a novel hybrid transfer learning model for energy consumption forecasting based on time series decomposition. Le et al, [22] were concerned with the computational time to train several models to predict the energy consumption of the apartments at the same building; therefore, they developed a framework for multiple electric energy consumption forecasting in smart buildings based on transfer learning. Due to the problem of developing predictive models with limited data for energy load and power generation, [23] propose a transfer learning strategy using a Convolutional Neuronal Network. Karb et al, [24] were also concerned about making predictions with limited time series, but in the domain of food sales of new products. They propose a networkbased Transfer Learning approach for deep neural networks to create effective predictive models.

For crude oil price forecasting, Xiao et al, [25] generate a hybrid transfer learning-based analog complexing model (HTLM). Otovic et al [11] analyzed if knowledge transfer between related domains could be more beneficial than knowledge transfer between unrelated domains in classification and regression time-series tasks. They used datasets from diverse areas such as seismic datasets, acoustic signals, medical datasets, and stock-market prices. Xin and Peng [26] combines autoencoders, convolutional neural networks (AE-CNN), and transfer learning to capture the intrinsic certainty of chaotic time series. The authors of all previous studies demonstrate that the models were better than the baselines. In some cases, the comparison was with other deep learning, and in others, cases were statistical models as Arima. The conclusions often highlight the promising of transfer learning for time series.

Some comparisons have recently been made between the performance of various algorithms and methods for time series forecasting. One of the most cited is the work of [16] . Those authors compare the performance of seven statistical models and ten machine learning models using the M3 competition. Due to high computational time, they applied deep learning models such as the LSTM and RNN, but without transfer learning and based on very simple architectures. They conclude that traditional statistical methods are more accurate than Machine learning ones. Papacharalampous et al [15] also extensively compare several stochastic and ML techniques to forecast hydrological processes. The machine learning models used were: ✓ Three simple neuronal networks (Single hidden layer, Multilayer, and Perceptron (MLP)). ✓ Three random forests. ✓ Three support vector machines.

Different to [16] , they conclude that the stochastic and ML methods can share a quite similar performance when forecasting hydrological time series of small length, but in linear situations, the ML methods are more likely to be inferior, while in non-linear situations, the ML methods are more likely to outperform.

Parmezan et al [14] provide a comparison between popular statistical methods and machine learnings models (Support Vector Machines, kNN-TSPI, Multilayer and Perceptron (MLP), and LSTM), using synthetic and real times series of different domains. They conclude that SARIMA is the best for deterministic series, kNN-TSPI and SARIMA are the best for Stochastic, and SVM was the more stable method for chaotic series. Catal et al [13] also compared machine learning models (Linear Regression, Bayesian Regression, Neural Network Regression, Decision Forest Regression, and Boosted Decision Tree Regression) and statistical models using the Walmart sales dataset. They found that Boosted Decision Tree Regression algorithm was the best predictor.

On the other hand, Lara-Benitez et al, [27] generate an experimental study comparing the performance of the most popular deep learning architectures for time series prediction using different datasets. In this study, they didn't compare with other machine learning algorithms or statistical methods nor analyzed the effect of transfer learning on the performance prediction. LSTM, CNN, and TCN were between networks with the best results.

The time series used comes from the M3 and M4 competitions. Two sets called A and B, of 1000 time series, were selected randomly from the Monthly M4 set, which contains 48 000 time series, and a third set named B_M3 were composed by all-time series of monthly M3 competition. Every time-series taken from the M4 competition includes the train and test subset. The testing subset of M4 competition for each time series in the A set was used to calculate the models' performance metrics. This subset is made up of 18 months.

As the research questions require the analysis of multiple-step-ahead forecasts (forecast horizon) performance with different steps, we pre-process the 18 months to make predictions of one step ahead, three, six, and twelve steps ahead. For example, for three steps ahead, the 18 test months were transformed into sixteen instances of prediction (table 1) . 

The machine learning and deep learning models were trained using different input sizes for each forecast horizon (time steps ahead). For 3, 6, and 12 steps ahead, we applied a window input size of:

✓ the same size as the forecast horizon. This input size was chosen because Shynkevich et al [28] found that the highest prediction performance is achieved when the input window length is approximately equal to the horizon forecast. ✓ 1.25 times of the forecast horizon size. This input size was chosen because Lara-Benitez et al [27] found better performance with 1.25 times the forecast horizon size than larger output horizons. ✓ 12 months. This input size was chosen to incorporate stational information into the models.

The procedure for training and testing each of the methods used is explained below: Figure 1 describes the process. Source set was used to develop the models and set A to make the transfer learning and get the predictions of the test sample. There were two kinds of source datasets: B set (1000 time series selected randomly from M4 competition) and B_M3 set (monthly time series of M3 competition). The process executed was the next. First, the input window size and output horizon size were selected for the preprocessing of each time series in every set. Then the data is divided into training, validation, and testing samples. The train and validation samples of the all-time series in the source set were concatenated to train the models and select the hyperparameters and architectures using Bayesian optimization. The grid of search for the Bayesian optimization is in table 2.

Each model has been trained using the Adam optimizer and a stop criterion which consists of stopping after two epochs without improvement in the validation sample's loss function. The loss function was the mean absolute percentage error, and the batch size was equal to one. After the model has been generated, transfer learning is applied for every time series of set A, using the training and validation sample. Then the model is used to predict the A test subset, and finally, the performance metric is generated. Only the last layer of the model was unfrozen, and the weights were updated with a learning rate of 0.000005, which is less than the used in the training phase.

We generated two versions of the deep learning models that were applied with transfer learning. In the first version, we trained the CNN, LSTM, and TCN using the B set, and in the second version, we trained these networks using the B_M3 set. The purpose of the two versions was to assess how transfer learning worked when the models had been trained from different datasets. [1, 2, 4, 8] or [1, 2, 4, 8, 16] 

The methods used were Auto Arima, ETS, Theta. Figure 3 describes the training and testing process. The training and validation sample of each time series in set A were put together in order to train the models, and the test sample was used to calculate the performance metrics. The first step is to choose the number of outputs to predict. Later the models were trained for each time series. Finally, the test sample was used to obtain the predictions and the performance metrics. 

In order to evaluate the performance of the models, two metrics were estimated using the test sample, Mape and sMape. We chose these metrics because they allow the performance comparison between time series independently of the scale. The time series were normalized only for the deep learning models, but the metrics were estimated on the original scale after converting the predictions to the original scale. Table 4 shows the performance for each model and forecast horizon, based on Mape and sMape. The machine learning models and deep learning models were trained with different input window sizes for each output horizon; however, we selected the models with the input size that let to obtain the best performance.

According to the results, the statistical models have more stable performance through the forecast horizons; meanwhile, the machine learning models show an increment as the forecast horizon increases. The deep learning models with transfer learning tend to show the better performance at the Mape and sMAPE in each forecast horizon, but to determine if there are significant statistical differences between models, the CD Diagrams were generated, following the procedure of the package autorank from python [29] .

In the CD diagram, the vertical lines show the average rank of each model compared to the others. This average is calculated based on the performance metric used for the test sample. The horizontal line represents the critical difference for the comparison between models. When the average distance between models is greater than the critical distance, there is a statistically significant difference in the performance using a p-value =0.05. Therefore, when the horizontal line intersects the vertical lines of the average rankings, there is no difference between models. Figure 4 shows the sMape CD diagrams for each forecast horizon, and figure 5 the MAPE CD diagrams. The results are similar in both graphs. Based on the hypothesis test, the best models when the forecast horizons are 1 and 3 months were the TCN with transfer learning and LSTM with transfer learning. The first and second place depends on whether we look at the Mape o sMape figure. Another finding is that the Statistical models tend to be in the last position, therefore, they are not a good option for a short forecast horizon.

When the forecast horizon is larger, the LSTM with transfer learning was first, and TCN or LSTM trained with M3 dataset was second. The CNN and TCN without transfer learning tend to be the worst option for a larger horizon. Interestingly, the deep learning models with transfer learning but trained with a different dataset show good results because they were located in the second subset of the best models in each forecast horizon. The M3 monthly dataset used to train these models is not very different from the monthly M4 dataset however shows some differences; for example, the M3 time series are less forecastable, trended, and linear [30] . This finding suggests that it is possible to get good results when the training dataset has differences from the test dataset while these differences are not large. Some deep learning models without transfer learning were in the third group of best performance when the forecast horizon is less than 12, despite the exposition to overfitting, because of monthly time series with few data points and models with more parameters. Finally, the machine learning models (not deep learning models) were neither the thirdbest group nor the worst group. 

In order to analyze the performance of the models by the behavior of the time series, we estimated eight features for each time series, and later we applied the PAM cluster to classify the time series according to their behavior. The features calculated were:

1. Forecastability measure with entropy [31] : Larger values occur when a time series is difficult to forecast. 2. Seasonal [31] : Larger values means more seasonal strength 3. Linearity/nonlinearity [31] : Takes large values when the series is non-linear and around 0 when the time series is linear. 4. Skewness [31] : negative skew commonly indicates negative asymmetry distribution, positive skew indicates positive asymmetry distribution, and values close to zero indicates symmetry distribution 5. Kurtosis [31] : Lager values means more concentration around the mean 6. White noise measure with box test [32] . Higher values of chi-square test mean not white noise 7. Outliers: Proportion of outliers computed using [33] approach 8. Stationarity was measured with the adf test of stationarity [34] . Higher p values indicate that the series is not stationary Before the PAM execution, the features were standardized. The silhouette score and calinski harabasz score were higher with two cluster; however, we decided to generate four to have more cluster diversity. Based on the means features of each cluster, we determine that the first and second clusters are composed for time series more unpredictable (see the entropy and white noise metrics), but in the second group, the time series tend to be more non-linear, asymmetric and with more concentration of data point around the mean. Clusters 3 and 4 are more predictable, but in 3, the time series tend to have more seasonal strength, while in group 4 tend to be more stationary. Another difference is that in 4, the time series tend to be more non-linear, with more outliers. At the bottom of table 2, we include brief names for each cluster based on the previous analysis. Figure 6 and Figure 7 presents the CD diagrams by series cluster, using sMAPE and MAPE, respectively. The main finding in both figures is that the TCN and LSTM with transfer learning were between the two best models. In some clusters, the LSTM got the best performance but in others was the TCN. The difference is statistically significant. Also, the ranking of these figures shows that the deep learning models trained with a different dataset were frequently between the third and five positions. These results confirm that the performance of the deep learning models with transfer learning was between the best in different groups of series, that goes from the more predictable to the more unpredictable and noisy time series. The CNN without transfer learning was in the last position, and the TCN without transfer learning was not between the best seven models independently of the cluster. The LSTM without transfer learning got the best results, but in some clusters like the unpredictable-nonlinear with a concentration around mean and predictable -with stationary and nonlinear tendency were surpassed by Theta or ETS. unpredictable -symmetric unpredictable-nonlinear with concentration around mean predictable-seasonable predictable -with stationary and nonlinear tendency 

Three research questions were posited in the introduction; they will be answered in this section. The first question was: Are the Deep Learning models for multiple-step-ahead forecasts in monthly time series more effective in terms of performance prediction than traditional models? Based on the Mape and sMape metrics, the deep learning models TCN, LSTM, and CNN with transfer learning tend to be more effective than some more traditional models such as Theta, ETS, Arima, random forest XGBoost, and SVM. Those deep learning models were trained with a concatenated dataset of time series, and later, the transfer learning was applied to update the weights of the last layer for new time series. Also, the models trained with the M3 dataset show the best performance than traditional methods. The monthly M3 dataset has some similarities with the monthly M4 dataset but is not entirely identical because the time series tends to be less forecastable, trended, and linear [30] . This finding suggests that the deep learning algorithms are a good option or maybe the best option to build forecast models if we make a dataset with time series that comes from a population that is the same or at least a little different from the population where our target time series comes from.

On the other hand, if we train the TCN, LSTM, and CNN directly on the target time series, the results suggest that the LSTM and TCN could get similar or better performance than traditional methods, depending on the forecast horizon, while the CNN is not a good option. It is reasonable that the deep learning models have lower performance being trained on the monthly target time series because this kind of series didn't have a lot of data points, and deep learning models require to estimate several weights, which could cause overfitting.

The second question was: How does the performance of the Deep Learning models change according to the forecast horizon compared to the traditional models?

The answer is that the statistical models are more stable in the performance if the forecast horizon increases, however the deep learning models with transfer learning show a best average ranking position and less Mape and sMape independently if the forecast horizon is 1, 3, 6, or 12. According to the forecast horizon, the best option is LSTM or TCN; for example, the results suggest that if the goal is to predict 1 or 3 months, the best option is TCN, but for larger horizons, the best is LSTM. The CNN with transfer learning tends to have lower performance on any horizon.

The third question was: Are there groups of time series where Deep Learning methods and the application of Transfer Learning are more effective?

We classify the time series in four groups: unpredictable -symmetric, unpredictable-nonlinear with a concentration around mean, predictable-seasonable and predictable -with stationary and nonlinear tendency. Our results suggest that the deep learning models with transfer learning tend to be the best independently of the group; however, it doesn't mean that they are always the best option. The best deep learning model using transfer learning belongs to the TCN or LSTM model in the four groups.

In relation with the deep learning models trained directly on the target time series, the result suggests that the LSTM tends to have better performance than other traditional methods when the time series is unpredictable -symmetric and predictable-seasonable. In other groups, the ETS or Theta got better performance.

In this study, the performance of deep learning models with transfer learning was evaluated on time series that belong to a similar or slightly different population of the time series used to train the models. However, it is relevant to know how the transfer learning models would work as the distance between the features of the training and testing time series are further apart.

Clusters of time series have been generated to compare the performance of the algorithms by groups; however, the final goal should be to develop a meta-learning model that determines which algorithm would be the best according to the features of each time series. Some efforts have been made on this research line [e.g, 35, 36, 37, 38] without incorporating the deep learning models.

Last, our study didn't include the performance comparison of the deep learning models with hybrid models that combine different algorithms. Several studies have shown that this kind of models perform well in various topics.

[36] Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2020). Meta-learning framework with applications to zero-shot time-series forecasting. arXiv preprint arXiv:2002.02887.

[37] Li, Y., Zhang, S., Hu, R. 

Financial time series forecasting with deep learning : A systematic literature review

Deep Learning Approaches to Time Series Forecasting. R cent Advances in Time Series Forecasting

Deep learning for time-series analysis

Deep learning methods for forecasting COVID-19 time-Series data: A Comparative study

A Data-Driven Forecasting Strategy to Predict Continuous Hourly Energy Demand in Smart Buildings

Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station

A survey on transfer learning

Transfer learning for financial time series forecasting

Deep Transfer Learning for Time Series Data Based on Sensor Modality Classification

Transfer Learning for Clinical Time Series Analysis Using Deep Neural Networks

Intradomain and cross-domain transfer learning for time series data-How transferable are the features? Knowledge-Based Systems

An Enterprise Time Series Forecasting System for Cloud Applications Using Transfer Learning

Benchmarking of Regression Algorithms and Time Series Analysis Techniques for Sales Forecasting

Evaluation of statistical and machine learning models for time series prediction: Identifying the state-of-the-art and the best conditions for the use of each model

Comparison of stochastic and machine learning methods for multi-step ahead forecasting of hydrological processes

Statistical and Machine Learning forecasting methods: Concerns and ways forward

Reconstruction and regression loss for time-series transfer learning

A Novel Approach to Short-Term Stock Price Movement Prediction using

Time series anomaly detection using convolutional neural networks and transfer learning

Application of Transfer Learning in Continuous Time Series for Anomaly Detection in Commercial Aircraft Flight Data

A hybrid transfer learning model for short-term electric load forecasting

Multiple Electric Energy Consumption Forecasting Using a Cluster-Based Strategy for Transfer Learning in Smart Building

Energy Predictive Models with Limited Data using Transfer Learning

A network-based transfer learning approach to improve sales forecasting of new products

A hybrid transfer learning model for crude oil price forecasting

Prediction for Chaotic Time Series-Based AE-CNN and Transfer Learning

An Experimental Review on Deep Learning Architectures for Time Series Forecasting

Forecasting price movements using technical indicators: Investigating the impact of varying input window length

Autorank: A Python package for automated ranking of classifiers

Are forecasting competitions data representative of the reality?

Characteristic-Based Clustering for Time Series Data

Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models

Forecasting time series with outliers

Testing for unit roots in autoregressive-moving average models of unknown order

Meta-Learning for Time Series Forecasting Ensemble