key: cord-0496757-vkzya6ng
authors: Borchert, Oliver; Salinas, David; Flunkert, Valentin; Januschowski, Tim; Gunnemann, Stephan
title: Multi-Objective Model Selection for Time Series Forecasting
date: 2022-02-17
journal: nan
DOI: nan
sha: 02674f6f2798ed6ab012397bbfc4d490f0661eff
doc_id: 496757
cord_uid: vkzya6ng

Research on time series forecasting has predominantly focused on developing methods that improve accuracy. However, other criteria such as training time or latency are critical in many real-world applications. We therefore address the question of how to choose an appropriate forecasting model for a given dataset among the plethora of available forecasting methods when accuracy is only one of many criteria. For this, our contributions are two-fold. First, we present a comprehensive benchmark, evaluating 7 classical and 6 deep learning forecasting methods on 44 heterogeneous, publicly available datasets. The benchmark code is open-sourced along with evaluations and forecasts for all methods. These evaluations enable us to answer open questions such as the amount of data required for deep learning models to outperform classical ones. Second, we leverage the benchmark evaluations to learn good defaults that consider multiple objectives such as accuracy and latency. By learning a mapping from forecasting models to performance metrics, we show that our method PARETOSELECT is able to accurately select models from the Pareto front -- alleviating the need to train or evaluate many forecasting models for model selection. To the best of our knowledge, PARETOSELECT constitutes the first method to learn default models in a multi-objective setting.

For decades, businesses have been using time series forecasting to drive strategic decision-making (Simchi-Levi et al., 2013; Hyndman & Athanasopoulos, 2018; Petropoulos et al., 2020) . Analysts leverage forecasts to gauge resource requirements, retailers forecast future product demand to optimize their supply chains, cloud providers predict future web traffic to scale server fleets, and in the energy sector, forecasting plays a crucial role e.g. to predict load and energy prices. In domains like these and many other, more precise predictions directly translate into an increase in profit. Thus, it is no surprise that research on forecasting methods has historically focused on improving accuracy.

In addition to more classical local forecasting methods which fit a model per time series, global forecasting models such as deep learning and tree-based models have demonstrated state-of-the-art forecasting accuracy (Wen et al., 2017; Oreshkin et al., 2019; Salinas et al., 2020a; Smyl, 2020) when sufficient training data is available (Makridakis et al., 2018; . This research has led to a large variety of different forecasting models and hyperparameter choices , where different models exhibit vastly different characteristics, including accuracy, training time, model size, and inference latency.

While this variety of forecasting models is a great resource, it introduces challenging questions. On the one hand, researchers would like to understand patterns in the performance of different models and to benchmark new models against existing methods. On the other hand, practitioners are interested in understanding which model performs best for a particular dataset or application.

In this work, we address both of these problems: we release one of the most comprehensive publicly available evaluations of forecasting models across 44 datasets. Using this benchmark dataset, we develop a novel method for learning good defaults for forecasting models on previously unseen datasets. Importantly, we adopt a multiobjective perspective that allows us to select forecasting models that are simultaneously accurate and satisfy constraints such as inference latency, training time, or model size.

The main contributions of this work can be summarized as follows:

• We release the evaluations of 13 forecasting methods on 44 public datasets with respect to multiple performance criteria (different forecast accuracy metrics, inference latency, training time, and model size). This constitutes, by far, the most comprehensive publicly available evaluation of forecasting methods. Those evaluations can be leveraged, for example, to assess the relative performance of future forecasting methods with little effort.

• As an example application of this benchmark, we use the data to perform a statistical analysis which shows that only a few thousands observations are required for deep learning methods to outperform classical methods. In addition, we investigate the benefit of ensembling forecasting models.

• We propose a novel method that can leverage offline evaluations to learn good default models for unseen datasets. Default models are selected to optimize for multiple objectives.

• We introduce a technique for ensembling models in a multi-objective setting where the resulting ensemble is not only highly accurate but also optimized for other objectives, such as a low inference latency.

Time Series Forecasting. Time series forecasting has seen a recent surge in attention by the academic community that has started to reflect its relevance in business applications. Traditionally, univariate, so-called local models that consider time series individually have dominated (Hyndman & Athanasopoulos, 2018) . However, in modern applications, methods that learn globally across a set of time series can be more accurate (Oreshkin et al., 2019; Salinas et al., 2020a; Lim et al., 2021; ) -in particular, methods that rely on deep learning. With a considerable choice of models available, it is unclear which methods should perform best on which forecasting dataset, unlike in other machine learning domains where dominant approaches exist.

In domains such as computer vision or neural architecture search (NAS), offline computations have been harvested and leveraged successfully to perform extensive model comparisons or compare different NAS strategies (Ying et al., 2019; Dong & Yang, 2020; Pfisterer et al., 2021) . However, to the best of our knowledge, no comprehensive set of model evaluations has been released in the realm of forecasting. Consequently, some questions remain open: how much data is required for global deep learning methods to outperform classical local methods such as ARIMA or ETS (Hyndman & Athanasopoulos, 2018) ? What is the impact on accuracy when ensembling different forecasting models? While some recent methods incorporate ensembling (Oreshkin et al., 2019; Jeon & Seong, 2021) , comparisons are often made against individual models which may cloud the benefit of the method proposed versus the sole benefit of ensembling.

Learning Default Models. Finding the best model or set of hyperparameters is often performed via Bayesian optimization given its theoretical regret guarantees (Srinivas et al., 2012) . However, even with early-stopping techniques (Golovin et al., 2017; Li et al., 2017) , practitioners often restrict the search to a single model due to the large cost of training many models. One technique to drastically speed up the model/hyperparameter search is to reuse offline evaluations of related datasets via transfer learning. For instance, Wistuba et al. (2015) ; Winkelmolen et al. (2020) ; Pfisterer et al. (2021) leverage offline evaluations to alleviate the need for training many models by learning a small list of n good defaults that provide a small joint error when evaluated on all datasets. This approach bears a lot of similarity with ours. Key differences are that we consider multiple objectives and model ensembles. In the time series domain, Shah et al. (2021) also considers the task of automatically choosing from a large pool of forecasting models the one performing best on a particular dataset but also only optimizes for a single objective and relies on potentially expensive model training for model selection.

In this section, we formally introduce the problem of time series forecasting and subsequently provide an overview of the benchmark evaluations that we release. By evaluating 13 forecasting methods along with different hyperparameter choices on all 44 benchmark datasets and for two random seeds, the benchmark comprises 4,708 training runs and amasses over 1 TiB of forecast data. At the end of this section, we demonstrate how this benchmark data can be used to perform an extensive comparison of contemporary forecasting methods.

The goal of time series forecasting is to predict the future values of one or more time series based on historical observations. Formally, we consider a set Z = {z (i) 1:

) denotes the vector of observations for the i -th time series in the time interval a ≤ t ≤ b. In probabilistic time series forecasting, a model then estimates the probability distribution across the forecast horizon τ , i.e. the distribution over the τ future values, from the historical observations:

Models typically approximate the joint distribution of Eq. (1) via Monte Carlo sampling (Salinas et al., 2020a) or learn to directly predict a set of distribution quantiles for multiple time steps using quantile regression (Wen et al., 2017; Lim et al., 2021) .

We consider 13 models in total that can be distinguished into 7 local methods (estimating parameters individually from each time series) and 6 global methods (learning from all available time series jointly). Details about models can be found in Appendix A.

Local Methods. We use Seasonal Naïve Athanasopoulos, 2018) and NPTS (Rangapuram et al., 2021) as simple, non-parametric baselines. Further, we consider ARIMA and ETS (Hyndman & Athanasopoulos, 2018) as well-known statistical methods and STL-AR (Talagala, 2021) and Theta (Assimakopoulos & Nikolopoulos, 2000) as inexpensive alternatives. Lastly, we include Prophet (Taylor & Letham, 2018) , an interpretable model that has received plenty of attention.

Global Methods. All global methods that we use are deep learning models, namely: Simple Feedforward (Alexandrov et al., 2020) , MQ-CNN and MQ-RNN (Wen et al., 2017) , DeepAR (Salinas et al., 2020a) , N-BEATS (Oreshkin et al., 2019) , and TFT (Lim et al., 2021) . For each model (except MQ-RNN), we consider three hyperparameter settings: this includes the default set provided by their implementation as well as hyperparameter sets that roughly halve and double the default model capacity (see Table 2 in the appendix). Additionally, we consider three different context lengths that govern the length of the time series that predictions are conditioned on.

Model Training. Model training is only required for deep learning models since parametric local methods estimate parameters at prediction time. Deep learning models are trained for a fixed duration depending on the size of the dataset. Further details can be found in Appendix C.

Our benchmark provides 44 heterogeneous public datasets in total. The datasets greatly differ in the number of time series (from 5 to ≈170,000), their mean length (from 19 to ≈500,000), their frequency (minutely, hourly, daily, weekly, monthly, quarterly, yearly), and the forecast horizon (from 4 to 60). Thus, we expect them to cover a wide range of datasets encountered in practice.

Datasets are taken from various forecasting competitions, the UCI (Dua & Graff, 2017) , and the Monash time series forecasting repository (Godahewa et al., 2021) . Dataset sources, descriptions, basic statistics, and an explanation of the data preparation procedure can be found in Appendix B.

To measure the accuracy of forecasting models, we employ the normalized continuous ranked probability score (nCRPS) (Matheson & Winkler, 1976; Gneiting & Raftery, 2007) whose definition is given in Appendix A. The benchmark dataset that we release contains four additional forecast accuracy metrics (MAPE, sMAPE, NRMSE, ND). Besides forecasting accuracy metrics, we store inference latency, model size (i.e. number of parameters) and training time for deep learning models. Latency is measured by dividing the time taken to generate predictions for all test time series in the dataset by their number.

Having outlined the benchmark, we now want to demonstrate the usefulness of the evaluations for research. For this, we want to analyze how local (classical) and global (deep learning) forecasting methods compare against each other. In fact, this comparison has been the object of heated discussions in the forecasting community (Makridakis et al., 2018; .

Method Comparison. In order to compare methods, we consider their latency and accuracy (in terms of nCRPS) across all datasets. Table 1 shows each method's relative latency compared to Seasonal Naïve as well as the the methods' nCRPS rank 1 . As far as latency is concerned, global methods, once trained, allow to generate forecasts considerably faster than local methods (except for Seasonal Naïve). The reason is simple: once trained, they simply need to run a forward pass. In contrast, the implementations for the local methods chosen here do not differentiate between training and inference, but rather estimate their parameters at prediction time. Across datasets, the relative latency of all considered deep learning models improves between 15% and 95% upon the latency of the fastest local method (excluding Seasonal Naïve). As far as accuracy is concerned, complex deep learning models -namely DeepAR and TFT -generally perform best across the benchmark datasets. Except for MQ-RNN, global methods compare favorably against local methods.

As a measure of relative model stability, Table 1 additionally shows the standard deviation of the nCRPS rank across all benchmark datasets. Not only is TFT the method which performs best on average, its rank is also comparatively consistent compared to other methods. Especially, it is more stable than DeepAR which seems to generate relatively inaccurate forecasts on some datasets. Rel. Improvement (Jeon & Seong, 2021) . While generally bearing a significant cost in latency, the benefit of ensembling is significant. For instance, the Simple Feedforward hyper-ensemble yields a competitive model that outperforms DeepAR in terms of both accuracy and latency. Lastly, DeepAR and TFT hyper-ensembles result in the most competitive ensembles.

Comparison of Classical and Deep Learning Methods. We further conduct a statistical analysis to compare the classical time series models with the deep learning models listed in Section 3.2. For this, we construct the null hypothesis that the accuracy of classical models is equal to or better than the accuracy of deep learning methods. This is a claim put forward, for example, by Makridakis et al. (2018) . More formally, we write H 0 : Q class ≤ Q deep where Q class and Q deep describe the distribution of nCRPS values of classical and deep learning models, respectively. Samples of the distributions are derived from the respective single-model evaluations in our benchmark. Q class and Q deep are continuous and belong to unknown families of distributions with unequal variances. Hence, we choose the nonparametric two-sample Kolmogorov-Smirnov test (Gibbons & Chakraborti, 2003) to compute the p-value, i.e. the probability of H 0 holding true, on each individual dataset.

We evaluate the null hypothesis H 0 on all 44 benchmark datasets and plot the p-value for the different dataset sizes in Figure 1 . At a significance level of α = 5%, H 0 can be rejected for 30 out of 44 datasets. In fact, deep learning models exhibit competitive performance for almost all datasets and regularly outperform classical methods. Further, they tend to perform comparatively better with increasing dataset size -even if for some large datasets, classical models are indistinguishable -and, nonetheless, show competitive performance for small datasets. The latter is of particular interest since it contradicts tribal knowledge that large datasets are needed to outperform local/classical methods and disputes the claim presented by Makridakis et al. (2018) since only a few thousands observations seem sufficient to outperform the classical models considered.

These statements are further supported by Figure 2 where we compare the best deep learning model x deep against the best classical model x class on each dataset. We compute the relative improvement given by deep learning models as

nCRPS(x class ) −1. Except for few outliers, the best deep learning model outperforms all classical models on 40 out of 44 datasets (see Appendix F for a further analysis). Fitting a linear regression model on the relative improvements shows that the improvement by using deep learning models is only slightly amplified as the datasets grow in size.

We note, that we conflate deep learning and global models in our above discussion. show the general superiority of global models over local models both theoretically and empirically and our results confirm their findings further.

The benchmark introduced in the previous section outlined how to acquire offline evaluations of forecasting methods on various datasets. However, in the presence of multiple conflicting objectives such as accuracy and latency, it is not clear how one could choose the best models. In this section, we first formalize the problem of learning good defaults from these evaluations for a single objective. Afterwards, we formally introduce multiobjective optimization and show how learning defaults can be extended to account for multiple objectives. In the end, we report experimental results obtained by using our method.

In our setting, we consider an objective function f : X → R m for a given task (in our case, a task is a dataset). The objective function maps any time series model x ∈ X to m objectives (such as nCRPS, latency, etc.) that ought to be minimized. In addition, we assume that model evaluations on T different but related tasks with objective functions f (1) , . . . , f (T ) are available. The set of offline model evaluations is then given as

where N j evaluations are available for task j and y (j) i denotes the evaluation of x (j) i on task j.

Learning Defaults. To learn a set of n good default models for unseen datasets in the single-objective setting, Pfisterer et al. (2021) propose to pick the models that minimize the joint error obtained when evaluating them on all datasets. This amounts to picking the set of models {x 1 , . . . , x n } ⊂ X which minimizes the following objective:

where k denotes a single fixed objective (for instance, the classification error). Since the problem is NPcomplete, a greedy heuristic with a provable approximation guarantee is used: it iteratively adds the model minimizing Eq. 3 to the current selection. However, this approach only works for a single objective of choice f k rather than taking all objectives f (such as accuracy and latency) into account. In addition, it does not easily support the selection of ensembles since objectives can interact in differently when models are ensembled: while accuracy will likely increase by a small amount, the latencies of ensemble members need to be added up.

Taking into account all possible ensembles of a given size n would also blow up the combinatoric search when optimizing Eq. 3.

Multi-Objective Optimization. In multi-objective optimization, there generally exists no singular solution. Thus, when considering the objective function f , there exists no single optimal model, but rather a set of optimal models (Emmerich & Deutz, 2018) .

The set of all non-dominated solutions (i.e. models) is denoted as the Pareto front P f (X ) whose members are commonly referred to as Pareto-optimal solutions:

To quantify the quality of a set S = {x 1 , . . . , x n } ⊂ X of selected models, we use the hypervolume error (Zitzler & Thiele, 1998; Li & Yao, 2019) . Given a set of points Y ⊂ R m and a reference point r ∈ R m , we first define the hypervolume as the Lebesgue measure Λ of the dominated space:

In turn, the hypervolume error ε of the set S is defined as the difference in hypervolume compared to the true Pareto front P f (X ):

Thus, the hypervolume error ε(S) reaches its minimum when S ⊇ P f (X ). To account for different scales in the optimization objectives, we further standardize all objectives by quantile normalization (Bolstad et al., 2003) such that they follow a uniform distribution U(0, 1) -this is feasible as X is finite. While the normalization makes our comparisons robust to monotonic transformations of the objectives (Binois et al., 2020) , it also allows us to choose the reference point for Eq. 5 as r = 1 1 1 m .

Learning Multi-Objective Defaults. Using the hypervolume error, we can extend the minimization problem of Eq. 3 to learn defaults in the multi-objective setting. For this, we propose the following minimization problem:

In words, we seek to find a set of complementary model configurations {x 1 , . . . , x n } ⊂ X that provide a good approximation of the Pareto front P f (j) (X ) across tasks j ∈ {1, . . . , T }.

We now introduce PARETOSELECT which tackles the minimization of Eq. 7. On a high-level, PARETOSELECT works as follows: first, it fits a parametric surrogate modelf θ that predicts the performances of model configurations by leveraging the evaluations of D. Then, it uses the surrogatef θ to estimate the objectives for all models in X and applies the non-dominated sorting algorithm on the objectives to select a list of default models.

The surrogate modelf θ is trained by attempting to correctly rank model configurations within each task in the available offline evaluations D. For each model x ∈ X ,f θ outputs a vectorỹ ∈ R m for m optimization objectives. Models can then be ranked for each objective independently. In order to fitf θ on the data D, we use listwise ranking (Xia et al., 2008) and apply linear discounting to focus on correctly identifying the top configurations similar to Zügner et al. (2020) . The corresponding loss function can be found in Appendix D.

For the parametric surrogate modelf θ , we choose a simple MLP similar to Salinas et al. (2020b) . We use two hidden layers of size 32 with LeakyReLU activations after each hidden layer and apply a weight decay of 0.01 as regularization. Importantly, we do not tune these hyperparameters. Predictive performances for every model can eventually be computed asỸ

We note that for minimizing Eq. 7, either regression or ranking objectives can be used: since we apply quantile normalization to compute the hypervolume, the magnitude of the values predicted byf θ is irrelevant. However, we find the ranking setting to be superior to regression for finding good default models (see Appendix D for more details).

Multi-Objective Sorting. The "best" configurations are then determined by applying the non-dominated sorting algorithm (NDSA) -an extension of sorting to the multi-dimensional setting (Srinivas & Deb, 1994; Emmerich & Deutz, 2018) -to the predictive performancesỸ. Figure 3 provides an illustration of the sorting procedure. NDSA can be described as a two-level procedure. First, it partitions the configurations to be sorted in multiple layers by iteratively computing the Pareto fronts, then it applies a sorting function sort to rank the configurations in each layer:

We choose sort to compute an -net (Clarkson, 2006) for which there exists a simple greedy algorithm. When sorting a set X, the -net sorting operation yields an order sort(X) = [x 1 , . . . , x |X| ]. The first element x 1 is chosen randomly from X and x i+1 is defined iteratively as the item which is farthest from the current selection in terms of Euclidean distance, formally:

The final ordering is eventually obtained by concatenating the sorted layers. While previous work followed different approaches for the sort operation in the NDSA algorithm (Srinivas & Deb, 1994; Deb et al., 2002) , we leverage the -net since highlighted its theoretical guarantees and Schmucker et al. (2021) provided empirical evidence for its good performance.

The pseudo-code of PARETOSELECT is given in Algorithm 1 in the appendix. In the case where predictions of the surrogate are error-free, the recommendations of the algorithm can be guaranteed to be perfect with zero hypervolume error.

Proposition 1. Assume that n ≥ |P f (X ) | and for all x ∈ X ,f θ (x) = f (x). Then, ε({x 1 , . . . , x n }) = 0 where x 1 , . . . , x n are the first n models selected by our method.

Proof. By Eq. 9, the first k recommendations of Pareto Select are such that {x 1 , . . . ,

. It follows that for any k ≤ n, P f (X ) = {x 1 , . . . , x k } ⊂ {x 1 , . . . , x n } and consequently ε({x 1 , . . . , x n }) = 0 since the hypervolume error of any set of points containing the Pareto front is zero.

In order to evaluate our approach, we leverage the data collected with the benchmark presented in Section 3. For evaluation, we perform leave-one-out-cross-validation where we use each dataset one after the other as the test dataset and estimate the parameters θ of our parametric surrogate modelf θ on the remaining 43 datasets. The multi-objective setting in the following experiments aims to minimize latency as well as nCRPS.

In Figure 4 , we report the average hypervolume obtained for PARETOSELECT and several baselines, averaged over all 44 test datasets. PARETOSELECT picks the top elements of the non-dominated sort on the surrogate's predictive performances. As baselines, we consider a variant of PARETOSELECT which considers the nCRPS as the single objective (Single-Objective), the approach of Pfisterer et al. (2021) which greedily picks models to minimize the joint nCRPS on the training data (Greedy Model-Free), and random search over the entire model space of size |X | = 247 (Random). The hypervolume obtained by PARETOSELECT with 10 default models is small (≈ 0.06) compared to the best baseline that reaches ≈ 0.20. Further, random search requires a total of 27 models to be on par with the hypervolume of only 10 models selected via our method. Figure 5 depicts the top n = 10 recommendations produced by PARETOSELECT when predicting default models on a single dataset. The default models cover the true Pareto front well: our method is able to pick models with low latency, low nCRPS, and a good trade-off between these two objectives.

As shown in Section 3.5, ensembling models yields significant gains in accuracy but comes at the cost of considerable latency. Since a massively large ensemble model is of no practical use, it is critical to be able to select an ensemble with a latency that is acceptable for an application. Thus, we consider ensembles under different latency constraints.

For a latency constraint c, we build an ensemble with n ≤ 10 members x 1 , . . . , x n by iteratively picking the model with best nCRPS such that the ensemble latency constrain c is still met. To generate ensemble predictions, we combine the forecasts of the base models x 1 , . . . , x n by averaging all 10-quantiles {0. average. This aligns with current literature showing that simple averaging is often most effective (Petropoulos et al., 2020) . Figure 6 shows the performance of different latency-constrained ensembles, relative to the performance of an unconstrained ensemble that picks the n = 10 models with lowest predicted nCRPS. As expected, higher latency constraints c allows the ensemble to perform better. For c ≥ 100 ms, the constrained ensemble recovers the Pareto front and outperforms all individual models x ∈ X as well as ensembles. The small gap to the Pareto front for constraints c ≤ 50 ms can be attributed to multiple effects. First, the nCRPS values predicted by the surrogatef θ are not optimal. Second, models may not satisfy c on every dataset although they have a lower latency on average (e.g. the default TFT configuration violates c = 2.5 ms on 17/44 datasets). Table 1 additionally reports performance metrics of latency-constrained ensembles: for instance, the ensemble for c = 10 ms outperforms all individual base models in terms of average nCRPS rank while the ensemble for c = 100 ms outperforms all hyper-ensembles as well.

In this work, we have presented (i) a new benchmark for evaluating forecasting methods in terms of multiple metrics and (ii) PARETOSELECT, a new algorithm that can learn default model configurations in a multi-objective setting. In future work, we will consider applying those techniques to other domains such as computer vision where offline evaluations of multiple datasets were recently made available by Dong & Yang (2020) . Additionally, future research may extend the presented predictive surrogate model to the case of ensembles: this would enable PARETOSELECT to be used directly to select members for ensembles of any latency.

Public datasets and method implementations were used to ensure the reproducibility of our experiments. In Appendix A, we provide all hyperparameters used for training the benchmark forecasting models via GluonTS. The source and processing of each dataset as well as the procedure for generating data splits is detailed in Appendix B. We also provide details on the surrogate model used and a comparison with other choices in Appendix D. The pseudocode of PARETOSELECT is given in Appendix E. As a source of truth, we share the entire code used to run the benchmark and its evaluation with the submission. This code as well as the generated evaluations will be released with the paper. Table 2 provides an overview over all benchmark models along with hyperparameters considered. In addition to three hyperparameter settings per deep learning model (except for MQ-RNN), three different context lengths l are considered. These represent multiples of the forecast horizon τ and are thus dataset-dependent. Eventually, this results in 9 model configurations per deep learning model (and 3 for MQ-RNN).

The implementation of all models are taken from GluonTS (Alexandrov et al., 2020) . All deep learning models are implemented in MXNet (Chen et al., 2015) . ARIMA, ETS, STL-AR, and Theta are available in GluonTS as well but forward computations to the R forecast package .

Model Evaluation. To measure the accuracy of forecasting models, we employ the continuous ranked probability score (CRPS) (Matheson & Winkler, 1976; Gneiting & Raftery, 2007) . Given the quantile function (F (i) t ) −1 of the forecast probability distribution in Eq. (1) for a time series z (i) at time step t, the CRPS is defined as

where Λ α is the quantile loss (or pinball loss) defined over a quantile level 0 < α < 1:

Here, I is the indicator function. To compute an aggregated score for a model forecasting multiple time series and time steps, we compute a normalized CRPS (nCRPS):

We approximate the nCRPS using the 10-quantiles α ∈ {0.1, 0.2, 0.3, . . . , 0.9}. While we focus on nCRPS for the evaluations in this paper, the benchmark dataset that we release contains four additional forecast accuracy metrics (MAPE, sMAPE, NRMSE, ND).

Our benchmark provides 44 datasets which were obtained from multiple sources. 16 datasets were obtained though GluonTS (Alexandrov et al., 2020) , 24 datasets are taken from the Monash Time Series Forecasting Repository (Godahewa et al., 2021) , and the remaining 4 datasets were originally published as part of Kaggle 2 forecasting competitions. Table 3 provides basic statistics about all datasets.

In the following, we provide brief descriptions of all datasets and provide their sources. We link the original source (if possible) along with any work which pre-processed the data (if applicable). The datasets "Corporación Favorita", "Restaurant", "Rossmann", and "Walmart" -which were obtained from Kaggle -are processed by ourselves. Bitcoin (Godahewa et al., 2021) provides over a dozen of potential influencers of the price of Bitcoin. Among others, these influencers are daily hash rate, block size, or mining difficulty. We exclude one time series from the original dataset as its absolute values are ≥ 10 18 . Data is provided from January 2009 to July 2021.

CIF 2016 (Štěpnička & Burda, 2017; Godahewa et al., 2021) is the dataset from the "Computational Intelligence in Forecasting" competition in 2016. One third of the time series originate from the banking sector while the remaining two thirds are generated artificially. In the competition, the time series have two different prediction horizons, however, GluonTS forces us to adopt only a single horizon (for which we choose the most common one).

COVID Deaths Godahewa et al., 2021) provides the daily COVID-19 death counts of various countries between January 22 and August 21 in the year 2020.

Car Parts (Hyndman, 2015; Godahewa et al., 2021) provides intermittent time series of car parts sold monthly by a US car company between January 1998 and April 2002. Periods in which no parts are sold have a value of zero.

Corporación Favorita (Favorita, 2017) contains the daily unit sales of over 4,000 items at 54 different stores of the supermarket company Corporación Favorita in Ecuador. Unit sales were manually set to 0 on days where there was no data available. Data ranges from January 2013 to August 2017. All covariates that are provided in the Kaggle competition are discarded.

Dominick (James M. Kilts Center, 2020; Godahewa et al., 2021) contains time series with the weekly profits of numerous stock keeping units (SKUs). The data is obtained from the grocery store company Dominick's over a period of seven years.

Electricity (Dua & Graff, 2017; Alexandrov et al., 2020) is comprised of the hourly electricity consumption (in kWh) of hundreds of households between January 2012 and June 2014.

Exchange Rate (Lai et al., 2018; Alexandrov et al., 2020) provides the daily exchange rates (on weekdays) between the US dollar and the currencies of eight countries (Australia, Great Britain, Canada, Switzerland, China, Japan, New Zealand, Singapore) in the period from 1990 to 2013. Hospital (Hyndman, 2015; Godahewa et al., 2021) provides the monthly patient counts for various products that are related to medical problems between 2000 and 2007.

KDD 2018 (Bekkerman et al., 2018; Godahewa et al., 2021) M1 (Makridakis et al., 1982; Godahewa et al., 2021) datasets are taken from the M1 competition in 1982.

Time series have varying start-and end dates and cover a wide semantic spectrum of domains (demographics, micro, macro, industry).

M3 (Makridakis & Hibon, 2000; Alexandrov et al., 2020) datasets are obtained from the M3 competition. Across frequencies, time series were collected from six different domains: demographics, micro, macro, industry, finance, and other. For the time series with no specified frequency, we assume a quarterly frequency (as this is the default in GluonTS).

M4 (Makridakis et al., 2020b; Alexandrov et al., 2020) datasets are taken from the prestigious M4 competition. The time series are collected from the same domains as in the M3 competition.

M5 (Makridakis et al., 2020a; Alexandrov et al., 2020) is the dataset from the M5 competition and contains daily unit sales of 3,049 products across 10 Walmart stores located in California, Texas, and Wisconsin. While covariates are available (e.g. product prices or special events), none of the forecasting methods we consider makes use of them. Sales data is provided from January 2011 to April 2016. Restaurant (Holdings, 2017) provides the number of daily visitors of hundreds of restaurants in Japan between January 2016 and April 2017, as tracked by the AirREGI system. As dates for which restaurants are closed do not provide visitor numbers, we impute zeros for these missing values. The Kaggle competition where this dataset is obtained from also provides data about reservations but this data remains unused for our purposes.

Rideshare (Godahewa et al., 2021) is comprised of hourly time series representing various attributes related to services offered by Uber in Lyft in New York. Time series include e.g. minimum, maximum, and average price, or the maximum distance traveled -grouped by starting location and provider. Data is available for the period of a month, starting in late November 2018.

Rossmann (Rossmann, 2015) provides daily sales counts at hundreds of different stores of the Rossmann drug store chain. Just like for the M5 dataset, covariates are available (e.g. distance to competition, special events, or discounts) but not used by any method. Data is available from January 2013 until August 2015.

San Francisco Traffic (Caltrans, 2020; Godahewa et al., 2021) contains hourly occupancy rates of freeways in the San Francisco Bay area. Data is available for two full years, starting from January 1, 2015.

Solar (Zhang, 2006; Alexandrov et al., 2020) is comprised of the hourly power consumption of dozens of photovoltaic power stations in the state of Alabama in the year 2006.

Taxi (NYC Taxi and Limousine Commission, 2015; Alexandrov et al., 2020) contains the number of taxi rides in hundreds of locations around New York in 30 minute windows. Training data contains data from January 2015 while test data contains data from January in the subsequent year.

Temperature Rain (Godahewa et al., 2021) provides the daily temperature observations and rain forecasts gather from 422 weather stations across Australia. Data ranges from May 2015 through April 2017.

Tourism (Athanasopoulos et al., 2011; Godahewa et al., 2021) Weather (Sparks, 2021; Godahewa et al., 2021) is comprised of daily data collected from hundreds of weather stations in Australia. Collected data is the amount of rain, the minimum and maximum temperature, and the amount of solar radiation.

Wiki Alexandrov et al., 2020) similarly provides daily pages views for several thousand Wikipedia pages in the period from January 2012 to July 2014.

Wind Farms (AEMO, 2020; Godahewa et al., 2021) contains time series with the minutely power production of hundreds of wind farms in Australia. Data is available for a period of one year, starting in August 2019.

For all of the datasets in Table 3 except for "Taxi", we perform similar preprocessing. Let Z = {z (i) 1:T i } K i=1 be the set of K univariate time series of lengths T i . Then, we run the following preprocessing steps:

1. Remove all time series which are constant to always be able to compute the MASE metric.

2. Remove all time series which are too short. We exclude all time series of length T ≤ (p + 1)τ where τ is the prediction horizon and p is a dataset-specific integer which governs over how many prediction lengths we want to perform testing 3 . We are using p = 1 for all datasets but "Electricity" (p = 7), "Exchange Rate" (p = 3), "Solar" (p = 4), "Traffic" (p = 7) and "Wiki" (p = 3).

3. Split time series into training, validation and test data. For each time series, we split off the prediction horizon p − 1 times to obtain time series for the test data. Then, we split off the prediction horizon another time to obtain a validation time series per training time series. Eventually, we construct the following sets:

For the "Taxi" dataset, we need to slightly augment the preprocessing to allow for discontinuities in the available time series. Essentially, each time series z ∈ Z consists of an old part z old and a new part z new (data throughout January of two successive years). We then obtain the training data from z old and construct the validation data by cutting the prediction horizon from z old . The test dataset is constructed from z new as in bullet point 3 with p = 56.

This section describes in detail how the deep learning models from Table 2 are trained on the datasets outlined in the previous section. As outlined in Section 3.2, parametric local methods directly estimate their parameters at training time. While deep learning models are trained on Z train such that models at different checkpoints can be compared using Z val , parameters of parametric local methods are estimated using Z val .

For all deep learning models, we run training for a predefined duration depending on the dataset size. For n total observations in the training data, we run training for d hours where d = 2 η with η = min max log 10 n 10000

, 0 , 3 .

Training therefore always runs between one and eight hours (refer to Table 3 for training times on individual datasets). Even on small datasets, we train for an hour to ensure that the trained model receives sufficiently many gradient updates.

Forecasts of all methods are required to be probabilistic to store the 10-quantiles q 0.1 , q 0.2 , . . . , q 0.9 . While some methods forecast the quantiles directly (e.g. MQ-CNN, TFT), other methods provide samples of the output distribution (e.g. DeepAR) or point forecasts (e.g. N-BEATS). For sampling-based models, we simply compute the empirical 10-quantiles. For the point forecasts, we interpret forecasts as Dirac distributions centered around the forecasted value y . Thus, each quantile can be set to the value y .

All deep learning models are trained using the Adam optimizer (Kingma & Ba, 2014) with an initial learning rate of ρ = 10 −3 . During training, we decrease the learning rate three times by a factor of λ = 1 2 using a linear schedule. When training for a duration d, we decrease the learning rate after durations D = { d 4 , d 2 , 3d 4 }. This results in a final learning rate of λ 3 ρ = 1.25 · 10 −4 . The decay is always applied after the first batch exceeding one of the durations in D.

During training, we save the model and compute its loss L as well as the nCRPS Q on the validation data for a total of 11 times. For a model being trained for a duration t, we perform these computations after durations

For each d ∈ D, we run validation on Z val after the first batch where the total training time exceeds d and compute the losses L d and Q d .

Note that we do not add the time taken to run validation to the training time to evaluate whether it exceeds d. We argue that validation can be run swiftly. If necessary, it could also be run on a subset of all available time series for further speedup.

For testing, we choose 5 of the 11 models that we saved during training. For each d ∈ I, we choose the model with the lowest nCRPS that was encountered up to d. Note that this may result in choosing the same model multiple times if continued training did not yield any improvements on the validation data.

All training jobs were scheduled on AWS Sagemaker 4 using CPU-only ml.c5.2xlarge instances (8 CPUs, 16 GiB memory). In few cases, we ran jobs on other instance types:

In the following, we discuss the choice of the surrogate model. First, we outline why a parametric surrogate model is desirable. Then, we describe how model configurations are vectorized to be used by parametric surrogate models. Afterwards, we list the ranking loss with linear discounting that we mention in Section 4.2 and eventually compare the ranking performance of different surrogate models.

Intuitively, a nonparametric surrogate model might appear to be optimal for ranking models. However, a parametric surrogate has several advantages. First, it can interpolate between neighboring configurations, which is essential when the configurations differ between different datasets or when the objective is noisy. Second, a parametric model allows for suggesting new configurations that e.g. maximize capacity under a given constraint. Third, a parametric surrogate modelf θ may also be conditioned on the dataset by taking as input dataset features (such as number of time series or observations in the task at hand) by concatenating features to the input vector off θ -although we do not consider this possibility here.

When using XGBoost or an MLP as surrogate model for predicting model performance, models must be represented as feature vectors. For this, we proceed as follows to encode model configurations:

• The model type (i.e. ARIMA, DeepAR, . . . ) is encoded via a one-hot encoding, yielding a 13dimensional vector.

• Every model hyperparameter (across models) defines a new real-valued feature. Hyperparameters that are shared among deep learning models (context length multiple, training time, learning rate 5 ) are reused across models. This results in 15 additional features which are all standardized to have zero mean and unit variance.

Inputs to the MLP are, thus, 28-dimensional. Features that are missing (e.g. all classical models do not provide a context length multiple) are imputed by the features' means (resulting in zeros due to the standardization).

Section 4.2 outlines that the MLP surrogate modelf θ is trained via listwise ranking using linear discounting. For training on the offline evaluations D,f θ minimizes the loss L via SGD. As ranking is performed for each objective k ∈ {1, . . . , m} and task j ∈ {1, . . . , T } independently, L can be written as follows:

wheref θ;k yields the surrogate model's outputs for objective k and L kj (f θ;k ) is defined as follows (where we ignore indices k and j in the formula for notational convenience):

Here, N provides the number of available evaluations for task j and r (i ) defines the index of the configuration x that is ranked at position i with respect to objective k among all configurations for task j. The ranks of the configurations x 1 , . . . , x N with respect to an objective k can be computed from the corresponding evaluations y 1;k , . . . , y N;k . Note that rank i = 1 provides the index for the "best" configuration andf θ outputs a smaller value for configurations that have a smaller rank (i.e. are "better").

The following compares the MLP surrogate trained via listwise ranking introduced in Section 4 to other possible choices for the surrogate model. Much like the approach described in Section 4.3, we evaluate surrogate models by performing LOOCV: surrogates are evaluated on each of the 44 benchmark datasets one after the other, being trained on the remaining 43 datasets. Surrogates are evaluated with respect to different ranking metrics that give an overview of the models' ability to rank the nCRPS values of model configurations on different datasets. As metrics for measuring the ranking performance on a single dataset, we consider the following which yield metrics between 0 and 1:

• Mean Reciprocal Rank (MRR) (Liu, 2009 ): Computed as one divided by the predicted rank of the model with the lowest true nCRPS.

• Precision@k (Liu, 2009) : The fraction of the k models with the lowest true nCRPS captured in the top k predictions.

• Normalized Discounted Cumulative Gain (NDCG) (Liu, 2009 ): A measure for the overall ranking performance. It first computes the discounted cumulative gain (DCG) by summing the discounted relevance scores 1, 0.9, . . . , 0.1 of the 10 models with the lowest true nCRPS scores -more specifically, a relevance score π(i ) is discounted by log 2 k + 1 where k is the predicted rank of the model with true rank i . Then, it normalizes the DCG by using the ideal DCG (iDCG) to obtain the NDCG. As a comparison to the ranking MLP surrogate with linear discount (MLP Ranking + Discounting), we first consider a random surrogate which predicts nCRPS by sampling from U(0, 1) as a baseline (Random). Then, we use a nonparametric model which predicts the nCRPS as the average among all training datasets (Nonparametric Regression) and one that predicts the average rank of the nCRPS across training datasets (Nonparametric Ranking). Additionally, we consider XGBoost surrogates which are trained via regression (XGBoost Regression) and pairwise ranking 6 (XGBoost Ranking), respectively. Lastly, we consider an MLP with the same architecture but trained via regression instead of listwise ranking (MLP Regression) and one without any discounting (MLP Ranking). 

In this section, we try to answer the following question: are there some simple dataset characteristics that can help to know which model is going to perform best? In particular, we analyze in details the results of Section 3.5 where we showed that, on some datasets, (1) classical methods outperformed deep learning methods and that (2) the performance of classical and deep learning methods could not be distinguished.

We start by analyzing more in depth the four out of 44 datasets where deep learning models do not outperform classical methods - Figure 2 showed that the best deep learning method outperforms the best classical methods on all but four of our benchmark datasets. On these four datasets ("M4 Hourly", "Temperature Rain", "Tourism Monthly", "Vehicle Trips"), the summary statistics differ wildly as can be seen from Table 3 both in terms of the number of observations, the average length, and other characteristics. Further, the classical method that outperforms the best deep learning method is inconsistent across these datasets ("M4 Hourly": STL-AR, "Temperature Rain": NPTS, "Tourism Monthly": ETS, "Vehicle Trips": NPTS). Notably, this does not include the local method that performs best on average (i.e. ARIMA, see Table 1 ). Figure 1 showed that for 14 out of the 44 benchmark datasets, the performance of classical methods is indistinguishable from the performance of deep learning methods. To further study these datasets, we plot in Figure 7 the correlation matrix of all k benchmark methods' nCRPS ranks across the datasets (using only the 

GluonTS: Probabilistic and Neural Time Series Modeling in Python

The Theta Model: A Decomposition Approach to Forecasting

The tourism forecasting competition

The kalai-smorodinsky solution for many-objective bayesian optimization

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

Caltrans performance measurement system

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

Pedestrian counting system -monthly (counts per hour

Nearest-Neighbor Searching and Metric Space Dimensions. Nearest-Neighbor Methods for Learning and Vision: Theory and Practice

A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II

An interactive web-based dashboard to track COVID-19 in real time

Nas-bench-201: Extending the scope of reproducible neural architecture search

UCI Machine Learning Repository

Corporación Favorita Grocery Sales Forecasting

Uber TLC FOIL Response

Probabilistic Forecasting with Spline Quantile Function RNNs

Nonparametric Statistical Inference. Marcel Dekker

Strictly Proper Scoring Rules, Prediction, and Estimation

Monash Time Series Forecasting Archive

Google Vizier: A Service for Black-Box Optimization

Recruit Restaurant Visitor Forecasting

Automated Machine Learning: Methods, Systems, Challenges

forecast: Forecasting functions for time series and linear models

expsmooth: Data Sets from "Forecasting with Exponential Smoothing

Forecasting: Principles and Practice

Dominick's Dataset

Criteria for classifying forecasting methods

Smart meters in London

Robust recurrent network model for intermittent time-series forecasting

Adam: A Method for Stochastic Optimization

Modeling Long-and Short-Term Temporal Patterns with Deep Neural Networks

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Quality Evaluation of Solution Sets in Multiobjective Optimisation: A Survey

Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting

Learning to Rank for Information Retrieval

The M5 accuracy competition: Results, findings and conclusions

The M3-Competition: results, conclusions and implications

The Accuracy of Extrapolation (lime Series) Methods: Results of a Forecasting Competit ion

Evangelos Spiliotis, and Vassilios Assimakopoulos. Statistical and Machine Learning Forecasting Methods: Concerns and Ways Forward

The M4 Competition: 100,000 time series and 61 forecasting methods

Scoring rules for continuous probability distributions

FRED-MD: A Monthly Database for Macroeconomic Research

Principles and Algorithms for Forecasting groups of Time Series: Locality and Globality

TLC Trip Record Data

tsibbledata: Diverse Datasets for 'tsibble

Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting

Theory and Practice

Learning Multiple Defaults for Machine Learning Algorithms

Deep Non-Parametric Time Series Forecaster. to be published

Rossmann Store Sales

High-Dimensional Multivariate Forecasting with Low-Rank Gaussian Copula Processes

DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

A quantile-based approach for hyperparameter transfer learning

A Multi-Objective Perspective on Jointly Tuning Hardware and Hyperparameters

AutoAI-TS: AutoAI for Time Series Forecasting

The Logic of Logistics: Theory, Algorithms, and Applications for Logistics Management

A hybrid method of Exponential Smoothing and Recurrent Neural Networks for time series forecasting

bomrang: Australian Government Bureau of Meteorology (BOM) Data Client

Multiobjective Optimization Using Nondominated Sorting in Genetic Algorithms

Information-theoretic regret bounds for gaussian process optimization in the bandit setting

On the Results and Observations of the Time Series Forecasting Competition CIF 2016

A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition

Forecasting at Scale. The American Statistician

Practical and sample efficient zero-shot HPO

Learning hyperparameter optimization initializations

Listwise Approach to Learning to Rank: Theory and Algorithm

Nas-bench-101: Towards reproducible neural architecture search

Solar Power Data for Integration Studies

Multiobjective Optimization Using Evolutionary Algorithms -A Comparative Case Study

Adversarial attacks on graph neural networks: Perturbations and their patterns

32 GiB memory) to counteract out-of-memory-issues. Used for N-BEATS on "Corporación Favorita" and "Weather

64 GiB memory) to allow for even more memory-intensive models. Used for N-BEATS on

144 GiB memory) to speed up predictions by parallelization. Used for ARIMA on

Including validation and testing, this amounted to roughly 684 days of total training time (∼3 hours and 30 minutes per training job). Parallelized over 200 instances, actual training time for our benchmark was roughly 4 days and amounts to approximately $7

Algorithm 1: Overview of PARETOSELECT Data: Time series models X , offline evaluations D, number of default models n Result: A set {x 1 , . . . , xn} ⊂ X of model defaults// Choose and remove a random element while X = ∅ do D = {} // Mapping from remaining elements to minimum distances for x ∈ X do // Compute minimum distance to all chosen elementsdefault configuration for the deep learning models). Notably, for twelve datasets ("Exchange Rate" through "M1 Yearly" on the y-axis), we observe a very low correlation with a large number of datasets. On these datasets, all models perform similarly due to strong stochasticity (e.g. "Exchange Rate", "Bitcoin", "M1 Quarterly", ...) or seasonality (e.g. "Tourism", "Ride share", ...) and predicting their ranking is therefore hard. Notably, 11 of these datasets overlap with the 14 datasets where classical methods are indistinguishable from deep learning methods.In line with these observations, we found that adding simple dataset grouping features such as domain type (electricity, retail, finance) as an input for a surrogate model predicting the performance of forecasting methods does not have much effect. Our conclusion is that the ranks of the best performing models seem to be relatively hard to identify only from simple dataset grouping rules. Nonetheless, we hope that by releasing the benchmark data, we help future research to identify dominant methods from dataset characteristics.