key: cord-0115083-gkhgaiar
authors: Tan, Chang Wei; Bergmeir, Christoph; Petitjean, Francois; Webb, Geoffrey I.
title: Time Series Regression
date: 2020-06-23
journal: nan
DOI: nan
sha: e24d82d4595bf22a6f5267d83fc409cd6ceaf42f
doc_id: 115083
cord_uid: gkhgaiar

This paper introduces Time Series Regression (TSR): a little-studied task of which the aim is to learn the relationship between a time series and a continuous target variable. In contrast to time series classification (TSC), which predicts a categorical class label, TSR predicts a numerical value. This task generalizes forecasting, relaxing the requirement that the value predicted be a future value of the input series or primarily depend on more recent values. In this paper, we motivate and introduce this task, and benchmark possible solutions to tackling it on a novel archive of 19 TSR datasets which we have assembled. Our results show that the state-of-the-art TSC model Rocket, when adapted for regression, performs the best overall compared to other TSC models and state-of-the-art machine learning (ML) models such as XGBoost, Random Forest and Support Vector Regression.More importantly, we show that much research is needed in this field to improve the accuracy of ML models.

In the past decade, there has been an increasing interest in time series analysis research, in particular time series classification (TSC) (Bagnall et al., 2017; Dau et al., 2019; Bagnall et al., 2015; Fawaz et al., 2019a; Dempster et al., 2019; Tan et al., 2020b) and time series forecasting (TSF) (Hyndman, 2018; Makridakis et al., 1982; Makridakis and Hibon, 2000; Makridakis et al., 2018 Makridakis et al., , 2020 . TSC is the task of predicting a discrete label for a time series that classifies the time series into some finite discrete categories (Bagnall et al., 2017; Dau et al., 2019) . On the other hand, TSF aims to predict future values of a series based on recent or seasonal values. It typically assumes that future values will more closely resemble recent values than those in the distant past. Despite the thousands of papers published in both of these fields each year, there has been little investigation of Time Series Regression (TSR), i.e. of how to predict numerical values that depend on the whole series, rather than depending more on recent than past values. The term regression has different meaning in different contexts. In the broader machine learning (ML) context, regression means predicting a continuous numerical value from a set of features (Segal, 2004; Sammut and Webb, 2011) . With respect to TSF, regression usually means fitting the historical time series data with a regression model such as ARIMA (Box and Jenkins, 1970) or Exponential Smoothing (Gardner Jr, 1985; Hyndman et al., 2008; Chatfield, 1978) models to forecast future values of the time series. These TSF regression models typically heavily rely on recent or seasonal values, or sliding input windows of some form.

In this work, we refer to the Time Series Regression problem as a more general methodology of predicting a single continuous value from a time series. The target can be a continuation of the input time series or unrelated to it and does not necessarily need to be a future value or depend on recent values. In the case where predicting a future value of a series is of interest, then that becomes a TSF problem. If predicting a finite discrete value is of interest, then that becomes a TSC problem. We are interested in a more general task that lies in between the spectrum of these two tasks, which cannot be solved intuitively using models from these two tasks.

For instance, we are interested in predicting the heart rate of a person from accelerometer data (Reiss et al., 2019; Zhang et al., 2014) , predicting the crop yield or fuel load from satellite image series describing the evolution of the 'colours' of the vegetation over the years; neither of which are discrete or future values. Figure 1 shows the example of predicting live fuel moisture content (LFMC) of the United States using a series of satellite images where the value of LFMC is a continuous value in the range from 0 to 200%. The input is the series of spectral values (i.e. time series of colour values) representing the state of a surface (or 'pixel') over the last 12 months; the target is to infer the amount of moisture in the vegetation, i.e. the ratio between the weight of water in vegetation and the weight of the dry part of vegetation (information that is obtained by sampling vegetation in the field, weighing it and drying it to weigh it again). This is a very important variable as the risk of fire increases very rapidly as soon as the LFMC goes below 80% (Yebra et al., 2018) , making it an invaluable variable for bush fire early warning systems. A very similar application is the one of predicting crop yield from these same series of spectral values, with great importance for food safety and agricultural planning.

Clearly we need models that are able to learn the relationship between time series data and the continuous target variable. There has been some research in this area where the models and features are specifically designed for the specific tasks (Reiss et al., 2019; Zhang et al., 2014; Zhang, 2015; De Vito et al., 2008). Unfortunately, these models do not generalise well to other problems. For instance, those specific features created from photoplethysmogram (PPG) measurements (Zhang et al., 2014; Reiss et al., 2019) for heart rate estimation cannot be used to predict crop yields and vice-versa. Therefore in this paper, we aim to motivate the research into developing more general TSR algorithms. We start by introducing the first TSR benchmarking archive, which we have assembled and contains 19 datasets in various domains in (Tan et al., 2020a) . These datasets have varying number of dimensions, dimensions with unequal lengths and missing values. They are used to benchmark some of the existing models adapted from classical regression and TSC models. Our results show that simple variants of some state-of-the-art TSC models outperform standard regression techniques (i.e. ones developed for tabular data) that do not take into account the underlying series nature of the data. More importantly, we show that most methods obtain similar accuracies and the top method -Rocket -is actually not far in accuracy from XGBoost Random Forest (Breiman, 2001) , which motivates the need for the development of a subfield of research.

The rest of this paper is organised as follows. In Section 2, we introduce the problem that we aim to address and discuss the related work. Then we describe some of the applications of TSR with respect to the benchmark datasets we created in Section 2.2. Section 3 then describes how the classic regression and TSC models can be adapted for TSR. After that, we evaluate these models on the first TSR benchmark datasets in Section 4. Finally, in Section 5, we summarise our contribution and give some direction for future work.

The term Time Series Regression (TSR) has different meaning in different contexts. In this section, we give a formal definition to TSR as we employ it.

We will also try to clear any misunderstandings that the readers might have and introduce the task that we aim to address. We first define a time series in Definition 1.

Definition 1 A time series S is an ordered collection of L pairs of measurements and timestamps, S = {(s 1 , t 1 ), (s 2 , t 2 ), ..., (s L , t L )}, where s i ∈ R D and t 1 to t L are the timestamps for some measurements s 1 to s L .

Note that the D-dimensional measurement s i measures the same phenomena with different instruments at the same time. Time series data differs from static data in a way that the ordering of the data attribute in time series data is critical in finding the best discriminating features in time series data.

Classification and Regression are both supervised learning tasks that learn the relationship between a target variable and a set of features (Sammut and Webb, 2011) . The main difference between Classification and Regression is that Classification predicts a categorical value for a data instance that categorises the data into some finite categories, while Regression predicts a continuous value. Regression tasks can become Classification tasks when the predicted values are discretized into some finite labels for the data. In this work, we only focus on Regression. A linear regression for example, assumes a linear relationship between a set of predictors (features) and a target variable, and fits a straight line through all the predictors to generate a prediction for the target variable.

Traditionally in ML, the features used for regression are static and have no relation to time. For instance, we could predict house prices using features such as the number of bedrooms, crime rate, nitric oxides concentration (pollution level), accessibility to radial highways and weighted distances to employment centers 1 . These features (predictors) do not depend on time and are less likely to change over time. They are then used to train an ML model such as a Random Forest (Breiman, 2001) , XGBoost or even linear regression to predict house price, the target variable that we are interested in. Different from the traditional regression problem, the TSR problem that we tackle in this work, considers time series data as the features. With respect to the house price prediction example, instead of using a single value for the number of rooms, crime rate or pollution level, we use the time series of these features to predict house prices. For example the daily crime rate or daily pollution level over the last one month. A more concrete example of TSR in our context is the prediction of heart rate which can only be achieved using time series data such as PPG and accelerometer data (Reiss et al., 2019; Zhang, 2015; Zhang et al., 2014 ) that measures the pulse and movement of the subject within a certain period of time.

A very large branch of time series analysis deals with TSF (Hyndman, 2018; Hyndman et al., 2008; Makridakis et al., 2018) , where Regression carries a slightly different meaning. In TSF, Regression is used to fit autoregressive models on the historical time series which models the recent and/or seasonal Figure 2 shows an example of a linear autoregressive model of order 7, AR (7), i.e. the model uses the past 7 days minimum daily temperature to forecast the minimum daily temperature for the next day. These models are then extrapolated to predict future values of the same time series. Going back to the example of predicting house prices, autoregressive models can be used to fit past house prices data and produce a good forecast for future house prices, since it is very likely that house price depends on the price in the previous months. In our TSR context, we can also build models to predict future house price using past house prices. However, we aim at developing more general models that do not make the assumptions that frequently underlie forecasting models, such as that the most recent values are most indicative of future values. In other words, we can see that forecasting models will not be useful in our TSR example of predicting heart rate, as heart rate is not a future value of PPG and accelerometer data and does not depend more on the final value of these data than on the initial ones.

Formally, we define the task of Time Series Regression in Definition 2.

A time series regression model is a function T → R, where T is a class of time series. Time series regression seeks to learn a time series regression model from a dataset D = {(t 1 , r 1 ), . . . , (t n , r n )}, where t i is a time series and r i is a numeric value.

While we have not been able to identify any prior work specifically addressing the more general class of learning task that we call time series regression, there are a number of specialised techniques addressing specific cases. In addition to forecasting, one that has received considerable attention is heart rate (HR) estimation using photoplethysmogram (PPG) sensors (Reiss et al., 2019; Zhang et al., 2014) . These methods rely on spectral analysis (Zhang et al., 2014; Zhang, 2015; Salehizadeh et al., 2016; Schäck et al., 2017 ) but they were not very accurate (Reiss et al., 2019) . A convolutional neural network based approach that takes the signal in the frequency domain as input has been proposed to improve the prediction accuracy (Reiss et al., 2019) . This approach was shown to be significantly more accurate compared to the existing spectral methods. Similar to heart rate estimation, respiratory rate (RR) estimation can also be achieved using PPG sensors (Pimentel et al., 2016; Meredith et al., 2012; Pimentel et al., 2015) . Estimating RR is an important task because it is often the earliest sign of critical illness (Meredith et al., 2012) . Existing methods fail to distinguish between periods of high and low quality data and were not able to generalise well to other datasets (Pimentel et al., 2016) . Typically, estimation of RR from PPG is achieved by applying a moving window to the time series producing an estimate for RR per window (Pimentel et al., 2016) and consists of four key components, (a) extracting respiratory signals; (b) estimating respiratory rates; (c) fusing the estimates and (d) quality assessments (Pimentel et al., 2015 (Pimentel et al., , 2016 . A probabilistic approach was proposed (Pimentel et al., 2015) using the Gaussian process regression framework to extract RR from the different sources of modulation in the PPG signal. The authors then proposed another method (Pimentel et al., 2016) by fitting multiple autoregressive models to the extracted respiratory signals. Their method was evaluated on two datasets, the Capnobase (Karlen et al., 2010) and the BIDMC dataset (Pimentel et al., 2016) (both can be found in http://peterhcharlton.github.io/RRest/datasets.html). Although the results showed that their method achieved the best mean absolute error (MAE) on both datasets compared to other existing methods in RR estimation, it was only significantly different to one of the methods on the Capnobase dataset. There were no significant difference on the BIDMC dataset.

Other than health monitoring, there are also similar works done for pollution monitoring, where the goal is to predict pollutant concentration using on-field sensors (De Vito et al., 2008) . De Vito et al. (2008) proposed a simple feed-forward network with 5 hidden layers, taking 7 sensor inputs to estimate benzene concentration in an Italian city. The method, although simple, achieved very low MAE of 0.13µg/m 3 , but is not generalisable.

To support research into TSR, we created the first TSR benchmarking archive, available online at http://timeseriesregression.org/. In this section, we describe the possible applications of TSR and our first TSR archive. The current TSR archive contains 19 time series datasets from 5 application areas, Health Monitoring, Energy Monitoring, Environment Monitoring, Sentiment Analysis and Forecasting. The archive contains 8 datasets assembled from the UCI machine learning repository (Dua and Graff, 2017) Table 1 : Time series datasets in the current TSR archive. The ones marked with an asterisk (*) have different lengths from one dimension to another (but the length is the same for all instances in any single dimension).

1 from a signal processing competition (Zhang et al., 2014) , 1 from the Covid-19 database from the World Health Organisation, 1 from the Australian Bureau of Meteorology (BOM) and the rest are donations. These datasets are unnormalised with varying number of dimensions, unequal length dimensions and missing values. We briefly describe these datasets below and refer readers to (Tan et al., 2020a) for a more detailed description. Table 1 outlines the properties of the datasets in the current TSR archive.

With advances in Smart City and Internet of Things applications, the task to monitor energy and power consumption has become more important than ever. The ability to predict energy and power consumption accurately can save millions of dollars for a big company. Energy monitoring is typically done by collecting data such as temperature, humidity, rain, voltage and current readings from sensors attached all over a building. These data are collected in the form of time series and is mapped to the power consumption of the building. For example, higher power consumption will be observed during winter months as more energy is required to heat up a building. The AppliancesEnergy, HouseholdPowerConsumption1 and HouseholdPowerConsumption2 are the three datasets in this archive targeting this application.

In the context of climate change, environment monitoring has become more important than ever. Environment monitoring is the task of predicting anything related to our environment such as pollution level, rainfall, crop yield and flood water level. The three datasets BenzeneConcentration, Bei-jingPM10Quality and BeijingPM25Quality focus on predicting pollution level in a metropolitan city. The LiveFuelMoistureContent is a dataset about predicting live fuel moisture content (moisture content in leaves) using series of satellite images, which we described in the introduction. Predicting the moisture content is very critical in bushfire prevention that could prevent the lost of thousands of lives and millions to billions of dollars. The three FloodModeling datasets address prediction of the height of different riverbeds given a series of rainfall events. Here again, being able to predict the rise of water is critical to mitigate its risk. The relationship between rainfall and water height in different locations is non-linear, as it depends on topography, transpiration and rainfall dynamics. Here we assume that topography and land-cover (which drives transpiration) is not known and propose to model water height directly from rainfall time series. Finally, the AustraliaRainfall dataset contains the hourly temperature of various locations in Australia and the goal is to predict the total daily rainfall in those locations based on the hourly temperature. This is useful as temperature sensors are much cheaper and easy to maintain as compared to rain gauges.

Health monitoring is the task of monitoring the health or vital signs of an individual. The data typically comes from a wearable device that can be attached to the subject, such as a photoplethysmogram (PPG), electrocardiogram (ECG), electroencephalogram (EEG) or accelerometer. In this work, we focus on three tasks, estimating heart rate, respiratory rate and blood oxygen saturation level. The PPGDalia, IEEEPPG and BIDMCHR are datasets focusing on heart rate estimation. BIDMCRR and BIDMCSpO2 are both datasets on predicting respiratory rate and blood oxygen saturation level, respectively.

Sentiment analysis is the interpretation and classification of emotions (positive, negative or neutral) within some text using text analysis techniques. This is typically done by analysing text comments or posts on websites and social media platforms to predict a sentiment score (Moniz and Torgo, 2018) . Moniz and Torgo (2018) released a dataset containing 100,000 news items on four topics: economy, microsoft, obama and palestine with the respective social feedback on 3 social media platforms: Facebook, Google+ and LinkedIn.

Here we attempted a different approach to predict the sentiment score by analysing the number of reactions received for the piece of news on the respective social media platforms. We included the NewsHeadlineSentiment and NewsTitleSentiment datasets that aim to predict the sentiment score of news headline and news title using the number of reactions over time from social media platforms.

As described in the introduction and Section 2, TSF is the task of predicting future values based on some recent and/or seasonal values. This is usually done by fitting a model to the historical data and extrapolating it into the future. Our TSR problem can be seen as a general case of forecasting where we are still predicting a continuous value that may not necessarily be a future value or depending more heavily on recent values. Thus, we included in this archive a dataset that could easily be solved with forecasting models to show that forecasting tasks can also be tackled using TSR models. The Covid3Month dataset contains the daily confirmed number of COVID-19 cases in most of the countries in the world from January to March 2020, and the goal is to predict the death rate at the start of April 2020.

In this section, we describe how some of the standard regression and TSC models can be adapted for TSR problems. Most methods developed in TSR cases are highly specific to a problem and are not generalisable, as discussed in Section 2.1. We observe the similarity of TSR with TSC (Bagnall et al., 2017) in Definition 2. The only difference between both tasks is that the target variable is continuous instead of discrete for TSC. Hence, in principle, most methods developed for TSC can be adapted for TSR problems.

Classical regression models are designed for tabular data. These models learn a mapping function from input features to the target variable. These features typically do not take into account the temporal dimension which is important for time series data. Hence, these models need to be adapted for TSR problems. A simple way to adapt these models for TSR is to flatten them into a single long feature vector of length D × L where D is the number of dimensions in the series and L is the length of the time series. For instance, a time series with 3 dimensions and 100 data points results in a feature vector with 300 features which can then be passed as an input to any standard regression model.

The k-Nearest Neighbour (k-NN) model is one of the simplest and most intuitive ML models that is also non-parametric (Sammut and Webb, 2011) . A k-NN model requires two parameters, (1) the number of nearest neighbours k and (2) a distance measure (Sammut and Webb, 2011) . There are many distance metrics such as the Euclidean, Manhattan, Minkowski or Mahalanobis distances that can be used with a k-NN model. Using one of these distance metrics, the model finds k nearest instances from the training dataset to a query instance in the feature space (Sammut and Webb, 2011) . For regression, the target values of the k nearest neighbours are averaged out and assigned as the target of the query instance. Weighted average can also be applied using the distances to the query to put more emphasis on nearer neighbours. In Section 3.2.1, we discuss a similar model for time series that takes into account the temporal dimension of the data.

The Support Vector Machine (SVM, Cortes and Vapnik, 1995) is a popular classification model. For regression, this is commonly known as Support Vector Regression (SVR, Drucker et al., 1997) . Although SVR is designed for regression, it differs slightly from the traditional regression task. The objective in traditional regression tasks is to minimise the error rate while SVR tries to fit the error rate within a threshold, . SVR works by mapping the data into a higher-dimensional space so that it is linearly separable using a kernel function such as linear, polynomial or Gaussian Radial Basis Function (RBF, Cortes and Vapnik, 1995) . Then it fits a hyperplane through the data bounded by two boundary lines which are distance apart from the hyperplane. The boundary lines are formed by support vectors which are datapoints that are closest to the boundary.

Another popular ML algorithm is the Random Forest (RF, Breiman, 2001) that has proven to be very robust on many tasks (Segal, 2004) . It is a bootstrap aggregation (also known as bagging) ensemble learning method that combines the predictions of multiple decision trees to improve prediction accuracy (Breiman, 2001) . Bagging is a type of ensemble learning method that randomly samples the data with replacement to build multiple models and aggregates the outputs from all models. Bagging aims to reduce the variance of high variance models such as decision trees. RF builds a multitude of decision trees at training time and outputs the average values of the appropriate leaf for regression tasks (Breiman, 2001) . There are 2 main hyper-parameters that need to be tuned for each problem, the number of trees N tree and the number of features randomly selected at each node m (Breiman, 2001). One major disadvantage of RF is that it is prone to overfit datasets with noisy classification/regression tasks.

Extreme Gradient Boosting (XGBoost, Chen and Guestrin, 2016) is a further accurate and popular machine learning algorithm. Similar to RF, XGBoost is a decision tree based ensemble learning algorithm that aims to reduce the variance and bias. Different from RF that uses bagging, XGBoost uses gradient boosting with regularisation to avoid overfitting, a problem in RF . XGBoost reduces bias by building models sequentially while minimising the errors from previous models . The errors are minimised using the gradient descent algorithm. This essentially "boosts" the model's performance over time .

A time series classification model maps time series to finite discrete labels which categorize the time series (Bagnall et al., 2017) . In this section, we describe how some of the state-of-the-art TSC models can be modified to predict a continuous value.

Time series nearest neighbours (NN, Lines and Bagnall, 2015; Tan et al., 2020b) is similar to the classical k-NN model described in Section 3.1.1. Instead of the nearest feature vector, the goal is to find the nearest time series to a query time series from the training dataset under a distance measure. In this case, the whole multivariate time series is used for the search and is not flattened out into a feature vector. Hence, the distance measures Tan et al., 2020b) are also slightly different from classic k-NN models. The simplest is the Euclidean distance (ED), which is similar to the ED used in the classic k-NN models. Equation 1 describes the ED to compute the distance between two time series P and Q, where D is the number of dimensions and L is the length of the time series.

Note that instead of flattening out into a feature vector, the distance is computed between two time series and summed over all the dimensions. One of the most popular distance measures is the Dynamic Time Warping (DTW) distance. It computes the minimum distance of two time series by finding the optimum alignment of two time series and taking into account the temporal order of the data Tan et al., 2020b Tan et al., , 2018 . Time series NN with DTW distance has been the state-of-the-art TSC model for more than a decade (Bagnall et al., 2017; Dau et al., 2019; Lines and Bagnall, 2015; Tan et al., 2020b) . Figure 3a and 3b shows the differences between ED and DTW distance. For multivariate time series, DTW can be computed dependent or independent of the dimensions of the time series (Shokoohi-Yekta et al., 2017) . These are commonly known as DT W D and DT W I . The modification of these models for regression tasks is the same as the classic k-NN model where the average target of the nearest neighbours are assigned to the query. In this work, we focus on the two most popular TSC NN algorithms, NN with ED (NN-ED) and DTW distance (NN-DTW). 

Deep learning models are capable of predicting both discrete labels (classification) and continuous values (regression). Fundamentally, the output of a neural network is a continuous value. Typically for classification tasks, softmax activation is used at the output layer to output class probabilities and classification is done by taking the class with the highest probability. The softmax activation is replaced with linear activation for regression tasks. Apart from the activation functions, the loss function has to be changed as well. The categorical cross entropy loss function that is commonly used for classification can be replaced by either the mean squared error or the mean absolute error loss function for regression tasks, in this case, mean squared error is chosen.

Recently, several deep learning models have been developed and benchmarked for TSC Wang et al., 2017; Fawaz et al., 2018 Fawaz et al., , 2019b .

In this work, we adapted three TSC deep learning models, Residual Networks (ResNet), Fully Convolutional Neural Networks (FCN) and Inception network . ResNet and FCN were first proposed in Wang et al. (2017) . In a recent survey on deep learning for TSC , ResNet was ranked the most accurate univariate TSC model benchmarked on 85 univariate time series datasets (Dau et al., 2019) . ResNet consists of 3 residual blocks with 3 convolutional layers in each block, followed by a global average pooling layer and an output layer. Different from the typical convolutional networks, ResNet has a shortcut residual connection between the convolutional layers which makes training easier by reducing the vanishing gradient effect .

FCN is the most accurate deep learning model for multivariate TSC on 12 multivariate time series datasets (Baydogan and Runger, 2015) and the second most accurate deep learning model for univariate TSC. It is composed of three convolutional blocks with batch normalization and a ReLU activation function. Then, global average pooling is applied to the last convolutional block and connected to a softmax classifier . For regression, the softmax activation function is replaced with linear activation function. Fawaz et al. (2019b) recently proposed the Inception network, which significantly improved existing deep learning models and achieved competitive performance with the state-of-the-art TSC model, HIVE-COTE (Lines et al., 2016) . The Inception network consists of two different residual blocks connecting the input to the next block's input to mitigate the vanishing gradient problem . Each residual block is comprised of three Inception modules. There are two major components in each of the inception module. The first one is the bottleneck layer that reduces the dimension of the time series using m filters and also allowing the Inception network to have ten times longer filters than ResNet . The second component consists of sliding multiple filters of different lengths to the output of the first component. A MaxPooling operation is also applied to the time series in parallel to these two components. The output from each of the convolution and MaxPooling operation is then concatenated to form the output of the Inception module. Finally, global average pooling is applied to the final residual block and passed to a fully connected layer for classification.

In our work, we use the same architecture from the original papers (Fawaz et al., 2019a,b) with some minor modifications to the activation and loss functions as mentioned above. We refer interested readers to the respective papers for the details of these architectures.

Recently, Dempster et al. (2019) proposed the Rocket classifier that achieves state-of-the-art accuracy in TSC with a fraction of the computational expense of existing methods. Rocket transforms time series using a large number of random convolutional kernels and trains a ridge regression classifier. These kernels have random length, weights, bias, dilation, and padding, and when applied to a time series produce a feature map. Then the maximum value and the proportion of positive values are computed from each feature map, producing two real-valued numbers as features per kernel. With the default 10,000 kernels, Rocket produces 20,000 features. Rocket was found to be the most accurate TSC classifier compared with other state-of-the-art models such as HIVE-COTE (Lines et al., 2016) and InceptionTime when benchmarked on the 85 TSC datasets (Dau et al., 2019) . In this work, we adapted Rocket by replacing the ridge regression classifier with a ridge regression model.

In this section, we evaluate the regression models described in Section 3 and set a baseline using the datasets from our TSR archive (Tan et al., 2020a) described in Section 2.2. We evaluate and benchmark the following regression models:

Missing values in the time series are linearly interpolated. When using a traditional regression model (i.e. non-temporal), the time series are flattened out into a single long feature vector.

We used the standard Scikit-Learn Python library (Pedregosa et al., 2011) to implement SVR and RF models. The default parameters are used for the SVR model with = 0.1 and C = 1. XGBoost was implemented using the Python XGBoost library 2 . Apart from the number of trees, we use the default parameters for both RF and XGBoost from the Python libraries. We adapted the code from The time series NN algorithms were all implemented in Java. Our source code has been made open source online at https://github.com/ChangWeiTan/ TSRegression.

Since some of the models are non-deterministic, we evaluate all the models over 5 runs and report the average root mean squared error (RMSE), one of the most widely used metrics for regression tasks. Equation 2 describes the formal definition of RMSE where n is the number of instances, y i andŷ i are the actual and predicted target respectively.

We compare the models statistically over the current datasets following the recommendations from (Demšar, 2006) . First, we rank each model by RMSE for every dataset. Rank 1 is assigned to the model with the lowest RMSE while rank 9 is assigned to the highest one. Fractional ranking is assigned to the model in case of ties. We then compute the average rank for each (Friedman, 1940; Demšar, 2006) was applied to the average ranks. If the null hypothesis is rejected, the post-hoc two-tailed Nemenyi test is used to compare the models to each other (Demšar, 2006) . Using this test, the performance of the models is significantly different if the average ranks differ by at least the critical difference shown in Equation 3 , where q α = 3.219 is the critical value for α = 0.05, k = 11 being the number of models and N = 19 being the number of datasets. This gives CD = 3.4638.

Finally, a critical difference diagram was used to visualise the comparison, where the thick horizontal line connecting a group of models indicates that all the models in the group are not significantly different from one another (Demšar, 2006) . Figure 4 shows the critical difference diagram of comparing the models used to benchmark the existing archive. The average ranks are indicated next to the models in the figure. Figure 4 shows that Rocket is the most accurate model with an average rank of 3.2632 and is significantly different from SVR, NN-ED and 1-NN-DTWD. The figure also shows that there is no significant difference between the state-of-the-art time series models and the classical regression models. This suggests that there is room for better models to be developed for TSR problems. Table 2 shows the performance of these models on all the datasets in the archive. The results show that Rocket performs the best overall with the lowest average RMSE ranks followed by the other state-of-the-art TSC models. RF and XGBoost are both very competitive compared with the time series models. This is expected as XGBoost and RF are both the state of the art in ML algorithms, especially in popular data science and ML competitions (Nielsen, 2016) .

On the tasks of energy monitoring and health monitoring, time series models are clearly performing better than classical regression models, with the top 3 models being time series models. For instance, the Inception network performs the best on heart rate prediction tasks while Rocket is the most accurate on energy prediction tasks. There is no clear winner for environment monitoring tasks. Classical regression models perform better at predicting pollution level while time series models perform better on the remaining datasets. The reason is that, the pollution metrics from these pollution datasets can be estimated fairly easily by applying a threshold to the measurements from gas sensors, where classical regression models such as RF and XGBoost are very good at. Nonetheless, we expect a TSR model that uses feature extraction techniques such as the TSC counterparts, Shapelet Transform (Lines et al., 2012) , Time Series Forest (Deng et al., 2013) and BOSS (Schäfer, 2015) , will perform better than classical regression models.

Although there is also no clear winner on the new sentiment analysis task that we propose in this work, the results show that predicting sentiment scores using time series data is feasible with very low RMSE scores. Both classical regression and time series models perform similarly on forecasting tasks. This is expected as both types of models are not designed for forecasting and we expect that a forecasting model if adapted for TSR will perform better. This may include a recurrent neural network in the regression model. Besides, the small Covid3Month dataset with 140 time series of length 84 may not have enough data for the models to train on. Overall, the results indicate that there is a need to design better TSR models that can better generalise for most datasets.

In this paper, we introduced and motivated the Time Series Regression problem where the goal is to predict a continuous value using time series data. We showed some examples of real-life applications where TSR may be useful and discussed some existing methods for this task. We benchmarked these methods on the first TSR benchmarking archive and showed that Rocket, one of the state-of-the-art TSC models performs the best overall. Despite the superior performance of Rocket and other models from the state of the art, machine learning models such as XGBoost and Random Forest are equally competitive as well. Therefore, this suggests much research is needed to develop better algorithms to improve the accuracy on TSR problems. (2):522-531

SVR with RBF kernel

Time-series classification with COTE: the collective of transformation-based ensembles

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Learning a symbolic representation for multivariate time series classification

Time series analysis forecasting and control

The holt-winters forecasting procedure

Xgboost: A scalable tree boosting system

Support-vector networks

The ucr time series archive

On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario

Rocket: Exceptionally fast and accurate time series classification using random convolutional kernels

Statistical comparisons of classifiers over multiple data sets

A time series forest for classification and feature extraction

Support vector regression machines

UCI machine learning repository

Transfer learning for time series classification

Deep learning for time series classification: a review

Inceptiontime: Finding alexnet for time series classification

A comparison of alternative tests of significance for the problem of m rankings

Exponential smoothing: The state of the art

Capnobase: Signal database and tools to collect, share and annotate respiratory signals

Time series classification with ensembles of elastic distance measures

A shapelet transform for time series classification

HIVE-COTE: The hierarchical vote collective of transformation-based ensembles for time series classification

The m3-competition: results, conclusions and implications

The accuracy of extrapolation (time series) methods: Results of a forecasting competition

The m4 competition: Results, findings, conclusion and way forward

The m4 competition: 100,000 time series and 61 forecasting methods

Photoplethysmographic derivation of respiratory rate: a review of relevant physiology

Multi-source social feedback of online news feeds

Tree boosting with xgboost-why does xgboost win" every" machine learning competition?

Scikit-learn: Machine learning in Python

Probabilistic estimation of respiratory rate from wearable sensors

Toward a robust estimation of respiratory rate from pulse oximeters

Deep ppg: Largescale heart rate estimation with convolutional neural networks

A novel time-varying spectral filtering algorithm for reconstruction of motion artifact corrupted heart rate signals during intense physical activities using a wearable photoplethysmogram sensor

Encyclopedia of machine learning

Computationally efficient heart rate estimation during physical exercise using photoplethysmographic signals

The boss is concerned with time series classification in the presence of noise

Generalizing dtw to the multi-dimensional case requires an adaptive approach

Efficient search of the best warping window for dynamic time warping

Monash university, uea, ucr time series regression archive

Fastee: Fast ensembles of elastic distances for time series classification

Time series classification from scratch with deep neural networks: A strong baseline

A fuel moisture content and flammability monitoring methodology for continental australia based on optical remote sensing

Photoplethysmography-based heart rate monitoring in physical activities via joint sparse spectrum reconstruction