key: cord-0219479-zqdl683z authors: Seth, Taniya; Muhuri, Pranab K. title: Optimizing Hyperparameters in CNNs using Bilevel Programming in Time Series Data date: 2021-01-19 journal: nan DOI: nan sha: 201e9fd888f98d3745522fdb95689073deffe55b doc_id: 219479 cord_uid: zqdl683z Hyperparameter optimization has remained a central topic within the machine learning community due to its ability to produce state-of-the-art results. With the recent interest growing in the usage of CNNs for time series prediction, we propose the notion of optimizing Hyperparameters in CNNs for the purpose of time series prediction. In this position paper, we give away the idea of modeling the concerned hyperparameter optimization problem using bilevel programming. Training a machine to perform humanely tasks such as image recognition and data prediction, involves preparing a good enough model that learns the given data. This model predominantly involves a training algorithm that is responsible for this learning task. Furthermore, the job of this training algorithm is to develop a function, which in essence minimizes a loss on some data samples (a subset of the ground truth data) introduced to it. This trained model is then applied to the test data (and out-of-sample subset of the ground truth data), on which the model is evaluated based on another loss. Who evaluates the performance of this model? Who decides exactly how much is good enough? Literature on machine learning has blessed us with answers to such questions while actually producing good models to make machines perform various tasks. The answer to the above question is that, the evaluation of the training algorithm in the model is done using a training loss, which identifies the difference between the actual state of the model's learning and the training data it is provided with to learn. The function mentioned earlier in this section minimizes this difference between the learnt data and the training data. This is, with respect some parameters of the model, say θ. The training algorithm gradually learns these parameters during the concerned process, model weights, for example. However, a model also includes the hyperparameters, λ in the scene, which [Bergstra and Bengio, 2012] refer to as the "bells and whistles" of a training algorithm. In practice, hyperparameters are chosen first, which is then followed by the development of the training algorithm. Due to their importance and influence on the training, these hyperparameters require expert intervention to be chosen. When optimized values of hyperparameters are supplied to the training algorithm, it learns well from the training data, while additionally performing well on the out-of-sample test data. The performance of the model on the test data is evaluated based on a validation loss, which must be minimized for the model to generalize well. This discussion defines the necessity of hyperparameter optimization (HO) within machine learning models. This problem has been studied for a long time. [Bergstra et al., 2011] utilized various approaches such as the sequential model based approach, Gaussian process approach, treestructure Parzen estimator approach etc for optimizing the estimated improvement criteria. Random search was subsequently studied for HO in [Bergstra and Bengio, 2012] . Later, [Thornton et al., 2013] introduced Auto-WEKA for the combined selection and HO in classification algorithms. [Eggensperger et al., 2013] put forward an empirical study to deal with Bayesian optimization for hyperparameters. Most importantly, gradient-based HO was discussed in [Maclaurin et al., 2015] , wherein exact gradients of hyperparameters were computed by chaining their derivatives backwards in the training procedure through reversible learning. Other works on HO include [Feurer et al., 2015] and [Li et al., 2017] . Having discussed the problem of HO above, one can notice the dual structure that the problem encompasses. In other words, performance of a machine learning model is optimized based on the training and validation losses. This optimization is subject to the values chosen for hyperparameters, λ of the model. This can be stated as the following: the validation loss of a model is minimized with respect to minimized training loss, for the model which is parameterized by the hyperparameters. Such a dual structure is noticed in multiple real-life situations, which can be modeled using the bilevel programming strategy [Bard, 2013] . Solving these problems follow a leader-follower approach, inspired from the game theory [Von Stackelberg and Von, 1952] . Within these problems, the solution space of the objective function (OF) of the leader is constrained by that of the follower problem. Hence, a proper solution is sought that satisfies both the leader's and follower's solution space while optimizing their individual arXiv:2101.07492v1 [cs. LG] 19 Jan 2021 objectives. Recently, the idea of HO using bilevel programming was proposed in [Franceschi et al., 2018] . Franceschi and coauthors developed a bilevel optimization framework for HO. Upon formulating the bilevel for HO, they observed that it is difficult to obtain a solution to the bilevel model, especially when λ is a real-valued vector of hyperparameters. To overcome this, the exact problem of the bilevel model was approximated and later proven to guarantee solutions. In the literature so far, the problem of HO has been dealt with mostly for the cases of images. In today's world, time series is available in abundance. From stock market to daily average temperatures, human activity data and now most importantly COVID-19 data, everything is available as a time series. Leveraging such series for either classification or prediction is crucial. Convolutional neural networks (CNN) have been utilized for both classification and prediction purposes on time series data. [ Zheng et al., 2014] time series data utilizing multiple channels deep CNNs with special attention to exploration of feature learning techniques. In [Yang et al., 2015] , classification of the human activity recognition (HAR) is done using deep CNNs, whereas in [Cui et al., 2016] , time series classification is done using multiscale CNNs. Other recent works on time series forecast and classification using CNNs in-clude [Borovykh et al., 2017] and [Yazdanbakhsh and Dick, 2019] . Keeping an eye on the relevance of time series prediction in today's world, one can observe that the literature lacks works where a machine learning model has been optimized for performance on time-series data. Hence, in this position paper, we propose the idea of utilizing bilevel programming for HO within CNNs for time series prediction. We first introduce the bilevel framework to model the overall performance of the machine learning model in terms of the training and validation loss. This is done in Section 2. Subsequently, in the same section, we revisit the approximation strategy for the bilevel framework of HO, along with the gradient-based approach to solve the problem. In Section 3, we introduce our proposed framework of utilizing bilevel programming for HO in CNNs for time series prediction. We conclude the position paper in Section 4. In this section, we revisit the structure of the bilevel programming framework for a machine learning model. As specified in [Franceschi et al., 2018] , bilevel programming problems of the following forms are considered: In the above equations, f : Λ → R is defined at λ ∈ Λ. E : R d × Λ → R is the leader objective. Also, ∀λ ∈ Λ, L λ : R d → R is the follower objective given that L λ : λ ∈ Λ is the class of OFs parameterized by λ. As mentioned earlier, the validation error is sought to be minimized for a machine learning model. Let the model be denoted as g w : X → Y Let it be parameterized by the vector w, with respect to one vector of hyperparameters λ. For a predefined loss function l, the leader and follower objectives can be given as follows: L λ (w) = Σ (x,y)∈Dtrain l(g w (w), y) + penalty (4) Here, D validation is the validation data presented to g w , for evaluation after it has been trained on D train . The penalty term can be implemented as a regularizer for the network model to improve the performance. [Franceschi et al., 2018] , specified an approximation of the bilevel problem given in (1) and (2). It is given as follows: In the above equations, [T ] is a predefined positive integer such that [T ] = {1, . . . , T }, φ 0 : R m → R d is a smooth initialization dynamic, and ∀t ∈ [T ], R d × R m → R d is a smooth mapping the operation of an optimization algorithm at the t th step. The optimization dynamic φ is implemented using the gradient descent optimization algorithm. In [Franceschi et al., 2018] , certain assumptions are chosen to reduce the bilevel framework given in (1)-(2), to prove the existence of solutions of the reduced problem and also the existence of the convergence of approximate problems to the reduced problem. They are omitted from this position paper for simplicity. We discuss our proposed idea in this section. We first define our CNN model for classification purposes. For our time series data, we utilize 1D convolutional layers, which are fit for situations dealing with time series information. For explanation, we utilize a time series dataset with 128 time steps and 9 features of data. Our deep CNN model for this time series data begins with an input layer, followed by two 1D convolutional layers each encompassing 64 filters with a filter size of 3. Both layers have the ReLU activation applied. These are followed by a dropout layer with a 50% dropout rate, followed by a max pooling layer. The output from the max pooling layer is then flattened and forwarded to a dense layer with 100 connections and the ReLU activation, followed by a final dense layer with 9 output units and the softplus activation. The model structure is depicted in Fig. 1 . For this model and data, we consider the example weights (w) and the learning rate (lr) of the neurons as the hyperparameters to be optimized. The metric to be minimized is given by the model is the Mean Squared Error (MSE), while the optimizer utilized is the Adam optimizer. With this scenario defined, the bilevel programming framework for our CNN model for the time series data is described below. For the following, λ = {w, lr} and T = 200. and, The follower level optimizer is defined by the gradient descent optimizer as given in [Franceschi et al., 2018] . This follower level optimizer is defined for the hyperparameter, lr. While the w is the minimizer for the problems in ( 7)-( 8). We believe that solving this bilevel problem to obtain the optimized value of lr with respect to the minimizer w, shall produce state-of-the-art results in terms of MSE. We plan to implement this scenario in on a machine with the following specifications: Intel Core 140 i3-6100 CPU with 12 GB of RAM and Windows 10 OS. The GPU employed is the NVIDIA GeForce 141 GTX 1660 Super. In this position paper, we have introduced the idea of using bilevel programming for HO within CNNs for time series data. Since the literature on HO for time series prediction or classification tasks is scarce, we believe that the idea presented here will mark a good start in the research in this direction. We utilized a deep CNN architecture to define the model for the purpose of time series prediction. Based on this, we defined a framework for the bilevel programming problem that must be solved to obtain the better results than most of the existing models. Our subsequent plans are to implement the scenario introduced within this position paper. Within this implementation, we shall perform a sensitivity analysis on different values of T , to obtain varied results. We also plan to compare the impact of HO using bilevel programming within the prediction and classification tasks on time series data. We plan to perform our experiments, on the human activity recognition (HAR) data to observe the results. Practical bilevel optimization: algorithms and applications Algorithms for hyperparameter optimization Conditional time series forecasting with convolutional neural networks Holger Hoos, and Kevin Leyton-Brown. Towards an empirical foundation for assessing bayesian optimization of hyperparameters Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms Deep convolutional neural networks on multichannel time series for human activity recognition