key: cord-0554798-1ov0o5e3 authors: Charotia, Himanshi; Garg, Abhishek; Dhama, Gaurav; Maheshwari, Naman title: Dynamic Temporal Reconciliation by Reinforcement learning date: 2022-01-28 journal: nan DOI: nan sha: 683bb2ead8311e83f22591996f39214014e7f8ad doc_id: 554798 cord_uid: 1ov0o5e3 Planning based on long and short term time series forecasts is a common practice across many industries. In this context, temporal aggregation and reconciliation techniques have been useful in improving forecasts, reducing model uncertainty, and providing a coherent forecast across different time horizons. However, an underlying assumption spanning all these techniques is the complete availability of data across all levels of the temporal hierarchy, while this offers mathematical convenience but most of the time low frequency data is partially completed and it is not available while forecasting. On the other hand, high frequency data can significantly change in a scenario like the COVID pandemic and this change can be used to improve forecasts that will otherwise significantly diverge from long term actuals. We propose a dynamic reconciliation method whereby we formulate the problem of informing low frequency forecasts based on high frequency actuals as a Markov Decision Process (MDP) allowing for the fact that we do not have complete information about the dynamics of the process. This allows us to have the best long term estimates based on the most recent data available even if the low frequency cycles have only been partially completed. The MDP has been solved using a Time Differenced Reinforcement learning (TDRL) approach with customizable actions and improves the long terms forecasts dramatically as compared to relying solely on historical low frequency data. The result also underscores the fact that while low frequency forecasts can improve the high frequency forecasts as mentioned in the temporal reconciliation literature (based on the assumption that low frequency forecasts have lower noise to signal ratio) the high frequency forecasts can also be used to inform the low frequency forecasts. Many real-life decision problems involve multiple time horizons. For example, in energy planning expected demand has to be forecasted for the upcoming hours, the next day, the next week, and even longer horizons [2, 3] . Another example is inventory management where short term(monthly) forecasts are needed for next month's stocking. Whereas it is often beneficial to take into account long term forecasts for taking decisions related to establishing contracts of purchasing a product [4, 5] . These decisions are made one-two year in advance using long term forecasts (1-2 years) . This is to make sure that decisions made now have a positive impact on future possibilities. In this context, temporal aggregation and reconciliation techniques have been found to be useful in improving forecasts among multiple horizons [6, 7, 8] . Different temporal aggregations can reveal important information about the underlying data-generating process. When temporal aggregation is applied to a time series, it can strengthen or attenuate different features [9] . Temporal hierarchies forecasts for different horizons can be made with generating forecasts independently for time series at each level by different (simple) methods. These forecasts produced by different approaches and based on different information are most likely incoherent. Reconciliation is necessary because optimal decision-making requires coherent forecasts. Thus, the main focus while forecasting hierarchical time series is to utilize the information available across all levels of a given hierarchy for producing coherent forecasts. Existing work in hierarchical forecasting deploys methodology in which first base forecasts are independently generated for each time series in the hierarchy and then combination/revision step is done in a post-processing step to ensure coherence. However , all the methods used in the industry or literature rely on the availability of complete data across all levels of the temporal hierarchy. While this offers mathematical convenience it is often not available to the decision maker using the forecasts. Due to the dynamic and stochastic nature of time series, it is very difficult to precisely estimate future changes (significant trend or mean shift) and to adjust different horizon forecasts based on these changes. An example can be a significant trend or mean shift in high frequency patterns at a daily or a weekly level such as those encountered in the COVID pandemic. There are high chances that the long term actuals will significantly diverge from the forecasts. All the existing methods lead to static temporal reconciliation and thus will perform poorly under dynamic scenarios. Reinforcement learning especially deep reinforcement learning (DRL) has been successfully applied in various fields including AlphaGo [10] , ATARI games [11] and robotics [12] . Reinforcement learning is a task-independent learning scheme. It is suitable for problems where there is no supervised information but only feedbacks from an external environment. The problem of informing forecasts among hierarchies for reconciliation without prior/complete knowledge of data distribution can be categorized as the same. Furthermore, reinforcement learning is a data-driven approach that is able to capture complex changing dynamics in the data and is well equipped to overcome the deficiencies of the existing methods. We propose a dynamic reconciliation method whereby we formulate the problem of informing low frequency forecasts based on high frequency actuals as a Markov Decision Process (MDP) allowing for the fact that we do not have complete information about the dynamics of the process. This allows us to have the best long term estimates based on the most recent data available even if the low frequency cycles have only been partially completed. The MDP has been solved using a Time Differenced Reinforcement learning (TDRL) approach and improves the long terms forecasts dramatically as compared to relying solely on historical low frequency data. The temporal difference is an agent learning from an environment through episodes with no prior knowledge of the environment. We have designed our own action and reward function. Based on the factors described above, the contributions of this paper can be summarized as below: • To the best of our knowledge, our work presents the first RL approach for the hierarchical reconciliation problem. • Dynamic reconciliation framework which works on partial data also. • A tunable action design function that tries to incorporate forecast change with tolerance while reconciling the forecast which ensures that produced reconciled forecast do not deviate from expected values. The rest of this paper is organized as follows. Section 2 introduces the reinforcement learning basics. In Section 3, we describe the detailed methodology of the proposed TDRL approach towards the hierarchical time series reconciliation and give a detailed analysis on how to design actions and -greedy policy. We report and discuss experimental results in Section 4. We talk about future directions and conclude in Section 5. Temporal hierarchies for forecasting are constructed for any time series by means of non-overlapping temporal aggregation. Such aggregation typically leads to tree structure but need not necessarily. For example, grouped [13] , temporal [14] , and cross-temporal aggregations [15] can be alternative aggregation paths. Consider a multi-level hierarchy Y t ∈ R n at time t having t = 1, ..., T . Here y t,i ∈ R is the value of the i-th (out of n) univariate time series. The index i denotes level of hierarchy. Level 0 (i=0) denotes the completely aggregated series, level 1 the first level of disaggregation, down to level K containing the most disaggregated time series. We refer to the time series at the leaf nodes of the hierarchy as bottom-level series and rest of the series can be termed as aggregated series. We also call the forecasts for all time series in the hierarchy generated without any reconciliation approach as base forecasts denoted byŶ t (not to be confused with bottom-level). We can split the vector of all series Y t into m bottom entries and r aggregated entries where n = r + m. Existing approaches for generating a coherent forecasts for a hierarchical time series follow a two-step procedure: (i)generate h-step-ahead forecast for each time series independently to obtain base forecastsŶ T +h and (ii) produce revised h-step-ahead forecastsỸ T +h through reconciliation given by equation below: for some appropriately chosen matrix P of order m × n. P matrix is used to extract and combine the relevant elements of the base forecastsŶ T +h . Reinforcement learning (RL) is a machine learning approach inspired by behaviorist psychology. In RL, an agent interacts with its environment by sequentially taking actions, observing consequences, and altering its behaviors to maximize a cumulative reward. RL is usually modeled as a MDP which consists of a state space S = {s}, an action space A = {a}, state transition dynamics T :S × A → P( S) where P( S) is the set of probability measures on S, an immediate reward function r: S × A → R, and a discount factor γ ∈ [ 0, 1] . A policy, denoted by π :S → P( A) where P( A) is the set of probability measures on A, fully defines the behavior of an agent. The agent uses its policy to interact with the environment and gives a trajectory of states, actions, and rewards s 1 , a 1 , r 1 , ..., s T , a T , r T (T = ∞ indicates a infinite horizon MDP and otherwise an episodic one) over : S × A × R. The cumulative discounted reward constitutes the return R = T t=1 γ t−1 r t . The agent's goal is to learn an optimal policy π * that maximizes the expected return from the start state. π * = argmax π E[ R|π] . A common method for learning optimal policy or optimal state-value function is temporal difference (TD) learning, which estimates the value of a state by bootstrapping from the value estimates of successor states using Bellman-style equations. TD methods work by updating the state-value estimates to reduce the TD error, which describes the difference between the current estimate of the state-value and a new sample obtained from interacting with the environment. Top-down and bottom-up approaches have traditionally been used to produce coherent forecasts for a hierarchy. In Top-down, forecasts are generated at the top level for the time series and then dis-aggregated down all the way to the bottom level. Whereas for Bottom-up, forecasts are generated at the most granular level and then aggregated up [16, 17] . In both of these methods, the generation of forecasts for the entire hierarchy is dominated by a single level of aggregation where the forecasts are produced, ignoring information at all other levels. To optimally combine forecasts from all the series of the hierarchy, [18] proposed the use of ordinary least-squares (OLS) estimator after formulating the forecast reconciliation problem for a structural hierarchy as a linear regression model. In the regression, the independent base forecasts are modeled as the sum of the expected values of the future series and coherency error. [13] suggested using weighted least squares (WLS), taking account of the variances on the diagonal of the covariance matrix of aggregation error but ignoring the off diagonal covariances. Later, [19] considered the generalized least-squares (GLS) estimator using P MinT found that the incorporation of correlation information into the reconciliation procedure to be beneficial for forecast accuracy, when combined with a simple shrinkage estimator. The advantages of the MinT approach are that its revised forecasts are coherent by construction and it uses information from all levels of hierarchy simultaneously. Disadvantages are the strong assumption of base forecasts to be unbiased. [14] showed that it is possible to use the reconciliation framework proposed by [18] to produce coherent forecasts by representing temporally aggregated series as hierarchical time series. The issue of getting coherent forecasts along both cross-sectional and temporal dimensions (i.e., cross-temporal coherency) has been dealt with by [20, 15, 21] and [22] . For this paper, we will use only 2-level of hierarchy. Level-0 denotes monthly aggregated data and Level-1 denotes its disaggregation into daily data. The framework will take as input aggregated (monthly) forecasts, bottom level (daily) forecastsŶ which are provided to any reconciliation approach. In addition to this, the framework will have access to ongoing high frequency actual. High frequency actual can help the agent to learn the daily shares and will help the agent to model ongoing dynamics in the data. We consider an episodic MDP with a discount factor γ = 1 where an episode(typically one month) starts with a low frequency forecast. From the external model, the agent also gets a share of each day in a month. The agent is given high frequency forecastŷ t at timestamp t, the agent takes an action of increasing or decreasing forecast per day to capture the daily variation/fluctuation. The main goal of the agent is to update monthly forecasts based on lower level forecasts and actual. This is being done using reward function R. The goal of the agent is to generate reconciled low frequency forecasts based on the high frequency actual share seen per day. More specifically, the core elements of the MDP are further explained as follows: • S: States in the environment. The state here represents the number of days present in a month. This will help the agent to keep track of how many days have passed in a month and the corresponding daily share. State s t is represented as : 1) t : the current day of the month, 2) State value will be remaining monthly total to be adjusted. • A: It constitutes a set of actions present to the agent in a state S. We have designed our own set of actions which will be discussed in the later section. • T : Transition function of MDP. We have used − greedy policy which mixes two behavior of switching between random and Q policy using the probability hyperparameter . • R: A reward r t is a scalar which measures the goodness of the action a t taken by the agent in the state s t . Reward r t at each time step t will be high frequency actual encountered for that day. • γ : We set reward discount factor γ = 1 Actions help the agent to interact with the environment and get feedback. As the main goal is the adjustment of the forecast, we have defined actions accordingly. The Agent has to perform an action of increase or decrease in forecast per day to capture the daily variation/fluctuation. But how much to increase or decrease from daily level forecast can have infinite values. To make this action space discrete, we have introduced the tolerance level ε parameter which helps in bucketing the action space. In any state s t , agent can increase or decrease the high frequency forecast based on tolerance level ε. Lets take an example of 3 actions, then for any time step t the Agent can make updated daily level forecast to be The tolerance level ε parameter can be tuned according to the problem at hand and helps in deciding how much fluctuation from the daily level forecast is permitted to the agent. To illustrate the concept, lets consider ε =5$ and daily level forecast value at time t to be 30$. Then the agent can take one of the 3 actions of increasing/decreasing forecasts defined as: Epsilon-Greedy is a method for selecting actions that balance both exploration and exploitation by choosing between exploration and exploitation randomly. The epsilon-greedy, where refers to the probability of choosing to explore. A policy is -greedy with respect to an action-value function estimate Q if for every state, • with probability 1 − , the Agent selects the greedy action, and • with probability , the Agent selects an action uniformly at random from the set of available (non-greedy and greedy) actions. The probability of selecting non-greedy actions increases with larger value of . To construct a fixed policy π that is -greedy with respect to the current action-value function estimate Q , we have set the policy as if action a maximizes Q ( s , a ). Else for each s ∈ S and a ∈ A(s). For finding the optimal state-value function, we have used SARSA variation of Temporal Difference-(0) Algorithm. TD(0) improves upon the drawbacks of the Monte Carlo (MC) method in which the agent has to wait until the end of an episode to obtain the actual return (experienced return) before it can update and make any improvements to the value function estimate. TD(0) updates the state value function at every time step t without waiting for the end of the episode. For finite-state MDP under a policy π, the update rule is given by: where V (S t ) denotes is state value function for state S t , α is step-size parameter, γ is discount factor and R t+1 is immediate reward recieved at time step t+1. Putting them together, we present our Dynamic Temporal Reconciliation (DTR) framework as given in Fig 1. The framework is built using TD(0) which takes as input high frequency forecasts, where the state values are first initialised as difference between forecasted low frequency total and cumulative high frequency forecasts seen till time step t for every t ∈ {1, ...., n}. Action value function are initialised using the three actions defined in Section 3.2 based on forecast for every t ∈ {1, ...., n} for the construction of corresponding -Greedy policy. See Fig. 2 for an illustration. Based on the -greedy policy, the agent takes an action a t ∈ A under state s t ∈ S. Here our reward function r t is not directly beneficial for the agent at each step but combined with the value update function, it helps in achieving the overall goal of the agent. In this section, we present the empirical study of DTR along with the experimental setup and implementation details of DTR. We perform experimental validation of our approach by using NIFTY 50 data set [1] . Fig. 3 shows high frequency data aggregated at the weekly level. To evaluate the model performance, we have used the mean absolute percentage error (MAPE). This metric is calculated as: where n is number of observations in time series, Y t andŶ t is observed and forecastes values at time step t. We report the performance of different settings of proposed algorithms in Table 1 . For this experimentation, we have used fixed -greedy policy along with 3 actions defined based on Section 3.2. The results in all experiments are given by " Reconciled Monthly forecast/ MAPE of reconciled forecast wrt monthly actual(M AP E rec )/ Percentage improvement over monthly forecast without reconciliation (% f ) ". As can be seen, the proposed DTR clearly outperform the both flexibility to change the agent's behavior according to their requirement. Smaller will rely on greedy action that agent believes has the best long-term effect. A higher will help agent to explore more. Similarly smaller ε will lead to adjustment in forecast near the mean value of forecast only whereas a higher ε will help agent to model high variance changes also. In this paper we present a Dynamic Temporal Reconciliation (DTR) framework by incorporating a dynamic adjustment layer into traditional Reconciliation algorithms. The integration allows us to better model the complex changes in high frequency forecast and adjust long term estimates accordingly. Therefore, DTR provides a powerful tool to handle incomplete low frequency data for reconciliation tasks based on high frequency actual and is well equipped to overcome the deficiencies of the existing reconciliation methods. For reconciliation, we propose a Reinforcement learning framework which is formulated as a MDP. This has been solved using a Time Differenced Reinforcement learning (TDRL) approach with custom action design and reward function. Proposed framework is data-driven and can be customised based on business needs. We examine the framework on NIFTY 50 dataset [1] , and DTR framework has demonstrated superior performance over baseline forecasts. Although we introduce DTR in an environment with only 3 actions and fixed -greedy policy, the framework can be customised by changing numbers of actions designed and by learning a dynamic policy function. Nse -national stock exchange of india ltd Probabilistic forecast reconciliation with applications to wind power and electric load Hierarchical probabilistic forecasting of electricity demand with smart meter data Bayesian intermittent demand forecasting for large inventories Approximate bayesian inference in linear state space models for intermittent demand forecasting at scale The effect of aggregation on prediction and estimation in the autoregressive model Asymptotic behaviour of temporal aggregates of time series Temporal aggregation of univariate and multivariate time series models: A survey A classification of business forecasting problems George van den Driessche, Thore Graepel, and Demis Hassabis The arcade learning environment: An evaluation platform for general agents Reinforcement learning in robotics: A survey Fast computation of reconciled forecasts for hierarchical and grouped time series Nikolaos Kourentzes, and Fotios Petropoulos. Forecasting with temporal hierarchies Cross-temporal aggregation: Improving the forecast accuracy of hierarchical electricity consumption Hierarchical forecasts for australian domestic tourism Disaggregation methods to expedite product line forecasting Optimal combination forecasts for hierarchical time series Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization Reconciling solar forecasts: Sequential reconciliation Cross-temporal coherent forecasts for australian tourism Cross-temporal forecast reconciliation: Optimal combination method and heuristic alternatives high frequency aggregated forecast as well as low frequency without reconciliation forecast. The framework starts with aggregated high frequency forecast and based on the actual seen per day changes its estimate for low frequency forecast. The results reveal how the DTR framework adapts to high frequency actual based on different parameter initialisation. If there is unexpected increase/decrease in daily values between the days, the framework increases/decreases its estimated low frequency forecast which is shown in Table 1 . On Day 13, there is a huge decrease in high frequency actual going from ( 10040 -> 9108 ) which makes the framework to adjust estimated low frequency forecast going from ( 365334 -> 298910 ) based on 5% tolerance and ε as 20 % . The reconciled forecast are highly dependent on tolerance level. Tolerance level parameter is problem specific and should be adjusted based on the business needs. We have tested our framework on values of ε at ( 10%, 20%, 30% ) of average high frequency forecast value which is ( 36736, 73473, 110209 ) respectively. The proposed framework has two hyper-parameters and ε. is used to balance between exploration and exploitation which helps in selecting the action with the highest estimated reward most of the time. ε is a parameter to adjust the strictness of the agent. Table 3 contains the effects of and ε on the agent's performance. This provides a user the