key: cord-0291389-6d5h75q4
authors: Nath, Somjit; Baranwal, Mayank; Khadilkar, Harshad
title: Revisiting State Augmentation methods for Reinforcement Learning with Stochastic Delays
date: 2021-08-17
journal: nan
DOI: 10.1145/3459637.3482386
sha: 6c5982fc06b1aed1b69727641d000ebd76f0e1cc
doc_id: 291389
cord_uid: 6d5h75q4

Several real-world scenarios, such as remote control and sensing, are comprised of action and observation delays. The presence of delays degrades the performance of reinforcement learning (RL) algorithms, often to such an extent that algorithms fail to learn anything substantial. This paper formally describes the notion of Markov Decision Processes (MDPs) with stochastic delays and shows that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure. We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with stochastic delays in actions and observations. The delay-resolved deep Q-network (DRDQN) algorithm is bench-marked on a variety of environments comprising of multi-step and stochastic delays and results in better performance, both in terms of achieving near-optimal rewards and minimizing the computational overhead thereof, with respect to the currently established algorithms.

problems. Two of the most common simplifying assumptions that pose challenges in practical scenarios are absence of (a) observation delay, i.e., a system's current state is always assumed to be available to the agent, and (b) action delay, i.e., an agent's action has an immediate effect on the system's trajectory. It is well-known that the presence of delays in dynamical systems degrades the performance of the agents, resulting in undesirable behaviors of the underlying closed-loop systems and leading to instability [22, 25] .

Many real-world problems, including but not limited to congestion control [2] , control of robotic systems [15, 16] , distributed computing [13] , management of transportation networks [11] , and financial reporting [36] exhibit the undesirable effect of action and observation delays towards performance of the corresponding agents. Medical domain is another significant and probably the most pertinent domain in today's world where decision making is based on observing delayed state information. For instance, the decision to self-isolate and quarantine in case of suspected COVID-19 infection is based on the outcome of swab-based test, the result of which is significantly delayed (often by a day or two), since the labs tend to run large batches of swab tests together, not to mention the time taken by the labs to collect samples [20, 27] . Similarly in biological systems, visual and proprioceptive information for motor control is significantly (≈500ms) delayed during transmission through neural pathways [38] .

In recent years, deep reinforcement learning (DRL) has demonstrated enormous capability in learning complex dynamics and interactions in highly constrained environments [12, 14, 23, 30] . Almost all the problems that DRL addresses are modeled using MDPs. By ignoring delay of agents, one can equivalently model the underlying problem as partially observable MDP (POMDP). POMPDs are generalizations of MDPs, however solving POMDPs without estimating hidden action states leads to arbitrary sub-optimal policies [31] . Therefore, it becomes imperative to augment the delayed observation with the last few actions in order to ensure Markov property [3, 6, 17, 34] in delayed settings.

The reformulation allows an MDP with delays to be viewed as an equivalent MDP without delays comprising of the same statetransition structure as that of the original MDP, but at the expense of increased (augmented) state-space and relatively complex cost structure involving a conditional expectation. While the stateaugmentation is essential to ensure Markov property, the resulting Bellman equation with conditional expectation in the equivalent cost structure requires vast computational resources for estimation. In particular, if the effect of an action or state observation is delayed by units at any stage, then the time-complexity of computation of the expected cost scales as ( |S| 2 ) [3, 17] , where S denotes the finite set of states. Moreover, the size of the "augmented" state grows exponentially with , i.e., I = S × A , and thus tabular learning approaches become computationally prohibitive even for moderately large delays. Here I and A represent the finite set of information states and actions, respectively. In view of these challenges, most of the existing work on designing reinforcement learning algorithms has largely worked with assumptions of constant delays and trying to handle delays in the environment either by adopting modelbased methods (which are again computationally expensive and susceptible to sub-optimality) or by simulating future observations.

In this paper, we further extend the applicability of reinforcement learning algorithms to delayed environments on the following fronts: (a) We formally define and introduce the constant delay MDP (CDMDP) and stochastic delay MDP (SDMDP), and show that MDP with deterministic or stochastic delays can be converted into equivalent MDPs without delays. This is in contrast with semi-Markov Decision Processes (SMDPs), where the agent can wait for the current state to become observable or wait for the most recent action to get applied before taking subsequent actions; (b) The equivalence presents itself as a general framework for modeling MDP with delays as MDPs without delays within which the augmented states (information states) carry the necessary information for optimal action selection, the conditional expectation of the cost structure is significantly simplified; (c) A Delay-Resolved Deep Q-Network (DRDQN) reinforcement learning algorithm is developed which is shown to achieve near-optimal performance with minimal computational overhead. In particular, by using deep neural networks as nonlinear function approximators and disregarding a tabular framework, the computational complexity scales as (|S| + |A|), i.e., it scales linearly with the delay , and thus becomes amenable to work with. A word on notation: A no-delay Markov Decision Process (MDP) is denoted by the tuple ⟨S, A, P A , , ⟩, where S and A represent the finite sets of states and actions, respectively. For each triple (s, s ′ , a) ∈ S 2 × A, the probability that the system state transitions from s to s ′ under the effect of the action a is denoted by a (s ′ |s) ∈ P A . The immediate reward associated with each state-action pair (s, a) is depicted by (s, a). The cumulative reward is discounted by a factor 0 < < 1 and is given by the infinite sum ∞ =0 (s , a ). Let represent the instantaneous delay in an agent's action or observation. We use I = S × A to denote the set of augmented states. Note that in case of stochastic delay, is not constant and thus the definition of I changes accordingly; however, with a slight abuse of notation, I is used to represent the set of augmented states independent of the total delay with the understanding that its meaning is clear from the context. Finally, we use o and a to differentiate between the delays in observation and action, respectively, though, it is shown later that the two delays have similar effects on optimal decision making in delayed environments. Fig. 1 is an illustration of an RL agent involved in sequential decision making in a delayed environment. At the beginning of each stage , agent has access to the most recent observed state s with ≤ , and the most recent sequence of actions (a ′ , a ′ +1 , . . . , a −2 , a −1 ) with Figure 1 : Graphical illustration of the problem with observation and actions delays. An action is computed in every time step, using the latest known state and a vector of all actions taken since that state was observed.

′ min ( , ), where a is the most recent delayed action that got applied in an otherwise undelayed environment. It is easy to notice that in order to ensure Markov property in a stochastic delay setting, it is not only important to keep track of the most recent observed state, but also the time-instant at which it was observed first. Instead of naively waiting for the current state to become observable and the current action to get applied, agent must make informed decision at each step in order to minimize the total time taken to complete a given task.

One of the motivating examples of real-world delayed setting concerns with medical decision making for a suspected COVID-19 infection, where the mean time between symptom onset and a case being confirmed is estimated to be 5.6 days (standard deviation 4.2), see Fig. 2 . Therefore, based on a mean incubation period of 5.3 days [21] , it may take up to 10.9 days on average for a case to be confirmed post-infection. Assuming that an agent starts developing some early flu-like symptoms, basing the decision making to selfisolate only after the availability of complete information (after 10.9 days) may result in further increase in infections around the subject. To overcome this concern, agent can instead resort to decision making in an MDP setting that takes into account the most recent observation (early symptoms, subject's proximity with potential confirmed cases, etc.) and action history (if the subject avoided close contacts with others during early symptomatic phase, agent's adherence to putting on a face-mask, etc.).

Thus, the ability to perform decision-making in the presence of delays without having to wait for the effect of the current action to become observable is critical, particularly when the environment responds poorly to such 'waiting' actions. Hence, it is extremely important to design algorithms with ability to perform real-time decision-making.

The main objective of this paper is to learn to perform optimal decision-making in delayed environments without having to wait for the observations to arrive or actions to get applied. It must be emphasized that the equivalent MDP formulation proposed in this paper with simplified reward structure and appropriately chosen Figure 2 : Histogram of the time between onset of COVID-19 symptom(s) and case confirmation for 139 patients from Basel-Landschaft, adopted from [29] .

information state need not have the same optimal reward as the original MDP, however, the two MDPs share the same optimal policy and we leverage this equivalence for optimal decision-making both in deterministic, as well as stochastic delayed environments. MDPs in delayed environments have been applied extensively in a variety of domains, particularly in optimal control of systems with delayed feedback [3] and design of communication network in the presence of transmission delays [2] . Much of the prior work on delayed-MDPs [3, 4, 8, 18, 19, 35] and POMDPs [32] considered only constant observation delays, which was later extended to analyze constant delays in action and cost collection [2, 5] . The notion of "Constantly Delayed Markov Decision Process" was formally introduced in [34] . MDPs with constant delays have largely been addressed using augmented states in the literature [5, 17] . However, as the size of the augmented state grows exponentially with the extent of delay, the augmented approaches find little applications due to intractability. A naive approach is to include memory-less methods that exploit prior information on the extent of constant delay [28] without the need for state augmentation. Consequently, model-based approaches were introduced to predict the current state in a delayed environment. These include involving estimation of transition probabilities in a multi-step delayed framework [9, 33] . However, model-based approaches are computationally expensive when compared with model-free approaches [24] , let alone in the presence of delayed observations and actions. Fortunately, the recent advances in deep learning have facilitated practical implementation of augmented approaches as, unlike tabular methods, the complexity is only polynomial in the size of A. [26] recently formalized Real-Time Reinforcement Learning (RTRL), a deep-learning based framework that incorporates the effect of single-step action delay. For small delays, [37] analyzes the influence of incorporating the action-selection time towards correcting the effect of constant delay in an MDP. [6] introduced partial trajectory sampling built on top of the Soft Actor Critic with augmented states in order to account for random delays in environments with significantly better performance. Most recently, [10] also introduces use of forward models to improve performance for constant delays.

The stochastic delayed environment in this work is fundamentally different from the stochastic setting described in [17] . [17] makes impractical assumptions on asynchronous cost collection where observations arrive only in order, and cost is calculated only after the observation. Moreover, the amount of delay between any consecutive steps is non-negative, and the overall episodic delay can grow in no time. A more recent work on decision-making in stochastic delayed environments [6] proposes a different solution for solving random delay problems by using augmented states along with the delay values for both action and observation. However, the authors in [6] leverage the values of action and observation delays at each step to convert off-policy sampled trajectories into on-policy ones. Consequently, their approach is not suited for optimal decision-making in delay-agnostic environments discussed in this work. Detailed analysis on the relevant related work along with their merits and demerits are discussed further in Section 5.

We now formalize the general setting of an MDP with constant and stochastic delays in terms of the set of information states I, action space A, probability of state-transition P A , discount-factor , onestep reward (·), delay in observation , and delay in application of action . It must be noted that from the point of view of the agent, action delay is functionally equivalent to observation delay (as shown later in Lemma 1). The effect of action (or observation) delay can better be understood through the following simple twostate MDP. The example below is adopted from [10] , however, [10] considers a simple Markov chain where state-transitions are independent of the agent's actions. We suitably modify it to consider controlled Markov chains or MDPs.

Proposition 1. For any ∈ (.5, 1], the optimal reward for the undelayed-MDP shown in Fig. 3 is greater than the optimal reward for the corresponding delayed version. Furthermore, if → 1, i.e., in the limiting case of a deterministic MDP, the effect of delay becomes negligible. Thus, the example shown in Fig. 3 highlights the fundamental trade-off between stochasticity and delays.

Proof. The proof uses the fact that the problem is symmetric in its states, and thus the optimal rewards from each state, as well as the corresponding optimal policies are also symmetric in the system states. We begin by making two key observations: 1. The problem is symmetric in the system states, i.e., the optimal reward * ( 0 ) = * ( 1 ). Without loss of generality, we only consider computation of * ( 0 ) in our analysis. 2. From symmetry, it can be further concluded that the optimal stationary policy *

It is later shown that the optimal policy necessitates = 0.

Let us first consider the case with no delay, then by definition: *

Furthermore, the transition probabilities are independent of time , and thus * +1 = * + = ≜ * for all ∈ Z. Similarly, let us consider the optimal reward,˜ * +2 , under onestep delay, given by: *

Note that from the problem description,

Again, since the transition probabilities are independent of and the optimal policies are stationary,˜ * +2 ( 0 ) =˜ * ( 0 ) for all . In order to evaluate the optimal reward in (3), the corresponding two-step transition probabilities need to be computed. For brevity, the analysis below shows the computation of only the first term on the right-hand-side of (3), (two-step transition probability under action 0). Fig. 4 shows the Markov chain for the aforementioned two-step transition along with corresponding one-step transition probabilities. Under action 0, the probability that the system state transitions from 0 to 1 is (and with probability 1 − , the system remains in 0 ). Likewise, the one-step transition probabilities are evaluated to reach the goal-state 1 .

From Fig. 4 , it follows that:

The transition probability from 0 to itself under action 1 can similarly be obtained, and it turns out to be the same as in (4) for the given problem parameters. Thus, from (3) and (4), the optimal return˜ * ( 0 ) is obtained as:

The return is maximized for a suitable choice of optimal policy * , leading to˜ *

Thus, from (5), it can be concluded that for any ∈ [0.5, 1], the delayed reward˜ * ( 0 ) is always smaller than or equal to the undelayed reward * ( 0 ). If → 1, the undelayed reward approaches the delayed reward with equality at = 1, indicating the fundamental trade-off between stochasticity and delays. The above analysis can be easily extended to consider any amount of delay (and not necessarily unit step delay). □

Definition 1. A Constant Delay Markov Decision Process (CD-MDP), denoted by the tuple ⟨S, A, P A , , , , ⟩, augments an MDP with state-space S × A + . Note that a policy is defined as a mapping : S × A + → A.

A CDMDP can inherit delays due to delay in both observation, as well as action. Intuitively, the two delays are functionally equivalent from the point of view of the agent. For instance, consider a scenario with no action delay, however, the observations are delayed by units of time. For an agent trying to select an optimal action at any time , the corresponding reward is available only at time + , i.e., the action at time gets applied effectively at time + , as also happens in the action delay case. We formalize this intuitive argument in the form of Lemma 1.

It is shown in [3] that CDMDP ⟨S, A, P A , , , , 0⟩ can be reduced to an equivalent MDP ⟨I, A, P A , , ⟩ with I S × A and suitably modified reward structure at any time , given bỹ (I , a ) E [ (s , a ) |I ] for I ∈ I, s ∈ S, a ∈ A. Thus, from [3] , the information necessary for optimal action selection a at any instant is contained in I (s − , a − , . . . , a −2 , a −1 ). The augmented state will henceforth be referred to as information state.

If an action a is selected at time , the information state transitions to I +1 = (s − +1 , a − +1 , . . . , a −1 , a ) with the probability a (s − +1 |s − ). According to [3] , the action a results in a reward E [ (s , a )|I ]. Note that the reward structure depends on s and not s − , and thus evaluation of expected reward entails computation of the conditional probability distribution (s |I ). The complexity of this computation grows as |S| 2 , which is why model-based methods [9] are computationally prohibitive to work with if the dimension of state-space is large. Alternatively, one could try to reformulate CDMDP into another equivalent MDP with simplified reward structure such that while the two MDPs differ in their achievable optimal rewards, the underlying optimal policies are still the same [17] . Thus, one can simultaneously achieve both the accuracy of model-free methods, as well as sample-efficiency (characteristic of model-based methods). It must be remarked that despite the choice of reward-structure, augmentation is required to impart Markov property. The flavor of the simplified reward structure can be seen in the proof of Lemma 1, however, a detailed analysis is presented for the generalized setting of stochastic delay in Theorem 1.

Lemma 1 (Eqivalence of action and observation delays). The CDMDPs ⟨S, A, P A , , , , 0⟩ and ⟨S, A, P A , , , 0, ⟩ are functionally equivalent to the MDP ⟨I, A, P A , , ⟩ with appropriately chosen information state I.

Proof. The proof is based on the fact that conditioned on the appropriately chosen information state, the optimal decision making for the two CDMDPs have a similar structure. Let us first consider the case of constant observation delay. We denote this observation delayed MDP by the CDMDP ⟨S, A, , , , 0⟩. The information necessary for optimal action selection at any stage is given by = (s − , a − , . . . , a −2 , a −1 ). According to [3] , for an arbitrary but fixed policy : S × A → A, the total expected reward is given by:

We consider another MDP with a modified cost-structure with the total expected reward given by:

which can be further expanded as:

The first term on the right-hand side is independent of the policy when conditioned on the information state . Therefore, it can be concluded that arg max˜o bs ( ) = arg max obs ( ),

i.e., the optimal policies for the two MDPs (with and without modified reward structure) are equivalent. Next, we consider the CDMDP ⟨S, A, , , 0, ⟩ with action-delay. According to [3] , this CDMDP can be reduced into an equivalent MDP with information state = (s , a − , . . . , a −2 , a −1 ) and the total expected reward given by:

Following our analysis for the observation-delayed MDP with modified cost structure, we can obtain a similar relationship between the expected reward for the modified and unmodified rewards, i.e., act ( ) = E [ (s , a − ) + (s +1 , a − +1 ) + . . .

As before, the policy that maximizes˜a ct ( ), also maximizes the act ( ). From the point of view of an agent taking an action, the agent just needs to work with the suitably modified reward structure and the corresponding information state, independent of the nature of delay. Finally, if we consider the CDMDP ⟨S, A, , , , ⟩, then for a decision maker, it suffices to consider an equivalent MDP with the information state = s − , a − − , a − − +1 , . . . , a −1 . □ Remark 1. Lemma 1 suggests that it suffices to consider only one form of delay. Consequently, the rest of the paper takes into account scenarios only with delayed observation, since action delays can be similarly analyzed. Nonetheless, we later present some experiments with action delays for the sake of completeness.

We now discuss the MDP with stochastic delays. Unlike in a CD-MDP, in a stochastic delay MDP (SDMDP), the number of time steps between two successive observations is random and not equal to unity. Consequently, it is possible to observe a later state s +1 before observing s . This is in contrast to the SDMDP setting described in [17] , where it is assumed that s +1 can only be observed after the state s has been observed.

For brevity, we only consider the scenario with random observation delays. The analysis for SDMDPs with action delays follows easily from Lemma 1. Below we formally define an SDMDP. Unlike with CDMDPs, in an SDMDP the information state must also include the time-instant at which the last observed system state s − was first observed. This is required in order to ensure that the reward corresponding to (s − , a − ) is collected only once at stage . We also assume that if the delay for a later state s +1 is smaller than the corresponding delay for s by more than unity, the agent gets to observe the most recent state s +1 and the reward associated with it, along with the rewards associated with s and other previously unobserved states.

Another subtlety with SDMDPs is the size of information state . Let s − be the most recent observation first observed at instant . Then for an agent trying to make an optimal decision, the information necessary at time is = s − , , a − , a − +1 , . . . , a −1 . If the state s − +1 becomes observable at the next instant, the system transitions to +1 = s − +1 , + 1, a − +1 , a − +2 , . . . , a with | +1 | = | |. However, if the agent does not receive any new observation at time + 1, the system transitions to +1 = s − , , a − , a − +1 , . . . , a with | +1 | = | | + 1. Thus, the size of the information state may vary during the evolution of an SDMDP. We account for this inconsistency by adding a 'no action' to the action space and keeping the length of the information state at each stage to + 1 with the understanding that the maximum allowable delay is − 1. In scenarios where the delay amount exceeds − 1, it is assumed that the MDP freezes from the perspective of the agent, i.e., the agent does not take any new actions till the most recent state becomes observable. Thus at each instant, the information state is given by ( + 1) -tuple s − , , a − , . . . , a −1 , ∅, . . . , ∅ , where ∅ denotes the 'no action'. In practice, 'no action' can be regarded as equivalent of zero-padding when implementing deep-RL algorithms. Following the similar arguments as in [3, 5] , it can be shown that SDMDP ⟨S, A, P A , , , , 0, [ ], ⟩ can be reduced to an undelayed MDP ⟨I, A, P A , ′ , ⟩ with ′ (s , a ) = E [ (s , a )| ]. The total expected reward for any arbitrary but fixed policy is thus given by

.

As with CDMDPs, the reward structure in MDP in (11) necessitates ( |S 2 |) computations which can be computationally prohibitive. The theorem below simplifies this cost structure by considering another MDP, whose optimal policy coincides with the optimal policy of the MDP in (11) .

Theorem 1 (Eqivalence of SDMDP and undelayed MDP). The SDMDP ⟨S, A, P A , , , , 0, [ ], ⟩ can be reduced into an equivalent undelayed MDP ⟨I, A, P A , , ⟩ with simplified cost structure.

Proof. Note that the set of instants [ ] at which the states are first observed do not explicitly enter into the accompanying technical arguments, however, it is needed to ensure that the reward collections occur only once per state. The proof of Theorem 1 is similar to the derivation in Lemma 1. We consider an equivalent MDP where reward collection occurs as prescribed in Section 4.2, i.e., if the delay at instant is , then the corresponding reward (s , a ) gets discounted by + . Let be the observation delay random variable, i.e, takes values in the set {0, 1, 2, . . . , − 1}. Finally, let s − be the last observable state that was first observed at stage ≤ . The information state is given by (s − , , a − , a − +1 , . . . , a −1 , ∅, . . . , ∅) . The total expected reward for the MDP with modified reward structure is given by:

The second term on the right hand side of (12) conditioned on can be re-written as:

Also, the first term on the right hand side of (12) is independent of conditioned on . Also, since the policy fixed and independent of the delay process, it can be concluded that arg max obs ( ) = arg max˜o bs ( ), which completes the proof. □

The basic essence of the Delay Resolved algorithms is the conversion of non-Markov problems with both constant and stochastic delays (in actions and observations) into Markov problems, CD-MDPs and SDMDPs respectively. Once MDPs have been formulated, we can apply any suitable reinforcement learning algorithm for both policy evaluation and improvement. The primary difference between Delay Resolved Q-learning and vanilla DQN ( [23] ) is the state-augmentation, i.e., instead of the state just being the current observation, there is an added history of pending actions which are stored in the action buffer to make up an augmented state called information state. For constant delays, the dimension of the information state, | | at time t, is constant. However, for stochastic delays, | | depends on the value of delay. As explained in Section 4.2, we assign a maximum allowable dimension for | | and fill up the action buffer with 'no-action'. Intuitively, it makes sense to choose the length of the action buffer as the maximum possible delay-value if known a priori, however, compute requirements may impose additional restrictions on its length.

Though work on delayed RL environments is not very extensive, some approaches have been proposed to address the problem of decision making in presence of delays. Model-based RL approaches [9, 33] , which involve learning the underlying undelayed transition probabilities can potentially be very useful if the dynamics are learned accurately. However, the learned models are often inaccurate, particularly in the presence of large or stochastic delays. Additionally, learning the state-transition probability is computationally expensive ( |S| 2 ), which is precisely why model-based methods are not very popular in practice. Another recent approach [1] used an expectation maximization algorithm where the agent takes decisions based on the expected next state, however it also requires estimating transition probabilities. Consequently, [28] proposes a model-free approach that does not use augmented information states, instead, proposes to build an action history. Their scheme uses the effective action from the action history, typically the last action of the action buffer, during the Q-learning update. Here, the effective action is the action that gets applied on the current observed state and makes the update more accurate than just incorporating the current pending action. However, their approach works only with action delays. Moreover, their framework is not delay-agnostic, i.e., the delay value must be known a priori. Most recently, [10] addresses the CDMDP by using forward models for predicting the current undelayed state and selecting actions based on that predicted state. Though their algorithm shows promising results, it still is computationally more extensive than the normal model-free methods, because additional compute is required for learning the forward models. Learning good forward models is an essential aspect of this algorithm, which is generally not an easy task for all environments. Particularly with large delays, the forward model needs to be applied multiple times, which can lead to compounding errors. With stochastic delays, the task becomes even more difficult. Additionally, for best results, ideally the delay value will be necessary both during inference as well as training. This method also requires additional storage in the form of storing a sample buffer which is basically the trajectory of the pending actions along with their states and rewards. While, the delay resolved algorithm proposed in this paper requires additional memory for the action buffer, it is much easier to store lower-dimensional actions than the potentially higher-dimensional trajectories. [6] proposes a different solution for solving random delay problems by using augmented states along with the delay values for both action and observation. Their algorithm leverages the values of action and observation delays at each step to convert off-policy sampled trajectories into on-policy ones. Their algorithm, known as Delay Correcting Actor Critic (DCAC), performs reasonably well, however, at the expense of requiring complete information of the delay values at each step needed for re-sampling. On the contrary, our DRDQN algorithm is designed to work in environments that do not leverage the additional information on delay values at each step. Consequently, DCAC is not suited for optimal decision-making in delay-agnostic environments discussed in this work.

We tested our algorithm on the delayed version of four standard environments, W-Maze, Acrobot, MountainCar and CartPole (details in 6.1) with both action and observation delays. For all our experiments, we used Deep Q-Networks ( [23] ) as the base RL algorithm 1 . Since, the primary modification is in how the state is structured 1 Code available at: https://github.com/baranwa2/DelayResolvedRL and not on how the samples are collected and trained with, Delay Resolved algorithms can be used with any RL algorithm.

We first describe the various environment dynamics along with the corresponding reward structures below. W-Maze: W-Maze is a 7x11 grid world as shown in Fig. 6 . The agent starts randomly at the bottom row and the objective is to reach the GOAL states through available actions (UP, DOWN, LEFT, RIGHT). The agent receives a reward of +10 if it reaches the GOAL state and -1 for every other step not concluding at the GOAL state. CartPole-v0: In CartPole-v0, the agent aims to balance a pole on a cart by moving the cart either left or right. The agent receives a reward of +1 for every step and 0 when the episode ends, which happens when the agent either falls down or stays balanced for 200 time-steps. Thus the optimal reward for this environment is~200. Note that in the reshaped version of CartPole-v0, originally used in [10] , the agent gets a reward based on the current value of its position and velocity and this reward grows as the pole becomes upright. Acrobot-v1: Acrobot-v1 is a complex version of CartPole and comprises of two links and two independent actuators. In essence, Acrobot represents a double pendulum with an aim to swing the end of the lower-link up to a desired height. The actions comprise of torques generated by the actuator. The agent receives a step-reward of -1 and 0 once the episode ends. The optimal reward is~-100. MountainCar-v0: In MountainCar-v0, the agent has to drive a vehicle uphill, however, the steepness of the slope is just enough to prevent it from driving up. Thus, he has to learn to drive backwards and build up momentum to reach the hilltop. The agent receives a step-reward of -1 and 0 on episode termination, which corresponds to him reaching the hilltop or 200 time-steps been elapsed. 

One of the biggest disadvantages of the algorithm as mentioned in [10] is that the information state space scales exponentially with delay value, I S × A . In tabular settings this becomes extremely problematic and infeasible to work with, specifically with large delays. However, if we use neural networks as function approximators, it is possible to include the action buffer as an input concatenated with the state space, and thus the size of the input space becomes |S| + |A|, which is easily manageable. Fig. 5a , it is evident that for smaller delays, delay resolved tabular Q-learning is able to achieve optimal reward (~0) for smaller delays. However, for large delays, they are not only computationally more expensive but also suffer from poor exploration. This is primarily due to exponential increase in the state-space resulting in poorer performance. The performance of the delay agent [28] is also depicted for the sake of completeness. This agent performs relatively well for smaller delays.

However, as shown in Fig. 5b , neural networks as function approximator can handle the action buffer as merely an input, which makes the information state space much smaller and thus leads to optimal performance across larger delays as well. Interestingly, the delay-DQN algorithm proposed in [28] , performs reasonably well since we have a much powerful function approximation tool for predicting the value function. Although this is a useful trick, there are no theoretical guarantees on convergence to the optimal policy.

A big advantage of Delay Resolved algorithms is that they can be used both for stochastic and constant delays, without requiring the delay input at each step as is needed for [6] . Fig. 7 highlights the performance of the proposed Delay Resolved DQN across three Gym environments. Each point is an average over the entire 1 Million steps of training across 10 runs (along with standard error bars), for constant as well as stochastic delays. The last column represents the stochastic delay uniformly selected between 0 and 10, with the maximum allowable dimension of the information state being 10. Across all the environments, for both stochastic and constant delays, the DRDQN algorithm is able to converge to optimal policy as reflected in its near optimal rewards. and Delayed DQN [10] are exhibited alongside. As mentioned earlier, the delayed RL algorithm [10] is heavily influenced by the architecture of the forward model. For consistency, we used the same architecture as was used in the original paper [10] for the primary DQN agent. However, as seen in Figs. 8b and 8c, it does not always work out well, and failing to learn an accurate forward model leads to worse performance. So, it is extremely critical to choose the correct forward model architecture and this will vary based on the environment, which can be very difficult to gauge ahead of training. Delay Resolved algorithms, on the other hand, do not involve learning any models and hence do not require any additional hyperparameter search. It does relatively well across all the domains for all the delay values. Additionally, in Fig. 8a , all the above algorithms perform relatively well getting to optimal rewards for large delays, however, the time taken is much longer for Delayed DQN [10] because of the additional training of the forward model. Fig. 9a depicts the timetaken for all the algorithms across all 10 runs. It can be clearly seen that Delay Resolved DQN is computationally very light and almost takes the same time as the vanilla DQN, whereas the Delayed DQN takes substantially longer per run for convergence.

Note: The points on each of the plots are an average over one million frames across 10 runs. Thus, the drop in performance for larger delays in Figs. 7 and 8 are due to the fact that the agent takes a longer time to reach the optimal performance which is reflected in the lower average score.

Lemma 1 establishes the equivalence of action and observation delays. For the sake of completeness, we plot the rewards obtained by DRDQN for the last 1000 episodes on Acrobot for both action and observation delays in Fig. 10 . It is evident that the DRDQN algorithm converges to similar values for action and observation delays across different values of delays. 

The only modification of Delay Resolved algorithms is in the addition of action buffer in the state space. Although, this addition can lead to exponential increase of compute in tabular settings, when added as an input to a neural network value function approximator, the additional computational overhead is very minimal as can be seen in Figs. 9a and b . With the increase in delay, the run-time increases dramatically for Delayed method [10] as the agent has to first compute the forward model and then apply the forward model multiple times in order to predict the current state. The environment used for Figs. 9b and c is the same as CartPole-v0 except with reshaped rewards as used in [10] (details in Section 6.1). We see that if appropriate forward models also reach optimality if appropriate architecture is chosen, but the time required is larger (Fig. 9b) .

In this work, we have addressed a very crucial assumption in Reinforcement Learning, which is the synchronization between the environment and the agent. In real world scenarios, we cannot expect this assumption to hold true and thus we would need to modify our RL algorithms accordingly. This paper revisits the idea of using augmented states as a solution for delayed RL problems. We formally define both constant and stochastic delays and provide theoretical analysis on why the reformulation still converges to the same optimal policy. Additionally, we adapt state augmentation methods to constant and stochastic delay problems, to formulate a class of Delay Resolved RL algorithms, which can perform well on both constant and random delay tasks.

For future work, we would like to extend this algorithm to realworld problems including accounting for actuator and motor delays in Robotics, and shipping delay in Supply Chains. Delay Resolved algorithms could prove to be a founding block for practical realtime RL algorithms. For example, without relaxing the assumption of knowing the delay value at every time-step, Delay Resolved DQN can still be improved upon by using better sampling strategies from the experience replay buffer.

Blind Decision Making: Reinforcement Learning with Delayed Observations

Congestion control as a stochastic control problem with action delays

Closed-loop control with delayed information

Markov decision processes with noisecorrupted and delayed state observations

Dynamic programming and optimal control

Reinforcement Learning with Random Delays

Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym

Markov decision processes with state-information lag

Delay-Aware Model-Based Reinforcement Learning for Continuous Control

Acting in Delayed Environments with Non-Stationary Markov Policies

Delay propagation and delay management in transportation networks

Google AI algorithm masters ancient game of Go

On unbounded delays in asynchronous parallel fixed-point algorithms

Deep reinforcement learning that matters

Ground-space bilateral teleoperation of ETS-VII robot arm by direct bilateral coupling under 7-s time delay condition

Robust compliant motion control of robot with nonlinear friction using time-delay estimation

Markov decision processes with delays and asynchronous cost collection

State information lag Markov decision process with control limit rule

A partially observable Markov decision process with lagged information

Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance

Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia

Destabilizing effects of small time delays on feedbackcontrolled descriptor systems

Human-level control through deep reinforcement learning

The curse of planning: dissecting multiple reinforcement-learning systems by taxing the central executive

Theoretical analysis of destabilization resonances in time-delayed stochastic second-order dynamical systems and some implications for human motor control

Real-time reinforcement learning

Effect of delay in diagnosis on transmission of COVID-19

Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach

Reproductive number of the COVID-19 epidemic in Switzerland with a focus on the Cantons of

Mastering the game of Go with deep neural networks and tree search

Learning without state-estimation in partially observable Markovian decision processes

The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs

Planning and Learning in Environments with Delayed Feedback

Learning and planning in environments with delayed feedback

Note on "A partially observable Markov decision process with lagged information

Timeliness of financial reporting and financial distress

Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

Does time-optimal control of a stochastic system with sensory delay produce movement units