key: cord-0043173-7sjdzz9x authors: Kang, Chun-Yao; Chen, Ming-Syan title: Balancing Exploration and Exploitation in Self-imitation Learning date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_21 sha: f918ec5a1c1762a5d9ad1becae88b1eeb872adac doc_id: 43173 cord_uid: 7sjdzz9x Sparse reward tasks are always challenging in reinforcement learning. Learning such tasks requires both efficient exploitation and exploration to reduce the sample complexity. One line of research called self-imitation learning is recently proposed, which encourages the agent to do more exploitation by imitating past good trajectories. Exploration bonuses, however, is another line of research which enhances exploration by producing intrinsic reward when the agent visits novel states. In this paper, we introduce a novel framework Explore-then-Exploit (EE), which interleaves self-imitation learning with an exploration bonus to strengthen the effect of these two algorithms. In the exploring stage, with the aid of intrinsic reward, the agent tends to explore unseen states and occasionally collect high rewarding experiences, while in the self-imitating stage, the agent learns to consistently reproduce such experiences and thus provides a better starting point for subsequent stages. Our result shows that EE achieves superior or comparable performance on variants of MuJoCo environments with episodic reward settings. Reinforcement learning (RL) learns an optimal policy by maximizing the expected return. These methods work well in environments with dense rewards but suffer from degenerate performance when the rewards are sparse, which means rewards are almost zero during an episode. In such cases, the agent requires both an efficient exploration method that guides itself to find potentially useful information, and an efficient exploitation technique to make better use of these experiences. Exploration bonus is a simple way for directed exploration, which generates intrinsic rewards every step even when the external rewards are unavailable. This bonus is designed to be higher in novel states than in those visited frequently, and thus encourages the agent to explore new behavior. Even though such method still requires a tremendous amount of time to train, and the intrinsic rewards may vanish when the policy converges to a local optimum. Another method called self-imitation learning is a recently introduced algorithm that enhances exploitation by storing and reusing useful experiences. Once the agent occasionally generates high rewarding trajectories, they are collected and stored in a replay buffer. The policy is then trained to imitate those trajectories in the buffer and thus the resulting agent consistently reproduces past good behavior. This method is shown to have high sample efficiency, though it hurts the exploration and has a chance to stuck at local optima. Besides, selfimitation learning relies on other techniques to bootstrap the process before the first trajectory is added to the buffer. In this paper, we propose a framework Explore-then-Exploit (EE) which combines random network distillation (RND) [5] , a kind of exploration bonus, and generative adversarial self-imitation learning (GASIL) [7] . By integrating these two methods, the RND bonus solves the initial bootstrap problem of selfimitation and potentially prevents the policy from getting stuck at local optima. On the other hand, GASIL speeds up the convergence of the policy and provides good starting points for later exploration. Nevertheless, a direct composition does not make sense due to that the agent will prefer unpredictable actions with exploration bonuses while tend to reproduce past behaviors with self-imitation. Mixing these two objectives will confuse the agent and lead to an undesired result. Instead, we suggest to combine them in an interleaving manner. By doing so, the agent only learns one concept at each stage and switch to another one when certain criteria are reached. We formulate our framework in the form of reward interpolation and provide some heuristic methods to determine the weight between exploration and self-imitation. Finally, we evaluate our model on several MuJoCo tasks [19] with episodic rewards and empirically show that EE improves over GASIL or RND in most environments. Exploration. There have been many researchers working on exploration in RL. Count-based exploration gives a reward to rarely visited states [2, 13] . Prediction-based exploration, also known as the curiosity-driven method, predicts the agent's dynamics and treats the prediction error as intrinsic reward [1, 4, 14, 18] . Random network distillation (RND) [5] further improves by utilizing two neural networks: a randomly initialized target network, and a predictor network trained to minimize the difference between the outputs of the two networks. The difference is then used as the exploration bonus. Self-imitation. Self-imitation learning (SIL) [12] was recently proposed to exploit past good behaviors. This algorithm stores the transitions in a replay memory, and uses them to update the policy when the stored return is higher than the current state value estimation. Generative adversarial self-imitation learning (GASIL) [7] is a generative adversarial extension to SIL, which instead stores top-k experiences based on episode returns and formulates it as a divergence minimization problem. Combined with actor-critic methods, GASIL learns a shaped, dense reward function that can be used as an extra reward signal. Another method [6] also utilizes the generative adversarial structure, but instead trains an ensemble of agents and uses a special optimization technique to guarantee the diversity among these agents. This work focuses on the interaction between multiple agents, while our work only considers one agent. This technique can be used simultaneously with our framework without problems. Amplifying the Imitation Effect. In a very recent work, AIE [10] was proposed to combine RND and SIL, which has similar idea as our method. This work combines these two algorithms directly and introduces several techniques to enhance the effect of RND, whereas ours integrates GASIL in an interleaving fashion, and evaluates it on common RL benchmarks. Consider a state space S and an action space A. The purpose of RL is to find a parameterized policy π θ (a|s) where s ∈ S and a ∈ A, which maximizes the expected discounted sum of rewards: where γ ∈ (0, 1] is the discount factor and r t is the reward at time t. The objective is non differentiable and hence requires the technique called policy gradient to estimate the gradient. A commonly used gradient estimator of the objective η(π θ ) is given by where t is the advantage estimation at time t. The estimator ∇ θ η(π θ ) can be obtained by differentiating the surrogate objective GASIL involves a good trajectory buffer B = {τ i } and a discriminator D φ (s, a). The buffer B stores the top-k good trajectories according to the total trajectory reward R = ∞ t=0 r t , where each trajectory τ i consists of a sequence of states and actions {s 0 , a 0 , s 1 , a 1 , . . . s t , a t }. The algorithm treats the trajectories stored in the buffer B as the expert demonstrations, and utilizes the GAIL framework [8] to obtain a similar generative adversarial loss where τ π , τ E are the trajectories sampled from the policy π θ and the buffer B respectively, and λH(π θ ) is the entropy regularization term. The discriminator D φ is updated via and the policy π θ is updated via the approximate gradient The Eq. (5) has a similar form as policy gradient (1), and thus can be combined together as follows: where α t is the advantage estimation by replacing r(s, a) with a modified reward a) can be seen as an intrinsic reward signal generated by the discriminator to encourage the agent to imitate the past behaviors. RND introduces a fixed and randomly initialized target network f : S → R k , which takes a state as input and outputs a k-dimensional embedding, together with a predictor networkf : The exploration bonus is defined as the prediction error f (s) − f (s) 2 , which is expected to be lower for the states similar to the frequently visited ones. Our EE framework incorporates the exploration bonus component into the GASIL structure. We choose RND as the exploration algorithm because of the simplicity of implementation. Both GASIL and RND generate extra reward signals, which allows us to formulate our framework as an interpolation of the rewards from three different sources: (a) the external environment reward r ext , (b) the imitation bonus r im = − log D φ (s, a) which is derived from the discriminator and (c) the exploration bonus r exp = f (s) − f (s) 2 which is given by the predictor network. It does not make sense to directly sum up these rewards, as the imitation bonus and exploration bonus guides the agent to different directions. Instead, we use an dynamically adaptive reward where ν controls the ratio between exploration and self-imitation. The parameter ν is explicitly assigned to 0 or 1 to prevent these two terms from interfering with each other. In the exploration stage, we set ν = 1 to completely eliminate the effect of self-imitation, which allows the agent to freely explore the environment. While in the self-imitation stage, we set ν = 0 to make the agent purely rely on Fig. 2 . Heuristic method which sets ν = 1 for the first 3 × 10 5 steps and sets ν = 0 for the remaining time the imitation bonus. As such, the agent will quickly converge to a local optimum and begin to explore again when it switches back to the exploration stage. It is crucial to determine when to assign the ratio ν to 0, which implies the self-imitation stage, or vice versa. In this work, we introduce a heuristic method that works well in the underlying benchmarks. We assign the ratio ν as a square wave with period T and duty cycle d. A particular setting T = 10 6 and d = 25% is used throughout the paper, which is shown in Fig. 1 . Another method is to set the agent in the exploration stage for certain steps in the beginning and switch to the self-imitation stage for the remaining time, as shown in Fig. 2 . We found that by doing so, the agent often leads to superior performance. Our framework only involves the reward interpolation, and thus can be plugged into any actor-critic based algorithm such as A2C [11] or PPO [16] . We demonstrate the combination of our method with PPO in Algorithm 1. Note that the rewards used to determine the ranking of the trajectories stored in the replay buffer do not include the exploration bonuses. More specifically, the total trajectory reward is defined as R = ∞ t=0 r ext t . The experiments are designed to answer the following questions: 1. Is EE better than running RND or GASIL alone? 2. Is the RND exploration bonus itself necessary, or a random exploration also works? 3. Is the effect of interleaving fashion critical? We evaluated our method on several OpenAI Gym [3] MuJoCo continuous control tasks. The specs of environments used are listed in Table 1 . All of the benchmarks were modified as episodic reward environments, which means that rather than providing the per timestep reward r t , we provided the whole episode reward R = ∞ t=0 r t at the last step of an episode and zero rewards in other steps. We implemented the following agents based on this PPO implementation [17] : -PPO: The proximal policy optimization [16] baseline. -PPO + RND: PPO combined with RND bonus [5] . -PPO + GASIL: PPO combined with GASIL [7] . -EE interval : Our method where ν is assigned to be the square wave with period 10 6 and duty cycle 25%. -EE first exp: Our method where ν is assigned to be 1 for the first 3 × 10 5 steps and to be 0 for the remaining steps. The hyperparameters used in our experiments are shown in Table 2 . Every feed-forward networks including the actor-critic network, the discriminator and the predictor has 2 hidden layers with 64 neurons. Note that the parameter α is only used in PPO + GASIL, not in EE. We first evaluated 5 types of agents on 6 MuJoCo tasks. The result in Fig. 3 shows that EE performs better than all of the baseline on Walker2d, Hopper and HalfCheetah, and performs comparably with GASIL on Swimmer. This is because the Swimmer task is relatively simple that exploration is not even necessary. However, on more complicated tasks such as Walker2d and Hopper, the benefit of integrating exploration and self-imitation is significant. For Ant and Humanoid, all of the 5 agents fail to learn a meaningful policy. This is mainly due to the high dimensions of the observation space, which makes the networks difficult to train. In Fig. 3 , we see that the EE interval has an obvious performance drop when the agent is in the exploration stage and begins to climb again when it switches back to the self-imitation stage. This is the expected behavior since the agent tends to select unpredictable behavior, which potentially causes early termination of an episode. Unfortunately, EE interval performs slightly worse than EE first exp, which seems to indicate that the interleaving one does not have the advantage over the non-interleaving one. One possible reason is that MuJoCo environments do not have the sequentially dependent structure, which means that reliably producing certain rewards does not make it easier to obtain the subsequent rewards. In this case, it is not beneficial at all to first converge to a good policy and then begin to explore from that state. In Fig. 3 , it should be noticed that adding the RND bonus alone does not take any notable effect, which gives rise to the question of the effectiveness of RND. We carried out another experiment to investigate this phenomenon. We modified the behavior of EE first exp agent in the exploration stage as follows: no exp indicates that the RND bonus is removed from the reward interpolation, which means that the agent only relies on external reward r ext ; random exp indicates that the agent always takes a random action; EE first exp remains unchanged. The result in Fig. 4 shows that integrating GASIL with RND indeed amplifies the effect of exploration compared to a purely random one. To justify the statement that directly mixing the imitation bonus and exploration bonus results in poor performance, we modified the reward interpolation to be r = α r im + (1 − α) r ext + βr exp where α was fixed at 0.8 throughout the experiment, and β coefficient was set to be {1, 2, 4, 8, 16} respectively. This agent is referred to as direct. Figure 5 shows that direct method with β = 1 performs much the same as GASIL, which points out the fact that imitation bonus is likely to dominate the outcoming policy when given similar weights. Furthermore, the performance drops when setting higher weights on exploration bonus. This result again demonstrates that mixing different behavior will bring about inferior performance. In this paper, we proposed Explore then Exploit (EE), a novel framework that combines GASIL and RND in a form of reward interpolation, and provided a heuristic way to interleaves between exploration and self-imitation stage. We demonstrated that EE significantly improves over existing single-agent methods on several continuous control tasks with episodic rewards. We also empirically justified our hypothesis that separating the objectives of exploration and imitation is better than mixing them together. Developing appropriate ways to automatically adjust the ratio between exploration and imitation will be an important future work. Further, we will apply our framework to more complicated environments such as Atari. Surprise-based intrinsic motivation for deep reinforcement learning Unifying count-based exploration and intrinsic motivation Openai gym Largescale study of curiosity-driven learning Exploration by random network distillation Learning self-imitating diverse policies Generative adversarial self-imitation learning Generative adversarial imitation learning Adam: a method for stochastic optimization Amplifying the imitation effect for reinforcement learning of UCAV's mission execution Asynchronous methods for deep reinforcement learning Self-imitation learning Count-based exploration with neural density models Curiosity-driven exploration by self-supervised prediction High-dimensional continuous control using generalized advantage estimation Proximal policy optimization algorithms Modularized implementation of deep RL algorithms in PyTorch Incentivizing exploration in reinforcement learning with deep predictive models MuJoCo: a physics engine for model-based control Acknowledgements. The authors would like to thank Dr. Kuan-Ting Lai for his helpful comments which improve the presentation of this paper.