key: cord-0578293-fnpz1xy5
authors: Zhao, Tony Z.; Nagabandi, Anusha; Rakelly, Kate; Finn, Chelsea; Levine, Sergey
title: MELD: Meta-Reinforcement Learning from Images via Latent State Models
date: 2020-10-26
journal: nan
DOI: nan
sha: fe2560be266c184f505715d3dfd3461cb66e1e2e
doc_id: 578293
cord_uid: fnpz1xy5

Meta-reinforcement learning algorithms can enable autonomous agents, such as robots, to quickly acquire new behaviors by leveraging prior experience in a set of related training tasks. However, the onerous data requirements of meta-training compounded with the challenge of learning from sensory inputs such as images have made meta-RL challenging to apply to real robotic systems. Latent state models, which learn compact state representations from a sequence of observations, can accelerate representation learning from visual inputs. In this paper, we leverage the perspective of meta-learning as task inference to show that latent state models can emph{also} perform meta-learning given an appropriately defined observation space. Building on this insight, we develop meta-RL with latent dynamics (MELD), an algorithm for meta-RL from images that performs inference in a latent state model to quickly acquire new skills given observations and rewards. MELD outperforms prior meta-RL methods on several simulated image-based robotic control problems, and enables a real WidowX robotic arm to insert an Ethernet cable into new locations given a sparse task completion signal after only $8$ hours of real world meta-training. To our knowledge, MELD is the first meta-RL algorithm trained in a real-world robotic control setting from images.

General purpose autonomous robots must be able to perform a wide variety of tasks and quickly acquire new skills. For example, consider a robot tasked with assembling electronics in a data center. This robot must be able to insert cables of varying shapes, sizes, colors, and weights into the correct ports with the appropriate amounts of force. While standard RL algorithms require hundreds or thousands of trials to learn a policy for each new setting, meta-RL methods hold the promise of drastically reducing the number of trials required. Given a task distribution, such as the variety of ways to insert cables described above, meta-RL algorithms leverage a set of training tasks to metalearn a mechanism that can quickly learn unseen tasks from the same distribution. Despite promising results in simulation demonstrating that agents can learn new tasks in a handful of trials [1, 2, 3, 4] , these algorithms remain largely unproven on real-world robotic systems.

Applying these algorithms to real-world robotic systems requires handling the raw sensory observations collected by a robot's on-board sensors. In principle, deep reinforcement learning (RL) algorithms can directly map sensory inputs to actions. However, this automation comes at a steep cost in sample efficiency since the agent must learn to interpret observations from reward supervision alone. Fortunately, unsupervised learning of general-purpose latent state (or dynamics) models can serve as an additional training signal to help solve the representation learning problem [5, 6, 7, 8] .

In this work, we seek to leverage the benefits of latent state models for representation learning to design a meta-RL algorithm that can acquire new skills quickly in the real world.

While meta-learning algorithms are often viewed as algorithms that learn to learn [9, 10, 3] , an alternative viewpoint frames meta-learning as task inference [11, 12, 4] . From this perspective, the task is a hidden variable that can be inferred from experience consisting of observations and rewards. Our key insight is that the same latent dynamics models that greatly improve efficiency in end-to-end single-task RL can also, with minimal modification, be used for meta-RL by treating the 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA. ✘ Figure 1 : At test time our algorithm MELD enables a 5-DoF WidowX robot to insert the ethernet cable into a novel insertion location and orientation within two episodes of experience, operating from image observations and a sparse task completion signal when the cable is correctly inserted. MELD achieves this result by metatraining a latent dynamics model to capture task and state information, as well a policy that conditions on this information to explore and identify the correct insertion point. unknown task information as a hidden variable to be estimated from experience. We formalize the connection between latent state inference and meta-RL, and leverage this insight in our proposed algorithm MELD, Meta-RL with Latent Dynamics. To derive MELD, we cast meta-RL and latent state inference into a single partially observed Markov decision process (POMDP) in which task and state variables are aspects of a more general per-time step hidden variable. Concretely, we represent the agent's belief over the hidden variable as the variational posterior in a sequential VAE latent state model that takes observations and rewards as input, and we condition the agent's policy on this belief. During meta-training, the latent state model and policy are trained across a fixed set of training tasks sampled from the task distribution. The trained system can then quickly learn a new task from the distribution by inferring the posterior belief over the hidden variable and executing the conditional meta-learned policy.

We find in simulation that MELD substantially outperforms prior work on several challenging locomotion and manipulation problems, such as running at varying velocities, inserting a peg into varying targets, and putting away mugs of unknown weights to varying locations on a shelf. We then analyze MELD's capability to meta-learn temporally-extended exploration strategies when only a sparse task completion signal is available. Finally, using a real WidowX robotic arm, we find that after eight hours of meta-training MELD successfully performs Ethernet cable insertion into ports at novel locations and orientations ( Figure 1 ). This real-world experiment with the WidowX is to our knowledge the first demonstration of a meta-RL algorithm trained from images on a real robotic platform. Our open-source implementation of MELD can be found at https://github.com/tonyzhaozh/meld.

Our approach can be viewed methodologically as a bridge between meta-RL methods for fast skill acquisition and latent state models for state estimation. In this section, we discuss these areas of work, as well as RL on real-world robotic platforms.

RL for Robotics. While prior work has obtained good results with geometric and force control approaches for a wide range of manipulation tasks [13, 14, 15] , including insertion tasks [16, 17] such as those in our evaluation, such approaches typically require considerable manual design effort for each task. RL algorithms offer an automated alternative that has been demonstrated on a variety of robotic tasks [18] including insertion [19, 20, 21, 22, 23] . Although these policies learn impressive skills, they typically do not transfer to other tasks and must be re-trained from scratch for each task.

Meta-RL. Meta-RL algorithms learn to acquire new skills quickly by using experience from a set of meta-training tasks. Current meta-RL methods differ in how this acquisition procedure is represented, ranging from directly representing the learned learning process with a recurrent [1, 2] or recursive [24] deep network, to learning initial parameters for gradient-based optimization [3, 25, 26, 27] , and learning tasks via variational inference [4, 28, 29] . Some of these works have formalized meta-RL as a special kind of POMDP in which the hidden state is constant throughout a task [4, 28, 30, 29] . Taking a broader view, we show that meta-RL can be tackled with a general POMDP algorithm that estimates a time-varying hidden state, rendering the same algorithm applicable to problems with both stationary and non-stationary sources of uncertainty.

Within this area, meta-learning approaches that enable few-shot adaptation have been studied with real systems for imitation learning [31, 32, 33, 34] and goal inference [35] , but direct meta-RL in the real world has received comparatively little attention. Adapting to different environment parameters has been explored in the sim2real setting for table-top hockey [36] and legged locomotion [37] , and in the model-based RL setting for millirobot locomotion [38] . In Section 5.3, we demonstrate that our algorithm MELD can perform meta-RL trained from images in the real world.

Latent State Inference in RL. A significant challenge in real-world robotic learning is contending with the complex, high-dimensional, and noisy observations from the robot's sensors. To handle general partial observability, recurrent policies can persist information over longer time horizons [39, 40, 41] , while explicit state estimation approaches maintain a probabilistic belief over the current state of the agent and update it given experience [42, 43, 44, 45, 46, 47, 48] . In our experiments we focus on learning from image observations, which presents a state representation learning challenge that has been studied in detail. End-to-end deep RL algorithms can learn state representations implicitly, but currently suffer from poor sample efficiency due to the added burden of representation learning [49, 20, 50] . Pre-trained state estimation systems can predict potentially useful features such as object locations and pose [51, 52] ; however, these approaches require ground truth supervision. On the other hand, unsupervised learning techniques can improve sample efficiency without access to additional supervision [53, 5, 54, 6, 55, 56, 57] . Latent dynamics models capture the time-dependence of observations and provide a learned latent space in which RL can be tractably performed [58, 59, 8, 60, 61, 7] . In our work, we generalize the learned latent variable to encode not only the state but also the task at hand, enabling efficient meta-RL from images.

In this work, we leverage tools from latent state modeling to design an efficient meta-RL method that can operate in the real world from image observations. In this section, we review latent state models and meta-RL, developing a formalism that will allow us to derive our algorithm in Section 4.

We first define the RL problem. A Markov decision process (MDP) consists of a set of states S, a set of actions A, an initial state distribution p(s 1 ), a state transition distribution p(s t+1 |s t , a t ), a discount factor γ, and a reward function r(s t , a t ). We assume the transition and reward functions are unknown, but can be sampled by taking actions in the environment. The goal of RL is to learn a policy π(a t |s t ) that selects actions that maximize the sum of discounted rewards. However, robots operating in the real world do not have access to the underlying state s t , and must instead select actions using high-dimensional and often incomplete observations from cameras and other sensors. Such a system can be described as a partially observed Markov decision process (POMDP), where observations x t are a noisy or incomplete function of the unknown underlying state s t , and the policy is conditioned on a history of observations as π(a t |x 1:t ).

While end-to-end RL methods can acquire representations of observations from reward supervision alone [20, 21] , the added burden of end-to-end representation learning limits sample efficiency and can make optimization more difficult. Methods that explicitly address this representation learning problem are more efficient and scale more efficiently to harder domains [47, 61, 7] . These approaches train latent state models to learn meaningful representations of the incoming observations by explicitly representing the unknown Markovian state as a hidden variable z t in a graphical model, as shown in Figure 2 (a). The parameters of these graphical models can be trained by approximately maximizing the log-likelihood of the observations: log p(x 1:T |a 1:

Given a history of observations and actions seen so far, the posterior distribution over the hidden variable captures the agent's belief over the current underlying state, and can be written as b t = p(z t |x 1:t , a 1:t−1 ). Then, rather than conditioning the policy on raw observations, these methods learn a policy π(a t |b t ) as a function of this belief state.

In this work, we would like a robot to learn to acquire new skills quickly. We formalize this problem statement as meta-RL, where each task T from a distribution of tasks p(T ) is a POMDP as described above, with initial state distribution p T (s 1 ), dynamics function p T (s t+1 |s t , a t ), observation function p T (x t |s t ), and reward function r T (r t |s t , a t ), as shown in Figure 2 (b). For example, a task distribution that varies both dynamics and rewards across tasks may consist of placing mugs of varying weights (dynamics) in different locations (rewards) on a kitchen shelf. The meta-training process leverages a set of training tasks sampled from p(T ) to learn an adaptation procedure that can adapt to a new task from p(T ) using a small amount of experience. Similar to the framework in probabilistic inference-based and recurrence-based meta-RL approaches [28, 4, 2, 1], we formalize the adaptation procedure f φ as a function of experience (x 1:t , r 1:t , a 1:t−1 ) that summarizes task-relevant information into the variable c t . The policy is conditioned on this updated variable as π θ (a t |x t , c t ) to adapt to the task. By training the adaptation mechanism f φ and the policy π θ end-to-end to maximize returns of the adapted policy, meta-RL algorithms can learn policies that effectively modulate and adapt their behavior with small amounts of experience in new tasks. We formalize this meta-RL objective as:

Meta-RL methods may differ in how the adaptation procedure f φ is represented (e.g., as probabilistic inference [4, 28] , as a recurrent update [2, 1] , as a gradient step [3] ), how often the adaptation procedure occurs (e.g., at every timestep [2, 28] or once per episode [4, 30] ), and also in how the optimization is performed (e.g., on-policy [2] , off-policy [4] ). Differences aside, these methods all typically optimize this objective end-to-end, creating a representation learning bottleneck when learning from image inputs that are ubiquitous in real-world robotics. In the following section, we show how the latent state models discussed in Section 3.1 can be re-purposed for joint representation and task learning, and how this insight leads to a practical algorithm for image-based meta-RL.

In this section, we present MELD: an efficient algorithm for meta-RL from images. We first develop the algorithm in Section 4.1 and then describe its implementation in Section 4.2.

To see how task inference in meta-RL can be cast as latent state inference, consider the graphical models depicted in Figure 2 . Panel (a) illustrates a standard POMDP with underlying latent state z t and observations x t , and panel (b) depicts standard meta-RL, where the hidden task variable T is assumed constant throughout the episode. In the meta-RL setting, the policy must then be conditioned on both the observation and the task variable in order to adapt to a new task (see Equation 1) . Casting this task variable as part of the latent state, panel (c) illustrates our graphical model, where the states z t now contain both state and task information. In effect, we cast the task distribution over POMDPs as a POMDP itself, where the state variables now additionally capture task information. This meld allows us to draw on the rich literature of latent state models discussed in Section 3.1, and use them here to tackle the problem of meta-RL from sensory observations. Note that the task is not explicitly handled since it is simply another hidden state variable, providing a seamless integration of meta-RL with learning from sensory observations.

Require: Training tasks {T i } i=1...J from p(T ) , learning rates η 1 , η 2 , η 3 1: Init. model p φ , q φ , actor π θ , critic Q ζ 2: Init. replay buffers B i for each train task 3: while not done do 4: for each task T i do collect data 5:

Step with a 1 ∼ π θ (a|b 1 ), get x 2 , r 2

for t = 2, . . . T − 1 do 8:

Step a t ∼ π θ (a|b t ), get x t+1 , r t+1 

Update θ, ζ with SAC(η 2 , η 3 ) train AC 22: end for 23: end while Algorithm 2 MELD Meta-testing

Step with a 1 ∼ π θ (a|b 1 ), get x 2 , r 2 3: for t = 2, . . . , T − 1 do 4:

Step with a t ∼ π θ (a|b t ), get x t+1 , r t+1 6: end for Concretely, we learn a latent state model over hidden variables by optimizing the log-likelihood of the evidence (observations and rewards) in the graphical model in Figure 2c :

Note that the only change from the latent state model from Section 3.1 is the inclusion of rewards as part of the observed evidence. While this change appears simple, it enables metalearning by allowing the hidden state to capture task information. Posterior inference in this model then gives the agent's belief b t = p(z t |x 1:t , r 1:t , a 1:t−1 ) over latent state and task variables z t . Conditioned on this belief, the policy π θ (a t |b t ) can learn to adapt its behavior to the task. Prescribing the adaptation procedure f φ from Equation 1 to be posterior inference in our latent state model, the meta-training objective in MELD is:

By melding state and task inference, MELD inherits the same representation learning mechanism as latent state models discussed in Section 3.1 to enable efficient meta-RL with images.

Exactly computing the posterior distribution over the latent state variable is intractable, so we take a variational inference approach to maximize a lower bound on the log-likelihood objective [62] in Equation 2. We factorize the variational posterior as q(z 1:T |x 1:T , r 1:T , a 1:

With this factorization, we implement each component as a deep neural network and optimize the evidence lower bound of the joint objective, where Ez 1:t∼qφ [log p(x 1:T , r 1:T |a 1:T −1 )] ≥ L model , with L model defined as:

The first two terms encourage a rich latent representation z t by requiring reconstruction, while the last term keeps the inference network consistent with latent dynamics. The first timestep posterior q φ (z 1 |x 1 , r 1 ) is modeled separately from the remaining steps, and p(z 1 ) is chosen to be a fixed unit Gaussian N (0, I). The learned inference networks q φ (z 1 |x 1 , r 1 ) and q φ (z t |x t , r t , z t−1 , a t−1 ), decoder networks p φ (x t |z t ) and p φ (r t |z t ), and dynamics p φ (z t |z t−1 , a t−1 ) are all fully connected networks that output parameters of Gaussian distributions. We follow the architecture of the latent variable model from SLAC [7] and provide the remaining implementation details in Appendix A.

replay buffer Figure 3 : MELD meta-training alternates between collecting data with π θ and training the latent state model p φ , inference networks q φ , actor π θ , and critic Q ζ .

We use the soft actor-critic (SAC) [63] RL algorithm in this work, due to its high sample efficiency and performance. The actor π θ (a t |b t ) and the critic Q ψ (b t , a t ) are conditioned on the posterior belief b t , modeled as fully connected neural networks, and trained as prescribed by the SAC algorithm. During meta-training, MELD alternates between collecting data with the current policy, training the model by optimizing L model , and training the policy with the current model. Metatraining and meta-testing are described in Algorithm 1 and 2 respectively.

In our experiments, we aim to answer the following: (1) How does MELD compare to prior meta-RL methods in enabling fast acquisition of new skills at test time in challenging simulated control problems? (2) Can MELD meta-learn effective exploration when only sparse task completion rewards are available at meta-test time? (3) Can MELD enable real robots to quickly acquire skills via meta-RL from images? In this section, we evaluate MELD on the four simulated image-based continuous control problems in Figure 4 . In (a) Cheetah-vel, each task is a different target running velocity for the 6-DoF legged robot. Reward is the difference in robot velocity from the target. The remainder of the problems use a 7-DOF Sawyer robotic arm. In (b) Reacher, each task is a different goal position for the endeffector. In (c) Peg-insertion, the robot must insert the peg into the correct box, where each task varies the goal box as well as locations of all four boxes. In (d) Shelf-placing, each task varies the weight (dynamics change) and target location (reward change) of a mug that the robot must move to the shelf. For the Sawyer environments, the reward function is the negative distance between the robot end-effector and the desired end location. For all environments, we train with 30 meta-training tasks and evaluate on 10 meta-test tasks from the same distribution that are not seen during training.

We compare MELD to two representative state-of-the-art meta-RL algorithms, PEARL [4] and RL 2 [2] . PEARL models a belief over a probabilistic latent task variable as a function of un-ordered batches of transitions, and conditions the policy on both the current observation and this inferred task belief. Unlike MELD, this algorithm assumes an exploration phase of several trajectories in the new task to gather information before adapting, so to get its best performance, we evaluate only after this exploration phase. RL 2 models the policy as a recurrent network that directly maps observations, actions, and rewards to actions. To apply PEARL and RL 2 with image observations, we augment them with the same convolutional encoder architecture used by MELD. Finally, to verify the need for task inference to solve new tasks, we compare to SLAC [7] , which infers state information from a sequence of observations but does not perform meta-learning.

In Figure 5 we plot average performance on meta-test tasks over the course of meta-training, across 3 random seeds. See Appendix B for the the definitions of metrics used for each task. MELD achieves the highest performance in each environment and is the only method to fully solve Cheetah-vel, Peg-insertion, and Shelf-placing. The SLAC baseline fails in this meta-RL setting, as expected, with qualitative behavior of always executing a single "average" motion, such as reaching toward a mean goal location and running at a medium speed. PEARL aggregates task information over time in its latent task variable, but relies on the current observation alone for state information. Its poor performance on Cheetah, Reacher, and Shelf reflect the need for state estimation from a sequence of observations to perform control in these environments. While RL 2 is capable of propagating both state and task information over time, we observe that it overfits heavily to training tasks and struggles on evaluation tasks.

Episode 1 Episode 2 Figure 6 : Button press with reward given only upon pressing correct button. The robot explores each button until it finds the correct one (top) and returns to that button immediately in the next episode (bottom). See text for discussion.

The previous section assumed a shaped reward function that is the negative distance between the current robot position and the desired one at every timestep. In the real world, this type of reward function is typically not available to the agent, since it requires information that may be difficult or impossible to obtain. For example, in the Ethernet cable insertion problem, the location of the insertion point is unknown, but the agent might receive a sparse task completion reward upon making the correct electrical connection. To quickly succeed at a new task given only this sparse signal at meta-test time, it is critical for MELD to reason over multiple episodes and acquire temporally-extended strategies during meta-training. Because RL with sparse rewards is very inefficient, we follow prior work [65, 4] and assume access to a shaped reward function during meta-training to help learn these strategies. We detail our particular approach to making use of shaped rewards during meta-training in Appendix C, and at meta-test time assume access to only the sparse reward signal. To evaluate MELD in this setting, we design a simulated button-pressing environment with the Sawyer robot, where the button to push and the location of the panel changes with each task. Sparse reward is given only when the correct button is pushed, while shaped reward (used only in meta-training) is the negative distance from the robot's end-effector to the insertion point. In Figure 6 , we analyze the qualitative behavior of MELD when learning a new task at test time. Though the shaped reward is not used at test time, we plot the shaped reward reconstruction mean and variance to gain an insight into the contents of the learned latent state. In the first episode, predicted reward error and variance are high until the robot presses the correct button. MELD's latent state model persists the task information to the second episode (predicted reward error and variance are very low) and the robot navigates immediately to the correct button. In Figure 7 , we compare MELD to the same baselines introduced in the previous section and find that it is the only method able to press the correct button within two trajectories of experience. In the next section, we test MELD's capability to perform such exploration and exploitation in the real world.

We now evaluate MELD on a real-world 5-DoF WidowX arm performing Ethernet cable insertion. The task distribution consists of different ports in a router that also varies in location and orientation (see Figure 8 ). To instrument these tasks in the real world, we build an automatic reset mechanism that moves and rotates the router, as detailed in Appendix E. At metatest time the reward is a sparse signal given when the robot inserts the cable in the correct port. As in the previous section, during meta-training we make use of a shaped reward function that is the sum of the L2-norms of translational and rotational distances between the pose of the object in the endeffector and a goal pose. The agent's observations are concatenated images from two webcams (Figure 9 ): one fixed view and one first-person view from a wrist-mounted camera. The policy sends joint velocity controls over a ROS interface to a low-level PID controller to move the joints of each robot.

We compare MELD to SLAC as described in Section 5.1 as well as a random policy, and plot the results in Figure 10 . After training across 20 metatraining tasks using a total of 8 hours (80, 000 samples at 3.3Hz) worth of data, MELD achieves a success rate of 90% over three rounds of evaluation in each of the 10 randomly sampled evaluation tasks that were not seen during training. To our knowledge, this experiment is the first demonstration of meta-RL trained entirely in the real world from image observations. We also conducted experiments with the Sawyer robot, finding that MELD enables the Sawyer to insert a peg into the correct hole given a per-timestep reward of distance to the hole (see Appendix F). Due to lab access restrictions as a result of COVID-19, we could not evaluate adaptation to new tasks on this platform. Videos of all experiments can be found on our project website. In this paper, we leverage the insight that meta-RL can be cast into the framework of latent state inference. This allows us to combine the fast skill-acquisition capabilities of meta-learning with the efficiency of unsupervised latent state models when learning from raw image observations. Based on this principle, we design MELD, a practical algorithm for meta-RL with image observations. MELD outperforms prior methods on simulated locomotion and manipulation tasks, and is efficient enough to perform meta-RL directly from images in the real world. However, neither our approach nor prior meta-learning works have shown convincing generalization to wider task distributions of qualitatively distinct manipulation tasks. Additionally, due to the difficulty of defining and instrumenting shaped reward functions for such tasks, it is important that meta-RL algorithms be able to learn from less structured signals and other forms of supervision, such as demonstrations and natural language. We view these directions as exciting avenues of future work that will broaden the applicability of robotic learning to less structured environments. 

Here, we expand on Section 4.2 to provide further implementation details of our algorithm MELD. Please also refer to our open-source code: https://github.com/tonyzhaozh/meld.

As discussed in Section 4.2, the latent state model is comprised of latent dynamics distributions, posterior inference distributions, and generative distributions of observations and rewards. While most timesteps are processed by the same time-invariant dynamics model p(z t |z t−1 , a t−1 ) and posterior inference network q φ (z t |z t−1 , x t , r t , a t−1 ), we use separate distributions to model the first time step of a trial: p(z 1 |z 0 ) and q φ (z 1 |z 0 , x 1 , r 1 ). Note that in Sections 5.2 and 5.3, we assume trials contain 2 episodes, while in Section 5.1 trials contain 1 episode. Following SLAC, we implement two layers of latent variables (please refer to the SLAC paper [7] , for more details). The network q φ (z t |z t−1 , x t , r t , a t−1 ) encodes image observations and the network p(x t |z t ) decodes them for the reconstruction loss. Both of these networks include the same convolutional architecture (the decoder simply the transpose of the encoder) that consists of five convolutional layers. The layers have 32, 64, 128, 256, and 256 filters and the corresponding filter sizes are 5, 3, 3, 3, 4. For environments in which the robot observes two images (such as scene and wrist camera), we concatenate the images and apply rectangular filters. All other model networks are fully connected and consist of 2 hidden layers of 32 units each. We use ReLU activations after each layer.

As discussed, we train the actor and critic using the SAC RL algorithm [63] with the belief state as input. The SAC algorithm maximizes discounted returns as well as policy entropy via policy iteration. The critic (Q-function) is trained to minimize the soft Bellman error, which takes the entropy of the policy into account in the backup. We instantiate the actor and critic as fully connected networks with 2 hidden layers of 256 units each. We follow the implementation of SAC, including the use of 2 Q-networks and the tanh actor output activation (please see the SAC paper [63] for more details).

Meta-training alternates between collecting data with the current model and policy, and training the model and actor-critic. Gradients from the actor-critic optimization do not flow into the latent state model. Before beginning this alternating scheme, we first train the latent state model on 60 trajectories of data collected with a random policy. Relevant hyper-parameters for meta-training can be found in Table 1 . 

In this section we provide further details on the simulated experiments in Section 5.1. We also perform an ablation study of several design decisions in the MELD algorithm.

We define a success metric for each environment that correlates with qualitatively solving the task: Cheetah-vel: within .2m/s of target velocity, Reacher: within 10cm of goal, Peg: complete insertion with 5cm variation possible inside the site, Shelf: mug within 5cm of goal. We use these success metrics because task reward is often misleading when averaged across a distribution of tasks; in peginsertion, for example, the numerical difference between always inserting the peg correctly versus never inserting it can be as low as 0.1, since the distance between the center of the goal distribution and each goal is quite small and accuracy is required. In Figure 5 , we plot this success threshold as a dashed black line.

In the Cheetah-vel environment, we control the robot by commanding the torques on the robot's 6 joints. The reward function consists of the difference between the target velocity v target and the current velocity v x of the cheetah's center of mass, as well a small control cost on the torques sent to the joints:

The episode length is 50 time steps, and the observation consists of a single 64x64 pixel image from a tracking camera (as shown in Fig. 11a ), which sees a view of the full cheetah.

In all three Sawyer environments, we control the robot by commanding joint delta-positions for all 7 joints. The reward function indicates the difference between the current end-effector pose x ee and a goal pose x goal , as follows:

This reward function encourages precision near the goal, which is particularly important for the peg insertion task. We impose a maximum episode length of 40 time steps for these environments.

The observations for all three of these environments consist of two images concatenated to form a 64x128 image: one from a fixed scene camera, and one from a wrist-mounted first-person view camera. These image observations for each environment are shown in Fig. 11b -d. The simulation time step and control frequency for each of these simulated environments is listed in Table 2 . 

In this section we ablate several hyper-parameters of the algorithm to understand which components affect its performance. First, we examine the effect of the number of meta-training tasks on MELD's performance on test tasks. In Figure 12 we plot the performance on test tasks for different numbers of train tasks on the Reacher problem. We find that while generalization fails completely with only a single meta-training task, there are diminishing returns for increasing the number of training tasks beyond 20. This result demonstrates promising generalization, since a relatively low number of training tasks is actually needed in order to be able to solve the held-out test tasks. This result also has a practical benefit in that it precludes the need for instrumenting a large task distribution for meta-training in the real world, which can be cumbersome.

Although the reference implementation of SAC calls for each data collection step to be interleaved with a training step, this process of stopping after each environment step in order to perform training is not possible in the real world. Instead, in MELD, we collect batches of data that consist of full episodes, and we interleave these collection phases with training phases that consist of many gradient steps. Here, we examine the affect of this choice on the performance of the algorithm in simulation.

From the results shown in Figure 13 , we see that MELD is not too sensitive to scaling between this type of batched off-policy training and the original SAC-style on-policy training, which is essential for training in the real world.

In Figure 14 , we examine the effect of the dimension of the latent variable z t . These experiments show that a larger latent dimension performs better; we use dimension 256 in all our experiments. 

As described in Section 5.2, when reward is given only upon completion of the task, efficient exploration is required to identify a new task within a few trials. The agent can acquire these exploration strategies during meta-training by learning strategies tailored to the task distribution. For example, in the button-pressing problem presented in Section 5.2, a learned exploration strategy might try pushing each button in succession, but would not try to e.g., pick up the control panel.

While in principle these behaviors can be acquired by MELD as described in Section 4.2, in practice performing RL with sparse rewards at meta-training time presents a significant exploration challenge. In effect, to learn useful exploration strategies for meta-test time, the agent must first explore effectively during meta-training. Because RL with sparse rewards is very inefficient, we follow prior work [65, 4] and assume access to a shaped reward function during meta-training to help learn these strategies. This setup corresponds to a setting where meta-training is performed in a laboratory with access to instrumentation to calculate the shaped reward, while meta-testing occurs outside the lab where only sparse rewards are available.

To make use of the shaped reward during meta-training time, we follow a two-stage procedure. First, we perform meta-training using the shaped reward as prescribed in Section 4.2. We then add data collected by this agent to the replay buffer of a second agent, which is trained with a small modification made to the model training loss from Equation 5. Here, the latent state model takes the sparse reward signal as input (to match our desired meta-testing setup), but we still use the shaped reward for the reconstruction target. We denote the shaped reward asr and the sparse reward as r, and highlight the difference from Equation 5 in blue.

L model (x 1:T , r 1:T ,r 1:T , a 1:

Additionally, we use the shaped reward to train the critic, since this is also not required at meta-test time.

Note that MELD does not simply learn to copy the trajectories from the shaped-reward training that were used to warmstart the sparse-reward training. The former ("expert") trajectories move from the starting position directly to the correct button, since the shaped reward contains information to almost immediately identify the correct task. The MELD trajectories that result from receiving only sparse rewards as input, however, demonstrate systematic exploration of visiting different buttons in order to determine the correct one (see Figure 6 ). Finally, note that we use this same approach for the real-world WidowX experiments in Section 5.3.

In this section, we discuss the benefit of the time-varying nature of MELD's latent belief. We emphasize that this design decision is useful in the standard meta-RL setting, as well as in other realistic settings of the underlying task itself changing within an episode.

We first present a didactic image-based 2D navigation problem to illustrate how MELD can learn extended exploration strategies to adapt to a new task, similar to the results shown in Figure 6 . Here, the task distribution consists of goals located along a semicircle around the start state. The agent receives inputs in the form of 64x64 image observations and rewards that are non-zero only upon reaching the correct goal. We use the same approach to using shaped reward for meta-training as described in Appendix C. As shown in Figure 15a , the agent learns an efficient exploration strategy of traversing the semi-circle goal region until the goal is found. By updating the posterior belief at each step, MELD is able to find the goal within 10 − 20 steps, instead of multiple episodes as required by methods that explore via posterior sampling [4, 65] that hold the task variable constant across each episode. Furthermore, note that once the goal is found, MELD can navigate directly to it (Figure 15b ) in the next episode.

This experiment demonstrates that even in standard meta-RL setting where the underlying task remains constant throughout an episode, updating task information at each timestep can enable faster adaptation. We argue that this behavior has safety benefits in the real world, since the agent need not complete full episodes of potentially hazardous exploration before incorporating task information. The experiments in the main paper as well as the section above consider the standard meta-RL paradigm, where the agent adapts to one test task at a time. However, many realistic scenarios consist of a sequence of tasks. For example, consider a robot moving a mug filled with liquid; if some of the liquid spills, the robot must adapt to the new dynamics of the lighter mug to finish the job. Because MELD updates the belief over the hidden variables at each time step, it can be directly applied to this setting without modification. We evaluate MELD in the Cheetah-vel environment on a sequence of 3 different target velocities within a single episode and observe in Figure 16 that MELD adapts to track each velocity within a few time steps.

Since meta-training requires training across a distribution of tasks, we build an automatic task reset mechanism for the real-world experiments with the WidowX robot performing ethernet cable insertion. This mechanism controls the translational and rotational displacement of the network switch. The network switch(A) is mounted to a 3D printed housing(B) with gear attached. We control the rotation of the housing through motor 1. This setup is then mounted on top of a linear rail(C) and motor 2 controls its translational displacement through a timing pulley. In our experiments, the training task distribution consisted of 20 different tasks, where each task was randomly assigned from a rotational range of 16 degrees and a translational range of 2cm. Figure 17 : Automatic task reset mechanism The network switch is rotated and translated by a series of motors in order to generate different tasks for meta-learning. This allows our meta-learning process to be entirely automated, without needing human intervention to reset either the robot or the task at the beginning of each rollout. In this experiment, we demonstrate that MELD can reason jointly about state and task information to perform real-world peg insertion with a 7-DoF Sawyer robot (Figure 18 , left). On the Sawyer robot, we learn precise peg insertion where the task distribution consists of three tasks, each corresponding to a different target box. Note that the goal is not provided to the agent, but must be inferred from its history of observations and rewards. The reward function is the sum of the L2-norms of translational and rotational distances between the pose of the object in the end-effector and a goal pose. The agent's observations are concatenated images from two webcams ( Figure 18 , right): one fixed view and one first-person view from a wristmounted camera. The robot succeeds on all three tasks after training on 4 hours of data (60, 000 samples at 4Hz), as shown in Figure 19 . Videos of the experiment may be found on the project website. Figure 19 : Rewards on train tasks during meta-training for Sawyer peg insertion.

Learning to reinforcement learn

Rl2: Fast reinforcement learning via slow reinforcement learning

Model-agnostic meta-learning for fast adaptation of deep networks

Efficient off-policy meta-reinforcement learning via probabilistic context variables

Deep spatial autoencoders for visuomotor learning

Deep predictive policy training using reinforcement learning

Stochastic latent actor-critic

Solar: Deep structured representations for model-based reinforcement learning

Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis

A Bayesian framework for concept learning

A bayesian approach to unsupervised one-shot learning of object categories

Robotic grasping and contact: A review

Decentralized algorithms for multi-robot manipulation via caging

Robot manipulation of deformable objects

Task transfer via collaborative manipulation for insertion assembly

Interpretation of force and moment signals for compliant peg-inhole assembly

Reinforcement learning in robotics: A survey

Acquiring robot skills via reinforcement learning

End-to-end training of deep visuomotor policies

Learning synergies between pushing and grasping with self-supervised deep rl

Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks

Deep reinforcement learning for industrial insertion tasks with visual inputs and natural reward signals

Meta-learning with temporal convolutions

Promp: Proximal meta-policy search

Evolved policy gradients

Guided meta-policy search

Varibad: A very good method for bayes-adaptive deep rl via meta-learning

Generalized hidden parameter mdps transferable model-based rl in a handful of trials

Meta reinforcement learning as task inference

One-shot visual imitation learning via meta-learning

One-shot imitation from observing humans via domain-adaptive meta-learning

Task-embedded control networks for few-shot imitation learning

Learning one-shot imitation from humans without humans

Few-shot goal inference for visuomotor learning and planning

Meta reinforcement learning for sim-to-real domain adaptation

Rapidly adaptable legged robots via evolutionary meta-learning

Learning to adapt in dynamic, real-world environments through meta-rl

Neural network based state estimation of dynamical systems

Memory-based control with recurrent neural networks

Deep recurrent q-learning for partially observable mdps

Planning and acting in partially observable stochastic domains

Point-based value iteration: An anytime algorithm for pomdps

A bayesian approach for learning and planning in partially observable markov decision processes

Solving nonlinear continuous state-action-observation pomdps for mechanical systems with gaussian noise

Qmdp-net: Deep learning for planning under partial observability

Deep variational reinforcement learning for pomdps

Temporal difference variational auto-encoder

Playing atari with deep reinforcement learning

End-to-end robotic reinforcement learning without reward engineering. environment (eg, by placing additional sensors

Deep object pose estimation for semantic robotic grasping of household objects

Contextual reinforcement learning of visuo-tactile multi-fingered grasping policies

Autonomous reinforcement learning on raw visual input data in a real world application

Self-supervised visual descriptor learning for dense correspondence

Self-supervised correspondence in visuomotor policy learning

Improving sample efficiency in model-free reinforcement learning from images

Mid-level visual representations improve generalization and sample efficiency for learning visuomotor policies

Embed to control: A locally linear latent dynamics model for control from raw images

Deep variational bayes filters: Unsupervised learning of state space models from raw data

Learning latent dynamics for planning from pixels

Deepmdp: Learning continuous latent space models for representation learning

Graphical models, exponential families, and variational inference

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor