key: cord-212098-hc81jwzn authors: Escontrela, Alejandro; Yu, George; Xu, Peng; Iscen, Atil; Tan, Jie title: Zero-Shot Terrain Generalization for Visual Locomotion Policies date: 2020-11-11 journal: nan DOI: nan sha: doc_id: 212098 cord_uid: hc81jwzn Legged robots have unparalleled mobility on unstructured terrains. However, it remains an open challenge to design locomotion controllers that can operate in a large variety of environments. In this paper, we address this challenge of automatically learning locomotion controllers that can generalize to a diverse collection of terrains often encountered in the real world. We frame this challenge as a multi-task reinforcement learning problem and define each task as a type of terrain that the robot needs to traverse. We propose an end-to-end learning approach that makes direct use of the raw exteroceptive inputs gathered from a simulated 3D LiDAR sensor, thus circumventing the need for ground-truth heightmaps or preprocessing of perception information. As a result, the learned controller demonstrates excellent zero-shot generalization capabilities and can navigate 13 different environments, including stairs, rugged land, cluttered offices, and indoor spaces with humans. The ability to traverse unstructured terrains make legged robots an appealing solution to a wide variety of tasks, including disaster relief, last-mile delivery, industrial inspection, and planetary exploration [1] , [2] . To deploy robots in these settings successfully, we must design controllers that work well across many different terrains. Due to the diversity of environments that a legged robot can operate in, hand-engineering such a controller presents unique challenges. Deep Reinforcement Learning (DRL) has proven itself capable of automatically acquiring control policies to accomplish a large variety of challenging locomotion tasks. However, many of these approaches learn control policies that succeed in a single type of terrain with limited variations. This approach limits the robot's ability to generalize to new or unseen environments, which is a crucial feature of a useful locomotion controller. In this paper, we develop an end-to-end reinforcement learning system that enables legged robots to traverse a large variety of terrains. To facilitate learning generalizable policies, we make two purposeful design decisions for our learning system. First, we formulate the problem as a Multi-Task Partially Observable Markov Decision Problem and show that the robot learns a robust policy that works well across a wide variety of tasks (terrains). To this end, we develop a novel procedural terrain generation method, which can efficiently generate a large variety of terrains for training. Second, we design an end-to-end neural network architecture that can handle both perception and locomotion. We call this parameterization a visual-locomotion policy. While many 1 Google Brain Robotics, {georgeyu,pengxu,atil,jietan}@google.com 2 Georgia Institute of Technology, aescontrela@gatech.edu Work performed while Alejandro was an intern at Google Brain. prior works in the legged robot literature focused on blind walking, which does not involve exteroceptive sensors (e.g., camera, LiDAR), we find that exteroceptive perception is essential for robots to navigate in diverse environments. Our end-to-end visual-locomotion policy takes both exteroceptive (a LiDAR scan) and proprioceptive information of the robot and outputs low-level motor commands. We embed the Policies Modulating Trajectory Generator (PMTG) [3] framework into our policy architecture to generate cyclic and smooth actuation patterns, and to facilitate the learning of robust locomotion policies. We evaluate our learning system using a high-fidelity physics simulator [4] and visually-realistic indoor scans [5] ( Figure 1 ). We test the learned policy in thirteen different and realistic simulation environments (five training and eight testing). Our system learns highly generalizable locomotion policies, which demonstrate zero-shot generalization to unseen testing environments. We also show that our visuallocomotion policy's parameterization is key to generalization and yields far better performance than commonly-used reactive policies. This paper's main contributions include an end-to-end visual-locomotion policy parameterization and a complete multi-task learning system, with which a quadruped robot learns a single locomotion policy that can traverse a diverse set of terrains. Locomotion controllers can be developed using trajectory optimization [6] , whole-body control [7] , model predictive control [8] , and state-machines [9] . While the controllers developed by these techniques can generalize to a certain degree, expertise and manual tuning are often needed to adapt them to different terrains. In contrast, Deep Reinforcement Learning [10] can automatically learn agile and robust locomotion skills [11] , [12] , [13] , [14] . Prior work in RL has learned policies that are specific for a single environment [15] , or generalize to variations of a single type of terrain [16] , [17] , [18] . Recently, Lee et. al. [14] combined various techniques, such as Actu-atorNet [13] , PMTG [3] , curriculum learning and "learning by cheating" [19] , which successfully performed zero-shot transfer from simulation to many challenging terrains in the real world. While our paper's high-level goal is similar to this prior work, our approach incorporates exteroceptive sensors that enable the robot to navigate in cluttered indoor environments where blind walking may have difficulties. Multi-task reinforcement learning (MTRL) [20] is a promising approach to train generalizable policies that can accomplish a wide variety of tasks. Hessel et. al. [21] learned a single policy that achieves state-of-the-art performance on 57 Atari games. Yu et al. [22] evaluated the performance of various RL algorithms on a grasping and manipulation benchmark and demonstrated that a single control policy is capable of completing a variety of complex robotic manipulation tasks. In this paper, we apply MTRL to develop a learning system for locomotion that enables legged robots to navigate in a large variety of environments. In this work, we frame legged locomotion as a multitask reinforcement learning problem (MTRL) and define each task as a type of terrain that the legged robot (agent) must traverse. To learn generalizable locomotion policies, our learning system consists of a procedural terrain generator that can efficiently generate diverse training environments, and an end-to-end visual-locomotion policy architecture that directly maps the robot's exteroceptive and proprioceptive observations to motor commands. Given a distribution of tasks M, each task M i ∈ M is a Partially Observable Markov Decision Process (POMDP). A POMDP is tuple, is the transition probability function, and R : S × A → R is the reward function. During training, the agent is presented with randomly sampled tasks M i ∈ M (Section III-B). The solution of the multi-task POMDP is a stochastic policy π : O × A → R + that maximizes the expected accumulated reward over the episode length T . Our problem is partially observable because of the limited sensors onboard the robot 1 . The robot is equipped with a LiDAR sensor to perceive the distances d to the surrounding environment. Proprioceptive information comes from a simulated IMU sensor, which includes measurement of the roll φ, pitch θ, and the angular velocity of the torso β ω = (φ,θ,ψ), and from motor encoders that measure the robot's 12 joint angles q. The complete observation at timestep t is ] are the sensor observations, g d and g h are the distance and relative heading to the target, a t−1 is the action at the last timestep, and s T TG are the parameters of the trajectory generator (Section III-C). Unlike some prior work in MTRL, where the task ID is part of the observation [22] , [23] , we purposefully choose not to leverage such information, because identifying tasks automatically in the real world is challenging. Instead, we would like to train a policy that can rely on its own perception input and demonstrates zero-shot generalization to new tasks, without knowing the task ID explicitly. In section IV, we demonstrate our perception is crucial in learning policies which generalize well to new tasks. The output action a t of the policy specifies the desired joint angles, which are tracked by PD controllers by the simulated robot. We employ a simple reward function, which encourages the agent to navigate to a target location g = (x g , y g , z g ) (the red ball in Figure 1 ): where g d,t is the Euclidean distance from the robot to the target location at timestep t, and ∆t is the timestep duration. This reward can be interpreted as the speed that the robot is moving towards the target location. Once the robot's center of mass is within a threshold distance to the target location, the task is complete. We develop a procedural terrain generator to generate diverse and challenging terrains that provide the robot with a large quantity of rich training data. The environment is composed of m×n pillars, each pillar having cross-sectional dimensions of l, w, and height h. We denote H = {h i,j } ∈ Obstacles Number of obstacles: n obstacle height: h R m×n as the height field for all the pillars. During training, we select a task M i and adjust each pillar's heights to reflect the chosen task. Each task is a set of randomly generated terrains that belongs to the same type (e.g., flat, stairs). Each type of terrain is described by a parameter vector φ i , which provides the lower and upper bounds for the random sampling. The terrain generator constructs the heightfield H from the given parameter vector φ. For example, the parameter vector φ for the rugged terrain task (Fig. 3b) includes the minimum and maximum values of the heightfield; for the stairs task, the parameter vector defines the height and length of each step. Table I summarizes the parameters and terrain generator for selected terrain types. With this simple parameterization, we can generate over ten different types of terrains that a robot may encounter in the real world. Our procedural terrain generation algorithm provides a rich set of training data essential for generalizable policies to emerge. Exteroceptive perception plays a crucial role when legged robots need to navigate different terrains and environments with obstacles and humans [24] , [25] . As such, we aim to incorporate perception into our policy architecture such that information from the robot's surroundings can modulate locomotion. Additionally, the policy's low-level actuation commands need to be smooth and realizable on the physical robot. To this end, we seek to restrict the search space of possible gaits to be cyclic and smooth while still expressive enough so that the perception can modulate locomotion sufficiently to work on different terrains. In our visual-locomotion policy architecture (Fig. 2) , we use two separate neural network encoders to process the proprioceptive and exteroceptive inputs. The upper branch of Fig. 2a processes the LiDAR input, while the lower branch takes care of proprioceptive information. The learned lower-dimensional features are concatenated with the target information before being passed to the policy's locomotion component. We chose to use Policies Modulating Trajectory Generators (PMTG) [3] as our locomotion component architecture (Fig. 2b) . PMTG encourages the policy to learn smooth and cyclic locomotion behaviors. PMTG outputs a desired trajectory for the legs that is modulated by a learned policy π θ (·): The policy observes the state of the trajectory generator (TG), s tg , and the robot's observation s t , then outputs parameters of the TG, p tg , including gait frequency, swing height, and stride length, and a residual action term µ f b . The final output action of our visual-locomotion policy is the combination of the trajectory generator and the residual action: a t = µ tg + µ f b . Please refer to the original paper [3] for more details. As detailed in [16] , our visual-locomotion policy architecture achieves a separation of concerns between the basic locomotion skills and terrain perception, which enables the robot to adapt its smooth locomotion behaviors according to its surrounding environments. We design experiments to validate the proposed system's ability to learn a visual locomotion policy that generalizes well to terrains not encountered during training. In particular, we would like to answer the following two questions: • Can our system learn visual locomotion policies that demonstrate zero-shot generalization to new terrains? • Can our policy architecture effectively use LiDAR input and PMTG parameterization to improve the generalization performance over unseen terrains? To answer the above questions, we evaluate our system using a simulated Unitree Laikago quadruped robot [26] , (a) Visual-locomotion policy architecture. (b) The locomotion component using PMTG [3] for smooth and cyclic actuation patterns. which weighs approximately 22kg and is actuated by 12 motors. We simulate the onboard Velodyne VLP-16 (Puck) LiDAR sensor, which provides the perception of the surrounding environment (See Figure 2b) . The LiDAR measures the distance from the surrounding obstacles and terrain to the robot. This sensor supports 16 channels, a 360 • horizontal field of view, and a 30 • vertical field of view. We add Gaussian noise to the ground-truth distance readings in simulation to mimic the real-world noise model. The 3D LiDAR scan matrix D is normalized to range [0, 1] and flattened to a vector d. Our policy computes joint target positions (a t ), which are converted to target joint torques by a PD controller running at 1kHz. Rigid body dynamics and contacts are also simulated at 1kHz. In other words, the position and velocity (provided by PyBullet [4] ) and the desired torque (provided by the PD controller) are sent to the actuator model every 1ms. The actuator model then computes 10 internal 100µs steps and provides the effective output torque of the actuator, which is then used by PyBullet to compute joint accelerations. The simulation environment is configured to use an action repeat of 10 steps, which means that our policy computes a new action a t and receive a state s t every 10ms (100Hz). We train the visual-locomotion policy using the MTRL formulation with simulated environments randomly generated using our procedural task generation method (Section III-B). We choose a distributed version of the Proximal Policy Optimization (PPO) [27] in TF-Agents [28] for training. We use a 2-layer fully-connected neural network of dimensions (512, 256) to parameterize the value function and another network of dimensions (256, 128) to parameterize the policy. The policy outputs the parameters of a multivariate Gaussian distribution, which we sample actions from during training. We use a greedy policy during evaluation by executing the mean of the multivariate Gaussian distribution provided by the policy network. The dimensions of the exteroceptive and proprioceptive input encoders are both (32, 16, 4), respectively. We use the ReLU activation function for all layers in both networks [29] . The advantages are estimated using Generalized Advantage Estimation [30] . We then evaluate the trained policies on a suite of testing environments not encountered during training. Figure 1 illustrates a subset of these testing environments. These high-fidelity simulated environments are created in PyBullet physics engine [4] with Gibson scenes [5] . A policy's ability to successfully navigate across a given terrain is measured using the task completion rate, tcr, which measures how close the agent gets to the target relative to its starting position: where g d,T is the final Euclidean distance between the robot and the target when the robot falls or completes the task, and g d,0 is the distance at the beginning of the episode. A task completion rate of 1 indicates successful navigation to the target, whereas tcr close to zero means that the robot cannot navigate across the terrain. Table II shows the generalization performance of our visual-locomotion policy trained on different types of terrains (rows) and tested in unseen environments (columns), including a maze (Maze), a steep and rugged mountain (Mountain), two indoor scenarios (Office 1 and Office 2), an office space with moving humans (Dynamic Env), a forest scene with rugged terrain and obstacles (Forest), a winding path with a cliff on both sides (Cliff), and a randomlygenerated continuous mesh (Continuous). Policies trained on a single type of terrain achieve a low task completion rate in the testing environments due to a lack of diverse training data. In contrast, our approach achieves much higher generalization performance. For instance, our method on average achieves a task completion rate of 67% on the mountain task, while policies trained in a single type of terrain only achieve 28% at best (See Figure 4 for a snapshot of our policy navigating up the rugged mountain trail). These results indicate that our MTRL formulation using procedural task generation, and visual-locomotion policy architecture, results in superior generalization performance. The policy We perform three ablation studies to understand the importance of each design decision in our system. Table III summarizes their impacts on the resulting generalization performance of the policy. a) PMTG: We replace the locomotion component of the visual-locomotion policy with a reactive policy that does not have a trajectory generator. Our PMTG-parameterized visual-locomotion policy performs 28%-218% better than a pure reactive locomotion component. We find that PMTG produces smoother actions and leads to improved zero-shot generalization to new terrains. b) Exteroceptive input: We remove the LiDAR input from the visual-locomotion policy. Observing Table III , it is clear that the exteroceptive information plays a critical role in learning generalizable locomotion policies that can adapt to a wide variety of terrains. This finding agrees with results from the field of experimental psychology, which establish the importance of exteroceptive observations in guiding foot placement when navigating over complex terrains [25] , [24] . Figure 5 visualizes the trajectory produced by our visual locomotion policy in a terrain with obstacles. When walking over flat terrain, the robot's foot height is constant and cyclic, only varying when turning to avoid obstacles. In contrast, on rugged terrain (Figure 6 ), the robot carefully places its feet to adapt to the geometry of the terrain to maintain balance. This careful foot placement is essential for challenging terrains and requires a visual feedback loop, which our learning system can provide. c) MTRL training scheme: Our system generates a new random locomotion task at each episode for all the distributed workers. This ensures a steady stream of rich training data to the agent. In this study, we lower the variety of tasks supplied in a single training step by proving tasks sequentially. That is, the agent learns one task for a fixed number of training steps before switching the task. The policy trained in a sequential fashion performs poorly due to catastrophic forgetting [31] . These ablation studies confirm the importance of each component of our system, including the exteroceptive input and PMTG used in the visual-locomotion policy architecture, as well as our multi-task POMDP training formulation. By Fig. 6 . Visualization of trajectory generated by our method in a rugged terrain. Foot Z positions for the left hind, right hind, left forward, and right forward feet are shown. The rugged terrain requires that the robot carefully place its feet to maintain balance. combining these components, our system can learn locomotion policies that work on various terrains and demonstrate zero-shot generalization to new environments. We introduce a learning system that enables legged robots to traverse various environments and demonstrates zero-shot generalization to new terrains. Our system consists of a novel multi-task reinforcement learning formulation of the locomotion problem, a visual locomotion policy architecture that encourages smooth actions and incorporates perception to modulate locomotion, and a novel procedural terrain generation algorithm that provides the agent with rich training data from a variety of simulated terrains. Our results on a suite of simulated environments show that treating legged locomotion as a multi-task POMDP leads to increased generalization performance. Additionally, we show that providing the policy with a strong prior over the space of gaits further enhances its ability to generalize to unseen terrains. In future work, we plan to evaluate our work on a real-world robot. Real-time motion planning in unknown environments for legged robotic planetary exploration Advances in real-world applications for legged robots Policies modulating trajectory generators Pybullet, a python module for physics simulation for games, robotics and machine learning Gibson env: real-world perception for embodied agents Gait and trajectory optimization for legged systems through phase-based endeffector parameterization Stabilizing series-elastic point-foot bipeds using whole-body operational space control Feedback mpc for torque-controlled legged robots Mit cheetah 3: Design and control of a robust, dynamic quadruped robot Reinforcement learning: An introduction Sim-to-real: Learning agile locomotion for quadruped robots Learning to walk via deep reinforcement learning Learning agile and dynamic motor skills for legged robots Learning quadrupedal locomotion over challenging terrain Learning to walk in the real world with minimal human effort Emergence of locomotion behaviours in rich environments Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning Deepgait: Planning and control of quadrupedal gaits using deep reinforcement learning Learning by cheating Multitask learning: A knowledge-based source of inductive bias Multi-task deep reinforcement learning with popart Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning Multi-task reinforcement learning without interference Gaze and the control of foot placement when walking in natural terrain Visual control of foot placement when walking over complex terrain Laikago: Let's challenge new possibilities Proximal policy optimization algorithms TF-Agents: A library for reinforcement learning in tensorflow What matters in on-policy reinforcement learning? a large-scale empirical study Highdimensional continuous control using generalized advantage estimation Catastrophic interference in connectionist networks: The sequential learning problem