key: cord-0174144-xuxnl37l authors: Jain, Deepali; Iscen, Atil; Caluwaerts, Ken title: From Pixels to Legs: Hierarchical Learning of Quadruped Locomotion date: 2020-11-23 journal: nan DOI: nan sha: c5df3ec3ebdeb3636b217a725aef68a7f5e86e42 doc_id: 174144 cord_uid: xuxnl37l Legged robots navigating crowded scenes and complex terrains in the real world are required to execute dynamic leg movements while processing visual input for obstacle avoidance and path planning. We show that a quadruped robot can acquire both of these skills by means of hierarchical reinforcement learning (HRL). By virtue of their hierarchical structure, our policies learn to implicitly break down this joint problem by concurrently learning High Level (HL) and Low Level (LL) neural network policies. These two levels are connected by a low dimensional hidden layer, which we call latent command. HL receives a first-person camera view, whereas LL receives the latent command from HL and the robot's on-board sensors to control its actuators. We train policies to walk in two different environments: a curved cliff and a maze. We show that hierarchical policies can concurrently learn to locomote and navigate in these environments, and show they are more efficient than non-hierarchical neural network policies. This architecture also allows for knowledge reuse across tasks. LL networks trained on one task can be transferred to a new task in a new environment. Finally HL, which processes camera images, can be evaluated at much lower and varying frequencies compared to LL, thus reducing computation times and bandwidth requirements. Legged robots have the potential to traverse many types of terrains while demonstrating a diverse set of agile skills. However, control of legged robots is challenging due to the dynamic nature of the problem. When incorporating visual inputs in the control loop, the task at hand becomes more difficult as it requires perceiving the environment while simultaneously handling the fast and contact-rich dynamics of the robot's legs. One solution is to split the problem into independent modules for vision and dynamics. However, this approach is typically limited by the high-level features and low-level behaviors that are independently designed or learned. It may be impossible to come up with a single feature space that is optimal for all given tasks, or the low-level behaviors needed might be different for each given task. We tackle the two problems of vision processing and fast dynamics by designing a hierarchical architecture with a high level (HL) and low level (LL) subsystem, which are concurrently trained. This framework does not require design decisions beyond a standard RL setup. HL handles vision with variable frequency and outputs a latent command, which is passed on to LL. LL runs at a higher frequency and handles control of the legs. We build on the hierarchical architecture presented in a related work by Jain et al [1] and incorporate vision processing to learn to navigate environments while concurrently discovering legged locomotion skills. The architecture is trained using Evolutionary Strategies (ES) [2] . The main contributions of our research are as follows: • Our HRL solution implicitly learns a complete pipeline from pixels to motor commands for quadruped legged locomotion without the need to design or learn low-level behaviors. We show that a hierarchical policy with more than 10 5 parameters can be successfully trained using evolutionary strategies. • Separation of observations into hierarchical levels allows knowledge reuse, because behaviors learned by LL are largely task agnostic and transferable to tasks of a similar nature. We show that data efficiency can be further improved by transferring LL from previously solved tasks, even if they were trained in a different environment. • The high level runs at a variable frequency computed by the high level's neural network. The result is that visual inputs are processed at much lower frequency (1.5 Hz − 10 Hz) compared to the low level (500 Hz). This leads to more efficient exploration and reduced training times, as illustrated by our experiments. • We illustrate how locomotion primitives emerge and can be selected as a low dimensional latent command. This creates an information bottleneck in which HL learns to extract only the useful visual information and LL learns only primitives relevant to the environment and the task at hand. We provide a detailed analysis of these specific learned behaviors. To test our method, we use a highly realistic simulation model of the Laikago robot, a quadruped with 12 degrees of freedom. The model is created in PyBullet [3] software and carefully tuned based on the physical robot. Despite our temporary inability to validate the results on hardware, we are confident in transferring our learned policies. In prior work, we have demonstrated successful deployment of policies learned in simulation on the robot. Additionally, we have performed validation experiments for the HL by processing real world depth images and verifying the computed latent commands. More details are provided in Appendix A. We test our framework on 3 visual navigation tasks and compare our policies with a non-hierarchical baseline. Our method outperforms the baseline and achieves increased sample efficiency and a lower wall-clock training time. Furthermore, we show that by running HL at low frequency, the inference time and computation cost of the learned policy is reduced with minimal effect on task performance. Hierarchical Reinforcement Learning. HRL decomposes complex decision making into subproblems. Well-known HRL frameworks in literature are based on Options [4] , MAXQ value decomposition [5] and Hierarchical Abstract Machines [6] . In these frameworks, a HL policy typically outputs temporally extended actions which are executed by LL for a specified amount of time. Designing or training good LL policies quickly becomes challenging for complicated tasks such as those encountered in robotics. Some papers try to learn LL by imitating from reference data [7, 8, 9] . This approach relies heavily on high quality data-sets, which are often hard to obtain. Recognizing the challenges of pre-training LL, we focus on a framework for learning both levels concurrently from scratch; many approaches have been proposed for this. In one class of methods, a sub-goal conditioned LL is trained to reach a point in observation space specified by HL [10, 11, 12] . However, this interface is not suitable in the case of high-dimensional observation spaces, such as camera images [13] . Some methods design auxiliary rewards to promote diversity in LL skills [14, 15, 16] ; however, by doing so, the RL agent may be forced to learn many skills irrelevant to the given task, leading to inefficiency in learning. Bacon et al [17] use an intrinsic reward function to learn LL. On the contrary, by training our policy with gradient-free policy search, we are able to train the whole hierarchical architecture from the main task reward. Thus we avoid imposing any external priors on LL behavior through intrinsic or auxiliary rewards. Some methods learn a finite set of options [17, 18, 19] . In our solution, we adopt the hierarchical policy structure proposed in [1] , which uses a vector space for modulating LL, allowing it to learn a continuum of skills. This is required to solve agile locomotion tasks. Legged Locomotion. Reinforcement Learning (RL) has been successfully applied to the problem of learning basic locomotion skills [20, 21, 22] . Complex locomotion tasks have also been addressed using RL [9, 23, 7, 24, 25] . Often, as complexity grows, only a part of the pipeline uses RL. Domain knowledge is used to constrain the problem, usually through a hand-crafted hierarchical solution [9, 10] . This is especially true for locomotion on real robots [22] . Peng et al [9] use HRL to learn locomotion in physics-based character animation. The two levels are learned separately by means of RL. Li et al [10] learn locomotion on the hexapod robot named Daisy. In their solution, HL does model planning to select one out of a few pre-trained LL skills. In this work, we solve vision-based legged locomotion without designing or pre-training LL. Perception in Robotics. Solving robotics tasks from vision input is an important and wellresearched topic [26, 27, 28, 29] . Our work focuses on learning legged locomotion and necessary navigation skills from vision. Prior work has considered HL kinematic or wheeled navigation from vision [29, 30, 31] . Our method learns LL legged locomotion directly from vision input, which involves processing vision to obtain navigation directives and learning dynamic legged locomotion skills to execute those directives. Some solutions for vision-based legged locomotion have been proposed that use domain-specific, hand-designed pipelines [32, 33] . Evolutionary Strategies in Robotics. Many RL algorithms for continuous control are available for training policies to solve robotics tasks; policy gradient and actor-critic methods are especially popular. However, given the architecture of our hierarchical policy, training with derivative-free approaches is the reasonable choice. Recently, evolutionary methods [2] have been successfully applied in robotics [21, 34, 35, 36] . We use an evolutionary algorithm called Augmented Random Search (ARS) [37] to optimize our neural network policies. High Level (HL) Figure 1 : Hierarchical policy. The high level (HL) is a CNN with parameters θ HL . The HL receives depth camera observations o HL and outputs a latent command vector l and a duration d. Optionally, task specific inputs can be fed into the HL's fully connected output layer. The low level (LL) is a linear network with parameters θ LL . It computes motor actuation commands a res and trajectory generator parameters p T G based on l, trajectory generator state s T G , and low-level observations o LL (IMU sensor values and motor angles). a res is added to the motor commands from trajectory generator a T G and applied to the robot motors. The HL is only evaluated every d steps. In our experiments, the low level runs at the environment simulator's frequency of 500 Hz, while the high level policy runs at 1.5 Hz − 10 Hz. The HL and LL level networks are trained concurrently by an evolutionary algorithm. The hierarchical policy structure introduced in [1] that we use for our solution is illustrated in Fig. 1 . The HL is a convolution neural network (CNN) while the LL is a linear fully-connected neural network. The policy interacts with the robot, which is controlled by combining the output of a trajectory generator (TG) with values computed by the LL's fully connected layer. A TG serves as a parameterized function that computes cyclic leg positions. The LL neural network continuously modulates the TG's phase and amplitude and adjusts the leg trajectories with residuals as needed. More details about the usage of TGs for learning locomotion can be found in [21] . Algorithm 1 shows how an episode is executed using a hierarchical policy in which the HL network (f θ HL ) and LL network (f θ LL ) have weights θ HL and θ LL respectively. The HL receives task specific exteroceptive observations (o HL ), such as the vision input in our tasks, and issues commands as a latent vector (l) to LL. HL also decides the duration (d) until its next execution. Note that the HL can also optionally receive task-specific inputs (e.g. relative position to goal). The LL receives Trajectory generator execution 10: r ← ExecuteAction(a res + a T G ) 11 : return R proprioceptive observations (o LL ) that include IMU (roll, pitch, roll rate and pitch rate) and motor angles. LL also processes the current latent command (l) and the current TG state, s T G . The LL outputs TG parameters p T G and residual motor actuation commands a res , which are added to TG output a T G and executed on the hardware. The environment returns the task reward (r) for the robot's action. The HL is invoked again after duration d and the process repeats. A reinforcement learning problem can be modeled as a Markov Decision Process (MDP) with state space S, action space A, a state transition function P (s t+1 |s t , a t ) and a reward function, r(s t , a t ). A policy π Θ (s), parameterized by a weight vector Θ, maps states s to actions a. For a hierarchical policy, Θ is the collection of parameters from all levels (Θ = {θ HL , θ LL }). The policy interacts with the MDP for an episode of T timesteps at a time. To jointly learn the parameters θ HL and θ LL of the two levels, we maximize the expected total reward (return) at the end of an episode. We use an evolutionary algorithm called Augmented Random Search (ARS) [37] to maximize the return. The algorithm proceeds by iteratively estimating gradient of return w.r.t. policy parameters and performing gradient ascent. During training, LL skills automatically emerge and are invoked by HL through latent commands (l) to solve a task. A trained LL can also be transferred to new tasks in unseen environments. This allows sharing of primitive skills across problems and is faster than learning from scratch on each task. LLs can be transferred by keeping θ LL fixed after training on the original task and reinitializing θ HL . Then, during new training only θ HL is updated by ARS. We use the Laikago quadruped robot from Unitree 2 . This robot is 60 cm tall, has 12 degrees of freedom (3 per leg) and weighs about 22 kg. The swing and extension of each leg is controlled by a PD position controller provided with the robot. We train our policies in simulation using PyBullet [3, 38] . Our tasks are set up in two environments: a curved cliff and a maze. Curved Cliff Environment. In this environment, the robot starts from the origin and a curved cliff lies ahead of it. The robot can observe the environment through a first-person depth-camera view, angled down slightly to see the cliff. Fig. 2a shows a still from this environment and a sample camera input. The robot's task in this environment is to progress forward as fast as possible. To accomplish this, it needs to learn to steer in order to follow the curves of the cliff and avoid falling off the edge. The shape of the cliff curve is randomized for each episode. The reward function is specified as the capped (v cap ) velocity of the robot along the x direction: Maze Environment. For the maze environment, the robot is placed in the middle of a walled 13 × 13 m 2 arena uniformly filled with pillars. An episode starts with randomly orientated robot observing the world with a depth-camera looking straight ahead. This environment and a sample camera image is shown in Fig. 2b . We set up two tasks in this environment: maze traversal and goal finding. For the maze traversal task, the robot needs to keep going further away from its starting point (origin). The optimal behavior constitutes stable forward walking and steering to avoid colliding with pillars and boundary walls. For this task, the robot also observes its position and orientation relative to the origin, x. The reward function for this task is as follows: In the goal finding task, the robot needs to reach a goal randomly placed in one of the 4 corners of the maze for each episode (see Fig. 2c ). Along with the camera input, it observes its position and orientation relative to the goal. To successfully find the goal it needs to learn to align itself in the direction of the goal along with all the skills for maze traversal. The reward function is given by: where g is the position of the goal. The reward is a weighted average of the maze traversal reward term r mt , and a term for progression towards the goal position, r gf , based on ω. The variable ω corresponds to the fraction of the distance travelled relative to the total distance from the goal to the origin. The reward is dominated by r mt when the robot is close to the origin and becomes more defined by r gf as it gets closer to the goal. This reward function encourages the robot to learn locomotion skills in the early stages of training. Without r mt , the robot doesn't experience any positive reinforcement for stable walking unless it happens to walk in the goal direction. In all tasks, the episode terminates if the robot loses its balance, falls off a cliff, collides with a pillar or boundary wall, or if the episode reaches 6000 LL robot control time steps (12 s). Additionally, in the goal finding task, the episodes terminates when the robot comes within 0.5 m of the goal position. We set v cap to 0.002 m in all experiments, which corresponds to 1 m s −1 . Solution Implementation details. The high level contains a CNN that receives a 16 × 16 × 1 depth camera input. It has 3 convolutional layers of 3 × 3 filters with output channels 4, 8, and 8, followed by a pooling layer with filter of size 2 × 2 applied with a stride of 2. Output from the pooling layer is flattened and transformed into a 10D feature vector through a fully-connected layer with tanh activation. If present, the task-specific HL inputs (relative position in the maze environment) are concatenated with the feature vector. It is then fed into a fully-connected layer to produce an output clipped between −1 and 1. For most of our experiments we use a 3D output, with the first dimension corresponding to the HL duration (d) and the rest to the latent command (l). The duration is calculated by linearly scaling the output to a value between 50 -300 time-steps (≈ 1.5 Hz−10 Hz). The latent command concatenated with IMU, motor angles, and trajectory generator (TG) state is fed to LL linear fully-connected network to output the residual motor commands and PMTG parameters. HL network has around 3000 parameters and LL has around 300. For comparison, we also train a non-hierarchical CNN with same convolutional and pooling layers as above. The feature vector is concatenated with all other sensor observations and fed to 2 fully connected layers (hidden layer size is 10) before producing the actions. architecture, which has shown success at learning diverse primitive behaviors for quadruped robots [21] . As mentioned above, LL observes the PMTG state (s T G ), which specifies the position along a periodic leg trajectory, and updates the PMTG parameters at every time-step. We train the policies using a distributed ARS implementation. For each optimization iteration, we evaluate policy perturbations on 64 parallel workers. Since all our tasks are randomized, we take average of return from 3 environment episodes to evaluate each perturbation. The number of perturbations evaluated, gradient step size, number of top perturbations used for gradient estimation, and the standard deviation for generating new perturbations are all determined by hyper-parameter tuning using a gaussian process bandits approach [39] . Our hierarchical policies are able to complete all 3 visual navigation tasks described in Sec. 4 by learning locomotion directly from vision input. Fig. 3 shows the trajectories of the trained robot in simulation for the 3 tasks. Dot markers, along the trajectories, show the points at which HL becomes active and computes the next latent command (l) and duration (d). Notice that for solving the curved cliff task (Fig. 3a) , the HL takes decisions more frequently (small d) when sharp turns are made to avoid falling off the cliff. In straighter regions, HL executions are sparser (large d). For the goal finding task, HL takes sparser decisions when the goal is close (Fig. 3c) . The robot efficiently turns in-place to face the goal using dynamic leg movements which are difficult to hand-design and tune. We compare the learning curves for our hierarchical policies described in Sec. 3 with nonhierarchical CNN policies on these 3 tasks (refer to Fig. 4) . The plot lines show the average return across ≈ 450 environment episodes. The shaded region denotes the standard deviation. The return is plotted against the total training episodes. We see that in all 3 cases our method largely outper- forms the baseline, completing each task at the end of the training. Though we trained policies with approximately 10 3 weights (Sec. 4 for details) in most of our experiments, we show that larger CNNs with over 10 5 weights can be trained using ARS. Fig. 5 shows the learning curve for the maze traversal and goal finding tasks. The image resolution is 32 × 32 and we use 2 additional 32D fully connected layers at the end of the CNN. Transferring Low Level Policies between Tasks. LLs from learned policies can be reused for new tasks and environments as shown in Fig. 6 . We first trained a policy with 2D latent commands for the goal finding task. The LL only has access to proprioceptive sensors, which forces it to learn generic steering and turning-in-place primitives. We reuse these skills for training a new policy in the maze traversal and curved cliff tasks based on the goal finding LL. LL weights are initialized from a pre-trained policy and frozen during the new HL policy training. In both of these cases we observe that this improves the learning efficiency. As can be expected, LL policies trained on the simpler and more restricted cliff task did not yield good performance when transferred to the maze traversal or goal finding tasks. In future work, we plan to explore fine-tuning transferred LL weights so that it can adapt to new tasks if previously learned skills do not suffice. Analysing 2D Latent Command Space. To better understand the behavior of the learned policy, we visualize a 2D latent command space for the goal finding task (Fig. 7) . For a learned hierarchical policy, we evaluated the low level with artificial latent command inputs taken from a uniform grid over the whole command space. Those behaviors are shown on the right of the figure in the form of robot trajectories running for 1 s in the XY coordinate plane. On the left we show the latent command space points for which these behaviors emerged in corresponding colors. For each point, we generate a vector summarizing the LL trajectory. Vector direction shows the movement direction of the whole LL trajectory and the length is proportional to the distance covered. This visualization shows how the latent space is used to smoothly transition between robot steering behaviors of varying velocities that have emerged automatically. Influence of Latent Command Dimension. The comparison between hierarchical policies with 1, 2, 4, and 8 dimensional latent command space (LCS) is shown in Fig. 8 . These policies are trained on the curved cliff task. It is clear that the 1D LCS is too restrictive for this task and the policy is not able to achieve optimal performance. The 2, 4, and 8D LCS perform similarly. It is promising to see 2D LCS reaching optimal performance, since low dimensional LCS has many benefits: they can be easily visualized and interpreted, they are easy to control and hence amenable to transfer, and they reduce the network size making it easier to train. To study the effect of temporal abstraction, we compare policies with different HL execution durations, trained on the goal finding task (see Fig. 9 ). On the left, we show the learning curves for policies with the HL running once every 1, 50, 150, 300, and d time-steps, where d is the variable time interval output by the HL. We see that all variants are able to learn the task with minor differences in performance. However, notice that HL running at every time-step has made comparatively little progress. This difference in training speed is captured in Column 4 of the table on the right. It is clearly inefficient to run the HL every time given that we can achieve the optimal return much faster with the temporally abstracted policies. The exact inference times are recorded in Column 2. Inference on temporally abstracted policies is ≈ 100 times faster, which will also facilitate deployment on hardware. Column 3 calculates the effective size of the policies over time due to variation in HL frequency. We presented an HRL technique to solve visual navigation for a quadruped from pixels to leg motions. Our method outperformed non-hierarchical baselines on 3 navigation tasks and achieved higher data efficiency and a lower wall-clock training time. However, the advantages of using our approach extend beyond performance improvements. First, by decoupling the high level from the low level, we were able to run them at different frequencies. Indeed, the high level learns when it has to process a new image. This has practical applications as processing vision input synchronously with the low level control loop is often impractical. Secondly, we analyzed low level policies and demonstrated that they can be transferred between tasks. This is important as it is non-trivial to define skills -encoded by the latent command space -that are both robust and exploit the full range of robot capabilities. Transfer of low level policies makes it possible to use the learned skills as a continuous low-level action space for other learning algorithms, a research direction we intend to pursue in future work. Note on Hardware Evaluation. Due to COVID-19 restrictions, we were unable to include hardware results. However, we are planning to implement our hierarchical policies on a real Laikago robot once we regain access to our lab spaces. We have previously validated learned low-level policies similar to those found by our algorithm on multiple real robots, and we are therefore confident in the transfer of our HRL policies to hardware. We have also already tested our vision stack in combination with a predefined set of low-level skills on a legged robot. Finally, we found that similar latent commands were computed by the high level network when presented with real depth camera images and simulated ones in an environment with obstacles. More details can be found in Appendix A. A Feasibility of Deploying Our Learned Policies on the Robot Figure 10 : Real depth camera images processed by high level. We trained a hierarchical policy on goal finding task in simulation and evaluated the learned HL on images from a real depth camera (Intel RealSense L515). We compared the downsampled (16 × 16) real world camera images with similar looking simulated camera images from our experiments. Both types of images result in similar latent commands, supporting compatibility of HL with real depth camera images (Fig. 10 ). 0.5s 1.5s 3.5s 4.5s We have previously deployed a forward walking policy (trained in simulation) that tracks a desired velocity to the real Laikago robot (Fig. 11) . The hierarchical policies presented in our experiments are trained with a similar infrastructure. Hence, we expect that our policies will transfer to hardware. Finally, we trained our policies in simulated 3D spaces with realistic visuals from the Gibson dataset [40] . After training, our policies were able to transfer to a new space (Fig. 12) . Figure 1 : Real depth camera images processed by high level. We trained a hierarchical policy on goal finding task in simulation and evaluated the learned HL on 2 images from a real depth camera (Intel RealSense L515). We compared the downsampled (16 × 16) 3 real world camera images with similar looking simulated camera images from our experiments. 4 Both types of images result in similar latent commands, supporting compatibility of HL with real 5 depth camera images (Fig. 1) . 6 0.5s 1.5s 3.5s 4.5s trained with a similar infrastructure. Hence, we expect that our poli-10 cies will transfer to hardware. 11 Finally, we trained our policies in simulated 3D spaces with realistic visuals from the Gibson 12 dataset [? ]. After training, our policies were able to transfer to a new space (Fig. 3 ). Hierarchical reinforcement learning for quadruped locomotion. IROS Evolution strategies as a scalable alternative to reinforcement learning Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Hierarchical reinforcement learning with the MAXQ value function decomposition Reinforcement learning with hierarchies of machines Hierarchical visuomotor control of humanoids MCP: Learning composable hierarchical control with multiplicative compositional policies. ArXiv, abs/1905.09808 Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning Learning generalizable locomotion skills with hierarchical reinforcement learning Hierarchical reinforcement learning with hindsight Data-efficient hierarchical reinforcement learning Near-optimal representation learning for hierarchical reinforcement learning Learning an embedding space for transferable robot skills Stochastic neural networks for hierarchical reinforcement learning Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings The option-critic architecture Multi-level discovery of deep options Learning sequential motor tasks Proximal policy optimization algorithms Policies modulating trajectory generators Learning agile and dynamic motor skills for legged robots Terrain-adaptive locomotion skills using deep reinforcement learning Learning and transfer of modulated locomotor controllers Iterative reinforcement learning based design of dynamic locomotion skills for cassie Qt-opt: Scalable deep reinforcement learning for visionbased robotic manipulation Collective robot reinforcement learning with distributed asynchronous guided policy search End-to-end training of deep visuomotor policies Zero-shot imitation learning from demonstrations for legged robot visual navigation HRL4IN: Hierarchical reinforcement learning for interactive navigation with mobile manipulators Indoor navigation of a wheeled mobile robot along visual routes Vision enhanced reactive locomotion control for trotting on rough terrain Fast and continuous foothold adaptation for dynamic locomotion through cnns Robotic table tennis with model-free reinforcement learning Provably robust blackbox optimization for reinforcement learning Reinforcement learning with chromatic networks Simple random search of static linear policies is competitive for reinforcement learning Simto-real: Learning agile locomotion for quadruped robots Google vizier: A service for black-box optimization Gibson Env: real-world perception for embodied agents