key: cord-0477005-hm34lqwq authors: Da, Xingye; Xie, Zhaoming; Hoeller, David; Boots, Byron; Anandkumar, Animashree; Zhu, Yuke; Babich, Buck; Garg, Animesh title: Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion date: 2020-09-21 journal: nan DOI: nan sha: 50b699e7c8bc7cd426295e6775dd3236129eb359 doc_id: 477005 cord_uid: hm34lqwq We present a hierarchical framework that combines model-based control and reinforcement learning (RL) to synthesize robust controllers for a quadruped (the Unitree Laikago). The system consists of a high-level controller that learns to choose from a set of primitives in response to changes in the environment and a low-level controller that utilizes an established control method to robustly execute the primitives. Our framework learns a controller that can adapt to challenging environmental changes on the fly, including novel scenarios not seen during training. The learned controller is up to 85~percent more energy efficient and is more robust compared to baseline methods. We also deploy the controller on a physical robot without any randomization or adaptation scheme. Quadruped locomotion is often characterized in terms of gaits (walking, trotting, galloping, bounding, etc.) that have been well-studied in animals [1] and reproduced on robots [2, 3] . A gait is a periodic contact sequence that defines a specific contact timing for each foot. Controllers designed for these gaits have demonstrated robust behaviors on flat ground and rough terrain locomotion. However, it is more rare to find controllers that can change gaits or contact sequences to adapt to environmental changes. An adaptive gait can reduce energy usage by removing unnecessary movement, as suggested in horse studies [1] . It is also required for completing more challenging scenarios such as riding a skateboard or recovery from leg slipping, as shown in Figure 1 (a, b) . In most model-based and learning-based control designs, the contact sequence is fixed or predefined [2, 4, 5, 6, 3, 7, 8] . Dynamic adaptation of the contact sequence is challenging because of the hybrid nature of legged locomotion dynamics as well as the inherent instability of such systems. While it is possible to generate adaptive contact schemes via trajectory optimization [9, 10, 11] , such approaches are generally too compute-intensive for real-time use. Here we present a hierarchical control framework for quadrupedal locomotion that learns to adaptively change contact sequences in real time. A high-level controller is trained with reinforcement learning (RL) to specify the contact configuration of the feet, which is then taken as input by a lowlevel controller to generate ground reaction forces via quadratic programming (QP). At inference time, the high-level controller needs only evaluate a small multi-layer neural network, avoiding the use of an expensive model predictive control (MPC) strategy that might otherwise be required to optimize for long-term performance. The low-level controller provides high-bandwidth feedback to track base and foot positions and also helps ensure that learning is sample-efficient. The framework produces a controller that is up to 85 percent more energy efficient and also more robust than baseline approaches. We train our controller with a simulated Unitree Laikago [12] on a split-belt treadmill, as shown in Figure 1 (c). The two belts can adjust speed independently, and we change the robot orientation to increase variation. In addition to comparing energy use and robustness to the baselines, we also demonstrate zero-shot transferability by testing the controller in novel situations such as one where a foot encounters a slippery surface (e.g., with zero friction), which we call the "banana peel" test. Figure 1 : (a) Riding a skateboard requires a contact sequence that only moves the feet on the ground while keeping the feet on the board still. (b) "banana peel" test: we put a frictionless mat under a foot to test robustness. (c) We train and test the robot on a split-belt treadmill where the speeds of the two belts are changed separately with the robot facing different directions. Furthermore, we show that the controller learns to generate novel contact sequences that have not been previously shown in either the model-based or learning-based approaches. Finally, we deploy the controller on a physical robot to demonstrate sim-to-real transfer, 2 which succeeds without any randomization or adaptation scheme due to the robustness of the low-level controller. 1. We introduce a hierarchical control structure that combines model-based control design and model-free reinforcement learning for legged locomotion. 2. We demonstrate that our framework allows sample-efficient learning, zero-shot adaptation to novel scenarios, and direct sim-to-real transfer without randomization or adaptation schemes. 3. Our framework learns adaptive contact sequences that are not present in either model-based or learning-based methods in real-time control. This is evidenced in the natural-looking behaviors that minimize unnecessary movement and energy usage in our split-belt treadmill scenarios. Model-based Legged Locomotion Control Model-based control designs [2, 4, 5, 6] use trajectory optimization and model predictive control methods that optimize the performance for a finite horizon, where the input includes a predefined contact sequence. Although one can change the control sequence externally to demonstrate various gaits, it cannot adapt to changes in the environment. Contact-implicit optimization [9, 10, 11] is used to solve non-convex, stiff problems that are not amenable to real-time use. The work in [13] introduced the Feasible Impulse Set which allows online gait adaptation in planar models, but no 3D work has been presented. Recently there has been significant work investigating the use of reinforcement learning to obtain locomotion policies for legged robots [7, 3, 14, 8] . However, the resulting policies are often either less robust compared to controllers obtained from model-based methods (e.g., [7, 3] ) or require the design of complicated reward functions and a large number of training samples, e.g [8] . A learned policy is often brittle and can fail under mild environmental changes. Many approaches like meta learning [15] or Bayesian Optimization in behavior space [16] have been proposed to adaptively update the policies. This usually requires the policy to interact with the target environment to collect additional data. In contrast to these approaches, our method can adapt to a changing environment without any online data collection. Hierarchical Control Hierarchical framework can greatly improve learning efficiency, as shown in many prior work, e.g., [17] . In robotics tasks, it is beneficial to decompose a controller into modules and obtain controllers in a hierarchical manner. A low-level controller can be model-based [18, 19, 20] or learned [21, 22] such that it can achieve subgoals specified by a learned high-level controller. Specifically for locomotion tasks, it is natural to decompose a controller into directionfollowing and navigation modules [22, 21, 23, 24, 25] . We also adopt a hierarchical structure, however, our work is distinguished from previous work in that the goal of the high-level controller is to choose a low-level primitive in order to adapt to environmental changes instead of specifying subgoals. This does not preclude other high-level policies or behaviors; e.g.. one could easily add a navigation module in our framework. Step4 RL . . Model-based PD control QP Solver We propose a hierarchical framework to perform locomotion that combines model-based control with reinforcement learning. The system is visualized in Figure 2 . The state of the robot s = (q,q, p foot ) consists of the base pose q ∈ R 3 × S 3 , containing the position p body and orientation Θ of the robot's body, the velocity vectorq ∈ R 6 , and the four Cartesian foot positions p foot = (p 1 , p 2 , p 3 , p 4 ) ∈ R 12 relative to the base. A primitive P = {0 : Stance, 1 : Swing} 4 ∈ Z 4 is a Boolean array that specifies the stance/swing state of the four feet. The highlevel controller chooses the appropriate primitive and the low-level controller uses the stance feet to generate ground reaction forces by solving a QP for base pose control and moves the swing feet based on a foot-placement algorithm. The low-level controller runs at 500 Hz for high-speed feedback and the high-level controller runs at 2.5 Hz to match the primitive execution time. The high-level controller can be designed manually. If we set all foot states to Stance, the result is a standing gait. If we instead synchronize diagonally opposed feet and switch the state at every timestep, the result is the trotting gait. As we demonstrate, however, a more adaptive high-level controller is needed to reduce energy consumption, reject disturbance, or react to friction changes. In this section, we first define the primitives that are used in the controller. We then give details on the low-level controller that implements these primitives and on the high-level controller that learns to select from these primitives to complete multiple tasks. A primitive P represents a contact configuration for the four feet. Each foot is in either a Stance or Swing state, and there are thus 2 4 = 16 possible primitives in total. We only use 9 primitives that are commonly used in quadruped locomotion, as summarized in the following table: Step2 Step3 Step4 We implement each primitive with a low-level torque controller. We find that a simple model-based method is sufficient to complete most of our tasks and is straightforward to transfer to the real robot. Base Pose Control The low-level controller receives a primitive P t from the high-level controller. It also receives the target base pose q d and velocityq d from user command. The controller computes foot forces by solving a QP, so the base pose can track the target pose and respect contact constraints. Similar to [2] , we approximates the quadruped dynamics as a linearized centroidal dynamics, is the column vector of Cartesian forces for each foot, andg = (g, 0 3 ) ∈ R 6 is the augmented gravity vector. The detailed derivation is given in Appendix A. Given a target base pose q d and velocityq d , we use PD control to compute the target acceleration This is then used to construct a QP to find foot forces that minimizes the acceleration error while respecting the contact configuration and friction constraints where Q and R are diagonal matrices that adjust weights in the cost function. Swing Foot Control The desired foot position p d,i for foot i is computed by a linear footplacement heuristic that adjusts position from the default state p 0,i . A position controller is then used to compute the swing foot force by Torque Control The foot forces computed via pose control and swing foot control are converted to motor torques by τ = J T f , where J ∈ R 12×12 is the feet positions Jacobian matrix with respect to motor states. This is updated at 500 Hz. Our high-level controller selects primitives based on the current robot state and is queried at 2.5 Hz. Here we describe how we use RL to learn the high-level controller in detail. State Space and Action Space We model the environment as a partially observable Markov decision process (POMDP). Specifically, the high-level controller takes the body pose q, excluding the x, y linear positions, and the relative foot positions p foot as input. To endow the controller with the capability to learn common gaits such as pacing and trotting that alternate between primitives, we also include the previously-used primitive as an input. The output of the controller is a 9-dimensional one-hot vector that indicates which primitive will be selected for the low level controller. Assuming the environment is deterministic, the input provided is enough to determine the next robot state. However, environments are often parameterized by some unobserved random variables, causing the transition dynamics to be stochastic with high variance. The goal of the high-level controller is then to choose primitives that can adapt to this high variance environment while optimizing for simple objectives such as energy efficiency and stability. Policy Representation Since the action space is discrete, instead of learning a policy directly, we choose to learn a Q-function that takes the state and action as input, and output the sum of discounted future rewards. At test time, the action that yields the maximum Q value is selected. We use a simple reward function of the form where the constant 1 ensures that the reward is positive, τ, p d,body andṗ body are control torques, body linear velocity, and desired body linear velocity respectively. T is the number of simulation timesteps within a primitive cycle. Training We adopt a DQN-like [26] training procedure, and implement a double Q-function [27] and delayed target network update [28] . In addition, instead of using an epsilon greedy strategy during training, the probability of applying an action is based on Q value estimates normalized by the softmax operator. More specifically, let {Q 1 , Q 2 , . . . , Q 9 } be the Q value estimates of the different actions at a particular state, and let the maximum Q value be Q max = max(Q 1 , Q 2 , . . . , Q 9 ), the probability of an action i being sampled will be proportional to exp(−ν Qi Qmax ), where ν is a hyperparameter controlling how sensitive the sampling probability distribution is to the Q value. The pseudocode of the algorithm is described in Appendix B. The Q-function for the high-level policies is implemented as a two-layer feedforward neural network with ReLU activation functions, each layer has 64 neurons. We set the temperature ν = 5 and update the Q-function every 100 samples, with 50 stochastic gradient descents and mini-batch size of 512. The learned policy already performs well with 10 5 samples, and we collect a maximum of 5 × 10 5 samples. Note that this is orders of magnitude fewer samples than used in previous work that employed RL with similar-scale quadrupeds like the ANYmal or Laikago [8, 3, 29 ]. We use a GPU-accelerated simulator [30] to train the Laikago robot and to compare different controllers. This simulator has been used for various robotics manipulation sim-to-real tasks [31, 32] and is validated to simulate rigid body physics with reasonable accuracy. In this section, we describe the experiments we use to validate our framework and show improvement over baseline methods. As a baseline, we have created five manually-designed high-level controllers. These are comparable to those used in typical model-based control approaches to quadrupedal locomotion [4, 2] . Here we briefly describe how they work. Standing Only the Stand primitive is used in this controller. It is the most energy efficient controller in the absence of perturbations but also the least robust one if perturbations are present. Trotting The trotting controller alternates between Trot1 and Trot2 and displays a trotting gait. It is commonly used for quadruped locomotion due to its stability. Pacing The pacing controller alternates between Pace1 and Pace2. It is another commonly used gait in quadrupedal locomotion but usually less stable and less energy efficient than trotting. Walking The walking controller lifts one foot at a time by switching between four stepping primitives in the order: In this controller, we create a heuristic Q-functionQ i =Q(P i ) for each primitive and the controller executes the primitive with the largestQ. We definê where J QP,i is the QP cost in Equation 3 for primitive P i . This represents a trade-off between the pose control and the swing feet control. When more feet are in Stance state, J QP,i will be smaller because of the contact constraints. When more feet are in Swing state, the foot position error will be smaller since only the swing feet can move to the target position. The heuristic-based controller is able to adapt contact behaviors in a few scenarios but is less robust overall. End-to-end Learned Controller We can also directly learn a control policy end-to-end instead of using the hierarchical framework proposed here. However, a purely learned policy on similar scale quadrupeds like the ANYmal and Laikago requires a training sample count on the order of 10 8 to 10 9 , with careful reward shaping [29, 3, 8] , while with the hierarchical framework, we use simple reward specification and get good performance at around 10 5 samples. We train a controller similar to [3] , where the robot is tracking a reference trotting motion. The resulting controller is not as robust as the manually designed trotting controller and fails in most of our testing scenarios. This is consistent with [3] , where it is also found that a purely learned controller is not robust, and a sophisticated adaptation scheme is needed to deal with environmental changes. Given these drawbacks, we do not compare with the end-to-end learned controller. We use a split-belt treadmill to train the policy so that the policy learns to choose different primitives to adapt to changing dynamics. During a new rollout, the speed of the treadmill is sampled from [−0.3, 0.3] m/s, see Figure 3 (a). We also randomly pause one side of the treadmill and command the robot to face different orientation. this is shown in Figure 3 (b) and (c), where the plywood represents the side of treadmill that is not moving. This provides a rich set of changing dynamics that the policy must learn to adapt to. Note the policy have no knowledge of the underlying treadmill parameters. The low-level controller is commanded to stay at the origin with target velocity 0. Energy We compare the energy use across different high-level controllers, shown in Figure 4 . The energy is computed as an average of sum square motor torques over ten seconds. First, we compare the energy consumption in a scenario shown in Figure 3 (a) where the treadmill is moving parallel to the robot in speed range [0, 0.3] m/s. The standing gait is the most energy efficient gait which uses 76.7%, 83.3%, and 85.7% less energy than the walking, trotting and pacing gaits respectively. The downside is that it can be only used at zero speed. The heuristic and learned (rl) controllers can start with the same lowest energy level and gradually increase the energy level as speed increases. As the treadmill speed reaches 0.2 m/s, the heuristic controller quickly fails while the learned controller's energy usage is comparable to the walking controller. The walking controller fails when the treadmill speed exceeds 0.2 m/s while the learned controller adjusts the primitive such that the energy is the same as trotting controller. The trotting and pacing controllers are able to cover the full speed range but consume more energy due to unnecessary leg movements. We then compare the energy consumption in scenarios similar to Figure 3 Contact Sequence The energy efficiency of the learned controller is mostly due to adaptive contact planning. Figure 5 shows the contact pattern of baseline and learned controllers. (a), (b), and (c) show the fixed pattern of the standing, walking and trotting. Note that they use the same contact sequence for all scenarios. Contrarily, the learned controller adapts contact sequence in different scenarios. When speed increases, the learned controller transients from standing gait to trotting gait, shown in (d), (e), and (f). We highlight the contact pattern in (e), where the learned controller uses a combination of Stand, Trot and Step primitives at speed 0.15 m/s. In (g), only one belt is moving and we replace the other non-moving belt with a plywood for clarity, the learned controller only moves the two right feet, thus more energy efficient compared to trotting or pacing. (h) and (i) are two scenarios not seen in training and the learned controller demonstrates novel contact sequences. We test the learned controller in novel scenarios that are not present during training to show that the policy can generalize. Purely learned controller usually overfits to the training dynamics and requires collection of additional data in the target environment to make adaptation [3] . With our hierarchical framework, the learned controller is able to adapt to these scenarios directly. Bridge Test We test the learned policy on a scenario where the treadmill is placed parallel to the robot while the front legs of the robot are placed on a fixed bridge, see Figure 3 (d). This scenario is not present during training while the learned policy is able to adapt and choose not to move the front legs while adjust the rear legs based on the movement of the treadmill. Banana peel stability test We test the robot on another scenario where the treadmill is not moving while a frictionless mat is placed under a foot, represented by a banana peel in Figure 3 (e). The only baseline controller that can recover is the heuristic controller, where the slipping foot is adjusted. The learned controller perform similarly to the heuristic controller even though it never sees this situation during training. One can also pass this test with a freeze-joint standing controller, but we emphasize that the high-level contact adjustment can improve the robustness without changing the low-level controller. We validate the learned controller on the physical robot; snapshots of the experiment are shown in Figure 6 . Due to the robustness of the low-level controller, we observe that the sim-to-real gap is small compared to other approaches [3] , and the controller is able to perform well without tuning. 3 Walking Forward To emulate the treadmill, we send the low-level controller command speeds so that the body will move forward and the high-level controller will need to choose primitives to stay balanced. At low speed, the high-level controller first adopts the Stand primitive with the body leaning forward; as the robot is close to falling over, other primitives are used to move the leg forward to regain balance. At high speed, the robot mostly uses the Trot primitives. We perturb the legs of the robot by manually pulling them in different directions. The learned controller is able to adopt the corresponding Step primitive to move the perturbed leg back to the nominal position while keeping the unperturbed legs still. We have presented a hierarchical framework that combines model-based control and reinforcement learning. By leveraging the advantages of both paradigms, we obtain a contact-adaptive controller that is more robust and energy efficient than those employing a fixed contact sequence. The learned controller generates novel contact sequences that are generally not produced by either approach alone, at least in the context of real-time control. We demonstrate our framework using a Laikago quadruped in various challenging scenarios such as walking on a split-belt treadmill with only one side moving or stepping onto a "banana peel." We also validate the controller on the physical robot, finding that sim-to-real transfer is relatively straightforward. We believe this is a promising step toward combining the best features of model-based control and reinforcement learning. The dynamics is similar to [2] with a few modifications. The general centroidal dynamics is wherep body is the base linear acceleration, f i is the ground reaction force on each foot, and m, g are the mass and gravity vector respectively. The I is the mass inertia,ω is the derivative of angular velocity. The p i is the foot position respect to the base. All variables are represented in the world frame. We ignore the Coriolis force ω × (Iω) since it does not contribute significantly to the dynamics of the robot. We use the small angular assumption to linearize the dynamics. The robot's orientation is expressed as a vector of Z-Y-X Euler angles Θ = [φ θ ψ] , where φ is roll, θ is pitch, and ψ is yaw. For small values of roll and pitch (φ, θ), the angular velocity is approximated by where R z (ψ) = cos(ψ) − sin(ψ) 0 sin(ψ) cos(ψ) 0 0 0 1 is the rotation matrix of yaw. The inertia matrix in the world frame can be approximated by where B I is the inertia matrix in body frame. The linearized dynamic isq where M = 1 3 /m . . . Gait and the energetics of locomotion in horses Dynamic locomotion in the mit cheetah 3 through convex model-predictive control Learning agile robotic locomotion skills by imitating animals Real-time motion planning of legged robots: A model predictive control approach Gait and trajectory optimization for legged systems through phase-based end-effector parameterization Crocoddyl: An efficient and versatile framework for multi-contact optimal control Simto-real: Learning agile locomotion for quadruped robots Learning agile and dynamic motor skills for legged robots A direct method for trajectory optimization of rigid bodies through contact Variational contact-implicit trajectory optimization Discovery of complex behaviors through contactinvariant optimization Online gait transitions and disturbance recovery for legged robots via the feasible impulse set Learning locomotion skills for cassie: Iterative design and sim-to-real Learning fast adaptation with meta strategy optimization Robots that can adapt like animals Temporal abstraction in reinforcement learning Learning hierarchical control for robust in-hand manipulation Learning to switch between sensorimotor primitives using multimodal haptic signals Terrain-adaptive locomotion skills using deep reinforcement learning Multi-agent manipulation via locomotion using hierarchical sim2real Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning Data-efficient hierarchical reinforcement learning Hierarchical reinforcement learning for quadruped locomotion Deepgait: Planning and control of quadrupedal gaits using deep reinforcement learning Human-level control through deep reinforcement learning Deep reinforcement learning with double q-learning Addressing function approximation error in actorcritic methods Guided constrained policy optimization for dynamic quadrupedal robot locomotion Gpu-accelerated robotic simulation for distributed reinforcement learning -hand object pose tracking via contact feedback and gpu-accelerated robotic simulation Closing the sim-to-real loop: Adapting simulation randomization with real world experience We use DQN like algorithm to train our high-level policy. Details are shown in Algorithm 1.Algorithm 1: Q Learning initialization Q-function parameters θ 1 .θ 2 for Q θ1 , Q θ2 , empty replay buffer D ; set target network parameters θ targ,1 , θ targ,2 ← θ 1 , θ 2 for Q θtarg,1 , Q θtarg,2 ; while not done do observe current state s ; sample action a based on Q-function; observe next state s , reward r and done signal d; store