key: cord-0136988-ypi1y4rw authors: Truong, Joanne; Chernova, Sonia; Batra, Dhruv title: Bi-directional Domain Adaptation for Sim2Real Transfer of Embodied Navigation Agents date: 2020-11-24 journal: nan DOI: nan sha: f7ad5c3d078c5756566ed179091fcaf2ca2d8f66 doc_id: 136988 cord_uid: ypi1y4rw Deep reinforcement learning models are notoriously data hungry, yet real-world data is expensive and time consuming to obtain. The solution that many have turned to is to use simulation for training before deploying the robot in a real environment. Simulation offers the ability to train large numbers of robots in parallel, and offers an abundance of data. However, no simulation is perfect, and robots trained solely in simulation fail to generalize to the real-world, resulting in a"sim-vs-real gap". How can we overcome the trade-off between the abundance of less accurate, artificial data from simulators and the scarcity of reliable, real-world data? In this paper, we propose Bi-directional Domain Adaptation (BDA), a novel approach to bridge the sim-vs-real gap in both directions -- real2sim to bridge the visual domain gap, and sim2real to bridge the dynamics domain gap. We demonstrate the benefits of BDA on the task of PointGoal Navigation. BDA with only 5k real-world (state, action, next-state) samples matches the performance of a policy fine-tuned with ~600k samples, resulting in a speed-up of ~120x. Deep reinforcement learning (RL) methods have made tremendous progress in many high-dimensional tasks, such as navigation [23] , manipulation [4] , and locomotion [9] . Since RL algorithms are data hungry, and training robots in the real-world is slow, expensive, and difficult to reproduce, these methods are typically trained in simulation (where gathering experience is scalable, safe, cheap, and reproducible) before being deployed in the real-world. However, no simulator perfectly replicates reality. Simulators fail to model many aspects of the robot and the environment (noisy dynamics, sensor noise, wear and-tear, battery drainage, etc.). In addition, RL algorithms are prone to overfitting -i.e., they learn to achieve strong performance in the environments they were trained in, but fail to generalize to novel environments. On the other hand, humans are able to quickly adapt to small changes in their environment. The ability to quickly adapt and transfer skills is a key aspect of intelligence that we hope to reproduce in artificial agents. This raises a fundamental question -How can we leverage imperfect but useful simulators to train robots while ensuring that the learned skills generalize to reality? This question is studied under the umbrella of 'sim2real transfer' and has been a topic of much interest in the community [5] , [8] , [11] , [16] , [22] , [25] , [26] . In this work, we first reframe the sim2real transfer problem into the following question -given a cheap abundant lowfidelity data generator (a simulator) and an expensive scarce 1 Georgia Institute of Technology, {truong.j, chernova, dbatra}@gatech.edu 2 Facebook AI Research high-fidelity data source (reality), how should we best leverage the two to maximize performance of an agent in the expensive domain (reality)? The status quo in machine learning is to pre-train a policy in simulation using large amounts of simulation data (potentially with domain randomization [22] ) and then fine-tune this policy on the robot using the small amount of real data. Can we do better? We contend that the small amount of expensive, highfidelity data from reality is better utilized to adapt the simulator (and reduce the sim-vs-real gap) than to directly adapt the policy. Concretely, we propose Bi-directional Domain Adaptation (BDA) between simulation and reality to answer this question. BDA reduces the sim-vs-real gap in two different directions (shown in Fig. 1 ). First, for sensory observations (e.g. an RGB-D camera image I) we train a real2sim observation adaptation module OA : I real → I sim . This can be thought of as 'goggles' [26] , [24] that the agent puts on at deployment time to make real observations 'look' like the ones seen during training in simulation. At first glance, this choice may appear counterintuitive (or the 'wrong' direction). We choose real2sim observation adaption instead of sim2real because this decouples sensing and acting. If the sensor characteristics in reality change but the dynamics remain the same (e.g. same robot different camera), the policy does not need to be retrained, but only equipped with a re-trained observation adaptor. In contrast, changing a sim2real observation adaptor results in the generated observations being out of distribution for the policy, requiring expensive re-training of the policy. Our real2sim observation adaptor is based on CycleGAN [27] , and thus does not require any sort of alignment or pairing between sim and real observations, which can be prohibitive. Second, for transition dynamics T : P r(s t+1 | s t , a t ) (the probably of transitioning from state s t to s t+1 upon taking action a t ), we train a sim2real dynamics adaptation module DA : T sim → T real . This can be thought of as a neural-augmented simulator [8] or a specific kind of boosted ensembling method [19] -where a simulator first makes predictions about state transitions and then a learned neural network predicts the residual between the simulator predictions and the state transitions observed in reality. At each time t during training in simulation, DA resets the simulator state from s sim t+1 (where the simulator believes the agent should reach at time t+1) toŝ real t+1 (where DA predicts the agent will reach in reality), thus exposing the policy to trajectories expected in reality. We choose sim2real dynamics adaptation instead of real2sim because this nicely exploits the fundamental asymmetry between the two domains - We learn a sim2real dynamics adaptation module to predict residual errors between state transitions in simulation and reality. Right: We learn a real2sim observation adaptation module to translate images the robot sees in the real-world at test time to images that more closely align with what the robot has seen in simulation during training. (b) Using BDA, we achieve the same SPL as a policy finetuned directly in reality while using 117× less real-world data. simulators can (typically) be reset to arbitrary states, reality (typically) cannot. Once an agent acts in the real-world, it doesn't matter what corresponding state it would have reached in simulator, reality cannot be reset to it. Once the two modules are trained, BDA trains a policy in a simulator augmented with the dynamics adaptor (DA) and deploys the policy augmented with the observation adaptor (OA) to reality. This process is illustrated in Fig. 1a , left showing policy training in simulation and right showing its deployment in reality. We instantiate and demonstrate the benefits of BDA on the task of PointGoal Navigation (PointNav) [3] , which involves an agent navigating in a previously unseen environment from a randomized initial starting location to a goal location specified in relative coordinates. For controlled experimentation, and due to COVID-19 restrictions, we use Sim2Sim transfer of PointNav policies as a stand-in for Sim2Real transfer. We conduct experiments in photo-realistic 3D simulation environments using Habitat-Sim [18] , which prior work [13] has found to have high sim2real predictivity, meaning that inferences drawn in simulation experiments have a high likelihood of holding in reality on Locobot mobile robot [2] . In our experiments, we find that BDA is significantly more sample-efficient than the baseline of fine-tuning a policy. Specifically, BDA trained on as few as 5,000 samples (state, action, next-state) from reality (equivalent of 7 hours to collect data in reality) is able to match the performance of baseline trained on 585,000 samples from reality (equivalent of 836 hours to collect data in reality, or 3.5 months at 8 working hours per day), a speed-up of 117× (Fig. 1b) . While our experiments are conducted on the PointNav task, we believe our findings, and the core idea of Bidirectional Domain Adaptation, is broadly applicable to a number of problems in robotics and reinforcement learning. We now describe the two key components of Bi-directional Domain Adaptation (BDA) in detail -(1) real2sim observation adaptation module OA to close the visual domain gap, and (2) sim2real dynamics adaptation module DA to close the dynamics domain gap. Preliminaries and Notation. We formulate our problem by representing both the source and target domain as a Markov Decision Process (MDP). A MDP is defined by the tuple (S, A, T , R, γ), where s ∈ S denotes states, a ∈ A denotes actions, T (s, a, s ) = P r(s | s, a) is the transition probability, R : S × A → R is the reward function, and γ is a discount factor. In RL, the goal is to learn a policy π : S → A to maximize expected reward. Algorithm 1: Bi-directional Domain Adaptation 1 Train behavior policy π sim in Sim 2 for t = 0, ..., N steps do 3 Collect I sim t ∼ Sim rollout (π sim ) 4 Collect I real t , s real t , a real t ∼ Real rollout (π sim ) Sim DA ← Augment Source with DA 8 for j = 0, ..., K steps do 9 π Sim DA ← Finetune π sim in Sim DA 10 π Sim OA+DA ← Apply OA at test-time 11 Test π Sim OA+DA in Real Observation Adaptation. We consider a real2sim domain adaptation approach to deal with the visual domain gap. We leverage CycleGAN [27] , a pixel-level image-toimage translation technique that uses a cycle-consistency loss function with unpaired images. We start by using a behavior policy π sim trained in simulation to sample rollouts in simulation and reality to collect RGB-D images I sim OA learns a mapping G sim : I real → I sim , an inverse mapping G real : I sim → I real , and adversarial discriminators D real , D sim . Although our method focuses on adaptation from real2sim, learning both mappings encourages the generative models to remain cycle-consistent, i.e., forward cycle: I real → G sim (I real ) → G real (G sim (I real )) ≈ I real and backwards cycle: I sim → G real (I sim ) → G sim (G real (I sim )) ≈ I real . The ability to learn mappings from unpaired images from both domains is important because it is difficult to accurately collect paired images between simulation and reality. A real2sim approach for adapting the visual domain offers many advantages over a sim2real approach because it disentangles the sensor adaptation module from our policy training. This enables us to remove an additional bottleneck during the RL policy training process; we can train OA in parallel with the RL policy, thus reducing the overall training time needed. In addition, if the sensor observation noise in the environment changes, the base policy can be kept frozen, and only OA will have to be retrained. Dynamics Adaptation. To close the dynamics domain gap, we follow a sim2real approach. Starting with the behavior policy π sim , we collect stateaction pairs (s real t , a real t ) in the real-world (line 4). The stateaction pairs are used to train DA, a 3 layer multilayer perceptron regression network, that learns the residual error between the state transitions in simulation and reality T sim → T real (line 7). Specifically, DA learns to estimate the change in position and orientation ∆s real : (s real t+1 − s real t ). We use a weighted MSE loss function, 1 , is represented by the position and orientation of the robot at timestep t. We placed twice as much weight on the prediction terms for the robot's position than for its orientation because getting the position correct is more important for our performance metric. Once trained, DA is used to augment the source environment (line 7). We finetune π sim in the augmented simulator, Sim DA (lines 8-9). Our hypothesis (which we validate in our experiments) is using real-world data to adapt the simulator via our DA model pays off because we can then train RL policies in this DA-augmented simulator for large amounts of experience cheaply. We use OA at test time (line 10). Finally, we test our policy trained with BDA in the real-world (line 11). To recap, BDA has a number of advantages over the status quo (of directly using real data to fine-tune a simulation trained policy) that we demonstrate in our experiments: (1) Decouples sensing and acting, (2) Does not require paired training data, (3) The data to train both modules can be collected jointly (by gathering experience from a behavior policy in reality) but the two can be trained in parallel independently of each other, (4) Similar to model-based RL [21] , reducing the sim-vs-real gap is made significantly more sample-efficient than directly fine-tuning the policy. Our goal in this work is to enable sample efficient Sim2Real transfer for the task of PointGoal Navigation (PointNav) [3] . However, for controlled experiments and due to COVID-19 restrictions, we study Sim2Sim transfer as a stand-in for Sim2Real. Specifically, we train policies in a "source" simulator (which serves as 'Sim' in 'Sim2Real') and transfer it to a "target" simulator (which serves as 'Real' in 'Sim2Real'). We add observation and dynamics noise to the target simulator to mimic the noise observed in reality. Notice that these noise models are purely for the purpose of conducting controlled experiments and are not avaliable to the agent (which must adapt and learn from samples of state and observations). Since no noise model is perfect (just like no simulator is perfect), we experiment with a range of noise models and report results with multiple target simulators. Our results show consistent improvements regardless of the noise model used, thus providing increased confidence in our experimental setup. For clarity, in the text below we present our approach from the perspective of "transfer from a source to target domain," with the assumption that obtaining data in the target domain is always expensive, regardless of whether it is a simulated or real-world environment. All of our experiments are conducted in Habitat [18] . In PointNav, a robot is initialized in an unseen environment and asked to navigate to a goal location specified in relative coordinates purely from egocentric RGB-D observations without a map, in a limited time budget. An episode is considered successful if the robot issues the STOP command within 0.2m of the goal location. In order to increase confidence that our simulation settings will translate to the real-world, we limit episodes to 200 steps, limit number of collisions allowed (before deeming the episode a failure) to 40, and turn sliding off-specifications found by [13] to have high sim2real predictivity (how well evaluation in simulation predicts real-world performance). Sliding is a behavior enabled by default in many physics simulators that allows agents to slide along obstacles when the agent takes an action that would result in a collision. Turning sliding off ensures that the agent cannot cheat in simulation by sliding along obstacles. We use success rate (SUCC), and Success weighted by (normalized inverse) Path Length (SPL) [3] as metrics for evaluation. Body. The robot has a circular base with a radius of 0.175m and a height of 0.61m. These dimensions correspond to the base width and camera height of the LoCoBot robot [2] . Sensors. The robot has access to an egocentric RGB and Depth sensor, and accurate localization and heading through a GPS+Compass sensor. real-world robot experiments from [13] used Hector SLAM [14] with a Hokuyo UTM-30LX LIDAR sensor and found that localization errors were approximately 7cm (much lower than the 20cm PointNav success criterion). This gives us confidence that our results will generalize to reality, despite the lack of precise localization. We match the specifications of the Intel D435 camera on the LoCoBot, and set the camera field of view to 70. To match the maximum range on the depth camera, we clip the simulated depth readings to 10m. Sensor Noise. To simulate noisy sensor observations of the real-world, we add RGB and Depth sensor noise models to [7] . Fig. 2 shows a comparison between noise free RGB-D images and RGB-D images with the different noise models and multipliers we use. Actions. The action space for the robot is turn-left 30 • turn-right 30 • , forward 0.25m, and STOP. In the source simulator, these actions are executed deterministically and accurately. However, actions in the real-world are never deterministic -identical actions can lead to vastly different final locations due to the actuation noise (wheel slippage, battery power drainage, etc.) typically found on a real robot. To simulate the noisy actuation that occurs in the real-world, we leverage the real-world translational and rotational actuation noise models characterized by [15] . A Vicon motion capture was used to measure the difference in commanded state and achieved state on LoCoBot for 3 different positional controllers: Proportional Controller, Dynamic Window Approach Controller from Movebase, and Linear Quadratic Regulator (ILQR). These are controllers typically used on a mobile robot. From a state (x, y, θ) and given a particular action, we add translational noise sampled from a truncated 2D Gaussian, and rotational noise from a 1D Gaussian to calculate the next state. We virtualize a 6.5m by 10m real lab environment (LAB) to use as our testing environment, using a Matterport Pro2 3D camera. To model the space, we placed the Matterport camera at various locations in the room, and collected 360 • scans of the environment. We used the scans to create 3D meshes of the environment, and directly imported the 3D meshes into Habitat to create a photorealistic replica of LAB Fig. 3b . We vary the number of obstacles in LAB to create 3 room configurations with varying levels of difficulty. Fig. 3 shows one of our room configurations with 5 obstacles. We perform testing over the 3 different room configurations, each with 5 start and end waypoints for navigation episodes, and 10 independent trials, for a total of 150 runs. We report the average success rate and SPL over the 150 runs. Our models were trained entirely in the Gibson dataset [24] , and have never seen LAB during training. The Gibson dataset contains 3D models of 572 cluttered indoor environments (homes, hospitals, offices, museums, etc.). In this work, we used the 72 Gibson environments that were rated 4+ in quality in [18] . Recall that our objective is to improve the ability for RL agents to generalize to new environments using little real-world data. To do this, we define our source environment as Gibson without any sensor or actuation noise (Gibson no noise ). We create 10 target environments with noise settings described in Table I . We use the notation O to represent an environment afflicted with only RGBD observation noise (rows 2, 5, or 8), D to represent an environment afflicted with only dynamics noise (rows 3, 6, or 9), and O + D to represent an environment afflicted with RGBD observation noise and dynamics noise (rows 4, 7, or 10). We train learning-based navigation policies, π, for Point-Goal in Habitat using environments from the Gibson dataset. Policies were trained from scratch with reinforcement learning using DD-PPO [23] , a decentralized, distributed variant of the proximal policy optimization (PPO) algorithm, that allows for large-scale training in GPU-intensive simulation environments. We follow the navigation policy described in [23] , which is composed of a ResNet50 visual encoder, and a 2-layer LSTM. Each policy is trained using 64 Tesla V100s. Base policies are trained for 100 million steps (π 100M ) to ensure convergence. Our experiments aim to answer the following: (1) How large is the sim2real gap? (2) Does our method improve generalization to target domains? (3) How does our method compare to directly training (or fine-tuning) in the target environment? (4) How much real-world data do we need? How large is the sim2real gap? First, we show that RL policies fail to generalize to new environments. We train a policy without any noise (π 100M Gibson no noise ), and a policy with observation and dynamics noise (π 100M Gibson O+D ). We test these policies in LAB with 4 different noise settings: LAB no noise , LAB O , LAB D , LAB O+D , and average across the noise settings. For each noise setting, we conduct 3 sets of runs, each containing 150 episodes in the target environments. We see that π 100M Gibson no noise tested in LAB no noise exhibits good transfer across environments -0.84 SPL (in contrast, the Habitat 2019 challenge winner was at 0.95 SPL [1] ). [23] showed that near-perfect performance is possible when the policy is trained out for significantly longer (2.5B frames), but for the sake of multiple experiments, we limit our analysis to 100M frames of training and compare all models across the same number. From Fig. 4 , we see that when dynamics noise is introduced (π 100M Gibson no noise tested in LAB D ), SPL drops from 0.84 to 0.56 (relative drop of 28%). More significantly, when sensor noise is introduced (π 100M Gibson no noise tested in LAB O ), SPL drops to 0.04 (relative drop of 81%), and when both sensor and dynamics noise are present, (π 100M Gibson no noise tested in LAB O+D ), SPL drops to 0.06 (relative drop of 78%). Thus, in the absence of noise, generalization across scenes (Gibson to LAB) is good, but in the presence of noise, the generalization suffers. We also notice that the converse is true: policies trained from scratch in Gibson O+D environments fail to generalize to LAB no noise and LAB D environments. Gibson no noise and π 100M Gibson O+D tested in LAB with different combinations of observation and dynamics noise. We see that SPL drops when a policy is tested in an environment with noise different from what it was trained in. These results show us that RL agents are highly sensitive to what might be considered perceptually minor changes to visual inputs. To the best of our knowledge, no prior work in embodied navigation appears to have considered this question of sensitivity to noise; hopefully our results will encourage others to consider this as well. How well does OA do? Following Alg. 1 described in Sec. II-A, we train OA from scratch for 200 epochs. In Fig. 5 , we see that the model learns to remove the Gaussian noise placed on the RGB image, and learns to smooth out textures in the depth image. In addition, we have RGB-D images of LAB collected from a real robot, pre-COVID, and results using our real2sim OA module. While no GAN metric is perfect (user studies are typically conducted for evaluation as done in [27] ), we calculated the Fréchet Inception Distance (FID) [10] score (lower is better) to provide quantitative results. We find that the FID comparing I real and I sim is 100.74, and the FID from OA (I real ) to I sim images is 83.05. We also calculated FID comparing simulation images afflicted with Gaussian noise, I Gaussian , to noise-free simulation images I no noise to be 98.73, and FID between OA (I Gaussian ) to I no noise images to be 88.44. To put things in context, the FID score comparing images from CIFAR10 to our simulation images is 317.61. This shows that perceptually, the distribution of our adapted images more closely resembles images taken directly from simulation, and that real2sim OA is not far off from our sim2sim OA experiments. While our architecture has changed since this initial data collection (initial images are 256 × 256, compared to our current architecture which uses 640 × 360 images), these results will serve as a good indication that our approach will generalize to reality. How does our method compare to fine-tuning? Next, we evaluate our policy finetuned using BDA with 5,000 data samples collected in the target environment (π BDA−5k Gibson OA+DA ). We compare this to directly finetuning in the target environment (π 1M Gibson O+D ), which serves as an oracle baseline. Both π BDA−5k Gibson OA+DA and π 1M Gibson O+D are initialized with π 100M Gibson no noise , and both are re-trained for each target O + D setting. Our results in Table II show the benefit in finetuning using data from target environments. π 1M Gibson O+D demonstrates robustness in all combinations of sensor and actuation noise. We also observe that using BDA to learn the observation and dynamics noise models with 5,000 samples from the target environment is capable of nearly matching performance of π 1M Gibson O+D . In fact, we only see, on average, a 5% difference between π 1M Gibson O+D and π BDA−5k Gibson OA+DA (rows 4, 8, 12) , while the former is directly trained in the target environment which is not possible in reality, as it requires 1M samples from the target environment. From these results, we notice in certain environments our method performs worse than the oracle baseline if no or only observation noise is present (rows 1, 5, 6, 10), but performs on the level of the oracle baseline when dynamics Gibson no noise is a policy trained solely in simulation. π BDA−5k Gibson OA+DA and π 1M Gibson O+D are initialized with π 100M Gibson no noise . π BDA−5k Gibson OA+DA is fine-tuned with BDA using 5k samples from the target environment. π 1M Gibson O+D is fine-tuned directly in the target environment for 1M steps of experience, and serves as an oracle baseline. While π 1M Gibson O+D and π BDA−5k Gibson OA+DA achieve in strong performance across environments with varying noises (rows 4, 8, 12) is added (rows 3, 4, 7, 8, 11, 12) . We believe it's due to 'sliding', a default behavior in 3D simulators allowing agents to slide along obstacles rather than stopping immediately on contact. Following the findings and recommendations of [13] , we disabled sliding to make our simulation results more predictive of real-world experiments. We find that one common failure mode in the absence of sliding is that agents get stuck on obstacles. In the presence of dynamics noise, the slight amount of actuation noise allows the agent to free itself from obstacles, similar to how it would in reality. Without dynamics noise, the agents continue to stay stuck. Sample Efficiency. We repeat our experiments, varying the amounts of data collected from the target environment. We re-train OA and DA using 100, 250, 500, 1,000, and 5,000 steps of experience in the target environment, and re-evaluate performance. We compare this to directly finetuning in the target environment for varying amounts of data. In Fig. 7 , the x-axis represents the number of samples collected in the target environment. From previous experiments, we estimate 1 episode in the real-world to last on average 6 minutes, in which the robot will take approximately 70 steps to reach the goal. We use this as a conversion factor, and add an additional x-axis to show the number of hours needed for collecting the required samples from the target environment. The y-axis shows the SPL in the target environment. We see that the majority of our success comes from our first 1,000 samples from the target environment, and after 5,000 samples, π BDA−5k Gibson OA+DA is able to match the performance from π 1M Gibson O+D . Collecting 5,000 samples of data from a target environment to train our method would have taken 7 hours. In comparison, Fig. 7b shows that we would have to finetune the base policy for approximately 585,000 steps in the target environment (836 hours to collect data from target environment) to reach the same SPL. Comparing the amount of data needed to reach the same SPL, we see that BDA reduces the amount of data needed from the target environment by 36× in Fig. 7a, 117× in Fig. 7c , for an average speed up of 61×. These results give us confidence in the importance of our approach, as we wish to limit the amount of data needed from a target environment (i.e. real-world). Bi-directional Domain Adaptation is related to literature on domain and dynamics randomization, domain adaptation, and residual policy learning. Domain and Dynamics Randomization. Borrowing ideas from data augmentation commonly used in computer vision, domain randomization is a technique to train robust policies by exposing the agent to a wide variety of simulation environments with randomized visual properties such as lighting, texture, camera position, etc. Similarly, dynamics randomization is a process that randomizes physical properties in the simulator such as friction, mass, damping, etc. [17] applied randomization to textures to learn real indoor flight by training solely in simulation. [6] used real-world roll outs to learn a distribution of simulation dynamics parameters to randomize over. [4] randomized both physical and visual parameters to train a robotic hand to perform in hand manipulation. However, finding the right distribution to randomize parameters over is difficult, and may require expert knowledge. If the distribution chosen to randomize parameters over is too large, the task becomes much harder for the policy to learn. On the other hand, if the distribution is too small, then the reality gap remains large, and the policy will fail to generalize. Domain Adaptation. To bridge the simulation to reality gap, many works have used domain adaptation, a technique in which data from a source domain is adapted to more closely resemble data from a target domain. Prior works have used domain adaptation techniques for adapting visionbased models to translate images from sim-to-real during training for manipulation tasks [5] , [11] , and real-to-sim during testing for navigation tasks [26] . Other works have focused on adapting policies for dynamic changes [8] , [25] . In our work, we seek to use domain adaptation to close the gap for both the visual and the dynamics domain. Learning. An alternative to typical transfer learning techniques is to directly improve the underlying policy itself. Instead of re-training an agent from scratch when policies perform sub-optimally, the sub-optimal policy can be used as prior knowledge in RL to speed up training. This is the main idea behind residual policy learning, in which a residual policy is used to augment an initial policy to correct for changes in the environment. [12] , [20] demonstrated that combining residual policy learning with conventional robotic control improves the robot's ability to adapt to variations in the environment for manipulation tasks. Our method builds on this line of research by augmenting the simulator using a neural network that learns the residual error between simulation and reality. We introduce Bi-directional Domain Adaptation (BDA), a method to utilize the differences between simulation and reality to accelerate learning and improve generalization of RL policies. We use domain adaptation techniques to transfer images from real2sim to close the visual domain gap, and learn the residual error in dynamics from sim2real to close the dynamics domain gap. We find that our method consistently improves performance of the initial policy π while remaining sample efficient. Habitat Challenge 2019 @ Habitat Embodied Agents Workshop. CVPR 2019 Locobot: An open source low cost robot Manolis Savva, et al. On Evaluation of Embodied Navigation Agents Learning dexterous inhand manipulation Using simulation and domain adaptation to improve efficiency of deep robotic grasping Closing the sim-to-real loop: Adapting simulation randomization with real world experience Robust reconstruction of indoor scenes Sim-to-real transfer with neural-augmented robot simulation Learning to walk via deep reinforcement learning Gans trained by a two time-scale update rule converge to a local nash equilibrium Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks Residual reinforcement learning for robot control Sim2real predictivity: Does evaluation in simulation predict real-world performance A flexible and scalable slam system with full 3d motion estimation Pyrobot: An open-source robotics framework for research and benchmarking Sim-to-real transfer of robotic control with dynamics randomization CAD2RL: real single-image flight without a single real image Habitat: A Platform for Embodied AI Research A brief introduction to boosting Reinforcement learning: An introduction Domain randomization for transferring deep neural networks from simulation to the real world IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames Gibson env: Real-world perception for embodied agents Learning fast adaptation with meta strategy optimization Vr-goggles for robots: Real-to-sim domain adaptation for visual control Unpaired image-to-image translation using cycle-consistent adversarial networks