key: cord-0582694-vl4r8f3v authors: Thomas, Hugues; Aurin, Matthieu Gallet de Saint; Zhang, Jian; Barfoot, Timothy D. title: Learning Spatiotemporal Occupancy Grid Maps for Lifelong Navigation in Dynamic Scenes date: 2021-08-24 journal: nan DOI: nan sha: ae976c1fdd51c176e615b9268eca017e9819e270 doc_id: 582694 cord_uid: vl4r8f3v We present a novel method for generating, predicting, and using Spatiotemporal Occupancy Grid Maps (SOGM), which embed future information of dynamic scenes. Our automated generation process creates groundtruth SOGMs from previous navigation data. We build on prior work to annotate lidar points based on their dynamic properties, which are then projected on time-stamped 2D grids: SOGMs. We design a 3D-2D feedforward architecture, trained to predict the future time steps of SOGMs, given 3D lidar frames as input. Our pipeline is entirely self-supervised, thus enabling lifelong learning for robots. The network is composed of a 3D back-end that extracts rich features and enables the semantic segmentation of the lidar frames, and a 2D front-end that predicts the future information embedded in the SOGMs within planning. We also design a navigation pipeline that uses these predicted SOGMs. We provide both quantitative and qualitative insights into the predictions and validate our choices of network design with a comparison to the state of the art and ablation studies. Robot navigation, and particularly local planning, is complex in the presence of dynamic obstacles. Anticipating future motions has begun to be possible with recent advances in machine learning and robotics. In our work, we aim at providing an accurate prediction of the temporal evolution of the environment, enabling a robot to proactively plan efficient and safe trajectories. As a lifelong learning approach, Reinforcement Learning (RL) often replaces the whole navigation pipeline; instead, we advocate the use of Self-Supervised Learning (SSL) to solve specific tasks helping robot navigation. We propose a method to predict the future motions of dynamic obstacles, which can easily be used in the navigation pipeline, along with proven localization and planning algorithms. Building on our previous work [1] , we use a similar automated annotation process with mapping and ray-tracing algorithms, providing four semantic labels to individual lidar points: ground, permanent (structures such as walls), movable (still-but-movable obstacles such as chairs), and dynamic (such as people). In addition, we propose a Spatiotemporal Occupancy Grid Maps (SOGM) generation algorithm, converting the annotated lidar point clouds to SOGMs with three channels, one for each obstacle label. We then train a novel 3D-2D feedforward architecture to predict the future time steps of SOGMs, given past 3D lidar frames as input. As shown in Figure 1 , the predicted SOGMs can be processed and used by an existing trajectory planning algorithm. {hugues.thomas, tim.barfoot}@utoronto.ca, maaurin@student.ethz.ch, jianz@apple.com. The robot faces dynamic obstacles (a). Our 3D-2D network forecasts their future as Spatiotemporal Occupancy Grid Maps (b), which are processed into Spatiotemporal Risk Maps (c). A planned trajectory is optimized in the 3D space (xyt) (d). We define a SOGM as a 3D grid map, where each grid cell, independent from the others, contains an occupancy probability for a given position and time, in world coordinates. It can be interpreted as a set of 2D Occupancy Grid Maps (OGM), one for each future time step. Note that this is different from the Dynamic Occupancy Grid Maps used by [2] , which are OGMs with additional velocity features in each cell. Therefore, the first crucial characteristic of our method is to be point-centric. As opposed to object-centric methods, we remain closer to the raw sensor data and inherently allow multi-modal predictions. A second characteristic is to include 3 types of obstacles as channels of the SOGM. Even if we are only interested in the prediction for the dynamic obstacles, it allows a significant feature: our network can take into account interactions when predicting future motions. For example, dynamic obstacles will bounce off walls, avoid objects, or pass through open doors. Our novel 3D-2D feedforward network architecture, shown in Figure 3 , is able to predict SOGMs from 3D lidar frames. The network input consists of consecutive frames, aligned on a global map and merged, to provide both spatial and temporal information. Our network starts with a 3D back-end, processing the lidar frames with KPConv layers [3] to extract rich features that can be projected on a 2D grid or used for point cloud semantic segmentation. The 2D grid features are processed by a 2D front-end network outputting a SOGM. The characteristic of this 2D front-end is to be feedforward, without any recurrent connections. The occupancy predictions at consecutive time steps are computed by successive convolutions with independent weights. This enables our network to learn and predict a wider range of spatiotemporal patterns at different future horizons. The main contributions of this paper are: • The automated SOGM generation method. • The novel 3D-2D feedforward neural network predicting three-channel SOGMs. Additionally, we integrated SOGMs in an overall navigation stack. First, they are converted into Spatiotemporal Risk Maps (SRM), to reflect the risk of collision, linearly decreasing with the distance to occupied space. The SRMs are then used by a modified Timed Elastic Band (TEB) [4] , [5] local planner. TEB normally optimizes a trajectory in a three-dimensional spatiotemporal space (x, y, t), maximizing the distance to discrete obstacles, but was adapted to minimize the risk value from the SRMs instead. We also keep the triaging idea from [1] , with some improvements. We reduce the delay in localization by using the raw lidar frames to locate against the map and keep the advantage of triaging by updating the map only with classified frames. While this paper focuses on the generation and prediction of SOGMs, we show early results of how we plan to use them in this overall navigation pipeline. Navigation around dynamic obstacles has been studied for a long time in robotics. In terms of predicting obstacles' future motions, [6] , [7] learned a distribution of possible pedestrian trajectories using Inverse Optimal Control to recover a set of agent preferences consistent with prior demonstration. Following these preliminary works, several directions have been followed for dynamic obstacle forecasting. Object tracking and trajectory prediction is a very common approach to enable motion planning in environments with dynamic obstacles. Following the success of recurrent neural networks (RNNs) and in particular long short-term memory networks (LSTMs) for trajectory prediction [8] , [9] , the idea of isolating each obstacle as a distinct object has received a lot of attention. [10] uses an LSTM-based network to predict people trajectories and plan around them, while [11] exploits a Hidden Markov Model to predict future states from a history of observations. Similarly, [12] detects individual obstacles and predicts their speeds to avoid "freezing zones". Such object-centric methods are also used in the context of autonomous driving [13] , [14] . However, they all rely on the detection and tracking results and do not easily incorporate multi-modal predictions, problems we do not face with our point-centric approach. Closer to our work, [15] predicts future human motions as 2D heat maps, implicitly handling multi-modality like us, but is still relying on object-level predictions, and is also limited to 2D inputs, where our method leverages 3D features. Reinforcement Learning has also been used extensively in recent years to replace standard motion planning algorithms [16] - [21] . However, standard local planners have proven to be very reliable methods, especially when it comes to producing dynamically feasible trajectories, which most RL methods fail to do. Even when the feasibility is ensured [22] , the whole planning algorithm is embedded into a black box end-to-end neural network, which is difficult to tune, debug and interpret. We chose to keep a standard local planner, with its guarantees and interpretability, and use a self-supervised deep learning method to predict the future motion of the dynamic obstacles. OGM prediction approaches are the closest to our work, and can be separated in two groups either using handcrafted or learned features. Handcrafted approaches usually rely on a model for human motion prediction. [23] predicts a Dynamic Risk Density based on the occupancy density and velocity field of the environment. [24] extends this idea with a Gaussian Process regulated risk map of tracked pedestrians and cars. Other recent works focus on adapting the uncertainty of the predictive model [25] - [27] , using realtime Bayesian frameworks in a closed-loop on real data. These methods either rely on object detection and tracking or are based on instantaneous velocities, and not able to predict future locations of obstacles accurately. Learned methods, usually based on video frame prediction architectures [28] - [30] , are better at predicting complex futures. [31] introduces a difference-learning-based recurrent architecture to extract and incorporate motion information for OGM prediction. [32] presents an LSTM-based encoder-decoder framework to predict the future environment represented as Dynamic Occupancy Grid Maps. They introduce recurrent skip connections into the network to reduce prediction blurriness. To account for the spatiotemporal evolution of the occupancy state of both static object and dynamic objects, [33] proposes a double-prong network architecture. However, two major differences remain with our approach. First, these methods all take previous OGMs as input, effectively losing valuable information in the shape patterns that common 3D sensors can capture. To our knowledge, we are the first to fill this gap in the literature, by incorporating 3D features in the forecasting of OGMs with the 3D backbone of our network. Second, we are also the first, to our knowledge, to predict a sequence of OGMs without recurrent layers. We argue feedforward architectures are easier to train and understand. Eventually, our network is able to make a distinction between different semantic classes, leveraging interactions between them, when predicting future occupancy. This section describes our approach, from the automated SOGM generation process, to the SOGM prediction with Fig. 2 . During pre-processing, every frame is semantically filtered and projected in 2D. During training, the 2D frames are stacked in 3D according to their timestamps and projected to a 3D grid to create the SOGMs. our novel 3D-2D feedforward architecture, and including training and inference details. Eventually, we present our risk-aware navigation system, with a modified TEB planner using SRMs. We encourage reproducibility and application to new datasets with our open-source implementation 1 . Generating the groundtruth SOGMs automatically allows us to train the network without human intervention, in a selfsupervised manner. We follow the same lifelong learning principles as [1] , allowing our network to learn from new situations, encountered through the robot's life. We first annotate lidar frames, using the method from [1] . A combination of a point cloud SLAM algorithm and a point cloud ray-tracing algorithm is used to create and then annotate a point cloud map for each previous navigation session. The four semantic labels, ground, permanent, movable, and dynamic, are identified on the map and projected back to the lidar frames of each session, which will be used to train the 3D semantic segmentation part of our network. We can then proceed to the generation of SOGMs with the three obstacles classes: permanent, movable, and dynamic. To avoid the problem of rotating grids, we precompute intermediate 2D point clouds, which can easily be rotated and then transformed into SOGMs during training ( Figure 2 ). In addition such a sparse structure in lighter to save than full grids. First, each annotated lidar frame is filtered to eliminate the ground points. We remove every point that is not annotated as an obstacle class, or that is closer than 20 cm from the ground plane. Then we project all the points on the ground plane and subsample the obtained 2D point cloud with a grid size of 3cm. At training time, the 2D point clouds we need are loaded and stacked along a third dimension according to their timestamp. After being rotated for data augmentation, this spatiotemporal point cloud is projected to a SOGM structure of spatial resolution dl 2D = 12cm and temporal resolution dt = 0.1s. The permanent and movable occupancy from 1 https://github.com/utiasASRL/Deep-Collison-Checker all time steps of the SOGM are merged because they are not moving. Therefore, in addition to the future locations of dynamic obstacles, our network also learns to complete partially seen static objects. Our network architecture (Figure 3) is composed of two parts, a 3D back-end, and a 2D front-end. The 3D back-end is a KPConv network [3] predicting a semantic label for each input point, that we use in our improved navigation system (see section III-D). Predicting 3D labels helps the network training by providing an additional supervisory signal and ensures that rich features are passed to the 2D front-end. We keep the KP-FCNN architecture and parameters of the original paper: a U-Net with five levels, each composed of two ResNet layers. The network input is a point cloud made from n f = 3 lidar frames aligned in the map coordinates and merged. We only keep the points inside a R in = 8m radius, as we are interested in the local vicinity of the robot. Each point is assigned a one-hot n f -dimensional feature vector, encoding of the lidar frame to which it belongs. Following the KPConv paper, we control the input point cloud density with a grid subsampling (dl 3D = 6cm). The 3D point features are passed to the 2D front-end with a grid projection using the same spatial resolution dl 2D as the SOGM. The size of the grid is determined as the inscribed square in the R in -radius circle: h grid = w grid = 94. Features from points located in the same cell are averaged to get the new cell features. The features of the empty cells are set to zero. The obtained 2D feature map is then processed by an image U-Net architecture with three levels, each composed of two ResNet layers, to diffuse the information contained in sparse locations to the whole grid. This dense feature map is used to predict the initial time step of the SOGM. Then, it is processed by successive propagation blocks, each composed Fig. 3 . Illustration of our 3D-2D feedforward architecture. The 3D backend is a 5-layer KPConv architecture, producing features at the point level. The 2D front-end is composed of a 3-layer 2D U-Net architecture, followed by consecutive convolutional propagation blocks. of two ResNet layers. The output of each propagation block is used to predict the corresponding time step of the SOGM. We define the final prediction time T = 3.0s, meaning that our SOGMs have n T = T /dt + 1 = 31 time steps in total. Note that the permanent and movable predictions are redundant but we keep them to force the network to keep the knowledge of their location after the propagation blocks. It helps to learn interactions between the classes further into the future. We define a particular training loss for our network: where λ 1 = 1.0, λ 2 = 10.0, L 3D is the standard cross entropy loss for semantic segmentation with a KPConv network, and L 2D k is the loss applied to layer k of our SOGM predictions: where x k,i is the network logit at the pixel i of the time-step layer k in the SOGM, y k,i is its corresponding groundtruth and BCE is a Binary Cross-Entropy loss. Note that for clarity, we use a simple index i for 2D pixels. The SOGM loss is thus a masked Binary Cross-Entropy, where the mask M k is here to help ignore the over-represented empty pixels and focus on the positive examples. We first tried a mask covering the positive groundtruth values in addition to some random pixels (GT-Mask), but then improved it to cover the union of positive groundtruth and positive prediction pixels (Active-Mask) to help reduce the false positives (See Figure 4 bottom). During training, we only use rotation augmentation around the vertical axis. The rest of the training parameters are kept identical as in the original KPConv paper [3] . In this setup, our input point clouds contain on average 20, 000 points. We use a variable batch size targeting B = 6, for an average of 85, 000 points per batch. More details can be found in our open-source code. Although our navigation system is not fully evaluated in this paper, we find it important to describe it here, to place some context around our other major contributions. During navigation, we transform the predicted SOGMs into SRMs to encode the risk of collision instead of the occupancy probability. TEB originally minimizes a linearly decreasing cost function: C obst = max (0, 1 − d/d 0 ), where d is the distance to the closest obstacle and d 0 a predefined influence distance. We define a linearly decreasing risk value similarly: with is the distance from pixel i to pixel j in the grid space, and d 0 is set to 2m. This risk value behaves like a p-norm (see top of Figure 4 ), the higher p is, the closer it is to the maximum value of the linear influence of each surrounding pixel. We use p = 3 in the following, compute a SRM for each SOGM channel and keep the pixel-wise maximum. TEB optimizes a trajectory in continuous space, so we use a trilinear interpolation of the SRM to get continuous risk values. If a pose time is too far in the future, it is ignored. TEB also allows the optimization of multiple trajectories for different homotopy classes. We keep this feature by creating estimated point obstacles at local maxima in the SOGM, ignored by the trajectory optimizer, but used for homotopy class computation. In addition, our network predicts labels for the frame points, that we use for global planning and localization. We exclude the dynamic points to update the global planner 2D costmap, and the movable and dynamic points for the localization 3D point map. Because of the network delay, we only perform update operations with these labeled frames, and use raw lidar frames instead to perform localization. On an Nvidia RTX 3090, our network forward pass takes less than 50ms with a CPU preprocessing of around 200ms. The SRM conversion takes about 30ms, if we add smaller other delays (communication for example), we end up with the first 3 or 4 layers of the SRM (out of 31) being obsolete, which is acceptable. We evaluate our method in a Gazebo simulated space 2 , and we define three actor behaviors. Bouncers are more like marbles than humans, walking in straight lines and bouncing on every obstacle. Wanderers walk with a constant speed but randomly change direction. They also bounce off obstacles and are repelled by the robot. Finally, Flow Followers are driven to a goal by a precomputed force field. They randomly choose their goal in a predefined set and only try to avoid the robot when it is very close to them. For the training of our network, we create three different datasets, one of each behavior, and use randomized session setups. For each training set, we perform 16 sessions, including 3 for validation, with a random population, object count, and robot tour. At test time, we can choose the robot objective and the population depending on the need of the experiment. We focus this section on our major contributions and evaluate the SOGM predictions. We provide an anecdotal example of our navigation system using SRMs in the conclusion and in the supplementary video 3 . We start with a quantitative analysis of the network results, and choose to only evaluate the dynamic class performances, which are the most significant. For consistency with previous works [30] , [33] , we could use Mean Squared Error (MSE) as a metric, but given the overwhelming majority of negative pixels, it is not the best choice. The SOGM predictions are 3D grids with values in [0, 1], and groundtruth SOGM are binary, so we can compute precision-recall curves and use Average Precision (AP) as our main metric. First, Figure 5 shows a comparison of the predicted SOGMs for Bouncers, Wanderers, and Flow Followers. By plotting AP for each time step in the SOGM, we see that it is very difficult to forecast obstacle motion far into the future. Bouncers are the easiest as they walk straight and Flow Followers are the most difficult as their movements are more complex. We choose the Flow Followers, closer to real actors, for the following evaluations shown in Table I . We use the overall AP and MSE of the 3D SOGM, in addition to the AP at 1.0s and 2.0s. We first study different architectures for the 2D front-end network, defined by (n 1 ,n 2 ,n 3 ), respectively the number of initial levels, the number of ResNet layers per level, and the number of ResNet layers per propagation block. We note that a deeper architecture has better performances, but is also slower than our (3, 2, 2) architecture, chosen for fast inference. Future works could explore different front-end network designs, as there seems to be an opportunity for improvement. We then show that the Active-Mask in our loss function is relevant, by comparing with the GT-Mask and no mask, both giving worse performances. We also show the importance of the 3D back-end to create rich features, by removing the 3D part of the loss (λ 1 = 0). These two experiments validate two crucial choices in our design. Eventually, we see that sharing the weights of the propagation blocks (making them recurrent) does not work. A direction for future work could be to explore recurrent convolutional layers, instead of simply sharing the weights of our propagation blocks. Eventually, we compare with a state-of-the-art video prediction method, Memory-in-Memory (MIM) [30] , which offers a solid benchmark. We also tried the open-source implementation of [33] , with very poor results. Their method is not applicable in our case because it predicts obstacles in the robot coordinates instead of the world coordinates. MIM normally predicts 64-pixel frames (MIM-64), but we modified their code to predict 94-pixel frames (MIM-94) for comparable performances. We see that our 3D-2D feedforward network outperforms MIM with a large gap, proving that our performance is better than the current best comparable approaches. This validates that our major contribution, incorporating a 3D back-end to 2D OGM prediction, leads to improved performances. We share the data and parameters used in this experiment to encourage future works to explore this idea further. Using only numerical results, it is hard to judge how well our approach performs, and set a bar for 'good' performances. Especially because the multi-modal nature of the actor trajectories lowers these values. Consider a simple example: if a person faces three doors and chooses randomly where to go, we expect the network to predict that each trajectory is possible with a 33% probability, which automatically means at least 66% of false positives. In our case, the situations are more complex but the principle remains, which is why we need a thorough investigation of the qualitative results to assess the performances SOGMs are best seen when animated in the supplementary video. Here, we present them in image format ( Figure 6 ). First, we merged all the time steps of the SOGMs to highlight the movement that is predicted (a). We superimpose the groundtruth for the dynamic class as a trajectory, computed with a local maximum filter and coloured according to time, from cyan (t = 0.0s) to green (t = 3.0s). We also show one layer of prediction 2.5 seconds into the future (b), with the groundtruth for the dynamic class superimposed as a contour. We can see several exciting results. First, the network handles Bouncers pretty well with accurate predictions even when they bounce on walls (1). It does not manage to predict bounces between actors, which is not surprising as they are Fig. 7 . Proposed risk-aware navigation system. The local planner anticipates the actors' future motion and plans to go behind them. less frequent and more random than wall bounces. Then we notice that the predictions for Wanderers (2b) generally have a shape similar to the banana distribution [34] . The network is also able to predict that the Wanderers are repelled by the robot (2a). Flow Followers exhibit more complex trajectories (e.g. going through doors), that the network is able to predict (3). We observe an interesting phenomenon: the network often predicts a latent risk at some busy doorways (4), because during training the network often sees a person appearing there out of nowhere. Last but not least, our network sometimes predicts two possible trajectories, here one going through the door and one going straight (5), illustrating our ability to make multi-modal predictions. We presented a self-supervised approach forecasting the motion of dynamic obstacles, without the need to introduce object-level description. We showed how to generate SOGMs automatically and train a novel 3D-2D feedforward neural network to predict them. Eventually, we validated our SOGM predictions both quantitatively and qualitatively. We also presented a navigation system using the predicted SOGMs to proactively avoid dynamic obstacles. We reserve a thorough evaluation of this system for future work, especially when real-world experiments will be possible again (hopefully very soon). For now, we show it running on a simple example in the supplementary video and Figure 7 . Following other works, we showed how self-supervision and lifelong learning are valuable for robotic navigation and hope that more work follows in this direction. There are many prospects for improvement, with more complex simulated human behaviors, and eventually real behaviors, by conditioning the SOGM predictions on the possible future robot actions, by combining recurrent and feed-forward architectures, or by including the notion of visible space in the generation and prediction of SOGMs. Self-supervised learning of lidar segmentation for autonomous indoor navigation Long-term occupancy grid prediction using recurrent neural networks Kpconv: Flexible and deformable convolution for point clouds Planning of multiple robot trajectories in distinctive topologies Integrated online trajectory planning and optimization in distinctive topologies Planningbased prediction for pedestrians Activity forecasting Social lstm: Human trajectory prediction in crowded spaces Social gan: Socially acceptable trajectories with generative adversarial networks Intent-aware pedestrian prediction for adaptive crowd navigation A data-driven framework for proactive intention-aware motion planning of a robot in a human environment Frozone: Freezing-free, pedestrian-friendly navigation in human crowds Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net Intentnet: Learning to predict intention from raw sensor data Discrete residual flow for probabilistic pedestrian behavior prediction Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning Crowdsteer: Realtime smooth and collision-free robot navigation in densely crowded scenarios trained using high-fidelity simulation Densecavoid: Real-time navigation in dense crowds using anticipatory behaviors Collision avoidance in pedestrian-rich environments with deep reinforcement learning Learning obstacle representations for neural motion planning Robot navigation in crowded environments using deep reinforcement learning Dwa-rl: Dynamically feasible deep reinforcement learning policy for robot navigation among mobile obstacles Dynamic risk density for autonomous navigation in cluttered environments without object detection Safe path planning with multi-model risk level sets IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Probabilistically safe robot planning with confidence-based human predictions A scalable framework for real-time multi-robot, multi-human collision avoidance A hamilton-jacobi reachability-based framework for predicting and analyzing human motion for safe planning Deep predictive coding networks for video prediction and unsupervised learning Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics Multi-step prediction of occupancy grid maps with recurrent neural networks Motion estimation in occupancy grid maps in stationary settings using recurrent neural networks Double-prong convlstm for spatiotemporal occupancy prediction in dynamic environments The banana distribution is gaussian: A localization study with exponential coordinates