key: cord-0513091-y4v04neu authors: Xiao, Pengchuan; Shao, Zhenlei; Hao, Steven; Zhang, Zishuo; Chai, Xiaolin; Jiao, Judy; Li, Zesong; Wu, Jian; Sun, Kai; Jiang, Kun; Wang, Yunlong; Yang, Diange title: PandaSet: Advanced Sensor Suite Dataset for Autonomous Driving date: 2021-12-23 journal: nan DOI: nan sha: 48238fecf0c3f95981e19b23a3a2b47e76701d03 doc_id: 513091 cord_uid: y4v04neu The accelerating development of autonomous driving technology has placed greater demands on obtaining large amounts of high-quality data. Representative, labeled, real world data serves as the fuel for training deep learning networks, critical for improving self-driving perception algorithms. In this paper, we introduce PandaSet, the first dataset produced by a complete, high-precision autonomous vehicle sensor kit with a no-cost commercial license. The dataset was collected using one 360{deg} mechanical spinning LiDAR, one forward-facing, long-range LiDAR, and 6 cameras. The dataset contains more than 100 scenes, each of which is 8 seconds long, and provides 28 types of labels for object classification and 37 types of labels for semantic segmentation. We provide baselines for LiDAR-only 3D object detection, LiDAR-camera fusion 3D object detection and LiDAR point cloud segmentation. For more details about PandaSet and the development kit, see https://scale.com/open-datasets/pandaset. Autonomous driving has attracted widespread attention in recent years with its potential to fundamentally disrupt the transportation and mobility landscape. A key component of the autonomous driving technology stack is 3D perception technology. Based on the current state of machine learning, 3D perception relies on large amounts of high-quality, realworld annotated data [1] . The data needs to satisfy two requirements. First, the sensors used for data collection demand sufficiently high precision. If sensor-collected data fails to achieve high precision (due to imprecise LiDAR range measurements, pixelated camera images, etc.), the performance of the corresponding back-end algorithm developed using this data will be limited [2] . Second, the ground truth labels need to be sufficiently accurate and complete. For current data-driven machine learning methods, incorrect and incomplete labeling might even deteriorate the performance of the model. For example, in object detection, a large number of polluted labels have been shown to hurt accuracy [3] . The demand for high-quality data also encompasses requirements on the diversity and complexity of the captured scenes. Autonomous driving is currently concentrated in limited, geofenced areas. But complex environments, whether due to different lighting conditions, changing traffic flow, hazardous road conditions, complex vegetation, unexpected human movements or positions, or unfamiliar objects are all potential problems in real-world driving scenarios [4] . Datasets that capture richer and more diverse scenes, or provide different levels of annotation, can help improve the robustness of autonomous vehicles [5] . However, due to the impact of COVID-19, a large number of autonomous driving companies had to suspend their road testing in 2020, which led to a significant reduction of road test data. To help fill this gap, we launched PandaSet: an open-source dataset for training autonomous driving machine learning models. We hope that PandaSet will serve as a valuable resource to promote and advance research and development in autonomous driving and machine learning. The main contributions of this paper are listed as follows: • We present a multimodal dataset named PandaSet, which provides a complete kit of high-precision sensors covering a 360 • field of view. It is the world's first open-source dataset to feature both mechanical spinning and forward-facing LiDARs and to be licensed for free without major restrictions on its research or commercial use. • PandaSet features 28 different annotation classes for each scene as well as 37 semantic segmentation labels for most scenes. All of the annotations are labeled under multi-sensor fusion to ensure that each ground truth label is sufficiently accurate and precise. • PandaSet includes data from complex metropolitan driving environments: traffic and pedestrians, construction zones, hills, and varied lighting conditions throughout the day and at night. It covers challenging driving conditions for full level 4 and 5 driving autonomy. There is a high density of useful information, with many more objects in each frame than in other datasets. • Based on PandaSet, we provide baselines for LiDARonly 3D object detection, LiDAR-camera fusion 3D object detection, and LiDAR point cloud segmentation, and a corresponding devkit for researchers to use the dataset directly. Over the past ten years, data-driven approaches to machine learning have become increasingly popular, leading to significant progress in the development of 3D perception Table I , considering only those that include both cameracollected and LiDAR-collected data, as well as 3D annotations. The pioneering KITTI dataset [15] , launched in 2012, is considered the first benchmark dataset collected for an autonomous driving platform. It features two stereo camera systems, a mechanical spinning LiDAR, and a GNSS/IMU device. However, in the KITTI dataset, an object is only annotated when it appears in the vehicle's front-view camera's field of view (FOV). It also only features daytime data. The 2018 ApolloScape dataset [16] employs a sensor configuration similar to that of KITTI, but its LiDAR is placed slanted at the rear of the car. Such placement is primarily geared toward map data collection and provides static data on depth rather than complete point cloud information. The nuScenes [1] , Argoverse [17] , Lyft L5 [18] , Waymo Open [19] , and A * 3D datasets [5] launched in 2019 expanded the availability and quality of open-source datasets. Among them, nuScenes, Argoverse and Lyft L5 added map data. nuScenes added additional radar data. However, Argoverse only provides point cloud semantic segmentation for one category. Lyft L5 does not include nighttime data. A * 3D only provides frontfacing camera data and does not provide annotations for point cloud semantic segmentation. nuScenes only provides point cloud within 70 meters. Waymo's data collection vehicle was equipped with a 64-channel spinning LiDAR, but the point cloud provided is only within 75 meters. The Cirrus dataset [20] launched in 2020 was equipped with a pair of longrange bi-pattern LiDARs with a 250-meter effective range in the front-facing direction. However, Cirrus did not employ any 360 • FOV sensor, limiting its perception range to only the front-facing view. Here we introduce our methods for data collection, sensor calibration, and data annotation, then provide a brief analysis of our dataset. We use a Chrysler Pacifica minivan mounted with a sensor suite of six cameras, two LiDARs, and one GNSS/IMU device to collect data in Silicon Valley. Five cameras cover a 360 • area, while a mechanical spinning LiDAR (Pandar64, 200m range at 10% reflectivity) and a forward-facing LiDAR (PandarGT, 300m range at 10% reflectivity) enable much longer 3D object detection range to better support high-speed autonomous driving scenarios. See Figure 1 for the sensor layout, Table II for detailed sensor specifications, and Figure 2 for point cloud samples. A frame-based data structure is used to encapsulate point cloud and image data. One frame of the image refers to a single picture taken after the camera is exposed to light. One frame of the point cloud refers to a point cloud set obtained after the LiDAR completes a scan cycle. The mechanical spinning LiDAR sweeps in a 360 • circle with 10Hz frequency, while the forward-facing LiDAR uses MEMS mirror-based scanning technology with 10Hz frequency. To achieve better data alignment between the LiDARs and cameras, we use a trigger board to trigger each camera to expose only when the mechanical spinning LiDAR scans across the center of a specific camera's FOV, ensuring the camera and the LiDAR capture the same objects at the same time. The timestamp of each image is the exposure time, calculated by adding the exposure trigger time to the exposure duration. The exposure trigger time is estimated by the timestamp when the mechanical spinning LiDAR sweeps across the center of the camera's FOV. The exposure duration is estimated by test statistics. Since the cameras used are all automatic exposure-controlled, we use two different exposure duration parameters for daytime and nighttime to provide more accurate estimations. The timestamp of the point cloud frame is the time that the LiDAR takes to complete the scan cycle. Moreover, each point's timestamp is provided in the frame's specific information. We use PTP for time synchronization of the two LiDARs. The time source for the entire suite is GPS clock. To ensure high quality with a multi-sensor dataset, it is important to calibrate the extrinsics and intrinsics of each sensor. The sensor calibration referred to in Pan-daSet includes intrinsic calibration of the cameras, extrinsic calibration of Pandar64-to-camera, extrinsic calibration of PandarGT-to-Pandar64, and extrinsic calibration of Pandar64-to-GNSS/IMU. Since all sensors are mounted tightly on our vehicle and the entire collection process was completed within two days, we assume the intrinsic and extrinsic parameters of the sensors remain unchanged. In other words, PandaSet has only one set of sensor intrinsic and extrinsic parameters. Moreover, to implement motion compensation, we estimate the vehicle's ego motion at each timestamp of the point cloud with linear interpolation of the vehicle's GNSS/IMU data, helping better align LiDAR scans and images, as well as consecutive LiDAR scans. Results are shown in Figure 3 . Note that all point cloud data in PandaSet is based on a global coordinate system rather than an ego coordinate system. Each sequence has its own definition of the global coordinate system with the origin at the vehicle's start position. All scenes are carefully selected to cover different driving conditions including complex urban environments (e.g. dense traffic, pedestrians, construction), uncommon object classes (e.g. construction vehicles, motorized scooters), a diversity of roads and terrain (e.g. sharp turns, hills), and different lighting conditions throughout the day and at night. The diversity and complexity of the scenes help capture the complex, varied scenarios of real-world driving. Raw data is collected from two routes in Silicon Valley: (1) San Francisco, and (2) El Camino Real from Palo Alto to San Mateo. PandaSet provides high-quality ground truth annotations of its sensor data, including 3D bounding box labels for all 103 scenes and point cloud semantic segmentation annotations for 76 scenes for both mechanical spinning and forwardfacing LiDARs. The annotation frequency remains 10Hz. For 3D object detection, we annotate 3D bounding boxes for 28 object classes (e.g. cars, buses, motorcycles, traffic cones) with a rich set of class attributes related to activity, visibility, location, and pose. All cuboids contain at least 5 LiDAR points, except the ones for which we can accurately predict the size and location for occluded or distant paths (this exception only applies if there is, at minimum, one frame where the same object has at least 5 LiDAR points.) For the task of point cloud semantic segmentation, we annotate points with 37 different semantic labels (e.g. car exhaust, lane markings, drivable surfaces). Based on pixel-level sensor fusion technology, we combine multiple LiDAR and camera inputs into one point cloud, enabling the highest precision and quality annotations. By combining the strengths of a complete, high-precision sensor kit, particularly the long-range mechanical spinning and forward-facing LiDARs, we can annotate objects up to 300 meters, significantly further than most other datasets. See Figure 6 for the average distance distribution of objects per C a r P e d e s t r i a n C y c l i s t V a n T r u c k T r a m M i s c T r a f f i c P a r t i c i p a n t C l a s s e s P a n d a S e t P a n d a S e t -F r o n t K I T T I density. See Figure 5 and Table III for a comparison with KITTI and Waymo Open dataset of main traffic participant annotations. Furthermore, PandaSet has the largest number of label categories and the most elaborate label taxonomy among current open-source datasets, as shown in Table I . Rare object classes such as motorized scooters, rolling containers, animals (e.g. birds), smoke, and car exhaust can provide a useful resource for researchers to address the long tail in real-world driving scenarios, which is a major challenge for the safe deployment of autonomous vehicles. See Figure 7 for the statistics of the annotated categories in both 3D object detection and point cloud semantic segmentation. We establish baselines on our dataset with methods for LiDAR-only 3D object detection, LiDAR-camera fusion 3D object detection, and LiDAR point cloud segmentation. We select the first 50 frames of data from each sequence as training data and the remaining 30 frames as test data to make a tradeoff between robustness and uniformity of the evaluation method. Since we have 103 sequences for 3D object detection and 76 sequences for LiDAR point cloud segmentation, there are 5150 training samples and 3090 test samples for 3D object detection, 3800 training samples, and 2280 test samples for LiDAR point cloud segmentation. To establish the baseline for LiDAR-only 3D object detection, we retrained PV-RCNN [14] , the top-performing network on both KITTI dataset and Waymo Open dataset. We use the publicly released code 1 for PV-RCNN and keep it for only 3 object classes (cars, pedestrians, and cyclists) to align with KITTI 3D object detection evaluation benchmark. Since the coverage of the mechanical spinning LiDAR and the forward-facing LiDAR is different, we retrained two different models with slight differences in network configuration. The detection range along the x-axis is set to [−70.4m, 70.4m] for the mechanical spinning LiDAR and [0m, 211.2m] for the forward-facing LiDAR. Other configuration parameters are shared in both models. The detection range along the y-axis is set to [−51.2m, 51.2m] and [−2m, 4m] along the z-axis. The voxel size is set to (0.1m, 0.1m, 0.15m). Both LiDARs' point cloud frame data is based on the ego vehicle frame, whose x-axis is positive in the forward direction, y-axis is positive to the left, and z-axis is positive in the The results of the test set are evaluated by average precision (AP), the commonly used evaluation benchmark with 11 recall positions [15] . We use 0.7 IoU threshold for cars and 0.5 IoU threshold for pedestrians and cyclists in 3D test. See Table V for detailed results. The distances expressed in the table signify a range (50m is 0-50m; 70m is 50-70m, and so on). The decline in detection performance of pedestrians and cyclists with the forward-facing LiDAR's model is likely due to an increased level of detection difficulty, as shown in Table IV . The difficulty ratings, LEVEL 1 and LEVEL 2, use Waymo Open dataset's difficulty definitions for the single frame 3D object detection task [19] . Examples with ≤5 LiDAR points are designated as the more challenging LEVEL 2. To establish the baseline for LiDAR-camera fusion 3D object detection, we re-implement PointPainting [21] , an effective sequential architecture to fuse point clouds with semantic information from images. We use DeepLabv3+ [22] to output per-pixel class scores of the image and choose PointRCNN [12] 2 as the 3D object detection network because of the strong performance shown in the original paper. Most configuration parameters remain unchanged from how they are presented in the paper, with the exception that we train a 19-classes-output DeepLabv3+ with the pretrained model for Cityscapes [23] supported by the publicly released code 3 . After image semantic segmentation, we merge the output from 19 object classes into 4 object classes (cars, pedestrians, cyclists, and background). Only the object examples in the field of view of the forward-facing, longfocus camera are involved in training and inference. We use the 50m range clipped point cloud from the mechanical spinning LiDAR for the LiDAR data input to align with the KITTI dataset. The same AP evaluation benchmark and IoU threshold are used for LiDAR-camera fusion 3D object detection as described in Section IV-A. See Table VI for detailed results. Since PandaSet also provides ground truth labels for LiDAR point cloud segmentation, we establish its baseline using RangeNet53, using only the network, without the postprocessing [24] . The publicly released code 4 is used. With the whole network unchanged, we merge the 37 final classes in the original output into 14 primary classes for autonomous driving. In the inference, the class with the highest score represents the class ouput of each point. The commonly applied IoU matrix [25] is used as the evaluation matrix. See Table VII for the detailed results. In this paper, we introduce PandaSet, the world's first open-source dataset to include both mechanical spinning and forward-facing LiDARs and to be released free-of-charge for both research and commercial use. We present details on data collection and annotation. At a time when the barriers to data collection are still high, PandaSet was released in the hopes of helping the broader research and developer community accelerate the safe deployment of autonomous vehicles. In the future, we plan to design evaluation metrics and build a public leaderboard to track the research progress in 3D detection and point cloud segmentation, and add map information to the dataset. The dataset analysis and baseline experiments presented in this paper were supported by the National Key Research and Development Program of China (2018YFB0105000). From Hesai, we thank Ziwei Pi and Congbo Shi for hardware design and implementation. The PandaSet dataset was annotated by Scale AI, supported by Dave Morse, Kathleen Cui, and Shivaal Roy. nuscenes: A multimodal dataset for autonomous driving Robust camera lidar sensor fusion via deep gated information fusion network Labels are not perfect: Inferring spatial uncertainty in object detection Is it safe to drive? an overview of factors, metrics, and datasets for driveability assessment in autonomous driving A* 3d dataset: Towards autonomous driving in challenging environments Multi-view 3d object detection network for autonomous driving Second: Sparsely embedded convolutional detection Voxelnet: End-to-end learning for point cloud based 3d object detection Pointpillars: Fast encoders for object detection from point clouds Fast point r-cnn Std: Sparse-to-dense 3d object detector for point cloud Pointrcnn: 3d object proposal generation and detection from point cloud 3dssd: Point-based 3d single stage object detector Pvrcnn: Point-voxel feature set abstraction for 3d object detection Are we ready for autonomous driving? the kitti vision benchmark suite The apolloscape dataset for autonomous driving Argoverse: 3d tracking and forecasting with rich maps Lyft level 5 perception dataset 2020 Scalability in perception for autonomous driving: Waymo open dataset Cirrus: A long-range bi-pattern lidar dataset Pointpainting: Sequential fusion for 3d object detection Encoder-decoder with atrous separable convolution for semantic image segmentation The cityscapes dataset for semantic urban scene understanding Rangenet++: Fast and accurate lidar semantic segmentation The pascal visual object classes challenge: A retrospective