key: cord-0057757-x0jvljhe authors: Liu, Jheng-Lun; Tsai, Augustine; Fuh, Chiou-Shann; Huang, Fay title: MamboNet: Adversarial Semantic Segmentation for Autonomous Driving date: 2021-03-18 journal: Geometry and Vision DOI: 10.1007/978-3-030-72073-5_27 sha: 3bbe09341a8f3645efa39b3f2c860b1779c253f7 doc_id: 57757 cord_uid: x0jvljhe Environment semantic maps provide essential information for autonomous vehicles to navigate in complex road scenarios. In this paper, an adversarial network to complement the conventional encoder-decoder semantic segmentation network is introduced. A newly proposed adversarial discriminator is piggybacked to the segmentation network, which is used to improve the spatial continuity and label consistency in a scene without explicitly specifying the contextual relationships. The segmentation network itself serves as a generator to produce an initial segmentation map (pixel-wise labels). The discriminator then takes the labels and compare them with the ground truth data to further update the generator in order to enhance the accuracy of the labeling result. Quantitative evaluations were conducted which show significant improvement on spatial continuity. Surrounding understanding is critical to the safety of autonomous vehicles. The ability to recognize the drivable areas and dynamic objects on the road enables the safe navigation. Conventionally, camera frames are used to detect pedestrians, cars, motorcycles, roads, and sidewalks in pixel-level. The goal of this task is to produce semantic segmentations by assigning each input data point, namely a pixel, a unique class label. With the advancements of LiDAR sensor technology in recent years, many commercial products can detect points beyond 200 m. In this paper, we tackled the semantic segmentation task using a rotating LiDAR scanners. Comparing to solely using camera frames, 3D point clouds obtained by LiDAR provide a richer spatial and geometry information. However, the unstructured and sparse nature of the 3D data presents another level of challenges. The major contribution of this paper is a novel method which can efficiently improve 3D LiDAR point cloud segmentation. We complemented an end-to-end encoder decoder segmentation pipeline with an adversarial network which is derived from Generative Adversarial Network (GAN) [1] . The network improves the spatial continuity and label consistency without explicitly specifying the contextual information. The adversarial network was only applied during model training, and was removed during the online inference stage. The complexity of the overall architecture is kept in minimum. Semantic segmentation is one of the most important deep learning applications. In 2D image segmentation, U-Net [2] pioneered the encoder-decoder CNN architecture adoption, they transferred the entire feature map from encoders to the corresponding decoders and concatenates them to up-sampled (via deconvolution) decoder feature maps. In order to reduce memory requirements, Kendall [3] proposed to store the max pooling indices instead of concatenation with fewer parameters for decoder reconstruction. Nowadays, 360°revolving LiDAR is the most common laser scanner for autonomous driving. In order to address 3D point cloud segmentation using aforementioned 2D segmentation paradigm, common approach is to spherically project the 3D point cloud data onto 2D range image plane. Leading the online frame-rate processing for practical applications, Wu [4] proposed a light weighted model derived from SqueezeNet to process data in 2D image plane. SqueezeSegV2 [5] extended V1 with Contextual Aggregation Module (CAM) [6] to mitigate LiDAR sensor data drop out issues. A synthetic point cloud generation using GTA-V game engine with intensity rendering was also proposed to augment the training data. Due to nonhomogeneous spatial distribution of point cloud, SqeeuzeSegV3 [7] proposed Spatial-Adaptive Convolutions (SAC) which may change the weights according to the input data location. Miliotos [8] extended Wu [4] 3 label classes to 19 classes and replace extended the label classes from three to nineteen, and replaced the 2D CRF to 3D GPU-based nearest neighbor search acting directly on the full, un-ordered point cloud. This last step helps the retrieval of labels for all points in the cloud, even if they are occluded in the range image. Cortinhal [9] transformed the deep network with Bayesian treatment by introducing uncertainty measures, epistemic and aleatoric noises. Luc [10] introduced an adversarial network to discriminate the predicted segmentation maps either from the ground truth or segmentation network to mitigate the higher order label inconsistencies. Souly [11] introduced a semi-supervised segmentation using weakly labelled data for the generator. In this paper, the proposed MamboNet was inspired by many of these approaches and mostly by Luc's adversarial network. The projection method as mentioned in [4, 5, [7] [8] [9] has been applied for data preprocessing. Each raw 3D point cloud in 360°surrounding is spherical projected onto a 2D grid point on a range image as illustrated in Fig. 1 . A 3D point (x, y, z) with respect to the world coordinate system originated at the sphere center is projected to the image with coordinates of (θ loc , ϕ loc ), which is calculated as follows: Here, θ and φ are quantization steps. Each grid point represents a fivedimensional feature vector: three for its associated 3D location (x, y, z), one for the intensity value, and the other for the range value. The main objective of applying adversarial network is to enforce the spatial continuity and label consistency. Conventional encoder-decoder network [3] creates a segmentation map (pixel-wise labeling), and then follows up with conditional random field (CRF) to impose pixel grouping constraints. We replaced CRF with a discriminator which is only used during the training and can be dropped in inference to maintain minimum network complexity and it is similar to bag of freebies in [11] . Our adversarial network (shown in Fig. 2 ) is similar to [10] , the discriminator takes two inputs, namely, predicted and ground truth maps. Both maps are concatenated with the same 2D input data. The predicted map is generated by the encoder-decoder semantic segmentation network. A detailed version of the generator is shown in Fig. 3 , each yellow block of the encoder is an Inception [13] like module with a group of mixed kernel sizes and dilation rates. Each block has three parallel convolution layers, the outputs are concatenated and then summed up with forth convolution layer. Between encoder and decoder, an Astrous Spatial Pyramid Pooling (ASPP) [2] module is inserted for exploiting multi-scale features and enlarging the receptive field. ASSP is employed to capture small street objects, such as pedestrian and cyclists. In decoder, the conventional transpose convolution layer is replaced with the low computation pixel-shuffle layer, similar to super resolution [14] . It can leverage low resolution feature map to generate up-sampled feature maps by converting information of the channel dimension to the spatial dimension. The operation is to convert a feature map of H × W × Cr 2 to (Hr × Wr × C), where H, W, C and r are the height, width, number of channel, and up-sampling factor. The discriminator is a VGG based convolutional network shown in Fig. 4 . The data size is 2048 × 64 × 6. The first two dimensions are the image width and height, and the third dimension includes x, y, z, intensity, range, and class label. Each layer uses 3 × 3 convolution kernel and is followed by a 2 × 2 max pooling except for the 1st layer. The sizes of the last three fully connected layers are 2048, 512, and 512, respectively. The training, shown in Fig. 5 , is based on conditional GAN (cGAN) [15] architecture. The discriminator, D, learns to classify fake (predicted semantic map) and real (ground truth map). Both generator and discriminator observe the same 2D range imagery input. There are three lost terms, the first term is the general cross-entropy term for segmentation network (generator), S(·), to predict each location (pixel-wise) of the output map with independent class label. It is a weighted cross-entropy loss as is expressed as. where Y and S are the one-hot vector maps for ground truth and predicted label, respectively. Due to the imbalance data nature of the street scene, pedestrians and cyclists are less seen compared to other cars, the way to mitigate the network biases toward to the classes with higher frequency of occurrence is to add a weighted factor f . The second term is the Lovász -Softmax loss [16] . The loss is used to improve the intersection-ofunion (IoU) or Jaccard index. The convex Lovász extension of submodular losses relaxes the IoU hypercube constraint where each vertex is a plausible combination of the class labels. Therefore, IoU score can be defined anywhere inside the hypercube. This term is expressed as where x i (c) ∈ [0,1] is the pixel-wise predicted probability, y i (c) is the predicted label. The loss will penalize the wrong prediction. The third term is the adversarial loss which can be expressed as. where D is the discriminator which produces Real and Fake binary outputs, and G generates the predicted label, x is the 2D range image, z is the optional random noise input, P gt is the distribution of ground truth label, y, and P p is the distribution of the predicted label. D tries to maximize the Jensen-Shannon divergence [1] between P gt and P p . On the contrary, G tries to minimize the same distribution divergence in order to make P p . Indistinguishable from P gt . The final objective is a mix-max optimization of the loss summation of cross entropy, Lovász -Softmax and adversarial terms as shown in Eq. (5) Semantic KITTI data set [17] was used for algorithm evaluation. The dataset contains 28 classes including classes of non-moving and moving objects. The scanned sequences of 0-10 except 8 were used for training, and sequence 8 was used for validation. Sequences 11-21 was used for testing, however, the annotations for the testing sequence are not available to the general public. In order to evaluate the performance, the labeled data were submitted to Semantic KITTI official server for test results. The evaluation metric is based on Jaccard Index or mean Intersection-over-Union (IoU) metric as shown in the Eq. (6). where TP, FP, and FN correspond to the number of true positive, false positive, and false negative predictions for class c, and C is the number of classes. In Table 1 , our method not only out performs most of the 3D point-wise methods [18] [19] [20] [21] , but also is superior to other projection based methods, especially in small object segmentation, such as person, bicyclist, and motor-cyclist categories. We compare our method with two other networks, the first one is the SalsaNext baseline [9] , and the second one is the SalsaNext augmented with a discriminator. The discriminator is a VGG-based convolutional network. In the beginning, we trained the SalsaNext baseline using their open source Github repository, and the test result of mIOU is 57.2, which is a little lower than the published result (59.5) [9] . The discrepancy can be due to the limited batch size [15] in our single board training configuration. In Table 1 , SalsaNext with discriminator outperforms baseline in 15 out of 19 categories, and the mIOU of 57.9 is slightly improved. Our method, MamboNet, achieves over one percent mOUT improvement of 58.5. In Fig. 6 , four blocks of segmented map results are shown for visual examination. Each block has three maps, the top is the SalsaNext baseline, the middle one is our method with adversarial discriminator, and the bottom one is the ground true for comparison. In the top strip of the first example, there is a small mis-classified pink circle inside the dark purple region (road). The middle strip of the same example, the circle disappears due to the discriminator power of enforcing regional consistency. The same rectification can be observed in the second and third examples, all middle strips correctly identify the fence region (brown), while the top strip mis-classify part of the fence to be the building regions (yellow). Finally in the fourth example, the top strip also misclassifies portion of light green (terrain) to be dark green (vegetation), however, the middle strip correctly identifies the terrain area. We augmented an encoder-decoder segmentation network with an adversarial network to improve the semantic segmentation performance. Adversarial network can implicitly enforce the regional contextual continuity. Unlike conventional CRF and KNN post processing techniques, the adversarial is learnt only during the offline training and is not active during the test. Therefore, the online computation is greatly reduced and yet the comparable results are still attainable. Generative adversarial networks. In: NIPS U-net: convolutional networks for biomedical image segmentation Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding Squeezeseg: convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D lidar point cloud Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud Multi-scale context aggregation by dilated convolutions Squeezesegv3: spatially-adaptive convolution for efficient point-cloud segmentation RangeNet++: fast and accurate LiDAR semantic segmentation SalsaNext: fast, uncertainty-aware semantic segmentation of LiDAR point clouds for autonomous driving Semantic segmentation using adversarial network Semi supervised semantic segmentation using generative adversarial network YOLOv4: Optimal Speed and Accuracy of Object Detection Going deeper with convolutions Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network Image-to-Image translation with conditional adversarial networks The Lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks SemanticKITTI: a dataset for semantic scene understanding of LiDAR sequences Randla-net: efficient semantic segmentation of largescale point clouds Pointnet: deep learning on point sets for 3D classification and segmentation Pointnet++: deep hierarchical feature learning on point sets in a metric space LatticeNet: Fast Point Cloud Segmentation Using Permutohedral Lattices