key: cord-0044145-w7t1xtgq
authors: Zhou, Zhuoli; Chen, Shitao; Huang, Rongyao; Zheng, Nanning
title: Robust 3D Detection in Traffic Scenario with Tracking-Based Coupling System
date: 2020-05-06
journal: Artificial Intelligence Applications and Innovations
DOI: 10.1007/978-3-030-49161-1_28
sha: 9a2f92963e3b064af82232dd08b6b94a988cc09c
doc_id: 44145
cord_uid: w7t1xtgq

Autonomous driving is conducted in complex scenarios, which requires to detect 3D objects in real time scenarios as well as accurately track these 3D objects in order to get such information as location, size, trajectory, velocity. MOT (Multi-Object Tracking) performance is heavily dependent on object detection. Once object detection gives false alarms or missing alarms, the multi-object tracking would be automatically influenced. In this paper, we propose a coupling system which combines 3D object detection and multi-object tracking into one framework. We use the tracked objects as a reference in 3D object detection, in order to locate objects, reduce false or missing alarms in a single frame, and weaken the impact of false and missing alarms on the tracking quality. Our method is evaluated on kitti dataset and is proved effective.

In recent years, autonomous driving has gradually attracted people's eyes and entered a rapid development period. Object detection and multi-object tracking technology are important components of autonomous driving technology, with which autonomous vehicles can understand the surrounding environment and make decisions. Autonomous driving usually integrates with multiple sensors, and the rich sensor information fused by multiple sensors can enhance robustness. For example, the camera can obtain the RGB texture information of the object but cannot accurately obtain the depth and 3D position information of the target. LiDAR sensor can obtain the position information of the object in the 3D space, not the texture information. By combining the information of camera and LiDAR, the object's RGB texture information and position information in 3D space can be obtained at the same time and the accuracy of object detection can be enhanced (Fig. 1) .

As mentioned above, the effect of the object detection algorithm has been greatly improved, but the false detection or miss detection are still urgent problems to be solved. The current mainstream object detection algorithms only consider a single frame, ignoring the connection between the upper and lower frames. Actually, the object detection of autonomous driving usually consumes a period of time. Therefore, the upper and lower frame information is beneficial to object detection. It can not only reduce the false alarms and missing alarms in a single frame, but also can locate objects in the current frame through the historical positions. To a large extent, multi-object tracking depends on the result of object detection. Namely, an efficient object detection can improve multi-object tracking. Combining these two concepts is an interesting research direction.

In this work, we propose a 3D detection and tracking coupling system to complete 3D object detection and multi-object detection tasks. We take the advantage of mature 2D object detectors and project the 2D boxes onto 3D phase to filter the frustum range of point clouds. Then we use the prediction of objects' 3D boxes which have been tracked to locate and segment the objects points in frustum point clouds. We associate the objects in this frame with the tracked objects, and determine whether false alarms or missing alarms would occur according to the tracked objects and handle if it occurs.

In this section, we will briefly review the object detection and multi object tracking. Recent years, The emergence and development of region of interests (RoIs)-based CNNs [6, 16] has generated high-confident candidates to detect, and has greatly improved the performance of 2D Object Detection. However, 2D object detection is still insufficient in more complex scenarios, such as autonomous driving and robot, since collecting 3D data is easier than before with the help of LiDAR and other sensors. 3D object detection draws more attention, because it is more challenging and complicated than 2D version. The 3D object detection can be divided in two main categories, including detection based on raw point clouds and detection based on data conversion or combination.

Methods Working on Raw Point Clouds. The VoxelNet-style methods [10, 21, 22] try to solve the issue about instance segmentation and T-Net alignment part before predicting, but they have a drawback of object unawareness in 3D point clouds. PointNet [13] , PointNet++ [14] propose a novel type of network architecture that predicts and segments instance directly based on 3D raw point clouds. PointPillars [10] explores pillar shape instead of the mainstream voxel design to aggregate features. PointRCNN [18] generates a 3D solution directly from the point cloud in a bottom-up manner, which has a higher recall rate than the previous method.

In MV3D [3] , the LiDAR point clouds were projected to bird eye view (BEV), and then processed by a Faster-RCNN [16] . To generate more reliable 3D object proposals in MV3D, AVOD [9] fuses the multi-modal features. Some existing methods also use RGB-D or RGB data to improve the performance. Complex-erYOLO [19] is the first method introducing semantic segmentation into 3D object detection, which generates a better voxelized sematic point cloud used in 3D predictions afterward. F-PointNet [12] segments point cloud based on the 2D image detection result.

3D MOT systems are frameworks trying to detect and associate multiple identical objects in different frames. Most MOT systems follow tracking-by-detection paradigm [2, 5, 17] , which has two steps. One is 3D object detection and the other is data association. The latter problem could be tackled from various perspectives like min-cost flow [5, 11] , Markov decision processes (MDP) [20] , partial filtering [2] . However, most of these methods are not trained in an end-to-end manner thus many parameters are heuristic (e.g., weights of costs). Therefore, they are susceptible to local optima. DSM [5] proposes an end-to-end tracking and matching method by accurately solving linear programming. It is rarely considered to optimize the detection part through tracking part. [8] boosts the bottom-up object detection with information integrating the top down knowledge about tracking. This method is experimentally validated in inner-city traffic scenes. Inspired by that, we consider the object detection and association as a whole, which means that we use object detection to support association and improve detection after tracking. Our input data is processed in three steps. We remove ground of the raw point cloud and detect 2D boxes on image, and then generate frustum point clouds of 2D boxes. We reference the estimate 3D boxes of tracked objects and segment object points. After segmentation, we estimate the 3D boxes and push 3D boxes and 2D box info together into the tracking management. In the MOT part, detection objects are associate with tracked objects. Then we handle the unmatched detection and tracked objects and predict matched objects' box.

Frustum Point Cloud Generation. The framework of our system is shown in Fig. 2 . Like the Frustum PointNets network, we first obtain the amodal 2D boxes and categories of objects on the RGB image through the proposed 2D object detector. Based on the known camera projection matrix R cam and camera-tolidar transformation matrix, we project 2D bounding boxes into the LiDAR coordinate frame and get frustums of each box. Before generating the frustum point cloud, We first preprocess the point cloud to remove the ground [7] . The purpose of ground points removal is to avoid the ground points that are considered to belong to objects located by the tracked objects. Then we filter out the non-ground points in each frustum generated by the 2D boxes.

Let c t i ∈ C t represents the 2D detection result in t frame, f t i ∈ F t represents the point cloud of frustums in t frame. According to the vehicle positioning and heading information and the transformation of the IMU coordinate system to LiDAR coordinate system, the point cloud of the frustums is transformed into the world coordinated system. Meanwhile we predict status of the stably tracked objects x j ∈ X and get their positions, orientations and sizes in the world coordinated system. If the predicted object bounding boxx t j ∈X t intersects with a frustum point cloud f t i , the tracked objects x i are associated with the 2D detection objects c t i .

Since the points of different objects in 3D space are naturally separated, we can segment object's points by the predicted bounding box. Considering the prediction error, we expand the bounding box appropriately. The formulas to judge whether point P belong to p objects is as follows:

Here P x , P y represent coordinates of the point, B x , B y , l, w, θ represent the center of the box, length, width and heading angle, cov x , cov y is the prediction error of box center got from covariance which is calucated in extended kalman filter when estimate the states of the box, λ is a parameter to control the influence from uncertainty of the center of the box. To get the 3D bounding box from the segmented object's points, we remove the points to the points center coordinate system by subtracting the x, y means of the object's points' position. Referring to Frustum PointNets [12] , a preprocessed transformer network and box regression PointNets [13] are used to estimate object's amodal 3D bounding box. Since no tracking information for the first frame and the first observed objects, we apply Frustum PointNets to create 3D object bounding box.

The frustum point clouds f t i and the predicted 3D boxesx t j are not always oneto-one correspondence. Therefore, the 2D detected objects c t i and the tracked 3D objects x j need to be associated respectively. For stable tracked objects, we apply EKF to estimate the position and orientation of the objects' boxes in the current frame and use them as points segmentation input. In addition, we distinguish disappearance and appearance of objects with false alarms and missing alarms to deal with the latter two cases.

Objects Association. For some reasons, such as occlusion, two or more frustum areas projected by 2D boxes may have intersection areas, or two predicted boxes may intersect with the same frustum point cloud. We need to match 2D detection boxes c t i ∈ C t with 3D objects x j ∈ X which they are not one-to-one associated. Because the motion of objects in video has continuity, the 2D bounding box of the same object in two adjacent frames will have a similar position and size. Besides, the objects which are occluded and further having little 2D bounding boxes. Thus we calculate the IoU between 2D box c t i in current frame and 2D box c t−1 j that has been associated to the 3D object x j in the last frame. We match the 2D object box c t i which has larger IoU and tracked object x j together. For the associated 2D object c t i and tracked 3D objects x j , we save the 2D bounding box, category, 3D bounding box and current frame ID in the queue of the tracking management's objects x j .

We apply extended kalman filter to predict the state of stably tracked objects in current frame, and update the entire state of each object based on its corresponding 3D box state. We predict objects' state on the world coordinate system to make no effect on the movement of vehicle. In order to predict the state of objects more accurately, we use a constant acceleration and constant angular velocity model. We formulate the state of the 3D objects as an 10-dimensional vector (x, y, z, θ, vx, vy, vz, ax, ay, w) , the variables vx, vy, vz, ax, ay represent the velocity and acceleration, w represents the angular velocity of the object, Δt represents time interval between two frames. The extended kalman filter updating model formulas are as follows: (2) y estimate = y + Δt * vy + Δt 2 * ay/2 vy estimate = vy + Δt * ay (3)

As a result, the element x, y, z, w of the predicted state and 3D bounding box size estimated in the last frame will be used in points segmentation as the input.

As the existing objects will disappear from the field of view and new objects will enter the detection area, we set status for objects to manage the tracked objects. For the objects tracked more than five frames, we consider the objects' status is stably tracked. We continuously track the stably tracked objects, record their trajectory, and predict their position and heading. If the stably tracked object loses less than three frames, we keep predicting its position and giving a hypothetical trajectory. If the objects are associated again, we consider there is a missing alarm for detection. The missing alarm may be caused by false negative 2D detection or there are no points in the area that predict object box intersecting with frustum. First, we apply RTS algorithm to smooth object's trajectory and get more accurate location in lost frame. Let's assume object X ST is tracked in frame (0, N) while lost in frame j ∈ (0, N), [Z 1 , · · · , Z j−1 , Z j+1 , · · · , Z N ] is the object detection input in other frame, we need to calculate the optimal state estimate of objectX j in frame j. The RTS smoothing algorithm is mainly reflected in the backward filtering process, we save state vector and variance matrix estimates and predictions valuẽ

in extended kalman filter for every frame. We initialize the smoother, let:

Subscript S indicates optimal smoothing and subscript F indicates kalman filter.

In frame j the smoothing gain of RTS smoothing algorithm is as follows:

In the formula (7), Φ j+1,j is the jacobian matrix in extended kalman filter. And the smooth state vector and variance matrix in frame j are updated as:

We can get more accurate position and heading of objects in frame from the smooth state estimate vector X S (j | N ), and take it as a prediction input to segment objects point cloud and generate 3D box. If there is no segmented point, we use the predicted value as a result. In addition, if the stably tracked objects miss more than five frames, we suppose the objects are out of range and stop tracking, thereby we distinguish objects disappearance and missing detection, and supply the detection.

For new detected objects and unassociated objects, we set their status as trackable. If the object misses after being detected only one frame, we consider that is a false alarm and discard that. For objects which are continuously detected, the status is updated to tracked.

In this section, we present experiments we have performed and analyze the results. We evaluate our methods and compare with the other multi-object tracking methods. Then we will show examples to prove that our method works.

Our method is tested on the challenging KITTI Benchmark [1] . We choose Recurrent Rolling Convolution [15] as 2D detection input, and train the F-PointNets network and box estimation network in our framework on KITTI 3D detection dataset. Our method need continuous frame information, so, we evaluate the proposed detection and multi-object tracking framework on tracking dataset. To evaluate the performance of our method, we adopt the MOT metrics. We compare our approach with three MOT methods which also use LiDAR. Table 1 shows the multi-object tracking evaluation results of our methods and other methods on test set. Our method has close performance with FANTrack on MOTA,MT,PT and ML. It indicates that our method has a similar performance on tracking accuracy and lost targets with FANTrack. Our method have a lower FRG value, which shows that our method has effect on distinguishing and supplying missing alarm. The reason for the low MOTP value may be that sometimes the tracked segment objects' points only based on predict boxes are not precise enough.

In addition to reduce the impact of the detector on performance we compare our methods with AB3DMOT using F-PointNets as 3D detectors on KITTI train set. Table 2 shows the results of AB3DMOT using PointRCNN as input, AB3DMOT using F-PointNets as input and our methods. We can see that the detector has a great impact on the tracking results. Our method performs better than AB3DMOT on MOTA and MOTP if using a similar detector. Fig. 3 . Instance of where missing alarm and false alarm has been detected and correctly handled. The green 2D boxes are true positive results, red 2D boxes in frame t, t + 1 and t + 3 are discarded, because no points belong to objects in the frustum. The green 3D boxes are true positive 3D detection obtained from 2D detection and predicted 3D boxes. The purple 3D boxes in frame t + 3 and t + 4 are supplied objects of which 2D detection is missing. The blue 3D boxes in frame t + 3 and t + 4 are ground truth. (Color figure online) Figure 3 shows an example about the handle of missing alarm and false alarm in our framework. The object with ID 3 is tracked stably before frame t+3, but lost 2D detection in frame t + 3 and t + 4. We keep tracking the object and predict its position and trajectory. It is associated again in frame t + 5 and frames after t + 5, so we consider missing alarm of the object with ID 3 happend. We smooth the predicted location in frame t + 3 and t + 4 by object's location after frame t + 5 and supply its 3D detection in frame t + 3 and t + 4. The purple 3D boxes are the results and the blue 3D boxes are ground truth. In frame t, t + 1 and t + 3,the red 2D boxes are discarded as false alarm for lacking LiDAR points.

In this paper, we present a 3D detection and tracking coupling framework. We achieve 3D detection and multi-object tracking through a 2D detection result. And our framework can effectively reduce miss alarm and false alarm in a single frame. Our method still has a long way to go. Though our method and 2D detector are independent, more precise 2D detector can bring superior performance. The points segmentation method and 3D box estimate network in our framework can also be further improved. We consider to improve object detection performance and focus more on the connection between tracking and object detection, because both are important to autonomous driving. Our future work will include improving the 2D detector and box estimation network, making better using of tracking information, and fully fusing camera and LiDAR data.

Are we ready for autonomous driving? The KITTI vision benchmark suite

Online multiperson tracking-by-detection from a single, uncalibrated camera

Multi-view 3D object detection network for autonomous driving

FANtrack: 3D multi-object tracking with feature association network

End-to-end learning of multi-sensor 3D tracking by detection

Fast R-CNN

Fast segmentation of 3D point clouds for ground vehicles

Tracking and classification of arbitrary objects with bottom-up/top-down detection

Joint 3D proposal generation and object detection from view aggregation

PointPillars: fast encoders for object detection from point clouds

FollowMe: efficient online min-cost flow tracking with bounded memory and computation

Frustum pointnets for 3D object detection from RGB-D data

PointNet: deep learning on point sets for 3D classification and segmentation

Pointnet++: deep hierarchical feature learning on point sets in a metric space

Accurate single stage detector using recurrent rolling convolution

Faster R-CNN: towards real-time object detection with region proposal networks

Beyond pixels: leveraging geometry and shape cues for online multi-object tracking

PointRCNN: 3D object proposal generation and detection from point cloud

Complexer-YOLO: real-time 3D object detection and tracking on semantic point clouds

Learning to track: online multi-object tracking by decision making

Second: sparsely embedded convolutional detection

VoxelNet: end-to-end learning for point cloud based 3D object detection