key: cord-0264694-7yw1u2cq authors: Yuan, Di; Shu, Xiu; Fan, Nana; Chang, Xiaojun; Liu, Qiao; He, Zhenyu title: Accurate Bounding-box Regression with Distance-IoU Loss for Visual Tracking date: 2020-07-03 journal: nan DOI: 10.1016/j.jvcir.2021.103428 sha: 6a41ebeeb1371194c0fa2c9d4fd259215214fdea doc_id: 264694 cord_uid: 7yw1u2cq Most existing trackers are based on using a classifier and multi-scale estimation to estimate the target state. Consequently, and as expected, trackers have become more stable while tracking accuracy has stagnated. While trackers adopt a maximum overlap method based on an intersection-over-union (IoU) loss to mitigate this problem, there are defects in the IoU loss itself, that make it impossible to continue to optimize the objective function when a given bounding box is completely contained within/without another bounding box; this makes it very challenging to accurately estimate the target state. Accordingly, in this paper, we address the above-mentioned problem by proposing a novel tracking method based on a distance-IoU (DIoU) loss, such that the proposed tracker consists of target estimation and target classification. The target estimation part is trained to predict the DIoU score between the target ground-truth bounding-box and the estimated bounding-box. The DIoU loss can maintain the advantage provided by the IoU loss while minimizing the distance between the center points of two bounding boxes, thereby making the target estimation more accurate. Moreover, we introduce a classification part that is trained online and optimized with a Conjugate-Gradient-based strategy to guarantee real-time tracking speed. Comprehensive experimental results demonstrate that the proposed method achieves competitive tracking accuracy when compared to state-of-the-art trackers while with a real-time tracking speed. Target tracking is a very hot and challenging visual task. Trackers need to learn a target appearance model that relies on the given information of the target in the initial frame. The learned model needs a strong generalization ability for the target appearance state. The target tracking task in question could be divided into two parts: target classification and target estimation. For target object detection task, we learn the target estimation part so that it can predict the Distance Intersection over Union (DIoU) score between target ground-truth and an estimated bounding box. In each tracking frame, the final target bounding-box is determined by maximizing the predicted DIoU score of some proposals and target bounding box in a reference frame. It should be noted here that our DIoU score differs from the IoU score in some cases (shown in Figure 2 ). Specifically, the loss of our DIoU-based network is higher than that of the IoU-based network when the centers of the two bounding-boxes do not coincide, which forces the two boundary boxes to quickly reach a state of the center overlaps. In other words, it is easier for DIoU-based trackers to get accurate tracking results. For the online target tracking phases, moreover, we choose a simple but effective two-layer fully convolutional network as the target classification part, due to it can provide high robustness in the complex tracking scenarios. To ensure the real-time tracking speed, we follow the ATOM [14] tracker, which addresses the problem of efficient online optimization by employing a Conjugate-Gradient-based method. The process of our online target tracking phases is simple: following model initialization, the target classification, target estimation, and model updating processes execute alternately until the entire tracking task is complete. The main contributions are summarized as follows: • We formulate a novel DIoU network-based bounding-box regression model for target tracking. While preserving the advantages offered by the IoU network in tracking tasks, the DIoU network can be deployed to directly minimize the distance between the ground-truth bounding box and predicted bounding box, an approach that allows the tracker to obtain more accurate tracking results. • We adopt a Conjugate-Gradient-based strategy to ensure that the optimization problem in the target classification component can be addressed efficiently online. • Extensive experiments have verified that our tracking method is more competitive than other state-of-the-art trackers on seven challenging datasets: OTB100 [16] , UAV123 [17] , Track-ingNet [18] , LaSOT [19] , GOT10k [20] , VOT2018 [21] and VOT2019 [22] . At present, most target trackers either under the detection-based framework and or under the template matching-based framework. Trackers based on the detection framework treat the target tracking task as a classification problem and distinguish the target from the background by modeling the target appearance. While, trackers based on the template matching framework typically use a Siamese network to determine the target location utilizing spatial cross-correlation, which can be used to the most relevant candidates for the target. There are many tracking approaches that combine tracking and detection in some respect [2, 23, 24, 25, 26, 27, 28, 29] . In [24] , the TLD tracking framework divides tracking task as tracking, learning and detection sub-tasks. Each of these three parts complements each other to enable the target tracking task to be completed. In [23] , Wang et al. demonstrate that tracking different objects could be formulated as a network-flow mixed-integer program. Lan et al. [28] propose to the target tracker in a frame-by-frame manner by exploring the time, space, and multicamera relationship of detection hypotheses shortly frames. Other trackers have integrated the detector within a particle filter tracking framework [10, 30] . Among these detection-based tracking methods, DCF-based tracking methods achieved some promising performance [2, 31, 32, 33] . These DCF-based tracking methods learn a correlation filter from target ground-truth provided in the initial image frame to discriminate between target and background. In [31] , Henriques et al. derive a kernelized correlation filter with the exact same complexity as its linear counterpart, while also proposing a fast multi-channel correlation filter; this allows the KCF tracker to achieve promising tracking accuracy and fast-tracking speed compared to other trackers of the same period. However, DCF-based trackers can not model the background well. To resolve this issue, Kiani et al. [2] proposes a background-aware correlation filter-based tracker to model both target and background. By introducing a temporal regularizer to the DCF-based trackers, it has been able to achieve a competitive tracking result [11, 32] . To improve tracking accuracy, a group feature selection strategy has been proposed under the DCF-based tracking framework that can select group features across channels and spatial dimensions to determine the structural correlation between feature channel and filter system [1] . The DCF-based trackers mentioned above are only able to determine the target center location, most of these trackers use a multi-scale search strategy to predict the target state, which usually results in relatively inaccurate tracking results [34, 35] . The recently proposed ATOM [14] tracker incorporates IoU modulation and IoU prediction to improve tracking performance. However, the IoU loss has an inherent defect: that is, when one bounding-box is completely inside the other, the IoU loss does not change; the centers of these two bounding-boxes do not necessarily overlap [36] . Accurate target boundary box positioning is very important for tracking tasks, meaning that further improvement of the IoU-based trackers is required. Template matching-based tracking frameworks typically use a Siamese network as the similarity measurement network [37, 38, 39, 13, 40, 41] . As the first Siamese network-based tracker, SINT [37] simply matches the initial target with proposals selected in the current frame and given the most similar proposal as the tracking target. Despite its simple network structure, the SINT tracker achieves efficient tracking performance, but suffers from a very slow tracking speed. In [38] , the SiamFC tracker was proposed with the aim of achieving a high tracking accuracy and a fast-tracking speed. In response to this work, there are many trackers that extend the SiamFC architecture for the target tracking task [13, 42, 43, 44, 45] . The SiamRPN [13] tracker joins the RPN network under the Siamese-based tracking framework. As a result of the region proposal refinement, the whole tracking process is simplified without affecting tracking performance. Both the DaSiamRPN [6] tracker and the SiamRPN++ [45] tracker, as improved versions of the SiamRPN [13] tracker, improve the tracking performance in different ways. Although Siamesebased trackers provide an acceptable balance between tracking speed and accuracy. Most of the Siamese-based trackers are difficult to classify targets effectively due to a lack of online model updating. Unlike these trackers, our proposed tracker not only has an offline training of the model but also offers a model update strategy during the online tracking phase, which allows for accurately estimated the target state when the target appearance changes dramatically. In target tracking tasks, a rectangular bounding box is usually utilized to display the target location. Accurate target boundary box estimation is a complex task, which depends strongly on the target location and scale. The target location is the key to determine the bounding box center. While the target scale is the key to whether the bounding box can accurately return to the state of the target or not. As a result, many trackers use lots of off-line training to try to get enough priors [13, 6] . Notably, the DaSiamRPN [6] tracker has obtained sufficient prior knowledge based on the off-line training, and therefore obtained promising results on bounding box regression. However, these trackers are always affected when they encounter the target classification problem. Different from the Siamese-based tracking methods, the ATOM [14] tracker and some of its variants [46, 47] trains a target estimation strategy to calculate the IoU overlap scores of proposals and the reference target. By maximizing the IoU overlap score, the ATOM [14] tracker can predict a compact bounding-box of the tracking target. GIoU [48] loss has also been proposed to tackle the gradient vanishing issues, but is affected by slow convergence and inaccurate regression. In comparison, the DIoU [15] loss offers faster convergence and better bounding box regression accuracy. Accordingly, we utilize the DIoU loss to improve the IoU-based tracker to achieve some competitive tracking results. We follow the process used in ATOM [14] and divide the tracker into two components: an offline learned target estimation component and an online learned target classification component. In other words, we separate the tracking problem into two sub-problems (classification and estimation). The whole tracking architecture is shown in Figure 3 . As described in the ATOM [14] tracker, the target state estimation aims to accurately predict the target bounding box by means of a rough initial estimate. The ATOM tracker uses an improved Figure 3 : Architecture of the proposed method for target tracking task. The DIoU predictor is pre-trained on some large training sets to predict the DIoU score of the target candidates. The target classifier is trained online to output the corresponding confidence map. IoUNet [49] for the target estimation; this means that given image (x) and bounding box estimate of a target (B), the IoUNet can calculate IoU score between estimate bounding box (B) and target ground-truth (B gt ). The prediction network to pool the region in the image x given by the estimate boundingbox, resulting in a determined size feature map. The ROI Pooling is differentiable and can be used to improve the predicted bounding box by maximizing IoU score. However, the IoU-based bounding-box regression for target tracking has an obvious drawback: when one bounding-box is located entirely within another bounding-box, the objective function based on the IoU loss is no longer optimized (see the right sub-figure of Figure 4 ). However, the prediction bounding-box may not be optimal; in other words, the tracking results are not accurate. We, therefore, propose an improved IoU loss-based bounding box regression method to ensure the tracking accuracy. We take inspiration from the DIoU [15] , a method that was recently proposed for object detection task, as this results in much faster convergence in training than the IoU loss. The loss function based on IoU can be defined according to the following format: where P (B, B gt ) is a penalty term. When the penalty term P (B, B gt ) = 0, the loss function will degenerate into the IoU loss. The DIoU score could be calculated as follows: where b and b gt are the central points of B and B gt , c is the diagonal length of the minimized enclosing bounding box C that covers B and B gt (see Figure 4) , and λ is a parameter to balance IoU score and penalty term. In general, the DIoU score is always lower than the IoU score, and they are equal if and only if the centers of the two bounding boxes overlap. This also brings the prediction bounding box depended on the DIoU score is closer to the reference bounding box center. The DIoU loss could be defined as follows: where ρ(.) is Euclidean distance. The DIoU score can directly reflect the overlap degree between B and B gt , as well as whether the center position of these two bounding boxes is the same. The penalty term λ ρ 2 (b,b gt ) c 2 directly minimizes the distance between the central points of these two bounding boxes. When λ = 0, the DIoU loss will degenerate to the IoU loss. In addition, the value of λ only affects the training speed of the model and has no obvious influence on the tracking performance of the trained model. Therefore, we set λ = 1 in this paper. The DIoU-based network is trained by minimizing DIoU losses between candidate samples and reference targets. The target boundary box is predicted by maximizing the DIoU prediction overlap score. Although the target estimation component can provide an accurate bounding-box for the tracking task, it cannot make robust distinctions between the target and the background. In this section, we introduce a robust target classifier that can accurately determine the target and background, regardless of whether or not the tracking scene is disturbed. Different from target estimation, target classification component can be trained directly online and used for target confidence score prediction. Refer to literature [14] , the target classification component we used can be defined as follows: where z denotes the feature map of the target, while w = {w 1 , w 2 } are parameters, φ 1 , φ 2 are the activation functions in the network. In order to achieve a fast tracking speed, we refer to the DCF-based trackers [2, 32] to build a l 2 error-based model, as follows: where z is the training sample feature map and yrepresents the corresponding label with a Gaussian shape. Generally, Eq. (6) is optimized by stochastic gradient descent, which makes the tracking speed slow. Similar with literature [14] , the object function (6) can be formulated as a squared L 2 norm of the residual vector L(w) = r(w) 2 . According to the first order Taylor expansion method, we can know that: r(w + ∆w) ≈ r(w) + ∂r ∂w ∆w. Using the quadratic Gauss-Newton approximation, we can obtain: where the ∆w is a increment in the parameters w. The Gauss-Newton problem (7) The proposed DIoU prediction network is pre-trained offline by using labeled training images as in Eq. (4). Similar to [14] , we used the LaSOT dataset [19] , the TrackingNet [18] dataset and the COCO [50] dataset as training data. Each training image pair contains one template image and one test image. For the template image, an image patch centered at the target has been cropped as a template sample; the template sample's size is 5 times the length and width of the target size. For the test image, we crop a similar image patch and add perturbations to simulate a real tracking scene. The cropped image patches are resized in the same size to train the network. We fixed all weights of our backbone network in the training phase and use L2 to train the DIoU-based predictor. The predictor was trained for 60 epochs and batch size set to 64. We also utilize ADAM optimizer with an initial learning rate lr = 10 −3 and a decay factor df = 0.2 for every 15 epochs. Our experiments are performed in Python using PyTorch, and the tracking speed is over 50/40 f ps with the backbone network ResNet18/ResNet50 on an NVIDIA GTX 2080Ti GPU. To evaluate the tracking performance of the proposed tracking method, we make some experimental comparisons of our tracker with several state-of-the-art trackers on 7 challenge datasets: OTB100 [16] , UAV123 [17] , TrackingNet [18] , LaSOT [19] , GOT10k [20] , VOT2018 [21] and VOT2019 [22] . Once the DIoU estimates have been trained offline, the online tracking process of the proposed tracking method can be easily subdivided into the following four steps: model initialization, target classification, target estimation, and model update. Model Initialization. We use the ResNet as our backbone network to extract features. Beginning with the initial target state, an image patch 5 times the size of the target was cropped and extract features from patch size 288 × 288 from the cropped patch. Target Classification. Following the ATOM [14] tracker, the target classification network in our tracker consists of a 2-layer CNN network. The first layer consists of a 1 × 1 convolutional layer (w 1 ), while the second layer adopts a 4 × 4 kernel (w 2 ) with a single output channel. Where φ 1 (t) = t, t ≥ 0 is an identity transformation and φ 2 (t) = α(e t/α − 1), t ≤ 0 (α = 0.05 in this paper). Moreover, φ 2 offers continuous differentiability and is thus good for optimization. In the first frame, we generate 30 training samples through data expansion, and optimize the parameters w 1 layer with 6 rounds of Gauss-Newton iterations and 10 rounds of Conjugate-Gradient iterations. We then only optimize the w 2 layer with 1 round of Gauss-Newton iterations and 5 rounds of Conjugate-Gradient iterations for each 10th frame. Target Estimation. At current t-th frame, the position with the highest confidence score can be found by using the classification model (Eq. (5)). After that, we can use this position as the target center point and generate 10 bounding boxes randomly. The DIoU score of each bounding box was maximized by the offline trained target estimation network. The final state of the predicted target in the current frame is determined by the average of these bounding-boxes with top-k DIoU scores. Model Update. In the target classification phases, we adopt the l 2 classification error in the DCFbased tracking framework so that we can distinguish target from background. And we adopt a linear update strategy: w = (1 − δ)w t−1 + δw t to update w, where δ is a learning rate. We first give an ablation study on the LaSOT [19] and OTB100 [16] datasets to verify the effectiveness of each component in the proposed tracker. The backbone network we used in this part is ResNet18. We mainly analyze the impact of the two main components (DIoU loss & Conjugate-Gradient) in our tracker on tracking performance. The experimental results are shown in Table 1 . To avoid confusion, we state that trackers without the DIoU loss mean that they adopt the IoU loss to train their model, and trackers without the Conjugate-Gradient strategy mean that they only adopt the Gauss-Newton strategy for the model optimizing. From this table we can know that the tracking performance of the tracker with the DIoU loss is significantly improved than the tracker without the DIoU loss, especially in the success score, it has about 11% improvement on the LaSOT [19] dataset. In addition, Table 1 also clearly reflects the tracking performance of the tracker with the Conjugate-Gradient strategy is better than the tracker without it. We present some quantitive comparisons of our DIoUTrack with a number of state-of-theart trackers on the 7 most challenging single target tracking datasets. Since we use two backbone networks (ResNet18 / ResNet50), we give the tracking results of the corresponding trackers (DIoUTrack18 / DIoUTrack50). Experiment on OTB100 [16] dataset: The OTB100 dataset includes 100 testing sequences and the tracking accuracy of each tracker is evaluated by precision (a center position distance between the predicted and ground-truth of the target that is ≤ a fixed threshold (such as 20 pixel values) is considered to have successfully tracked the target) and success (an area-under-curve (AUC) ≥ 0.5 is considered to have successfully tracked the target). We draw some experimental comparisons of the proposed DIoUTrack and several state-of-the-art trackers (namely ATOM [14] , GradNet [51] , GCT [52] , ARCF [3] , UDT [53] , MetaCREST [12] , SiamRPN [13] , SiamRPN++ [45] , PTAV [54] , DiMP18 and DiMP50 [46] ) on this dataset. Table 2 presents the results of these comparisons over all 100 testing videos. From this table, we can know the proposed DIoUTrack50 achieved the best tracking accuracy in both precision and success index. The SiamRPN [13] tracker employs a bounding-box regression strategy, while the ATOM [14] tracker adopts an improved bounding-box regression model based on the IoUNet to estimate the target state. Compared to other trackers, the ATOM [14] tracker achieves the acceptable success score and precision score ( 66.1% / 86.2%), while the DiMP18 [46] tracker achieves good tracking accuracy (66.0% / 87.8%); however, our DIoUTrack18 with the same backbone network (ResNet18), due to employing a DIoU networkbased bounding-box regression model for target estimation, significantly outperforms the ATOM tracker and the DiMP18 tracker by achieving a success score of 68.1% and a precision score of 89.0%. Experiment on UAV123 [17] dataset: This UAV123 dataset consists of 123 testing aerial video sequences, and the performance is evaluated in the same way as the OTB100 dataset. To eval-uate the tracking performance of the proposed DIoUTrack, we report some experimental comparisons of our tracker and several other state-of-the-art trackers (namely ATOM [14] , GFSDCF [1] , LDES [55] , UDT [53] , STRCF [32] , ARCF [3] , GCT [52] , SiamRPN++ [45] , SiamRPN [13] , DaSiamRPN [6] , DiMP18 and DiMP50 [46] ) on this dataset. Table 3 presents the precision and success scores on 123 video sequences. DaSiamRPN [6] , SiamRPN++ [45] and their predecessor SiamRPN [13] adopt a bounding-box regression-based target estimation component. Compared to other tracking methods, DiMP50 [46] achieves superior tracking performance in terms of AUC (65.4%) and precision (85.8%) indexes. While, SiamRPN++ [45] achieves good tracking performance in terms of AUC (61.3%) and precision (80.7%) indexes. However, the proposed DIoUTrack50 with the same backbone network (ResNet50), which employs a distance-IoU network-based bounding-box regression model for target estimation, outperforms the DiMP50 [46] tracker and the SiamRPN++ [45] tracker, achieving an AUC of 65.5% and a precision of 86.6%. Compared to the ARCF [3] , which is a tracker specifically designed to track targets in a drone scenario. Our DIoUTrack achieves an improvement of more than 15% in each index. Experiment on TrackingNet [18] dataset: TrackingNet is containing a test set of 511 video sequences. To verify the tracking results of our DIoUTrack, we make some comparisons of its performance on the TrackingNet test set with several state-of-the-art trackers, namely ATOM [14] , SPM [56] , GFSDCF [1] , C-RPN [57] , UpdateNet [58] , DiMP18 [46] , DiMP50 [46] , UPDT [59] , ECO [60] , SiamRPN++ [45] and DaSiamRPN [6] . Table 4 presents the comparison results in precision score, normalized precision score, and success score. From this table, it is evident that our DIoUTrack50 achieves the best scores in these three metrics. In terms of precision, our DIoUTrack50 outperforms the second-best tracker, DiMP50 [46] , by 1.3%; moreover, compared to the Siamese framework-based DaSiamRPN [6] tracker, the proposed DIoUTrack50 achieves a greater than 18% improvement in success and an improvement of over 28% in precision. Finally, compared with the IoUNet-based ATOM [14] tracker, our DIoUTrack18 with the same backbone network (ResNet18) achieves an improvement of more than 2% on each index. All of these comparative results show that the adopted Distance-IoU loss can effectively improve the target bounding-box regression model for accurate target location and estimation. Experiment on LaSOT [19] dataset: The LaSOT dataset is that consists of 1, 400 video sequences, with more than 3.5M image frames, and 280 videos in the testing set. To validate the tracking accuracy, we conduct several experimental comparisons on LaSOT testing set in order to assess our proposed DIoUTrack alongside some state-of-the-art tracking methods, namely MD-Net [5] , ECO [60] , CFNet [61] , PTAV [54] , BACF [2] , DSiam [62] , StructSiam [63] , VITAL [64] , STRCF [32] , TRACA [65] , SiamRPN++ [45] , ASRCF [11] , GCT [52] , ATOM [14] , DiMP18 [46] , DiMP50 [46] , UpdateNet [58] , ROAM [66] , SiamBAN [67] , SiamCAR [68] , LTMU [69] , CLNet [70] and Ocean [71] . Table 5 present the results of this comparison. Among these compared track- ers, the DiMP50 [46] obtains the second best precision, normalized precision and success scores. In contrast, our DIoUTrack50 outperforms the DiMP50 [46] tracker on each performance metric item, which fully proves the effectiveness of our tracker. Experiment on GOT10k [20] dataset: This GOT10k test set includes 180 video sequences for evaluation of tracking performance. We conduct experimental comparisons on GOT10k test set to evaluate the tracking performance of our DIoUTrack relative to other state-of-the-art trackers, namely MDNet [5] , ECO [60] , DSiam [62] , DAT [72] , DeepSTRCF [32] , STRCF [32] , SASiamP [44] , SASiamR [44] , MemDTC [73] , MetaSDNet [12] , RT-MDNet [74] , LDES [55] , SiamDW [43] , SPM [56] , ATOM [14] , DiMP18 [46] , DiMP50 [46] , SiamCAR [68] , ROAM [66] , and Ocean [71] . The comparison results are presented in Table 6 . The ATOM tracker obtains an average overlap score of 55.6%; however, our DIoUTrack18 with the same backbone network (ResNet18) achieves a 3.9% performance improvement over the ATOM tracker, as well as faster tracking speed. Meanwhile, our DIoUTrack50 achieves the best AO score and SR 0.50 score. Although our tracking speed is slightly lower than that of SPM [56] tracker, in terms of tracking accuracy, our tracker obviously exceeds the SPM [56] tracker in each indicator. Experiment on VOT2018 [21] dataset: VOT2018 is containing 60 test video sequences, and trackers are measured using the expected average overlap (EAO), robustness and accuracy. We make some comparisons of our tracker with several state-of-the-art trackers, namely ATOM [14] , DiMP18 [46] , DiMP50 [46] , PrDiMP18 [47] , PrDiMP50 [47] DaSiamRPN [6] , SiamRPN++ [45] , UPDT [59] and Ocean [71] on this test set. The comparison results are shown in Table 7 . Our DIoUTrack18 has the best accuracy score compared to other trackers. Our DIoUTrack18 adopts the same backbone network as ATOM [14] , DiMP18 [46] and PrDiMP18 [47] trackers, and the EAO score and accuracy score are all higher than these trackers, which significantly indicates that our proposed method can bring more accurate tracking results. Experiment on VOT2019 [22] dataset: VOT2019 has the same data set size as VOT2018, and trackers are also evaluated using the expected average overlap (EAO), robustness and accuracy. We make some comparisons of our tracker with several state-of-the-art trackers, namely ATOM [14] , DiMP18 [46] , DiMP50 [46] , PrDiMP18 [47] , PrDiMP50 [47] , TADT [7] , SiamRPN++ [45] , MemDTC [73] and Ocean [71] on this test set. The comparison results are shown in Table 8 . Our DIoUTrack50 has the best EAO score compared to other trackers. Our DIoUTrack50 adopts the same backbone network as DiMP50 [46] , PrDiMP50 [47] and SiamRPN++ [45] trackers, and the EAO score and accuracy score are all higher than these trackers, which also indicates that our proposed method can bring more accurate tracking results. To visually demonstrate tracking results, we present a qualitative comparison of our DIoUTrack50 to some state-of-the-art tracking methods, namely SiamRPN++ [45] , PrDiMP50 [47] and DiMP50 [46] . All of these trackers utilize the same backbone network: ResNet50. Figure 5 presents these visual comparison results on several of the most challenging sequences selected from the OTB100 [16] dataset. For the DiMP50 [46] tracker, it interferes easily in the scenes of occlusion, fast motion, background cluster, and deformation (e.g., bird1, and soccer). One explanation for this drawback is that it adopts the bounding-box regression model improved by an IoUNet, meaning that it is unable to locate the target accurately in some complex tracking scenes. By contrast, the proposed DIoUTrack50 adopts a distance-IoU network to improve the bounding-box regression model; this means that when the IoU score is constant, our model selects the candidate with the more accurate center position as the target. Meanwhile, the PrDiMP50 [47] tracker can not achieve ideal tracking results in illumination variation, deformation, scale variation and other aspects of tracking scenarios (e.g., bird1, matrix and soccer). Moreover, the SiamRPN++ [45] tracker readily interferes in the scenes of fast motion, scale variation, and deformation (e.g., motorrolling, skating1 and soccer); by contrast, our DIoU-based DIoUTrack50 obtains accurate tracking results on these testing video sequences. In summary, compared with these state-of-the-art trackers, our proposed tracker produces more accurate boundary boxes and tracking results. In this work, we propose an accurate bounding-box regression tracking method based on the distance-intersection-over-union (DIoU) loss. The proposed tracker comprises two components: an estimation component and a classification component. The former is trained offline in order to predict the DIoU overlap score between the target ground-truth and the predicted bounding box. Compared with the IoU loss, the adopted DIoU loss can make the prediction result closer to the real target in the training stage, which can predict the target boundary box more accurately in the tracking process. While the classification component is trained online using the Conjugate-Gradient-based method, resulting in a fast-tracking speed. Extensive experimental results on seven challenging benchmarks show that our proposed method obtains competitive tracking results compared with state-of-the-art trackers. Our future work will focus on how to better use large-scale unlabeled data to train the CNN model of the tracker. We hope to apply an unsupervised domain adaptation method for the tracker to achieve this goal. Joint group feature selection and discriminative filter learning for robust visual object tracking Learning background-aware correlation filters for visual tracking Learning aberrance repressed correlation filters for real-time UAV tracking Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking Learning multi-domain convolutional neural networks for visual tracking Distractor-aware Siamese networks for visual object tracking Target-aware deep tracking Tracking-by-counting: Using network flows on crowd density maps for tracking multiple targets Learning deep multi-level similarity for thermal infrared object tracking Multi-task correlation particle filter for robust object tracking Visual tracking via adaptive spatially-regularized correlation filters Meta-tracker: Fast and robust online adaptation for visual object trackers High performance visual tracking with Siamese region proposal network ATOM: Accurate tracking by overlap maximization Distance-iou loss: Faster and better learning for bounding box regression Object tracking benchmark A benchmark and simulator for UAV tracking TrackingNet: A large-scale dataset and benchmark for object tracking in the wild LaSOT: A high-quality benchmark for large-scale single object tracking GOT-10k: A large high-diversity benchmark for generic object tracking in the wild The sixth visual object tracking vot2018 challenge results The seventh visual object tracking vot2019 challenge results Tracking interacting objects using intertwined flows Tracking-learning-detection Greedy batch-based minimum-cost flows for tracking multiple objects Active learning for deep visual tracking A neighbor level set framework minimized with the split Bregman method for medical image segmentation Semi-online multi-people tracking by re-identification Adaptive segmentation model for liver ct images based on neural network and level set method Particle filter re-detection for visual tracking via correlation filters High-speed tracking with kernelized correlation filters Learning spatial-temporal regularized correlation filters for visual tracking TRBACF: Learning temporal regularized correlation filters for high performance online visual object tracking Learning target-focusing convolutional regression model for visual object tracking Visual object tracking with adaptive structural convolutional network Learning multiple instance deep representation for objects tracking Siamese instance search for tracking Fully-convolutional Siamese networks for object tracking Real-time long-term tracker with tracking-verification-detectionrefinement Self-supervised deep correlation tracking Correlation filter via random-projection based cnns features combination for visual tracking Siamese target estimation network with aiou loss for real-time visual tracking Deeper and wider Siamese networks for real-time visual tracking A twofold Siamese network for real-time object tracking SiamRPN++: Evolution of Siamese visual tracking with very deep networks Learning discriminative model prediction for tracking Probabilistic regression for visual tracking Generalized intersection over union: A metric and a loss for bounding box regression Acquisition of localization confidence for accurate object detection Microsoft COCO: Common objects in context GradNet: Gradient-guided network for visual object tracking Graph convolutional tracking Unsupervised deep tracking Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking Robust estimation of similarity transformation for visual object tracking SPM-Tracker: Series-parallel matching for real-time visual object tracking Siamese cascaded region proposal networks for real-time visual tracking Learning the model update for Siamese trackers Unveiling the power of deep tracking ECO: Efficient convolution operators for tracking End-to-end representation learning for correlation filter based tracking Learning dynamic Siamese network for visual object tracking Structured Siamese network for real-time visual tracking VITAL: Visual tracking via adversarial learning Context-aware deep feature compression for high-speed visual tracking ROAM: Recurrently optimizing tracking model Siamese box adaptive network for visual tracking SiamCAR: Siamese fully convolutional classification and regression for visual tracking High-performance long-term tracking with meta-updater Clnet: A compact latent network for fast adjusting Siamese trackers Ocean: Object-aware anchor-free tracking Deep attentive tracking via reciprocative learning Learning dynamic memory networks for object tracking Real-time MDNet