key: cord-0571658-7ky519kl authors: Shao, Zhenfeng; Wang, Jiaming; Deng, Lianbing; Huang, Xiao; Lu, Tao; Luo, Fang; Zhang, Ruiqian; Lv, Xianwei; Dang, Chaoya; Ding, Qing; Wang, Zhiqiang title: GLSD: The Global Large-Scale Ship Database and Baseline Evaluations date: 2021-06-05 journal: nan DOI: nan sha: b3e90781c1eb96f77d07f6c5a663921c57336864 doc_id: 571658 cord_uid: 7ky519kl In this paper, we introduce a challenging global large-scale ship database (called GLSD), designed specifically for ship detection tasks. The designed GLSD database includes a total of 212,357 annotated instances from 152,576 images. Based on the collected images, we propose 13 ship categories that widely exist in international routes. These categories include Sailing boat, Fishing boat, Passenger ship, Warship, General cargo ship, Container ship, Bulk cargo carrier, Barge, Ore carrier, Speed boat, Canoe, Oil carrier, and Tug. The motivations of developing GLSD include the following: 1) providing a refine and extensive ship detection database that benefits the object detection community, 2) establishing a database with exhaustive labels (bounding boxes and ship class categories) in a uniform classification scheme, and 3) providing a large-scale ship database with geographic information (covering more than 3000 ports and 33 routes) that benefits multi-modal analysis. In addition, we discuss the evaluation protocols corresponding to image characteristics in GLSD and analyze the performance of selected state-of-the-art object detection algorithms on GSLD, aiming to establish baselines for future studies. More information regarding the designed GLSD can be found at https://github.com/jiaming-wang/GLSD. Object detection has been an important computer vision task for over 20 years [1] . In recent years, with the growing demand for public security, the detection of ships has become an important task in both military and civilian fields [2] , including sea controlling, illegal smuggling monitoring, and automatic driving. The rapid development of artificial intelligence also pushes autonomous ship detection to the spotlight. Ship detection is of great importance, as sea routes are the lifeblood of the global economy 1 , given the fact that the international shipping industry is responsible for the carriage of around 90% of world trade . Nevertheless, manual inspection for identifying abnormal behaviors is a time-consuming and laborious process. With the development of the ship automatic navigation system, the demand for a gigantic amount of data for data-driven models is rising. In addition, despite that automatic object detection methods have achieved great performance, it is still far from maturity, as challenges still remain when those algorithms are being applied in real-world ship detection scenarios. Inspired by the immense success of machine learning approaches in computer vision tasks [3] , [4] , [5] , [6] , deep learning-based methods have been the mainstream in addressing object detection problems [1] . However, the performance of deep-learning-based algorithms, given their big-data-driven nature [7] , largely depends on the number of high-quality training samples. The first large-scale dataset, i.e., ImageNet [8] , has been widely adopted in object detection studies and even other vision tasks [9] , [10] . Following ImageNet, Lin et al. [11] presented the Microsoft common objects in context (MS COCO) dataset with instance-level segmentation masks. In real-world scenarios, ships with different categories play considerably different roles during sea transportation. In these publicly available datasets, however, ships are commonly generalized as "ship" or "boat" (for example, in VOC2007 [12] , CIFAR-10 [13] , Caltech-256 [14] , and COCO [11] ). Although, ImageNet [8] includes six types of ships, i.e., "fireboat", "lifeboat", "speedboat", "submarine", "pirate", and "container ship", most of them are seagoing vessels that are rarely seen in certain situations. Thus, we argue that object models learned from the aforementioned coarse-grained datasets are not suitable for ship identification and the corresponding applications in real-world scenarios. Developing a new largescale ship database is of great necessity. Recently, efforts have been made to construct ship datasets. For instance, Shao et al. [2] developed a ship dataset, i.e., SeaShips, that consists of 31,455 high-quality images (1,920 × 1,080 pixels) and covers six common ship types (i.e., "passenger ship", "fishing boat", "container ship", "general cargo ship", "bull cargo carrier", and "ore carrier"). Despite the fact that the SeaShip dataset considers the following factors: I COMPARISON OF SHIPS AMONG GLSD AND OBJECT DETECTION DATASETS. IMPROVING FROM THE FIRST SHIP DATASET SEASHIPS, WE ADD THE FOLLOWING CATEGORIES: "SAILING BOAT", "WARSHIP", "BARGE", "SPEED BOAT", "CANOE", "OIL CARRIER", AND "TUG". GLSD HAVE A TOTAL In this study, we present a novel ship dataset, called the Global Large-Scale Ship Database (GLSD), that consists of 152,576 images and covers 13 ship types. Considering that the routes of ships are well established, we collect internet images and monitoring data according to the routes with port and country information. The developed GLSD covers more than 3,000 ports around the world (more details in Section III-A). Improving from SeaShips, we add the following categories: "Sailing boat", "Warship", "Barge", "Speed boat" "Canoe", "Oil carrier", and "Tug". Labels and bounding boxes of GLSD are manually constructed in an accurate manner using an image annotation tool [16] . We name the route-based version of GLSD as "GLSD_port" (GLSD with geographic information, more details about the "GLSD_port" in https://github.com/jiaming-wang/GLSD/blob/master/Ports_ list.md). We believe that GLSD_port provides training models with multi-modal information that potentially benefits certain ship detection applications. Detailed comparisons of GLSD with existing databases are shown in Table I . The early database, like CIFAR-10 [13] , has a very low image resolution (32 × 32 pixels), which is not suitable for object detection tasks. The maximum number of boxes per image in the PASCAL VOC2007 [12] , Caltech-256 [14] , and ImageNet [8] , is rather small. Although COCO [11] includes a great quantity of "boat" images, the "boat" category has not been sub-categorized. Compared with SeaShips, the proposed GLSD contains more categories (13 vs. 6). In terms of image quantity, the number of images in our GLSD is three times compared to the number of images in SeaShips. In addition, GLSD owns a larger number of boxes per image than the one in SeaShips (1.39 vs. 1.37). Table I indicates that GLSD is a more challenging ship database that potentially benefits the training of robust ship detection models. The main contributions of this work are summarized as follows: 1) To our best knowledge, the developed GLSD is a very challenging global ship dataset with the largest number of annotates, reaching above 150,000 images, potentially facilitating the development and evaluation of existing object detection algorithms. 2) GLSD is built on global routes, providing multi-modal information (port and country of image acquisition) that better serves certain ship detection tasks. We plan to maintain GLSD in a regular manner when new images are available. 3) We evaluate state-of-the-art object detection algorithms on the proposed GLSD, setting up a baseline for future algorithm development. This paper is organized as follows. Section II reviews the related work. Sections III and IV illustrate the collection and design of the GLSD database. Section V details the experiments and analysis. Section VI concludes this study. In this section, we outline the development of object detection datasets and methods, providing references for future studies. In the early days, some small-scale well-labeled datasets (i.e., Caltech10/256 [17] , [14] , MSRC [18] , PASCAL [19] , and CIFAR-10 [13] ) were widely used in computer vision tasks as benchmarks. These datasets offer a limited number of categories with low-resolution images (such as, 32 ×32 and 300 × 200 pixels) [8] . It is widely acknowledged that the development of deep learning is inseparable from the support of big data. In general, high-quality training data can lead to the better performance of the deep-learning algorithms. For the first time, Deng et al. [8] built a dataset with worldwide targets following a tree structure organization, pushing object classification and detection fields towards more complex problems. The dataset proposed by Deng et al. [8] now contains 14 million images that cover 22 categories. Later on, pre-training backbone networks [20] , [21] based on the ImageNet images gradually became the benchmark in computer vision tasks. From early datasets, like COCO [11] , to the recent benchmarks, like Objects365 [22] , large-scale datasets have always been preferred choices by deep learning algorithm developers, as they play an essential role in evaluating the performance of object classification and detection tasks. Besides the above general object detection datasets, many datasets have been developed for specific scenarios, e.g., masked face recognition for novel coronavirus disease 2019 (COVID-19) pneumonia (RMFD [23] ), music information retrieval (GZTAN [24] and MSD [25] ), automated detection and classification of fish (labeled fishes [26] ), autonomous driving (JAAD [27] and LISA [28] ), and ship action recognition database [6] . These domain-specific datasets have greatly facilitated the development of the corresponding tasks and applications. In a recent effort, Shao et al. [2] constructed the first large-scale dataset for ship detection, i.e., SeaShips. Due to the fixed viewpoint in the deployed video monitoring system in the Zhuhai Hengqin New Area, however, the background information in SeaShips lacks diversity. Another notable effort is by Zheng et al. [15] , who presented a new multi-category ship dataset, i.e., McShips. However, ship targets in McShips are with an unreasonable ratio among different ship categories. An object detection model generally consists of two main components, a backbone pre-trained on a large image dataset (e.g., ImageNet [8] ) for feature extraction and a head used to predict the label. Common backbone networks include VGG [20] , ResNet [21] , DenseNet [29] , and ResNetXt [30] . Existing head components can be divided into two categories, i.e., the traditional methods and deep-learning methods [1] . In the early stages, most object detection methods adopted handcrafted image features to achieve real-time object detection [31] , [32] . Histogram of oriented gradients (HOG) detector [33] played a very important role in this task. Felzenszwalb et al. [33] proposed a deformable part-based model (DPM) that can be viewed as an extension of the HOG detector. DPM [33] gradually became the main theme of pedestrian detection as a pretreatment [34] . According to the network structure, deep-learning-based object detection methods can be grouped into two genres: two-stage and one-stage detection [1] , where the two-stage detectors are the dominant paradigm of the object detection tasks. Girshick et al. [35] proposed the regions with convolutional neural networks (CNN) feature maps for object detection, establishing a brand new venue for the development of two-stage detection algorithms. To reduce computational complexity, SPP-Net [36] largely reduced the computing cost through the spatial pyramid pooling layer. Inspired by SPP-Net, Girshick et al. [37] further utilized a more efficient region of interest (ROI) pooling to reduce unnecessary computational overhand. In 2015, Ren et al. [38] first proposed a framework that introduces a region proposal network to obtain bounding boxes with low complexity. Lin et al. [39] proposed the feature pyramid network (FPN), which fused multi-scale features to enhance semantic information expression. Different from the above deep learning-based algorithm, YOLO [40] transforms the detection and classification into an end-to-end regression model, sacrificing the localization accuracy for a boost of detection speed. In the following development of the YOLO series [41] , [42] , [43] , its subsequent versions inherit its advantages, while trying to gradually improve the detection accuracy. Liu et al. [44] proposed a multi-reference and multi-resolution framework that can significantly enhance detection accuracy. Further, Lin et al. [45] introduced the focal loss the prevents the accuracy drop resulting from the imbalance foreground-background classes in one-stage detection methods. In this section, we present the details on the collection, main categories, and characteristics of the GLSD. Referring to the United Nations Code for Trade and Transport Locations (UN/LOCODE) 2 , global ports are divided into 33 routes, i.e., "east of South America", "Pacific island", "West Mediterranean", "Middle East", "Caribbean", "West Africa", "Australia", "India-Pakistan", "European basic port", "European inland port", "East Mediterranean", "Black Sea", "Southeast Asia", "Canada", "west of South America", "China", "Taiwan-China", "East Africa", "North Africa", "Red Sea", "partial port of Japan", "Adriatic Sea", "Kansai", "Kanto", "Korea", "Mexico", "South Africa, "New Zealand", "west of American", "Russia Far East", "American inland port", "east of American", and "Central Asia". To ensure diverse image sources, we try to collect images from as many ports as possible (the ports involved in the dataset can be found on our website 3 ). A certain number of images in GLSD are captured from a deployed video monitoring system in the Zhuhai Hengqin New Area, China, and the rest are collected via search engines with multiple resolutions. As images in certain routes are unavailable, GLSD mainly covers ship images captured in China, America, and Europe. Images in the GLSD can be roughly divided into two categories: iconic images [46] and non-iconic images [11] . Iconic images, often with a clear depiction of categories, provide high-quality object instances, which clearly depict objects' categories (see examples in Fig. 2(a) ). Iconic images are widely used in object classification and retrieval tasks, and they can be directly retrieved via image search engines. Most images in SeaShips are iconic images. Non-iconic images that provide contextual information and non-canonical viewpoints also play an important role in object detection tasks (see examples in Fig. 2(b) ). In the proposed GLSD, we keep both iconic and non-iconic images, aiming to provide diverse image categories that benefit object detection model training. Compared with SeaShips that contain mostly iconic images, GLSD is considerably more challenging and closer to realworld scenes. From collected images, we construct 13 categories that widely exist in international routes. These categories include "Sailing boat", "Fishing boat", "Passenger ship", "Warship", "General cargo ship", "Container ship", "Bulk cargo carrier", "Barge", "Ore carrier, "Speed boat", "Canoe", "Oil carrier", and "Tug". We refer to Wikipedia 4 and the Cambridge International Dictionary of English to define the main categories involved in GLSD, as shown in Table II . In addition, Table III lists the number of images corresponding to each ship category in GLSD. The GLSD consists of a large number of ships that are capable of the oceangoing voyage (with great economic benefits), e.g., "General cargo ship", "Container ship", "Fishing boat", and "Passenger ship". Other ship types, e.g., "Tug", "Canoe", "Sailing boat", "Speed boat ", and "Barge", are usually not capable of long-tailed travels, leading to their limited sample sizes in our dataset. Therefore, despite the involvement of additional ship categories with a significantly increased number of samples compared to other ship datasets, the class imbalance issue still exists in the proposed GLSD. Fig. 4 shows the distribution of the image resolution in the GLSD. Different from SeaShips that mainly contain images retrieved from a monitoring system, GLSD also includes high-resolution images from unmanned aerial vehicles and satellite/airborne platforms. Considering that the performances of existing deep-learning-based algorithms are usually limited in detecting small targets, we include a large number of images with small targets (less than 32 × 32 pixels) and medium targets (between 32 × 32 to 96 × 96 pixels Fig. 3 . Samples of annotated images in the GLSD dataset. Based on the collected images, we propose 13 categories which widely exists in international routes. They are: "Sailing boat", "Fishing boat", "Passenger ship", "Warship", "General cargo ship", "Container ship", "Bulk cargo carrier", "Barge", "Ore carrier, "Speed boat", "Canoe", "Oil carrier", and "Tug". of small and medium targets follows [11] . The image sizes vary greatly in GLSD, with the smallest image of 90 × 90 pixels and the largest of 6, 509×6, 509 pixels. From the above description, it can be seen that GLSD contains more diverse images with various resolutions and target sizes than SeaShips. In this session, we describe how we annotate images in GLSD. Different from regular objects, ships can contain other features besides their main body, such as mast, elevating equipment, and oar. During the annotation process, all object instances are labeled with object names with bounding boxes that cover the entire ship with additional ship features. IV. DESIGN OF THE GLSD DATASET Different from SeaShips dataset that contains images from a site monitoring system, images in GLSD collected from the Internet and searching engines are generally more complex. Eight variations, i.e., viewpoint, state, noise, background, scale, mosaic, style, and weather variations, are considered and implemented to construct the GLSD. Selected examples corresponding to these variations are outlined in Fig. 5 . Images from different viewpoints have varying characteristics. Multi-viewpoint images have been proved to benefit deeplearning-based models in coping with the complex changes in real-world scenarios. Compared to SeaShips based on surveillance cameras with limited viewpoints, the designed GLSD contains considerably more viewpoints, as shown in Fig. 5(a) , potentially leading to increased model robustness. SeaShips only focuses on underway ships while ignoring the state under abnormal events, such as the shipping disaster (e.g., on fire), towed by a tug, and interaction between barges and large vessels. Datasets with images under different states are beneficial in monitoring abnormal events during shipping. Fig. 5(b) shows ship images in the designed GLSD under a unique on-fire state: two fishing boats with only skeletons left after burning. Numerous studies have shown that the accuracy of target detection is often higher in cleaner images. However, noises are unavoidable in many real-world cases. Thus, the introduction of noises in the images helps further improve the robustness of algorithms. Certain images collected from searching engines in GLSD contain watermarks (serving as noises), as shown in Fig. 5(a), (b), and (c) . Theoretically, backgrounds in a non-iconic image can provide rich contextual information. Therefore, a dataset with a diversity of backgrounds is preferred in many object recognition tasks. In real-world applications, backgrounds of the captured images tend to vary. Thus, it is necessary to collect images from different ports. Fig. 5 (d) presents images with two distinctively different backgrounds: one with a tropical characteristic featured by canoes, the other one with a modern port featured by large-scale ships. Deadweight tonnage (DWT) is an indicator of the ship' sizes and transporting capacity. The DWT of ships greatly varies. For example, the maximum gross payload of an oil carrier can reach 500,000 DWT, while some old cargo ships are only with 5,000 DWT. Even for ships of the same class, the scale can considerably vary. An oil carrier can occupy ten times as many pixels as a tug, as illustrated in Fig. 5 (e). Our designed GLSD contains ships with rich variations in scales, potentially increasing algorithms's capability in detecting both large and small objects. YOLOv4 [43] introduces a new method of data augmentation, named mosaicking, which mixes four different images into one image. Fig. 5(f) shows some examples of images that consist of three iconic images with different placement strategies after mosaicking. Our GLSD contains a variety of mosaicking images, greatly enriching the background information of ships to be detected. In addition to multi-viewpoint images, our designed GLSD includes images from various categories: aerial images, remote sensing images, and portraits. Numerous efforts have been made towards style transfer as a data augmentation approach (e.g., domain adaptation between GTA5 and image style transfer on the COCO database). Our GLSD contains abundant image styles that include images captured via cameras and realistic paintings, as shown in Fig. 5(g) . It is widely acknowledged that port operations are susceptible to extreme weather conditions, such as high winds, fog, heavy haze, snowstorms, thunderstorms, and typhoons. Such extreme weather conditions greatly affect the arrival and departure of ships and the unloading of cargo in the port. On the sea, the weather tends to change significantly in a relatively short time. Our DLSD includes a variety of weather conditions, expected to benefit models in ship recognition under different weather scenarios, as illustrated in Fig. 5(h) . In summary, the aforementioned variations make GLSD a rather challenging dataset for ship detection and recognition. The rich variations as well as the effectively widening withinclass gap in our GLSD are expected to facilitate models in reaching higher robustness. In this section, we conduct a comprehensive comparison of the following state-of-the-art object detection algorithms on GLSD: Faster R-CNN [35] , RetinaNet [45] , GHM [49] , FP16 [48] , Libra R-CNN [50] , PAA [51] , ATSS [52] , GFL [53] , and Fovea [54] . These experiments run at a desktop based on mmdetection-2.12.0 [55] (a popular open-source object detection toolbox developed by OpenMMLab) 5 with three NVIDIA GTX TI-TAN GPUs and 3.60 GHz Intel Core i7-7820X CPU, 32GB memory. We implement these methods using the PyTorch 1.7.0 [56] library with Python 3.7.9 under Ubuntu 18.04, CUDA 10.2, and CUDNN 7.6 systems. For evaluation, we employed average precision (AP , AP 50 , AP 75 , AP S , AP M , and AP L ) and average recall (AR 1 , AR 10 , AR 100 , AR S , AR M , and AR L ), as with [57] . Among them, the AP , AP S , AP M , AP L , and all average recall are calculated with intersections over union (IOU) values ([0.50 : 0.05 : 0.95]) as IOU thresholds. As for AP 50 and AP 75 , the corresponding thresholds are 0.5 and 0.75, respectively. Moreover, scale = {S, M, L} represents the average with different scales (small scale: targets with less than 32 × 32 pixels; medium scale: targets with between 32 × 32 to 96 × 96; large scale: targets with larger than 96 × 96 pixels [11] ), and num = {1, 10, 100} denotes the average recall with different number of detections. For a fair comparison, all selected algorithms are trained and tested on images with 1,333 × 800 pixels. In Table IV , we report the performances of all selected models on GLSD. The prediction-recall curves are shown in Fig.6 . In scenes that contain small targets, APs from selected algorithms without the focal loss function are lower than 5%. Even in scenes with medium targets, APs of all selected algorithms with schedule 1× are about 16%, proving that small-target recognition is still one of the major challenges in our designed GLSD. Thus, we believe our GLSD creates a valuable venue for future innovative object detectors to compete. With the introduction of the focal loss, an effective approach to mitigating the issues of long-tailed distribution, the performances of Retinanet [45] and GFL [53] show significant improvement compared to other two-stage algorithms. We notice that ARs are larger than APs for these algorithms, indicating the existence of error detection due to the small interclass gap. However, with an increasing number of iterations, Retinanet [45] is able to achieve a great performance (up to 1.0% gains) in large-object detection. Our experiment suggests that solutions that address the long-tailed distribution are the key for models to reach satisfactory performance in GLSD. The prediction-recall curves of PAA [51] with different training schedules are shown in Fig.5 . We observe notable increases in AP (about 1.0% gains) and stableness in AR. With the increase of IOU, the impact of the number of iteration on performance becomes notable, especially when iou ≥ 0.8. It implies that with the increase of iterations leads to improved capability of the model in identifying non-ship objects. In order to further investigate the performance of different classes on the GLSD, Table V shows the APs of the above state-of-the-art object detection algorithms. As shown in Table V , the APs of "Barge" are only about 5%, presumably due to two reasons: 1) the small number of "Barge" images; 2) the similarity between "Barge" and "Ore carrier" (especially in shape). The same phenomenon is observed in other categories with a small number of images ("Oil carrier", "Tug", "Canoe", and "Speed boat"). However, the tested methods all have a good recognition performance in "Warship", given their unique appearances. 1) Validation of multi-scale training: In our detection framework, all images are resized to 1,333 × 800 pixels for training. Considering the poor performance of small target detection on GLSD, we train the state-of-the-art method (PAA [51] ) with default multi-scale configuration (1,333 × 800 and 1,333 × 640 pixels) to verify the impact of multi-scale training on GLSD. As shown in Table VI , different configurations of the training size lead to trivial performance fluctuations. Such a phenomenon can be explained by the fact that there exist huge differences in terms of image sizes in GLSD (some natural images with 96 × 96 pixels and satellite images with 6509 × 6509 pixels), while most images in regular natural image datasets (e.g., PASCAL VOC2007 [12] and COCO [11] ) are no more than 1000 × 1000 pixels. 2) Validation of normalization strategies: The normalization strategy in the detection task can effectively accelerate the convergence speed and alleviate the problem of gradient disappearance. To intuitively reveal the impact of normalization on GLSD, different normalization methods (Batch Normalization (BN) [58] , Group Normalization (GN) [59] , and Synchronized Batch Normalization (SyncBN) [60] ) on PAA [51] are tested. The GN layer is proposed to eliminate the influence of batch size for normalization, while the SyncBN layer is distributed version BN layer. As shown in Table VII , the AP of PAA with SyncBN is 0.3% and 3.6% higher than that with BN and the GN, respectively. In this paper, we introduce a global large-scale ship database, i.e., GLSD, which is designed for ship detection tasks. The designed GLSD is considerably larger and more challenging than any existing database, to our best knowledge. The main characteristics of the GLSD lie in three aspects: 1) the GLSD contains a total of 152,576 images with a widening inter-class gap from 13 categories, i.e., "sailing boat", "fishing boat", "Passenger ship", "Warship", "General cargo ship", "Container ship", "Bulk cargo carrier", "Barge", "Ore carrier, "Speed boat", "Canoe", "Oil carrier", and "Tug"; 2) the GLSD includes a diversity of variations that include viewpoint, state, noise, background, scale, mosaic, style, and weather variations, which benefit improved model robustness; 3) the route-based version of GLSD, i.e., GLSD_port, contains geographic information, providing rich multi-modal information that benefits various ship detection and recognition tasks. We also propose evaluation protocols and provide evaluation results on GLSD using numerous state-of-the-art object detection algorithms. As ship images of certain categories are difficult to collect, the current version of GLSD has a notable long-tail issue. We will continue to extend GLSD with more ship images, especially on ship categories of "Tug", "Canoe", and "Speed boat". Object detection in 20 years: A survey Seaships: A large-scale precisely annotated dataset for ship detection Global-local fusion network for face super-resolution Scscn: A separated channelspatial convolution net with attention for single-view reconstruction Fusiongan: A generative adversarial network for infrared and visible image fusion Spatialtemporal pooling for action recognition in videos Dota: A large-scale dataset for object detection in aerial images Imagenet: A large-scale hierarchical image database Photo-realistic single image super-resolution using a generative adversarial network A dual-path fusion network for pan-sharpening Microsoft coco: Common objects in context The pascal visual object classes (voc) challenge Learning multiple layers of features from tiny images Caltech-256 object category dataset Mcships: A large-scale ship dataset for detection and fine-grained categorization in the wild Labelme: a database and web-based tool for image annotation One-shot learning of object categories Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation The pascal visual object classes challenge: A retrospective Very deep convolutional networks for large-scale image recognition Deep residual learning for image recognition Objects365: A large-scale, high-quality dataset for object detection Masked face recognition dataset and application Musical genre classification of audio signals The million song dataset Automated detection of rockfish in unconstrained underwater videos using haar cascades and a new image dataset: Labeled fishes in the wild Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior A general active-learning framework for on-road vehicle recognition and tracking Densely connected convolutional networks Aggregated residual transformations for deep neural networks Rapid object detection using a boosted cascade of simple features Robust real-time face detection A discriminatively trained, multiscale, deformable part model Scalable person re-identification: A benchmark Region-based convolutional networks for accurate object detection and segmentation Spatial pyramid pooling in deep convolutional networks for visual recognition Fast r-cnn Faster r-cnn: Towards real-time object detection with region proposal networks Feature pyramid networks for object detection You only look once: Unified, real-time object detection Yolo9000: better, faster, stronger Yolov3: An incremental improvement Yolov4: Optimal speed and accuracy of object detection Ssd: Single shot multibox detector Focal loss for dense object detection Finding iconic images labelme: Image Polygonal Annotation with Python Mixed precision training Gradient harmonized single-stage detector Libra r-cnn: Towards balanced learning for object detection Probabilistic anchor assignment with iou prediction for object detection Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection Foveabox: Beyound anchor-based object detection MMDetection: Open mmlab detection toolbox and benchmark Pytorch: An imperative style, high-performance deep learning library Object detection in uav images via global density fused convolutional network Batch normalization: Accelerating deep network training by reducing internal covariate shift Group normalization Context encoding for semantic segmentation We thank Lan Ye, Sihang Zhang, Linze Bai, Gui Cheng, and all the others who were involved in the annotation of GLSD. In addition, we thank the support of the Post-Doctoral Research Center of Zhuhai Da Hengqin Science and Technology Development Co., Ltd, Guangdong Hengqin New Area.