key: cord-0767382-89z6gbdw
authors: Le, Trung-Nghia; Cao, Yubo; Nguyen, Tan-Cong; Le, Minh-Quan; Nguyen, Khanh-Duy; Do, Thanh-Toan; Tran, Minh-Triet; Nguyen, Tam V.
title: Camouflaged Instance Segmentation In-The-Wild: Dataset, Method, and Benchmark Suite
date: 2021-03-31
journal: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
DOI: 10.1109/tip.2021.3130490
sha: 9f5199f9e12e053ccb8476f823a1bd853cb1d29a
doc_id: 767382
cord_uid: 89z6gbdw

This paper pushes the envelope on decomposing camouflaged regions in an image into meaningful components, namely, camouflaged instances. To promote the new task of camouflaged instance segmentation of in-the-wild images, we introduce a dataset, dubbed CAMO++, that extends our preliminary CAMO dataset (camouflaged object segmentation) in terms of quantity and diversity. The new dataset substantially increases the number of images with hierarchical pixel-wise ground truths. We also provide a benchmark suite for the task of camouflaged instance segmentation. In particular, we present an extensive evaluation of state-of-the-art instance segmentation methods on our newly constructed CAMO++ dataset in various scenarios. We also present a camouflage fusion learning (CFL) framework for camouflaged instance segmentation to further improve the performance of state-of-the-art methods. The dataset, model, evaluation suite, and benchmark will be made publicly available on our project page: https://sites.google.com/view/ltnghia/research/camo_plus_plus

: Examples from our Camouflaged Object Plus Plus (CAMO++) dataset with corresponding pixel-level annotations. Plus Plus (++) indicates increments in terms of dataset size and task compared with preliminary CAMO dataset [1] . First and third rows contain original images; second and last rows contain corresponding pixel-wise ground truth images. Each camouflaged instance is represented by distinct color for visualization purposes only. Aim with camouflaged instances is to conceal their texture into the background. Best viewed in color and with zoom.

Abstract-This paper pushes the envelope on decomposing camouflaged regions in an image into meaningful components, namely, camouflaged instances. To promote the new task of camouflaged instance segmentation of in-the-wild images, we introduce a dataset, dubbed CAMO++, that extends our preliminary CAMO dataset (camouflaged object segmentation) in terms of quantity and diversity. The new dataset substantially increases the number of images with hierarchical pixel-wise ground truths. We also provide a benchmark suite for the task of camouflaged instance segmentation. In particular, we present an extensive evaluation of state-of-the-art instance segmentation methods on our newly constructed CAMO++ dataset in various scenarios. We also present a camouflage fusion learning (CFL) framework for camouflaged instance segmentation to further improve the performance of state-of-the-art methods. The dataset, model, evaluation suite, and benchmark will be made publicly available on our project page. 1 Index Terms-Camouflaged instance segmentation, in-the-wild image, camouflage dataset, benchmark suite, multimodal learning

The term "camouflage" was originally used to describe the means that organisms use to disguise their appearance to blend in with their surroundings in order to hunt or avoid being hunted [2] . This natural phenomenon [1] was adopted by humans initially for use on the battlefield. For example, soldiers and military equipment were camouflaged by respectively dressing them and painting it to blend in with the surroundings. This resulted in artificially camouflaged objects [1] . Autonomously identifying camouflaged objects is helpful in various fields of computer vision (search-and-rescue work [1] ; wild species discovery and preservation [1] ); medical diagnosis (polyp detection and segmentation [3] ; COVID-19 infection identification from lung x-rays [4] ); and media forensics (manipulated image/video detection and segmentation [5] , [6] ).

Although image segmentation has been worked on for long time, general detectors cannot deal with camouflaged objects [7] - [10] . The detectors initially developed for camouflage detection [11] - [18] , which use handcrafted low-level features, are effective only for images with a simple and uniform background. More recently developed deep learningbased detectors [1] , [19] for camouflaged object segmentation perform only at the region level by mapping each pixel to camouflage/non-camouflage labels, so they do not show the number of camouflaged objects in a scene. None of them targets instance-level segmentation.

In the work reported here, we push the envelope on decomposing camouflaged regions into meaningful components, namely, camouflaged instances. A camouflaged object is defined as the set of all camouflaged pixels 2 in an image without any detailed information such as the number of objects or the semantic meaning [1] . In contrast, a camouflaged instance consists of only meaningful pixels that cover an object instance.

Camouflaged instance segmentation is more challenging than conventional camouflaged object segmentation in the sense that it not only maps each pixel to a label but also assigns an instance identity to each pixel. To the best of our knowledge, this is the first work to address camouflaged instance segmentation. Existing camouflaged object segmentation methods were developed under the assumption that camouflaged objects are always present in an image [19] - [21] . However, this assumption is not always satisfied in practice. In contrast, our work focuses on segmentation of in-thewild camouflage images without any assumption. To simulate the real-world, we aim to segment camouflaged instances in unrestricted images, meaning that camouflaged instances are not always present.

To this end, we introduce a dataset designed explicitly for the task of camouflaged instance segmentation. The dataset contains 5,500 images of people and more than 90 animal species with hierarchically pixel-wise annotation for all images. There are both camouflage and non-camouflage images at a ratio of approximately 50:50. It can thus serve as a benchmark for not only the camouflaged instance segmentation task 2 A camouflaged pixel is a pixel belonging to a foreground object that is easily classified as belonging in the background. but also the conventional camouflaged object segmentation task. We also provide a benchmark suite to facilitate the evaluation and advance the task of camouflaged instance segmentation. Particularly, we evaluate and analyze state-of-theart instance segmentation methods in various scenarios. Note that previous work [19] - [22] used only camouflage images to train and evaluate their methods under the assumption that camouflaged objects are always present an image. In addition, existing camouflage datasets [1] , [19] do not have the ground truth of non-camouflage images. This does not correspond to the real world since an animal or person is considered to be camouflaged depending on the surrounding context. Both camouflage and non-camouflage image in our dataset are annotated, enabling both image types to be used for simulating the real world. In addition, we present a fusion method that further improves the performance of the state-ofthe-art methods.

We summarize the contributions of this paper as follows:

• We address a new task of camouflaged instance segmentation in-the-wild and analyze in depth the challenges of performing this task. To the best of our knowledge, this is the first work defining and exploring camouflaged instance segmentation. Finding camouflaged instances in a scene is a useful task, and it should be an interesting problem for the computer vision community. A few methods have been reported that can perform camouflaged object segmentation, but none can perform instance-level segmentation. In addition, they are based on the assumption that camouflaged objects are always present in an image; in contrast, our proposed method performs segmentation on unrestricted images without any assumption. • We present a new image dataset to promote the task of camouflaged instance segmentation. Our newly constructed Camouflaged Object Plus Plus (CAMO++) dataset, extended from our preliminary CAMO dataset [1] , consists of 5, 500 images of people and more than 90 animal species, with 2700 camouflage images and 2800 non-camouflage images. All images are hierarchically annotated with meta-category labels, fine-category labels, bounding boxes, and instance-level masks. Pixel-wise ground truths were manually annotated for all instances in each image. • We provide a benchmark suite for the camouflaged instance segmentation task. In particular, we present the results of an extensive evaluation of state-of-the-art instance segmentation methods in various scenarios. We further provide an in-depth analysis of the experiments. The CAMO++ dataset, evaluation suite, and benchmark will be made available on our website along with paper publication 3 . • We present a camouflage fusion learning (CFL) framework for camouflaged instance segmentation that leverages the advantages of state-of-the-art methods in camouflaged instance segmentation. The remainder of this paper is organized as follows. Section II summarizes related work. Next, Section III introduces the newly constructed CAMO++ dataset. Section IV presents our proposed CFL framework for camouflaged instance segmentation. Section V presents the benchmark suite and the results of our evaluation of baselines on the newly constructed dataset. Finally, Section VII summarizes the key points and mentions future work.

When a large-enough area in a foreground object can be easily classified as background, the pixels in that area can be considered to be camouflaged. As mentioned above, a camouflaged object is defined as the set of all camouflaged pixels in an image without any further detailed information such as the number of objects or the semantic meaning [1] . Although camouflaged object recognition has a wide range of applications, this research field has not been well explored in the literature. Early work related to camouflage detection focused on the foreground region even when some of its texture was similar to the background [11] - [13] . The foreground was distinguished from the background on the basis of simple features, such as color, intensity, shape, orientation, and edge. A few methods based on handcrafted low-level features have been presented for tackling the problem of camouflage detection [14] - [18] . However, they are effective only for images with a simple and uniform background. Thus, their performances are unsatisfactory in camouflaged object segmentation due to the substantial similarity between the foreground and the background.

Recently, Le et al. [1] proposed an end-to-end network, dubbed ANet, for camouflaged object segmentation through integrating classification information into segmentation. The idea of utilizing classification for segmentation can be helpful when appropriately applied to multiple-region segmentation. Following the same direction, Fan et al. [19] subsequently developed SINet, which includes two main modules, namely a search module and an identification module. This network is based on simulated hunting, in which a predator first judges whether a potential prey exists; i.e., it searches for prey. Once a target animal is identified, it can be caught. Yan et al. [25] recently introduced MirrorNet, a dual-stream network comprising a main stream and a mirror stream. This bio-inspired network effectively captures different layouts of the scene and thereby boosts segmentation accuracy. Jinchao et al. [21] presented the TINet, which interactively refines multi-level texture and segmentation features and thereby gradually enhances the segmentation of camouflaged objects. To the best of our knowledge, there has been no previous work on camouflaged instance segmentation.

Table I summarizes the main characteristics of different datasets in camouflage research. CamouflagedAnimals [23] and CHAMELEON [24] were the first two camouflage datasets with mask annotations. However, they do not contain enough images to train deep learning methods. Le et al. [1] created the CAMO dataset, the first camouflage dataset with more than 1000 annotated images. It contains 1250 annotated images, which is a limited number of samples to train and evaluate deep learning methods. Fan et al. [19] subsequently created the COD dataset, which comprises 10,000 images (both camouflage and non-camouflage). However, they annotated only 5,000 camouflage images. Lamdouar et al. [20] recently developed the MoCA dataset for the camouflage object detection task; it contains only bounding box ground truths. Although CAMO [1] and COD [19] were constructed with multiple levels of ground truth, they have been used only for the camouflaged object-level segmentation problem. To the best of our knowledge, our newly constructed CAMO++ dataset is the first dataset fully supporting camouflaged instance-level segmentation.

Instance segmentation is the task of unifying object detection and semantic segmentation. It has been intensively studied in recent years using either the segmentation-based approach or the proposal-based approach. With the segmentationbased approach [26] - [30] , two-stage processing is generally used: segmentation first and then instance clustering. With the proposal-based approach [31] - [33] , on the other hand, bounding boxes are first predicted and then parsed to obtain mask regions [31] , or an object detection model is used (e.g., Faster RCNN [34] or R-FCN [35] ) to classify mask regions [32] , [33] . Proposal-based methods, which achieve state-of-the-art performance, have gained popularity due to their superiority over segmentation-based methods. Hence, this paper solely focuses on proposal-based methods.

Proposal-based methods can be categorized into single-stage and two-stage processes. The two-stage methods detect and then segment: they first perform object detection to extract a bounding box around each instance object and then perform binary segmentation inside each bounding box to separate the foreground (object) and the background. Two-stage methods (e.g., Mask RCNN [32] and its variants) are quite slow and thus are not practical for many real-time applications. Mask RCNN [32] , the first end-to-end model for instance segmentation, is an extension of Faster RCNN [34] : a branch was added for predicting an object mask in parallel with the existing branch for bounding box detection. Mask Scoring RCNN (MS RCNN) [36] , Cascade Mask RCNN [37] , and PANet [38] are extensions of Mask RCNN that improve the quality of segmented instances. MS RCNN [36] contains a network block on top of Mask RCNN that learns the quality of the predicted instance masks. Cascade Mask RCNN [37] , a multistage architecture, consists of a sequence of detectors trained with increasing intersection over union (IoU) thresholds that are sequentially more selective against close false positives. PANet [38] is aimed at boosting information flow in the feature extractor through bottom-up path augmentation.

The single-stage methods were inspired by anchor-free object detection methods (such as CenterNet [39] and FCOS [40] ). Generally, these methods are faster than twostage methods. Some can even run in real time. YOLACT [41] is one of the first such methods attempting real-time instance segmentation. YOLACT breaks instance segmentation into two parallel subtasks (i.e., generating a set of prototype masks and predicting per-instance mask coefficients) and then linearly combines the prototypes with the mask coefficients. Blend-Mask [42] and CenterMask [43] , which were extended from YOLACT, are aimed at blending cropped prototype masks with a finer-grained mask within each bounding box. Center-Mask [43] adds a spatial attention-guided mask branch to an anchor-free single-stage object detector (FCOS [40] ). Blend-Mask [42] first predicts dense per-pixel position-sensitive instance features with very few channels and then merges the attention map for each instance through a blender module. PolarMask [44] performs instance segmentation by predicting the contour of each instance by instance center classification and dense distance regression in a polar coordinate. Embed-Mask [45] generates embeddings for pixels and proposals to assign pixels to the mask of the proposal if their embeddings are similar. TensorMask [46] performs dense sliding window instance segmentation using structured 4D tensors to represent masks over a spatial domain. RetinaMask [47] adds a novel instance mask prediction head to the single-shot RetinaNet [48] detector. FCIS [33] and CondInst [49] use fully convolutional networks to produce masks. SOLO [50] reformulates instance segmentation as category prediction and mask generation and directly outputs masks without computing bounding boxes.

To the best of our knowledge, this is the first work to address camouflaged instance segmentation. Given the lack of a largescale dataset for training and testing purposes, we created a benchmark for the task of camouflaged instance segmentation by training instance segmentation methods on our newly constructed CAMO++ dataset. We further conducted an extensive evaluation of state-of-the-art instance segmentation methods in various scenarios.

The core contribution of this paper is our CAMO++ dataset, which extends our preliminary CAMO dataset [1] . This new large-scale dataset enables us to train and evaluate state-of-theart methods for the task of camouflaged instance segmentation. A. Dataset Construction 1) Camouflage Image Collection: We initially collected 4000 images containing at least one camouflaged object. We collected them from the Internet using various search terms combining an adjective ("camouflaged," "concealed," "hidden") with an animal name (e.g., cat, dog, seahorse), a person-related term (e.g., soldier, body-painting), and/or an environment (e.g., marine, underwater, mountain, desert, forest).

We manually discarded images with low resolution. We mixed the remaining images with the 1250 camouflage images in our preliminary CAMO dataset [1] and manually discarded the duplicates. We ended up with 2700 camouflage images.

We asked ten annotators to identify the camouflaged instances in each image and annotate them using a customdesigned interactive segmentation tool. It took each annotator 5-20 minutes to annotate an image depending on its complexity. The annotation stage thus spanned a few months. The outcome of this process was a binary mask for each image and the hierarchical category for each instance (see Figure 1 ).

2) Non-Camouflage Image Collection: We manually selected 2800 images from the large vocabulary instance segmentation (LVIS) dataset [51] that contained at least one human or animal instance. We manually selected the images to ensure that they did not contain any camouflaged instances, which could result in false-positive segmentation. shows examples of the non-camouflage images.

3) Dataset Splits: We randomly split the 2700 camouflage images into separate training and testing sets: the training set consisted of 1700 images, and the testing set consisted of 1000 images. We also randomly split the 2800 non-camouflage images into training and testing sets: 1800 images and 1000 images, respectively.

We describe in this section the Camouflaged Object Plus Plus (CAMO++) dataset designed explicitly for the task of camouflaged instance segmentation. The ++ indicates the increments in terms of dataset size and task compared with our preliminary CAMO dataset [1] . Example images are shown in Figure 1 along with the corresponding ground-truth label annotations. As shown in Table I , the CAMO++ dataset has the most object categories and object instances. Note that the COD dataset [19] provides the ground truth only for camouflaged instances that do not satisfy the in-the-wild setting in nature.

Category Diversity. The CAMO++ dataset is focused on various kinds of camouflaged pieces as shown by the biology-inspired hierarchical categorization illustrated in Figure 4a . The CAMO++ dataset consists of 13 biological meta-categories (i.e., amphibian, arachnid, asteroidea, bird, cephalopod, crustacean, fish, gastropod, insect, mammal, person, reptile, and worm) and 93 categories. Each category contains 352 instances on average. Figure 5a shows the ratios of the 13 biological meta-categories in the CAMO++ dataset: each biology meta-category contains 2520 instances on average. The distributions of camouflaged instances by category in three datasets are represented by the word clouds in Figure 2 .

Since biological categorization can be difficult for machines and computer vision experts to understand, we re-organized the categories on the basis of vision features (i.e., appearance) combined with biological features (i.e., behavior, living environment). We created a new set of eight meta-categories (i.e., amphibian, bird, insect, jointed legs, mammal, marine, person, reptile). Figure 5b shows the ratios of the eight vision metacategories in the CAMO++ dataset: each vision meta-category contains 4095 instances on average.

Our CAMO++ dataset contains more object categories and meta-categories than previous datasets. It contains 93 categories and 13 meta-categories in comparison with the 69 categories and 5 meta-categories of the COD dataset [19] and the 67 categories of the MoCA dataset [20] (see Table I and computer vision communities. Image Dimension. As shown in Figure 7 , the CAMO++ dataset has the greatest diversity in image dimensions. Moreover, it has more high-resolution images than the COD dataset.

Instance Density. In the CAMO++ dataset, each image has from 1 to more than 100 instances. It has 6.0 instances per image on average, whereas the COD dataset has only 1.2 instances per image on average and a maximum of 8. As illustrated in Figure 8 , the CAMO++ dataset has a large number of images with multiple instances, including separate single instances and spatially connected/overlapping instances. They account for 51% of the images: 38% have from 2 to 10 instances, 10% have from 11 to 30 instances, and the remaining 3% have more than 30 instances. In contrast, only 10% images in the COD dataset have multiple singe instances. This makes our CAMO++ dataset more challenging for the camouflaged instance segmentation task.

Instance Mask Size. The mask size of an instance is defined as the number of pixels in the mask compared with the number in the image. As shown in Figure 11 , the CAMO++ dataset contains a large number of small and medium instances. Small instances (smaller than 0.1) comprise 69.6% of the total number, and medium instances (size from 0.1 to 0.3) comprise 23.8%. In the COD dataset, small instances comprise 14.3% of the total number, and medium instances comprise 64.5%. Moreover, the CAMO++ dataset contains a fair number of tiny instances, which makes our dataset even more challenging for the camouflaged instance segmentation task. In addition, instance the bounding box distributions in Figure 9 show that the CAMO++ dataset has a broader range of bounding box sizes than previous datasets.

Center Bias. Figure 10 depicts the distributions of object centers in normalized image coordinates over all images in the camouflage datasets. Camouflage instances are biased toward the center of the images in all datasets. This can be explained by the observation that camouflage images are usually cropped for camouflaged instances nearly at the center for people to identify those concealed instances, even for videos (i.e., MoCA dataset [?]). Unlike previous datasets, instances in the CAMO++ dataset are localized over the entire image.

The newly constructed CAMO++ dataset inherits challenging attributes from our preliminary CAMO dataset [ as object appearance, background clutter, shape complexity, object occlusion, and distraction. A more detailed description is available elsewhere [1] .

A. Proposed Framework 1) Overview: General instance segmentation methods [32] , [36] can be applied to camouflaged instance segmentation by fine-tuning models on camouflage datasets. However, they are imperfect in the sense that each method may have advantages in specific contexts but disadvantages in others. To utilize the strength of each instance segmentation method, we propose a simple yet efficient CFL framework to fuse various models by learning image contexts. Figure 12 depicts our proposed CFL framework for camouflaged instance segmentation. We first train instance segmentation methods on our CAMO++ dataset independently (see Section V-A). We next generate results for all instance segmentation models and then search for results corresponding to the best model for each image (Algorithm 1 is our search algorithm). We later train a model predictor to predict the best instance segmentation model for each image.

2) Model Search: Algorithm 1 leverages the insight from a greedy algorithm and iterates over the images to update a "waiting list." At each iteration, segmentation results of models for an image is added to the waiting list, evaluation is performed to choose the best model, and the corresponding segmentation results are used as pseudo labels to train a model predictor.

In particular, when working with the i th image, a "prediction list" contains the segmentation masks for the 1 st to the (i − 1) th image. The waiting list is the union of the prediction list and temporary segmentation mask for the i th image. The average precision (AP) [52] between the ground truths of the whole training set and the waiting list are then evaluated. If the k th segmentation model is assumed to provide the best AP value, this image will get pseudo label k for the training model predictor afterward. Additionally, the segmentation mask of the k th model for the i th image is appended to the prediction list.

3) Objective Function: Given an image x with corresponding instance ground truth y, model predictor f , and M instance segmentation models g, our loss function comprises two parts:

where L segm is instance segmentation loss and L pred is model prediction loss. The segmentation losses of instance segmentation models g are calculated in accordance with the authors' released source code. Readers could refer to respective works to find more details of the segmentation loss functions.

where c is a vector in which the best model i selected using Algorithm 1 is indicated by c i = 1 and the other models are indicated by 0. The cross entropy loss for multinomial logistic regression [53] - [55] is used as the model prediction loss:

We employed Vision Transformer (ViT-Base16) [55] as the model predictor. Five well-known instance segmentation methods: Mask RCNN [32] , Cascade Mask RCNN [37] , MS RCNN [36] , RetinaMask [47] , and CenterMask [43] were empirically chosen for our CFL framework. The framework was developed on PyTorch while the individual models came from publicly available source codes provided by the respective authors.

Our CFL framework was trained in two stages. In the first stage, instance segmentation models were trained independently with their corresponding losses L ins (see Section V-A). In the second stage, the instance segmentation models were frozen, the search algorithm was run to select the best ones, and the model predictor was trained.

The predictor training process was conducted by fine-tuning the ImageNet pre-trained model on our CAMO++ dataset. We proposed five-fold stratified sampling for the training strategy. In particular, we randomly split the training data into a training set (4 folds) and a validation set (1 fold) to avoid overfitting. We also applied simple augmentations: resizing, cropping, translation, rotation, and flipping. The ViT model predictor was trained for 100 epochs with a batch size of 16, a base learning rate of 0.0008, a warmup of 1000 steps, and cosine learning rate decay. We used AdamW optimization with a weight decay of 0.001 and momentum β 1 = 0.9 and β 2 = 0.999.

We trained the models on PCs with 64-GB RAM and Tesla P100 GPUs. The training code and trained model will be published upon acceptance of this paper.

In addition to constructing our large-scale CAMO++ dataset, we conducted an intensive benchmark for camouflaged instance segmentation. To this end, we trained and tested eight state-of-the-art instance segmentation methods (i.e., Mask RCNN [32] , Cascade Mask RCNN [37] , MS RCNN [36] , RetinaMask [47] , YOLACT [41] , Center-Mask [43] , SOLO [50] , and BlendMask [42] , ) on the subsets described in Section III-A3. Figure 13 shows the development timeline of the benchmarked methods. We used three different backbones (ResNet50-FPN [56] , ResNet101-FPN [56] , and ResNeXt101-FPN [57] ) for each instance segmentation method. For YOLACT and BlendMask, we used only the ResNet50-FPN and ResNet101-FPN backbones because these methods have not yet been implemented on the ResNeXt101-FPN backbone [41] , [42] .

We trained the models on PCs with 64-GB RAM and Tesla P100 GPUs. The methods were fine-tuned using the MS-COCO pre-trained models and default public configurations provided by the respective authors.

The results given here follow standard COCO-style average precision (AP) metrics: AP (averaged over IoU thresholds from 50% to 95%), AP 50 (AP for IoU threshold 50%), and AP 75 (AP for IoU threshold 75%) [52] . We also evaluated the results using AP at different scales (AP S , AP M , and AP L ), where S, M, and L represent small (area of less than 32 × 32 pixels), medium (area of 32 × 32 to 96 × 96 pixels), and large objects (area of above 96 × 96 pixels), respectively.

The results also include average recall (AR) metrics: AR 1 , AR 10 , and AR 100 (AR for given number of results per image) [52] . Similar to the AP evaluation, we evaluated the AR results at different scales (AR S , AR M , and AR L ). 

We evaluated the instance segmentation methods on two different settings:

• Setting 1: simulates real world (i.e., in-the-wild or unrestricted images), where camouflaged instances are not always present in images. We trained and tested the models on all images. • Setting 2: camouflaged instances are assumed to be present in every image. We trained and tested the models on only images containing camouflaged instances. 1) Setting 1 (Camouflaged Instances Are Not Always Present): Table III details the performance of the tested methods for the first experimental setting: segment multiple camouflaged instances on unrestricted images. As can be seen, better backbones tended to produce better results within the same method. The ResNeXt101-based implementations had the best performance. In terms of AP, the AP 50 metric had the highest scores whereas the AP 75 metric had the lowest scores. The AP L , AR 100 , and AR L metrics had the highest scores in terms of AP Across Scales, AR, and AR Across Scales, respectively. In general, the benchmarking methods consistently performed across the performance metrics. In other words, the methods performing well for one performance metric tended to perform well for the others. There was no dominant method across all metrics. For example, RetinaMask surpassed BlendMask in terms of AP S and AR S whereas MS RCNN surpassed RetinaMask in terms of AP 75 , AP M , and AP L .

Our fusion method is aimed at leveraging the advantages of different methods in order to produce the best results. Our proposed scene-driven framework selects the best models adaptively for each image by learning its visual deep features. It can thus take advantage of all component models, resulting in superior performance. In particular, our CFL framework achieved the state-of-the-art performance across all metrics. It significantly outperformed the other instance segmentation methods with an AP of 19.2, 21.9, and 25.1 on the ResNet50-FPN, ResNet101-FPN, and ResNeXt101-FPN backbones, respectively. It was also consistently better than the others in terms of AR.

2) Setting 2 (Camouflaged Instances Are Always Present): Table IV details the performance of the tested methods for the second experimental setting (in which camouflaged objects are assumed to be present in every image). Again, thanks to [36] , RetinaMask [47] , YOLACT [41] , CenterMask [43] , SOLO [50] , and BlendMask [42] . Camouflaged instances are shown in blue, and non-camouflaged instances are shown in red. Best viewed in color with zoom.

the ResNet50-FPN, ResNet101-FPN, and ResNeXt101-FPN backbones, respectively.

Qualitative Comparison: Figure 14 shows a visual comparison of the tested methods on the ResNet50 backbone as it is the most commonly used backbone for segmentation. The CFL framework achieved the best results, and the results are close to the ground truth. The CFL framework was able to segment camouflaged instances with fine details, demonstrating its robustness. It effectively handled a variety of challenging cases, including camouflaged instances with colors and textures similar to those of the background and images with complex shapes and multiple objects. Figure 15 . It had trouble localizing and segmenting camouflaged instances due to a tiny instances (top row) and extreme resemblance to the background (second row). These cases are immensely challenging even for human detection. There were also failures on occluded or overlapping camouflaged instances (bottom row), resulting in incorrect segmentation (bottom left example) or misclassification between camouflaged instances and noncamouflaged instances (bottom right example).

From the visualization and aforementioned observations along with experimental results, we conclude that even the leading instance segmentation methods in the deep learning era remain limited. They cannot yet effectively segment multiple camouflaged instances on unrestricted images without any assumption (Top-1: AP ≤ 25 as in Table III) . Hence, accurate camouflaged instance segmentation of in-the-wild images is still far from being achieved, leaving much room for improvement . The results also indicate the challenges posed by our CAMO++ dataset.

The lack of training data may affect the training of camouflage localization and segmentation systems. Previous camouflage research has used extra images along with augmented data to improve segmentation performance. Fan et al. [19] combined images from multiple camouflage datasets to train their network. Li et al. [58] recently used the relationship between saliency and camouflage to train a network to jointly segment salient and camouflaged objects.

In this work, we further utilized the information of the non-camouflage images with only general objects to segment camouflaged instances. In particular, we trained instance segmentation methods on a combination of camouflage and noncamouflage images in our CAMO++ dataset. Table VI compares the performance of different methods. The addition of the non-camouflaged instances resulted in the fine-tuned models achieving better performance in segmenting the camouflaged instances. This shows that training using additional non-camouflaged instance data helps boost performance. Furthermore, once again, our proposed CFL framework was the top performer. In particular, the performance of the CFL framework with the ResNeXt101-FPN backbone was boosted thanks to the combined data more than the others. It achieved the best performance (AP of 42.8).

In addition, we investigates dataset bias through crossdataset generalization evaluation over the CAMO++ and COD datasets. We used only the camouflage images in both datasets because the COD dataset does not have the ground truths for the non-camouflage images. For a fair comparison, we randomly select 1700 images from the training set and 1000 images from the test set of the COD dataset to obtain the same number of images as in the CAMO++ dataset. We trained Cascade Mask RCNN [37] with the ResNeXt101-FPN backbone on each dataset using the training configuration described in Section V-A. Table V shows the AP obtained by cross-dataset generalization. Each column illustrates the results of a model that was trained on one dataset and tested on all datasets, indicating the generalizability of the training dataset. Each row shows the performance of a model that was trained on all datasets and tested on a specific dataset, indicating the difficulty of the testing dataset. The results show that our newly constructed CAMO++ dataset is more challenging than the COD dataset. In particular, our training images are unbiased (e.g.mean value of 30.8 on the bottom row), which should help boost performance on both the CAMO++ and COD datasets. The results also show that our testing images are the most difficult (e.g.mean value of 28.4 in the right column) as they consist of many challenging cases such as tiny, extreme background resemblance, distraction, and occluded and overlapping camouflaged instances (see Section V-C4 and Figure 15 ).

VII. CONCLUSION AND OUTLOOK In our study of the interesting yet challenging problem of camouflaged instance segmentation, we created a large-scale dataset dubbed Camouflaged Object Plus Plus (CAMO++). We also performed an in-depth analysis of CAMO++ to demonstrate its diversity and complexity and developed a camouflage fusion learning framework to further improve the performance of camouflaged instance segmentation.

We conducted an extensive benchmark for the camouflaged instance segmentation task and evaluated state-of-the-art instance segmentation methods in various experimental settings. We found that using non-camouflaged instances for training boosted the performance of state-of-the-art methods. However, there is still room for improvement, as shown by the cases of failure. The CAMO++ dataset should serve as a valuable benchmark for not only the camouflaged instance segmentation task but also related tasks such as semantic camouflage segmentation and video camouflaged instance segmentation. We expect that our CAMO++ dataset will greatly support research activities in this field.

Anabranch network for camouflaged object segmentation

Survey of object detection methods in camouflaged image

Polyp detection and segmentation using mask r-cnn: Does a deeper feature extractor cnn always perform better?" in ISMICT

Covidiagnosis-net: Deep bayes-squeezenet based diagnosis of the coronavirus disease 2019 (covid-19) from x-ray images

Multi-task learning for detecting and segmenting manipulated facial images and videos

Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild

A markov random field model-based approach to unsupervised texture segmentation using local and global spatial statistics

Graph cuts and efficient nd image segmentation

Superpixel-based object class segmentation using conditional random fields

Superpixel-enhanced pairwise conditional random field for semantic segmentation

Texture segmentation by multiscale aggregation of filter responses and shape elements

A new camouflage texture evaluation method based on wssim and nature image features

Camouflage performance analysis and evaluation framework based on features fusion

Study on the camouflaged target detection method based on 3d convexity

Foreground object detection using topdown information based on em framework

Performance of decamouflaging through exploratory image analysis

Detection of the mobile object with camouflage color under dynamic background based on optical flow

Foreground object segmentation for moving camera sequences based on foreground-background probabilistic models and prior probability maps

Camouflaged object detection

Betrayed by motion: Camouflaged object discovery via motion segmentation

Inferring camouflage objects by texture-aware interactive guidance network

Camoufinder: Finding camouflaged instances in images

It's moving! a probabilistic model for causal motion segmentation in moving camera videos

Animal camouflage analysis: Chameleon database

Mirrornet: Bio-inspired camouflaged object segmentation

Instancecut: from edges to instances with multicut

Joint graph decomposition & node labeling: Problem, algorithms, applications

Proposal-free network for instance-level semantic object segmentation

Instance-level segmentation for autonomous driving with deep densely connected mrfs

Adaptis: Adaptive instance selection network

Instance-aware semantic segmentation via multi-task network cascades

Mask r-cnn

Fully convolutional instanceaware semantic segmentation

Faster r-cnn: Towards real-time object detection with region proposal networks

R-fcn: Object detection via regionbased fully convolutional networks

Mask Scoring R-CNN

Cascade r-cnn: Delving into high quality object detection

Path aggregation network for instance segmentation

Objects as points

Fcos: Fully convolutional onestage object detection

Yolact: Real-time instance segmentation

Blendmask: Top-down meets bottom-up for instance segmentation

Centermask: Real-time anchor-free instance segmentation

Polarmask: Single shot instance segmentation with polar representation

Embedmask: Embedding coupling for one-stage instance segmentation

Tensormask: A foundation for dense object segmentation

RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

Focal loss for dense object detection

Conditional convolutions for instance segmentation

SOLO: Segmenting objects by locations

Lvis: A dataset for large vocabulary instance segmentation

Microsoft coco: Common objects in context

On loss functions for deep neural networks in classification

Logistic regression

An image is worth 16x16 words: Transformers for image recognition at scale

Deep residual learning for image recognition

Aggregated residual transformations for deep neural networks

Uncertainty-aware joint salient object and camouflaged object detection

We intend to explore the effect of various factors on the given problem. For example, the use of contextual information may be helpful in detecting and segmenting camouflaged instances. We also plan to extend our work to dynamic scenes such as those in videos. In particular, we intend to investigate the use of motion information in segmenting camouflaged instances in videos. We are grateful to the Software Engineering Lab (University of Science, VNU-HCM) for their support in annotating the CAMO++ dataset. We gratefully acknowledge NVIDIA for their support of the GPUs.