key: cord-0775038-f9r9e0df
authors: Yan, Xinghe; Chen, Zhenxue; Wu, Q. M. Jonathan; Lu, Mengxu; Sun, Luna
title: 3MNet: Multi-task, multi-level and multi-channel feature aggregation network for salient object detection
date: 2021-02-18
journal: Mach Vis Appl
DOI: 10.1007/s00138-021-01172-y
sha: e73a17f75f12d412d2af47256cbeefb068a1285f
doc_id: 775038
cord_uid: f9r9e0df

Salient object detection is a hot spot of current computer vision. The emergence of the convolutional neural network (CNN) greatly improves the existing detection methods. In this paper, we present 3MNet, which is based on the CNN, to make the utmost of various features of the image and utilize the contour detection task of the salient object to explicitly model the features of multi-level structures, multiple tasks and multiple channels, so as to obtain the final saliency map of the fusion of these features. Specifically, we first utilize contour detection task for auxiliary detection and then utilize use multi-layer network structure to extract multi-scale image information. Finally, we introduce a unique module into the network to model the channel information of the image. Our network has produced good results on five widely used datasets. In addition, we also conducted a series of ablation experiments to verify the effectiveness of some components in the network.

Salient object detection refers to the separation of objects that can most attract human visual attention from background images [1] . Recently, due to the rapid increase in the quantity and quality of image files, salient object detection has become increasingly important as a precondition of vari- 1 ous image processing approaches. In the early stage, salient object detection was applied to image content editing [2] , object recognition [3] , image classification [43] and semantic segmentation [4] . In recent years, it has also played an important role in intelligent photography [5] and image retrieval [6] . It is worth noting that we have seen an interesting application of saliency detection in the emerging Internet video technology. Video site users especially the young like to post their own comments while watching the video. These comments will be displayed on the screen. We call this "bullet screen." In addition, salient object detection is also applied to virtual background technology, which can protect the privacy of users in video conferences, especially during the epidemic of COVID-19. As shown in Fig. 1 , our saliency detection technology can help us highlight the important people or objects in the scene so that they are not obscured by the bullet screen, and the real background in video conferencing has been replaced by a virtual background.

Early saliency detection techniques were mainly based on the extraction of certain artificial features. Limited by prior knowledge, these methods sometimes cannot achieve better results in natural scenes. We focus on making full use of deep information at different levels and modeling the image with multi-level mine the information. Convolutional neural networks can effectively extract the features of the image. The low-level layers usually have smaller receptive fields and can focus on local details of the image, such as edge information. However, unlike the edge detection in traditional tasks, we mainly focus on salient objects and ignore the cluttered lines in the background; as such, we use salient foreground contours as an auxiliary task for our salient object detection.

Most existing methods simply merge multi-channel feature maps, ignoring the variety of effects that different feature channels may have on the final saliency map. We model the feature channels explicitly, introduce a global pooling method with a large visual receptive field into the modeling of the feature channels and reweight each feature channel.

In general, our proposed 3MNet uses a U-shaped structure as the main structure, with contour detection branches as auxiliary tasks, and introduces channel reweighting modules in the network structure, so as to explicitly model and combine the multi-task, multi-level and multi-channel features of the image. Specifically, the contour detection task can refine the edge details of salient objects. The multi-level network structure can better aggregate the local and global feature information of the image. Multiple multi-channel feature maps are generated in the deep network. Modeling the channel features helps to mine the deep channel information in the image and enhance the weight of high contribution channels. Our subsequent experiments also proved that combining multiple image features can effectively improve the detection accuracy.

The main contributions of this paper are as follows:

(1) The proposed 3MNet makes full use of the deep salient information in the image and combines the multi-task, multi-level and multi-channel features to explicitly model the saliency detection task. We have achieved good results on the basis of salient object detection tasks, supplemented by target contour detection.

(2) Compared with traditional models and some other deep detection models, our model has higher accuracy, and multiple evaluation indicators on the five most commonly used data sets are ahead of other methods. In addition, we conducted a series of ablation experiments to verify the effectiveness of our network structure.

(3) Our training process requires saliency object contour information. Therefore, we provide saliency target contour ground-truth maps of multiple training sets as a supplement to the training set, so that researchers can adopt more optional auxiliary methods for saliency detection.

The rest part of our paper is organized as follows: Section 2 introduces the related works of salient object detection. The specific structure of our proposed approach are described in Sect. 3. Section 4 shows and analyzes the results of our experiment. Section 5 makes a conclusion to our paper.

Early salient object detection used a data-driven bottom-up approach. In 1998, Itti et al. [7] proposed the classic saliency visual attention model. For a long time, manual features such as contrast, color and background prior dominated the salient object detection.

Achanta et al. [8] introduced a frequency-tuned model to extract the global features of the image. Jiang et al. [9] used the absorbing Markov chain to calculate the absorption time. They considered solving problems mathematically rather than imitating human vision. [42] introduced a bootstrap learning algorithm into salient object detection task. Researchers also proposed methods of preprocessing and post-processing such as the super-pixel detection [10] and the conditional random field [11] methods.

Recently, salient object detection models based on deep learning have been widely studied and applied. Inspired by various network optimization methods, especially the emergence of convolutional neural network structures [24] , more and more models designed for saliency detection tasks are appearing and have achieved unprecedented detection Overall structure of our proposed network framework. The RWC module is the RWConv module. The upper part explicitly models the contour information and uses this information to help detect salient targets. The lower left part uses image multi-level features to fuse salient feature maps effects on various evaluation criteria. Since the introduction of VGG [12] and residual networks [13] , saliency detection models with these networks as the base structure have developed considerably. Researchers have achieved better results by appropriately increasing the depth of the network and expanding the width of the network. [14] combine features of different levels in the deep network to predict salient regions. DHSNet [15] aggregates the characteristics of many different receptive fields to obtain performance gains. Ronneberger et al. [16] propose a U-shaped network structure for image segmentation. Liu et al. propose PoolNet [17] for saliency detection based on a similar structure and obtained accurate and fast detection performance. Hou et al. [18] ingeniously build short connections between multi-level feature maps to make full use of high-level features to guide detection. Li et al. [41] explore the channel characteristics with reference to the structure of SENet [23] .

Apart from innovations in depth and breadth in the network structure, some researchers have also attempted multi-task-assisted saliency detection. Li et al. [19] combine the saliency detection task with the image semantic segmentation task. Through the collaborative feature learning of these two related tasks, the shared convolutional layer produces effective object perception features. Zhuge et al. [20] focus on using the boundary features of the objects in the image, utilizing edge truth labels to supervise and refine the details of the detection feature map. [44] make full use of the multi-temporal features and show the effectiveness of multiple features in improving detection performance. [21] apply saliency detection to dynamic video processing, greatly expanding the application space of saliency detection.

Our model captures the features of the image to be detected from the following aspects: First, we set up two sets of network frameworks to perform saliency target detection and salient object contour detection in parallel. Second, we use a U-shaped network construction [16] for the main structure of each network to aggregate the salient features extracted from different levels. Finally, for the basic unit of each convolution module, we make full use of the channel characteristics, use global pooling to obtain the corresponding global receptive field of each channel and learn how much each channel contributes to the salient features. According to the learning results, we then recalibrate the weights of the feature channels. The specific framework of the model is shown in Fig.  2 

For common RGB three-channel images, each channel's salient stimulation of the human eyes of each channel may be different [22] . This reminds us that different feature channels of salient feature maps may also contribute differently to the saliency detection. We refer to the structure of SENet [23] and propose a similar multi-channel reweighted convolution module RWConv and a multi-channel reweighted fusion module RWFusion. These two structures are shown in Fig. 3 . For each basic convolution unit RWConv, we use ResNet's convolution layer [13] as its main structure. On this basis, we introduce a second branch between the residual and the accumulated sum x , as the weight storage area. For an input image with number of channels c, width w and height h, first, we use global pooling to convert the input to an output of 1×1×c. To some extent, these c real numbers can describe the global characteristics of the input. Its calculation method is shown in Eq. 1.

where P k (i, j) is the feature value corresponding to the coordinate (i, j) in the kth channel of the given feature map.

In order to fully represent the relationship between each channel, so that our model can focus on the channels that contribute more, we add 2 fully connected layers after global pooling. The number of fully connected points in each layer is the same as the number of channels in the upper layer, and a Relu layer is added to ensure the nonlinearity of the model. After obtaining the final channel weight W c , we weight and accumulate the input value corresponding to the c weight parameters to obtain the final output.

where this operation corresponds to the scale module in the network.

The basic structure of the RWFusion part is roughly the same as that of the RWConv part, except that one of the addends x is replaced by the same size feature map on the other side of the U-shaped network. The input of the main part is obtained by the upsampling operation.

The basic module of the contour detection part is the same as that of the above-mentioned RWConv and RWFusion. This fusion method takes into account the multi-level and multi-channel characteristics, makes full use of the detailed information of the image and enhances the expression ability of the network.

Explicitly modeling contour features is undoubtedly helpful for optimizing the details of salient object. However, the high-level feature maps often have large receptive fields and cannot pay attention to the details of the target. Low-level feature maps can help us optimize the contour details of objects [25] . As such, we take low-level features into consideration. We use a two-layer RWConv structure to extract the contour features of the object in the main part of the network; then, after obtaining the significant contour feature map E j , we use the same fusion method. The calculation method is as follows:

where up( * , θ) means upsampling the feature map, θ is the upsampling multiple and RW F is the multi-channel feature reweighted fusion operation. We fuse two saliency contour feature maps according to the following combined strategy:

where Con means that the feature maps are concatenated by channel and Conv means the convolution operation. The parameter ω i is trained through the convolutional layer.

In order to effectively obtain the salient contours of salient targets, we imitate the prior knowledge in the traditional method [26] and increase the contour weights of salient regions. At this time, we use the high-level feature map S 4 as a prior map to emphasize the importance of the saliency region and get the final fusion contour saliency map E f .

For the main part of the model framework, we adopt a design that is similar to a U-shaped network structure [16] . The basic unit of the convolution layer is a multi-channel feature response reweighting module (RWConv), which we introduced in detail in Sect. 3.1. First, the input image passes through four consecutive levels of RWConv layers to form four corresponding-level feature maps. The feature fusion module at each level is RWFusion, which we have also introduced in Sect. 3.1. We represent the feature map obtained at each level of the saliency target detection section as F i , and we fuse them according to Eq. 5:

where the operations in Eq. 5 are the same as the operations in Eq. 3 and Eq. 4.

The final result R f of the multi-feature fusion is as:

In the training phase, we use the MSRA10K dataset [27] as our training set. The dataset contains 10,000 high-quality images with salient objects and is labeled at the pixel level. In addition, we randomly selected 5000 images from the DUTS-TR [40] dataset to expand our training set. We do not use validation sets during the training phase. Since our training has salient object contour supervision in addition to the original ground-truth map, we need to expand the dataset. We utilize the Laplacian operator in the OpenCV toolbox to perform edge detection on the targets in the ground-truth map.

In this way, we get a 10K group of images with pixel-level object contour annotation. Our implementation is based on the pytorch deep learning framework. The training and testing processes are performed on a desktop with an NVIDIA GeForce RTX 2080Ti (with 11G memory). On our desktop, our model can achieve a relatively fast speed of 16 fps. The initial values of the main parameters of the first half of the U-shaped network are consistent with ResNet [13] , and the other parameters are initialized randomly. We use the crossentropy loss function to calculate the loss between the feature map and the truth map. The calculation method of Softmax function and the cross-entropy loss function is as follows:

where α i represents the ith value of the predicted Cdimensional vector and y i represents the value of the label in the ground truth. We take C as 2 to distinguish background and foreground. ω is the weight parameter. The model we propose is end to end and does not contain any preprocessing or post-processing operations. We trained 30 epochs on the network.

During network training, the stochastic gradient descent optimization method is used, the momentum is set to 0.9, and the weight decay is 0.0005. The basic learning rate is set to 1e-6, and it is reduced by 50% every 10 epochs.

We qualitatively and quantitatively compare different methods and their performance on five commonly used benchmark datasets. The ECSSD [28] dataset contains 1,000 complex images, and the images contain salient objects of different sizes. The SOD [29] dataset is built on the basis of BSD [30] , and pixel-level annotations were made by Jiang et al. [31] . It contains 300 high-quality and challenging images. 

We use five common evaluation metrics to assess our model performance, including precision-recall curve [1] , Fmeasure [36] , receiver operating characteristic curve (ROC) [36] , area under ROC curve (AUC) [36] and MAE [36, 37] . We binarize the predicted saliency map according to a certain threshold and then compare the obtained binarization map with the ground truth to get the precision and recall, with the F-measure as the harmonic mean of the two. These are calculated as:

where β 2 is generally set to 0.3 in order to emphasize the importance of the precision value [1] . For each fixed binarization threshold, different P-R and F-measure values are obtained. We draw them as curves, and we pick the maximum value of all F-measure calculation results. Additionally, we can obtain the paired false positive rate (F P R) and true positive rate (T P R), from which we can get the ROC curve and calculate the AUC value.

where M is the binary salient feature map, G is the truth map andḠ is the result of negating G. MAE is expressed as the mean absolute error between the normalized saliency map S and the ground truth G. Its calculation formula is as:

where W and H are the width and height of the image, respectively. 

Our experiments quantitatively compare our model with eight other saliency detection algorithms (Amulet [14] , DSMT [19] , DHSNet [15] , MDF [33] , NLDF [38] , RFCN [39] , DRFI [31] and RC [27] ). The P-R curves of some of the mentioned datasets are shown in Fig. 4 , and the ROC is shown in Fig. 5 . We compare the five performance indicators of the model on the five datasets mentioned above. Quantitative Comparison: On the five commonly used datasets mentioned above, we quantitatively compare the P-R curve, the ROC curve and the MAE value, and the corresponding experimental results are shown below. For the P-R curve, the quantitative result we are interested in is the F-measure, and the AUC in the ROC curve can be quantitatively compared as shown in Table 1 .

It can be seen from the table that, for the model we proposed, the performance in five popular datasets of its two quantitative indicators' F-measure, AUC, is significantly better than within the other methods. The bold part in the table indicates that the method performs best on the dataset. In par-ticular, the evaluation criteria F-measure, compared with the second place, has an increase of 3.2%, 3.7%, 5.4%, 4.1% and 2.9% on HKU-IS, ECSSD, DUT-OMRON, PASCAL-S and SOD datasets. Although the method DSMT scores higher on the auc indicator on PASCAL-S and SOD datasets, it is not as good as our method in terms of refining the target contour and uniformly highlighting the salient target, which can be found in the following qualitative visual comparison. The DRFI and RC methods are outstanding among the traditional methods. By comparison, we can prove that the models' performance based on the deep network is much better than the traditional method, which is explained in [24] . Figure 6 shows the experimental results of the nine methods we mentioned regarding MAE values in four datasets. And the histogram shows that our model has the best performance on these datasets. Qualitative Comparison: Fig. 7 compares the performance of our model with other detection methods for different scenarios. Our images are selected from the aforementioned datasets. Through intuitive comparison, we can find that, due to the explicit modeling of the contour of the salient object,

HKU-IS SOD Fig. 6 MAE histogram of the above detection methods. From left to right in each histogram is our method, Amulet [14] , DHSNet [15] , DSMT [19] , MDF [33] , NLDF [38] , RFCN [39] , DRFI [31] and RC [27] our method can better refine the contour of the target to be tested; it also achieved good performance in the overall consistency of the salient target.

Our ablation experiments focus on the impact of the contouraided detection and the multi-channel reweighting module on the performance of the detection. Our baseline model is a network structure without these two parts. We take the ECSSD [28] dataset as an example and add contour information to assist detection and channel reweighting modules. The evaluation indicators F β and MAE are shown in Table 2 . After successively introducing contour features and channel features, the F-measure has improved by 2.1% and 1.5%, while the MAE has been reduced by 0.012 and 0.002, respectively. From this, we can discover that the contour feature improves the detection performance more significantly. The salient feature maps before and after the multi-feature cues are added as shown in Figs. 8 and 9 . Qualitative observations show that the saliency map with the contour assist module has clearer boundaries. Adding a multi-channel reweighting module can make full use of the information in the feature channels to help highlight the target area uniformly. Fig. 8 Comparison images before and after adding multi-channel features. a Input image; b ground truth; c feature map before adding multi-channel features; d feature map with multi-channel features Fig. 9 Visual effect of adding contour assistant detection module in various of unmanned missions, including aerial photography, intelligent driving, traffic sign detection and underwater target detection. (a) Input images; (b) original detection feature maps; (c) contour auxil-iary feature maps; (4) feature maps with contour information. After adding contour information, the detailed information of the object is more refined. For instance, the wings of the bird in the picture become clearer

This paper explores methods to make full use of multiple aspects of image information and proposes a saliency detection network that combines multi-level, multi-task and multi-channel features. The network explicitly models these three features of the image. Multi-level features are modeled with U-shaped networks, multi-task features are modeled with contour-assisted branches, and multi-channel features are modeled with reweighting modules. The model is an endto-end model without any preprocessing or post-processing.

It is relatively flexible for multi-tasking as well as multichannel modeling, and it can be used to improve most existing models. Experiments show that our method is comparable to the state-of-the-art deep learning methods on various datasets. System Control and Information Processing (Scip201801), in part by the Foundation of Key Laboratory of Intelligent Computing & Information Processing of Ministry of Education (2018ICIP03), and in part by the Foundation of State Key Laboratory of Integrated Services Networks (ISN20-06). Xinghe Yan and Zhenxue Chen contributed equally to this work and should be considered as the co-first authors.

What is a salient object? a dataset and a baseline model for salient object detection

A shapepreserving approach to image resizing

Enhanced hierarchical model of object recognition based on a novel patch selection method in salient regions

A novel image segmentation algorithm based on visual saliency detection and integrated feature extraction

Sketch2photo: Internet image montage

Robust pre-processing technique based on saliency detection for content based image retrieval systems

A model of saliency-based visual attention for rapid scene analysis

Frequencytuned salient region detection

Saliency detection via absorbing markov chain

Superpixel-based spatiotemporal saliency detection

Deep contrast learning for salient object detection

Very deep convolutional networks for large-scale image recognition

Deep residual learning for image recognition

Amulet: Aggregating multi-level convolutional features for salient object detection

Dhsnet: Deep hierarchical saliency network for salient object detection

U-net: Convolutional networks for biomedical image segmentation

A simple poolingbased design for real-time salient object detection

Deeply supervised salient object detection with short connections

Deepsaliency: multi-task deep neural network model for salient object detection

Boundary-guided feature aggregation network for salient object detection

Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion

Saliency detection based on non-uniform quantification for rgb channels and weights for lab channels

Squeeze-and-excitation networks

Visualizing and understanding convolutional networks

Richer convolutional features for edge detection

Graph-regularized saliency detection with convex-hull-based center prior

Global contrast based salient region detection

Hierarchical saliency detection

Design and perceptual validation of performance measures for salient object segmentation

A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics

Salient object detection: A discriminative regional feature integration approach

Saliency detection via graph-based manifold ranking

Visual saliency based on multiscale deep features

The secrets of salient object segmentation

The pascal visual object classes (voc) challenge

Salient object detection: a benchmark

Saliency optimization from robust background detection

Nonlocal deep features for salient object detection

Saliency detection with recurrent fully convolutional networks

Learning to detect salient objects with image-level supervision

Deep saliency with channelwise hierarchical feature responses for traffic sign detection

Salient object detection via bootstrap learning

Sdl: Saliency-based dictionary learning framework for image similarity

Recovering quantitative remote sensing products contaminated by thick clouds and shadows using multitemporal dictionary learning

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.