key: cord-0058321-2lupyy4d authors: Majumder, Soumajit; Khurana, Ansh; Rai, Abhinav; Yao, Angela title: Multi-stage Fusion for One-Click Segmentation date: 2021-03-17 journal: Pattern Recognition DOI: 10.1007/978-3-030-71278-5_13 sha: 1204d558d2c6b4a0ce6e25bd13158683f9e80280 doc_id: 58321 cord_uid: 2lupyy4d Segmenting objects of interest in an image is an essential building block of applications such as photo-editing and image analysis. Under interactive settings, one should achieve good segmentations while minimizing user input. Current deep learning-based interactive segmentation approaches use early fusion and incorporate user cues at the image input layer. Since segmentation CNNs have many layers, early fusion may weaken the influence of user interactions on the final prediction results. As such, we propose a new multi-stage guidance framework for interactive segmentation. By incorporating user cues at different stages of the network, we allow user interactions to impact the final segmentation output in a more direct way. Our proposed framework has a negligible increase in parameter count compared to early-fusion frameworks. We perform extensive experimentation on the standard interactive instance segmentation and one-click segmentation benchmarks and report state-of-the-art performance. The widespread availability of smartphones had made taking photos easier than ever. In a typical image capturing scenario, the user taps the device touchscreen to focus on the object of interest. This tap directly locates the object in the scene and can be leveraged for segmentation. Generated segmentations are implicit, but are applicable for downstream photo applications, such as simulated 'bokeh' or other special-effects filters such as background blur (see Fig. 1 ). In this work, we tackle "tap-and-shoot segmentation" [4] , a special case of interactive instance segmentation. Interactive segmentation leverages inputs such as clicks, scribbles, or bounding boxes to help segment objects from the background down to the pixel level. Two key differences distinguish tap-and-shoot segmentation from standard interactive segmentation. First, tap-and-shoot uses only "positive" clicks marking foreground, as we assume that the user clicks (only) on the object of interest during the capture process. Standard interactive segmentation uses both positive and negative clicks [18, 19, 28] to respectively indicate the object of interest versus background or non-relevant foreground objects. Secondly, tap-and-shoot has a strong focus on maximizing the mean intersection over union (mIoU) with a single click because the target application is casual photography. In contrast, standard interactive segmentation tries to achieve some threshold mIoU (e.g. 85%) while minimizing the total number of clicks. This second distinction is subtle but critical for designing and learning tapand-shoot segmentation frameworks. Our finding is that existing approaches fare poorly with only one or two clicks -they are simply not trained to maximize performance under such settings. To make the most of the first (few) click(s), we hypothesize that user cues' guidance should be fused into the network at multiple locations rather than via early fusion. Just as gradients vanish towards the initial layers during back-propagation, input signals also diminish as it makes a forward pass through the network. The many layers of deep CNNs further exacerbate this effect [14, 22] . A late fusion would allow the user interaction to have a direct and more pronounced effect on the final segmentation mask. To this end, we propose an interactive segmentation framework with multi-stage fusion and demonstrate its advantages over the common early fusion frameworks and other alternatives. Specifically, we propose a light-weight fusion block that encodes the user click transformation and allows a shorter connection from user inputs to the final segmentation layer. Most similar in spirit to our framework is [14] and [23] . These two works also propose alternatives to early fusion but are extremely parameter heavy. For example, [14] uses two dedicated VGG [26] networks to to extract features from the image and the user interactions separately before fusing into a final instance segmentation mask (see Fig. 2 (c)). [23] uses a single stream but applies a simple late fusion of element-wise multiplication on the feature maps (see Fig. 2 (b)). It therefore has separate 'positive' and 'negative' feature maps and the number of weights for the following layer increases by a factor of 2. For VGG, this doubles the parameters of the ensuing 'fc6' layer from 100 to 200 million. Compared to [23] , our last-stage fusion approach is light-weight and uses less than 1.5% more trainable parameters. Our contributions are summarized as follows: -We propose a novel one-click interactive segmentation framework that fuses user guidance at different network stages. -We demonstrate that multi-stage fusion is highly beneficial for propagating guidance and increasing the mIoU since it allows user interaction to have a more direct impact on the final segmentation. -Comprehensive experiments on six benchmarks show that our approach significantly outperforms existing state-of-the-art for both tap-and-shoot and standard interactive instance segmentation. As an essential building block of image/video editing applications, interactive segmentation and dates back decades [21] . The latest methods [14, 18, 19, 23, 28] We consider the popular special-effect filter used in mobile photography -background blur. Here the user intends to blur the rest of the image barring the dog. In most existing interactive segmentation approaches [18, 19, 28] , the user click (here placed on the dog) is leveraged only at the input layer and its influence diminishes through the layers. This can result in unsatisfactory image effects, e.g. portions of the dog's elbow and ear are wrongly classified as background and are mistakenly blurred (shown in enlarged red boxes). Our proposed multi-stage fusion allows user click to have a more direct effect leading to improvement in segmentation quality (shown in enlarged green boxes). integrate deep architectures such as FCN-8s [17] or DeepLab [5, 6] . Most of these approaches integrate user cues in the input stage. The clicks are transformed into 'guidance' maps and appended to the three-channel colour image input before being passed through a CNN [18, 19, 28] . Early Interactive Instance Segmentation methods used graph-cuts [3, 24] , geodesics, or a combination [10] . These methods' performance is limited as they separate the foreground and background based on low-level colour and texture features. Consequently, for scenes where foreground and background are similar in appearance, or lighting and contrast is low, more labelling effort from the users to achieve good segmentations [28] . Recently, deep convolutional neural networks [6, 17] have been incorporated into interactive segmentation frameworks. Initially, [28] used Euclidean distance-based guidance maps to represent user-provided clicks and are passed along with the input RGB image through a fully convolutional network. Subsequent works made extensions with newer CNN architectures [18] , iterative training procedures [18] and structure-aware guidance maps [19] . These works share a structural similarity: the guidance maps are concatenated with the RGB image as additional channels at the first (input) layer. We refer to this form of structure as early fusion (see Fig. 2(a) ). Architecture-wise, early fusion is simple and easy to train; however, user inputs' influence gets diminished through the layers. Tap-and-Shoot Segmentation was introduced by [4] , and refers to the one-click interactive setting. One assumes that during image capture, the user taps the touchscreen (once) on the foreground object of interest, from which one can directly segment the object of interest. [4] uses early fusion; it The work of [14] uses two dedicated VGG [26] networks for extracting features from image and user interactions separately. (c) The work of [23] performs late fusion via element-wise multiplication on the feature maps which requires an additional 100 million parameters. (d) We leverage user guidance at the input (early fusion) and via late fusion. Our multi-stage fusion reduces the layers of abstraction and allows user interactions to have a more direct impact on the final output. transforms the user tap into a guidance map via two shortest-path minimizations and then concatenates the map to the input image. The authors validate only on simple datasets such as ECSSD [25] and MSRA10K [7] , where the images contain a single dominant foreground object. As we show later in our benchmarks (see Table 1 ), these datasets are so simplistic that properly trained networks with no user input can also generate high-quality segmentation masks which are comparable or even surpass the results reported by [4] . Feature Fusion in Deep Architectures is an efficient way to leverage complementary information, either from different modalities [27] , or different levels of abstraction [29] . Element-wise multiplication [23] and addition [14, 16] are two common operations applied for fusing multiple channels. Other strategies include 'skip' connections [17] , where features from earlier layers are concatenated with the features extracted from the deeper layers. Recently, a few interactive instance segmentation works have begun exploring outside of the earlyfusion paradigm to integrate user guidance [14, 23] . However, these approaches are heavy in their computational footprint, as they increase the number of parameters to be learned by order of hundred of millions [23] . Dilution of input information is common-place in deep CNNs as the input gets processed several blocks of convolution [22] . Feature fusion helps preserve input information by reducing the layers of abstraction between the user interaction and the segmentation output. We follow the conventional paradigm of [18, 19, 28] in which 'positive' and 'negative' user clicks are transformed into 'guidance' maps of the same size as the input image. Unlike [18, 19, 28] , we work within the one-click setting. The user provides a single 'positive' click on the object of interest; this click is then encoded into a single channel guidance map G (see Sect. 3.3). We then feed the 3-channel RGB image input and the guidance map as an additional channel into a fully convolutional network. Figure 3 (a) shows an overview of our pipeline. Typically these FCNs are fine-tuned versions of semantic segmentation networks such as FCN-8s [17] or DeepLab [5] . For our base segmentation network, we use DeepLab-v2 [5] ; it consists of a ResNet-101 [12] feature extraction backbone and a Pyramid Scene Parsing (PSP) module [30] acting as the prediction head. Upon receiving the input of size h × w × 4, the ResNet-101 backbone generates feature maps of dimension h/8 × w/8 × 2048 ( Fig. 3(a) ). The fusion module consists of 3 Squeeze-and-Excitation residual blocks (SE-ResNet) [13] . Proposed in [13] , SE-ResNet blocks have been shown to effective for a variety of vision tasks such as image classification on ImageNet [8] and object detection on MS COCO [15] . SE-ResNet blocks incur minimal additional computational overhead as they consist of two 3 × 3 convolutional layers, two inexpensive fully connected layers and channel-wise scaling operation. Each SE-ResNet block consists of a residual block, a squeeze operation which produces a channel descriptor by aggregating feature maps across their spatial operation, dimensionality reduction layer (by reduction ratio r ) and an excitation operation which captures the channel interdependencies. The individual components of the SE-ResNet block is shown in Fig. 3(b) . The residual block consists of two 3×3 convolutions, batch normalization, and a ReLU non-linearity ( Fig. 3(c) ). We fix the number of filter banks to be 256 for each of the 3 × 3 convolution. The reduction ratio r is kept as 16 [13] . The input to the fusion block is a h/4×w/4×256 feature map which is obtained by processing the h×w×4 input with 7 × 7 convolution operation with stride 2, batch normalization, ReLU nonlinearity and a 2 × 2 max-pooling operation with stride 2 (Init block, Fig. 3(a) map. This is concatenated with the h/8×w/8×2048 obtained from the feature extraction backbone to obtain a h/8×w/8×2304 feature map. On top of these feature maps, PSP performs pooling operations at different grid scales on the feature maps to gather the global contextual prior, leading to feature maps of dimensions h/8 × w/8 × 512. The multi-scale feature pooling of PSP [30] enables the network to capture objects occurring at different image scales. Pixel-wise foreground-background classification is performed on these down-sampled feature maps. The network outputs a probability map representing whether a pixel belongs to the object of interest or not. Bi-linear interpolation is performed to up-sample the predicted probability map to have the same dimensions as the original input image. In interactive approaches, pixel values of the guidance map are defined as a function of its distance on the image grid to the point of user interaction (Eq. 1). This includes Euclidean [14, 28] and Gaussian guidance maps [18] . For each pixel position p on the image grid, the pair of distance-based guidance maps for positive (+) and negative clicks (−) can be computed as For Euclidean guidance maps [28] , the function d(·, ·) is the Euclidean distance. For Gaussian guidance maps, the 'min' is replaced by a 'max' operator. A more recent approach advocated taking image structures such as super-pixels and region-based object proposals into consideration to generate guidance maps [19] . To generate the guidance maps, we use Gaussian transformations [18] as it offers a favourable trade-off between simplicity and performance. We initialize an image-sized all zero channel and place a Gaussian with a standard deviation of 10 pixels at the user click location. Note that we do not use 'negative' clicks in our framework. Network Optimization. We train the network to minimize the class-balanced binary cross-entropy loss, where N is the number of pixels in the image, BCE(·) is the standard crossentropy loss between the label y j and the predictionŷ j at pixel location j given by, w yj is the inverse normalized frequency of labels y j ∈ {0, 1} within the minibatch. We optimize using mini-batch SGD with Nesterov momentum (with default value of 0.9) and a batch size of 5. The learning rate is fixed at 10 −8 across all epochs and weight decay is 0.0005. For the ResNet-101 backbone, we initialize the network weights from a model pre-trained on ImageNet [8] . During training, we first update the early-fusion skeleton for 30-35 epochs. Next we freeze the weights of the early-fusion model and train the late-fusion weights for 5-10 epochs. Finally, we train the joint network for another 5 epochs. Simulating User Clicks. Manually collecting user interactions is an expensive and arduous process [2] . In a similar vein as [4] and other interactive segmentation frameworks [18, 19, 28] , we simulate user interactions to train and evaluate our method. During training, we use the ground truth masks of the object instances from the MSRA10K dataset. To initialize, we take the center of mass of the ground truth mask as our user click location; we then jitter the click location by U(−50, 50) pixels randomly. The clicked pixel location is constrained to the confines of the object ground truth mask. The random perturbation introduces variation in the training data and also allows better approximation of true user interactions. We evaluate on six publicly available datasets commonly used to benchmark interactive image segmentation [4, 18, 19, 28] : MSRA10K [7] , ECSSD [25] , Grab-Cut [24] , Berkeley [20] , PASCAL VOC 2012 [9] and MS COCO [15] . We use mean intersection over union (mIoU) of foreground w.r.t. to the ground truth object mask across all instances to evaluate the segmentation accuracy as per existing works [4, [17] [18] [19] 28] . MSRA10K has 10, 000 natural images; the images are characterized by variety in the foreground objects whilst the background is relatively homogeneous. Extended complex scene saliency dataset (ECSSD) is a dataset of 1000 natural images with structurally complex backgrounds. GrabCut is a dataset consisting of 49 images with typically a distinct foreground object. It is a popular dataset for benchmarking interactive instance segmentation algorithms. Following [4] , we use MSRA10K [7] for training and partition the dataset into three non-overlapping subsets of 8000, 1000 and 1000 images as our training, validation and test set. We report the mIoU after training for 16K iterations and again after network convergence (at 43k iterations for us, vs. 260k iterations in [4] ) in Table 1 . During training, we resize the images to 512 × 512 pixels. This choice of resolution is driven primarily by matching the resolution to that of the training images for the ResNet-101 backbone [12] . The -baseline models are trained using only the 3-channel RGB image and the instance ground truth mask without any user click transformations. Theearly models use Gaussian guidance maps [18] ; the network input is 3-channel RGB image and Gaussian encoding of the user's tap on the object of interest ( Fig. 2(a) ). The -multi models refer to the multi-stage fusion models with Gaussian encoding of user clicks. Note that we do not train a late-fusion model; standalone late-fusion models show inferior performance compared to their earlyfusion counterparts [23] . From Table 1 , we observe that our trained network converges mostly within 16K iterations. For simplistic datasets such as MSRA10K and ECSSD, the vggbaseline without user click transformation compares favourably with the approach of [4] at the same training resolution of 256 × 256. resnet-baseline models trained with 512 × 512 images significantly outperform [4] reporting absolute mIoU gains of till 7% across the datasets. Based on this result alone, we conclude that one-click (and standard) interactive segmentation approaches should be benchmarked on more challenging datasets. Examples include PASCAL VOC 2012 and MS COCO, which feature cluttered scenes, multiple objects, occlusions and challenging lighting conditions. (see Table 3 ). Furthermore, with only the Gaussian transformation and ResNet-101 backbone trained on 512 × 512, we are able to achieve mIoU increase in the range of 5-11% across datasets at convergence w.r.t [4] . Having the multi-stage fusion offers us absolute mIoU gains of 1-4% w.r.t the early fusion variant (resnet-early vs. resnet-multi when trained with 512 × 512 images). Additionally, our resnet models require significantly less memory; 195.8 MB (stored as 32-bit/4-byte floating point numbers) instead of the 652.45 MB required for the segmentation network of [4] . Approaches in the literature [14, 18, 19, 28] are typically evaluated by (1) the average number of clicks needed to reach the desired level of segmentation (@85% mIoU for PASCAL VOC 2012, MS COCO, @90% mIoU for the less challenging Grabcut and Berkeley) and (2) the average mIoU vs the number of clicks. The first criterion is primarily geared towards annotation tasks [18, 19] where high-quality segments are desired for each instance in the scene; the fewer the number of clicks, the lower the annotation effort. In this work, we are concerned primarily with achieving high-quality segments for the object of interest given only a single click. Accordingly, given a single user click, we report the average mIoU across all instances for the GrabCut, Berkeley and the PASCAL VOC 2012 val dataset. For MS COCO object instances, following [28] , we split the dataset into the 20 PASCAL VOC 2012 categories and the 60 additional categories, and randomly sample 10 images per category for evaluation. We also report the average mIoU across the sampled 800 MS COCO instances [14] . For training [14, 19, 28] , we use the ground truth masks of object instances from PASCAL VOC 2012 [9] train set with additional masks from Semantic Boundaries Dataset (SBD) [11] resulting in 10582 images. Note that unlike [18] , we do not use the training instances from MS COCO. Ablation Study. We perform extensive ablation studies to thoroughly analyze the effectiveness of the individual components of our one-click segmentation framework. First, to validate our choice of guidance maps, we consider the user click transformations commonly used in existing interactive segmentation algorithms -Euclidean distance maps [14, 28] , Gaussian distance maps [18] and disk [2] . Figure 4 shows examples of such guidance maps. For each kind of guidance map, we train separate networks to understand the impact of different user click transformations. For evaluation, we report the average mIoU over all instances in the dataset, given a single click (see Table 2 ). Next, we study the impact of our proposed late-fusion module (denoted by -multi in Table 2 ); we observe an average mIoU improvement of around 1.8% across different datasets. One-Click Segmentation. We compare the segmentation performance of our method with existing interactive instance segmentation approaches (see Table 3 ). The approaches are grouped separately into 3 different categoriespre-deep learning approaches, deep learning-based interactive instance segmentation approaches and tap-and-shoot segmentation approaches. From Table. 3, Table 3 . Average mIoU given a single click. The approaches are grouped separately into 3 different categories -pre-deep learning approaches, deep learningbased interactive instance segmentation approaches and tap-and-shoot segmentation approaches respectively. For GC [3] , GM [1] , GD [10] , and iFCN [28] we make use of the values provided by the authors of iFCN [28] . The mIoU improvement (in %) over existing state-of-the-art approaches is indicated using ↑. we observe that our approach outperforms the classical interactive segmentation works by a significant margin reporting 40% absolute improvement in average mIoU. We also outperform existing state-of-the-art interactive instance segmentation approaches [18, 19] by a considerable margin (>3%). Additionally, we report an absolute mIoU improvement of 7.2% and 17% on Grabcut and Berkeley over the tap-and-shoot segmentation framework of [4] . We show qualitative results to demonstrate the effectiveness of our proposed algorithm (see Fig. 5 ). The resulting segmentations demonstrate that our approach is highly effective for the one-click segmentation paradigm. Across existing state-of-the-art interactive frameworks [18, 19, 28] , user clicks are simulated following the protocols established in [18, 28] . For our user study, we consult 5 participants uninitiated to the task of interactive segmentation. We prepare a toy dataset with 50 object instances from the MSRA10K [7] dataset. We presented the image with the segmentation mask for the target object overlaid on the image and asked the users to provide their click. During training, we applied random perturbations of U(−50, 50) pixels to the center of mass of the object instance to obtain the final user click. Our user study found that participants placed clicks at a mean distance of 24 pixels from the center of the mask with a standard deviation of 27 pixels. This result validates our assumption that users are more likely to click in the vicinity of the object's center-of-mass. On average, we observed that users took 2.3 s with a standard deviation of 0.8 s to position their click. In this work, we propose a one-click segmentation framework that produces highquality segmentation masks. We validated our design choices through detailed ablation studies; we observed that having a multi-stage module improves the segmentation framework and gives the network an edge over its early-fusion variants. Via experiments, we observed that for the single click scenario, our proposed approach significantly outperforms existing state-of-the-art approaches -including the more complicated interactive instance segmentation models using state-of-the-art segmentation models [6] . However, we observe existing tap-and-shoot segmentation frameworks [4] , including our proposed framework, are limited by their inability to learn from negative clicks [18, 19, 28] . One major drawback of such a training scenario is that the network does not have a notion of corrective clicking; if the generated segmentation mask extends beyond the object boundaries, it cannot rectify this mistake. Clicking on locations outside the object can mitigate this effect, though this then deviates from tap-and-shoot interaction. Geodesic matting: a framework for fast interactive image and video segmentation and matting Large-scale interactive object segmentation with human annotators Interactive graph cuts for optimal boundary & region segmentation of objects in nd images Tap and shoot segmentation Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS Encoder-decoder with atrous separable convolution for semantic image segmentation Global contrast based salient region detection Imagenet: a large-scale hierarchical image database The pascal visual object classes (voc) challenge Geodesic star convexity for interactive image segmentation Semantic contours from inverse detectors Deep residual learning for image recognition Squeeze-and-excitation networks A fully convolutional two-stream fusion network for interactive image segmentation Microsoft COCO: common objects in context Nuclei segmentation via a deep panoptic model with semantic feature fusion Fully convolutional networks for semantic segmentation Iteratively trained interactive segmentation Content-aware multi-level guidance for interactive instance segmentation A comparative evaluation of interactive segmentation algorithms Intelligent scissors for image composition Semantic image synthesis with spatially-adaptive normalization Few-shot segmentation propagation with guided networks Grabcut: interactive foreground extraction using iterated graph cuts Hierarchical image saliency detection on extended CSSD Very deep convolutional networks for large-scale image recognition Temporal multimodal fusion for video emotion classification in the wild Deep interactive object selection A late fusion CNN for digital matting Pyramid scene parsing network