key: cord-0058289-yjh38ims
authors: Bonde, Ujwal; Alcantarilla, Pablo F.; Leutenegger, Stefan
title: Towards Bounding-Box Free Panoptic Segmentation
date: 2021-03-17
journal: Pattern Recognition
DOI: 10.1007/978-3-030-71278-5_23
sha: 0ce3fcc60e4c1eb7b5694f7caf55189a6d7649ad
doc_id: 58289
cord_uid: yjh38ims

In this work we introduce a new Bounding-Box Free Network (BBFNet) for panoptic segmentation. Panoptic segmentation is an ideal problem for proposal-free methods as it already requires per-pixel semantic class labels. We use this observation to exploit class boundaries from off-the-shelf semantic segmentation networks and refine them to predict instance labels. Towards this goal BBFNet predicts coarse watershed levels and uses them to detect large instance candidates where boundaries are well defined. For smaller instances, whose boundaries are less reliable, BBFNet also predicts instance centers by means of Hough voting followed by mean-shift to reliably detect small objects. A novel triplet loss network helps merging fragmented instances while refining boundary pixels. Our approach is distinct from previous works in panoptic segmentation that rely on a combination of a semantic segmentation network with a computationally costly instance segmentation network based on bounding box proposals, such as Mask R-CNN, to guide the prediction of instance labels using a Mixture-of-Expert (MoE) approach. We benchmark our proposal-free method on Cityscapes and Microsoft COCO datasets and show competitive performance with other MoE based approaches while outperforming existing non-proposal based methods on the COCO dataset. We show the flexibility of our method using different semantic segmentation backbones and provide video results on challenging scenes in the wild in the supplementary material. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-71278-5_23) contains supplementary material, which is available to authorized users.

Panoptic segmentation is the joint task of predicting semantic scene segmentation together with individual instances of objects present in the scene. Historically this has been explored under different umbrella terms of scene understanding [37] and scene parsing [32] . In [17] , Kirillov et al. coined the term and gave a more concrete definition by including the suggestion from Forsyth et al. [10] of splitting the objects categories into things (countable objects like persons, cars, etc..) and stuff (uncountable like sky, road, etc..) classes. While stuff Electronic supplementary material The online version of this chapter (https:// doi.org/10.1007/978-3-030-71278-5 23) contains supplementary material, which is available to authorized users. classes require only semantic label prediction, things need both the semantic and instance labels. Along with this definition, Panoptic Quality (PQ) measure was proposed to benchmark different methods. Since then, there has been a more focused effort towards panoptic segmentation with multiple datasets [7, 23, 24] supporting it. Existing methods for panoptic segmentation can be broadly classified into two groups. The first group uses a proposal based approach for predicting things. Traditionally these methods use completely separate instance and scene segmentation networks. Using a MoE approach, the outputs are combined either heuristically or through another sub-network. Although, more recent works propose sharing a common feature backbone for both networks [16, 27] , this split of tasks restricts the backbone network to the most complex branch. Usually this restriction is imposed by the instance segmentation branch ( [13] ).

The second group of work uses a proposal free approach for instance segmentation allowing for a more efficient design. An additional benefit of these methods is that they do not need bounding-box predictions. While boundingbox detection based approaches have been popular and successful, they require predicting auxiliary quantities like scale, width and height which do not directly contribute to instance segmentation. Furthermore, the choice of bounding-boxes for object-detection had been questioned in the past [28] . We believe panoptic segmentation to be an ideal problem for a bounding-box free approach since it already contains structured information from semantic segmentation.

In this work, we exploit this using a flexible panoptic segmentation head that can be added to any off-the-shelf semantic segmentation network. We coin this as Bounding-Box Free Network (BBFNet) which is a proposal free network and predicts things by gradually refiningclass boundaries predicted by the base network. To achieve this we exploit previous works in non-proposal based methods for instance segmentation [2, 4, 26] . Based on the output of a semantic segmentation network, BBFNet first detects noisy and fragmented large instance candidates using a watershed-level prediction head (see Fig. 1 ). These candidate regions are clustered and their boundaries improved with a triplet loss based head. The remaining smaller instances, with unreliable boundaries, are detected using a Hough voting [3] head that predicts the offsets to the center of the instance. Without using MoE our method produces comparable results to proposal based approaches while outperforming proposal-free methods on the COCO dataset.

Most current works in panoptic segmentation fall under the proposal based approach for detecting things. In [17] , Kirillov et al. use separate networks for semantic segmentation (stuff ) and instance segmentation (things) with a heuristic MoE fusion of the two results for the final prediction. Realising the duplication of feature extractors in the two related tasks, [16, 18, 21, 27, 35] propose using a single backbone feature extractor network. This is followed by separate branches for the two sub-tasks with a heuristic or learnable MoE head to combine the results. While panoptic Feature Pyramid Networks (FPN) [16] uses Mask R-CNN [13] for the things classes and fills in the stuff classes using a separate FPN branch, UPSNet [35] combines the resized logits of the two branches to predict the final output. In AUNet [21] , attention masks predicted from the Region Proposal Network (RPN) parallizable and the instance segmentation head help fusing the results of the two tasks. Instead of relying only on the instance segmentation branch, TASCNet [18] predicts a coherent mask for the things and stuff classes using both branches. All these methods rely on Mask R-CNN [13] for predicting things. Mask R-CNN is a two-stage instance segmentation network which uses a RPN to predict initial candidates for instance. The two-stage serial approach makes Mask R-CNN accurate albeit computationally expensive and inflexible thus slowing progress towards real-time panoptic segmentation.

In FPSNet [12] , the authors replace Mask R-CNN with a computationally less expensive detection network and use its output as a soft attention mask to guide the prediction of things classes. This trade off is at a cost of considerable reduction in accuracy while continuing to use a computationally expensive backbone (ResNet50 [14] ). In [20] the authors make up for the reduced accuracy by using an affinity network but this is at the cost of computational complexity. Both these methods still use bounding-boxes for predicting things. In [31] , the detection network is replaced with an object proposal network which predicts instance candidates. In contrast, we propose a flexible panoptic segmentation head that relies only on a semantic segmentation network which, when replaced with faster networks [29, 30] allows for a more efficient solution.

A parallel direction gaining increased popularity is the use of proposal-free approach for predicting things. In [33] , the authors predict the direction to the center and replace bounding box detection with template matching using these predicted directions as a feature. Instead of template matching, [1, 19] use a dynamically initiated conditional random field graph from the output of an object detector to segment instances. In the more recent work of Gao et al. [11] , cascaded graph partitioning is performed on the predictions of a semantic segmentation network and an affinity pyramid computed within a fixed window for each pixel. Cheng et al. [5] simplify this process by adopting a parallelizable grouping algorithm for thing pixels. In comparison, our flexible panoptic segmentation head predicts things by refining the segmentation boundaries obtained from any backbone semantic segmentation network. Furthermore, our postprocessing steps are computationally more efficient compared to other proposalfree approaches while outperforming them on multiple datasets. BBFNet gradually refines the class boundaries of the backbone semantic segmentation network to predict panoptic segmentation. The watershed head predicts quantized watershed levels (shown in different colours) which is used to detect large instance candidates. For smaller instances we use Hough voting with fixed bandwidth. The output shows offsets (X off , Y off ) colour-coded to represent the direction of the predicted vector. Triplet head refines and merges the detection to obtain the final instance labels. We show the class probability (colour-map hot) for different instances with their center pixels used as fa. Table 1 lists the components of individual heads while Sect. 3 explains them in detail.

In this section we introduce our non-bounding box approach to panoptic segmentation. Figure 2 shows the various blocks of our network and Table 1 details the main components of BBFNet. The backbone semantic segmentation network consists of a ResNet50 followed by an FPN [22] . In FPN, we only use the P2, P3, P4 and P5 feature maps which contain 256 channels each and are 1/4, 1/8, 1/16 and 1/32 of the original scale respectively. Each feature map then passes through the same series of eight Deformable Convolutions (DC) [8] . Intermediate features after every couple of DC are used to predict semantic segmentation (Sect. 3.1), Hough votes (Sect. 3.2), watershed energies (Sect. 3.3) and features for the triplet loss [34] network. We first explain each of these components and their corresponding training loss.

The first head in BBFNet is used to predict semantic segmentation. This allows BBFNet to quickly predict things (C things ) and stuff (C stuff ) labels while the remainder of BBFNet improves things boundaries using semantic segmentation features F seg . We use per-pixel cross-entropy loss to train this head given by:

where y c and p ss c are respectively the one-hot ground truth label and predicted softmax probability for class c.

The Hough voting head is similar to the semantic segmentation head and is used to refine F ss to give Hough features F hgh . These are then used to predict offsets for the center of each things pixel. We use a tanh non-linearity to squash the predictions and obtain normalised offsets (X off andŶ off ). Along with the centers we also predict the uncertainty in the two directions (σ x and σ y ) making the number of predictions from the Hough voting head equal to 4 × C things . The predicted center for each pixel (x, y), is then given by:

where C is the predicted class and (x,ŷ) are image normalised pixel location.

Hough voting is inherently noisy [3] and requires clustering or mode seeking methods like mean-shift [6] to predict the final object centers. As instances could have different scales, tuning clustering hyper-parameters is difficult. For this reason we use Hough voting primarily to detect small objects and to filter predictions from other heads. We also observe that the dense loss from the Hough voting head helps convergence of deeper heads in our network.

The loss for this head is only for the thing pixels and is given by:

where X off and Y off are ground truth offsets and w is the per pixel weight. To avoid bias towards large objects, we inversely weigh the instances based on the number of pixels. This allows it to accurately predict the centers for objects of all sizes. Note that we only predict the centers for the visible regions of an instance and do not consider its occluded regions.

Our watershed head is inspired from DWT [2] . Similar to that work, we quantise the watershed levels into fixed number of bins (K = 4). The lowest bin (k = 0) corresponds to background and regions that are within 2 pixels inside the instance boundary. Similarly, k = 1, k = 2 are for regions that are within 5 and 15 pixels away from the instance boundary, respectively, while k = 3 is for the remaining region inside the instance. In DWT, the bin corresponding to k = 1 is used to detect large instance boundaries. While this does reasonably well for large objects, it fails for smaller objects producing erroneous boundaries. Furthermore, occluded instances that are fragmented cannot be detected as a single object. For this reason we use this head only for predicting large object candidates which are filtered and refined using predictions from other heads.

Due to the fine quantisation of watershed levels, rather than directly predicting the upsampled resolution, we gradually refine the lower resolution feature Table 1 . Architecture of BBFNet. dc, conv, ups and cat stand for deformable convolution [8] , 1 × 1 convolution, upsampling and concatenation respectively. The two numbers that follow dc and conv are the input and output channels to the blocks.* indicates that more processing is done on these blocks as detailed in Sect. 3 We use a weighted cross-entropy loss to train this given by:

where W k is the one-hot ground truth for k th watershed level, p wtr k its predicted probability and w k its weights.

The triplet loss network is used to refine and merge the detected candidate instances in addition to detecting new instances. Towards this goal, a popular choice is to formulate it as an embedding problem using triplet loss [4] . This loss forces features of pixels belonging to the same instance to group together while pushing apart features of pixels from different instances. Margin-separation loss is usually employed for better instance separation and is given by:

where f a , f p , f n are the anchor, positive and negative pixel features respectively and α is the margin. Choosing α is not easy and depends on the complexity of the feature space [25] . Instead, we opt for a fully-connected network to classify the pixel features and formulate it as a binary classification problem:

We use the cross-entropy loss to train this head:

T c is the ground truth one-hot label for the indicator function and p trp the predicted probability. The pixel feature used for this network is a concatenation of F T (see Table 1 ), its normalised position in the image (x, y) and the outputs of the different heads (p seg , p wtr ,X off ,Ŷ off , σ x and σ y ).

We train the whole network along with its heads in using the weighted loss function:

For the triplet loss network, training with all pixels is prohibitively expensive. Instead we randomly choose a fixed number of anchor pixels N a for each instance.

Hard positive examples are obtained by sampling from the farthest pixels to the object center and correspond to watershed level k = 0. For hard negative examples, neighbouring instances' pixels closest to the anchor and belonging to the same class are given higher weight. Only half of the anchors use hard example mining while the rest use random sampling.

We observe that large objects are easily detected by the watershed head while Hough voting based center prediction does well when objects are of the same scale. To exploit this observation, we detect large object candidates (I L ) using connected components on the watershed predictions correspond to k ≥ 1 bins. We then filter out candidates whose predicted Hough center (I center L ) does not fall within their bounding boxes (BB L ). These filtered out candidates are fragmented regions of occluded objects or false detections. Using the center pixel of the remaining candidates (I L ) as anchors points, the triplet loss network refines them over the remaining pixels allowing us to detect fragmented regions while also improving their boundary predictions.

After the initial watershed step, the unassigned thing pixels corresponding to k = 0 and primarily belong to small instances. We use mean-shift clustering with fixed bandwidth (B) to predict candidate object centers, I center S . We then back-trace pixels voting for their centers to obtain the Hough predictions I S .

Finally, from the remaining unassigned pixels we randomly pick an anchor point and test it with the other remaining pixels. We use this as candidates regions that are filtered (I R ) based on their Hough center predictions, similar to the watershed candidates. The final detections are the union of these predictions. We summarise these steps in algorithm provided in the supplementary material.

In this section we evaluate the performance of BBFNet and present the results we obtain. We first describe the datasets and the evaluation metrics used. In Table 2 . (a) Performance of different heads (W-Watershed, H-Hough Voting and T-Triplet Loss Network) on Cityscapes validation set. BBFNet exploits the complimentary performance of watershed (large objects > 10k pixels) and Hough voting head (small objects < 1k pixels) resulting in higher accuracy. PQs, PQm and PQ l are the PQ scores for small, medium and large objects respectively. Bold is for best results. (b) Performance of Hough voting head (H) with varying B for different sized objects, s-small < 1k pixels, l-large > 10k pixels and m-medium sized instances. For reference we also plot the performance of Watershed+Triplet loss (W+T) head (see Table 2 ).

Sect. 4.1 we describe the implementation details of our network. Section 4.2 then discusses the performance of individual heads and how its combination helps improve the overall accuracies. We presents both the qualitative and quantitative results in Sect The Cityscapes dataset [7] contains 2975 densely annotated images of driving scenes for training and a further 500 validation images. For the panoptic challenge, a total of 19 classes are split into 8 things and 11 stuff classes. Microsoft COCO [23] is a large scale object detection and segmentation dataset with over 118k training (2017 edition) and 5k validation images. The labels consists of 133 classes split into 80 things and 53 stuff.

We benchmark using the Panoptic Quality (PQ) measure which was proposed in [16] . Were avilabel we also provide the IoU score.

We use the pretrained ImageNet [9] models for ResNet50 and FPN and train the BBFNet head from scratch. We keep the backbone fixed for initial epochs before training the whole network jointly. In the training loss (Eq. 8), we set α 1 , α 2 , α 3 and α 4 parameters to 1.0, 0.1, 1.0 and 0.5 respectively, since we found this to be a good balance between the different losses. The mean-shift bandwidth is set to reduced pixels of B = 10 to help the Hough voting head detect smaller instances. In the watershed head, the number of training pixels decreases with K and needs to be offset by higher w k . We found the weights 0.2, 0.1, 0.05, 0.01 to work best for our experiments. Moreover, these weights help the network Table 3 . Panoptic segmentation results on the Cityscapes and COCO dataset. All methods use the same pretraining (ImageNet) and backbone (ResNet50+FPN), except those with * (ResNet101) and ± (Xception-71). Bold is for overall best results and underscore is the best result in non-BB based methods. focus on detecting pixels corresponding to lower bins on whom the connectedcomponent is performed. To train the triplet-loss network head we set the number of pixels per object N a = 1000. To improve robustness we augment the training data by randomly cropping the images and adding alpha noise, flipping and affine transformations. No additional augmentation was used during testing. All experiments were performed on NVIDIA Titan 1080Ti. A common practice during inference is to remove prediction with low detection probability to avoid penalising twice (FP and FN) [35] . In BBFNet, these correspond to regions with poor segmentation. We remove regions with low mean segmentation probability (<0.65). Furthermore, we also observe boundaries shared between multiple objects to be frequently predicted as different instances. We filter these by having a threshold (0.1) on the IoU between the segmented prediction and its corresponding bounding box.

We conduct ablation studies here to show the advantage of each individual head and how BBFNet exploits them. Table 2 (a) shows the results of our experiments on Cityscapes. We use the validation sets for all our experiments. We observe that watershed or Hough voting heads alone do not perform well. In the case of watershed head this is because performing connected component analysis on k = 1 level (as proposed in [2] ) leads to poor SQ. Note that performing the watershed cut at k = 0 is also not optimal as this leads to multiple instances that share boundaries being grouped into a single detection. By combining the Watershed head with a refining step from the triplet loss network we observe over 10 point improvement in accuracy.

On the other hand, the performance of the Hough voting head depends on the bandwidth B that is used. Table 2(b) plots its performance with varying B. As B increases from 5 to 20 pixels we observe an initial increase in overall PQ before it saturates. This is because while the performance increases on large objects (>10k pixels), it reduces on small (<1k pixels) and medium sized objects. However, we observe that at lower B it outperforms the Watershed+triplet loss head on smaller objects. We exploit this in BBFNet (see Sect. 3.5) by using the watershed+triplet loss head for larger objects while using Hough voting head primarily for smaller objects. Table 3 benchmarks the performance of BBFNet with existing methods on the Cityscapes and COCO datasets. As all state-of-the-art methods report results with ResNet50+FPN networks while using the same pre-training dataset (Ima-geNet) we also follow this convention and report our results with this setup except where highlighted. Multi-scale testing along with horizontal-flipping were used in some works but we omit those results here as this can be applied to any existing work including BBFNet to improve performance. From the results we observe that BBFNet, without using an MoE or BB, has comparable performance to other MoE+BB based methods while outperforming non-BB based methods on the more complicated COCO dataset. Figure 3 shows some qualitative results on the Cityscapes and COCO dataset. 

To highlight BBFNets ability to work with different segmentation backbones we compare its generalisation with different segmentation networks. As it is expected we observe an increase in performance with more complex backbones and with DC's but at a cost of reduced efficiency (see Table 4 ). For reference we also show the performance of a baseline proposal-based approach (UPSNet) and a proposal-free approach (SSAP). We used the author provided code of UPSNet 1 for computing efficiency figures. Note, that since UPSNet uses Mask R-CNN its backbone cannot be replaced and it is not as flexible as BBFNet. As BBFNet does not use a separate instance segmentation head, it is computationally more efficient using only ≈ 29.5M parameters compared to 44.5M UPSNet. We find a similar pattern when we compare the number of FLOPs on a 1024 × 2048 image with BBFNet taking 0.38 TFLOPs compared to 0.425 TFLOPs of UPSNet. The authors of SSAP [11] do not provide details about their number of parameters, FLOPs and inference time. However, they provide timing information for their post-processing step which is a cascaded graph partitioning approach that uses the predictions of a semantic segmentation network and an affinity pyramid network. This cascaded graph partition module solves a multicut optimisation problem [15] and takes between 0.26 − 1.26 seconds depending on the initial resolution for the cascaded graph partition. We believe that BBFNet post-processing step is simpler and presumably faster than in SSAP.

We discuss the reasons for performance difference between our bounding-box free method and ones that use bounding-box proposals. UPSNet [35] is used as a benchmark as it shares common features with other methods. Table 5 depicts the number of predictions made for different sized objects in the Cityscapes validation dataset. We report the True Positive (TP), False Positive (FP) and the False Negative (FN) values.

One of the areas where BBFNet performs poorly is the number of small object detections. BBFNet detects 2/3 of the smaller objects compared to UPSNet. Poor segmentation (wrong class label or inaccurate boundary prediction) also leads to a relatively higher FP for medium and large sized objects. Figure 4 shows some sample examples. The multi-head MoE approach helps addressing these issues but at the cost of additional complexity and computation time (Sect. 4.3). For applications where time or memory are more critical compared to detecting smaller objects, BBFNet would be a more suited solution. 

We presented an efficient bounding-box free panoptic segmentation method called BBFNet. Unlike previous methods, BBFNet does not use any instance segmentation network to predict things. It instead refines the boundaries from the semantic segmentation output obtained from any off-the-shelf segmentation network. This allows us to be flexible while out-performing proposal-free methods on the more complicated COCO benchmark.

In the next future we would work on making the network end-to-end trainable and improving the efficiency by removing the use of DCN while maintaining similar accuracy.

Pixelwise instance segmentation with a dynamically instantiated network

Deep watershed transform for instance segmentation

Generalizing the hough transform to detect arbitrary shapes

Semantic instance segmentation with a discriminative loss function

Panopticdeeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation

Mean shift, mode seeking, and clustering

The cityscapes dataset for semantic urban scene understanding

Deformable convolutional networks

Imagenet: a large-scale hierarchical image database

Finding pictures of objects in large collections of images

SSAP: single-shot instance segmentation with affinity pyramid

Fast panoptic segmentation network

Mask R-CNN

Deep residual learning for image recognition

Efficient decomposition of image and mesh graphs by lifted multicuts

Panoptic feature pyramid networks

Panoptic segmentation

Learning to fuse things and stuff

Weakly-and semi-supervised panoptic segmentation

Unifying training and inference for panoptic segmentation

Attention-guided unified network for panoptic segmentation

Feature pyramid networks for object detection

Microsoft COCO: common objects in context. In: Fleet

The Mapillary Vistas dataset for semantic understanding of street scenes

Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth

Fast scene understanding for autonomous driving

Seamless scene segmentation

Yolov3: An incremental improvement

ErfNet: efficient residual factorized convnet for real-time semantic segmentation

Mobilenetv 2: inverted residuals and linear bottlenecks

Adaptis: adaptive instance selection network

Scene parsing with object instances and occlusion ordering

Pixel-level encoding and depth layering for instance-level semantic labeling

Distance metric learning for large margin nearest neighbor classification

UPSNet: a unified panoptic segmentation network

Deeperlab: single-shot image parser

Describing the scene as a whole: joint object detection

We would like to thank Prof. Andrew Davison and Dr. Alexandre Morgand for their critical feedback during the course of this work.