key: cord-0951843-h62t4qns authors: Sehar, Uroosa; Naseem, Muhammad Luqman title: How deep learning is empowering semantic segmentation: Traditional and deep learning techniques for semantic segmentation: A comparison date: 2022-04-06 journal: Multimed Tools Appl DOI: 10.1007/s11042-022-12821-3 sha: 5443c4d63af5dd4bc73757615c83a9722aed4898 doc_id: 951843 cord_uid: h62t4qns Semantic segmentation involves extracting meaningful information from images or input from a video or recording frames. It is the way to perform the extraction by checking pixels by pixel using a classification approach. It gives us more accurate and fine details from the data we need for further evaluation. Formerly, we had a few techniques based on some unsupervised learning perspectives or some conventional ways to do some image processing tasks. With the advent of time, techniques are improving, and we now have more improved and efficient methods for segmentation. Image segmentation is slightly simpler than semantic segmentation because of the technical perspective as semantic segmentation is pixels based. After that, the detected part based on the label will be masked and refer to the masked objects based on the classes we have defined with a relevant class name and the designated color. In this paper, we have reviewed almost all the supervised and unsupervised learning algorithms from scratch to advanced and more efficient algorithms that have been done for semantic segmentation. As far as deep learning is concerned, we have many techniques already developed until now. We have studied around 120 papers in this research area. We have concluded how deep learning is helping in solving the critical issues of semantic segmentation and gives us more efficient results. We have reviewed and comprehensively studied different surveys on semantic segmentation, specifically using deep learning. Some algorithms can perform on a specific dataset we give as input; it will not provide the same results on other datasets [102] . The primary reason behind it is that the different dataset does not perform the same operations before getting to the training phase and the testing phase; the second reason is that while going through the machine learning process, we imagine that our whole data set does not have any ambiguity and in result, it will yield the best result which is more efficient and accurate [122] . Sometimes the dataset has few samples, like maybe less than 1000 images as a whole. In that case, our model will be underfitted for such types of problems [90, 106, 107] . Model overfitting and the underfitting problem often occur when we have many or few samples [16] . We can now imagine that an algorithm is not equally performed in all dataset types, so in this paper, we will see how we can change something and improve efficiency [37] . For example, if you have a satellite image dataset [116] , that means images from satellites of any region need to do some predictive analysis. You need to evaluate the climatic change in a specific area [121] . For that, you will get all the training datasets from already captured pictures taken from any satellite. Now you will examine all the images you have already accepted as a map. Here, you can use any map like Apple map, Baidu map, Google map, and so on [76] . After getting all the pictures, you can say on the images dataset for that specific area and apply some deep learning algorithms to evaluate semantic segments. Classifying the objects that belong to the forest region [91] , i.e., the tree is your required masked area. Now, you can easily highlight the regions of the area where we have trees. Our next step is to mention the label across that particular area [70, 71] . In this way, you can easily see which part of the country has more forest and less. Due to the forest, we are facing some climatic changes. Hence, you have noticed that using a semantic segmentation technique, and you can quickly get a prediction about a specific period, also like when this area grows more trees and has good weather and so on. Semantic segmentation has many more examples applicable in any place, like in the medical field for automatic diagnosis of Schizophrenia, where we can use CNN-LSTM models and the EEG Signals [22, 69, 75 ]. Object detection [10] is also another essential part of semantic segmentation. After classification, you will make some edges and then detect the object [42] . Another helpful example is detecting some diseases by taking any medical training dataset [72] . The most recent research in this particular area is to detect COVID-19 [21, 73] when you have a sample of lungs, all the images for the infected lungs you can train. After that, you can guess and evaluate the differences you can see from infected lungs' images from healthy lungs [105] . The most popular is the chest X-Ray scan. Here in this example, you can take all the data set and then evaluate which image had a particular sort of disease, and then you can spot the infected area and then mask that specific area. After that, the difference between the input lung image and the healthy lung image will have the calculation for the evaluation based on some critical analysis [62] . Therefore, you can see that using a good and accurate algorithm can give you more efficient results. You can easily detect the climatic change in a specific area in weather prediction cases. Here, the problem you can encounter is getting the primary data set and all the behavioral changes with time. Before getting all the data set and images, you will need to analyze before making your dataset. So, in this field, you can say getting all the data is also a critical step in dealing with or applying some deep learning algorithms [40] . Our purpose is to deliver a core understanding of all these concepts. We have first used supervised learning algorithms [37] , and after that, we have seen how it will not give us the most efficient result, and we will jump to the most efficient algorithm. The algorithm that we have tested in the beginning uses some simple images or saves some Lighter images. We have used some already available libraries to get the results, and after that, we tried the more complicated algorithms. This research area is becoming popular day by day. We have tried the algorithms on CamVid [113] , Pascal VOC [76] , and COCO dataset [100] . The dataset that we have used throughout is CamVid. Most of the time, it works for any technique [57] . While doing masking, we have used the COCO dataset. Cambridge-driving Labeled Video Dataset consists of 367 training pairs, 101 validation pairs, 233 test pairs with 32 semantic classes. It is supposed to be the most advanced dataset for real-time semantic segmentation [63] . The other dataset that we have experimented on is the COCO dataset. COCO, or we can refer to Microsoft Common Object in Context dataset, is a large image dataset explicitly designed for image processing tasks like object detection, person keypoint detection, caption generation, segmentation, and many other prevalent problems these days. It has around 80 classes and has more detailed objects inside the images. Moreover, we have also used some small datasets like balloons dataset, shapes [19] (to detect rectangle, circle, triangle, etc.), nucleus [74, 105] (for medical relevant field). The purpose of using these datasets is to check how algorithmic network architecture can work on other datasets with the same accuracy [123, 124] and loss. Many other semantic segmentation datasets like Mapillary Vistas [61] contain around 25000 high-resolution images with the 66 defined semantic classes. Moreover, it also has instance-specific labels for the 37 classes. Mainly, the classes belong to the road scenes same as in the cityscapes dataset, but it has more annotations than cityscapes. Moreover, the Semantic KITTI is also used as the outclass dataset to understand the semantics of scenes [5] . It is based on the KITTI Vision with all the sequences or frames with overall angles (Table 1) . [33, 85, 91] . This dataset can perform the complete tasks, including object detection, multi-object tracking, semantic segmentation, instance segmentation, segmentation tracking, and lane detection. For the semantic segmentation, BDD has 19 classes, and samples are not so practical for urban scenes semantic segmentation. Wildash 2 is also a primary dataset for semantic segmentation, but it has limited material, i.e., training and testing samples, to fulfill the algorithm's requirements. So, it is advisable to prefer the other highly organized and well-managed datasets [110] . Wildpass is considered as a panoramic semantic segmentation taken from the cities. It has two versions with the same datasets. One is WildPass and other is WildPass2K. The first one contains 500 panoramas from 25 cities, and WildPASS2K contains 2000 labeled panoramas taken from 40 cities. Although the variation is acceptable, this dataset is not recommended to deal with the complex scenarios while dealing with urban scenes [104] . Grab-cut is an image-based segmentation [67] that is important in getting the object based on the defined area or region. Here, we can also extract the image information based on dividing the image into two regions: background and foreground; after that, we can make a segment. In the region-based algorithm, we have used Grab-cut [62] using open CV in Implementation using Python; the steps are under below, 1. Take in an input image. 2. We have taken foreground and background separately so that we can input them in our grab cut algorithm function. 3. After that, the defined rectangle will map on the foreground image, showing the segment separately. This experiment result depends on the rectangle size that we have already made in Step 3. So, you cannot define this algorithm as efficient because if you change the area argument value, it will not give us the same result (Fig. 3 ). This algorithm is mainly designed to get the required segment of an image using a K map. It will get the necessary mass based on the neighborhood, and after that, we can get the required section of the image based on RGB values. The images we will use as input are also not high-resolution images. This algorithm can quickly work only on low-resolution images and easily get all the segments based on colors and neighborhoods ( Table 2 ). The main steps of this algorithm are explained below -First, take input of the image; you can take any image you want. -Apply the Gaussian filter to let smoothen the image. -Then, we have created a graph far building segments. -Segment graph by merging the gay neighborhood, which is in this case we have used four or eight you can also say based on neighborhood or similarity. -After those segments have been created, and it will generate an output image with all the details mapped on the plot. -Then, it will return a graph with edges of the input image. -Two segment graphs we have used the threshold to give us the result output image as a segmented image. This algorithm is efficient and can get all the segments after the image based on colors. The only drawback that we have evaluated is that when you take a high-resolution image, this algorithm becomes slower and also causes a delay. There is an ambiguity of classes. As we have not defined this particular algorithm, we can see that the road that should be colored with one color is also segmented into so many parts. Furthermore, the person in the original image must mask as one object, the same as the car beside the person. Meeting and coping these all things in mind, we refer to the deep learning approach in which we will see how classes can categorize and read by the machine (Fig. 4 ). The algorithm that we have tried is based on the up-down settings of the convolutional blocks and found the semantic segments as a color map. Currently, State-of-the-art methods include many approaches to deal with semantic segmentation problems. The encoderdecoder-based, i.e., Fully transformer-based network models, are very popular, and they also give us promising result [80] . Modern research has applied the fully transformer-based architectures, and some adopted the CNN-based semantic segmentation model. Moreover, the hybrid network models are also a practical approach to solve these problems [95] . For semi-supervised segmentation, consistency regularization is a popular topic. Consistent predictions and intermediate characteristics are enforced by this method. Input perturbation techniques randomly augment the input pictures and apply a consistency constraint between the predictions of enhanced images, such that the decision function is in the low-density zone. Multiple decoders are used in a feature perturbation technique to ensure that the outputs of the decoder are consistent. Furthermore, the GCT technique further executes network perturbation by employing two segmentation networks with the same structure but started differently and ensured consistency between the perturbed networks [104] . Weakly Supervised Semantic Segmentation or WSSS with image-level labels has made significant progress in developing pseudo masks and training a segmentation network [43] . Recently, WSSS approaches have relied on CAMs to locate objects by identifying picture pixels helpful in classifying them. It does not mean that CAMs do not create helpful pseudo masks; they only accentuate the most discriminative portions of an object. There has been a great deal of work invested into finding a solution to this problem. For this purpose, they are employing tactics such as region-erasing, region monitoring, and regional growth to complete the CAMs. Other approaches use an iterative process to develop the CAM. Using random walks, PSA and IRN, for example, suggest spreading local replies across surrounding semantic entities. As previously stated, this problem stems from a lack of coordination between categorization and segmentation. Many academics have noticed this and are investigating ways to decrease the gap using extra supervision, such as CAM consistency, cumulative feature maps, cross-image semantics, sub-categories, saliency maps, and multi-level feature maps requirements. These strategies are straightforward, yet they provide positive results [20] . While we consider the context-based mechanism, OCNet (Object Context Network for Scene Parsing) would be a better option to select as the baseline. Logically, it contains the Resnet-FCN and objects Context Module. After the classifier, it again upsamples the image to parse the scene and provide the mask [109] . It has many variations like Base-OC, Pyramid-OC, ASP-OC (Atrous Spatial Pooling). Besides OCNet, we can have significantly matured network models like RFNET or ACNET that use asymmetric convolution blocks to strengthen the kernel structure. This network also helps us save extra computation time. Moreover, SETR (Segmentation Transformer) is the latest network architecture for the transformer-based mechanism that challenges the excellent mIoU of 50.28% for the ADE20K dataset and 55.83% for Pascal Context, and also give us promising results on the Cityscapes dataset [36, 77] . There are other latest transformer-based semantic segmentation models, i.e., Trans4Trans(Transformer for Transparent Object Segmentation) and SegFormer(Semantic Segmentation with Transformers) that are significantly less computational network architecture that can give us multi-scale features [99, 114] . SegFormer minimizes the effect of using complex decoders. Technically, it adds the learned features from all layers and the maximized and enriched representation. [99] also re-scale the basic approach and found very well-noted and robust results for up to 84.0% while experimenting on the Cityscapes dataset. Omnisupervised learning framework is also designed for efficient CNNs, which adds different data sources. In this way, it will improve the reliability in unseen areas. So, the traditional CNN uses an unsupervised framework to take advantage of both labeled and unlabeled panoramas [103] . Now, researchers plan to take a panoramic panoptic segmentation approach to better scene understanding. The traditional Semantic segmentation is based on RGB images, which is not a reliable way to deal with complex outdoor scenarios. Polarization sensing can be adopted as an efficient approach for dealing with these issues. By getting the information from the optical sensors, we can get the exact information regardless of what materials we are incorporating [47, 98] . There is another challenging aspect of semantic segmentation, i.e., 3D Segmentation in computer vision, that can be applied in autonomous driving, medical image analysis, and robotics. Usually, it applies to the handcrafted features while considering the shortcomings of the 2D-based segmentation [31] . Several models are famous in 3D Semantic segmentation, including ShapeNet, PSB, COSeg, ScanNet, etc. [60] , Another fascinating aspect of semantic segmentation is mapping the associated pixels in night scenes where the illumination concept is negligible. In this way, the real objects are not fully seen, and network models need to look down into deep details. The re-weighting methodology has been adopted before to cope with the false predictions [94] . The publicly available datasets can be used for that subject. These scenarios can be compared with the medical images, where images are almost grey, and the features are not correctly shown. The dataset we have used for this is the CamVid dataset, i.e., the dataset contains images of all around 960x720 dimensions, with approximately. The network architecture we will be following is based on core UNET. We will have 32 semantic classes; each class has a defined color (RGB) defined by the CamVid. The classes have been categorized by the characteristic of a particular object in our input. Mainly, moving objects has animals, pedestrians, children, rolling carts, bicyclists, motorcycles, cars, trucks, trains. Moreover, we have Road (shoulder, lane marking drivable, non-drivable), Ceiling (sky, tunnel, archway), and Fixed Objects (Building, wall, Tree, Fence, sidewalk, parking block, pole, traffic cone, bridge, Sign, Text, Traffic lights) respectively. The exact labeled image is shown in Figs. 5 and 6. This model was developed by Olaf Ronneberger [66] . They developed a model for doing image segmentation on biomedical data. We have tried the UNET model on the CamVid dataset. Taking the input dataset image of size 960x720 and all the images are in png format. It is recommended to use images in png format if you are preprocessing data or making your dataset in certain environmental conditions and camera quality because image in other formats is not so feasible for doing all types of operations that we want to perform while performing deep neural network operations. The model used for this type of segmentation is by upsampling the image matrix by using a convolutional block [116] . While using the maxpooling layer, the image changes from high resolution to low resolution. Here we can get the extract part of the image, i.e., a feature required to be extracted. Moreover, here we can see that the image size decreases, but the depth, context, or receptive field enhances [30] . In return, we can see how size increases, resulting in losing the information that where our actual matrix value was where place. To solve this problem, we have another step for decoding the information that was downsampled before, and then it will pass to the transposed convolutional network to upsample it. During downsampling, we compute the parameters for the transpose convolution such that the image's height and breadth are doubled, but the Comparison for All state-of-the art network architectures based on every class using Cityscapes dataset number of channels is halved. We will get the required dimensions with the exact information that will increase the accuracy in return. Results in Fig. 9 show results for the tested algorithm on CamVid Dataset. UNet has also been applied on the Aerial or Drone dataset [59] and with VGG [48] as the network backbone. The drone dataset is the freely available dataset. It has 23 semantic classes. Figure 7 shows the segmentation results after applying UNET to this dataset. UNet outperforms in most situations. The Table 3 shows the accuracy and loss values for both datasets in 5 epochs (Figs. 8, 9, 10, 11 and 12). Another famous and more useful algorithm for datasets like VOC2012 [49] , Camvid, and Coco datasets using Mask RCNN is the extension for Fast RCNN, mainly for object detection. The experiment was done using resnet101 as the backbone. Resnet-101 [39] is the CNN network which is 101 deep layers. What you have to do is to load a pre-trained network trained on more than a million images from the ImageNet dataset with 1.5 million images with the 1000 image classes. All input images are of size 224*224. Resnet pre-trained on the ImageNet dataset can be further evaluated using the Fast RCNN (Figs. 13, 14, 15 and 16) . After that, it can validate all the datasets containing classes that contain object classes similar to it. For example, the COCO dataset configured pre-trained weight file can validate the COCO dataset. Moreover, when we tested it with VOC, it showed a good efficiency of around 98-99% in every object class. After detecting the objects that belong to different classes, it will mask that instance. Eventually, it proved fruitful with class-wise object detection on every image while testing the CamVid dataset. The experiments are performed under good hardware specifications. For conventional algorithms and Mask-RCNN experiment configurable to 2.2GHz dual-core Intel Core i7, Turbo Boost up to 3.2GHz, with 4MB shared L3 cache. UNET experiment was done under-speed boost with NVidia Tesla P100 GPUs. Selecting the system or hardware for semantic segmentation algorithms' customization and performance analysis is also a key aspect [113] . Efficiency affects the results for segmentation. It is due to the reason that we have non-local information in semantic segmentation. Sometimes, we focus on the feature map level rather than the image level. However, we have already lost spatial information while focusing on the last feature map. We must save the residuals from managing the dependency between the pixels far away within the image. The performance evaluation can vary from problem to problem when making the deep learning neural network. Mainly, traditional methods like KNN [81] , Decision tree with Boosting [85] , SVM [53] , conditional random fields, or any statistical-based approach use accuracy or precision as performance evaluation metrics. As far as deep learning is concerned, we have more performance metrics for Classification, Object Detection, and Semantic Segmentation [89] . -Normally, we evaluate the final performance based on the accuracy (mIOU -mean intersection over Union). mIoU we can get by comparing the ground truth values with the output map after passing our image into the derived model. -The second measure s usually the time it takes to process the image in the CNN network. In terms of FPS, we can also refer to it as Latency. The third one is the Network Parameters that we have used to learn by the derived network. -The storage space we will need to save all the network parameters. This is also known as network size. -The computational cost for GPU that we are using. Sometimes, we use it as the execution time for the frequency of GPU. The more the value is, the more efficient our GPU will work. -Utilization of the hardware resources that mainly deal with CPU, GPU, and memory. -The amount of power that our system mainly consumed. -Memory bandwidth that we are using. It refers to the ratio of a number of bytes to a several transferred bytes from memory (sometimes it is shared and sometimes not shared). -Training metrics also matter. Which environment, IDE, or libraries for deep learning are we using. We have done experiments on the algorithms and advanced deep learning algorithms and concluded that traditional algorithms are not working fine for real-time images. It does not have the meaning for each pixel. A model-like UNET, SegNet, Deeplabv3+ is the right choice to be selected now [51] . The reason is they can give you exact pixel information without losing the pixel values. Like in encoder-decoder structure, i.e., UNET, SegNet model, we use skip connection in decoder side to regain the lead that we have lost in performing Max-Pooling operation on the encoder side. Losing information refers to the pixel values while we are doing feature extraction. Deeplab V3 has three more updated versions with improvements. UNet also has improved versions, namely Residuals U-Net and Fully Dense UNet. UNet refers to helping detect low-level features. It has initially been proposed for medical image segmentation. Residual UNet was introduced to improve the performance of UNet architecture. Further, a series of residual blocks are stacked together that benefits in terms of degradation problems with the help of skip connections, as same as in UNet, which helps to propagate the low-level features. Moreover, while modifying the UNet architecture using dense blocks, Dense UNet was introduced. It helps to improve the artifact while allowing each layer to learn the features at various spatial scales. We show in Table 4 the comparative data of JPANetcomposed of three different lightweight backbone networks and other models on the camvid test set. JPANetcan not only achieves 67.45% mIoU but also obtains 294FPS once we input 360 × 480 low-resolution images. The data in Table 4 another time proves the effectiveness of the JPANet model. Figure 5 shows the visual comparison effect of JPANet on the CamVid test set (Figs. 17 and 18) . Eventually, UNET is easily applied in every field, especially in Biomedical (for medical image datasets) and Industry 4.0 related problems, like detecting the defects for Hot-Rolled Steel Strips, Surface, or Road Defects [79] . Mask-RCNN also has advanced in recent years like Mask-Scoring RCNN. For solving real-time scene understanding, Mask RCNN would be a better choice. The performance also matters. In some dense networks like Yolo V5 or Fully Dense UNET, the network parameters are abundant. While selecting a network model, you must consider the lightweight architectures to be applied to real-time applications and fast in computation. We can see from Table 5 that JPANet achieved the very best scores in 18 of the 19 classification categories. It's because JPANet emphasizes the importance of shallow spatial information. The development of JPANet on small object samples is the most blatant. For instance, the JPANet accuracy on the traffic signal and traffic sign are 24.6% and 19.8% above ESPNet, respectively. Besides, JPANet also pays attention to extracting multi-scale semantic information. Thus JPANet also improves the segmentation results of huge targets to a particular extent. For instance, the accuracy of JPANet on the sidewalk and car is 1.7%, and 1.2% above the state-of-the-art ERFNet, respectively (Table 5) . In this paper, a comprehensive overview of semantic segmentation algorithms and their grouping has been discussed. Some classical and some machine learning algorithms are compared and examined. Furthermore, by adding some parameters, we have calculated the efficiency of the model. Semantic segmentation is becoming a hot topic in every field, whether we study in industrial projects or the medical fields. Most of the time helps us in examining the critical details about the particular application after the valuable data preprocessing. The Future aspects of this research area can be studied as below: -An annotation problem is also a big challenge for the dataset, which has very few samples for training, as in segmenting the surface defect datasets, some medical diagnostic or in soft robots where the field is quite new and publicly available datasets are not as much. -Few-shot segmentation can be used to solve the less annotated dataset problems. The same problem can also be tackled by using the data augmentation technique. -Small objects detection and segmentation is also an essential aspect for semantic segmentation that most researchers want to solve these days. Small is not tiny, but it is not as straightforward as nearer objects. Typically aerial images and long distant scenes are the main subjects to examine the classification or segmentation of the small object. -Weather conditions could be a significant point of discussion in urban scene datasets for semantic segmentation. -Moreover, light effect and control are crucial aspects that need to be addressed. -Computational loaded segmentation approaches are also becoming quite negligible because of the robust and less-computational network architectures introduced daily like Trans4Trans or SegFormer. They have already taken the place of traditional encoder-decoder-based network architectures. So, to cover the whole understanding in every specific field and to understand the fundamental challenges, we must have a clear understanding of how we extract features, whether it is for detecting big objects or for detecting the smaller object in images due to the variation in distance, or lightening conditions. We have no conflict of interest. A perceptual prediction framework for self supervised event segmentation Interactive full image segmentation by considering all regions jointly Segnet: A deep convolutional encoder-decoder architecture for image segmen-tation Improved road connectivity by joint learning of orientation and segmentation SemanticKITTI: A dataset for semantic scene understanding of liDAR sequences Large-scale interactive object segmentation with human annotators A real-time semantic segmentation algorithm for aerial imagery Triply supervised decoder networks for joint detection and segmentation End-to-end learned random walker for seeded image segmentation D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation All about structure: Adapting structural information across domains for boosting semantic segmentation Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation Residual pyramid learning for single-shot semantic segmentation Learning active contour models for medical image segmentation Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision Hybrid task cascade for instance segmentation Darnet: Deep active ray network for building segmentation Object counting and instance segmentation with image-level supervision Semantic correlation promoted shape-variant context for segmentation Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast S4net: Single stage salient-instance segmentation MS-TCN: Multi-stage Temporal convolutional network for action segmentation Dual attention network for scene segmentation A deep-learning-based approach for fast and robust steel surface defects classification Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images Bubblenets: Learning to select the guidance frame in video object segmentation by deep sorting frames Lvis: A dataset for large vocabulary instance segmentation Adaptive pyramid context network for semantic segmentation Song an end-to-end steel surface defect detection approach via fusing multiple hierarchical features Knowledge adaptation for efficient semantic segmentation Deep learning based 3D segmentation: A survey 3D-SIS: 3D semantic instance segmentation of RGBd scans Sail-VOS: Semantic amodal instance level video object segmentation-A synthetic dataset and baselines Joint pyramid attention network for real-time semantic segmentation of urban scenes Efficient fast semantic segmentation using continuous shuffle dilated convolutions ACNET: Attention based network to exploit complementary features for RGBD semantic segmentation Mask scoring r-CNN Accel: a corrective fusion network for efficient semantic segmentation on video Interactive image segmentation via backpropagating refinement scheme Geometry-aware distillation for indoor semantic segmentation A generative appearance model for end-to-end video object segmentation Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference Depth-wise Asymmetric bottleneck with Point-Wise aggregation decoder for Real-Time semantic segmentation in urban scenes DFANEt: Deep feature aggregation for real-time semantic segmentation Bidirectional learning for domain adaptation of semantic segmentation Attention-guided unified network for panoptic segmentation Zigzagnet: Fusing top-down and bottom-up context for object segmentation Structured knowledge distillation for semantic segmentation An end-to-end network for panoptic segmentation Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation Efficient dense mod-ules of asymmetric convolution for real-time semantic segmen-tation Video object segmentation with episodic graph memory networks ).lu see more know more unsupervised video object segmentation with co-attention CVPR Zero-Shot Video object segmentation with Co-Attention siamese networks Content-aware multi-level guidance for interactive instance segmentation A cross-season correspondence dataset for robust semantic segmentation ADM for grid CRF loss in CNN segmentation A relation-augmented fully convolutional network for semantic segmentation in aerial scenes Fast neural architecture search of compact semantic segmentation models via auxiliary cells The mapillary vistas dataset for semantic understanding of street scenes Elastic boundary projection for 3D medical image segmentation In defense of pre-trained imagenet architectures for realtime semantic segmentation of road-driving images In defense of pre-trained imagenet architectures for realtime semantic segmentation of road-driving images Enet: A deep neural network architecture for realtime semantic segmentation Seamless scene segmentation Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation Erfnet: Efficient residual factorized convnet for real-time semantic segmentation An Overview on Artificial Intelligence Techniques for Diagnosis of Schizophrenia Based on Magnetic Resonance Imaging Modalities: Methods, Challenges, and Future Works Cyclic guidance for weakly supervised joint detection and segmentation Not using the car to see the sidewalk-Quantifying and controlling the effects of context in classification and segmentation Applications of deep learning techniques for automated multiple sclerosis detection using magnetic resonance imaging: a review Automated Detection and Forecasting of COVID-19 using Deep Learning Techniques: A Review Applications of Epileptic Seizures Detection in Neuroimaging Modalities Using Deep Learning Techniques: Methods, Challenges, and Future Works Automatic Diagnosis of Schizophrenia using EEG Signals and CNN-LSTM Models Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation Real-time Fusion Network for RGB-D Semantic Segmentation Incorporating Unexpected Obstacle Detection for Road-driving Images Not all areas are equal: Transfer learning for semantic segmentation via hierarchical region selection Segmentation-based deep-learning approach for surfacedefect detection Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation Adaptive weighting multi-field-of-view CNN for semantic segmentation in pathology Speeding up semantic segmentation for autonomous driving RVOS: End-to-end Recurrent network for video object segmentation Feelvos: Fast end-to-end embedding learning for video object segmentation Mots: Multi-object tracking and segmentation Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation Context-aware spatio-recurrent curvilinear structure segmentation Graph attention convolution for point cloud semantic segmentation Dual Super-Resolution Learning for Semantic Segmentation Associatively segmenting instances and semantics in point clouds Example-guided style-consistent image synthesis from semantic labeling CNN-based minor fabric defects detection Cgnet: A light-weight context guided network for semantic segmentation DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation Fully Transformer Networks for Semantic Image Segmentation Fully Transformer Networks for Semantic Image Segmentation Semantic projection network for zero-and few-label semantic segmentation Polarization-driven semantic segmentation via efficient attentionbridged fusion SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Upsnet: A unified panoptic segmentation network MHP-VOS: Multiple Hypotheses propagation for video object segmentation Spatiotemporal CNN for video object segmentation Omnisupervised omnidirectional semantic segmentation Is Context-Aware CNN ready for the surroundings? panoramic semantic segmentation in the wild Cross-modal self-attention network for referring image segmentation GSPN: Generative Shape proposal network for 3D instance segmentation in point cloud Partnet: A recursive part decomposition network for finegrained and hierarchical shape segmentation Bisenet: Bilateral segmentation network for realtime semantic segmentation OCNet: Object Context Network for Scene Parsing Wilddash-creating hazardaware benchmarks Fast semantic segmentation for scene perception Pattern-affinitive propagation across depth, surface normal and semantic segmentation Customizable architecture search for semantic segmentation Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World Co-occurrent features in semantic segmentation Pose2seg: Detection free human instance segmentation Data augmentation using learned transformations for one-shot medical image segmentation Icnet for real-time semantic segmentation on high-resolution images Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers Context-reinforced semantic segmentation Collaborative learning of semi-supervised segmentation and classification for medical images Learning instance activation maps for weakly supervised instance segmentation Improving semantic segmentation via self-training Structured binary neural networks for accurate image classification and semantic segmentation Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.