key: cord-0057722-6knhgd77
authors: Kaur, Ranpreet; GholamHosseini, Hamid; Sinha, Roopak
title: Deep Learning in Medical Applications: Lesion Segmentation in Skin Cancer Images Using Modified and Improved Encoder-Decoder Architecture
date: 2021-03-18
journal: Geometry and Vision
DOI: 10.1007/978-3-030-72073-5_4
sha: 053f010e2b494974d283aede633d2531a7a5a090
doc_id: 57722
cord_uid: 6knhgd77

The rise of deep learning techniques, such as a convolutional neural network (CNN) in solving medical image problems, offered fascinating results that motivated researchers to design automatic diagnostic systems. Image segmentation is one of the crucial and challenging steps in the design of a computer-aided diagnosis system owing to the presence of low contrast between skin lesion and background, noise artifacts, color variations, and irregular lesion boundaries. In this paper, we propose a modified and improved encoder-decoder architecture with a smaller network depth and a smaller number of kernels to enhance the segmentation process. The network performs segmentation for skin cancer images to obtain information about the infected area. The proposed model utilizes the power of the VGG19 network’s weight layers for calculating rich features. The deconvolutional layers were designed to regain spatial information of the image. In addition to this, optimized training parameters were adopted to further improve the network’s performance. The designed network was evaluated for two publicly available benchmarked datasets ISIC, and PH(2) consists of dermoscopic skin cancer images. The experimental observations proved that the proposed network achieved the higher average values of segmentation accuracy 95.67%, IoU 96.70%, and BF-score of 89.20% on ISIC 2017 and accuracy 98.50%, IoU 93.25%, and BF-score 84.08% on PH(2) datasets as compared to other state-of-the-art algorithms on the same datasets.

Medical image analysis is the process of analyzing image information generated in clinical settings. The primary target behind image analysis is to obtain relevant information for diagnosis. The recent developments in computer vision techniques have made medical image analysis one of the exciting research areas. Deep learning is one of the popular choices of researchers to use in solving medical problems, such as lung cancer [1] , breast tissue detection [2] , glaucoma eye diagnosis [3] , and blood pressure monitoring [4] . Our research focuses on the extraction of lesion areas from skin cancer images that would further analyze the type of cancer, such as benign and malignant. Skin cancer is the most commonly found problem around the world that occurs due to overexposure to ultraviolet sun rays. The incidence of skin cancer has been overgrowing, and it is the 19 th most common cancer worldwide [5] , thus become an issue of great concern in the healthcare community.

Skin cancer is the excessive growth of abnormal cells in the outermost skin layer known as the epidermis. The cancerous cells overgrow into the body and damage surrounding tissues to form a malignant tumor. The main type of skin cancer is squamous cell carcinoma, basal cell carcinoma, and melanoma [8] , as shown in Fig. 1 . Melanoma is rare but the deadliest form of other skin cancer types because the cancerous cells multiply unexpectedly. The risk factors that contribute towards the formation of melanoma include fair complexion, overexposure to sun rays, sunburn, genetic history, and weak immune system [6, 7] . Researchers are on track to develop an efficient and cost-effective software tool to detect cancer at an early stage, which can significantly reduce the mortality rate. Deep learning approaches are widely applied in image analysis applications. For the accurate diagnosis of medical problems, it is pertinent to understand image patterns that are possible through the image segmentation process. Segmentation is one of the crucial steps of image analysis to differentiate object descriptions from their background [9] . Manual skin image segmentation is a tedious and time-consuming task in a clinical environment to identify each patient's skin patterns. Thus, there is a tremendous need for developing an automatic segmentation approach which is the primary aim of the current research. To design an automated classification system, a precise and efficient lesion segmentation approach is one of the key requirements.

We introduced three contributions in this paper, firstly four blocks of down-sampling layers consisted of convolutional, batch normalization, leaky ReLU, and pooling layers were designed as encoder block to extract spatial information of the image and to store max-pooling indices. Corresponding to each down-sampling layer, the decoder section was designed to perform up-sampling. The indices computed by down-sampling layers were used by up-sampling layers in the decoder section to regain the spatial resolution. Secondly, a leaky ReLU layer was used instead of the ReLU layer to add more balance and faster learning. At last, the proposed framework was effectively evaluated for a large database to make it more general for other applications.

Many attempts have been made to extract lesions using traditional image processing techniques, machine learning algorithms, and the latest deep learning approaches. There is no standard method of segmenting images because the area of interest and object description varies from application to application. Hence selection of segmentation technique depends upon the type of images given in the input. Segmentation of lesions is a challenging task due to the large variations in terms of color, size, shape, location, and texture. The approaches that have been applied in the field of lesion segmentation are categorized into six groups, edge-based methods [10, 11] , thresholding-based methods [12] [13] [14] [15] [16] , clustering methods [17, 18] , active contour methods [19] [20] [21] , and supervised algorithms [22] [23] [24] . These old image processing methods have become indispensable in solving very complex and challenging image segmentation problems. In 2013, an initial attempt was made to employ a deep learning approach for the segmentation of X-ray images [25] . They applied patch-wise classification on raw images to identify bone tissues and selected only rib regions to reduce training time. A similar approach was applied to fundus images for the segmentation of blood vessels proposed by Melinscak et al. in 2015 [26] . They used a deep convolutional neural network that made use of max-pooling layers instead of subsampling or downsampling layers and achieved an average accuracy of 94%.

In 2015, Long et al. proposed the first end-to-end pixel-wise segmentation technique, a fully convolutional neural network (FCN) [27] . They adapted AlexNet, VGGNet, and GoogleNet classification models for semantic segmentation by replacing convolutional layers with fully connected layers to generate segmentation maps. In 2015, Hong et al. proposed a semi-supervised learning approach, deconvolution networks (DeconvNet), as an extension of FCN networks [28] . This architecture decoupled classification and segmentation tasks by adding bridging layers to construct class-oriented feature maps that reduce training and inferencing time. The main problem with FCN was pooling layers, that increased field of view and discarded the actual information. In 2017, an idea of encoder-decoder-based architecture was presented by Badrinarayanan et al. [29] that has a network layout similar to VGG16. In this network, the encoder section reduces spatial dimensionality with pooling layers and the decoder gradually regains the object information and spatial details. U-Net is another famous network inspired by encoder-decoder architecture which is successfully applied for the segmentation of medical images [30] . More details of these networks are given in review papers [31, 32] . Some research work studied the impact of changing network architecture using a transfer learning approach in terms of layer structure [33] [34] [35] , optimizing training parameters [36] [37] [38] [39] , aiming to reach an optimal solution that reduces computation time, power and increase network accuracy.

Motivated by the recent developments in deep learning algorithms, we aimed to propose a modified encoder-decoder framework to perform segmentation. Instead of applying any pre-processing or post-processing approach to improve segmentation accuracy, we focused on designing an end-to-end learning network that can process raw images contaminated with different noise levels and can accurately extract a region of interest. In this paper, we explored the VGG19 network [40] , which was originally designed for the ImageNet classification task. The network applied convolutional and pooling operation over an entire image to extract features. A cross-entropy loss function was used to predict the pixel-wise labels. The lesion occupies a small portion of the whole image that tends loss function to be biased towards background information. Thus, a weighted cross-entropy loss function was used to improve segmentation results. The designed framework was evaluated on two publicly available datasets, ISIC, and PH 2 .

The proposed model was inspired by VGG19 [40] deep neural network, whose layers were modified and improved to design deep encoder-decoder architecture for lesion segmentation. The CNN architecture designed for skin cancer segmentation is shown in Fig. 2 . The idea of an encoder-decoder network was originally designed for doing semantic segmentation of road scene applications. The proposed network was designed by changing network depth, layer configuration, and optimizing hyperparameters. The first half of the architecture called the encoder section having four blocks that further consists of multiple convolutional, batch normalization, leaky ReLU, and max-pooling layers. In contrast, the lower half was the decoder section having deconvolutional layers, batch normalization, leaky ReLU layer, and max-unpooling layers corresponding to each layer in the encoder section, followed by a cross-entropy loss function-based segmentation layer. In this network, the pooling indices computed by the max-pooling layers of the encoder section were used to perform up-sampling in the decoder section. The fully connected layers in the previous networks were removed, which made the encoder section smaller and reduced the number of learnable parameters.

In the encoder section, a convolutional operation was performed with a set of kernels to generate feature maps. These were then batch normalized and the negative values were set to any fixed value by applying element-wise leaky rectified linear units (PReLU). Following this, a max-pooling operation was implemented with a 2 × 2 window size, a stride of 2, and padding 0. The max pooling was used to reduce the size of feature maps while preserving significant characteristics. It is often placed between two convolutional layers and is responsible for decreasing the number of parameters and calculations in the network. Pooling operation can be applied using the equation as:

FM x, y shows the output feature map, f p is the pooling function, I x,y is the input feature map from the previous layer. Max pooling calculates a maximum value from the portion overlapped by the kernel. The repetitive use of max-pooling and striding at consecutive layers reduces the spatial resolution of the output feature map. Some architectures such as fully convolutional networks used a 'transposed convolutional layer' but it results in more time and memory. In the original VGG19 network, there was the successive use of several convolutional layers to achieve high translation invariance for more robust classification. However, there was a loss of spatial resolution which is not suitable for segmentation. Therefore, it is significant to preserve boundary information after sub-sampling layers. To achieve this, max-pooling indices were stored, i.e. the location of maximum features was memorized and used in the decoder section to perform upsampling that regains spatial resolution. The high dimensional feature maps were fed to a multi-class softmax pixel-wise classifier that generates K-channel probabilities for each pixel, K is the number of classes. We add here, Deconvnet, SegNet and U-Net shares similar kind of architecture, but have some differences in terms of layer organization, hyperparameters, and training. Deconvnet is hard to train end-to-end due to the presence of fully connected layers, SegNet is very time consuming and U-Net does not reuse pooling indices.

The network consists of 60 layers designed from scratch broadly divided into encoder and decoder sections, as shown in Fig. 2 . The layers of the encoder block follow the VGG19 layout with some additional layers, whereas the corresponding decoder block is our contribution. The first section has 29 layers that extract object details by downsampling the feature maps and the second section has respective deconvolutional layers to upsample the image information. The network consists of a total of 60 layers, 134 layer connections, and generates an output feature map of different sizes after each layer, and the network details are demonstrated in Table 1 . Encoder Section. This is the first half of the section responsible for down-sampling of feature maps and extracts features of input images using the following layers.

Convolutional Layer: This layer in the encoder section generates a feature map by sliding a kernel over an entire image and determines the product between the kernel and underlying region of an image using the formula:

FM x, y represents the output feature map, x, y shows the pixel positions in the spatial domain, l is the kernel size, I is the input image, and K is the kernel or template. The other parameters used in this layer are kernel size, number of kernels, stride, and padding. The kernel size chosen for the current network is 3 × 3, as this size is suitable for images and covers all pixels positions of an image. Moreover, the larger kernel size leads to more time and memory consumption. Each encoder performs a convolutional operation with multiple kernels to produce a feature map. The higher number of kernels helps extract very fine details of the image, but it also takes a longer execution time. We chose the number of kernels as 64 kernels; however, in original VGG19 networks, the number of kernels at each convolutional layer varies from 64 to 512. Then, the stride value is taken as '1' that defines the step size of the kernel when sliding over an image. The last parameter is padding selected as '1' which adds extra ones at the end of the image border to prevent cropping.

The outputs of the convolutional layer, i.e. output feature map are fed to batch normalization. This layer does not reduce the size of the feature map. It allows each layer to learn on a more stable distribution of inputs, i.e. rescales inputs between 0 and 1 and accelerate the training of the network.

PReLU Layer: It applied the threshold function to each element in the input and multiplied all negative values by a fixed scalar. This layer passes the output element as the input to the next layer directly if it is positive; otherwise, its output to value multiplied by a fixed scalar using the following activation function:

Max-pooling Layer: This layer reduces the spatial size of the network by decreasing the number of parameters and computation, also called down-sampling. Max-pooling layer in the current network used window size '2 × 2', padding [0 0 0 0] and stride value [2 2 ]. The stride value in the max-pooling layer denotes the number of shifts over the input matrix and padding inserts extra zeros so that the filter exactly fits the input image.

Decoder Section. The decoder in the second half of the network performs the reverse operation of pooling and reconstructs the original image resolution. The deconvolutional layers used the same number of kernels with the same size, stride, and padding value to obtain an activation map through up-sampling layers that swap forward and backward pass of a convolution. As a result, a dense feature map is produced that represents high image resolution. Then batch normalization and PReLU layers are placed in the same pattern as in the encoder section. After that, the max-pooling indices applied that were calculated during the first section to obtain the same image resolution. In this way, the network learns from global information and fine details to perform segmentation. 

The output of the decoder blocks given to the pixel classification layer that generates each pixel's posterior probability using weighted cross-entropy loss [41] . This loss function measures the error between prediction scores P and targets T . The weighted cross-entropy loss between P and T can be calculated as:

Here, N is the number of observations, K is the number of classes, and w is a vector of weights determined by the network for each class. The entropy loss is mostly used to evaluate the performance of medical image segmentation in comparison to the ground truth images.

Network Training Algorithm. Training a deep neural network for 5694 images with different lesion shapes was a challenging task. As explained in the previous section, the network consists of 60 layers where 8 convolutional and 4 pooling layers perform encoding function and corresponding 8 deconvolutional and 4 unpooling layers to regain the original information. In the end, the classification layer classifies each pixel into 2 classes, i.e. lesion or background, and generated a posterior probability map for each pixel. Few parameters were fine-tuned for the proposed network to execute effective training.

Adaptive Moment Estimation (ADAM) Optimizer: This optimization algorithm emerged from gradient descent and momentum to update the network training parameters, weights, and bias aiming to minimize loss value. It computes the adaptive learning rate for each parameter of the network, unlike stochastic gradient descent that calculates a single learning rate for all weights updates. The original gradient descent for updating the network in the direction of the negative value of loss function given as:

Here, i denotes the iterations, α > 0 is the learning rate taken as 0.01, θ is the parameter vector and L(θ ) is the loss function. In the standard stochastic gradient method, the gradient of the loss function, ∇L(θ ) is evaluated for the entire training dataset. In contrast, the ADAM optimizer adds a term called 'momentum' to do an element-wise average of parameter gradients and their squared values. The above equation can be rewritten as:

The optimizer used moving average to update the network as:

β 1 and β 2 are gradient decay factor set as '0.95' and squared gradient decay factor set as '0.99' respectively and the learning rate α was specified as 0.03.

Data collected from the ISIC 2018 archive repository [42] , was nearly 5694 images of BCC, Melanoma, Squamous, and Nevus skin cancer types. Approximately 3100 images from ISIC 2017 challenge were used for testing purposes. Another dataset used for testing was PH 2 [43], which consists of almost 200 images, including ground truth images. Ground truth images are necessary for evaluating segmentation results to predict whether the approach correctly classified all the pixels into the foreground and background. The original images had dimensions 1022 × 767 × 3 which is too large to process and can slow down the processing of the system. Thus images were downsized to 300 × 300 × 3. Different image size was taken into consideration, such as 224 × 224, 227 × 227, and 256 × 297, but the 300 × 300 had given optimal performance. To prepare datasets for training and validation, the whole dataset is split into the ratio of 80%, and 20% respectively (Fig. 3) . 

The performance evaluation metrics used in our experiments are accuracy, intersection over union (IoU), and BF score. The values of these parameters are expected to be high for efficient segmentation and these are calculated on the testing dataset. The parameters TP, TN, FP, FN denotes the true positives, true negatives, false positives, and false negatives.

Accuracy is the number of corrected pixels identified over the total number of pixels.

Intersection Over Union (IoU) is the ratio of correctly classified pixels to the total number of pixels that are assigned a class by the ground truth and the predictor.

BF Score computes the Boundary F1 contour matching score between the predicted segmentation in prediction and the true segmentation in the ground truth.

BF Score = 2 * precision * recall Precision + recall (10)

To evaluate the effectiveness of the network, it is compared with other state-of-theart methods from the last four years (2016-2019) and it is observed that the designed network outperformed as compare to others. The qualitative segmentation results using the proposed network are displayed for the input images (see Fig. 4 ). It showed that the lesion region was extracted accurately, however, some irrelevant background object information exists for a few highly noisy images. The proposed network was trained for 5694 RGB images and tested for 3300 images different from the training set to measure the performance parameters. The comparative analysis with other similar research studies based on these metrics on the ISIC dataset is illustrated in Table 2 . It clearly shows the higher performance of the proposed approach as compared to others in terms of accuracy and IoU also called a Jaccard index which should be expected higher for better segmentation results. The literature studies chosen for the comparison are based on the same dataset, i.e., ISIC 2017 and PH 2 . Our network achieved a high value of this parameter with a difference of 0.43% as compared to other mentioned studies. Moreover, it gained almost higher accuracy and BF score except given in [39] but it outperformed other studies with the highest Jaccard score of 96.7% and accuracy of 95.6%. Table 3 Illustrates parameters on the PH 2 dataset, and it shows a higher performance of the proposed network over other approaches from the years 2016-2019 in terms of accuracy and IoU. Our model is gaining higher accuracy and overlapping index (IoU) as compared to other state-of-the-art methods which show the network is efficient in segmenting lesions from dermoscopic images on the PH 2 dataset. The impact of choosing the different depths of encoder-decoder architecture is illustrated in Table 4 . It shows that architectural depth below 4 given a lower performance in terms of segmentation accuracy, whereas, depth higher than 4 given little increase but leads to higher execution time. Thus, '4' was selected as an optimal network depth for the proposed model. 

In this paper, we presented a modified deep learning framework to meet the challenges of automatic lesion segmentation on dermoscopic images. The new network leverages the increased segmentation accuracy despite highly noisy data and varying lesion shapes. The design of a new network is inspired by the VGG19 classification network and encoder-decoder models. The layers of the encoder section are organized taking the VGG19 framework as a base with some additional layers such as leaky ReLU in each block. To boost the network's performance, a leaky ReLU layer, fine-tuned training parameters such as kernel size, number of filters, learning rate, momentum, and batch size, and the weighted cross-entropy loss function was used. The network was trained for ISIC 2018 dataset and performance was evaluated for two datasets, ISIC 2017 and PH 2 . It was concluded that the proposed network yields better performance as compared to the other state-of-the-art methods. It is illustrated in Tables 2 and 3 that our proposed  model outperformed previous studies by gaining higher values of accuracy, and IoU index. The proposed network can be efficiently applied in clinical settings to understand lesion patterns for the early diagnosis of the type of skin cancer. In some instances, there is a lack of properly segmented region due to the presence of intense hairlines, asymmetrical shapes, and color variations. Therefore, there is room for further improvement in segmentation results by applying any pre-processing techniques and improving network configuration. Moreover, in the future, our research will focus on the design of a more optimal segmentation network having low execution time and high accuracy.

Lung CT image segmentation using deep neural networks

Automatic breast and fibroglandular tissue segmentation in breast MRI using deep learning by a fully-convolutional residual neural network U-net

Robust optic disc and cup segmentation with deep learning for glaucoma detection

Continuous cuff-less blood pressure estimation based on combined information using deep learning approach

Where's the lesion?: Variability in human and automated segmentation of dermoscopy images of melanocytic skin lesions

A survey on monochrome image segmentation methods

An automatic based nonlinear diffusion equations scheme for skin lesion segmentation

Skan: Skin scannersystem for skin cancer detection using adaptive techniques

Three-phase general border detection method for dermoscopy images using non-uniform illumination correction

Computer-aided pattern classification system for dermoscopy images

Border detection in dermoscopy images using hybrid thresholding on optimized color channels

A novel approach to segment skin lesions in dermoscopic images based on a deformable model

A coarse-to-fine approach for segmenting melanocytic skin lesions in standard camera images

Image segmentation method based on K-mean algorithm

Performance analysis of fuzzy C-means clustering methods for MRI image segmentation

Classification of malignant melanoma and benign skin lesions: implementation of automatic ABCD rule

Automatic skin lesions segmentation based on a new morphological approach via geodesic active contour

Local edge-enhanced active contour for accurate skin lesion border detection

MRI white matter lesion segmentation using an ensemble of neural networks and overcomplete patch-based voting

Evolving ensemble models for image segmentation using enhanced particle swarm optimization

An efficient multiple sclerosis segmentation and detection system using neural networks

Segmentation of bone structure in X-ray images using convolutional neural network

Retinal vessel segmentation using deep neural networks

Fully convolutional networks for semantic segmentation

Decoupled deep neural network for semi-supervised semantic segmentation

Segnet: a deep convolutional encoder-decoder architecture for image segmentation

U-net: convolutional networks for biomedical image segmentation

A survey on automated melanoma detection

Deep learning techniques for medical image segmentation: achievements and challenges

A novel multi-task deep learning model for skin lesion segmentation and classification

Deep features to classify skin lesions

Melanoma lesion detection and segmentation using deep region based convolutional neural network and fuzzy C-means clustering

Dense fully convolutional network for skin lesion segmentation arXiv preprint

SkinNet: a deep learning framework for skin lesion segmentation arXiv preprint

Melanoma segmentation based on deep learning

Automated melanoma recognition in dermoscopy images via very deep residual networks

Very deep convolutional networks for large-scale image recognition arXiv preprint

Pattern Recognition and Machine Learning

Skin lesion segmentation based on modification of SegNet neural networks

Supervised saliency map driven segmentation of lesions in dermoscopic images

Automatic skin lesion segmentation using deep fully convolutional networks with Jaccard distance

Step-wise integration of deep classspecific learning for dermoscopic image segmentation

Dermoscopic image segmentation via multi-stage fully convolutional networks