key: cord-0885824-odtbicd1
authors: Joseph Raj, Alex Noel; Zhu, Haipeng; Khan, Asiya; Zhuang, Zhemin; Yang, Zengbiao; Mahesh, Vijayalakshmi G. V.; Karthik, Ganesan
title: ADID-UNET—a segmentation model for COVID-19 infection from lung CT scans
date: 2021-01-26
journal: PeerJ Comput Sci
DOI: 10.7717/peerj-cs.349
sha: 2892a51132f7f60c4c77f355ae01b71b56a0cd0f
doc_id: 885824
cord_uid: odtbicd1

Currently, the new coronavirus disease (COVID-19) is one of the biggest health crises threatening the world. Automatic detection from computed tomography (CT) scans is a classic method to detect lung infection, but it faces problems such as high variations in intensity, indistinct edges near lung infected region and noise due to data acquisition process. Therefore, this article proposes a new COVID-19 pulmonary infection segmentation depth network referred as the Attention Gate-Dense Network- Improved Dilation Convolution-UNET (ADID-UNET). The dense network replaces convolution and maximum pooling function to enhance feature propagation and solves gradient disappearance problem. An improved dilation convolution is used to increase the receptive field of the encoder output to further obtain more edge features from the small infected regions. The integration of attention gate into the model suppresses the background and improves prediction accuracy. The experimental results show that the ADID-UNET model can accurately segment COVID-19 lung infected areas, with performance measures greater than 80% for metrics like Accuracy, Specificity and Dice Coefficient (DC). Further when compared to other state-of-the-art architectures, the proposed model showed excellent segmentation effects with a high DC and F1 score of 0.8031 and 0.82 respectively.

COVID-19 has caused a worldwide health crisis. The World Health Organization (WHO) announced COVID-19 as a pandemic on March 11, 2020. The clinical manifestations of COVID-19 range from influenza-like symptoms to respiratory failure (i.e., diffuse alveolar injury) and its treatment requires advanced respiratory assistance and artificial ventilation. According to the global case statistics from the Center for Systems Science and Engineering (CSSE) of Johns Hopkins University (JHU) (Wang et al., 2020a) RELATED WORK ADID-UNET model proposed in this paper is based on UNET (Ronneberger, Fischer & Brox, 2015) architecture and therefore, we will discuss the literature related to our work which includes: deep learning and medical image segmentation, improvement of medical image segmentation algorithms, CT scan segmentation, and application of deep learning in segmentation of COVID-19 lesions from lung CT scans.

In recent years, deep learning algorithms have become more mature leading to various artificial intelligence (AI) systems based on deep learning algorithms being developed. Also, semantic segmentation using deep learning algorithms (Oktay et al., 2018) has developed rapidly with applications in both natural and medical images. Long, Shelhamer & Darrell (2015) pioneered the use of a fully connected CNN (FCN) to present rough segmentation outputs that were of the input resolution through fractionally strided convolution process also referred as the upsampling or deconvolution. The model was tested on PASCAL VOC, NYUDv2, and SIFT datasets and, presented a Mean Intersection of Union (M-IOU) of 62.7%, 34%, 39.5%, respectively. They also reported that upsampling, part of the in-network, was fast, accurate, and provided dense segmentation predictions. Later through a series of improvements and extensions to FCN (Ronneberger, Fischer & Brox, 2015; Badrinarayanan, Kendall & Cipolla, 2017; Xu et al., 2018) , a symmetrical structure composed of encoder and decoder pipelines, called UNET (Ronneberger, Fischer & Brox, 2015) , was proposed for biomedical or medical image segmentation. The encoder structure predicted the segmentation area, and then the decoder recovered the resolution and achieved accurate spatial positioning. Also, the UNET used crop and copy operations for the precise segmentation of the lesions. Further, the model achieved good segmentation performance at the International Symposium on Biomedical Imaging (ISBI) challenge (Cardona et al., 2010) with the M-IOU of 0.9203. Moreover, an improved network referred as the SegNet was proposed by Badrinarayanan, Kendall & Cipolla (2017) . The model used the first 13 convolution layers of the VGG16 network (Karen & Andrew, 2014) to form an encoder to extract features and predict segmentation regions. Later by using a combination of convolution layers, unpooling and softmax activation function in the decoder, segmentation outputs of input resolution were obtained. When tested with the CamVid dataset (Brostow, Fauqueur & Cipolla, 2009) , the M-IOU index of SegNet was nearly 10% higher than that of FCN (Long, Shelhamer & Darrell, 2015) . Xu et al. (2018) regarded segmentation as a classification problem in which each pixel was associated with a class label and designed a CNN network composed of three layers of convolution and pooling, a fully connected layer (FC) and softmax function. The model of successfully segmented three-dimensional breast ultrasound (BUS) image datasets was presented into four parts: skin, fibroglandular tissue, mass, and fatty tissue and achieved a recall rate of 88.9%, an accuracy of 90.1%, precision of 80.3% and F1 score of 0.844. According to the aforementioned literature, FCN (Long, Shelhamer & Darrell, 2015) and their improved variants presented accurate segmentation results for both natural or medical images. Therefore, the UNET and variants (Almajalid et al., 2019; Negi et al., 2020) , due to its advantages of fast training and high segmentation accuracy are widely used in the field of medical image segmentation.

Medical images such as the ultrasound images are generally prone to speckle noise, uneven intensity distribution, and low contrast between the lesions and the backgrounds which affect the segmentation ability of the traditional UNET (Ronneberger, Fischer & Brox, 2015) structure. Therefore, considerable efforts were invested in improving the architecture. Xia & Kulis (2017) proposed a fully unsupervised deep learning network called W-Net model that connects two UNETs to predict and reconstruct the segmentation results. Schlemper et al. (2019) proposed an attention UNET network, which integrated attention modules into the UNET (Ronneberger, Fischer & Brox, 2015) model to achieve spatial positioning and subsequent segmentation. The model presented a segmentation accuracy of 15% higher than the traditional UNET architecture. Zhuang et al. (2019a) combined the goodness of the attention gate system and the dilation convolution module and proposed a hybrid architecture referred as the RDA-UNET. By introducing residual network (He et al., 2016) instead of traditional convolution layers they reported a segmentation accuracy of 97.91% towards the extraction of lesions in breast ultrasound images. Also, the GRA-UNET (Zhuang et al., 2019b) model included a group convolution module in-between the encoder and decoder pipelines to improve the segmentation of the nipple region in breast ultrasound images. Therefore, from the literature, it can be inferred that introducing additional modules like attention gate instead of traditional cropping and copying, inclusion of dilation convolution to increase the receptive fields and use of residual networks can favorably improve the accuracy of the segmentation model. However, these successful segmentation models (Schlemper et al., 2019; Zhuang et al., 2019a; Xia & Kulis, 2017) were rarely tested with CT scans, hence the next section concentrates on the segmentation of CT scans.

CT imaging is a commonly used technology in the diagnosis of lung diseases since lesions can be segmented more intuitively from the chest CT scans. The segmented lesion aid the specialist in the diagnosis and quantification of the lung diseases (Gordaliza et al., 2018) . In recent years, most of the classifier models and algorithms based on feature extraction have achieved good segmentation results in chest CT scans. proposed a shape-based Computer-Aided Detection (CAD) method where a 3D adaptive fuzzy threshold segmentation method combined with chain code was used to estimate infected regions in lung CT scans. In feature-based techniques, due to the low contrast between nodules and backgrounds, the boundary discrimination is unclear leading to inaccurate segmentation results. Therefore, many segmentation techniques based on deep learning algorithms have been proposed. Wang et al. (2017) developed a central focusing convolutional neural network for segmenting pulmonary nodules from heterogeneous CT scans. Jue et al. (2018) designed two deep networks (an incremental and dense multiple resolution residually connected network) to segment lung tumors from CT scans by adding multiple residual flows with different resolutions. Guofeng et al. (2018) proposed a UNET model to segment pulmonary nodules in CT scans which improved the overall segmentation output through the avoidance of overfitting. Compared with other segmentation algorithms such as graph-cut , their model had better segmentation results with a Dice coefficient of 0.73. Recently, Peng et al. (2020) proposed an automatic CT lung boundary segmentation method, called Pixel-based Two-Scan Connected Component Labeling-Convex Hull-Closed Principal Curve method (PSCCL-CH-CPC). The model included the following: (a) the image preprocessing step to extract the coarse lung contour and (b) coarse to finer segmentation algorithm based on the improved principal curve and machine learning model. The model presented good segmentation results with Dice coefficient as high as 96.9%. Agarwal et al. (2020) proposed a weakly supervised lesion segmentation method for CT scans based on an attention-based co-segmentation model (Mukherjee, Lall & Lattupally, 2018) . The encoder structure composed of a variety of CNN architectures that includes VGG-16 (Karen & Andrew, 2014) , Res-Net101 (He et al., 2016) , and an attention gate module between the encoder-decoder pipeline, while decoder composed of upsampling operation. The proposed method first generated the initial lesion areas from the Response Evaluation Criteria in Solid Tumors (RECIST) measurements and then used co-segmentation to learn more discriminative features and refine the initial areas. The paper reported a Dice coefficient of 89.8%. The above literatures suggest that deep learning techniques are effective in segmenting lesions in lung CT scans and many researchers have proposed different deep learning architectures to deal with COVID-19 CT scans. Therefore, in the next section we will further study their related works.

In recent months, COVID-19 has become a hot topic of concern all over the world and CT imaging is considered to be a convincing method to detect COVID-19. However, due to the limited datasets and the time and labor involved in annotations, segmentation datasets related to COVID-19 CT scans are less readily available. But, many researchers have still proposed advanced methods to deal with COVID-19 diagnosis, which also includes segmentation techniques (Fan et al., 2020; Wang et al., 2020b; Yan et al., 2020; Zhou, Canu & Ruan, 2020; Elharrouss et al., 2020; Chen, Yao & Zhang, 2020) . On the premise of insufficient datasets with segmentation labels, the Inf-Net network proposed by Fan et al. (2020) , combined a semi-supervised learning model and FCN8s network (Long, Shelhamer & Darrell, 2015) with implicit reverse attention and explicit edge attention mechanism to improve the recognition rate of infected areas. The model successfully segmented COVID-19 infected areas from CT scans and reported a sensitivity and accuracy of 72.5% and 96.0%, respectively. Elharrouss et al. (2020) proposed an encoder-decoder-based CNN method for COVID-19 lung infection segmentation based on a multi-task deep-learning based method, which overcame the shortage of labeled datasets, and segmented lung infected regions with a high sensitivity of 71.1%. Wang et al. (2020b) proposed a noise-robust COVID-19 pneumonia lesions segmentation network which included a noise-robust dice loss function along with convolution function, residual network, and Atrous Spatial Pyramid Pooling (ASPP) module. The model was referred as Cople-Net presented automatic segmentation of COVID-19 pneumonia lesions from CT scans. The method proved that the proposed new loss function was better than the existing noise-robust loss functions such as Mean absolute error (MAE) loss (Ghosh, Kumar & Sastry, 2017) and Generalized Cross-Entropy (GCE) loss (Zhang & Sabuncu, 2018) and achieved a Dice coefficient and Relative Volume Error (RVE) of 80.72% and 15.96%, respectively. Yan et al. (2020) employed an encoder-decoder deep CNN structure composed of convolution function, Feature Variation (FV) module (mainly contains convolution, pooling, and sigmoid function), Progressive Atrous Spatial Pyramid Pool (PASPP) module (including convolution, dilation convolution, and addition operation) and softmax function. The convolution function obtained features, FV block enhanced the feature representation ability and the PASPP was used between encoder and decoder pipelines compensated for the various morphologies of the infected regions. The model achieved a good segmentation performance with a Dice coefficient of 0.726 and a sensitivity of 0.751 when tested on the COVID-19 lung CT scan datasets. Zhou, Canu & Ruan (2020) proposed an encoder-decoder structure based UNET model for the segmentation of the COVID-19 lung CT scan. The encoder structure was used to extract features and predict rough lesion areas which composed convolution function and Res-dil block (combines residual block (He et al., 2016) and dilation convolution module). The decoder pipeline was used to restore the resolution of the segmented regions through the upsampling and the attention mechanism between the encoder-decoder framework to capture rich contextual relationships for better feature learning. The proposed method can achieve an accurate and rapid segmentation on COVID-19 lung CT scans with a Dice coefficient, sensitivity, and specificity of 69.1%, 81.1%, and 97.2%, respectively. Further, Chen, Yao & Zhang (2020) proposed a residual attention UNET for automated multi-class segmentation of COVID-19 lung CT scans, which used residual blocks to replace traditional convolutions and upsampling functions to learn robust features. Again, a soft attention mechanism was applied to improve the feature learning capability of the model to segment infected regions of COVID-19. The proposed model demonstrates a good performance with a segmentation accuracy of 0.89 for lesions in COVID-19 lung CT scans. Therefore, the deep learning algorithms are helpful in segmenting the infected regions from COVID-19 lung CT scans which aid the clinicians to evaluate the severity of infection , large-scale screening of COVID-19 cases and quantification of the lung infection (Ye et al., 2020) . Table 1 summarizes the deep learningbased segmentation techniques available for COVID-19 lung infections.

In this section, we first introduce the proposed ADID-UNET network with detailed discussion on the core network components including dense network, improved dilation convolution, and attention gate system. To present realistic comparisons, experimental results are presented at each subsection to illustrate the performance and superiority of the model after adding core components. Further in "Experimence Results" we have presented a summary of the % improvements achieved when compared to the traditional UNET architecture.

ADID-UNET is based on UNET (Ronneberger, Fischer & Brox, 2015) architecture with the following improvements: (a) The dense network proposed by Huang et al. (2017) is used in addition to the convolution modules of encoder and decoder structures, (b) an improved dilation convolution (IDC) is introduced between the frameworks, and (c) the attention gate (AG) system is used instead of the simple cropping and copying operations. The structure of ADID-UNET is shown in Fig. 2 . Here f en , f upn , f idc describe the features at the n-th layer of the encoder, decoder, and IDC modules, respectively.

When COVID-19 CT scans are presented to the encoder, the first four layers (each layer has convolutions, rectification, and max pooling functions) extract features (f 1 -f 4 ) that are passed to dense networks. Here dense networks are used instead of convolution and max-pooling layers to further enhance the features (f 5 -f 6 ) and in "Dense Network", we elaborate the need for the dense network and present experimental results to prove its significance. Next, an improved dilation convolution module referred as the IDC model, is used between the encoder-decoder structure to increase the receptive field and gather detailed edge information that assists in extracting the characteristic. The module accepts the feature f 6 from the dense networks and after improvement, present f idc them as inputs to the decoder structure. To ensure consistency in the architecture and to avoid losing information, the decoder mirrors the encoder with two dense networks that replace the first two upsampling operations. Further for the better use of the context information between the encoder-decoder pipeline, the AG model is used instead of cropping and copying operations, which aggregates the corresponding layer-wise encoder features with the decoder and presents it to the subsequent upsampling layers. Likewise, the decoder framework presents upsampled features f up1 to f up6 and final feature map (f up6 ) is presented to the sigmoid activation function to predict and segment the COVID-19 lung infected regions. The following section explains the components of ADID-UNET in detail. Figure 2 The structure of the ADID-UNET network. The blue, purple, black and orange arrows represent transfer function, convolution function with 1Â1 convolution kernel and sigmoid function, upsampling function, convolution function with 3Â3 convolution kernel and RELU function, respectively. The triangle represents the attention gate system. The orange, purple, blue and black dotted boxes represent the dense network, concatenate function, dilation convolution and improved dilation convolution, respectively. The curved black arrows within the orange rectangle indicate the dense block. The C within the circle represents concatenate function. The gray, orange, green, black, purple and blue squares represent convolution function, maximum pooling function, improved void convolution layer, attention gate layer, upsampling layer and transition layer, respectively, and describe the features at n-th layer of the encoder, decoder and IDC module, respectively. Full-size  DOI: 10.7717/peerj-cs.349/fig-2

It was presumed that with the increase of network layers, the learning ability of the network will gradually improve, but during the training, for deep networks, the gradient information that is helpful for the generalization may disappear or expand excessively. In literature, the problem is referred as vanishing or explosion of the gradient. As the network begins to converge, due to the disappearance of the gradient the network saturates, resulting in a sharp decline in network performance. Therefore, Zhuang et al. (2019a) introduced residual units proposed by He et al. (2016) into UNET structure to avoid performance degradation during training. The residual learning correction scheme to avoid performance degradation is described in (1):

Here x and y are the input and output vectors of the residual block, F i is the weight of the corresponding layer. The function G x; F i f g ð Þis a residue when added to x, avoids vanishing gradient problems, and enables efficient learning.

From (1) the summation of G x; F i f g ð Þand x in Res-Net (He et al., 2016) avoids the vanishing gradient problems but forwarding the gradient information alone to the proceeding layers may hinder the information flow in the network and the recent work by Huang et al. (2016) illustrated that of Res-Nets discard features randomly during training. Moreover, Res-Nets include large number of parameters, which increases the training time. To solve this problem, Huang et al. (2017) proposed a dense network (as shown in Fig. 3) , which directly connects all layers, and thus skillfully obtains all features of the previous layer without convolution.

The dense network is mainly composed of convolution layers, pooling function, multiple dense blocks, and transition layers. Let us consider a network with L layers, and each layer implements a nonlinear transformation H i . Let x 0 represent the input image, i represents layer i, x i−1 is the output of layer i − 1. H i can be a composite operation, such as batch normalization (BN), rectified linear function (RELU), pooling, or convolution functions. Generally, the output of traditional network in layer i is as follows:

For the residual network, only the identity function from the upper layer is added:

For a dense network, the feature mapping x 0 , x 1 ,…, x i−1 of all layers before layer i is directly connected, which is represented by Eq. (4):

where ½x 0 ; x 1 ; . . . ; x iÀ1 denotes the cascade of characteristic graphs and × represents the multiplication operation. Figure 4 shows the forward connection mechanism of the dense network where the output of layers is connected directly to all previous layers. Generally, a dense network is composed of several dense blocks and transition layers. Here we only use two dense blocks and transition layers to form simple dense networks. Using Eq. (5) to express the dense block:

where ½x 0 ; x 1 ; . . . ; x iÀ1 denotes the cascade of characteristic graphs, β i is the weight of the corresponding layer. In the ADID-UNET model proposed in this paper, the feature f 4 (refer to Fig. 2) is fed to the transition layer, which is mainly composed of BN, RELU, and average pooling operation. Later the feature is batch standardized and rectified before convolving with a 1 × 1 kernel function. Again, the filtered outputs go through the same operation and are convoluted with 3 × 3 kernel, before concatenating with the input feature f 4 . The detailed structure of the two dense blocks and transition layers used in the encoder structure is shown in Fig. 5A . Here w, h correspond to the width and height of the input, respectively, and b represents the number of channels. Besides, s represents the step size of the pooling operation, n represents the number of filtering operations performed by each layer. In our model, n takes values 32, 64, 128, 256, and 512. It should be noted that the output of the first dense layer is the aggregated result of 4 convolution operations (4 × n), which is employed to emphasize the features learning by reducing the loss of features. In the decoding structure, to restore the resolution of the predicted segmentation, a traditional upsampling layer of the UNET (Ronneberger, Fischer & Brox, 2015) is used instead of the transition layer. The detailed structure is shown in Fig. 5B . For the proposed network, we use only two dense networks mainly (a) to reduce the computation costs and (b) experiments with different layers of dense networks suggest that the use of two dense networks was sufficient since the segmentation results were accurate and comparable to the ground truth. Figure 6 and Table 2 illustrate the qualitative and quantitative comparisons with different numbers of dense network in the encoder-decoder framework. From the analysis of results in Fig. 6 and Table 2 , it is found that the effect of using two dense networks in the model is obvious and can present accurate segments of the infected areas that can be inferred directly from the qualitative and quantitative metrics.

Moreover, with high accuracy and a good Dice coefficient, the choice of two dense networks is the best choice in the encoder decoder pipeline. Also, using two dense networks in place of traditional convolutions or residual networks enable global feature propagation, encourage feature reuse, and also solve the gradient disappearance problems associated with deep networks thereby significantly improving the segmentation outcomes.

Since the encoder pipeline of the UNET structure is analogous to the traditional CNN architecture, the pooling operations involved at each layer propagate either the maximum or the average characteristics of the extracted features, hence connecting the encoder outputs directly to decoder, thus limiting the segmentation accuracy of the network. The RDA-UNET proposed by Zhuang et al. (2019a) utilized a dilation convolution (DC) module between the encoder-decoder pipeline to increase the receptive field and further learn the boundary information accurately. Also, the DC module is often used in many variant UNETs (Chen et al., 2019; Yu & Koltun, 2015) to improve the receptive field, hence, we use the DC module and introduce additional novelty in the DC module.

Equation (6) describes the DC operation between the input image f x; y ð Þ and the kernel g i; j ð Þ.

where α is the RELU function, k is a bias unit i; j ð Þ and x; y ð Þ denote the coordinates of the kernel and those of the input images respectively, and r is the dilation rate that controls the size of receptive fields. The size of the receptive field obtained can be expressed as follows: where k_f size is the convolution kernel size, r is the convolution rate of the dilation and N is the size of the receptive field. As shown in Fig. 7 . Based on our experimental analysis we understand that DC module has a pronounced effect in extracting information for larger objects or lesions and considering that most of the early ground-glass opacity (GGO) or late lung consolidation lesions have smaller areas, we present an improved dilation convolution (IDC) module between the encoder-decoder framework to accurately segment smaller regions. Figure 8 illustrates the IDC module that consists of several convolution functions with different dilation rates and rectified linear functions (RELU). Our improvements are as follows: (a) combining single strided convolution operations and dilated convolutions with dilation rate such as 2, 4, 8, and 16, respectively. The above combination helps in the extraction of features from both smaller and larger receptive fields thus assisting in the isolation of the small infected COVID-19 regions seen in lung CT scans and (b) referring to the idea of the dense network (Huang et al., 2017) , we concatenate the input of the IDC module to its output and use the information of input features to further enhance feature learning. The input of IDC module is the rough segmentation regions obtained by encoder structure. The combination of the original segmentation region features and the accurate features extracted by IDC module not only avoids the loss of useful information, but also provides accurate input for the decoding pipeline, which is conducive to improve the segmentation accuracy of the model. As the inputs advance (left to right in Fig. 8) , they get convolved with a 3 × 3 kernel of convolution layers and the dilation rate of IDC is 2, 4, 8, and 16, respectively. From the comparative experiments with the traditional DC model (the dilation rate is the same for both the models), we find that the computational cost and computation time required for the IDC module is less than that of the DC module, as shown in Table 3 . From Fig. 9 and Table 4 , it is found that the use of layers with convolution and smaller dilation rates at the end along with others ensures the cumulative extraction of features from both smaller and larger receptive fields thus assisting in the isolation of the small infected COVID-19 regions seen in lung CT scans. Also, the performance scores specifically the Dice coefficient is higher (about 3%) for DID-UNET compared to DD-UNET. In summary, the IDC model connected between the encoder-decoder structure, reduces loss of the original features but additionally expands the field of the segmented areas thereby improving the overall segmentation effect.

Although the improved dilation convolution improves the feature learning ability of the network, due to the loss of spatial information in the feature mapping at the end of the encoder structure, the network has difficulties in reducing false prediction for (a) small COVID-19 infected regions and (b) areas with blurry edges with poor contrast between the lesion and background. To solve this problem, we introduce the attention gate (AG) model shown in Fig. 10 mechanism into our model instead of simple cropping and copying. AG model computes the attention coefficient r 2 0; 1 ½ , based on Eq. (8):

where n and m represent the feature mapping of the AG module input from the decoder and encoder pipelines, respectively. And p m , p n , p i , p k are the convolution kernels of size 1 × 1. b m,n , b int , b k represent the offset unit. ε 1 and ε 2 denote the RELU and sigmoid activation function respectively. Here ε 2 limits the range between 0 and 1. Finally, the attention coefficient σ is multiplied by the input feature map f i to present the output g o as shown in Eq. (10):

From Fig. 11 and Table 5 , results showed that the inclusion AG module improved the performance of the network (ADID-UNET), with segmentation accuracy of almost 97%. Therefore, by introducing the AG model, the network makes full use of the output feature information of encoder and decoder, which greatly reduces the probability of 

Organizing a COVID-19 segmentation dataset is time-consuming and hence there are not many CT scan segmentation datasets. At present, there was only one standard dataset namely the COVID-19 segmentation dataset (MedSeg, 2020), which was composed of 100 axial CT scans from different COVID-19 patients. All CT scans were segmented by radiologists associated with the Italian Association of medicine and interventional radiology. Since the database was updated regularly, on April 13, 2020, another segmented CT scans dataset with segment labels from Radiopaedia was added. The whole datasets that contained both positive and negative slices (373 out of the total of 829 slices have been evaluated by a radiologist as positive and segmented), were selected for training and testing the proposed model. The dataset consists of 1,838 images with annotated ground truth was randomly divided into 1,318 training samples, 320 validation samples, and 200 test samples. Since the number of training images is less, we expand the training dataset where we first merge the COVID-19 lung CT scans with the ground scene and then perform six affine transformations as mentioned in Krizhevsky, Sutskever & Hinton (2012) . Later the transformed image is separated from the new background truth value and added to the training dataset as additional training images. Therefore, the 1,318 images of the training dataset are expanded, and 9,226 images are obtained for training. Figure 12 illustrates the data expansion process.

The commonly used evaluation indicators for segmentation such as accuracy (ACC), precision (P c ), Dice coefficient (DC), the area under the curve (AUC), sensitivity (S en ), specificity (S p ) and F1 score (F1) were used to evaluate the performance of the model. These performance indicators are calculated as follows:

(1) For computing accuracy, precision, sensitivity, specificity, and F1 score we generate the confusion matrix where the definitions of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are shown in Table 6 .

(1) Accuracy (ACC): A ratio of the number of correctly predicted pixels to the total number of pixels in the image.

(2) Precision (P c ): A ratio of the number of correctly predicted lesion pixels to the total number of predicted lesion pixels.

(3) Sensitivity (Sen): A ratio of the number of correctly predicted lesion pixels to the total number of actual lesion pixels.

(4) F1 score (F1): A measure of balanced accuracy obtained from a combination of precision and sensitivity results.

(5) Specificity (S p ): A ratio of the number of correctly predicted non-lesion pixels to the total number of actual non-lesion pixels.

(6) Dice coefficient (DC): Represents the similarity between the model segment output (Y) and the ground truth (X). The higher the similarity between the lesion and the Table 5 The quantitative results of the comparison with or without the AG model experiment. AG-UNET-the addition of AG module to UNET, DA-UNET-adding two dense networks and AG module to the network without including the IDC module. IDA-UNET refers to adding IDC and AG modules to the UNET without adding dense networks, and ADID-UNET indicates that dense network, IDC and AG module are added to the network. ground truth, the larger the Dice coefficient and the better the segmentation effect. Dice coefficient is calculated as follows:

Also, we use a Dice coefficient (Dice, 1945) loss (dice_loss) as the training loss of the model, the calculation is as follows:

(7) The area under the curve (AUC): AUC is the area under the receiver operating characteristic (ROC) curve. It represents the degree or the measure of separability and indicates the capability of the model in distinguishing the classes. Higher the AUC better is the segmentation output and hence the model.

In addition to the above widely used indicators, we also introduce the Structural metric (S m ) (Fan et al., 2017) , Enhanced alignment metric (E α ) (Fan et al., 2018) and Mean Absolute Error (MAE) (Fan et al., 2020; Elharrouss et al., 2020) to measure the segmentation similarity with respect to the ground truth.

(8) Structural metric (S m ): Measures the structural similarity between the prediction map and ground truth segmented mask, it is more in line with the human visual system than Dice coefficient.

where S os stands for target perception similarity, S or stands for regional perceptual similarity, β = 0.5 is a balance factor between S os and S or . And S op stands for the final prediction result and S gt represents the ground truth.

(9) Enhance alignment metric (E α ): Evaluates the local and global similarity between two binary maps computed based on Eq. (19):

where w and h are the width and height of ground truth S gt , (i,j) denotes the coordinates of each pixel in S gt . α represents the enhanced alignment matrix:

(10) Mean Absolute Error (MAE): Measures the pixel-wise difference between S op and S gt , defined as:

The ADID-UNET proposed in this paper is implemented in Keras framework and is trained and tested by using the workstation with NVIDIA GPU P5000. During the training process, we set the learning rate as l r ¼ 1 Â 10 À3 , and Adam optimizer was selected as the optimization technique. The 9,226 training samples, 320 verification samples, and 200 test samples were resized to 128 × 128 and trained with a batch size of 32 for 300 epochs. Figures 13 and 14 shows the performance curves obtained for the proposed ADID-UNET during training, validation, and testing.

To show the performance of the ADID-UNET model, we used 200 pairs of COVID-19 lung infection CT scans as test data, and the segmentation results are shown in Fig. 15 . From the analysis of Fig. 15 , it was found that the ADID-UNET model can accurately segment the COVID-19 lung infection areas from the CT scans, especially the smaller infected areas, and the segmentation result is very close to the ground truth. This illustrates the effectiveness of the proposed method for the segmentation of COVID-19 lung infection regions from CT scans. Moreover, we can also see that ADID-UNET can accurately segment the complicated infection areas (single COVID-19 lung infection areas and more complex uneven distribution infection areas) in CT scans, which further proves the power of the model proposed in this paper. In a word, the ADID-UNET model proposed in this paper can effectively and accurately segment COVID-19 lung infection areas with different sizes and uneven distribution, and the visual effect of segmentation is very close to the gold standard. Further, we also compare the proposed model with other state-of-art segmentation models. From the results (Figs. A1 and A2 and Table 7 ), we can infer that the ADID-UNET model presents segmentation outputs closer to the ground truth. In contrast, the FCN8s network (Long, Shelhamer & Darrell, 2015) presents more under and over segmented regions. Further RAD-UNET (Zhuang et al., 2019a) presents comparable segmentation results but its effect is less pronounced for smaller segments. Analyzing the segmentation visual results from Figs. A1 and A2, we can clearly find that the ADID-UNET model proposed in this paper can accurately segment the COVID-19 lung infection regions than other state-of-the-art model with results close to the ground truth, which proves the efficacy of the proposed ADID-UNET model. Table 7 , presents the performance scores for various indicators mentioned in "Experimence Results". Here, for ADID-UNET the scores such as the Dice coefficient, precision, F1 score, specificity and AUC are 80.31%, 84.76%, 82.00%, 99.66% and 95.51%, respectively. Further, most of the performance indexes are above 0.8 with the highest segmentation accuracy of 97.01%. The above results clearly indicates that the proposed model presents segmentation outputs closer to ground truth annotations.

The proposed model presents an improved version of the UNET model obtained by the inclusion of modules such as the dense network, IDC and the attention gates to the existing UNET (Ronneberger, Fischer & Brox, 2015) structure. The effectiveness of these additions were experimentally verified in "Methods". Further, to summarize the effectiveness of the addition of each module to the UNET architecture, Table 8 tabulates the improvement at each stage of the addition. From Table 8 , it is found that adding additional components to the UNET (Ronneberger, Fischer & Brox, 2015) structure can obviously improve the overall segmentation accuracy of the network. For example, with the inclusion of the dense networks (D-UNET), the metrics such as Dice coefficien (DC) and AUC reached 79.98% and 93.47%, respectively. Further, the inclusion of the IDC improved the scores further (DID-UNET). Finally, the proposed model with dense network, IDC and the AG modules (namely ADID-UNET) presented the best performance scores and provided an improvement of 0.05%, 0.33%, 2.29%, 2.04% and 1.09% for metrics such as accuracy, DC, precision, AUC and structural metric respectively when compared to traditional UNET architecture. Furthermore, from Figs. A1 and A2, it is obvious that ADID-UNET performs better than other well-known segmentation models in terms of visualization. Specifically, ADID-UNET can segment relatively smaller infected regions which is of great significance for clinical accurate diagnosis of COVID-19 infection location. The use of (a) dense 

The paper proposes a new variant of UNET (Ronneberger, Fischer & Brox, 2015) architecture to accurately segment the COVID-19 lung infections in CT scans. The model, ADID-UNET includes dense networks, improved dilation convolution, and attention gate, which has strong feature extraction and segment capabilities. The experimental results show that ADID-UNET is effective in segmenting small infection regions, with performance metrics such as accuracy, precision and F1 score of 97.01%, 84.76%, and 82.00%, respectively. The segmentation results of the ADID-UNET network can aid the clinicians in faster screening, quantification of the lesion areas and provide an overall improvement in the diagnosis of COVID-19 lung infection.

We describe the abbreviations of this paper in detail, as shown in Table A1 . Table 8 The quantitative results showing percentages improvements of the model after adding additional components to UNET (Ronneberger, Fischer & Brox, 2015) structure. D-UNET denotes dense networks with UNET structure, DID-UNET represents dense networks and improved dilation convolution to the structure of UNET, and ADID-UNET refers to proposed model with dense networks improved dilation convolution and attention gate modules to the UNET structure. ↑ indicates that the performance index is higher than that of UNET structure, ↓ indicates that the performance index is lower than that of UNET structure. For each row, (D) denotes the ground truth, and (E-K) illustrate the segmentation results from FCN8s (Long, Shelhamer & Darrell, 2015) , UNET (Ronneberger, Fischer & Brox, 2015) , Segnet (Badrinarayanan, Kendall & Cipolla, 2017) , Squeeze UNET (Iandola et al., 2016) , Residual UNET (Alom et al., 2018) , RAD UNET (Zhuang et al., 2019a) , ADID-UNET, respectively. Full-size  DOI: 10.7717/peerj-cs.349/ fig-A2 

Weakly-supervised lesion segmentation on CT scans using co-segmentation

Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases

Development of a deep-learning-based method for breast ultrasound image segmentation

Recurrent residual convolutional neural network based on UNET (R2U-Net) for medical image segmentation

Lung pattern classification for interstitial lung diseases using a deep convolutional neural network

egNet: a deep convolutional encoder-decoder architecture for image segmentation

Semantic object classes in video: a high-definition ground truth database

An integrated micro-and macroarchitectural analysis of the drosophila brain by computer-assisted serial section electron microscopy

Quantification of tomographic patterns associated with COVID-19 from chest CT

Residual attention UNET for automated multi-class segmentation of COVID-19 chest CT images

Environmental sound classification with dilated convolutions

CT imaging features of 2019 novel coronavirus (2019-nCoV)

Automated classification of usual interstitial pneumonia using regional volumetric texture analysis in high-resolution computed tomography

Measures of the amount of ecologic association between species

An encoder-decoder-based method for COVID-19 lung infection segmentation

Structure-measure: a new way to evaluate foreground maps

Enhanced-alignment measure for binary foreground map evaluation

Inf-Net: Automatic COVID-19 lung infection segmentation from CT scans

Sensitivity of chest CT for COVID-19: comparison to RT-PCR

Robust loss functions under label noise for deep neural networks

Unsupervised CT lung image segmentation of a mycobacterium tuberculosis infection model

Improved UNET network for pulmonary nodules segmentation

Deep residual learning for image recognition

Densely connected convolutional networks

Deep networks with stochastic depth

Squeezenet: AlexNet-level accuracy with 50x fewer parameters and <0.5 mb model size

Multiple resolution residually connected feature streams for automatic lung tumor segmentation from CT images

Very deep convolutional networks for large-scale image recognition

Identifying medical diagnoses and treatable diseases by image-based deep learning

Learning tree-structured representation for 3D coronary artery segmentation

Imagenet classification with deep convolutional neural networks

Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy

Fully convolutional networks for semantic segmentation

COVID-19 CT segmentation dataset

Segmentation dataset, COVID-19 CT

Object co-segmentation using deep siamese network

RDA-UNET-WGAN: an accurate breast ultrasound lesion segmentation using wasserstein generative adversarial networks

Attention UNET: Learning where to look for the pancreas

Hybrid automatic lung segmentation on chest ct scans

Visualization and interpretation of convolutional neural network predictions in detecting pneumonia in pediatric chest radiographs

UNET: convolutional networks for biomedical image segmentation

Attention gated networks: learning to leverage salient regions in medical images

Lung infection quantification of COVID-19 in CT images with deep learning

Large-scale screening of COVID-19 from community acquired pneumonia using infection size-aware classification

A deep learning algorithm using CT images to screen for corona virus disease (COVID-19)

Severity assessment of coronavirus disease 2019 (COVID-19) using quantitative features from chest ct images

A novel coronavirus outbreak of global health concern

A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images

COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-Ray images

Central focused convolutional neural networks: developing a data-driven model for lung nodule segmentation

W-Net: A deep model for fully unsupervised image segmentation

Deep learning system to screen coronavirus disease 2019 pneumonia

Medical breast ultrasound image segmentation by machine learning

COVID-19 chest CT image segmentation-a deep convolutional neural network solution

Precise diagnosis of intracranial hemorrhage and subtypes using a three-dimensional joint convolutional and recurrent neural network

Graph cut-based automatic segmentation of lung nodules using shape, intensity, and spatial features

Shape-based computer-aided detection of lung nodules in thoracic CT images

Chest CT manifestations of new coronavirus disease 2019 (COVID-19): a pictorial review

Multi-scale context aggregation by dilated convolutions

Generalized cross entropy loss for training deep neural networks with noisy labels

An automatic COVID-19 CT segmentation based on UNET with attention mechanism

An RDAU-NET model for lesion segmentation in breast ultrasound images

Nipple segmentation and localization using modified UNET on breast ultrasound images

We are very grateful to the Italian Society of medicine and interventional radiology, Radiopedia, and Ma et al. (2020) for providing the COVID-19 CT scan segmentation database.

The authors declare that they have no competing interests.

Alex Noel Joseph Raj conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. Haipeng Zhu conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. 

The following information was supplied regarding data availability: Code and data is available in the Supplemental Files. 

Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.349#supplemental-information.