key: cord-0471780-yrhw5pad authors: Zhao, Xiangyu; Zhang, Peng; Song, Fan; Fan, Guangda; Sun, Yangyang; Wang, Yujia; Tian, Zheyuan; Zhang, Luqi; Zhang, Guanglei title: D2A U-Net: Automatic Segmentation of COVID-19 Lesions from CT Slices with Dilated Convolution and Dual Attention Mechanism date: 2021-02-10 journal: nan DOI: nan sha: f0ffcea6f4debf48073017610d0891cb3ba0d987 doc_id: 471780 cord_uid: yrhw5pad Coronavirus Disease 2019 (COVID-19) has caused great casualties and becomes almost the most urgent public health events worldwide. Computed tomography (CT) is a significant screening tool for COVID-19 infection, and automated segmentation of lung infection in COVID-19 CT images will greatly assist diagnosis and health care of patients. However, accurate and automatic segmentation of COVID-19 lung infections remains to be challenging. In this paper we propose a dilated dual attention U-Net (D2A U-Net) for COVID-19 lesion segmentation in CT slices based on dilated convolution and a novel dual attention mechanism to address the issues above. We introduce a dilated convolution module in model decoder to achieve large receptive field, which refines decoding process and contributes to segmentation accuracy. Also, we present a dual attention mechanism composed of two attention modules which are inserted to skip connection and model decoder respectively. The dual attention mechanism is utilized to refine feature maps and reduce semantic gap between different levels of the model. The proposed method has been evaluated on open-source dataset and outperforms cutting edges methods in semantic segmentation. Our proposed D2A U-Net with pretrained encoder achieves a Dice score of 0.7298 and recall score of 0.7071. Besides, we also build a simplified D2A U-Net without pretrained encoder to provide a fair comparison with other models trained from scratch, which still outperforms popular U-Net family models with a Dice score of 0.7047 and recall score of 0.6626. Our experiment results have shown that by introducing dilated convolution and dual attention mechanism, the number of false positives is significantly reduced, which improves sensitivity to COVID-19 lesions and subsequently brings significant increase to Dice score. COVID-19 pandemic caused by SARS-nCov-2 continues to spread all over the world [24] , and most of the countries have been affected in this unprecedented public health event. By August 2020, more than 23.75 million of cases of COVID-19 have been reported and more than 810,000 died [2] of COVID-19 infection. Due to the strong infectivity of SARS-nCov-2, identification of people infected by COVID-19 is significant to cut off the transmission and slow down virus spread. Reverse transcriptase-polymerase chain reaction (RT-PCR) is considered as the gold standard of diagnosis [29] for its high specificity, but it is time-consuming and laborious. Also, the capacity of RT-PCR tests can be rather insufficient in less-developed regions, especially during the pandemic. Computed tomography (CT) imaging is one of the most commonly used screening methods to detect lung infection and has proved to be efficient in the diagnosis and follow-up prognosis of COVID-19. Compared with chest X-ray images, CT imaging is more sensitive, especially in the early stage of infection. Ground glass pattern is the most common finding in COVID-19 infections, usually in the early stage, while pulmonary consol- Figure 1 : Example of COVID-19 CT slices, where the red, green and blue masks denote the ground glass, consolidation and pleural effusion respectively. The images are collected from [1] . Thus, chest CT imaging is regarded as a convenient, fast and accurate approach to diagnose COVID-19. The evaluation of localization and geometric features of infection area could provide adequate information of disease progress and First Author et al.: Preprint submitted to Elsevier help doctors make better treatment [10] [16] [19] . However, manual annotation of infection regions is a time-consuming and laborious work. Also, the annotation made by radiologists can be subjective and biased due to individual experience and personal judgements. Recently, a number of deep learning systems using convolutional neural networks (CNNs) have been proposed to detect COVID-19 infection. For instance, Wang and Wong [27] have developed a COVID-Net to perform ternary classification between healthy people, COVID-19 patients and people infected with other pneumonia in chest X-ray images, which achieves an overall accuracy of 93.3 %. In terms of deep learning systems for CT imaging, Zhou and Canu [39] have proposed an automatic network facilitated with attention mechanism to segment infection area in CT slices. Fan et. al [6] developed an Inf-Net and corresponding semisupervision algorithm to perform CT segmentation. Zheng et al. [37] proposed a weakly-supervised deep learning method to detect COVID-19 in CT volumes. Xi et al. [18] presented a dual-sampling attention network to diagnose COVID-19 from community acquired pneumonia. However, detecting lung infection caused by COVID-19 in CT images remains to be challenging. As infection regions vary in shape, position and texture, and the boundaries with normal tissues can be rather blurred, which add to the difficulty in COVID-19 detection and limit model performance, especially in terms of recall score. To address the issues above, we proposed a dilated dual attention U-Net (D2A U-Net) framework to automatic segment lung infection in COVID-19 CT slices. Since infected tissues can be hardly distinguishable with normal tissues, we introduce a dual attention mechanism consisting of a gate attention module (GAM) and a decoder attention module (DAM) to refine feature maps and produce more informative feature representation. The proposed GAM is utilized by fusing features and semantic-rich gate signals to refine skip connections. The proposed DAM is introduced to the decoder of the network to improve model decoding quality and better segment blurred infected tissues. As COVID-19 infection varies in position and size, we utilize dilated convolution with different dilation rate in the model decoder to obtain larger receptive fields and balance the segmentation on both large and tiny objects, which thus provides better segmentation results. Such refinement improves segmentation recall score and thus provides better segmentation results. The paper is organized as follows: Section 2 offers a review of related works on CT segmentation. Section 3 describes the overview of this work and details our model. Section 4 presents the details of our experiments and provides both quantitative and qualitative segmentation results. Section 5 discusses the proposed method and concludes our work. In this section, we will go through 4 types of most related works, which includes chest CT segmentation, attention mechanism, dilated convolution and AI-based COVID-19 segmentation systems. Chest CT imaging is one of the most popular screening methods for lung disease diagnosis [9] . Segmentation of organs and lesions provides crucial information for disease diagnosis and prognosis. However, manual segmentation remains time-consuming and laborious and subjective error is inevitable, thus automatic CT segmentation gains much popularity in the research fields. Recent researches upon automatic segmentation mainly focus on utilizing machine learning techniques. Related works most feature a pixel-wise classifier to infer from extracted features to make predictions. For example, Mansoor et. al [13] proposed a texturebased feature classifier for pathological lung segmentation in CT images. Yao et. al [34] utilized texture analysis and support vector machine to segment infections in lung tissues. These algorithms have realized automatic segmentation in chest CT images but several issues remain unsolved, including subjective bias in feature extraction and difficulties in segmenting nodule regions. Deep learning algorithms feature powerful fitting capacity and require no laborious preprocessing. Most cutting-edge segmentation algorithms are based on deep learning approaches. For example, Shaziya et. al [23] used U-Net to segment lung tissues in chest CT scans. Zhao et. al [36] proposed a fully convolutional neural network with multi-instance and conditional adversary loss for pathological lung segmentation. Attention plays an important role in human perception and visual cognition [5] . One significant property in human perception is that humans hardly process visual information as a whole. Instead, humans usually process visual information recurrently, where top information is utilized to guide bottom-up feedforward process [15] . Inspired by this principle, attention mechanism has been widely used in computer vision, especially in image classification [7] [31] [25] . Related algorithms typically refine feature maps in spatial dimension, channel dimension or both. For example, Hu et al. [7] introduced a Squeeze-and-Excitation module, where global average pooling is performed on input features to produce channel-wise attention. Woo et. al [31] proposed a convolutional block attention module (CBAM) to introduce a fused attention consisting of channel attention and spatial attention. Wang et al. [25] presented a residual attention network, which contains an attention module featuring an encoder-decoder architecture. Attention mechanism has also been utilized in semantic segmentation tasks to make more accurate dense predictions. For instance, Li et. al [11] proposed a Pyramid Attention Network to exploit the impact of global contextual information in semantic segmentation. These typical algorithms resemble in some aspects. Certain operations, such as global pooling, convolution and the combination of downsampling and upsampling, are utilized to enhance informative regions in the feature maps and suppress unrelated information, which makes the network learn more generalized visual structures and improves robustness to noisy inputs. Traditional deep convolutional networks often involve convolution with stride or pooling operations to improve receptive fields, and input images are downsampled in this process. However, these operations often lead to the loss of global information in dense predictions, such as semantic segmentation and object detection. Yu and Koltun [35] introduced dilated convolution to deep networks, which has proved useful in dense predictions. The basic idea of dilated convolution is to insert "holes" (zeros) in convolution kernels to obtain large receptive fields without downsampling. Dilated convolution avoids information loss during downsampling and has been widely used in semantic segmentation tasks [30] [14] [20] . However, it has been observed that simply stacking dilated convolution in CNNs may cause grid effects and irrelevant long-ranged information [35] and lead to performance deterioration. Wang et. al [28] proposed a hybrid dilated convolution (HDC) framework to avoid grid effects and improve segmentation performance on both large and tiny objects. Artificial intelligence has been widely utilized in fighting against COVID-19. We mainly focus on AI-based semantic segmentation systems upon CT scans. Many works focus on learning robust and noise-insensitive representations from limited or noisy inputs. For example, Xie et. al [33] proposed a RTSU-Net for segmenting pulmonary lobes in CT scans. A non-local neural network module was introduced to learn both visual and geometric relationships among feature maps to produce self-attention. Wang et. al [26] presented a noise-robust framework for COVID-19 lesion segmentation. They utilized a noise-robust Dice loss and adaptive self-ensembling strategy to learn from noisy labels. Chen et. al [4] proposed a residual attention U-Net which introduced aggregated residual transformations and soft attention mechanism to learn robust feature representations. Also, researchers look into segmentation solutions that achieve both high speed and high accuracy. For example, Zhou et. al [38] developed a rapid, accurate and machine-agnostic segmentation and quantification method for automatic segmentation on COVID-19 lesions. The innovation of their work lies in the first CT scan simulator for COVID-19 and a novel network architecture which solves the large-scene-small-object problem. Qiu et. al [21] developed a parameter-efficient framework to achieve fast segmentation of COVID-19 lung infection with relatively low computational cost. In this section we will go through the details of the proposed D2A U-Net architecture. In the first part, we will offer the overview of proposed network. We then provide details about dual attention mechanism and proposed attention modules. Finally we introduce our proposed decoder blocks. Basically, our proposed network is based on the U-Net [22] architecture, which is quite popular in medical image segmentation. Compared with original U-Net, dilated convolution and a novel combination of attention mechanism are integrated in our framework to obtain better feature representation. As COVID-19 pandemic broke out rapidly, available open access CT image data with gold-standard annotations is hard to acquire, and thus utilizing pretrained encoder in the segmentation model can offer a better parameter initialization and improve generalization ability. Therefore, in this work, we utilize a ResNeXt-50 (32×4d) [32] pretrained on ImageNet as the encoder of our model. Furthermore, we integrate a dual attention mechanism in model decoder. A gated attention mechanism is inserted inside skip connections to utilize both high and low levels of feature representations and reduce semantic gap between encoder and decoder. Also, we introduce another fused attention mechanism in model decoder to refine feature maps after upsamling. We utilize a hybrid dilated convolution module [28] as the basic block of model decoder to enlarge receptive field and produce better dense predictions. The network scheme is shown in Fig. 2 . We introduce a dual attention mechanism composed of a gate attention module (GAM) and a decoder attention module (DAM) to our network. GAM is utilized to refine features extracted by model encoder and reduce semantic gap by fusing high and low level feature maps. DAM is inserted in model decoder to refine feature representations after upsampling. Feature concatenation from encoder to decoder is the typical topological structure in U-Net, where the combination of high-resolution features in the encoder and upsampled features in the decoder enables better localization of segmentation targets [22] . However, not all visual representations in encoder feature maps contribute to precise segmentation. Also, semantic gap between encoder and decoder could limit model performance as well. Thus, we introduce a gate attention module before concatenation to refine features coming from model encoder and reduce semantic gap. Oktay et. al [17] proposed an attention gate to refine encoder features with attention mechanism. But in their proposed attention gate only spatial attention mechanism is implemented to refine features. We believe introducing channel attention and spatial attention simultaneously will improve the efficiency of attention mechanism. Thus, inspired by the global attention upsample module proposed in pyramid attention network [11] and CBAM [31] , we provide a novel design of a gate attention module to enable both channel attention and spatial attention. Detailed scheme of the proposed GAM is shown in Fig. 3 . Two feature maps are fed into the attention module. The guiding signal refers to the feature map coming from model decoder (or the last convolution block in model encoder), and the feature refers to feature maps coming from model encoder to concatenate with upsampled feature maps. We use ∈ ℝ × × to denote guiding signal and ∈ ℝ × × to denote features. In a U-Net shaped architecture, compared with , contains more deep and high-resolution semantic information which is encoded in channel dimension. We utilize global average pooling and a multilayer perception (MLP) to create a channel attention map ( ) ∈ ℝ ×1×1 . The output size of the MLP is smaller than the input size, thus we suppress irrelevant feature representations in channel dimension and implement channel-wise attention mechanism. In short, we compute channel attention as follows: where denotes sigmoid activation, denotes global average pooling, 0 ∈ ℝ ∕ × and 1 ∈ ℝ × ∕ , denotes reduce ratio and in our experiments it is set to 16. Spatial attention is guided by both guiding signal and input feature itself. We use convolution operation with 1 filter to squeeze channel dimension of and . Then reduced feature map from is upsampled to match the size of . A combination of convolution operation with different kernel size is utilized to produce spatial attention ( ) ∈ ℝ 1× × . In short, we compute spatial attention as: where denotes sigmoid activation, 3×3 , 5×5 and 7×7 denote convolution operation with corresponding kernel size. 1×1 is used to squeeze channel dimension. Then we use element-wise multiplication to combine spatial and channel attention to produce fused attention ( ): In semantic segmentation, high-resolution visual representations in the encoder need to be upsampled to make dense predictions. Transposed convolution and interpolation are both popular solutions to image upsampling, but both have their drawbacks. Compared with interpolation, transposed convolution is trainable and offers more nonlinearity to deep networks, which improves model fitting capacity. But grid effect is hard to avoid if hyperparameters are not configured properly, while such drawback can be more troublesome when stacking more than one transposed convolution layer. Thus we propose a combination of bilinear interpolation and following convolution to upsample feature maps. However, as interpolation is not trainable, it is inevitable to introduce irrelevant information or noise to upsampling. We introduce a decoder attention module to solve this issue. A fused attention mechanism is utilized to refine postupsampling feature maps in both channel and spatial dimensions. The scheme is shown in Fig. 4 . Compared with GAM, DAM is more simplified and only takes one input, but the implementation of both channel and spatial attention is quite similar. We use ( ) ∈ ℝ ×1×1 to denote channel attention, ( ) ∈ ℝ 1× × to denote spatial attention and ( ) to denote fused attention. In short, DAM is computed as follows: where denotes sigmoid activation, denotes global average pooling, 0 ∈ ℝ ∕ × and 1 ∈ ℝ × ∕ , denotes reduce ratio and in our experiments it is set to 16. ( ) = ( 3×3 ( 1×1 ( ))+ 5×5 ( 1×1 ( ))+ 7×7 ( 1×1 ( )) where denotes sigmoid activation, 3×3 , 5×5 and 7×7 denote convolution operation with corresponding kernel size. 20 CT volume (Left lung, right lung, and infections are labeled by two radiologists and verified by an experienced radiologist, and 1,844 out of the total of 3520 slices contains infection regions.) 1×1 is used to squeeze channel dimension. Standard convolution hardly reaches a large receptive field due to kernel size. Such drawback in traditional design of U-Net based network decoder can limit the performance in segmentation. Inspired by the design of hybrid dilated convolution [28] , we proposed a residual attention block (RAB) as the basic module in model decoder. Unlike similar works using dilated convolution in the encoder, we explore to use it in the decoder to capture multiscale patterns of upsampled feature maps. Hybrid dilated convolution is utilized in our RAB to acquire large receptive fields and avoid grid effects. The stem of RAB is a stack of dilated convolution with kernel size 3 and dilation rate [1, 2, 5] , followed by a decoder attention module. The scheme is shown in Fig. 4 . We assume initial receptive field as 1 × 1. The equivalent kernel size of dilated convolution is computed as follows: where denotes equivalent kernel size, denotes actual kernel size and denotes dilation rate. Thus, the equivalent kernel size of dilated convolution with kernel size 3 and dilation rate [1, 2, 5] is 3, 5, 11, respectively. According to the definition of receptive field, such design of stacked dilated convolution reaches a receptive field of 17 × 17, which enables the capture of global information. Also, dilated convolution with different dilation rate can capture multiscale information in feature maps, which can contribute to the accurate segmentation on both large and small objects. As we use a ResNeXt-50 (32 × 4d) as model encoder, we utilize residual connection in decoder as well to avoid gradient vanishing. Dilated convolution is followed by a DAM to refine upsampled features and produce fused attention maps. In short, the output of our RAB is computed as follows: where denotes decoder attention module and denotes hybrid dilated convolution. CT slices used in our experiments consist of 3 datasets[1] [8] . Details about dataset used are shown in Table 1 . Dataset 1 contains 100 axial CT slices from more than 40 patients, which have been rescaled to 512 × 512 pixels and grayscaled. All slices are segmented by a radiologist using three labels: ground-glass opacity, consolidation and pleural effusion. Dataset 2 contains 9 axial CT volumes, where 373 out of the total of 829 slices have been evaluated by a radiologist as positive and segmented using 2 labels including ground-glass opacity and consolidation. Dataset 3 contains 20 CT axial volumes, which have been segmented by two radiologists and verified by an experienced radiologist. Dataset 2 and Dataset 3 contain 29 CT volumes in total, but not all slices contain infection regions. We choose to discard all slices containing no COVID-19 infection and use slices with annotations only. As annotations in Dataset 3 do not distinguish ground-glass opacity and consolidation, we take both ground-glass opacity and consolidation in Dataset 2 as COVID-19 lesions and do not distinguish them as well, thus creating a binary segmentation dataset. An intensity normalization has been applied on both datasets and all slices have been rescaled to 512 × 512 pixels to match Dataset 1. We take all ground-glass, consolidation and pleural effusion in Dataset 1 as COVID-19 lesions, just the same as what we have done to Dataset 2. We did not choose to combine processed Dataset 1 to 3 together and then split them randomly, because in this way slices of one certain subject can exist in both training and test datasets, which could be regarded as a data leakage and cause a virtual-high model performance. Instead, we finally obtain 1645 processed slices from processed Dataset 2 and Dataset 3 in total and use these slices as our final training dataset, and then we use 100 axial slices from Dataset 1 as our final test dataset. Such data split can best evaluate model generalization capacity. Model encoder is a ResNeXt-50 (32 × 4d) pretrained on ImageNet-1K. We remove the global average pooling and full connection layers from original network. Number of output channels is 64, 256, 512, 1024, 2048, respectively, just the same as original paper of ResNeXt. Convolution operations in model decoder are padded, without stride, if not specified. Bilinear interpolation is utilized to upsample feature maps, and scale factor is set to 2. Dice loss is widely utilized in semantic segmentation, but the differential of Dice loss is sometimes numerically unstable and may lead to oscillation in training process. The combination of Dice loss and cross-entropy could avoid this issue. Thus we combine Dice loss  and binary cross-entropy loss  as our final loss function: where = 1 in our experiments. Our model is implemented using PyTorch on an Ubuntu 16.04 server. We use a NVIDIA RTX 2080 Ti GPU to accelerate our training process. Data augmentation is utilized in our training process to reduce overfitting and improve generalization capacity. First all input images are rescaled to 560 × 560, followed by random flip, random rotation, random gamma and log transform. Finally images are randomly cropped to 448 × 448 and fed into network. The model is optimized by an Adam optimizer with 1 = 0.9, 2 = 0.999, = 1 − 8. 2 regularization is utilized to reduce overfitting as well. We set model weight decay to 1e-4. Initial learning rate is set to 1e-4 and reduced when faced with plateau, with reduce factor being 0.1 and patience being 10. The batch size is set to 6 and we perform evaluation on test set after 30 epochs. The training process takes approximately 140 minutes. We use Dice similarity coefficient and pixel error as the main metrics to evaluate segmentation performance of our D2A U-Net. Dice is a statistic used to gauge the similarity of two samples, and has been widely used to evaluate performance in semantic segmentation. Pixel error measures the number of pixels predicted falsely in the image, which shows the global segmentation accuracy of the proposed models. Both metrics measure segmentation performance in a global way. In addition, we calculate recall score of infection regions as recall score measures model's sensitivity to lung infection, which is rather significant in terms of COVID-19 infection. We use to denote ground truth, to denote dense predications, to denote true positive, to denote false positive, to denote true negative and to denote false negative. These metrics are calculated as fol-lows: We compared the performance of proposed network with U-Net [22] , Attention U-Net [17] and U-Net++ [40] . The VGG-style backbone refers to the encoder design proposed in original U-Net paper [22] . Also, we compared our model with 2 cutting-edge models widely used in natural image segmentation, including FCN8s [12] and DeepLab v3 (output stride = 8) [3] , with both models containing a pretrained backbone as well. Apart from model performance comparison, model parameters and computational costs (FLOPs) are also compared in our experiments. As our model differs with other U-Net family models in terms of model encoder, to best evaluate our design of model decoder and attention mechanism, we also build a simplified D2A U-Net with a VGG-style backbone as well. We believe the simplified version offers more fair comparison between proposed network and other U-Net based models, while standard D2A U-Net with backbone ResNeXt-50 (32 × 4d) provides best segmentation results. To best evaluate model performance, all the metrics reported in Table 2 are averaged in 5 reduplicate experiments to report a fair and reliable result. Detailed comparison among different models in our experiments is shown in Table 2 . As has been shown, without pretrained backbone, our proposed network outperforms U-Net, Attention U-Net and U-Net++ in terms of Dice, pixel error and recall. As these models are identical in model encoder, it is clear that the proposed dual attention mechanism and RAB contribute to infection segmentation a lot. The utilization of attention mechanism aids the model to detect infected tissues more accurately, which reduces the number of false positives and improves recall score. Also, RAB in model decoder captures both large and tiny visual structures, which is helpful to segment infection lesions with different size. Also, it should be noted that proposed D2A U-Net with VGG-style backbone outperforms U-Net++ with comparably lower model parameters and computational costs, which could prove the balance of efficiency and performance in our models. Table 2 Quantitative analysis of infection regions on our dataset. Backbone VGG-style refers to the encoder proposed in [22] , and backbone ResNet-101 and ResNeXt-50 (32 × 4d) are pretrained on ImageNet-1K. Utilizing pretrained backbone could also improve model performance. As can be seen, our D2A U-Net with pretrained ResNeXt-50 (32 × 4d) backbone outperforms other networks including ones with similar pretrianed backbones in terms of Dice, pixel error and recall by a large margin and yields best results on our dataset. Also, our D2A U-Net with pretrained ResNeXt-50 (32 × 4d) backbone takes fewer computational resources than FCN8s and DeepLab v3 (output stride = 8). As can be seen from Table 2 , pretrained encoder could offer a better initialization of model parameters and reduce overfitting, especially when data amount is insufficient. Overall, the proposed architecture performs better than existing cutting-edge models. We visualized segmentation results, as is shown in Fig. 5 . It can be seen from the visualization that our proposed model outperforms other models obviously. U-Net and Attention U-Net are the least sensitive to COVID-19 lesions, and the background pixels have much stronger activation compared with other models. U-Net++ produces more accurate segmentation results, but still not promising as some tiny lesions or lesions with blurred edge are segmented poorly. D2A U-Net with VGG-style backbone produces most accurate segmentation masks compared with other U-Net based models mentioned above, and when backbone is switched to ResNeXt-50 (32 × 4d), D2A U-Net produces the best segmentation results, which is comparably more sensitive to blurred or tiny lesions than other models. Several ablation experiments were conducted to evaluate the performance of components presented in our model, as is shown in Table. 3. Effectiveness of Combining GAM, RAB and PB As can be seen from Table. 3, in experiment No.4, introducing GAM and RAB together (proposed D2A U-Net) yields best results in our experiments, and the performance boost exceeds the simple addition of each module's performance boost. Such experimental results indicate that introducing GAM and RAB together promotes the performance mutually. Also, in No.5, pretrained backbone as better parameter initialization could further improve model performance. In this paper we proposed a novel segmentation network, D2A U-Net, for COVID-19 CT segmentation. Inspired by global attention upsample and CBAM, we propose a novel gated attention mechanism, called gate attention module, to produce a fused attention map on features extracted by encoder. We introduce a decoder attention module as well, which helps refine upsampled feature maps. Also, inspired by hybrid dilated convolution, we present a residual attention block containing a hybrid dilated convolution and a decoder attention module; we use it as the basic block in model decoder. Attention mechanism is utilized to increase model sensitivity to positive pixels and improve recall score. And we use residual attention block as decoder basic block to refine upsampled feature maps and increase receptive field simultaneously. Experimental results indicate that our network design is capable of segment COVID-19 lesions from CT slices automatically, and achieves best results among popular cutting-edge models evaluated in our experiments. But our work is still limited to some degree, as only binary segmentation is performed in our experiments, which can limit model's potential use in both diagnosis and health care. We expect to gather more CT scans and perform multi-class segmentation in the future. Also, despite the significantly better performance of our D2A U-Net with ResNeXt-50 (32 × 4d) backbone, the model has much more model parameters than other architectures with similar backbones (FCN8s and DeepLab v3). We believe that as ResNet family models have a large number of channels (eg. 1024 and 2048 in the last two layers), the parameters of model decoder becomes extremely large. Such problem might be addressed by introducing socalled Bottleneck in ResNets to the decoder of D2A U-Net to reduce channels and thus model parameters. Covid-19 ct segmentation dataset Covid-19 global cases by johns hopkins university Rethinking atrous convolution for semantic image segmentation Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images Control of goal-directed and stimulus-driven attention in the brain Inf-net: Automatic covid-19 lung infection segmentation from ct images Squeeze-and-excitation networks COVID-19 CT Lung and Infection Segmentation Dataset A review on lung and nodule segmentation techniques Ct imaging of the 2019 novel coronavirus (2019-ncov) pneumonia Pyramid attention network for semantic segmentation Fully convolutional networks for semantic segmentation A generic approach to pathological lung segmentation Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation Recurrent models of visual attention Imaging profile of the covid-19 infection: radiologic findings and literature review Dual-sampling attention network for diagnosis of covid-19 from community acquired pneumonia Time course of lung changes on chest ct during recovery from Concentrated-comprehensive convolutions for lightweight semantic segmentation Miniseg: An extremely minimum network for efficient covid-19 segmentation U-net: Convolutional networks for biomedical image segmentation Automatic lung segmentation on thoracic ct scans using u-net convolutional network A novel coronavirus outbreak of global health concern Residual attention network for image classification A noise-robust framework for automatic segmentation of covid-19 pneumonia lesions from ct images Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Understanding convolution for semantic segmentation Detection of sars-cov-2 in different types of clinical specimens Smoothed dilated convolutions for improved dense prediction Cbam: Convolutional block attention module Aggregated residual transformations for deep neural networks Relational modeling for robust and efficient pulmonary lobe segmentation in ct scans Computeraided diagnosis of pulmonary infections using texture analysis and support vector machine classification Multi-scale context aggregation by dilated convolutions Lung segmentation in ct images using a fully convolutional neural network with multiinstance and conditional adversary loss Deep learning-based detection for covid-19 from chest ct using weak label A rapid, accurate and machineagnostic segmentation and quantification method for ct-based covid-19 diagnosis An automatic covid-19 ct segmentation network using spatial and channel attention mechanism Unet++: A nested u-net architecture for medical image segmentation, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support This work was partially supported by the Fundamental Research Funds for Central Universities, the National Natural Science Foundation of China (No. 61601019, 61871022) , the Beijing Natural Science Foundation (7202102), and the 111 Project (No. B13003).