key: cord-0879346-p5f1sq1g
authors: Zhao, Xiangyu; Zhang, Peng; Song, Fan; Fan, Guangda; Sun, Yangyang; Wang, Yujia; Tian, Zheyuan; Zhang, Luqi; Zhang, Guanglei
title: D2A U-Net: Automatic Segmentation of COVID-19 CT Slices Based on Dual Attention and Hybrid Dilated Convolution
date: 2021-06-02
journal: Comput Biol Med
DOI: 10.1016/j.compbiomed.2021.104526
sha: 8979f4a8c2a6de1470a330e03949ffc9627e6af5
doc_id: 879346
cord_uid: p5f1sq1g

Coronavirus Disease 2019 (COVID-19) has become one of the most urgent public health events worldwide due to its high infectivity and mortality. Computed tomography (CT) is a significant screening tool for COVID-19 infection, and automatic segmentation of lung infection in COVID-19 CT images can assist diagnosis and health care of patients. However, accurate and automatic segmentation of COVID-19 lung infections is faced with a few challenges, including blurred edges of infection and relatively low sensitivity. To address the issues above, a novel dilated dual attention U-Net based on the dual attention strategy and hybrid dilated convolutions, namely D2A U-Net, is proposed for COVID-19 lesion segmentation in CT slices. In our D2A U-Net, the dual attention strategy composed of two attention modules is utilized to refine feature maps and reduce the semantic gap between different levels of feature maps. Moreover, the hybrid dilated convolutions are introduced to the model decoder to achieve larger receptive fields, which refines the decoding process. The proposed method is evaluated on an open-source dataset and achieves a Dice score of 0.7298 and recall score of 0.7071, which outperforms the popular cutting-edge methods in the semantic segmentation. The proposed network is expected to be a potential AI-based approach used for the diagnosis and prognosis of COVID-19 patients.

COVID-19 pandemic caused by SARS-nCov-2 continues to spread all over the world [1] , and most of the countries have been affected in this unprecedented public health event. By March 2021, more than 116 million of cases of COVID- 19 have been reported and more than 2,580,000 people died [2] of COVID-19 infection. Due to the strong infectivity of SARS-nCov-2, identification of people infected by COVID-19 is significant to cut off the transmission and slow down virus spread. Reverse transcriptase-polymerase chain reaction (RT-PCR) is considered as the gold standard of diagnosis [3] for its high specificity, but it is time-consuming and laborious. Also, the capacity of RT-PCR tests can be rather insufficient in the less-developed regions, especially during the pandemic. Computed tomography (CT) imaging is one of the most commonly used screening methods to detect lung infection and has proved to be efficient in the diagnosis and follow-up prognosis of COVID-19.

Compared with chest X-ray images, CT imaging is more sensitive, especially in the early stage of infection. Ground glass pattern is the most common finding in COVID-19 infections, usually in the early stage, while pulmonary consolidation can be observed in the later stage. Pleural effusion can also be observed in pathological CT slices. These typical features of COVID-19 lung infection are shown in Figure  1 .

Thus, chest CT imaging is regarded as a convenient, fast and accurate approach to diagnose COVID-19. The eval- Figure 1 : Example of COVID-19 CT slices, where the red, green and blue masks denote the ground glass, consolidation and pleural effusion respectively. The images are collected from [4] .

uation of the localization and geometric features of the infection area could provide adequate information on disease progression and help physicians make better treatments [5] [6] [7] . However, manual annotation of the infection regions is a time-consuming and laborious work. Also, the annotation made by radiologists may be subjective and biased due to personal judgements.

Recently, numerous deep learning algorithms using convolutional neural networks (CNNs) have been proposed to detect COVID-19 infection. For instance, Wang and Wong [8] have developed a COVID-Net to perform ternary classification among healthy people, COVID-19 patients and people infected with other pneumonia in chest X-ray images, which achieves an overall accuracy of 93.3 %. In terms of deep learning algorithms for CT imaging, Zhou and Canu [9] have proposed an automatic network facilitated with attention mechanism to segment the infection area from CT slices. Fan et al. [10] developed an Inf-Net and corresponding semi-supervision algorithm to perform CT segmentation. Zheng et al. [11] proposed a weakly-supervised deep learning method to detect the COVID-19 infection in CT volumes. Xi et al. [12] presented a dual-sampling attention network to diagnose COVID-19 from community acquired pneumonia. However, the detection of the lung infections caused by COVID-19 in CT images remains challenging, because infection regions vary in shape, position and texture, and the boundaries between lesions and normal tissues can be rather blurred. These features increase the difficulty of COVID-19 detection and limit the model performance, especially in terms of sensitivity.

To address the issues above, we proposed a dilated dual attention U-Net (D2A U-Net) framework to automatically segment the lung infection in COVID-19 CT slices. Since the infected tissues can be hardly distinguishable from the normal tissues, we introduce a dual attention strategy consisting of a gate attention module (GAM) and a decoder attention module (DAM) to refine feature maps and produce more informative feature representation. The proposed GAM is utilized by fusing features and semantic-rich gate signals to refine the skip connections in the network. The proposed DAM is introduced to the model decoder to improve the decoding quality, especially when segmenting the blurred lesions. As COVID-19 infection varies in position and size, we utilize hybrid dilated convolutions with different dilation rate in the model decoder to obtain larger receptive fields and balance the segmentation performance on both large and tiny objects, which thus provides better segmentation results. The sensitivity for infection segmentation has been improved significantly due to these refinements, which leads to better segmentation performance.

The paper is organized as follows: Section 2 offers a review of related works on CT segmentation. Section 3 describes the overview of this work and details our proposed model. Section 4 presents the details of our experiments and provides both quantitative and qualitative segmentation results. Section 5 discusses the proposed method and concludes our work.

In this section, we will go through 4 types of most related works, which includes chest CT segmentation, attention mechanism, dilated convolution and AI-based COVID-19 segmentation systems.

Chest CT imaging is one of the most popular screening methods for lung disease diagnosis [13] . Segmentation of organs and lesions provides crucial information for the diagnosis and prognosis of many diseases. However, since manual segmentation remains time-consuming, laborious and subjective, automatic CT segmentation has gained much popu-larity in the research fields. Recent researches upon automatic CT segmentation mainly focus on utilizing machine learning techniques. Related works most feature a pixel-wise classifier to infer from extracted features and make dense predictions. For example, Mansoor et al. [14] proposed a texture-based feature classifier for pathological lung segmentation in the CT images. Yao et al. [15] utilized texture analysis and support vector machine to segment infections in the lung tissues. These algorithms have realized automatic segmentation in the chest CT images but several issues remain unsolved, including subjective bias in feature extraction and difficulties in segmenting nodule regions. Deep learning algorithms feature powerful fitting capacity and require no laborious preprocessing. Most cutting-edge segmentation algorithms are based on deep learning approaches. For example, Shaziya et al. [16] used U-Net to segment lung tissues in the chest CT scans. Zhao et al. [17] proposed a fully convolutional neural network with multi-instance and conditional adversary loss for pathological lung segmentation.

Attention plays an important role in human perception and visual cognition [18] . One significant property in human perception is that humans hardly process visual information as a whole. Instead, humans usually process visual information recurrently, where top information is utilized to guide bottom-up feedforward process [19] . Inspired by this principle, attention mechanism has been widely used in computer vision, especially in the image classification [20] [21] [22] . Related algorithms typically refine feature maps in the spatial dimension, channel dimension or both. For example, Hu et al. [20] introduced a Squeeze-and-Excitation module, where global average pooling is performed on the input features to produce channel-wise attention. Woo et al. [21] proposed a convolutional block attention module (CBAM) to introduce a fused attention consisting of channel attention and spatial attention. Wang et al. [22] presented a residual attention network, which contains an attention module featuring an encoder-decoder architecture. Attention mechanism has also been utilized in semantic segmentation tasks to make more accurate dense predictions. For instance, Li et al. [23] proposed a Pyramid Attention Network to exploit the impact of global contextual information in semantic segmentation.

These typical algorithms resemble in some aspects. Certain operations, such as global pooling, convolution, and the combination of downsampling and upsampling, are utilized to enhance the informative regions in the feature maps and suppress irrelevant information, which allows the network to learn more generalized visual structures and improve the robustness against noisy inputs.

Traditional deep convolutional networks usually involve strided convolution or pooling operations to improve the receptive fields, in which the input images are downsampled. However, these operations often lead to the loss of global information in dense predictions, such as semantic segmentation and object detection. Yu and Koltun [24] introduced dilated convolution to deep networks, which has proved to be useful in dense predictions. The basic idea of dilated convolution is to insert "holes" (zeros) in the convolution kernels to obtain larger receptive fields without downsampling. Dilated convolution avoids information loss during downsampling and has been widely used in the semantic segmentation tasks [25] [26] [27] . However, it has been observed that simply stacking dilated convolution in CNNs may cause grid effects [24] , which could lead to severe performance deterioration. Wang et al. [28] proposed a hybrid dilated convolution (HDC) framework to avoid grid effects, which improves the segmentation performance on both large and tiny objects.

Artificial intelligence (AI) has been widely utilized in fighting against COVID-19. We mainly focus on AI-based semantic segmentation systems upon CT scans. Many works focus on learning robust and noise-insensitive representations from limited or noisy inputs. For example, Xie et al. [29] proposed a RTSU-Net for segmenting pulmonary lobes in the CT scans. A non-local neural network module was introduced to learn both visual and geometric relationships among the feature maps to produce self-attention. Wang et al. [30] presented a noise-robust framework for COVID-19 lesion segmentation. They utilized a noise-robust Dice loss and an adaptive self-ensembling strategy to learn from noisy labels. Chen et al. [31] proposed a residual attention U-Net which introduced aggregated residual transformations and soft attention mechanism to learn robust feature representations. Also, researchers have investigated segmentation schemes that achieve both high speed and accuracy. For example, Zhou et al. [32] developed a rapid, accurate and machine-agnostic segmentation and quantification method for automatic segmentation of COVID-19 lesions. The innovation of their work lies in the first CT scan simulator for COVID-19 and a novel network architecture which solves the large-scene-small-object problem. Qiu et al. [33] developed a parameter-efficient framework to achieve fast segmentation of COVID-19 lung infection with relatively low computational cost.

In this section, we will go through the details of the proposed D2A U-Net architecture. In the first part, we will offer an overview of the proposed network. We then provide details about the proposed attention modules. Finally we introduce our proposed model decoders with hybrid dilated convolutions.

Basically, our proposed network is based on the U-Net [34] architecture, which is quite popular in medical image segmentation. Compared with the original U-Net, dilated convolutions and a novel combination of attention mechanism are integrated in our framework to obtain better feature representation. We integrate the dual attention strategy in the model decoder. A gated attention module is inserted inside the skip connections to utilize feature representations from different levels and reduce the semantic gap between the encoder and the decoder. Also, we introduce another fused attention mechanism in the model decoder to refine feature maps after upsamling. Specifically, a hybrid dilated convolution module [28] is utilized as the basic block of the model decoder to enlarge receptive fields and produce better dense predictions. For the model encoder, both VGG-style encoder proposed in the original U-Net [34] and ResNeXt-50 (32×4d) [35] pretrained on ImageNet are utilized. The network scheme is shown in Figure 2 

We introduce a dual attention strategy composed of a gate attention module (GAM) and a decoder attention module (DAM) to our network. The motivation behind utilizing dual attention strategy instead of single attention module is to further highlight the infection area and suppress false positives. GAM is utilized to refine the features extracted by the model encoder and to reduce the semantic gap by fusing high and low level feature maps, which highlights potential infection regions and improves the sensitivity to COVID-19 infection. DAM is inserted in the model decoder to refine the feature representations after upsampling, which is used to suppress the noise that may be introduced during upsampling and inhibit false positives.

Feature concatenation from the encoder to the decoder is the typical topological structure in U-Net, where the combination of high-resolution features in the encoder and upsampled features in the decoder enables better localization of segmentation targets [34] . However, not all visual representations in the encoder feature maps contribute to precise segmentation. In addition, the semantic gap between the encoder and the decoder can limit the performance of the model. Therefore, we introduce a gate attention module prior to concatenation to refine the features from the encoder and reduce the semantic gap.

Oktay et al. [36] proposed an attention gate to refine the encoder features with attention mechanism. But in their proposed attention gate, only spatial attention mechanism is implemented to refine features. However, the introduction of both channel attention and spatial attention will improve the efficiency of attention mechanism. Thus, inspired by the global attention upsample module proposed in pyramid attention network [23] and CBAM [21] , we propose a novel design of a gate attention module to enable both channel attention and spatial attention. Detailed scheme of the proposed GAM is shown in Figure 3 . Two feature maps are fed into the attention module. The guiding signal refers to the feature map from the model decoder (or the last convolution block in the model encoder), and the feature refers to the feature map fed to the skip connections.

∈ ℝ × × denotes the guiding signal and ∈ ℝ × × denotes the feature.

In the U-shaped mesh structure, contains more deep semantic information which is encoded in the channel dimension compared with . We utilize a global average pooling operation followed by a multilayer perception (MLP) to create the channel attention map ( ) ∈ ℝ ×1×1 . The output size of the MLP is smaller than the input size, which enables the suppression of irrelevant feature representations in the channel dimension. In short, we compute the channel attention as follows:

where denotes sigmoid activation, denotes global average pooling, ∈ ℝ ∕ × and ∈ ℝ × ∕ , denotes reduce ratio and in our experiments it is set to 16. Spatial attention is guided by both the guiding signal and the input feature itself. We use convolution operation with one filter to squeeze the channel dimension of and . Then the reduced feature map from is upsampled to match the size of . A combination of convolution operation with different kernel size is utilized to produce spatial attention ( ) ∈ ℝ 1× × . In short, we compute spatial attention as: where denotes sigmoid activation, 3×3 , 5×5 and 7×7 denote convolution operation with corresponding kernel size.

1×1 is used to squeeze channel dimension. Then we use element-wise multiplication to combine spatial and channel attention to produce the fused attention ( ):

where • denotes element-wise multiplication.

In semantic segmentation, high-resolution visual representations in the encoder need to be upsampled to make dense predictions. Transposed convolution and interpolation are both popular solutions to image upsampling, but both have their drawbacks. Compared with interpolation, transposed convolution is trainable and offers more nonlinearity to deep networks, which improves the model capacity. However, grid effects are hard to avoid if hyperparameters are not properly configured, and this drawback can be more troublesome when stacking more than one transposed convolution layer. Thus we propose a combination of bilinear interpolation and following convolution to upsample the feature maps. However, as interpolation is not trainable, it is inevitable to introduce irrelevant information or noise to the upsampling process. Thus, we introduce a decoder attention module to solve this issue. A fused attention mechanism is utilized to refine the post-upsampling feature maps in both channel and spatial dimensions. The scheme is shown in Figure 4 . Compared with the proposed GAM, DAM is more simplified and only takes one input, but the implementation of both channel and spatial attention is quite similar. We use ( ) ∈ ℝ ×1×1 to denote channel attention, ( ) ∈ ℝ 1× × to denote spatial attention and ( ) to denote fused attention. In short, DAM is computed as follows:

where denotes sigmoid activation, denotes global average pooling, 0 ∈ ℝ ∕ × and 1 ∈ ℝ × ∕ , denotes the reduce ratio and it is set to 16 in our experiments.

( ) = ( 3×3 ( 1×1 ( ))+ 5×5 ( 1×1 ( ))+ 7×7 ( 1×1 ( ))

where denotes sigmoid activation, 3×3 , 5×5 and 7×7 denote convolution operation with corresponding kernel size. And 1×1 is used to squeeze channel dimension.

where • denotes element-wise multiplication. 

Standard convolution hardly reaches a large receptive field with a fixed kernel size. Such drawback in traditional U-Net based networks may limit the segmentation performance. Inspired by the design of hybrid dilated convolution [28] , we proposed a residual attention block (RAB) as the basic module in the model decoder. We explore to use dilated convolutions in the decoder to capture multiscale patterns of the upsampled feature maps. The stem of RAB is a stack of dilated convolutions with a kernel size of 3 and dilation rate of [1, 2, 5] . Such dilation rate settings acquires larger receptive fields and also avoids grid effects of vanilla dilated convolutions [28] . Then the RAB is followed by a decoder attention module. The scheme is shown in Figure 4 .

We assume initial receptive field as 1 × 1. The equivalent kernel size of dilated convolution is computed as follows:

where denotes the equivalent kernel size, denotes the actual kernel size, and denotes the dilation rate.

Thus, the equivalent kernel sizes of dilated convolutions with kernel size 3 and dilation rate [1, 2, 5] are 3, 5, 11, respectively. According to the definition of receptive field, such design of stacked dilated convolution obtains a receptive field of 17 × 17, which enables the capture of global information. Also, dilated convolution with different dilation rate can capture multiscale information in the feature maps, which can contribute to the accurate segmentation on both large and small objects.

In addition, we utilize residual connections in the RAB to avoid gradient vanishing. Hybrid dilated convolutions are followed by a DAM to refine upsampled features and produce fused attention maps. In short, the output of our RAB is computed as follows:

where denotes the input feature maps, denotes the output feature maps, denotes the proposed decoder attention module, and denotes the hybrid dilated convolutions.

CT axial slices used in our experiments consist of 3 independent datasets [4] [37] . The details about the datasets used in our experiments are shown in Table 1 . Dataset 1 contains 100 axial CT slices from more than 40 patients, which have been rescaled to 512 × 512 pixels and grayscaled. All slices are segmented by a radiologist using three labels: groundglass opacity, consolidation and pleural effusion. Dataset 2 contains 9 axial CT volumes, where 373 out of the total 829 slices have been evaluated by a radiologist as positive and segmented using 2 labels including ground-glass opacity and consolidation. Dataset 3 contains 20 CT axial volumes, which have been segmented by two radiologists and verified by an experienced radiologist. Dataset 2 and Dataset 3 contain 29 CT volumes in total, but not all slices contain infection regions. We choose to discard all slices containing no COVID-19 infection and use slices with annotations only. As annotations in Dataset 3 do not distinguish ground-glass opacity and consolidation, we take both ground-glass opacity and consolidation in Dataset 2 as COVID-19 lesions and do not distinguish them as well, thus creating a binary segmentation dataset. An intensity normalization has been applied on both datasets and all slices have been rescaled to 512 × 512 pixels to match Dataset 1. We take all ground-glass, consolidation and pleural effusion in Dataset 1 as COVID-19 lesions, just the same as what we have done to Dataset 2.

We do not choose to combine processed Dataset 1 to 3 together and then split them randomly, because in this way slices of one subject may exist in both training and test datasets, which could be regarded as data leakage and cause a virtualhigh model performance. Since Dataset 1 contains the largest number of subjects (40 subjects), which hence best suits to be the independent test set, we finally obtain 1645 processed slices from processed both Dataset 2 and Dataset 3 and use these slices as our final training dataset, and then we use the 100 axial slices from Dataset 1 as our final test dataset. Such data split can best evaluate model generalization capacity.

Model encoder is a ResNeXt-50 (32 × 4d) pretrained on ImageNet-1K. We remove the global average pooling and full connection layers from original network. The number of output channels is 64, 256, 512, 1024, 2048, respectively, which are the same as the original paper of ResNeXt. Convolution operations in model decoder are padded and without stride, if not specified. Bilinear interpolation is utilized to upsample feature maps, and scale factor is set to 2. Dice loss is widely utilized in semantic segmentation, but the differential of Dice loss is sometimes numerically unstable and may lead to oscillation in training process. The combination of Dice loss and cross-entropy could avoid this issue. Thus we combine Dice loss  and binary cross-entropy loss  as our final loss function:

where = 1 in our experiments.

Our model is implemented using PyTorch on an Ubuntu 16.04 server. We use a NVIDIA RTX 2080 Ti GPU to accelerate our training process. Data augmentation is utilized in our training process to reduce overfitting and improve the generalization capacity. First all input images are rescaled to 560 × 560, followed by random flip, random rotation, random gamma and log transform. Finally images are randomly cropped to 448 × 448 and fed into the network. The model is optimized by an Adam optimizer with 1 = 0.9, 2 = 0.999, = 1 − 8. The 2 regularization is utilized to reduce overfitting as well. We set model weight decay to 1 − 8. Monte Carlo cross-validation is utilized to find the optimal hyper-parameters (i.e., the initial learning rate and number of epochs) during the training phase. Initial learning rate is set to 1e-4 and is reduced when faced with plateau, with reduce factor being 0.1 and patience being 10. The batch size is set to 6 and we perform evaluation on test set after 30 epochs. The training process takes approximately 140 minutes.

We use Dice similarity coefficient and pixel error as the main metrics to evaluate the segmentation performance of our D2A U-Net. Dice is a statistic used to gauge the similarity of two samples, and has been widely used to evaluate the performance in semantic segmentation. Pixel error measures the number of pixels predicted falsely in the image, which shows the global segmentation accuracy of the proposed models. Compared to the Dice score or recall score, pixel error is easier to interpret and more intuitive. Both metrics measure segmentation performance in a global way. In addition, we calculate recall score of infection regions, as recall score measures model's sensitivity to lung infection, which is rather significant in terms of COVID-19 infection. We use to denote ground truth, to denote dense predications, to denote true positive, to denote false pos-itive, to denote true negative and to denote false negative. These metrics are calculated as follows:

In this section, the proposed D2A U-Net is compared with other cutting-edge methods to evaluate the effectiveness of the proposed model. Two groups of model comparison have been conducted in the experiments to provide a fair comparative observation of the model performance from different angles of view.

First, the proposed D2A U-Net has been compared with popular U-Net family models including U-Net [34] , Attention U-Net [36] and U-Net++ [38] . Models listed above are all trained from scratch and share the same backbone structure, i.e. the VGG-style backbone, which refers to the encoder design proposed in the original U-Net paper [34] . Such experimental settings provide the most fair comparison of those U-Net based models, as they share the same model backbone and training strategies.

In addition, utilizing backbone pretrained on ImageNet to accelerate convergence and improve segmentation results has been popular in the CV tasks of natural images. Thus, we also introduce a pretrained D2A U-Net with ResNeXt-50 (32 × 4d) backbone to further improve the segmentation performance. The pretrained version is compared with 2 cuttingedge models widely used for natural image segmentation, including FCN8s [39] and DeepLab v3 (output stride = 8) [40] , both of which contain a pretrained ResNet-101 backbone.

Apart from model performance comparison, model parameters and computational costs (FLOPs) are also compared in our experiments.

To better evaluate the performance, all the metrics listed in Table 2 and Table 3 are averaged in 5 reduplicate experiments to report a fair and reliable result.

Detailed comparison among different models in our experiments is shown in Table 2 and Table 3 . As shown in Table 2 , our proposed network outperforms U-Net, Attention U-Net and U-Net++ in terms of Dice, pixel error and recall. As these models are identical in the encoder, it is clear that the proposed dual attention strategy and RAB contribute significantly to the infection segmentation. The utilization of attention mechanism aids the model to detect infected tissues more accurately, which reduces the number of false positives and improves recall score. Also, RAB in the decoder captures both large and tiny visual structures, which is helpful to segment infection lesions with different size. In addition, it should be noted that the proposed D2A U-Net with VGG-style backbone outperforms U-Net++ with comparably lower model parameters and computational costs, which could prove the balance of efficiency and performance in our models.

Utilizing pretrained backbone could also improve model performance. As can be seen, our D2A U-Net with pretrained ResNeXt-50 (32 × 4d) backbone outperforms other networks in terms of Dice, pixel error and recall by a large margin and yields the best results on our dataset. Also, our D2A U-Net with pretrained ResNeXt-50 (32 × 4d) backbone takes fewer computational resources than FCN8s and DeepLab v3 (output stride = 8). As can be seen from Table 3 , pretrained encoder could offer a better initialization of the parameters and reduce overfitting, especially when the data amount is insufficient. Overall, the proposed architecture performs better than the existing cutting-edge models. Table 2 Quantitative analysis of U-Net based models on our dataset, including U-Net, Attention U-Net, U-Net++ and the proposed D2A U-Net. Metrics include Dice score, pixel error and recall score.

Param 

We visualized segmentation results, as shown in other U-Net based models mentioned above, and when backbone is switched to ResNeXt-50 (32 × 4d), D2A U-Net achieves the best segmentation results, which is comparably more sensitive to blurred or tiny lesions than other models.

Apart from common models in the field of computer vision, we also conducted the comparison with latest researches, as shown in Table 4 . Our proposed D2A U-Net yields top performance compared with the latest advances in the field of COVID-19 CT segmentation. The performance of our proposed D2A U-Net attributes its success to the development of our proposed dual attention strategy and the utilization of hybrid dilated convolution blocks.

Several ablation experiments are conducted to evaluate the performance of components presented in our model, as shown in Table 5 and Figure 6 . In addition, we have visual- Table 4 Comparison with the latest researches in the field of COVID-19 CT segmentation.

Method Dice

Wang et al. [41] 3D U-Net 0.704 Yan et al. [42] COVID-SegNet 0.7026 Ma et al. [43] 3D U-Net 0.673 Fan et al. [10] Semi-Inf-Net 0.739

Ours D2A U-Net 0.7298 ized feature maps to further demonstrate the effectiveness of the proposed network components.

To evaluate the validity of the proposed GAM in our experiments, we design two baselines shown in Table 5 , including No.1 (U-Net only) and

No.2 (U-Net + GAM). Feature maps have been shown in Figure 7 to provide an intuitive demonstration of the effectiveness of the proposed GAM. Experimental results have shown that introducing GAM to the U-Net model can highlight the potential infection region and thus boost the performance, which leads to a better Dice score and recall.

Effectiveness of Proposed RAB We conducted similar experiments (No. 1 and No.3) to explore the effectiveness of the proposed RAB, which includes a hybrid dilated convolution block and a decoder attention module. From Figure  8 , it is indicated that the introduction of hybrid dilated convolution block into the decoder improves the recall score of segmentation, and the following decoder attention module further highlights the infection regions and also suppresses false positives. By introducing RAB to our model, the proposed network yields better results than the vanilla version.

Effectiveness of Combining GAM, RAB and PB As can be seen from Table 5 , in No.4, introducing GAM and RAB together (proposed D2A U-Net) yields the best results in our experiments, and the performance boost exceeds the simple addition of each module's performance boost. Such experimental results indicate that introducing GAM and RAB together promotes the performance mutually. Also, in No.5, the pretrained backbone offers better parameter initialization, and therefore could improves the performance further. 

In this paper, we proposed a novel segmentation network, D2A U-Net, for COVID-19 CT segmentation. In order to refine the feature maps and improve segmentation performance, especially in terms of recall score, we present a dual attention strategy consisting of a gate attention module and a decoder attention module. Gate attention module is proposed to produce a fused attention map on the features extracted by the encoder. Decoder attention module is introduced to the model decoder, which helps refine the upsampled feature maps after convolution operations. Also, hybrid dilated convolution, combined with decoder attention module, referred to as residual attention block, has been introduced as the basic block of the model decoder. Hybrid dilated convolution is utilized in the decoder to increase receptive field and improve the quality of feature representation. Experimental results indicate that the proposed network is capable of segmenting COVID-19 lesions from CT slices automatically, and achieves the best results among the popular cutting-edge models evaluated in our experiments. But our work is still limited to some degree, as only binary segmentation is performed in our experiments, which can limit model's potential use in both diagnosis and health care. Multi-class segmentation is expected in the future to further evaluate the performance of the proposed model. Also, despite the significantly better performance of our D2A U-Net with ResNeXt-50 (32 × 4d) backbone, the model has much more model parameters than other architectures with similar backbones (FCN8s and DeepLab v3). It is believed that as ResNet family models have a large number of channels (eg. 1024 and 2048 in the last two layers), the parameters of the decoder becomes extremely large. Such problem might be addressed by introducing so-called Bottleneck in ResNets to the decoder of D2A U-Net to reduce the number of channels and thus model parameters.

A novel coronavirus outbreak of global health concern

Covid-19 global cases by johns hopkins university

Detection of sars-cov-2 in different types of clinical specimens

Covid-19 ct segmentation dataset

Ct imaging of the 2019 novel coronavirus (2019-ncov) pneumonia

Imaging profile of the covid-19 infection: radiologic findings and literature review

Time course of lung changes on chest ct during recovery from 2019 novel coronavirus (covid-19) pneumonia

Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images

An automatic covid-19 ct segmentation network using spatial and channel attention mechanism

Inf-net: Automatic covid-19 lung infection segmentation from ct images

Deep learning-based detection for covid-19 from chest ct using weak label. medRxiv

Bin Song, et al. Dualsampling attention network for diagnosis of covid-19 from community acquired pneumonia

A review on lung and nodule segmentation techniques

A generic approach to pathological lung segmentation

Computer-aided diagnosis of pulmonary infections using texture analysis and support vector machine classification

Automatic lung segmentation on thoracic ct scans using u-net convolutional network

Lung segmentation in ct images using a fully convolutional neural network with multi-instance and conditional adversary loss

Control of goal-directed and stimulus-driven attention in the brain

Recurrent models of visual attention

Squeeze-and-excitation networks

Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module

Residual attention network for image classification

Pyramid attention network for semantic segmentation

Multi-scale context aggregation by dilated convolutions

Smoothed dilated convolutions for improved dense prediction

Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation

Concentrated-comprehensive convolutions for lightweight semantic segmentation

Understanding convolution for semantic segmentation

Relational modeling for robust and efficient pulmonary lobe segmentation in ct scans

A noise-robust framework for automatic segmentation of covid-19 pneumonia lesions from ct images

Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images

A rapid, accurate and machine-agnostic segmentation and quantification method for ct-based covid-19 diagnosis

Miniseg: An extremely minimum network for efficient covid-19 segmentation

U-net: Convolutional networks for biomedical image segmentation

Aggregated residual transformations for deep neural networks

Learning where to look for the pancreas

Zhu Qiongjie, Dong Guoqiang, and He Jian. COVID-19 CT Lung and Infection Segmentation Dataset

Unet++: A nested u-net architecture for medical image segmentation

Fully convolutional networks for semantic segmentation

Rethinking atrous convolution for semantic image segmentation

Does non-covid-19 lung lesion help? investigating transferability in covid-19 ct image segmentation

Covid-19 chest ct image segmentation-a deep convolutional neural network solution

Toward data-efficient learning: A benchmark for covid-19 ct lung and infection segmentation

This work was partially supported by the Fundamental Research Funds for Central Universities, the National Natural Science Foundation of China (No. 61871022, 61601019) , the Beijing Natural Science Foundation (7202102), and the 111 Project (No. B13003).

J o u r n a l P r e -p r o o f Highlights:• A novel network to perform accurate segmentation of COVID-19 lesions in CT images.• Proposed dual attention mechanism consisting of GAM and DAM is utilized to refine feature maps and reduce semantic gap.• A novel residual attention block with dilated convolution is introduced to model decoder to obtain large receptive fields and refine segmentation.• The proposed network has potential in AI-aided diagnosis and prognosis of COVID-19 patients.

We confirm that neither the manuscript nor any parts of its content are currently under consideration or published in another journal. All authors listed have contributed to this manuscript and agreed to submit to your journal. The authors declare that there is no conflict of interest regarding the publication of this paper.

Fan Song Guangda Fan Yangyang Sun Yujia Wang Zheyuan Tian Luqi Zhang Guanglei Zhang