key: cord-0157544-07zi4oj9 authors: Qiu, Yu; Liu, Yun; Xu, Jing title: MiniSeg: An Extremely Minimum Network for Efficient COVID-19 Segmentation date: 2020-04-21 journal: nan DOI: nan sha: fb34b5fcdfe317e8787349dd7a7286117850c218 doc_id: 157544 cord_uid: 07zi4oj9 The rapid spread of the new pandemic, coronavirus disease 2019 (COVID-19), has seriously threatened global health. The gold standard for COVID-19 diagnosis is the tried-and-true polymerase chain reaction (PCR), but PCR is a laborious, time-consuming and complicated manual process that is in short supply. Deep learning based computer-aided screening, e.g., infection segmentation, is thus viewed as an alternative due to its great successes in medical imaging. However, the publicly available COVID-19 training data are limited, which would easily cause overfitting of traditional deep learning methods that are usually data-hungry with millions of parameters. On the other hand, fast training/testing and low computational cost are also important for quick deployment and development of computer-aided COVID-19 screening systems, but traditional deep learning methods, especially for image segmentation, are usually computationally intensive. To address the above problems, we propose MiniSeg, a lightweight deep learning model for efficient COVID-19 segmentation. Compared with traditional segmentation methods, MiniSeg has several significant strengths: i) it only has 472K parameters and is thus not easy to overfit; ii) it has high computational efficiency and is thus convenient for practical deployment; iii) it can be fast retrained by other users using their private COVID-19 data for further improving performance. In addition, we build a comprehensive COVID-19 segmentation benchmark for comparing MiniSeg with traditional methods. Code and models will be released to promote the research and practical deployment for computer-aided COVID-19 screening. A S one of the most serious pandemics in human history, coronavirus disease 2019 (COVID- 19) continues to threaten global health with thousands of newly infected patients every day. Since the end of 2019, COVID-19 has infected more than 2.0M people and caused 130K deaths. COVID-19 is caused by the infection of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) which can be spread by breathing, coughing, sneezing, or other means of excreting infectious viruses. Effective screening of infected patients is of high importance to the fight against COVID-19 because i) early diagnosis of COVID-19 can help mitigate the spread of viruses by isolating the infected patients early; ii) imposing treatment early greatly improves the chances Y. Qiu is with College of Artificial Intelligence, Nankai University, Tianjin 300350, China (e-mail: yqiu@mail.nankai.edu.cn). Y. Liu is with College of Computer Science, Nankai University, Tianjin 300350, China (e-mail: nk12csly@mail.nankai.edu.cn). J. Xu is with College of Artificial Intelligence, Nankai University, Tianjin 300350, China (e-mail: xujing@nankai.edu.cn). Joint first author: Y. Qiu and Y. Liu; joint corresponding author: Y. Liu and J. Xu of survival; iii) confirmed patients should also be repeatedly diagnosed in order to confirm whether the virus is cleared. The gold standard for COVID-19 diagnosis is the tried-and-true polymerase chain reaction (PCR) [1] which detects nucleic acid (e.g., RNA) of SARS-CoV-2 from respiratory specimens (e.g., nasopharyngeal swabs, oropharyngeal swabs, nasal midturbinate swabs, and anterior nares swabs) in laboratories of Biosafety Level 2 (BSL-2). As a laborious, time-consuming and complicated manual process, PCR testing is in short supply, although many countries are racing to increase supplies. Another COVID-19 screening method is radiography examination by analyzing chest radiography images, e.g., X-ray and computed tomography (CT). As found in recent studies [2] , [3] , there exist characteristic abnormalities in the chest radiography images of infected patients. Some researchers suggest that chest CT should be considered as a primary tool for COVID-19 screening in epidemic areas [4] . Since chest radiography imaging can be easily conducted in modern hospitals, radiography examination is faster and cheaper than PCR testing, in some cases, even showing higher sensitivity [5] . However, the bottleneck is that radiography examination needs expert radiologists to interpret radiography images, while human eyes are not sensitive enough to subtle visual indicators especially in the exhausted state of overwork [6] , [7] . Therefore, computer-aided systems are expected for automatic and accurate radiography interpretation. Computer-aided systems could learn to capture subtle details and will never feel tired like human beings. When it comes to computer-aided COVID-19 screening, deep learning based technology is a good choice due to its uncountable successful stories in computer vision and medical imaging. In some cases, deep learning can even outperform human beings [6] , [8] - [10] . However, directly applying traditional deep learning models for COVID-19 screening is suboptimal. On one hand, these models usually have millions of parameters and thus require a large amount of labeled data for training. The problem is that the publicly available COVID-19 data are limited and thus easy to cause overfitting of traditional data-hungry models. On the other hand, traditional deep learning methods, especially the ones for image segmentation, are usually computationally intensive. Considering the current severe pandemic situation, fast training/testing and low computational load are essential for quick deployment and development of computer-aided COVID-19 screening systems. It is a widely accepted concept that overfitting is easier to happen when a model has more parameters and less training data. To solve the above problems of COVID-19 segmentation, we observe that lightweight networks are not only uneasy to arXiv:2004.09750v1 [cs.CV] 21 Apr 2020 overfit owing to their small number of parameters but also likely to be efficient, making them suitable for computer-aided COVID-19 screening systems. Therefore, we think lightweight COVID-19 segmentation should be the technical solution of this paper. The key is how to achieve accurate segmentation under the constraints of the number of network parameters and high efficiency. Although replacing the vanilla convolution with the combination of depthwise separable convolution (DSConv) and pointwise convolution [11] , [12] can reduce the number of network parameters, the accuracy usually decreases as the network shrinks [11] - [15] . To achieve our goal, we find the accuracy of image segmentation can be improved with better multi-scale learning. Many multi-scale learning strategies have significantly pushed forward the state of the arts of segmentation [16] - [24] . Hence a proper multi-scale learning strategy has the potential to ensure the segmentation accuracy of lightweight networks. With the above analyses, our effort starts with the design of an Attentive Hierarchical Spatial Pyramid (AHSP) module for effective lightweight multi-scale learning. AHSP first builds a spatial pyramid of dilated depthwise separable convolutions and feature pooling for learning multi-scale semantic features. Then, the learned multi-scale features are fused hierarchically to enhance the capacity of multi-scale representation. Finally, the multi-scale features are merged under the guidance of the attention mechanism which learns to highlight essential information and filter out noisy information in radiography images. With the AHSP module incorporated, we propose an extremely minimum network for efficient segmentation of COVID-19 infected areas in chest CT images. Our method, namely MiniSeg, only has 472K parameters, two orders of magnitude less than traditional image segmentation methods, so that current limited COVID-19 data can be adopted for training MiniSeg. We build a comprehensive COVID-19 segmentation benchmark, including the well-known methods for both medical image segmentation and semantic image segmentation, for extensively comparing MiniSeg with previous stateof-the-art methods. Experiments demonstrate that MiniSeg performs favorably against previous state-of-the-art segmentation methods with high efficiency and limited COVID-19 training data. Code and models will be released to promote the future research and deployment of computer-aided COVID-19 screening. In summary, our contributions are threefold: • We propose an Attentive Hierarchical Spatial Pyramid (AHSP) module for effective lightweight multi-scale learning which plays an essential role in image segmentation. • With the AHSP module incorporated, we present an extremely minimum network, MiniSeg, for accurate and efficient COVID-19 segmentation with limited training data. • For extensive comparison of MiniSeg with previous stateof-the-art segmentation methods, we build a comprehensive COVID-19 segmentation benchmark where MiniSeg performs favorably against previous competitors with high efficiency. In this section, we briefly review recent development in image segmentation and the techniques for designing efficient networks. We also discuss some recent studies for computeraided COVID-19 screening. Image segmentation is a hot topic due to its wide range of applications. Since the invention of fully convolutional networks (FCNs) [25] , FCNs based methods have dominated this field. Objects in images usually exhibit very large scale changes, so multi-scale learning plays an essential role in image segmentation. Hence most of the current state-of-theart methods aim at designing FCNs to learn effective multiscale representations from input images. For example, Ronneberger et al. [26] proposed the well-known U-Net architecture that is actually an encoder-decoder network for fusing the deep features from the top to bottom layers. DeconvNet [27] and SegNet [28] make careful designs to improve the U-Net architecture. U-Net++ [29] improves U-Net by introducing a series of nested, dense skip connections between the encoder and decoder sub-networks. Attention U-Net [30] improves U-Net by using the attention mechanism to learn to focus on target structures. Some studies [31] - [34] also aggregate multiscale deep features from multi-level layers for final dense prediction. DeepLab [16] and its variants [17] , [18] , [35] design ASPP modules using dilated convolutions with different dilation rates to learn multi-scale features. Based on ASPP, DenseASPP [19] connects a set of dilated convolutional layers densely, such that it generates multi-scale features that cover a larger scale range densely. Besides the multi-scale learning, some studies focus on exploiting the global context information through pyramid pooling [20] , context encoding [36] , or non-local operations [37] , [38] . Moreover, DFN [24] introduces a smooth network to handle the intra-class inconsistency problem and a border network to make the bilateral features of boundary distinguishable. Wu et al. [39] tried to find a good compromise between network depth and width to improve segmentation accuracy. Some methods [35] , [40] , [41] use conditional random fields (CRF) or Markov random fields (MRF) to model the spatial relationship in semantic segmentation. The above models aim at improving the segmentation accuracy without consideration of model size and inference speed, so they are impractical for COVID-19 segmentation which only has limited training data and requires high efficiency. In the experiment section, we will show that the limited COVID-19 training data cannot optimize these large models well. Lightweight networks aim at reducing the parameters and improving the efficiency of deep networks. Convolutional factorization is an intuitive way to reduce the computational complexity of convolution operations. Specifically, many wellknown network architectures decompose the standard convolution into multiple steps to reduce the computational complexity, including Flattened Model [42] , Inception networks [43] - [45] , Xception [11] , ResNeXt [46] , and MobileNets [12] , [13] . Among them, Xception [11] and MobileNets [12] , [13] factorize a convolution operation into a pointwise convolution and a DSConv. The pointwise convolution is actually a 1 × 1 convolution, which is used for interaction among channels. The depthwise convolution is a grouped convolution with the number of groups equaling to the number of output channels, so that it can process each feature channel separably. Shuf-fleNets [14] , [15] further factorize a pointwise convolution into a channel shuffle operation and a grouped pointwise convolution for reducing the parameters and complexity. A standard 2D convolution can also be factorized into two asymmetric 1D convolutions [43] - [45] , [47] . There are some studies focusing on designing efficient semantic segmentation networks [21] , [22] , [43] , [48] - [50] . ESPNet [21] decomposes the standard convolution into a pointwise convolution and a spatial pyramid of dilated convolutions. ESPNetv2 [22] extends ESPNet [21] using grouped pointwise and dilated DSConv. Considering COVID-19 segmentation, our proposed MiniSeg should have a small number of parameters for training with limited data. Our observation of the essential role of multi-scale learning in image segmentation helps MiniSeg achieve higher accuracy while running at a fast speed. Another way to build efficient networks is network compression. Previous studies have adopted various techniques, such as shrinking [51] , parameter quantization [52] , pruning [53] and hashing [54] , to compress networks. Some research [55] - [57] also quantizes the network weights into low bits to reduce the model size and computational complexity. In this paper, we must avoid the training of large networks owing to the shortage of COVID-19 data. Therefore, these methods are unsuitable for our goal, because they aim at compressing pretrained large networks rather than directly train a lightweight model. Computer-aided COVID-19 screening has already attracted the attention of medical imaging researchers to alleviate the shortage of PCR supply. Some studies [58] - [60] design deep neural networks to classify chest CT images for COVID-19 screening, but their code is not released. Inspired by the open-source efforts by the research community, a chest X-ray classification network [7] is proposed for COVID-19 screening and its code has been available. There are also some other X-ray classification based COVID-19 screening networks [61] , [62] . In this paper, we focus on segmenting COVID-19 infected areas from chest CT images, because segmentation can provide more useful information than image classification and chest CT has been demonstrated to be a very useful tool for COVID-19 screening [4] , [5] . In this section, we first introduce our Attentive Hierarchical Spatial Pyramid (AHSP) module for effective and lightweight multi-scale learning. Then, we present the network architecture of MiniSeg for the segmentation of COVID-19 infected lung areas. At last, we provide the training strategies of MiniSeg. Although the factorization of a convolution operation into a pointwise convolution and a depthwise separable convolution (DSConv) can significantly reduce the number of network parameters and computational complexity, it usually comes with the decrease of accuracy [11] - [15] . Inspired by the fact that effective multi-scale learning plays an essential role in improving segmentation accuracy [16] - [24] , we propose the AHSP module for effective multi-scale learning in a lightweight and efficient setting. Besides some common convolution operations such as vanilla convolution, pointwise convolution, and DSConv, we introduce the dilated DSConv convolution which adopts a dilated convolution kernel for each input channel. Suppose F k×k r denotes a vanilla convolution, where k × k is the size of convolution kernel and r is the dilation rate. SupposeF k×k r denotes a depthwise separable convolution, where k × k and r have the same meaning as F k×k r . The subscript r will be omitted without ambiguity if we have a dilation rate of 1, i.e., r = 1. For example, F 1×1 represents a pointwise convolution (i.e., 1 × 1 convolution). F 3×3 2 represents a dilated DSConv with a dilation rate of 2. With the above definitions of basic operations, we continue by introducing the proposed AHSP module which is illustrated in Fig. 1 . Let X ∈ R C×H×W be the input feature map, so that the output feature map is H(X) ∈ R C ×H ×W , where H denotes the transformation function of AHSP for its input. C, H, and W are the number of channels, height, and width of input feature map X, respectively; similar definitions hold for C , H , and W . The input feature map X is first processed by a pointwise convolution to shrink the number of channels into C /K, in which K is the number of parallel dilated branches which will be described later. This operation can be written as Then, the generated feature map S is fed into K parallel dilated DSConv, i.e., where the dilation rate is increased exponentially for enlarging the receptive field. Equation (2) is the basis for multi-scale learning with large dilation rates capturing large-scale information and small dilation rates capturing local information. We also add an average pooling operation for S to enrich the multi-scale information, i.e., where AvgP ool 3×3 represents the average pooling with kernel size of 3 × 3. Note that we have F k ∈ R C K ×H ×W for k = 0, 1, · · · , K. If we have H = H or W = W , the convolution and pooling operations in (2) and (3) will have a stride of 2 to downsample the feature map by a scale of 2; otherwise, the stride will be 1. These multi-scale feature maps produced by (2) and (3) are merged in an attentive hierarchical manner. We first add them up hierarchically asḞ where feature maps are gradually fused from small scales to large scales to enhance the representation capability of multiscale learning. We further adopt a spatial attention mechanism to make the AHSP module automatically learn to focus on target structures of varying scales. On the other hand, the attention mechanism can also learn to suppress irrelevant information at some feature scales and emphasize essential information at other scales. Such self-attention make each scale speak for itself to decide how important it is in the multiscale learning process. The transformation ofḞ by spatial attention can be formulated as in which σ is a sigmoid activation function and ⊗ indicates element-wise multiplication. The pointwise convolution in (5) outputs a single-channel feature map which is then transformed to a spatial attention map by the sigmoid function. This attention map is replicated to the same size asḞ k , i.e., C K × H × W , before element-wise multiplication. Considering the efficiency, we can compute the attention map for all K branches together, like where Concat(·) means to concatenate a series of feature maps along the channel dimension. The pointwise convolution in (6) is a K-grouped convolution with K output channels, so we have A ∈ R K×H ×W . Hence we can rewrite (5) as in which A[k] means the k-th channel of A. Note that (5) is equivalent to (6) and (7). Finally, we merge and fuse the above hierarchical feature maps asF = Concat(F 1 ,F 2 , · · · ,F K ), where BatchNorm(·) denotes the batch normalization [63] and PReLU(·) indicates PReLU (i.e., Parametric ReLU) activation function [9] . The pointwise convolution in (8) is a K-grouped convolution with C output channels, so that this pointwise convolution aims at fusingF k (k = 1, 2, · · · , K) separately, i.e., adding connection to channels for depthwise convolutions in (2) . The fusion among hierarchical scales is conducted using the first pointwise convolution in the next AHSP module of MiniSeg, which means (1) also serves to fuse features of various scales in the previous AHSP module. Such a design can reduce the number of convolution parameters in (8) by K times when compared with that using a vanilla pointwise convolution, i.e., C 2 /K vs. C 2 . Given an input feature map X ∈ R C×H×W , we can compute the output feature map H(X) ∈ R C ×H ×W of an AHSP module using (1)- (8) . We can easily find that increasing K will reduce the number of AHSP parameters. Considering the balance between segmentation accuracy and efficiency, we set K = 4 in our experiments. The proposed AHSP module not only significantly reduces the number of parameters but also enables to learn effective multi-scale features, so that we can adopt the limited COVID-19 data to train a high-quality segmenter. Here, we describe in detail the network architecture of the proposed lightweight COVID-19 segmenter, i.e., MiniSeg. MiniSeg has an encoder-decoder structure where the encoder sub-network focuses on learning effective multi-scale representations for the input image, while the decoder sub-network gradually aggregates the representations at different levels of the encoder to predict COVID-19 infected areas. The network architecture of MiniSeg is displayed in Fig. 2 . The encoder sub-network uses AHSP as the basic module, consisting of two paths that are connected through a series of nested skip pathways. Suppose I ∈ R 3×H×W denotes an input chest CT image, where a grayscale CT image is replicated three times to make its number of channels the same as color images. The input I is downsampled four times in the encoder sub-network, resulting in four scales of 1 we downsample until the 1/16 scale for enlarging the receptive field and reducing the computational complexity. Suppose in the encoder sub-network we denote the output feature map of the i-th stage and the j-th block as E i j , w.r.t. i ∈ {1, 2, 3, 4} and j ∈ {1, 2, · · · , N i }, where N i indicates the number of blocks in the i-th stage. Therefore, we have , in which C i is the number of feature channels at the i-th stage. The abovementioned block refers to the proposed AHSP module except for the first stage whose basic block is the vanilla Convolution Block (CB). Since the number of feature channels at the first stage (i.e., C 1 ) is small, the vanilla convolution will not introduce too many parameters. Without ambiguity, let H i j (·) be the transformation function of the i-th stage and the j-th block without distinguishing whether this block is a vanilla convolution or an AHSP module. For the another path, we propose a Downsampler Block (DB) block. The transformation function of a DB block is denoted as H i (·) (i ∈ {1, 2, 3, 4}): whereF 5×5 (·) has a stride of 2 for downsampling. Suppose the output of H i (·) is E i . Therefore, for the first block of the first stage, we have For the first block of other stages, we compute the output feature map as Here, H i 1 (·) (i ∈ {1, 2, 3, 4}) has a stride of 2 for feature downsampling by a scale of 2. For other blocks, the output feature map is computed as where H i j (·) has a stride of 1 and a residual connection is included for better optimization. As shown in (12) , the output of another path H i (·) is connected to the main path. The computation of E i can be formulated as except for i = 1: Through (12) and (13), the two paths of the encoder subnetwork build nested skip connections. Such a design benefits the multi-scale learning of the encoder. Considering the balance among the number of network parameters, segmentation accuracy, and efficiency, we set C i to 16, 64, 128, and 256, and set N i to 3, 4, 9, and 7, for i = 1, 2, 3, 4, respectively. The decoder sub-network is simple for efficient multiscale feature decoding. Since the top feature map of the encoder has a scale of 1/16 of the original input, it is suboptimal to predict COVID-19 infected areas directly owing to the loss of fine details. We instead utilize a simple decoder sub-network to gradually upsample and fuse the learned feature map at each scale of the encoder sub-network. A Feature Fusion Module (FFM) is proposed for feature aggregation. Let H i (·) represent the function of FFM: in which H i (X) (i = 1, 2, 3) has C i channels and the pointwise convolution is utilized to adjust such number of channels. We denote the feature map in the decoder as We compute other D i (i = 3, 2, 1) as (Upsample(D i+1 , 2) ), where Upsample(·, t) means to upsample a feature map by a scale of t using bilinear interpolation. In this way, the decoder sub-network enhance the high-level semantic features with low-level fine details, so that MiniSeg can make accurate predictions for COVID-19 infected areas. With D i (i = 1, 2, 3, 4) computed, we can make dense prediction using a pointwise convolution, i.e., where Softmax(·) is a standard softmax function and this pointwise convolution has two output channels representing two classes of background and COVID-19 infected areas, respectively. P i ∈ R 1×H×W is the predicted class label map. We utilize P 1 as the final output prediction. In the training, we impose deep supervision [66] by replacing the softmax function in (17) with the well-known cross-entropy loss function, i.e., in which CEL(·) indicates the standard cross-entropy loss function and G is the ground-truth label map. The total loss is calculated as where λ is a weight to balance the losses at different stages. In this paper, we follow previous studies [20] , [36] , [51] to empirically set λ as 0.4. Implementation details. We implement the proposed MiniSeg network using the well-known PyTorch [67] framework. Adam optimization [68] is used for training with the weight decay of 1e-4. We adopt the learning policy of poly where the initial learning rate is 1e-3 and we train 100 epochs on the training set. A batch size of 5 is used. Note that we train all previous state-of-the-art segmentation methods using the same training settings as our MiniSeg for a fair comparison. All experiments are performed on a TITAN RTX GPU. We utilize an open-access COVID-19 CT segmentation dataset [69] to evaluate MiniSeg. Due to the constraints of personal privacy and government policy, this dataset is the only publicly available COVID-19 segmentation dataset. It consists of 100 axial CT images from ∼60 patients with COVID-19, so that this is a small but diverse dataset with each patient contributing ∼1.6 axial CT images. Each CT image is carefully annotated by a radiologist to provide the segmentation mask of COVID-19 infected areas. We randomly choose 60 CT images for training models, and another 40 CT images are used for performance evaluation. In the training, we resize the image into multiple scales, i.e., 0.5, 0.75, 1.0, 1.25, and 1.5. We also utilize the standard cropping and random flipping for data augmentation. Thanks to the high diversity of this dataset, the evaluation results of models on this dataset are representative. Evaluation metrics. In this paper, we evaluate the COVID-19 segmentation accuracy using five widely used evaluation metrics in medical imaging analysis, i.e., mean intersection over union (mIoU), sensitivity (SEN), specificity (SPC), Dice similarity coefficient (DSC), and the Hausdorff distance (HD). The metric of mIoU is a typical measure for semantic segmentation by computing the overlap rate between prediction and ground truth for each class and then averaging across all classes. Here, sensitivity represents the ability of the COVID-19 infected area of ground truth to be predicted as it is. Specificity represents the ability of the background region of ground truth to be predicted as background. Dice similarity coefficient is an overlap index that can represent the degree of similarity between predicted COVID-19 area and labeled COVID-19 area of ground truth. Hausdorff distance (HD) measures the structural differences among two given objects and is the minimum distance between the ground truth and segmented region. These metrics are defined as follows: where TP, FP, TN, FN indicate the number of pixels in the true positive, false positive, true negative, and false negative regions, respectively. S m = {s m1 , s m2 , . . . , s mi } is the curve generated from ground truth of the COVID-19 infected area, and S a = {s a1 , s a2 , . . . , s ai } is the curve formed by segmentation methods. Suppose we have h(S m , S a ) = max sm∈Sm min sa∈Sa ||s m − s a || where || · || is Euclidean distance; h(S a , S m ) can be similarly defined. Specifically, SEN, SPC, and DSC range between 0 and 1; The larger these values, the better the model. Note that the lower value of HD indicates better segmentation accuracy. Before comparing with state-of-the-art competitors, we conduct ablation studies to demonstrate the effectiveness of our model components. The effect of the main components is summarized in Table I . We start with a single-branch module that only has the DSConv with a dilation rate of 1. Replacing all AHSP modules in MiniSeg with such singlebranch modules and removing the two-path design of the MiniSeg encoder, we achieve an mIoU of 78.10%. Then, we extend such a single-branch module to a multi-branch module using the spatial pyramid as in the AHSP module, such a multi-branch module improves the mIoU to 78.79%, which demonstrates the importance of multi-scale learning. Next, we add our attentive hierarchical fusion strategy to get our AHSP module, so we can improve the mIoU to 80.30%, which proves the superiority of the attentive hierarchical fusion. We continue by adding the two-path design to the encoder sub-network to recover MiniSeg, and we achieve an mIoU of 81.27%, which validates that such a two-path design can benefit the network optimization. At last, pretraining the MiniSeg encoder sub-network on the ImageNet dataset [70] further pushes the mIoU to 82.12. These ablation studies demonstrate that the main components in MiniSeg are all effective in COVID-19 segmentation. Besides the above main components, we also conduct ablation studies for some design choices of MiniSeg. The results are shown in Table II . We first replace the PReLU activation function [9] with the ReLU function [71] . Then, we remove the decoder sub-network and change the stride of the last stage from 2 to 1, so we can directly make predictions at the scale of 1/8 and upsample to the original size, as in previous studies [16] , [19] - [21] , [23] , [24] , [36] - [38] , [47] - [49] , [64] , [65] . Next, we remove the deep supervision in the training, which means that only L 1 in (19) is used. Furthermore, we replace the Convolution Blocks (CB) in the first stage with the AHSP modules. Besides, we also replace the 5 × 5 DSConv in the Downsampler Blocks (DB) with 3 × 3 DSConv. At last, we replace the Feature Fusion Modules (FFM) in the decoder subnetwork with AHSP modules. We can see that the default setting achieves the best overall performance, especially in terms of the most important and best-known mIoU metric. To compare MiniSeg with previous state-of-the-art competitors and promote the research for COVID-19 segmentation, we build a comprehensive benchmark. This benchmark contain 26 previous state-of-the-art image segmentation methods, including U-Net [26] , FCN-8s [25] , FRRN [23] , SegNet [28] , PSPNet [20] , DeepLabv3 [17] , DeepLabv3+ [18] , UNet++ [29] , Attention U-Net [30] , BiSeNet [72] , DenseASPP [19] , DFN [24] , EncNet [36] , OCNet [73] , DANet [74] , MobileNet [12] , MobileNetv2 [13] , MobileNetv3 [75] , ShuffleNet [14] , ShuffleNetv2 [15] , ENet [64] , CGNet [48] , EDANet [49] , LEDNet [50] , ESPNet [21] , and ESPNetv2 [22] , Among them, MobileNet, MobileNetv2, MobileNetv3, ShuffleNet, and ShuffleNetv2 are designed for lightweight image classification, we view them as the encoder and add the decoder of MiniSeg to them, so that they are reformed as image segmentation models. Besides, ENet, CGNet, EDANet, LEDNet, ESPNet, and ESPNetv2, are well-known lightweight segmentation models. We not only evaluate these methods using five widely used metrics for medical imaging segmentation but also report their numbers of parameters, numbers of FLOPs, and speed, where FLOPs and speed are tested using a 512 × 512 input image and a TITAN RTX GPU. We think this benchmark would be useful for future research on COVID-19 segmentation. The evaluation results of MiniSeg and other competitors are displayed in Table III . We can see that lightweight models seem to outperform traditional segmentation networks, which proves our conjecture that networks with a small number of parameters are more suitable for COVID-19 segmentation due to the limited training data. Among traditional large networks, FCN-8s [25] achieves the best performance. We think it is because the simplest FCN-8s is easy to training than other carefully designed networks, which further demonstrates that previous state-of-the-art segmentation networks with too many parameters are not suitable for COVID-19 segmentation. Lightweight models, including Mobilenets [12] , [13] , [75] , ShuffleNets [14] , [15] , and lightweight segmentation models, achieve very competitive performance, but the proposed MiniSeg are more stable across various evaluation metrics. In terms of the best-known metric of mIoU, MiniSeg achieves the best performance when training with or without ImageNet pretraining [70] . This demonstrates the superiority of MiniSeg in COVID-19 infected area segmentation. Compared with other methods, MiniSeg consistently achieves the best performance in terms of four metrics, including mIoU, SPE, DSC, and HD. For the metric of SEN, MiniSeg performs slightly worse than the best method. Furthermore, we note that no competitors can consistently perform well on all metrics. The fact that MiniSeg consistently outperforms other competitors demonstrates its effectiveness. MiniSeg also has a low computational load (i.e., fewer FLOPs) and fast speed, making it convenient for practical deployment which is of high importance in the current serious situation of COVID-19. Fig. 3 provide some examples for segmenting COVID-19 infected areas from chest CT images using MiniSeg. We can observe that the results of MiniSeg are very close to the ground truth segmentation. In this paper, we focus on segmenting COVID-19 infected areas from chest CT images. To address the lack of COVID-19 training data and meet the efficiency requirement for the deployment of computer-aided COVID-19 screening systems, we propose an extremely minimum network, i.e., MiniSeg, for accurate and efficient COVID-19 segmentation. MiniSeg adopts a novel multi-scale learning module, i.e., the Attentive Hierarchical Spatial Pyramid (AHSP) module, to ensure its accuracy under the constraints of the extremely minimum network size. To extensive compare MiniSeg with previous state-of-the-art image segmentation methods and promote the research on COVID-19 segmentation, we build a comprehensive benchmark that would be useful for future research. The comparison of MiniSeg with state-of-the-art image segmentation methods demonstrates that MiniSeg not only achieves the best performance but also has high efficiency. The code and models of this paper will be released. Detection of SARS-CoV-2 in different types of clinical specimens Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Imaging profile of the COVID-19 infection: Radiologic findings and literature review Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: A report of 1014 cases Sensitivity of chest CT for COVID-19: Comparison to RT-PCR Rethinking computer-aided tuberculosis diagnosis COVID-Net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest radiography images Deep learning face representation by joint identification-verification Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification Richer convolutional features for edge detection Xception: Deep learning with depthwise separable convolutions MobileNets: Efficient convolutional neural networks for mobile vision applications MobileNetV2: Inverted residuals and linear bottlenecks ShuffleNet: An extremely efficient convolutional neural network for mobile devices ShuffleNet v2: Practical guidelines for efficient CNN architecture design DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs Rethinking atrous convolution for semantic image segmentation Encoderdecoder with atrous separable convolution for semantic image segmentation DenseASPP for semantic segmentation in street scenes Pyramid scene parsing network ESPNet: Efficient spatial pyramid of dilated convolutions for semantic segmentation ESPNetv2: A light-weight, power efficient, and general purpose convolutional neural network Full-resolution residual networks for semantic segmentation in street scenes Learning a discriminative feature network for semantic segmentation Fully convolutional networks for semantic segmentation U-Net: Convolutional networks for biomedical image segmentation Learning deconvolution network for semantic segmentation SegNet: A deep convolutional encoder-decoder architecture for image segmentation Unet++: A nested U-Net architecture for medical image segmentation Attention U-Net: Learning where to look for the pancreas Attention to scale: Scale-aware semantic image segmentation Hypercolumns for object segmentation and fine-grained localization RefineNet: Multi-path refinement networks for high-resolution semantic segmentation Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net Semantic image segmentation with deep convolutional nets and fully connected CRFs Context encoding for semantic segmentation CCNet: Criss-cross attention for semantic segmentation Asymmetric non-local neural networks for semantic segmentation Wider or deeper: Revisiting the resnet model for visual recognition Semantic image segmentation via deep parsing network Conditional random fields as recurrent neural networks," in Int. Conf. Comput. Vis Flattened convolutional neural networks for feedforward acceleration Going deeper with convolutions Rethinking the inception architecture for computer vision Inception-v4, Inception-ResNet and the impact of residual connections on learning Aggregated residual transformations for deep neural networks ERFNet: Efficient residual factorized convnet for real-time semantic segmentation CGNet: A light-weight context guided network for semantic segmentation Efficient dense modules of asymmetric convolution for real-time semantic segmentation LEDNet: A lightweight encoder-decoder network for real-time semantic segmentation ICNet for real-time semantic segmentation on high-resolution images Quantized convolutional neural networks for mobile devices Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding Compressing neural networks with the hashing trick Quantized neural networks: Training neural networks with low precision weights and activations XNOR-Net: ImageNet classification using binary convolutional neural networks Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1 Rapid AI development cycle for the coronavirus (COVID-19) pandemic: Initial results for automated detection & patient monitoring using deep learning CT image analysis Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT Deep learning system to screen coronavirus disease 2019 pneumonia Automatic detection of coronavirus disease (COVID-19) using X-ray images and deep convolutional neural networks COVID-19 screening on chest X-ray images using deep learning based anomaly detection Batch normalization: Accelerating deep network training by reducing internal covariate shift ENet: A deep neural network architecture for real-time semantic segmentation ContextNet: Exploring context and detail for semantic segmentation in real-time Deeply-supervised nets Automatic differentiation in PyTorch Adam: A method for stochastic optimization COVID-19 CT segmentation dataset Imagenet large scale visual recognition challenge Rectified linear units improve restricted boltzmann machines BiSeNet: Bilateral segmentation network for real-time semantic segmentation OCNet: Object context network for scene parsing Dual attention network for scene segmentation Searching for mobilenetv3