key: cord-0464117-a41jsboe authors: Bizopoulos, Paschalis; Vretos, Nicholas; Daras, Petros title: Comprehensive Comparison of Deep Learning Models for Lung and COVID-19 Lesion Segmentation in CT scans date: 2020-09-10 journal: nan DOI: nan sha: 04b14a0d6fbef22727a435881580a351e9e5b572 doc_id: 464117 cord_uid: a41jsboe Recently there has been an explosion in the use of Deep Learning (DL) methods for medical image segmentation. However the field's reliability is hindered by the lack of a common base of reference for accuracy/performance evaluation and the fact that previous research uses different datasets for evaluation. In this paper, an extensive comparison of DL models for lung and COVID-19 lesion segmentation in Computerized Tomography (CT) scans is presented, which can also be used as a benchmark for testing medical image segmentation models. Four DL architectures (Unet, Linknet, FPN, PSPNet) are combined with 25 randomly initialized and pretrained encoders (variations of VGG, DenseNet, ResNet, ResNext, DPN, MobileNet, Xception, Inception-v4, EfficientNet), to construct 200 tested models. Three experimental setups are conducted for lung segmentation, lesion segmentation and lesion segmentation using the original lung masks. A public COVID-19 dataset with 100 CT scan images (80 for train, 20 for validation) is used for training/validation and a different public dataset consisting of 829 images from 9 CT scan volumes for testing. Multiple findings are provided including the best architecture-encoder models for each experiment as well as mean Dice results for each experiment, architecture and encoder independently. Finally, the upper bounds improvements when using lung masks as a preprocessing step or when using pretrained models are quantified. The source code and 600 pretrained models for the three experiments are provided, suitable for fine-tuning in experimental setups without GPU capabilities. Coronavirus Disease 2019 has emerged in December of 2019 and was declared as a pandemic in March of 2020 [1] . The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has certain properties that makes it highly infectious, thus, turning ineffective government policy measures such as social distancing and increasing the need for fast and accurate diagnosis of the disease. A well established, high resolution, imaging procedure that targets lungs and depicts rich pathological information is Computerized Tomography (CT) scan. More specifically, for a COVID-19 patient, CT scan images show bilateral patchy shadows or ground glass opacity on the infected region [2] , which are not always visible in common X-Ray scans [3] . Another method that has been used for COVID-19 diagnosis is the so-called Reverse-Transcription Polymerase Chain Reaction (RT-PCR), which, however, has been found to have lower sensitivity compared to CT [4] scan and is more time consuming. Medical experts often need to examine a large number of CT scan images, which is an error prone and time consuming process. To that aim, automatic segmentation methods are being proposed that segment regions-of-interest (ROIs) of different size and shape such as lungs, nodules and lesions, taking advantage of the CT scan resolution. These methods facilitate medical experts in diagnosing by focusing on the ROIs instead of the whole image. Methods for automatic segmentation in the lung area from the literature include the use of morphological operations [5] , active contours [6] and fuzzy clustering [7] . Feature engineering methods however, were surpassed by end-to-end learning such as Deep Learning (DL) [8] , which were successfully applied in medical image segmentation tasks [9] . More specifically, applications of DL methods in medical image segmentation primarily target lungs [10, 11] , pathological lungs [12] , infections [13, 14, 15] , lungs and infections [16] , lungs and COVID-19 lesions [17] . The majority of them uses encoder-decoder architectures such as Unet [18] and its variations. A major issue in the field of lung/lesion segmentation (and medical image segmentation in general) is the use of different datasets for evaluating newly proposed models. Moreover, there is lack of benchmark baseline models that could play the role of reference for evaluating the accuracy and the performance of proposed models. Benchmarks for COVID-19 in CT scan images were provided in the literature, such as Ma et al. [19] that has a limited number of cases and He et al. [20] that test 20 models for lung segmentation of COVID-19 patients. Other comparison studies on similar tasks such as lung nodule segmentation were proposed in [21] , which compare three non-learnable algorithms, where each one is created by a different research group. Comparison studies of deep learning image segmentation tasks has also been conducted on non-medical images such as coral reef images [22] where the authors test four models as well as aerial city images [23] where the authors test 12 different models. No previous work, to the best of our knowledge, includes a comprehensive quantified comparison of 600 DL models on the task of image segmentation. In this paper, four of the most widely used DL image segmentation architectures are explored, namely Unet [18] , Linknet [24] , Feature Pyramid Network (FPN) [25] and Pyramid Scene Parsing Network (PSPNet) [26] combined with 25 encoders for lung and COVID-19 lesion segmentation in CT scan images. The contribution of this paper in the field of medical image segmentation can be summarized as follows: • derivation of best architecture-encoder combinations for the three experiments that are conducted (lung, lesion and lesion with lung masks exper-iments), • quantitative comparison of architectures, • quantitative comparison of encoders, • quantitative comparison of lesion segmentation with and without masks (in this case lungs) as a preprocessing step, • quantitative comparison of random and ImageNet initialization, • open source implementation 1 , • release of 600 pretrained models of all experimental setups for use by external researchers. The rest of the paper is organized as follows: a detailed description of the models and their components (architecture, encoder) is provided in Section 2, the datasets used are presented in Section 3, the experimental setup used to evaluate the models is shown in Section 4 and the results are demonstrated in Section 5. Finally the findings in relation with findings in the previous literature are shown in Section 6 and the final remarks are concluded in Section 7. In this Section the problem of segmentation is formalized and the architectures and encoders that are used in this study are presented. Let D be a dataset containing images X ∈ R nr,nc and Y ∈ {0, 1} nr,nc the corresponding target mask (in our case n r = 512, n c = 512). Let m ex,ar,en,ew be a DL model for segmentation where ex denotes the specific experiment, ar the architecture, en the encoder and ew the encoder weights. The 'encoder' is defined as the part of the model that performs the feature extraction. The model m ex,ar,en,ew is trained on a dataset D train ⊂ D consisting of X train ∈ R nr,nc and Y train ∈ {0, 1} nr,nc . Moreover a validation dataset is defined as D val ⊂ D where D val ∩ D train = ∅ consisting of X val ∈ R nr,nc and Y val ∈ {0, 1} nr,nc . Therefore, the objective of the experiments conducted in this study can be designed as, finding an optimal point in the parameter space of m ex,ar,en,ew during training such that when presented with an input from D such as X val , its predictionŶ val ∈ [0, 1] nr,nc is as near as possible with the target Y val . This is implemented by selecting the model that performs the minimum validation error out of all epochs. Subsequently, the selected models are tested on the generalization ability on an unseen D test with D test ∩ D = ∅. A high level overview of the training of the models can be seen in Fig. 1 . architecture encoder L XŶ Y backpropagation Figure 1 : High level overview of a lesion segmentation model with an architecture consisting of an encoder trained using a single training augmented image. X is the input, Y is the target mask,Ŷ is the predicted mask and L is the loss function, which in this case is the Dice loss. Green and red pixels atŶ depict correctly and falsely classified pixels, while green pixels at Y depict the pixels of the target mask. Arrows denote the flow of the feed-forward and backpropagation pass. X is passed to the architecture consisting of a specific encoder andŶ is calculated. ThenŶ and Y are used to calculate the loss, which is then used to backpropagate the error to the weights. This procedure is repeated for a number of times using more training examples till L converges. Four architectures are used as the basis of the models to be tested: • Unet [18] • Linknet [24] • Feature Pyramid Network (FPN) [25] • Pyramid Scene Parsing Network (PSPNet) [26] Unet [18] combines an encoder that scales down the features to a lower dimensional bottleneck and a decoder that scales them up to original dimensions. It also uses skip connections that were proven to improve image segmentation results [27] . Linknet [24] is similar to Unet with the difference of using residual [28] instead of convolutional blocks in its encoder and decoder networks. Feature Pyramid Network (FPN) [25] is also similar to Unet with the difference of applying a 1 × 1 convolution layer and adding the features instead of copying and appending them as done in the Unet architecture. The Pyramid Scene Parsing Network (PSPNet) [26] exploits a pyramid pooling module to aggregate the image global context information with an auxiliary loss [29] . The following encoders are used along with their variations denoted in the parenthesis: • VGG [30] (11, 13, 19) • DenseNet [31] (121, 161, 169, 201) • ResNet [28] (18, 34, 50, 101, 152) • ResNext [32] • Dual Path Networks (DPN) [33] (68, 98) • MobileNet [34] • Xception [35] • Inception-v4 [36] • EfficientNet [37] (b0, b1, b3, b4, b5, b6) VGG [30] is named after the Visual Geometry Group that proposed it and took the second place during the ImageNet Competition in 2014 [38] . It was one of the first models that demonstrated the importance of depth in DL and it is preferred for tasks such as feature extraction due to its simple repeating structure. On the other hand, ResNet's [28] (abbreviation of Residual Networks) contributions allowed training deep networks by using layers that learn residual functions with reference to layer inputs, while DenseNet [31] uses connections between each layer and every other layer in a feed-forward fashion. Moreover ResNext [32] consists of a stack of residual blocks, which are subject to two rules. The first one is that layers that output spatial maps with the same size, share hyper-parameters and the second is that when a spatial map is downsampled by two, the width of the blocks is multiplied by two. Dual Path Networks (DPN) [33] proposed as a network that combines feature re-usage and feature exploration that ResNet and DenseNet do respectively, while MobileNet [34] constructed to fill the need of training and inferencing on devices with low computational capabilities such as embedded device and mobile phones. Xception [35] is a variation of Inception Network [39] in which the inception modules have been replaced with depthwise convolutions followed by a pointwise convolution. Finally Inception-v4 [36] combines previous inception architectures with residual connections achieving state-of-the-art performance on the ImageNet, while EfficientNet [37] is an improvement of MobileNet where the compound scaling module was proposed as an efficient way to uniformly scale depth, width and resolution. Two public COVID-19 CT scan datasets with lung and lesion masks were used. The first dataset 2 consists of 100 CT axial scans from < 40 patients with 512 × 512 size and corresponding lung masks from [40] and lesion masks labeled with four classes (none, ground-glass, consolidation, pleural effusion). The original dataset, without the annotations, was selected from the Italian Society of Medical and Interventional Radiology 3 . The second dataset 4 consists of 829 images from 9 CT scan volumes (a set of CT scan images acquired from the same patient at the same moment) with corresponding target masks. 373 out of 829 were annotated as positive and segmented by a radiologist by the same group as the first dataset. Raw data from both datasets contain samples in Hounsfield units [41] . Regarding preprocessing, first the positive classes of the pixels of the images in the first dataset are merged into one, converting the problem to a binary segmentation problem. The CT scan images from the second dataset are resized to 512 × 512, and both datasets are normalized with µ = −500 and σ = 500. We use 80 scans from the first dataset for training the models, 20 scans for validation and all scans from the second dataset for the testing of the models. In Fig. 2 the histograms of the pixel intensities of all the CT scan images and the target masks in the test dataset after normalization, for each of the three experiments, are depicted. The considerable overlap between the histograms makes the use of thresholding models in this kind of problem unsuitable, thus justifying the use of learning models such as DL. In this Section the experimental setup is presented. In total three experiments are conducted: • lung segmentation • lesion segmentation (referred to as 'lesion segmentation A') • lesion segmentation with lung masks (referred to as 'lesion segmentation B') The choice of these experiments covers balanced (lung segmentation), unbalanced (lesion segmentation A) and unbalanced with preprocessing (lesion segmentation B) image segmentation tasks and the findings could apply in image segmentation tasks with non-medical images. Each of the performed experiments uses a different target mask Y , where for the 'lesion segmentation B' the corresponding lung masks is also applied in the input image X. Each model is constructed using a unique combination of the four architectures described in Subsection 2.2 and the 25 encoders referenced in Subsection 2.3. The selection of architectures and encoders was based on the restriction of the GPU memory of our graphics card, combined with the value of batch size. Then, for each model we also test the randomly initialized and its ImageNet pretrained version. The default values for every hyperparameter of the models were used (as seen in Table. 1), to avoid favoring models that were proposed after being evaluated in a controlled experimental setup. The activation function for all architectures was sigmoid that squashes the output in the range of [0, 1]. For all experiments and in each epoch during training, data augmentation is applied on the images from the training dataset: • horizontal/vertical flip each with probability 50%, During training the Soft Dice Loss is used to calculate the error of the model on the training dataset as: where Y ij ,Ŷ ij are the pixel intensities at the i th column, j th row of the target mask and predicted mask, respectively (which applies for Y train , Y val and Y test ) and = 10 −5 . The model selection is done using Soft Dice Loss in each epoch during training on the validation dataset. During testing, the predicted maskŶ test is binarized with a threshold value of 0.5 allowing us to use hard metrics for testing the models: where T P , T N , F P and F N are the true positive, true negative, false positive and false negative ofŶ w.r.t Y , respectively and = 10 −5 to prevent division with zero. When T P + F P = 0 the model has correctly identified that the input does not have any positive pixel and in that case all metrics are set to 1. We train a total of 600 different models for the three experiments each one for 200 epochs with a batch size of 2, which was the maximum possible considering the GPU memory restriction. We use the optimizer [42] with the default values of learning rate 0.001, β 1 = 0.9, β 2 = 0.999, = 10 −8 , without weight decay. Pytorch [43] and the 'Segmentation Models Pytorch' library [44] were used for implementing the experiments, a GeForce RTX 2080 Ti Graphics Card with 11Gb RAM from NVIDIA and an Intel Core i9-9900K CPU @3.60GHz, on a Linux-based operating system for training the models for two weeks. A pseudocode implementation of the experimental setup is shown in Algorithm 1. for ar = 1 to n architectures do 3: for en = 1 to n encoders do 4: for ew = 1 to n encoder weights do 5: for ep = 1 to epochs do 6: 12: end for 14: In this Section the results of the three experimental setups are demonstrated, along with several comparisons between experiments, architectures, encoders and weight initialization schemes. In Table. 2 the resulted metrics of all experiments are presented. The best combination of architecture-encoder for each combination of encoder weight initialization, experimental setup and metric are showin in bold. The mean Dice results for each experiment are 93.18% ± 1.3% for lung segmentation, 85.47% ± 1.17%, for 'lesion segmentation A' and 86.44% ± 1.04%, for 'lesion segmentation B'. The best performing models w.r.t. Dice for each experiment were the resnet50-xception (95.58%) for lung segmentation, resnet18-xception (87.56%) for lesion segmentation A and dpn98-efficientnet-b4 (89.0%) for lesion segmentation B. In Fig. 3 the predicted masks for 24 out of the 600 models are depicted, demonstrating the difference in segmentation quality between the best (efficientnet-b4) and the worst (vgg19) performing encoder of the models for each architecture, with randomly initialized weights. In Fig. 4 the Dice vs. the number of parameters is plotted, demonstrating that there is positive correlation, suggesting that segmentation generally improves when using higher number of parameters. However, this is not a significant positive correlation. It is worth noting that the best model is not the one with the largest number of parameters. In Fig. 5 the train and validation loss vs. epochs of the four architectures segregated upon encoders is depicted. In all experiments and architectures, the training loss during the 5 first epochs, decreases fast and in a slower rate during the next epochs. We can observe the same behaviour for validation loss during the 15 first epochs but with more variability, which can be explained by the use of the dice loss as a validation metric. More specifically, we observe faster convergence for PSPNet for training loss compared to the other architectures, greater variance for FPN and lower convergence for Linknet in both training and validation. In Fig. 6 the Dice boxplots for the three experiments is plotted. Regarding time performance for training and inference, the fastest architecture is PSPNet and the slowest is Linknet even having more parameters than Unet. The In Fig. 7 the predictions as a volume for the three experiments, for the resnet18 encoder are visualized, demonstrating good match with the original masks. In the bottom two subfigures in Fig. 5 the difference of the train and validation loss between the 'lesion segmentation A' and 'lesion segmentation B' are depicted. We observe convergence between the losses, which is an indication that when training for large number of epochs the use of lung masks, as a preprocessing step, is less required. In Fig. 8 the weights are depicted, in which we observe that with random initialization the weights depict high and low frequency textures after training. The Dice for using random initialization is 87.1% ± 88.36% and for ImageNet initialization is 4.1% ± 3.62%. A motivation for this study is that the large amount of new models that are proposed, rarely conduct ablation studies and do not compare with simple baselines. This study can be used as a set of baseline models that DL model designers will test on, to confirm and evaluate whether their novel model performs better than other models, e.g. by comparing their accuracy with models with the same number of parameters and/or training/validation time. A common preprocessing step for lesion segmentation is using lung masks as either from manual annotation or from an automatic method. This step naturally improves lesion segmentation since the model only needs to search within the lung region instead of the whole image. Moreover, the use of this step is necessary in cases where the lesion is orders of magnitude smaller than the lungs and the background, justifying the characterization of lesion datasets as 'unbalanced'. The arguments against using this preprocessing step is that the complexity of the model and the cost of annotation by the experts are increased. The question to be answered by the expert is whether it is beneficial to increase the model complexity and the annotation cost to achieve the additional lesion segmentation accuracy increase. Related previous work was conducted by Shi et al. [45] , which categorized COVID-19 segmentation models between: • the lung-lesion-oriented models, which directly segment lesions and • the lung-region-oriented models, which first segment the lungs and then pass the masked region for further segmentation or classification. Regarding the encoder weight initialization experiment we confirm previous research such as [46] that pretrained weights significantly improve segmentation results, however we hypothesize that for as the number of epochs increases, the accuracy gap between them decreases. Similar positive findings regarding transfer learning such as improved performance and faster convergence were also reported in [47] . Limitations of this study include the use of small number of training data, however this is partially solved by the use of data augmentation methods that are applied on each training epoch. Moreover, it is costly to gather and annotate medical images especially when extreme events such as the COVID-19 outbreak occur. Therefore, the use of this study training dataset is representative of medical datasets that exist in the wild as summarized in [48] , which contains samples in the order of 10 2 to lower 10 3 . Future work includes the use of neuron and layer attribution methods to investigate reasons that specific combinations of architectures and encoders perform better than others. The need for fast, accurate and automatic diagnosis of COVID-19 requires highly reliable and publicly available models. We demonstrate specific properties that increase model segmentation and help experts in improved diagnosis by publicly providing pretrained models ready to be used for fine-tuning in experimental setups without GPU. Coronavirus disease 2019 (covid-19): situation report Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in wuhan, china Imaging profile of the covid-19 infection: radiologic findings and literature review Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases Automatic lung segmentation for accurate quantitation of volumetric x-ray ct images Lung nodule segmentation and recognition using svm classifier and active contour modeling: A complete intelligent system Lung cancer detection using fuzzy auto-seed cluster means morphological segmentation and svm classifier Deep learning Image segmentation using deep learning: A survey Lung ct image segmentation using deep neural networks Ai-assisted ct imaging analysis for covid-19 screening: Building and deploying a medical ai system in four weeks Progressive and multi-path holistically nested neural networks for pathological lung segmentation from ct images Inf-net: Automatic covid-19 lung infection segmentation from ct images Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images A noise-robust framework for automatic segmentation of covid-19 pneumonia lesions from ct images Covid-19 chest ct image segmentation-a deep convolutional neural network solution Lung infection quantification of covid-19 in ct images with deep learning U-net: Convolutional networks for biomedical image segmentation Towards efficient covid-19 ct annotation: A benchmark for lung and infection segmentation Benchmarking deep learning models and automated model design for covid-19 detection with chest ct scans, medRxiv A comparison of lung nodule segmentation algorithms: methods and results from a multi-institutional study A comparison of deep learning methods for semantic segmentation of coral reef survey images A comparison of deep learning architectures for semantic mapping of very high resolution images Linknet: Exploiting encoder representations for efficient semantic segmentation Feature pyramid networks for object detection Pyramid scene parsing network The importance of skip connections in biomedical image segmentation Deep residual learning for image recognition A comparison and strategy of semantic segmentation on remote sensing images Very deep convolutional networks for largescale image recognition Densely connected convolutional networks Aggregated residual transformations for deep neural networks Dual path networks Mobilenets: Efficient convolutional neural networks for mobile vision applications Xception: Deep learning with depthwise separable convolutions Inception-v4, inceptionresnet and the impact of residual connections on learning Efficientnet: Rethinking model scaling for convolutional neural networks 2009 IEEE conference on computer vision and pattern recognition Going deeper with convolutions Automatic lung segmentation in routine imaging is a data diversity problem, not a methodology problem The calibration of ct hounsfield units for radiotherapy treatment planning Adam: A method for stochastic optimization Pytorch: An imperative style, high-performance deep learning library Segmentation models, GitHub repository Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images Does non-covid19 lung lesion help? investigating transferability in covid-19 ct image segmentation Deep learning techniques for medical image segmentation: Achievements and challenges ACKNOWLEDGMENT This work received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 875325 (TeNDER, affecTive basEd iNtegrateD carE for betteR Quality of Life).