key: cord-0546016-mi4oqo78 authors: Saeedizadeh, Narges; Minaee, Shervin; Kafieh, Rahele; Yazdani, Shakib; Sonka, Milan title: COVID TV-UNet: Segmenting COVID-19 Chest CT Images Using Connectivity Imposed U-Net date: 2020-07-24 journal: nan DOI: nan sha: 2a2abdab3d7de2d83ac7a9ec63f0c3bc4a17c167 doc_id: 546016 cord_uid: mi4oqo78 The novel corona-virus disease (COVID-19) pandemic has caused a major outbreak in more than 200 countries around the world, leading to a severe impact on the health and life of many people globally. As of mid-July 2020, more than 12 million people were infected, and more than 570,000 death were reported. Computed Tomography (CT) images can be used as an alternative to the time-consuming RT-PCR test, to detect COVID-19. In this work we propose a segmentation framework to detect chest regions in CT images, which are infected by COVID-19. We use an architecture similar to U-Net model, and train it to detect ground glass regions, on pixel level. As the infected regions tend to form a connected component (rather than randomly distributed pixels), we add a suitable regularization term to the loss function, to promote connectivity of the segmentation map for COVID-19 pixels. 2D-anisotropic total-variation is used for this purpose, and therefore the proposed model is called"TV-UNet". Through experimental results on a relatively large-scale CT segmentation dataset of around 900 images, we show that adding this new regularization term leads to 2% gain on overall segmentation performance compared to the U-Net model. Our experimental analysis, ranging from visual evaluation of the predicted segmentation results to quantitative assessment of segmentation performance (precision, recall, Dice score, and mIoU) demonstrated great ability to identify COVID-19 associated regions of the lungs, achieving a mIoU rate of over 99%, and a Dice score of around 86%. Since December 2019, a novel corona-virus (SARS-CoV-2) has spread from Wuhan to the whole China, and then to many other countries. At the end of January 2020, the World Health Organization (WHO) declared that COVID-19 a Public Health Emergency of International Concern [2] . By July 15 2020, more than 12 million confirmed cases, and more than 570,000 deaths cases were reported across the world [1] . While infection rates are decreasing in some countries, numbers of new infections continue quickly growing in many other countries, signaling the continuing and global threat of COVID-19 [4] - [6] . Up to this point, no effective treatment has yet been proven for COVID-19. Therefore for prompt prevention of COVID-19 spread, accurate and rapid testing is extremely pivotal. The Narges Saeedizadeh is with Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Iran (e-mail: saeidy-narges@gmail.com). Shervin Minaee is with Snap Inc., Seattle, WA, USA (e-mail: smi-naee@snapchat.com). Rahele Kafieh is with Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Iran (e-mail: rkafieh@amt.mui.ac.ir). Shakib Yazdani is with ECE Department, Isfahan University of Technology, Iran (e-mail: shyazdani@ec.iut.ac.ir). Milan Sonka is with Iowa Institute for Biomedical Imaging, The University of Iowa, Iowa City, USA (e-mail: milan-sonka@uiowa.edu). M. Sonka research effort is supported, in part, by NIH grant R01-EB004640. reverse transcription polymerase chain reaction (RT-PCR) has been considered the gold standard in diagnosing COVID-19. However, the shortage of available tests and testing equipment in many areas of the world limits rapid and accurate screening of suspected subjects. Even under best circumstances, obtaining RT-PCR test results takes more than 6 hours and the routinely achieved sensitivity of RT-PCR is insufficient [3] . On the other hand, the radiological imaging techniques like chest X-rays and computed tomography (CT) followed by automated image analysis [26] may successfully complement RT-PCR testing. CT screening provides three-dimensional view of the lung and is therefore more sensitive (although less widely available) compared to chest X-ray radiography. In a systematic review [9] the authors indicated that CT images are sensitive in detection of COVID-19 before observation of some clinical symptoms. Typical signs of COVID-19 in CT images consist of unilateral, multifocal and peripherally based ground glass opacities (GGO), interlobular septal thickening, thickening of the adjacent pleura, presence of pulmonary nodules, round cystic changes, bronchiectasis, pleural effusion, and lymphadenopathy [7] , [8] . Accurate and rapid detection and localization of these pathological tissue changes is critical for early diagnosis and reliable severity assessment of COVID-19 infection. As the number of affected patients is high in most of the hospitals, manual annotation by well-trained expert radiologists is time consuming and subject to inter-and intraobserver variability. Such annotation is tremendous and laborintensive for radiologists and slows down the CT analysis. The urgent need for automatic segmentation of typical COVID-19 CT signatures is widely appreciated and deep learning methods can offer a unique solution for identifying COVID-19 signes of infection in clinical-quality images frequently suffering from variations in CT acquisition parameters and protocols [10] . In this work, we present a deep learning based framework for automatic segmentation of pathologic COVID-19associated tissue areas from clinical CT images available from publicly available COVID-19 segmentation datasets. There has been a huge progress in the performance of image segmentation model using various deep learning based frameworks in recent years [11] - [16] . Our solution is based on adapting and enhancing a popular deep learning medical image segmentation architecture U-Net to for COVID-19 segmentation task. As COVID-19 tissue regions tend to form connected regions identifiable in individual CT slices, "connectivity promoting regularization" term was added to the specifically designed training loss function to encourage the model to prefer sufficiently large connected segmentation regions of desirable properties. It is worth to mention that there have been a few works proposed for COVID-19 segmentation from CT images very recently. In [17] , Fan et al. proposed Inf-Net, to identify infected regions from chest CT slices. In they proposed model a parallel partial decoder is used to aggregate the high-level features and generate a global map. Then, the implicit reverse attention and explicit edge-attention are utilized to model the boundaries and enhance the representations. But unfortunately this model used a very small dataset of CT labeled images for segmentation, which consist of a total of 100 CT slices, making it hard to generalize the result, and compare. In terms of Dice score, achieves much higher Dice score, on a larger test set. In [18] , Elharrouss et al. proposed an encoder-decoder based model for lung infection segmentation using CT-scan images. The proposed model first uses image structure and texture to extract the ROI of infected area, and then uses the ROI along with image structure to predict the infected region. They also train this model on a small dataset of CT images, and achieve reasonable performance. In [19] , Ma et al. prepared a new benchmark of 3D CT data with 20 cases that contains 1800+ annotated slices, and provided several pre-trained baseline models, that serve as out-of the-box 3D segmentation. The main contributions of our work can be summarized as follows: • Development of an image segmentation framework for detecting pathologic COVID-19 regions in pulmonary CT images, • Development of a novel connectivity-promoting regularization loss function, • Quantitative validation showing better performance than some of existing state-of-the-art segmentation approaches • Publicly sharing the developed software code facilitating research and medical community use. Despite a large number of patients suffering from COVID-19, despite a growing number of COVID-19 volumetric CT scans, the availability of labeled CT images that can be used for training of deep learning methods is still limited. Therefore, our strategy relies heavily on the use of transfer learning, initiating the training from a model previously developed for medical image segmentation (segmentation of neuronal structures in electron microscopic stacks), and adapt it toward this task. To better suit the segmentation task at hand, we employed an architecture similar to the U-Net, one of the most successful deep learning medical image segmentation approaches, and modified its loss function to prefer COVID-19 specific foreground mask connectivity. U-Net is one of the popular segmentation models which is based on encoder-decoder neural architecture and use of skip connections, and was originally proposed by Ronneberger et al. [16] . The network architecture of U-Net is illustrated in Fig. 2 . In the encoder part, model gets an image as input and applies multiple layers of convolution, max-pooling and ReLU activation, and compresses the data into a latent space. In the decoder part, the network attempts to decode the information from the latent space using transposed convolution operation (deconvolution) and produce the segmentation mask of the image. The rest of the operations are similar to the aforementioned ones in the encoder part. One difference between U-Net and plain encoder-decoder model is the use of skipconnections to send the information from the corresponding high-resolution layers of the encoder to the decoder, which can help the network to better capture small details that are present in high-resolution. Fig. 2 illustrates the general architecture of a U-Net model. Where : Ω → {1, . . . , K} considering that Ω ⊂ Z 2 and also K denotes the total number of classes in the dataset. Moreover, the soft-max is defined as where a k (x) represents the activation in feature channel K. Additionally, ω : Ω → IR is a weighted map to give some features more importance. The segmentation maps usually consist of a number of connected components, and single-pixel regions are rare. To encourage our segmentation model to generate segmentation maps with connected components of desirable sizes, we found that incorporating an explicit regularization term in the training loss function can greatly improve connectivity requirements for the predicted segmentation regions. It is worth noting that conventional U-Net can also implicitly learn such behavior from training data to some extent, assuming sufficient data sizes are available, which is not quite the case in our situation. Several strategies were developed and considered to impose desired connectivity requirements within images such as adding groupsparsity or incorporate total variation terms [23] - [25] . Based on achieved experience, we decided to use total variation (TV) as its gradient update is computationally attractive during the backward pass stage. TV penalizes the generated images with large variations among neighboring pixels, leading to more connected and smoother solutions [24] . Total variation of a differentiable function f defined on an interval [a, b] ⊂ R has the following expression if f is Riemann-integrable: Total variation of 1D discrete signals (y = [y 1 , ..., y N ]) is straightforward, and can be defined as: where D is a (N − 1) × N matrix as below: For 2D signals (Y = [y i,j ]), we can use isotropic or anisotropic versions of 2D total variation [23] . To simplify our optimization problem, we have used the anisotropic version of TV, which is defined as the sum of horizontal and vertical gradients at each pixel: In our case we can add the total variation of the predicted binary mask for COVID-19 pixels to the loss function. Adding this 2D-TV regularization term to our framework will promote the connectivity of the produced segmentation regions. The new loss function for our model would then be defined as: where L U net is the binary cross-entropy loss, which is similar to the sum of cross-entropy on all pixels, and defined above. We have used the COVID-19 CT segmentation dataset [28], which contains two versions. The first version of this dataset contains 100 images from 40 patients, which are all labeled as COVID-19 class. This dataset has three types of ground truth masks, which are called Ground Glass, Consolidation and Pleural Effusion. The original CT images and all ground truth masks have a size of 512 x 512. The second version of the dataset was expanded to 829 images (from 9 patients) in which 373 of those are labeled as COVID-19 and the rest as normal. All ground truth masks as COVID-19 have Ground Glass mask but majority of them are missing the latter two. The size of images and masks in the second version of this dataset is 630 x 630. We combined these two versions, which contains a total of 49 people. Four sample images from this dataset are shown in Figure 3 . The images in the first and second rows denote the original images, and their corresponding COVID-19 mask, respectively. The images in the first and second columns denote two sample images of normal people, and the images in the third and fourth columns denote two COVID-19 images. The white and gray regions in ground-truth masks denote COVID-19 regions, while the black pixels denote healthy regions (note that if the mask is entirely black, it means that the given CT image belongs to a healthy person). The red boundary contours are drawn to better show the parts containing COVID-19, and are not a part of the original image. After precise examination of the three types of ground truth masks, and consulting with a board-certified radiologist, we decided to focus on the Ground Glass mask, and remove the Consolidation and Pleural Effusion masks, as: on one hand very few images have all three types of masks and most of them are missing the latter two, and on the other hand it is verified by our radiologist that the result using only groundglass mask is also acceptable and it can be used to infer the presence of COVID-19. Two sample images of COVID-19 class with the three types of ground truth masks are shown in Figure 4 . To evaluate the effect of different ground truth masks and the effect of different training/testing set composition, two different splits (of training, validation, and test sets) are selected from this dataset. In split-1, we have 729 images (associated with 46 patients) in the train and validation sets, and 200 images (associated with 3 patients) in the test set. In split-2, we have 654 images (associated with 35 patients) in the train and validation sets, and 275 images (associated with 14 patients) in the test set. The details of these splits are provided in Table I . Additionally, a semi-supervised COVID-19 segmentation dataset (COVID-SemiSeg) recently reported in [17] and [29] was used to compare our TV-Unetapproach with other methods. The COVID-SemiSeg dataset consists of two sets. The first one contains 1600 pseudo labels generated by Semi-Inf-Net model and 50 labels by expert physicians. The second set included 50 multi-class labels. There are 48 images to used for performance assessment in both sets. In this section we provide a detailed experimental analysis of the proposed segmentation framework, by presenting both qualitative and quantitative results as well as comparing our results with a baseline approach. There are several metrics that are used by the research community to measure the performance of segmentation models, including precision, recall, dice coefficient and mean IoU (mIoU). These metrics are also widely used in medical domain, and are defined as below. Precision is calculated as the ratio of pixels correctly predicted as COVID divided by total pixels predicted as COVID, and is defined as Eq. 6: where TP refers to the true positive (the number of correctly predicted COVID-19 cases) and FP refers to the false positive (the number of wrongly predicted COVID-19 cases). Recall is the ratio of pixels correctly predicted as COVID-19 divided by total number of actual COVID-19 pixels, and is defined as Eq. 7: where TP is false positiv, and FN refers to the false negative and is the number of pixels mistakenly predicted as non-COVID. Precision and Recall are widely used in medical domain, and to get a big picture of model performance usually a paired version of them is used. Precision-Recall (PR) curve is popular way to look at the model performance holistically, which is a plot of the precision (y-axis) versus the recall (x-axis) rates for different thresholds. Dice Coefficient (also known as Dice score, or DSC) is another popular metric especially for the multi-class image segmentation. The dice score is defined as Eq. 8: where A and B denote the predicted and ground-truth masks. Intersection over Union (also known as Jaccard index) is another popular metric used to evaluate the similarity between ground truth and predicted segmentation masks. It is defined as the size of the intersection divided by the size of the union of the target mask and predicted segmentation map (Eq. 9). where A and B are predicted and ground-truth masks. If A and B are both empty, IoU(A,B) is defined as 1. IoU ranges between 0 and 1. Mean-IoU is the average IoU values over all classes. It is worth mentioning that Dice coefficient and IoU are positively correlated. Hyper-parameters are very important, and it is crucial to properly tune their values during the training of machine learning models to achieve good performance, especially in the case of deep neural networks. Hyper-parameter tuning can be done in two different ways, automatically and manually. In this work, we manually evaluated few different combinations of hyper-parameters and selected the best combination. To simplify the tuning process, we fixed the number of epochs to 100, and the batch-size to 32. We designed and compared different loss functions (such as binary cross entropy (BCE), dice coefficient loss, and BCE plus total variation regularization), different optimizers (such as ADAM, Adagrad, Adadelta and stochastic gradient descent (SGD)), and different learning rates. We used adaptive learning rate scheduling and early stopping criteria as below, which achieved reasonable performance on the validation set: • Learning rate is decayed whenever the validation loss does not improve for 5 continues epochs. • Early stopping is applied whenever the validation loss does not improve for 10 subsequent epochs. Table II shows the impact of the loss function design on the model performance with binary cross entropy and the proposed connectivity regularized loss function achieving the best performance. The impact of the optimizer on the model performance is shown in Table III . As we can see, ADAM achieves the highest performance in terms of all metrics. Table IV provides the analysis of model performance for two different learning rate values when using (the best performing) ADAM optimization. Note that the model predicts a probability for each pixel, showing the likelihood of it belonging to the pathologic COVID-19 region (zero denotes Non-COVID pixels and one denotes COVID-19 pathology). These probabilities are thresholded, different thresholds yield certain sensitivit/specificity rates. Threshold value of 0.3 achieved the best performance on the validation set and was therefore used to report the results of the proposed model. The impact of modifying the threshold values on the model accuracy is given in Section IV-D. Qualitative result showing how close our predicted masks are to the ground-truth masks are given in Figure 5 for 5 sample images from the test set. As can be seen when the desired region is very tiny, the conventional U-Net model (fine-tuned on our dataset) cannot distinguish the segmentation region and background very well, while the proposed TV-UNet model performs notably better. As discussed previously, our model predicts a probability score for each pixel, showing the likelihood of its being in COVID-19 pathology region. Different cut-off thresholds can be used on those probabilities to decide COVID-19 labeling. By increasing the cut-off threshold, less and less pixels would be labeled as COVID-19 pathology. Tables V and VI show the model performance (in terms of precision, recall, and mIoU) for eight different values of cut-off thresholds for Split-1 and Split-2 datasets. The cut-off threshold of 0.3 results in the highest Dice score, and mIoU metric, and therefore was employed to compare our model with other baseline models. To see the holistic view of the proposed model performance on all possible threshold values, Figures 6 and 7 provide the precision-recall curves on the test sets in Split 1 and Split 2, respectively. Figure 6 shows average precision of 0.92 for the conventional U-Net and 0.94 for our TV-UNet for Split 1 dataset (an improvement of around 0.02 in terms of Averageprecision). Figure 7 shows average precision of 0.67 for the conventional U-Net and 0.88 for our TV-UNet for Split 2, a relative improvement of 31%. For a fair comparison between the proposed TV-Unet model and the conventional fine-tuned U-Net model, corresponding cut-off thresholds were identified for similar recall rates for each model and compared in terms of other performance metrics. Tables VII and VIII provide the comparison between these two models for four different recall rates. Consistency of our TV-UNet model outperforms that of the fine-tuned UNet model when considering all metrics, showing the added value of the connectivity-promoting regularization. We have an average improvement of around 2% in terms of Dice score in Split 1 and 10.9% in Split 2 studies. Quantitative analysis of COVID-19 segmentation performance on CT images is beginning to appear in publications of others. One such recent model is Inf-Net [17] , in which reverse attention mechanism is used in an encoder-decoder based model for COVID-19 segmentation. This work is trained on COVID-SemiSeg dataset, that was explained in section III, and tested on a subset of the first version of COVID-CTsegmentation dataset. Therefore, for the comparisons in this section our model is trained on COVID-SemiSeg dataset, to have a fair model evaluation setting. Here we compare the proposed TV-UNet model, with the Inf-Net, and a few promising image segmentation models trained on COVID-SemiSeg dataset, including UNet++ [20] , Semi-Inf-Net [17] , DeepLab-v3 [21] , FCN8s [22] , and Semi-Inf-Net+FCN8s [17] . Tables IX, X and XI provide the performance comparisons in terms of recall and Dice coefficient, in different settings. As it can be seen from these tables, the proposed TV-UNet model achieves very promising results, outperforming other models in all three experiments with different settings. To see the model convergence during training, we provide the loss function, recall, and precision rates of the model on different epochs, in Figures 8, 9 and 10 . It is worth to mention that for precision and recall, the default threshold value of 0.5 is used in these figures. A novel deep learning framework for COVID-19 segmentation from CT images was reported. We used the popular U-Net architecture as the main framework, and improved its performance by an added connectivity promoting regularization term, to encourage the model to generate larger contiguous connected segmentation maps. We showed that the trained model is able to achieve a reasonably high accuracy rate, for detecting of pathologic COVID-19 regions. We report the model performance under various hyper-parameter settings, which can be helpful for future research by the community to know the impact of different parameters on the final results. We will further extend this work to semi-supervised setting, in which a combination of labeled and unlabeled data will be used for training the model. Such an approach will undoubtedly be extremely useful as collecting accurate segmentation labels for COVID-19 remains very challenging. âȂIJCoronavirus disease (COVID-19) pandemic Comparison to RT-PCR,âȂİ Radiology Coronavirus infectionsâȂŤmore than just the common cold COVID-19 and Italy: what next? COVID-19 in Iran: A Deeper Look Into The Future âȂIJChest CT findings of COVID-19 pneumonia by duration of symptoms,âȂİ âȂIJCT imaging and clinical course of asymptomatic cases with COVID-19 pneumonia at admission in Wuhan, China COVID-19): a systematic review of imaging findings in 919 patients âȂIJLung infection quantification of COVID-19 in ct images with deep learning,âȂİ arXiv preprint Segnet: A deep convolutional encoder-decoder architecture for image segmentation Encoder-decoder with atrous separable convolution for semantic image segmentation Ccnet: Criss-cross attention for semantic segmentation Image segmentation using deep learning: A survey Biometric recognition using deep learning: A survey Unet: Convolutional networks for biomedical image segmentation Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Images An encoder-decoder-based method for COVID-19 lung infection segmentation Towards Efficient COVID-19 CT Annotation: A Benchmark for Lung and Infection Segmentation Unet++: A nested u-net architecture for medical image segmentation Encoder-decoder with atrous separable convolution for semantic image segmentation Fully convolutional networks for semantic segmentation An algorithm for total variation minimization and applications An ADMM approach to masked signal decomposition using subspace representation Group-based sparse representation for image restoration Deep-covid: Predicting covid-19 from chest x-ray images using deep transfer learning The authors would like to thank our radiologist, Doctor Ghazaleh Soufi, for her advice on the important signals in chest CT images for detecting COVID-19. We would also like to thank the providers of the publicly available pulmonary CT datasets. M. SonkaâȂŹs research effort supported, in part, by NIH grant R01-EB004640.