key: cord-0594526-q9mguspa authors: Yang, Yang; Chen, Jiancong; Wang, Ruixuan; Ma, Ting; Wang, Lingwei; Chen, Jie; Zheng, Wei-Shi; Zhang, Tong title: Towards Unbiased COVID-19 Lesion Localisation and Segmentation via Weakly Supervised Learning date: 2021-03-01 journal: nan DOI: nan sha: bcf8c7e068c2fd0e73cd5c3f4f94056b0c7e5d6f doc_id: 594526 cord_uid: q9mguspa Despite tremendous efforts, it is very challenging to generate a robust model to assist in the accurate quantification assessment of COVID-19 on chest CT images. Due to the nature of blurred boundaries, the supervised segmentation methods usually suffer from annotation biases. To support unbiased lesion localisation and to minimise the labeling costs, we propose a data-driven framework supervised by only image-level labels. The framework can explicitly separate potential lesions from original images, with the help of a generative adversarial network and a lesion-specific decoder. Experiments on two COVID-19 datasets demonstrate the effectiveness of the proposed framework and its superior performance to several existing methods. Since the first case reported in Dec 2019, the novel Coronavirus Disease (COVID-19) has made the world a pandemic era. Till 4 Oct 2020, there have been 34,724,785 confirmed cases of COVID-19, including 1,030,160 deaths, according to WHO [1] . Accurate lesion localisation and segmentation methods are in huge demand to aid the fast disease diagnosis and stage monitoring. Among different diagnostic imaging modalities, computed tomography (CT) has proven itself to be effective and been widely used for the assessment and evaluation of disease evolution [2, 3] . Patchy ground-glass opacitity (GGO) with consolidation is often been found from CT images as a typical sign of lung infection. Thus, the quantitative evaluation of such lung lesions can help diagnosis. Recently, deep learning algorithms, e.g., Convolutional Neural Networks (CNNs) [4] , have been widely used to de- * Coresponding authors Email-to: zhangt02@pcl.ac.cn, wangruix5@mail.sysu.edu.cn tect lung diseases via CT images. For example, researchers applied existing CNN frameworks, such as U-Net [5] , to the automatic segmentation of COVID-19 CT scans [6, 7, 8] . To achieve satisfactory results, the highly accurate annotation of lesions is essential. However, obtaining a large amount of annotation of infections is expensive and time-consuming. COPLE-Net was designed to enhance the robustness of the detection, using the labels polluted by the noise data from non-experts [9] . Another ways is to use a weakly supervised framework for classification and localization of lesions [10] . Yet, it is still difficult for these methods to identify the boundaries of GGO as a result of its low contrast and blurred appearances. To overcome above challenges, we propose a novel weakly supervised framework for automatic localization and segmentation of COVID-19 pneumonia lesions only with the help of image-level label information. The framework consists of a generative adversarial network and an additional decoder specifically for lesion estimation. It can explicitly decompose any image into two images, one containing the normal information in the original image, and the other containing possible lesion information if existing in the original image. An effective training strategy with new loss terms was proposed to help decompose potential lesions from normal information in images. Extensive evaluations (including cross-dataset evaluation) on two COVID-19 datasets confirmed the effectiveness of the proposed method in lesion localization and segmentation. We propose a novel weakly supervised framework for automatic localization and segmentation of COVID-19 pneumonia lesions only with the help of image-level label information ( Figure 1 ). The basic idea is to explicitly decompose Fig. 1 . The proposed weakly supervised framework for lesion localization and segmentation. It consists of the encoder E, the generator G 1 for estimation of normal information from the input, the decoder G 2 for estimation of lesion information from the input, and the discriminator (critic) D to judge whether the generator's outputs are realistically normal or not. A lung segmentation model was pre-trained and applied to the original CT slices before they are input to the network model. any image (either normal or with lesion) into a corresponding normal version and a remaining lesion version, with the constraint that there should be no lesion in the lesion version for any normal image. To help obtain realistic normal versions from lesioned images, a discriminator D is employed to judge whether the decomposed normal versions are realistic or not compared to real normal images. Therefore the framework can be considered as the fusion of a generative adversarial network (GAN) and a lesion decoder, with the lesion decoder part G 2 sharing the same encoder E with the generator G 1 (Figure 1 ). with lung regions of interest (ROI) pre-segmented are available to train the network model, where x i denotes the i-th CT slice and y i denotes whether the slice is normal (y i = 0) or contains lesion (y i = 1). For any slice image x i as the input to the model, denote by G 1 (E(x i )) the output of the generator G 1 , representing the normal version of the original input x i , and by G 2 (E(x i )) the output of the decoder G 2 , representing the lesion information in the original input x i . If the decomposition process works well, the recombination of the two decomposed components should be close to the original input, i.e., the reconstruction loss L r should be small, where · represents the L p norm with p = 1 or 2. Since normal images contain no lesion, if the decomposition works well, the normal version G 1 (E(x j )) itself should be close to the original input for any normal input x j , i.e., the normal fidelity loss L g for normal images should be small, where x j is the j-th normal image and N 1 is the total number of normal images. The set of normal images is a subset of the whole dataset D. While the minimization of L r and L g together may help the model well reconstruct normal images, it may not be enough to correctly estimate the lesion information by the decoder G 2 (E(x i )) when the input image x i contains lesion, because there could exist multiple or even infinite number of decomposition results which can satisfy the constraint i.e., making L r minimal. An extreme case is that the generator G 1 would always output the original input, no matter whether the input contains lesion or not, which would make the lesion decoder output little or no information about lesion. To well separate lesion from healthy parts in lesioned images, the proposed framework uses a discriminator to judge whether the decomposed normal versions G 1 (E(x i ))'s are really similar to real normal images or not. Here the Wasserstain GAN with gradient penalty (WGAN-GP) is adopted to train the discriminator (also called critic) D. Recall that, in the current task, the sampling version of the loss function for the critic of WGAN-GP is where GP stands for the gradient penalty term (see detailed form in [11] ) and λ is its corresponding weight. This loss aims at maximizing the critic output for real normal data x j 's, meanwhile minimizing the critic output for estimated normal data G 1 (E(x i ))'s from the generator G 1 . Higher output indicates that the input to the critic is more realistic. On the other hand, as part of the well-known alternative GAN training strategy, the generator G 1 together with the encoder E can be trained by minimizing the adversarial loss L a , Minimization of this loss would help the generator output more realistic normal estimates, resulting in higher output from the critic D. Overall, the generator G 1 , the encoder E, and the lesion decoder G 2 can be trained together by minimizing the combined loss terms L r , L g and L c , α 1 , α 2 , and α 3 are coefficients to trade off the importance between the three loss terms. L c and L g are minimized alternatively to train the critic (discriminator) and the other parts of the network model. The proposed network was trained with a set of 2007 normal lung CT slices and 870 lesioned slices which were randomly sampled from the COVID-cell dataset [2] , and then evaluated on two test sets of lesioned images. One test set includes 128 lesioned images where were randomly sampled from the COVID-cell dataset and then annotated at pixel level by one logists. Note that there is no overlap between the training set and this test set although they are both from the COVID-cell dataset. The other test set consist of 493 lesioned slices from the COVID-19 Image Data Collection [12] which contains 20 cases with covid-19. This dataset was released with lesion area annotations, although the pixel-level annotations are not that accurate particularly around the boundary of the lesion area. It is worth noting that all the pixel-level lesion annotations were not for model training but only for quantitative evaluation of the proposed model. As a pre-processing step, the lung regions in all images in both training and test sets were segmented out with a U-Net segmentation model, where the segmentation model was pretrained on covid-cell dataset. The visual information outside the lung region was removed from each image based on the segmentation mask before the image was used for training or testing. Each image was resized to 256 × 256 pixels, and then normalized based on the mean and standard deviation of pixel values over all training images. In our proposed framework, the encoder E and the generator G 1 forms the well-known U-Net network, and similarly the encoder E and the lesion decoder G 2 forms the other U-Net network. The only modification is the addition of the Tanh activation function at the last layer of the generator G 1 and the decoder G 2 respectively to constrain the pixel values of the output within the same range (−1, 1) as that of the model's input. A seven-layer CNN was used for the discriminator (or critic) D, with the outputs of three down-samplings pooled globally and then concatenated to form the input to the final fully connected layer. For model training, the gradient penalty coefficient λ in the WGAN loss was empirically set to 10, and the coefficients α 1 = 0.01, α 2 = α 3 = 100. Adam was adopted as the optimizer during model training, with default learning rate 0.0002, and batch size 8. The PR curves are used to evaluate localization performance of the proposed model and baseline methods, which were generated by comparing the pixel-level lesion estimates (from the output of the decoder G 2 ) with ground truth annotations. The output values from the decoder was normalized from [-1, 1] to [0, 1] for both quantitative and qualitative evaluations. We compared the lesion localization ability of our method with the commonly used visualization techniques CAM [13] , Grad-CAM [14] , and the recent visual feature attribution using Wasserstein GANs (VA-GAN) [15] , and biomarker localization (BL) method [16] . ResNet18 was chosen as the backbone of CAM and Grad-CAM to train a binary classification task. From Figure 2 (Left), it can be observed that CAM fails to detect most lesion areas and Grad-CAM detects the lesions but introduces many irrelevant area. As for BL, it can indeed localize some lesions, but often fails to detect small or indistinct lesions. In comparison, our method provides much more precise localization (and therefore segmentation) of lesions even if the lesions are in irregular shapes or with vague boundaries, demonstrating the superior performance of our method to others. This is also confirmed by the quantitative evaluation in the PR curve ( Figure 2 , Right), with the area under the PR curve (AUC) 0.63 for our approach, 0.36 for BL, 0.179 for Grad-CAM and 0.061 for CAM. Based on the localization results, lesions can be automatically segmented by thresholding the heatmaps. Figure 3 shows that in a wide range of thresholds ([0.1, 0.4]), the simple threshold-based segmentation resulted in the dice score above 0.7, suggesting that the proposed framework can provide reasonably good segmentation even just based on the image-level labels. In this section, we evaluate the effect of loss terms and different components in our framework by ablation study. As can be seen from Figure 4 (Left), with the removal of different framework components or part of loss terms, the localization and corresponding segmentation (threshold=0.4) performance degrades more or less. In particular, by removing the lesion decoder or the discriminator, the model fails to discriminate the lesion regions from many normal regions. Without the normal fidelity loss, some normal boundaries were mistakenly considered as lesions by the model. Quantitative evaluation (Figure 4 , Right) further confirmed the effectiveness of each component or loss term in performance boosting. In this paper, an effective weakly supervised localization and segmentation framework is proposed. Experiments on two lung CT datasets demonstrate that the proposed framework achieves superior performance compared with widely used visualization methods and a recent lesion localization method. Without annotating the detailed lesion regions, the proposed framework provides a novel and effect approach for clinicians to efficiently analyze degree of lesions based on the automatic localization and segmentation results, particularly for the diagnosis of COVID-19 disease. Source codes implemented in PyTorch and MindSpore will be available at https://git.pcl.ac.cn/capepoint after the conference. This work proposes a general machine learning framework for weakly supervised image localisation and segmentation. Our experimental data were collected from open source datasets, which were ethically approved as indicated in the references [2, 12] . A novel coronavirus outbreak of global health concern Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography The role of chest imaging in patient management during the covid-19 pandemic: a multinational consensus statement from the fleischner society Deep learning with convolutional neural network in radiology U-net: Convolutional networks for biomedical image segmentation Serial quantitative chest ct assessment of covid-19: Deep-learning approach Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct Longitudinal assessment of covid-19 using a deep learningbased quantitative ct pipeline: Illustration of two cases A noise-robust framework for automatic segmentation of covid-19 pneumonia lesions from ct images A weakly-supervised framework for covid-19 classification and lesion localization from chest ct Improved training of wasserstein gans Covid-19 image data collection Learning deep features for discriminative localization Grad-cam: Visual explanations from deep networks via gradient-based localization Visual feature attribution using wasserstein GANs Biomarker localization by combining cnn classifier and generative adversarial network