key: cord-0523857-fitte5bf authors: Hu, Shaoping; Gao, Yuan; Niu, Zhangming; Jiang, Yinghui; Li, Lao; Xiao, Xianglu; Wang, Minhao; Fang, Evandro Fei; Menpes-Smith, Wade; Xia, Jun; Ye, Hui; Yang, Guang title: Weakly Supervised Deep Learning for COVID-19 Infection Detection and Classification from CT Images date: 2020-04-14 journal: nan DOI: nan sha: 003d2e515e1aaf06f0052769953e861ed8e56608 doc_id: 523857 cord_uid: fitte5bf An outbreak of a novel coronavirus disease (i.e., COVID-19) has been recorded in Wuhan, China since late December 2019, which subsequently became pandemic around the world. Although COVID-19 is an acutely treated disease, it can also be fatal with a risk of fatality of 4.03% in China and the highest of 13.04% in Algeria and 12.67% Italy (as of 8th April 2020). The onset of serious illness may result in death as a consequence of substantial alveolar damage and progressive respiratory failure. Although laboratory testing, e.g., using reverse transcription polymerase chain reaction (RT-PCR), is the golden standard for clinical diagnosis, the tests may produce false negatives. Moreover, under the pandemic situation, shortage of RT-PCR testing resources may also delay the following clinical decision and treatment. Under such circumstances, chest CT imaging has become a valuable tool for both diagnosis and prognosis of COVID-19 patients. In this study, we propose a weakly supervised deep learning strategy for detecting and classifying COVID-19 infection from CT images. The proposed method can minimise the requirements of manual labelling of CT images but still be able to obtain accurate infection detection and distinguish COVID-19 from non-COVID-19 cases. Based on the promising results obtained qualitatively and quantitatively, we can envisage a wide deployment of our developed technique in large-scale clinical studies. It is highly contagious, and severe cases can lead to acute respiratory distress or multiple organ failure [3] . On 11 March 2020, the WHO has made the assessment that COVID-19 can be characterised as a pandemic. As of , in total, 1,391,890 cases of COVID-19 have been recorded, and the death toll has reached 81,478 with a rapid increase of cases in Europe and North America. 8th April 2020 The disease can be confirmed by using the reverse-transcription polymerase chain reaction (RT-PCR) test [4] . While being the gold standard for diagnosis, confirming COVID-19 patients using RT-PCR is time-consuming, and both high false-negative rates and low sensitivities may put hurdles for the presumptive patients to be identified and treated early [3] [5] [6] . As a non-invasive imaging technique, computed tomography (CT) can detect those characteristics, e.g., bilateral patchy shadows or ground glass opacity (GGO), manifested in the COVID-19 infected lung [7] [8] . Hence CT may serve as an important tool for COVID-19 patients to be screened and diagnosed early. Despite its advantages, CT may share some common imagery characteristics between COVID-19 and other types of pneumonia, making the automated distinction difficult. Recently, deep learning based artificial intelligence (AI) technology has demonstrated tremendous success in the field of medical data analysis due to its capacity of extracting rich features from multimodal clinical datasets [9] . Previously, deep learning was developed for diagnosing and distinguishing bacterial and viral pneumonia from thoracic imaging data [10] . In addition, attempts have been made to detect various chest CT imaging features [11] . In the current COVID-19 pandemic, deep learning based methods have been developed efficiently for the chest CT data analysis and classification [2] [3] [12] . Besides, deep learning algorithms have been proposed for , screening [14] and prediction of the hospital stay [15] . A full list of current AI applications for COVID-19 related research can be found elsewhere [16] . In this study, we will focus on the chest CT image based localisation for the infected areas and disease classification and diagnosis for the COVID-19 patients. Although initial studies have demonstrated promising results by using chest CT for the diagnosis of COVID-19 and detection of the infected regions, most existing methods are based on commonly used supervised learning scheme. This requires a considerable amount of work on manual labelling of the data; however, at such an outbreak situation clinicians have very limited time to perform the tedious manual drawing, which may fail the implementation of such supervised deep learning methods. In this study, we propose a weakly supervised deep learning framework to detect COVID-19 infected regions fully automatically using chest CT data acquired from multiple centres and multiple scanners. Based on the detection results, we can also achieve the diagnosis for the COVID-19 patients. In addition, we also test the hypothesis that based on the CT radiological features, we can classify COVID-19 cases from community acquired pneumonia (CAP) and non-pneumonia (NP) scans using the deep neural networks we developed. were scanned with matrix = 512×512, the field of view = 500 mm × 500 mm, and reconstructed slice thickness varies at either 1 mm, 2.5 mm or 3 mm. Data pre-processing steps were performed to standardise data acquired from multiple centres and multiple scanners. Instead of normalising input slices into a pre-defined Hounsfield unit (HU) window, we designed a more flexible scheme based on previously proposed image enhancement view U-Net based segmentation network consisted of a multi-window voting post-processing procedure and a sequential information attention module in order to utilise the information from each view of the 3D volume and reinforce the integrity of the 3D lung structure of the delineation results. Our lung segmentation model was trained, cross-validated and tested on the TCIA dataset with manual ground truth. The trained lung segmentation model was then used for inferencing the delineation of the lung anatomy of the COVID-19, CAP and NP patients included in this study. Detection and Classification Network Inspired by the VGG architecture [22] , we adopted the configuration that increased CNN depth using small convolution filters stacked with non-linearity injected in between, as depicted in Figure 1 . All convolution layers consisted of 3×3 kernels, batch normalisation and Rectified Linear Units. The proposed CNN was fully convolutional consisting of five convolutional blocks, i.e., Conv1, Conv2, Conv3, Conv4 and Conv5 in the backbone architecture. The full architecture, using shorthand notation, is 2× C(32,3,1)-MP-2× C(64,3,1)-MP-3× C(128,3,1)-MP-3× C(256,3,1)-MP-3× C(256,3,1)-MP, where C(d,f,s) indicates a convolution layer with d filters of spatial size f×f, applied to the input with stride s. MP represents non-overlapping max-pooling operation with a kernel size of 2×2. From the previous findings using CT [23] [24] [25] , it is known that infections of COVID-19 share the similar and common radiographic features as CAP, such as GGO and airspace consolidation. They frequently distribute bilaterally, peripherally in lower zone predominant, and the infectious areas can vary significantly in size depending on the condition of the patients. For example, in mild cases the abnormalities appear to be small, but in severe cases they appear scattered and spread around over a large area. Therefore, we proposed a multi-scale learning scheme to cope with variations of the size and location of the lesions. To implement this, we fed the intermediate CNN representations, i.e., feature maps, at Conv3, Conv4 and Conv5, respectively into the weakly supervised classification layers, in which 1×1 convolution was applied to mapping the feature maps down to the class score maps (i.e., class activation maps). We then applied a spatial aggregation with a Global Max Pooling (GMP) operation to obtain categorical scores. The scores vectors at Conv3, Conv4 and Conv5 level were aggregated by sum to make a final prediction with a Softmax function. We then trained the proposed model end-to-end by minimising the following objective function where there are N training images x and K training classes. S is the k component in the score vector ∈ ℜ , and c is the true class of x . As we encountered an imbalanced classification, we added a class-balanced weighting factor w to the cross-entropy loss, which was set by inverse class frequency, i.e., w = When an example was misclassified and P was small, the factor f was near 1 and the loss was unaffected. As P → 1, the factor went to 0 and the loss for well-classified examples was down-weighted. The parameter γ is a positive integer which can smoothly adjust the rate at which easy examples are down-weighted. As γ is increased the modulating effect of the factor f is likely to be increased. After determining the class score maps and the image category in a forward pass through the network, the discriminative patterns corresponding to that category can then be localised in the image. A coarse localisation could already be achieved by directly relating each of the neurons in the class score maps to its receptive field in the original image. However, it is also possible to obtain pixel-wise maps containing information about the location of class-specific target structures at the resolution of the original input images. This can be achieved by calculating how much each pixel influences the activation of the neurons in the target score map. Such maps can be used to obtain a much more accurate localisation, like the examples shown in Figure 2 . In the following, we will show how categorical-specific saliency maps can be obtained through the integrated gradients. Besides, we will also show how to post-process the saliency maps from which we can extract bounding boxes around the detected lesions. Generally, suppose we have a flattened input image denoted as x = (x , ..., x ) ∈ ℜ (number of pixels=n), category-specific saliency map can be obtained by calculating the gradient of the predicted class score S(x) at the input x : g = = (g , ..., g ) ∈ ℜ , where g represents the contribution of individual pixel x to the prediction. In addition, the gradient can be estimated by back-propagating the final prediction score through each layer of the network. There are many state-of-the-art back-propagation approaches, including Guided-Backpropagation [27], DeepLift [28] and Layer-wise Relevance Propagation (LRP) [29] . However, Guided-Backpropagation method may break gradient sensitivity because it back-propagates through a ReLU node only if the ReLU is turned on at the input. In particular, the lack of sensitivity causes gradients to focus on irrelevant features and results in undesired saliency localisation. DeepLift and LRP methods tackle the sensitivity issue by computing discrete gradients instead of instantaneous gradients at the input. However, they fail to satisfy the implementation invariance because the chain rule does not hold for discrete gradients in general. In doing so, the back-propagated gradients are potentially sensitive to unimportant features of the models. To deal with these limitations, we employ a feature attribution method named "Integrated Gradients" [30] that assigns an importance score ϕ (S(x), x) (similar to pixel-wise gradients) to the i pixel representing how much the pixel value adds or subtracts from the network output. A large positive score indicates that pixel strongly increases the prediction score S(x) , while an importance score closes to zero indicates that pixel does not influence S(x) . To compute the importance score, it needs to introduce a baseline input representing "absence" of the feature input, denoted as x = (x , ..., x ) ∈ ℜ , which in our study, was a null image (filled with zeros) with the same shape as input image x . We considered the straight-line path, i.e., point-to-point from the baseline x to the input x , and computed the gradients at all points along the path. Integrated gradients can be defined as where α ∈ [0, 1] . Intuitively, integrated gradients can obtain importance scores by accumulating gradients on images interpolated between the baseline value and the current input. The integral in Eq. 2 can be efficiently approximated via a summation of the gradients as: where m is the number of steps in the Riemann approximation of the integral. We compute the approximation in a loop over the set of inputs, i.e., for n = 1, ..., m . The integrated gradients are computed at different feature levels, in our experiments, which are Conv3, Conv4 and Conv5 respectively, as shown in Figure 2 (b), Figure 2 (c) and Figure 2(d) . Then, a joint saliency can be obtained, as depicted in Figure 2 (e), by pixel-wise multiplication between the multi-scale integrated gradients. Next, we post-processed the joint saliency map from which a bounding box can be extracted. Firstly, we took the absolute value of the joint saliency map and blurred it with a 5 × 5 Gaussian kernel. Then, we thresholded the blurred saliency map using the Isodata thresholding method [31] that it iteratively decided a threshold segmenting the image into foreground and background, where the threshold was midway between the mean intensities of sampled foreground and background pixels. In doing so, we obtained a binary mask on which we applied morphological operations (dilation followed by erosion) to close the small holes in the foreground. Finally, we took the connected components with areas above a certain threshold and fit the minimum rectangular bounding boxes around them. An example is shown in Figure 2 (f). . Experiments Setup: We trained the proposed model for both a three-way classification (i.e., K = 3 for NP, CAP and COVID-19) and three binary classification tasks ( K = 2 ), i.e., NP vs. COVID-19, NP vs. CAP and CAP vs. COVID19, respectively. In the three-way classification settings, we first trained individual classifiers at different convolution blocks. In our experiment, we chose Conv3, Conv4 and Conv5, respectively. Then, we trained a joint classifier on the aggregated prediction scores (as described in the "Multi-Scale Learning" Section). All the classifiers were trained with the loss in Eq. 1. Finally, we conducted a 5-fold cross-validation on all tasks that in each category, we split the datasets into training, validation and test set. This can ensure that no samples (images) originating from validation and test patients were used for training. In each fold, we held out ~20% of all samples for validation and test, and the remaining were used for training. . Training Configurations: We implemented the proposed model (as depicted in Figure 1 ) using Tensorflow 1.14.0. All models were trained from scratch on four Nividia GeForce GTX 1080 Ti GPUs with an Adam optimiser (learning rate: 10 , β = 0.5 , β = 0.9 and ϵ = 10 ). We set γ to 1 in the focal modulator f and the total number of training iterations was set to 20,000. Early stopping was enabled to terminate training automatically when validation loss stopped decreasing for 1,000 iterations. We run validation once every 500 iterations of training, a checkpoint was saved automatically if the current validation accuracy exceeded the previous best validation accuracy. Once the training was terminated, we generated a frozen graph on the latest checkpoint and saved it in .pb format. For testing, we simply loaded the frozen graphs and retrieved the required nodes. Empirically, we found that 20 to 30 steps were good enough to approximate the integral when computing the integrated gradients; thus, we fix m = 25 in Eq. 3. Using positive results of the RTPCR testing as the ground truth labelling for the COVID-19 group and diagnosis results of CAP and NP patients, accuracy, precision, sensitivity and specificity [32,33] of our classification framework were calculated. We also carried out the area under the receiver operating characteristic curve (AUC) analysis for the quantification of our classification performance. For the lung segmentation, we used Dice score [34] to evaluate the accuracy. In order to evaluate the lung segmentation network, we randomly split the 60 TCIA data with ground truth into 40 training, 10 validation and 10 independent testing datasets. Ablation study results of different pre-processing and post-processing methods using Dice scores are shown in consolidation; and also small nodule-like lesions, such as ground-glass opacities (GGO) and bronchovascular thickening. Notably, we found the mid-level layers, i.e., Conv3 and Conv4, learn to detect small lesions (GGO most frequently), especially those distributed peripherally and subpleurally. However, they are not able to capture larger patchy-like lesions, and this may be because of the limited receptive field at the mid-layers. In contrast, the high-level layer, i.e., Conv5, having sufficiently large receptive filed learns well to detect the large patchy-like lesions, such as crazy paving sign and consolidation, which are often distributed centrally and peribronchially. Pneumonia. Figure 5 shows the examples of categorical-specific joint saliency computed by integrated gradients. It shows the original inputs on the left and the overlaid saliency on the right. CAMs showed in Figure 4 only depict the spatial distribution of infection. However, it can not be used for precise localisation of the lesions. The saliency maps, on the other hand, can provide pixel-level information that delineates the exact extent of the lesions so providing a precise localisation of the lesions. Furthermore, clinically, this can also be useful for diagnosis that with the saliency maps, we can estimate the percentage of infection to lung areas. These saliency maps highlight the pixels that contribute to increasing categorical-specific scores: the brighter the pixels, the more significant the contribution. Intuitively, one can also interpret this as the brighter the pixels are, the more critical features to the network to make the decision (prediction). It is of note that in Figure 4 and Figure 5 , there is not only an inter-class contrast variation (due to the data are collected from multiinstitutions) but also an intra-class contrast variation, especially in COVID-19 group. In our experiments, we found that histogram matching can suppress lesions, especially on COVID-19 images; for instance, GGO disappears or become less apparent. Besides, this leads to inferior performance of detection. Therefore, instead of directly applying histogram matching, we applied random on-the-fly contrast adjustment for data augmentation at training time. This turns out to be very effective, as demonstrated in Figure 5 , our proposed model learns to be invariant to image contrast, and precisely capture the lesions. In addition, from the COVID-19 and CAP saliency, we found that the CAP lesions are generally smaller and more constrained locally compare to COVID-19 cases that often have multiple infected regions and lesions are massive and scattered. It should also be noted that COVID-19 and CAP lesions do share similar radiographic features, such as GGO and air space consolidation. Besides, GGOs appear frequently in subpleural regions as well in CAP cases. Interestingly, from the saliency map for the NP cases, we found the network takes the pulmonary arteries as the salient feature. Finally, Figure 6 shows the bounding boxes extracted from COVID-19 and CAP saliency maps (corresponding to the examples in Figure 5 ). We found the results agree with our primary findings that CAP cases have less infected areas and often there is single-instance of infection, in contrast, lesions vary a lot in terms of extent. Overall, CAP infection areas are smaller compare to those of COVID-19. Performance of our proposed model for each specific task was evaluated with 5-fold crossvalidation, and the results on the test set are reported and summarised in Table 2 . We use five evaluation metrics, which are accuracy (ACC), precision (PRC), sensitivity (SEN), specificity (SPE) and the area under the ROC curve (AUC). We report the mean of 5-fold cross-validation results in each metric with the 95% confidence interval. We also compared our proposed method with a reimplementation of the Navigator-Teacher-Scrutinizer Network (NTS-NET) [35] . As described earlier in the experimental settings, basically we have two groups of tasks: three-way classification tasks (indicated by ) and binary classification tasks (indicated by ), and two learning configurations: single-scale learning (indicated by ) that assigns an auxiliary classifier to a specific feature level, and multi-scale learning (indicated by ) that aggregates the multi-level prediction scores then trained with a joint classifier. All the binary tasks listed were trained with the multi-scale learning. In terms of three-way classification, we found the multi-scale learning with joint classifier achieves superior overall performance than any of the single-scale learning tasks. It is of note that among the single-scale learning tasks, classification with Conv4 and Conv5 features achieve very similar performance in every metric, which is significantly better than classification with mid-level, i.e., Conv3 features. One possible explanation is the mid-level features are not sufficiently semantic compare to higher-level features, i.e., Conv4 and Conv5. As we know, high-level CNN representations are semantically strong but poorly at preserving spatial details, whereas mid-lower level CNN representations preserve well the local features but lack of semantic information. † ‡ Furthermore, it is of note that, overall, binary classification tasks achieve significantly better performance than three-way classification, especially in the tasks, such as NP/COVID-19 and NP/CAP. It can be seen our proposed model is reasonably good at distinguishing COVID-19 cases from NP cases as suggested by the results, showing that it achieves a mean ACC of 96.2%, PRC of 97.3%, SEN of 94.5%, SPE of 95.3% and AUC of 0.970, respectively. One can explain this is because binary classification is less complicated, and there is also less uncertainty than three-way classification. This may also because COVID-19 and CAP image features are intrinsically discriminative compare to the NP cases. For instance, as the COVID-19 cases demonstrated earlier, there is often a combination of various diseased patterns and large areas of infection on the scans. Last but not least, we found that the performance of COVID-19/CAP classification is the least superior among all the binary classification tasks. One possible reason is COVID-19 shares the similar radiographic features with CAP, such as GGO and airspace consolidation and the network capacity may not be enough to learn disease-specific representations. Nevertheless, the results obtained using our proposed method outperformed the ones obtained by the NTS-NET. We also break down the overall performance, i.e., the joint classifier (indicated by ) into classes, and the classification metrics are reported for each class, as shown in Table 3 and Figure 7 . We found that the "COVID-19" and the "NP" classes achieve the comparable performance in each metric and the "NP" class has higher sensitivity (91.3%) than the COVID-19 (87.6%) and CAP (83.0%). Besides, we found, overall, the "COVID" remains the best performed and the most discriminative class with a mean AUC of 0.923, compared to the "CAP" (0.864) and the "NP" (0.901). It can also be noted that the overall results for the class "CAP" are moderately lower than those of the "NP" and "COVID-19". This could be correlated with our finding in the COVID-19/CAP classification that because of similar appearance, the "CAP" class is likely to be misclassified as the "COVID-19" sometimes. Also, another possible reason is that the network could have learned and be distracted by the few "NP noises", and there might be a fractional number of non-infected slices in between the CAP training samples. This is because we sampled all the available slices from each subject, and there might be a few slices having no infections. However, it tends to discard small local lesions. This is well complemented by the mid-level representations (Figure 4) , i.e., Conv4 and Conv5, from which the lesions detected also correspond to our clinical findings that the infections usually located in the peripheral lung (95%), mainly in the inferior lobe of the lungs (65%), especially in the posterior segment (51%). We speculate that it is mainly because there are more well-developed bronchioles, alveoli, rich blood flows and immune cells such as lymphatic cells in the periphery. These immune cells played a vital role in the inflammation caused by the virus. We have also demonstrated that combing multi-scale saliency maps, generated by integrated gradients, is the key to achieve a precise localisation of multiinstance lesions. Furthermore, from a clinical perspective, the joint saliency is useful that it provides a reasonable estimation of the percentage of infected lung areas, which is a crucial factor that clinicians take account for evaluating the severity of a COVID-19 patient. Besides, the classification performance of the proposed network has been studied extensively that we have not only conducted three-way classification but also binary classification by combining any two of the classes. We found one limitation of the proposed network is that it is not discriminative enough when it comes to separate the CAP from COVID-19. We suspect this is due to the limited capacity of the backbone CNN that a straightforward way of boosting CNN capacity is to increase the number of feature channels at each level. Another attempt in the future would be employing more advanced backbone architecture, such as Resnet and Inception. Another limitation in this work is that we have trained the networks on individual slices (images) that we use all available samples for each subject. However, for the CAP or COVID-19 subjects, there might be fractional non-infection slices in between which could introduce noises in training. In the future, we can address the limitation by attention-based multiple instances learning that instead of training on individual slices, we put the patient-specific slices into a bag and train on bags. The network will learn to assign weights to individual slices in a COVDI-19 or CAP positive bag and automatically sample those high weighted slices for infection detection. In this study, we designed a weakly supervised deep learning framework for fast and fully-automated detection and classification of COVID-19 infection using retrospectively extracted CT images from multi-scanners and multi-centres. Our framework can distinguish COVID-19 cases accurately from CAP and NP patients. It can also pinpoint the exact position of the lesions or inflammations caused by the COVID-19, and therefore can also potentially provide advice on patient severity in order to guide the following triage and treatment. Experimental findings have indicated that the proposed model achieves high accuracy, precision and AUC for the classification, as well as promising qualitative visualisation for the lesion detections. Based on these findings we can envisage a largescale deployment of the developed framework. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study Lung Infection Quantification of COVID-19 in CT Images with Deep Learning Artificial Intelligence Distinguishes COVID-19 from Community Acquired Pneumonia on Chest CT Chest CT for typical 2019-nCoV pneumonia: relationship to negative RT-PCR testing Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases Sensitivity of chest CT for COVID-19: comparison to RT-PCR Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I A survey on deep learning in medical image analysis Visualization and interpretation of convolutional neural network predictions in detecting pneumonia in pediatric chest radiographs Lung pattern classification for interstitial lung diseases using a deep convolutional neural network Deep learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT images Rapid AI Development Cycle for the Coronavirus (COVID-19) Pandemic: Initial Results for Automated Detection & Patient Monitoring using Deep Learning CT Image Analysis Deep Learning System to Screen Coronavirus Disease Machine learning-based CT radiomics model for predicting hospital stay in patients with pneumonia associated with SARS-CoV-2 infection: A multicenter study Mapping the Landscape of Artificial Intelligence Applications against COVID-19 Autosegmentation for thoracic radiation treatment planning: A grand challenge at AAPM 2017 An algorithm for fast adaptive image binarization with applications in radiotherapy imaging Sliding window adaptive histogram equalization of intraoral radiographs: Effect on image quality. Dento maxillo facial radiology Simultaneous left atrium anatomy and scar segmentations via deep learning in multiview information with attention U-net: Convolutional networks for biomedical image segmentation Very deep convolutional networks for large-scale image recognition Time Course of Lung Changes On Chest CT During Recovery From Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study Relation Between Chest CT Findings and Clinical Conditions of Coronavirus Disease (COVID-19) Pneumonia: A Multicenter Study Focal loss for dense object detection Striving for simplicity: the all convolutional net Not just a black box: Learning important features through propagating activation differences Layer-wise relevance propagation for neural networks with local renormalisation layers Axiomatic Attribution for Deep Networks Images thresholding using isodata technique with gamma distribution Comparison Study of Radiomics and Deep Learning Based Methods for Thyroid Nodules Classification Using Ultrasound Images Discrete wavelet transform-based whole-spectral and subspectral analysis for improved brain tumor clustering using single voxel MR spectroscopy MV-RAN: Multiview recurrent aggregation network for echocardiographic sequences segmentation and full cardiac cycle analysis Learning to navigate for fine-grained classification