key: cord-0485555-1n9kisk9 authors: Georgescu, Bogdan; Chaganti, Shikha; Aleman, Gorka Bastarrika; Barbosa, Eduardo Jose Mortani; Cabrero, Jordi Broncano; Chabin, Guillaume; Flohr, Thomas; Grenier, Philippe; Grbic, Sasa; Gupta, Nakul; Mellot, Franccois; Nicolaou, Savvas; Re, Thomas; Sanelli, Pina; Sauter, Alexander W.; Yoo, Youngjin; Ziebandt, Valentin; Comaniciu, Dorin title: Machine Learning Automatically Detects COVID-19 using Chest CTs in a Large Multicenter Cohort date: 2020-06-09 journal: nan DOI: nan sha: 4a1a93f035bc7ef2bdc44595ed601cc4704e218a doc_id: 485555 cord_uid: 1n9kisk9 Purpose: To investigate if AI-based classifiers can distinguish COVID-19 from other pulmonary diseases and normal groups, using chest CT images. To study the interpretability of discriminative features for COVID19 detection. Materials and Methods: Our database consists of 2096 CT exams that include CTs from 1150 COVID-19 patients. Training was performed on 1000 COVID-19, 131 ILD, 113 other pneumonias, 559 normal CTs, and testing on 100 COVID-19, 30 ILD, 30 other pneumonias, and 34 normal CTs. A metric-based approach for classification of COVID-19 used interpretable features, relying on logistic regression and random forests. A deep learning-based classifier differentiated COVID-19 based on 3D features extracted directly from CT intensities and from the probability distribution of airspace opacities. Results: Most discriminative features of COVID-19 are percentage of airspace opacity, ground glass opacities, consolidations, and peripheral and basal opacities, which coincide with the typical characterization of COVID-19 in the literature. Unsupervised hierarchical clustering compares the distribution of these features across COVID-19 and control cohorts. The metrics-based classifier achieved AUC, sensitivity, and specificity of respectively 0.85, 0.81, and 0.77. The DL-based classifier achieved AUC, sensitivity, and specificity of respectively 0.90, 0.86, and 0.81. Most of ambiguity comes from non-COVID-19 pneumonia with manifestations that overlap with COVID-19, as well as COVID-19 cases in early stages. Conclusion: A new method discriminates COVID-19 from other types of pneumonia, ILD, and normal, using quantitative patterns from chest CT. Our models balance interpretability of results and classification performance, and therefore may be useful to expedite and improve diagnosis of COVID-19. Coronavirus disease 2019 or COVID-19 has caused a global pandemic associated with an immense human toll and health care burden across the world (1). COVID-19 can manifest as pneumonia, which may lead to acute hypoxemic respiratory failure, which is the main reason for hospitalization and mortality. A consensus statement provided by the Fleischner Society indicates the use of lung imaging for triage of patients with moderate to severe clinical symptoms, especially in resource-constrained environments (2) . The most typical pulmonary CT imaging features related to COVID-19 are multi-focal (often bilateral and peripheral predominant) airspace opacities, comprised by ground glass opacities and/or consolidation, which may be associated with interlobular and intralobular septal thickening ("crazy-paving") (3) . A study comparing the differences between COVID-19 and other types of viral pneumonia demonstrated that distinguishing features more typical of COVID-19 are predominance of ground glass opacities, peripheral distribution, and vascular thickening (4) . A consensus statement on the reporting of COVID-19 by Radiological Society of North America (RSNA) indicates the typical appearance of COVID-19 as peripheral and bilateral distribution of ground glass opacities with or without consolidation or a crazy paving pattern, and possibly with the 'reverse halo' sign (5) . Confirmatory diagnosis of COVID-19 requires identification of the virus on nasopharyngeal swabs via RT-PCR (reverse transcription -polymerase chain reaction), a test that is highly specific (>99%) but with sensitivity ranging from 50-80% (6, 7) . Given the imperfect sensitivity of RT-PCR and potential resource constraints, the role of chest CT imaging for diagnosis of COVID-19 is still under investigation. Recently, several groups have shown that COVID-19 can be distinguished from other types of lung disease on CT with variable accuracy. Mei et al. showed that chest CT scans in patients who were positive for COVID-19 by RT-PCR testing could be distinguished from chest CT scans in patients that tested negative with an AUC of 0.92 using machine learning and deep learning (8) . While this classification is potentially valuable, it is limited by lack of details on the types and distribution of findings on negative cases. It is important to be able to distinguish COVID-19 related pulmonary disease not just from healthy subjects, but also from other types of lung diseases that are not related to COVID-19, including other infections, malignancy, ILD and COPD. This is especially important as COVID-19 can manifest similarly to other respiratory infections such as influenza, which can lead to confusion in triage and diagnosis. Bai et al. showed that an artificial intelligence system can assist radiologists to distinguish between COVID-19 and other types of pneumonia by improving their diagnostic sensitivity to 88% and specificity to 90% (9) . The two cohorts compared in this study are from two different countries, therefore the generalizability of their model is limited. Similarly, some of the studies that show promising results in classification do not provide a detailed description of imaging cohorts in terms of acquisition protocols or countries from which the data is acquired (10, 11) . This information is important since different institutions will have varied CT acquisition protocols and different clinical indications for CT usage, which can lead to distinct patient populations. In this manuscript, we compute CT derived quantitative imaging metrics corresponding to the typical clinical presentation of COVID-19 and evaluate the discriminative power of these metrics for the diagnosis of COVID-19. We perform unsupervised clustering of interpretable features to visualize how COVID-19 patients differ from controls. We compare the performance of metrics-based classifiers to a deep learning-based model. Our large training and test datasets are comprised of chest CTs obtained in COVID-19 confirmed patients and negative controls from North America and Europe, making this one of the first large studies to demonstrate differences in COVID -19 and non-COVID-19 imaging cohorts outside of China. All authors have either been employed or partially supported by BLINDED. The data used in this work has been acquired from 16 different centers in North America and Europe after anonymization and ethical review at the respective institutions. Our dataset consists of chest CT scans of 1150 patients who were positive for COVID-19, and 946 of chest CT scans of patients without COVID-19, including patients with pneumonia (n=159), interstitial lung disease (ILD) (n=177), and without any pathology on chest CT (n=610). All CT scans in the COVID-19 cohort from North America have been confirmed by an RT-PCR test. The COVID-19 cohort from Europe has been either confirmed by an RT-PCR test or diagnosed based on clinical symptoms, epidemiological exposure and radiological assessment. The pneumonia cohort consists of cases of patients with non-COVID-19 viral pneumonias, organizing pneumonia or aspiration pneumonia. The ILD cohort consists of patients with various types of ILD exhibiting ground glass opacities, reticulation, honeycombing and consolidation to different degrees. The dataset was divided into training, validation and test sets (see Table 1 ). Model training and selection was performed based on training and validation sets. The final performance of selected models is reported on the test dataset. Refer to Table S1 in the supplemental material for detailed breakdown of demographic and scanning information for each cohort. Note that some of the information is unavailable due to anonymization protocols of some centers. We computed several metrics of severity based on abnormalities known to be associated with COVID-19, as well as lung and lobar segmentation. We used a previously developed Deep Image-to-Image Network that was trained on a large cohort of healthy and abnormal cases for segmentation of lungs and lobes (12) . Next, we used a DenseUnet to identify the abnormalities related to COVID-19 such as GGO and consolidations (12) . Based on these segmentations, we computed thirty severity metrics to summarize the distribution, location and extent of airspace disease in the two lungs. The complete list of metrics and their detailed description is provided in the supplementary section. Mutual information was used to select the metrics of severity that are most discriminative between COVID-19 and non-COVID-19 abnormalities. The k best features were incrementally selected based on an internal validation split. Based on the selected metrics, an unsupervised hierarchical cluster analysis was performed to identify clusters of images that have similar features. The pairwise Euclidean distance between two metrics was used to compute a distance matrix and the average linkage method is used for hierarchical clustering (13) . The resulting clustering was visualized as a heatmap. The Python Seaborn package was used for this visualization (14) . Two metrics-based classifiers were trained based on the thirty computed metrics. First, we trained a Random Forest classifier, M1, using k selected features based on mutual information. Subsequently, we trained a second classifier that uses logistic regression (LR), after a feature transformation based on gradient boosted trees (GBT) (15) . For training GBT, we used 2000 estimators with max depth 3 and 3 features for each split. The boosting fraction 0.8 was used for fitting the individual trees. The LR classifier, M2, was trained with L2 regularization (C=0.2). The class weights were adjusted to class frequencies for the class imbalance between COVID-19 and non-COVID-19 cases. A deep-learning-based 3D neural network model, M3, was trained to separate the positive class (COVID-19) vs negative class (non-COVID-19). As input, we considered a two-channel 3D tensor, with the first channel containing directly the CT Hounsfield units masked by the lung region segmentation and the second channel containing the probability map of a previously proposed opacity classifier (12) . The 3D network uses anisotropic 3D kernels to balance resolution and speed and consists of deep dense blocks that gradually aggregate features down to a binary output. The network was trained end-to-end as a classification system using binary cross entropy and uses probabilistic sampling of the training data to adjust for the imbalance in the training dataset labels. A separate validation dataset was used for final model selection before the performance was measured on the testing set. The input 3D tensor size is fixed (2x128x384x384) corresponding to the lung segmentation from the CT data rescaled to a 3x1x1mm resolution. The first two blocks are anisotropic and consist of convolution (kernels 1x3x3) -batch normalization -LeakyReLU and Max-pooling (kernels 1x2x2, stride 1x2x2). The subsequent five blocks are isotropic with convolution (kernels 3x3x3) -batch normalization -LeakyReLU and Max-pooling (kernels 2x2x2, stride 2x2x2) followed by a final linear classifier with the input 144-dimensional. Figure 1 shows an overview of our 3D DL classifier. Seven features were selected by computing mutual information between the feature and the class in the training dataset of 999 COVID-19 cases and 801 controls (pneumonia, ILD and healthy). Note that one case of COVID-19 was excluded from training due to field of view issues, one pneumonia control was excluded since the z-axis resolution was less than 10 mm and another pneumonia control was excluded due to incorrect DICOM parameters and artifact issues. The features are: bootstrapping (17) . The corresponding confusion matrices for the three models are shown in Table 2 . Figure 4 shows typical CT images from COVID-19 patients and Figure 5 shows negative examples from ILD and non-COVID-19 pneumonia patients. Overlaid in red are the areas identified by the opacity classifier. Figure 6 illustrates examples of cases incorrectly labeled by both classifiers and Figure 7 shows cases that are incorrectly labeled by the metric-based classifier but correctly labeled by the DL classifier that uses additional texture features extracted directly from the images. In this research, we evaluated the ability of machine learning algorithms to distinguish between chest CTs in patients positive for COVID-19 and a control cohort comprising of chest CTs obtained to evaluate other pneumonias, ILD and normal cases. We performed an analysis based on clinically interpretable severity metrics computed from automated segmentation of abnormal regions in a chest CT scan, as well as a black-box approach using a deep learning system. Unsupervised clustering on selected severity metrics shows that while there are dominant characteristics that can be observed in COVID-19 such as the presence of ground glass opacities as well as peripheral and basal distribution, these characteristics are not observed in all cases of COVID-19. On the other hand, some subjects with ILD and pneumonia can exhibit similar characteristics. We found that the performance of the system can be improved by mapping these metrics into a higher dimensional space prior to training a classifier, as shown by model M2 in Figure 2 . The best classification accuracy is achieved by the deep learning system, which is essentially a high-dimensional, non-linear model. The deep learning method achieves a reduced false positive and false negative rate relative to the metrics-based classifier suggesting that there might be other latent radiological representations of COVID-19 that distinguish it from interstitial lung diseases or other types of pneumonia. It would be interesting to investigate how to incorporate the common imaging features into our 3D DL classifier as prior information. The proposed AI-based method has been trained and tested on a database of 2096 CT datasets with 1150 COVID-19 patients and 946 datasets coming from other categories. We also show how our method compares to the one published by Li et al (10) and found that our method achieves a higher AUC as well as sensitivity. Further details are provided in the supplementary section. One limitation of this study is that our training set is biased toward COVID-19 and healthy controls. This bias could have influenced the specificity for discriminating against other types of lung pathology. Another limitation is that the validation set size is relatively small, which might not capture the entire data distribution of clinical use cases for proper model selection. Among the strengths of this study are the diversity of training and testing CT scans used, which were acquired from a variety of manufacturers, institutions, and regions as shown in Table S1 , ensuring that our results are robust and likely generalizable to different environments. We included not only healthy subjects but also various types of lung pathology from ILD and pneumonia to the COVID-19 negative control group. The system described in this paper provides clinical value in several aspects. It can be used for rapid triage of positive cases, particularly in resource constrained environments where radiologic expertise may not be immediately available, whereas RT-PCR results may take up to several hours. This system could help radiologist to prioritize interpreting CTs in patients with COVID-19 by screening out lower probability cases. In addition to rapidity and efficiency concerns, the output of our deep learning classifier is easily reproducible and replicable, mitigating inter-reader variability in manually read radiology studies. While RT-PCR will remain the reference standard for confirmatory diagnosis of COVID-19, machine learning methods applied to quantitative CT can perform with high diagnostic accuracy, increasing the value of imaging in diagnosis and management of this disease. Furthermore, the algorithms described in this paper could potentially be integrated in a surveillance effort for COVID-19, even in unsuspected patients. All chest CT scans for pulmonary and non-pulmonary pathology (i.e. coronary artery exams, chest trauma evaluation) would be automatically assessed for evidence of COVID-19 lung disease as well as for non-COVID-19 pneumonia and referring clinicians could be alerted, allowing more rapid institution of isolation protocols. Finally, it could potentially be applied retrospectively to large numbers of chest CT exams from institutional PACS systems worldwide to uncover the origin and trace the diffuse of SARS-CoV-2 in communities prior to the implementation of widespread testing efforts. In the future, we plan to deploy and validate the algorithm in a clinical setting and evaluate the clinical utility and diagnostic accuracy on prospective data, as well as to investigate the correlation of the proposed metrics with the clinical severity of COVID-19 and disease progression over time. COVID-19 severity can be further quantified by using features from contrast CT angiography such as detection and measurement of acute pulmonary embolism which was reported to be associated with severe COVID-19 infections (18, 19) . In addition, a clinical decision models could be improved by training a classifier that incorporates other clinical data such as pulse oximetry, cell counts, liver enzymes, etc. in addition to imaging features. Metric #1-6: Percentage of Opacity (%) or PO The total percent volume of the lung parenchyma that is affected by the airspace disease. Computed for both lungs and for each lobe. The total percent volume of the lung parenchyma that is affected by severe disease i.e., high opacity regions including consolidation and vascular thickening. High opacity is defined as the airspace disease region with mean H.U. greater than -200. Computed for both lungs and for each lobe. Metric #13-18: Percentage of High Opacity (%) 2 The total percent volume of the lung parenchyma that is affected by denser airspace disease i.e., high opacity regions including consolidation. High opacity is defined as the airspace disease region with mean H.U. between -200 and 50. Computed for both lungs and for each lobe. Sum of severity score for each of the five lobes. Based on PO for each lobe, severity score of a lobe is: 0 if lobe not affected, 1 if 1-25% is affected, 2 is 25-50% is affected, 3 is 50-75% is affected, 4 is 75-100% affected. (20) Metric #20: Lung High Opacity Score (LHOS) Sum of severity score for each of the five lobes, for high opacity regions only. Based on PHO for each lobe, severity score of a lobe is: 0 if lobe not affected, 1 if 1-25% is affected, 2 is 25-50% is affected, 3 is 50-75% is affected, 4 is 75-100% affected. Sum of severity score for each of the five lobes, for high opacity regions excluding vasculature (threshold 50 HU). Based on PHO for each lobe, severity score of a lobe is: 0 if lobe not affected, 1 if 1-25% is affected, 2 is 25-50% is affected, 3 is 50-75% is affected, 4 is 75-100% affected. True if both right and left lungs are involved, false if only one of the two or none is involved. Number of lobes affected by the disease. Number of affected regions in the lung. Number of lesions that are in the periphery of the lung. Not including apex and mediastinal regions. See Fig S1(a) . Any abnormality that intersects with the peripheral border is considered a peripheral lesion. (16) Metric #26: Number of Lesions in the Rind Number of regions that are in the rind of the lung as defined in (17) (See Fig S1(b) ). Any abnormality that intersects with the "rind" is considered a lesion in the rind. Number of regions that are in the core of the lung as defined in (17) (See Fig S1(b) ). Any abnormality that does not intersect with the rind, is considered a core lesion. Given by the number of peripheral lesions divided by the number of total lesions. The total percent volume of the lung parenchyma that is affected by disease for peripheral lesions only. The total percent volume of the lung parenchyma that is affected by less dense airspace disease i.e., lesions which are characterized as GGO only. GGO is defined as the airspace disease region with mean H.U. less than -200. We compared the models in this work to those published by Li et al (10) . They investigated a deep learning method to distinguish COVID-19 from community-acquired pneumonia and healthy subjects using chest CT. Their proposed DL method is based on extracting 2D features on each CT slice followed by feature pooling across slices and a final linear classifier. While the results were promising, the distribution of data for train and test was not specified in detail in terms of scanning protocols and geography. There are two main differences between the DL method proposed in this article and the one proposed by Li et al (10) . First, our method is fundamentally based on 3D deep learning, which exploits better the 3D image context, and second, our method is using as input the location of lung regions affected by opacities, which focuses the classifier on the regions of interest. We trained and tested on our dataset using the published code by Li et al (10) Table S2 . JHU. Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering The Role of Chest Imaging in Patient Management during the COVID-19 Pandemic: A Multinational Consensus Statement from the Fleischner Society Chest CT findings in coronavirus disease-19 (COVID-19): relationship to duration of infection Performance of radiologists in differentiating COVID-19 from viral pneumonia on chest CT Radiological Society of North America Expert Consensus Statement on Reporting Chest CT Findings Related to COVID-19. Endorsed by the Society of Thoracic Radiology, the American College of Radiology, and RSNA Essentials for radiologists on COVID-19: an update-radiology scientific expert panel Sensitivity of chest CT for COVID-19: comparison to RT-PCR Artificial intelligence-enabled rapid diagnosis of COVID-19 patients. medRxiv AI Augmentation of Radiologist Performance in Distinguishing COVID-19 from Pneumonia of Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT Classification of COVID-19 patients from chest CT images using multiobjective differential evolution-based convolutional neural networks Quantification of tomographic patterns associated with COVID-19 from chest CT Modern hierarchical, agglomerative clustering algorithms seaborn: statistical data visualization Greedy function approximation: a gradient boosting machine CO-RADS-A categorical CT assessment scheme for patients with suspected COVID-19: definition and evaluation ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves Acute pulmonary embolism associated with COVID-19 pneumonia detected by pulmonary CT angiography Hypoxaemia related to COVID-19: vascular and perfusion abnormalities on dual-energy CT Chest CT findings in COVID-19 Data Origin North America: 131 North America: 16 EU:5 Sex F:54, M:54, Unknown:23 F:9, M:4, Unknown:3 F:13 Age Median:59 yrs, IQR: 56.5-73 Median:58 yrs Manufacturer Siemens: 43; GE: 55 Philips: 7; Toshiba: 5 Siemens: 24; GE: 1 We gratefully acknowledge the contributions of multiple frontline hospitals to this collaboration. The authors also thank the COPDGene for providing the data. The COPDGene study (NCT00608764) was funded by NHLBI U01 HL089897 and U01 HL089856 and also supported by the COPD Foundation through contributions made to an Industry Advisory Committee comprised of AstraZeneca, BoehringerIngelheim, GlaxoSmithKline, Novartis, and Sunovion. We thank many colleagues who made this work possible in a short amount of time. Special recognition to Brian Teixeira and Sebastien Piat, who were instrumental for implementing and managing the data infrastructure. We gratefully acknowledge the contributions of many colleagues who made this work possible in a short amount of time. (10) . For the model proposed by Li et al, we trained and tested on our dataset using the code provided by the authors. The 95% confidence intervals (shown as a band) are computed by bootstrapping over 1000 samples with replacement from the predicted scores.