key: cord-0805093-ffati4ua authors: Javor, D.; Kaplan, H.; Kaplan, A.; Puchner, S.B.; Krestan, C.; Baltzer, P. title: Deep learning analysis provides accurate COVID-19 diagnosis on chest computed tomography date: 2020-11-04 journal: Eur J Radiol DOI: 10.1016/j.ejrad.2020.109402 sha: 5a005d16ac7ea5167de3a0375c66ab4d7d89cc44 doc_id: 805093 cord_uid: ffati4ua INTRODUCTION: Computed Tomography is an essential diagnostic tool in the management of COVID-19. Considering the large amount of examinations in high case-load scenarios, an automated tool could facilitate and save critical time in the diagnosis and risk stratification of the disease. METHODS: A novel deep learning derived machine learning (ML) classifier was developed using a simplified programming approach and an open source dataset consisting of 6868 chest CT images from 418 patients which was split into training and validation subsets. The diagnostic performance was then evaluated and compared to experienced radiologists on an independent testing dataset. Diagnostic performance metrics were calculated using Receiver Operating Characteristics (ROC) analysis. Operating points with high positive (>10) and low negative (<0.01) likelihood ratios to stratify the risk of COVID-19 being present were identified and validated. RESULTS: The model achieved an overall accuracy of 0.956 (AUC) on an independent testing dataset of 90 patients. Both rule-in and rule out thresholds were identified and tested. At the rule-in operating point, sensitivity and specificity were 84.4% and 93.3% and did not differ from both radiologists (p > 0.05). At the rule-out threshold, sensitivity (100%) and specificity (60%) differed significantly from the radiologists (p < 0.05). Likelihood ratios and a Fagan nomogram provide prevalence independent test performance estimates. CONCLUSION: Accurate diagnosis of COVID-19 using a basic deep learning approach is feasible using open-source CT image data. In addition, the machine learning classifier provided validated rule-in and rule-out criteria could be used to stratify the risk of COVID-19 being present. Since the outbreak of corona virus disease and its declaration by the WHO as a pandemic on March 11, 2020, various measures have been implemented worldwide in order to achieve containment of the disease (1, 2, 3) . One key factor has been proven to be the rapid and accurate identification of infected patients and their isolation among with social distancing for the general population (4, 5) . While reverse-transcription polymerase chain reaction (RT-PCR) remains up to date the main diagnostic tool, the role of computed tomography (CT) of the chest as a complementary method or even a reliable alternative is increasingly recognized, given the vast numbers of potential carriers of the disease (6, 7, 8, 9, 10) . Artificial intelligence with the use of deep learning technology has been sought to improve the capability of CT in terms of distinguishing typical COVID-19 features from other types of pneumonia (11, 12) . The use of this kind of technology could expedite the identification of diseased patients and further improve the risk stratification in cases with indeterminate findings on CT in absence of other diagnostic methods. Recently, a study was published reporting a high accuracy of a newly developed machine learning (ML) model for detecting COVID-19 on chest CT images (11) utilizing extensive preprocessing (lung segmentation) and a large dataset. The aim of the present study was to evaluate if a robust deep learning derived classifier can be built with a simplified programming approach without extensive preprocessing and a smaller open source dataset and to compare such a model to experienced radiologists in terms of diagnostic accuracy. We assembled a dataset of 6868 images of CT chest exams from 418 patients from public sources (see supplemental material). 3102 images from 209 patients were labeled as COVID-19-positive according to the information provided with the images; only exams from patients that were unequivocally categorized and described as COVID-19-positive (indicating a positive RT-PCR result) by the providing source were included. 3766 images from 209 patients were labeled COVID-19-negative and included other lung pathologies, such as lobar bacterial pneumonias, atypical or viral pneumonias, lung cancer, organizing pneumonia, infectious bronchiolitis and other diseases. To avoid class imbalance within the training data resulting in over-classification of the majority group (non-Covid19) due to its increased prior probability (13) , the datasets were intentionally balanced (50:50 ratio). The patient disease statistics are summarized in Table 1 . All exams with inferior image quality or uncertain COVID-19-status were excluded. From the entire dataset, an independent testing set was assembled containing 90 images of 90 patients. Forty-five COVID-19-positive patients were selected randomly and 45 negative patients were selected manually to ensure a similar distribution of diseases as in the training dataset and to avoid an insufficient amount of infectious diseases similar to COVID-19, such as virus pneumonias, due to randomisation. The training dataset (6778 images of 328 patients) was then further split for training the model and internal validation (20% of the samples). The independent testing set was not used for training nor for internal validation. The validation set was used for internal validation but not for training. After selection, the validation and test datasets were not presented to the DL model in the training phase. The input sources originally were plain png and jpeg files. Out of these, 82.8% were jpeg, while the rest (17.3%) were png files. In order to homogenise the data and minimize any bias due to compression artefacts, all png files were converted to jpeg files. The Deep Learning model is able to extract 2D features from the submitted images. The Convolutional Neuronal Network was a plain, default ResNet50 (14) . Images were standardized to 448x448 pixels. The random seed was set to 43 and kept constant in order to reproduce our results. We used only the default transformations and augmentations offered by the fastai2 (15) Deep Learning framework. All hyper parameters are provided in supplemental material section II. We used 17 epochs of training via the fastai2 library on a single Nvidia Tesla GPU with 16 GB of VRAM. Batch size was set to 32. As opposed to other approaches no prior lung segmentation or preprocessing was performed reducing the required computational time to a minimum and increasing significantly the feasibility. To guarantee an open science approach, the trained model can be downloaded and used with a sample script (inference.py). The training python scripts are also released as open source on github. To evaluate a possible bias from recognition of the various images sources, a separate ML model was developed with the same architecture and hyperparameters as the original model to identify the image sources. The trained ML classifier was compared to two radiologists, both having more than 15 years of experience (each), on the same independent test dataset. The testing dataset cases were presented to both radiologists independently in different reading sessions in random order. The readers were J o u r n a l P r e -p r o o f neither aware of the diagnoses nor of the prevalence of positive cases. They used diagnostic criteria known at the time of reading (16, 17) and assigned a diagnosis (COVID-19 positive vs negative). The diagnostic performance of the ML model classifier was calculated using ROC statistics. The area under the ROC curve (AUC) was used as the measure of diagnostic accuracy for this study. The null hypothesis for diagnosis of COVID-19 was pure chance (an AUC of 0.5). To determine exploratory thresholds to either rule-in or rule-out COVID-19, operating points on the ROC curve of the ML classifier in the validation dataset were chosen that lead to a positive likelihood ratio (LR+) of >10 (rule-in criterion for COVID-19) or a negative likelihood ratio (LR-) of <0.01 (rule-out criterion for COVID-19). These thresholds were then applied to the independent testing dataset. The areas under the ROC curves (AUC) of the ML classifier and the two radiologists were compared with each other using the DeLong method. Further, sensitivity and specificity at these thresholds were compared to the classification results of both radiologists using McNemar tests. P-values <0.05 were considered significant. No alpha error correction procedure was applied as all statistics are exploratory in nature. To exemplify the effect of positive and negative test results, pre-test probabilities were mapped to post-test probabilities using the likelihood ratios in a Fagan nomogram. The nomogram can be used as a graphic calculator to map any given pre-test probability to a post-test probability using LR+ or LR-for a positive or negative test result (18) . The trained ML model is available to the general public (https://labs.deep-insights.ai). The server running the web page is an IBM x3550 M2, with an Intel(R) Xeon(R) CPU X5460 @ 3.16GHz (an old CPU from 2007). Inference on this approx.10 year old, low powered system was fast: 1.2 seconds per image. The ML classifier trained to identify whether the image source presented a source of bias resulted in a correct classification rate of the image sources of 61% (with 50% being chance), indicating that such a bias based on the source of the images can be largely excluded. The diagnostic performance metrics of the ML model and the human readers (radiologists R1 and R2) for detection of COVID-19 are summarized in figure 1 and Table 2 . The area under the ROC curve was numerical higher than that of both readers but did only prove statistically significant superiority compared to R1 (8.9% difference, 95%-CI 0.7-17%, p=0.033) but not compared to R2 (6.7% difference, 95%-CI -1.2-14.5%, p=0.097). R1 and R2 did not differ significantly from each other J o u r n a l P r e -p r o o f ((2.2% difference, 95%-CI -7.6-12.1%, p=0.659). While the radiologists test response was dichotomous, the metric character of the ML classifier results allowed the definition for rule-in and rule-out thresholds on the validation dataset which were then applied to the independent testing dataset ( figure 1, table 2 ). The rule-in threshold coincided with the automatically selected threshold based on the Youden index. Positive likelihood ratios >10 were present at this threshold (predicted probability of COVID-19 by the ML classifier of >40%) for both validation and testing dataset (table 2). The negative likelihood ratio at the rule-out threshold was <0.01 for both datasets (table 2) . Comparison of ML classifier sensitivity and specificity with radiologists´ performance Dichotomous radiologists' performance in terms of sensitivity (Radiologist 1: 80%, Radiologist 2: 82.2%) and specificity (Radiologist 1: 91.1%, Radiologist 2: 97.8%) did not differ from the ML classifier (sensitivity: 84.4%, specificity 93.3%) at the rule-in threshold (p>0.05, respectively) while sensitivity was significantly inferior and specificity significantly superior at the rule-out threshold (sensitivity and specificity of the ML classifier: 100% and 60%, p<0.05, respectively, table 2). The likelihood ratios calculated within the independent testing dataset can be mapped to post-test probabilities given pre-test probabilities based on the prevalence of COVID-19 within the literature (19) in a Fagan-Nomogram (example given in figure 2 ) for both rule-in and rule-out thresholds. In this study, we developed a novel deep learning derived classifier for detecting COVID-19 on CTimages. Although omitting the preprocessing step of lung segmentation and utilizing an open source dataset that is rather small for deep learning methodology, the model achieved high accuracy with an AUC of 0.956 on an independent testing dataset. We compared the novel ML classifier with two radiologists, both having more than 15 years of experience (each), on the same test dataset and found a minor overall superior diagnostic accuracy in terms of the AUC. As opposed to the dichotomous decision by the radiologists, the continuous character of the ML classifier prediction results allowed us to identify thresholds that were highly sensitive or specific for diagnosis of COVID-19. This is of high clinical relevance since decision making requires a clear-cut decision between cases J o u r n a l P r e -p r o o f that have a high, medium or low level of suspicion. Using simple Bayes theorem in a Fagan nomogram, one can map various pre-test probabilities based on the local prevalence of COVID-19 to posttest probabilities using the likelihood ratios at the respective thresholds. By establishing explorative rule-in and rule-out thresholds for low and high predictive probabilities, this enables a "traffic light"system for patient triage that could be combined with clinical parameters. If positive, the patient should be isolated as a consequence until (multiple) RT-PCR tests have confirmed or rejected the predicted diagnosis. If negative, the post-test probability is low enough to practically rule out COVID-19 as the suspected underlying lung disease. RT-PCR to diagnose COVID-19 has its own limitations: the test is not universally available, turnaround times can be lengthy, and reported sensitivities vary (20) . The combination of chest CT and ML has the advantage of obtaining an immediate resultpractically after image reconstruction and while the patient is still on the examination bedand independent of the presence of the radiologist (21, 22, 23, 24) . This scenario could potentially play a role in either minor peripheral hospitals or places with radiological personnel shortage due to various reasons including a high load of COVID-19 cases and could also act as a support for inexperienced radiologists during night-duty. In the case of staff shortages and overwhelming patient loads an initial reading from an AI-based tool could dramatically reduce waiting times and facilitate a rapid risk stratification. A specialist could follow up the AI system's reading with a more thorough diagnosis later (25) . Since our ML algorithm is published as open source and only open source images have been utilised to build it, easy access to a fast preliminary diagnosis can be offered. This has various applications from low-income countries to acute pressing patient numbers or staff shortage. Compared to a recently published study which was trained on a much larger dataset (11) our results are only slightly inferior in terms of diagnostic accuracy though the results cannot directly be compared due to different datasets. Nevertheless, the diagnostic accuracy was comparable and even slightly superior compared to experienced radiologists and in contrast to the above mentioned study our approach required no complex preprocessing with lung segmentation reducing the required time to a minimum and increasing the overall feasibility significantly. Further developments could directly be compared to our ML classifier as all data used within this study is free of access and the ML model itself is also provided without access restrictions. Our results indicate that the complex preprocessing step of lung segmentation may be omitted while still achieving high sensitivity and specificity. ResNet-50 was used as the backbone residual neural Although our results indicate that the complex preprocessing step of lung segmentation may be omitted, the lack of lung segmentation can also lead to a possible bias due to extra thoracic objects, such as intubation tubes of intensive care patients. While this cannot be excluded entirely, only a small percentage of patients in the COVID-19 and the control group showed any kind of extrathoracical objects. Finally, the balanced training dataset with a high prevalence of COVID-19 cases yields sensitivity and specificity estimates that are potentially affected towards higher sensitivity and lower specificity (29) . While this does less likely affect the ML classifier but rather human readers, the effect of prevalence on predictive values is of a more direct mathematical nature (30) . Consequently, we determined the operating points to either rule-in or rule-out malignancy based on prevalence-independent likelihood ratios together with a Fagan-nomogram to adapt the ML classifier results to different clinical settings (31 CRediT author statement J o u r n a l P r e -p r o o f COVID-19: learning from experience Radiology department strategies to protect radiologic technologists against COVID19: Experience from Wuhan Adapting to a new normal? 5 key operational principles for a radiology service facing the COVID-19 pandemic Strategies shift as coronavirus pandemic looms Isolation, quarantine, social distancing and community containment: pivotal role for old-style public health measures in the novel coronavirus (2019-nCoV) outbreak Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases Comparison to RT-PCR Imaging of coronavirus disease 2019: A Chinese expert consensus statement Characteristic CT findings distinguishing 2019 novel coronavirus disease (COVID-19) from influenza pneumonia A diagnostic model for coronavirus disease 2019 (COVID-19) based on radiological semantic and clinical features: a multi-center study AI Augmentation of Radiologist Performance in Distinguishing COVID-19 from Pneumonia of Survey on deep learning with class imbalance Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition Fastai: A Layered API for Deep Learning CT Features of Coronavirus Disease 2019 (COVID-19) Pneumonia in 62 Patients in Wuhan, China Coronavirus Disease 2019 (COVID-19): A Systematic Review of Imaging Findings in 919 Patients Nomogram for Bayes theorem Diagnostic Performance of CT and Reverse Transcriptase-Polymerase Chain Reaction for Coronavirus Disease A role for CT in COVID-19? What data really tell us so far Artificial intelligence applications for thoracic imaging A deep residual learning network for predicting lung adenocarcinoma manifesting as ground-glass nodule on CT images Evaluation of an AIbased, automatic coronary artery calcium scoring software Deep learning-enabled system for rapid pneumothorax screening on chest CT Doctors are using AI to triage covid-19 patients. The tools may be here to stay The role of imaging in 2019 novel coronavirus pneumonia (COVID-19) Chest CT findings in cases from the cruise ship "Diamond Princess Deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: A multicentre study Variation of sensitivity, specificity, likelihood ratios and predictive values with disease prevalence Statistics Notes: Diagnostic tests 2: predictive values Letter: Nomogram for Bayes theorem