key: cord-0972000-uw6vt4rt
authors: Sun, Ju; Peng, Le; Li, Taihui; Adila, Dyah; Zaiman, Zach; Melton, Genevieve B.; Ingraham, Nicholas; Murray, Eric; Boley, Daniel; Switzer, Sean; Burns, John L.; Huang, Kun; Allen, Tadashi; Steenburg, Scott D.; Gichoya, Judy Wawira; Kummerfeld, Erich; Tignanelli, Christopher
title: A Prospective Observational Study to Investigate Performance of a Chest X-ray Artificial Intelligence Diagnostic Support Tool Across 12 U.S. Hospitals
date: 2021-06-03
journal: ArXiv
DOI: nan
sha: 08d39537ce01ffd0e25d796530bb7c5ced15f296
doc_id: 972000
cord_uid: uw6vt4rt

IMPORTANCE: An artificial intelligence (AI)-based model to predict COVID-19 likelihood from chest x-ray (CXR) findings can serve as an important adjunct to accelerate immediate clinical decision making and improve clinical decision making. Despite significant efforts, many limitations and biases exist in previously developed AI diagnostic models for COVID-19. Utilizing a large set of local and international CXR images, we developed an AI model with high performance on temporal and external validation. OBJECTIVE: Investigate real-time performance of an AI-enabled COVID-19 diagnostic support system across a 12-hospital system. DESIGN: Prospective observational study. SETTING: Labeled frontal CXR images (samples of COVID-19 and non-COVID-19) from the M Health Fairview (Minnesota, USA), Valencian Region Medical ImageBank (Spain), MIMIC-CXR, Open-I 2013 Chest X-ray Collection, GitHub COVID-19 Image Data Collection (International), Indiana University (Indiana, USA), and Emory University (Georgia, USA) PARTICIPANTS: Internal (training, temporal, and real-time validation): 51,592 CXRs; Public: 27,424 CXRs; External (Indiana University): 10,002 CXRs; External (Emory University): 2002 CXRs MAIN OUTCOME AND MEASURE: Model performance assessed via receiver operating characteristic (ROC), Precision-Recall curves, and F1 score. RESULTS: Patients that were COVID-19 positive had significantly higher COVID-19 Diagnostic Scores (median .1 [IQR: 0.0–0.8] vs median 0.0 [IQR: 0.0–0.1], p < 0.001) than patients that were COVID-19 negative. Pre-implementation the AI-model performed well on temporal validation (AUROC 0.8) and external validation (AUROC 0.76 at Indiana U, AUROC 0.72 at Emory U). The model was noted to have unrealistic performance (AUROC > 0.95) using publicly available databases. Real-time model performance was unchanged over 19 weeks of implementation (AUROC 0.70). On subgroup analysis, the model had improved discrimination for patients with “severe” as compared to “mild or moderate” disease, p < 0.001. Model performance was highest in Asians and lowest in whites and similar between males and females. CONCLUSIONS AND RELEVANCE: AI-based diagnostic tools may serve as an adjunct, but not replacement, for clinical decision support of COVID-19 diagnosis, which largely hinges on exposure history, signs, and symptoms. While AI-based tools have not yet reached full diagnostic potential in COVID-19, they may still offer valuable information to clinicians taken into consideration along with clinical signs and symptoms.

For COVID-19 negative cases, we collected cases and frontal images combined from: (1) 2011 -2016 MIMIC-CXR 12 

For model development, 38 ,508 (2,220 positives and 36,288 negatives) M Health Fairview CXR were used for training. Model training was supplemented to maximize model generalizability using publicly available (9,592 total with a positive: negative ratio of 1: 16) images of COVID-19 positive and negative patients. In the training set, 444 positives and 7,257 negatives were held out for tuning the deep learning models hyperparameters and the rest were used to train the models. Our main model pipeline consisted of lung segmentation, outlier detection, and feature extraction/classification part, as illustrated in Figure 2 .

To ensure the AI system relies on medically relevant pulmonary pathology (and minimize AI 'shortcuts' 6 ) we performed lung segmentation to focus learning on lung parenchyma, where the COVID-19 radiomic features are located (Figure 1 ). [14] [15] [16] [17] Segmentation was performed using a modified (adopted from Kaggle 18 ) U-net model 19 which is widely used for biomedical image segmentation. The segmentation model was trained using three public lung segmentation datasets: Montgomery 20 , HIN 21 , and Japanese Society of Radiological Technology Digital Image Database 22 , which provided manual segmentation masks (Figure 1) .

Practical X-rays have large variations and some of the extreme cases, (e.g., caused by high/low exposure, skewed positions, wrong position attributes) can substantially contaminate the model training or prediction process. Rather than overburden the model (robustness is a grand challenge for modern AI 23 ), we chose to isolate these extreme and infrequent cases for human screening ( Figure 2) . We implemented two sequential procedures for this. First, before lung segmentation, we trained a conditional Generative Adversarial Network (GAN) 24 on the training CXRs to separate potential outliers. The class labels were fed into the conditional GAN as the "conditional" information. After training, any samples that were assigned scores lower than 0.1 by the discriminator with corresponding both positive and negative "conditional" information were declared as outliers. Second, on the remaining samples, after lung segmentation, we calculated the ratio of the area of the predicted lung mask and the area of the whole X-ray image.

Any CXR with a ratio below 0.1 or above 0.9 would be removed as outliers. The two procedures rejected about 10% of all input images, which were visually confirmed as outliers. An example of an outlier is shown in Figure 2 where a lateral CXR was inappropriately labeled as frontal.

We used the pre-trained DenseNet-121 25 , which was trained on the ImageNet dataset (the largest natural image benchmark dataset) 26 , and further trained the model using our CXR datasets to fine-tune it to diagnose COVID-19. The difference between the prediction and the target (1 for positive and 0 for negative) was measured using the standard cross-entropy loss ( Figure 2 ). The network was implemented using the deep learning package PyTorch 1.5.0. 27 Our data were imbalanced between the positive cases and negative controls, reflecting the intrinsically biased distribution of COVID-19 cases in the population. To counter the adverse effects of the imbalance on learning, we set our training objective as the maximum of averaged loss over the positive and the negative cases.

Pre-implementation Validation:

Prior to implementation, the model underwent multiple temporal and external validations. To simulate real-time performance, temporal validation included all adult CXRs within the M Health Fairview system obtained between July 1, 2020 -July 30, 2020. To investigate model performance under differing COVID-19 prevalence, varying ratios of case imbalance were evaluated using a ratio of 1:1 (50% positive: negative) to 1:20 (4.8%). The area under the precision-recall curve (AUPRC) was calculated for each ratio. During this prospective period, 5,228 CXRs were obtained from patients that tested negative for COVID- 19 

External validation included 2,002 CXRs of patients age 18 years and older within the Emory University hospital system collected between March 1 st , 2020 and July 30 th , 2020. COVID-19 positive and negative CXRs were equally distributed. Patient demographics are provided in Table 1 .

In collaboration with Epic Cognitive Computing, the AI model was integrated into the M Health Table 1 . A sub analysis was conducted to evaluate model performance for patients with "severe" disease defined as patients that required ICU admission and "moderate" disease defined as patients that required hospital (but not ICU) admission. Supplemental Figure To investigate how real-time performance correlates with performance obtained using publicly available COVID-19 datasets, performance was investigated using a sample of publicly available COVID-19 CXRs. The mean AUROC and AUPRC are shown in Supplemental Table 1 .

Models were externally validated at Indiana and Emory University (Supplemental Table 2 ). Table 3 

Ethnic and gender data was available for negative and positive controls for both external validation at Emory University and real-time Validation at M Health Fairview. Model performance was evaluated by subgroup analysis in Table 3 . In both datasets, the model had improved performance in males and non-white patients ( Table 3) . Performance was highest in Asian patients (AUROC 0.94, 95% CI 0.86-1.0).

This study represents a prospective observational study to investigate the real-world performance of an AI model for COVID-19 diagnosis based on CXR findings alone.

Specifically, this study sought to characterize real-world performance, model drift and equity. In this study we identified: (1) COVID-19 CXR diagnostic models perform well for patients with "severe" COVID-19 (patients with a high COVID-19 Diagnostic AI score); however, they fail to differentiate patients with "mild" COVID-19 who may present with minimal CXR findings and thus a low COVID-19 Diagnostic AI score. There may also coexist chest pathology or chronic lung disease that may be the only imaging finding of a newly diagnosed positive COVID-19 patient or there may be overlapping findings.

However, the possibility that AI could differentiate these diseases based on features not seen by the "naked eye" promulgated efforts to test this hypothesis. Second, it is possible that adequate training data has not yet been collected to train such a generalizable model. Despite our model, which utilized approximately 50,000 images both locally and internationally, we observed an AUROC of 0.7 on real-world validation. Another reason, it is possible that the rigorous approach to develop and evaluate AI models for medical imaging has not yet been defined, and there may be a lack of communication between AI model developers and medical researchers. For example, a recent review of 62 AI models for COVID-19 from biomedical imaging found significant limitations in AI models published to date and nearly all have been designated as having high bias. 5 These biases include: the lack of external validation, lack of equity analysis by race and gender, lack of reporting patient demographics, inadequate number of images, lack of reporting of real-time performance, and the utilization of "unrealistic" training datasets which fail to represent the environment where the model will ultimately be deployed. Finally, albeit unlikely, it is possible that false positives by AI are actually patients with COVID-19 that had a negative PCR test despite being COVID-19 positive. The current sensitivity of rapid PCR testing varies significantly based on the viral load of a patient. Patients with viral load cycle threshold (Ct) levels < 25 have a sensitivity of 90%; however, in patients with lower viral loads (higher Ct) PCR sensitivity drops to 76%. 31 A question raised by our findings is what is the performance bar for AI models in clinical decision diagnostic support? Does a model with an AUROC of 0.7 or 0.8 not add additional information that the clinician can integrate into decision making? Similar to how an elevated white blood cell count (AUROC 0.70-0.75 32, 33 ) in a patient with right lower quadrant tenderness adds diagnostic information towards a work-up for appendicitis. Our findings suggest that AI analysis of chest x-rays alone is not adequate to diagnose COVID-19. However, AI-enabled clinical decision support may add additional information, which ED providers can integrate into clinical decision making when developing a differential diagnosis and determining if the patient needs confirmatory testing and isolation for COVID -19. Correspondingly, what is the standardized evaluation process to assess achievement of the performance bar? 34, 35 In this study, prior to implementation we performed a temporal validation to simulate performance had the model been implemented live in July 2020.

Following acceptable performance we conducted two external validations including an equity evaluation at one site. Following usability optimization, the model was then implemented for investigational use and an 8-week proactive educational campaign was initiated across our system to educate providers about this model and its investigational use. Performance was evaluated during a 1-week pilot immediately following implementation to ensure no significant performance drops as compared with pre-implementation validation. We then conducted a prospective observational study to investigate real-time model performance, drift, and equity. 36 We encourage model developers to implement and accurately evaluate real-world performance prior to overly optimistic publications. 30 We also encourage exercising maximal discretion when interpreting or utilizing reported performance using publically available. Our model obtained unrealistic performance (AUROCs > 0.96) using such publicly available data.

Currently, the need for a rapid diagnostic algorithm for COVID-19 is less urgent given the development and wide utilization of a rapid PCR test. However, we believe continued investigation into model optimization is warranted as to better inform development for future viral pandemics and other AI tasks. Moreover, limited resource settings may not have access to testing and hence imaging may be used for initial triage especially when resources are overwhelmed as may occur in a pandemic. Differentiation of COVID-19, which presents with non-specific ARDS findings is significantly harder than differentiation of other diseases processes such as acute pneumothorax. We observed that when the model generates a high score, it is typically correct in its identification of COVID-19. Given the high PPV (0.98) of PCR testing (and low NPV: 0.8) in COVID-19, our clinical decision support model only ran on patients with unknown or negative COVID-19 tests. Thus, it is possible that performance would be improved if the model had run on patients with known COVID-19. We observed, many patients with "mild" COVID-19 will have a low score thus overlapping with negative controls.

We propose the development of a hierarchal or two-step model, which will first pass all CXRs through the algorithm and generate a score. In the event that patients have a low score, we propose to train a model to differentiate "mild" COVID-19 from non-COVID-19 in efforts to improve discrimination at the lower end of the scale. Additionally, we propose the integration of structured and unstructured note data into model training. For example, vital signs, lab values, and signs and symptoms from clinical notes may significantly improve diagnostic accuracy in combination with findings from radiographic models.

A source of bias in most models is the lack of adequate analysis ensuring it performs similarly across different populations, specifically gender and racial groups. We and others have reported COVID-19 has disproportionately burdened minority populations. 37,38 To ensure the model performed equitably, we tested the model across race and gender. Notable, the model performed slightly better in males and minority populations. Male gender and minority populations have been found to be at higher risk for severe disease. 37-39 In fact, one study found imaging severity to be higher across minority populations compared to white. 40 This may explain the improved performance we noted in non-white patients, as our pre-implementation model performance was superior for patients with "severe" vs. "moderate" COVID-19 (Supplemental Table 2 ). Importantly, the model does perform equitably and there is limited risk that it would further widen the disparate COVID-19 outcomes being experienced by minority populations.

This study is not without limitations. First, our negative controls were not selected from a target population of suspected COVID-19 patients. We included all x-rays to model a "realworld" environment when training the model to optimize realistic performance; however, this limits the potential usefulness of the model outside of the ED and early inpatient setting. Second, CXR findings for COVID-19 are nonspecific and overlap with a number of other infectious and non-infectious etiologies, which could complicate interpretation. Third, our model only ran on patients with unknown or negative COVID-19 status. Given the high PPV of COVID-19 PCR testing, it is unnecessary to deploy an AI model when the diagnosis is already confirmed. Thus performance reported was truly pragmatic; however, data does not exist as to model performance for patients that had a positive PCR test result prior to CXR. This study does however, encompass a period of pre-rapid PCR testing. Lastly, these models were trained and validated on fixed data and it is anticipated that the models will evolve as new data arrive. It is possible to modify the models to make them gradually improve over time, leveraging advances in online machine learning. Finally, the integration of radiometric characteristics of COVID-19 positive patients may further improve models.

In conclusion, AI-based diagnostic tools may serve as an adjunct, but not replacement, for clinical decision support of COVID-19 diagnosis, which largely hinges on exposure history, signs, and symptoms. While AI-based tools have not yet reached full diagnostic potential in COVID-19, they may still offer valuable information to clinicians taken into consideration along with clinical signs and symptoms. 

A Framework for Rationing Ventilators and Critical Care Beds During the COVID-19 Pandemic

Interim Guidelines for Collecting, Handling, and Testing Clinical Specimens from Persons for Coronavirus Disease

False Negative Tests for SARS-CoV-2 Infection -Challenges and Implications

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

AI for radiographic COVID-19 detection selects shortcuts over signal. medRxiv

Was there COVID-19 back in 2012? Challenge for AI in Diagnosis with Similar Indications

Can medical practitioners rely on prediction models for COVID-19? A systematic review

BIMCV COVID-19+: a large annotated dataset of RX and CT images from COVID-19 patients

Covid-19 Chest Xray Dataset

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports

Chest X-ray in new Coronavirus Disease 2019 (COVID-19) infection: findings and correlation with clinical outcome

Frequency and Distribution of Chest Radiographic Findings in Patients Positive for COVID-19

Coronavirus Disease 2019 (COVID-19): A Systematic Review of Imaging Findings in 919 Patients

Lung Segmentation from Chest X-Ray dataset

Convolutional Networks for Biomedical Image Segmentation

Two public chest X-ray datasets for computer-aided screening of pulmonary diseases

Requirements for Minimum Sample Size for Sensitivity and Specificity Analysis

Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal

Artificial intelligence in the diagnosis of COVID-19: challenges and perspectives

Rapid coronavirus tests: a guide for the perplexed

Clinical value of total white blood cells and neutrophil counts in patients with suspected appendicitis: retrospective study

Receiver operating characteristic analysis of leukocyte counts in operations for suspected appendicitis

The Lancet Digital H. Artificial intelligence for COVID-19: saviour or saboteur? Lancet

Racial and Ethnic Disparities in COVID-19-Related Infections, Hospitalizations, and Deaths : A Systematic Review

Understanding the renin-angiotensinaldosterone-SARS-CoV axis: a comprehensive review

Racial and Ethnic Disparities in Disease Severity on Admission Chest Radiographs among Patients Admitted with Confirmed Coronavirus Disease 2019: A Retrospective Cohort Study

The authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing resources that contributed to the research results reported within this paper. URL: http://www.msi.umn.edu The authors have no other conflict of interest to declare.

All authors significantly contributed to study design, data analysis and/or interpretation, developing, writing, and revising this manuscript.