key: cord-157444-huvnyali
authors: Nabulsi, Zaid; Sellergren, Andrew; Jamshy, Shahar; Lau, Charles; Santos, Eddie; Kiraly, Atilla P.; Ye, Wenxing; Yang, Jie; Kazemzadeh, Sahar; Yu, Jin; Kalidindi, Raju; Etemadi, Mozziyar; Vicente, Florencia Garcia; Melnick, David; Corrado, Greg S.; Peng, Lily; Eswaran, Krish; Tse, Daniel; Beladia, Neeral; Liu, Yun; Chen, Po-Hsuan Cameron; Shetty, Shravya
title: Deep Learning for Distinguishing Normal versus Abnormal Chest Radiographs and Generalization to Unseen Diseases
date: 2020-10-22
journal: nan
DOI: nan
sha: 
doc_id: 157444
cord_uid: huvnyali

Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to build specific systems to detect every possible condition. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For development, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system generalizes to new patient populations and abnormalities. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7-28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist.

Chest radiography (CXR) is a crucial thoracic imaging modality to detect, diagnose, and guide the management of numerous cardiothoracic conditions. Approximately 837 million CXRs are obtained annually worldwide 1 , resulting in a high reviewing burden for radiologists and other healthcare professionals. 2, 3 In the United Kingdom, for example, a shortage in the radiology workforce is limiting access to care, increasing wait times, and delaying diagnoses. 4 The need to reduce radiologist workload and improve turnaround time has sparked a surge of interest in developing artificial intelligence (AI)-based tools to interpret CXRs for a broad range of findings. [5] [6] [7] Many algorithms have been shown to detect specific findings, such as pneumonia, pleural effusion, and fracture, with comparable or higher performance than radiologists. [5] [6] [7] [8] [9] [10] However, by virtue of being developed to detect specific findings, these algorithms are unlikely to properly report other abnormalities that they were not trained to detect. [11] [12] [13] For example, interstitial lung disease may not necessarily trigger a pneumonia detector. If these detectors are indeed highly specific, they can only be used to detect specific diseases, and are not suitable as comprehensive prioritization tools. Moreover, because developing accurate AI algorithms generally requires large labeled datasets, developing algorithms for every potential abnormality that may be encountered in a broad clinical setting is impractical. Therefore, a different problem framing is required for use as an effective prioritization tool: algorithms are needed to distinguish normal versus abnormal CXRs more generally. A reliable AI system for distinguishing normal CXRs from abnormal ones can contribute to prompt patient workup and management. There are several use cases for such a system. First, in scenarios with a high reviewing burden for radiologists, the AI algorithm could be used to identify cases that are unlikely to contain findings, empowering healthcare professionals to quickly exclude certain differential diagnoses and allowing the diagnostic workup to proceed in other directions without delay. Cases that are likely to contain findings can be also grouped together for prioritized review, reducing the turnaround time. Second, in settings when clinical demand outstrips availability of radiologists (for example, in the midst of a large disease outbreak), such a system might be used as a frontline point-of-care tool for non-radiologists. Importantly, the AI needs to be evaluated on CXRs with "unseen" abnormalities (i.e. those that it had not encountered during development), to validate its robustness towards new diseases or new manifestations of diseases.

In this work, we developed a deep learning system (DLS) that classifies CXRs as normal or abnormal with data from 5 clusters of hospitals from 5 cities in India. We then evaluated the DLS for its generalization to unseen data sources and unseen diseases using 6 independent datasets from India, China, and the United States. These datasets comprise of two broad clinical datasets, two tuberculosis (TB) datasets with microbiologically confirmed positive and negative cases, and two coronavirus disease 2019 (COVID-19) datasets with reverse transcription polymerase chain reaction (RT-PCR)-confirmed positive and negative cases.

Dataset curation Figure 1 shows the overall study design. Our training set consisted of 250,066 CXRs of 213,889 patients from 5 clusters of hospitals from 5 cities in India (Supplementary Table 1 , Supplementary Figure 1 ). In the training set, all known TB cases were excluded and COVID-19 cases were absent. To evaluate the trained DLS, we used 6 datasets with a total of 11,576 CXRs from 11,298 patients ( Table 1 , Supplementary Figure 1 ). This includes 2 broad clinical datasets (Dataset 1 and ChestX-ray14 , n=8,557 total cases) with 2,423 abnormal cases, 2 datasets (TB-1 and TB-2, n=595 total cases) with 294 TB-positive cases, and 2 datasets (COV-1 and COV-2, n=2,424 total cases) with 873 COVID-19 positive cases. DS-1, COV-1, and COV-2 were obtained from a mixture of general outpatient and inpatient settings and thus represent a wide spectrum of CXRs seen across different populations. Evaluation on these broad datasets mitigates the risk of selecting only the most obvious cases while excluding more difficult images. CXR-14, TB-1, TB-2 were enriched for rare conditions and were publicly available. Evaluation on these datasets specifically validates the DLS's performance on rarer conditions, and enables benchmarking with other studies using the same data.

To define high-sensitivity and high-specificity operating points for the DLS, we created four small operating point selection datasets for four scenarios: DS-1, CXR-14, TB, and COVID-19; n=200 cases each (see Figure 1B and "Operating point selection datasets" section in Methods). Across these datasets, we collected 48,877 labels from 31 radiologists for either the reference standard or to serve as a comparison for the DLS (see "Labels" section in Methods).

The DLS was first evaluated for its ability to classify CXRs as normal or abnormal on the test split of DS-1 and an independent test set CXR-14. We obtained the normal and abnormal labels from the majority vote of three radiologists (see "Labels" section in Methods). The percentage of abnormal images were 24% and 71% in DS-1 and CXR-14, respectively ( Figure 2A ). To have a comprehensive understanding of the DLS, we measured sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), percentage of predicted positives and the percentage of predicted negatives at a high-sensitivity operating point and a high-specificity operating point ("Evaluation metrics" section in Methods). With the high-sensitivity operating point (see "Operating Point Selection" section in Methods), the DLS predicted 29.9% of DS-1 and 24.0% of CXR-14 as normal, with NPVs of 0.98 and 0.85, respectively (Table 2) . With the high-specificity operating point, the DLS predicted 22.2% of DS-1 and 11.7% of CXR-14 as abnormal, with PPVs of 0.68 and 0.99, respectively ( Table 2 ). The NPVs and PPVs across different operating points are plotted in Figure 3 .

To put the performance of the DLS in context, two independent board-certified radiologists reviewed both the test split of DS-1 and CXR-14. The radiologists had average NPVs of approximately 0.87 and 0.70 and PPVs of 0.75 and 0.96 on DS-1 and CXR-14, respectively ( Table 3 ). The radiologists' sensitivity and specificity are illustrated on the ROC curves ( Figure  2A ).

Radiographic findings vary in their difficulty and importance of detection. Thus we next conducted subgroup analyses for each abnormality listed in Supplementary 

The DLS was next evaluated on two diseases that it had not been trained to detect (TB and COVID-19) across four disease-specific datasets: TB-1, TB-2, COV-1, and COV-2. In these analyses, the DLS was evaluated against the reference standard for each specific disease (TB or COVID, respectively, see "Labels" section in Methods). For TB (where percentage of disease-positive images were 52% and 40% in TB-1 and TB-2; Table 1 ), the AUCs were 0.95 (95%CIs: 0.93-0.97) in TB-1 and 0.97 (95%CIs: 0.94-0.99) in TB-2 (Table 2, Figure 2B ). At the high-sensitivity operating point, the DLS predicted 43.1% of TB-1 and 38.3% of TB-2 as negative, with NPVs of 0.88 and 0.98, respectively (Table 2A) . The NPVs and PPVs across different operating points are also plotted in Figure 3 . However, CXRs that were labeled (TB) negative could nonetheless contain other abnormalities (see "Labels" section in Methods). Hence PPVs (Table 2A-B) need to be interpreted with the context that low PPVs for identifying TB-positive radiographs as abnormal do not necessarily reflect the PPV for correctly identifying images with other findings in those datasets (see "Distributional shift between datasets" below).

Every image in TB1 and TB2 was also annotated as normal or abnormal by one radiologist from a cohort of 8 consultant radiologists from India. The radiologist NPVs were 0.74 and 0.88 and their PPVs were 0.93 and 0.93 on TB-1 and TB-2, respectively (Table 3 and Figure 2B ).

For COVID-19 (where percentage of disease-positive images were 32% and 48% in COV-1 and COV-2; Table 1 ), the AUCs were 0.68 (95%CIs: 0.66-0.71) in COV-1 and 0.65 (95%CIs: 0.60-0.69) in COV-2 (Table 2, Figure 2A ). With a high-sensitivity operating point, the DLS predicts 5.9% of COV-1 and 9.8% of COV-2 as negatives with NPVs of 0.85 and 0.56, respectively ( Table 2 ). The NPVs and PPVs for different operating points are plotted in Figure 3 .

Similar to the TB case above, images that were negative for COVID-19 often contained other abnormalities (see "Distributional shift between datasets" section below) .

Every image in COV-1 and COV-2 was also reviewed by one radiologist from a cohort of four US board-certified radiologists. The radiologist NPVs were 0.78 and 0.62 and their PPVs were 0.51 and 0.60 on COV-1 and COV-2, respectively (Table 3 and Figure 2C ).

Finally, to better understand the potential impact of the algorithm in the setting of imperfect RT-PCR sensitivity, we conducted a subanalysis of COVID-19 cases that had a "false negative" RT-PCR test result on initial testing, defined as a negative RT-PCR test followed by a positive one within five days. In the 21 such cases, the DLS achieved a 95.2% sensitivity, with the CXR taken at the time of the negative test.

To better understand the data shifts between applications (general clinical setting in DS-1 vs. the enriched CXR-14; the broad clinical settings vs. TB; and the broad clinical settings vs. COVID-19), we next examined the distributions of the DLS predictive scores across all 6 test datasets and their corresponding operating point selection sets (Figure 4 , see "Operating Point Selection Datasets" in Methods). We observed similarly peaked DLS prediction score distributions (near 1.0) for positive cases --whether for general abnormalities, specific conditions, TB, or COVID-19 (see red histograms in Figure 4A -C). However, although the distributions for "negative" cases were mostly similar, they did have a small degree of variability, even among datasets of the same scenario from different sites. For example, comparing TB-1 and TB-2 which have similar CXR findings (TB) but were from two independent sites, negative cases in TB-2 had higher scores than in TB-1. Similarly, comparison between COV-1 and COV-2 also shows slight differences in the scores for negative cases. These observations confirm the existence of data shifts, suggesting that the scenario-specific operating points are essential, and that even having site-specific operating points may further improve the DLS's performance.

Although scores for positive and the negative cases in DS-1, CXR-14, TB-1, and TB-2 were well-separated, there was significant overlap between the distributions of positive and negative cases for the COVID-19 datasets. In fact, further review of the images revealed that 24.9% of negatives in COV-1 and 31.5% of negatives in COV-2 had other CXR findings, and were thus abnormal. A breakdown of the type of finding in these "negatives" is presented in Supplementary Figure 5 . Examples of challenging cases of each condition and associated saliency maps highlighting the regions with the greatest influence on DLS predictions are presented in Figure 5 .

To understand how the developed DLS can assist practicing radiologists, we investigated two simulated DLS-based workflows. In the first setup, to assist radiologists in prioritizing review of abnormal cases, the DLS sorted cases by the predicted likelihood of being abnormal ( Figure  1D ). We measured the differences in expected turnaround time for the abnormal cases with and without DLS prioritization. For simplicity, in this simulation, we assume the same review time for each case, and that the review time per case does not vary based on review order. The DLS-based prioritization reduced the mean turnaround time of abnormal cases by 8-29% for DS-1 and CXR-14, 21-28% for TB-1 and TB-2, and 8-13% for COV-1 and COV-2 ( Figure 6 ). In the second setup, we investigated a simulated sequential reading setup where the DLS identified cases that were unlikely to contain findings, and the radiologist reviewed only the remaining cases ( Figure 1D ). Though the deprioritized cases could be reviewed at a later time, we computed the effective immediate performance assuming the DLS-negatives were not yet reviewed by radiologists and considered them to be interpreted as "normal" for evaluation purposes. There were minimal performance differences between radiologists and the sequential DLS-radiologists setup, but the effective "urgent" caseload reduced by 25-30% for DS-1 and CXR-14, about 40% for the TB datasets, and about 5-10% for the COVID-19 datasets (Supplementary Table 7 ).

We have developed and evaluated a DLS for interpreting CXRs as normal or abnormal, instead of detecting individual abnormalities. We further validated that it generalized with acceptable performance using six datasets: two broad clinical datasets (AUC: 0.87 and 0.94), two datasets with one unseen disease (TB; AUC: 0.95 and 0.97), and two datasets with a second unseen disease (COVID-19; AUC: 0.68 and 0.65).

Generalizability to different datasets and patient populations is critical for evaluation of AI systems in medicine. Studies have shown that many factors might lead to challenges of generalization of AI systems to new populations, such as dataset shift and confounders. 14 Furthermore, with CXRs, as with all medical imagery, the number of potential manifestations is unbounded, especially with the emergence of new diseases over time. Understanding model performance on this set of unseen diseases is an imperative step in developing a robust and clinically useful model that can be trusted in real world situations. In this work, we evaluated the DLS's performance on 6 independent test sets consisting of different patient populations, spanning three countries, and with two unseen diseases (TB and COVID-19). The DLS's high sensitivity operating point for ruling out normal CXRs performed on par with board-certified radiologists, with the DLS NPVs of 0.85-0.95 (general abnormalities), 0.88-0.98 (TB), and 0.56-0.85 (COVID-19), comparable to radiologist NPVs of 0.67-0.87 (general abnormalities), 0.74-0.88 (TB), and 0.62-0.78 . These results highlight the DLS's generalizability across real-world dataset shifts, increasing the likelihood of such a system to also generalize to new datasets and new manifestations. The "lower" observed AUCs of the DLS on the COVID-19 datasets were likely caused by our deliberate application of a general abnormality detector to a cohort enriched for patients with a clinical presentation consistent with COVID-19 and thus tested for COVID-19. However, as other acute diseases may share a similar clinical presentation, many cases negative for COVID-19 exhibited abnormal CXR findings that likely triggered the DLS ( Figure 5, Supplementary Figure 5 ). In addition, a substantial number of COVID-19 patients can present with a normal CXR 15 , which would also contribute to a lower observed AUC.

The variability in patient population and clinical environment across different datasets also meant that the same operating point was unlikely to be appropriate across all settings. For example, a general outpatient setting is substantially less likely to contain CXR findings compared to a cohort of patients with respiratory symptoms or fevers in the midst of the COVID-19 pandemic. Similarly, datasets that are deliberately enriched for specific conditions (CXR-14 and TB) are skewed and are not representative of a general disease screening population. Thus, we used a small number of cases (n=200) from each setting to determine the operating points specific to that setting. Consistent with this hypothesis, these operating points then generalized well to another dataset, such as from TB-1 to TB-2 and from COV-1 to COV-2. However, further performance improvement is likely possible with site-specific operating point selection sets. We anticipate that this simple operating point selection strategy using a small number of cases may be useful when evaluating an AI system in a new setting, institution, or patient population.

In addition to general performance across the 6 datasets, subgroup analysis of the DLS' performance on each specific abnormal CXR finding of DS-1 and CXR-14 (Supplementary  Tables 4 and 5 ) revealed consistently high NPVs, suggesting that the DLS was not overtly biased towards any particular abnormal finding. In addition, the DLS outperformed radiologists on atelectasis, pleural effusion, cardiomegaly / enlarged cardiac silhouette, and lung nodulessuggesting that the DLS as a prioritization tool could be particularly valuable in emergency medicine where dyspnea, cardiogenic pulmonary edema, and incidental lung cancer detection are commonly encountered. Furthermore, the DLS also outperformed radiologists in settings where an abnormal chest radiographic finding was present but the abnormality was not one of the predefined chest radiographic findings (e.g. perihilar mass) or radiologists agreed on the presence of a finding but disagreed as to its characterization (indicating case ambiguity; see "Other" in Supplementary Tables 4 and 5 ). This suggests that the DLS may be robust in the setting of chest radiographic findings that are uncommon or difficult to reach consensus on.

To further evaluate the potential utility of our system, we simulated a setup where the DLS prioritizes cases that are likely to contain findings for radiologists' review. Our evaluation suggests a potential reduction in turnaround time for abnormal cases by 7-28%, indicating the DLS's potential to be a powerful first-line prioritization tool. Whether deployed in a relatively healthy outpatient practice or in the midst of an unusually busy inpatient or outpatient setting, such a system could help prioritize abnormal CXRs for expedited radiologist interpretation. In radiology teams where CXR interpretation responsibilities are shared between general and subspecialist (i.e. cardiothoracic) radiologists, such a system could be used to distribute work. For non-radiologist healthcare professionals, a rapid determination regarding the presence or absence of an abnormality on CXR prevents releasing of a patient who needs care and enables alternative diagnostic workup to proceed without delay while the case is pending radiologist review. Finally, a radiologist's productivity might increase by batching negative CXRs for streamlined formal review.

Finally, to facilitate the continued development of AI models for chest radiography, we are releasing our abnormal versus normal labels from 3 radiologists (2430 labels on 810 images) for the publicly-available CXR-14 test set. We believe this will be useful for future work because label quality is of paramount importance for any AI study in healthcare. In CXR-14, the binary abnormal labels were derived through an automated natural language processing (NLP) algorithm on the radiology report. 7 However, editorials have questioned the the quality of labels derived from clinical reports. 16 Hence, in this study we obtained labels from multiple experts to establish the reference standard for evaluation, and a confusion matrix of our majority vote expert labels against the public NLP labels is shown in Supplementary Table 6 .

Prior studies have demonstrated an algorithm's potential to differentiate normal and abnormal CXRs. [17] [18] [19] [20] [21] Hwang et al. evaluated a commercially available system with comparison to radiology residents. 19 Annarumma et al. further demonstrated the system's utility in a simulated prioritization workflow using held-out data from the same institution as the training dataset. 18 Our study complements prior works by performing extensive evaluations on model generalizability, including generalization to multiple datasets in different continents, different patient populations settings, and with the presence of unseen diseases. In addition, we also obtained radiologist reviews as benchmarks to understand the DLS's performance. Lastly, we presented two simulated workflows; one demonstrated reduced turnaround time for abnormal cases, and the other showed comparable performance while reducing effective caseload.

Our study has several limitations. First, there are a wide range of abnormalities and diseases that were not represented among the CXRs available for this study. Although it's infeasible to exhaustively obtain and annotate datasets for every possible finding, further increasing the conditions and diseases considered in this study could help both in the DLS development and evaluation. Second, we only had labeled data regarding disease-positive and disease-negative for TB and COVID-19. The absence of normal and abnormal labels for the TB and COVID-19 datasets led to added complexity in understanding the performance metrics of PPVs and specificities for these scenarios. Third, to provide a comparison with the DLS, which only had CXRs as input, the radiologists reviewed the cases solely based on CXRs without referencing additional clinical or patient data. In a real clinical setting, this information is generally available, and likely influences a radiologist's decisions. Lastly, the results were based on retrospective data. The utility of the DLS-assisted workflows were based on simulation with many assumptions, such as identical radiologist diagnosis regardless of the review order and identical review time across normal and abnormal cases. Hence, the true effects will need to be determined through future evaluation in a prospective setting.

In conclusion, we have developed and evaluated a clinically relevant artificial intelligence model for chest radiographic interpretation and evaluated its generalizability across a diverse set of images in 6 distinct datasets. These results suggest the potential for the AI system to generalize to new patient populations and unseen diseases. Using the AI system in a simulated workflow to prioritize abnormal cases, the turnaround time for abnormal cases reduced by 7-28%. Lastly, we hope that the performance analyses reported here on the publicly available datasets can serve as a useful resource to facilitate the continued development of clinically useful AI models for CXR interpretation.

In this study, we utilized 6 independent datasets for DLS development and evaluation. The DLS was evaluated in two ways: distinguishing normal vs. abnormal cases in a general setting with multiple radiologist-confirmed abnormalities (first 2 datasets), and in the setting of diseases that the DLS was not exposed to during training (TB was excluded from the train set and COVID-19 was not present; last 4 datasets). All data were stored in the Digital Imaging and Communications in Medicine (DICOM) format and de-identified prior to transfer to study investigators. Details regarding these datasets and patient characteristics are summarized in Table 1, Supplementary Table 1 , and Supplementary Figure 1 . This study using de-identified retrospective data was reviewed by Advarra IRB (Columbia, MD), which determined that it was exempt from further review under 45 CFR 46.

The first dataset (DS-1) was from five clusters of hospitals across five different cities in India (Bangalore, Bhubaneswar, Chennai, Hyderabad, and New Delhi). 5 DS-1 consisted of images from consecutive inpatient and outpatient encounters between November 2010 and January 2018, and reflected the natural population incidence of the abnormalities in the populations. All TB cases were excluded and COVID-19 cases were not present. In total, DS-1 originally contained 1,052,274 CXRs from 794,501 patients before exclusions (Supplementary Figure 1A) . This dataset was randomly split into training, tuning, and testing sets in a 0.775:0.1:0.125 ratio while ensuring that images from the same patient remained in the same split. The split is consistent with our previous study. 5 The DLS was developed solely using the training and tuning splits of DS-1. Because outpatient management is primarily done using posterior-anterior (PA) CXRs, while inpatient management is primarily done on anterior-posterior (AP) CXRs, we emphasized PA CXRs in the tune split to better represent an outpatient use case. Both PA and AP images are used in the test datasets.

To select operating points for each of the four scenarios (two general abnormalities, TB, COVID-19), 200 images were randomly selected as the operating point selection sets. For general abnormalities, we selected two independent operating points using 200 randomly sampled images from the DS-1 tune set and 200 randomly sampled images from CXR-14's publicly-specified combined train and tune set 7, 22 . For TB, 200 randomly sampled images from TB-1 were used. For COVID-19, 200 randomly sampled images from COV-1 were used. These images were only used to determine an operating point for that scenario, and once used for operating point selection, were excluded from the test set (Supplementary Figure 1 ).

Two datasets were used to evaluate the DLS's performance in distinguishing normal and abnormal findings in a general abnormality detection setting. The first dataset contains 7,747 randomly selected PA CXRs from the original test split of the DS-1. 5 These sampled images were expertly labelled as normal or abnormal for the purposes of this study. The second dataset contains 2,000 randomly selected CXRs from the publicly-specified test set (25,596CXRs from 2,797 patients) of CXR-14 from the National Institute of Health. 7, 22 From these 2,000 CXRs (also used in prior work 5 ), we removed all the patients younger than 18 years of age and all the AP scans (to focus on an outpatient setting, see tune split procedure above), leaving us with 810 images.

To evaluate the DLS performance in unseen diseases, we curated 2 datasets for TB and 2 datasets for COVID-19 (1 CXR per patient, Supplementary Figure 1C 

For development and evaluation of the DLS, we obtained labels to indicate whether abnormalities were present in each CXR. Each image was annotated as either "normal" or "abnormal", where an "abnormal" scan is defined as a scan containing at least one clinically-significant finding that may warrant further follow-up. For example, degenerative changes and old fractures were not labeled abnormal because no further management is required.

For the train and tune split of DS-1, we obtained the abnormal and normal labels using NLP (regular expressions) on the radiology reports (Supplementary Table 2 ). For the normal images, radiology report templates were often used, meaning the same report indicating a normal scan was often used for numerous images. We extracted the most commonly used radiology reports, manually confirmed those that indicated normal reports, and obtained all images that used one of these normal template reports. Examples of these radiology reports along with their frequencies are shown in Supplementary Table 2 . For the abnormal images, we obtained all images that did not contain keywords indicating the scan is normal in their respective radiology reports.

For the test sets of DS-1 and CXR-14, a group of US board-certified radiologists reviewed the images to provide reference standard labels. For each image in DS-1, three readers were randomly assigned from a cohort of 18 US board-certified radiologists (range of experience 2-24 years in general radiology). For CXR-14, we obtained labels from three US board-certified radiologists (years of experience: 5, 12, and 24). In both cases, the majority vote of the three radiologists was taken to determine the final reference standard label.

For both DS-1 and CXR-14, in addition to the normal versus abnormal label, we also obtained labels for a selected set of findings present in the abnormal images for subgroup analysis (Supplementary Table 3 ). Note that the lists of findings for DS-1 and CXR-14 differ. For DS-1,we selected a slightly different list of findings to represent conditions that were more clinically reliable, mutually exclusive, and for which the CXR is reasonably sensitive and specific at characterizing (Supplementary Methods and Supplementary Table 3 ). Similarly to the normal versus abnormal label, the majority vote was taken for each specific finding. For CXR-14, the differences between the majority voted labels and the publically available labels are shown in a confusion matrix in Supplementary Table 6. TB labels TB positive cases were microbiologically confirmed. The first TB dataset 23 

For the COVID-19 datasets COV-1 and COV-2, patients with RT-PCR tests and CXRs were included (Supplementary Figure 1) . The COVID-19-positive labels were derived from positive RT-PCR tests. In accordance with current Centers for Disease Control and Prevention (CDC) guidelines 26 , COVID-19-negative labels consisted of CXRs from patients with at least two consecutive negative RT-PCR tests and no positive test. As false negative rates for RT-PCR have been reported to be ≥20% in symptomatic COVID-19-positive patients, CXRs from patients with only one negative RT-PCR test were excluded. 27

We trained a convolutional neural network (CNN) with a single output to distinguish between abnormal and normal CXRs. The CNN uses EfficientNet-B7 28 as its feature extractor, which was pre-trained on 300 million natural images 29 . Since the CNN was pre-trained on three-channel RGB natural images, we tiled the single channel CXR image to three channels for technical compatibility. We trained the CNN using the cross-entropy loss and the momentum optimizer 30 with a constant learning rate of 0.0004 and a momentum value of 0.9. During training, all images were scaled to 600x600 pixels with bilinear interpolation and image pixel values were normalized on a per-image basis to be between 0 and 1. The original bit depth for each image was used (Table 1) . For regularization, we applied dropout 31 , with a dropout "keep probability" of 0.5. Furthermore, data augmentation techniques were applied to the input images, including horizontal flipping, padding, cropping, and changes in brightness, saturation, hue, and contrast. All hyperparameters were selected based on the empirical performance on the DS-1 tuning set. We developed the network using TensorFlow and used 10 NVIDIA Tesla V100 graphics processing units for training.

Given a CXR, the DLS predicts a continuous score between 0 and 1 representing the likelihood of the CXR being abnormal. For making clinical decisions, operating points are needed to threshold the scores and produce binary normal or abnormal categorizations. In this study, we selected two operating points (see "Operating point selection datasets" section above), a high sensitivity operating point (95% sensitivity) and a high specificity operating point (95% specificity) for each scenario: general abnormalities for a general clinical setting in DS-1, general abnormalities for an enriched dataset in CXR-14, TB, and COVID-19.

To compare the DLS with radiologists in classifying CXRs as normal versus abnormal, additional radiologists reviewed all test images without referencing additional clinical or patient data. All images in the DS-1 and CXR-14 test set were independently interpreted by two board-certified radiologists (with 2 and 13 years of experience), who classified each CXR as normal or abnormal. These radiologists were independent from the cohort of radiologists who contributed to the reference standard labels.

Each image in TB-1 and TB-2 were reviewed by a random radiologist from a cohort of 8 consultant radiologists in India. Each image was annotated as abnormal or normal. Each image in COV-1 and COV-2 was reviewed by one of four board-certified radiologists (with 2, 5, 13, and 22 years of experience). Similarly, each image was annotated as abnormal or normal.

We simulated two setups in which the DLS was leveraged to optimize radiologists' workflow ( Figure 1D ). In the first setup, we randomly sampled 200 CXRs from each of our 6 datasets to simulate a "batch" workload for a radiologist in a busy clinical environment. For these CXRs, we compared the turnaround time for the abnormal CXRs when (1) they were sorted randomly (to simulate a clinical workflow without the DLS) and (2) when the CXRs were sorted in descending order based on the DLS-predicted scores, such that cases with higher scores appeared earlier.

We repeated each simulation 1,000 times per dataset to obtain the empirical distribution of turnaround differences.

In the second setup, we analyzed an extreme use case where the DLS identified CXRs that were unlikely to contain findings using a high sensitivity threshold, and the radiologists only reviewed the remaining cases. All cases skipped by radiologists were labeled negative. We compared the sensitivity between this simulated "reduced workload" workflow and a normal workflow in which the radiologists reviewed all cases.

To evaluate the DLS across different operating points, we calculated the areas under receiver operating characteristic curves (area under ROC, AUC). To evaluate the performance of the DLS in classifying CXRs as normal or abnormal, we measured negative predictive values (NPV), positive predictive values (PPV), sensitivity, specificity, percentage of predicted negatives, and percentage of predicted positives at a high specificity and a high sensitivity operating point chosen for each scenario (see "Operating point selection" in Deep learning system development. For evaluating the DLS for each individual type of finding, we considered a "each abnormality versus normal" setup where negatives consisted of all normal CXRs, and positives consisted of only the CXRs with that particular finding. As such, specificity values were the same across all findings in a given dataset.

We measured the same set of metrics to evaluate the DLS performance with unseen diseases (TB and COVID-19). However, the ground truth here was defined by either the respective TB or COVID-19 tests, and not whether each image contained any abnormal finding. Thus "negative" TB and COVID-19 cases could still contain other abnormalities.

Confidence intervals (CI) for all evaluation metrics were calculated using the non-parametric bootstrap method with n=1,000 permutations at the image level.

To compare the performance of DLS with the radiologists in a DLS-assisted workflow, non-inferiority tests with paired binary data were performed using the Wald test procedure with a 5% margin. 32 To correct for multiple hypothesis testing, we used Bonferroni correction, yielding α=0.003125 (one-sided test with α=0.025 divided by 8 comparisons). 33

To provide a visual explanation of how the DLS makes predictions, we utilized gradient-weighted class activation mapping (Grad-CAM) 34 to identify the image regions critical to the model's decision-making process ( Figure 5 ). Because overlaying activation maps on an image obscures the original image, a common Grad-CAM visualization shows two images: the original image, and the image with the overlaid activation maps. Here, to balance brevity and clarity, we present the activation maps as outlines highlighting the regions of interest. The outlines were obtained by taking a horizontal cross-section of the activated maps' three-dimensional contour plot, where the x and y axes represent the spatial location, and the z-axis represents the magnitude of activation.

Many of the datasets used in this study are publicly available. CXR-14 is a public dataset provided by the NIH. 7,22 TB-1 and TB-2 are publicly available. 23 Other than these public datasets, DS-1, COV-1, and COV-2 are owned by their respective institutions and are not publicly available.

The deep learning framework used here (TensorFlow) is available at https://www.tensorflow.org/ and the neural network architecture is available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. The Python libraries used for computation and plotting of the performance metrics (SciPy, NumPy, Lifelines, and Matplotlib) are available from https://www.scipy.org/, http://www.numpy.org/, and https://matplotlib.org/, respectively. N/A indicates information was not available. * abnormal images in the disease-specific datasets include both those positive for TB or COVID-19, and those with other findings; the numbers of images that contained other findings were not available. Table 1 and Supplementary Table 1 Table 3 ). Positive CXRs in the two TB datasets are from patients with tuberculosis. Positive CXRs in the two COVID-19 datasets are from patients with reverse transcription polymerase chain reaction (RT-PCR)-verified COVID-19. Radiologists' performances in distinguishing the test cases as normal or abnormal are also highlighted in the figures. Each image has the saliency presented as red outlines that indicate the areas the DLS is focusing on for identifying abnormalities, and yellow outlines representing regions of interest indicated by radiologists. Text descriptions for each CXR are below the respective image . Note that the general abnormality false negative example is shown with abnormal saliency maps. However, the DLS predictive score on the case was lower than the selected threshold; hence the image was classified as "normal". *Note that the TB false positive image was saved in the system with inverted colors, and presented to the model that way. Colors have been uninverted for visualization purposes. 

Supplementary Figure 1 . The STARD diagrams with inclusion and exclusion criteria for the 6 datasets. *For COVID-19, the first CXR during the patient's hospital encounter was selected. † Negative tests had to be administered at least 12 hours apart. Tables   Supplementary Table 1 

United Nations Scientific Committee on the Effects of Atomic Radiation. Sources and effects of ionizing radiation

Radiologist supply and workload: international comparison

Training for rural radiology and imaging in sub-saharan Africa: addressing the mismatch between services and population

Clinical radiology UK workforce census 2019 report. The Royal College of Radiologists

Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation

Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists

ChestX-ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks

Development and Validation of Deep Learning-based Automatic Detection Algorithm for Malignant Pulmonary Nodules on

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

Likelihood Ratios for Out-of-Distribution Detection

Concrete Problems in AI Safety

Machine learning for COVID-19-asking the right questions

Key challenges for delivering clinical impact with artificial intelligence

Clinical Characteristics of Coronavirus Disease 2019 in China

Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the

Machine learning 'red dot': open-source, cloud, deep convolutional neural networks in chest radiograph binary normality classification

Automated Triaging of Adult Chest Radiographs with Deep Artificial Neural Networks

Deep Learning for Chest Radiograph Diagnosis in the Emergency Department

Automated abnormality classification of chest radiographs using deep convolutional neural networks

Training and Validating a Deep Convolutional Neural Network for Computer-Aided Detection and Classification of Abnormalities on Frontal Chest Radiographs

NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories

Two public chest X-ray datasets for computer-aided screening of pulmonary diseases

Automatic tuberculosis screening using chest radiographs

Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration

Criteria for Return to Work for Healthcare Personnel with SARS-CoV-2 Infection (Interim Guidance

Variation in False-Negative Rate of Reverse Transcriptase Polymerase Chain Reaction-Based SARS-CoV-2 Tests by Time Since Exposure

Rethinking Model Scaling for Convolutional Neural Networks

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

The authors thank the members of the Google Health Radiology and labeling software teams for software infrastructure support, logistical support, and assistance in data labeling. For tuberculosis data collection, thanks go to Sameer Antani, Stefan Jaeger, Sema Candemir, Zhiyun Xue, Alex Karargyris, George R. Thomas, Pu-Xuan Lu, Yi-Xiang Wang, Michael Bonifant, Ellan Kim, Sonia Qasba, and Jonathan Musco. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, Jonny Wong for coordinating the imaging annotation work, and David F. Steiner, Kunal Nagpal, and Michael D. Howell for providing feedback on the manuscript.

List of specific findings for DS- 1 We modified the list of findings from CXR-14 to include conditions that were more likely to be clinically actionable, mutually exclusive, and for which CXR is reasonably sensitive and specific for characterizing (Supplementary Table 3 ). For example, findings in CXR-14 such as "emphysema" (for which CXR lacks both sensitivity and specificity) and "infiltration" (an ambiguous term that overlaps other CXR-14 findings such as "pneumonia" and "atelectasis") were replaced by more specific terms. On the other hand, clinically relevant and distinct findings commonly encountered on CXR were also introduced (e.g. "hilar enlargement", "acute fracture") or augmented (e.g. "abnormal mediastinal mass/widening" rather than "hiatal hernia"). Our choice of findings for the DS-1 dataset also recognized inherent limitations of CXR for reliably distinguishing between some conditions; hence "focal/multifocal lung opacity" was adopted as a single finding, rather than distinct findings for "consolidation", "atelectasis", and "fibroconsolidative opacity". *Note, "Other" was not part of the public labels, and one that we added to indicate findings not covered by CXR-14's original 14 conditions, and for CXRs where the radiologists did not have a majority opinion regarding the specific finding.