key: cord-0551732-p7wdny0m authors: Kazemzadeh, Sahar; Yu, Jin; Jamshy, Shahar; Pilgrim, Rory; Nabulsi, Zaid; Chen, Christina; Beladia, Neeral; Lau, Charles; McKinney, Scott Mayer; Hughes, Thad; Kiraly, Atilla; Kalidindi, Sreenivasa Raju; Muyoyeta, Monde; Malemela, Jameson; Shih, Ting; Corrado, Greg S.; Peng, Lily; Chou, Katherine; Chen, Po-Hsuan Cameron; Liu, Yun; Eswaran, Krish; Tse, Daniel; Shetty, Shravya; Prabhakara, Shruthi title: Deep learning for detecting pulmonary tuberculosis via chest radiography: an international study across 10 countries date: 2021-05-16 journal: nan DOI: nan sha: cb8656389274bfb187de4541fcfaa837fffd86dc doc_id: 551732 cord_uid: p7wdny0m Tuberculosis (TB) is a top-10 cause of death worldwide. Though the WHO recommends chest radiographs (CXRs) for TB screening, the limited availability of CXR interpretation is a barrier. We trained a deep learning system (DLS) to detect active pulmonary TB using CXRs from 9 countries across Africa, Asia, and Europe, and utilized large-scale CXR pretraining, attention pooling, and noisy student semi-supervised learning. Evaluation was on (1) a combined test set spanning China, India, US, and Zambia, and (2) an independent mining population in South Africa. Given WHO targets of 90% sensitivity and 70% specificity, the DLS's operating point was prespecified to favor sensitivity over specificity. On the combined test set, the DLS's ROC curve was above all 9 India-based radiologists, with an AUC of 0.90 (95%CI 0.87-0.92). The DLS's sensitivity (88%) was higher than the India-based radiologists (75% mean sensitivity), p<0.001 for superiority; and its specificity (79%) was non-inferior to the radiologists (84% mean specificity), p=0.004. Similar trends were observed within HIV positive and sputum smear positive sub-groups, and in the South Africa test set. We found that 5 US-based radiologists (where TB isn't endemic) were more sensitive and less specific than the India-based radiologists (where TB is endemic). The DLS also remained non-inferior to the US-based radiologists. In simulations, using the DLS as a prioritization tool for confirmatory testing reduced the cost per positive case detected by 40-80% compared to using confirmatory testing alone. To conclude, our DLS generalized to 5 countries, and merits prospective evaluation to assist cost-effective screening efforts in radiologist-limited settings. Operating point flexibility may permit customization of the DLS to account for site-specific factors such as TB prevalence, demographics, clinical resources, and customary practice patterns. Globally, 1 in 4 people are infected with Mycobacterium tuberculosis, and 5-10% of these individuals will develop active tuberculosis (TB) disease in their lifetime 1,2 . In 2019, the estimated TB mortality was 1.4 million, including 200,000 people who were human immunodeficiency virus (HIV) positive, and an estimated 2.9 million people who contracted TB were not formally reported due to a combination of underreporting, underdiagnosis, and pretreatment loss to follow up. Almost 90% of the active TB cases occur in a few dozen "high-burden" countries, many with scarce resources needed to tackle this public health problem. 3 The anticipated rising burden of drug resistant TB poses an increased threat to both endemic and non-endemic parts of the world. 4 Lastly, the COVID-19 pandemic that has caused devastation around the world has also disrupted efforts to combat TB: globally, 21% fewer (1.4 million) people received care for TB in 2020 than in 2019. 5 In the past decade, there has been steady global support to combat this health crisis through the World Health Organization (WHO)'s End TB Strategy, the United Nations (UN)'s Sustainable Development Goals, and the Global Fund to fight AIDS, TB and malaria. 6 Cost effective pulmonary TB screening using CXR has the potential to increase equity in access to healthcare, particularly in difficult-to-reach populations. 7 In light of high patient volumes and limited access to timely expert interpretation of CXRs in many regions, there has been active research into using artificial intelligence to screen with a CXR followed by a corroborating diagnostic test; [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] 21 Such artificial intelligence-based triaging followed by GeneXpert testing for a confirmatory diagnosis was shown to be cost-effective compared to GeneXpert alone, and also substantially increased patient throughput. 13 As part of their recently-published 2021 guidance, the WHO evaluated three independent computer-aided detection (CAD) software systems, and determined that the diagnostic accuracy and performance of CAD software was similar to human readers. 7, 9, 13, 17 Given the scarcity of experienced readers, as an alternative to human interpretation of CXR, the WHO now recommends CAD for both screening and triage in individuals 15 years or older. 7 However, the WHO emphasized the importance of using a performant CAD system that has been tested on a population that is representative of the target population. In this study, we developed a deep learning system (DLS) to interpret CXRs for imaging features of active TB. Developing a universal TB classifier can be challenging -not only due to the array of potential imaging features -but because prevailing imaging features, severity of disease at presentation, and prevalence of TB and HIV can differ broadly on locale. Therefore, we validated our DLS using an aggregate of datasets from China, India, US, and Zambia that together reflect different regions, race/ethnicities, and local disease prevalence. We evaluated the DLS under two conditions: having a single prespecified operating point across all datasets, and when customized to radiologists' performance in each locale. As diagnostic performance may be influenced by disease prevalence, we compared the DLS with two different cohorts of radiologists: one based in a TB-endemic region (India) and one based in a TB non-endemic region (United States). An analysis of HIV positive and sputum smear positive subgroups was also performed. Finally, we estimate cost savings for using this DLS as a triaging solution for nucleic acid amplification testing (NAAT) in screening settings. For this work, we leveraged de-identified CXR images from multiple datasets spanning 9 countries for training and 4 countries for validating the DLS, for a total of 10 countries (Table 1) . Our DLS was trained using 160,187 images from Europe 22 (Azerbaijan, Belarus, Georgia, Moldova, Romania), India, and South Africa, and tuned using 3,258 images from China 16, 23, 24 , India, and Zambia 25 . Additionally, we used 550,297 images 26, 27 for pretraining purposes (10,310 of which overlapped with the train sets and none of which overlapped with the tune sets), and 138 images with labeled lung segmentation masks from the US dataset for training and tuning of the lung cropping model. We then validated the DLS using 1,262 images from China 16,23 , India, US 16, 23 , and Zambia, and 1040 images from South Africa, using 1 image per patient in both cases. Additional details including inclusion/exclusion criteria, enrichment, and reference standard are presented in Supplementary Figure S1 and Supplementary Table S1 . This retrospective study was approved by the respective Ethics Committee or Institutional Review Board at each participating institution and all data were de-identified prior to transfer. For all test and tune datasets, the positive TB status were confirmed via microbiology (sputum culture or sputum smear) or NAAT (GeneXpert MTB/RIF, Cepheid, Sunnyvale, CA); see Table 1 and Supplementary Table S1 . On the train datasets, the reference standard varied due to site-specific practice differences and data availability, including microbiology, radiologist interpretation, clinical diagnoses (based on medical history and imaging), and NAAT. We developed a DLS to detect evidence of active pulmonary TB on CXRs. The system consists of three modules: a lung cropping model for identifying a bounding box spanning the lungs, a detection model for identifying regions containing possible imaging features of active tuberculosis (nodules, airspace opacities with cavitation, airspace opacities without cavitation, pleural effusion, granulomas, and fibroproductive lung opacities), and a classification model that takes the output from both the segmentation model and the detection model to predict the likelihood of the CXR being TB positive ( Figure 1 ). For the lung cropping model, we used Mask RCNN 28 with a ResNet-101-FPN 29 feature extractor to train for both pixel-level segmentation and bounding boxes as outputs. We then cropped each CXR using the model's output bounding box enclosing the lungs as the input for the classification model. For the detection model, we used a Single Shot MultiBox Detector (SSD) 30 to create bounding boxes around potential TB-relevant imaging features. Based on the predicted bounding boxes, a probabilistic attention mask was calculated as the final pooling layer in the classification model. For the classification model, we combined an EfficientNet-B7 31 pre-trained on classifying CXRs as normal or abnormal 27 with an attention pooling layer and a fully-connected layer. The attention pooling layer utilizes the probabilistic attention mask generated from the detection model to perform a weighted average of the feature maps before feeding to the final fully connected layer. The classification model classifies CXRs into 1 of 3 classes: TB-positive, TB-negative but abnormal, and normal. We took the prediction score for TB-positive class as the output prediction for all TB-related analysis. Training the individual components of the DLS described above is a multi-step process ( Figure 1 ). First, we trained the lung cropping model using lung segmentation masks from the US dataset of 138 images with 80% used for training and 20% used for tuning. To train the detection model, radiologists annotated 9,871 bounding boxes around TB-indicative abnormalities (nodules, airspace opacities with cavitation, airspace opacities without cavitation, pleural effusion, granulomas, lymphadenopathy, and fibroproductive lung opacities). Both the detection and classification models were trained using the Europe dataset and the two India train datasets. Due to the limited amount of labeled data, we used the noisy-student 32 semi-supervised learning approach to leverage a much larger set of unlabeled data. Specifically, we obtained "noisy" TB labels by running inference using the initial version of the DLS on the South Africa train dataset with more than 150,000 unlabeled CXRs. These data with generated labels were combined with the original dataset to train 6 classification models, which were then ensembled by taking the mean of the scores. For the detection model, we used a dropout keep probability of 0.99, and augmentation included random cropping, rotation, flipping, jitter on the bounding boxes, multi-scale anchors, and a box matcher with intersection-over-union. For the classification model, we applied dropout, with a dropout keep probability of 0.5. Furthermore, we applied data augmentation such as horizontal flipping, random shears, and random deformations. All hyperparameters were selected based on empirical performance on the tune sets. Training was done using TensorFlow on third-generation tensor processing units with a 4x4 topology. All images were scaled to 1024 x 1024 pixels, and image pixel values were normalized on a per-image basis to be between 0 and 1. For model selection (checkpoint selection and other hyperparameter optimization), we selected models to maximize the area under the receiver operating characteristic curve (area under ROC curve, or AUC) corresponding to the range of radiologists' sensitivities in the tune sets. This approach was used to help explicitly select models that were performant across the range of radiologist sensitivities, instead of potentially optimizing for ranges that were beyond the scope of customary clinical practice. In order to gauge the performance of the DLS across datasets containing cases of different levels of difficulty, all test set cases were reviewed by a team of radiologists, whose performance not only served as a baseline for comparison, but also as an indirect indicator of difficulty level. As the performance characteristics of radiologists accustomed to practice in endemic vs. non-endemic settings can vary, this team consisted of two cohorts of radiologists (10 India-based consultant radiologists and 5 USbased board-certified radiologists). The India-based radiologists had an average of 6 years of experience (range 3-9), while the US-based radiologists had an average of 10.8 years of experience (range [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] . These radiologists were provided with both the image and additional clinical information about the patient when available (age, sex, symptoms, and HIV status), whereas the DLS was blinded to these information. For each image, the radiologists labeled it for the presence/absence of TB, other pulmonary findings, and optionally whether there were any minor technical issues visible on the image. The tune sets were labeled similarly. Our primary analyses compared the performance of the DLS with that of the India-based radiologists on the pooled combination of 4 test datasets. To support comparisons with the binary judgments of experts, we thresholded the DLS's continuous score using an operating point of 0.45, chosen based on an analysis of the tune datasets (conducted prior to evaluating the DLS on any of the test sets). We tested for noninferiority of sensitivity and specificity, both with a 10% absolute margin. To account for correlations within case and within radiologist, we used the Obuchowski-Rockette-Hillis procedure 33,34 configured for binary data 35 and adapted to compare readers with the standalone algorithm 36 in a noninferiority setting 37 . A p-value below 0.0125 was considered significant for the primary analyses (a conservative one-sided alpha of 0.025 was halved for a Bonferroni correction for 2 tests). Subsequent superiority testing was prespecified if non-inferiority was met, which does not require multiple testing correction. 38 Prespecified secondary analyses included per-dataset subgroup analysis for ROC; sensitivity and specificity at the prespecified operating point; operating points corresponding to the WHO thresholds; matched sensitivity/specificity analysis on a per-dataset and per-radiologist level; comparisons of the India-based and US-based radiologists, and comparison of the DLS to the US-based radiologists. Additional secondary analyses were on subgroups based on HIV status, images flagged by the reviewing radiologists to have minor technical issues, demographic information, and symptoms (work in progress). Exploratory subgroup analysis based on sputum smear was also conducted. Unless otherwise specified, 95% confidence intervals (CIs) were calculated using the bootstrap method with 1000 samples. To understand the performance of the "abnormality" detector in the DLS, we additionally labeled the India test dataset for any actionable abnormal CXR findings. Each case was reviewed by 3 US-based radiologists. Because follow up testing such as a repeat CXR or computed tomography were not available, "ground truth" was based on how many radiologists indicated the presence of an abnormal finding: at least 1 of 3, at least 2 of 3, and all 3 of 3. Finally, we simulated the potential cost savings of using our DLS as a TB screening intervention. Recent studies have estimated the overall cost for subsidized GeneXpert to be about US$13.06 per test, including equipment, resources, maintenance, and consumables. 13 The cost to acquire a single digital CXR was estimated to be US$1.49, including equipment and running costs, but not radiology interpretation. 13, 39 In our simulation, the DLS is used for initial TB screening, and patients who meet the threshold (based on our prespecified operating point) proceed with GeneXpert testing. The expected total GeneXpert testing cost was computed using the prevalence, sensitivity, and specificity to get DLSpositive rates and multiplying by the cost of GeneXpert. The total expected cost included both this testing cost for DLS-positive patients and the cost of CXR screening for all patients. Finally, we divided the total cost by the number of true positive TB cases caught to derive the cost per positive TB case. We then analyzed the effect of prevalence on the cost, which makes the simplifying assumption that there are no changes in case severity or other factors that may affect DLS performance. DLS performance was first evaluated on a combined test dataset incorporating a diverse population representing multiple races and ethnicities, drawn from 4 countries: China, India, US, and Zambia (Table 1) . Among a total of 1,262 images from 1,262 patients, there were 217 TB cases based on positive culture or GeneXpert. DLS development (training and tuning) and operating point selection was conducted on the tune datasets, independently of the test datasets. Patient sources for these 4 datasets included TB referral centers, outpatient clinics, and active case finding. The India test dataset was from a site independent of those used in development. An independent dataset from South Africa comprising a mining population served as an additional test set. In our combined test dataset across 4 countries, the DLS achieved an AUC of 0.90 ( Figure 2A , Table 2 ). To contextualize the model's performance and better understand the case spectrum, we obtained radiologist interpretations for the same cases from two cohorts of radiologists: radiologists based in India, a country where TB is endemic, and radiologists based in the US. One India-based radiologist was found to have a rate of flagging positives (and consequently sensitivity) substantially below the others (Supplementary Figure S2) and so was excluded from subsequent analyses to avoid underrepresenting radiologist performance. The DLS's ROC curve was above the performance points of all 9 remaining India-based radiologists ( Figure 2A ). Our prespecified primary analyses involved comparisons of the DLS at a prespecified operating point (0.45) with India-based radiologists. The DLS's sensitivity (88%, 95% CI 83-94%) was higher than (superiority test was conducted if non-inferiority passed, see Methods) the India-based radiologists (median sensitivity: 74%; IQR: 72-76%), p<0.001. At the same operating point, the DLS's specificity (79%, 95%CI 75-82%) was similarly non-inferior to the India-based radiologists (median specificity: 86%; IQR: 81-87%), p=0.003. While both India-based and US-based radiologists had sensitivities and specificities that tracked closely and slightly below the ROC curve of our model, the conservativeness with which the two groups of radiologists called cases as positive for TB appeared to differ (Figure 2A -C, Supplementary Figure S2 ). India-based radiologists appeared to be more specific but less sensitive than US-based radiologists, who had a median sensitivity of 84% (IQR 76-86%) and a median specificity of 71% (IQR 67-81%). The DLS's sensitivity and specificity remained comparable to the US-based radiologists (p-value for noninferiority: 0.022 for sensitivity; 0.018 for specificity). Next, we conducted subgroup analysis on a per-dataset level ( Table 2 and Figure 2B ). The China and US datasets were similarly-constructed case-control datasets, with normal CXRs selected to match the TB positive CXRs. On these two datasets, while the India-based radiologists achieved high specificity (96-99%), their sensitivity was lower (53-65%) compared to in the combined dataset. At the prespecified operating point, both the DLS's sensitivity and specificity were non-inferior to the radiologists in both datasets (p<0.001 for all 4 comparisons). In the India dataset, which consisted of TB presumptive patients identified in a tertiary hospital, the DLS was similarly non-inferior in both sensitivity and specificity (p<0.001 for both). In the Zambia dataset, which was taken from a trial 25 , NAAT was associated with cases where the CAD4TB system had flagged an abnormal CXR, resulting in substantial enrichment for CXR-abnormal TB-negatives. In this dataset, at the prespecified operating point, the DLS was non-inferior for sensitivity (p<0.001) but not for specificity (p=0.504), though 8 of 9 India-based radiologists were below the ROC curve. In addition to the 4 datasets above, we evaluated the DLS on another independent dataset from a mining population in South Africa ( Figure 2D and Supplementary Table S2 ). The ROC curve of the model was above all but 1 radiologist. At the same prespecified operating point as the other datasets, the DLS was non-inferior both in terms of sensitivity and specificity to both India-based and US-based radiologists (p<0.05 for all). At a higher (lower sensitivity) operating point selected based on the South Africa tune datasets, the DLS was again non-inferior in both sensitivity and specificity compared to the India-based radiologists, but had higher specificity (p=0.012) at the cost of not being non-inferior in sensitivity (p=0.571). To better understand inter-dataset differences, histograms of DLS prediction scores were plotted separately for TB positive and TB negative cases for each dataset (Figure 3 ). The distribution of DLS scores for both TB positive and TB negative cases remained similar across the China, India, and US datasets (Supplementary Table S5 ). However, there was a higher proportion of TB-negative cases with high DLS scores in the Zambia dataset. This appears to have been a consequence of first-round CAD screening of the Zambia dataset which censored many normal-appearing CXRs, resulting in a more challenging dataset with a relative paucity of normal CXRs. To facilitate comparisons despite the wide range in radiologists' sensitivities and specificities, both across datasets and readers, we next conducted a matched analysis by shifting the DLS's operating point on a per-dataset level to (1) compare sensitivities at mean radiologist specificity, and (2) compare specificities at mean radiologist sensitivity. These analyses were done separately for the India-based radiologists and US-based radiologists, for a total of 16 analyses (4 datasets * 2 comparator radiologist group * matching sensitivity/specificity) and presented in Table 3 . The DLS had non-inferior performance in 15 out of these 16 analyses (p<0.05 for 15 and p=0.068 for the remaining). Next, we adjusted the DLS's operating point to match each individual radiologist's sensitivity and specificity, focusing on the two larger datasets (India, Zambia) to improve statistical power. With 14 radiologists, 2 datasets, and matching sensitivity/specificity, this amounted to 56 analyses (Supplementary Table S3 ). In 50 of these analyses, the DLS was non-inferior (p<0.05), with 4 of the 6 non-passing tests in the enriched Zambia dataset and in comparison with US-based radiologists. !"#$%&'$()*+,#)$-+./01)$-+.234#5$2.+$*$!6$71+##838,$)#7)$+#1.99#8/7$*$7#873)3:3);$<=>?$*8/$*$ 7-#132313);$<@>?. To further understand the performance of the DLS, we conducted matched performance analysis, similarly to the radiologist-matched analysis above. At 90% sensitivity, the DLS had a specificity of 77% on the combined dataset; and at 70% specificity the DLS had a sensitivity of 93%, both of which met the recommendations. This remained true in the China, India, and US datasets, but not in the enriched Zambia dataset (Table 4) . We next considered subgroups based on HIV status where available (this included most patients in the Zambia dataset). The DLS found the HIV-positive subgroup more challenging than the HIV-negative subgroup (DLS AUC: 0.81 vs 0.92), and a similar lowering of sensitivity and specificity were observed for the radiologists ( Figure 2C ). However, the DLS remained comparable to the radiologists in both subgroups, notably despite the DLS not having access to the HIV status as the radiologists did. Sputum smear microscopy is fast, inexpensive, and specific for Mycobacterium tuberculosis. Despite the low sensitivity, it is still used for rapid diagnosis in resource limited settings. 40 We evaluated the performance of our model on this subset using our Zambia dataset, and evaluated the DLS's sensitivity for TB-positive cases with different sputum smear results. Although this subset was small with only 12 smear positive and 14 smear negative patients, at our prespecified operating point, our model had 100% sensitivity for smear-positive TB-positive patients and 71% sensitivity in smear-negative TBpositive patients. We also evaluated the DLS in subgroups based on age and sex (Supplementary Figures S3-4) . The DLS's AUC varied between 0.86 to 0.97 within these subgroups, with similar trends of the ROC curve remaining higher than almost all of the radiologists. During case review, radiologists could indicate that images had technical issues that hindered confident interpretation. As the number of radiologists who indicated such issues grew from 0 to 1 to 2, the DLS AUC decreased, from 0.99 to 0.91 to 0.82 (Supplementary Figure S5) . When grouping these images using a cumulative approach (i.e., "1 or more", "2 or more"), the trends were similar. However, of the 45 images where 3 or more radiologists indicated a technical issue, the AUC trend reversed to 0.90, though the confidence intervals grew. As may be expected, the radiologists' sensitivities and specificities moved in a similar manner for cases they had indicated issues with. We further evaluated the "abnormality" detector in the DLS on the India test set, using labels provided by 3 US-based radiologists as the "ground truth" (Methods). The DLS was then evaluated using this ground truth in 2 ways. First, we used the entire India test set and defined a positive case as either being TB-positive or having another abnormal CXR finding. Depending on how many radiologists indicated the presence of the abnormality, the DLS's AUC ranged from 0.80 to 0.96 (Supplementary Figure S6) . Second, using only TB-negative cases, we defined a positive case as having any abnormal CXR finding, and plotted the ROC for the DLS's "non-TB abnormality" prediction alone. The AUC of the DLS ranged from 0.71 to 0.85, though with wider confidence intervals (Supplementary Figure S6) . Finally, in our analysis of potential cost savings, we simulated a workflow where patients only proceed to GeneXpert testing if they are flagged as positive by the DLS. This workflow has a reduced overall sensitivity (though still exceeding the WHO target of 90%), but substantially reduces cost via a lower number of confirmatory tests being conducted and thus improves cost effectiveness as measured by cost per positive TB case detected. We then simulated the cost of using the DLS performance on the India dataset (94% sensitivity and 95% specificity), the WHO target performance (90% sensitivity and 70% specificity), a lower-specificity device (90% sensitivity and 65% specificity), and GeneXpert only (no CXR). Based on the performance on the India dataset, as prevalence decreases from 10% to 1%, the cost per positive TB case detected increased substantially, and the cost savings compared to using GeneXpert alone increased from 73% to 82% ( Figure 5 ). The corresponding cost savings at low prevalence is not as profound when simulating the WHO target (47% to 53%) and a lower-specificity device (42 to 48%). In order to achieve the long term public health vision of global elimination of TB, there is a pressing need to scale up identification and treatment in resource-constrained settings. The recently-released 2021 WHO consolidated guidelines stated that CAD technologies had the potential to "increase equity in the reach of TB screening interventions and in access to TB care." They also emphasized the importance of using a high-performing CAD that was tested on CXRs drawn from a representative population for the corresponding use case. 7, 41 We have developed a DLS using data from 9 countries and validated the DLS in 5 countries, together covering many of the high-TB-burden countries and a wide range of race/ethnicities and clinical settings. In this combined international test dataset, the DLS's pre-specified operating point demonstrated higher sensitivity and non-inferior specificity relative to a large cohort of India-based radiologists. The development of a DLS with robust performance across a broad spectrum of patient settings has the potential to equip public health organizations and healthcare providers with a powerful tool to reduce inequities in efforts to screen and triage TB throughout the world. When considering each dataset individually, the DLS's performance was excellent in two commonlyused case-control datasets from China and US, and generalized well to an external validation set in India. Moreover, the DLS's performance was maintained in the enriched Zambia dataset, which was filtered by another CAD device. Since many images that were considered radiologically clear were excluded from this dataset, the difficulty of triaging the remaining cases was likely increased. The DLS also performed well when radiologists indicated minor technical issues with the image, indicating robustness to real-world issues. In addition to performing well in different countries with a wide range of race/ethnicities, the model was also comparable to radiologists in important subgroups. First, HIV infection increases the risk of active TB disease up to 40 fold compared to background rates. 1,42 Patients with HIV-associated pulmonary TB often have an atypical presentation on CXR, making them more difficult to screen. 43 Thus, the fact that the DLS's detection performance remained comparable with radiologists in HIV-positive patients is reassuring. Second, sputum smear has a fast turnaround and a low cost, leading to its importance in resource-limited settings despite having limited sensitivity. Though the subgroups were small, the DLS was able to identify all sputum-positive cases and remained accurate on sputum-negative cases. The fact that the use of the DLS should not lead to missing cases that would otherwise be detected by a relatively accessible procedure is comforting, but will need to be further validated in a larger population. We further verified that the DLS remained comparable to the radiologists in important populations based on demographic information, for patients without a prior history of TB, and subgroups based on symptoms including WHO's recommended 4-symptom screen: cough, weight loss, fever, or night sweats. Importantly, our test set comprising a gold mining population in South Africa is supportive evidence for the DLSs potential to help with this subgroup recommended by the WHO for systematic screening. Finally, the DLS was able to accurately detect other non-TB abnormalities that were identified by radiologists. Such a capability resolves one of the drawbacks that traditional CAD systems were noted to have by the WHO: that unlike human readers, the CAD systems could not simultaneously screen for pulmonary or thoracic conditions. Although NAAT, such as GeneXpert have high positive predictive value, many populations are unable to derive the broadest possible benefit from such tests because of their higher relative per-unit cost. However, if coupled to an inexpensive but relatively sensitive first-line filter like CXR (i.e., only cases screening positive on CXR are tested using NAAT), the benefits of NAAT could effectively benefit a larger population due to more targeted use. Two-stage screening strategies of this type would traditionally be intractable in many locales because settings with constrained access to NAAT often also lack providers trained to reliably interpret CXRs for TB-related abnormalities. In these settings, in accordance with current WHO guidelines, a robust performing CAD can increase the viability of this strategy by serving as an effective alternative to human readers. Our cost analysis of this two-stage screening workflow using the DLS suggests that it has the potential to provide 40-80% cost savings at 1-10% prevalence. The cost savings increases further as prevalence falls, which is an important financial consideration in disease eradication. Our comprehensive analysis with a large cohort of radiologists also revealed several important subtleties. First, irrespective of practice location, radiologists demonstrated a wide range of sensitivities and specificities. For example, even among our 9 India-based radiologists, sensitivities spanned a 18% range (69-87%). This variability is documented in the literature, with clinical experience being a potential contributing factor. 7, [44] [45] [46] However, this means that direct, single-operating-point comparisons with any individual radiologist can be difficult to interpret without matching the DLS operating point to either the sensitivity or specificity of that reader. Second, radiologists practicing in India were generally more specific and less sensitive than those practicing in the US. This may partially be due to practice patterns: in India where TB is endemic, radiologists' calls need to be highly specific to avoid testing an overwhelming number of patients. By contrast, in the US where TB is relatively rare and the goal is to avoid outbreaks, radiologists are incentivized to make calls that are highly sensitive at the expense of specificity. The wide range in performance between individuals and across practice locations suggests that future CAD for TB studies will likely need to take into account the practice locations of the comparator radiologists, and ensure that a sufficient number of radiologists are recruited to represent the natural variability. Remarkably, despite variability in individual radiologists' sensitivity and specificity, their performance tracked closely with the DLS's ROC curve, with the clearest evidence of this trend seen in the enriched Zambia dataset which exhibits a marked rightward shift toward lower specificity. This suggests that the inherent advantageous ability of the DLS to provide continuous "scores" as output for thresholding can likely help individual sites customize the triggering rate to their local practice patterns and resource needs, while trusting that the customized operating point has a similar effect to calibrating the "conservativeness" of a radiologist. As suggested by the WHO, this ability to calibrate the operating point may be critical over time even for the same population, as prevalence and disease severity changes over time. Statistical methods to tune operating points for each dataset and to detect when operating points should be updated over time may be useful in this regard and represent an important direction of future work for real-world use cases. Our study has limitations. First, this study was retrospective and prospective validation will be needed to better understand challenges in integrating into real-world workflows and to rigorously determine the TB status via mycobacterial culture for all patients. Second, as highlighted by the WHO, TB screening often happens in a broader context of patient care; patients can present with symptoms suspicious of TB but have other pulmonary or thoracic conditions instead. The subgroup analyses presented here were also not fully comprehensive due to the lack of important variables (such as HIV status and symptoms) in several datasets. As our datasets all had relatively high prevalence, we will need to evaluate its performance in populations with lower prevalence. Finally, the cost analysis is a simulation that makes simplifying assumptions such as DLS performance being unaffected by prevalence changes. In practice, prevalence changes may be associated with case severity, which may affect DLS sensitivity or specificity. In this study, we developed a DLS and demonstrated its generalization via international test datasets spanning 5 countries encompassing a wide range of race/ethnicities: China, India, US, and Zambia. At a uniform, prespecified operating point, the DLS had significantly higher sensitivity while maintaining non-inferior specificity compared to 9 India-based radiologists. When compared with US-based radiologists who were more sensitive but less specific, the DLS was non-inferior in both sensitivity and specificity. The DLS further meets the WHO targets when matching to either 90% sensitivity or 70% specificity. The DLS may be able to facilitate TB screening in areas with scarce radiologist resources, and merits further prospective clinical validation. Figure 1. Overview of our deep learning system (DLS). The system consists of three modules: a lung segmentation model to specifically crop the lungs, a detection model to identify regions of interest, and a classification model that takes the output from the other two models to predict the likelihood of the CXR being TB positive. The large-scale abnormality pretraining and noisy student semi-supervised learning used to train these modules are not visualized here. Figure S2) ; presented here for completeness. Latent tuberculosis infection: updated and consolidated guidelines for programmatic management. (World Health Organization Management of drug-resistant tuberculosis How COVID hurt the fight against other dangerous diseases General Assembly adopts Declaration of the first-ever United Nations High Level Meeting on TB WHO consolidated guidelines on tuberculosis: Module 2: screening -systematic screening for tuberculosis disease. (World Health Organization Automatic detection of mycobacterium tuberculosis using artificial intelligence Using artificial intelligence to read chest radiographs for tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of three deep learning systems Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks Artificial neural network models to support the diagnosis of pleural tuberculosis in adult patients Chest x-ray analysis with deep learning-based software as a triage test for pulmonary tuberculosis: a prospective study of diagnostic accuracy for culture-confirmed disease Computer aided detection of tuberculosis on chest radiographs: An evaluation of the CAD4TB v6 system Detection of tuberculosis patterns in digital photographs of chest X-ray images using Deep Learning: feasibility study A novel stacked generalization of models for improved TB detection in chest radiographs Automatic tuberculosis screening using chest radiographs Performance of Qure.ai automatic classifiers against a large annotated database of patients with diverse forms of tuberculosis Development and Validation of a Deep Learning-based Automatic Detection Algorithm for Active Pulmonary Tuberculosis on Chest Radiographs A new resource on artificial intelligence powered computer automated detection software products for tuberculosis programmes and implementers CheXpedition: Investigating Generalization Challenges for Translation of Chest X-Ray Algorithms to the Clinical Setting The TB Portals: an Open-Access, Web-Based Platform for Global Drug-Resistant-Tuberculosis Data Sharing and Analysis Two public chest X-ray datasets for computer-aided screening of pulmonary diseases Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration Active TB case finding in a high burden setting; comparison of community and facility-based strategies in Lusaka, Zambia Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation Deep Learning for Distinguishing Normal versus Abnormal Chest Radiographs and Generalization to Unseen Diseases Single Shot MultiBox Detector Rethinking Model Scaling for Convolutional Neural Networks Self-training with Noisy Student improves ImageNet classification Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests an anova approach with dependent observations A comparison of denominator degrees of freedom methods for multiple observer ROC analysis Multireader multicase reader studies with binary agreement data: simulation, analysis, validation, and sizing Observer performance methods for diagnostic imaging: foundations, modeling, and applications with r-based examples Hypothesis testing in noninferiority and equivalence MRMC ROC studies Committee for Proprietary Medicinal Products. Points to consider on switching between superiority and non-inferiority Automated chest-radiography as a triage for Xpert testing in resourceconstrained settings: a prospective study of diagnostic accuracy and costs Guidance for Studies Evaluating the Accuracy of Sputum-Based Tests to Diagnose Tuberculosis Module 2: screening -systematic screening for tuberculosis disease A prospective study of the risk of tuberculosis among intravenous drug users with human immunodeficiency virus infection Pulmonary TB: varying radiological presentations in individuals with HIV in Soweto, South Africa Scoring systems using chest radiographic features for the diagnosis of pulmonary tuberculosis in adults: a systematic review A systematic review of the sensitivity and specificity of symptom-and chestradiography screening for active pulmonary tuberculosis in HIV-negative persons and persons with unknown HIV status Diagnostic accuracy of chest radiography for the diagnosis of tuberculosis (TB) and its role in the detection of latent TB infection: a systematic review Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization The authors thank the members of the Google Health Radiology and labeling software teams for software infrastructure support, logistical support, and assistance in data labeling. For tuberculosis data collection, thanks go to Sameer Antani, Stefan Jaeger, Sema Candemir, Zhiyun Xue, Alex Karargyris, George R. Thomas, Pu-Xuan Lu, Yi-Xiang Wang, Michael Bonifant, Ellan Kim, Sonia Qasba, and Jonathan Musco. The train dataset from Europe/India was obtained from the TB Portals (https://tbportals.niaid.nih.gov), which is an open-access TB data resource supported by the National Institute of Allergy and Infectious Diseases (NIAID) Office of Cyber Infrastructure and Computational Biology (OCICB) in Bethesda, MD. These data were collected and submitted by members of the TB Portals Consortium (https://tbportals.niaid.nih.gov/Partners). Investigators and other data contributors that originally submitted the data to the TB Portals did not participate in the design or analysis of this study. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, Jonny Wong for coordinating the imaging annotation work, Anna Majkowska for initial modeling efforts, Joshua Reicher for early input, Christopher Semturs for team guidance, Rayman Huang for statistical input, T Saensuksopa for figure and user interface design, and Akinori Mitani and Craig H. Mermel for manuscript feedback. Supplementary Figure S1 . STARD diagrams.Supplementary Figure S2 . Boxplots showing the rates of flagging images as positive per grader. The outlier India-based radiologist was excluded owing to a substantially lower rate of flagging positives (15% vs 22-33%) and accordingly an anomalously-low sensitivity (57% vs 69-86%). Outliers were defined using matplotlib default settings, as point beyond the whiskers (1.5 * interquartile range away from the first or third quartiles). Supplementary Table S1 . Detailed dataset information.