key: cord-0308864-gih0w3pq authors: Koyuncu, D.; Niazi, M. K. K.; Tavolara, T.; Abeijon, C.; Ginese, M.; Liao, Y.; Mark, C.; Gower, A. C.; Gatti, D. M.; Kramnik, I.; Gurcan, M.; Yener, B.; Beamer, G. title: Tuberculosis biomarkers discovered using Diversity Outbred mice. date: 2021-01-09 journal: nan DOI: 10.1101/2021.01.08.20249024 sha: f4d7323a925fe1e934c60d723e5973baafbff3da doc_id: 308864 cord_uid: gih0w3pq Background: Biomarker discovery for pulmonary tuberculosis (TB) may be accelerated by modeling human genotypic diversity and phenotypic responses to Mycobacterium tuberculosis (Mtb). To meet these objectives, we use the Diversity Outbred (DO) mouse population and apply novel classifiers to identify informative biomarkers from multidimensional data sets. Method: To identify biomarkers, we infected DO mice with aerosolized Mtb confirmed a human-like spectrum of phenotypes, examined gene expression, and inflammatory and immune mediators in the lungs. We measured 11 proteins in 453 Mtb-infected and 29 non-infected mice. We have searched all combinations of six classification algorithms and 239 biomarker subsets and independently validated the selected classifiers. Finally, we selected two mouse lung biomarkers to test as candidate biomarkers of active TB, measuring their diagnostic performance in human sera acquired from the Foundation for Innovative New Diagnostics. Findings: DO mice discovered two translationally relevant biomarkers, CXCL1 and MMP8 that accurately diagnosed active TB in humans with > 90% sensitivity and specificity compared to controls. We identified them through the two classifiers that accurately diagnosed supersusceptible DO mice with early-onset TB: Logistic Regression using MMP8 as a single biomarker, and Gradient Tree Boosting using a panel of 4 biomarkers (CXCL1, CXCL2, TNF, IL-10). Interpretation: This work confirms that the DO population models human responses and can accelerate discovery of translationally relevant TB biomarkers. Biomarker discovery for pulmonary tuberculosis (TB) may be accelerated by modeling human genotypic diversity and phenotypic responses to Mycobacterium tuberculosis (Mtb). To meet these 20 objectives, we use the Diversity Outbred (DO) mouse population and apply novel classifiers to identify informative biomarkers from multidimensional data sets. To identify biomarkers, we infected DO mice with aerosolized Mtb confirmed a human-like spectrum of phenotypes, examined gene expression, and inflammatory and immune mediators in 25 the lungs. We measured 11 proteins in 453 Mtb-infected and 29 non-infected mice. We have searched all combinations of six classification algorithms and 239 biomarker subsets and independently validated the selected classifiers. Finally, we selected two mouse lung biomarkers to test as candidate biomarkers of active TB, measuring their diagnostic performance in human sera acquired from the Foundation for Innovative New Diagnostics. 30 Findings DO mice discovered two translationally relevant biomarkers, CXCL1 and MMP8 that accurately diagnosed active TB in humans with > 90% sensitivity and specificity compared to controls. We identified them through the two classifiers that accurately diagnosed supersusceptible DO mice with early-onset TB: Logistic Regression using MMP8 as a single biomarker, and Gradient Tree Introduction achieved higher sensitivities and specificities than the WHO TPPs: Logistic Regression with matrix metalloproteinase 8 (MMP8) and Gradient Tree Boosting 14 with a panel of four biomarkers (CXCL1, CXCL2, tumor necrosis Factor (TNF), interleukin-10 (IL-10)). MMP8 and CXCL1 are 70 significantly higher in human pulmonary TB patients than latent Mtb infection (LTBI) and normal controls. We show the DO population's capacity to model human TB allows identification and validation of diagnostic biomarkers with strong translational potential to improve diagnosis of human pulmonary TB patients. Tufts Institutional Animal Care and Use Committee (IACUC) and the Institutional Biosafety Committee (IBC) approved experiments under IACUC protocols: G2012-53; G2015-33; G2018-33 and IBC registrations GRIA04; GRIA10; and GRIA17. 80 Female DO and C57BL/6J mice were purchased from The Jackson Laboratory (Bar Harbor, ME) and housed in sterile BSL3 conditions at the New England Regional Biosafety Laboratory (Tufts University, Cummings School of Veterinary Medicine, North Grafton, MA). At 8-10-weeks old, mice were infected aerosolized Mtb strain Erdman bacilli (~100 in experiments 1 and 2; ~25 in 85 experiments 3, 4, 5) using a CH Technologies system. We examined mice daily for weakness/lethargy, respiratory distress, and body condition score 15 . Mice were weighed at least twice per week. All mice were confirmed to be infected by recovering live bacilli from homogenized lung tissue as described 11, 16 . 90 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint One lung lobe was homogenized in TRIzol™ and stored at -80C until extraction using Pure Link . The CEL files were also normalized using Expression Console (build 1.4.1.46) and the default probesets defined by Affymetrix to assess array quality using the AUC metric. Moderated t tests and ANOVAs were performed using the limma package (version 3.39.19) (i.e., creating simple linear models with lmFit, followed by empirical Bayesian adjustment with eBayes). Correction for multiple hypothesis testing was accomplished using the 105 Benjamini-Hochberg false discovery rate (FDR) 22 . Human homologs of mouse genes were identified using HomoloGene (version 68) 23 . All microarray analyses were performed using the R environment for statistical computing (version 3.6.0). Enrichr (https://amp.pharm.mssm.edu/Enrichr) was used to determine the overrepresentation of Gene Ontology (GO) biological processes (version 2018) within an input set of official mouse gene 110 symbols. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Two lung lobes from each mouse were homogenized in 2mL of phosphate buffered saline and 115 stored at -80C until use. Lung samples were serially diluted and tested for CXCL5, CXCL2, CXCL1, TNF, MMP8, S100A8, IFN-γ, IL12p40, IL-12p70 heterodimer, IL-10, and VEGF by sandwich ELISA using antibody pairs and standards from R&D Systems (Minneapolis, MN), Invitrogen (Carlsbad, CA), eBioscience (San Diego, CA), or BD Biosciences (San Jose, CA, USA), per kit instructions. 120 Serum samples were obtained from the Foundation for Innovative Diagnostics (FIND; Geneva, Switzerland). Patients were HIV-negative adults diagnosed with ATB by clinical signs and confirmed with sputum microscopy or culture. Patients diagnosed with LTBI did not have clinical, radiographical, or microbiological evidence of disease but were immunologically reactive to Mtb antigens. For additional details see Table S6 . MMP8 and CXCL1 were quantified by ELISA using 125 antibody pairs and standards from R&D Systems (Minneapolis, MN), per kit instructions. Survival, weight, and ELISA data were analyzed and graphed in GraphPad Prism 8.4.2 with significance set at p < 0.05 and adjusted for multiple comparisons. Survival curves were analyzed 130 using Log-rank (Mantel-Cox) test. For lung biomarkers and weights, data was analyzed for normal or lognormal distributions prior to Kruskal-Wallis one-way ANOVA and Dunnett's post-tests, ***p<0.001. In DO mouse studies, sample sizes were determined based on a recent systematic review with recommendations for human biomarker studies 6 . The studies using human sera were pilot studies to generate proof-of-concept results. All available human serum samples were used 135 in ELISA testing. The AUC analysis on the human sera samples is done using GraphPad Prism 8.4.3. If there are multiple operating points that satisfies the >90% sensitivity and >70% All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. VEGF, S100A8 are included in a biomarker combination than the second set is used and if none of them are included then the first set is used. The samples are split 75% for classifier training and 150 validation and 25% for classifier testing, stratified by class (Supersusceptible (SS), and not Supersusceptible (nSS)) and experiment number. Leave-one-experiment-out setting: In the leave-one-experiment-out setting, three of the four experiments in the discovery cohort are combined and used as the training set and the remaining 155 one is used as the validation set. Resulting in four different train/validation pairs. When the training portion of the discovery cohort is used the training and validation set sizes of the four pairs are 253,59; 267,45; 269,43;147,165 respectively and when the testing portion of the discovery cohort is used the training and validation set sizes of the four pairs are 253,17; 267,13; 269,12;147,53 respectively. If any of the biomarkers MMP8, VEGF, S100A8 are included in the subset selection 160 then only the samples with complete measurements for all ten biomarkers are used. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. Classification algorithm, hyper-parameter and feature selection: We have used two different variations of best-subset selection. Using the first approach, we have identified the Logistic 185 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. Hyper-parameters searched: For Gradient Boosting Tree, logloss is used and as hyper-parameters learning rate (0.001, 0.01), number of trees (1, 3, 5) , and max depth of a tree (1,3,5) are searched All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. Operating point selection: We have selected the prediction threshold as a hyper-parameter and selected it through K-fold-CV. In the first approach, 5-fold-CV is used and the threshold that achieves the highest sensitivity while maintaining at least 70% specificity is selected. In the second 225 approach, in order to reduce the error resulting from selecting a single threshold for K-classifiers we have used a higher number of folds where during the classifier selection 100-fold-CV is used and 150-fold-CV is used when the selected classifier is retrained using both training and the testing datasets. The operating point that maximizes the experiment-wise sensitivity while achieving at least 70% specificity in each of the experiments is selected. If two operating points satisfy the 230 criteria and achieve a similar (<0.1%) experiment-wise sensitivity the one with the higher experiment-wise specificity is selected. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint 235 One third of DO mice are supersusceptible (SS) to Mtb, developing early-onset disease within 8 weeks of aerosol infection by ~25 bacilli, significantly reducing survival compared to age-and sex-matched non-infected DO mice or to Mtb-infected C57BL/6J inbred mice ( Figure 1A ). The SS phenotype is consistent across sexes, institutions, and aerosol infection methods 11, 25 . Peak mortality occurs 20-35 days after Mtb ( Figure 1B ) and disease onset begins as early as 4 days 240 after infection ( Figure 1C ). The remaining ~70% of Mtb-infected DO mice and all C57BL/6J mice show no morbidity or mortality over 8 weeks. 245 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. (Table S1) . Mechanistically, all gene pathways 260 specific to SS converge on acute inflammatory responses: Neutrophil (granulocyte) recruitment and degranulation; positive inflammatory signaling, cytokines and chemokines; and extracellular matrix degradation. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Figure 2 . 121 genes are unniquely and highly expressed in lungs of SS DO mice, compared to other groups. Microarray gene expression profiling identified a set of 121 genes that changed significantly >2-fold in SS (dark brown) relative to noninfected (gray) and to not-supersusceptible (tan) DO mice. Rows and columns correspond to genes and individual DO mice, respectively. Hierarchical clustering was performed across all rows and was also performed separately within each group. We z-normalized the expression values for each gene to a mean of zero and standard deviation of one across all samples in each row; blue, white, and red indicate z-scores of ≤ -2, 0, and ≥ +2, respectively. To generate data sets for classifier and biomarker selection, we used CXCL2, S100A8, and MMP8 proteins (products of highly expressed genes in SS DO mouse lungs identified by microarray); TH1 proinflammatory cytokines required for resistance to Mtb (IFN-γ, TNF, and IL- All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint 12 (p40 and p70); immune suppressive IL-10; and neutrophil chemokines that cause inflammation 270 in Mtb infected lungs: CXCL1, CXCL5 26-28 . We included Vascular Endothelial Growth Factor (VEGF) because VEGF is a component of a promising biomarker panel 7 . We measured proteins in the lungs of Mtb-infected DO mice, C57BL/6J mice, and non-infected age-and sex-matched control DO mice ( Figure 3 , panels A-K). All biomarkers except IL-12p40, IL-12p70 heterodimer, and VEGF were significantly higher in the lungs of SS DO mice compared to all other groups. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. To identify diagnostic biomarkers, we organized data from five different experimental infections into discovery and independent cohorts ( Figure 4 ). 280 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint First, we applied best-subset feature selection with 5-fold cross validation to the training data. From the 478 classifiers (a classifier refers to a specific classification algorithm and specific biomarkers) Logistic Regression with MMP8 (Table S2 ) achieved 0.95 AUC, 94.1% sensitivity and 87.4% specify for classifying SS and nSS mice. To avoid overfitting, we validated 285 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint performance of Logistic Regression with MMP8 on the test data, achieving 0.96 AUC with 94.4% sensitivity and 87.3% specificity. On data from the independent cohort, the classifier achieved 0.987 AUC ( Figure S1A ), 78.3% sensitivity, 100% specificity, 100% positive predictive value (PPV) and 94.9% negative predictive value (NPV). Although its sensitivity was below the TPP triage test target, Logistic Regression with MMP8 was highly attractive because it used a single 290 biomarker and a linear decision boundary. However, the classifier had unacceptable variation in specificity across experiments, which led us to investigate whether we could identify a different classifier that performed with greater consistency. We sought a classifier capable of 90% sensitivity and 70% specificity in each experimental infection used in discovery. Among the identified classifiers only three (Table S3 ) had >80% 295 minimum experiment sensitivity and 70% minimum experiment specificity in the leave-oneexperiment-out setting. From the three, we pursued Gradient Tree Boosting with the biomarker panel CXCL1, CXCL2, TNF, and IL-10 because it achieved the highest minimum experiment sensitivity (93.3%), had high experiment-wise sensitivity (97.3%) and overall sensitivity (96.6%). In the testing portion of the discovery cohort, the four-biomarker panel (CXCL1, CXCL2, 300 TNF, and IL-10) achieved 96.9% and 94.4% sensitivity and specificity, respectively, across all four experiments combined. In the leave-one-experiment-out setting, this classifier had 87.5% and 70% minimum experiment sensitivity and specificity, respectively. In the independent cohort, the four-biomarker panel achieved 91.3% sensitivity and 81.7% specificity, satisfying the WHO performance criteria for a triage test. The PPV and NPV of the classifier were 55.3% and 97.4%, 305 respectively, and its AUC was 0.95 ( Figure S1B ). We evaluated the classifier further using the 27 age-and sex-matched non-infected DO controls and it achieved 100% specificity. Of the 4 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint biomarkers in the panel, CXCL1 and CXCL2 were more important than TNF and IL-10, shown by higher variable importance values of 0.62, 0.32, 0.04, and 0.02, respectively. To determine how each biomarker contributes to classifier performance, we measured the 310 average percent difference. MMP8 achieved the highest effect, with an average percent difference of 11.31%, followed by CXCL1 (5.8%) and CXCL2 (4.82%) (Table S4 ). Figure 5 demonstrates this effect graphically by showing that biomarker panels lacking CXCL1, CXCL2, or MMP8 have lower AUC values than the biomarker sets that contain at least 1 of these 3. For human study, we pursued MMP8 and CXCL1 because neither have been fully investigated in humans and both performed well under different conditions in DO mice. MMP8 is the biomarker used in the first classifier and CXCL1 had the highest variable importance among the panel used in the second classifier. We obtained human sera from HIV-negative adults with 320 active pulmonary TB (ATB) or LTBI from the Foundation for Innovative Diagnostics. Pooled All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint serum from healthy individuals served as controls. MMP8 and CXCL1 protein levels were significantly higher in sera from patients with ATB to other groups ( Figure 6) (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint between 2010 and 2017 6 , identified 44 panels or single biomarkers that satisfies a TPP criterion. 335 However, in terms of the QUADAS-2 assessments (study design, sampling, negative population, timing, reference standard and blinding), only one panel had a low risk of bias 6 To address that need and accelerate TB biomarker discovery, we modeled human responses 345 to Mtb by using the DO mouse population. Like humans, this population is highly genetically diverse and demonstrates variable responses and disease outcomes to Mtb, ranging from supersusceptible to highly resistant 11, 25 . We classify Mtb-infected DO mice as supersusceptible based on survival <8 weeks. Our supersusceptible DO class is somewhat analogous to progressor DO mice recently described 30 . We prefer survival for ground truth as the early peak in DO 350 mortality is robust and reproducible across hundreds of mice, while Mtb burden and inflammation scores (used to identify DO progressors) fall along a continuous spectrum and become unstable ground truth as mouse numbers increase. Regardless, the richness of the DO responses to Mtb (Figures 3, 4) clearly provides an unprecedented tool for discovery, validation, and testing on independent cohorts. 355 We identified two classifiers that can accurately discriminate between supersusceptible and not-supersusceptible, namely Logistic Regression with MMP8 and Gradient Tree Boosting with a All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint panel of four biomarkers (CXCL1, CXCL2, TNF, and IL10). The first classifier had an advantage of being a single biomarker and a simpler regression model, with MMP8 satisfying the minimal requirements of a WHO TPP detection test. However, its performance showed batch variation 360 across experiments attributed to the different criterion used in feature selection. The second classifier had an advantage of performing consistently across experimental batches. Gradient Tree Boosting with CXCL1, CXCL2, TNF, and IL10 also achieved the WHO triage test specifications with the operating points selected in the discovery cohort. To our knowledge, using DO mice for translational biomarker discovery has not been 365 explored previously. Two highly performing biomarkers in DO mice were MMP8, the only biomarker used in the first classifier, and CXCL1, the biomarker that achieved the highest variable importance in the second classifier. Additionally, expression of Mmp8 gene was significantly higher in the lungs of supersusceptible DO mice compared to other groups. In human patients, both MMP8 and CXCL1 were significantly higher in patients active pulmonary TB compared to 370 patients with LTBI or to pooled sera from normal individuals. In our proof-of-concept study, each biomarker had sensitivity and specificity values higher than the minimum requirements of the TPP triage test to distinguish ATB from pooled normal samples. Discriminating ATB from LTBI in human sera using MMP8 or CXCL1 was more challenging, likely related to the small sample size. The AUC of MMP8 has dropped to 0.676 but CXCL1 maintained a relatively high AUC (0.844) 375 and its performance was only slightly below the triage test target. MMP8 and CXCL1 have not been fully investigated in humans, and we suggest they be rigorously pursued. MMP8 and CXCL1 were not among the TB biomarkers with the highest AUC values reported in humans which were C-reactive protein, transferrin, IFN-γ, interferon (IFN)-γ inducible protein-10 (IP-10), IL-27, and interferon-inducible T Cell Alpha Chemoattractant 380 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. Our work here discovering, validating, and independently testing TB biomarkers in DO mice is important, yet has three limitations. First, we used two separate datasets in the discovery cohort (n=407 or 345) because of missing data. Ideally, when comparing two different classifiers, 390 we should use identical data sets to estimate out-of-sample performance. However, since mouse lungs are small with limited volumes, not all samples were available to measure all biomarkers. This contributed to missing values, and for simplicity, samples with missing values were excluded from training, validation, and testing. Second, the WHO Target Product Profiles (TPPs) were intended for non-sputum samples of humans 9 and not lung samples from experimentally infected 395 mice. We believe, however, that benchmarking against the TPPs increases the translational relevance of our findings and future studies will use non-lung samples from Mtb-infected DO mice. Third, the number of human serum samples we used is smaller than the recent biomarker publications such as 7 which has 179 TB and 138 other TB-like disease patients. Our aim, however, is to demonstrate a proof of concept that DO mice can discover translationally relevant TB 400 biomarkers. We acknowledge that larger studies including TB patients with different forms of disease, co-morbidities and including patients with non-TB lung disease are required for final validation and independent testing in humans. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. BY -Funding acquisition, methodology, project administration, and review and editing of the manuscript. 425 GB -Funding acquisition, conceptualization, project administration; sample and data collection, assay performance, curation, data analysis, writing, review, editing of the manuscript. Authors declare no competing interests. 430 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. Table S1 -S6 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Fig. S1 . 560 Receiver Operating Characteristics (ROC) curves of classifier performance to diagnose supersusceptible Mtb-infected DO mice. A) Logistic Regression with MMP8, and B) Gradient Tree Boosting with CXCL1, CXCL2, TNF, and IL-10. Blue and orange curves correspond to the AUC in the discovery cohort (n=345 for A and n=407 for B) and independent cohort (n=116) respectively for both A and B. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Confusion matrices for classifying between SS and nSS mice in the independent cohort. A) 570 Logistic Regression with MMP8 and B) Gradient Tree Boosting with CXCL1, CXCL2, TNF, and IL-10. Row labels and column labels indicate the correct and the predicted labels respectively. A B 575 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Confusion matrices for the biomarkers in human sera. A) and B) confusion matrices of CXCL1 and MMP8 for classifying between ATB and pooled normal samples. C) and D) confusion matrices of CXCL1 and MMP8 for classifying between ATB and LTBI. For each confusion matrix row labels and column labels indicate the correct and the predicted labels respectively. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Fig. S4 . Additional lung biomarkers we measured are displayed in (A-D). All data were lognormal 585 distributed and analyzed by Kruskal-Wallis one-way ANOVA with Dunnett's multiple comparisons post-tests (*p<0.05; ***p<0.001). Each dot repsesents 1 mouse. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Table S1 . Pulmonary TB in SS DO mice reflects acute inflammation, neutrophil recruitment/activation and extracellular matrix degradation. Enrichr identified the biological pathways within a set of 121 genes that were highly expressed in SS mice relative to non-infected DO mice and non-SS mice. GO terms with adjusted p < 0.01 are shown, along with the human homologs of the genes 595 overlapping with the term. Expressed genes in bold were pursued as diagnostic biomarkers. IL1A, IL6, IL1B, CXCR2, CCL4, CCL3 , PROK2, S100A9, S100A8 neutrophil chemotaxis GO:0030593 2.27E-08 1.29E-05 CXCR2, CCL4, CCL3, SAA1, CXCL3, S100A9, S100A8 granulocyte chemotaxis GO:0071621 3.37E-08 1.72E-05 CXCR2, CCL4, CCL3, SAA1, CXCL3, S100A9, S100A8 positive regulation of inflammatory response GO:0050729 2.18E-07 1.01E-04 IL6, SERPINE1, CCL4, CCL3, OSM, S100A9, S100A8 (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Table S2 . 5-fold-CV results of the initial selection. The first three rows achieved the highest AUC and the remaining rows achieved comparable AUC with a simpler model. "Threshold" denotes the probability threshold selected. The standard deviation of the five folds is denoted with ±. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Table S3 . The 100-fold-CV performance of the three classifiers that satisfy the criterion for both leave-oneexperiment-out and four-experiments-combined settings. Columns under "Leave-one-exp-out" and "Four-exp-combined" correspond to the performance in leave-one-experiment-out setting and 610 four-experiments-combined setting respectively. "Spec." denotes specificity and "Sens." denotes sensitivity. Sensitivity (specificity) under the "Exp-Wise" columns indicate experiment-wise sensitivity (specificity) and sensitivity (specificity) under the "Min. Exp." columns indicate minimum experiment sensitivity (specificity) and sensitivity (specificity) under the "Overall" columns indicate sensitivity (specificity) (As defined in Materials and methods). (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Table S4 . Feature rankings of the ten biomarkers. "Avg." is short of average and "Diff." is short of difference. 620 In Materials and Methods, how the average % difference is calculated is described. To calculate the average difference the same method is used but instead of averaging over % difference, it's averaged over difference. Average AUC denotes the average AUC of the all biomarker panels that contain the biomarker. Average top 5 denotes the average AUC of the five biomarker panels with highest AUC that contain the biomarker and average bottom 5 denotes the average AUC of the 625 five biomarker panels with the least AUC that contain the biomarker. Avg (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Table S5 . 630 The AUC values of the ten biomarkers in the discovery cohort for classifying between SS and nSS mice. 95% confidence intervals are denoted in the parenthesis. AUC values with the confidence intervals were calculated using pROC 44 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Table S6 . The demographics of the human sera. Patients that are missing the demographic information are omitted while calculating the statistics displayed. For rows next to "Country" and "Sex" the 640 number of patients in each category and its percentage is given. At the final row the median and the IQR of the age is given. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Table S7 . (Uploaded as a standalone file.) The 5-fold-CV AUC values of the 478 classifiers that are searched during the first classifier selection. Highlighted in yellow are the six selected classifiers whose sensitivity and specificity values are reported in Table S2 . The standard deviation of the five folds is denoted with ±. Table S8 . (Uploaded as a standalone file.) The 100-fold-CV performance of the 478 classifiers that are searched during the second classifier selection. Highlighted in yellow are the three classifiers that satisfied the criterion. Columns under "Leave-one-exp-out" and "Four-exp-combined" correspond to the performance in the leave-one-655 experiment-out setting and four-experiments-combined setting respectively. In the leave-oneexperiment-out setting, a different experiment is used as the validation set and the columns D-K denote the specificity and sensitivity values on the four validation sets. Columns L and M denote the average of the specificity and sensitivity values in columns D-K respectively. Columns N and O denote the minimum of the specificity and sensitivity values in columns D-K respectively. In 660 the four-experiments-combined setting, four experiments are combined into a dataset. Columns P-W denote the specificity and sensitivity values calculated using only the samples from a single experiment. Columns X and Y denote the average of the specificity and sensitivity values in columns P-W respectively. Columns Z and AA denote the minimum of the specificity and sensitivity values in columns P-W respectively. Columns AB and AC denote the sensitivity and 665 specificity values computed in the usual way i.e. using all of the samples. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Classification algorithm, hyper-parameter and feature selection details 670 In the first approach, for further analysis, three classifiers with the highest AUC are selected and three classifiers that use a linear decision boundary and a single biomarker which achieved comparable AUCs are selected (Table S2 ). Operating points of the selected classifiers are tuned to achieve >90% sensitivity and >70% specificity. Of those, we selected the classifier with the highest 675 sensitivity whose specificity is >70% specificity after subtracting one std. In the second approach, after selecting the hyper-parameters of the classifiers, for each classifier the operating point that maximizes the experiment-wise sensitivity while achieving at least 70% specificity in each of the experiments is selected. Afterward, we identified the classifiers that achieved both minimum experiment sensitivity ≥90% and specificity ≥70 in the four-experiments 680 combined setting and minimum experiment sensitivity ≥80% and specificity ≥70% in the leaveone-experiment-out setting (Table S3) . Among the classifiers identified the one with the highest minimum experiment sensitivity (93.3%) is selected. 685 In the initial classifier selection, we have already measured the AUC of the 239 biomarker combinations using 5-fold-CV on the training portion of the discovery cohort. To calculate the average percent improvement for a biomarker, first, a subset of the 239 biomarker combinations that includes that biomarker are identified. Then, for each identified biomarker combination, the percent difference between the AUC of that combination and the AUC of the biomarker 690 combination without that biomarker is calculated. Finally, percent differences calculated for each of the identified combinations are averaged to calculate the average percent difference for that biomarker. As the AUC of a biomarker combination, we have used the AUC of the Logistic Regression with L1 regularization and the hyper-parameter with the highest AUC is used. The same methodology is used for the initial classifier selection with the exception that all 1023 combinations of the 10 biomarkers are searched and (0.01, 0.3, 0.5,1.) are searched for the hyperparameters of Gradient Tree Boosting. We have used the Scikit-learn 24 implementation of Logistic Regression with L1 Regularization whose mathematical description is given below. Let " ∈ % denote the feature vector and " ∈ {−1,1} as the label of i'th sample respectively where D is the number of biomarkers in the panel and {−1,1} denote nSS and SS respectively. Let ∈ % denote the weights of the features and 705 ∈ the bias term, the optimal parameters are found by optimizing the following problem. * , * = arg min (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint where ‖. ‖ J denotes the L1 norm. We have treated as a hyper-parameter and applied grid search to select it as described in Materials and Methods. Gradient boosting is a way of combining a set of weak learners. Each stage a weak learner is fit to the "pseudo-residuals" resulting from combinations of the weak learners so far. After fitting the weak learner, its prediction function is added to the combined prediction function after 715 multiplied by a shrinkage factor 45 . Gradient Tree Boosting uses regression trees as weak learners and uses a slight variation of gradient boosting algorithm where after fitting the regression tree to the "pseudo-residuals" the prediction value for each region is updated again 45 . The algorithm without the shrinkage is given in Algorithm 10.3 in 45 . 720 We have treated maximum depth of the trees, the number of weak learners (referred to as the number of trees), and the shrinkage factor (referred to as the learning rate) as hyper-parameters and selected them through grid search as described in Materials and Methods. The remaining hyper-parameters are used as the default values provided in Scikit-learn 24 . 725 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint Selection of the first classifier. Three classifiers achieved the highest area under the receiver operating characteristics curve 730 (AUC) values: Gradient Tree Boosting with CXCL1, IL-12, MMP8, VEGF, and S100A8; Gradient Tree Boosting with CXCL1 and MMP8; and Gradient Tree Boosting with CXCL1, IL-12, and MMP8. The other three classifiers had comparable AUC values and used a single biomarker with a linear decision boundary: Logistic Regression with CXCL2, CXCL1, or MMP8 (Table S2 ). All six classifiers performed well, with AUCs between 0.95 and 0.98. To refine further, we tuned the 735 operating points to meet the WHO TPP triage test specifications (>90% sensitivity and >70% specificity): Logistic Regression with CXCL1; Logistic Regression with MMP8; and Gradient Tree Boosting with CXCL1, IL-12, MMP8, VEGF, and S100A8. Of those, we selected Logistic Regression with MMP8 for validation, because its sensitivity (94.1%) is higher than that of the Gradient Tree Boosting classifier with five proteins, while its specificity is comparable (87.4%). In the four-experiment combined setting, 21 out of 478 classifiers achieve minimum experiment sensitivity ≥90 and specificity ≥70. In the leave-one-experiment task, 11 out of the 478 classifiers 745 achieved minimum testing sensitivity ≥80% and specificity ≥70%. None of them achieved higher than 90% minimum testing sensitivity. Three classifiers that satisfy both criterions is listed in (Table S3 ). We have observed that the number of classifiers achieved sensitivity ≥90 and specificity ≥70 in each of the experiments is less than to the number of classifiers that have experiment wise sensitivity ≥90 and specificity ≥70. We have also observed that the number of 750 classifiers that achieved overall sensitivity ≥90 and specificity ≥70 is higher than the number of classifiers that achieved the same values using experiment-wise averages. Logistic regression with MMP8 is evaluated in the testing portion of the discovery cohort in four-755 experiments combined setting. It achieved sensitivity values within Experiments 1-4, 100%, 100%, 100%, and 89%, respectively and specificity values are 40%, 100%, 70%, and 95% respectively. In the testing portion of the discovery cohort, Gradient Tree Boosting with CXCL1, CXCL2, TNF, and IL-10 is evaluated in four-experiments combined setting. Its sensitivity values within Experiments 1-4 were 87.5%, 100%, 100%, and 100%, respectively, with specificities of 77.8%, 760 100%, 100%, and 100%, respectively. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.20249024 doi: medRxiv preprint WHO coronavirus disease (COVID-19) dashboard Pathogenesis of post primary tuberculosis: immunity and hypersensitivity in the development of cavities Lung necrosis and neutrophils reflect common pathways of susceptibility to Mycobacterium tuberculosis in genetically diverse, immunecompetent mice Mouse model of necrotic tuberculosis granulomas 470 develops hypoxic lesions Tuberculosis Susceptibility and Vaccine Protection Are Independently Controlled by Host Genotype Greedy function approximation: A gradient boosting machine Body condition scoring: a rapid and accurate method for assessing health status in mice Genetically diverse mice are novel and valuable models of age-associated susceptibility to Mycobacterium tuberculosis Exploration, normalization, and summaries of high density oligonucleotide array probe level data affy--analysis of Affymetrix GeneChip data at the probe level Bioconductor: open software development for 485 computational biology and bioinformatics Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data Quality Assessment for Short Oligonucleotide Microarray Data Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Database resources of the National Center for Biotechnology Information Scikit-learn: Machine Learning in Python The Diversity Outbred Mouse Population Is an Improved Animal Model of Vaccination against Tuberculosis That Reflects Heterogeneity of Protection T cells and adaptive immunity to Mycobacterium tuberculosis in humans The role of IL-10 in immune regulation during M. tuberculosis infection CXCL5-secreting pulmonary epithelial cells drive 505 destructive neutrophilic inflammation in tuberculosis Diagnostic Potential of Novel Salivary Host Biomarkers as Candidates for the Immunological Diagnosis of Tuberculosis Disease and Monitoring of Tuberculosis Treatment Response pROC: an open-source package for R and S+ to 765 analyze and compare ROC curves The Elements of Statistical Learning: Data Mining, Inference, and Prediction We thank Julie Tzipori, Curtis Rich, Donald Girouard, and Sam Telford III for services in the New England Regional Biosafety Laboratory at Tufts University Cummings School of All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.