key: cord-0254516-qiogakwo authors: Albright, J.; Mick, E.; Sanchez-Guerrero, E.; Kamm, J.; Mitchell, A.; Detweiler, A. M.; Neff, N.; Tsitsiklis, A.; Hayakawa Serpa, P.; Ratnasiri, K.; Havlir, D.; Kistler, A.; DeRisi, J.; Pisco, A. O.; Langelier, C. title: A 2-Gene Host Signature for Improved Accuracy of COVID-19 Diagnosis Agnostic to Viral Variants date: 2022-01-07 journal: nan DOI: 10.1101/2022.01.06.21268498 sha: 328e8ce61884bcda704113e1187d68e0a2010952 doc_id: 254516 cord_uid: qiogakwo The continued emergence of SARS-CoV-2 variants is one of several factors that may cause false negative viral PCR test results. Such tests are also susceptible to false positive results due to trace contamination from high viral titer samples. Host immune response markers provide an orthogonal indication of infection that can mitigate these concerns when combined with direct viral detection. Here, we leverage nasopharyngeal swab RNA-seq data from patients with COVID-19, other viral acute respiratory illnesses and non-viral conditions (n=318) to develop support vector machine classifiers that rely on a parsimonious 2-gene host signature to predict COVID-19. Optimal classifiers achieve an area under the receiver operating characteristic curve (AUC) greater than 0.9 when evaluated on an independent RNA-seq cohort (n=553). We show that a classifier relying on a single interferon-stimulated gene, such as IFI6 or IFI44, measured in RT-qPCR assays (n=144) achieves AUC values as high as 0.88. Addition of a second gene, such as GBP5, significantly improves the specificity compared to other respiratory viruses. The performance of a clinically practical 2-gene RT-qPCR classifier is robust across common SARS-CoV-2 variants, including Omicron, and is unaffected by cross-contamination, demonstrating its utility for improving accuracy of COVID-19 diagnostics. The COVID-19 pandemic has inflicted unprecedented human health consequences, with 39 millions of deaths reported worldwide since December 2019 1 used the selected genes as features, calculated using 5-fold cross-validation within the training 90 set (Figure 1a) . Thus, a first gene was selected to maximize the AUC it achieved, and a second 91 gene was selected to maximize the AUC when combined with the first gene. Table 1a lists nine 92 combinations composed of each of the three best 'first' genes and their respective three best 93 'second' genes. The 'first' genes in the top combinations were the interferon-stimulated genes 94 (ISGs) IFI6, IFI44L and HERC6, which we previously showed are strongly induced in COVID-95 19 10 . Most of the 'second' genes were also related to immune and inflammatory processes. The performance of the nine 2-gene combinations on previously unseen data was 97 estimated by: i) 10,000 rounds of 5-fold cross-validation within the training set, ii) 10,000 rounds 98 of 5-fold cross-validation within the testing set, or iii) training on the training set and prediction on 99 the testing set (Table 1a) . Using the third approach, we observed AUC values as high as 0.93 100 (Figure 1b) . We further validated the classifiers using an external, independently generated and 101 quantified NP swab RNA-seq dataset from a cohort of n=553 patients in New York (166 with 102 COVID-19, 79 with other viral ARIs, 308 with non-viral conditions) 12 (Supp. Table 1 ; Supp. Data this ISG was typically higher in other viral ARIs (Figure 1c) . Given this pattern, and because our ultimate goal was a qPCR assay in which small effect sizes are more difficult to discern, we refined 117 our candidate genes by also considering the expression fold-change between COVID-19 and the 118 two other patient groups. fold-change of that gene between the COVID-19 and non-viral samples, where both measures 121 were averaged between the full UCSF cohort and the New York cohort (Figure 1d ; Supp. Data 122 File 2). As expected, several ISGs exhibited equivalently robust predictive value as well as 123 substantial fold-changes (log 2 FC ~2-4) that should be readily detectable by qPCR. We then suggested they were mostly non-viral. All four genes were able to clearly separate the majority of 139 samples with or without COVID-19 in the qPCR data (Figure 2a) . SVM classifiers relying on single 140 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 7, 2022. Classifiers that combined one of these genes with a 'first' gene achieved near perfect separation 149 of the COVID-19 and other viral samples (Figure 2d ). This performance is likely overly optimistic, 150 due in part to the relatively small size of the other virus group in the qPCR data, but it is overall 151 consistent with the performance observed in the larger RNA-seq datasets. These results 152 demonstrate that 2-gene signatures can successfully predict COVID-19 status from qPCR data. Host signatures are robust to SARS-CoV-2 variants and laboratory cross-contamination We next assessed whether a 2-gene host classifier was robust across SARS-CoV-2 156 variants, which could conceivably yield an altered host response and/or harbor mutations that contemporaneously in the laboratory 9 . To examine whether the IFI6+GBP5 host classifier would 167 be affected in such cross-contamination events, we spiked extracted NP swab RNA from a 168 sample with very high SARS-CoV-2 viral load (C t ≈12) into n=7 COVID-19 negative swab 169 specimens at a dilution of 1:10 5 , which would be expected to yield a positive viral PCR with C t <30. The probability of COVID-19 estimated by the host classifier was not significantly affected in the targets is likely to improve diagnostic accuracy, however, a prospective assessment using clinically confirmed false-positive and false-negative viral tests is needed. Moreover, our classifier The UCSF cohort used to develop the RNA-seq classifiers was initially described in our 214 study applying metagenomic sequencing to NP swabs from adult patients tested for COVID-19 samples were assigned to one of three viral status groups: 1) samples with a positive clinical RT- PCR test for SARS-CoV-2 were assigned to the "COVID-19" group, 2) samples with another 218 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 7, 2022. ; https://doi.org/10.1101/2022.01.06.21268498 doi: medRxiv preprint pathogenic respiratory virus detected by the ID-Seq pipeline 18 in the metagenomic sequencing data were assigned to the "other virus" group, and 3) remaining samples were assigned to the with additional swabs collected, sequenced and analyzed in the same manner. In order to rigorously assess the performance of the SVM 2-gene models, we employed 258 three approaches: (1) running 10,000 rounds of 5-fold cross-validation on the UCSF 70% training 259 set and calculating the average AUC and standard deviation, (2) running 10,000 rounds of 5-fold 260 cross-validation on the UCSF 30% testing set and calculating the average AUC and standard 261 deviation, and (3) training each model on the UCSF 70% training set and testing it on the 30% 262 testing set to generate an AUC score (Table 1a) . We then validated the 2-gene models on the 263 external New York cohort, using two approaches: (1) running 10,000 rounds of 5-fold cross-264 validation on the New York cohort and calculating the average AUC and standard deviation, and (2) training each model on the UCSF 70% training set and testing it on the New York cohort to 266 generate an AUC score (Table 1b) . All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 7, 2022. Performance of SVM classifiers to distinguish between the samples with (n=72) and without 287 (n=72) COVID-19 (Figure 2b) , or to distinguish between the samples with COVID-19 and other 288 viral ARIs (n=17) (Figure 2d) , was assessed by 5-fold cross-validation. The IFI6+GBP5 classifier, which was used to predict the COVID-19 status of variant 290 samples (Figure 2e ) and of samples that had been purposely contaminated with 1:10 5 dilution 291 from a high SARS-CoV-2 viral load sample (Figure 2f) , was trained on the set of samples with 292 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 7, 2022. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 7, 2022. World Health Organization. WHO COVID-19 Dashboard Identification of a Polymorphism in the N Gene of SARS-CoV-2 That 334 Adversely Impacts Detection by Reverse Transcription-PCR World Health Organization. The top 10 causes of death Description COVID-19 (n) Other Viral ARI (n) Reference