key: cord-0689012-krypatd3 authors: Peterson, Derick R; Baran, Andrea M; Bhattacharya, Soumyaroop; Branche, Angela R; Croft, Daniel P; Corbett, Anthony M; Walsh, Edward E; Falsey, Ann R; Mariani, Thomas J title: Gene Expression Risk Scores for COVID-19 Illness Severity date: 2021-11-30 journal: J Infect Dis DOI: 10.1093/infdis/jiab568 sha: 63f68caed79283bcb9cb26f1657a409680895775 doc_id: 689012 cord_uid: krypatd3 BACKGROUND: The correlates of COVID-19 illness severity following infection with SARS-Coronavirus 2 (SARS-CoV-2) are incompletely understood. METHODS: We assessed peripheral blood gene expression in 53 adults with confirmed SARS-CoV-2-infection clinically adjudicated as having mild, moderate or severe disease. Supervised principal components analysis was used to build a weighted gene expression risk score (WGERS) to discriminate between severe and non-severe COVID. RESULTS: Gene expression patterns in participants with mild and moderate illness were similar, but significantly different from severe illness. When comparing severe versus non-severe illness, we identified >4000 genes differentially expressed (FDR<0.05). Biological pathways increased in severe COVID-19 were associated with platelet activation and coagulation, and those significantly decreased with T cell signaling and differentiation. A WGERS based on 18 genes distinguished severe illness in our training cohort (cross-validated ROC-AUC=0.98), and need for intensive care in an independent cohort (ROC-AUC=0.85). Dichotomizing the WGERS yielded 100% sensitivity and 85% specificity for classifying severe illness in our training cohort, and 84% sensitivity and 74% specificity for defining the need for intensive care in the validation cohort. CONCLUSION: These data suggest that gene expression classifiers may provide clinical utility as predictors of COVID-19 illness severity. In December 2019 a novel coronavirus, SARS-CoV-2, was identified in China as a cause of severe pneumonia with explosive human-to human transmission [1] . Illness due to SARS-CoV-2 has been designated COVID-19, and on March 11, 2020, the World Health Organization officially declared SARS-CoV-2 a pandemic. To date there have been over 240 million infections and over 5 million deaths globally due to COVID-19. (Source: https://covid19.who.int/) Although most patients experience mild to moderate disease, 5-10% progress to severe or critical illness with severe pneumonia or respiratory failure [2, 3] . Early in the pandemic it became clear that certain underlying chronic medical conditions, and principally age, were key risk factors for severe disease [4, 5] . While severe disease can occur early in illness, a distinct progression to severe illness occurs in some individuals 7-12 days after symptom onset suggesting transition from a viral phase to an inflammatory phase [6] . In addition, some young individuals without co-morbidities have also developed severe illness, highlighting the incomplete understanding of disease pathogenesis due to SARS-CoV-2 infection [7] . Gene expression provides an unbiased measure of the host response to a pathogen on a cellular level. We and others have previously demonstrated the potential for peripheral blood gene expression patterns to classify the ontogeny and severity of viral respiratory illness [8, 9] . We hypothesized that analysis of gene expression in the blood of patients with SARS-CoV2-related COVID-19 might help identify those at greatest risk for severe symptoms and in need of intensive care. Gene expression analysis might also identify pathways underlying disease pathogenesis and suggest new targets amenable to potential therapeutic interventions. A c c e p t e d M a n u s c r i p t 5 Methods Acute Illness Evaluation: Adults ≥18 years of age, either hospitalized or community recruited, exhibiting COVID-19 symptoms and documented to have SAR-CoV-2 by PCR, were eligible for the study. Participants with immunosuppression or symptoms onset greater than 28 days prior to admission were excluded. Hospitalized participants were assessed within 24 hours of admission and outpatients were brought to the clinic within 1-2 days of being identified as SARS-CoV-2 positive. Demographic, clinical, radiographic and laboratory information, date of symptom onset and signs and symptoms of the illness were collected. Medication use was recorded with attention to drugs that may affect transcriptional profiling. Clinical severity assessment: Severity for COVID-19 participants at enrollment and throughout the illness was assessed using a combination of clinical variables as well as the National Early Warning Score (NEWS) of 7 graded physiological measurements (respiratory rate; oxygen saturation; oxygen supplementation; temperature; blood pressure; heart rate; level of consciousness) [10] . Severe illness was defined as requiring any of the following: ICU care, high flow oxygen, ventilator support, presser support or evidence of new end organ failure. Non-severe illness was defined as illnesses not meeting severe criteria. In addition, a panel of 4 physicians (3 infectious disease and 1 pulmonary critical care) adjudicated all non-severe illnesses and categorized them as mild or moderate using the NEWS as well as symptoms and physiologic parameters in the context of underlying diseases and baseline oxygen requirements. Participants were followed for the duration of hospitalization and illness, and outcomes were recorded as the highest level of care required or death. M a n u s c r i p t 6 Sample Collection and Processing: Approximately 3 ml of whole blood was collected in a Tempus™ Blood RNA Tube at the time of enrollment and stored at -80C until the time of processing. The median time from symptom onset to blood collection ranged from 4-9 days, as shown in Table 1 . Following centrifugation, RNA was isolated from the pellet using the Tempus Spin RNA Isolation Kit using the manufacturer recommended protocol. Total RNA was processed for globin reduction using GLOBINclear Human Kit as described previously [9] . RNA Sequencing: cDNA libraries were generated using 200 ng of globin-reduced total RNA. Library construction was performed using the TruSeq Stranded mRNA library kit (Illumina, San Diego, CA). cDNA quantity was determined with the Qubit Flourometer (Life Technologies, Grand Island, NY) and quality was assessed using the Agilent Bioanalyzer 2100 (Agilent, Santa Clara, CA). Libraries were sequenced on the Illumina NovaSeq6000 at a target read depth of ~20 million 1 × 100-bp single end reads per sample. Sequences were aligned against the human genome version hg38 using the Splice Transcript Alignment to a Reference (STAR) algorithm [11] , and counts were generated using HTSeq [12] . Raw counts were divided by participant-specific library size (in millions) to yield counts per million (CPM)normalized expression, borrowing no information across participants, and gene and sample level filtering was performed to remove outlier samples and low expressing genes. Normalized and filtered analytical data sets were log 2 -transformed (after adding a pseudocount of 1 CPM) prior to analysis. We excluded data from 19,861 genes with uniformly zero reads, leaving a data set comprised of 39,225 genes from 53 participants. Finally, we retained genes that had normalized counts exceeding 1 CPM in greater than 14 participants (the smallest class size). This resulted in an analytical dataset of 14,228 CPM-normalized genes. The raw sequence and normalized data are currently being deposited to dbGAP (https://www.ncbi.nlm.nih.gov/gap/). The accession number for this data series is will be A c c e p t e d M a n u s c r i p t 7 provided as soon as the submission is approved. In the meantime summary counts data will be available upon request to the corresponding author(s). Continuous clinical variables were compared by COVID severity levels using the nonparametric Kruskal-Wallis test, and binary variables by Fisher's exact test. WDifferential expression by COVID-19 severity was assessed using the nonparametric Wilcoxon rank sum test. To allow adjustment for important clinical covariates, we fit semiparametric Cox proportional hazards models for normalized gene expression as a function of severe vs non-severe COVID-19, adjusted for race, sex, BMI, days since symptom onset, and library size. The Benjamini-Hochberg procedure was used to control the False Discovery Rate (FDR). Pathway analysis of significantly differentially expressed genes was performed using ENRICHR [13] .While Negative binomial regression is a useful generalized linear model (GLM) for the log of the mean of raw count data, assuming the variance of the count is a quadratic function of its mean, we used a more flexible semiparametric Cox model, which can be viewed as a GLM with complementary log-log link. The Cox model makes no assumption about the shape of the distribution of (normalized) counts, and it is invariant to monotonic transformations of the response since it only depends on the ranks. The familiar nonparametric rank-based Wilcoxon and Kruskal-Wallis tests we used are closely related to the logrank test, a special case of the Cox model. To perform an independent validation of our risk score, we use a dataset from Overmyer et al. [14] , which had a different definition of severity in the outcome (ICU vs non-ICU), and used a different normalization for the gene expression data (TPM). Of the 18 genes used in our risk score, 2 were missing in the validation data. We imputed data for these 2 genes via multiple linear regression with coefficients estimated by regressing each on the 16 nonmissing CPM-normalized log gene expression values in the training data. We standardized the TPM-normalized validation gene expression data using means and SDs estimated from the training data, and then applied the risk score coefficients from the training data to construct a risk score for each validation subject. Apparent miscalibration required choosing a different WGERS threshold for the validation data due to gene expression measures being generally lower in the validation data compared to the training data. An ROC curve with associated AUC was used to assess the performance of the risk score in the validation data. A c c e p t e d M a n u s c r i p t Between April 30th and June 29th 2020, 58 participants with PCR documented COVID-19 illnesses were enrolled from inpatient and outpatient settings. Of these, 3 participants did not have blood samples collected and 2 did not meet inclusion criteria, leaving 53 participants for RNA sequencing analysis. Illnesses were adjudicated as 20 severe and 33 non-severe (14 mild and 19 moderate). This categorization was consistent with the severity separation in the NEWS (Supplemental Figure S1 ). Two severely ill participants received one dose of Remdesivir prior to blood collection. No subject received steroids or any other experimental COVID-19 treatment prior to enrollment. Five hospitalized participants had rapidly progressive hypoxemia and hemodynamic instability after enrollment and required transfer to intensive care, and 3 subsequently were mechanically ventilated. No mildly ill outpatient illnesses progressed in severity to require medical attention. There was insufficient evidence of any difference in demographic characteristics or underlying conditions by disease severity, except for race and time from disease onset ( Table 1) : white non-Hispanic comprised 93% mild vs 50-58% moderate-severe (p=0.02), and median time from symptom onset to enrollment was 4 days among mild, 9 days among moderate, and 6.5 days among severe (p=0.047, due to heterogeneity among non-severe). The median age of participants was 62 years with 53% of them being male. As expected, dyspnea, hypoxemia, the presence of infiltrates, and use of supplemental oxygen were more common in moderate and severe, compared to mild illness. All severely ill patients required intensive care; 15 were enrolled in the ICU and 5 were moved to ICU within 48 hours of blood sampling. All severely ill participants required supplemental oxygen; 12 (60%) were mechanically ventilated, one was supported with ECMO and survived, 13 (65%) required vasopressor support and one subject died. Median NEWS were different between the 3 groups ( Figure S1 ). Inflammatory markers were not available for most outpatients but were notably elevated in those hospitalized with moderate to severe disease. (Table 1) A c c e p t e d M a n u s c r i p t 10 Blood gene expression profiling from SARS-CoV-2 positive cases (n=53) was completed by standard mRNA sequencing (RNAseq) of globin mRNA-reduced RNA isolated from whole blood at the time of recruitment. On average 58 ± 6 million reads were generated from each of the cDNA libraries, with a mapping rate of 94.2 ± 0.6% and transcriptome coverage of 41.3 ± 1.3% (Supplemental Figure S2 ). Exploratory Principal Components Analysis suggested similar patterns of gene expression might be shared by participants with mild and moderate illness, but appeared distinct from those with severe illness ( Figure 1A ). Statistical analysis for differential gene expression confirmed significant differences when comparing mild vs severe, and moderate vs severe, but not mild vs moderate COVID ( Figure 1B) . We next tested for differences in gene expression when comparing participants with severe (n=20) vs non-severe illness (n=33), pooling the 14 mild and 19 moderate cases. We tested for differential gene expression without (univariate) and with adjustment for variables potentially associated with severe outcome (race, sex, BMI), the number of days since onset of symptoms, and library size ( Figure 1C and Supplemental Table 1 ). These analyses identified 6483 (46% of tested) and 8435 (59% of tested) differentially expressed genes, with and without multivariate adjustment, respectively. We performed ontology analysis for the 6483 genes identified as differentially expressed in severe COVID illness, focusing on the fully adjusted analysis (Figure 2 ). This analysis identified 74 pathways over-represented by genes (n=936) significantly upregulated in severe COVID, and 25 pathways over-represented by genes (n=5547) significantly downregulated (Figure 2 and Supplemental Table S2 ). Activated pathways included a number associated with infectious diseases as well as TNFα and NFkB signaling. Notably, A c c e p t e d M a n u s c r i p t 11 there was also evidence for significant upregulation of genes associated with platelet activation and coagulation. Among pathways associated with downregulated genes in severe COVID were multiple pathways involved in general host RNA metabolism as well as multiple pathways specifically associated with T cell regulation, including Th2 and Th17 differentiation. The most significantly downregulated pathway was associated with HSV1 infection. Given the substantial number of differentially expressed genes when comparing severe vs non-severe COVID, we investigated the ability of gene expression patterns to discriminate severe illness. Gene-specific thresholds for univariate AUC and magnitude change were chosen via the cross-validation procedure and used to produce an 18 gene weighted gene expression risk score (WGERS) for severe illness. Nested cross-validation was used to estimate performance via the stratified AUC (CV-AUC=0.98). The pooled CV-AUC of 0.93 corresponds with a cross-validated ROC curve to graphically summarize performance ( Figure 3A ). The pooled CV-ROC curve also was used to select a risk score threshold (-1.04) with 95% sensitivity and 88% specificity, which corresponded with apparent (noncross-validated) sensitivity of 100%, specificity of 85%, and error rate of 9% (5/53), represented via the WGERS distributions for the training data ( Figure 4A ). All 5 misclassified participants had moderate illness ( Figure 4B ). A c c e p t e d M a n u s c r i p t 12 We next identified an independent validation data set describing peripheral blood-based gene expression profiling of COVID subjects who were either admitted (n=50) or not admitted (n=50) to the ICU due to the severity of their acute illness [14] . Our 18 gene WGERS discriminated between ICU and non-ICU patients with an AUC of 0.85, and thresholding at 1.77 yielded 84% sensitivity and 74% specificity (Figures 3B and 4C ). Furthermore, all 18 genes selected in the training data were differentially expressed (FDR < 0.01) in the validation data (Supplemental Table S3 ). asymptomatic, respiratory illness to severe pneumonia with multisystem failure and death. Although measurements of inflammatory markers such as C-reactive protein and serum IL-6 levels are often associated with worse disease, their use to predict poor outcomes is imperfect [15] [16] [17] . Viral characteristics, such as shedding kinetics or gene sequence variation, are not reliable predictors of clinical outcome [18, 19] . Genome-wide expression profiling, a powerful and unbiased tool, can be used for multiple purposes such as relating activation or suppression of molecular pathways to clinical manifestations of disease, identification of biomarkers that may allow individual prediction of disease severity, and identification of novel gene targets for therapeutic intervention. Early predictors to identify patients that will decompensate following SARS-CoV-2 infection would be highly impactful. The goal of our study is to use gene expression analysis to identify peripheral markers pathways associated with COVID-19 severity, which may serve as predictors of disease severity potential therapeutic targets. In this study of 53 SARS-CoV-2 infected adults with illness ranging from very mild upper respiratory infection to acute respiratory failure, we A c c e p t e d M a n u s c r i p t 13 identified >6,000 differentially expressed genes (DEGs) (FDR < 0.05) between severe and non-severe illness. The vast majority (85%) of DEGs were under-expressed, most notably with a marked effect on lymphocytes and altered function [20, 21] . Pathway analysis revealed inhibition of Th1, Th2 and Th17 cell differentiation, as well as inhibition of the T cell receptor signaling pathway. These effects are likely related to the marked lymphopenia and poor adaptive immune response in persons with severe SARS-CoV-2 infection [22] . Also notable in severe illness is the inhibition of the mRNA surveillance pathways that include the nonsense-mediated mRNA decay pathway which can degrade viral mRNA. Using a model coronavirus, murine hepatitis virus, Wada and colleagues showed viral transcription is enhanced by blocking this host cell pathway, demonstrated to be mediated by the viral nucleocapsid protein [23] . Several activated pathways we identified in our studies are worth comment, given what is already known about SARS-CoV-2 and COVID-19. Activation of the NF-kappa B and TNF signaling pathways in a setting of heightened inflammatory process is not surprising. Activation of the platelet, complement, and coagulation cascade pathways are also expected, given the characteristic hypercoagulable state that has been observed in severe illness [24] . Thrombocytopenia and activated platelets are associated with the high incidence of venous and arterial clotting, while elevated levels of serum D-dimer, a fibrinogen degradation product, and increased INR are all features of severe COVID-19 [25] . It is interesting that the infection-related pathways most significantly activated include those principally associated with intracellular bacterial (legionella, mycobacterial) and parasitic (toxoplasma, leishmania and trypanosome [Chagas]) infections. These infections are associated with marked activation of macrophages, and thus may be consistent with activation of the osteoclast differentiation pathway, as osteoclasts and macrophages have many similarities [26, 27] . A c c e p t e d M a n u s c r i p t 15 Although our study was not designed to identify and validate early predictors of severe disease, the data do offer a first step. Using gene expression data we were able develop and validate an 18 gene signature for severe disease -fully concordant with requiring ICU-with 85% AUC, 84% sensitivity, and 74% specificity in an independent validation data set. In a recent paper Guardela et al assessed the utility of blood transcript levels of 50 genes known to predict mortality in Idiopathic Pulmonary Fibrosis patients to classify illness severity in COVID- 19 [31] . A discovery cohort of eight subjects was used, and then validated using a publicly available data set of 128 subjects [14] . The gene expression risk profile discriminated ICU admission, need for mechanical ventilation, and in-hospital mortality with an AUC of 77%, 75%, and 74%, respectively (p < 0.001) in a COVID-19 validation cohort. Our current study has several limitations which are worth noting, including its relatively small sample size, the non-standardized interval between symptom onset and sample collection, A Novel Coronavirus from Patients with Pneumonia in China Clinical Characteristics of Coronavirus Disease 2019 in China Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72314 Cases From the Chinese Center for Disease Control and Prevention Population risk factors for severe disease and mortality in COVID-19: A global systematic review and meta-analysis COVID-19 illness in native and immunosuppressed states: A clinical-therapeutic staging proposal Characteristics of Adults Aged 18-49 Years Without Underlying Conditions Hospitalized With Laboratory-Confirmed Coronavirus Disease 2019 in the United States: COVID-NET Airway gene-expression classifiers for respiratory syncytial virus (RSV) disease severity in infants Discriminate Bacterial from Nonbacterial Infection in Adults Hospitalized with Respiratory Illness Use of the first National Early Warning Score recorded within 24 hours of admission to estimate the risk of in-hospital mortality in unplanned COVID-19 patients: a retrospective cohort study STAR: ultrafast universal RNA-seq aligner HTSeq--a Python framework to work with high-throughput sequencing data Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool Large-Scale Multi-omic Analysis of COVID-19 Clinical Characterization and Prediction of Clinical Severity of SARS-CoV-2 Infection Among US Adults Using Data From the US National COVID Cohort Collaborative Primary Care Relevant Risk Factors for Adverse Outcomes in Patients With COVID-19 Infection: A Systematic Review Baricitinib plus Remdesivir for Hospitalized Adults with Covid-19 COVID-19 viral load not associated with disease severity: findings from a retrospective cohort study Viral dynamics in mild and severe cases of COVID-19 Cytotoxic cell populations developed during treatment with tyrosine kinase inhibitors protect autologous CD4+ T cells from HIV-1 infection T cell immunity to SARS-CoV-2 following natural infection and vaccination Antigen-Specific Adaptive Immunity to SARS-CoV-2 in Acute COVID-19 and Associations with Age and Disease Severity Interplay between coronavirus, a cytoplasmic RNA virus, and nonsense-mediated mRNA decay pathway Collagen coatings reduce the incidence of capsule contracture around soft silicone rubber implants in animals COVID-19: Thrombosis, thromboinflammation, and anticoagulation considerations Tuberculosis and the art of macrophage manipulation Macrophages as host, effector and immunoregulatory cells in leishmaniasis: Impact of tissue micro-environment and metabolism Host transcriptomic profiling of COVID-19 patients with mild, moderate, and severe clinical outcomes A blood RNA transcriptome signature for COVID-19 Downregulated Gene Expression Spectrum and Immune Responses Changed During the Disease Progression in Patients With COVID-19 Baricitinib as potential treatment for 2019-nCoV acute respiratory disease A c c e p t e d M a n u s c r i p t