key: cord-0846434-biqf1uro
authors: Cano-Gamez, E.; Burnham, K. L.; Goh, C.; Malick, Z. H.; Kwok, A.; Smith, D. A.; Peters-Sengers, H.; Antcliffe, D.; GAinS Investigators,; McKechnie, S.; Scicluna, B. P.; van der Poll, T.; Gordon, A. C.; Hinds, C. J.; Davenport, E. E.; Knight, J. C.
title: An immune dysfunction score for stratification of patients with acute infection based on whole blood gene expression
date: 2022-03-18
journal: nan
DOI: 10.1101/2022.03.17.22272427
sha: b01d102d4d39432a185b1437cfefbbcc1c71b178
doc_id: 846434
cord_uid: biqf1uro

Dysregulated host responses to infection can lead to organ dysfunction and sepsis, causing millions of deaths globally each year. To alleviate this burden, improved prognostication and biomarkers of response are urgently needed. We investigated the use of whole blood transcriptomics for stratification of patients with severe infection by integrating data from 3,149 samples of sepsis patients and healthy individuals into a gene expression reference map. We used this map to derive a quantitative sepsis response signature (SRSq) score reflective of immune dysfunction and predictive of clinical outcomes, which can be estimated using a 19-gene signature. Finally, we built a machine learning framework, SepstratifieR, to deploy SRSq in sepsis, H1N1 influenza, and COVID-19, demonstrating clinically relevant stratification across diseases and revealing the physiological alterations linking immune dysregulation to mortality. Our method enables early identification of individuals with dysfunctional immune profiles, thus bringing us closer to precision medicine in infection.

Infectious diseases result in considerable global morbidity and mortality. 1 Notably, the recent H1N1 influenza 2 and SARS-CoV-2 3 pandemics illustrate how individuals without significant risk factors can still develop critical illness following infection. In extreme cases, this can lead to sepsis, a dysregulated host response accompanied by major organ dysfunction 4 which accounted for 11 million deaths in 2017 alone. 5 This highlights the importance of better We previously proposed a 7-gene signature predictive of SRS. 7, 8 We now asked whether this signature was applicable to RNA-seq by assessing the contribution of these genes to CC1 ( Figure 1B) . We observed significant contributions for 6 out of 7 genes (Figure 1C) , demonstrating this signature is relevant to both microarray and RNA-seq. To assess if the signature can also be detected with rapid turn-around methods, we used quantitative reverse transcription polymerase chain reaction (qRT-PCR) to profile these genes in 115 samples from the GAinS study 7 , of which 72 and 94 also had microarray and RNA-seq measurements, respectively (Methods and Supplementary Table 1) . We observed a very high level of agreement between profiling methods (Figure 1D) , suggesting that the signature could potentially be used for qRT-PCR-based point-of-care testing.

An important limitation of our signature is its bias towards genes with higher expression in SRS2, with only one SRS1-associated gene (TDRD9). We reasoned that including more genes could make predictions more resilient. We combined our signature with a set of genes ranked amongst the top 1% with highest CC1 contribution ( Figure 1C and Methods), which are expected to be robust to technological variation. This resulted in twelve additional genes, all of which showed comparable expression to the original signature (Supplementary Figure   1A) . 7 We henceforth refer to the 7-gene set as the Davenport signature and to the 19-gene set as the Extended signature.

We used these signatures to construct cross-platform reference maps of gene expression in sepsis. We compiled data from 1,044 patients in the GAinS study, profiled with up to three platforms, and integrated them based on our signatures. To make reference maps representative of a wider patient population, we also included three cohorts of healthy individuals (Supplementary Table 1 ). Integration was performed using mutual nearest neighbours (mNN), a method borrowed from single-cell omics which matches samples in one batch to their nearest neighbours in other batches (Methods). 16 This resulted in two reference maps: the Davenport map, containing 3,264 samples (1,413 sepsis and 1,609 healthy), seven genes, and three modalities; and the Extended map, containing 3,149 samples (1,406 sepsis and 1,609 healthy), 19 genes, and two modalities. Samples in both reference maps clustered by endotype rather than technology, with the main axis of variation corresponding to a separation into healthy volunteer, SRS1, and SRS2 groups ( Figure 1E) . Thus, our reference maps capture a wide spectrum of transcriptional variation spanning health and critical illness.

We then used these signatures to predict endotypes. We split our reference maps into training (n=909) and test (n=2,355) sets, which we used to train random forest classifiers for each . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint signature (Methods). Training sets contained all patient samples taken at ICU admission and for which SRS membership was known 7,8 (n=639) , as well as 270 randomly selected healthy volunteer samples, with all remaining samples allocated to the test set. Healthy volunteers were assigned to a newly defined SRS3 group. SRS3 is designed to capture healthy individuals, as well as patients in the low severity/recovery spectrum (i.e. transcriptionally closer to health).

Cross-validation revealed high accuracy across all endotype classes (AUROCs > 0.97; Figure   2A ). This was recapitulated in the test set, with significant agreement between our predictions and previously proposed SRS labels for these samples. 7, 8 The Extended signature (Cohen's Kappa = 0.91) marginally outperformed the Davenport signature (Cohen's Kappa = 0.88).

However, both reached a consensus for the majority of samples (97% and 84% agreement in microarray and RNA-seq, respectively; Supplementary Figure 1B ). Predictions were consistent across technologies, with most samples assigned to the same SRS group regardless of profiling platform ( Figure 2B) . Thus, our models simultaneously achieve high accuracy and cross-technology applicability.

We next assessed if our models recapitulated known gene expression differences between endotypes. Differential expression analysis (Methods) revealed upregulation of neutrophil and innate immunity genes (e.g. MMP8, GPR84, and CD177) in SRS1, as well as downregulation of T cell function and antigen presentation genes (e.g. CD27, CD6, CCR3, and HLA class II molecules; Figure 2C -D). The top SRS1-associated pathways were Toll-like receptor (TLR) signalling, cytokine production, and glycolysis. In contrast, SRS2 was associated with T cell receptor (TCR) engagement, CD28-costimulation, and IFN signalling ( Supplementary   Figure 2A) . This was supported by decreased lymphocyte and increased polymorphonuclear cell counts in SRS1 ( Figure 2E ). These observations agree with previous literature, demonstrating that our models distinguish biologically relevant molecular profiles. [7] [8] [9] Finally, we asked whether SRS groups differed in clinical outcome. SRS1 patients showed higher Sequential Organ Failure Assessment (SOFA) scores, indicative of more severe organ dysfunction ( Figure 2F ). In the RNA-seq cohort, this was accompanied by increased Acute Physiology and Chronic Health Evaluation (APACHE) II scores (Supplementary Figure 2B) .

Survival analysis (Methods) further revealed that SRS1 patients are at increased risk of 28day mortality (Figure 2G) , demonstrating that our models can predict clinical outcomes.

. CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ;  https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint (horizontal bars) or whether there is agreement between labels derived from multiple modalities (vertical bars). Grey vertical bars indicate samples for which only one data modality was available. The colour scale in the heatmap indicates the overlap between two categories, as estimated using Jaccard indexes. C) Volcano plot showing differentially expressed genes between SRS1 and SRS2 groups. Each dot corresponds to a gene, for which its log-fold change (X axis) and log10 p value (Y axis) are indicated. Red dots represent genes significantly upregulated in the SRS1 endotype, with blue dots representing genes upregulated in the SRS2 group. Grey dots represent non-significant genes. Gene names were added to a subset of significantly differentially expressed genes of immune relevance. D) Correlation between log-fold changes between SRS1 and SRS2 estimated from microarray (X axis) and RNA-seq (Y axis) data. Each dot represents a gene. The identity line (red) is shown as a reference. Cor = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. E) Box plots comparing cell counts between SRS groups in the microarray (top) and RNA-seq (bottom) GAinS cohorts. Each dot represents a sample from a sepsis patient, colour coded by sepsis endotype. Each patient may contribute up to three samples, obtained at different times following ICU-admission. Box plots were defined in terms of medians (central line) and interquartile range (IQR; upper and lower box limits), with whiskers extending by ±1.5 the IQR from the limits of each box. p = p values from T tests (top) and Kruskal-Wallis tests (bottom). F) Box plots comparing SOFA scores between SRS groups in the microarray (left) and RNA-seq (right) cohorts. Dots represent samples obtained at the latest available time point from each patient, colour coded by endotype. Box plots are defined in terms of medians and IQR, with whiskers extending by ±1.5 the IQR. p = p values from either a T test (left) and a Kruskal-Wallis test (right). G) Kaplan-Meier curves comparing the 28-day survival of sepsis patients in the SRS1 and SRS2 groups. SRS groups were defined using samples from the latest available time point for each patient. Lines represent average survival probabilities (Y axis) of each endotype over time (X axis), with shaded areas indicating 95% confidence intervals. Patient numbers are shown at the bottom; p = p value from log-rank tests.

A quantitative score reflective of immune dysfunction Sepsis can be viewed as a spectrum of illnesses with varying severities [17] [18] [19] , which raises the possibility of modelling patients as a continuum. We did this by using diffusion maps (Methods), a method designed to embed samples into a low-dimensional space that reflects their original connectivity. 20 The first diffusion component (DC1) separated samples into a progression which started at SRS3 and gradually transitioned towards the SRS2 and SRS1 groups ( Figure 3A and Supplementary Figure 3A) . This provided evidence of the existence of a patient continuum.

We used DC1 to derive a quantitative metric reflective of the position of individuals along this continuum, which we refer to as the quantitative sepsis response signature score (SRSq; Methods). SRSq is bound between 0 and 1, with lower values indicating a patient is transcriptionally closer to health and higher values indicating similarity to the highest extreme of SRS1 ( Figure 3B ). SRSq scores derived using our two gene signatures were highly correlated (Pearson correlation = 0.84). However, the extended signature achieved better separation of healthy volunteers (Supplementary Figure 3B) . To make SRSq calculation more straightforward, we trained machine learning models to predict this variable. We subdivided samples into training and test sets (as defined above) and trained random forest models for each gene signature (Methods). Model performance was high in both cross-. CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint We next investigated the molecular changes underlying SRSq. Both gene expression ( Figure   3C and Supplementary Figure 3D ) and cell counts (Supplementary Figure 3E ) changed linearly as a function of SRSq, with lymphocyte counts decreasing proportionally to it. While both SRS and SRSq captured similar gene expression programs (99.7% of genes differentially expressed between SRS1 and SRS2 were also associated with SRSq; Figure 3D ), our analysis identified 4,121 SRSq-associated genes which were not significantly different between SRS endotypes ( Figure 3C ). This more than doubled the associated gene set, demonstrating the power of modelling patients as a continuum.

We next investigated the relationship between SRSq and illness severity. We observed significant associations between SRSq and 28-day mortality (Methods) in both microarray and RNA-seq ( Figure 3E ), which were significant in Cox Proportional-Hazards models (Methods), even when accounting for age and source of sepsis ( Figure 3F) . Additionally, SRSq correlated with the severity of ICU-acquired infections ( Figure 3G ). Hazard ratios (HR = 2 and 1.6 in microarray and RNA-seq, respectively) indicated that a 0.1 increase in SRSq decreased patient survival as much as if the patient were a decade older. This illustrates the value of SRSq in risk estimation.

We subsequently investigated the processes linking immune dysfunction and death. Mediation analysis is a statistical method which takes candidate hypotheses and tests their compatibility with existing data by simulating how response variables change when other variables are altered one at a time. [21] [22] [23] This is roughly equivalent to a computational randomised experiment. Results from mediation analysis suggest probable causal mechanisms, thus supporting hypothesis generation. We applied mediation analysis to SRSq and clinical measurements (Methods and Supplementary Table 2 ), assuming a model where SRSq influences organ dysfunction (i.e. SOFA scores), in turn increasing mortality ( Figure 3H ). This enabled us to estimate the direct effect of SRSq on death (i.e. how much mortality would increase if SRSq was raised while keeping SOFA constant), as well as its mediation effect (i.e. how much mortality would increase if SOFA was raised while keeping SRSq constant).

Almost all the effect of SRSq on mortality was mediated by organ dysfunction is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint each cohort within our reference map. SRSq was calculated by min-max scaling the first component of the diffusion map. Box plots are stratified by cohort. Each dot represents a sample, colour coded by sepsis endotype. Box plots were defined in terms of medians (central line) and interquartile ranges (IQR; upper and lower box limits), with whiskers extending by ±1.5 the IQR from the limits of each box. p = p value from a Kruskal-Wallis test. C) Top genes positively (top panels) and negatively (bottom panels) associated with SRSq. Each plot represents the expression level for the gene in question as measured using microarrays (log-VSN normalised intensities; Y axis) for different SRSq values (X axis). Each dot represents a sample, colour coded by sepsis endotype. D) Upset plot (bottom) and scatter plot (top) showing the agreement between differential gene expression along SRSq and differential gene expression between SRS1 and SRS2 endotypes. Bar heights in the UpSet plot indicate the number of genes called as differentially expressed in each comparison, with dots and lines showing different types of overlap. The scatter plot shows log-fold changes estimated between SRS1 and SRS2 (X axis) and along SRSq (Y axis) in microarray samples from the GAinS study. Each dot represents a gene; Cor = Pearson correlation coefficient. E) Association between SRSq and mortality in microarray (left) and RNA-seq (right) samples from GAinS. A sliding window was used to estimate the mortality of sets of samples with increasingly higher SRSq values. Dark dots represent 28-day mortality estimates and light lines show the associated 95% confidence intervals. F) Hazard ratio (HR) estimates obtained from applying a Cox Proportional Hazards model to microarray (left) and RNA-seq (right) samples in GAinS. Dots indicate the estimated HR for each variable, with lines showing 95% confidence intervals. G) Box plots showing the association between SRSq (Y axis) and ICU-acquired infection scores (ICU-AI; Supplementary Table 2) in microarray (left) and RNA-seq (right) samples from GAinS. Each dot represents a sample, colour coded by sepsis endotype. Box plots were defined in terms of medians (central line) and interquartile ranges (IQR; upper and lower box limits), with whiskers extending by ±1.5 the IQR from the limits of each box. ϐ = regression coefficient as estimated using proportional odds logistic regression (POLR); p = associated p value. H) Schematic diagram representing the structure of the causal model assumed for mediation analysis. Arrows represent hypothesised relationships, with arrow directions indicative of the direction of causality. Effect estimates from mediation analysis are provided for each arrow, along with their associated 95% confidence intervals and significance levels. I) Effect size estimates obtained from applying mediation analysis to microarray (left) and RNA-seq (right) samples in GAinS. Each row shows estimates obtained for a different clinical variable. Dots represent estimated effect sizes, with lines showing the associated 95% confidence interval. Solid and dotted lines represent effect sizes estimated with respect to samples with high SRSq (treatment condition) and low SRSq (control condition), respectively. ACME = Average Causal Mediation Effect; ADE = Average Direct Effect; p = mediation p value associated with ACME estimation.

We next assessed the role of individual organs by performing mediation analysis on all subcomponents of SOFA. The effect of SRSq on death was mediated by alterations in mean arterial pressure, coagulation (platelet counts), and, to a lesser extent, renal function (creatinine; Figure 3H -I). In contrast, we found no evidence of liver or lung dysfunction mediating this effect ( Figure 3I SepstratifieR: a machine learning framework for patient stratification We next developed a software for SRS/SRSq prediction. We collected the models described above in an algorithmic framework called SepstratifieR. For a data set of interest, . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint SepstratifieR: 1) extracts expression measurements for all signature genes (i.e. the Davenport or Extended signature, as specified by the user); 2) aligns samples to the corresponding reference map using mNN; and 3) predicts SRS and SRSq using random forest classifiers ( Figure 4 ). This can be achieved in a single line of code, as described in https://github.com/jknightlab/SepstratifieR. Publicly available data were used to construct sepsis reference maps based on small gene signatures by aligning different technologies into the same space. Next, random forest models were trained to predict SRS and SRSq, respectively. When applying SepstratifieR to a set of new samples, genes in the signature of interest are extracted and used to align the new samples to the sepsis reference map. After alignment, SRS and SRSq are predicted using pre-trained random forest models.

We evaluated the performance of SepstratifieR in two cohorts of sepsis patients (Supplementary Table 1 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Figure 4B) . These differences were also evident for SRSq, with elevated SRSq scores associated with higher APACHE II and mortality (Supplementary Figure 4C and I). In contrast, 84% of the SRS3 group consisted of healthy volunteers (Supplementary Figure   5B) . Surprisingly, the Mars1 endotype did not correspond to any of the groups identified by SepstratifieR, suggesting it represents an orthogonal axis of variation unrelated to our gene signatures. Finally, differential expression analysis showed highly correlated effect sizes between studies (Pearson correlation = 0.83) and a similar set of differentially active pathways is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint shock and organ dysfunction. We hypothesised that unmeasured factors could be severing the link between SRSq and death. For example, confounding or moderator variables could influence SRSq and mortality in opposite directions (Supplementary Figure 5I) . We tested this using mediation analysis (Methods). While the total effect of SRSq on death was not significant, we observed significant mediation effects of SRSq on death via shock and SOFA scores (Supplementary Figure 5J) . This suggests that an increase in SRSq leads to higher probabilities of shock and organ failure, which in turn increases mortality, but that unobserved variables might counterbalance this effect. For example, corticosteroid treatment has been shown to worsen the outcome of SRS2 patients who would otherwise be at lower risk. 12 Thus, SepstratifieR can separate sepsis patients by molecular profile and predict illness severity in cohorts of different demographic and clinical composition.

SepstratifieR was trained on samples spanning a range of severities, from health to critical illness. This should extend its applicability to patients who do not fulfil sepsis criteria. To test this, we deployed SepstratifieR in a cohort of patients hospitalised with influenza. 25 Exploratory analysis revealed a gradation of illness severities, with patients separating by degree of oxygen supplementation ( Figure 5A ). This gradation agreed with increases in SRSq scores ( Figure 5A) . At the molecular level, we confirmed increased expression of innate immunity genes proportionally to SRSq (Figure 5B) , with a significant correlation of effect sizes between sepsis and influenza (Pearson correlation = 0.69; Figure 5B -C). This provided evidence that SRSq is applicable to acute viral infection.

We next tested the association between SRSq and illness severity. We observed an increase in SRSq proportional to the extent of oxygen supplementation, with patients on mechanical

ventilation showing an SRSq about 0.2 units higher than patients without supplemental oxygen ( Figure 5D ). In addition, we found a decrease of SRSq over time, with most patients displaying SRSq values equivalent to healthy volunteers after 4 weeks ( Figure 5E) . Interestingly, while patients with high SRSq upon admission (> 0.4) showed variable rates of SRSq decrease, patients with low initial SRSq (< 0.4) showed no changes over time ( Figure 5F ). These observations demonstrate that SepstratifieR is applicable to influenza, even when patients do not fulfil traditional sepsis criteria, and reveal complex patient trajectories. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ;  panels) and negatively (bottom panels) associated with SRSq. Each plot represents the expression level for the gene in question (Y axis) for different SRSq values (X axis). Each dot represents a sample, colour coded by sepsis endotype. D) Box plots showing the association between SRSq (X axis) and supplemental oxygen requirement (Y axis). Each dot represents a sample. Box plots were defined in terms of median and IQR, with whiskers extending by ±1.5 the IQR from the limits of each box. Overall differences in SRSq distribution were tested using a Kruskal-Wallis test, followed by Dunn's post-hoc test for each combination of categories (*** indicates a Bonferroni-adjusted Dunn's test p < 0.01; n.s. Indicates a Bonferroni-adjusted Dunn's test p > 0.05). E) Box plots showing the association between SRSq and time since patient admission. Each dot represents a sample, colour coded by sepsis endotype. Samples are stratified by oxygen requirement. Box plots are defined in terms of median and IQR, with whiskers extending by ±1.5 the IQR. p = p value from Kruskal-Wallis tests. F) Line plot showing changes of SRSq (Y axis) over time (X axis). Each dot represents a sample, colour coded by sepsis endotype. Lines join together samples from the same patient and are coloured based on the sepsis endotype assigned to that patient upon admission.

Finally, we applied SepstratifieR to two COVID-19 cohorts: the COVID-19 Multi-Omic Blood Atlas (COMBAT) 26 27 In both instances, whole-transcriptome analysis showed a clear separation between patients and controls ( Figure 6A and Supplementary Figure 6A) . In COMBAT, patients further separated by severity ( Figure 6A ). While 90% of healthy volunteers in DeCOI were assigned to SRS3, 80% of COVID-19 samples were classified as either SRS1 or SRS2 (Supplementary Figure 6B) . In COMBAT, the SRS3 group contained a mixture of healthy volunteers and community COVID-19 cases, who were never hospitalised. In contrast, SRS2 and SRS1 were enriched for patients with severe illness and in critical care, respectively Figure 6F and Supplementary Figure 7C) . Furthermore, SRSq positively correlated with neutrophil and negatively correlated with lymphocyte counts (Figure 6D and Supplementary Figure 6G) . Interestingly, elevated SRSq scores in COVID-19 were also associated with type I interferon signalling, a pathway not observed in sepsis.

We next evaluated the utility of SRSq for outcome prediction. SRSq was significantly associated with C reactive protein (CRP), respiratory function (P/F ratios), and SOFA scores, is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint as well as with pneumonia indexes estimated by DeCOI investigators (Figure 6E and Supplementary Figure 6H) . To assess if this resulted in differential outcomes, we evaluated the relationship between SRSq and 28-day mortality in COMBAT. While all patients with SRSq < 0.6 survived, we observed a sharp and linear increase in mortality in patients with SRSq > 0.6 ( Figure 6F) . This association was significant in a Cox Proportional-Hazards model (HR = 3.1 per 0.1-unit increase in SRSq), even when accounting for age ( Figure 6G) . Consequently, SRSq is predictive of clinical outcomes.

Finally, we performed mediation analysis assuming that immune dysfunction leads to higher rates of organ failure, in turn increasing mortality (Figure 6H-I) . Surprisingly, we found no evidence of total SOFA scores mediating the effect of SRSq on death. Instead, this effect was entirely explained by respiratory function (i.e. P/F ratios; Figure 6H -I). While this analysis is limited by sample size, it suggests marked differences between causes of death in sepsis and COVID-19. In particular, in COMBAT, where SARS-CoV-2 infected patients span a wider severity range, respiratory failure plays a more prominent role, presumably because major systemic inflammation is not as prevalent.

In summary, SRSq is a quantitative score reflective of immune dysfunction and applicable to a variety of infections, including COVID-19. Elevated SRSq scores indicate decreased lymphocyte function and antigen presentation, increased neutrophil counts and TLR signalling, more severe illness, and higher likelihood of poor outcomes. This is likely explained by alterations in coagulation and blood pressure in sepsis, but by respiratory failure in COVID- 19 . These factors, possibly combined with a differential response to immunomodulatory therapy, increase patient mortality (Figure 6J) . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint agreement between sepsis endotypes and clinical severity groups. Colours represent the level of agreement between two categories, as estimated using Jaccard indexes. C) Box plots showing the association between SRSq (X axis) and clinical severity (Y axis). Each dot represents a sample. Box plots were defined in terms of median and IQR, with whiskers extending by ±1.5 the IQR from the limits of each box. p = p value from a Kruskal-Wallis test. D) Association between SRSq (X axis) and cell counts (Z-scored; Y axis). Each dot represents a sample, coloured by sepsis endotype. Lines indicate the best linear fit, with shaded areas showing their associated 95% confidence intervals. Cor = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. E) Association between SRSq (X axis) and clinical variables (Y axis). Each dot represents a sample, coloured by sepsis endotype. Lines indicate the best linear fit, with shaded areas showing their associated 95% confidence intervals. Cor = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. F) Association between SRSq and mortality. A sliding window was used to estimate the mortality of sets of samples with increasingly higher SRSq values. Dark dots represent 28-day mortality estimates and light lines show the associated 95% confidence intervals. G) Hazard ratio (HR) estimates obtained from a Cox Proportional Hazards model. Dots indicate the estimated HR for each variable, with lines showing 95% confidence intervals. H) Effect size estimates obtained from mediation analysis, using total SOFA score (left) and P/F ratio (right) as mediator variables. Dots represent estimated effects, with lines showing their associated 95% confidence interval. Solid and dotted lines represent effect sizes estimated with respect to samples with high SRSq (treatment condition) and low SRSq (control condition), respectively. ACME = Average Causal Mediation Effect; ADE = Average Direct Effect; p = mediation p value associated with ACME estimation. I) Schematic diagram representing the causal model inferred from mediation analysis. Arrows represent causal relationships, with arrow directions indicative of the direction of causality. Effect estimates from mediation analysis are provided for each arrow, along with their 95% confidence intervals and significance levels. J) Graphical summary of SRSq and its biological interpretation.

We introduced SepstratifieR, a collection of models for stratification of patients with acute infection. SepstratifieR can be used in conjunction with numerous gene expression profiling methods and deployed across infections, successfully separating patients by severity and predicting clinical outcomes.

Our study addresses long-standing challenges. Firstly, it furthers our ability to identify endotypes at point of care. While this has been achieved in other diseases like inflammatory bowel disease 14,28 and asthma 13 , similar methods are lacking for infectious disease.

SepstratifieR provides a framework which can be used in conjunction with qRT-PCR testing, but remains applicable to full-transcriptome technologies. Secondly, SepstratifieR's ability to model patients as a continuum makes it useful across a range of infections, independently of patients fulfilling sepsis criteria. This suggests that sepsis is one extreme along a spectrum of immune dysregulation. Interestingly, it was previously thought that systemic inflammatory response syndrome (SIRS), sepsis, and septic shock formed a continuum 17 , but evidence of different molecular processes operating in these conditions led to SIRS being removed as a diagnosis in Sepsis-3. 29, 30 While this simplifies diagnostic criteria, a spectrum of severities and mechanisms of immune dysfunction exists within sepsis 19 , and it is recognised that Sepsis-3 . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint licence to any Author Accepted Manuscript version arising from this submission. We thank all the patients, patient families, nurses, and clinicians who participated in the GAinS study. 

Sample collection was performed as described by Davenport et al. 7, 8 Briefly, whole blood (~10 mL) was obtained from patients on the first, third, and/or fifth day following ICU admission.

Leukocyte isolation was performed at the bedside using the LeukoLOCK system (Thermo Scientific), with RNA extracted using the Total RNA Isolation Protocol (Ambion). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Polymerase, and nuclease-free water) using the following cycle: 1) 94°C for 2 min; 2) 40 cycles of 94°C for 30s, 55°C for 30s, and 72°C for 45s; and 3) 72°C for 5 min.

Raw data were processed with CFX Manager TM (BioRad) and Cq values <35 were removed.

Mean Cq values for each gene were calculated by averaging over technical replicates, and normalisation to housekeeping genes was performed by subtracting the geometric mean of housekeeping genes from each gene in the SRS signature. Normalised Cq values were used for batch correction with limma (v3.44.1). 44

Publicly available gene expression data were collected from three cohorts of healthy volunteers of German, Dutch and Finnish ancestry, as well as six infectious disease cohorts comprising all-cause sepsis, influenza and COVID-19 patients of different geographical origins (Supplementary Table 1) . 9, 24, 25, 27, [45] [46] [47] [48] [49] [50] For microarray studies, probes were quality filtered (detection p value < 0.01 in over 20% of samples), and measurements were averaged across all probes uniquely mapping to each gene. For RNA-seq studies, counts were normalised and log-transformed and genes were quality filtered (>1 CPM in over 10% of samples).

Canonical correlation analysis (CCA) was performed using the sparse CCA algorithm in the PMA package (v1.2.1). 51 Following CCA, datasets in the GAinS study were integrated with healthy volunteer data from three cohorts (Supplementary Table 1 ) based on a set of seven or 19 genes. Technical differences between cohorts were corrected using mutual nearest neighbours (mNN) with the batchelor package (v1.4.0). 16 Batches were defined based on technology (i.e. Illumina HumanHT-12 arrays, RNA-seq, or qPCR).

Definition of a quantitative sepsis response score Diffusion maps were constructed using the destiny package. 52 The first diffusion component (DC1) was used to derive a quantitative sepsis response score (SRSq), defined as the minmax scaled DC1 coordinate of each sample:

. CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint 

Random forests were trained using Breiman's algorithm with the randomForest (v4.6.14) and caret (v6.0.86) packages. 5354,55 Five hundred decision trees were built per forest, using either 7 or 19 genes as predictor variables. Either SRS labels or SRSq scores were used as response variables. Performance was evaluated using leave-one-out cross-validation (LOOCV) by estimating Cohen's Kappa (for SRS) or root-mean-square errors (RMSE; for SRSq). 56 The optimal number of predictor variables sampled at each split (i.e. 'mtry') was defined by assessing model performance over a range of values.

Differential expression analysis Differential gene expression between SRS endotypes and along SRSq was assessed using moderated T-tests with limma. 44 Genes were called differential expressed whenever: 1) |fold-change| > 1.5 between SRS groups at a false discovery rate (FDR) of 0.05; or 2) |fold-change| > 3.5 per unit increase in SRSq at an FDR of 0.05 (i.e. a one-fold increase in expression for every 0.3 SRSq units).

Pathway enrichment was assessed using XGR (v1.1.8) 57 using pathways listed in REACTOME. 58 Pathways were deemed significantly enriched if they passed the 0.05 FDR threshold.

Pre-processing and integration with SRS/SRSq GAinS clinical information was collected by local investigators using electronic case report forms (Supplementary Table 2) . Data were quality filtered and assembled into a database for ease of access.

Associations between SRS and clinical variables were tested using Kruskal-Wallis one-way analysis of variance (for numeric variables) or Mood's median test (for ordinal variables) with the RVAideMemoire package (v0.9.80). 59 Associations between SRSq and clinical variables . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ; https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint were assessed using correlation tests (for numeric variables) or proportional odds logistic regression (for ordinal variables) with the MASS package. 60, 61 Survival analysis Mortality and time to death were censored (i.e. capped) at 28 days post ICU-admission.

Kaplain-Meier curves were built using the survival (v3.1.12) package, modelling time to event as a function of SRS (with SRS measured at the latest time point available per patient).

Visualisation was performed using survminer (v0.4.9) 62 , with significance estimated by logrank tests.

To test for associations between SRSq and survival, samples were sorted by increasing SRSq (at the latest time point available) and a sliding window containing 35% of samples was used to estimate 28-day survival. Windows were slid one sample at a time using the zoo (v1.8.7) package 63 until reaching the sample with highest SRSq. Survival within each window was estimated using survival. Associations were subsequently tested using Cox Proportional Hazard models. 64 Hazard ratios (relative to 0.1-unit increases in SRSq) were modelled as an exponential function of SRSq, adjusting for age and source of sepsis (CAP or FP). Significance was assessed by log-rank tests.

Mediation analysis is best conceptualised using causal diagrams. In the diagram below, X represents an independent variable and Y a dependent variable, with M being a mediator lying in the causal path between X and Y. Arrows indicate the direction of causality:

Mediation analysis estimates how much of the effect of X on Y depends on changes in M (the average causal mediation effect, ACME), as well as how much is independent of M (the direct causal effect, ADE). 21, 65 To do so, Y is modelled as a function of M and X, M itself being a function of X. ACME and ADE are then estimated either exactly or with simulations. 22, 66 In this study, SOFA scores, organ function measurements, and the presence/absence of shock were considered potential mediators of SRSq on mortality. Thus, each of them was modelled as a function of SRSq while accounting for covariates (age and source of sepsis).

More specifically, numeric variables (e.g. SOFA scores) were modelled using a standard linear regression:

. CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Where M is the mediator variable, Ci the i-th covariate, and and the regression coefficients and random error term, respectively.

In contrast, binary variables (e.g. shock) were modelled using logistic regression with a probit model:

Where p(M) is the probability of the mediator, Ci the i-th covariate, and the regression coefficients and random error term, respectively, and the cumulative distribution function (CDF) of the normal distribution:

Assuming X is normally distributed with zero mean and unit standard deviation:

The dependent variable (i.e. mortality) was then modelled as a function of each mediator, one variable at a time, while accounting for SRSq and covariates. This was done using a logistic regression with probit model:

Where p(D) is the probability of death within 28 days of ICU admission, M the mediator of interest, Ci the i-th covariate, and the regression coefficients and random error term, respectively, and the cumulative distribution function (CDF) of the normal distribution.

ACMEs, ADEs, and their 95% confidence intervals were estimated using Monte Carlo simulations, as implemented in the mediation package (v4.5.0). 22, 23, 67 Effect sizes were estimated relative to 0.1-unit increases in SRSq.

Mediation effects are best interpreted in terms of counterfactuals. 21 In particular, ADEs tell us how much mortality would increase if SRSq were increased by 0.1 units while keeping the . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint 

All codes will be made publicly available upon publication. The SepstratifieR package can be downloaded and installed from https://github.com/jknightlab/SepstratifieR. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ;  https://doi.org/10.1101/2022.03.17.22272427 doi: medRxiv preprint profiling platform used. B) Correlation between SRSq values estimated using the Davenport (X axis) and Extended (Y axis) gene signatures. Each dot represents a sample. Cor = Pearson correlation coefficient. C) Correlation between SRSq scores as originally defined using diffusion maps (X axis) and SRSq scores as predicted with a random forest model (Y axis) based on the Extended gene signature. Each dot represents a sample, colour coded by sepsis endotype. R = Pearson correlation coefficient; RMSE = Root Mean Square Error; p = Correlation p value estimated using a two-tailed T test. D) Top genes positively (top panels) and negatively (bottom panels) associated with SRSq. Each plot represents the expression level for the gene in question as measured using RNA-seq (log-CPM; Y axis) for different SRSq values (X axis). Each dot represents a sample, colour coded by sepsis endotype. E) Association between SRSq (X axis) and log-transformed cell counts (Y axis)in GAinS microarray (top) and RNA-seq (bottom) samples. Each dot represents a sample, coloured by sepsis endotype. Lines indicate the best linear fit, with shaded areas showing their associated 95% confidence intervals. R = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. F) Effect size estimates obtained from applying mediation analysis to microarray (left) and RNA-seq (right) samples in GAinS, using SOFA score as a mediator variable. Dots represent estimated effect sizes, with lines showing the associated 95% confidence interval. Solid and dotted lines represent effect sizes estimated with respect to samples with high SRSq (treatment condition) and low SRSq (control condition), respectively. ACME = Average Causal Mediation Effect; ADE = Average Direct Effect; p = mediation p value associated with ACME estimation. G) Schematic diagram representing the causal model inferred from mediation analysis. Arrows represent causal relationships, with arrow directions indicative of the direction of causality. Effect estimates from mediation analysis are provided for each arrow, along with their 95% confidence intervals and significance levels. H) Correlation between SRSq (X axis) and clinical variables (Y axis) in GAinS microarray (top panels) and RNA-seq (bottom panel) samples. Rows show different clinical variables. Each dot represents a sample, colour coded by sepsis endotype. Lines indicate the best linear fit, with shaded areas showing the associated 95% confidence intervals. R = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint In the line plots, each dot represents a sample colour coded by sepsis endotype, with lines connecting samples which correspond to the same patient repeatedly sampled over time. Lines are coloured based on the sepsis endotype assigned to each patient at the earliest time point available. I) Association between SRSq (X axis) and clinical variables (Y axis). Each dot represents a sample, coloured by sepsis endotype. Lines indicate the best linear fit, with shaded areas showing their associated 95% confidence intervals. R = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The colour scale in the heatmap shows the overlap between the two endotype definitions as estimated using Jaccard indexes. C) Volcano plot (bottom) showing genes differentially expressed between SRS1 and SRS2 samples. Each dot is a gene, with red indicating upregulation and blue downregulation in SRS1 samples. Grey dots show genes which do not reach statistical significance. Gene names are provided for a subset of significant genes with immune relevance. The associated scatter plot (top) shows the agreement between log-fold changes in the GAinS (X axis) and MARS (Y axis) studies. Cor = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. D) Top genes identified as positively (top panels) and negatively (bottom panels) associated with SRSq in GAinS. Each plot represents the expression level for the gene in question in the MARS cohort (Y axis) for different SRSq values (X axis). Each dot is a sample, colour coded by SRS endotype. E) A subset of immune-relevant pathways significantly enriched for genes differentially expressed between SRS1 and SRS2. Plots show either pathways upregulated (left) or downregulated (right) in SRS1 samples. Dots show the estimated log2-fold enrichment, with whiskers indicating the associated 95% confidence interval. All pathways were significant at FDR 0.05. F) Proportion of MARS patients diagnosed with septic shock in each SRS group. Bar heights indicate patient proportions per group, with colours corresponding to the presence or absence of shock; p = p value from Fisher's exact test. G) Box plot showing the distribution of SOFA scores within each SRS group (top panel), and scatter plot showing the correlation between SOFA and SRSq scores (bottom panel). Each dot represents a sample, colour coded by SRS endotype. Box plots were defined in terms of medians (central line) and IQRs (box limits), with p values estimated using a Kruskal-Wallis test. The solid line in the scatter plot shows the best linear fit, with R = Pearson correlation coefficient and p = correlation p value obtained using a two-tailed T test. The associated plots show the distribution of organ-specific SOFA scores for each SRS group. Red lines indicate median values, with p values calculated using Mood's median test. H) Kaplan-Meier curves comparing the 28-day survival of patients in the SRS1 and SRS2 groups. Each line represents average survival probabilities (Y axis) over time (X axis), with shaded areas indicating 95% confidence intervals. All patient numbers are shown at the bottom; p = p value from log-rank tests. I) Causal model used for mediation analysis. Arrows represent hypothesised relationships, with arrowheads indicative of the direction of causality. J) Effect size estimates obtained from mediation analysis, using either septic shock (left) or SOFA score (right) as the mediator variable. Dots represent estimated effect sizes, with lines showing the associated 95% confidence interval. Solid and dotted lines represent effect sizes estimated with respect to samples with high SRSq (treatment condition) and low SRSq (control condition), respectively. ACME = Average Causal Mediation Effect; ADE = Average Direct Effect; p = mediation p value associated with ACME estimation. the heatmap shows the overlap between endotypes and clinical groups as estimated using Jaccard indexes. C) Box plots showing the association between SRSq (X axis) and clinical group (Y axis). Each dot represents a sample, colour coded by sepsis endotype (red = SRS1, light blue = SRS2, dark blue = SRS3). Box plots were defined in terms of median and IQR, with whiskers extending by ±1.5 the IQR; p = p value from a Kruskal-Wallis test. D) Top genes identified as positively (top panels) and negatively (bottom panels) associated with SRSq in GAinS. Each plot represents the expression level for the gene in question in the DeCOI study (Y axis) for different SRSq values (X axis). Each dot is a sample, colour coded by sepsis endotype. E) Volcano plot showing genes differentially expressed along SRSq. Each dot is a gene, with red indicating a positive and blue a negative association with SRSq. Grey dots represent genes which do not reach statistical significance. Gene names are provided for a subset of significant genes with immune relevance. The associated scatter plot (right) shows the agreement between log-fold changes in the GAinS (X axis) and DeCOI (Y axis) studies. Cor = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. F) A subset of immune-relevant pathways significantly enriched for genes differentially expressed along SRSq. Plots show either pathways positively (left) or negatively (right) associated with SRSq. Dots show the estimated log2-fold enrichment, with whiskers indicating the associated 95% confidence interval. All pathways were significant at FDR 0.05. G) Association between SRSq (X axis) and cell counts (Z-scored; Y axis). Each dot represents a sample, coloured by sepsis endotype. Lines indicate the best linear fit, with shaded areas showing their associated 95% confidence intervals. Cor = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. H) Association between SRSq (X axis) and clinical variables (Y axis). Each dot represents a sample, coloured by sepsis endotype. Lines indicate the best linear fit, with shaded areas showing their associated 95% confidence intervals, with Cor = Pearson correlation coefficient and p = correlation p value estimated using a two-tailed T test. Box plots were defined in terms of median and IQR, with whiskers extending by ±1.5 the IQR and p = p value from a T test.

. CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Figure 7 . Transcriptional profiles associated with SRSq in COVID-19 patients from the COMBAT study. A) Volcano plot showing genes differentially expressed along SRSq. Each dot is a gene, with red indicating a positive and blue a negative association with SRSq. Grey dots represent genes which do not reach statistical significance. Gene names are provided for a subset of significant genes with immune relevance. The associated scatter plot (right) shows the agreement between log-fold changes in the GAinS (X axis) and COMBAT (Y axis) studies. Cor = Pearson correlation coefficient; p = correlation p value estimated using a two-tailed T test. B) Top genes identified as positively (top panels) and negatively (bottom panels) associated with SRSq in GAinS. Each plot represents the expression level of the gene in question in the COMBAT study (Y axis) for different SRSq values (X axis). Each dot is a sample, colour coded by sepsis endotype. C) A subset of immune-relevant pathways significantly enriched for genes differentially between sepsis endotypes. Plots show either pathways upregulated (left) or downregulated (right) in SRS1 samples. Dots show the estimated log2fold enrichment, with whiskers indicating the associated 95% confidence interval. All pathways were significant at FDR 0.05. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted March 18, 2022. ;  

Global burden of 369 diseases and injuries in 204 countries and territories, 1990-2019: a systematic analysis for the Global Burden of Disease Study

Severe respiratory disease concurrent with the circulation of H1N1 influenza

Predictors of COVID-19 severity: A literature review

The immunopathology of sepsis and potential therapeutic targets

Global, regional, and national sepsis incidence and mortality, 1990-2017: analysis for the Global Burden of Disease Study

Panomics for Precision Medicine

Genomic landscape of the individual host response and outcomes in sepsis: a prospective cohort study

Shared and Distinct Aspects of the Sepsis Transcriptomic Response to Fecal Peritonitis and Pneumonia

Classification of patients with sepsis according to blood genomic endotype: a prospective cohort study

A community approach to mortality prediction in sepsis via gene expression analysis

Predicting sepsis severity at first clinical presentation: The role of endotypes and mechanistic signatures

Transcriptomic Signatures in Sepsis and a Differential Response to Steroids. From the VANISH Randomized Trial

A qPCR-based metric of Th2 airway inflammation in asthma

A blood-based prognostic biomarker in IBD

Canonical Correlation Analysis: Uses and Interpretation

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

The natural history of the systemic inflammatory response syndrome (SIRS). A prospective study

SIRS in the Time of Sepsis-3

Elucidating the Spectrum of Disease Severity Encompassed by Sepsis

Diffusion maps

Interpretation and Identification of Causal Mediation

A general approach to causal mediation analysis

Identification, Inference and Sensitivity Analysis for Causal Mediation Effects

Identifying key regulatory genes in the whole blood of septic patients to monitor underlying immune dysfunctions

Progression of whole-blood transcriptional signatures from interferoninduced to neutrophil-associated patterns in severe influenza

A blood atlas of COVID-19 defines hallmarks of disease severity and specificity

Disease severity-specific neutrophil signatures in blood transcriptomes stratify COVID-19 patients

Gene expression profiling of CD8+ T cells predicts prognosis in patients with Crohn disease and ulcerative colitis

The 'cytokine profile': a code for sepsis

The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3)

A genomic storm in critically injured humans

A genomic score prognostic of outcome in trauma patients

Inborn errors of type I IFN immunity in patients with life-threatening COVID-19

Autoantibodies against type I IFNs in patients with life-threatening COVID-19

Neutrophil extracellular traps in patients with sepsis

Neutrophil extracellular traps (NETs) are increased in the alveolar spaces of patients with ventilator-associated pneumonia

Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis

Variance stabilization applied to microarray data calibration and to the quantification of differential expression

MixupMapper: correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects

Adjusting batch effects in microarray expression data using empirical Bayes methods

STAR: ultrafast universal RNA-seq aligner

featureCounts: an efficient general purpose program for assigning sequence reads to genomic features

Housekeeping genes as internal standards: use and limits

limma powers differential expression analyses for RNA-sequencing and microarray studies

An Immune Response Network Associated with Blood Lipid Levels

Identification of genetic loci associated with Helicobacter pylori serologic status

Systematic identification of trans eQTLs as putative drivers of known disease associations

Deconvolution of bulk blood eQTL effects into immune cell subpopulations

Understanding human immune function using the resources from the Human Functional Genomics Project

A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. bioRxiv (2021)

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis

destiny: diffusion maps for large-scale single-cell data in R

Random Forests

caret: Classification and Regression Training

Classification and regression by randomForest

A Coefficient of Agreement for Nominal Scales

XGR software for enhanced interpretation of genomic summary data, illustrated by application to immunological traits

The reactome pathway knowledgebase

Use of Ranks in One-Criterion Variance Analysis

Regression Models for Ordinal Data

Modeling Survival Data: Extending the Cox Model

S3 Infrastructure for Regular and Irregular Time Series

Regression models and life-tables

The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations

Making the Most of Statistical Analyses: Improving Interpretation and Presentation

mediation: R package for causal mediation analysis

and Southmead Hospital

Roser Faras-Arraya

St Bartholomew's Hospital and Royal London Hospital

The following investigators, listed alphabetically by institution, were involved in patient recruitment, sample collection, and/or sample processing within the GAinS study: