key: cord-0101772-2ow5hzvr authors: Niyakan, Seyednami; Qian, Xiaoning title: COVID-Datathon: Biomarker identification for COVID-19 severity based on BALF scRNA-seq data date: 2021-10-11 journal: nan DOI: nan sha: 491555ca8ac818f1a1fc91d98a98373463bc82b4 doc_id: 101772 cord_uid: 2ow5hzvr The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emergence began in late 2019 and has since spread rapidly worldwide. The characteristics of respiratory immune response to this emerging virus is not clear. Recently, Single-cell RNA sequencing (scRNA-seq) transcriptome profiling of Bronchoalveolar lavage fluid (BALF) cells has been done to elucidate the potential mechanisms underlying in COVID-19. With the aim of better utilizing this atlas of BALF cells in response to the virus, here we propose a bioinformatics pipeline to identify candidate biomarkers of COVID-19 severity, which may help characterize BALF cells to have better mechanistic understanding of SARS-CoV-2 infection. The proposed pipeline is implemented in R and is available at https://github.com/namini94/scBALF_Hackathon. Globally, the outbreak of Coronavirus disease 2019 , caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has led to more than 233 million infections and more than 4.7 million deaths according to the statistics of World Health Organization (WHO) as of October 1, 2021 [Ren and et al., 2021] . The clinical spectrum of COVID-19 is broad: Many COVID-19 patients are asymptomatic or experience only mild symptoms; however, some patients progress to severe life-threatening conditions [Ren and et al., 2021] . Accurate classification of COVID-19 severity in patients may aid in delivering proper healthcare and reducing mortality. Thus it is of great importance to understand the underlying molecular mechanisms of the disease and identify the biomarkers of the illness severity accurately [de Terwangne et al., 2020] . Single-cell RNA sequencing (scRNA-seq) is a powerful tool at dissecting the cellular processes and characterizing immune responses [Ren and et al., 2021, Niyakan et al., 2021] . Many groups have been applying scRNA-seq to COVID-19 studies to better understand the human immune response to the infection [Ren and et al., 2021 , Wauters et al., 2021 , Cao et al., 2020 . A recent study performed scRNA-seq on Bronchoalveolar lavage fluid (BALF) cells of patients with different COVID-19 severity levels to characterize the respiratory immune properties associated with COVID-19 severity [Liao et al., 2020] . To help better understand the inherent biological signals in the recently published single-cell RNA-seq data from BALF cells [Liao et al., 2020] for COVID-19 severity prediction, we develop a bioinformatics pipeline to first identify the COVID-19 severity biomarker genes and then classify BALF cells using these genes. We apply several classification algorithms, including linear classifiers, such as Linear Discriminant Analysis (LDA), and non-linear ones, such as Quadratic and Flexible Discriminant Analysis (QDA and FDA [Hastie et al., 1994] ), Random Forests (RF) [Breiman, 2001] and Support Vector Machines (SVM), to evaluate our pipeline in classifying BALF cells based on COVID-19 severity. We have used the scBALF-COVID-19 dataset prepared for the single-cell transcriptomics challenge of the IEEE COVID-19 Data Hackathon. This dataset is derived from public data [Liao et al., 2020] and contains a set of Broncho Alveolar Lavage Fluid (BALF) cells from patients categorized clinically as having mild, severe or no COVID-19 infection. In detail, the dataset has scRNA-seq data of 23189 BALF cells from three classes based on COVID-19 severity: mild infection (3292 cells), severe infection (7919 cells), and no infection (11978 cells) across 1999 genes. The dataset has been normalized for technical differences between patients (batch normalization) as well as sequencing depth differences between cells in each patient. This normalization step, has removed the overdispersion inherent in typical scRNA-seq data. As a result, it may not be beneficial to apply typical scRNA-seq tools for biomarker identification and cell clustering on this normalized data set, such as scVI [Lopez et al., 2018] , DESeq2 [Love et al., 2014] and SimCD [Niyakan et al., 2021] , in which Negative Binomial (NB) distribution models and their extensions were developed to model scRNA-seq data. As shown in the histogram plot of normalized gene expression values for a random gene "CXCR1" in the left plot of Fig 1, normalized gene expression profiles across samples can be approximated with a normal distribution. At the first step of our proposed pipeline, we detect genes that are differentially expressed (DE) across cells with different levels of COVID-19 severity. In order to do this we perform three different sets of DE analyses, each time as one label vs the rest (e.g. severe cells vs other cells). DE analyses are based on the R package Monocle [Qiu et al., 2017] designed specifically for scRNA-seq experiments with the capability of handling normally distributed data. This can be done by passing the argument expressionFamily to be uninormal. In detail, what Monocle does for DE analyses is fitting vector generalized linear models (VGLM) to both full and reduced models and then computing likelihood ratio Chi-squared statistics and corresponding p-values. After that, for each of three analyses, we rank genes based on their adjusted p-values in an increasing order. Then we report the intersection of top 100 DE genes (also having lower adjusted p-values) in all three DE analyses as COVID-19 severity potential biomarkers. After identifying the candidate genes as biomarkers to classify BALF cells for COVID-19 infection severity, we run various well-known classifiers, including both linear and non-linear classifiers. We first run LDA as it assumes that predictors are normally distributed and that the different classes have class-specific means and equal variance. Then we try multiple non-linear classifiers, including QDA, FDA, RF and SVM with the radial basis function (RBF) kernel to better capture the non-linearity in the data. For LDA and QDA, we use lda and qda from the R package MASS. For FDA, we use fda with the regression method to be multivariate adaptive regression splines (MARS), from R package mda. For RF, we use randomForest function from a R package with the same name. We perform SVM with RBF, with the svm from the R package e1071. Also, it worth mentioning that in order to account for the imbalance in class labels we set the option class.weights to be inverse so that weights for samples with different class labels be chosen inversely proportional to the corresponding class size. In this section, we present the results of applying our bioinformatics pipeline discussed in Sections 2.2 and 2.3 to the BALF cell scRNA-seq data. First, we discuss the biomarker identification results and then go over the results of cell classification tools. After applying the DE analyses detailed in Section 2.2, we end up having 12 genes in the intersection set of top 100 DE genes in three sets of DE analyses performed by Monocle: RSAD2, CXCL10, IDO1, GCH1, CXCL11, CRYBA4, CCL3, LGMN, IFIT1, CTSB, GBP1 and CCL2. Half of these genes (RSAD2, CXCL10, CXCL11, CCL3, IFIT1, CCL2) have been previously reported as genetic biomarkers for COVID-19 severity in a recent study that profiled the immune response signatures in the BALF cells of eight COVID-19 cases [Zhou et al., 2020] . The rest of the genes detected by our pipeline can be new potential biomarkers for immune response to COVID-19 in BALF cells. The middel and right box plots in Fig. 1 show the normalized gene expression values of GBP1 and RSAD2, two identified potential biomarkers by our pipeline. RSAD2 has been previously studied to be associated with respiratory immune response severity in BALF cells but GBP1 can be a new potential biomarker. After having the ranking lists of genes based on their adjusted p-values in three sets of one class vs rest DE analyses, we make three different sets of candidate gene features for cell classification: Applying these to our BALF scRNA-seq data, we find 12, 88 and 194 genes in the gene sets G1, G2 and G3 respectively. Venn diagram of top 100 DE genes from the three sets of DE analyzes is shown in left plot of Figure 2 . We then apply the previously described classification algorithms in Section 2.3 with the gene expression values from each of the gene sets to classify cells. In order to measure the accuracy of each classifier by each of three gene sets, we first randomly split data to 75% and 25% of the total size of the available cell samples to construct training and test sample sets. Then for each gene set and each classifier, we train the classifier three times (each time one label vs the rest) on the training set and calculate area under Receiver Operator Characteristic (AUC-ROC) values by testing the trained model on the test data. In each case of label comparison, classification algorithm and gene set, we run the code three times and then report the average and standard deviations of AUC-ROC values. Table 1 shows these AUC-ROC values for each of combinations of running one of five different classifiers on three gene sets. As it was expected, the linear classifier LDA has the worst performance in comparison with non-linear ones. Furthermore, as we expect, almost for all classifiers the AUC-ROC is higher when using the gene set G3 comparing to other two gene sets G1 and G2. Similar trends when comparing the results using G2 vs G1. This is reasonable as Table.1) the numbers of genes (biomarkers/features) in the gene sets G2 and G3 are 7 and 16 times higher than the number of genes in the gene set G1, respectively. Actually, the performance of RF, SVM with the RBF kernel, and FDA on the gene set G1, which only has 12 genes, show the practicality of our proposed biomarker identification approach to detect COVID-19 severity in BALF cells. Table 1 , also presents the superior performance of the random forest classification algorithm on dissecting BALF cells based on their COVID-19 severity. Also, it worth mentioning that one can almost accurately classify BALF cells, by running the RF algorithm on the proposed set of genes from the biomarker identification step in our bioinformatics pipeline. Due to the outbreak of COVID-19 infection across the world and daily increasing number of mortality in the patients having the infection, it is of great importance to better understand the underlying biological mechanisms in human immune response to this infection. As a result of this, here, we presented a bioinformatics pipeline, capable of first, identifying potential COVID-19 severity biomarkers in BALF cells and secondly, accurately classifying cells based on the constructed set of genes. The proposed pipeline is implemented in R and all the codes are available in the Github page: https://github.com/namini94/scBALF_Hackathon. Covid-19 immune features revealed by a large-scale single-cell transcriptome atlas Predictive accuracy of covid-19 world health organization (who) severity classification and comparison with a bayesian-method-based severity score (epi-score) Simcd: Simultaneous clustering and differential expression analysis for single-cell transcriptomic data Discriminating mild from critical covid-19 by innate and adaptive immune single-cell profiling of bronchoalveolar lavages Potent neutralizing antibodies against sars-cov-2 identified by high-throughput single-cell sequencing of convalescent patients'b cells Single-cell landscape of bronchoalveolar immune cells in patients with covid-19 Flexible discriminant analysis by optimal scoring Random forests Deep generative modeling for single-cell transcriptomics Moderated estimation of fold change and dispersion for rna-seq data with deseq2 Reversed graph embedding resolves complex single-cell trajectories Heightened innate immune responses in the respiratory tract of covid-19 patients