key: cord-0312447-zby12m60 authors: Büttner, M.; Ostner, J.; Müller, CL.; Theis, FJ.; Schubert, B. title: scCODA: A Bayesian model for compositional single-cell data analysis date: 2020-12-17 journal: bioRxiv DOI: 10.1101/2020.12.14.422688 sha: 527d2681208fc7c8db5ca9a71018f30c91465f15 doc_id: 312447 cord_uid: zby12m60 Compositional changes of cell types are main drivers of biological processes. Their detection through single-cell experiments is difficult due to the compositionality of the data and low sample sizes. We introduce scCODA (https://github.com/theislab/scCODA), a Bayesian model addressing these issues enabling the study of complex cell type effects in disease, and other stimuli. scCODA demonstrated excellent detection performance and identified experimentally verified cell type changes that were missed in original analyses. Recent advances in single-cell RNA-sequencing (scRNA-seq) allow large-scale quantitative transcriptional profiling of individual cells across a wide range of tissues, thus enabling the monitoring of transcriptional changes between conditions or developmental stages and the data-driven identification of distinct cell types. Although being important drivers of biological processes such as in disease 1 , development 2 , aging 3 , and immunity 4 , shifts in cell type compositions are non-trivial to detect using scRNA-seq. Statistical tests need to account for multiple sources of technical and methodological limitations, including low number of experimental replications. The total number of cells per sample is restricted in most single-cell technologies, implying that cell type counts are proportional in nature. This, in turn, leads to a negative bias in cell type correlation estimation 5 (Fig. 1a) . For example, if only a specific cell type is depleted after perturbation, the relative frequency of others will rise. If taken at face value, this would lead to an inflation of differential cell types. Therefore, standard univariate statistical models that test compositional changes of each cell type independently may falsely deem certain population shifts as real effects, even though they were solely induced by the inherent negative correlations of the cell type proportions. Yet, common statistical approaches currently applied in compositional cell type analysis ignore this effect. For example, Haber et al. 6 applied a univariate test based on Poisson regression, Hashimoto et al. 3 a Wilcoxon rank-sum test, and Cao et al. 7 proposed a method based on a generalized linear regression framework with a Poisson likelihood, all thus not addressing the issue of compositionality. To account for the inherent bias present in cell type compositions, we drew inspiration from methods for compositional analysis of microbiome data 8, 9 and propose a Bayesian approach for cell type composition differential abundance analysis to further address the low replicate issue. The singlecell compositional data analysis (scCODA) framework models cell type counts with a hierarchical Dirichlet-Multinomial distribution that accounts for the uncertainty in cell type proportions and the negative correlative bias via joint modeling of all measured cell type proportions instead of individual ones (Fig. 1b, Online Methods -Model) . The model uses a Logit-normal spike-and-slab prior 10 with a log-link function to estimate effects of binary (or continuous) covariates on cell type proportions in a parsimonious fashion. This implicitly assumes that only few cell types change upon perturbation. Since compositional analysis always requires a reference to be able to identify compositional changes 5 , scCODA currently relies on the manual specification of a reference cell type 11 . This implies that credible changes detected by scCODA have to be interpreted in relation to the selected reference. We first performed comprehensive benchmarks on synthetic data across a wide range of scenarios (Online Methods -Simulation) that focused on scCODA's primary application: the behavior of a single binary covariate that models the effect of a perturbation of interest in the respective scRNAseq experiment. To detect statistically credible changes in cell type compositions, we calculate the model inclusion probability for each covariate determined by the spike-and-slab prior (Online Methods -Model description). The benchmark scenarios were then used to identify the optimal inclusion probability threshold to deem a covariate statistically credibly affecting cell type composition. We found that the optimal inclusion threshold is roughly proportional to the number of cell types ( Supplementary Fig. 1a-e) . This reflects the necessity to have higher confidence in the identified effect with increasing number of cell types, mimicking the behavior of frequentistic multiple testing correction. We next compared scCODA's performance to state-of-the-art differential compositional testing schemes from the microbiome field as well as all non-compositional tests recently applied to single-cell data ( Fig. 1c-e) . In our synthetic benchmarks, we found scCODA to significantly outperform all approaches across a wide variety of effect sizes and experimental settings with an average Matthews correlation coefficient (MCC) of 0.75. Considering effect sizes, scCODA showed the highest MCC across all scenarios and consistent improvement with effect size strength (Fig. 1c) . Other compositional (Bayesian) models such as a standard Dirichlet Multinomial (DM), ANCOM 12 , ALDEx2 13 , and additive log-ratio (ALR) transformed proportions combined with a t-test (Online Methods -Model Comparison) showed similar behavior, albeit with lower MCC. Considering the number of replicates per group, scCODA had a considerable edge over other methods in the common scenario with low number of replicates per group (seven or less) (Fig. 1d) . ANCOM and the standard DM model achieved comparable performance with higher number of replicates, outcompeting ALDEx2 and ALR-transform models. Non-compositional models (purple lines in Fig. 1c -e) included more false-positives with increasing effect size (exemplified in Fig. 1e ) and number of replicates per group, highlighting the detrimental effect of compositional data on these models. Since extensive replication of scRNA-seq experiments is still costly and hence rare, yet essential for studying compositional changes, we also investigated the sample size dependency of effect size and rarity of affected cell type on scCODA's performance (Fig. 1f) . We performed a power analysis fitting a linear model (R 2 =0.816±0.006, Online Methods -Power Analysis) on empirical logit transformed 14 and scaled MCC values to infer the required sample size to reach an MCC of 0.8 for varying log-fold changes (Fig. 1f, Supplementary Fig. 1f-g) . We estimated that a relative change of 1 (log2 scale) in abundant cell types (e.g., 1,000 out of 5,000 cells) can be determined with two samples, while the same relative change requires between 20 and 30 samples in a rare cell type (e.g., 125 out of 5,000 cells). Notably, large relative changes (log-fold changes of 3) in rare cell types could be detected from only two samples. While this implies that for many situations only two replicates are useful, we would advise to increase the number of samples when detection of compositional changes in rare cell types is relevant. Next, we applied scCODA to a number of scRNA-seq data examples 1,3,4,6,15 ( Fig. 2 and Supplementary Fig. 2-5) . To confirm scCODA's applicability on real data with known ground truth, we first considered a recent study of age-related changes in peripheral blood mononuclear cells (PBMCs) 3 , where cellular characteristics of supercentenarians (n=7) were compared against the ones of younger controls (n=5; Fig. 2a ). The original study used a Wilcoxon rank-sum test and reported a significant decrease of B-cells in supercentenarians, which is known from literature 16 . Moreover, the result was validated by FACS measurements. scCODA also identified B-cell populations as the sole affected cell type using Erythroblasts as reference. The effect could be consistently detected with only three samples per condition (in 7/10 stratified random subsamples) while the Wilcoxon rank-sum test failed to detect the effect when removing even just one sample from each condition. This suggests that scRNA-seq data indeed comprise enough information to study compositional changes, and that scCODA can correctly identify the experimentally validated age-related decrease of B-cells even in low sample regimes. scCODA detects staining confirmed increase of disease-associated microglia in Alzheimer's disease on few replicates Second, we analyzed the compositional changes of three microglia cell types in an Alzheimer's disease (AD) mouse model 15 (Fig. 2b) . Here, the number of replicates of sorted cells from cortex and cerebellum was low (n=2 per group), thus challenging standard statistical testing scenarios. In the cortex, scCODA identified statistically credible changes both in microglia 2 and disease-associated microglia (DAM) using the most abundant tissue-resident microglia 1 as reference cell type. By contrast, scCODA detected no statistically credible change in the cerebellum, which is known to be unperturbed in AD. Keren-Shaul et al. 15 quantified the increase of DAM in the cortex of the AD mouse model via staining. While DAM localize in close proximity to amyloid-beta plaques and show a distinct inflammatory gene expression pattern, microglia 2 tend to represent an intermediate state between DAM and homeostatic microglia 1 15 (Supplementary Fig. 3) . Therefore, our analysis with scCODA supports the contribution of DAM in AD. scCODA scales to large sample sizes and cell type numbers We next analyzed compositional changes of cell types in single-cell data from patients with ulcerative colitis (UC) compared to healthy donors 1 . Here, biopsy samples from the epithelium and the underlying lamina propria (Fig. 2c) were enzymatically separated and subsequently analyzed with scRNA-seq, resulting in 51 cell types from 133 samples. The epithelium and the lamina propria represent two different compartments and were tested separately. However, some epithelial cells scCODA detects cell type changes in COVID-19 patients that were not detected with non-compositional tests but confirmed in larger-scale studies Next, we reanalyzed a recent COVID-19 single-cell study comparing compositional changes of major cell types in bronchoalveolar lavage fluid between healthy controls (n=4), severe (n=6) and moderate (n=3) COVID-19 cases 4 using plasma cells as reference (Fig. 2f) . The study originally reported significant differential changes in pDC's between all groups, and depletion of T cells in severe (b) Microglia associated with Alzheimer's disease (AD) are significantly more abundant in the cortex, but not in cerebellum 15 (n=2 in AD and wild type mice, respectively), reference was set to microglia 1, HMC chain length was set to 20,000 with burn-in of 5,000. (c-e) Changes in epithelium and lamina propria in the human colon 1 in ulcerative colitis (UC) (n=133 from 18 UC patients, 12 healthy donors). Credible and significant results are depicted as colored bars (Red: scCODA, green: Dirichlet Regression). Stars indicate the significance level (*: adjusted p<0.05, **: adjusted p<0.01, ****: adjusted p<0.001; Benjamini-Hochberg corrected). (c) Epithelium and Lamina propria are distinct tissues, which are studied separately. (d) Compositional changes from healthy to non-inflamed and inflamed biopsies of the intestinal epithelium, reference was set to Non-Epithelial cells, HMC chain length was set to 150,000 with burn-in of 10,000. (e) Compositional changes from healthy to non-inflamed and inflamed biopsies in the lamina propria, reference was set to six different cell types (see Methods), HMC chain length was set to 400,000 with burn-in of 10,000. (f) Compositional changes in bronchoalveolar cells in COVID-19 patients (n=4 healthy, n=3 mild, n=6 severe disease progression) 4 . Credible and significant results are depicted as colored bars (Red: scCODA, orange: t-test). Stars indicate the significance level (*: adjusted p<0.05, **: adjusted p<0.01, ***: adjusted p<0.001; Benjamini-Hochberg corrected), reference was set to Plasma, HMC chain length was set to 80,000 with a burn-in of 10,000. cases vs. healthy controls using a t-test without multiple testing correction. scCODA confirmed these results but additionally identified a statistically credible depletion of T cells in severe vs. moderate cases, increases of NK and B cells in moderate vs. healthy cases, NK depletions in severe vs. moderate cases, as well as a decrease in mDCs in severe vs. moderate, and a Neutrophil increase in severe cases vs. moderate and healthy cases. Correlation of T cell abundances with severity is well established and has been used as risk factors for severe cases 18, 19 . A decrease of NK cells with COVID-19 severity was observed between recovered and diseased patients 19 In summary, using a comprehensive set of synthetic and scRNA-derived compositional data sets and application scenarios, we established scCODA's excellent performance for identifying statistically credible changes in cell type compositions. scCODA compared favorably to commonly used models for single-cell and microbiome compositional analysis, particularly when only a low number of experimental replicates are available or when detecting compositional changes with small effect sizes is of importance. We believe this is due to the Bayesian nature of the model as it adequately accounts for the uncertainty of observed cell counts, automatically performs model selection, and does not rely on asymptotic assumptions. scCODA not only correctly reproduced previously discovered and partially FACS-verified compositional changes in recent scRNA-seq studies, but also identified additional cell type shifts that were confirmed by independent studies, including Treg cell enrichment in UC patients and Neutrophils increase in severe COVID-19 cases. Using synthetic benchmarks, we confirmed that standard univariate tests, such as Poisson regression models or t-tests, are inadequate for cell type analysis, since they do not account for the compositional nature of the data. While log-ratio transforms from compositional data analysis (such as the ALR used here) can partially mitigate these shortcomings, our Bayesian scCODA framework provided substantial performance improvements across all tested scenarios and is particularly preferable when only few replicates are available. In its present form, our scCODA framework relies on pre-specified cell type definitions which, in turn, hinge on statistically sound and biologically meaningful clustering assignments. In situations where crisp clustering boundaries are elusive, for instance, due to the presence of the transient developmental processes underlying the data, joint modeling of different resolution hierarchies 22 or modeling compositional processes 23,24 may help account for such continuities changes. We believe that our scCODA framework offers an ideal starting point to model such advanced processes thanks to its hierarchical and extendable nature. We seek to identify the credibly associated covariates X NxM to observed cell counts Y NxK of K cell types measured in a single cell experiment with N samples and M covariates. We address this question with a Bayesian generalized linear multivariate regression framework using a Dirichlet Multinomial model with a log-link function to account for the compositional nature and uncertainties in the observed data. Covariate coefficients are hierarchically modeled using a non-centered parameterization 25, 26 . For automatic model selection and identification of credibly associated covariates and affected cell types, we utilize a logit-normal prior as a continuous relaxation of the spike-and-slab prior 10 To prevent identifiability issues of the covariate parameters, we reparametrize the model and choose one cell type as reference, forcing its covariates : = 0 as in Maier et al. 11 . Parameter inference is performed via Hamiltonian Monte Carlo (HMC) using ten leapfrog steps per iteration with automatic step size adjustment according to Betancourt et al. 28 The mean vector for each sample was calculated from the mean abundance of the first cell type in control samples (no effect) i , and the mean change in abundance of the first cell type between the two groups ′. All other cell types were modeled to be equally abundant, leading to = The sets of generation parameters were as follows: • Spike-and-slab threshold determination (Supplementary Fig. 1a- Spike-and-slab threshold determination To identify statistically credible effects, scCODA compares the posterior inclusion probability to a threshold . To find a value for that reliably identifies credible effects on single-cell population data, we applied scCODA to a range of 9,000 simulated data sets (Simulation description) and compared the inclusion probability for each cell type to possible thresholding values ranging from 0 to 1. We measured the performance of scCODA by different binary classification metrics, noting a dependence on the number of cell types in all metrics (Supplementary Fig. 1a-d) . To enable scCODA to evaluate datasets with dimensionality outside of the range of the simulation study, we determined the optimal thresholding values in terms of MCC for each and approximated these as a function of the form = 1 − ⋅ " . We found the optimal values for and by performing a grid search, using the sum of squared residuals as a metric (Supplementary Fig. 1e) . We excluded the case = 2 from this optimization, as it represents an edge case where the only unchanged cell type is also the reference cell type. We found the optimal parameters to be = 0.77, and = −0.29. To be able to estimate the required sample sizes for an intended MCC, we fitted a linear regression We performed a backward model selection with repeated 10-fold cross-validation to reduce the feature set. The final model consisted of sample size, log fold-change, absolute cell count change, and the interaction effect between absolute cell count and log-fold change, which was expected given that we observed an interaction effect between these two variables in the raw benchmark results (Supplementary Fig. 1d-f) . With the fitted model, we can now inverse estimate the required total sample size ™™ with fixed MCC • HZZ , log fold-change šZ , and absolute cell count change ZZ as: Fig. 2) . All analysis steps were carried out using scanpy v.1.5.1. Single-cell RNA-seq data of microglia in Alzheimer's disease (AD) mouse model We downloaded the raw single-cell RNA-seq count matrices (deposited at GEO, accession code GSE98969) comprising immune cells isolated from the mouse brain in wild type (WT) and AD mice 15 . The complete dataset with all samples consists of 37,248 cells. We filtered out ERCC spike-ins before computing the quality metrics of all cells. We then excluded 12,053 cells with less than 500 UMI counts and 11,065 genes, which were not expressed. We subsequently normalized by library size with target sum 10,000 counts (CPM normalization) and log+1 scaled. Following the analysis of Keren-Shaul et al. 15 , we selected the samples of six-month-old mice from AD and WT, which haven't been sorted by brain region, resulting in 9,196 cells. It must be noted that Keren-Shaul et al. Fig. 6 ). Specifically, we tested chain lengths of 20,000, 40,000, 80,000, and 150,000 iterations with a burn-in of 10,000 in the Epi case, while we tested chain lengths of 200,000, 400,000, and 800,000 iterations with a burn-in of 10,000 in the LP case due to the larger number of cell types in LP compared to Epi. For the subsampling analysis in scCODA and Dirichlet regression, we used the Epi data and excluded 12 to 20 donors with all corresponding replicates. The dataset contained 29 donors (12 healthy, 17 UC patients), and we successively reduced the number of donors in the dataset from 17 to 9, where one donor from each group was removed in each step (stratified subsampling). Hence, when the subsampled dataset with 17 donors consisted of 6 healthy donors and 11 UC patients, while the dataset with 9 donors consisted of 3 healthy donors and 6 UC patients. We repeated each round 5 times with a different subset of donors to reduce selection bias (Supplementary Fig. 4 ). Single-cell RNA-seq data of bronchoalveolar immune cells in patients with We used the annotated single-cell RNA-seq data of the bronchoalveolar lavage fluid cells from three patients with moderate COVID-19 progression, six patients with severe COVID-19 progression, four healthy donors, and a publicly available sample 4 We compared scCODA's ability to correctly identify significant compositional changes in a setting typical for single-cell experiments to other methods recently used in scRNA-seq analysis and approaches from the field of microbial population analysis. We applied all methods to each of the 5,000 datasets generated for the comparison analysis (Online Methods -Simulation description) and recorded which of the 5 components each method found to statistically differ between the two groups. We then compared these results to the ground truth assumption of only the first component changing significantly via binary classification metrics (significant vs. non-significant changes). We chose Matthews' correlation coefficient as our primary metric, as it best accounts for the numerical imbalance between the two groups. Details on the individual methods can be found in Supplementary Table 1 . The method has been implemented in Python 3 using Tensorflow >= 2.3.1 30 , Tensorflow-Probability >= 0.11 31 , ArviZ >= 0.9 32 , numpy >= 1.19, and Scanpy >= 1.5 33 . The Power Analysis was performed using Scikit-learn >= 0.23 34 (Python 3.8) and caret package 35 (R 3.7). Source code can be found at https://github.com/theislab/sccoda. All code to reproduce the presented analyses can be found at https://github.com/theislab/scCODA_reproducibility. Simulated data and benchmark analyses results are available at https://doi.org/10.5281/zenodo.4305907 Supplement Supplementary Figure 1 Benchmark evaluation results for spike-and-slab threshold determination (ae) and overall benchmark (f-g) (Online Methods -Simulation Description). The optimal thresholding value for the posterior inclusion probability of the spike-and-slab prior on each effect depends on the number of cell types. The blue line always represents the chosen optimal thresholding function = 1 − 0.77 ⋅`i .Sj (Online Methods -Spike-and-slab threshold determination). Benchmark performance for different numbers of cell types and thresholding values in terms of (a) Matthews correlation coefficient, (b) false discovery rate, (c) true positive rate, and (d) true negative rate. (e) The optimal threshold function (orange) closely matches the MCC-maximizing threshold values (blue) for all numbers of cell types. (f) The performance of scCODA (measured by MCC) depends on the amount of change in abundance. The "Base" value represents the mean cell count of the only differentially abundant cell type. For cell types with higher initial abundance, scCODA can detect smaller relative (log2-fold) changes between the two groups. (g) For cell types with higher initial abundance, the absolute (count) change must be higher to reliably detect changes in abundance. Intra-and Inter-cellular Rewiring of the Human Colon during Ulcerative Colitis A single-cell molecular map of mouse gastrulation and early organogenesis Single-cell transcriptomics reveals expansion of cytotoxic CD4 T cells in supercentenarians Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19 The statistical analysis of compositional data A single-cell survey of the small intestinal epithelium scDC: single cell differential composition analysis Microbiome Datasets Are Compositional: And This Is Not Optional An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data Simultaneous parameter estimation and variable selection via the logit-normal continuous analogue of the spike-and-slab prior Dirichlet regression for compositional data in R Analysis of composition of microbiomes: a novel method for studying microbial composition ALDEx2: ANOVA-Like Differential Expression tool for compositional data Modelling Binary Data A Unique Microglia Type Associated with Restricting Development of Alzheimer's Disease The immunology of exceptional individuals: the lesson of centenarians Functional CD4+CD25high regulatory T cells are enriched in the colonic mucosa of patients with active ulcerative colitis and increase with disease activity Predictors of mortality for patients with COVID-19 pneumonia caused by SARS-CoV-2: a prospective cohort study The dynamics of immune response in COVID-19 patients with different illness severity Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China Targeting potential drivers of COVID-19: Neutrophil extracellular traps Tree-Aggregated Predictive Modeling of Microbiome Data Modeling and Analysis of Compositional Data Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing Online References Non-centered parameterisations for hierarchical models and data augmentation A General Framework for the Parametrization of Hierarchical Models On the Half-Cauchy Prior for a Global Scale Parameter Optimizing The Integrator Step Size for Hamiltonian Monte Carlo Empirical logit analysis is not logistic regression Tensorflow: A system for large-scale machine learning ArviZ a unified library for exploratory analysis of Bayesian models in Python SCANPY: large-scale single-cell gene expression data analysis Scikit-learn: Machine learning in Python. the caret: Classification and regression training. R package version 6 Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis Re-analysis of microglia data in Alzheimer's disease (AD) mouse model 15 . (a) Joint cell type annotation of cells. (b) Cell distribution in the both wild type (WT) and AD mouse model Compositional analysis of Haber et al. 6 on the response to pathogen infection in the small intestinal epithelium of the mouse. Significant and credible results are depicted as colored bars (Red: scCODA, purple: Dirichlet regression), stars depict the significance of the Dirichlet regression model Supplementary Figure 6: Convergence of HMC sampling for many cell types (in data of Smillie et al.). (a-b) Inclusion probabilities for pairwise tests in the epithelium (a) and lamina propria (b) of healthy donors and patients of UC. Colors depict the tested levels, symbols depict the addition of pseudocounts, symbol sizes depict different references Trace plots of different chain lengths for the parameter inference in Goblet cells with pseudocount comparing healthy and inflamed samples (reference Non-Epithelial) (c) and Cycling T cells without pseudocount comparing healthy and inflamed samples We would like to thank Malte Luecken for his support in the initial design of the study, as well as Karin (corresponds to subsample 20 in the plot), where one donor from each group was removed in each step. Each subsampling step was repeated five times. Ground truth for MCC computation is the set of statistically differentially abundant cell types in the full dataset.