key: cord-0318525-x5u2w0nn authors: Weinberger, Ethan; Lin, Chris; Lee, Su-In title: Isolating salient variations of interest in single-cell transcriptomic data with contrastiveVI date: 2022-04-06 journal: bioRxiv DOI: 10.1101/2021.12.21.473757 sha: 163031b84889439d63dc4de2852b4c17470391db doc_id: 318525 cord_uid: x5u2w0nn Single-cell RNA sequencing (scRNA-seq) technologies enable a better understanding of previously unexplored biological diversity. Oftentimes, researchers are specifically interested in modeling the latent structures and variations enriched in one target scRNA-seq dataset as compared to another background dataset generated from sources of variation irrelevant to the task at hand. For example, we may wish to isolate factors of variation only present in measurements from patients with a given disease as opposed to those shared with data from healthy control subjects. Here we introduce Contrastive Variational Inference (contrastiveVI; https://github.com/suinleelab/contrastiveVI), a framework for end-to-end analysis of target scRNA-seq datasets that decomposes the variations into shared and target-specific factors of variation. On four target-background dataset pairs, we apply contrastiveVI to perform a number of standard analysis tasks, including visualization, clustering, and differential expression testing, and we consistently achieve results that agree with known biological ground truths. : Overview of contrastiveVI. Given a reference background dataset and a second target dataset of interest, contrastiveVI separates the variations shared between the two datasets and the variations enriched in the target dataset. a, Example background and target data pairs. Samples from both conditions produce an RNA count matrix with each cell labeled as background or target. b, Schematic of the contrastiveVI model. A shared encoder network q φz transforms a cell into the parameters of the posterior distribution for z, a low-dimensional set of latent factors shared across target and background data. For target data points only, a second encoder q φt encodes target data points into the parameters of the posterior distribution for t, a second set of latent factors encoding variations enriched in the target dataset and not present in the background. To illustrate the advantages of contrastiveVI, we benchmarked its performance against that 118 of three previously proposed methods for analyzing raw scRNA-seq count data. First, to 119 demonstrate that our contrastive approach is necessary for isolating enriched variations in 120 target datasets, we compared against scVI [30] . scVI has achieved state-of-the-art results on 121 many tasks; however, it was not specifically designed for the CA setting and thus may struggle 122 to capture salient variations in target samples. We also compared against two contrastive 123 methods designed for analyzing scRNA-seq count data: contrastive Poisson latent variable 124 model (CPLVM) and contrastive generalized latent variable model (CGLVM) [22] . While 125 these methods are designed for the contrastive setting, they both make the strong assumption 126 that linear models can accurately capture the complex variations in scRNA-seq data. To our 127 knowledge, CPLVM and CGLVM are the only existing contrastive methods for analyzing 128 scRNA-seq count data. 129 Qualitatively (Fig. 2a) , we find that none of these baseline models are able to separate 130 pre-and post-transplant cells as well as contrastiveVI can. This finding is further confirmed 131 by quantitative results (Fig. 2b) . Across all of our metrics we find that contrastiveVI signif-132 icantly outperforms baseline models, with especially large gains in the ARI and AMI. These Higher values indicate better performance for all metrics. For each method, the mean and standard error across five random trials are plotted. c, contrastiveVI's salient latent representations of the target dataset were clustered into two groups, and pathway enrichment analysis was then performed on the differentially expressed genes between the two clusters. results indicate that contrastiveVI recovered the variations enriched in the AML patient data 134 far better than baseline models. 135 contrastiveVI separates intestinal epithelial cells by infection type 136 We also applied contrastiveVI to data collected in Haber et al. Here our goal is to separate cells by infection type in the salient latent space. On the 141 other hand, any separations in the background latent space should reflect variations shared 142 between healthy and infected cells, such as those due to differences between cell types. We 143 present our results in Figure 3 . 144 We find that contrastiveVI successfully separates cells by infection type in its salient latent 145 space (Fig. 3a) . Moreover, we find that cells mix across infection types in the contrastiveVI 146 background latent space as expected (Fig. 3b) . These results indicate that enriched varia- Table 156 2). These enriched pathways are consistent with previous findings that lipids and lipoproteins 157 partake in innate immunity [44, 25] of Salmonella via autophagy [20] . Furthermore, six of the ten differentially expressed genes 160 in the enriched pathways were found to have pathogen-specific expression in Haber et al. [17] (e.g. Apoc2 and Fabp1 ), while the other four genes belong to the same families as 162 differentially expressed genes specific to Salmonella or H. polygyrus (e.g. Apoc3 and Fabp2 ). 163 These results show that contrastiveVI can be used to identify and interpret biologically 164 relevant subgroups in target data. For this dataset we further validated contrastiveVI's ability to disentangle target and 166 background variations using ground truth cell type labels provided by Haber et al. [17] . In 167 Figure 3 : contrastiveVI isolates responses to different infections in mouse intestinal epithelial cells. a,b, UMAP plots of contrastiveVI's salient and background representations colored by infection type. Cells are correctly separated by infection type in the salient space, while they mix across infection types in the background space. c, Clustering metrics quantify how well cells separate by infection type for scVI's single latent space and contrastive models' salient latent spaces, with means and standard errors across five random trials plotted. d,e, UMAP plots of contrastiveVI's salient and background representations colored by cell type. Cells separate well by cell type in the background space, while they mix across cell types in the salient space. f, Quantifying how well cells separate by cell type in scVI's single latent space and contrastive models' background latent spaces, with means and standard errors across five random trials for each method. particular, we found strong mixing across cell types in contrastiveVI's salient latent space 168 (Fig. 3d) . This result agrees with the analysis in Haber et al. [17] , which found that 169 responses to the two pathogens were mostly cell-type agnostic. On the other hand, cell types 170 separated clearly in the background latent space (Fig. 3e) . This result also agrees with prior 171 biological knowledge, as we would expect the underlying factors of variation that distinguish 172 cell types to be shared across healthy and infected cells. We quantified the degree of this cell-173 type separation in contrastiveVI's background latent space using our set of clustering metrics 174 (Fig. 3f ) . We find that contrastiveVI's background latent space is far better at capturing 175 differences between cell types than previously proposed contrastive methods' background 176 latent spaces. Moreover, we find that contrastiveVI's background latent space separates cell 177 types to a similar degree as the non-contrastive scVI's latent space. Qualitatively, contrastiveVI's salient latent space stratifies cells based on TP53 mutation 194 status (Fig. 4a) . Our quantitative metrics also indicate that contrastiveVI separates the 195 two classes of target cells more clearly than baseline methods (Fig. 4b) . Moreover, we find 196 that the clusters identified by applying k-means clustering to the contrastiveVI salient latent 197 space have differentially expressed genes enriched in the p53 signaling pathway (Fig. 4c) . It is worth noting that the p53 signaling pathway is the only statistically significant (under 199 0.05 false discovery rate) pathway identified by contrastiveVI. These results demonstrate 200 that contrastiveVI captures salient variations in the target samples treated with idasanutlin 201 that specifically relate to the ground truth mechanism of idasanutlin perturbation. 202 We further evaluated contrastiveVI's performance on this dataset by embedding all cells, 203 whether treated with DMSO or idasanutlin, into the model's background latent space. Ide-204 ally, contrastiveVI's background latent space would only capture variations that distinguish 205 cell lines and not those related to treatment response. In particular, we would expect strong 206 mixing between DMSO-and idasanutlin-treated cells even for cell lines with wild type TP53. 207 We find that wild type TP53 cell lines clearly separate by treatment type in the original data . b, The average silhouette width (silhouette), adjusted Rand Index (ARI) and normalized mutual information (NMI), with mean and standard error across five random trials plotted for each method. c, Two clusters identified by k-means clustering on contrastiveVI's salient latent representations of the idasanutlin-treated cells. Highly differentially expressed genes were identified from the two clusters, and these genes were used to perform pathway enrichment analysis. d,e,f, UMAP plots of contrastiveVI's background latent space colored by treatment type (d) TP53 mutation status (e), and cell line (f ). Finally, we applied contrastiveVI to data collected using the Perturb-Seq [11, 4] platform. regulatory circuits related to innate immunity [21], the unfolded protein response pathway 219 [4], and the T cell receptor signaling pathway [10] , among other applications. Despite these 220 successes, recent work [22] has suggested that naive approaches for analyzing Perturb-Seq We would expect cells to separate by these gene programs; however, in the latent space of an 231 scVI model we observed significant mixing between cells with different gene program labels 232 (Fig. 5a) . Moreover, using data from cells treated with control guides as a background 233 dataset, we find that the previously proposed contrastive models CPLVM ( On the other hand, using the same background dataset, we find qualitatively that con-236 trastiveVI better separates cells by gene program in its salient latent space (Fig. 5d) . Fur-237 thermore, we find that the relative positions of the different gene programs in the con- gamma distribution is parameterized by the mean ρ ng ∈ R + and shape θ g ∈ R + . Further-304 more, following the generative process, θ g is equivalent to a gene-specific inverse dispersion parameter for a negative binomial distribution, and θ ∈ R G + is estimated via variational 306 Bayesian inference. f w and f g in the generative process are neural networks that transform 307 the latent space and batch annotations to the original gene space, i.e.: where d is the size of the concatenated salient and background latent spaces. The network f w 309 is constrained during inference to encode the mean proportion of transcripts expressed across 310 all genes by using a softmax activation function in the last layer. That is, letting f g w (z n , t n , s n ) 311 denote the entry in the output of f w corresponding to gene g, we have g f g w (z n , t n , s n ) = 1. The neural network f h encodes whether a particular gene's expression has dropped out in a 313 cell due to technical factors. q φx (z n , t n , n |x n , s n ) = q φz (z n |x n , s n )q φt (t n |x n , s n )q φ ( n |x n , s n ). (1) Here φ x denotes a set of learned weights used to infer the parameters of our approximate (2) 341 Next, for background data points we approximate the posterior using the factorization: where r g a i ,b j := log 2 (ρ g a i ) − log 2 (ρ g b j ) is the log fold change of the denoised, library size-367 normalized expression of gene g, and δ is a pre-defined threshold for log fold change mag-368 nitude to be considered biologically meaningful. The posterior probability of differential 369 expression is therefore expressed as p(M g 1 |x a i , x b j ), which can be obtained via marginaliza-370 tion of the latent variables and categorical covariates: p(M g 1 |x a , x b )dp(a)dp(b), 377 where we assume that the cells a and b are independently sampled a ∼ U(a 1 , ..., a m ) and CA methods that explicitly model count-based scRNA-seq normalization [22] . We present 401 a summary of previous work in CA in Supplementary Fig. 2 various small-molecule therapies. For our target dataset, we used data from cells that were 468 exposed to idasanutlin, and for our background we used data from cells that were exposed where n ij is the number of cells assigned to cluster i based on the reference labels and 502 cluster j based on a clustering algorithm, a i is the number of cells assigned to cluster i in the 503 reference set, and b j is the number of cells assigned to cluster j by the clustering algorithm. ARI values closer to 1 indicate stronger agreement between the reference labels and labels 505 assigned by a clustering algorithm. Normalized mutual information 507 The normalized mutual information (NMI) measures the agreement between reference clus-508 tering labels and labels assigned by a clustering algorithm. The NMI is calculated as where P and T denote empirical distributions for the predicted and true clusterings, I 511 denotes mutual information, and H the Shannon entropy. Genomics. 10x genomics. support: single cell gene expression datasets Vitamin d differentially regulates salmonella-induced intestine epithelial linking crispr-pooled screens with single-cell rna-seq Contrastive latent variable 576 modeling with application to case-control sequencing experiments Kegg: kyoto encyclopedia of genes and genomes Ten years of pathway analysis: current approaches 581 and outstanding challenges Thematic review series: The pathogenesis of atherosclerosis. 584 effects of infection and inflammation on lipid and lipoprotein metabolism mechanisms 585 and consequences to the host1 Adam: A method for stochastic optimization Auto-encoding variational bayes Gata-1 reprograms avian myelomonocytic cell 591 lines into eosinophils, thromboblasts, and erythroblasts Probabilistic contrastive principal component 594 analysis Deep generative modeling 596 for single-cell transcriptomics scgen predicts single-cell perturbation 598 responses Mapping single-cell data to 601 reference atlases by transfer learning Pathway-level 603 information extractor (plier) for gene expression data Single-cell transcriptomic analysis 607 of alzheimer's disease Multiplexed single-cell 610 transcriptional response profiling to define cancer vulnerabilities and therapeutic mech-611 anism of action Multi-seq: sample multiplexing 614 for single-cell rna sequencing using lipid-tagged indices Exploring genetic interaction manifolds constructed from 618 rich single-cell phenotypes A general and flexible 620 method for signal extraction from single-cell rna-seq data Learning interpretable latent 623 autoencoder representations with annotations of feature sets. bioRxiv Unsupervised learning with contrastive latent 625 variable models Massively multiplex 629 chemical transcriptomics at single-cell resolution Interpretable factor models of single-634 cell rna-seq via variational autoencoders Interaction of pathogens with host cholesterol 636 metabolism Defining a cancer 639 dependency map In vivo activation of 642 the p53 pathway by small-molecule antagonists of mdm2 Gata-1 but not scl 645 induces megakaryocytic differentiation in an early myeloid line Moment matching deep contrastive latent 648 variable models Vitamin d signaling, infectious diseases, and regulation of innate immunity. 650 Infection and immunity A single-cell atlas of the 653 peripheral immune response in patients with severe covid-19 Scanpy: large-scale single-cell gene expression 656 data analysis Single-cell profiling of tumor heterogeneity and the microenvironment in advanced 659 non-small cell lung cancer Pioneer transcription factors: establishing competence 661 for gene expression Definition of a foxa1 cistrome that is crucial for g1 664 to s-phase cell-cycle transit in castration-resistant prostate cancer Massively parallel digital transcrip-672 tional profiling of single cells Contrastive learning using spectral