key: cord-0022519-495psy61
authors: Shen, Hongru; Li, Yang; Feng, Mengyao; Shen, Xilin; Wu, Dan; Zhang, Chao; Yang, Yichen; Yang, Meng; Hu, Jiani; Liu, Jilei; Wang, Wei; Zhang, Qiang; Song, Fangfang; Yang, Jilong; Chen, Kexin; Li, Xiangchun
title: Miscell: An efficient self-supervised learning approach for dissecting single-cell transcriptome
date: 2021-10-02
journal: iScience
DOI: 10.1016/j.isci.2021.103200
sha: ebec92d953eeb09fb0b9daaaacde43eab8a528ea
doc_id: 22519
cord_uid: 495psy61

We developed Miscell, a self-supervised learning approach with deep neural network as latent feature encoder for mining information from single-cell transcriptomes. We demonstrated the capability of Miscell with canonical single-cell analysis tasks including delineation of single-cell clusters and identification of cluster-specific marker genes. We evaluated Miscell along with three state-of-the-art methods on three heterogeneous datasets. Miscell achieved at least comparable or better performance than the other methods by significant margin on a variety of clustering metrics such as adjusted rand index, normalized mutual information, and V-measure score. Miscell can identify cell-type specific markers by quantifying the influence of genes on cell clusters via deep learning approach.

Iterative training Time + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + Survival probability 

Comprehensive analysis of single-cell RNA-seq (scRNA-seq) is helpful for dissecting tumor microenvironment, exploring developmental lineages, and characterizing dynamic evolution of immune checkpoint blockade (ICB) therapy response in multiple cancer types (Chalmers et al., 2017; Guo et al., 2018; Jerby-Arnon et al., 2018; Zhang et al., 2018; Zheng et al., 2017) . Jerby and colleagues identified a T cell exclusion signature via analyzing scRNA-seq data of malignant melanoma (Jerby-Arnon et al., 2018) . Meanwhile, the heterogeneity, clonal expansion and migration of T cells are reported to be key players in the efficacy of ICB therapy including exhaustion and dysfunction of tumor infiltrating lymphocytes (Li et al., 2019; Tirosh et al., 2016a) . There are several scRNA-seq analysis methods such as Seurat (Butler et al., 2018; Kang et al., 2018) , Scanpy and scVI (Lopez et al., 2018) that have been proposed to elucidate the complex cellular biological processes involved in tumor immune microenvironment and organ development (Asp et al., 2019; Guo et al., 2018; Liao et al., 2020; Zheng et al., 2017) . However, the gene expression profile of single-cell is highly sparse, heterogeneous, noisy and subjected to unavoidable batch effect, which poses a tremendous challenge to learn cell-type invariant representation from scRNA-seq data.

The core procedure of scRNA-seq analysis methods is to learn feature representation of single-cells by embedding each cell into a lower and more compact feature space, where downstream tasks such as cell type clustering and annotation could be performed more efficiently. The aforementioned methods Seurat (Butler et al., 2018; Kang et al., 2018) and Scanpy rely on linear dimensional reduction such as principal component analysis of gene expression profiles to capture the relationships among genes and cells. However, the rationality of the linear assumption of gene expression data remained hypothetical and unproven (Lopez et al., 2018) . Deep learning algorithms offer an alternative way to modeling nonlinear features of gene expression profiles with deep neural networks. For instance, scVI (Lopez et al., 2018) and MARS (Brbic et al., 2020) employed deep generative models to learn single-cell representation for variational inference and unknown cell-type detection, respectively. Deep generative models learn the latent representation space with a unified encoder and decoder by reconstructing the input and output data via minimizing reconstruction loss, which is often difficult to optimize and associated with high computational cost (Grill et al., 2020) . Self-supervised learning is a new paradigm for learning the latent representation of high dimensional data by using itself as a label to predict any parts of its input from any observed parts . The self-supervised learning is rapidly approaching the performance of supervised learning in computer vision including image classification (Chen et al., 2020a (Chen et al., , 2020b .

The analytical framework of Miscell ( Figure 1 ) consists of developing a momentum contrast self-supervised learning framework (Chen et al., 2020b; He et al., 2019) for feature representation learning of gene expression ( Figure 1A) , application of the learned features in downstream tasks including delineation of single-cell clusters ( Figure 1B) , identification of cluster-specific genes ( Figure 1C) , classification of different cell types ( Figure 1D ), and examination of the learned single-cell features with clinical phenotypes in bulk tumors (Figure 1E) . The feature encoder used in the self-supervised learning framework is a densely connected deep neural network (DenseNet) with 21 layers (Data S1). The DenseNet has the advantages of improving feature propagation, encouraging feature reuse and alleviating vanishing-gradient problems (Huang et al., 2016) . The feature representation component in Miscell is trained end-to-end to minimize the contrastive loss function (Chen et al., 2020b) of the original gene expression pattern x and its corresponding augmented version x 0 . Data augmentation for gene expression includes random zeroing out and shuffling 20% of genes. Parameters of the encoder f ðq k Þ and multiple layer perceptron (MLP) gðq k Þ projection head are updated end-to-end by stochastic gradient descent, whereas parameters of momentum encoder f ðq q Þ and MLP gðq q Þ are updated gradually in a moving-averaged manner ( Figure 1A and see STAR Methods). The MLP maps latent features learned by the encoder to a new space that is beneficial to contrastive learning (Chen et al., 2020a) . Miscell maintains a queue of negative examples in that the oldest one is discarded and the current one is enqueued. The momentum encoder can prevent solution collapsing (Chen et al., 2020b) . The developed encoder can be viewed as a trainable feature extractor for condensing highdimensional and sparse single-cell expression profiles into much lower and more compact features. The extracted latent features y are used as input for downstream tasks including delineation of single-cell clusters and identification of cluster-specific marker genes. To detect cluster-specific markers, we added a 1-layer linear classifier at the end of the developed encoder and trained it end-to-end by freezing the parameters of the encoder. Subsequently, we applied an integrated gradient algorithm (Sundararajan et al., 2017) to quantify the impact of each gene on the decision made by the linear classifier as a surrogate statistic to identify marker genes ( Figure 1D and STAR Methods). We examined the association of single-cell features with clinical phenotypes such as immune chokepoint blockade therapy response and prognosis in bulk tumors by applying the developed encoder to expression profiles of bulk tumors.

We evaluated the clustering performance of Miscell on three datasets including 43,095 peripheral blood mononuclear cells (PBMC) subjected to 10x sequencing, 55,656 single-cells across 20 mouse organs in the Muris (Tabula Muris et al., 2018) and 71,494 single-cells subjected to smart-seq sequencing protocol (Smart-Seq) that were aggregated from 10 previous studies (Filbin et al., 2018; Guo et al., 2018; Jerby-Arnon et al., 2018; Puram et al., 2017; Tirosh et al., 2016a Tirosh et al., , 2016b Venteicher et al., 2017; Zhang et al., 2018 Zhang et al., , 2020 Zheng et al., 2017) (Table S1 ), which encompass tumor cells, stroma cells, dendritic cells, CD4+ T cells, and CD8+ T cells from 7 cancer types. We set the queue size (q) = 1024 and momentum coefficient (q) = 0.999 according to the parameter tuning results on the Smart-seq dataset ( We systematically compared the clustering performance of Miscell to three state-of-the-art methods that are widely used for single-cell transcriptome analysis. These three methods include Scanpy , Seurat (Butler et al., 2018; Kang et al., 2018) and scVI (Lopez et al., 2018) . Miscell performed clustering in the latent feature space projected by the encoderf ðq k Þ. Scanpy and Seurat performed clustering in a lower dimensional space by applying principal component analysis to the original expression matrices, whereas scVI performed clustering in latent space projected by a deep generative model. The t-SNE embedded scatterplots shown in Figure 

iScience 24, 103200, November 19, 2021 3 iScience Article dataset, the cluster labels and true labels (i.e., cell types) are shown adjacently (Figures 2A-2D , 2F-2I, and 2K-2N). The true labels for Smart-seq dataset are only available for CD4+ and CD8+ T cells from Fluorescence-Activated Cell Sorting, which acounted for 30% (21,213/71,494) of total cells. In the PBMC dataset, Miscell is better than the other methods in identifying cell types of small numbers. For example, Miscell can detect distinct clusters of platelets and dendritic cells, whereas the other methods often intermingle these iScience Article small number of cells into other cell clusters ( Figure S1 ). Miscell achieved higher clustering metrics as measured by true labels or cluster compactness ( Figure 2E ). The qualitative values of these clustering metrics are provided in Table S3 . In the Smart-seq dataset, Miscell is able to identify visually more compact and less admixed clusters of CD4+ and CD8+ T cells as compared with the other three methods ( Figures 2F-2I ). Miscell achieved significantly higher clustering performance among a variety of clustering metrics in the Smart-seq dataset (Figures 2J; Table S4 ). In the Muris dataset, Miscell is able to identify cell clusters (Table  S5 ) across 20 mouse organs ( Figure 2K ). More specifically, in the PBMC dataset, Miscell achieved a gain of 33% for ARI, 6% for NMI, and 7% for V-measure score over the second-best method ( Figure 2E ). Meanwhile, Miscell achieved a gain of 120% for ARI, 66% for NMI, and 67% for V-measure score over the second-best method on the Smart-seq dataset ( Figure 2J ), and Miscell gained 10% for ARI and comparable NMI and V-measure as compared with the second-best method on the Muris dataset ( Figure 2O ; Table S6 ). In addition, a better clustering result ( Figure S2 ) of Miscell was also observed when it was applied to the batchcorrected expression matrix of the PBMC dataset corrected by Harmony (Korsunsky et al., 2019) .

In addition, Scanpy can also benefit from latent features learned by Miscell. Scanpy achieved significant improvement across these two datasets among all clustering metrics except ARI; however, its performance in this scenario is still inferior to Miscell ( Figure S3 ).

Miscell allows for identifying cluster-specific marker genes with an attribution method (see STAR Methods), which is to quantify the impact of a given gene on a specific cluster. We took the Smart-seq dataset as an exemplar to demonstrate the ability of Miscell in this scenario. Miscell detected 63 clusters ( Figure 3A ) on the Smart-seq dataset. CD4+ T cells are dispersed in cluster C29, C49, C54, and C55, whereas CD8+ T cells are dispersed in cluster C8, C47, C50, C51, C52, and C53 ( Figure 3B ). We calculated the attribution scores of each gene for each cluster. The attribution scores reflect the quantitative influence of genes on cell clusters and are used as a surrogate statistical value in downstream functional annotation. The result shows that Miscell is able to identify cell-type specific markers. For example, C29, a sub-cluster of CD4+ T cells, is characterized by high attribution score of SLAMF7 and PRF1, which are specific markers of cytotoxic CD4+ T cells (Della-Torre et al., 2018; Herndler-Brandstetter et al., 2018) ( Figure 3B ). C49 is a sub-cluster of CD4+ T cells, which is marked with genes associated with CD4+ Treg cells (Kavanagh et al., 2008) such as TIGIT and CCDC22, and enrichment of T cell dysfunction ( Figure 3B ). C8, a sub-cluster of CD8+ T cells, is marked with XCR1, SLC4A10, and CTLA4 ( Figure 3B ). Functional annotation indicated that C8 was associated with tumor evasion (Qin et al., 2019; Ribas and Wolchok, 2018) . The C53 cluster is featured by overrepresentation of KLRG1, which is a marker of KLRG1+ effector CD8+ T cell and can induce proliferation of effector CD8+ T cells while receiving inflammatory signals (Herndler-Brandstetter et al., 2018) .

We compared the marker genes of CD4+ and CD8+ T cells identified by Miscell, Seurat and Scanpy from the Smart-seq dataset (Table S8) . Miscell is able to identify traditional marker genes of CD4+ T cells (i.e., CD4 and CD40LG) and for CD8+ T cells (i.e., CD8A, CD8B, and KLRK1) that were also found by Scanpy and Seurat. In addition, Miscell identified CCR4 as the marker gene of CD4+ T cell (Sugiyama et al., 2013) and XCL1 as the marker gene of CD8+ T cell (Brewitz et al., 2017) .

Here, we demonstrated that Miscell is capable of transferring knowledge learned from single-cells to extract biologically meaningful features from bulk tumor expression data. We applied Miscell to extract latent features from an ICB clinical trial of 298 urothelial carcinoma patients subjected to bulk tumor RNA sequencing. Sixty-eight patients achieved CR/PR whereas 230 achieved SD/PD (see STAR Methods). Among 64 latent features, three of them are significantly associated with ICB therapy response (Wilcoxon test, adjusted p < 0.05, Figure 4A ). For instance, patients with low score (i.e., less than 25% quantile value) of 31 th feature was characterized by upregulation of genes related to activated immune environment including IFNG (Spranger et al., 2013) , IL12B, CXCR3 (House et al., 2020) , and CD19 (Yazawa et al., 2003) ( Figure 4C ). In addition, one of these three features exhibited significant association with prognosis (Log rank test, p = 0.004) whereas the other two also exhibited marginal significant association ( Figure S4 ). Analogously, we examined the association between extracted latent features by Miscell from TCGA pan-cancer expression data and overall survival. We found that the 54 th feature ( Figure 4B ) and the other 16 features are significantly associated with prognosis after controlling for Age, Sex, and TNM stage (Table S7 and iScience Article genomic and transcriptomic alteration signatures (Thorsson et al., 2018) , Hallmarks of cancer and immunerelated circuits (Liberzon et al., 2015) ( Figure 4D ). For instance, the 12 th feature is positively correlated with Age, Proliferation, Wound healing, G2M checkpoint and tumor evasion ( Figure 4D ).

We developed a cell type classifier based on the latent features of Miscell applied to the PBMC dataset with a single-layer linear classifier (STAR Methods). Miscell achieved high classification performance in detecting 7 cell types of PBMC with area under the curve ranging from 0.908 for CD8+ T cell to 0.997 for monocyte and accuracy from 92.6% to 99.8% ( Figure S5 ).

Miscell is a self-supervised learning approach, which is used to address the fundamental analytical tasks of single-cell transcriptomes including delineation of single-cell clusters and identification of cluster-specific marker genes. Noticeably, Miscell can transfer knowledge learned from single-cells to extract meaningful features from bulk tumors. Miscell outperforms state-of-the-art methods examined in this study by significant margin on multiple clustering metrics. This is probably because of better feature representation learning capability of Miscell enabled by the utilization of DenseNet as feature encoder and the contrastive momentum self-supervised learning. The DenseNet can improve information flow, enable feature reuse and substantially reduce the number of parameters; consequently, it is less prone to overfitting (Huang et al., 2016) . Study showed that the surface of loss function of DenseNet is smoother than its plain and sequential counterpart; therefore, it is easier to minimize the loss function (Li et al., 2017) . Apart from the advantage of using DenseNet as encoder in Miscell, momentum contrastive self-supervised learning is the key determinant leading to its superior performance. Momentum contrastive self-supervised learning achieved comparable performance in visual representation learning of images as compared with supervised representation learning (Chen et al., 2020b; He et al., 2019) . As compared with a similar method proposed by Ciotran and colleagues (Ciortan and Defrance, 2021) used a network of 3 linear layers as feature encoder, we observed that Miscell achieved better performance ( Figure S6 ). This is probably because of better representation capacity of the feature encoder used by Miscell.

The use of integrated gradient algorithm to identify cluster-specific markers by Miscell has unique advantage in that it defines markers in the context of complex interactions among all genes as compared with differential analysis such as t test and Wilcoxon rank-sum test adopted by Seurat and Scanpy, which assume independent expression of genes. Our analysis showed that Miscell is able to identify KLRG1 as markers of KLRG1+ effector CD8+ T cell and SLAMF7 and PRF1 as markers of cytotoxic CD4+ T cell, which justifies the feasibility of using the integrated gradient algorithm as marker detection. Besides, Miscell has the potential to discover new cell types. For example, C8, a subgroup of CD8+ T cells, is marked with overrepresentation of CTLA4 ( Figure 3 ) and involved in tumor escape (Pardoll, 2012) . C49, a subgroup of CD4+ Treg cells, is enriched for TIGIT and CCDC22 ( Figures 3A  and 3B ) and may contribute to HIV persistence during antiretroviral therapy (Fromentin et al., 2016) . Miscell is able to identify traditional marker genes for CD4+ and for CD8+ T cells and new marker genes that might be missed by Scanpy and Seurat such as CCR4 and XCL1. CCR4 is a chemokine receptor of CD4+ T cells whose upregulation could evoke antitumor immune response (Sugiyama et al., 2013) . XCL1 was reported to be significantly associated with the number of tumor-infiltrating CD8+ T cells in squamous cell carcinoma (Tamura et al., 2020). Miscell would complement Scanpy and Seurat in identifying marker genes as it uses a different strategy. In this study, the identification of cluster-specific marker genes by Miscell is limited to the highly variable genes. However, cluster-specific marker genes may be missed in HVGs. Therefore, training Miscell on a full gene list would allow it to identify new marker genes. iScience Article improvement in the identification of single-cell clusters albeit it is still inferior to Miscell ( Figure S3 ). This indicates Miscell is better at disentangling the complex expression patterns of single cells, therefore, beneficial for downstream clustering tasks. Although both Miscell and scVI are all based on deep learning algorithms, the former is based on self-supervised learning whereas the latter is based on deep generative approach. Miscell uses contrastive loss to minimize the similarity of the input data and its various augmented versions, whereas scVI needs original feature generation and may not be necessary for feature representation learning. As Miscell is built upon deep neural networks, it requires a large amount of data for training. Therefore, it would not be appropriate for small sample sizes. In addition, there are different batch-correction methods such as Harmony (Korsunsky et al., 2019) , MNN (Haghverdi et al., 2018) and BBKNN (Pola nski et al., 2020) . The selection of batch-correction often depends on the scientific issues to address; therefore, we did not include batch-correction in Miscell and leave it as a data preprocessing step.

Analogous to autoencoders, the trained encoder of Miscell can be used to extract latent features from bulk tumors. These latent features showed significant associations with cancer immunotherapy response (Figure 4A ), overall survival ( Figures 4B and S4) , alteration of immune signatures and cancer hallmarks (Liberzon et al., 2015) ( Figure 4D ). This indicates that Miscell may be helpful to facilitate basic research findings into clinical utilities. The biological mechanisms underlying the latent features of Miscell trained on these single-cell datasets remain unknown and necessitate exploration in future study.

In summary, we present a new approach Miscell that is built upon contrastive momentum self-supervised learning to dissect single-cell transcriptomes with state-of-the-art performance. Miscell will facilitate basic research of single-cell transcriptomics into clinical utility.

Analogous to methods built upon deep neural networks, Miscell demands a large sample size to obtain good performance. Miscell does not internally include any batch-correction methods. Therefore, users are required to correct for batch effects if necessary before running Miscell. The clinically significant latent feature extracted by Miscell is lack of explainability and future works are required to address this issue.

Detailed methods are provided in the online version of this paper and include the following: Data and code availability d Data analyzed in the manuscript are available in open access database. All data relevant to the study are included in the main text or supplemental information. The detailed information for the collected datasets was provided in the Key resources table and Table S1 . The raw count matrix of the datasets was downloaded from the Gene Expression Omnibus including GEO: GSE96583, GSE103322, GSE89567, GSE108989, GSE98638, GSE102130, GSE146771, GSE99254, GSE72056, GSE115978, GSE70630. The Tabula Muris data were collected from Figshare Data: https://figshare.com/projects/Tabula_ Muris_Senis/64982. The source code of Miscell is available at https://github.com/lixiangchun/Miscell. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

We downloaded single-cell gene expression matrices (Filbin et al., 2018; Guo et al., 2018; Jerby-Arnon et al., 2018; Kang et al., 2018; Puram et al., 2017; Tirosh et al., 2016a Tirosh et al., , 2016b Venteicher et al., 2017; Zhang et al., 2018 Zhang et al., , 2020 Zheng et al., 2017) subjected to 10x sequencing. This PBMC dataset includes B cells, CD4+ T cells, CD8+ T cells, dendritic cells, platelets, monocytes and natural killer cells. The Smart-seq dataset consists of 71,494 single cells subjected to smart-seq sequencing aggregated from 10 studies (Filbin et al., 2018; Guo et al., 2018; Jerby-Arnon et al., 2018; Puram et al., 2017; Tirosh et al., 2016a Tirosh et al., , 2016b Venteicher et al., 2017; Zhang et al., 2018 Zhang et al., , 2020 Zheng et al., 2017) encompassing tumour cells, stroma cells, dendritic cells, CD4+ and CD8+ T cells. The Tabula Muris dataset consists of 55,656 single-cells across 20 mouse organs, which were subjected to smart-seq sequencing. We used the cell types sorted by Fluorescence-Activated Cell Sorting as gold standard labels. We collected gene expression and clinical data from a dataset of 298 patients with urothelial carcinoma who received immune checkpoint blockade (ICB) therapy (Mariathasan et al., 2018) .

We discarded genes that were expressed in less than 3 cells in each batch and identified the HVG using Scanpy toolkit. We denoted the gene expression matrix as M where rows are samples and columns are genes. The raw gene count expression data was transformed into counts per million with the cpm in R package edgeR (v3.20.8) followed by log2 transformation. We normalized the gene expression matrix M by columns via subtracting each column by its mean and dividing with standard deviation. The preprocessing steps for bulk tumors consisted of conversion of gene count to counts per million, log2 transformation and feature normalization. The expression matrix for the highly variable genes (HVGs) obtained from the Smart-Seq data was inputted to the model for feature extraction.

Miscell implements self-supervised feature representation learning with momentum contrast by minimizing the contrastive loss of the original single-cell and its corresponding augmented version. The feature encoder of Miscell is a DenseNet (Huang et al., 2016) with 21 layers consisted of 4 dense blocks. We replaced the convolutional layer in the original convolutional DenseNet with linear layer to make it able to process gene expression data. For each layer, all the outputs from preceding layers are used as input via feature concatenation, which could make feature transmission more efficient. The use of multi-layer perceptron (MLP) as project head was demonstrated to be beneficial for contrastive learning (Chen et al., 2020b) .

Miscell trains a feature encoder of single-cell gene expression by matching an encoded query cell q to a dictionary of encoded keysfcell k1 ; cell k2 ; ::: ; cell kn g defined on-the-fly by a set of trained cells. The

iScience 24, 103200, November 19, 2021 13 iScience Article dictionary is built as a queue, and the contrastive loss function is used to iteratively optimize the model. The contrastive loss function is defined as (Chen et al., 2020b) :

The cell q is the feature of query cell, where cell k + is the feature of matching cell in the queue. fcell k À g are the features of negative examples that do not match cell q . t is a hyper-parameter whose value was set to 0.2. q q are updated by back-propagation while q k were updated slowly via q k )mq k + ð1 À mÞq q , where m˛½0; 1Þ. m is a momentum coefficient. The strategy of self-supervised learning is to minimize the differences between the latent features of two augmented versions of each sample towards zero. Therefore, the self-supervised labels are all zeros.

We used stochastic gradient descent to train the network for 500 epochs with a weight decay of 0.0001, initial learning rate of 0.24 and batch size of 256. The learning rate was decayed by 0.1 at epoch 150 and 250. We use data augmentation to increase data diversity and mimic data variation. The data augmentation operations include random zeroing out and random shuffling of gene expression values. We conducted parameter tuning for queue size (q) and momentum coefficient (q). We implemented Miscell with PyTorch framework (v1.6.0).

The developed feature encoder f ðq q Þ can be used as a nonlinear feature extractor f ðq q Þ : M N3G /M N3F , where F is the dimension of the extracted features. The extracted features can be used as input for downstream tasks.

We applied dbscan algorithm (Ester et al., 1996) implemented in scikit-learn package to the embedded features of fast interpolation-based t-SNE (van der Maaten and Hinton, 2008) applied to the extracted features mapped by f ðq q Þ. In addition, the extracted features obtained from Miscell were used as input to the K-Nearest Neighbors (KNN) algorithm to construct KNN graph for subsequent clustering by Leiden (Traag et al., 2019) algorithm using Scanpy with default parameters named Miscell-KNN.

We constructed a new deep neural network (denoted as F : R n /½0; 1) by adding a single linear classifier at the end of the trained feature encoder to predict the identity of the cell clusters. We froze the parameters of the developed encoder and train only the newly added linear classifier. Identification of cluster-specific marker genes can be viewed as quantifying the impact of each input feature (i.e. gene expression) on the classification made by the classifier. We used the attribution score calculated by integrated gradient algorithm (Sundararajan et al., 2017) as surrogate metric for the impact of each gene on classification output. Genes exhibited high variation of attribution scores (i.e. variation coefficient >0.15) among different clusters are considered to be cluster-specific marker genes.

For each cell cluster, we applied integrating gradient algorithm to calculate the important score of the i th gene as the gradient of FðxÞ along the i th dimension, which is expanded as (Sundararajan et al., 2017):

The x andx 0 are the actual and baseline expression levels, respectively. We used zero as the baseline expression.

We applied the developed feature encoder f ðq k Þ to extract features from bulk tumors, i.e. f ðq k Þ : Z N3G / Z N3F , where Z N3G is the gene expression matrix of bulk tumor and Z N3F is the corresponding extracted feature matrix. N denoted the number of samples and G the number of genes. We then analyzed the associations of extracted features with clinical and biological features.

iScience Article Benchmarking tools

We systematically compared the clustering performance of Miscell to Scanpy , Seurat (Butler et al., 2018; Kang et al., 2018) and scVI (Lopez et al., 2018) . Miscell, Scanpy and scVI use the same set of HVGs obtained from Scanpy while Seurat used its own routine to identify HVGs.

Scanpy is a scalable toolkit for analyzing single-cell gene expression data implemented in Python. It implements canonical single-cell analysis tasks such as clustering and differential expression testing etc. Scanpy selects highly variable genes (HVGs) by calculating the dispersion coefficient (defined as the ratio of variance to mean). It places genes into a given bin based on average expression. The HVGs are identified based on the condition that the expression values of genes are highly variable compared to genes with similar average expression within the bin. Scanpy captures the relationships among genes and cells relied on principal components of the expression profiles of HVGs, on which a neighborhood graph of cells is built using KNN algorithm. Subsequently, cell cluster cells are detected by Leiden algorithm (Traag et al., 2019) . We filtered out cells with less than 200 genes expressed or more than 30% mitochondrial counts. The HVGs were selected using Scanpy with default parameter (i.e. min_mean = 0.0125 and max_mean = 3). Leiden community detection algorithm was applied to the first 50 principal components obtained from Scanpy with the default resolution (i.e. resolution = 1). In addition, we used rank_genes_groups from Scanpy to identify cluster-specific markers parameters.

Seurat is an R package designed for analysis and exploration of single-cell RNA-seq data. Seurat is able to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate diverse-source single-cell data. Seurat constructs a Shared Nearest Neighbor (SNN) graph (Xu and Su, 2015) via the first 50 principal components of expression matrix for optimal modularity detection (Waltman and Van Eck, 2013) to determine cell clusters. The gene expression matrix was log-normalized with a scale factor of 1.0 3 10 À4 . HVGs were selected by FindVariableGenes with the number of HVGs set to 2000. We used the default 50 principal components to construct the Shared Nearest-Neighbor (SNN) graph and subsequently applied FindClusters to delineate cluster with default parameter.

scVI (single-cell variational inference) is a PyTorch-based package for probabilistic modeling of single-cell omics data. scVI uses deep neural network and stochastic optimization to aggregate information across similar cells and genes. The scVI supports fundamental analysis tasks of single-cell including clustering and differential expression. We used the default parameters for scVI. The number of neurons of the hidden layer is set to 128, the output neurons to 10 dimensions and dropout rate to 0.1. scVI was trained with a batch size of 256, learning rate of 0.001.

We curated a number of gene sets to annotate the identified cell clusters. The function of these gene sets include markers natural killer cell, tumor cell, fibroblast, B cell, monocyte and macrophage, T cell and endothelial cell, follicular T cell, mitotic CD8+ T cell, mitotic macrophage, M1 macrophage, M2 macrophage, dendritic cell, monocyte, naïve T cell, co-simulator, T regulatory cell, cytotoxic CD8+ T cell, CD8+ T effector memory cell and CD8+ resident memory cell, antigen presentation, ligand, cell adhesion, receptor, and co-inhibitor, interferon associated immune checkpoint blockade therapy, dysfunction T cell, cytotoxic T lymphocyte and tumor evasion. Gene set enrichment analysis was conducted with in R package fgsea (v 1.6.0). The marker genes of the 31 th feature were analyzed with R package limma (v 3.34.9), and the patients with urothelium carcinoma were stratified into two clusters (i.e. <25% quantile group versus >75% quantile group). A differential expressed marker gene was required to satisfy these criteria: absolute value of log-transformed fold-change > 0.15 and adjusted p value < 0.05. In addition, we annotated the marker genes with clinical information including Sex, Stage, Tumor mutation burden (TMB), Response Evaluation Criteria in Solid Tumours (RECIST), Cytotoxic Lymphocytes Infiltration (CTL) score and the expression of CTLA4, CD274 and PDCD1. The CTL score was calculated as the average gene expression level of CD8A, CD8B, GZMA, GZMB and PRF1.

We used V-measure score, normalized mutual information (NMI), adjusted rand index (ARI), Calinski-Harabaz index and Davies Bouldin index as measurement of clustering. We compared the performance of Miscell versus Scanpy (v1.6.0), scVI (v0.6.8), and Seurat (v3. 

A spatiotemporal organ-wide gene expression and cell atlas of the developing human heart

MARS: discovering novel cell types across heterogeneous single-cell experiments

CD8(+) T cells orchestrate pDC-XCR1(+) dendritic cell spatial and functional cooperativity to optimize priming

Integrating single-cell transcriptomic data across different conditions, technologies, and species

Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden

A simple framework for contrastive learning of visual representations

Improved baselines with momentum contrastive learning. arXiv

Contrastive self-supervised clustering of scRNA-seq data

A CD8alpha-subset of CD4+SLAMF7+ cytotoxic T cells is expanded in patients with IgG4-related disease and decreases following glucocorticoid treatment

A density-based algorithm for discovering clusters in large spatial databases with noise

Developmental and oncogenic programs in H3K27M gliomas dissected by single-cell RNAseq

CD4+ T cells expressing PD-1, TIGIT and LAG-3 contribute to HIV persistence during ART

Bootstrap your own latent: a new approach to self-supervised learning

Global characterization of T cells in nonsmall-cell lung cancer by single-cell sequencing

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

KLRG1(+) effector CD8(+) T cells lose KLRG1, differentiate into all memory T cell lineages, and convey enhanced protective immunity

Macrophage-derived CXCL9 and CXCL10 are required for antitumor immune responses following immune checkpoint blockade

Multiplexed droplet single-cell RNA-sequencing using natural genetic variation

CTLA4 blockade expands FoxP3+ regulatory and activated effector CD4+ T cells in a dose-dependent fashion

Fast, sensitive and accurate integration of single-cell data with Harmony

Visualizing the loss landscape of neural nets

Dysfunctional CD8 T cells Form a proliferative, dynamically regulated compartment within human melanoma

Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19

The molecular signatures database (MSigDB) hallmark gene set collection

Self-supervised learning: generative or contrastive

Deep generative modeling for single-cell transcriptomics

Visualizing data using t-SNE

TGFbeta attenuates tumour response to PD-L1 blockade by contributing to exclusion of T cells

The blockade of immune checkpoints in cancer immunotherapy

BBKNN: fast batch alignment of single cell transcriptomes

Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer

Novel immune checkpoint targets: moving beyond PD-1 and CTLA-4

Cancer immunotherapy using checkpoint blockade

Upregulation of PD-L1, IDO, and T(regs) in the melanoma tumor microenvironment is driven by CD8(+) T cells

Anti-CCR4 mAb selectively depletes effectortype FoxP3+CD4+ regulatory T cells, evoking antitumor immune responses in humans

The immune landscape of cancer

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

Singlecell RNA-seq supports a developmental hierarchy in human oligodendroglioma

From Louvain to Leiden: guaranteeing wellconnected communities

Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq

A smart local moving algorithm for large-scale modularity-based community detection

SCANPY: large-scale single-cell gene expression data analysis

Identification of cell types from single-cell transcriptomes using a novel clustering method

CD19 regulates innate immunity by the toll-like receptor RP105 signaling in B lymphocytes

Lineage tracking reveals dynamic relationships of T cells in colorectal cancer

Single-cell analyses inform mechanisms of myeloid-targeted therapies in colon cancer

We are grateful for researchers for their generosity to made their data publicly available. This work was supported by the Chinese National Key Research and Development Project (no. 2018YFC1315600), National Natural Science Foundation of China (no. 31801117 to X.L., no. 31900471 to M.Y. and 82073287 to Q.Z.), Tianjin Municipal Health Commission Foundation (grant no. RC20027 to Y.L) and IRT_14R40 from the Program for Changjiang Scholars and Innovative Research Team in University in China (K.C.).

Further information and requests for resources and materials should be directed to and will be fulfilled by the lead contact, Xiangchun Li (lixiangchun2014@foxmail.com 2) python package. The confidence interval of mean and standard deviation of these metrics were calculated via random permutation with 1000 times.V-measure score is defined as the harmonic average value of homogeneity and compactness. The homogeneity score is defined as:

HðCjKÞ is the conditional entropy of category division under given cluster division conditions, which is calculated as:HðCÞ is the conditional entropy of category division, which is defined as:where n is the number of instances.n c and n k are the number of instances in cluster c and k, respectively.n c;k is the number of instances in cluster c which are classified into cluster k.The compactness score is defined as c: c = 1 À HðKjCÞ HðCÞ V-measure score is defined as:Normalized Mutual Information (NMI). It is used to evaluate the similarity between two label assignments. Assuming that the clustering labels and actual labels of N cells are U and V, their entropy is defined as:PðiÞlogðPðiÞÞ where pðiÞ = jU i j=N is the probability that a cell picked at random from U falls into U i . In a similar manner, H(V) is defined as:The mutual information (MI) between U and V is calculated by: where n ij denoted the numbers of cell in common between clustering labels and actual labels, a i the sum of i th row and a j the sum of j th column from the contingency table.Calinski-Harabaz Index. It is also known as variance ratio criterion, and based on the concept of dense and well-separated cluster to comprehensively measure the dispersions for between-cluster and within-cluster.Assuming that a population of N cells can be partitioned into k clusters, the Calinskin-Harabaz index CH k is defined as the ratio of between-clusters dispersion and the within-cluster dispersion:where W k is the within-cluster dispersion matrix and defined as:B k is the between group dispersion matrix defined as:where N is the total number of cells, C q the set of cells in cluster q, c q the cell in center of cluster q, c the center of cluster q, n q be the number of cells in cluster q.Davies Bouldin Index. Davies Bouldin index is defined as the average similarity between each cluster C i for i = 1; .; k and most similar one C j . Lower Davies Bouldin index sugggests better clustering results. The Davies Bouldin index is defined as:S i is the average distance between each example in cluster i and the centroid of cluster i. d ij is the distance between centroids of cluster i and j.

The area under the receiver operating curve (AUROC) was used to evaluate classification performance on cell type identification. The AUROC value was calculated by roc routine implemented in R package pROC (v1.10.0). We performed 10-fold cross-validation with sklearn (v0.21.2) to evaluate the classification performance.

We performed the Kaplan-Meier survival analyses to examine the association between latent features learned by Miscell and prognosis using R package survival (v 3.1.12). We stratified patients into two clusters (i.e. C1/2) by determining whether the extracted feature is greater than zero. The survival curves of C1/2 were plotted by R package survminer and their difference was evaluated by log-rank test. p values were subjected for multiple hypothesis test. Two-sided test was used if not specified.

This study did not generate additional data.

iScience 24, 103200, November 19, 2021