key: cord-0571452-f94q535l
authors: Flores, Mario; Liu, Zhentao; Zhang, Ting-He; Hasib, Md Musaddaqui; Chiu, Yu-Chiao; Ye, Zhenqing; Paniagua, Karla; Jo, Sumin; Zhang, Jianqiu; Gao, Shou-Jiang; Jin, Yu-Fang; Chen, Yidong; Huang, Yufei
title: Deep learning tackles single-cell analysis A survey of deep learning for scRNA-seq analysis
date: 2021-09-25
journal: nan
DOI: nan
sha: 4653ca376a80258b8404cc8a9dce7db4716815dc
doc_id: 571452
cord_uid: f94q535l

Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioning as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here we present a processing pipeline of single-cell RNA-seq data, survey a total of 25 DL algorithms and their applicability for a specific step in the processing pipeline. Specifically, we establish a unified mathematical representation of all variational autoencoder, autoencoder, and generative adversarial network models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative use of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.

Single cell sequencing technology has been a rapidly developing area to study genomics, transcriptomics, proteomics, metabolomics, and cellular interactions at the single cell level for cell-type identification, tissue composition, and reprogramming [1, 2] .

Specifically, sequencing of the transcriptome of single cells, or single-cell RNAsequencing (scRNA-seq), has become the dominant technology in many frontier research areas such as disease progression and drug discovery [3, 4] . One particular area where scRNA-seq has made a tangible impact is cancer, where scRNA-seq is becoming a powerful tool for understanding invasion, intratumor heterogeneity, metastasis, epigenetic alterations, detecting rare cancer stem cells, and therapeutic response [5, 6] . Currently, scRNA-seq is applied to develop personalized therapeutic strategies that are potentially useful in cancer diagnosis, therapy resistance during cancer progression, and the survival of patients [5, 7] . The scRNA-seq has also been adopted to combat COVID-19 to elucidate how the innate and adaptive host immune system miscommunicates resulting in worsening the immunopathology produced during this viral infection [8, 9] .

These studies have led to a massive amount of scRNA-seq data deposited to public databases such as 10X Single-cell gene expression dataset, Human Cell Atlas, and Mouse Cell Atlas. Expressions of millions of cells from 18 species have been collected and deposited, waiting for further analysis. On the other hand, due to biological and technical factors, scRNA-seq data presents several analytical challenges related to its complex characteristics like missing expression values, high technical and biological variance, noise and sparse gene coverage, and elusive cell identities [1] . These characteristics make it difficult to directly apply commonly used bulk RNA-seq data analysis techniques and have called for novel statistical approaches for scRNA-seq data cleaning and computational algorithms for data analysis and interpretation. To this end, specialized scRNA-seq analysis pipelines such as Seurat [10] and Scanpy [11] , along with a large collection of task-specific tools, have been developed to address the intricate technical and biological complexity of scRNA-seq data.

Recently, deep learning has demonstrated its significant advantages in natural language processing and speech and facial recognition with massive data [12] [13] [14] . Such advantages have initiated the application of DL in scRNA-seq data analysis as a competitive alternative to conventional machine learning approaches for uncovering cell clustering [15, 16] , cell type identification [15, 17] , gene imputation [18] [19] [20] , and batch correction [21] in scRNA-seq analysis. Compared to conventional machine learning (ML) approaches, DL is more powerful in capturing complex features of high-dimensional scRNA-seq data. It is also more versatile, where a single model can be trained to address multiple tasks or adapted and transferred to different tasks. Moreover, the DL training scales more favorably with the number of cells in scRNA-seq data size, making it particularly attractive for handling the ever-increasing volume of single cell data. Indeed, the growing body of DL-based tools has demonstrated DL's exciting potential as a learning paradigm to significantly advance the tools we use to interrogate scRNA-seq data.

In this paper, we present a comprehensive review of the recent advances of DL methods for solving the present challenges in scRNA-seq data analysis (Table 1 ) from the quality control, normalization/batch effect reduction, dimension reduction, visualization, feature selection, and data interpretation by surveying deep learning papers published up to April 2021. In order to maintain high quality for this review, we choose not to include any (bio)archival papers, although a proportion of these manuscripts contain important new findings that would be published after completing their peer-reviewed process. Previous efforts to review the recent advances in machine learning methods focused on efficient integration of single cell data [22, 23] . A recent review of DL applications on single cell data has summarized 21 DL algorithms that might be deployed in single cell studies [24] .

It also evaluated the clustering and data correction effect of these DL algorithms using 11 datasets.

In this review, we focus more on the DL algorithms with a much detailed explanation and comparison. Further, to better understand the relationship of each surveyed DL model with the overall scRNA-seq analysis pipeline, we organize the surveys according to the challenge they address and discuss these DL models following the analysis pipeline. A unified mathematical description of the surveyed DL models is presented and the specific model features are discussed when reviewing each method. This will also shed light on the modeling connections among the surveyed DL methods and the recognization of the uniqueness of each model. Besides the models, we also summarize the evaluation matrics used by these DL algorithms and methods that each DL algorithm was compared with. The online location of the code, the development platform, the used datasets for each method are also cataloged to facilitate their utilization and additional effort to improve them. Finally, we also created a companion online version of the paper at https://huang-ai4medicine-lab.github.io/survey-of-DL-for-scRNA-seqanalysis/gitbook/_book, which includes expanded discussion as well as a survey of additional methods. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative use of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.

Various scRNA-seq techniques (like SMART-seq, Drop-seq, and 10X genomics sequencing [25, 26] are available nowadays with their sets of advantages and disadvantages. Despite the differences in the scRNA-seq techniques, the data content and processing steps of scRNA-seq data are quite standard and conventional. A typical scRNA-seq dataset consists of three files: genes quantified (gene IDs), cells quantified (cellular barcode), and a count matrix (number of cells x number of genes), irrespective of the technology or pipeline used. A series of essential steps in scRNA-seq data processing pipeline and optional tools for each step with both ML and DL approaches are illustrated in Fig. 1 .

With the advantage of identifying each cell and unique molecular identifiers (UMIs) for expressions of each gene in a single cell, scRNA-seq data are embedded with increased technical noise and biases [27] . Quality control (QC) is the first and the key step to filter out dead cells, double-cells, or cells with failed chemistry or other technical artifacts. The most commonly adopted three QC covariates include the number of counts (count depth) per barcode identifying each cell, the number of genes per barcode, and the fraction of counts from mitochondrial genes per barcode [28] .

Normalization is designed to eliminate imbalanced sampling, cell differentiation, viability, and many other factors. Approaches tailored for scRNA-seq have been developed including the Bayesian-based method coupled with spike-in, or BASiCS [29] , deconvolution approach, scran [30] , and sctransfrom in Seurat where regularized Negative Binomial Regression was proposed [31] . Two important steps, batch correction and imputation, will be carried out if required by the analysis.

• Batch Correction is a common source of technical variation in high-throughput sequencing experiments due to variant experimental conditions such as technicians and experimental time, imposing a major challenge in scRNA-seq data analysis. Batch effect correction algorithms include detection of mutual nearest neighbors (MNNs) [32] , canonical correlation analysis (CCA) with Seurat [33] , and Harmony algorithm through cell-type representation [34] .

• Imputation step is necessary to handle high sparsity data matrix, due to missing value or dropout in scRNA-seq data analysis. Several tools have been developed to "impute" zero values in scRNA-seq data, such as SCRABBLE [35] , SAVER [36] , and scImpute [37] .

Dimensionality reduction and visualization are essential steps to represent biological meaningful variations and high dimensionality with significantly reduced computational cost. Dimensionality reduction methods, such as PCA, are widely used in scRNA-seq data analysis to achieve that purpose. More advanced nonlinear approaches that preserve the topological structure and avoid overcrowding in lower dimension representation, such as LLE [38] (used in SLICER [39] ), tSNE [40] , and UMAP [41] , have also been developed and adopted as a standard in single-cell data visualization.

Clustering analysis is a key step to identify cell subpopulations or distinct cell types to unravel the extent of heterogeneity and their associated cell-type-specific markers.

Unsupervised clustering is frequently used here to categorize cells into clusters by their similarity often taken the aforementioned dimensionality-reduced representations as input, such as community detection algorithm Louvain [42] and Leiden [43] , or data-driven dimensionality reduction followed with k-Means cluster by SIMLR [44] .

Feature selection is another important step in single-cell RNA-seq analysis to select a subset of genes, or features, for cell-type identification and functional enrichment of each cluster. This step is achieved by differential expression analysis designed for scRNA-seq, such as MAST that used linear model fitting and likelihood ratio testing [45] ; SCDE that adopted a Bayesian approach with a Negative Binomial model for gene expression and Poisson process for dropouts [46] , or DEsingle that utilized a Zero-Inflated Negative Binomial model to estimate the dropouts [47] .

Besides these key steps, downstream analysis can include cell type identification, coexpression analysis, prediction of perturbation response, where DL has also been applied. Other advanced analyses including trajectory inference and velocity and pseudotime analysis are not discussed here because most of the approaches on these topics are non-DL based.

As batch correction, dimension reduction, imputation, and clustering are unsupervised learning tasks, we start our review by introducing the general formulations of variational autoencoder (VAE), the autoencoder (AE), or generative adversarial networks (GAN) for scRNA-seq together with their training strategies. We will focus on the different features of each method and bring attention to their uniqueness and novelty applications for scRNA-seq data in Section 4.

Let ! ! represent a " × 1 vector of gene expression (UMI counts or normalized, logtransformed expression) of " genes in cell % , where '() "! * + "! , -"! . follows some distribution (e.g., zero-inflated negative binomial (ZINB) or Gaussian), where + "! and -"! are distribution parameters (e.g., mean, variance, or dispersion) ( Fig. 2A) . We consider + "! to be of particular interest (e.g., the mean counts) and is thus further modeled by a decoder neural network / # ( Fig. 2A) as

where the 6 th element of 0 ! is + "! and 7 is a vector of decoder weights, 3 ! ∈ ℝ $ represents a latent representation of gene expression and is used for visualization and clustering and 4 ! is an observed variable (e.g., the batch ID). For VAE, 3 ! is commonly assumed to follow a multivariate standard normal prior, i.e., '(3 ! ) = :(0, < $ ) with < $ being a = × = identity matrix. Further, -"! of '() "! * + "! , -"! . is a nuisance parameter, which has a prior distribution '(-"! . and can be either estimated or marginalized in variational inference. Now define > = ?7, -!" ∀%, 6A . Then, '() "! * + "! , -"! . and (1) together define the likelihood '(! ! |3 ! , 4 ! , >).

The goal of training is to compute the maximum likelihood estimate of >

where ℒ(>) is the evidence lower bound (ELBO),

and R(3 ! |! ! , 4 ! ) is an approximate to '(3 ! |! ! , 4 ! ) and assumed as

with ?U 0 ! , X 1 ! 2 A given by an encoder network Z 3 ( Fig. 2A) as

where [ is the weights vector. Now, > = ?7, [, -!" ∀%, 6A and equation (2) is solved by the stochastic gradient descent approach while a model is trained.

All the surveyed papers that deploy VAE follow this general modeling process.

However, a more general formulation has a loss function defined as

where \ 4 ∀_ = 1, … , a are losses for different functions (clustering, cell type prediction, etc) and ^4s are the Lagrange multipliers. With this general formulation, for each paper, we examined the specific choices of data distribution '() "! * + "! , -"! . that define ℒ(>), different \ 4 designed for specific functions, and how the decoder and encoder were applied to model different aspects of scRNA-seq data. 

AEs learn the low dimensional latent representation 3 ! ∈ ℝ $ of expression ! ! . The AE includes an encoder Z 3 and a decoder / # (Fig. 2B) such that

where > = {7, [} are encoder and decoder weight parameters and ! b ! defines the parameters (e.g. mean) of the likelihood '(! ! |>) (Fig. 2B) and is often considered as imputed and denoised expressions. Additional design can be included in the AE model for batch correction, clustering, and other objectives.

The training of the AE is generally carried out by stochastic gradient descent algorithms to minimize the loss similar to Eq. (6) except ℒ(>) = − log '(! ! |>). When '(! ! |>) is the Gaussian, ℒ(>) becomes the mean square error (MSE) loss

Because different AE models differ in their AE architectures and the loss functions, we will discuss the specific architecture and the loss functions for each reviewed DL model in Section 4.

GANs have been used for imputation, data generation, and augmentation of the scRNAseq analysis. Without loss of generality, the GAN, when applied to scRNA-seq, is designed to learn how to generate gene expression profiles from ' 5 , the distribution of ! ! . The vanilla GAN consists of two deep neural networks [48] . The first network is the generator " # (3 ! , e ! ) with parameter 7, a noise vector 3 ! from the distribution ' 0 and a class label e (e.g. cell type) and is trained to generate ! 6 , a "fake" gene expression (Fig.  2C ). The second network is the discriminator network / 3 " with parameters [ 7 , trained to distinguish the "real" ! from fake ! 6 (Fig. 2C) . Both networks, " # and / 3 " are trained to outplay each other, resulting in a minimax game, in which " # is forced by / 3 " to produce better samples, which, when converge, can fool the discriminator / 3 " , thus becoming samples from ' 5 . The vanilla GAN suffers heavily from training instability and mode collapsing [49] . To that end, Wasserstein GAN (WGAN) [49] was developed with the WGAN loss [50] :

Additional terms can also be added to equation (9) to constrain the functions of the generator. Training based on the WGAN loss in Eq. (9) amounts to a min-max optimization, which iterates between the discriminator and generator, where each optimization is achieved by a stochastic gradient descent algorithm through backpropagation. The WGAN requires/ 3 " to be K-Lipschitz continuous [50] , which can be satisfied by adding the gradient penalty to the WGAN loss [49] . Once the training is done, the generator " ø # can be used to generate gene expression profiles of new cells.

In this section, we survey applications of DL models for scRNA-seq analysis. To better understand the relationship between the problems that each surveyed work addresses and the key challenges in the general scRNA-seq processing pipeline, we divide the survey into sections according to steps in the scRNA-seq processing pipeline illustrated in Fig. 1 . For each DL model, we present the model details under the general model framework introduced in Section 3 and discuss the specific loss functions. We also survey the evaluation metrics and summarize the evaluation results. To facilitate crossreferences of the information, we summarized all algorithms reviewed in this section in Table 1 and tabulate the datasets and evaluation metrics used in each paper in Tables   2 & 3 . We also listed all other algorithms that each surveyed method evaluated against in Fig. 3 , highlighting the extensiveness these algorithms were assessed for their performance.

DCA [18] is an AE for imputation (Figs. 2B, 4B) and has been integrated into the Scanpy framework.

Model. DCA models UMI counts with missing values using the ZINB distribution '() "! | >. = f "! g(0) + (1 − f "! .hi(+ "! , -"! ., for 6 = 1, … "; % = 1, … h,

where g(⋅) is a Dirac delta function, hi(⋅,⋅) denotes the negative binomial distribution, and f "! , + "! , -"! , representing dropout rate, mean, and dispersion, respectively, are functions of the output (! b ! ) of the decoder in the DCA as follows,

where n : , n ; , and n < are additional weights to be estimated. The DCA encoder and decoder follow the general AE formulation as in Eq. (7) but the encoder takes the normalized, log-transformed expression as input. To train the model, DCA uses a constrained log-likelihood as the loss function

with > = { 7, [, n : , n ; , n < }. Once the DCA is trained, the mean counts p ! are used as the denoised and imputed counts for cell %.

Results. For evaluation, DCA was compared to other methods using simulated data (using Splatter R package), and real bulk transcriptomics data from a developmental C.

elegans time-course experiment was used with added simulating single-cell specific noise. Gene expression was measured from 206 developmentally synchronized young adults over a twelve-hour period (C. elegans). Single-cell specific noise was added in silico by genewise subtracting values drawn from the exponential distribution such that 80% of values were zeros. The paper analyzed the Bulk contains less noise than singlecell transcriptomics data and can thus aid in evaluating single-cell denoising methods by providing a good ground truth model. The authors also did a comparison of other methods including SAVER [36] , scImpute [37] , and MAGIC [51] . DCA denoising recovered original time-course gene expression pattern while removing single-cell specific noise. Overall, DCA demonstrated the strongest recovery of the top 500 genes most strongly associated with development in the original data without noise; DCA was shown to outperform other existing methods in capturing cell population structure in real data using PBMC, CITEseq, runtime scales linearly with the number of cells.

SAVER-X [52] is an AE model (Figs. 2B, 4B ) developed to denoise and impute scRNAseq data with transfer learning from other data resources.

Model. SAVER-X decomposes the variation in the observed counts ! ! with missing values into three components: i) predictable structured component representing the shared variation across genes, ii) unpredictable cell-level biological variation and genespecific dispersions, and iii) technical noise. Specifically, ) "! is modeled as a Poisson-Gamma hierarchical model,

where t ! is the sequencing depth of cell n, + "! is the mean, and -" is the dispersion. This

Poisson-Gamma mixture is an equivalent expression to the NB distribution and thus, the ZINB distribution as Eq. (10) is adopted to model missing values.

The loss is similar to Eq. (12) . However, + "! is initially learned by an AE pre-trained using external datasets from an identical or similar tissue and then transferred to ! ! to be denoised. Such transfer learning can be applied to data between species (e.g., human

and mouse in the study), cell types, batches, and single-cell profiling technologies. After + "! is inferred, SAVER-X generates the final denoised data ) v "! by an empirical Bayesian shrinkage.

Results. SAVER-X was applied to multiple human single-cell datasets of different scenarios: i) T-cell subtypes, ii) a cell type (CD4+ regulatory T cells) that was absent from the pretraining dataset, iii) gene-protein correlations of CITE-seq data, and iv) immune cells of primary breast cancer samples with a pretraining on normal immune cells.

SAVER-X with pretraining on HCA and/or PBMCs outperformed the same model without pretraining and other denoising methods, including DCA [28] , scVI [17] , scImpute [37] , and MAGIC [51] . The model achieved promising results even for genes with very low UMI counts. SAVER-X was also applied for a cross-species study in which the model was pre-trained on a human or mouse dataset and transferred to denoise another. The results demonstrated the merit of transferring public data resources to denoise in-house scRNAseq data even when the study species, cell types, or single-cell profiling technologies are different.

DeepImpute [20] imputes genes in a divide-and-conquer approach, using a bank of DNN models ( Fig. 4A ) with 512 output, each to predict gene expression levels of a cell.

Model. For each dataset, DeepImpute selects to impute a list of genes or highly variable genes (variance over mean ratio, default = 0.5). Each sub-neural network aims to understand the relationship between the input genes and a subset of target genes. Genes are first divided into h random subsets of 512 target genes. For each subset, a two-layer DNN is trained where the input includes genes that are among the top 5 best-correlated genes to target genes but not part of the target genes in the subset. The loss is defined as the weighted MSE

which gives higher weights to genes with higher expression values.

Result. DeepImpute had the highest overall accuracy and offered shorter computation time with less demand on computer memory than other methods like MAGIC, DrImpute, ScImpute, SAVER, VIPER, and DCA. Using simulated and experimental datasets ( Table   2) , DeepImpute showed benefits in improving clustering results and identifying significantly differentially expressed genes. DeepImpute and DCA, show overall advantages over other methods and between which DeepImpute performs even better.

The properties of DeepImpute contribute to its superior performance include 1) a divide-and-conquer approach which contrary to an autoencoder as implemented in DCA, resulting in a lower complexity in each sub-model and stabilizing neural networks, and 2) the subnetworks are trained without using the target genes as the input which reduces overfitting while enforcing the network to understand true relationships between genes.

LATE [53] is an AE (Figs. 2B, 4B) whose encoder takes the log-transformed expression as input.

Model. LATE sets zeros for all missing values and generates the imputed expressions.

LATE minimizes the MSE loss as Eq. (8). One issue is that some zeros could be real and reflect the actual lack of expressions.

Result. Using synthetic data generated from pre-imputed data followed by random After imputation, scGMAI uses fast independent component analysis (ICA) on the AE reconstructed expressions to reduce the dimension and then applies a Gaussian mixture model on the ICA reduced data to perform the clustering.

Results. To assess the performance, the AE in scGMAI was replaced by five other imputation methods including SAVER [36] , MAGIC [51] , DCA [28] , scImpute [37] , and CIDR [55] . A scGMAI implementation without AE was also compared. Seventeen scRNAseq data (part of them are listed in Tables 2b & c as marked) were used to evaluate cell clustering performances. The results indicated that the AEs significantly improved the clustering performance in eight of seventeen scRNA-seq datasets.

Imputation approaches based on information from cells with similar expressions suffer from oversmoothing, especially for rare cell types. scIGANs [19] is a GAN-based imputation algorithm (Figs. 2C, 4E), which overcomes this problem by training a GAN model to generate samples with imputed expressions.

Model. scIGAN takes the image-like reshaped gene expression data ! ! as input. The model follows a BEGAN [56] framework, which replaces the GAN discriminator / with a function w ø $ to compute the reconstruction MSE. Then, the Wasserstein distance loss between the reconstruction errors between the real and generated samples are computed

This framework forces the model to meet two computing objectives, i.e. reconstructing the real samples and discriminating between real and generated samples. Proportional

Control Theory was applied to balance these two goals during the training [57] .

After training, the decoder " 9 is used to generate new samples of a specific cell type. Then, the k-nearest neighbors (KNN) approach is applied to the real and generated samples to impute the real samples' missing expressions.

Results. scIGANs was first tested on simulated samples with different dropout rates.

Performance of rescuing the correct clusters was compared with 11 existing imputation approaches including DCA, DeepImpute, SAVER, scImpute, MAGIC, etc. scIGANs reported the best performance for all metrics. scIGAN was next evaluated for its ability to correctly cluster cell types on the Human brain scRNA-seq data, which showed superior performance than existing methods again. scIGANs was then evaluated for identifying cell-cycle states using scRNA-seq datasets from mouse embryonic stem cells. The results showed that scIGANs outperformed competing existing approaches for recovering subcellular states of cell cycle dynamics. scIGANs were further shown to improve the identification of differentially expressed genes and enhance the inference of cellular trajectory using time-course scRNA-seq data from the differentiation from H1 ESC to definitive endoderm cells (DEC). Finally, scIGAN was also shown to scale to scRNA-seq methods and data sizes.

BERMUDA [58] deploys a transfer-learning method (Figs. 2B, 4B) to remove the batch effect. It performs correction to the shared cell clusters among batches and therefore preserves batch-specific cell populations.

Model. BERMUDA is an AE that takes normalized, log-transformed expression as input.

Its consists of two parts as

where ℒ(>) is the MSE loss and \ %%@ is the maximum mean discrepancy (MMD) [59] loss that measures the differences in distributions between pairs of similar cell clusters shared among batches as:

where 3 A,C is the latent variable of ! A,C , the input expression of a cell from cluster z of batch Results. BERMUDA was shown to outperform other methods like mnnCorrect [32] , BBKNN [61] , Seurat [10] , and scVI [17] in removing batch effects on simulated and human pancreas data while preserving batch-specific biological signals. BERMUDA provides several improvements compared to existing methods: 1) capable of removing batch effects even when the cell population compositions across different batches are vastly different; and 2) preserving batch-specific biological signals through transfer-learning which enables discovering new information that might be hard to extract by analyzing each batch individually.

DESC [62] is an AE model (Figs. 2B, 4B ) that removes batch effect through clustering with the hypothesis that batch differences in expressions are smaller than true biological variations between cell types, and, therefore, properly performing clustering on cells across multiple batches can remove batch effects without the need to define batches explicitly.

Model. DESC has a conventional AE architecture. Its encoder takes normalized, logtransformed expression and uses decoder output, ! b ! as the reconstructed gene expression, which is equivalent to a Gaussian data distribution with ! b ! being the mean.

The loss function is similar to Eq. (16) and except that the second loss \ G is the clustering loss that regularizes the learned feature representations to form clusters as in the deep embedded clustering [63] . The model is first trained to minimize ℒ(>) only to obtain the initial weights before minimizing the combined loss. After the training, each cell is assigned with a cluster ID.

Results. DESC was applied to the macaque retina dataset, which includes animal level, region level, and sample-level batch effects. The results showed that DESC is effective in removing the batch effect, whereas CCA [33] , MNN [32] , Seurat 3.0 [10] , scVI [17] , BERMUDA [58] , and scanorama [64] were all sensitive to batch definitions. DESC was then applied to human pancreas datasets to test its ability to remove batch effects from multiple scRNA-seq platforms and yielded the highest ARI among the comparing approaches mentioned above. When applied to human PBMC data with interferon-beta stimulation, where biological variations are compounded by batch effect, DESC was shown to be the best in removing batch effect while preserving biological variations.

DESC was also shown to remove batch effect for the monocytes and mouse bone marrow data and DESC was shown to preserve the pseudotemporal structure. Finally, DESC scales linearly with the number of cells, and its running time is not affected by the increasing number of batches. It is designed to remove batch biases while preserving dataset-specific biological variations.

Model. iMAP consists of two processing stages, each including a separate DL model. In the first stage, a special AE, whose decoder combines the output of two separate

where 4 ! is the one-hot encoded batch number of cell %. / # ' can be understood as decoding the batch noise, whereas / # ( reconstructs batch-removed expression from the latent variable 3 ! . The training minimizes the loss in Eq. (16) except the 2 nd loss is the content loss

where 4! is a random batch number. Minimizing \ H (>) further ensures the reconstructed expression ! b ! would be batch agnostic and has the same content as ! ! .

However, due to the limitation of AE, this step is still insufficient for batch removal.

Therefore, a second stage is included to apply a GAN model to make expression distributions of the shared cell type across different baches indistinguishable. To identified the shared cell types, a mutual nearest neighbors (MNN) strategy adapted from [32] was developed to identify MNN pairs across batches using batch effect independent 3 ! as opposed to ! ! . Then, a mapping generator " # # is trained using MNN pairs based on GAN

and ! ! (J) are the MNN pairs from batch • and an anchor batch Ä. The WGAN-GP loss as in Eq. (9) was adopted for the GAN training. After training, " # # is applied to all cells of a batch to generate batch-corrected expression.

Results: iMAP was first tested on benchmark datasets from human dendritic cells and

Jurkat and 293T cell lines and then human pancreas datasets from five different platforms. All the datasets contain both batch-specific cells and batch-shared cell types.

iMAP was shown to separate the batch-specific cell types but mix batch shared cell types and outperformed 9 other existing batch correction methods including Harmony, scVI, fastMNN, Seurat, etc. iMAP was then applied to the large-scale Tabula Muris datasets containing over 100K cells sequenced from two platforms. iMAP could not only reliably integrate cells from the same tissues but identify cells from platform-specific tissues.

Finally, iMAP was applied to datasets of tumor-infiltrating immune cells and shown to reduce the dropout ratio and the percentage of ribosomal genes and non-coding RNAs, thus improving detection of rare cell types and ligand-receptor interactions. iMAP scales with the number of cells, showing minimal time cost increase after the number of cells exceeds thousands. Its performance is also robust against model hyperparameters.

This study [66] considers AEs (Figs. 2B, 4B ) for learning the low-dimensional representation and specifically explores the benefit of incorporating prior biological knowledge of gene-gene interactions to regularize the AE network architecture.

Model. Several AE models with single or two hidden layers that incorporate gene interactions reflecting transcription factor (TF) regulations and protein-protein interactions (PPIs) are implemented. The models take normalized, log-transformed expressions and follow the general AE structure, including dimension-reducing and reconstructing layers, but the network architectures are not symmetrical. Specifically, gene interactions are incorporated such that each node of the first hidden layer represented a TF or a protein in the PPI; only genes that are targeted by TFs or involved in the PPI were connected to the node. Thus, the corresponding weights of Z 3 and / # are set to be trainable and otherwise fixed at zero throughout the training process. Both unsupervised (AE-like) and

supervised (cell-type label) learning were studied.

Results. Regularizing encoder connections with TF and PPI information considerably reduced the model complexity by almost 90% (7.5-7.6M to 1.0-1.1M). The clusters formed on the data representations learned from the models with or without TF and PPI information were compared to those from PCA, NMF, independent component analysis (ICA), t-SNE, and SIMLR [44] . The model with TF/PPI information and 2 hidden layers achieved the best performance by five of the six measures and the best average performance. In terms of the cell-type retrieval of single cells, the encoder models with and without TF/PPI information achieved the best performance in 4 and 3 cell types, respectively. PCA yielded the best performance in only 2 cell types. The DNN model with TF/PPI information and 2 hidden layers again achieved the best average performance across all cell types. In summary, this study demonstrated a biologically meaningful way to regularize AEs by the prior biological knowledge for learning the representation of scRNA-seq data for cell clustering and retrieval.

Dhaka [67] was proposed to reduce the dimension of scRNA-seq data for efficient stratification of tumor subpopulations.

Model. Dhaka adopts a general VAE formulation (Figs. 2A, 4C) . It takes the normalized, log-transformed expressions of a cell as input and outputs the low-dimensional representation.

Result. Dhaka was first tested on the simulated dataset. The simulated dataset contains 500 cells, each including 3K genes, clustered into 5 different clusters with 100 cells each. (Figs. 2A, 4C ) that learns the low-dimensional representations capture both local and global neighboring structures in scRNA-seq data.

Model: scvis adopts the generic VAE formulation described in section 3.1. However, it has a unique loss function defined as

where ℒ(>) is ELBO as in Eq. (3) and \ H is a regularizer using non-symmetrized t-SNE objective function [70] , which is defined as

where V and z are two different cells, ' A|C measures the local cell relationship in the data space, and R C|A measures such relationship in the latent space. Because t-SNE algorithm preserves the local structure of high dimensional space, \ H learns local structures of cells.

Results. scvis was tested on the simulated data and outperformed t-SNE in a ninedimensional space task. scvis preserved both local structure and global structure. The relative positions of all clusters were well kept but outliers were scattered around clusters.

Using simulated data and comparing to t-SNE, scvis generally produced consistent and 

, resulting a GMVAE. The inference process still follows that of a VAE as discussed in section 3.1.

Results. scVAEs were evaluated on the PBMC data and compared with factor analysis (FA) models. The results showed that GMVAE with negative binomial distribution achieved the highest lower bound and ARI. Zero-inflated Poisson distribution performed the second best. All scVAE models outperformed the baseline linear factor analysis model, which suggested that a non-linear model is needed to capture single-cell genomic features. GMVAE was also compared with Seurat and shown to perform better using the withheld data. However, scVAE performed no better than scVI [17] or scvis [70] , both are VAE models.

VASC [72] is another VAE (Figs. 2A, 4C) for dimension reduction and latent representation but it models dropout. 

scDeepCluster [74] is an AE network (Figs. 2B, 4B ) that simultaneously learns feature representation and performs clustering via explicit modeling of cell clusters as in DESC.

Model: Similar to DCA, scDeepCluster adopts a ZINB distribution for ! ! as in Eq. (10) and (12) . The loss is similar to Eq. (16) but with the first being the negative log-likelihood of the ZINB data distribution as defined in Eq. (12) and the second \ G being a clustering loss performed using KL divergence as in DESC algorithm. Compared to csvis, scDeepcluster focuses more on clustering assignment due to the KL divergence.

Results. scDeepCluster was first tested on the simulation data and compared with other seven methods including DCA [18] , two multi-kernel spectral clustering methods MPSSC [75] and SIMLR [44] , CIDR [55] , PCA + k-mean, scvis [70] and DEC [76] . Results: The performance of scGAN was first evaluated using PBMC data. The generated samples were shown to capture the desired clusters and the real data's regulons.

Additionally, the AUC performance for classifying real from generated samples by a Random Forest classifier only reached 0.65, performance close to 0.5. Finally, scGAN's generated samples had a smaller MMD than those of Splatter, a state-of-the-art scRNAseq data simulator [79] . Even though a large MMD was observed for scGAN when compared with that of SUGAR, another scRNA-seq simulator, SUGAR [80] was noted for prohibitively high runtime and memory. scGAN was further trained and assessed on the bigger mouse brain data and shown to model the expression dynamics across tissues.

Then, the performance of cscGAN for generating cell-type-specific samples was evaluated using the PBMC data. cscGAN was shown to generate high-quality scRANseq data for specific cell types. Finally, the real PBMC samples were augmented with the generated samples. This augmentation improved the identification of rare cell types and the ability to capture transitional cell states from trajectory analysis.

Given the versatility of AE and VAE in addressing different scRAN-seq analysis challenges, DL models possessing multiple analysis functions have been developed. We survey these models in this section.

scVI [17] is designed to address a range of fundamental analysis tasks, including batch correction, visualization, clustering, and differential expression.

Model. scVI is a VAE (Figs. 2A, 4C ) that models the counts of each cell from different batches. scVI adopts a ZINB distribution for ) "!

which is defined similarly as Eq (11) in DCA, except that \ ! denotes the scaling factor for cell %, which follows a log-Normal (tm6:) prior as '(\ ! ) = tm6:(Ç & ! , É & ! 2 ., therefore, Ñ "! represents the mean counts normalized by \ ! . Now, let 4 ! ∈ {0,1} S be the batch ID of cell % with i being the total number of batches. Then, + "! and f " are further modeled as functions of the =-dimension latent variable 3 ! ∈ ℝ $ and the batch ID 4 ! by the decoder networks / # , and / # -as

where the 6th element of 0 ! and k ! are + "! and f " , respectively, and 7 T , and 7 : are the decoder weights. Note that the lower layers of the two decoders are shared. For inference, both 3 ! and \ ! are considered as latent variables and therefore R( 

where [ 0 , and [ & are the encoder weights. Note that, like the decoders, the lower layers of the two encoders are also shared. Overall, the model parameters to be estimated by the variational optimization is > = ?7 T , 7 : , [ 0 , [ & , -" A. After inference, 3 ! are used for visualization and clustering. + "! provides a batch-corrected, size-factor normalized estimate of gene expression for each gene 6 in each cell %. An added advantage of the probabilistic representation by scVI is that it provides a natural probabilistic treatment of the subsequent differential analysis, resulting in lower variance in the adopted hypothesis tests.

Results: scVI was evaluated for its scalability, the performance of imputation. For scalability, ScVI was shown to be faster than most nonDL algorithms and scalable to handle twice as many cells as nonDL algorithms with a fixed memory. For imputation, ScVI, together with other ZINB-based models, performed better than methods using alternative distributions. However, it underperformed for the dataset (HEMATO) with fewer cells. For the latent space, scVI was shown to provide a comparable stratification of cells into previously annotated cell types. Although scVI failed to ravel SIMLR, it is among the best in capturing biological structures (hierarchical structure, dynamics, etc.) and recognizing noise in data. For batch correction, it outperforms ComBat. For normalizing sequencing depth, the size factor inferred by scVI was shown to be strongly correlated with the sequencing depth. Interestingly, the negative binomial distribution in the ZINB was found to explain the proportions of zero expressions in the cells, whereas the zero probability f "! is found to be more correlated with alignment errors. For differential expression analysis, scVI was shown to be among the best.

LDVAE [81] is an adaption of scVI to improve the model interpretability but it still benefits from the scalability and efficiency of scVI. Also, this formulation applies to general VAE models and thus is not restricted to scRNA-seq analysis.

Model. LDVAE follows scVI's formulation but replaces the decoder / # , in Eq. (23) by a linear model

where n ∈ ℝ $×= is the weight matrix. Being the linear decoder provides interpretability in the sense that the relationship between latent representation 3 ! and gene expression 0 ! can be readily identified. LDVAE still follows the same loss and non-linear inference scheme as scVI.

Results. LDVAE's latent variable 3 ! could be used for clustering of cells with similar accuracy as a VAE. Although LDVAE had a higher reconstruction error than VAE, due to the linear decoder, the variations along the different axes of 3 ! establish direct linear relationships with input genes. As an example from analyzing mouse embryo scRNA-seq, Ö *,! , the second element of 3 ! , is shown to relate to simultaneous variations in the expression of gene um|5á1 and à=6á1 . In contrast, such interpretability would be intractable without approximation for a VAE. LDVAE was also shown to induce fewer correlations between latent variables and to improve the grouping of the regulatory programs. LDVAE is capable to scale to a large dataset with ~2M cells.

SAUCIE [15] is an AE (Figs. 2B, 4B ) designed to perform multiple functions, including clustering, batch correlation, imputation, and visualization. SAUCIE is applied to the normalized data instead of count data.

Model. SAUCIE includes multiple model components designed for different functions.

1. Clustering: SAUCIE first introduced a "digital" binary encoding layer â G ∈ {0,1} V in the decoder / that functions to encode the cluster ID. To learn this encoding, an entropy loss is introduced

where ' 4 is the probability (proportion) of activation on neuron _ by the previous layer.

Minimizing this entropy loss promotes sparse neurons, thus forcing a binary encoding.

To encourage clustering behavior, SAUCIE also introduced an intracluster loss as

which computes the distance \ W between the expressions of a pair of cells

that have the same cluster ID (ℎ A G = ℎ C G ).

To correct the batch effect, an MMD loss is introduced to measure the differences in terms of the distribution between batches in the latent space

where i is the total number of batches and 3 Z[6 is the latent variable of an arbitrarily chosen reference batch.

3. Imputation and visualization: The output of the decoder is taken by SAUCIE as an imputed version of the input gene expression. To visualize the data without performing an additional dimension reduction directly, the dimension of the latent variable 3 ! is forced to 2.

Training the model includes two sequential runs. In the first run, an autoencoder is trained to minimize the loss \ ] +^S\ S with \ ] being the MSE reconstruction loss defined in (9) so that a batch-corrected, imputed input ! å can be obtained at the output of the decoder.

In the second run, the bottleneck layer of the encoder from the first run is replaced by a 2-D latent code for visualization and a digital encoding layer is also introduced. This model takes the cleaned ! å as the input and is trained for clustering by minimizing the loss \ ] + ^@\ @ +^W\ W . After the model is trained, ) ç is the imputed, batch-corrected gene expression. The 2-D latent code is used for visualization and the binary encoder encodes the cluster ID. scScope [87] is an AE (Figs. 2B, 4D) with recurrent steps designed for imputation and batch correction.

Model. scScope has the following model design for batch correction and imputation.

where and w{\ê is the ReLu activation function, é ∈ ℝ =×/ is the batch correction matrix, è G ∈ {0,1} /×^ is an indicator vector with entry 1 indicates the batch of the input, and a is the total number of batches.

2. Recursive imputation: Instead of using the reconstructed expression ! b ! as the imputed expression like in SAUCIE, scScope adds an imputer to ! b ! to recursively improve the imputation result. The imputer is a single-layer autoencoder, whose decoder performs the imputation as

where 3 v í ! is the output of the imputer encoder, / _ is the imputer decoder, and u _ is a masking function that set the elements in ! b C ! that correspond to the non-missing values to zero. Then, ! b C ! will be fed back to fill the missing value in the batch corrected input as ! ! G + ! b C ! , which will be passed on to the main autoencoder. This recursive imputation can iterate multiple cycles as selected.

The loss function is defined as

where à is the total number of recursion, ! b ! H is the reconstructed expression at î th recursion, u _ Q is another masking function that forces the loss to compute only the nonmissing values in ! ! G .

Results. scScope was evaluated for its scalability, clustering, imputation, and batch correction. It was compared with PCA, MAGIC, ZINB-WaVE, SIMLR, AE, scVI, and DCA.

For scalability and training speed, scScope was shown to offer scalability (for >100K cells)

with high efficiency (faster than most of the approaches). For clustering results, scScope was shown to outperform most of the algorithms on small simulated datasets but offer similar performance on large simulated datasets. For batch correction, scScope performed comparably with other approaches but with faster runtime. For imputation, scScope produced smaller errors consistently across a different range of expression.

scScope was further shown to be able to identify rear cell populations from a large mix of cells.

scRNA-seq is able to catalog cell types in complex tissues under different conditions.

However, the commonly adopted manual cell typing approach based on known markers is time-consuming and less reproducible. We survey deep learning models of automated cell type identification.

DigitalDLSorter [88] was proposed to identify and quantify the immune cells infiltrated in tumors captured in bulk RNA-seq, utilizing single-cell RNA-seq data.

Model. DigitalDLSorter is a 4-layer DNN (Fig. 4A) (2 hidden layers of 200 neurons each and an output of 10 cell types). The DigitalDLSorter is trained with two single-cell datasets: breast cancers [89] and colorectal cancers [90] . For each cell, it is determined to be tumor cell or non-tumor cell using RNA-seq based CNV method [89] , followed by using xCell algorithm [91] to determine immune cell types for non-tumor cells. Different pseudo bulk (from 100 cells) RNA-seq datasets were prepared with known mixture proportions to train the DNN. The output of DigitalDLSorter is the predicted proportions of cell types in the input bulk sample.

Result. DigitalDLSorter was first tested on simulated bulk RNA-seq samples.

DigitalDLSorter achieved excellent agreement (linear correlation of 0.99 for colorectal cancer, and good agreement in quadratic relationship for breast cancer) at predicting cell types proportion. The proportion of immune and nonimmune cell subtypes of test bulk TCGA samples was predicted by DigitalDLSorter and the results showed a very good correlation to other deconvolution tools including TIMER [89] , ESTIMATE [92] , EPIC [93] and MCPCounter [94] . Using DigitalDLSorter predicted CD8+ (good prognosis for overall and disease-free survival) and Monocytes-Macrophages (MM, indicator for protumoral activity) proportions, it is found that patients with higher CD8+/MM ratio had better survival for both cancer types than those with lower CD8+/MM ratio. Both EPIC and MCPCounter yielded non-significant survival associations using their cell proportion estimate.

scCapsNet [95] is an interpretable capsule network designed for cell type prediction. The paper showed that the trained network could be interpreted to inform marker genes and regulatory modules of cell types.

Model. The two-layer architecture of scCapsNet takes log-transformed, normalized expressions as input to form a feature extraction network (consists of \ parallel singlelayer neural networks) followed by a capsule network for cell-type classification (type capsules). For each L parallel feature extraction layer, it generates a primary capsule è \ ∈ ℝ $ 1 as

where n a,\ ∈ ℝ $ 1 ×= is the weight matrix. Then, the primary capsules are fed into the capsule network to compute a type capsules p 4 ∈ ℝ $ 2 , one for each cell type, as

where 4R|W4ℎ is the squashing function [96] to normalize the magnitude of its input vector to be less than one, n 4\ is another trainable weight matrix, and ñ 4\ ∀ t = 1, … , \ are the coupling coefficients that represent the probability distribution of each primary capsule's impact on the prediction of cell type _. ñ 4\ is not trained but computed through the dynamic routing process proposed in the original capsule networks [95] . The magnitude of each type of capsule p 4 represents the probability of a single cell ! ! belonging to cell type _, which will be used for cell-type classification.

The training of the network minimizes the cross-entropy loss by the back- Results. scCapsNet's performance was evaluated on human PBMCs [97] and mouse retinal bipolar cells [98] datasets and shown to have comparable accuracies (99% and Model. netAE works with UMI counts and assumes a ZINB distribution for ) "! as in Eq.

(22) in scVI. However, netAE adopts the general VAE loss as in Eq. (6) with two functionspecific loss as

where • is a set of indices for all cells and • & is a subset of • for only cells with cell type labels, ò is modified Newman and Girvan modularity [100] that quantifies cluster strength using 3 ! , á is the softmax function, and e ! is the cell type label. The second loss in Eq.

(34) functions as a clustering constraint and the last term is the cross-entropy loss that constrains the cell type classification.

Results: netAE was compared with popular dimension reduction methods including scVI, ZIFA, PCA and AE as well as a semi-supervised method scANVI [101] . 

where \ W is the cross-entropy loss, \ @ is a contrastive loss as described in [103] . Notice that (46) has an adversarial formulation and minimizing this loss maximizes the misclassification of cells from different batches, thus making them indistinguishable.

Similar to GAN training, scDGN is trained to iteratively solve:

and

.

. scDGN was tested for classifying cell types and aligning batches ranging in size from 10 to 39 cell types and from 4 to 155 batches. The performance was compared to a series of deep learning and traditional machine learning methods, including Lin et al. DNN [66] , CaSTLe [104] , MNN [32] , scVI [17] , and Seurat [10] . 

Predicting biological function and response to treatment at single cell level or cell types is critical to understand cellular system functioning and potent responses to stimulations.

DL models are capable of capture gene-gene relationship and their property in the latent space. Several models we reviewed below provide exciting approaches to learn complex biological functions and outcomes.

CNNC [105] is proposed to infer causal interactions between genes from scRNA-seq data.

Model. CNNC is a Convolutional Neural Network (CNN) (Fig. 4F) Result. CNNC was trained to predict transcription factor (TF)-Gene interactions using the mESC data from scQuery [106] , where the ground truth interactions were obtained from the ChIP-seq dataset from the GTRD database [107] . The performance was compared with DNN, count statistics [108] , and mutual information-based approach [109] . CNNC was shown to have more than 20% higher AUPRC than other methods and reported almost no false-negative for the top 5% predictions. CNNC was also trained to predict the pathway regulator-target gene pairs. The positive regulator-gene pairs were obtained from KEGG [110] , Reactome [111] , and negative samples were genes pairs that appeared in pathways but not interacted. CNNC was shown to have better performance of predicting regulator-gene pairs on both KEGG and Reactome pathways than other methods including Pearson correlation, count statistics, GENIE3 [112] , Mutual information, Bayesian directed network (BDN), and DREMI [109] . CNNC was also applied for causality prediction between two genes, that is if two genes regulate each other and if they do, which gene is the regulator. The ground truth causal relationships were also obtained from the KEGG and Reactome datasets. Again, CNNC reported better performance than BDN, the common method developed to learn casual relationships from gene expression data. CNNC was finally trained to assign 3 essential cell functions (cell cycle, circadian rhythm, and immune system) to genes. This is achieved by training CNNC to predict pairs of genes from the same function (e.g. Cell Cycle defined by mSigDB from gene set enrichment analysis (GSEA) [113] ) as 1 and all other pairs as 0.

The performance was compared with "guilt by association" and DNN, and CNNC was shown to have more than 4% higher AUROC and reported all positives for the top 10%

predictions.

scGen [114] is designed to learn cell response to certain perturbation (drug treatment, gene knockout, etc) from single cell expression data and predict the response to the same perturbation for a new sample or a new cell type. The novelty of scGen is that it learns the response in the latent space instead of the expression data space.

Model. ScGen follows the general VAE ( Figs. 2A, 4C ) for scRNA-seq data but uses the Result. scGen was applied to predict perturbation of out-of-samples response in human

PBMCs data, and scGen showed a higher average correlation (R 2 = 0.948) between predicted and real data for six cell types. Comparing with other methods including CVAE [115] , style transfer GAN [116] , linear approaches based on vector arithmetics(VA) [114] and PCA+VA, scGen predicted full distribution of ISG15 gene (strongest regulated gene by IFN-b) response to IFN-b [117] , while others might predict mean (CAVE and style transfer GAN) but failed to produce the full distribution. scGen was also tested on predicting the intestinal epithelial cells' response to infection [118] . For early transitamplifying cells, scGen showed good prediction (R 2 =0.98) for both H. poly and Salmonella infection. Finally, scGen was evaluated for perturbation across species using scRNA-seq data set by Hagai et al [119] , which comprises bone marrow-derived mononuclear phagocytes from mice, rats, rabbits, and pigs perturbed with lipopolysaccharide (LPS). scGen's predictions of LPS perturbation responses were shown to be highly correlated (R 2 = 0.91) with the real responses.

We systematically survey 25 DL models according to the challenges they address. Unlike other surveys, we categorize major DL model statements into VAE, AE, and GAN with a unified mathematic formulation in Section 3 (graphic model representation in Fig. 2 ), which will guide readers to focus on the DL model selection, training strategies, and loss functions in each algorithm. Specifically, the differences in loss functions are highlighted in each DL model's applications to meet specific objectives. DL/ML models that 25 surveyed models are evaluated against are presented in Fig. 3 , providing a straightforward way for readers to pick up the most suitable DL model at a specific step for their own scRNA-seq data analysis. All evaluation methods are listed in Table 3 , as as we foresee to be an easy recipe book for researchers to establish their scRNA-seq pipeline. In addition, a summary of all the 25 DL models concerning their DL models, evaluation metrics, implementation environment, and downloadable source codes is presented in Table 1 . Taking together, this survey will provide a rich resource for DL method selection for proper research applications, and we expect to inspire new DL model development for scRNA-seq analysis.

In this review, we focus our survey on common DL models, such as AE, VAE, and GAN, and their model variations or combinations to address single-cell data analysis challenges. With the advancement of multi-omics single-cell technologies, such as cyTOF (such as SAUCIE [15] ), spatial transcriptome using DNN [120] , and CITE-seq that simultaneously generates read counts for surface protein expression along with gene expression [121, 122] . Other than 3 most common DL models, we also include network frameworks such as Capsule NN (such as scCapsNet [95] ), Convolution NN (such as CNNC [105] ) and domain adaption learning (such as scDGN [102] ). It is expected that more DL models will be developed and implemented for the most challenging step for scRNA-seq data, including but not limited to, data interpretation. For example, integrating protein-protein interaction graphs into DL models has been shown for its advantages of biological knowledge and nonlinear interactions embedded in the graphs [123] [124] [125] .

Indeed, a recently published scRNA-seq analysis pipeline, scGNN [126] , incorporates 3

iterative autoencoders (including one graph autoencoder) and successfully demonstrated Alzheimer's disease-related neural development and differentiation mechanism. We expect that our careful organization of this review paper will provide a basic understanding of DL models for scRNA-seq and a list of critical elements to be considered for future developments in DLs.

This article's publication costs were supported partially by the National Institutes of Health Average of normalized (log2-transformed) scRNA-seq counts across cells is calculated and then correlation coefficient between the pseudobulk and the actual bulk RNA-seq profile of the same cell type is evaluated.

Mean squared error (MSE)

MSE assesses the quality of a predictor, or an estimator, from a collection of observed data x, with ) + being the predicted values. The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables, where rX is the rank of X.

Entropy of accuracy, Hacc [21] 7 +,, = − 1 ! ' ' 8 ! 9) -: log >8 ! 9) -:?

Measures the diversity of the ground-truth labels within each predicted cluster group. pi(xj) (or qi(xj)) are the proportions of cells in the j th ground-truth cluster (or predicted cluster) relative to the total number of cells in the i th predicted cluster (or groundtruth clusters), respectively.

Entropy of purity, Hpur [21] 7 01* = − 1 @ ' ' A ! 9) -: log >A ! 9) -:? Measure of constancy between two clustering outcomes, where a (or b) is the count of number of pairs of cells in one cluster (or different clusters) from one clustering algorithm but also fall in the same cluster (or different clusters) from the other clustering algorithm.

Adjusted Rand index (ARI) [165] _ 

Given a dataset of @ cells from l batches with @ A denoting the number of cells in batch m, @ #A @ is the number of cells from batch m in the k-nearest neighbors of cell &, fl is the global fraction of cells in batch m, or j A = . % . , and 2 9B% " denotes the 2 " distribution with l − 1 degrees of freedom. It uses a 2 " -based test for random neighborhoods of fixed size to determine the significance ("well-mixed"). where H() is the entropy, and U is the ground-truth assignment and V is the predicted assignment. The HS range from 0 to 1, where 1 indicates perfectly homogeneous labeling. 

Eleven grand challenges in single-cell data science

Sequencing thousands of single-cell genomes with combinatorial indexing

Single-Cell Sequencing for Drug Discovery and Drug Development

Pan-cancer single-cell RNA-seq identifies recurring programs of cellular heterogeneity

The first five years of single-cell cancer genomics and beyond

Single-Cell RNA Sequencing in Cancer: Lessons Learned and Emerging Challenges

Application of single-cell sequencing technologies in pancreatic cancer

Discriminating mild from critical COVID-19 by innate and adaptive immune single-cell profiling of bronchoalveolar lavages

Host-Viral Infection Maps Reveal Signatures of Severe COVID-19 Patients

Comprehensive Integration of Single-Cell Data

SCANPY: large-scale single-cell gene expression data analysis

Attention is all you need

Large-scale video classification with convolutional neural networks

Deep learning in natural language processing

Exploring single-cell data with deep multitasking neural networks

A hybrid deep clustering approach for robust cell type profiling using single-cell RNA-seq data

Deep generative modeling for single-cell transcriptomics

Single-cell RNA-seq denoising using a deep count autoencoder

scIGANs: single-cell RNA-seq imputation using generative adversarial networks

DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data

A benchmark of batcheffect correction methods for single-cell RNA sequencing data

Machine learning and statistical methods for clustering singlecell RNA-sequencing data

A comparison of automatic cell identification methods for single-cell RNA sequencing data

A comparison of deep learning-based pre-processing and clustering approaches for single-cell RNA sequencing data

Smart-seq2 for sensitive full-length transcriptome profiling in single cells

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

Single-cell RNA-seq analysis software providers scramble to offer solutions

Single-Cell RNA-Seq Technologies and Related Computational Data Analysis

Bayesian Analysis of Single-Cell Sequencing Data

Pooling across cells to normalize single-cell RNA sequencing data with many zero counts

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

Integrating single-cell transcriptomic data across different conditions, technologies, and species

Fast, sensitive and accurate integration of single-cell data with Harmony

SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data

SAVER: gene expression recovery for single-cell RNA sequencing

An accurate and robust imputation method scImpute for single-cell RNA-seq data

Nonlinear dimensionality reduction by locally linear embedding

SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data

Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data

Dimensionality reduction for visualizing single-cell data using UMAP

Unfolding communities in large complex networks: combining defensive and offensive label propagation for core extraction

Visualization and analysis of singlecell RNA-seq data by kernel-based similarity learning

MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data

Bayesian approach to single-cell differential expression analysis

DEsingle for detecting three types of differential expression in single-cell RNA-seq data

Generative adversarial networks

Improved training of wasserstein gans

Recovering Gene Interactions from Single-Cell Data Using Data Diffusion

Data denoising with transfer learning in single-cell transcriptomics

Imputation of single-cell gene expression with an autoencoder neural network

scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder

CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data

BEGAN: Boundary Equilibrium Generative Adversarial Networks

Began: Boundary equilibrium generative adversarial networks

BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes

Integrating structured biological data by Kernel Maximum Mean Discrepancy

Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor

BBKNN: fast batch alignment of single cell transcriptomes

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

Improved deep embedded clustering with local structure preservation

Efficient integration of heterogeneous single-cell transcriptomes using Scanorama

iMAP: integration of multiple single-cell datasets by adversarial paired transfer networks

Using neural networks for reducing the dimensions of single-cell RNA-Seq data

Dhaka: Variational Autoencoder for Unmasking Tumor Heterogeneity from Single Cell Genomic Data

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

Scalable whole-genome single-cell library preparation without preamplification

Interpretable dimensionality reduction of single cell transcriptome data with deep generative models

scVAE: variational auto-encoders for single-cell gene expression data

VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder

Categorical reparameterization with gumbel-softmax. arXiv

Clustering single-cell RNA-seq data with a model-based deep learning approach

The Human Cell Atlas

Unsupervised deep embedding for clustering analysis

Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks

M: cGANs with projection discriminator

Splatter: simulation of single-cell RNA sequencing data

Geometry-based data generation

Interpretable factor models of single-cell RNAseq via variational autoencoders

Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis

Reversed graph embedding resolves complex single-cell trajectories

Umap: uniform manifold approximation and projection for dimension reduction

Visualizing data using t-SNE

PHATE: a dimensionality reduction method for visualizing trajectory structures in high-dimensional biological data

Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning

Digitaldlsorter: Deep-Learning on scRNA-Seq to Deconvolute Gene Expression Data

Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer

Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors

xCell: digitally portraying the tissue cellular heterogeneity landscape

Inferring tumour purity and stromal and immune cell admixture from expression data

Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data

Estimating the population abundance of tissueinfiltrating immune and stromal cell populations using gene expression

An interpretable deep-learning architecture of capsule networks for identifying cell-type gene expression programs from single-cell RNAsequencing data

Neural network implementation using bit streams

Massively parallel digital transcriptional profiling of single cells

netAE: semi-supervised dimensionality reduction of single-cell RNA sequencing to facilitate cell labeling

Modularity and community structure in networks

Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models

Supervised Adversarial Alignment of Single-Cell RNA-seq Data

Dimensionality reduction by learning an invariant mapping

CaSTLe-classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments

Deep learning for inferring gene relationships from single-cell expression data

A web server for comparative analysis of single-cell RNA-seq data

GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments

Gene coexpression measures in large heterogeneous samples using count statistics

Systems biology. Conditional density-based analysis of T cell signaling in single-cell data

KEGG: new perspectives on genomes, pathways, diseases and drugs

The Reactome Pathway Knowledgebase

Inferring regulatory networks from expression data using tree-based methods

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Theis FJ: scGen predicts single-cell perturbation responses

Advances in Neural Information Processing Systems 28

Aligning biological manifolds

Multiplexed droplet single-cell RNA-sequencing using natural genetic variation

A single-cell survey of the small intestinal epithelium

Gene expression variability across cells and species shapes innate immunity

DEEPsc: A Deep Learning-Based Map Connecting Single-Cell Transcriptomics and Spatial Imaging Data

Clustering single cell CITE-seq data with a canonical correlation based deep learning method

Surface protein imputation from single cell transcriptomes by deep neural networks

Classification of Cancer Types Using Graph Convolutional Neural Networks

Prediction and interpretation of cancer survival using graph convolution neural networks

Relational inductive biases, deep learning, and graph networks

scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses

Human embryonic stem cells extracellular vesicles and their effects on immortalized human retinal Muller cells

Simultaneous epitope and transcriptome measurement in single cells

Molecular Diversity of Midbrain Development in Mouse, Human, and Stem Cells

Single-Cell Map of Diverse Immune Phenotypes in the

Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm

A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter-and Intra-cell Population Structure

Multilineage communication regulates human liver bud development from pluripotency

A Single-Cell Transcriptome Atlas of the Human Pancreas

Quake SR: A survey of human brain transcriptome diversity at the single cell level

Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma

Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma

Scalable whole-genome single-cell library preparation without preamplification

Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex

RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes

Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells

An Immune Atlas of Clear Cell Renal Cell Carcinoma

Mapping the Mouse Cell Atlas by Microwell-Seq

Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex

Single-Cell Transcriptomics Reveals that Differentiation and Spatial Signatures Shape Epidermal and Hair Follicle Heterogeneity

Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors

Computational analysis of cell-to-cell heterogeneity in single-cell RNAsequencing data reveals hidden subpopulations of cells

Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing

Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells

Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells

Heterogeneity in Oct4 and Sox2 Targets Biases Cell Fate in 4-Cell Mouse Embryos

Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression

Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing

Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq

Single-Cell Transcriptomic Map of the Human and Mouse Bladders

Population snapshots predict early haematopoietic and erythroid hierarchies

A single-cell molecular map of mouse gastrulation and early organogenesis

The single-cell transcriptional landscape of mammalian organogenesis

Wishbone identifies bifurcating developmental trajectories from singlecell data

Gottgens B: A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation

Comprehensive single-cell transcriptional profiling of a multicellular organism

Cluster ensembles---a knowledge reuse framework for combining multiple partitions

Normalized mutual information to evaluate overlapping community finding algorithms

Information theory, inference and learning algorithms

Comparing Partitions

A test metric for assessing single-cell RNA-seq batch correction

V-measure: A conditional entropy-based external cluster evaluation measure

Local Inverse Simpson's Index (LISI