key: cord-0779652-9a48btos
authors: Chen, Yanshuo; Wang, Yixuan; Chen, Yuelong; Cheng, Yuqi; Wei, Yumeng; Li, Yunxiang; Chan, Ting-Fung; Li, Yu
title: Deep autoencoder for interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis
date: 2021-11-01
journal: bioRxiv
DOI: 10.1101/2021.10.26.465846
sha: 77ec80d65de1659fb0bde966a3c1277b02770705
doc_id: 779652
cord_uid: 9a48btos

Single-cell RNA-sequencing (RNA-seq) has become a powerful tool to study biologically significant characteristics at explicitly high resolution. However, its application on emerging data is currently limited by its intrinsic techniques. Though many methods have been proposed to analyze bulk data using single-cell profile as a reference, they are limited on the interpretability, processing speed, and data size requirement. Here, we introduce Tissue-AdaPtive autoEncoder (TAPE), a deep learning method connecting bulk RNA-seq and single-cell RNA-seq, to achieve precise prediction in short time. By constructing an interpretable decoder and training under a unique scheme, TAPE can predict cell-type fractions and cell-type-specific gene expression tissue-adaptively. Compared with existing methods on several benchmarking datasets, TAPE is more accurate (up to 40% improvement on the real bulk data) and faster. Thus, it is sensitive enough to provide biologically meaningful predictions. For example, only TAPE can predict the tendency of increasing monocytes-to-lymphocytes ratio in COVID-19 patients from mild to serious symptoms, of which estimated indices are consistent with laboratory data. More importantly, through the analysis of clinical data, TAPE shows its ability to predict cell-type-specific gene expression profiles with biological significance. Combining with single-sample gene set enrichment analysis (ssGSEA), TAPE also provides valuable clues to investigate the immune response in different virus-infected patients. We believe that TAPE will enable and accelerate the precise analysis of high-throughput clinical data in a wide range.

Bulk RNA sequencing (RNA-seq), a widely used high-throughput sequencing technique, provides a powerful tool to investigate transcriptome variation of biological events [1] . RNA-seq measures averaged expression levels, which gives a macro atlas of large samples from transcription level without cell-specific information. However, it is also important to study the cellular composition and proportion of the sample in some cases, especially in a system with cellular development and proliferation (e.g., cancer) [2, 3] .

Recently, single-cell RNA sequencing (scRNA-seq) has given unprecedented opportunities to identify and analyze the cell heterogeneity of complex tissues [4] . While scRNA-seq provides impressive resolution in cell granularity, it is still costly and vulnerable to noise, prohibiting sequencing the large-scale samples [5, 6] . To overcome these obstacles, we may combine the abundant bulk RNA-seq data with the scRNAseq data, performing cell-type deconvolution from the bulk RNA-seq samples with reference to a small scRNA-seq dataset.

Many single-cell profile-assisted algorithms have sprung up to dissect bulk RNA-seq data in the past few years. The existing methods can be roughly divided into two categories: statistical learning-based and deep learning-based methods. Based on traditional regression models like non-negative least squares (NNLS) and support vector regression (SVR), a series of methods like CIBERSORT (CS) [7] , MuSiC 1 [8] , CIBERSORTx (CSx) [9] , Bisque [10] , DWLS [11] , and RNA-Sieve [12] have been developed. All these tools need a pre-selected cell-type-specific gene expression profile (GEP). In contrast, Scaden [13] , a deep learning method, utilizes simulated bulk data for training without relying on a pre-defined GEP. Despite this progress, these methods ignore the running time cost, especially regarding the growing demands of dealing with big datasets. Moreover, except CSx, other methods, like Scaden, can not predict the crucial cell-type-specific gene expression. This limitation leads to the poor interpretability of Scaden and other methods. Even for CSx, its function to predict cell-type-specific gene expression can not accommodate small data sizes (<15) [9] .

To overcome these limitations, we propose an accurate, efficient, and interpretable deep-learning algorithm, Tissue-AdaPtive autoEncoder (TAPE), using deep neural networks (DNNs). The basic idea is that the encoder can learn higher-order latent representations and decoder can realize the interpretability of the output in the framework of autoencoder. Moreover, we introduce a new training scheme named adaptive training to optimize the GEP tissue-adaptively. Empirically, our method could achieve a better performance up to 34% on simulated data and up to 40% on real data than the previous state-of-the-art methods. To demonstrate the clinical application of TAPE, we use three datasets to show that TAPE is sensitive to biological changes. To be specific, TAPE is the only method that predicts the increasing tendency of monocytes-to-lymphocytes ratio (MLR) value which is suitable within the clinical report (0.29-0.88) [14] in the COVID-19 peripheral blood mononuclear cells (PBMC) dataset. Furthermore, TAPE can predict cell-type-specific GEP tissue-adaptively with a lower input data size requirement. Thus, TAPE can provide a valuable reference for biologists to more conveniently investigate differentially expressed genes (DEGs). We further combine TAPE with single-sample gene set enrichment analysis (ssGSEA) on the virus-infected PBMC dataset to prove its capability of analyzing cross-viral functional differences among cells.

In this work, we build TAPE to precisely predict cellular fractions and cell-type-specific gene expression. The novelty lies in adopting the architecture of autoencoder as well as introducing a new training scheme at the adaptive stage. Compared with the state-of-art methods, TAPE shows a competitive performance and has the fastest processing speed on benchmarking datasets. In addition, TAPE predicts the cell-type-specific gene expression tissue-adaptively, allowing the dissection of bulk gene expression into different cell types and discovering the potential differential gene expression among cell types.

As shown in Figure 1 , the basic architecture of our method is a DNN-based autoencoder (AE), taking bulk gene expression profile as input and outputting cell-type proportions and cell-type-specific gene expression profiles (GEPs). There are three stages of using TAPE. The first stage is to create training data through simulation. Simulated bulk data is the sum of selected single-cell gene expression profiles with the pre-defined cell fractions and the total cell numbers. The single-cell profile and the real bulk profile should come from the same tissue. The next is the training stage. We want to train the model to output the proper cell fractions after the encoder and use the cell fractions to reconstruct the bulk profile. More than only using the reconstruction loss in the classic AE model, we try to minimize the mean absolute error (MAE) between the ground truth and the predicted cell fractions to make it supervised. When the model is required to predict the cell fractions and the cell-type-specific GEPs on the real bulk data, it enters the adaptive stage. In this process, inspired by the classic AE's training process, we only use real bulk data to train the model in an unsupervised manner. More specifically, the model is iteratively greedily optimized on the decoder and the encoder. That is, it would not optimize the parameters of the encoder until it achieves the temporally best parameters on the decoder. As for the decoder, we require it to reconstruct the real bulk data and maintain the concordance with itself, while the encoder is required to predict the proper cell fractions which should be similar to the primary prediction after the training stage. Since we require the decoder to output the reconstructed bulk gene expression based on the cell fractions, the parameters of the decoder are the cell-type-specific GEPs as a matter of course, so we could directly output these parameters as the GEPs after the adaptive stage. Figure 1 : TAPE workflow and clarification of adaptive stage. a TAPE takes scRNA-seq data from human or mouse and RNA-seq data from the homologous tissue as input, then performs the deconvolution as well as the prediction of cell-type specific GEPs via a training stage and an adaptive stage. b Generation of the cell-type-specific GEPs has two separate modes. The first is the "highresolution" mode: TAPE takes the RNA-seq data from one sample at a time as input and outputs the adapted cell-type-specific signature matrix for each sample. The second is the "overall" mode: TAPE takes all the RNA-seq data at one time as input and outputs one signature matrix adapted to all samples.

We used TAPE to analyze pseudo-bulk data generated in silico from single-cell gene expression profiles. To simulate the real deconvolution scenario, we added artificial noise, like Gaussian noise and drop-out events when constructing the pseudo-bulk data. On the simulated data generated from the single-cell data of five mouse organs [15] , we compared the performance of TAPE to that of four representative deconvolution methods, namely Scaden, RNA-Sieve, CSx, and DWLS, with 5-fold validation. Performance was evaluated by the mean absolute error (MAE) and Lin's concordance correlation coefficient (CCC) [16] between the prediction and the ground truth. More details of the simulation process, dataset, and metrics for evaluation are in the Method part. Figure 2 : Comparison of deconvolution algorithms on benchmark datasets. a Deconvolution results on simulated data and real data. CCC represents the Lin's concordance correlation coefficient, measuring the concordance between the predicted fraction and the ground truth. MAE represents mean absolute error, measuring the accuracy of prediction. Blue represents a good performance, and red represents a bad performance. b Deconvolution procedure diagram. Bulk RNA-seq data and singlecell data should come from a homologous tissue. c Detailed comparison between the performance of Scaden and TAPE on 5 real bulk datasets using 50 different random seeds. The dots on the upper right represent better performance than dots on the bottom left. d Time complexity analysis of different methods. These tests are conducted on the simulated data. The time limit is set to 2500s. Any longer test was not conducted. TAPE performs as a constant time algorithm.

As shown in Figure 2a , TAPE achieves the best performance on all the datasets. TAPE reduces 13%-34% of the MAE and raises 2%-34% of the CCC in comparison with Scaden. This result lies in the decoder we bring in, which regularizes the output fractions of the encoder to approach the specific biological significance via minimizing the reconstruction loss. Regarding handling the artificial noise, deep learning-based methods are more robust than linear models, which is highly related to the structural superiority of neural networks. The likelihood-based inference method, RNA-Sieve, is 5%-40% worse than TAPE while performing better than CSx and DWLS.

We further evaluated TAPE and the other four representative cell deconvolution methods on real tissue expression datasets with the corresponding ground truth. First, we assessed deconvolution performance on two human PBMC bulk RNA-seq datasets, SDY67 [17] and the S13 cohort from Monaco et al. [18] . Another PBMC microarray dataset was obtained from Newman et al. [7] . Second, we deconvolved the ROSMAP human brain RNA-seq dataset [19] with both human brain single-cell RNA-seq and mouse brain single-cell RNA-seq as references. Through immunohistochemistry analysis, the cell-type fractions of 41 samples of the ROSMAP dataset were recently given [20] . Detailed deconvolution software comparison and settings are in the Method part.

Among all the real datasets considered, TAPE achieves the best MAE on the SDY67 (0.061), Monaco's (0.044), and ROSMAP datasets (human scRNA-seq as reference 0.036; mouse scRNA-seq as reference 0.044), while Scaden performs (0.070) slightly better than TAPE (0.073) only in the Newman's dataset. Newman's dataset is a microarray dataset. Unlike the scRNA-seq or RNA-seq that we use for training and prediction, the differentiation among the expression of genes in the microarray dataset is very small. Moreover, Scaden is an ensemble of 3 different models which may be more robust than TAPE in this scenario. RNA-Sieve obtains the worst MAE (0.26 and 0.25) in the SDY67 and Monaco's datasets for the poor learning ability for unknown cell type ratios. But RNA-Sieve achieves better performance (human scRNA-seq as reference: 0.09; mouse scRNA-seq as reference: 0.06) in the ROSMAP dataset than other statistical deconvolution methods. CSx is a powerful computational tool that only has slightly worse results than Scaden when deconvolving bulk RNA-seq. For instance, CSx gets an MAE of 0.14 in the SDY67 dataset, 0.089 in the Monaco's dataset, 0.088 (0.065) in the human (mouse) referenced ROSMAP dataset. The benchmarking results valued by MAE and CCC are detailed in Figure 2a .

Furthermore, since TAPE and Scaden are both DNN-based methods, we made a head-to-head comparison between them using different random seeds to evaluate TAPE's stability. In Figure 2c , the two colors stand for the different methods, and the two coordinates of each dot represent MAE and CCC respectively. As shown in the figure, the dots of TAPE occupy the upper-right of the figures with low variance, showing TAPE's better performance and stability than Scaden.

Except for accuracy and stability, the methods' scalability is also important in practice. Therefore, we evaluated the time consumption of the four representative methods mentioned above on the same pseudo-bulk samples. We ran TAPE, Scaden, RNA-Sieve, and DWLS on the same workstation with Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz, CentOS Linux release 7.9.2009 (Core), Nvidia 3090 GPU. CSx was tested on the web-based application. Detailed implementations are in the Method part.

It is worth noticing that TAPE runs the fastest among all the other methods. According to the time record, TAPE deconvolves 800 samples in about 120 seconds, which is 33% faster than Scaden. Besides optimization of the sampling process, TAPE is lighter than Scaden that combines 3 models. Once the training process is finished, TAPE takes little time to deconvolve compared to the statistical methods. Written in Python, RNA-Sieve runs faster than DWLS and CSx. RNA-Sieve can deconvolve 800 samples in less than 400 seconds, which is quite good. However, the time consumption of RNA-Sieve increases linearly, Thus, it is incapable of dealing with massive samples. DWLS and CSx run at a relatively slow speed. Since the web station of CSx did not provide a timer, we did manual timing. And it needs more than 40 mins to deconvolve 300 samples. Figure 2d is truncated at 2,500 seconds.

We further evaluated whether TAPE could predict cell-type proportions consistent with clinical prior knowledge. Here, we selected three datasets with clinical information or related prior knowledge: 1. the ROSMAP dataset [21] that is obtained from patients with Alzheimer's Disease (AD); 2. the COVID-19 PBMC dataset [22] containing clinical information about different severity of COVID-19 (mild, moderate, and serious) and different stages of patients (treatment stage, convalescence stage, and rehabilitation stage); 3. the cultured pancreatic islet dataset [23] which has RNA-seq data of islet from three different conditions (normal, SARS-CoV-2 infected, and SARS-CoV-2 infected tissue with Remdesivir treatment). Detailed information of these datasets is in the Method part.

For the ROSMAP dataset, we used the human brain single-cell profile from Darmanis et al. [24] as reference. As we know, neuron cell loss is a significant symptom in patients with AD. In the ROSMAP dataset, the Braak stage is given as a measurement of the severity of AD [25] . So we expected neuron fraction would decrease with the development of AD. Additionally, we investigated each sample's Braak stage to the estimated fraction of microglia whose proportion will increase with AD severity and will decrease at stage 6 as shown by previous study [26, 27] . The results (Figure 3a , 3b) show that TAPE can predict the tendency of neuron loss and have a good prediction of microglia activation and deactivation among 532 samples with clinical information. Moreover, according to the immunohistochemistry analysis of AD from a previous study, the proportion of neurons or microglia cells ranging from 0.32-0.55 and 0.06-0.12 respectively [20] . Impressively, only TAPE could predict proportions in this range, which shows the great accuracy of TAPE's prediction.

Next, we used the PBMC data8k [28] dataset as the reference to deconvolve the COVID-19 PBMC dataset. According to existing clinical observations and researches, metrics like neutrophil-to-lymphocyte ratio (NLR) and monocyte-to-lymphocyte ratio (MLR) are closely associated with the progression of the COVID-19, thus which are used to indicate its severity clinically. [29, 30] . In practice, patients with the Figure 3 : Deconvolution benchmark on datasets with clinical information. a Comparison of estimated neuron cell proportion on different Braak stages between different models on the ROSMAP dataset. Neuron content is expected to decrease along with the development of AD. b Microglia content estimated by different methods on Braak stage. The fraction is expected to increase from stage 0 to 5 while there will be a decrease from stage 5 to 6. c Estimated MLR value calculated from the estimated monocytes fraction divided by the sum of estimated proportions of CD4 + T cell, CD8 + T cell, and B cell. Only TAPE's prediction has a proper tendency and is accordant with laboratory data. d Estimated beta-cell fractions of cultured islet in different conditions. The middle one represents samples infected with SARS-CoV-2, and the right one means samples treated with Remdesivir after infection. The model should predict the restoration of beta-cell content after being treated with medication. higher NLR or MLR show a more serious symptom in COVID-19. Since the data we used is obtained from PBMC which do not contain neutrophils, we only tested the correlation of MLR and the severity of COVID-19 patients. More specifically, we only considered the treatment stage data because patients in the convalescence or rehabilitation stage do not represent the same pathology characteristics as the real infected circumstances. We used the estimated MLR value predicted from different models to compare the tendency between different severity (Figure 3c ). Only TAPE predicted the increasing tendency of MLR value, and the value range is suitable with the clinical report (0.29-0.88) [14] .

To deconvolve the cultured islet dataset, we selected the endocrine cells (alpha, beta, gamma, delta, and epsilon cells) from Baron et al. [31] to generate training data because pancreatic islet only contains endocrine cells. Since the infection of SARS-CoV-2 usually causes metabolic dysregulation and Mellitus [23] , we expected the decrease of beta-cell fraction in COVID-19 patients. Here, we used the sequencing data of in vitro cultured islets to deconvolve and expected TAPE to predict the decrease of beta cell proportion after infection. Furthermore, the proportion of beta cell should restore after treatment with Remdisivir, a very famous antiviral medication used to treat COVID-19 ( Figure 3d ). We found that only TAPE and Scaden can predict both beta cell loss and restoration in this experiment. The accurate deconvolution results of these controlled experiments demonstrate that TAPE is sensitive to the biological changes in the bulk RNA-seq data and can produce biologically significant results.

More than only predicting cell fractions of bulk RNA-seq data like the existing deep-learning method, TAPE could also predict the cell-type-specific gene expression tissue-adaptively. That is, TAPE only needs simulated data from healthy samples to train, but it can predict the cell-type-specific gene expression in pathological conditions if the corresponding bulk RNA-seq data is given. This feature enables TAPE to dissect bulk gene expression into different cell types and discover some potentially differentially expressed genes in different cell types.

We began with testing the correctness of the predicted cell-type-specific GEPs. To test this, we measured the concordance between the predicted gene expression value of each cell type and the original gene expression value obtained from single-cell RNA-seq (Fig 4a) . Here, the PBMC bulk data is from Monoco et al. [18] , while the single-cell data is the data8k dataset from the 10X website [28] . Since we transformed the input RNA-seq data into 0-1 value using Log 2 and M inM axScaler() in the training stage (more in the Method part), the sums of gene expression value grouped by cell types are also transformed in this way to compare with the predicted relative gene expression value. Note that only gene expression in monocytes does not have a good concordance. This may be caused by the batch effect from different individuals. The concordance shown in the figure proves that TAPE predicts the signature matrix correctly and establishes the base for further gene expression analysis.

More than the concordance, we also expect that TAPE can assign the gene expression value in bulk data to different values at a cell-type level. To test this, we used the ROSMAP RNA-seq dataset [21] and human brain single-cell profile [24] to perform adaptive training in the "overall" mode. The deconvolution result (Fig 4d) of cell-type-specific GEPs shows that TAPE indeed predicted the differentially expressed genes in different cell types. However, since TAPE takes single-cell gene expression as input, these differences may be inherent from single-cell data. So, we compared the original signature matrix from single-cell data to the adapted signature matrix using the heatmap (Fig 4c) . We further investigated whether TAPE just inherits the data distribution from bulk RNA-seq data and whether the different distributions of different cell types are randomly assigned. We selected the NRGN gene to study it. Since the NRGN gene has been shown to be closely associated with AD [32] , we expected it to have a high gene expression level in neurons or other nerve cells. Interestingly, comparing the relative NRGN expression value in bulk GEP, single-cell GEPs, and predicted GEPs (Fig 4b) , we found that TAPE can successfully predict a high expression value of NRGN in neurons while a low expression value of NRGN in endothelial cells. More specifically, this shows that the prediction of cell-type-specific GEPs is a product of two-side information from both bulk and single-cell profile, not randomly assigned or guessed.

For HIV-infected patients, they can be classified into two different classes based on the existence of broadly neutralizing antibodies (BNab). Recently, a study about the development mechanism of BNab in HIV patients used bulk RNA-seq and population sorted RNA-seq to investigate the most differentially The relative gene expression value of NRGN from different sources. The dashed line represents the total relative NRGN expression value in the AD patients' brain tissue. c, d Estimated signature matrix after the adaptive stage in the "overall" mode. The gene expression is normalized with Z-Score. The genes are selected by the differential expression in different cell types after the adaptive stage. The differences between before and after adaptive stage represent TAPE could not only make the signature matrix adapted to new data and but also maintain concordance with the original one. e Boxplots of the estimated RAB11FIP5 gene expression value in different cell types. This gene has been proved to have differential expression in NK cells through experiments. TAPE predicts this phenomenon through digital analysis without physical experiments. 8 expressed gene (DEG) [33] . In this study, researchers first made conclusion that RAB11FIP5 is the most differentially expressed gene in natural killer (NK) cells and leads to the development of BNab. The steps they used to find the relation between RAB11FIP5 and NK could be replaced with the cell-typespecific gene expression analysis at high resolution. We used TAPE to tissue-adaptively deconvolve the HIV PBMC data [33] . To avoid batch effects and harmful effects caused by the low-quality single-cell data, we combined data6k, data8k, and data10k PBMC single-cell data [28, 34, 35] as reference. In the "high-resolution" mode of TAPE, TAPE could predict cell-type-specific GEPs for each sample. After obtaining the predicted GEPs at high resolution, we calculated the adjusted p-value and fold change for each cell type and the original bulk RNA-seq data (Fig 4e) . The results show that TAPE successfully predicts that RAB11FIP5 is differentially expressed in NK cells, which means that TAPE could tell NK as a source of producing the most differentially expressed gene in PBMC correctly and precisely.

Although the GEPs prediction is not as reliable as real experiments, TAPE could be a valuable reference for biologists investigating differentially expressed genes at a cell-type-specific level.

To further prove the versatility of TAPE, we applied TAPE on the PBMC RNA-seq data of three kinds of virus-infected samples, including SARS-CoV-2 infected, which is the severe acute respiratory syndrome coronavirus 2 that has been sweeping the world, hepatitis C virus (HCV) infected, which caused 290,000 death in 2019, and human immunodeficiency virus (HIV) infected, which will lead acquired immunodeficiency syndrome (AIDS). These three virus infections damage the host immune system but lead to different syndromes. Knowing the specific function differences in specific cells could help us in both the treatment and prevention of these infections. Besides the differential expressed genes, we also investigated the functions of each cell type by incorporating cell-specific GEPs and ssGSEA. As the ssGSEA algorithm only needs the gene rank, which could be provided by our cell-specific GEPs, we could predict the activities of each function pathway for each sample. Considering function pathways that significantly (p adj < 0.05) (in)activated in at least one sample, we found the samples that were infected by different viruses clustered (Pearson for distance, ward.D2 for cluster) together (Figure 5a ). Comparing SARS-CoV-2 infected samples with the other two virus-infected samples, we found that functional pathways were potential to activate then to inactivate. Also, the HIV-infected samples are similar to HCV-infected samples, showing the difference from the SARS-CoV-2 infection. Besides, subsets of samples within each virus-infected sample could also be identified, which might present the heterogeneous samples within the same virus infection.

Even the activities of the significant function pathways show differences among the three virus infections. The proportions of common significant enriched pathways were large in different cell types ( Figure  5b , 5c, 5d). More significant enriched function pathways were observed in SARS-CoV-2 infected samples, compared to the other two virus-infection samples. In the B cells, HIV-infected samples share 99% of significantly enriched pathways with HBV-infected samples, while SARS-CoV-2 occupied more than 40% of the significantly enriched pathways privately (Figure 5b) . These SARS-CoV-2 private enriched pathways contributed to the identification of the subset samples (Figure 5e ).

Of note, monocytes and NK cells contributed to distinguishing these three kinds of virus-infected samples (Figure 5f, 5g) . We noticed that the number of common enriched pathways in these two cell types are much larger than the numbers of mono-enriched or di-enriched pathways, indicating the activation differences, rather than functional differences, make the various of three virus-infection samples.

We develop TAPE as a novel deep-learning algorithm for digital tissue dissection. Key features distinguishing it from previous methods include 1. highly accurate and sensitive deconvolution to capture the biologically significant changes in clinical data, and 2. tissue-adaptive cell-type-specific gene expression profile prediction to identify potential gene expression differences at the cell-type level. TAPE benefits from the architecture of the autoencoder and the unique training method in the adaptive stage. The encoder-decoder architecture enables an interpretable decoder to answer why the encoder makes such predictions. More interestingly, the decoder is a natural cell-type-specific signature matrix that can be learned after the training stage and then adapted to the bulk data after the adaptive stage. Notice that the special training process of TAPE makes it fundamentally different from other methods, which only predict cell fractions or need large cohort bulk RNA-seq data to impute cell-type-specific GEPs. Another advantage of TAPE is the constant running time when deconvolving a large number of samples. Running on the popular graphic card, TAPE is much faster than traditional statistical methods and even faster than the previous deep-learning method.

As highlighted before, TAPE can predict cell-type-specific GEPs tissue-adaptively. But admittedly, it can be improved further. Firstly, a normalized gene expression value has lost the original information about total read count, which is hard for researchers to analyze DEGs perfectly. Secondly, although we have proved that TAPE could capture the biological significance of the highly expressed genes and highly differentially expressed genes, the fidelity of the insignificant genes is still unknown.

In summary, TAPE represents a widely applicable framework for deciphering heterogeneity of tissues at a cell-type level, and provides a practical training scheme for supervised autoencoder to perform domain adaptation. Considering it can be integrated with other tools seamlessly, we believe that TAPE will be helpful to investigate the connection between the single-cell data and the abundant bulk data.

In this work, we used several public single-cell RNA-seq datasets, bulk RNA-seq datasets, and microarray datasets to perform our experiments. In the pseudo-bulk test, a single-cell dataset of mouse atlas from Tabular Muris [15] was used. This dataset consists of 20 organs and tissues with cell type labels provided by the authors. Only five tissues' (kidney, liver, lung, skin, thymus) data produced by the fluorescence activated cell sorting (FACS) method was used to perform the pseudo-bulk test.

In the experiments of real bulk data with ground truth, we used several real bulk datasets with the corresponding cell fractions. The first PBMC dataset SDY67 was created by Zimmermann et al., but it was indirectly obtained from Scaden's training data with unknown fractions. The second PBMC dataset created by Monaco et al. could be downloaded from the GEO database with accession code GSE107011. The corresponding cell fractions data was provided as supplementary information of the original paper. More specifically, the unknown fraction was calculated by one minus sum of known proportions and cell types of the same kind were added together to fit cell types in training data. For example, monocytes C, monocytes I, and monocytes NC are different kinds of monocytes, so their fractions will be added together as the total fraction of monocytes. The third PBMC dataset is created by Newman et al. Its expression data were downloaded from GEO with code GSE65133 and its cell fractions were provided on the webpage of CIBERSORT [7] . Next, the dataset we used to deconvolve human tissue with Alzheimer's Disease (AD) is from a project called Religious Orders Study and Memory and Aging Project (ROSMAP) [21] . This dataset consists of about 600 samples of RNA-seq data from AD patients, while 41 of them have cell-type proportion information measured by immunohistochemistry in another study [20] . The gene expression data was obtained from supplementary data of Scaden rather than the original program of ROSMAP to maintain consistency during the test. As for the single-cell datasets, 8k PBMC dataset from healthy donors was downloaded from 10X Genomics [28] , mouse and human brain datasets were obtained from the GEO database with accession code GSE87544 and GSE67835 respectively [24, 36] . All of these datasets were preprocessed to generate cell-type labels using the same procedure in Scaden. Notably, if the training data is available in Scaden, like PBMC and mouse brain datasets, we just used the training data provided by the author of Scaden to assess performance

In the advanced analysis of real bulk data with clinical information, three different datasets were involved. Since the ROSMAP dataset has been introduced above, here we only describe the other two datasets. The first is COVID-19 PBMC dataset [22] from a longitude study of patients with COVID-19, this dataset has 39 RNA-seq data of PBMC constitutes of different stages (treatment stage, convalescence stage, and rehabilitation stage) and different types (mild, moderate, and serious) from 16 patients. The second is the COVID-19 islet dataset which is from a study of the SARS-CoV-2 infected islet. This dataset only has six samples which are divided into three groups: normal cultured group, infected group, and Remdesivir treated group. The single-cell dataset used as reference is from Baron et al. [31] (GEO accession code: GSE84133) which has 14 labeled cell types in pancreas tissue. Instead of using all the cells in the dataset, we selected only endocrine cells: alpha cell, beta-cell, delta cell, gamma cell, and epsilon cell to constitute the reference dataset.

In the final analysis of tissue-adaptive GEP, we introduced an HIV PBMC dataset from the GEO database with accession number GSE115449. This dataset has PBMC data collected from 91 HIV patients, half of them have developed BNab and the others do not have BNab. Furthermore, when we used ssGSEA to analyze cellular function changes in PBMC cross different viruses infection, we used an HCV infected bulk RNA-seq data of PBMC (GEO database, accession number: GSE119117). This dataset is also from a longitude study of patients. RNA-seq data is collected from individuals before, during, and after acute HCV infection. Acute HCV infection resulted in spontaneous viral resolution (n=6) or chronic infection (n=8). Four time points were examined per subject: i) Pre-infection baseline (Variable); ii) Early acute (2-9 weeks, mean 6 weeks); iii) Late acute (15 -20 weeks, mean 17 weeks); and iv) Follow up (25-71 weeks, mean 52 weeks). Another virus-infected PBMC dataset is the COVID-19 PBMC dataset which has been mentioned before.

Note that, all the datasets involved in this study might use different ways to represent genes. To maintain the concordance, we processed all the different representations into gene names through BioMart [37] .

Usually, deep learning models need a large amount of training data to optimize. So, it is crucial to generate pseudo-bulk data from a single-cell dataset to train the model.

Single-cell expression data with cell-type fractions are used to generate pseudo-bulk data. By definition, pseudo-bulk expression data is the sum of single-cell expression data from a subset of cells. So, to generate pseudo-bulk data, cells should be sampled with a given cell type proportions (ground truth) and total cell number like the stratified sampling.

The random cell-type fractions were first generated using the randn() function from numpy.random package [38] and then divided by the sum of random numbers to ensure the sum of each cell type proportion is equal to 1. Next, we multiply the total cell number with the generated cell fractions for each sample to acquire the exact sampling number for each cell type. After that, we use a stratified sampling method to sample cells of each cell type with the given number. Finally, the pseudo-bulk expression profile is created by summing the expression values of the randomly selected single-cell expression profiles for each sample.

Additionally, if users want to predict tissue-adaptive GEPs and investigate the relative gene expression value (output GEP value is between 0 and 1), users need to consider the data shift between different sequencing methods. For example, counts data from the 10X sequencing platform represents the real expression value while counts data from smart-seq need to be further normalized using a method like TPM or FPKM to show the real expression value. Here we provide a simple function counts2FPKM (or TPM) to transform raw counts to FPKM (or TPM). Due to the original information loss of the processed single-cell expression profile, we just normalized raw counts of a certain gene with its maximum transcripts length obtained from BioMart [37] . So, we recommend users prepare a suitable single-cell profile in advance to avoid information loss.

According to the previous study from Scaden [13] , different sampling distributions and single-cell datasets with a heavy bias of different cell types do not affect deconvolution performance notably. Furthermore, we investigate the impact of varying total cell numbers; the results demonstrate that there is no significant difference when total cells number ranges from 100 to 1,000.

To illustrate our model more clearly, it is necessary to define the problem in advance. All of the symbols defined in this section are consistent throughout the article. Intuitively, we expect gene expression profile from bulk RNA-seq would be a linear combination of each cells' GEP from single-cell RNA-seq. Furthermore, if cells belonging to one kind of cell type have the same gene expression pattern, we could use the signature gene expression pattern and the number of cells for each type to reconstruct the GEP of a bulk RNA-seq data. So, given the number of k cell types, m genes, and n samples in bulk RNA-seq data, an ideal mathematic model could be defined as:

where B is an n × m matrix representing GEPs of bulk RNA-seq; S is a k × m signature matrix; X is an n × k matrix representing cell-type fractions in each sample.

Given the well-defined problem, we just need to modify the equation to accommodate deep learning:

Here, f ϕ and f ψ represent two coordinated deep neural network, symbols with tilde likeB refers to the output of the model, and S refers to the explicit matrix form of f ψ . Usually, f ϕ and f ψ are called encoder and decoder respectively in the classical architecture of AE. f ϕ is a regression model which is responsible for mapping the high dimensional bulk gene expression data to a low dimensional representation of cell compositions. In contrast, f ψ is the inverse function of f ϕ which is expected to reconstruct bulk data based on the cell fractions. Obviously, f ψ functions like the signature matrix discussed above. Therefore, we want to make it have an explicit matrix-form to enforce the interpretability of f ϕ . Because an ideal f ϕ could also be represented in matrix-form due to the inverse relation in math. To achieve the progress in interpretability of deep model, f ψ was designed without activation layers or bias, it is only the regularized value of dot product of five weight matrices, thus the signature matrix is visible in the deep model:

where ReLU(x) = (x) + = max(0, x). The reason to design such an equation to represent S rather than a single matrix is that more parameters could enable the model to learn a good signature matrix more quickly and easily and the ReLU(·) function is used to ensure the biological meaning of the signature matrix.

We need to stress that, it seems that our model assumes that cell proportions could be inferred from the bulk data directly through the function f ϕ without the signature matrix. However, if we consider f ψ , we will find that parameters of f ϕ is affected by f ψ during optimization. Just like other statistical methods computing the pseudo-inverse of the signature matrix in the fitting process, we also use the inverse relationship between f ϕ and f ψ in the training stage. Therefore, compared to the previous machine learning methods using a single function to predict fractions without regularization from the signature matrix, this architecture makes sense better

Although the input datasets varied between platforms and protocols, we utilize the same processing approach to prepare them for deep-learning models and avoid the dimensionality curse. As for the bulk data (real or simulated), it was first transformed to the Log 2 space with a pre-added one to avoid null value. Next, to maintain the meaningful signature matrix, we decide to use M inM axScaler() function provided by scikit-learn [39] to scale data into a range between 0 and 1. This function is described below:

, j = 1, 2, 3, . . . , m.

As previously stated, there are two stages of training in TAPE. The first is the training stage, in this stage, we use mean absolute error (MAE) between prediction and ground truth to optimize the parameters of encoder and MAE between reconstructed input and input to optimize both the decoder and the encoder. The loss functions are defined as:

Usually, we found that MAE(X,X) is stable after 5,000 iterations with batch size 128, and we just stopped training to avoid overfitting.

In the adaptive stage, we aimed to train the parameters to adapt new data rather than predicting cell fractions with the same parameters in all situations. To achieve this goal, we designed a greedily iterative optimizing method in a new manner: step 1. optimize the decoder with loss function MAE(B,B) + MAE(S,S 0 ) until MAE(B,B) did not decrease; step 2. optimize the encoder with loss function MAE(B,B) + MAE(X,X 0 ) until MAE(B,B) did not decrease. Usually, repeating step 1 and step 2 several times would make the parameters of TAPE adapt to the new data. In our practice, after the adaptive stage, the prediction of cell fractions will improve a little and it always outputs an adaptive signature matrix.

Generally, there are two different ways to analyze cell-type-specific GEP: 1. The "overall" mode: using all the samples at once to capture an overall cell-type-specific GEP in a certain condition. 2. The "high-resolution" mode: predicting all the samples one by one to maintain the differences between each sample. Certainly, the latter will consume more time than the first one. The choice of different modes mainly depends on users' demands. For example, if users want to discover the differentially expressed genes at a cell-type level, they should choose to predict GEPs in the "high-resolution" mode, thus they could calculate the p-value. On the other hand, if users only need to investigate the highly expressed genes in different cell types or have an overall look at the cell-type level, they should choose to predict GEPs in the "overall" mode.

It is worth noting that, the value of GEP predicted by TAPE is between 0 and 1 which represents the relative expression value among a single sample. This limitation is caused by the Min-Max scaling method when we preprocess the input data. Using this GEP may encounter some problems when users need to analyze the fold change of a certain gene due to the information loss induced by the nonlinear scaling function.

TAPE's encoder and decoder are both made up of five fully connected layers with the same weight size in the corresponding position. For example, the first layer of the encoder and the last layer of the decoder each have 512 nodes. More specifically, the number of nodes in each encoder layer is 512, 256, 128, 64, and the number of cell types in sequential order. Before the first four fully connected layers, each has a dropout function with a probability of 0.5; after each layer, each has a nonlinear activation function CELU(·), defined as CELU(x) = max(0, x) + min(0, e x−1 ). Decoder, on the other hand, does not contain any bias in the fully connected layers or nonlinear functions except the ReLU(·) function, as we mentioned before. During the training stage, we use Adam with a learning rate 1times10 −4 to optimize parameters. Other parameters of Adam are set as default in PyTorch. We train the network for 5,000 iterations with batch size 128. These training hyperparameters are succeeded from Scaden. While in the adaptive stage, we use Adam with the same learning rate 1 × 10 −4 to fine-tune the parameters on the new data. We train both the encoder and the decoder for 300 steps within each iteration. The max iteration number is flexible for users and we recommend users set it at least 2 to make it output a well-adapted signature matrix.

Within the main text above, we combined mean absolute error (MAE) with Lin's concordance correlation coefficient (CCC) [16] to evaluate different algorithms' performance because only one metric is hard to assess performance reasonably in all situations. For instance, if there are only two kinds of cell type in tissue and one type's fraction range from 80%-90% in the ground truth, if the model predicts this cell type fraction is 100%, then the CCC value will perform well but the MAE is not so good. So, to avoid the situation of discarding fractions of minor cell types, it is necessary to combine MAE with CCC. Generally, a higher CCC value and a lower MAE suggest a better deconvolution performance. These metrics are defined as follows:

MAE(X,X) = i,j |X i,j −X i,j | n × k ,

CCC(x,x) = 2 × cov(x,x) σ 2

x + σ 2

where cov(x,x) stands for the covariance between these two vectors. Notably, these two metrics are applied to all data points of the predicted matrixX and the ground truth matrix X. More specifically, as for the CCC value, we reshape the matrix into a vector and then calculate the total CCC between two vectors. This calculation pattern usually results in a higher CCC value than computing the average CCC value for each cell type.

To evaluate the performance of TAPE compared with other methods, we selected several representative methods for comparison. Except for Scaden, other methods were tested following the instruction and tutorials provided by each package. For deconvolution performance on the pseudo-bulk and real bulk data with ground truth, we benchmark Scaden, RNAsieve, CIBERSORTx, and DWLS [9, [11] [12] [13] . We will describe the details of the benchmarking procedure below.

For Scaden, since the package provided by the author did not have a clear API document, we implemented a PyTorch-based Scaden. The training hyperparameters were set following the instruction of the original article and source code. Though we tried our best to make it the same as the original Scaden, it still had some different behaviors, for example, loss plot and deconvolution performance on the SDY67 dataset were different from reported data. These differences were probably caused by the different deep learning backends (Keras or PyTorch). In general, since the differences were not huge, the implementation of Scaden is acceptable.

For RNA-Sieve, we used the python package provided by the author and used the default settings to deconvolve data. We first validated its performance on pseudo-bulk data and the original data provided by the authors. The results showed that it could perform well on the simulated data but it could not reproduce the same results as they reported on Newman's dataset and Monaco's dataset. There may exist some problems in the python-based RNA-Sieve.

For CIBERSORTx (CSx), we used the web-based application to test. As for the pseudo-bulk test, we did not perform the 5-fold evaluation on CSx because it consumed too much time to deconvolve. We only used the first batch (800 samples) of simulated data to deconvolve. We first used a single-cell profile to generate a signature matrix and then we used the corresponding bulk data to deconvolve with S mode batch-correction. Other settings were default.

For DWLS, we used the core functions and packages written in R programming language with the default settings to generate signature matrices and therefore deconvolving the targeted pseudo-bulk data and the real ones. To guarantee the rationality of our implementation, we carefully followed the example of the intestine stem cell provided by the manual of DWLS. Since the deconvolution function provided by DWLS only deconvolves one sample at a time, a for-loop is brought in because we need to deal with some large samples. Furthermore, our data needed to be logged to avoid overflow, because of the very strict input requirements.

It should be pointed out that, for all statistical methods (RNA-sieve, CSx, and DWLS), all PBMC datasets were deconvolved using a signature matrix generated from PBMC data8k dataset, reference of mouse brain dataset is generated from Chen et al. [36] , and signature matrix of human brain dataset is generated from Darmanis et al. [24] 5 Data availability All the datasets we used are listed in the Method part. Only the ROSMAP human brain dataset is not public, researchers need to download it from Synapse (ID: syn3219045) with a request. For convenience, we listed these datasets on the webpage: https://sctape.readthedocs.io/datasets/.

The open source implementation of TAPE is available at https://github.com/poseidonchan/TAPE, and the experiments conducted to produce the main results of this article are also stored in this repository. The documentation of TAPE is published at https://sctape.readthedocs.io/.

Rna sequencing: new technologies and applications in cancer research

Single-cell analysis supports a luminal-neuroendocrine transdifferentiation in human prostate cancer

Single-cell rna-seq reveals a subpopulation of prostate cancer cells with enhanced cell-cycle-related transcription and attenuated androgen response

Single-cell rna sequencing to explore immune cell heterogeneity

Computational and analytical challenges in single-cell transcriptomics

Self-supervised contrastive learning for integrative single cell rna-seq data analysis. bioRxiv

Robust enumeration of cell subsets from tissue expression profiles

Bulk tissue cell type deconvolution with multi-subject single-cell expression reference

Determining cell type abundance and expression from bulk tissues with digital cytometry

Accurate estimation of cell composition in bulk expression through robust integration of single-cell information

Accurate estimation of cell-type composition from gene expression data

A likelihood-based deconvolution of bulk gene expression data using single-cell references

Deep learning-based cell composition analysis from tissue expression profiles

Elevated monocyte to lymphocyte ratio and increased mortality among patients with chronic kidney disease hospitalized for covid-19

Single-cell transcriptomics of 20 mouse organs creates a tabula muris

A concordance correlation coefficient to evaluate reproducibility

System-wide associations between dna-methylation, gene expression, and humoral immune response to influenza vaccination

Rna-seq signatures normalized by mrna abundance allow absolute deconvolution of human immune cell types

Religious orders study and rush memory and aging project

Deconvolving the contributions of cell-type heterogeneity on cortical gene expression

A multi-omic atlas of the human frontal cortex for aging and alzheimer's disease research

Longitudinal transcriptome analyses show robust t cell immunity during recovery from covid-19

Sars-cov-2 infects and replicates in cells of the human endocrine and exocrine pancreas

A survey of human brain transcriptome diversity at the single cell level

Neuropathological stageing of alzheimer-related changes

Microglia in alzheimer's disease

Microglia in alzheimer's disease: Activated, dysfunctional or degenerative

8k pbmcs from a healthy donor, single cell gene expression dataset by cell ranger 2.1.0. 10X

Evidence of abnormally low lymphocyte-to-monocyte ratio in covid-19-induced severe acute respiratory syndrome

The diagnostic and predictive role of nlr, d-nlr and plr in covid-19 patients

A single-cell transcriptomic map of the human and mouse pancreas reveals interand intra-cell population structure

Association of neurogranin gene expression with alzheimer's disease pathology in the perirhinal cortex

Rab11fip5 expression and altered natural killer cell function are associated with induction of hiv broadly neutralizing antibody responses

6k pbmcs from a healthy donor single cell gene expression dataset by cell ranger 1.1.0. 10X

single cell gene expression dataset by cell ranger 3.0.0

Single-cell rna-seq reveals hypothalamic cell diversity

The biomart community portal: an innovative alternative to large, centralized data repositories

Array programming with numpy

Scikit-learn: Machine learning in python. the

We thank Yuqi Cheng for his advice on analysis of COVID-19 PBMC data. We also thank Yuemeng Sun for his help to accelerate the process to generate simulated data. Special thanks to people who helped us request for the ROSMAP dataset and the dataset creators from Religious Orders Study (ROS) or the Rush Memory and Aging Project (MAP).