AdRoit: an accurate and robust method to infer complex transcriptome composition


 1 

AdRoit: an accurate and robust method to infer complex 1 
transcriptome composition 2 
 3 
Tao Yang1, Nicole Alessandri-Haber1, Wen Fury1, Michael Schaner1, Robert Breese1, Michael LaCroix-Fralish2, 4 

Jinrang Kim1, Christina Adler1, Lynn E. Macdonald1, Gurinder S. Atwal1, Yu Bai1, * 5 

 6 
Affiliations 7 

1. Regeneron Pharmaceuticals, Inc., Tarrytown NY 10591 8 

2. Cellular Longevity, Inc., San Francisco, CA 94103 9 

 10 
*Corresponding author 11 
 12 

Abstract 13 

RNA sequencing technology promises an unprecedented opportunity in learning disease 14 

mechanisms and discovering new treatment targets. Recent spatial transcriptomics methods 15 

further enable the transcriptome profiling at spatially resolved spots in a tissue section. In 16 

controlled experiments, it is often of immense importance to know the cell composition in 17 

different samples. Understanding the cell type content in each tissue spot is also crucial to the 18 

spatial transcriptome data interpretation. Though single cell RNA-seq has the power to reveal 19 

cell type composition and expression heterogeneity in different cells, it remains costly and 20 

sometimes infeasible when live cells cannot be obtained or sufficiently dissociated. To 21 

computationally resolve the cell composition in RNA-seq data of mixed cells, we present AdRoit, 22 

an accurate and robust method to infer transcriptome composition. The method estimates the 23 

proportions of each cell type in the compound RNA-seq data using known single cell data of 24 

relevant cell types. It uniquely uses an adaptive learning approach to correct the bias gene-wise 25 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 2 

due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes 26 

while control their cross-sample variability. Our systematic benchmarking, spanning from 27 

simple to complex tissues, shows that AdRoit has superior sensitivity and specificity compared 28 

to other existing methods. Its performance holds for multiple single cell and compound RNA-29 

seq platforms. In addition, AdRoit is computationally efficient and runs one to two orders of 30 

magnitude faster than some of the state-of-the-art methods. 31 

 32 

Introduction 33 

RNA sequencing is a powerful tool to address the transcriptomic perturbations in disease 34 

tissues and help understand the underlying mechanism to develop treatments1. Due to the 35 

presence of heterogeneous cell populations, bulk tissue transcriptome only characterizes the 36 

averaged expression of genes over a mixture of different types of cells. The identity of 37 

individual cell types and their prevalence remain unelucidated in the bulk data. However, 38 

knowledge of the cell type composition and gene expression perturbation at the cell type level 39 

is often critical to identifying disease-manifesting cells and designing targeted therapies. For 40 

instance, the constitution of stromal and immune cells sculpts the tumor microenvironment 41 

that is essential in cancer progression and control2–6. Excessive expression of cytokines in 42 

particular leukocyte types underlines the etiology of many chronic inflammatory diseases 7–11. 43 

Such information cannot be directly read out from the bulk RNA-Seq. 44 

 45 

Recent breakthroughs in spatial transcriptomics methods enable characterizing whole 46 

transcriptome-wise gene expressions at spatially resolved locations in a tissue section12. 47 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 3 

However, it remains challenging to reach a single cell resolution while measuring tens of 48 

thousands of genes transcriptome-wise. Some widely used technologies can achieve a 49 

resolution of 50-100 μm, equivalent to 3–30 cells depending on the tissue type12,13. The 50 

transcripts therein may originate from one or more cell types. Unlike the bulk RNA-seq, the 51 

profiling data at each spot contains substantial dropouts as merely a few cells are sequenced, 52 

imposing additional challenges to demystify the cell type content. We refer to bulk RNA-seq 53 

and spatial transcriptomics data at the multi-cell resolution as compound RNA-seq data 54 

hereafter. 55 

 56 

The rapid development of single-cell RNA-seq (scRNA-seq) technologies has allowed for cell-57 

type specific transcriptome profiling14. It provides the information missing from the compound 58 

RNA-seq data. Nevertheless, the technologies have low sensitivity and substantial noise due to 59 

the high dropout rate and the cell-to-cell variability. Consequently, scRNA-seq technologies 60 

require a large number of cells (thousands to tens of thousands) to ensure statistical 61 

significance in the results. In addition, the cells must remain viable during capture. These 62 

requirements render the scRNA-seq technologies costly, prohibiting their application in clinical 63 

studies that involve many subjects or cannot allow real time tissue dissociation and cell capture. 64 

Furthermore, scRNA-seq technologies may not be well suited to characterizing cell-type 65 

proportions in solid tissues because the dissociation and capture steps can be ineffective to 66 

certain cell types 15–17. 67 

 68 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 4 

As sequencing at the single cell level is not always feasible, in silico approaches have been 69 

developed to infer cell type proportions from compound RNA-seq data18–24. The most common 70 

strategy is to conduct a statistical inference through the maximum likelihood estimation 71 

(MLE)25 or the maximum a posterior estimation (MAP)26 on a constrained linear regression 72 

framework, wherein the unobserved mixing proportion of a finite number of cell types are part 73 

of the latent variables to be optimized. 1921–24The deconvolution methods are often applied to 74 

dissect the immune cell compositions in blood samples27–31. However, their performance in 75 

more complex tissues, such as the nervous, ocular, respiratory and gastrointestinal organs, 76 

remains unclear. These tissues often contain many cell types (10-102) and the difference among 77 

related cells can be subtle, rendering the deconvolution a challenging task. For example, a 78 

recent study on the mouse nervous system contains more than 200 cell clusters and many are 79 

highly similar neuronal subtypes32. 80 

 81 

Earlier works often utilized the transcriptome profiling of the purified cell populations to 82 

estimate the gene expressions per cell type (e.g. Cibersort)19. More recently, acquiring cell type 83 

specific expression from the scRNA-seq data was shown to be an intriguing alternative21–24. 84 

Though it provides higher throughput by measuring multiple cell types in one experiment, 85 

profiling at single cell level is substantially noisy. Deconvolution using scRNA-seq data as 86 

reference can be biased by noise non-relevant to cell identities if not treated properly. 87 

Moreover, the platform difference between the compound data and the single cell data cannot 88 

be ignored. 89 

 90 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 5 

To overcome these challenges, additional information from the data may be considered. A 91 

recent method that weighs genes according to their expression variances across samples 92 

greatly improved the accuracy22, highlighting the importance of gene variability in inferring cell 93 

type composition. Some other methods and applications have pointed out the importance of 94 

cell type specific genes24,28,31,33. In these works, the cell type specific expression was only used 95 

to select the input genes (e.g., markers). Nonetheless, it measures how informative a gene is in 96 

distinguishing cell types and thus can be incorporated as a part of the model. To address the 97 

platform difference between the compound data and the single cell data it is usually assumed 98 

there exists a single scaling factor or a linearly scaled bias for all genes that can be learned and 99 

corrected accordingly22,23. This assumption is hardly held because the impact of the platform 100 

difference to each gene is different. Though learning a uniform scaling factor would correct the 101 

difference in the majority of genes, a few genes that remain significantly biased can easily 102 

confound the estimation, especially under a linear model framework. Thus, a gene-wise 103 

correction should be considered.  104 

 105 

In this work, we presented a new deconvolution method, AdRoit, a unified framework that 106 

jointly models the gene-wise technology bias, genes’ cell type specificity and cross-sample 107 

variability. The method estimated the cell type constitution in the compound RNA-seq samples 108 

using relevant single cell data as a training source. Genes used for deconvolution were 109 

automatically selected from the single cell data based on their information richness. Uniquely, 110 

it uses an adaptively learning approach to estimate gene-wise scaling factors, addressing the 111 

issue that different platforms impact genes differently. The model of AdRoit is further 112 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 6 

regularized to avoid collinearity among closely related cell subtypes that are common in 113 

complex tissues. Over a comprehensive benchmarking data sets with a varying cell composition 114 

complexity, AdRoit showed superior sensitivity and specificity to other existing methods. 115 

Applications to real RNA-seq bulk data and spatial transcriptomics data revealed strong and 116 

expected biologically relevant information. We believe AdRoit offers an accurate and robust 117 

tool for cell type deconvolution and will promote the value of the bulk RNA-seq and the spatial 118 

transcriptomics profiling. 119 

 120 

Results 121 

Overview of the AdRoit framework 122 

AdRoit estimates the proportions of cell types from compound transcriptome data including but 123 

not limited to bulk RNA-seq and spatial transcriptome. It directly models the raw reads without 124 

normalization, preserving the difference in total amounts of RNA transcript in different cell 125 

types. The method utilizes as reference the relevant pre-existing single cell RNA-seq data with 126 

cell identity annotation. It selects informative genes, estimates the mean and dispersion of the 127 

expression of selected genes per cell type, and constructs a weighted regularized linear model 128 

to infer percent combinations (Fig. 1a). Because sequencing platform bias impacts genes 129 

differently15,34,35, a uniform scaling factor for all genes does not sufficiently eliminate such bias. 130 

A key innovation of AdRoit is that it uniquely adopts an adaptive learning approach, where the 131 

bias was first estimated for each gene, then adjusted such that more biased gene is corrected 132 

with a larger scaling factor (Fig. 1b). 133 

 134 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 7 

We also attribute the success of AdRoit to the consideration of a comprehensive set of other 135 

relevant factors including genes’ cross-sample variability, cell type specificity and collinearity of 136 

expression profiles among closely related cell types. The cross-sample variability of a gene 137 

confounds its biological expression variability due to the variety of cell types. The latter is 138 

referred as the cell type specific expression that helps identify the cell type. AdRoit weighs 139 

down genes with high cross-sample variability whilst weighs up those with an expression highly 140 

specific to certain cell types. The definition of cross-sample variability and cell type specificity 141 

also accounts for the overdispersion nature in counts data. Lastly, AdRoit adopted a linear 142 

model to ensure the interpretability of the coefficients. At the same time, AdRoit included a 143 

regularization term to minimize the impact of the statistical collinearity. Each of the factors 144 

contributes an indispensable part to AdRoit, leading to an accurate and robust deconvolution 145 

method for inferring complex cell compositions. 146 

 147 

To evaluate the performance, we compared AdRoit with MuSiC22 and NNLS18,36 for bulk data 148 

deconvolution, and stereoscope23 for spatial transcriptomics data deconvolution. When 149 

evaluating the algorithms, a common practice is to pool the single cell data to synthesize a 150 

“bulk” sample with the known ground truth of the cell composition. We measured the 151 

performance by comparing the estimated cell proportions with true proportions using four 152 

metrics: mean absolution difference (mAD), rooted mean squared deviation (RMSD) and two 153 

correlation statistics (i.e., Pearson and Spearman). Both correlations are included because 154 

Pearson reflects linearity, while Spearman avoids the artificial high scores driven by outliers 155 

when majority of estimates are tiny. Good estimations feature low mAD and RMSD along with 156 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 8 

high correlations. When estimating cell proportions for a synthetic sample, cells from this 157 

sample are excluded from the input single cell reference (i.e., leave-one-out) to avoid 158 

overfitting. We further applied AdRoit to real bulk RNA-seq data and validated the results by 159 

available RNA fluorescence in-situ hybridization (RNA-FISH) data. The estimates were further 160 

confirmed by relevant biology knowledge of human pancreatic islets. We also used AdRoit to 161 

map cell types on spatial spots, and the accuracy was verified by in-situ hybridization (ISH) 162 

images from Allen mouse brain atlas37. 163 

 164 

AdRoit excels in datasets with both simple and complex cell constitutions 165 

We started with a simple human pancreatic islets dataset that contains 1492 cells and four 166 

distinct endocrine cell types (i.e., Alpha, Beta, Delta, and PP cells)38 (Extended Data Fig. 1a; 167 

Supplementary Table 1). The synthesized bulk data were constructed by mixing the single cell 168 

data at known proportions. Though all three methods achieved satisfactory performance 169 

according to the evaluation metrics, AdRoit has slightly better performance as reflected by 170 

scatterplots of estimated proportion vs. true proportion (Extended Data Fig. 1b, Supplementary 171 

Table 2). It has moderately lower mAD (0.029 vs. 0.031 for MuSiC and 0.066 for NNLS), and 172 

RMSD (0.039 vs. 0.046 for MuSiC and 0.095 for NNLS) and comparable correlations (Pearson: 173 

0.99 vs 0.98 for MuSiC and 0.93 for NNLS; Spearman: 0.97 vs 0.98 for MuSiC and 0.91 for NNLS) 174 

(Extended Data Fig. 1c). This performance was expected because there were only four cell types 175 

with very distinct transcriptome profiles. Deconvoluting such data was a relatively easy task for 176 

all three methods. 177 

 178 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 9 

We then tested the methods on a couple of complex tissues that are more challenging to 179 

deconvolute. One is the human trabecular meshwork (TM) tissue. We acquired published single 180 

cell data that contains 8758 cells and 12 cell types from 8 donors39. The data include 3 similar 181 

types of endothelial cells, 2 types of Schwann cells and 2 types TM cells (Supplementary Fig. 1; 182 

Supplementary Table 3). Cells from each donor were pooled as a synthetic bulk sample. The cell 183 

type proportions vary from <1% to 43%. These proportions were the ground truth cell 184 

composition and were compared head-to-head with the estimated proportions inferred by 185 

AdRoit, MuSiC and NNLS. For each synthetic bulk sample, estimations were performed using a 186 

reference built from cells of other donors (i.e., leaving-one-out). In each of the 8 samples, the 187 

estimates made by AdRoit best approximated the true proportions. In particular, AdRoit had 188 

significantly lower mAD (0.016) and RMSD (0.025), and higher correlations (Pearson = 0.97; 189 

Spearman = 0.94), comparing to MuSiC (mAD = 0.038; RMSD = 0.06; Pearson = 0.83; Spearman 190 

= 0.73) and NNLS (mAD = 0.06; RMSD = 0.088; Pearson = 0.69; Spearman = 0.63) (Fig. 2a). We 191 

further assessed the deviation of the estimates from the true proportions for each cell type. 192 

AdRoit consistently had the lowest deviations from the true proportions for all cell types, as 193 

well as the lowest variation among 8 samples (Fig. 2b, blue dots), indicating a higher robustness 194 

over various cell types and samples. Notably, AdRoit only missed one rare cell type (true 195 

proportion = 0.3%) out of 12 cell types in one sample, while MuSiC missed 1 to 5 cell types in 6 196 

of the 8 samples, and NNLS missed 3 to 7 cell types in all 8 samples (Supplementary Fig. 2, 197 

Supplementary Table 4). 198 

 199 

AdRoit has better sensitivity and specificity 200 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 10 

We next systematically addressed the sensitivity and specificity of these algorithms. In the 201 

context of the cell type deconvolution, a false negative occurs when the proportion of an 202 

existing cell type is predicted to be zero (or below a given threshold). Conversely, a non-zero 203 

prediction (or above a given threshold) of an absent cell type results in a false positive. False 204 

negatives and false positives measure the sensitivity and specificity of a deconvolution 205 

algorithm, respectively. Both quantities are crucial to establish the utility of the algorithm. 206 

Particularly, in real world applications, it is often difficult to know a prior what cell types exist in 207 

a bulk sample, users may inform the algorithm to consider more possible cell types than what 208 

are actually in the sample. False positive predictions in this situation would make the algorithm 209 

unusable. 210 

 211 

We designed a simulation to test the sensitivity and specificity. we selected 6 out of the 12 cell 212 

types, i.e., Schwann-cell like cell, TM1, smooth muscle cell, melanocyte, macrophage and 213 

pericyte, from each donor sample and pooled them within that sample to synthesize 8 new bulk 214 

samples. The unselected 6 cell types are considered absent in the bulk samples. Some cell types 215 

in presence are highly similar to those in absence, challenging the programs to pinpoint the 216 

right cell type present in the bulk among similar candidates. We provided the full list of 12 217 

single cell types as reference to the programs to estimate the cell type proportions. NNLS was 218 

excluded from this evaluation due to its low benchmarking performance observed earlier (Fig. 219 

2a, b). 220 

 221 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 11 

Consistently across 8 samples, AdRoit had very accurate estimates for the 6 present cell types, 222 

and zero or close-to-zero estimated values for the non-existing cell types in the synthetic bulk 223 

data. MuSiC was notably less accurate on the 6 selected cell types, meanwhile it had many non-224 

negligible values (>1% for 26 out 48 estimates) of the 6 cell types excluded in the 8 synthetic 225 

samples (Fig. 2c, Supplementary Table 5). For example, smooth muscle cells accounted for 226 

~14% in donor 4 but was largely missed (~0.03%) by MuSiC. We noted that TM2 had false non-227 

zero estimates from both methods though not included. This is because TM2 is easily mistaken 228 

as TM1 due to their high similarity39. Nonetheless, AdRoit’s estimates of TM2 were consistently 229 

small across samples (<1% for 44 out of 48 estimates), while MuSiC had significantly larger 230 

estimates of TM2 that occasionally even exceeded the TM1 estimates (donors 5 and 8 in Fig. 2c 231 

right). For a systematic comparison, we constructed the receiver operating characteristic (ROC) 232 

curve by varying the threshold of detection (i.e., a cutoff below which the cell type was deemed 233 

undetected) (Fig. 2d). AdRoit had significantly higher area under the curve (AUC) than MuSiC 234 

(0.95 vs. 0.74), implying a dominantly better sensitivity and specificity. 235 

 236 

AdRoit outperforms in deconvoluting closely related subtypes 237 

To further evaluate AdRoit when multiple cell subtypes present in a complex tissue, we 238 

performed scRNA-seq experiment on mouse lumbar dorsal root ganglion (DRG) from five mice. 239 

Following the standard analysis pipeline (Methods), we obtained 3352 single cells after quality 240 

control procedures. After clustering and annotation, we discovered 14 cell types including 241 

multiple subtypes of neuronal cells (Fig. 3a, Supplementary Table 6). The heatmap of the top 242 

marker genes showed distinct patterns of the major cell types as well as similar patterns of the 243 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 12 

subtypes (Extended Data Fig. 2a), and the cell type proportions varied from 0.5% to 33.71% 244 

(Extended Data Fig. 2b). These 14 cell types include 3 subtypes of neurofilament containing 245 

neurons (i.e., NF_Calb1, NF_Pvalb, NF_Ntrk2.Necab2), 3 subtypes of non-peptidergic neurons 246 

(i.e., NP_Nts, NP_Mrgpra3, NP_Mrgprd), and 5 subtypes of peptidergic neurons (i.e., PEP1_Dcn, 247 

PEP1_S100a11.Tagln2, PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a, PEP3_Trpm8). Also discovered 248 

were tyrosine hydroxylase containing neurons (Th), satellite glia and endothelial cells. Such 249 

complex compositions formed a challenging testing ground for evaluating the ability to 250 

distinguish closely related cell types. We again did the leave-one-out deconvolution on five 251 

synthesized bulk samples. 252 

 253 

AdRoit had highly accurate estimations on all cell types across samples (Fig. 3b). It is worth to 254 

mention that, for the rare cell types that account for less than 5%, AdRoit still had a good 255 

estimation that is fairly close to the true proportions and never missed a single cell type, 256 

showing that AdRoit is very robust on rare cell types. For example, 0.51% endothelial cells were 257 

predicted to be 0.35%, and 1.05% NF2_Ntrk2.Necab2 cells were predicted to be 0.85% 258 

(Supplementary Fig. 3, Supplementary Table 7). On the contrary, MuSiC and NNLS were notably 259 

less accurate, especially for the cell types less than 5%, and missed multiple cell types including 260 

some large cell clusters taking account of ~10% (PEP1_Slc7a3.Sstr2 cells of Sample5). We 261 

further examined how much the variability of the estimates was in each individual sample. We 262 

computed the 4 metrics to evaluate the performance on each of the 5 synthetic samples and 263 

compared them head-to-head among the algorithms. This fine comparison showed AdRoit 264 

significantly outperformed MuSiC and NNLS on every sample (Fig. 3c). Further, the performance 265 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 13 

metrics of AdRoit were highly consistent across samples with the lowest variability among the 266 

three methods. 267 

 268 

AdRoit excels on simulated spatial transcriptomics data 269 

Given the promising performance on complex tissues, we continued to test AdRoit’s 270 

applicability to spatial transcriptomics data. Spatial transcriptomics data differs from bulk RNA-271 

seq data in that each spot only contains transcripts from a handful of cells (3-30)12. Some of the 272 

spots contain multiple cells of the same type, while others may have mixtures of cell types at 273 

varying mixing percentages (e.g., spatial spots at the boundary of different cell types). Also, 274 

because the mixture is a pool of only a few cells, the variations across spatial spots are 275 

expected to be greater than in bulk samples. We simulated a large number of spatial spots 276 

(3200 in total) by using sampled cells from the DRG single cell data above (Methods), then 277 

compared AdRoit with Stereoscope over a range of simulation scenarios. 278 

 279 

We first tested whether the methods could correctly infer a single cell type when the spots 280 

contain cells from that same type. For each of the 14 cell types from DRG, we sampled 10 cells 281 

and pooled them to form a spatial spot. We repeated the simulation for 100 times for a robust 282 

testing, then used the full set of 14 cell types as reference to deconvolute the 1400 simulated 283 

spots. Both methods were able to identify the correct cell types with indistinguishable accuracy 284 

on the simulated cell types (i.e., estimates close to 1) and comparably low estimated values 285 

(i.e., estimates close to zero) for other cell types not included when simulating the spots 286 

(Extended Data Fig. 3). 287 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 14 

 288 

We then continued a difficult scenario where we sampled cells from the 5 PEP subtypes and 289 

mixed them. We created three simulation schemes for a comprehensive evaluation: 1) 5 PEP 290 

subtypes had same percent of 0.2; 2) PEP1_Dcn was 0.1 and the other 4 were 0.225; 3) 291 

PEP1_S100a11.Tagln2 and PEPE1_Dcn were 0.1, PEP2_Htr3a.Sema5a and PEP1_Slc7a3.Sstr2 292 

were 0.2, and PEP3_Trpm8 was 0.4. Again, each simulation scheme was repeated 100 times. 293 

Under each scheme, the estimates by AdRoit consistently centered around true proportions 294 

and the other cell types had very low estimated values (close to zero) (Fig. 4a, Supplementary 295 

Table 8). In comparison, though the estimates for the other cell types were also generally close 296 

to zero, the estimates of the PEP cells by Stereoscope systematically deviated from the true 297 

proportions for all three simulated schemes except for PEP1_S100a11.Tagln2. 298 

 299 

We further expanded the simulated spatial spots to the mixture of 3 NP cell types and mixture 300 

of 3 NF cell types. In addition, we sampled NP_Mrgpra3 cells and mixed them with other 301 

distinct cell types (i.e., Th, satellite glia and endothelial), as well as NF_Calb1 cells mixed with 302 

other distinct cell types, and PEP3_Trpm8 mixed with other distinct cell types. For all these 303 

simulated spatial spots, AdRoit’s estimates were consistently centered at true proportions, 304 

whereas Stereoscope’s estimates deviated in almost all simulated schemes (Extended Data Fig. 305 

4, Supplementary Table 8). We speculate the main reason Stereoscope underperformed at 306 

these simulated spots is that it normalizes the total UMI counts to the same number for all 307 

cells. In real world, a spatial spot is unlikely to be a pool of cells that have the same total RNA 308 

transcripts sampled, especially when a spot contains different cell types (e.g., immune cells 309 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 15 

have about 10-fold less total UMIs than the neuronal cells or subtypes of neuronal cells). Our 310 

simulation pooled the sampled cells by adding up the raw UMI counts per gene, which we 311 

believe best mimics the real data. 312 

 313 

Next, we asked how sensitive the methods are in detecting rare cell populations. We simulated 314 

mixtures of 3 PEP subtypes (i.e., PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a, PEP3_Trpm8) with a 315 

series of low percent PEP3_Trpm8 (from 0.01 to 0.1 by 0.01), and the other two cell types 316 

sharing the rest percentage equally (Methods). At each given percent, the simulation was 317 

repeated 100 times. We then checked how accurately the percent of PEP3_Trpm8 cells was 318 

estimated. The medians of AdRoit’s estimates were always close to the true proportions (Fig. 319 

4b, red lines), whereas that of Stereoscope’s estimates were largely lower than true 320 

proportions. Stereoscope also missed the majority of PEP3_Trpm8 cell type when the simulated 321 

proportion was below 0.06. This comparison implied AdRoit is more advantageous in detecting 322 

low percent cells. For a complete comparison, we also simulated 5 other types of cell mixtures 323 

in the same way. At each given low percent, we computed how many times out of 100 the low 324 

percent cell component was detected (estimates > 0.005). AdRoit had systematically higher 325 

detection rates, as well as higher consistency across different cell mixtures (Fig. 4c, 326 

Supplementary Table 9). Notably, at a simulated percent of 5%, AdRoit achieved >90% of 327 

detention rate, making it a powerful tool in detecting rare cells. 328 

 329 

Though MuSiC was not designed for deconvoluting spatial spots, theoretically it also can be 330 

applied to spatial transcriptomics data. We thus also compared AdRoit to MuSiC on the same 331 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 16 

sets of simulation data above. We observed AdRoit was also significantly more accurate over all 332 

simulation scenarios of spatial spots (Fig. 4a, Extended Data Fig. 3 and 4, Supplementary Fig. 4), 333 

and more sensitive when detecting low percent cells (Fig. 4b, c, Supplementary Fig. 5). 334 

 335 

Application to real bulk RNA-seq data of human pancreatic islets  336 

Though using synthetic bulk data based on mixing of single cells is a useful benchmarking 337 

strategy, the bulk and single cell RNA-seq often use distinct RNA library preparation and 338 

sequencing protocols. The capability of a method to deconvolute real bulk samples shall be 339 

addressed to ensure it is useful in the real-world applications. We acquired 70 real human 340 

pancreatic islets bulk samples from published studies38,40,41 (Supplementary Table 10) and used 341 

single cell data of the same tissue38 as reference to infer the percentages of 4 endocrine cell 342 

types (i.e., Alpha, Beta, Delta, PP). The 70 bulk samples were collected from 39 distinct donors, 343 

including 26 healthy donors, and 13 donors with type 2 diabetes (T2D). Each donor contributed 344 

1 to 5 replicated bulk RNA samples. 345 

 346 

Replicates from the same donor are expected to have similar compositions and thus were used 347 

to assess the reproducibility of the estimates from AdRoit. For all cell types, AdRoit had highly 348 

consistent estimates for the same donors (Fig. 5a, Supplementary Table 11). The average 349 

standard deviations did not exceed 1% for all 4 cell types (i.e., Alpha: 0.010; Beta: 0.008; Delta: 350 

0.004; PP: 0.002). To seek an independent validation, we obtained cell sorting results by RNA-351 

FISH for 4 of the 39 donors38 (Supplementary Table 12). The estimated cell proportions of the 4 352 

were highly consistent with the percentages measured by RNA-FISH (Fig. 5b), and the 353 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 17 

consistency held for both major cells (Alpha and Beta) and the minor cells (Delta and PP). 354 

Reproducibility and independent validation showed AdRoit is reliable in deconvoluting real bulk 355 

RNA-seq data. 356 

 357 

We then asked if AdRoit can detect known biological differences between healthy and T2D 358 

donors. Loss of functional insulin-producing Beta cells is a prominent characteristic of T2D42–44, 359 

typically reflected by elevated level of hemoglobin A1c (HbA1c)45,46. Among the healthy donors, 360 

the majority of Beta cell proportions estimated by AdRoit ranged from 50% to 75% (Fig. 5c), 361 

agreed with the known percent range of Beta cells in human islets tissue47,48. A significant 362 

decreasing of the estimated Beta cell proportions was seen in T2D patients (P value = 4.1e-6). 363 

Further, a linear regression of estimated Beta cell proportions on HbA1c levels showed a 364 

statistically significant negative association (P value = 1.8e-6). AdRoit adequately reflected the 365 

cell composition difference between healthy donors and T2D patients. 366 

 367 

Application to mouse brain spatial transcriptomics 368 

We lastly demonstrated an application to the real spatial transcriptomics data. Given the 369 

molecular architecture of brain tissue has been well studied, we chose mouse brain spatial 370 

transcriptomics data generated by 10x genomics, containing 2703 spatial spots (Methods). The 371 

reference single cell data were acquired from an independent study which contains a 372 

comprehensive set of nervous cell types in brain32. We curated the cell types by merging highly 373 

similar clusters and came down to a consolidated set of 46 distinct brain cell types (Methods, 374 

Supplementary Table 13). 375 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 18 

 376 

The cell contents inferred by AdRoit per spot appear to accurately match the expected cell 377 

types at that location (Extended Data Fig. 5, Supplementary Table 14). For example, the three 378 

subtypes of cortex excitatory neurons each occupied a sub-area in the cerebral cortex region. 379 

As another example, the shape of hippocampal region was delineated by the estimated 380 

percentages of dentate gyrus granule/excitatory neurons. For an independent validation, we 381 

checked the consistency between estimated cell types with the in-situ hybridization (ISH) 382 

images from Allen mouse brain atlas49. We chose 4 genes highly expressed in 4 brain regions 383 

respectively, i.e., Spink8 for hippocampal field CA1, C1ql2 for dentate gyrus, Clic6 for choroid 384 

plexus, and Synpo2 for thalamus32. The spots enriched with the 4 cell types (i.e., hippocampal 385 

CA1 excitatory neuron type 2, dentate gyrus granule neuron type 2, choroid plexus cell, 386 

thalamus excitatory neuron type 1), as mapped by AdRoit, precisely co-localized with the strong 387 

signals of the 4 marker genes on the ISH images respectively (Fig. 5d). This agreement 388 

confirmed that the spatial mapping of cell types by AdRoit is reliable. 389 

 390 

Computational efficiency 391 

Besides the accuracy and robustness, another major advantage of AdRoit is its magnitude 392 

higher computational efficiency. AdRoit uses a two-step procedure to do the inference. The first 393 

step prepares the reference on single cell data where per-gene means and dispersions are 394 

estimated, and cell type specificity is subsequently computed. The built reference can be saved 395 

and reused. We tested the running time on the reference building using the aforementioned 396 

mouse brain single cell dataset containing ~15,000 cells. It took about 4.5 minutes on a CPU 397 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 19 

that has 24 cores (23 used for parallel computing). The second step inputs the built reference 398 

and target compound data and does the estimation. Deconvoluting ~2700 compound RNA-seq 399 

samples took around 5 minutes. Therefore, AdRoit in total took less than 10 minutes and ~3Gb 400 

memory usage on a regular CPU. As a comparison, MuSiC took about 1 hour and 37 minutes on 401 

the same data using the same CPU. Stereoscope ran about 24 hours continuously with the 402 

published parameter setting (-scb 256 -sce 75000 -topn_genes 5000 -ste 75000 -lr 0.01 -stb 100 403 

-scb 100) on a powerful V100 GPU with 80 cores and 16G memory, which is prohibitive for 404 

seeking a quick turnaround. 405 

 406 

Discussion 407 

In this work we have demonstrated that AdRoit is capable of deconvoluting the cell 408 

compositions from the compound RNA-seq data with a leading accuracy, measured by the 409 

consistency between the true and predicted cell proportions. Its advantage over the existing 410 

state-of-the-art methods was verified over a wide range of use cases. In particular, AdRoit 411 

excelled in complex tissues composed of more than ten different cell types with wide range of 412 

cell proportions (e.g., trabecular meshwork, dorsal root ganglion). In both cases, AdRoit 413 

performed significantly better than the comparators MuSiC and NNLS on deconvoluting bulk 414 

RNA-seq data. AdRoit is also more accurate and sensitive than Stereoscope in demystifying 415 

spatial transcriptomics spots, especially in detecting low percent cells. Previous benchmarking 416 

often assumed the types of cells in the synthetic bulk data are not more or less than the cell 417 

types collected in the reference, and thus the only unknown was the proportion of each cell 418 

type. This assumption may not hold. Missing existing cell types or false predictions of non-419 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 20 

existing ones can hinder the utility of an algorithm. Thus, besides the overall accuracy, we also 420 

examined the sensitivity and specificity of the algorithms. We observed a superior sensitivity 421 

and specificity in AdRoit, an important leverage for its usage in practice. 422 

 423 

The reference single cell data used by AdRoit came from different platforms, such as the 10x 424 

Genomics Chromium Instrument (the mouse dorsal root ganglion), and the Fluidigm C1 system 425 

(the human pancreatic islets data). AdRoit consistently exhibited excellent performance across 426 

all benchmarking datasets independent of their single cell sequencing technology platforms. 427 

More importantly, this statement holds not only for deconvoluting the synthesized bulk data, 428 

but also for the real bulk RNA-seq data. The latter typically does not apply the unique molecular 429 

barcoding and requires a significantly different cDNA amplification procedure from what is used 430 

in the single cell RNA-seq (Methods). Besides, the sequencing depth, read mapping and gene 431 

expression quantification are dissimilar as well. The fact that AdRoit accurately dissected the 432 

cell compositions in the real bulk samples based on the single cell reference data further 433 

supports its cross-platform applicability. 434 

 435 

We attribute the power of AdRoit to its comprehensive modeling of relevant factors. Firstly, we 436 

think a common rescaling factor is not sufficient to correct the platform difference between 437 

single cells and the compound data. Rather, the impact of platform difference to genes is quite 438 

different and hardly is linearly scaled. Correcting such differences entails rescaling factors 439 

specifically tailored to each gene. AdRoit uses an adaptive learning approach to estimate such 440 

gene-wise correcting factor and does the correction in a unified model. In addition, the 441 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 21 

contribution of a gene in a cell type to the loss function is jointly weighted by its specificity and 442 

variability in a cell type, where specificity and variability are defined in a way accounting for the 443 

overdispersion property of counts data. Our observations over the multiple benchmarking 444 

dataset also show that the coexistence of similar cell types may have induced a collinearity 445 

condition that negatively impacted the regression-based methods developed by others. Being 446 

able to alleviate this problem gives AdRoit an edge to outperform. All these factors help AdRoit 447 

to distinguish similar cell clusters while sensitive enough to separate rare cell types. 448 

 449 

Technically, the input profiles of individual cell types to AdRoit does not necessarily come from 450 

the single cell RNA-seq. Bulk RNA-seq profiles of individual isolated cell types can be used as 451 

well. Nevertheless, using single cell RNA-seq data as the reference has a few key advantages. It 452 

is a high throughput approach wherein multiple cell types can be interrogated simultaneously. 453 

Prior knowledge of the cell types in presence as well as their specific gene markers are not 454 

required, which allows novel cell types to be identified. Although detection of lowly expressing 455 

genes has been a challenge for the single cell RNA-seq, significant enhancements have been 456 

demonstrated. For example, the number of detectable genes currently can reach an order of 457 

10,000 per cell and keeps improving50. As AdRoit focuses on the informative genes whose 458 

expressions are generally high, the detection limit of the single cell RNA-seq does not impose a 459 

significant drawback. Indeed, given the single cell reference profiles, AdRoit successfully 460 

deconvoluted the real bulk RNA-seq data and spatial transcriptomics data. The results suggest 461 

that, besides enriching our understanding of the bulk transcriptome data, AdRoit can leverage 462 

the usage of the vast amount and continuously growing single cell data as well. 463 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 22 

 464 

AdRoit is a reference-based deconvolution algorithm. A comprehensive collection of the 465 

possible cell components is important. However, completeness may not always be guaranteed. 466 

Even with the single cell acquisition that is independent of prior knowledge, rare and/or fragile 467 

cell types may not survive through the capture procedure and hence are excluded. It is also 468 

difficult to generate a solid reference profile for cells that are versatile from sample to sample 469 

(e.g., tumor cells). Currently AdRoit deals implicitly with the components unknown to the 470 

reference. If an unknown cell type reassembles one of the referenced ones, it may be 471 

considered as part of the known cell type and their joint population is predicted. Such an 472 

outcome is acceptable as treating two similar cell types as one is still biologically meaningful 473 

although the resolution of the system may be compromised. If the unknown component is 474 

dissimilar to all the known ones, it will be ignored by AdRoit because its representative markers 475 

are unlikely among the top weighted genes associated with the known components. At the 476 

same time, the distinct component is expected to have a unique gene expression pattern and 477 

thus unlikely interferes significantly with the gene expressions from the known cell types. 478 

Therefore, AdRoit essentially deconvolutes the relative populations among the known cell 479 

components. For example, AdRoit was able to correctly uncover the populations of 4 endocrine 480 

cell types from the human islet bulk data despite the absence of many other cell types such as 481 

macrophages, Schwann cells and endothelial cells in the input single cell reference20. Although 482 

under such a circumstance, the absolute percentages of the cells remain obscure, we expect 483 

their relative proportions can be studied and valuable. A future improvement is to explicitly 484 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 23 

model the unknown cell types and estimate their percentages upon the signals in the 485 

compound data that cannot be explained by the contribution from the known components. 486 

 487 

Methods 488 

Gene selection  489 

AdRoit selects genes that contain information about cell type identity, excluding non-490 

informative genes that potentially introduce noise. There are two ways for selecting such 491 

genes: 1) union of the genes whose expression is enriched in one or more cell types in the 492 

single cell UMI count matrix. These genes are referred as marker genes; 2) union of the genes 493 

that vary the most across all the cells in the single cell UMI count matrix, referred as the highly 494 

variable genes. For marker genes, we recommend selecting top ~200 genes (P value < 0.05), 495 

ranked by fold change, from each cell type for resolving complex compound transcriptome 496 

data. Considering some genes may mark more than one cell types, we further require selected 497 

markers presenting in no more than 5 cell types to ensure specificity. We also suggest select a 498 

minimal of 1000 total number unique genes for an accurate estimation. If not satisfied, one 499 

may consider expand the number of top genes and/or loose the P value cutoff. 500 

 501 

AdRoit also offer the option to use highly variable genes. To avoid the selected highly variable 502 

genes being dominated by large cell clusters whilst underrepresents small clusters, AdRoit first 503 

balances the cell types in the single cell UMI count matrix by finding the median size among all 504 

cell clusters, then sample cells from each cluster to make them equal to this size. Next, AdRoit 505 

computes the variance of each gene across the cells in the balanced single cell UMI matrix. Due 506 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 24 

to the well-known dispersion effect in RNA-seq data, directly computing variances from count 507 

matrix can results in overestimation. We thus compute variances on the normalized data done 508 

by variance-stabilizing transformation (VST)51. Genes with top 2000 large variances are then 509 

selected. 510 

 511 

In both ways, mitochondria genes were excluded as their expression do not have information of 512 

cell identity. The results shown in current paper were based the marker genes as described 513 

above. But we also demonstrated that using the balanced highly variable genes yields 514 

comparably accurate estimations (Supplementary Fig. 6). 515 

 516 

Estimate gene mean and dispersion per cell type 517 

Modeling single cell RNA-seq data is challenging due to the cellular heterogeneity, technical 518 

sensitivity, and noise. While the expression of some genes can be not detected by chance, other 519 

genes may be found to be highly dispersed. These factors can lead to excessive variability even 520 

within the same cell type. AdRoit combats high noise and computational complexity by building 521 

models with estimated mean and dispersion per cell type. This strategy reduced the data 522 

complexity while preserve the cell type specific information. 523 

 524 

Although typical analyses of RNA-seq data starts with normalization, Adroit does not do 525 

normalization prior to the mean estimation. Performing a normalization across all cell types 526 

forces every cell type to have the same amount of RNA transcripts, measured by the total 527 

unique molecular identifier (UMI) counts per cell. However, different cell types can have 528 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 25 

dramatically different amounts of transcripts. For example, the amount of RNA transcripts in 529 

neuronal cells is about 10 times fold of that in glial cells. Thus, normalization can falsely alter 530 

the relative abundance of cell types, misleading the estimation of cell type percentages. To 531 

avoid this problem, AdRoit models the means using the raw UMI counts.  532 

 533 

Studies have shown that UMI counts follows negative binomial distribution52,53, we therefore fit 534 

negative binomial distributions to single cells of each cell type and build the model based on 535 

the estimated means and dispersions from the selected genes. More specifically, let 𝑋!"be the 536 

set of single cell UMI counts of gene i ∈ 1,..,I for all cells in cell type k ∈ 1,…,K. I is the number 537 

of selected genes, and K denotes number of cell types in the single cell reference. The 538 

distribution of 𝑋!"follows negative binomial distribution, 539 

𝑋!" ∼ 𝑁𝐵(𝜆!",𝑝!"),   (1) 540 

where 𝜆!" is the dispersion parameter of the gene i in cell type k, and 𝑝!" is the success 541 

probability, i.e., the probability of gene i in cell type k getting one UMI. The two parameters are 542 

estimated by MLE. The likelihood function is 543 

𝐿𝐻(𝜆!",𝑝!"|𝑋!") = ∏ 𝑓(𝑋!"|𝜆!",𝑝!")
#!
!$% ,  (2) 544 

where 𝑛" is the number of cells in cell type k, and f is the probability mass function of negative 545 

binomial distribution. The MLE estimates are then given by 546 

(𝜆&"2 ,𝑝&")2 = 𝑎𝑟𝑔max
'"!,)"!

𝐿𝐻(𝜆!",𝑝!"|𝑋!").  (3) 547 

Once success probability and dispersion are estimated, the mean estimates can be computed 548 

numerically according to the property of negative binomial distribution, 549 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 26 

𝜇!" =
'#!* ∙)#!,
%-)#!,

,   (4) 550 

𝜎!"
. = '#!

* ∙)#!,
(%-)#!, )$

.   (5) 551 

Estimation using MLE has been readily coded in many R packages. We choose ‘fitdist’ function 552 

from ‘fitdistrplus’ package54 for its fast computation speed and flexibility in selecting 553 

distributions. Estimations are done for each selected gene in each cell type, resulting in a 𝐼 × 𝐾 554 

matrix of cell type means. 555 

 556 

Cell type specificity of genes 557 

Genes with cell-type specific expression patterns better represent cell types, thus are more 558 

important when be used for resolving cell type composition. In line with this property, AdRoit 559 

weights genes with high specificity more than less specific ones. Highly specific genes usually 560 

have consistently high expression and thus relatively low variance among cells within a cell 561 

type. To compute cell type specificity of a gene, we first identify the cell type in which the gene 562 

has the highest expression (i.e., most specifically expressed cell type), then defines the 563 

specificity of this gene as the mean-to-variance ratio within the cell type. A high ratio renders 564 

high weight to the gene in the model. We use the estimated means and variances from 565 

negative binomial fitting (𝜇!" and 𝜎!"
.  in eq. 4 and 5). Let 𝑘1 be the index of cell type that has the 566 

highest mean expression of gene i, 567 

𝑘1 = 𝑎𝑟𝑔max
"

{𝜇!"|	𝑘	𝜖	1…𝐾},  (6) 568 

then the cell type specificity weight for gene i, denoting 𝑤!
2, is given by, 569 

𝑤!
2 =

3"!%
4"!%
$ ,  (7) 570 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 27 

and it is computed for each gene in the set of selected genes. 571 

 572 

Cross-sample gene variability 573 

The variability of a gene contrasts how much stable a gene is across samples. The idea of 574 

weighting genes based on variability across samples is first explored by Wang et al22, where 575 

variability was defined as the cross-sample variance. By weighting down the high variability 576 

genes, the authors achieved a great advantage over the traditional unweighted method. Genes 577 

with low cross-sample variability better represent the population, hence are more trust-worthy 578 

to be used to learn the cell composition. AdRoit incorporates the same notion to weight the 579 

importance of genes, however, defines the variability in a more sophisticated way. Similar as 580 

we define the cell type specificity, AdRoit utilizes mean and variance, and computes variance-581 

to-mean ratio (VMR) to stand for cross-sample gene variability. But here the mean and variance 582 

are computed across samples. The VMR is better scaled than the simple variance, and it can 583 

avoid underweighting genes that has low expression, while circumvent overweighting genes 584 

hugely dispersed. 585 

 586 

In addition, AdRoit extends the method to fit the case where multiple samples are not 587 

available. We proposed three ways to compute the VMR, depending on whether multi-sample 588 

data is available. Typically, the compound transcriptome data to be deconvolved have multiple 589 

samples. In bulk RNA-seq data, multiple samples are usually included to control for biological 590 

variability. In spatial transcriptome data, the spatial dots can be seen as multiple samples. 591 

Therefore, we first consider computing the cross-sample gene variability from compound 592 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 28 

transcriptome data. In case multi-sample for compound data is not available, AdRoit utilizes the 593 

single cell reference, and synthesizes compound samples by pooling all cells belonging to the 594 

same sample. If multi-sample is not available for both data, AdRoit subsample single cells and 595 

pool them to make pseudo samples. Let 𝑌!5 denote the counts of sequences for gene i in 596 

sample j ∈ 1,…,J, then 597 

𝑌!5 ∼ 𝑁𝐵(𝜆!5,𝑝!5),   (8) 598 

where 𝜆!5 is the dispersion parameter of the gene i in sample j, and 𝑝!5 is the success 599 

probability. Again, we use MLE to get the estimates 𝜆&62and 𝑝&6G, following which cross-sample 600 

mean and variance can be numerically computed: 601 

𝜇!
2 =

'#&* ∙)#&,
%-)#&,

,   (9) 602 

(𝜎!
.)2 =

'#&* ∙)#&,

7%-)#&, 8
$,   (10) 603 

and cross-sample variability for gene i is then defined as 604 

𝑉𝑀𝑅! =
(4"

$)'

3"
' =

%
9"
(,   (11) 605 

where 𝑤!
: is later used in the model. The cross-sample variability weight is computed for each 606 

gene in the set of selected genes. 607 

 608 

Gene-wise scaling factor to correct platform bias 609 

When linking the compound data to the single cell data, rescaling factor is often used to 610 

account for the library size and platform difference. The existing methods adopt a single 611 

rescaling factor for each unit of sample, i.e., all genes of a single sample are multiplied by the 612 

same factor22,23. This operation is based on a strong assumption that the impact of platform 613 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 29 

difference to every gene is the same and linearly scaled among different cell types, which is 614 

hardly true. In addition, because estimates can be easily affected by outliers in linear model, 615 

estimation of cell proportions can be steered away from the truth by extremely high expression 616 

genes. Therefore, applying a uniform scaling factor to all gene is inappropriate. 617 

 618 

To overcome this problem, AdRoit instead estimates gene-wise scaling factors via an adaptive 619 

learning strategy and rescales each gene with its respective scaling factor. To proceed, we first 620 

input the mean gene expression from the compound samples (𝜇!
2in eq. 9) and the estimated 621 

means of each cell type from the single cell data (𝜇!" in eq. 4), then apply a traditional non-622 

negative least square regression (NNLS) to get a rough estimation of the proportions of each 623 

cell type, denoting 𝜏". For each gene, a predicted mean expression (∑ 𝜏"G;" 𝜇!" in eq. 13) is 624 

computed as the weighted sum of the means of each cell type wherein the weights are the 625 

roughly estimated proportions. The regression equation is given by, 626 

𝜇!
2 = 𝐴 ∙ (∑ 𝜏";" 𝜇!" + 𝜀), 0 < 𝜏", ∑ 𝜏"

;
" = 1 (12) 627 

where A is a constant to ensure 𝜏"’s sum to 1 and 𝜀 is the error term. We use ‘nnls’ function in 628 

the ‘nnls’ package55 to estimate 𝜏"’s. Next, we calculate the ratio between the mean expression 629 

from compound samples and the predicted means, and define the gene-wise rescaling factor as 630 

the logarithm of the ratio plus 1, 631 

𝑟! = log	(
3"
)

∑ =!,
*
! 3"!

+ 1).  (13) 632 

Given the dispersion property of count data, the logarithm of the ratio is a more appropriate 633 

statistic as it results in relatively stable scaling factors. The addition of 1 avoids taking logarithm 634 

on zero. By multiplying the flexible gene-wise rescaling factor, the “outlier” genes will be 635 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 30 

pushed toward the truth regression line direction, while the genes around the true regression 636 

lines are less affected (Fig. 1b). 637 

 638 

Weighted and regularized model 639 

We next designed a model that incorporates all these factors to do the actual estimation of cell 640 

type proportions. AdRoit builds upon non-negative least square regression model. It gives high 641 

weights to the genes with high cell type specificity and low cross-sample variability. This was 642 

done by optimizing a weighted sum of squared loss function L, where the weights consist of 643 

two components (𝑤!
: in eq. 7, 𝑤!

2 in eq. 11). The gene-wise scaling factor tailored for each gene 644 

effectively corrects the bias due to technology difference between compound sample and 645 

single cell data (𝑟!in eq 13). In cases of complex tissues (e.g., neural tissues) where many highly 646 

similar subtypes are common, closely related subtypes can have strong collinearity, leading to 647 

overestimation of some cell types whilst underestimate or miss some others. AdRoit handles 648 

this problem by including a L2 norm of the estimates as the regularization component. Denote 649 

𝛽" as the unscaled coefficient for cell type k. For a compound transcriptome sample j, the loss 650 

function is given by, 651 

𝐿5(𝛽%,…,𝛽;|𝑦!5,𝑤!
:,𝑤!

2,𝑟!,𝜇&"G) = ∑ 𝑤!
: ∙ 𝑤!

2 ∙ (𝑦!5 − 𝑟! ∙ ∑ 𝛽"𝜇&"G;" ).
>
! + ∑ 𝛽"

.;
" . (14) 652 

Then the coefficient 𝛽" can be estimated by minimizing the loss function with the constraint 653 

𝛽%,…,𝛽; > 0, 654 

𝛽%2,…,𝛽;2 = argmax
?+,…,?*

?+,…,?*AB𝐿5. (15) 655 

The estimation is done by a gradient projection method by Byrd et al56. We derive the gradient 656 

function by taking partial derivative of the loss function with w.r.t. 𝛽", 657 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 31 

𝐺" = ∇?!𝐿5 = −2∑ 𝑟! ∙ 𝜇&"G ∙ 𝑤!
: ∙ 𝑤!

2 ∙ ^𝑦!5 − 𝑟! ∙ ∑ 𝛽"𝜇&"G;" _ +
>
!	 2𝛽". (16) 658 

AdRoit uses the function ‘optim’ from the R package ‘stats’ to do the estimation57, providing the 659 

loss function (eq. 15) and the gradient (eq. 16). To get the final estimates of cell type 660 

proportions, we rescale the coefficients 𝛽"’s to ensure a summation of 1, 661 

𝜃" =
?!*

∑ ?!*
*
!

.   (17) 662 

Each compound sample j is independently estimated by the model described above. 663 

 664 

Simulation of bulk RNA-seq and spatial transcriptomics data 665 

Bulk RNA-seq data used for benchmarking are synthesized by adding up the raw UMI reads per 666 

gene from all single cells of a sample regardless of cell types. Denote 𝑡" as a cell in cell type k, 667 

and 𝑡" ∈ 1, …,	𝑇", where 𝑇" is the number of cells in cell type k. Let 𝑌!5
D be the read count of 668 

gene i in a synthesized bulk sample j, and 𝑋!5E! be the UMI count of the gene, then 669 

𝑌!5
D = ∑ ∑ 𝑋!5E!

F!
E!

;
" . 670 

The true proportion of cell type k is given by, 671 

𝜃"
B = F!

∑ F!
*
!

. 672 

 673 

To simulate spatial transcriptomic spots, we first sample 10 cells without replacement from 674 

each cell type and added them up, then mix them with designed proportions. For example, to 675 

simulate a spot with 𝑝" percent of cell type k, the read count 𝑌!5
G of gene i in a spatial spot j is 676 

given by, 677 

𝑌!5
G = ∑ 𝑝";" ∑ 𝑋!"#%B#$% , 678 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 32 

where 𝑋!"G is UMI count of gene i in a sampled cell n of cell type k. For each mixing scheme, the 679 

simulation is repeated 100 times. 680 

 681 

Evaluation statistics 682 

We compared the estimated cell type proportions with the ground truth by calculating 4 683 

statistics. The mAD and RMSD are given by, 684 

𝑚𝐴𝐷 =
∑ HI!-I!

,	H*!
;

, 685 

𝑅𝑀𝑆𝐷 =
∑ 7I!-I!

,8
$*

!
;

. 686 

Pearson correlation coefficient is computed as, 687 

𝜌) =
∑ 7I!-I!JJJJ8KI!

,-I!
,JJJJL*!

M∑ 7I!-I!JJJJ8
*
!

$M∑ KI!
,-I!

,JJJJL
$*

!

, 688 

where 𝜃"ggg and 𝜃"
Bggg are means of the estimated proportions and true proportions, respectively. 689 

Spearman correlation coefficient is given by, 690 

𝜌G =
∑ (N!-N!JJJJ)KN!

,-N!
,JJJJL*!

M∑ (N!-N!JJJJ)
*
!

$M∑ KN!
,-N!

,JJJJL
$*

!

, 691 

where 𝑟"is the rank of 𝜃". 692 

 693 

Single cell RNA sequencing of mouse dorsal root ganglion 694 

As described previously58, lumbar DRGs were isolated from adult C57BL/6 mice and transferred 695 

to a dissociation buffer (Dulbecco's modified Eagle's medium supplemented with 10% heat-696 

inactivated Fetal Calf Serum) (Gibco; cat # A38400-02). To generate a single cell suspension, 697 

DRGs were subjected to a 2 step-enzymatic dissociation followed by a mechanical dissociation. 698 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 33 

In brief, DRGs were first incubated with 0.125% collagenase P from Clostridium histolyticum 699 

(Roche Applied Science; cat # 11249002001) for 90 minutes in an Eppendorf Thermomixer C 700 

(37°C; intermittent 750 rpm shaking for about 10 sec every 2 minutes). Then, DRGs were 701 

transferred to a Hank's Balanced Salt Solution (HBSS, Mg2+ and Ca2+ free; Invitrogen) 702 

supplemented with 0.25% Trypsin (Worthington biochemical corp.; cat # LSoo3707) and 703 

0.0025% EDTA and incubated for 10 minutes at 37°C in the Eppendorf Thermomixer C. Trypsin 704 

was neutralized by the addition of 2.5 mg/ml MgSO4 (Sigma; cat #M-3937) and DRGs were 705 

triturated with Pasteur pipettes. The resulting cell suspension was passed through a 70 µm 706 

mesh filter to remove remaining chunks of tissues and centrifuged for 5 minutes at 2500 rpm at 707 

room temperature. The pellet was resuspended in HBSS (Ca2+, Mg2+ free; Invitrogen) and the 708 

cell suspension was run on a 30% Percoll Plus gradient (Sigma GE17-5445-02) to further remove 709 

debris. Finally, cells were resuspended in PBS supplemented with 0.04% BSA at a concentration 710 

of 200 cells/µl and cell viability was determined using the automated cell analyzer 711 

NucleoCounter® NC-250™. The suspended single cells were loaded on a Chromium Single Cell 712 

Instrument (10X Genomics) with about 6000 cells per lane to minimize the presence of 713 

doublets. 2000-3000 cells per lane were recovered. RNA-seq libraries were constructed using 714 

Chromium Single Cell 3’ Library, Gel Beads & Multiplex Kit (10X Genomics). Single end 715 

sequencing was performed on Illumina NextSeq500. Read 1 starts with a 26-bp UMI and cell 716 

barcode, followed by an 8-bp i7 sample index. Read 2 contains a 55-bp transcript read. Sample 717 

de-multiplexing, alignment, filtering, and UMI counting were conducted using Cell Ranger 718 

Single-Cell Software Suite59 (10X Genomics, v2.0.0). Mouse mm10 Genome assembly and UCSC 719 

gene model were used for the alignment. 720 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 34 

 721 

Data preprocessing 722 

DRG single cell data 723 

The UMI data output from Cell Ranger Single-Cell Software Suite (10X Genomics, v2.0.0) was 724 

analyzed using Seurat package60 to assess the cell quality and identify cell types, similar to what 725 

described previously39. Cells with the number of detected genes less than 500 or over 15000, or 726 

with a UMI ratio of mitochondria encoded genes versus all genes over 0.1 were also removed. 727 

The UMI data was normalized by the ‘NormalizeData’ method in Seurat with default settings. 728 

To avoid potential sample-to-sample variation caused by technical variation at various 729 

experiment steps, we employed Seurat data integration method. The top 2000 variable genes 730 

of each of the 5 samples were identified using ‘FindVariableFeatures’ with 731 

selection.method=‘vst’. Based on the union of these variable genes, the anchor cells in each 732 

sample were identified by ‘FindIntegrationAnchors’. All the samples were then integrated by 733 

‘IntegrateData’. We subsequently scaled the integrated data (‘ScaleData’) and performed 734 

dimension reduction (‘RunPCA’). Cells were then clustered based on the first 15 principal 735 

components by applying ‘FindNeighbors’ and ‘FindClusters’ (resolution=0.6, algorithm=1). 736 

Marker genes for each cluster were identified using ‘FindAllMarkers’. Parameters were used 737 

such that these genes were expressed in at least 25% of the cells in the cluster, and on average 738 

2-fold higher than the rest of cells with a multiple-testing adjusted Wilcoxon test p value of less 739 

than 0.01. The specificity of the canonical cell type-specific genes or cell cluster-specific genes 740 

were further examined by visualizations (Extended Data Fig. 2) and used to define the cell type 741 

for each cluster. At the end, the original UMI data from 17271 genes and 3352 cells that passed 742 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 35 

the quality control were organized into a matrix (genes as rows and cell identifiers as columns). 743 

This matrix, together with the cell type label for each cell therein, were loaded into AdRoit as 744 

reference profiles. 745 

 746 

Mouse brain single cell data 747 

The scRNA-seq reference data of the mouse brain were obtained from Zeisel et. al32. Among all 748 

the available data, we only retained 96,572 cells that were acquired from the brain regions, had 749 

an assigned cell type by the authors and a minimal total UMI of 1000. These cells corresponded 750 

to 183 clusters at the finest taxonomy level in the original study. As many of the clusters are 751 

highly similar, we decided to merge some of them to simplify the reference landscape. First, the 752 

top 50 cluster enriched markers were derived using Scanpy61 via the ‘rank_genes_groups’ 753 

function (method=‘wilcoxon’), following the normalization (‘normalize_per_cell’), log 754 

transformation (‘log1p’) and regressing out (‘regress_out’) the variances associated with the 755 

total UMI and the percentage of mitochondrial chromosome encoded genes per cell. Then, the 756 

pair-wise overlapping p-values among the clusters were calculated using the top 50 marker 757 

genes assuming the hypergeometric null distribution. Last, clusters with overlapping p-values 758 

more significant than 1e-10 were merged and new names were assigned by combinedly 759 

considering the original annotation, the molecular features and the specificity to certain brain 760 

regions. A total of 46 cell types were determined that cover all the 12 brain regions and their 761 

important substructures37 (Supplementary Table 13). To make the reference dataset more 762 

manageable in size and more balanced in the representation of cell types, we down sampled 763 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 36 

each cluster to no more than 360 cells. A final set of 14,666 cells over 46 cell types were used 764 

for the deconvolution of the mouse brain spatial transcriptome data. 765 

 766 

Human Islets 767 

We used the 1492 high quality human islets single cell and annotation from Xin et al38. The 768 

RPKM expression table was directly downloaded and used as is. The RNA-FISH data was also 769 

from this study38. For the real bulk human pancreatic islets data38,40,41, the read counts table 770 

were deconvoluted. Only data from donors with HbA1C level available were included in the 771 

regression of Beta cell proportion on HbA1C level (Fig. 4c, Supplementary Table 10). 772 

 773 

Trabecular Meshwork 774 

We downloaded the raw sequence data and followed the same analysis procedure as in Patel et 775 

al39 for quality control and cell type identification. 776 

 777 

Mouse Brain Spatial transcriptomics data by 10x Visium platform 778 

The filtered cell matrix, tissue image and the spatial coordinates of a coronal section of an adult 779 

C57BL/6 mouse brain from the 10x Genomics were available for download and used as is. 780 

 781 

Mouse Brian ISH images 782 

The ISH images were directly downloaded from Allen mouse Brain Atlas37 by searching the gene 783 

names. THE images were used with further editing except for cropping. 784 

 785 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 37 

Data availability 786 

DRG single cell data are deposited at NCBI GEO (accession number: GSE163252) . The bulk RNA-787 

seq and RNA-FISH data for human pancreatic islets were initially published as aggregated data 788 

where the data processing and experimental procedure were described therein38,40,41. We 789 

acquired the individual sample data from the authors and released them along with the current 790 

study (Supplementary Table 10 and Supplementary Table 12). The other public data analyzed in 791 

this study are available from: GEO (human pancreatic islets single cell data: GSE81608); NCBI 792 

(human trabecular meshwork single cell data: PRJNA616025; mouse brain single cell data: 793 

SRP135960). Mouse brain spatial transcriptomic data was downloaded from the 10x Genomics 794 

website (https://support.10xgenomics.com/spatial-gene-795 

expression/datasets/1.1.0/V1_Adult_Mouse_Brain_Coronal_Section). 796 

 797 

Code availability 798 

AdRoit’s source code is available on Github (https://github.com/TaoYang-dev/AdRoit). 799 

 800 

Software 801 

The statistical analyses were done with R statistical software (v3.6.0)57 and python (v3.7.2)62. 802 

The packages used include Seurat (v3.0.1)60, scanpy (v1.6.0)61, dplyr (v0.8.0.1)63, doParallel 803 

(v1.0.14)64, data.table (v1.12.4)65, fitdistrplus (v1.1-1)54, nnls (v1.4)55. 804 

 805 

Reference 806 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 38 

1. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. 807 

Nature Reviews Genetics (2009) doi:10.1038/nrg2484. 808 

2. Chu, G. C., Kimmelman, A. C., Hezel, A. F. & DePinho, R. A. Stromal biology of pancreatic 809 

cancer. Journal of Cellular Biochemistry (2007) doi:10.1002/jcb.21209. 810 

3. Bussard, K. M., Mutkus, L., Stumpf, K., Gomez-Manzano, C. & Marini, F. C. Tumor-811 

associated stromal cells as key contributors to the tumor microenvironment. Breast 812 

Cancer Research (2016) doi:10.1186/s13058-016-0740-2. 813 

4. Munn, D. H. & Bronte, V. Immune suppressive mechanisms in the tumor 814 

microenvironment. Current Opinion in Immunology (2016) 815 

doi:10.1016/j.coi.2015.10.009. 816 

5. Gonzalez, H., Hagerling, C. & Werb, Z. Roles of the immune system in cancer: From tumor 817 

initiation to metastatic progression. Genes and Development (2018) 818 

doi:10.1101/GAD.314617.118. 819 

6. Garner, H. & de Visser, K. E. Immune crosstalk in cancer progression and metastatic 820 

spread: a complex conversation. Nature Reviews Immunology (2020) 821 

doi:10.1038/s41577-019-0271-z. 822 

7. Singh, U. P. et al. Chemokine and cytokine levels in inflammatory bowel disease patients. 823 

Cytokine (2016) doi:10.1016/j.cyto.2015.10.008. 824 

8. Van Lint, P. & Libert, C. Chemokine and cytokine processing by matrix metalloproteinases 825 

and its effect on leukocyte migration and inflammation. J. Leukoc. Biol. (2007) 826 

doi:10.1189/jlb.0607338. 827 

9. Zelová, H. & Hošek, J. TNF-α signalling and inflammation: Interactions between old 828 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 39 

acquaintances. Inflammation Research (2013) doi:10.1007/s00011-013-0633-0. 829 

10. Koelman, L., Pivovarova-Ramich, O., Pfeiffer, A. F. H., Grune, T. & Aleksandrova, K. 830 

Cytokines for evaluation of chronic inflammatory status in ageing research: Reliability 831 

and phenotypic characterisation. Immun. Ageing (2019) doi:10.1186/s12979-019-0151-1. 832 

11. Landskron, G., De La Fuente, M., Thuwajit, P., Thuwajit, C. & Hermoso, M. A. Chronic 833 

inflammation and cytokines in the tumor microenvironment. Journal of Immunology 834 

Research (2014) doi:10.1155/2014/149185. 835 

12. Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial 836 

transcriptomics. Science (2016) doi:10.1126/science.aaf2403. 837 

13. Vickovic, S. et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat. 838 

Methods (2019) doi:10.1038/s41592-019-0548-y. 839 

14. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 840 

(2009) doi:10.1038/nmeth.1315. 841 

15. Denisenko, E. et al. Systematic assessment of tissue dissociation and storage biases in 842 

single-cell and single-nucleus RNA-seq workflows. Genome Biol. (2020) 843 

doi:10.1186/s13059-020-02048-6. 844 

16. Nguyen, Q. H., Pervolarakis, N., Nee, K. & Kessenbrock, K. Experimental considerations 845 

for single-cell RNA sequencing approaches. Frontiers in Cell and Developmental Biology 846 

(2018) doi:10.3389/fcell.2018.00108. 847 

17. Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. 848 

Nature (2017) doi:10.1038/nature21350. 849 

18. Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z. & Clark, H. F. Deconvolution of 850 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 40 

blood microarray data identifies cellular activation patterns in systemic lupus 851 

erythematosus. PLoS One (2009) doi:10.1371/journal.pone.0006098. 852 

19. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. 853 

Nat. Methods (2015) doi:10.1038/nmeth.3337. 854 

20. Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas 855 

Reveals Inter- and Intra-cell Population Structure. Cell Syst. (2016) 856 

doi:10.1016/j.cels.2016.08.011. 857 

21. Tsoucas, D. et al. Accurate estimation of cell-type composition from gene expression 858 

data. Nat. Commun. (2019) doi:10.1038/s41467-019-10802-z. 859 

22. Wang, X., Park, J., Susztak, K., Zhang, N. R. & Li, M. Bulk tissue cell type deconvolution 860 

with multi-subject single-cell expression reference. Nat. Commun. (2019) 861 

doi:10.1038/s41467-018-08023-x. 862 

23. Andersson, A. et al. Single-cell and spatial transcriptomics enables probabilistic inference 863 

of cell type topography. Commun. Biol. 3, 565 (2020). 864 

24. Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues 865 

with digital cytometry. Nat. Biotechnol. (2019) doi:10.1038/s41587-019-0114-2. 866 

25. Myung, I. J. Tutorial on maximum likelihood estimation. J. Math. Psychol. (2003) 867 

doi:10.1016/S0022-2496(02)00028-7. 868 

26. Bassett, R. & Deride, J. Maximum a posteriori estimators as a limit of Bayes estimators. 869 

Math. Program. (2019) doi:10.1007/s10107-018-1241-0. 870 

27. Zhao, Y. & Simon, R. Gene expression deconvolution in clinical samples. Genome 871 

Medicine (2010) doi:10.1186/gm214. 872 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 41 

28. Chiu, Y. J., Hsieh, Y. H. & Huang, Y. H. Improved cell composition deconvolution method 873 

of bulk gene expression profiles to quantify subsets of immune cells. BMC Med. 874 

Genomics (2019) doi:10.1186/s12920-019-0613-5. 875 

29. Kang, K. et al. CDSeq: A novel complete deconvolution method for dissecting 876 

heterogeneous samples using gene expression data. PLoS Comput. Biol. (2019) 877 

doi:10.1371/journal.pcbi.1007510. 878 

30. Qiao, W. et al. PERT: A Method for Expression Deconvolution of Human Blood Samples 879 

from Varied Microenvironmental and Developmental Conditions. PLoS Comput. Biol. 880 

(2012) doi:10.1371/journal.pcbi.1002838. 881 

31. Zaitsev, K., Bambouskova, M., Swain, A. & Artyomov, M. N. Complete deconvolution of 882 

cellular mixtures based on linearity of transcriptional signatures. Nat. Commun. (2019) 883 

doi:10.1038/s41467-019-09990-5. 884 

32. Zeisel, A. et al. Molecular Architecture of the Mouse Nervous System. Cell (2018) 885 

doi:10.1016/j.cell.2018.06.021. 886 

33. Donovan, M. K. R., D’Antonio-Chronowska, A., D’Antonio, M. & Frazer, K. A. Cellular 887 

deconvolution of GTEx tissues powers discovery of disease and cell-type associated 888 

regulatory variants. Nat. Commun. (2020) doi:10.1038/s41467-020-14561-0. 889 

34. Phipson, B., Zappia, L. & Oshlack, A. Gene length and detection bias in single cell RNA 890 

sequencing protocols. F1000Research (2017) doi:10.12688/f1000research.11290.1. 891 

35. Chen, G., Ning, B. & Shi, T. Single-cell RNA-seq technologies and related computational 892 

data analysis. Frontiers in Genetics (2019) doi:10.3389/fgene.2019.00317. 893 

36. Chen, D. & Plemmons, R. J. Nonnegativity constraints in numerical analysis. in The Birth 894 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 42 

of Numerical Analysis (2009). doi:10.1142/9789812836267_0008. 895 

37. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 896 

(2007) doi:10.1038/nature05453. 897 

38. Xin, Y. et al. RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. 898 

Cell Metab. (2016) doi:10.1016/j.cmet.2016.08.018. 899 

39. Patel, G. et al. Molecular taxonomy of human ocular outflow tissues defined by single-900 

cell transcriptomics. Proc. Natl. Acad. Sci. 117, 12856 LP – 12867 (2020). 901 

40. Xin, Y. et al. Pseudotime ordering of single human B-cells reveals states of insulin 902 

production and unfolded protein response. Diabetes (2018) doi:10.2337/db18-0365. 903 

41. Gutierrez, G. D. et al. Gene signature of proliferating human pancreatic a cells. 904 

Endocrinology (2018) doi:10.1210/en.2018-00469. 905 

42. Cerf, M. E. Beta cell dysfunction and insulin resistance. Frontiers in Endocrinology (2013) 906 

doi:10.3389/fendo.2013.00037. 907 

43. Maedler, K. & Donath, M. Y. Beta-cells in type 2 diabetes: a loss of function and mass. 908 

Hormone research (2004). 909 

44. Donath, M. Y. et al. Mechanisms of β-cell death in type 2 diabetes. Diabetes (2005) 910 

doi:10.2337/diabetes.54.suppl_2.S108. 911 

45. Calanna, S. et al. Alpha- and beta-cell abnormalities in haemoglobin A1c-defined 912 

prediabetes and type 2 diabetes. Acta Diabetol. (2014) doi:10.1007/s00592-014-0555-5. 913 

46. Kanat, M. et al. The Relationship Between β-Cell Function and Glycated Hemoglobin. 914 

Diabetes Care 34, 1006 LP – 1010 (2011). 915 

47. Nepton, S. Beta-Cell Function and Failure. in Type 1 Diabetes (2013). doi:10.5772/52153. 916 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 43 

48. Dolenšek, J., Rupnik, M. S. & Stožer, A. Structural similarities and differences between 917 

the human and the mouse pancreas. Islets (2015) doi:10.1080/19382014.2015.1024405. 918 

49. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 919 

445, 168–176 (2007). 920 

50. Vieth, B., Parekh, S., Ziegenhain, C., Enard, W. & Hellmann, I. A systematic evaluation of 921 

single cell RNA-seq analysis pipelines. Nat. Commun. (2019) doi:10.1038/s41467-019-922 

12266-7. 923 

51. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome 924 

Biol. (2010) doi:10.1186/gb-2010-11-10-r106. 925 

52. Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-926 

seq data using regularized negative binomial regression. Genome Biol. (2019) 927 

doi:10.1186/s13059-019-1874-1. 928 

53. Svensson, V. Droplet scRNA-seq is not zero-inflated. Nature Biotechnology (2020) 929 

doi:10.1038/s41587-019-0379-5. 930 

54. Delignette-Muller, M. L. & Dutang, C. fitdistrplus: An R package for fitting distributions. J. 931 

Stat. Softw. (2015) doi:10.18637/jss.v064.i04. 932 

55. Mullen, Katharine M., I. H. M. van S. nnls: The Lawson-Hanson algorithm for non-933 

negative least squares (NNLS). R Packag. version 1.4 (2012). 934 

56. Byrd, R. H., Lu, P., Nocedal, J. & Zhu, C. A Limited Memory Algorithm for Bound 935 

Constrained Optimization. SIAM J. Sci. Comput. (1995) doi:10.1137/0916069. 936 

57. The R Core Team. R: A Language and Environment for Statistical Computing. R 937 

Foundation for Statistical Computing (2019). 938 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 44 

58. Alessandri-Haber, N. et al. Hypotonicity induces TRPV4-mediated nociception in rat. 939 

Neuron (2003) doi:10.1016/S0896-6273(03)00462-8. 940 

59. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. 941 

Commun. (2017) doi:10.1038/ncomms14049. 942 

60. Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell (2019) 943 

doi:10.1016/j.cell.2019.05.031. 944 

61. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: Large-scale single-cell gene expression data 945 

analysis. Genome Biol. (2018) doi:10.1186/s13059-017-1382-0. 946 

62. van Rossum, G. & Drake, F. L. Python 3 Reference Manual. Scotts Valley, CA (2009). 947 

63. Wickham, H. & Francois, R. dplyr: A Grammar of Data Manipulation. R Packag. version 948 

0.4.2. (2015). 949 

64. Weston, S., Calaway, R. & Tenenbaum, D. doParallel: Foreach Parallel Adaptor for the 950 

Parallel Package. Cran (2014). 951 

65. Dowle, M. & Srinivasan, A. data.table: Extension of ‘data.frame’. R Package Version 952 

1.12.8. Manual (2019). 953 

 954 

Acknowledgements 955 

We thank Yurong Xin for pointing us to the relevant public data resource. We also thank Gabor 956 

Halasz and Yuan Zhu for the advice to algorithm design. 957 

 958 

Author contributions 959 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 45 

T.Y., Y.B., W.F., N.A.-H., M.L.-F., L.E.M. and G.S.A. designed the research. T.Y., Y.B., and W.F. 960 

developed the algorithm. T.Y., Y.B., W.F. and J.K. participated in the data analyzing. M.S. and 961 

R.B. performed the DRG tissue collection. C.A. performed the single cell library preparation and 962 

sequencing experiment. T.Y., Y.B., N.A.-H. and G.S.A. wrote the manuscript.  963 

 964 

Competing interests 965 

T.Y., Y.B., W.F. and G.S.A. have filed a patent application relating to the AdRoit computational 966 

framework. M.L.-F. is an employee of Cellular Longevity. All other authors are employees and 967 

shareholders of Regeneron Pharmaceuticals, although the manuscript’s subject matter does not 968 

have any relationship to any products or services of this corporation. 969 

 970 

Figure legends 971 

Fig. 1: Schematic representation of AdRoit computational framework. a, AdRoit inputs bulk or 972 

spatial RNA-seq data, single cell RNA-seq data and cell type annotations. It first selects 973 

informative genes and estimates their means and dispersions, based on which the cell type 974 

specificity of genes is computed. Depending on multi-sample availability, cross-sample gene 975 

variability is estimated from compound data, or single cell samples (dashed arrow). Lastly the 976 

gene-wise scaling factors are estimated using both compound data and single cell data. These 977 

computed quantities are fed to a weighted regularized model to infer the transcriptome 978 

composition. b, A mock example to illustrate the role of gene-wise scaling factor. Ideally, an 979 

accurate estimation of slop (i.e., cell proportion) would be the slope of the green line, however 980 

direct fitting would result in the red line due to the impact of the outlier genes. Outlier genes 981 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 46 

can be induced due to platform difference affecting genes differently. AdRoit adopts an 982 

adaptive learning approach that first learns a rough estimation of the slop (red line), then 983 

moves the outlier genes toward it such that the more deviated genes will be moved more 984 

toward the true line (i.e., longer arrows). After the adjustment, the new estimated slop (blue 985 

line) is closer to the truth (green line), thus is a more accurate estimation. 986 

 987 

Fig.2: Benchmark on simulated bulk data synthesized from trabecular meshwork (TM) single 988 

cells data. a, AdRoit has the closest estimation to the true cell proportion comparing to MuSiC 989 

and NNLS. Each dot is a cell type from one donor. b, For each cell type in TM, AdRoit has the 990 

smallest differences from the true cell type proportion and the smallest variance of estimates 991 

across the 8 donors. For each cell type, a dot on the graph denotes a donor, and the bars 992 

represent the 1.5	 × interquartile ranges. Estimation was done by using the single cell as 993 

reference leaving out the donor used for synthesizing bulk. c, AdRoit’s estimates are more 994 

accurate and specific than MuSiC’s estimates on synthetic bulk that contains partial cell types. 995 

The synthetic bulk was simulated by using only 6 out of the 12 cell types per donor, then 996 

estimated with the reference of 12 cell types. AdRoit has notably fewer false positive estimates 997 

of the 6 cell types not included, and more accurate estimation of the 6 cell types used for 998 

synthesizing bulk. d, Receiver operating characteristic (ROC) curve shows AdRoit has a 999 

significantly higher AUC than MuSiC (0.95 vs 0.74), meaning better sensitivity and specificity. 1000 

 1001 

Fig. 3: Benchmark on scRNA-seq data from dorsal root ganglion (DRG) where these exist many 1002 

closely related subtypes of neuronal cells. a, 14 cell types were identified from scRNA-seq 1003 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 47 

samples of 5 mice, including multiple subtypes of neurofilaments (NF), peptidergic (PEP) and 1004 

non-peptidergic (NP) neurons. b, Benchmarking with the synthetic data shows AdRoit’s 1005 

estimation of cell type proportions are highly accurate. In particular, AdRoit achieves 1006 

reasonably high accuracy when the cells are rare (e.g., < 5%). Each dot represents a cell type 1007 

from one sample. c, For each individual sample, mAD, RMSD, Pearson and Spearman 1008 

correlations were computed and compared across three methods. AdRoit has the lowest mAD 1009 

and RMSD, and highest Pearson and Spearman correlations. In addition, AdRoit’s estimation is 1010 

also the most stable across samples. Each dot on the boxplot is a sample. Estimation was done 1011 

by using the single cell reference leaving out the sample used for synthesizing bulk. 1012 

 1013 

Fig. 4: AdRoit is more accurate and sensitive than Stereoscope on spatial spots simulated 1014 

from real DRG cells. a, AdRoit and Stereoscope estimations on simulated spatial spots that 1015 

contains 5 PEP neuron subtypes. True mixing proportions were denoted by the red dashed 1016 

lines. Three schemes were simulated: 1) the proportions of 5 PEP cell types are the same and 1017 

equal to 0.2; 2) PEP1_Dcn is 0.1 and the other 4 are 0.225; 3) PEP1_Dcn and 1018 

PEP1_S100a11.Tagln2 are 0.1, PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a 0.2 are 0.2, and 1019 

PEP3_Trpm8 is 0.4. In all simulation schemes, AdRoit’s estimates are more consistently 1020 

centered around the true proportions than Stereoscope’s estimates. b, AdRoit is more accurate 1021 

in estimating rare cells in spatial spots. The spots were simulated by simulating mixtures of 3 1022 

PEP cell types (i.e., PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a and PEP3_Trpm8), with a series of 1023 

low percent of PEP3_Trpm8 cell type from 1% to 10% and the other two cell types sharing the 1024 

rest proportion equally. AdRoit’s estimates are systematically closer to the true simulated 1025 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 48 

proportions than Stereoscope’s estimates. c, AdRoit is consistently more sensitive than 1026 

Stereoscope in detecting low percent cells (estimates > 0.5% deemed as detected) in simulated 1027 

spots of 1) low percent of NF_Calb1 mixed with NF_Pvalb and NF2_Ntrk2.Necab2, 2) low 1028 

percent of NP_Mrgpra3 mixed with NP_Mrgprd and NP_Nts, 3) low percent of PEP3_Trpm8 1029 

mixed with PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a, 4) low percent of NF_Calb1 mixed with 1030 

Th, satellite glia and endothelial, 5) low percent of NP_Mrgpra3 mixed with Th, satellite glia and 1031 

endothelial, and 6) low percent of PEP_Trpm8 mixed with Th, satellite glia and endothelial. 1032 

 1033 

Fig. 5: Applications to real bulk human islets RNA-seq data and mouse brain spatial 1034 

transcriptome data. a, AdRoit’s estimates on real human Islets bulk RNA-seq data were highly 1035 

reproducible for the repeated samples from same donor. b, AdRoit estimated cell type 1036 

proportions agreed with the RNA-FISH measurements. c, AdRoit estimated Beta cell 1037 

proportions in type 2 diabetes patients are significantly lower than that in healthy subjects. In 1038 

addition, the estimated proportions have a significant negative linear association with donors’ 1039 

HbA1C level. d, The spatial mapping of 4 mouse brain cell types is consistent with the ISH 1040 

images of 4 marker genes from Allen mouse brain atlas37 respectively. The 4 genes, Spink8 1041 

(marker of hippocampal field CA1), C1ql2 (marker of Dentate Gyrus), Clic6 (marker of Choroid 1042 

Plexus), Synpo2 (marker of Thalamus) were identified as markers of corresponding tissues by 1043 

Zeisel et al32. 1044 

 1045 

Extended Data Fig. 1: Benchmark three methods on human pancreatic islets data. a, Human 1046 

islets single cell data contains 4 cell types from 18 subjects including two major cell types Alpha 1047 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 49 

and Beta cells, and two minor cells PP and Delta cells38. The cell proportion varies across 1048 

different subjects. b, c, AdRoit achieves leading accuracy when applied to the bulk data 1049 

synthesized from the single cell data. Each dot on scatterplot is a cell type from one subject. 1050 

Estimation was done by using the single cell reference leaving out the subject used to 1051 

synthesize bulk. 1052 

 1053 

Extended Data Fig. 2: Dorsal root ganglion single cell shows 14 cell types including 3 subtypes 1054 

of neurofilament, 3 subtypes of non-peptidergic neurons, and 5 subtypes of peptidergic 1055 

neurons. a, Heatmap of top markers shows distinction between cell types as well as similarity 1056 

between subtypes. b, The proportion of each cell type varies from 0.5% to 33.71% across 1057 

different samples. 1058 

 1059 

Extended Data Fig. 3: Comparing the performance on estimated simulated spatial spots of 14 1060 

pure cell type respectively. a, Estimates by AdRoit and b, estimates by Stereoscope are 1061 

comparably accurate. Simulations were done by sampling cells from the same cell type and 1062 

adding up the read counts per gene. For each of the 14 cell types of the DRG tissue, we 1063 

repeated the simulation 100 times. The results shown were a summary of 100 simulations for 1064 

each cell type. For both methods, the median estimates of the sampled cell type were close to 1065 

1 (red lines), whereas the cell type not sampled has zero or close-to-zero values. 1066 

 1067 

Extended Data Fig. 4: The comparison of AdRoit and Stereoscope on the simulated spots of 1068 

additional cell mixing schemes. 5 more types of mixed spatial spots were simulated: 1) mixture 1069 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 50 

of 3 neurofilaments (NF); 2) mixture of 3 non-peptidergic (NP) cell types; 3) NF2_Ntrk2.Necab2 1070 

mixing with Th, satellite glia and endothelial; 4) NP_Nts mixing with Th, satellite glia and 1071 

endothelial; and 5) PEP3_Trpm8 mixing with Th, satellite glia and endothelial. Each simulation 1072 

was repeated 100 times. Consistently for all simulation schemes, AdRoit’s estimates were 1073 

always closer to the true simulated proportions (red lines), whereas Stereoscope’s estimates 1074 

largely deviated from the true proportions. 1075 

 1076 

Extended Data Fig. 5: Spatial mapping of 46 cell types with AdRoit quantitative depicts the 1077 

content in each spot. Spatial transcriptomics data was downloaded from 10x genomics 1078 

(https://support.10xgenomics.com/spatial-gene-1079 

expression/datasets/1.1.0/V1_Adult_Mouse_Brain_Coronal_Section). The reference single cells 1080 

were sampled from Zeisel et al32 and curated into 46 cell types. 1081 

 1082 

 1083 

 1084 

 1085 

 1086 

 1087 

 1088 

 1089 

 1090 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 51 

Figures 1091 

Fig. 1 1092 

 1093 

 1094 

 1095 

 1096 

 1097 

 1098 

 1099 

 1100 

 1101 

 1102 

 1103 

 1104 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 52 

Fig. 2 1105 

 1106 

 1107 

 1108 

 1109 

 1110 

 1111 

 1112 

 1113 

 1114 

 1115 

 1116 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 53 

Fig. 3 1117 

 1118 

 1119 

 1120 

 1121 

 1122 

 1123 

 1124 

 1125 

 1126 

 1127 

 1128 

 1129 

 1130 

 1131 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 54 

Fig. 4 1132 

 1133 

 1134 

 1135 

 1136 

 1137 

 1138 

 1139 

 1140 

 1141 

 1142 

 1143 

 1144 

 1145 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 55 

Fig. 5 1146 

 1147 

 1148 

 1149 

 1150 

 1151 

 1152 

 1153 

 1154 

 1155 

 1156 

 1157 

 1158 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 56 

Extended Data Fig. 1 1159 

 1160 

 1161 

 1162 

 1163 

 1164 

 1165 

 1166 

 1167 

 1168 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 57 

Extended Data Fig. 2 1169 

 1170 

 1171 

 1172 

 1173 

 1174 

 1175 

 1176 

 1177 

 1178 

 1179 

 1180 

 1181 

 1182 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 58 

Extended Data Fig. 3 1183 

 1184 

 1185 

 1186 

 1187 

 1188 

 1189 

 1190 

 1191 

 1192 

 1193 

 1194 

 1195 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 59 

Extended Data Fig. 4 1196 

 1197 

 1198 

 1199 

 1200 

 1201 

 1202 

 1203 

 1204 

 1205 

 1206 

 1207 

 1208 

 1209 

 1210 

 1211 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697


 60 

Extended Data Fig. 5 1212 

 1213 

 1214 

(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 
The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint 

https://doi.org/10.1101/2020.12.14.422697