AdRoit: an accurate and robust method to infer complex transcriptome composition 1 AdRoit: an accurate and robust method to infer complex 1 transcriptome composition 2 3 Tao Yang1, Nicole Alessandri-Haber1, Wen Fury1, Michael Schaner1, Robert Breese1, Michael LaCroix-Fralish2, 4 Jinrang Kim1, Christina Adler1, Lynn E. Macdonald1, Gurinder S. Atwal1, Yu Bai1, * 5 6 Affiliations 7 1. Regeneron Pharmaceuticals, Inc., Tarrytown NY 10591 8 2. Cellular Longevity, Inc., San Francisco, CA 94103 9 10 *Corresponding author 11 12 Abstract 13 RNA sequencing technology promises an unprecedented opportunity in learning disease 14 mechanisms and discovering new treatment targets. Recent spatial transcriptomics methods 15 further enable the transcriptome profiling at spatially resolved spots in a tissue section. In 16 controlled experiments, it is often of immense importance to know the cell composition in 17 different samples. Understanding the cell type content in each tissue spot is also crucial to the 18 spatial transcriptome data interpretation. Though single cell RNA-seq has the power to reveal 19 cell type composition and expression heterogeneity in different cells, it remains costly and 20 sometimes infeasible when live cells cannot be obtained or sufficiently dissociated. To 21 computationally resolve the cell composition in RNA-seq data of mixed cells, we present AdRoit, 22 an accurate and robust method to infer transcriptome composition. The method estimates the 23 proportions of each cell type in the compound RNA-seq data using known single cell data of 24 relevant cell types. It uniquely uses an adaptive learning approach to correct the bias gene-wise 25 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 2 due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes 26 while control their cross-sample variability. Our systematic benchmarking, spanning from 27 simple to complex tissues, shows that AdRoit has superior sensitivity and specificity compared 28 to other existing methods. Its performance holds for multiple single cell and compound RNA-29 seq platforms. In addition, AdRoit is computationally efficient and runs one to two orders of 30 magnitude faster than some of the state-of-the-art methods. 31 32 Introduction 33 RNA sequencing is a powerful tool to address the transcriptomic perturbations in disease 34 tissues and help understand the underlying mechanism to develop treatments1. Due to the 35 presence of heterogeneous cell populations, bulk tissue transcriptome only characterizes the 36 averaged expression of genes over a mixture of different types of cells. The identity of 37 individual cell types and their prevalence remain unelucidated in the bulk data. However, 38 knowledge of the cell type composition and gene expression perturbation at the cell type level 39 is often critical to identifying disease-manifesting cells and designing targeted therapies. For 40 instance, the constitution of stromal and immune cells sculpts the tumor microenvironment 41 that is essential in cancer progression and control2–6. Excessive expression of cytokines in 42 particular leukocyte types underlines the etiology of many chronic inflammatory diseases 7–11. 43 Such information cannot be directly read out from the bulk RNA-Seq. 44 45 Recent breakthroughs in spatial transcriptomics methods enable characterizing whole 46 transcriptome-wise gene expressions at spatially resolved locations in a tissue section12. 47 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 3 However, it remains challenging to reach a single cell resolution while measuring tens of 48 thousands of genes transcriptome-wise. Some widely used technologies can achieve a 49 resolution of 50-100 μm, equivalent to 3–30 cells depending on the tissue type12,13. The 50 transcripts therein may originate from one or more cell types. Unlike the bulk RNA-seq, the 51 profiling data at each spot contains substantial dropouts as merely a few cells are sequenced, 52 imposing additional challenges to demystify the cell type content. We refer to bulk RNA-seq 53 and spatial transcriptomics data at the multi-cell resolution as compound RNA-seq data 54 hereafter. 55 56 The rapid development of single-cell RNA-seq (scRNA-seq) technologies has allowed for cell-57 type specific transcriptome profiling14. It provides the information missing from the compound 58 RNA-seq data. Nevertheless, the technologies have low sensitivity and substantial noise due to 59 the high dropout rate and the cell-to-cell variability. Consequently, scRNA-seq technologies 60 require a large number of cells (thousands to tens of thousands) to ensure statistical 61 significance in the results. In addition, the cells must remain viable during capture. These 62 requirements render the scRNA-seq technologies costly, prohibiting their application in clinical 63 studies that involve many subjects or cannot allow real time tissue dissociation and cell capture. 64 Furthermore, scRNA-seq technologies may not be well suited to characterizing cell-type 65 proportions in solid tissues because the dissociation and capture steps can be ineffective to 66 certain cell types 15–17. 67 68 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 4 As sequencing at the single cell level is not always feasible, in silico approaches have been 69 developed to infer cell type proportions from compound RNA-seq data18–24. The most common 70 strategy is to conduct a statistical inference through the maximum likelihood estimation 71 (MLE)25 or the maximum a posterior estimation (MAP)26 on a constrained linear regression 72 framework, wherein the unobserved mixing proportion of a finite number of cell types are part 73 of the latent variables to be optimized. 1921–24The deconvolution methods are often applied to 74 dissect the immune cell compositions in blood samples27–31. However, their performance in 75 more complex tissues, such as the nervous, ocular, respiratory and gastrointestinal organs, 76 remains unclear. These tissues often contain many cell types (10-102) and the difference among 77 related cells can be subtle, rendering the deconvolution a challenging task. For example, a 78 recent study on the mouse nervous system contains more than 200 cell clusters and many are 79 highly similar neuronal subtypes32. 80 81 Earlier works often utilized the transcriptome profiling of the purified cell populations to 82 estimate the gene expressions per cell type (e.g. Cibersort)19. More recently, acquiring cell type 83 specific expression from the scRNA-seq data was shown to be an intriguing alternative21–24. 84 Though it provides higher throughput by measuring multiple cell types in one experiment, 85 profiling at single cell level is substantially noisy. Deconvolution using scRNA-seq data as 86 reference can be biased by noise non-relevant to cell identities if not treated properly. 87 Moreover, the platform difference between the compound data and the single cell data cannot 88 be ignored. 89 90 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 5 To overcome these challenges, additional information from the data may be considered. A 91 recent method that weighs genes according to their expression variances across samples 92 greatly improved the accuracy22, highlighting the importance of gene variability in inferring cell 93 type composition. Some other methods and applications have pointed out the importance of 94 cell type specific genes24,28,31,33. In these works, the cell type specific expression was only used 95 to select the input genes (e.g., markers). Nonetheless, it measures how informative a gene is in 96 distinguishing cell types and thus can be incorporated as a part of the model. To address the 97 platform difference between the compound data and the single cell data it is usually assumed 98 there exists a single scaling factor or a linearly scaled bias for all genes that can be learned and 99 corrected accordingly22,23. This assumption is hardly held because the impact of the platform 100 difference to each gene is different. Though learning a uniform scaling factor would correct the 101 difference in the majority of genes, a few genes that remain significantly biased can easily 102 confound the estimation, especially under a linear model framework. Thus, a gene-wise 103 correction should be considered. 104 105 In this work, we presented a new deconvolution method, AdRoit, a unified framework that 106 jointly models the gene-wise technology bias, genes’ cell type specificity and cross-sample 107 variability. The method estimated the cell type constitution in the compound RNA-seq samples 108 using relevant single cell data as a training source. Genes used for deconvolution were 109 automatically selected from the single cell data based on their information richness. Uniquely, 110 it uses an adaptively learning approach to estimate gene-wise scaling factors, addressing the 111 issue that different platforms impact genes differently. The model of AdRoit is further 112 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 6 regularized to avoid collinearity among closely related cell subtypes that are common in 113 complex tissues. Over a comprehensive benchmarking data sets with a varying cell composition 114 complexity, AdRoit showed superior sensitivity and specificity to other existing methods. 115 Applications to real RNA-seq bulk data and spatial transcriptomics data revealed strong and 116 expected biologically relevant information. We believe AdRoit offers an accurate and robust 117 tool for cell type deconvolution and will promote the value of the bulk RNA-seq and the spatial 118 transcriptomics profiling. 119 120 Results 121 Overview of the AdRoit framework 122 AdRoit estimates the proportions of cell types from compound transcriptome data including but 123 not limited to bulk RNA-seq and spatial transcriptome. It directly models the raw reads without 124 normalization, preserving the difference in total amounts of RNA transcript in different cell 125 types. The method utilizes as reference the relevant pre-existing single cell RNA-seq data with 126 cell identity annotation. It selects informative genes, estimates the mean and dispersion of the 127 expression of selected genes per cell type, and constructs a weighted regularized linear model 128 to infer percent combinations (Fig. 1a). Because sequencing platform bias impacts genes 129 differently15,34,35, a uniform scaling factor for all genes does not sufficiently eliminate such bias. 130 A key innovation of AdRoit is that it uniquely adopts an adaptive learning approach, where the 131 bias was first estimated for each gene, then adjusted such that more biased gene is corrected 132 with a larger scaling factor (Fig. 1b). 133 134 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 7 We also attribute the success of AdRoit to the consideration of a comprehensive set of other 135 relevant factors including genes’ cross-sample variability, cell type specificity and collinearity of 136 expression profiles among closely related cell types. The cross-sample variability of a gene 137 confounds its biological expression variability due to the variety of cell types. The latter is 138 referred as the cell type specific expression that helps identify the cell type. AdRoit weighs 139 down genes with high cross-sample variability whilst weighs up those with an expression highly 140 specific to certain cell types. The definition of cross-sample variability and cell type specificity 141 also accounts for the overdispersion nature in counts data. Lastly, AdRoit adopted a linear 142 model to ensure the interpretability of the coefficients. At the same time, AdRoit included a 143 regularization term to minimize the impact of the statistical collinearity. Each of the factors 144 contributes an indispensable part to AdRoit, leading to an accurate and robust deconvolution 145 method for inferring complex cell compositions. 146 147 To evaluate the performance, we compared AdRoit with MuSiC22 and NNLS18,36 for bulk data 148 deconvolution, and stereoscope23 for spatial transcriptomics data deconvolution. When 149 evaluating the algorithms, a common practice is to pool the single cell data to synthesize a 150 “bulk” sample with the known ground truth of the cell composition. We measured the 151 performance by comparing the estimated cell proportions with true proportions using four 152 metrics: mean absolution difference (mAD), rooted mean squared deviation (RMSD) and two 153 correlation statistics (i.e., Pearson and Spearman). Both correlations are included because 154 Pearson reflects linearity, while Spearman avoids the artificial high scores driven by outliers 155 when majority of estimates are tiny. Good estimations feature low mAD and RMSD along with 156 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 8 high correlations. When estimating cell proportions for a synthetic sample, cells from this 157 sample are excluded from the input single cell reference (i.e., leave-one-out) to avoid 158 overfitting. We further applied AdRoit to real bulk RNA-seq data and validated the results by 159 available RNA fluorescence in-situ hybridization (RNA-FISH) data. The estimates were further 160 confirmed by relevant biology knowledge of human pancreatic islets. We also used AdRoit to 161 map cell types on spatial spots, and the accuracy was verified by in-situ hybridization (ISH) 162 images from Allen mouse brain atlas37. 163 164 AdRoit excels in datasets with both simple and complex cell constitutions 165 We started with a simple human pancreatic islets dataset that contains 1492 cells and four 166 distinct endocrine cell types (i.e., Alpha, Beta, Delta, and PP cells)38 (Extended Data Fig. 1a; 167 Supplementary Table 1). The synthesized bulk data were constructed by mixing the single cell 168 data at known proportions. Though all three methods achieved satisfactory performance 169 according to the evaluation metrics, AdRoit has slightly better performance as reflected by 170 scatterplots of estimated proportion vs. true proportion (Extended Data Fig. 1b, Supplementary 171 Table 2). It has moderately lower mAD (0.029 vs. 0.031 for MuSiC and 0.066 for NNLS), and 172 RMSD (0.039 vs. 0.046 for MuSiC and 0.095 for NNLS) and comparable correlations (Pearson: 173 0.99 vs 0.98 for MuSiC and 0.93 for NNLS; Spearman: 0.97 vs 0.98 for MuSiC and 0.91 for NNLS) 174 (Extended Data Fig. 1c). This performance was expected because there were only four cell types 175 with very distinct transcriptome profiles. Deconvoluting such data was a relatively easy task for 176 all three methods. 177 178 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 9 We then tested the methods on a couple of complex tissues that are more challenging to 179 deconvolute. One is the human trabecular meshwork (TM) tissue. We acquired published single 180 cell data that contains 8758 cells and 12 cell types from 8 donors39. The data include 3 similar 181 types of endothelial cells, 2 types of Schwann cells and 2 types TM cells (Supplementary Fig. 1; 182 Supplementary Table 3). Cells from each donor were pooled as a synthetic bulk sample. The cell 183 type proportions vary from <1% to 43%. These proportions were the ground truth cell 184 composition and were compared head-to-head with the estimated proportions inferred by 185 AdRoit, MuSiC and NNLS. For each synthetic bulk sample, estimations were performed using a 186 reference built from cells of other donors (i.e., leaving-one-out). In each of the 8 samples, the 187 estimates made by AdRoit best approximated the true proportions. In particular, AdRoit had 188 significantly lower mAD (0.016) and RMSD (0.025), and higher correlations (Pearson = 0.97; 189 Spearman = 0.94), comparing to MuSiC (mAD = 0.038; RMSD = 0.06; Pearson = 0.83; Spearman 190 = 0.73) and NNLS (mAD = 0.06; RMSD = 0.088; Pearson = 0.69; Spearman = 0.63) (Fig. 2a). We 191 further assessed the deviation of the estimates from the true proportions for each cell type. 192 AdRoit consistently had the lowest deviations from the true proportions for all cell types, as 193 well as the lowest variation among 8 samples (Fig. 2b, blue dots), indicating a higher robustness 194 over various cell types and samples. Notably, AdRoit only missed one rare cell type (true 195 proportion = 0.3%) out of 12 cell types in one sample, while MuSiC missed 1 to 5 cell types in 6 196 of the 8 samples, and NNLS missed 3 to 7 cell types in all 8 samples (Supplementary Fig. 2, 197 Supplementary Table 4). 198 199 AdRoit has better sensitivity and specificity 200 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 10 We next systematically addressed the sensitivity and specificity of these algorithms. In the 201 context of the cell type deconvolution, a false negative occurs when the proportion of an 202 existing cell type is predicted to be zero (or below a given threshold). Conversely, a non-zero 203 prediction (or above a given threshold) of an absent cell type results in a false positive. False 204 negatives and false positives measure the sensitivity and specificity of a deconvolution 205 algorithm, respectively. Both quantities are crucial to establish the utility of the algorithm. 206 Particularly, in real world applications, it is often difficult to know a prior what cell types exist in 207 a bulk sample, users may inform the algorithm to consider more possible cell types than what 208 are actually in the sample. False positive predictions in this situation would make the algorithm 209 unusable. 210 211 We designed a simulation to test the sensitivity and specificity. we selected 6 out of the 12 cell 212 types, i.e., Schwann-cell like cell, TM1, smooth muscle cell, melanocyte, macrophage and 213 pericyte, from each donor sample and pooled them within that sample to synthesize 8 new bulk 214 samples. The unselected 6 cell types are considered absent in the bulk samples. Some cell types 215 in presence are highly similar to those in absence, challenging the programs to pinpoint the 216 right cell type present in the bulk among similar candidates. We provided the full list of 12 217 single cell types as reference to the programs to estimate the cell type proportions. NNLS was 218 excluded from this evaluation due to its low benchmarking performance observed earlier (Fig. 219 2a, b). 220 221 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 11 Consistently across 8 samples, AdRoit had very accurate estimates for the 6 present cell types, 222 and zero or close-to-zero estimated values for the non-existing cell types in the synthetic bulk 223 data. MuSiC was notably less accurate on the 6 selected cell types, meanwhile it had many non-224 negligible values (>1% for 26 out 48 estimates) of the 6 cell types excluded in the 8 synthetic 225 samples (Fig. 2c, Supplementary Table 5). For example, smooth muscle cells accounted for 226 ~14% in donor 4 but was largely missed (~0.03%) by MuSiC. We noted that TM2 had false non-227 zero estimates from both methods though not included. This is because TM2 is easily mistaken 228 as TM1 due to their high similarity39. Nonetheless, AdRoit’s estimates of TM2 were consistently 229 small across samples (<1% for 44 out of 48 estimates), while MuSiC had significantly larger 230 estimates of TM2 that occasionally even exceeded the TM1 estimates (donors 5 and 8 in Fig. 2c 231 right). For a systematic comparison, we constructed the receiver operating characteristic (ROC) 232 curve by varying the threshold of detection (i.e., a cutoff below which the cell type was deemed 233 undetected) (Fig. 2d). AdRoit had significantly higher area under the curve (AUC) than MuSiC 234 (0.95 vs. 0.74), implying a dominantly better sensitivity and specificity. 235 236 AdRoit outperforms in deconvoluting closely related subtypes 237 To further evaluate AdRoit when multiple cell subtypes present in a complex tissue, we 238 performed scRNA-seq experiment on mouse lumbar dorsal root ganglion (DRG) from five mice. 239 Following the standard analysis pipeline (Methods), we obtained 3352 single cells after quality 240 control procedures. After clustering and annotation, we discovered 14 cell types including 241 multiple subtypes of neuronal cells (Fig. 3a, Supplementary Table 6). The heatmap of the top 242 marker genes showed distinct patterns of the major cell types as well as similar patterns of the 243 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 12 subtypes (Extended Data Fig. 2a), and the cell type proportions varied from 0.5% to 33.71% 244 (Extended Data Fig. 2b). These 14 cell types include 3 subtypes of neurofilament containing 245 neurons (i.e., NF_Calb1, NF_Pvalb, NF_Ntrk2.Necab2), 3 subtypes of non-peptidergic neurons 246 (i.e., NP_Nts, NP_Mrgpra3, NP_Mrgprd), and 5 subtypes of peptidergic neurons (i.e., PEP1_Dcn, 247 PEP1_S100a11.Tagln2, PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a, PEP3_Trpm8). Also discovered 248 were tyrosine hydroxylase containing neurons (Th), satellite glia and endothelial cells. Such 249 complex compositions formed a challenging testing ground for evaluating the ability to 250 distinguish closely related cell types. We again did the leave-one-out deconvolution on five 251 synthesized bulk samples. 252 253 AdRoit had highly accurate estimations on all cell types across samples (Fig. 3b). It is worth to 254 mention that, for the rare cell types that account for less than 5%, AdRoit still had a good 255 estimation that is fairly close to the true proportions and never missed a single cell type, 256 showing that AdRoit is very robust on rare cell types. For example, 0.51% endothelial cells were 257 predicted to be 0.35%, and 1.05% NF2_Ntrk2.Necab2 cells were predicted to be 0.85% 258 (Supplementary Fig. 3, Supplementary Table 7). On the contrary, MuSiC and NNLS were notably 259 less accurate, especially for the cell types less than 5%, and missed multiple cell types including 260 some large cell clusters taking account of ~10% (PEP1_Slc7a3.Sstr2 cells of Sample5). We 261 further examined how much the variability of the estimates was in each individual sample. We 262 computed the 4 metrics to evaluate the performance on each of the 5 synthetic samples and 263 compared them head-to-head among the algorithms. This fine comparison showed AdRoit 264 significantly outperformed MuSiC and NNLS on every sample (Fig. 3c). Further, the performance 265 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 13 metrics of AdRoit were highly consistent across samples with the lowest variability among the 266 three methods. 267 268 AdRoit excels on simulated spatial transcriptomics data 269 Given the promising performance on complex tissues, we continued to test AdRoit’s 270 applicability to spatial transcriptomics data. Spatial transcriptomics data differs from bulk RNA-271 seq data in that each spot only contains transcripts from a handful of cells (3-30)12. Some of the 272 spots contain multiple cells of the same type, while others may have mixtures of cell types at 273 varying mixing percentages (e.g., spatial spots at the boundary of different cell types). Also, 274 because the mixture is a pool of only a few cells, the variations across spatial spots are 275 expected to be greater than in bulk samples. We simulated a large number of spatial spots 276 (3200 in total) by using sampled cells from the DRG single cell data above (Methods), then 277 compared AdRoit with Stereoscope over a range of simulation scenarios. 278 279 We first tested whether the methods could correctly infer a single cell type when the spots 280 contain cells from that same type. For each of the 14 cell types from DRG, we sampled 10 cells 281 and pooled them to form a spatial spot. We repeated the simulation for 100 times for a robust 282 testing, then used the full set of 14 cell types as reference to deconvolute the 1400 simulated 283 spots. Both methods were able to identify the correct cell types with indistinguishable accuracy 284 on the simulated cell types (i.e., estimates close to 1) and comparably low estimated values 285 (i.e., estimates close to zero) for other cell types not included when simulating the spots 286 (Extended Data Fig. 3). 287 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 14 288 We then continued a difficult scenario where we sampled cells from the 5 PEP subtypes and 289 mixed them. We created three simulation schemes for a comprehensive evaluation: 1) 5 PEP 290 subtypes had same percent of 0.2; 2) PEP1_Dcn was 0.1 and the other 4 were 0.225; 3) 291 PEP1_S100a11.Tagln2 and PEPE1_Dcn were 0.1, PEP2_Htr3a.Sema5a and PEP1_Slc7a3.Sstr2 292 were 0.2, and PEP3_Trpm8 was 0.4. Again, each simulation scheme was repeated 100 times. 293 Under each scheme, the estimates by AdRoit consistently centered around true proportions 294 and the other cell types had very low estimated values (close to zero) (Fig. 4a, Supplementary 295 Table 8). In comparison, though the estimates for the other cell types were also generally close 296 to zero, the estimates of the PEP cells by Stereoscope systematically deviated from the true 297 proportions for all three simulated schemes except for PEP1_S100a11.Tagln2. 298 299 We further expanded the simulated spatial spots to the mixture of 3 NP cell types and mixture 300 of 3 NF cell types. In addition, we sampled NP_Mrgpra3 cells and mixed them with other 301 distinct cell types (i.e., Th, satellite glia and endothelial), as well as NF_Calb1 cells mixed with 302 other distinct cell types, and PEP3_Trpm8 mixed with other distinct cell types. For all these 303 simulated spatial spots, AdRoit’s estimates were consistently centered at true proportions, 304 whereas Stereoscope’s estimates deviated in almost all simulated schemes (Extended Data Fig. 305 4, Supplementary Table 8). We speculate the main reason Stereoscope underperformed at 306 these simulated spots is that it normalizes the total UMI counts to the same number for all 307 cells. In real world, a spatial spot is unlikely to be a pool of cells that have the same total RNA 308 transcripts sampled, especially when a spot contains different cell types (e.g., immune cells 309 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 15 have about 10-fold less total UMIs than the neuronal cells or subtypes of neuronal cells). Our 310 simulation pooled the sampled cells by adding up the raw UMI counts per gene, which we 311 believe best mimics the real data. 312 313 Next, we asked how sensitive the methods are in detecting rare cell populations. We simulated 314 mixtures of 3 PEP subtypes (i.e., PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a, PEP3_Trpm8) with a 315 series of low percent PEP3_Trpm8 (from 0.01 to 0.1 by 0.01), and the other two cell types 316 sharing the rest percentage equally (Methods). At each given percent, the simulation was 317 repeated 100 times. We then checked how accurately the percent of PEP3_Trpm8 cells was 318 estimated. The medians of AdRoit’s estimates were always close to the true proportions (Fig. 319 4b, red lines), whereas that of Stereoscope’s estimates were largely lower than true 320 proportions. Stereoscope also missed the majority of PEP3_Trpm8 cell type when the simulated 321 proportion was below 0.06. This comparison implied AdRoit is more advantageous in detecting 322 low percent cells. For a complete comparison, we also simulated 5 other types of cell mixtures 323 in the same way. At each given low percent, we computed how many times out of 100 the low 324 percent cell component was detected (estimates > 0.005). AdRoit had systematically higher 325 detection rates, as well as higher consistency across different cell mixtures (Fig. 4c, 326 Supplementary Table 9). Notably, at a simulated percent of 5%, AdRoit achieved >90% of 327 detention rate, making it a powerful tool in detecting rare cells. 328 329 Though MuSiC was not designed for deconvoluting spatial spots, theoretically it also can be 330 applied to spatial transcriptomics data. We thus also compared AdRoit to MuSiC on the same 331 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 16 sets of simulation data above. We observed AdRoit was also significantly more accurate over all 332 simulation scenarios of spatial spots (Fig. 4a, Extended Data Fig. 3 and 4, Supplementary Fig. 4), 333 and more sensitive when detecting low percent cells (Fig. 4b, c, Supplementary Fig. 5). 334 335 Application to real bulk RNA-seq data of human pancreatic islets 336 Though using synthetic bulk data based on mixing of single cells is a useful benchmarking 337 strategy, the bulk and single cell RNA-seq often use distinct RNA library preparation and 338 sequencing protocols. The capability of a method to deconvolute real bulk samples shall be 339 addressed to ensure it is useful in the real-world applications. We acquired 70 real human 340 pancreatic islets bulk samples from published studies38,40,41 (Supplementary Table 10) and used 341 single cell data of the same tissue38 as reference to infer the percentages of 4 endocrine cell 342 types (i.e., Alpha, Beta, Delta, PP). The 70 bulk samples were collected from 39 distinct donors, 343 including 26 healthy donors, and 13 donors with type 2 diabetes (T2D). Each donor contributed 344 1 to 5 replicated bulk RNA samples. 345 346 Replicates from the same donor are expected to have similar compositions and thus were used 347 to assess the reproducibility of the estimates from AdRoit. For all cell types, AdRoit had highly 348 consistent estimates for the same donors (Fig. 5a, Supplementary Table 11). The average 349 standard deviations did not exceed 1% for all 4 cell types (i.e., Alpha: 0.010; Beta: 0.008; Delta: 350 0.004; PP: 0.002). To seek an independent validation, we obtained cell sorting results by RNA-351 FISH for 4 of the 39 donors38 (Supplementary Table 12). The estimated cell proportions of the 4 352 were highly consistent with the percentages measured by RNA-FISH (Fig. 5b), and the 353 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 17 consistency held for both major cells (Alpha and Beta) and the minor cells (Delta and PP). 354 Reproducibility and independent validation showed AdRoit is reliable in deconvoluting real bulk 355 RNA-seq data. 356 357 We then asked if AdRoit can detect known biological differences between healthy and T2D 358 donors. Loss of functional insulin-producing Beta cells is a prominent characteristic of T2D42–44, 359 typically reflected by elevated level of hemoglobin A1c (HbA1c)45,46. Among the healthy donors, 360 the majority of Beta cell proportions estimated by AdRoit ranged from 50% to 75% (Fig. 5c), 361 agreed with the known percent range of Beta cells in human islets tissue47,48. A significant 362 decreasing of the estimated Beta cell proportions was seen in T2D patients (P value = 4.1e-6). 363 Further, a linear regression of estimated Beta cell proportions on HbA1c levels showed a 364 statistically significant negative association (P value = 1.8e-6). AdRoit adequately reflected the 365 cell composition difference between healthy donors and T2D patients. 366 367 Application to mouse brain spatial transcriptomics 368 We lastly demonstrated an application to the real spatial transcriptomics data. Given the 369 molecular architecture of brain tissue has been well studied, we chose mouse brain spatial 370 transcriptomics data generated by 10x genomics, containing 2703 spatial spots (Methods). The 371 reference single cell data were acquired from an independent study which contains a 372 comprehensive set of nervous cell types in brain32. We curated the cell types by merging highly 373 similar clusters and came down to a consolidated set of 46 distinct brain cell types (Methods, 374 Supplementary Table 13). 375 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 18 376 The cell contents inferred by AdRoit per spot appear to accurately match the expected cell 377 types at that location (Extended Data Fig. 5, Supplementary Table 14). For example, the three 378 subtypes of cortex excitatory neurons each occupied a sub-area in the cerebral cortex region. 379 As another example, the shape of hippocampal region was delineated by the estimated 380 percentages of dentate gyrus granule/excitatory neurons. For an independent validation, we 381 checked the consistency between estimated cell types with the in-situ hybridization (ISH) 382 images from Allen mouse brain atlas49. We chose 4 genes highly expressed in 4 brain regions 383 respectively, i.e., Spink8 for hippocampal field CA1, C1ql2 for dentate gyrus, Clic6 for choroid 384 plexus, and Synpo2 for thalamus32. The spots enriched with the 4 cell types (i.e., hippocampal 385 CA1 excitatory neuron type 2, dentate gyrus granule neuron type 2, choroid plexus cell, 386 thalamus excitatory neuron type 1), as mapped by AdRoit, precisely co-localized with the strong 387 signals of the 4 marker genes on the ISH images respectively (Fig. 5d). This agreement 388 confirmed that the spatial mapping of cell types by AdRoit is reliable. 389 390 Computational efficiency 391 Besides the accuracy and robustness, another major advantage of AdRoit is its magnitude 392 higher computational efficiency. AdRoit uses a two-step procedure to do the inference. The first 393 step prepares the reference on single cell data where per-gene means and dispersions are 394 estimated, and cell type specificity is subsequently computed. The built reference can be saved 395 and reused. We tested the running time on the reference building using the aforementioned 396 mouse brain single cell dataset containing ~15,000 cells. It took about 4.5 minutes on a CPU 397 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 19 that has 24 cores (23 used for parallel computing). The second step inputs the built reference 398 and target compound data and does the estimation. Deconvoluting ~2700 compound RNA-seq 399 samples took around 5 minutes. Therefore, AdRoit in total took less than 10 minutes and ~3Gb 400 memory usage on a regular CPU. As a comparison, MuSiC took about 1 hour and 37 minutes on 401 the same data using the same CPU. Stereoscope ran about 24 hours continuously with the 402 published parameter setting (-scb 256 -sce 75000 -topn_genes 5000 -ste 75000 -lr 0.01 -stb 100 403 -scb 100) on a powerful V100 GPU with 80 cores and 16G memory, which is prohibitive for 404 seeking a quick turnaround. 405 406 Discussion 407 In this work we have demonstrated that AdRoit is capable of deconvoluting the cell 408 compositions from the compound RNA-seq data with a leading accuracy, measured by the 409 consistency between the true and predicted cell proportions. Its advantage over the existing 410 state-of-the-art methods was verified over a wide range of use cases. In particular, AdRoit 411 excelled in complex tissues composed of more than ten different cell types with wide range of 412 cell proportions (e.g., trabecular meshwork, dorsal root ganglion). In both cases, AdRoit 413 performed significantly better than the comparators MuSiC and NNLS on deconvoluting bulk 414 RNA-seq data. AdRoit is also more accurate and sensitive than Stereoscope in demystifying 415 spatial transcriptomics spots, especially in detecting low percent cells. Previous benchmarking 416 often assumed the types of cells in the synthetic bulk data are not more or less than the cell 417 types collected in the reference, and thus the only unknown was the proportion of each cell 418 type. This assumption may not hold. Missing existing cell types or false predictions of non-419 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 20 existing ones can hinder the utility of an algorithm. Thus, besides the overall accuracy, we also 420 examined the sensitivity and specificity of the algorithms. We observed a superior sensitivity 421 and specificity in AdRoit, an important leverage for its usage in practice. 422 423 The reference single cell data used by AdRoit came from different platforms, such as the 10x 424 Genomics Chromium Instrument (the mouse dorsal root ganglion), and the Fluidigm C1 system 425 (the human pancreatic islets data). AdRoit consistently exhibited excellent performance across 426 all benchmarking datasets independent of their single cell sequencing technology platforms. 427 More importantly, this statement holds not only for deconvoluting the synthesized bulk data, 428 but also for the real bulk RNA-seq data. The latter typically does not apply the unique molecular 429 barcoding and requires a significantly different cDNA amplification procedure from what is used 430 in the single cell RNA-seq (Methods). Besides, the sequencing depth, read mapping and gene 431 expression quantification are dissimilar as well. The fact that AdRoit accurately dissected the 432 cell compositions in the real bulk samples based on the single cell reference data further 433 supports its cross-platform applicability. 434 435 We attribute the power of AdRoit to its comprehensive modeling of relevant factors. Firstly, we 436 think a common rescaling factor is not sufficient to correct the platform difference between 437 single cells and the compound data. Rather, the impact of platform difference to genes is quite 438 different and hardly is linearly scaled. Correcting such differences entails rescaling factors 439 specifically tailored to each gene. AdRoit uses an adaptive learning approach to estimate such 440 gene-wise correcting factor and does the correction in a unified model. In addition, the 441 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 21 contribution of a gene in a cell type to the loss function is jointly weighted by its specificity and 442 variability in a cell type, where specificity and variability are defined in a way accounting for the 443 overdispersion property of counts data. Our observations over the multiple benchmarking 444 dataset also show that the coexistence of similar cell types may have induced a collinearity 445 condition that negatively impacted the regression-based methods developed by others. Being 446 able to alleviate this problem gives AdRoit an edge to outperform. All these factors help AdRoit 447 to distinguish similar cell clusters while sensitive enough to separate rare cell types. 448 449 Technically, the input profiles of individual cell types to AdRoit does not necessarily come from 450 the single cell RNA-seq. Bulk RNA-seq profiles of individual isolated cell types can be used as 451 well. Nevertheless, using single cell RNA-seq data as the reference has a few key advantages. It 452 is a high throughput approach wherein multiple cell types can be interrogated simultaneously. 453 Prior knowledge of the cell types in presence as well as their specific gene markers are not 454 required, which allows novel cell types to be identified. Although detection of lowly expressing 455 genes has been a challenge for the single cell RNA-seq, significant enhancements have been 456 demonstrated. For example, the number of detectable genes currently can reach an order of 457 10,000 per cell and keeps improving50. As AdRoit focuses on the informative genes whose 458 expressions are generally high, the detection limit of the single cell RNA-seq does not impose a 459 significant drawback. Indeed, given the single cell reference profiles, AdRoit successfully 460 deconvoluted the real bulk RNA-seq data and spatial transcriptomics data. The results suggest 461 that, besides enriching our understanding of the bulk transcriptome data, AdRoit can leverage 462 the usage of the vast amount and continuously growing single cell data as well. 463 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 22 464 AdRoit is a reference-based deconvolution algorithm. A comprehensive collection of the 465 possible cell components is important. However, completeness may not always be guaranteed. 466 Even with the single cell acquisition that is independent of prior knowledge, rare and/or fragile 467 cell types may not survive through the capture procedure and hence are excluded. It is also 468 difficult to generate a solid reference profile for cells that are versatile from sample to sample 469 (e.g., tumor cells). Currently AdRoit deals implicitly with the components unknown to the 470 reference. If an unknown cell type reassembles one of the referenced ones, it may be 471 considered as part of the known cell type and their joint population is predicted. Such an 472 outcome is acceptable as treating two similar cell types as one is still biologically meaningful 473 although the resolution of the system may be compromised. If the unknown component is 474 dissimilar to all the known ones, it will be ignored by AdRoit because its representative markers 475 are unlikely among the top weighted genes associated with the known components. At the 476 same time, the distinct component is expected to have a unique gene expression pattern and 477 thus unlikely interferes significantly with the gene expressions from the known cell types. 478 Therefore, AdRoit essentially deconvolutes the relative populations among the known cell 479 components. For example, AdRoit was able to correctly uncover the populations of 4 endocrine 480 cell types from the human islet bulk data despite the absence of many other cell types such as 481 macrophages, Schwann cells and endothelial cells in the input single cell reference20. Although 482 under such a circumstance, the absolute percentages of the cells remain obscure, we expect 483 their relative proportions can be studied and valuable. A future improvement is to explicitly 484 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 23 model the unknown cell types and estimate their percentages upon the signals in the 485 compound data that cannot be explained by the contribution from the known components. 486 487 Methods 488 Gene selection 489 AdRoit selects genes that contain information about cell type identity, excluding non-490 informative genes that potentially introduce noise. There are two ways for selecting such 491 genes: 1) union of the genes whose expression is enriched in one or more cell types in the 492 single cell UMI count matrix. These genes are referred as marker genes; 2) union of the genes 493 that vary the most across all the cells in the single cell UMI count matrix, referred as the highly 494 variable genes. For marker genes, we recommend selecting top ~200 genes (P value < 0.05), 495 ranked by fold change, from each cell type for resolving complex compound transcriptome 496 data. Considering some genes may mark more than one cell types, we further require selected 497 markers presenting in no more than 5 cell types to ensure specificity. We also suggest select a 498 minimal of 1000 total number unique genes for an accurate estimation. If not satisfied, one 499 may consider expand the number of top genes and/or loose the P value cutoff. 500 501 AdRoit also offer the option to use highly variable genes. To avoid the selected highly variable 502 genes being dominated by large cell clusters whilst underrepresents small clusters, AdRoit first 503 balances the cell types in the single cell UMI count matrix by finding the median size among all 504 cell clusters, then sample cells from each cluster to make them equal to this size. Next, AdRoit 505 computes the variance of each gene across the cells in the balanced single cell UMI matrix. Due 506 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 24 to the well-known dispersion effect in RNA-seq data, directly computing variances from count 507 matrix can results in overestimation. We thus compute variances on the normalized data done 508 by variance-stabilizing transformation (VST)51. Genes with top 2000 large variances are then 509 selected. 510 511 In both ways, mitochondria genes were excluded as their expression do not have information of 512 cell identity. The results shown in current paper were based the marker genes as described 513 above. But we also demonstrated that using the balanced highly variable genes yields 514 comparably accurate estimations (Supplementary Fig. 6). 515 516 Estimate gene mean and dispersion per cell type 517 Modeling single cell RNA-seq data is challenging due to the cellular heterogeneity, technical 518 sensitivity, and noise. While the expression of some genes can be not detected by chance, other 519 genes may be found to be highly dispersed. These factors can lead to excessive variability even 520 within the same cell type. AdRoit combats high noise and computational complexity by building 521 models with estimated mean and dispersion per cell type. This strategy reduced the data 522 complexity while preserve the cell type specific information. 523 524 Although typical analyses of RNA-seq data starts with normalization, Adroit does not do 525 normalization prior to the mean estimation. Performing a normalization across all cell types 526 forces every cell type to have the same amount of RNA transcripts, measured by the total 527 unique molecular identifier (UMI) counts per cell. However, different cell types can have 528 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 25 dramatically different amounts of transcripts. For example, the amount of RNA transcripts in 529 neuronal cells is about 10 times fold of that in glial cells. Thus, normalization can falsely alter 530 the relative abundance of cell types, misleading the estimation of cell type percentages. To 531 avoid this problem, AdRoit models the means using the raw UMI counts. 532 533 Studies have shown that UMI counts follows negative binomial distribution52,53, we therefore fit 534 negative binomial distributions to single cells of each cell type and build the model based on 535 the estimated means and dispersions from the selected genes. More specifically, let 𝑋!"be the 536 set of single cell UMI counts of gene i ∈ 1,..,I for all cells in cell type k ∈ 1,…,K. I is the number 537 of selected genes, and K denotes number of cell types in the single cell reference. The 538 distribution of 𝑋!"follows negative binomial distribution, 539 𝑋!" ∼ 𝑁𝐵(𝜆!",𝑝!"), (1) 540 where 𝜆!" is the dispersion parameter of the gene i in cell type k, and 𝑝!" is the success 541 probability, i.e., the probability of gene i in cell type k getting one UMI. The two parameters are 542 estimated by MLE. The likelihood function is 543 𝐿𝐻(𝜆!",𝑝!"|𝑋!") = ∏ 𝑓(𝑋!"|𝜆!",𝑝!") #! !$% , (2) 544 where 𝑛" is the number of cells in cell type k, and f is the probability mass function of negative 545 binomial distribution. The MLE estimates are then given by 546 (𝜆&"2 ,𝑝&")2 = 𝑎𝑟𝑔max '"!,)"! 𝐿𝐻(𝜆!",𝑝!"|𝑋!"). (3) 547 Once success probability and dispersion are estimated, the mean estimates can be computed 548 numerically according to the property of negative binomial distribution, 549 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 26 𝜇!" = '#!* ∙)#!, %-)#!, , (4) 550 𝜎!" . = '#! * ∙)#!, (%-)#!, )$ . (5) 551 Estimation using MLE has been readily coded in many R packages. We choose ‘fitdist’ function 552 from ‘fitdistrplus’ package54 for its fast computation speed and flexibility in selecting 553 distributions. Estimations are done for each selected gene in each cell type, resulting in a 𝐼 × 𝐾 554 matrix of cell type means. 555 556 Cell type specificity of genes 557 Genes with cell-type specific expression patterns better represent cell types, thus are more 558 important when be used for resolving cell type composition. In line with this property, AdRoit 559 weights genes with high specificity more than less specific ones. Highly specific genes usually 560 have consistently high expression and thus relatively low variance among cells within a cell 561 type. To compute cell type specificity of a gene, we first identify the cell type in which the gene 562 has the highest expression (i.e., most specifically expressed cell type), then defines the 563 specificity of this gene as the mean-to-variance ratio within the cell type. A high ratio renders 564 high weight to the gene in the model. We use the estimated means and variances from 565 negative binomial fitting (𝜇!" and 𝜎!" . in eq. 4 and 5). Let 𝑘1 be the index of cell type that has the 566 highest mean expression of gene i, 567 𝑘1 = 𝑎𝑟𝑔max " {𝜇!"| 𝑘 𝜖 1…𝐾}, (6) 568 then the cell type specificity weight for gene i, denoting 𝑤! 2, is given by, 569 𝑤! 2 = 3"!% 4"!% $ , (7) 570 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 27 and it is computed for each gene in the set of selected genes. 571 572 Cross-sample gene variability 573 The variability of a gene contrasts how much stable a gene is across samples. The idea of 574 weighting genes based on variability across samples is first explored by Wang et al22, where 575 variability was defined as the cross-sample variance. By weighting down the high variability 576 genes, the authors achieved a great advantage over the traditional unweighted method. Genes 577 with low cross-sample variability better represent the population, hence are more trust-worthy 578 to be used to learn the cell composition. AdRoit incorporates the same notion to weight the 579 importance of genes, however, defines the variability in a more sophisticated way. Similar as 580 we define the cell type specificity, AdRoit utilizes mean and variance, and computes variance-581 to-mean ratio (VMR) to stand for cross-sample gene variability. But here the mean and variance 582 are computed across samples. The VMR is better scaled than the simple variance, and it can 583 avoid underweighting genes that has low expression, while circumvent overweighting genes 584 hugely dispersed. 585 586 In addition, AdRoit extends the method to fit the case where multiple samples are not 587 available. We proposed three ways to compute the VMR, depending on whether multi-sample 588 data is available. Typically, the compound transcriptome data to be deconvolved have multiple 589 samples. In bulk RNA-seq data, multiple samples are usually included to control for biological 590 variability. In spatial transcriptome data, the spatial dots can be seen as multiple samples. 591 Therefore, we first consider computing the cross-sample gene variability from compound 592 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 28 transcriptome data. In case multi-sample for compound data is not available, AdRoit utilizes the 593 single cell reference, and synthesizes compound samples by pooling all cells belonging to the 594 same sample. If multi-sample is not available for both data, AdRoit subsample single cells and 595 pool them to make pseudo samples. Let 𝑌!5 denote the counts of sequences for gene i in 596 sample j ∈ 1,…,J, then 597 𝑌!5 ∼ 𝑁𝐵(𝜆!5,𝑝!5), (8) 598 where 𝜆!5 is the dispersion parameter of the gene i in sample j, and 𝑝!5 is the success 599 probability. Again, we use MLE to get the estimates 𝜆&62and 𝑝&6G, following which cross-sample 600 mean and variance can be numerically computed: 601 𝜇! 2 = '#&* ∙)#&, %-)#&, , (9) 602 (𝜎! .)2 = '#&* ∙)#&, 7%-)#&, 8 $, (10) 603 and cross-sample variability for gene i is then defined as 604 𝑉𝑀𝑅! = (4" $)' 3" ' = % 9" (, (11) 605 where 𝑤! : is later used in the model. The cross-sample variability weight is computed for each 606 gene in the set of selected genes. 607 608 Gene-wise scaling factor to correct platform bias 609 When linking the compound data to the single cell data, rescaling factor is often used to 610 account for the library size and platform difference. The existing methods adopt a single 611 rescaling factor for each unit of sample, i.e., all genes of a single sample are multiplied by the 612 same factor22,23. This operation is based on a strong assumption that the impact of platform 613 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 29 difference to every gene is the same and linearly scaled among different cell types, which is 614 hardly true. In addition, because estimates can be easily affected by outliers in linear model, 615 estimation of cell proportions can be steered away from the truth by extremely high expression 616 genes. Therefore, applying a uniform scaling factor to all gene is inappropriate. 617 618 To overcome this problem, AdRoit instead estimates gene-wise scaling factors via an adaptive 619 learning strategy and rescales each gene with its respective scaling factor. To proceed, we first 620 input the mean gene expression from the compound samples (𝜇! 2in eq. 9) and the estimated 621 means of each cell type from the single cell data (𝜇!" in eq. 4), then apply a traditional non-622 negative least square regression (NNLS) to get a rough estimation of the proportions of each 623 cell type, denoting 𝜏". For each gene, a predicted mean expression (∑ 𝜏"G;" 𝜇!" in eq. 13) is 624 computed as the weighted sum of the means of each cell type wherein the weights are the 625 roughly estimated proportions. The regression equation is given by, 626 𝜇! 2 = 𝐴 ∙ (∑ 𝜏";" 𝜇!" + 𝜀), 0 < 𝜏", ∑ 𝜏" ; " = 1 (12) 627 where A is a constant to ensure 𝜏"’s sum to 1 and 𝜀 is the error term. We use ‘nnls’ function in 628 the ‘nnls’ package55 to estimate 𝜏"’s. Next, we calculate the ratio between the mean expression 629 from compound samples and the predicted means, and define the gene-wise rescaling factor as 630 the logarithm of the ratio plus 1, 631 𝑟! = log ( 3" ) ∑ =!, * ! 3"! + 1). (13) 632 Given the dispersion property of count data, the logarithm of the ratio is a more appropriate 633 statistic as it results in relatively stable scaling factors. The addition of 1 avoids taking logarithm 634 on zero. By multiplying the flexible gene-wise rescaling factor, the “outlier” genes will be 635 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 30 pushed toward the truth regression line direction, while the genes around the true regression 636 lines are less affected (Fig. 1b). 637 638 Weighted and regularized model 639 We next designed a model that incorporates all these factors to do the actual estimation of cell 640 type proportions. AdRoit builds upon non-negative least square regression model. It gives high 641 weights to the genes with high cell type specificity and low cross-sample variability. This was 642 done by optimizing a weighted sum of squared loss function L, where the weights consist of 643 two components (𝑤! : in eq. 7, 𝑤! 2 in eq. 11). The gene-wise scaling factor tailored for each gene 644 effectively corrects the bias due to technology difference between compound sample and 645 single cell data (𝑟!in eq 13). In cases of complex tissues (e.g., neural tissues) where many highly 646 similar subtypes are common, closely related subtypes can have strong collinearity, leading to 647 overestimation of some cell types whilst underestimate or miss some others. AdRoit handles 648 this problem by including a L2 norm of the estimates as the regularization component. Denote 649 𝛽" as the unscaled coefficient for cell type k. For a compound transcriptome sample j, the loss 650 function is given by, 651 𝐿5(𝛽%,…,𝛽;|𝑦!5,𝑤! :,𝑤! 2,𝑟!,𝜇&"G) = ∑ 𝑤! : ∙ 𝑤! 2 ∙ (𝑦!5 − 𝑟! ∙ ∑ 𝛽"𝜇&"G;" ). > ! + ∑ 𝛽" .; " . (14) 652 Then the coefficient 𝛽" can be estimated by minimizing the loss function with the constraint 653 𝛽%,…,𝛽; > 0, 654 𝛽%2,…,𝛽;2 = argmax ?+,…,?* ?+,…,?*AB𝐿5. (15) 655 The estimation is done by a gradient projection method by Byrd et al56. We derive the gradient 656 function by taking partial derivative of the loss function with w.r.t. 𝛽", 657 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 31 𝐺" = ∇?!𝐿5 = −2∑ 𝑟! ∙ 𝜇&"G ∙ 𝑤! : ∙ 𝑤! 2 ∙ ^𝑦!5 − 𝑟! ∙ ∑ 𝛽"𝜇&"G;" _ + > ! 2𝛽". (16) 658 AdRoit uses the function ‘optim’ from the R package ‘stats’ to do the estimation57, providing the 659 loss function (eq. 15) and the gradient (eq. 16). To get the final estimates of cell type 660 proportions, we rescale the coefficients 𝛽"’s to ensure a summation of 1, 661 𝜃" = ?!* ∑ ?!* * ! . (17) 662 Each compound sample j is independently estimated by the model described above. 663 664 Simulation of bulk RNA-seq and spatial transcriptomics data 665 Bulk RNA-seq data used for benchmarking are synthesized by adding up the raw UMI reads per 666 gene from all single cells of a sample regardless of cell types. Denote 𝑡" as a cell in cell type k, 667 and 𝑡" ∈ 1, …, 𝑇", where 𝑇" is the number of cells in cell type k. Let 𝑌!5 D be the read count of 668 gene i in a synthesized bulk sample j, and 𝑋!5E! be the UMI count of the gene, then 669 𝑌!5 D = ∑ ∑ 𝑋!5E! F! E! ; " . 670 The true proportion of cell type k is given by, 671 𝜃" B = F! ∑ F! * ! . 672 673 To simulate spatial transcriptomic spots, we first sample 10 cells without replacement from 674 each cell type and added them up, then mix them with designed proportions. For example, to 675 simulate a spot with 𝑝" percent of cell type k, the read count 𝑌!5 G of gene i in a spatial spot j is 676 given by, 677 𝑌!5 G = ∑ 𝑝";" ∑ 𝑋!"#%B#$% , 678 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 32 where 𝑋!"G is UMI count of gene i in a sampled cell n of cell type k. For each mixing scheme, the 679 simulation is repeated 100 times. 680 681 Evaluation statistics 682 We compared the estimated cell type proportions with the ground truth by calculating 4 683 statistics. The mAD and RMSD are given by, 684 𝑚𝐴𝐷 = ∑ HI!-I! , H*! ; , 685 𝑅𝑀𝑆𝐷 = ∑ 7I!-I! ,8 $* ! ; . 686 Pearson correlation coefficient is computed as, 687 𝜌) = ∑ 7I!-I!JJJJ8KI! ,-I! ,JJJJL*! M∑ 7I!-I!JJJJ8 * ! $M∑ KI! ,-I! ,JJJJL $* ! , 688 where 𝜃"ggg and 𝜃" Bggg are means of the estimated proportions and true proportions, respectively. 689 Spearman correlation coefficient is given by, 690 𝜌G = ∑ (N!-N!JJJJ)KN! ,-N! ,JJJJL*! M∑ (N!-N!JJJJ) * ! $M∑ KN! ,-N! ,JJJJL $* ! , 691 where 𝑟"is the rank of 𝜃". 692 693 Single cell RNA sequencing of mouse dorsal root ganglion 694 As described previously58, lumbar DRGs were isolated from adult C57BL/6 mice and transferred 695 to a dissociation buffer (Dulbecco's modified Eagle's medium supplemented with 10% heat-696 inactivated Fetal Calf Serum) (Gibco; cat # A38400-02). To generate a single cell suspension, 697 DRGs were subjected to a 2 step-enzymatic dissociation followed by a mechanical dissociation. 698 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 33 In brief, DRGs were first incubated with 0.125% collagenase P from Clostridium histolyticum 699 (Roche Applied Science; cat # 11249002001) for 90 minutes in an Eppendorf Thermomixer C 700 (37°C; intermittent 750 rpm shaking for about 10 sec every 2 minutes). Then, DRGs were 701 transferred to a Hank's Balanced Salt Solution (HBSS, Mg2+ and Ca2+ free; Invitrogen) 702 supplemented with 0.25% Trypsin (Worthington biochemical corp.; cat # LSoo3707) and 703 0.0025% EDTA and incubated for 10 minutes at 37°C in the Eppendorf Thermomixer C. Trypsin 704 was neutralized by the addition of 2.5 mg/ml MgSO4 (Sigma; cat #M-3937) and DRGs were 705 triturated with Pasteur pipettes. The resulting cell suspension was passed through a 70 µm 706 mesh filter to remove remaining chunks of tissues and centrifuged for 5 minutes at 2500 rpm at 707 room temperature. The pellet was resuspended in HBSS (Ca2+, Mg2+ free; Invitrogen) and the 708 cell suspension was run on a 30% Percoll Plus gradient (Sigma GE17-5445-02) to further remove 709 debris. Finally, cells were resuspended in PBS supplemented with 0.04% BSA at a concentration 710 of 200 cells/µl and cell viability was determined using the automated cell analyzer 711 NucleoCounter® NC-250™. The suspended single cells were loaded on a Chromium Single Cell 712 Instrument (10X Genomics) with about 6000 cells per lane to minimize the presence of 713 doublets. 2000-3000 cells per lane were recovered. RNA-seq libraries were constructed using 714 Chromium Single Cell 3’ Library, Gel Beads & Multiplex Kit (10X Genomics). Single end 715 sequencing was performed on Illumina NextSeq500. Read 1 starts with a 26-bp UMI and cell 716 barcode, followed by an 8-bp i7 sample index. Read 2 contains a 55-bp transcript read. Sample 717 de-multiplexing, alignment, filtering, and UMI counting were conducted using Cell Ranger 718 Single-Cell Software Suite59 (10X Genomics, v2.0.0). Mouse mm10 Genome assembly and UCSC 719 gene model were used for the alignment. 720 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 34 721 Data preprocessing 722 DRG single cell data 723 The UMI data output from Cell Ranger Single-Cell Software Suite (10X Genomics, v2.0.0) was 724 analyzed using Seurat package60 to assess the cell quality and identify cell types, similar to what 725 described previously39. Cells with the number of detected genes less than 500 or over 15000, or 726 with a UMI ratio of mitochondria encoded genes versus all genes over 0.1 were also removed. 727 The UMI data was normalized by the ‘NormalizeData’ method in Seurat with default settings. 728 To avoid potential sample-to-sample variation caused by technical variation at various 729 experiment steps, we employed Seurat data integration method. The top 2000 variable genes 730 of each of the 5 samples were identified using ‘FindVariableFeatures’ with 731 selection.method=‘vst’. Based on the union of these variable genes, the anchor cells in each 732 sample were identified by ‘FindIntegrationAnchors’. All the samples were then integrated by 733 ‘IntegrateData’. We subsequently scaled the integrated data (‘ScaleData’) and performed 734 dimension reduction (‘RunPCA’). Cells were then clustered based on the first 15 principal 735 components by applying ‘FindNeighbors’ and ‘FindClusters’ (resolution=0.6, algorithm=1). 736 Marker genes for each cluster were identified using ‘FindAllMarkers’. Parameters were used 737 such that these genes were expressed in at least 25% of the cells in the cluster, and on average 738 2-fold higher than the rest of cells with a multiple-testing adjusted Wilcoxon test p value of less 739 than 0.01. The specificity of the canonical cell type-specific genes or cell cluster-specific genes 740 were further examined by visualizations (Extended Data Fig. 2) and used to define the cell type 741 for each cluster. At the end, the original UMI data from 17271 genes and 3352 cells that passed 742 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 35 the quality control were organized into a matrix (genes as rows and cell identifiers as columns). 743 This matrix, together with the cell type label for each cell therein, were loaded into AdRoit as 744 reference profiles. 745 746 Mouse brain single cell data 747 The scRNA-seq reference data of the mouse brain were obtained from Zeisel et. al32. Among all 748 the available data, we only retained 96,572 cells that were acquired from the brain regions, had 749 an assigned cell type by the authors and a minimal total UMI of 1000. These cells corresponded 750 to 183 clusters at the finest taxonomy level in the original study. As many of the clusters are 751 highly similar, we decided to merge some of them to simplify the reference landscape. First, the 752 top 50 cluster enriched markers were derived using Scanpy61 via the ‘rank_genes_groups’ 753 function (method=‘wilcoxon’), following the normalization (‘normalize_per_cell’), log 754 transformation (‘log1p’) and regressing out (‘regress_out’) the variances associated with the 755 total UMI and the percentage of mitochondrial chromosome encoded genes per cell. Then, the 756 pair-wise overlapping p-values among the clusters were calculated using the top 50 marker 757 genes assuming the hypergeometric null distribution. Last, clusters with overlapping p-values 758 more significant than 1e-10 were merged and new names were assigned by combinedly 759 considering the original annotation, the molecular features and the specificity to certain brain 760 regions. A total of 46 cell types were determined that cover all the 12 brain regions and their 761 important substructures37 (Supplementary Table 13). To make the reference dataset more 762 manageable in size and more balanced in the representation of cell types, we down sampled 763 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 36 each cluster to no more than 360 cells. A final set of 14,666 cells over 46 cell types were used 764 for the deconvolution of the mouse brain spatial transcriptome data. 765 766 Human Islets 767 We used the 1492 high quality human islets single cell and annotation from Xin et al38. The 768 RPKM expression table was directly downloaded and used as is. The RNA-FISH data was also 769 from this study38. For the real bulk human pancreatic islets data38,40,41, the read counts table 770 were deconvoluted. Only data from donors with HbA1C level available were included in the 771 regression of Beta cell proportion on HbA1C level (Fig. 4c, Supplementary Table 10). 772 773 Trabecular Meshwork 774 We downloaded the raw sequence data and followed the same analysis procedure as in Patel et 775 al39 for quality control and cell type identification. 776 777 Mouse Brain Spatial transcriptomics data by 10x Visium platform 778 The filtered cell matrix, tissue image and the spatial coordinates of a coronal section of an adult 779 C57BL/6 mouse brain from the 10x Genomics were available for download and used as is. 780 781 Mouse Brian ISH images 782 The ISH images were directly downloaded from Allen mouse Brain Atlas37 by searching the gene 783 names. THE images were used with further editing except for cropping. 784 785 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 37 Data availability 786 DRG single cell data are deposited at NCBI GEO (accession number: GSE163252) . The bulk RNA-787 seq and RNA-FISH data for human pancreatic islets were initially published as aggregated data 788 where the data processing and experimental procedure were described therein38,40,41. We 789 acquired the individual sample data from the authors and released them along with the current 790 study (Supplementary Table 10 and Supplementary Table 12). The other public data analyzed in 791 this study are available from: GEO (human pancreatic islets single cell data: GSE81608); NCBI 792 (human trabecular meshwork single cell data: PRJNA616025; mouse brain single cell data: 793 SRP135960). Mouse brain spatial transcriptomic data was downloaded from the 10x Genomics 794 website (https://support.10xgenomics.com/spatial-gene-795 expression/datasets/1.1.0/V1_Adult_Mouse_Brain_Coronal_Section). 796 797 Code availability 798 AdRoit’s source code is available on Github (https://github.com/TaoYang-dev/AdRoit). 799 800 Software 801 The statistical analyses were done with R statistical software (v3.6.0)57 and python (v3.7.2)62. 802 The packages used include Seurat (v3.0.1)60, scanpy (v1.6.0)61, dplyr (v0.8.0.1)63, doParallel 803 (v1.0.14)64, data.table (v1.12.4)65, fitdistrplus (v1.1-1)54, nnls (v1.4)55. 804 805 Reference 806 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 38 1. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. 807 Nature Reviews Genetics (2009) doi:10.1038/nrg2484. 808 2. Chu, G. C., Kimmelman, A. C., Hezel, A. F. & DePinho, R. A. Stromal biology of pancreatic 809 cancer. Journal of Cellular Biochemistry (2007) doi:10.1002/jcb.21209. 810 3. Bussard, K. M., Mutkus, L., Stumpf, K., Gomez-Manzano, C. & Marini, F. C. Tumor-811 associated stromal cells as key contributors to the tumor microenvironment. Breast 812 Cancer Research (2016) doi:10.1186/s13058-016-0740-2. 813 4. Munn, D. H. & Bronte, V. Immune suppressive mechanisms in the tumor 814 microenvironment. Current Opinion in Immunology (2016) 815 doi:10.1016/j.coi.2015.10.009. 816 5. Gonzalez, H., Hagerling, C. & Werb, Z. Roles of the immune system in cancer: From tumor 817 initiation to metastatic progression. Genes and Development (2018) 818 doi:10.1101/GAD.314617.118. 819 6. Garner, H. & de Visser, K. E. Immune crosstalk in cancer progression and metastatic 820 spread: a complex conversation. Nature Reviews Immunology (2020) 821 doi:10.1038/s41577-019-0271-z. 822 7. Singh, U. P. et al. Chemokine and cytokine levels in inflammatory bowel disease patients. 823 Cytokine (2016) doi:10.1016/j.cyto.2015.10.008. 824 8. Van Lint, P. & Libert, C. Chemokine and cytokine processing by matrix metalloproteinases 825 and its effect on leukocyte migration and inflammation. J. Leukoc. Biol. (2007) 826 doi:10.1189/jlb.0607338. 827 9. Zelová, H. & Hošek, J. TNF-α signalling and inflammation: Interactions between old 828 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 39 acquaintances. Inflammation Research (2013) doi:10.1007/s00011-013-0633-0. 829 10. Koelman, L., Pivovarova-Ramich, O., Pfeiffer, A. F. H., Grune, T. & Aleksandrova, K. 830 Cytokines for evaluation of chronic inflammatory status in ageing research: Reliability 831 and phenotypic characterisation. Immun. Ageing (2019) doi:10.1186/s12979-019-0151-1. 832 11. Landskron, G., De La Fuente, M., Thuwajit, P., Thuwajit, C. & Hermoso, M. A. Chronic 833 inflammation and cytokines in the tumor microenvironment. Journal of Immunology 834 Research (2014) doi:10.1155/2014/149185. 835 12. Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial 836 transcriptomics. Science (2016) doi:10.1126/science.aaf2403. 837 13. Vickovic, S. et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat. 838 Methods (2019) doi:10.1038/s41592-019-0548-y. 839 14. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 840 (2009) doi:10.1038/nmeth.1315. 841 15. Denisenko, E. et al. Systematic assessment of tissue dissociation and storage biases in 842 single-cell and single-nucleus RNA-seq workflows. Genome Biol. (2020) 843 doi:10.1186/s13059-020-02048-6. 844 16. Nguyen, Q. H., Pervolarakis, N., Nee, K. & Kessenbrock, K. Experimental considerations 845 for single-cell RNA sequencing approaches. Frontiers in Cell and Developmental Biology 846 (2018) doi:10.3389/fcell.2018.00108. 847 17. Tanay, A. & Regev, A. Scaling single-cell genomics from phenomenology to mechanism. 848 Nature (2017) doi:10.1038/nature21350. 849 18. Abbas, A. R., Wolslegel, K., Seshasayee, D., Modrusan, Z. & Clark, H. F. Deconvolution of 850 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 40 blood microarray data identifies cellular activation patterns in systemic lupus 851 erythematosus. PLoS One (2009) doi:10.1371/journal.pone.0006098. 852 19. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. 853 Nat. Methods (2015) doi:10.1038/nmeth.3337. 854 20. Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas 855 Reveals Inter- and Intra-cell Population Structure. Cell Syst. (2016) 856 doi:10.1016/j.cels.2016.08.011. 857 21. Tsoucas, D. et al. Accurate estimation of cell-type composition from gene expression 858 data. Nat. Commun. (2019) doi:10.1038/s41467-019-10802-z. 859 22. Wang, X., Park, J., Susztak, K., Zhang, N. R. & Li, M. Bulk tissue cell type deconvolution 860 with multi-subject single-cell expression reference. Nat. Commun. (2019) 861 doi:10.1038/s41467-018-08023-x. 862 23. Andersson, A. et al. Single-cell and spatial transcriptomics enables probabilistic inference 863 of cell type topography. Commun. Biol. 3, 565 (2020). 864 24. Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues 865 with digital cytometry. Nat. Biotechnol. (2019) doi:10.1038/s41587-019-0114-2. 866 25. Myung, I. J. Tutorial on maximum likelihood estimation. J. Math. Psychol. (2003) 867 doi:10.1016/S0022-2496(02)00028-7. 868 26. Bassett, R. & Deride, J. Maximum a posteriori estimators as a limit of Bayes estimators. 869 Math. Program. (2019) doi:10.1007/s10107-018-1241-0. 870 27. Zhao, Y. & Simon, R. Gene expression deconvolution in clinical samples. Genome 871 Medicine (2010) doi:10.1186/gm214. 872 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 41 28. Chiu, Y. J., Hsieh, Y. H. & Huang, Y. H. Improved cell composition deconvolution method 873 of bulk gene expression profiles to quantify subsets of immune cells. BMC Med. 874 Genomics (2019) doi:10.1186/s12920-019-0613-5. 875 29. Kang, K. et al. CDSeq: A novel complete deconvolution method for dissecting 876 heterogeneous samples using gene expression data. PLoS Comput. Biol. (2019) 877 doi:10.1371/journal.pcbi.1007510. 878 30. Qiao, W. et al. PERT: A Method for Expression Deconvolution of Human Blood Samples 879 from Varied Microenvironmental and Developmental Conditions. PLoS Comput. Biol. 880 (2012) doi:10.1371/journal.pcbi.1002838. 881 31. Zaitsev, K., Bambouskova, M., Swain, A. & Artyomov, M. N. Complete deconvolution of 882 cellular mixtures based on linearity of transcriptional signatures. Nat. Commun. (2019) 883 doi:10.1038/s41467-019-09990-5. 884 32. Zeisel, A. et al. Molecular Architecture of the Mouse Nervous System. Cell (2018) 885 doi:10.1016/j.cell.2018.06.021. 886 33. Donovan, M. K. R., D’Antonio-Chronowska, A., D’Antonio, M. & Frazer, K. A. Cellular 887 deconvolution of GTEx tissues powers discovery of disease and cell-type associated 888 regulatory variants. Nat. Commun. (2020) doi:10.1038/s41467-020-14561-0. 889 34. Phipson, B., Zappia, L. & Oshlack, A. Gene length and detection bias in single cell RNA 890 sequencing protocols. F1000Research (2017) doi:10.12688/f1000research.11290.1. 891 35. Chen, G., Ning, B. & Shi, T. Single-cell RNA-seq technologies and related computational 892 data analysis. Frontiers in Genetics (2019) doi:10.3389/fgene.2019.00317. 893 36. Chen, D. & Plemmons, R. J. Nonnegativity constraints in numerical analysis. in The Birth 894 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 42 of Numerical Analysis (2009). doi:10.1142/9789812836267_0008. 895 37. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 896 (2007) doi:10.1038/nature05453. 897 38. Xin, Y. et al. RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. 898 Cell Metab. (2016) doi:10.1016/j.cmet.2016.08.018. 899 39. Patel, G. et al. Molecular taxonomy of human ocular outflow tissues defined by single-900 cell transcriptomics. Proc. Natl. Acad. Sci. 117, 12856 LP – 12867 (2020). 901 40. Xin, Y. et al. Pseudotime ordering of single human B-cells reveals states of insulin 902 production and unfolded protein response. Diabetes (2018) doi:10.2337/db18-0365. 903 41. Gutierrez, G. D. et al. Gene signature of proliferating human pancreatic a cells. 904 Endocrinology (2018) doi:10.1210/en.2018-00469. 905 42. Cerf, M. E. Beta cell dysfunction and insulin resistance. Frontiers in Endocrinology (2013) 906 doi:10.3389/fendo.2013.00037. 907 43. Maedler, K. & Donath, M. Y. Beta-cells in type 2 diabetes: a loss of function and mass. 908 Hormone research (2004). 909 44. Donath, M. Y. et al. Mechanisms of β-cell death in type 2 diabetes. Diabetes (2005) 910 doi:10.2337/diabetes.54.suppl_2.S108. 911 45. Calanna, S. et al. Alpha- and beta-cell abnormalities in haemoglobin A1c-defined 912 prediabetes and type 2 diabetes. Acta Diabetol. (2014) doi:10.1007/s00592-014-0555-5. 913 46. Kanat, M. et al. The Relationship Between β-Cell Function and Glycated Hemoglobin. 914 Diabetes Care 34, 1006 LP – 1010 (2011). 915 47. Nepton, S. Beta-Cell Function and Failure. in Type 1 Diabetes (2013). doi:10.5772/52153. 916 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 43 48. Dolenšek, J., Rupnik, M. S. & Stožer, A. Structural similarities and differences between 917 the human and the mouse pancreas. Islets (2015) doi:10.1080/19382014.2015.1024405. 918 49. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 919 445, 168–176 (2007). 920 50. Vieth, B., Parekh, S., Ziegenhain, C., Enard, W. & Hellmann, I. A systematic evaluation of 921 single cell RNA-seq analysis pipelines. Nat. Commun. (2019) doi:10.1038/s41467-019-922 12266-7. 923 51. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome 924 Biol. (2010) doi:10.1186/gb-2010-11-10-r106. 925 52. Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-926 seq data using regularized negative binomial regression. Genome Biol. (2019) 927 doi:10.1186/s13059-019-1874-1. 928 53. Svensson, V. Droplet scRNA-seq is not zero-inflated. Nature Biotechnology (2020) 929 doi:10.1038/s41587-019-0379-5. 930 54. Delignette-Muller, M. L. & Dutang, C. fitdistrplus: An R package for fitting distributions. J. 931 Stat. Softw. (2015) doi:10.18637/jss.v064.i04. 932 55. Mullen, Katharine M., I. H. M. van S. nnls: The Lawson-Hanson algorithm for non-933 negative least squares (NNLS). R Packag. version 1.4 (2012). 934 56. Byrd, R. H., Lu, P., Nocedal, J. & Zhu, C. A Limited Memory Algorithm for Bound 935 Constrained Optimization. SIAM J. Sci. Comput. (1995) doi:10.1137/0916069. 936 57. The R Core Team. R: A Language and Environment for Statistical Computing. R 937 Foundation for Statistical Computing (2019). 938 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 44 58. Alessandri-Haber, N. et al. Hypotonicity induces TRPV4-mediated nociception in rat. 939 Neuron (2003) doi:10.1016/S0896-6273(03)00462-8. 940 59. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. 941 Commun. (2017) doi:10.1038/ncomms14049. 942 60. Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell (2019) 943 doi:10.1016/j.cell.2019.05.031. 944 61. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: Large-scale single-cell gene expression data 945 analysis. Genome Biol. (2018) doi:10.1186/s13059-017-1382-0. 946 62. van Rossum, G. & Drake, F. L. Python 3 Reference Manual. Scotts Valley, CA (2009). 947 63. Wickham, H. & Francois, R. dplyr: A Grammar of Data Manipulation. R Packag. version 948 0.4.2. (2015). 949 64. Weston, S., Calaway, R. & Tenenbaum, D. doParallel: Foreach Parallel Adaptor for the 950 Parallel Package. Cran (2014). 951 65. Dowle, M. & Srinivasan, A. data.table: Extension of ‘data.frame’. R Package Version 952 1.12.8. Manual (2019). 953 954 Acknowledgements 955 We thank Yurong Xin for pointing us to the relevant public data resource. We also thank Gabor 956 Halasz and Yuan Zhu for the advice to algorithm design. 957 958 Author contributions 959 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 45 T.Y., Y.B., W.F., N.A.-H., M.L.-F., L.E.M. and G.S.A. designed the research. T.Y., Y.B., and W.F. 960 developed the algorithm. T.Y., Y.B., W.F. and J.K. participated in the data analyzing. M.S. and 961 R.B. performed the DRG tissue collection. C.A. performed the single cell library preparation and 962 sequencing experiment. T.Y., Y.B., N.A.-H. and G.S.A. wrote the manuscript. 963 964 Competing interests 965 T.Y., Y.B., W.F. and G.S.A. have filed a patent application relating to the AdRoit computational 966 framework. M.L.-F. is an employee of Cellular Longevity. All other authors are employees and 967 shareholders of Regeneron Pharmaceuticals, although the manuscript’s subject matter does not 968 have any relationship to any products or services of this corporation. 969 970 Figure legends 971 Fig. 1: Schematic representation of AdRoit computational framework. a, AdRoit inputs bulk or 972 spatial RNA-seq data, single cell RNA-seq data and cell type annotations. It first selects 973 informative genes and estimates their means and dispersions, based on which the cell type 974 specificity of genes is computed. Depending on multi-sample availability, cross-sample gene 975 variability is estimated from compound data, or single cell samples (dashed arrow). Lastly the 976 gene-wise scaling factors are estimated using both compound data and single cell data. These 977 computed quantities are fed to a weighted regularized model to infer the transcriptome 978 composition. b, A mock example to illustrate the role of gene-wise scaling factor. Ideally, an 979 accurate estimation of slop (i.e., cell proportion) would be the slope of the green line, however 980 direct fitting would result in the red line due to the impact of the outlier genes. Outlier genes 981 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 46 can be induced due to platform difference affecting genes differently. AdRoit adopts an 982 adaptive learning approach that first learns a rough estimation of the slop (red line), then 983 moves the outlier genes toward it such that the more deviated genes will be moved more 984 toward the true line (i.e., longer arrows). After the adjustment, the new estimated slop (blue 985 line) is closer to the truth (green line), thus is a more accurate estimation. 986 987 Fig.2: Benchmark on simulated bulk data synthesized from trabecular meshwork (TM) single 988 cells data. a, AdRoit has the closest estimation to the true cell proportion comparing to MuSiC 989 and NNLS. Each dot is a cell type from one donor. b, For each cell type in TM, AdRoit has the 990 smallest differences from the true cell type proportion and the smallest variance of estimates 991 across the 8 donors. For each cell type, a dot on the graph denotes a donor, and the bars 992 represent the 1.5 × interquartile ranges. Estimation was done by using the single cell as 993 reference leaving out the donor used for synthesizing bulk. c, AdRoit’s estimates are more 994 accurate and specific than MuSiC’s estimates on synthetic bulk that contains partial cell types. 995 The synthetic bulk was simulated by using only 6 out of the 12 cell types per donor, then 996 estimated with the reference of 12 cell types. AdRoit has notably fewer false positive estimates 997 of the 6 cell types not included, and more accurate estimation of the 6 cell types used for 998 synthesizing bulk. d, Receiver operating characteristic (ROC) curve shows AdRoit has a 999 significantly higher AUC than MuSiC (0.95 vs 0.74), meaning better sensitivity and specificity. 1000 1001 Fig. 3: Benchmark on scRNA-seq data from dorsal root ganglion (DRG) where these exist many 1002 closely related subtypes of neuronal cells. a, 14 cell types were identified from scRNA-seq 1003 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 47 samples of 5 mice, including multiple subtypes of neurofilaments (NF), peptidergic (PEP) and 1004 non-peptidergic (NP) neurons. b, Benchmarking with the synthetic data shows AdRoit’s 1005 estimation of cell type proportions are highly accurate. In particular, AdRoit achieves 1006 reasonably high accuracy when the cells are rare (e.g., < 5%). Each dot represents a cell type 1007 from one sample. c, For each individual sample, mAD, RMSD, Pearson and Spearman 1008 correlations were computed and compared across three methods. AdRoit has the lowest mAD 1009 and RMSD, and highest Pearson and Spearman correlations. In addition, AdRoit’s estimation is 1010 also the most stable across samples. Each dot on the boxplot is a sample. Estimation was done 1011 by using the single cell reference leaving out the sample used for synthesizing bulk. 1012 1013 Fig. 4: AdRoit is more accurate and sensitive than Stereoscope on spatial spots simulated 1014 from real DRG cells. a, AdRoit and Stereoscope estimations on simulated spatial spots that 1015 contains 5 PEP neuron subtypes. True mixing proportions were denoted by the red dashed 1016 lines. Three schemes were simulated: 1) the proportions of 5 PEP cell types are the same and 1017 equal to 0.2; 2) PEP1_Dcn is 0.1 and the other 4 are 0.225; 3) PEP1_Dcn and 1018 PEP1_S100a11.Tagln2 are 0.1, PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a 0.2 are 0.2, and 1019 PEP3_Trpm8 is 0.4. In all simulation schemes, AdRoit’s estimates are more consistently 1020 centered around the true proportions than Stereoscope’s estimates. b, AdRoit is more accurate 1021 in estimating rare cells in spatial spots. The spots were simulated by simulating mixtures of 3 1022 PEP cell types (i.e., PEP1_Slc7a3.Sstr2, PEP2_Htr3a.Sema5a and PEP3_Trpm8), with a series of 1023 low percent of PEP3_Trpm8 cell type from 1% to 10% and the other two cell types sharing the 1024 rest proportion equally. AdRoit’s estimates are systematically closer to the true simulated 1025 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 48 proportions than Stereoscope’s estimates. c, AdRoit is consistently more sensitive than 1026 Stereoscope in detecting low percent cells (estimates > 0.5% deemed as detected) in simulated 1027 spots of 1) low percent of NF_Calb1 mixed with NF_Pvalb and NF2_Ntrk2.Necab2, 2) low 1028 percent of NP_Mrgpra3 mixed with NP_Mrgprd and NP_Nts, 3) low percent of PEP3_Trpm8 1029 mixed with PEP1_Slc7a3.Sstr2 and PEP2_Htr3a.Sema5a, 4) low percent of NF_Calb1 mixed with 1030 Th, satellite glia and endothelial, 5) low percent of NP_Mrgpra3 mixed with Th, satellite glia and 1031 endothelial, and 6) low percent of PEP_Trpm8 mixed with Th, satellite glia and endothelial. 1032 1033 Fig. 5: Applications to real bulk human islets RNA-seq data and mouse brain spatial 1034 transcriptome data. a, AdRoit’s estimates on real human Islets bulk RNA-seq data were highly 1035 reproducible for the repeated samples from same donor. b, AdRoit estimated cell type 1036 proportions agreed with the RNA-FISH measurements. c, AdRoit estimated Beta cell 1037 proportions in type 2 diabetes patients are significantly lower than that in healthy subjects. In 1038 addition, the estimated proportions have a significant negative linear association with donors’ 1039 HbA1C level. d, The spatial mapping of 4 mouse brain cell types is consistent with the ISH 1040 images of 4 marker genes from Allen mouse brain atlas37 respectively. The 4 genes, Spink8 1041 (marker of hippocampal field CA1), C1ql2 (marker of Dentate Gyrus), Clic6 (marker of Choroid 1042 Plexus), Synpo2 (marker of Thalamus) were identified as markers of corresponding tissues by 1043 Zeisel et al32. 1044 1045 Extended Data Fig. 1: Benchmark three methods on human pancreatic islets data. a, Human 1046 islets single cell data contains 4 cell types from 18 subjects including two major cell types Alpha 1047 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 49 and Beta cells, and two minor cells PP and Delta cells38. The cell proportion varies across 1048 different subjects. b, c, AdRoit achieves leading accuracy when applied to the bulk data 1049 synthesized from the single cell data. Each dot on scatterplot is a cell type from one subject. 1050 Estimation was done by using the single cell reference leaving out the subject used to 1051 synthesize bulk. 1052 1053 Extended Data Fig. 2: Dorsal root ganglion single cell shows 14 cell types including 3 subtypes 1054 of neurofilament, 3 subtypes of non-peptidergic neurons, and 5 subtypes of peptidergic 1055 neurons. a, Heatmap of top markers shows distinction between cell types as well as similarity 1056 between subtypes. b, The proportion of each cell type varies from 0.5% to 33.71% across 1057 different samples. 1058 1059 Extended Data Fig. 3: Comparing the performance on estimated simulated spatial spots of 14 1060 pure cell type respectively. a, Estimates by AdRoit and b, estimates by Stereoscope are 1061 comparably accurate. Simulations were done by sampling cells from the same cell type and 1062 adding up the read counts per gene. For each of the 14 cell types of the DRG tissue, we 1063 repeated the simulation 100 times. The results shown were a summary of 100 simulations for 1064 each cell type. For both methods, the median estimates of the sampled cell type were close to 1065 1 (red lines), whereas the cell type not sampled has zero or close-to-zero values. 1066 1067 Extended Data Fig. 4: The comparison of AdRoit and Stereoscope on the simulated spots of 1068 additional cell mixing schemes. 5 more types of mixed spatial spots were simulated: 1) mixture 1069 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 50 of 3 neurofilaments (NF); 2) mixture of 3 non-peptidergic (NP) cell types; 3) NF2_Ntrk2.Necab2 1070 mixing with Th, satellite glia and endothelial; 4) NP_Nts mixing with Th, satellite glia and 1071 endothelial; and 5) PEP3_Trpm8 mixing with Th, satellite glia and endothelial. Each simulation 1072 was repeated 100 times. Consistently for all simulation schemes, AdRoit’s estimates were 1073 always closer to the true simulated proportions (red lines), whereas Stereoscope’s estimates 1074 largely deviated from the true proportions. 1075 1076 Extended Data Fig. 5: Spatial mapping of 46 cell types with AdRoit quantitative depicts the 1077 content in each spot. Spatial transcriptomics data was downloaded from 10x genomics 1078 (https://support.10xgenomics.com/spatial-gene-1079 expression/datasets/1.1.0/V1_Adult_Mouse_Brain_Coronal_Section). The reference single cells 1080 were sampled from Zeisel et al32 and curated into 46 cell types. 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 51 Figures 1091 Fig. 1 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 52 Fig. 2 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 53 Fig. 3 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 54 Fig. 4 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 55 Fig. 5 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 56 Extended Data Fig. 1 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 57 Extended Data Fig. 2 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 58 Extended Data Fig. 3 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 59 Extended Data Fig. 4 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697 60 Extended Data Fig. 5 1212 1213 1214 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.12.14.422697doi: bioRxiv preprint https://doi.org/10.1101/2020.12.14.422697