A read count-based method to detect multiplets and their cellular origins from snATAC-seq data 1 A read count-based method to detect multiplets and their cellular origins from 1 snATAC-seq data 2 Asa Thibodeau1*, Alper Eroglu1*, Nathan Lawlor1, Djamel Nehar-Belaid1, Romy Kursawe1, Radu Marches1, 3 George A. Kuchel2, Jacques Banchereau1, Michael L. Stitzel1,3,4, A. Ercument Cicek5,6, Duygu Ucar1,3,4 4 5 1 The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA 6 2 University of Connecticut Center on Aging, UConn Health Center, Farmington, CT, 06030, USA 7 3 Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, 8 06030, USA 9 4 Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, 06030, USA. 10 5 Computer Engineering Department, Bilkent University, Ankara, 06800, Turkey 11 6 Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA 12 * These authors contributed equally to this work. 13 Correspondence: duygu.ucar@jax.org 14 15 ABSTRACT 16 Similar to other droplet-based single cell assays, single nucleus ATAC-seq (snATAC-seq) data harbor multiplets 17 that confound downstream analyses. Detecting multiplets in snATAC-seq data is particularly challenging due to 18 its sparsity and trinary nature (0 reads: closed chromatin, 1: open in one allele, 2: open in both alleles), yet offers 19 a unique opportunity to infer multiplets when >2 uniquely aligned reads are observed at multiple loci. Here, we 20 implemented the first read count-based multiplet detection method, ATAC-DoubletDetector, that detects 21 multiplets independently of cell-type. Using PBMC and pancreatic islet datasets, ATAC-DoubletDetector 22 captured simulated heterotypic multiplets (different cell-types) with ~0.60 recall, showing ~24% improvement 23 over state of the art. ATAC-DoubletDetector detected homotypic multiplets with ~0.61 recall, representing the 24 first method to detect multiplets originating from the same cell type. Using our novel clustering-based algorithm, 25 multiplets were annotated to their cellular origins with ~85% accuracy. Application of ATAC-DoubletDetector will 26 improve downstream analysis of snATAC-seq.27 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 2 MAIN 28 Single nucleus ATAC-seq (snATAC-seq)1–3 technology is widely used to study epigenomes of diverse cells and 29 tissues with increased resolution3,4. However, as with other droplet based single cell technologies, snATAC-seq 30 data harbor multiplet nuclei5. The presence of multiplets can confound downstream analyses by introducing 31 combined epigenomic profiles that originate from two or more nuclei, increasing the difficulty of clustering and 32 comparing different cell types within a sample. Compared to other single cell assays, the difficulty of detecting 33 multiplets in snATAC-seq is further increased due to data sparsity and the trinary nature of chromatin accessibility 34 levels (e.g., 0 reads: closed chromatin, 1: open in one allele, 2: open in both alleles). 35 The current state of the art for detecting multiplets in snATAC-seq data adapt detection methods 36 developed for single cell RNAseq (scRNA-seq). Notably, two snATAC-seq data analysis packages, SnapATAC6 37 and ArchR7, either employ or implement a method similar to multiplet detection methods (i.e., DoubletFinder8 38 and Scrublet9) for scRNA-seq. In these methods, synthetic heterotypic multiplets (i.e., originating from different 39 cell types) are simulated by combining profiles of two or more cells, which are then used to detect putative 40 multiplets based on cluster similarity. Such algorithms assume that multiplets and singlets exhibit distinct 41 genomic profiles, which becomes problematic when true singlets share genomic profiles with two or more cell 42 types. Under this assumption, these methods will fail to detect homotypic multiplets (i.e., originating from the 43 same cell type) since their overall genomic profile is considered to be similar to that of the underlying cell type. 44 However, homotypic multiplets are characterized by increased read counts compared to singlets, suggesting 45 new methods that utilize read counts can detect them. In order to overcome the limitations of existing methods 46 to detect both homotypic and heterotypic multiplets, we developed a novel multiplet detection method, ATAC-47 DoubletDetector, that exploits read count distributions to infer multiplets in snATAC-seq data. 48 ATAC-DoubletDetector’s efficacy was tested in two snATAC-seq datasets generated from peripheral 49 blood mononuclear cells (PBMCs) samples (n=2) and pancreatic islet (n=2) tissues. We identified multiplets in 50 these tissues and quantified the algorithm’s efficacy using simulated homotypic and heterotypic multiplets. We 51 found that when snATAC-seq samples were adequately sequenced (e.g., >20k valid read pairs per cell), ATAC-52 DoubletDetector proved very effective for detecting both homotypic and heterotypic multiplets (recall ranging 53 from 0.74-0.89 in PBMCs). In addition, ATAC-DoubletDetector includes a novel clustering-based algorithm that 54 accurately annotates the cellular origins of detected multiplets (85% average accuracy in our simulations), 55 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 3 providing further data quality insights. ATAC-DoubletDetector is provided as a user-friendly computational 56 framework with documentation and source code freely available at: https://github.com/UcarLab/ATAC-57 DoubletDetector. 58 59 Results 60 ATAC-DoubletDetector leverages the fact that the expected number of uniquely aligned reads for a given locus 61 ranges from 0 to 2 per nucleus in snATAC-seq data: 0 = closed chromatin, 1 = open in one allele (i.e., from either 62 maternal or paternal chromosomes), 2 = open in two alleles (i.e., both maternal and paternal chromosomes) 63 (Fig. 1a). A locus can have more than two reads (>2) when: 1) it contains repetitive sequences; 2) there are 64 sequencing or alignment errors; or 3) reads stem from multiplet nuclei. In the case of multiplets, we expect to 65 observe many loci with >2 reads since their epigenomic profiles are derived from two or more nuclei resulting in 66 increased accessible DNA. ATAC-DoubletDetector identifies all loci with >2 reads for each cell/nucleus (Fig. 1b) 67 by utilizing sorted read alignments to detect their overlapping read intervals (22-39 bp on average across all 68 samples). A unified list of these loci across all nuclei is then generated to quantify the number of occurrences 69 where >2 reads align to a locus in a given nucleus (Fig. 1c). As a proof of concept, highly significant multiplets 70 (P-Values < 10-324) can be clearly seen harboring many more loci with >2 reads (924-1054 loci) than average 71 (~23 loci per nuclei) (Extended Data Fig.1). Random occurrences of loci with >2 reads (i.e., due to sequencing 72 or alignment errors) were modeled with the Poisson cumulative distribution function using the mean number of 73 overlaps detected across all cells. Nuclei that harbor significantly more loci with >2 reads are identified as 74 multiplets based on their deviations from the distribution using False Discovery Rate (FDR) (Fig. 1c). To trace 75 multiplets back to their cellular origins, we employed a clustering-based algorithm as part of the ATAC-76 DoubletDetector framework. Marker peaks are detected to generate reference accessibility profiles for each cell 77 type using single cell clustering. Epigenomic similarity scores at marker peaks are then used to compare multiplet 78 profiles with singlet profiles to differentiate between heterotypic and homotypic multiplets and annotate them. 79 We demonstrate the utility and performance of our computational framework by applying our methods in 80 PBMC and islet sample datasets (Fig. 1d). First, we simulated artificial multiplets in PBMC and islet samples and 81 quantified ATAC-DoubletDetector’s ability to identify and annotate these multiplets. Second, we compared 82 ATAC-DoubletDetector to ArchR, measuring their overall performances and their ability to detect simulated 83 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 4 heterotypic and homotypic multiplets. Finally, we measure the efficacy of our annotation method and analyze 84 multiplet cellular origins to understand whether cell type influences the rate of multiplet occurrences. 85 86 ATAC-DoubletDector detects heterotypic and homotypic multiplets in PBMC and islet samples. We 87 generated snATAC-seq libraries from two human PBMC and two human pancreatic islet samples using 10x 88 Genomics Chromium platform3. Sequence reads were preprocessed using Cell Ranger ATAC pipeline 89 (methods), resulting in an average of 5,559 and 6,173 nuclei per sample and an average of 24,393 and 16,625 90 valid read pairs per cell for PBMC and islet samples respectively (Fig. 2a). Valid read pairs refer to all pairs of 91 paired end reads that align to autosomes and pass quality control flags/thresholds (methods). Despite deeper 92 sequencing for islet samples, fewer valid read pairs were observed in islet samples compared to PBMC samples 93 (Fig. 2b), which can be explained by increased mitochondrial reads in islets (114,821,502 and 47,522,248 total 94 reads aligned to chrM) compared to PBMCs (2,610,761and 947,233 total reads aligned to chrM). 95 Nuclei clustering using an in-house implementation (methods) of a two-pass clustering method3 for 96 snATAC-seq data identified 16 and 15 clusters for PBMC1 and PBMC2. Correlating pseudo-bulk accessibility 97 profiles of these clusters with accessibility maps from sorted bulk ATAC-seq data10 (Extended Data Fig. 2a,b) 98 grouped them into 5 major cell types: myeloid (including CD14+, CD16 monocytes and conventional dendritic 99 cells), B, CD4+ T, CD8+ T, and NK cells (Extended Data Fig. 2c,d). These annotations were confirmed based on 100 chromatin accessibility patterns at cell-specific marker genes (Extended Data Fig. 3a,b). The same clustering 101 procedure identified 14 and 12 distinct clusters for islet1 and islet2, which were then annotated as alpha, beta, 102 delta, and ductal cells by integrating their accessibility profiles with in-house islet scRNA-seq data (Extended 103 Data Fig. 4a,b). These annotations were confirmed by analyzing the chromatin accessibility patterns at known 104 cell-specific marker genes11 (Extended Data Fig. 4c,d). 105 We applied ATAC-DoubletDetector on PBMCs and human islet samples using an FDR cutoff of 0.01 106 (Methods). Nuclei detected as multiplets were distributed throughout all clusters (Fig. 2c-d, Extended Data Fig. 107 5) and in one case (PBMC1) multiplets formed their own distinct cluster (see selected multiplets in Fig. 2d). The 108 percentage of detected multiplets were higher in PBMCs (7%, 10.84%) compared to islets (5% for both samples) 109 (Fig. 2e), which is likely due to the lower valid read pairs per nuclei in islets as previously mentioned (Fig. 2b). 110 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 5 To further study the biological relevance of these detected multiplets, we selected a cluster which 111 exclusively encompassed multiplets (Fig. 2d; PBMC 1 selected multiplets) and analyzed their chromatin 112 accessibility profiles (Fig. 2f). The selected multiplets were characterized by a high chromatin accessibility at the 113 promoters of both CD3G (T cell marker gene) and LYZ (monocyte marker gene), suggesting T cell-monocyte 114 multiplets. These results demonstrate how read count distribution information from snATAC-seq can be used to 115 effectively detect multiplets. 116 117 ATAC-DoubletDetector effectively detects simulated heterotypic and homotypic multiplets. To quantify 118 the efficacy of ATAC-DoubletDetector, we generated artificial multiplets by randomly selecting 5% of nuclei in a 119 sample and pairing them together to artificially form multiplets (repeated 10 times per sample). This resulted in 120 artificial multiplets at 2.5% of the total number of nuclei within a sample. These artificial multiplets serve as 121 positive multiplet examples and enable us to measure recall (i.e., the fraction of detected artificial multiplets 122 among all artificial multiplets introduced in the sample). We first evaluated ATAC-DoubletDetector’s ability to 123 detect heterotypic, homotypic, and a combination of both multiplet types. We then compared it’s performance in 124 comparison to another method ArchR7. 125 ATAC-DoubletDetector detected heterotypic multiplets introduced in PBMC samples with high recall 126 (average recall 0.80 for PBMC1 and 0.90 for PBMC2 over 10 runs), outperforming ArchR (0.23 and 0.24 127 respectively) (Fig. 3a). Average recall for ATAC-DoubletDetector was lower in islet1 and islet2 than PBMCs (0.37 128 and 0.34 average recall respectively) whereas the average recall showed improvement for ArchR (0.68 and 0.30 129 average recall respectively). Decreased performance of ATAC-DoubletDetector’s in islets can be explained by 130 low number of valid read pairs per nuclei in islet samples compared to PBMCs (Fig 2b). Notably, ATAC-Doublet 131 detector was equally effective for detecting homotypic multiplets (average recall 0.82 and 0.91 for PBMC 1 and 132 PBMC 2, 0.38 and 0.31 for islet 1 and islet 2) (Fig. 3b), demonstrating the utility of using read counts to detect 133 multiplets. As expected, ArchR had low recall for detecting homotypic multiplets (average between 0.07 and 0.11 134 for all samples), as this algorithm identifies multiplets with distinct genomic profiles from singlets. Finally, we 135 measured the efficacy to simultaneously detect both types of multiplets by introducing a more realistic- 136 heterotypic and homotypic multiplet 1:1 ratio (Extended Data Fig. 6a). As expected, the average recall values of 137 ATAC-DoubletDetector’s were similar (0.82 and 0.92 for PBMC1 and PBMC 2, 0.34 and 0.33 for islet1 and islet2 138 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 6 respectively), while, those of ArchR were lower (0.13 and 0.16 for PBMC1 and PBMC2, 0.40 and 0.17 for islet1 139 and islet2), likely due to its poor homotypic multiplet detection performance. 140 To further study how the valid read pairs influence ATAC-DoubletDetector’s performance, we generated 141 artificial multiplets using cells with ranging reads per nucleus (Fig 3c-d, Extended Data Fig. 6b). We observed a 142 noticeable increase in average recall (> 0.96 recall) for ATAC-DoubletDetector, when the number of valid read 143 pairs was above 47.2k, corresponding to an average of 23.6k valid reads pairs per nucleus. In contrast, ArchR 144 did not show significant differences in performances with respect to the number of valid read pairs per nucleus 145 (Extended Data Fig. 6b), as it relies more on genomic profile similarity to detect multiplets. More exhaustive 146 analyses of 100 repetitions per sample further confirmed that the majority (96%, 98% for PBMC1 and PBMC2 147 and 83%, 72% for islet1 and islet2) of multiplets with >40k valid read pairs (i.e., multiplets formed from nuclei 148 with 20k valid read pairs each) were detected with this method (Extended Data Fig. 7). Together, these analyses 149 suggest that when >20k valid read pairs are captured per nucleus, ATAC-DoubletDetector is very effective in 150 detecting both homotypic and heterotypic multiplets from snATAC-seq data. 151 To compare ATAC-DoubletDetector and ArchR performances, we ran ArchR with recommended 152 parameter settings (i.e., k=10 nearest neighbors and 1.5 filter ratio). Only 38 to 78 multiplets across all samples 153 were detected by both methods (Fig 3e-f, Extended Data Fig. 8, Extended Data Fig. 9a-b) and majority of these 154 multiplets were among the ones that formed their own clusters (i.e., heterotypic multiplets). For example, the 155 majority of selected multiplets detected in cluster in Fig 2d were detected by both methods (Extended Data Fig. 156 8), which are multiplets that have unique epigenomic profiles; hence easier to detect with the synthetic multiplet-157 based method employed by ArchR. Notably, 47.35% of Delta cells were identified as multiplets by ArchR for 158 Islet1 (Figure 3f, Extended Data Fig. 8). Delta cells resemble both alpha and beta cells in their genomic profile, 159 hence these cells were mistakenly detected as multiplets by ArchR, demonstrating a pitfall for synthetic multiplet-160 based methods. Multiplets are expected to have higher read counts than singlets since they combine chromatin 161 accessibility profiles of more than one nucleus. In alignment with this, multiplets detected by ATAC-162 DoubletDetector had significantly higher valid read pair counts compared to singlets (average valid read pairs of 163 46,980 for multiplets and 18,561 for singlets for all samples) (P-Values < 1.375 x 10-152). In contrast, read counts 164 for ArchR multiplets were significantly lower (average P-Values < 1.016 x 10-57) than ATAC-DoubletDetector 165 multiplets, observing read counts closer to that of singlets (average read count per cell 23,703 for ArchR 166 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 7 multiplets and 19,951 for singlets) (Extended Data Fig. 9c). In summary, these analyses showed that when there 167 is sufficient number of valid read pairs per cell (> 20k), count based methods are advantageous over synthetic 168 multiplet-based methods as they can accurately detect both homotypic and heterotypic multiplets. 169 170 Marker peaks can effectively annotate cellular origins of multiplets. Cellular origin annotations of multiplets 171 were inferred using a three-step algorithm (Fig. 4a). First, nuclei were clustered and annotated to their respective 172 cell types. Second, marker peaks were detected for each cluster/cell type. Third, we calculated epigenomic 173 similarity of each multiplet to different cell types by counting marker peak reads for the multiplet and the k=15 174 nearest neighbor nuclei (Methods). Cluster similarity scores were then used to annotate multiplets. For example, 175 in PBMCs, for each multiplet we calculated 5 scores, where each score represents the similarity of the multiplet 176 epigenome to that of the five studied clusters (Figure 4b). The distribution of these similarity scores are used to 177 first distinguish heterotypic and homotypic multiplets, by comparing their profiles to annotated singlets (Methods). 178 For example, in PBMC1, nuclei in B cell cluster (cluster 5) had high similarity score for B cell marker peaks and 179 low scores for all other cell types (Figure 4b). In contrast, nuclei in cluster 13 had high similarity scores for NK, 180 CD4+ T, CD8+ T and myeloid cells, a signature of heterotypic multiplets (Fig. 4b). Once the multiplet type is 181 identified, their cellular origins are annotated using the highest scoring cell type(s). 182 We evaluated the efficacy of this annotation pipeline using artificial multiplets, where cells were randomly 183 selected and paired together to form both heterotypic and homotypic multiplets. Using these artificial multiplets, 184 we categorized multiplets as homotypic or heterotypic and annotated multiplets with respect to the number of 185 cell types associated with them. We identified the cellular origins of both types of multiplets with an average 186 accuracy of 82.47%, 85.87% in PBMC1, PBMC2 and 85.7%, 85.5% in islet1, islet2 (Fig. 4c). For example, in 187 PBMC1, 96% of all simulated B and myeloid multiplets were correctly annotated. Cell types that have similar 188 functions, hence similar epigenomes, observed lower annotation accuracies; such as 86% for simulated NK and 189 CD8+ T cells. Our annotations were equally effective for annotating both homotypic and heterotypic multiplets, 190 showing 83.65% accuracy on average to annotate homotypic multiplets and 85.59% accuracy to annotate 191 heterotypic multiplets. 192 193 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 8 Multiplet cell-type compositions reflect cellular compositions of the underlying tissue. Using ATAC-194 DoubletDetector’s annotation pipeline, we annotated all detected multiplets in PBMCs and islets. Inspection of 195 aggregate accessibility profiles at marker gene promoters (MS4A1, CD3G, CD4, CD8A, TREM1, NKG7, and 196 KLRF1) for each cell type in PBMC2 (Fig. 5a) revealed that annotated multiplets have accessibility at relevant 197 marker gene promoters. For instance, homotypic B cell multiplets had strong signal at the promoter of B cell 198 marker gene MS4A1, whereas heterotypic multiplets originating from CD8+ T cell and B cells had high 199 accessibility signals for both B cell marker gene MS4A1 and CD8+ T cell marker gene CD8A. 200 As expected, homotypic multiplets clustered together with the underlying cell type, whereas heterotypic 201 multiplets typically formed their own clusters (Fig. 5b-c, Extended Data Fig. 10a-b). The majority of heterotypic 202 multiplets for islet1 were found between major cell type clusters and near the delta cell cluster while homotypic 203 multiplets resided within the boundaries of singular cell type clusters (Fig. 5d). For PBMC1, the majority of 204 multiplets resided within multiplet cluster we previously identified and as a subcluster of CD8+ T cells (Fig. 5e). 205 As before, homotypic multiplets were found within corresponding cell type clusters. Overall, the majority of 206 detected multiplets were homotypic (76.7-84.3% in islets, 63-78.7% in PBMCs), with cell types being distributed 207 with respect to their cell proportions for both homotypic and heterotypic multiplet types (Fig. 5d-e, Extended Data 208 Fig. 10c-d). Indeed, in both tissues, the propensity of a cell type to form a multiplet was positively correlated 209 with the percent of that cell type within the tissue (Pearson’s R = 0.824, 0.897, P-Value < 0.087, 0.04 for PBMC1 210 and PBMC2, Pearson’s R = 0.931, 0.475 P-Value < 0.07, 0.525 for islet1 and islet2) (Fig. 5f-g, Extended Data 211 Fig. 10e-f), suggesting that snATAC-seq multiplets are more likely to occur randomly than through specific 212 interactions between nuclei. For example, the most abundant cell type in islet1 was beta cells (46.62% of the 213 cell population) which contributed to 51.96% of multiplets (Fig. 5f). Heterotypic multiplet annotations in islet 214 samples mostly originated from alpha, beta and delta cells. In PBMCs, the most frequent heterotypic multiplets 215 were the ones stemming from CD4+ T and CD8+ T cells (Fig. 5f, Extended Data Fig. 10e). 216 217 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 9 DISCUSSION 218 Detecting and discarding multiplets from snATAC-seq data is a critical step for improving data quality as 219 multiplets can form their own clusters and can confound downstream analyses. ATAC-DoubletDetector exploits 220 read count distributions for a given nucleus to effectively detect and eliminate multiplets without requiring prior 221 knowledge of cell-type information. It accomplishes this by first efficiently counting loci with >2 uniquely aligned 222 reads per nucleus and identifying nuclei with read count distributions deviating from expectations. Unlike other 223 methods that utilize artificial multiplet examples to identify putative multiplets (i.e., ArchR), ATAC-224 DoubletDetector is capable of detecting both homotypic (i.e., multiplets originating from the same cell type) and 225 heterotypic multiplets (i.e., multiplets originating from different cell types). Eliminating heterotypic multiplets is 226 essential for improved clustering and differential analyses between clusters and samples, whereas homotypic 227 multiplets introduce bias in allele-specific analyses. Hence, detecting and removing both types of multiplets will 228 improve downstream analyses. 229 The number of valid read pairs per cells is the most important factor affecting the performance of ATAC-230 DoubletDetector. When read depth per nucleus is sufficiently high (e.g., >20k read pairs per nucleus), ATAC-231 DoubletDetector is very effective in detecting both heterotypic and homotypic multiplets (average recall = 0.836 232 to detect artificial multiplets in PBMCs). Since ATAC-DoubletDetector does not depend on artificial multiplet 233 examples, it is not inherently biased towards cell types that resemble others. For example, in islets, delta cells 234 transcriptionally resemble alpha and beta cells, hence artificial multiplets generated by combining alpha and beta 235 cells have genomic profiles that resemble delta cells. These instances are particularly challenging for methods 236 that depend on artificial multiplet examples (e.g., ArchR for snATAC7, DoubletFinder8 and Scrublet9 for scRNA-237 seq). In alignment with this, ArchR categorized 47.35% of delta cells as multiplets in islet1. Given the success of 238 ATAC-DoubletDetector for identifying multiplets from snATAC-seq data with enough reads per nuclei, it can also 239 be effective in detecting and eliminating multiplets in recent multi-ome transcriptome and epigenome assays12. 240 Epigenomic signal at marker peaks is an effective way to annotate cellular origins of multiplets, where 241 we achieved 84.69% accuracy on average in simulations. Annotations of detected multiplets showed that 242 majority are homotypic. Furthermore, the propensity of nuclei to form multiplets was positively correlated with 243 the abundance of that cell type within the tissue. Since cells are lysed and nuclei are profiled in snATAC-seq 244 protocols3; these assays will likely not be prone to biological multiplets due to cell-cell interactions). Therefore, 245 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 10 snATAC-seq multiplets likely occur randomly among all cells; hence the most abundant cells are the most likely 246 to form multiplets. 247 Quantifying the efficacy of multiplet detection methods is a challenging task since true examples of singlet 248 and multiplets are not known. To overcome this challenge, we evaluated ATAC-DoubletDetector’s ability to 249 capture multiplets by simulating artificial multiplets, enabling us to measure recall. ATAC-DoubletDetector 250 identified 5-10.84% of cells as multiplets in islet and PBMC samples, which was in alignment with expectations. 251 Hence, we believe false positive calls are also restricted in our method. Although we quantified our method by 252 forming artificial multiplets, ATAC-DoubletDetector pipeline can be easily extended to capture and annotate 253 multiplets that include data from multiple nuclei. 254 Multiplets are inevitable in single cell sequencing and performing better data analyses calls for their 255 removal. ATAC-DoubletDetector introduces a novel and effective count-based solution for detecting multiplets 256 and provides a framework for annotating their cellular origins, improving future downstream analyses. ATAC-257 DoubletDetector code and documentation is freely available at https://github.com/UcarLab/ATAC-258 DoubletDetector, providing an easy to use interface for all backgrounds. Our multiplet detection algorithm is fast 259 and can be incorporated into data analyses pipelines, where processing of an average library (i.e., ~5,886 cells 260 at ~20,508 valid read pairs per cell) takes <30 minutes. 261 262 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 11 METHODS 263 snATAC-seq cell labeling, capture, library preparation, and sequencing. For single nucleus ATAC 264 sequencing (snATACseq) experiments, viable single cell suspensions from each sample were used to generate 265 snATACseq data using the 10X Chromium platform according to the manufacturer’s protocols (Demonstrated 266 Protocol Nuclei Isolation for ATAC Sequencing Document CG000169; Chromium Single Cell ATAC_User Guide 267 RevB Document CG000168). Briefly, >100,000 cells of interest were centrifuged, the supernatant was removed 268 without disrupting the cell pellet, Lysis Buffer was added for 5 minutes on ice to generate isolated and 269 permeabilized nuclei, followed by quenching by dilution with Wash Buffer. After centrifugation to pellet the 270 washed nuclei, Diluted Nuclei Buffer was used to re-suspend nuclei at the desired nuclei concentration as 271 determined using a Countess II FL Automated Cell Counter and combined with ATAC Buffer and ATAC Enzyme 272 to form a Transposition Mix. Transposed nuclei were immediately combined with Barcoding Reagent, Reducing 273 Agent B and Barcoding Enzyme and loaded onto a 10X Chromium Chip E for droplet generation, followed by 274 library construction. The barcoded sequencing libraries were subjected to bead clean-up and checked for quality 275 on an Agilent 4200 TapeStation, quantified by qPCR (KAPA Biosystems Library Quantification Kit for Illumina 276 platforms), and pooled for sequencing on an Illumina NovaSeq 6000 S2 flow cell (paired-end libraries 2x50bp). 277 278 Human islet isolation 279 Human islets were obtained through partnerships with the Integrated Islet Distribution Program (IIDP, 280 http://iidp.coh.org/). Assessment of human islet function was performed by islet GSIS static incubation assay on 281 the day after arrival, following the IIDP protocol. Primary human islets were cultured in Prodo media (PIM-S + 282 supplements PIM-G + PIM-ABS) in 5% CO2 at 37oC for ~24 hours prior to beginning studies. In preparation of 283 single cell suspension for 10x platform, human islets were dispersed with StemPro Accutase (Thermo Fisher 284 Scientific) 1ml/1000IEq for 10min at 37oC. Islet single cell suspension was washed three times in PBS-0.03% 285 BSA and cell number determined using Countess II FL Automated Cell Counter (Life Tech). Nuclei isolation for 286 single cell ATAC sequencing was performed following the 10x protocol 287 (https://assets.ctfassets.net/an68im79xiti/5g035d2ngCW1aB9DFqPphO/71445a59fb282ea273a866c26cb5d31288 9/CG000169_DemonstratedProtocol_NucleiIsolation_ATAC_Sequencing_RevD.pdf, based on the OMNI 289 nucleiprep by Corces et al.13). 290 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 12 291 Identifying snATAC-seq loci with >2 reads. Position sorted paired-end read alignments from snATAC-seq 292 data are compared to detect all loci with >2 unique reads per nucleus. To avoid instances where reads overlap 293 due to technical reasons, we removed all read pairs that are marked using the following parameters in the 294 HTSJDK14 library: 1) ReadPairedFlag = True, 2) ReadUnmappedFlag = False, 3) MateUnmappedFlag = False, 295 4) SecondaryOrSupplementary = False, 5) DuplicateReadFlag = False, and ReferenceIndex != 296 MateReferenceIndex (i.e., read pairs map to the same chromosome). To reduce overlaps due to alignment 297 errors, reads are excluded based on i) mapping quality scores less than or equal to 30, and ii) insert sizes (i.e., 298 the end to end distance between 5’ and 3’ read positions) greater than 900bp (~6 nucleosomes) in length. 299 To identify instances of >2 reads overlapping at any specific locus, all intervals are identified for which 300 an overlap was observed for at least two valid read pairs. Reads defining each interval are then compared to 301 one another to identify all subintervals that exceed the specified overlap threshold (i.e., 2). To efficiently identify 302 these subintervals, for each subset, interval breakpoints were defined at the start and end positions of each 303 paired end read. For each interval breakpoint, an integer value of 1 was assigned to all breakpoints originating 304 from start positions, and -1 to all breakpoints originating from an end position. Interval breakpoints are then 305 visited in start position sorted order to generate a cumulative sum based on the assigned values at each 306 breakpoint. The cumulative sum indicates the total number of overlaps between two interval breakpoints and 307 efficiently identifies all sub-intervals with a number of overlaps greater than the specified threshold. 308 Once all subintervals satisfying the threshold are identified for a subset of reads, the algorithm repeats 309 this process for the remaining paired end read subsets. Each step is performed using a linear time algorithm 310 (i.e., O(n), n is the number of total reads), with an additional O(log(m)) (m equals the number of nuclei) overhead 311 for each read to identify their respective nucleus origin, resulting in O(n*log(m)) runtime. The runtime can be 312 reduced to an expected O(n) runtime by instead using an appropriate hash function for cell identifiers/barcodes. 313 Note that this algorithm assumes that reads are sorted beforehand and is otherwise superseded by time it takes 314 to sort reads by their chromosome and start positions (i.e., O(n*log(n)). 315 316 Detecting significant multiplets from snATAC overlap counts. Loci with >2 reads were first filtered using 317 simple repeats, segmental duplications, repeat masker and blacklist regions obtained from UCSC Genome 318 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 13 Browser15 and ENCODE16,17. Next, filtered regions from all nuclei were merged if they overlapped by at least one 319 base pair. Using this unified list of loci, a binary matrix was generated where rows in the matrix represent loci 320 with >2 reads for at least one nucleus, and the columns represent the individual cells within the sample. Values 321 within the matrix were assigned to 1 if the cell and genomic region combination observed >2 reads overlapping, 322 and 0 otherwise. From this matrix, multiplets can be detected using column sums (i.e., the total number of >2 323 read overlap instances for each nucleus) while repetitive element sequences can be inferred using row sums 324 (i.e., the total number of cells observing >2 reads at the same locus). 325 The events of observing >2 reads overlapping within the same region for multiple cells or across multiple 326 regions within the same cell can be modeled using the Poisson distribution. Occurrences of these events are 327 independent, counted within set intervals (i.e., counting regions across the entire genome within cells or counting 328 cells within the same genomic regions), are either present or not within these intervals, and have a constant 329 average rate of occurring, satisfying the assumptions of the Poisson distribution. We therefore detected 330 significant multiplets and inferred repetitive sequences using the Poisson cumulative distribution function, using 331 respective mean row and column sum counts as the expected values to calculate Poisson probabilities. In this 332 process, we first use Poisson probabilities to infer repetitive sequences where a significant number of nuclei 333 observe >2 reads at the same genomic region. All inferred repetitive sequence loci are removed from further 334 analysis. Next, we calculate the Poisson probability of observing more loci with >2 reads than expected in a 335 nucleus(i.e., multiplets) using column sums. Poisson probabilities for both inferring repetitive sequence and 336 multiplet detection were corrected using the Benjamini Hochberg procedure to adjust for multiple hypothesis 337 testing. Repetitive sequence inferences and multiplets were predicted by selecting regions or cells with adjusted 338 Poisson probabilities less than 0.01. 339 340 Multiplet annotation pipeline. Detected multiplets are annotated using clusters identified for snATAC-seq 341 samples, merging them with respect to specific cell types present in the cell population. In our study, PBMC 342 clusters were merged to represent CD4+T, CD8+T, Natural Killer (NK), myeloid and B cells and islet clusters 343 were merged to represent alpha, beta, delta and ductal cells. Marker peaks for all cell type clusters with at least 344 150 cells were identified with the FindMarkers function in Seurat18, using the logistic regression setting. For the 345 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 14 sake of unison, the top 100 marker peaks are then identified for each cell type cluster based on Bonferroni 346 adjusted p-value of average log fold changes. 347 To account for data sparsity in snATAC-seq data, aggregate read profiles are calculated for each cell 348 and marker peak. Aggregate read profiles are found by taking average read counts for each cell’s 15 nearest 349 neighbors using the top 50 singular value decomposition (SVD) components. The cumulative distribution function 350 in R (i.e., ecdf) is then used to find the abundance of reads for each cluster’s marker peaks. Distribution scores 351 represent the percent of each cell type’s accessibility profiles present within the cell. In order to distinguish 352 multiplet types (i.e., heterotypic or homotypic) singlet profiles were calculated for each cell type in the sample. 353 For each cell type’s singlet cells, abundance scores at every marker peak were averaged to find the representive 354 abundance score profile for that cell type. Multiplets that have a profile close to their abundant cell type’s singlet 355 profile were classified as homotypic. Euclidean distance was used to measure the similarity between the profiles 356 of multiplets and singlets. Mixture models were then fitted to the distances with the Mclust R package19 to group 357 the closeness of the multiplets to their corresponding cell type’s singlet profile. Multiplets in the group with largest 358 distance to the singlet profile are considered heterotypic. Multiplets are then annotated using the top 1 (for 359 homotypic) or 2 (for heterotypic) abundance scores. 360 361 snATAC-seq nuclei clustering. To cluster nuclei from snATAC-seq data, we employed an in-house 362 implementation (https://github.com/UcarLab/snATACClusteringPipeline) of a two pass clustering method 363 previously described3 with notable differences. First, we restrict the number of 2.5kb bins in the first pass 364 clustering to the top 50k bins, up from 20k bins. For second pass clustering, we increase the number of peaks 365 to include all peaks identified in pass 1 up to 200k. 366 367 Integration of scRNA-seq and snATAC-seq data. Integrative clustering and analysis of single cell 368 transcriptomes and single nucleus epigenomes was performed using the R package Seurat18,20. First, gene 369 activity scores were derived from the resultant snATAC-seq peak count-matrix using the 370 CreateGeneActivityMatrix function with default parameters. Next, single nuclei with < 5,000 total read counts 371 were discarded from analyses. The resultant single nuclei and gene activity scores were log normalized and 372 scaled. Using the processed scRNA-seq data (also analyzed with Seurat), we identified anchors between the 373 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 15 snATAC-seq gene activity score matrix and scRNA-seq gene expression matrix following the methodology 374 described by Butler et al. (2018)18. After identifying anchors between the datasets, cell-type labels from the 375 scRNA-seq dataset were transferred to the snATAC-seq dataset and a prediction and confidence score was 376 assigned for each cell. 377 378 Simulating artificial multiplets to measure multiplet detection performances. To measure recall for 379 detecting multiplets, artificial multiplets were simulated by combining accessibility profiles of nuclei within each 380 sample population tested. For each sample, cells were randomly selected equal to 5% of the total cell population 381 and paired together to introduce artificial multiplets equivalent to 2.5% of the total population. Introducing 2.5% 382 artificial multiplets ensured that they were not the majority compared to real multiplets (5-11% of cells across all 383 samples) present in the data. Cell pairs were randomly reselected until they formed heterotypic, homotypic, or 384 1:1 ratio of heterotypic and homotypic multiplets based on cell type annotations. Simulations measuring the 385 number of valid read pairs per nucleus did not have restrictions based on cell type and were selected based on 386 read depth when stratifying by number of valid read pairs (i.e., Fig. 3c-d, Extended Data Fig. 6b) or completely 387 at random (i.e., Extended Data Fig. 7). Once cell pairs were identified, artificial multiplets were introduced by 388 generating modified barcode mappings (for ATAC-DoubletDetector) or barcodes in fragment files (for ArchR7), 389 which assigned artificial multiplet reads to the same cell identifier (i.e., the first nucleus in the pair). Artificial 390 multiplets were simulated 10 or 100 runs depending on the analysis. 391 392 CODE AVAILABILITY 393 ATAC-DoubletDetector is provided as a user-friendly computational framework with documentation and source 394 code freely available at: https://github.com/UcarLab/ATAC-DoubletDetector. 395 396 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 16 REFERENCES 397 1. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 398 523, 486–490 (2015). 399 2. Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular 400 indexing. Science 348, 910–914 (2015). 401 3. Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell 402 development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019). 403 4. Rai, V. et al. Single-cell ATAC-Seq in human pancreatic islets and deep learning upscaling of rare cells 404 reveals cell-specific type 2 diabetes regulatory signatures. Mol. Metab. 32, 109–121 (2020). 405 5. Lareau, C. A., Ma, S., Duarte, F. M. & Buenrostro, J. D. Inference and effects of barcode multiplets in 406 droplet-based single-cell assays. Nat. Commun. 11, 866 (2020). 407 6. Fang, R. et al. SnapATAC: A Comprehensive Analysis Package for Single Cell ATAC-seq. 408 https://www.biorxiv.org/content/10.1101/615179v3 (2020). 409 7. Granja, J. M. et al. ArchR: An integrative and scalable software package for single-cell chromatin 410 accessibility analysis. http://biorxiv.org/lookup/doi/10.1101/2020.04.28.066498 (2020) 411 doi:10.1101/2020.04.28.066498. 412 8. McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: Doublet Detection in Single-Cell RNA 413 Sequencing Data Using Artificial Nearest Neighbors. Cell Syst. 8, 329-337.e4 (2019). 414 9. Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: Computational Identification of Cell Doublets in Single-415 Cell Transcriptomic Data. Cell Syst. 8, 281-291.e9 (2019). 416 10. Ucar, D. et al. The chromatin accessibility signature of human immune aging stems from CD8+ T cells. J. 417 Exp. Med. 214, 3123–3144 (2017). 418 11. Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific 419 expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017). 420 12. Ma, S. et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin. Cell 421 183, 1103-1116.e20 (2020). 422 13. Corces, M. R. et al. An improved ATAC-seq protocol reduces background and enables interrogation of 423 frozen tissues. Nat. Methods 14, 959–962 (2017). 424 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 17 14. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinforma. Oxf. Engl. 25, 2078–2079 425 (2009). 426 15. Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–427 D858 (2019). 428 16. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 429 489, 57–74 (2012). 430 17. Davis, C. A. et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 431 46, D794–D801 (2018). 432 18. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data 433 across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018). 434 19. Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: Clustering, Classification and Density 435 Estimation Using Gaussian Finite Mixture Models. R J. 8, 289–317 (2016). 436 20. Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888-1902.e21 (2019). 437 438 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 18 439 Fig. 1: Overview of detecting multiplets in snATAC-seq. a, Tn5 transposase cleaves accessible DNA at maternal and paternal chromosomes. Number of ATAC-seq read counts per loci per nucleus are expected to be 0, 1, or 2. b, Instances where more than 2 (>2) reads are observed for any locus in a cell are identified using an efficient algorithm for counting the number of overlapping reads. c, Poisson cumulative distribution function is used to detect multiplets based on deviations from expected number of loci with >2 reads. d, Overview of downstream analyses: 1) quantification of multiplet detection performances using artificial multiplets, 2) comparison of ATAC- DoubletDetector to alternative method ArchR, 3) annotating cellular origins of multiplets using a clustering-based method. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 19 440 Fig. 2: ATAC-DoubletDetector identifies heterotypic and homotypic multiplets in human PBMC snATAC-seq data. a, Summary of snATAC-seq samples generated and used in this study from human PBMC and islets. b, Valid read pair distributions for PBMC and islet snATAC-seq samples. c, PBMC clusters were annotated based on their correlations with sorted bulk ATAC-seq data (See. Extended Data Fig.2). d, All multiplets (heterotypic and homotypic) detected by ATAC- DoubletDetector in PBMC1. Selected multiplets refer to multiplets for which aggregated profiles are shown in panel f of this figure. e, The number of cells and percentage of multiplets detected by ATAC-DoubletDetector in PBMC and islet samples. f, Chromatin accessibility profiles of CD4+ T, myeloid, and selected multiplets around for T cell marker gene (CD3G) and myeloid cell marker gene (LYZ). CD4+ T and myeloid cells show strong accessibility signals for their relevant marker genes while selected multiplets have accessible chromatin for both marker genes. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 20 441 Fig. 3: ATAC-DoubletDetector detects multiplets with high recall when read depth is sufficient. a-b, Recall for detecting heterotypic (a) and homotypic (b) artificial multiplets. ATAC-DoubletDetector consistently detected both heterotypic and homotypic multiplets with similar recall, while ArchR was only effective for predicting heterotypic multiplets for data with high heterogeneity. c-d, Performance of detecting artificial multiplets at increasing valid read pair (insertions) distributions for PBMC1(c) and islet1(d). ATAC-DoubletDetector effectively detects multiplets at the >40k valid read pairs per nucleus. ArchR’s performance did not observe the same level of effect for read depth. e, Reference annotations for islet1. Islet1 annotations correspond to alpha, beta, delta and ductal cell types. f, Representative UMAP plots for multiplets detected by ATAC- DoubletDetector and ArchR for islet1 (other samples shown in Extended Fig. 8). We identified islet clusters for Alpha, Beta, Delta, and Ductal cells. Majority of multiplets detected were not shared between the two methods. Heterotypic multiplets were the most common. Note: ArchR detected the majority of Delta cells as multiplets. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 21 442 Fig. 4: Multiplet cell-type origins are predicted with high accuracy. a, Overview of the cell origin annotation pipeline. First, cells are clustered. Second, marker peaks are identified. Third, multiplets and their k-nearest neighbor cells are used to generate cluster similarity scores. b, Example of aggregate cluster profiles for predicting cell origin annotations. Clusters corresponding to cell types observe strong signal for their respective cell types (e.g., Cluster 5) while clusters corresponding to multiplets show a mixed profile of cell types (e.g., Cluster 13). c, Heatmaps of cell origin annotation accuracies for predicting artificial multiplets derived from cells of the specific cell type pairings. Multiplet annotations showed high accuracies for the majority of cell type compositions. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 22 443 Fig. 5: Majority of multiplets are homotypic and correspond to cell type proportions. a, Accessibility maps for cell origin annotations for multiplets identified in PBMC2. Homotypic multiplets observe strong signal for their respective marker genes. Heterotypic multiplets observe a combined signal at respective marker genes corresponding to the respective annotated cell types. b-c, UMAP clustering for heterotypic and homotypic multiplet annotations in PBMC1 (b) and islet1 (c). Heterotypic multiplets are found between major cell type clusters. Homotypic multiplets are observed on the periphery of major cell type clusters. d-e, Heterotypic and homotypic multiplet cell distributions (left bars). Homotypic cell type annotations (right bars) for PBMC (d) and islet (e) samples. Majority of multiplets are annotated as homotypic. Homotypic cell type distributions show similar distribution to the overall proportions of each cell type in their respective samples. f-g, Cell and multiplet proportions for PBMC2(f) and islet1(g). Multiplet cell type proportions are highly correlated with overall cell proportions. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 23 444 445 Extended Data Fig. 1: Multiplets observe many loci with >2 reads. The binary matrix of loci with >2 reads per cell reveals high 446 confidence multiplet (marked by arrows) that harbor many loci with >2 reads. These multiplets can be clearly seen compared to the other 447 cells in the subset. 448 449 450 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 24 451 452 Extended Data Fig. 2: Pseudo-bulk snATAC-seq profile correlations with sorted bulk ATAC-seq revealed 5 major cell types. a, 453 b, Spearman correlation heatmaps between pseudo-bulk (snATAC) and sorted bulk ATAC-seq accessibility profiles for PBMC1 (a) and 454 PBMC2 (b). Pseudo-bulk profiles cluster with four major cell types: Myeloid, B, CD4+ T, CD8+ T and Natural Killer (NK). c, d, Annotated 455 UMAP clusters for PBMC1 (c) and PBMC2 (d). Myeloid, B form distinct clusters for both samples. CD4+T, CD8+T and NK cell types share 456 more accessible loci and tend to cluster more closely to one another. 457 458 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 25 459 Extended Data Fig. 3: Annotated snATAC-seq clusters reflect accessibility at cell specific promoters. a, b, Annotated UMAPs for 460 PBMC1 (a) and PBMC2 (b) at the promoters of CD3G (T-Cell Marker), CD4 (CD4+ T cell marker), CD8A (CD8+ T cell marker), MS4A1 461 (B cell marker), NKG7 (NK cell marker), and TREM1 (Myeloid cell marker). Accessibility was binarized to 0 or 1 based on the presence 462 or absence of a read within these promoters. Using these markers, B and Myeloid cell types are clearly annotated with their respective 463 markers. CD4+ T and CD8+ T cells can be observed by combining CD3G with CD4 and CD8A markers respectively whereas NK cells are 464 can be seen using NKG7 and excluding nuclei with accessibility at CD3G promoter. 465 466 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 26 467 Extended Data Fig. 4: Islet snATAC-seq clusters correspond to scRNA-seq and cell marker annotations. a, b, UMAP clusters of 468 snATAC-seq data for islet1 (a) and islet2 (b) annotated as alpha, beta delta or ductal cells via integration with annotated scRNA-seq data. 469 Four distinct clusters are observed with these cell types. c, d. Cell specific clusters correspond to their respective marker peaks for both 470 islet 1(c) and islet2 (d). Accessibility was binarized to 0 or 1 based on the presence or absence of a read within these promoters. Alpha, 471 beta, delta and ductal cells are clearly identified with their respective marker genes: GCG (Alpha), INS (Beta), SST (Delta), and KRT19 472 (Ductal). 473 474 475 476 477 478 479 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 27 480 Extended Data Fig. 5: Multiplets are distributed throughout snATAC-seq clusters. Multiplet annotated UMAP clustering of PBMC1, 481 PBMC2, islet1 and islet2 reveal that multiplets are distributed throughout all identified clusters and in some cases form their own multiplet 482 clusters (i.e., center cluster in PBMC1). Multiplets between major cell type clusters are likely to be heterotypic whereas multiplets at the 483 periphery of annotated clusters are likely to be homotypic. 484 485 486 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 28 487 Extended Data Fig. 6: ATAC-DoubletDetector detects both homotypic and heterotypic multiplets at high read depth. a, Recall for 488 detected both homotypic and heterotypic artificial multiplets at a 1:1 ratio. ATAC-DoubletDetector did not observe noticeable differences 489 in performances due to its robustness for detecting both multiplet types. ArchR showed reduced performance compared to heterotypic 490 multiplet only detection due to the inclusion of homotypic multiplets. b, Recall for multiplets stratified by read count distributions (top for 491 each sample) and valid read pair distributions for each multiplet subset (bottom for each sample). ATAC-DoubletDetector performances 492 increased when the number of valid read pairs exceeded ~40k valid read pairs per nuclei, suggesting multiplets can be reliably detected 493 when nuclei have >20k valid read pairs each. ArchR did not show significant differences in performance due to read depth. 494 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 29 495 Extended Data Fig. 7: Artificial multiplets are detected when combined valid read pairs exceed 40k. For each sample, multiplets 496 were detected (Top left for each sample) or not detect (Top right for each sample), depending on whether one or both nuclei exceeded 497 20k valid read pairs. Histogram of combined profiles revealed that the majority of detected multiplets (bottom left for each sample) had at 498 least 20k valid read pairs while multiplets not detected were those with less than 40kb valid read pairs (bottom right for each sample). 499 When nuclei are sequenced for 20k valid reads per nuclei, multiplets will harbor 40k valid read pairs and can be detected by ATAC-500 DoubletDetector. 501 502 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 30 503 504 Extended Data Fig. 8: ATAC-DoubletDetector and ArchR identify different multiplet subsets. UMAP clusters annotating ATAC-505 DoubletDetector multiplets (green), ArchR multiplets (orange), or their intersection (black). Majority of multiplets detected by both ATAC-506 DoubletDetector and ArchR were between major cell type clusters (i.e., heterotypic multiplets). 507 508 509 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 31 510 Extended Data Fig. 9: ATAC-DoubletDetector and ArchR multiplets comparisons reveal nature of their underlying algorithms. 511 a, Venn diagrams and total number of multiplets detected by ATAC-DoubletDetector and ArchR. Only a small subset of multiplets is 512 detected by both methods. b, Total number of nuclei and multiplets detected by each method. Differences in number of nuclei are due to 513 differences in inputs (i.e., alignment (BAM) files for ATAC-DoubletDetector and fragment files (Cell Ranger output) for ArchR). Overall, 514 ArchR detects more multiplets using default parameters than ATAC-DoubletDetector. c, Valid read pair distributions between multiplets 515 and singlets detected by ATAC-DoubletDetector and ArchR. Differences in number of valid read pairs between multiplet and singlets 516 were more significant for ATAC-DoubletDetector than ArchR while the number valid read pairs for ATAC-DoubletDetector were 517 significantly greater than ArchR multiplet. 518 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250 32 519 520 Extended Data Fig. 10: Multiplet annotations correspond to cell proportions. a-b, UMAP clustering for heterotypic and homotypic 521 multiplet annotations in PBMC2 (a) and islet2 (b). Heterotypic multiplets are found between major cell type clusters. Homotypic multiplets 522 are observed on the periphery of major cell type clusters. c-d, Heterotypic cell type annotations for PBMC (d) and islet (e) samples. 523 Majority of multiplets are annotated as homotypic. f-g, Cell and multiplet proportions for PBMC1(f) and islet2(g). Multiplet cell type 524 proportions are highly correlated with overall cell proportions. Islet2 observed more beta cell multiplets than other cell types/samples, 525 reducing correlation and significance for islet2. 526 527 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425250doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425250