key: cord-0284421-ojo36tgc authors: Hao, Yuhan; Stuart, Tim; Kowalski, Madeline; Choudhary, Saket; Hoffman, Paul; Hartman, Austin; Srivastava, Avi; Molla, Gesmira; Madad, Shaista; Fernandez-Granda, Carlos; Satija, Rahul title: Dictionary learning for integrative, multimodal, and scalable single-cell analysis date: 2022-02-26 journal: bioRxiv DOI: 10.1101/2022.02.24.481684 sha: a22fdbc38c28904a3d0ed2f4acba597727d350d0 doc_id: 284421 cord_uid: ojo36tgc Mapping single-cell sequencing profiles to comprehensive reference datasets represents a powerful alternative to unsupervised analysis. Reference datasets, however, are predominantly constructed from single-cell RNA-seq data, and cannot be used to annotate datasets that do not measure gene expression. Here we introduce ‘bridge integration’, a method to harmonize singlecell datasets across modalities by leveraging a multi-omic dataset as a molecular bridge. Each cell in the multi-omic dataset comprises an element in a ‘dictionary’, which can be used to reconstruct unimodal datasets and transform them into a shared space. We demonstrate that our procedure can accurately harmonize transcriptomic data with independent single cell measurements of chromatin accessibility, histone modifications, DNA methylation, and protein levels. Moreover, we demonstrate how dictionary learning can be combined with sketching techniques to substantially improve computational scalability, and harmonize 8.6 million human immune cell profiles from sequencing and mass cytometry experiments. Our approach aims to broaden the utility of single-cell reference datasets and facilitate comparisons across diverse molecular modalities. Availability Installation instructions, documentations, and vignettes are available at http://www.satijalab.org/seurat In the same way that read mapping tools have transformed genome sequence analysis 1-3 , the ability to map new datasets to established references represents an exciting opportunity for the field of single-cell genomics. As an alternative to fully unsupervised clustering, supervised mapping approaches leverage large and well-curated references to interpret and annotate query profiles. This strategy is enabled by the curation and public release of reference datasets, as well as the development of new computational tools, including statistical learning 4-7 and deep learningbased approaches 8, 9 that have been successfully applied towards this goal. While powerful, a significant current limitation of existing approaches is their primary focus on single-cell RNA-seq (scRNA-seq) data. Single cell transcriptomics is well-suited for the assembly and annotation of reference datasets, particularly as differentially expressed gene markers can typically be interpreted to help annotate cell clusters. This has led to the development of highquality, carefully curated, and expertly annotated references, particularly from consortia including the Human Cell Atlas (HCA) 10 , Human Biomolecular Atlas Project (HuBMAP 11 ), and the Chan Zuckerberg Biohub 12 . Mapping to these references facilitates data harmonization, standardization of cell ontologies and naming schemes, and comparison of scRNA-seq datasets across experimental conditions and disease states. A crucial challenge is to extend reference-mapping to additional molecular modalities, including single-cell measurements of chromatin accessibility (e.g. scATAC-seq 13, 14 ) , DNA methylation (scBS-seq 15 ), histone modifications (scCUT&Tag 16, 17 ) , and protein levels (CyTOF 18 ), each of which measures a different set of features than scRNA-seq. The lack of transcriptome-wide measurements creates challenges for unsupervised annotation. Ideally, datasets from different modalities could be mapped onto scRNA-seq references, ensuring that established cell labels and ontologies would be preserved. We and others have proposed methods to map datasets across modalities [19] [20] [21] , but these make strict biological assumptions (for example, that accessible chromatin is associated with active transcription) that may not always hold true, particularly when analyzing cellular transitions or developmental trajectories 22 . Here we introduce 'bridge integration', which performs integration of single-cell datasets measuring different modalities by leveraging a separate dataset where both modalities are simultaneously measured as a molecular 'bridge'. The multi-omic bridge dataset, which can be generated by a diverse set of technologies 23-32 (Fig. 1a) , helps to translate information between disparate measurements, resulting in robust integration without requiring any limiting biological assumptions. We illustrate the broad applicability of our approach, demonstrating its performance across five different molecular modalities, and highlighting specific requirements for the multiomic dataset that can help to guide experimental design. Bridge integration leverages tools from a subfield of representation learning known as 'dictionary learning', which are commonly used in image analysis 33 . The goal of dictionary learning is to find a representation of the input data as a weighted linear combination of individual basic elements. We show that dictionary learning has multiple potential applications for single-cell analysis. Our bridge integration procedure is enabled by treating each cell in a multi-omic dataset as elements of a dictionary that can be utilized to reconstruct single-modality datasets. Moreover, we demonstrate how the development of compact dictionaries via dataset sketching can dramatically improve the computational efficiency of large-scale single-cell analysis, and enable rapid integration of dozens of datasets spanning millions of cells. We aimed to develop a flexible and robust integration strategy to integrate data from single-cell sequencing experiments where different modalities are measured ('single-modality datasets'). The fundamental challenge is that different single-modality datasets measure different sets of features. For example, scRNA-seq measures the expression level of individual transcripts, while scATAC-seq or scBS-seq measure DNA accessibility or methylation levels (Fig. 1a) . Previously proposed methods from our group and others [19] [20] [21] attempt to convert one set of features into another, for example, taking the gene-body sum of ATAC-seq signal (or the inverse of the DNA methylation levels), as a proxy for transcriptional output. While this conversion facilitates downstream integration, it assumes a strict and simplistic biological relationship between modalities that may not hold true, particularly in developing or transitioning systems. We reasoned that an alternative approach would be to leverage a multi-omic dataset as a bridge that can help to translate between disparate modalities. To perform this translation, we were inspired by the field of dictionary learning, a form of representation learning that is commonly utilized in image analysis and also genomics 33-36 . The goal of dictionary learning is to represent input data, a noisy image for example, in terms of individual elements. These elements, such as image patches, are called atoms and together comprise a dictionary. Reconstructing an image as a weighted linear combination of these atoms is an effective tool for denoising, and represents a transformation of the image dataset into a dictionary-defined space. We find that dictionary learning is a powerful tool for enabling cross-modality bridge integration at single-cell resolution. Our key insight is to treat a multi-omic dataset as a dictionary, with each individual cell's multi-omic profile representing an atom. We learn a 'dictionary representation' of each unimodal dataset based on these atoms. This transformation takes datasets in which completely different sets of features were measured and represents them each in a space where the defining features represent the same set of atoms (Fig. 1b) . Once different modalities can be represented using the same set of features, they can be readily aligned in a final step. Our bridge integration is illustrated in Fig. 1b and described fully in the Supplementary Methods, and we note a few key points below. First, our procedure makes no assumptions about the relationships between modalities, as these are learned automatically from the multi-omic dataset. Second, the key advance we present here is a transformation to project datasets profiling different modalities to be represented by a shared set of features. Once transformed, the final alignment step is compatible with a wide diversity of single-cell integration techniques including Harmony 37 , mnnCorrect 38 , Seurat 19 , Scanorama 39 , or scVI 40 . In this manuscript, we perform this step with an implementation of the mnnCorrect algorithm 38 . Lastly, we found that when working with sizable bridge datasets, the large number of atoms (single cells in the bridge dataset) created a substantial computational burden. Motivated by a similar problem addressed by Laplacian Eigenmaps 41 , we compute the graph laplacian for the multi-omic dataset, and calculate an eigendecomposition, thereby reducing the dimensionality from the number of atoms to the number of selected eigenvectors (Supplementary Methods). We then utilize these eigenvectors to transform the learned dictionary representations into the same lowerdimensional space, substantially increasing the efficiency of our bridge integration procedure. We first demonstrate our bridge integration strategy by performing cross-modality mapping on scATAC-seq and scRNA-seq samples of human bone marrow mononuclear cells (BMMCs). These samples consist of cells representing the full spectrum of hematopoietic differentiation, including hematopoietic stem cells, multi and oligopotent progenitors, and fully differentiated cells. As part of HuBMAP, we have leveraged public datasets to construct a comprehensive scRNAseq reference ('Azimuth reference'; 297,627 cells) of human BMMC, carefully annotating 10 progenitor and 25 differentiated cell states (Fig. 2a) . We aimed to map scATAC-seq 'query' datasets of human BMMC 42 (16,266 whole bone marrow profiles and 9,893 CD34 + enriched profiles) to this reference (Fig. 2b) . We used a 10x Multiome dataset 43 (32,368 cells paired snRNA-seq + scATAC-seq) that was publicly released as part of NeurIPS 2021, as a bridge. Our bridge procedure successfully mapped the scATAC-seq dataset on our Azimuth reference, enabling joint visualization and annotation of scATAC-seq and scRNA-seq data (Fig. 2c) . Reference-mapping also aligned shared cell populations across multiple samples, mitigating sample-specific batch effects. Query samples representing CD34 + BMMC fractions mapped exclusively to the HSC and progenitor components in the reference dataset, demonstrating that bridge integration can robustly handle cases where the query dataset represent a subset of the reference, while whole fractions mapped to all 35 cell states (Supplementary Fig. 1a) . Our reference-derived annotations were concordant with the annotations accompanying the query dataset produced by the original authors ( Supplementary Fig. 1b) , but we found that bridge integration annotated additional rare and high-resolution subpopulations. For example, our annotations separated monocytes into CD14 + and CD16 + fractions, NK cells into CD56 bright and CD56 dim subgroups, and cytotoxic T cells into CD8 + and mucosal associated invariant T (MAIT) subpopulations (Fig. 2d,e and Supplementary Fig. 1c,d) . While these subdivisions were not identified in the unsupervised scATAC-seq analysis, we confirmed these predictions by observing differential accessibility at canonical loci (i.e. elevated accessibility at the FCGR3A/CD16 gene locus in CD16 + monocytes), after grouping by reference-derived annotations. Similarly, bridge integration identified extremely rare groups of innate lymphoid cells (ILC; 0.15%), and recently discovered AXL + SIGLEC6 + (ASDC) dendritic cells 44,45 (0.10%) ( Fig. 2f and Supplementary Fig. 1e,f) . To our knowledge, these cell populations have not been previously identified in scATACseq data. Again, we found that differentially accessible sites, such as an ASDC-specific peak in the SIGLEC6 gene (Fig. 2f) , fully supported the accuracy of our mapping procedure. Our reference-mapping procedure not only enables the transfer of discrete annotations, but by projecting datasets from multiple modalities into a common space, allows us to explore how variation in one corresponds to variation in another. For example, after integration, we applied diffusion maps to the harmonized measurements to construct a joint differentiation trajectory spanning multiple progenitor states during myeloid differentiation (Fig. 2g) . Since this trajectory represents both reference and query cells, we can explore how pseudo-temporal variation in chromatin accessibility correlates with gene expression, even though the two modalities were measured in separate experiments. Consistent with previous findings, we identified cases where gene expression changes 'lagged' behind variation in chromatin accessibility. For example, while myeloperoxidase (MPO) expression is expressed in granulocyte-macrophage progenitors (GMP) and is associated with myeloid fate commitment 46,47 , the regulatory region immediately upstream acquired accessibility in lymphoid-primed multipotent progenitors (LMPP) (Fig. 2h-j) . We utilized a cross-correlationbased metric to systematically identify 236 'lagging' loci (Supplementary Methods) across this trajectory. KEGG pathway enrichment analysis revealed a strong enrichment for genes involved in cell cycle and DNA replication (Fig. 2k) . These loci were characterized by accessible chromatin at the earliest stages of differentiation (hematopoietic stem cells), but there is a delay before the associated genes become transcriptionally active (Fig. 2l) . The accessible state of these loci in the earliest progenitors may represent a form of priming to enable rapid cell-cycle entry once the decision to differentiate has been made and represents the type of discovery that can be enabled through integrative analysis across modalities. As our strategy relies on the ability for the dictionary to represent and reconstruct individual datasets, we explored how the size and composition of the multi-omic dataset affected the accuracy of integration. We sequentially downsampled the multi-omic dataset, repeated bridge integration, and compared the results with our original findings. Downsampling the bridge generally returned results that were concordant with the full analysis, but as expected, could affect annotation accuracy for rare cell types which are most sensitive to downsampling (Fig. 3a) . We found that if a bridge dataset contained at least 50 cells ('atoms') representing a given cell type, this was sufficient for robust integration. We note that this threshold is not a strict requirement; we found that integration can be successful for rare cell types such as ASDC even when fewer than ten cells are present in the bridge, but we also observed failure modes in this regime. We note that generating bridge datasets consisting of more than 50 cells per subpopulation is quite feasible for many multi-omic technologies, and that our findings represent guidelines to assist in experimental design when performing multi-omic experiments. We next compared the performance of bridge integration against two recently proposed methods for integrated analysis of multimodal and single-modality datasets. Both multiVI 48 and Cobolt 49 utilize variational autoencoders for integration, and while they do not explicitly treat multi-omic datasets as a bridge, they aim to integrate datasets across technologies and modalities into a shared space. When applied to the previously described datasets, both methods were broadly successful in integrating scRNA-seq and scATAC-seq data, but did not identify matches at the same level of resolution (for example, neither method successfully matched ASDC in scATACseq data to the ASDC in the Azimuth reference) ( Fig. 3b and Supplementary Fig. 2a-c) . When comparing computational efficiency, bridge integration (0.8 hours, not including 1.2 hours of preprocessing time), and Cobolt (3.3 hours) were the most efficient methods, while multiVI required more computational resources (15.7 hours). To quantitatively benchmark performance, we split the 10x multi-omic bridge dataset into two groups. In one group, we treated the scRNA-seq and scATAC-seq data as if they were from separate experiments, representing a benchmark dataset for integration where ground-truth correspondences were known. The second group of cells was used as a multi-omic bridge dataset. After aligning cells across modalities, we calculated the Jaccard similarity metric between each scATAC-seq cell and its matched scRNA-seq counterpart. We found that our bridge integration strategy consistently maximized this similarity metric, demonstrating that our procedure most effectively matched cells in the same biological state across modalities ( Fig. 3c and Supplementary Fig. 2d) . Consistent with our previous results, we found that the strongest improvements were observed when mapping rare cell types including plasma cells and dendritic cells (Supplementary Fig. 2d ). As our procedure is compatible with multiple integration techniques, we compared the performance of bridge integration when using either mnnCorrect 38 or Seurat v3 19 for the final alignment step, and observed very similar results (Supplementary Fig. 2d ). As a second quantitative benchmark with ground-truth data, we pursued a similar strategy using a recently published Paired-Tag dataset 26 , where individual histone modification binding profiles via scCUT&Tag were simultaneously measured with RNA transcriptomes. Since each Paired-Tag experiment was performed with biological replicates, we used one replicate as a multi-omic bridge dataset and split the other replicate into separate modalities for benchmarking. We performed cross-modality integration between scRNA-seq and scCUT&Tag for active histone marks (H3K27ac), repressive histone marks (H3K27me3), and enhancer histone marks (H3K4me1). In each case, bridge integration successfully integrated cells across modalities, and returned the highest Jaccard similarity between matched scRNA-seq and scCUT&Tag profiles ( Fig. 3c and Supplementary Fig. 2e-g) . To further demonstrate the flexibility of our approach, we used bridge integration to map and annotate a snmC-seq dataset, which measures DNA methylation profiles in single cells from the human cortex 50 . As a reference, we utilized a dataset from the Allen Brain Atlas which defines a taxonomy of cell-types in the human cortex, and is accompanied by an expertly curated and multilevel cell ontology 51 . Using a snmC2T-seq dataset which simultaneously measures methylation and gene expression as a bridge 28 , we were able to annotate the snmC-seq profiles with high confidence (Supplementary Fig. 2h) . Even when our reference-derived annotations did not augment the resolution to unsupervised clustering of snmC-seq data, they did add substantial interpretability ( Fig. 3d-f ). For example, unsupervised clustering identified multiple populations of L6 neurons (labeled as L6-1, L6-2, and L6-3), but RNA-assisted annotation clearly labeled these clusters as either 'Near Projecting' (NP) or deep neocortical laminar 6b (L6b) excitatory neurons (Fig. 3f) . Taken together, these results demonstrate the accuracy, robustness, and flexibility of our bridge integration procedure. We demonstrate applications on multiple modalities and data types, as well as best-in-class performance via quantitative and ground-truth benchmark comparisons. We demonstrate how cross-modality mapping can help interpret and improve the resolution of cell type annotation, including extremely rare cell types whose identification is facilitated by curated annotation in a reference dataset. Moreover, projecting datasets into a harmonized space also enables exploration of cross-modality relationships. The recent increase in publicly available single-cell datasets poses a significant challenge for integrative analysis. For example, multiple tissues have now been profiled across dozens of studies, representing hundreds of individuals and millions of cells. We refer to the challenge of harmonizing a broad swath (or the entirety) of publicly available single-cell datasets from a single organ as 'community-wide' integration. While a rich diversity of analytical methods can harmonize datasets of hundreds of thousands of cells, performing unsupervised 'community-wide' integration remains challenging, even when analyzing a single modality. We were inspired by previous work on 'geometric sketching' which first selects a representative subset of cells (a 'sketch') across all datasets, integrates them, and then propagates the integrated result back to the full dataset 52, 53 . This pioneering approach substantially improves the scalability of integration as the heaviest computational steps are focused on subsets of the data. However, this approach is dependent on the results of principal components analysis that must first be performed on the full dataset. As datasets continue to grow in scale, more sophisticated computational infrastructure is required to load full collections of data into memory, and even performing dimensional reduction can become a limiting step. We aimed to devise a strategy that could integrate large compendiums of datasets, without ever needing to simultaneously analyze or perform intensive computation on the full set of cells. We reasoned that dictionary learning could also enable efficient and large-scale integrative analysis. We first select a representative sketch of cells (i.e. 5,000 cells) from each dataset, and treat these cells as atoms in a dictionary (Fig. 4a , Supplementary Methods). We next learn a dictionary representation, representing a weighted linear combination of atoms that can reconstruct the full dataset. These steps can occur for each dataset independently, allowing for efficient processing. We then perform integration on the atoms from each dataset. This is the only step that simultaneously analyzes cells from multiple datasets, but since only the atoms are considered, this does not impose scalability challenges. Finally, we apply our previously learned dictionary representations to the harmonized atoms from each dataset individually, and reconstruct harmonized profiles for the full dataset. We refer to this procedure as 'atomic sketch integration'. We highlight that for this application, the 'atoms' used to reconstruct a dataset represent a subset of cells from the dataset itself. Contrastingly, in bridge integration, the atoms refer to cells from a different (multi-omic) dataset. The success of atomic sketch integration rests on identifying a representative subset of cells for each dataset. Sketching techniques for single-cell analysis aim to find subsamples that preserve the overall geometry of these datasets [52] [53] [54] . These methods do not require a pre-clustering of the data, but aim to ensure that the sketched dataset represents both rare and abundant cell states, even after downsampling. Here, we perform sketching using a leverage-score sampling based strategy that has been proposed for large-scale information retrieval problems 55 and can be rapidly and efficiently computed on sparse datasets (Supplementary Methods). We emphasize that atomic sketch integration represents a general strategy for improving scalability that can be broadly coupled with existing methods. For example, a wide variety of integration techniquesincluding Harmony 37 , Scanorama 39 , mnnCorrect 38 , scVI 40 , and Seurat 19 , can be used to integrate the atom elements in each dictionary, with our procedure then enabling these results to be extended to full datasets. To demonstrate the potential of atomic sketch integration to perform 'community-wide' analysis, we first considered scRNA-seq datasets of the human lung. During the COVID-19 pandemic, there has been widespread scRNA-seq data collection from respiratory tissues, particularly by the Human Cell Atlas Lung Biological Network 56 . Leveraging a recently published 'database' of scRNA-seq studies 57 , as well as collection of openly released lung and upper airway datasets from the Human Cell Atlas, we assembled a group of 19 datasets spanning 1,525,710 total cells. We created an atomic dictionary consisting of 5,000 cells from each dataset (95,000 total atoms), integrated these cells, and then reconstructed the full datasets. Our atomic sketch integration procedure performed all these steps (including preprocessing) in 55 minutes, using a single computational core. Our results exhibit the advantages of community-scale integration compared to individual analysis. First, by matching biological states across datasets and technologies, the integrated reference can help to standardize cell ontologies and naming schemes (Fig. 4b,c) . When observing previously assigned annotations derived from each study, we found that matched cell populations were often assigned slightly different names (Supplementary Fig. 3a) . Unsupervised integration at this scale is a valuable tool for identifying these conflicts and can assist in the development of authoritative and standardized cell ontologies. As a second benefit, we found that community-scale integration enabled consistent identification of ultra-rare populations, and in particular, a population of Foxi1-expressing 'pulmonary ionocytes' that were recently discovered in both human and mouse lungs 58 (Fig. 4d) . While these cells were only independently annotated in 6 out of 19 studies, our integrated analysis discovered at least one pulmonary ionocyte in 17 out of 19. The identified ionocytes were extremely rare (0.047%), but exhibited clear expression of canonical markers (Fig. 4b) , highlighting the potential value for pooling multiple datasets to characterize these cells. We note that selection of dictionary atoms by sketching, or leverage-score sampling is essential for optimal performance ( Supplementary Fig. 3b,c) ; repeating the analysis using a set of atoms determined by random downsampling successfully integrated abundant cell types, but failed to integrate ionocytes as they were not sufficiently represented in the dictionary. Finally, we found that community-scale integration can substantially improve the identification of differentially expressed (DE) cell-type markers. The use of 19 study replicates specifically enables us to identify genes that show consistent patterns across laboratories and technologies, representing robust and reproducible markers. We grouped cells by both sample replicate and cell type identity, and performed differential expression on the resulting pseudobulk profiles ( Fig. 4e and Supplementary Fig. 4) . For example, we identified 116 positive markers for pulmonary ionocytes, representing one of the deepest transcriptional characterizations of this cell type. These markers included both canonical markers such as the transcription factor FOXI1, but also revealed clear ontology enrichments for ATPases (e.g. ATP6V1G3, ATP6V0A4) and chloride channels (e.g. CLCNKA, CLCNKB, CFTR), supporting the role of these cells in regulating chemical concentrations in the lung (Fig. 4f) . One advantage of working with pseudobulk values is increased quantification accuracy for lowly expressed genes. Indeed, we repeatedly found that top DE markers found using this strategy tended to capture more genes at a lower range of average expression values (Fig. 4g) . As a final demonstration, we considered a similar problem of community-wide integration for circulating human peripheral blood cells, which is one of the most widely profiled systems with diverse single-cell technologies. Exploring publicly available studies of either COVID-19 samples or healthy controls, we accumulated a collection of 14 studies with scRNA-seq measurements, representing a total of 3.46M cells from 639 individuals. Data from 11 of the studies was obtained from a recently published collection of standardized single-cell sequencing datasets 59 . We performed unsupervised atomic sketch integration, yielding a harmonized collection in which we annotated 30 cell states (Fig. 5a) . As a subset of our samples were not depleted for granulocytes, our collection includes a distinct population of neutrophils that were absent in our previous Azimuth reference of human PBMC. Moreover, we identified specific populations of activated granulocytes and B cells that were specific to COVID-19 samples (Supplementary Fig. 5a ). Consistent with previous reports, monocytes in COVID-19 samples sharply upregulated interferon response genes 60,61 , but were correctly harmonized with healthy monocytes ( Fig. 5b and Supplementary Fig. 5b) . By matching shared cell types across disease states (while still allowing for the possibility of disease-specific subpopulations), this collection represents a valuable resource for identifying cell-type specific transcriptional changes that reproduce across multiple studies. We characterized cell-type specific responses for eight additional cell types, each of which exhibited a conserved interferon-driven response alongside the activation of cell typespecific response genes (Supplementary Fig. 6 ). While single-cell sequencing technologies are capable of measuring RNA transcripts and surface proteins in thousands of single cells, cytometry-based techniques can measure both extracellular and intracellular proteins in millions of cells. As our bridge integration procedure should enable the mapping of CyTOF profiles onto scRNA-seq datasets, we obtained a collection of CyTOF datasets spanning 119 individuals and 5,170,249 total cells 62 . We used our previously collected CITE-seq dataset of 161,764 PBMC from healthy donors as a multi-omic bridge 4 . The CyTOF and CITE-seq dataset both shared 30 cell surface protein features, while the CyTOF dataset also measured 17 unique proteins which included intracellular targets that cannot be measured via CITE-seq. Bridge integration annotated each CyTOF dataset with cluster labels derived from our 3.46M cell scRNA-seq collection, and allowed us to infer intracellular protein levels for each of these clusters (Fig. 5c) . Predicted regulatory CD4 + T cells expressed high levels of the transcription factor Foxp3 63 , and effector T cells exhibited enriched Klrg1 levels 64 (Fig. 5d) . We also found that among cytotoxic lymphocyte populations, MAIT cells were uniquely depleted for expression of the cytotoxic protease Granzyme B, consistent with previous reports 65 . Each of these patterns supports the accuracy of our cross-modality mapping. Finally, we successfully annotated a rare populations of innate lymphoid cells (0.024%), which were not independently identified in the CyTOF dataset, but correctly exhibited a CD25 + CD127 + CD161 + CD56immunophenotype 4,66 (Fig. 5d,e) . Taken together, we conclude that dictionary learning enhances the scalability of integration, as well as the ability to integrate and compare diverse molecular modalities. In order to map datasets measuring a diverse set of modalities to scRNA-seq reference datasets, we developed bridge integration, an approach for cross-modality alignment that leverages a multiomic dataset as a bridge. We characterize specific compositional requirements for the bridge dataset, perform quantitative benchmarking analyses with ground-truth datasets, and demonstrate the broad applicability of our method to a wide variety of technologies and modalities. Finally, we demonstrate how to use atomic sketch integration to extend the scalability of our approach to harmonize dozens of datasets spanning millions of cells. We anticipate that our methods will be valuable to both individual labs but also larger consortia that have already invested in constructing and annotating comprehensive scRNA-seq references. For example, the Human Cell Atlas, Human Biomolecular Atlas Project, Tabula Sapiens 67 , and Human Cell Landscape 68 , all have released scRNA-seq references spanning hundreds of thousands of cells for multiple human tissues. Similar efforts are present in model organisms as well, including the Fly Cell Atlas 69 , and Plant Cell Atlas projects 70 . In each case, these efforts involve careful, collaborative, and expert-driven cell annotation alongside the curation of reference cell ontologies. While repeating this manual effort for each modality is infeasible, bridge integration enables the mapping of new modalities without having to modify the reference. As additional multi-omic datasets become available, we expect that tools such as Azimuth will begin to map additional modalities as well. We note that the bridge integration is particularly well-suited for experimental designs where multiomic technologies can be applied to a subset, rather than all, experimental samples. This is a common occurrence, particularly because multi-omic technologies often are associated with increased cost, lower throughput, and reduced data quality for each individual measurement type. In particular, we note that combinatorial indexing approaches can be readily applied using commercial instrumentation to profile a single modality in hundreds of thousands of cells 71, 72 , but the same is not true for multi-omic technologies. We propose that the collection of large singlemodality datasets, harmonized via a smaller but representative multi-omic bridge, may represent an efficient and robust strategy to explore cross-modality relationships across millions of cells. Our identification of cell cycle 'priming' in hematopoietic stem cells represents an example of cross-modality insights that can be derived via bridge integration. We note that future extensions of our work can further broaden the applicability of bridge integration or demonstrate its potential in new contexts. For example, performing bridge integration on spatially resolved unimodal datasets (e.g. CODEX 73 ), could help to better characterize the spatial localization of scRNA-seq defined cell types in large tissue sections. New multi-omic technologies that couple high-resolution mass spectrometry imaging to single-cell or spatial transcriptomics could serve as a bridge to harmonize lipidomic and metabolic profiles 74, 75 with sequencing-based references. In addition, future computational improvements will further lower the requirements of the bridge dataset, enabling robust integration with an even smaller number of multi-omic cells. We emphasize the ability for bridge and atomic sketch integration to identify and characterize rare cell populations, including AXL + SIGLEC6 + dendritic cells and pulmonary ionocytes. Single cell transcriptome profiling played an essential role in the initial discovery of these cell types, but a deeper understanding of their biological role and function will benefit from multimodal characterization. The goal of moving beyond an initial taxonomic classification of cell types towards a complete multimodal reference will not be accomplished with a single experiment or technology. We envision that computational tools for cross-modality integration will play key contributions to the construction of this map. Author Contributions: TS, YH, and RS conceived the research. YH, TS, MK, SC, PH, AH, AS, GM, SM performed computational analysis, supervised by CG and RS. YH TS and RS wrote the manuscript, with input and assistance from all authors. The authors would like to thank all the members of the Satija Lab for thoughtful discussions related to this work. We thank Andrew Butler and Harsh Srivastava for assistance in identifying and locating scRNA-seq datasets from human lung and PBMC. We acknowledge the Gottardo and Newell labs for publicly releasing a standardized compendium of human PBMC scRNA-seq datasets. This work was supported by the Chan Zuckerberg Initiative (EOSS-0000000082, HCA-A-1704-01895 to R.S.), and the NIH (K99HG011489-01 to T.S; K99CA267677 to A.S; RM1HG011014-02, 1OT2OD026673-01, DP2HG009623-01, R01HD096770, R35NS097404 to R.S). (a) Per cell-type prediction accuracy of bridge integration, based on the number of cells representing each cell type in the multi-omic dataset. Accuracy results were obtained by serially downsampling the multi-omic dataset, repeating bridge integration, and comparing resulting query annotations with those derived from the full dataset. Boxplots represent the observed range of values across 20 cell types. (b) Coverage plots for the SIGLEC6 locus, after performing crossmodality annotation with bridge integration, multiVI, and Cobolt. Only cells called as ASDC by bridge integration exhibit celltype-specific accessibility at this locus. Additional loci shown in Supplementary Fig. 2. (c) Ground truth benchmarking analysis. RNA and ATAC profiles from a 10x multiome dataset were unpaired and integrated. Barplots show the average Jaccard similarity between each scATAC-seq cell and its matched scRNA-seq cell. Results are split by individual cell types in Supplementary Fig. 2 . Results are also shown for Paired-Tag datasets for three histone modification profiles. In each case, bridge integration achieves the highest Jaccard similarity. (d) scRNA-seq reference of the human motor cortex. (e,f) Mapping of single cell DNA methylation profiles of human cortical cells onto the reference using a snmC2T-seq multi-omic dataset as a bridge. Cells are colored by the methylation-derived annotations from the original study (e), or the scRNA-seq derived labels from bridge integration (f). Reference-derived labels at higher levels of granularity are shown in Supplementary Fig. 2 . (a) UMAP visualization of the mapped scATAC-seq dataset in Fig. 2 , computed using the multiVI and Cobolt, and Bridge integration approaches. In Fig. 2 , we visualize the results of bridge integration on the scRNA-seq reference-derived UMAP. Here, we compute a new UMAP visualization to enable comparison with alternative methods. (b) Since we do not have ground truth of cell labels for this dataset, we visualize chromatin accessibility patterns to assess annotation accuracy. At the CD8A locus, bridge integration shows the clearest celltype-specific accessibility patterns, suggesting that alternative methods blend CD8 and CD4 naïve cell populations together during integration. (a) UMAP visualization of integrated scRNA-seq profiles from human PBMC. Same as in Fig. 5a , but split into cells from healthy controls, or samples associated with mild, moderate, and severe COVID. Activated B cells are at sharply reduced frequency in the healthy control samples (which were also depleted for granulocytes). Activated granulocytes are observed primarily in moderate and severe COVID samples. (b) Same as Fig. 5b, but showing cell type-specific responses that correlate with disease severity for eight additional cell states. Each column represents a pseudobulk average from one of 506 individuals. Expression of the module is correlated with disease severity within the individual, which is indicated by the color scale above the heatmap. Heatmap showing the top transcriptional markers of 30 cell states that are consistent across multiple studies. Each column represents a pseudobulk average of all cells from a single cell type and single study. The top ten markers are shown for each cell state. Unsupervised labels (from DNA methylation) scRNA-seq reference Predicted labels (annotated from scRNA-seq) Coverage plots for the SIGLEC6 locus, after performing cross-modality annotation with bridge integration, multiVI, and Cobolt. Only cells called as ASDC by bridge integration exhibit celltype-specific accessibility at this locus. Additional loci shown in Supplementary Fig. 2. (c) Ground truth benchmarking analysis. RNA and ATAC profiles from a 10x multiome dataset were unpaired and integrated. Barplots show the average Jaccard similarity between each scATAC-seq cell and its matched scRNA-seq cell. Results are split by individual cell types in Supplementary Fig. 2 . Results are also shown for Paired-Tag datasets for three histone modification profiles. In each case, bridge integration achieves the highest Jaccard similarity. (d) scRNA-seq reference of the human motor cortex. (e,f) Mapping of single cell DNA methylation profiles of human cortical cells onto the reference using a snmC2T-seq multi-omic dataset as a bridge. Cells are colored by the methylation-derived annotations from the original study (e), or the scRNA-seq derived labels from bridge integration (f). Reference-derived labels at higher levels of granularity are shown in Supplementary Fig. 2 . Supplementary Fig. 4 . (f) GO ontology enrichment terms for ionocyte markers. (g) Expression distributions of top transcriptional markers recovered from single-cell differential expression analysis (red), or pseudobulk analysis (blue). All atoms All scRNA cells Learn dictionary representations Reconstruction from the integrated atoms T re g C D 4 T C M C D 4 T C M C D 4 N a iv e C D 4 N a iv e C D 4 N a iv e C D 4 N a iv e C D 8 N a iv e C D 8 N a iv e C D 8 N a iv e C D 8 N a iv e C D y sm oo th mu sc le Alv eo lar Fib rob las t alv eo lar ma cro ph ag e art eri al en do the lia l ce ll AT 1 AT 2 B ce ll ba sa l ba so ph il bro nc hia l cili ate d CA P1 CA P2 CD 4+ T ce ll e ec tor me mo ry CD 8+ T ce ll e ec tor me mo ry CD 8+ T ce ll na ive cD C1 my elo id de nd riti c ce ll cD C2 my elo id de nd riti c ce ll cla ss ica l mo no cy te clu b go ble t Int ers titi al ma cro ph ag e ion oc yte lym ph ati c en do the lia l ce ll me so the lia l ce ll My o bro bla st na tur al kill er ne uro en do cri ne ne utr op hil no n-c las sic al mo no cy te pe ric yte pla sm a ce ll pla sm ac yto id de nd riti c ce ll pro life rat ing NK /T pro life rat ing alv eo lar ma cro ph ag e pro life rat ing ba sa l pu lm on ary ve no us en do the lia l ce ll se rou s Sq ua mo us SERINC2 KRT23 PROM2 SPRR1B IVL S100A16 EPS8L1 KRT6A TRIM29 IL36A SLPI ZG16B DMBT1 PRB3 LTF SCGB1A1 NUAK1 SRPX PLAT CCL23 LIFR RAMP3 PRSS23 SELP PLA1A CKS2 LMNB2 ANLN UBE2C CDK1 HELLS C19orf48 TK1 ZWINT TOP2A CENPU POC1A COQ2 GTSE1 AURKA HMMR TCF19 RAD51AP1 HERC5 MKI67 SKA2 AURKB TMPO CCNA2 NUF2 KIFC1 NDC80 HMGB2 NUSAP1 IDH3A CLEC4C SMPD3 PLAC8 PLD4 PTCRA SHD LILRA4 IRF8 IRF7 XBP1 IGHG1 JCHAIN PIM2 DERL3 JSRP1 TNFRSF17 MZB1 IGKV6D-21 IGKV2D-40 NOTCH3 ARHGEF17 LZTS1 COX4I2 CDH6 HIGD1B GJC1 KCNK3 PDGFRB FAM162B C19orf38 CTSS AIF1 GPBAR1 COTL1 LILRA1 LST1 LILRB2 LRRC25 IL1R2 SOD2 MXD1 PLEK LILRB3 LUCAT1 HCAR3 NAMPT GK AQP9 CHGA CHGB SYT1 GP2 INA GRP CALCA DNAJC12 CPLX2 GNG4 CX3CR1 IL2RB S1PR5 TTC38 TBX21 SH2D1B KLRF1 PRF1 COL5A1 POSTN ASPN COL6A3 COL1A2 THBS2 FKBP10 COL3A1 CTHRC1 COL1A1 CRB2 CALB2 FAM110C TFPI2 HAS1 C3 BNC1 PRG4 ITLN1 GALNT9 NRP2 TFPI PARD6G PROX1 KBTBD11 FLT4 KLHL4 MMRN1 CCL21 GPR182 CLCNKB STAC2 C9orf152 FAM131C BSND ATP6V1A CEL RASSF6 HEPACAM2 FOXI1 SLCO2B1 MS4A4A MS4A6A STAB1 MFSD1 GAL3ST4 FPR3 CD209 FOLR2 LGMN AGR2 SCGB3A1 MUC5B CXCL6 MSMB CYP2F1 CP SPDEF BPIFB1 MGST1 PIGR KLK11 CLIC6 MET HSD17B13 SFTPB C16orf89 CYB5A SCGB3A2 ASGR2 TNFRSF1B RNASE2 EREG NLRP3 S100A12 FGR FCN1 VCAN CD300E HLA-DPB1 HLA-DQB1 S100B CD207 CD1A FCGR2B FCER1A CPVL CD1C CLEC10A CD80 FSCN1 HMSD CD83 GPR157 IDO1 DAPP1 CCL22 BIRC3 CCR7 NKG7 KLRC1 TBC1D10C GNLY GZMH KLRD1 GZMA CD3E CCL5 LAG3 CD3D TIGIT CD2 GZMK CD8B CD8A TNFSF8 CTLA4 PIK3IP1 TRAT1 LEPROTL1 CD6 IL7R CD28 TCF7 CD40LG EFNB1 TSPAN12 ROBO4 SOSTDC1 PCDH12 HPGD PRX B3GALNT1 EDNRB EMCN CD93 ADGRL4 STC2 GPIHBP1 ARHGEF15 VWF ENG PLVAP ITGA6 NOSTRIN C9orf135 MAPK15 ZBBX MDH1B EFHC1 RSPH9 LRRC71 ROPN1L C1orf158 FBXO15 TPSAB1 NTRK1 MS4A2 RGS13 VWA5A SLC18A2 CPA3 KIT HDC HPGDS PLCH2 IRF6 FAT2 PTPRZ1 GPR87 KRT5 TNS4 KRT17 TP63 KRT15 VPREB3 FCER2 FCRLA TNFRSF13C CXCR5 CD79B LINC00926 BANK1 CD19 MS4A1 SLC34A2 SFTPA2 NAPSA ABCA3 PEBP4 CPB2 SFTPA1 ALOX15B PLA2G4F MFSD2A ANXA3 ANKRD29 RAB11FIP1 LMO7 ANKRD1 GPRIN2 AGER GPRC5A RTKN2 EMP2 NPR3 SOX17 TSPAN2 BMX HEY1 ENPP2 LINC00551 EFNB2 DKK2 GJA5 GLIPR2 APOC1 FBP1 LTA4H OLR1 MSR1 MARCO ACP5 GPD1 CYP27A1 FBLN5 SCN7A ANGPT1 WNT2 HSD11B1 PLEKHH2 A2M GPC3 FGFR4 BMP5 PLN CNN1 LMOD1 ATP1A2 SPEG ITIH3 HSPB7 DES MYH11 ACTG2 OSR1 DCN SFRP2 PODN CCDC80 FBLN1 SFRP1 PI16 MFAP5 SCARA5 Ad ve nti tia l Fib rob las t air wa y sm oo th mu sc le Alv eo lar Fib rob las t alv eo lar ma cro ph ag e art eri al en do the lia l ce ll AT 1 AT 2 B ce ll ba sa l ba so ph il bro nc hia l cili ate d CA P1 CA P2 CD 4+ T ce ll e ec tor me mo ry CD 8+ T ce ll e ec tor me mo ry CD 8+ T ce ll na ive cD C1 my elo id de nd riti c ce ll cD C2 my elo id de nd riti c ce ll cla ss ica l mo no cy te clu b go ble t Int ers titi al ma cro ph ag e ion oc yte lym ph ati c en do the lia l ce ll me so the lia l ce ll My o bro bla st na tur al kill er ne uro en do cri ne ne utr op hil no n-c las sic al mo no cy te pe ric yte pla sm a ce ll pla sm ac yto id de nd riti c ce ll pro life rat ing NK /T pro life rat ing alv eo lar ma cro ph ag e pro life rat ing ba sa l pu lm on ary ve no us en do the lia l ce ll se rou s Sq ua mo us LGALS3BP IFI35 IFI44L IFI27 XAF1 ODF3B IRF7 OASL PLSCR1 IFI6 ANKRD22 TRIM22 IGFLR1 LAP3 HERC5 EPSTI1 GRN SERPING1 PLAC8 NCF1 LY6E EPSTI1 ISG15 IFITM3 PSMB9 LGALS9 IFI27 IFI44L IFI6 CMPK2 STK4 TAP1 SLC31A2 IRF7 SAMD9L DRAP1 IFITM1 BST2 LGALS3BP IFI35 LY6E STXBP2 EPSTI1 ALDOA IFI44L TNFSF13 IFI44 UBE2L6 IL2RG MNDA STAT1 LGALS9 XAF1 EIF2AK2 IFI27 ISG15 PLSCR1 MX1 ISG20 SAMD9L XAF1 IFI44L HERC5 MX1 IFI6 IRF7 SAMD9L EPSTI1 EIF2AK2 ISG15 LY6E PLSCR1 SRSF11 IFITM1 IRF9 IFIT3 TRIM22 IL2RG TRAPPC5 IFI6 IFI44L EPSTI1 XAF1 IRF7 EIF2AK2 ISG15 CSNK2B IFIT3 MX1 IRF9 IFI27 IL2RG ARPC1B RPS17 RNASEK CTSD TOMM6 EEF1G NDUFA13 IFI44L XAF1 ISG15 IL2RG CORO7 IFI6 IRF9 RPS17 PTPRCAP GOLGA4 RNASEK TRAPPC5 LY6E IRF7 TOMM6 EIF2AK2 SMCHD1 CSNK2B GIMAP1 EEF1G IFI6 XAF1 IRF7 ISG15 IFI44L EIF2AK2 MX1 RPS17 TAP2 CSNK2B RNASEK ARPC4 IL2RG CTSD NDUFA13 TOMM6 EEF1G KLRK1 EIF4A1 PTPRCAP IFI6 TAP2 IFI44L MICU2 IRF9 XAF1 ICE1 CSNK2B TRIM22 ARID4B SRSF10 SON CD46 PLSCR1 USP8 ARPC4 SMCHD1 TLK1 SLC4A10 ADAR (a) UMAP visualization of integrated scRNA-seq profiles from human PBMC. Same as in Fig. 5a , but split into cells from healthy controls, or samples associated with mild, moderate, and severe COVID. Activated B cells are at sharply reduced frequency in the healthy control samples (which were also depleted for granulocytes). Activated granulocytes are observed primarily in moderate and severe COVID samples. (b) Same as Fig. 5b , but showing cell type-specific responses that correlate with disease severity for eight additional cell states. Each column represents a pseudobulk average from one of 506 individuals. Expression of the module is correlated with disease severity within the individual, which is indicated by the color scale above the heatmap. Tre g LINC00426 ICOS IKZF2 CTLA4 STAM FANK1 DUSP4 IL2RA RTKN2 FOXP3 RACGAP1 MCM7 CHEK1 TYMS UHRF1 LIG1 MCM5 DNMT1 CHAF1B STMN1 DNM3 PDE5A RAB27B TNNC2 ZNF185 GNAZ TMEM40 CXCL5 ALOX12 HGD SUB1 MANF KPNA2 PBK CALR BUB1 GTSE1 SHCBP1 LRRC59 PDIA6 FKBP11 SEC11C PRDX4 CD38 IGF1 IGHA1 ERLEC1 IGHA2 MZB1 SSR4 PHEX PACSIN1 TMEM8B ZFAT SHD TPM2 SERPINF1 SCAMP5 LAMP5 SMPD3 PYROXD2 NCAM1 IL2RB CMC1 XCL2 ZMAT4 XCL1 KIR2DL4 SPTSSB KLRC1 TKTL1 NMUR1 CD160 S1PR5 PRF1 TTC38 PRSS23 CD247 SPON2 CREB5 STEAP4 MXD1 MCL1 VNN2 FFAR2 IL1RN AQP9 NABP1 CMTM2 TSPAN15 COLQ KLRB1 RORC ADAM12 GPR171 NCR3 ME1 IL23R SLC4A10 HPGD LINC01229 GIMAP6 TTLL10 KIT FXYD7 TNFRSF18 GATA3 TNFRSF4 PTGDR2 CDK6 CYTL1 CPXM1 CRHBP BAALC PROM1 NPR3 CD34 TEKT1 AGR2 DEFA3 CEBPE S100P HP PGLYRP1 BPI S100A8 ANXA3 RETN DEFA8P FEZ1 GZMM COL6A2 TRDC TRGV9 TRDV2 KLRG1 TRGC1 EPB42 FAM210B SLC25A39 GYPA DCAF12 AHSP GYPB ALAS2 IFIT1B SLC4A1 S1PR4 LAYN CLDND1 SH2D1A DENND2D PGM2L1 IL10 FXYD2 CHI3L2 TIMD4 LGALS2 PID1 CACNA2D3 CD1E ENHO MRC1 CD1D FCER1A CLEC10A CD1C PLCD1 HLA-DPB1 BATF3 CCND1 RAB7B CPVL IDO1 C1orf54 SLAMF8 CLEC9A PPP2R2B TRGV4 FGFBP2 CCL5 CD3G FCRL6 GZMH ZNF683 CD8A EPHA1-AS1 PDE3B LINC02446 LRRN3 OXNAD1 CD8B NELL2 CD248 REG4 PTGER2 KDSR GZMK SLAMF1 DPP4 LYAR NSG1 IL7R SERINC5 RCAN3 CDC14A INPP4B TRADD CD40LG AP3M2 CAMK4 AK5 TRAT1 LEF1 FHIT TRABD2A TSHZ2 CHRM3-AS2 MAL ANKRD55 GPBAR1 HMOX1 LRRC25 CTSL LYPD2 SIGLEC10 ZNF703 MS4A7 PPP1R17 CDKN1C RBP7 CD163 LRP1 HNMT FCN1 NRG1 ASGR1 CD14 VCAN STAB1 AFF3 BACH2 VPREB3 FADS3 IL4R TCL1A ABCB4 DSP SLC38A11 COL19A1 BLK LINC01781 AIM2 BTNL9 BAIAP3 TEX9 GRAMD1C RASSF6 COCH SSPN RALGPS2 KLK1 EBF1 CLECL1 BANK1 CNFN MS4A1 MPP6 SYT1 ARL4D ARHGAP24 ABCA6 COL9A3 FMOD ZBTB32 ARHGAP44 RAPGEF3 RASGRF1 APOD METTL8 VASH1 ADAM33 FCGRT UPK3A TNNI2 TRPM4 HAMP CDH1 SIGLEC6 Figure 6 . Heatmap of reproducible gene expression markers in human PBMC cell states. Related to Figure 5 . Heatmap showing the top transcriptional markers of 30 cell states that are consistent across multiple studies. Each column represents a pseudobulk average of all cells from a single cell type and single study. The top ten markers are shown for each cell state. BLAT-the BLAST-like alignment tool Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Fast and accurate short read alignment with Burrows-Wheeler transform Integrated analysis of multimodal single-cell data Efficient and precise single-cell reference atlas mapping with Symphony scmap: projection of single-cell RNA-seq data across data sets Cross-tissue immune cell analysis reveals tissue-specific adaptations and clonal architecture in humans. bioRxiv Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models Mapping single-cell data to reference atlases by transfer learning The human body at cellular resolution: the NIH Human Biomolecular Atlas Program Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris Single-cell chromatin accessibility reveals principles of regulatory variation Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing Genome-wide base-resolution mapping of DNA methylation in single cells using single-cell bisulfite sequencing (scBS-seq) Single-cell CUT&Tag analysis of chromatin modifications in differentiation and tumor progression Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum Comprehensive integration of single-cell data Joint analysis of heterogeneous single-cell RNA-seq dataset collections Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity Chromatin state dynamics during blood formation Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex Comparative cellular analysis of motor cortex in human, marmoset and mouse Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape Hopper: a mathematically optimal algorithm for sketching biological data mbkmeans: Fast clustering for single cell data using mini-batch k-means Low-rank approximation and regression in input sparsity time The Human Lung Cell Atlas: A High-Resolution Reference Map of the Human Lung in Health and Disease A curated database reveals trends in singlecell transcriptomics A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte Single-cell immunology of SARS-CoV-2 infection The type I interferon response in COVID-19: implications for treatment A single-cell atlas of the peripheral immune response in patients with severe COVID-19 A blood atlas of COVID-19 defines hallmarks of disease severity and specificity Regulatory T cells and Foxp3 Increased expression of the NK cell receptor KLRG1 by virus-specific CD8 T cells during persistent antigen stimulation MAIT cells are licensed through granzyme exchange to kill bacterially sensitized targets The heterogeneity of human CD127(+) innate lymphoid cells revealed by single-cell RNA sequencing The Tabula Sapiens: a multiple organ single cell transcriptomic atlas of humans. bioRxiv Construction of a human cell landscape at single-cell level Fly Cell Atlas: a single-cell transcriptomic atlas of the adult fruit fly. bioRxiv Vision, challenges and opportunities for a Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging Single-cell lipidomics with high structural specificity by mass spectrometry