64487503 Learning association for single-cell transcriptomics by integrating profiling of gene expression and alternative polyadenylation Guoli Ji 1,4 , Wujing Xuan 1,4 , Yibo Zhuang 2 , Lishan Ye 3,4 , Sheng Zhu 1,4 , Wenbin Ye 1,4 , Xi Wang 1,4 and Xiaohui Wu 1,4* 1 Department of Automation, Xiamen University, Xiamen 361005, China 2 Xiamen YLZ Yihui Technology Co., Ltd, Xiamen, Fujian 361008, China 3 Xiamen Health and Medical Big Data Center, Xiamen, Fujian 361008, China 4 National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China Keywords: cell type clustering; alternative polyadenylation; single-cell RNA-seq; integrative analysis; software Guoli Ji is a professor with the Department of Automation in Xiamen University. His research interests include bioinformatics, advanced control, data mining and information system. Wujing Xuan is a graduate student with the Department of Automation in Xiamen University. His research interests are bioinformatics and data mining. Yibo Zhuang is an employee in Xiamen YLZ Yihui Technology company. His research interests are software design, cloud computing and big data. Lishan Ye is the director of Xiamen Health and Medical Big Data Center. Her research interests are cloud computing and healthcare big data. Sheng Zhu is a Ph.D. candidate with the Department of Automation in Xiamen University. His research interests are bioinformatics and healthcare big data. Wenbin Ye is a Ph.D. candidate with the Department of Automation in Xiamen University. Her research interests are bioinformatics and mRNA processing. Xi Wang is a graduate student with the Department of Automation in Xiamen University. Her research interests are bioinformatics and data mining. Xiaohui Wu is an associate professor with the Department of Automation in Xiamen University. Her research interests are mRNA processing, bioinformatics, and data mining. * Corresponding author. E-mail: xhuister@xmu.edu.cn, Tel: 86+13459029440 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Abstract Single-cell RNA-sequencing (scRNA-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells. A myriad of computational methods have been proposed to learn cell-cell similarities and/or cluster cells, however, high variability and dropout rate inherent in scRNA-seq confounds reliable quantification of cell-cell associations based on the gene expression profile alone. Lately bioinformatics studies have emerged to capture key transcriptome information on alternative polyadenylation (APA) from standard scRNA-seq and revealed APA dynamics among cell types, suggesting the possibility of discerning cell identities with the APA profile. Complementary information at both layers of APA isoforms and genes creates great potential to develop cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scRNA-seq data without changing experimental technologies. We proposed a toolkit called scLAPA for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation derived from the same scRNA-seq data. We compared scLAPA with seven similarity metrics and five clustering methods using diverse scRNA-seq datasets. Comparative results showed that scLAPA is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. Moreover, with scLAPA we found two hidden subpopulations of peripheral blood mononuclear cells that were undetectable using the gene expression data alone. As a comprehensive toolkit, scLAPA provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in most existing scRNA-seq pipelines. scLAPA is available at https://github.com/BMILAB/scLAPA. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Introduction Single-cell RNA-sequencing (scRNA-seq) has enabled transcriptome-wide profiling of gene expressions in individual cells, which has great potential to reveal cellular composition of tissues, transcriptional heterogeneity among cells and structure of cell types [1]. Cell-type identification is a critical step in most scRNA-seq data analyses, and a myriad of computational methods have emerged to detect novel cell types, previously un-appreciated sub-types of cells and rare cells [2]. Fundamentally, these numerous clustering methods rely on cell-cell associations (or similarities) for categorizing individual cells into different clusters [3]. A wide range of computational tools have been proposed to cluster cells, which implicitly or explicitly rely on a similarity concept [2]. SIMLR (Single-cell Interpretation via Multikernel Learning) adapts k-means by simultaneously training a similarity measure based on multiple kernel learning [4]. RaceID extends k-means with outlier detection to discover rare cell types [5]. SC3 (Single-Cell Consensus Clustering) utilizes a consensus approach to combine multiple clustering solutions [6]. PhenoGraph combines shared nearest-neighbour graphs and Louvain community detection to fast identify cell clusters [7]. Despite of the considerable progress, there is no strong consensus on which is the best clustering approach to define cell types for all situations [2, 8, 9]. Particularly, high variability and dropout rate inherent in scRNA-seq confounds the reliable quantification of lowly and/or moderately expressed genes [10, 11], resulting in extremely sparse gene-cell count matrix. Consequently, there might be little satisfactory overlap of observed genes among cells, hindering reliable quantification of cell-cell similarities based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Recently, multi-omics methods that leverage additional aspects of the cell, such as the DNA methylome, open chromatin or proteome, are beginning to appear [12]. Seurat v3 [13] harmonizes scRNA-seq and scATAC-seq data from a similar tissue to identify subpopulations of cells that are undistinguishable using the scATAC-seq data alone. LIGER [14], a method based on integrative non-negative matrix factorization (iNMF), was proposed to classify cortical cells profiled from single-cell bisulfite sequencing by integrating scRNA-seq data. Additional modalities of individual cells provide valuable information about the phenotype and genetic cellular state not manifested by the transcriptome. However, not all scRNA-seq data is accompanied data from different modalities. Even that multimodal omics data are gradually available, integrative multimodal analysis is still in its infancy [12]. It remains a challenge to reconcile the heterogeneity across modalities as different modalities are normally profiled from cells sampled from the same tissue rather than the same cells. Although most scRNA-seq studies focus on gene expression profiling, key information on transcript isoforms, e.g., alternative splicing (AS) and/or alternative polyadenylation (APA), can be obtained, enabling multiple aspects of transcriptome information to be derived from standard scRNA-seq without changing experimental technologies [15-20]. Lately, several computational methods, such as scAPAtrap [15], Sierra [16] and scAPA [18], have been proposed to identify APA sites in single cells from diverse 3′ tag-based scRNA-seq protocols, e.g., Drop-seq [21], CEL-seq [22] and 10x Genomics [23]. Cell-to-cell heterogeneity in APA site usage was also observed [15-18]. Particularly, the previous study [15] revealed that the APA profile, even that from non-differentially expressed genes, can distinguish mouse cells in different stages during sperm cell differentiation, suggesting the possibility of discerning cell identities with APA usages independent of gene expression. Recent efforts have (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 pioneered methods to identify APA sites or explore APA dynamics across different cell types [16-18, 24-26], however, most studies profiled APA among cells with predefined cell type labels rather than discern cell types in an unsupervised manner. Complementary information at both layers of APA isoforms and genes can be refined from the same cells [15-20], which creates great potential to develop more sophisticated and cost-efficient approaches to dissect cell types based on multiple modalities derived from existing scRNA-seq experiments. Here we proposed a toolkit called scLAPA for learning association for single-cell transcriptomics by combing single-cell profiling of gene expression and alternative polyadenylation. scLAPA leverages the resolution and huge abundance of scRNA-seq, boosting the gene-level analysis with additional layer of APA information directly derived from the same scRNA-seq data. By employing the strategy of similarity network fusion, scLAPA effectively learns highly informative cell-cell associations from expression profiles of both genes and APA isoforms. We compared scLAPA with seven similarity metrics and five clustering methods, using diverse scRNA-seq data from different experimental technologies and species. Comparative results showed that scLAPA is more effective and robust for learning cell-cell similarities and clustering cell types than competing methods. Moreover, with scLAPA we found two hidden subpopulations of cells in peripheral blood mononuclear cells (PBMCs) that were undetectable using the gene expression data alone. As a comprehensive toolkit, scLAPA provides a unique strategy to learn cell-cell associations, improve cell type clustering and discover novel cell types by augmentation of gene expression profiles with polyadenylation information, which can be incorporated in many other standard scRNA-seq pipelines for single-cell analyses. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Materials and methods scRNA-seq datasets We used five publicly available scRNA-seq datasets from animals and plants generated by 3′ tag-based scRNA-seq protocols (Table S1), spanning a wide spectrum of tissues, cell types and species. Raw data except for the PBMC data were downloaded from NCBI GEO (Gene Expression Omnibus). Cell types and cell labels of the data of Amygdala, Mammary and Root were obtained from the corresponding studies; cell labels of the Hypothalamus data were obtained from PanglaoDB [27]. The PBMC 4k dataset was downloaded from the 10x Genomics website (https://www.10xgenomics.com/). For cell type annotation of PBMCs, we followed the tutorial of Seurat v3 [13] to cluster cells on the basis of the gene-cell expression matrix. Specifically, cells with total read counts less then 300 were discarded. The LogNormalize method was adopted for normalization. Top 2000 highly variable features were selected by the vst method. PCA (Principal Component Analysis) was used for dimensionality reduction and top 20 principal components were retained. Finally, cells were clustered by Seurat’s FundClusters with argument ‘resolution=0.9. For cell type annotation of cell clusters, known marker genes of PBMCs were complied from relevant studies (Table S2). Differentially expressed (DE) genes for each cell group were calculated with Seurat’s FindAllMarkers. We also calculated, for each cell cluster, the number of cells where a DE gene is expressed and the mean expression level of a DE gene. The cell type was carefully assigned to a cell cluster according to the presence and expression level of marker gene(s). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Overview of scLAPA scLAPA mainly consists of four modules (Figure S1): (i) the input module, (ii) cell-cell distance, (iii) distance fusion, (iv) cell type clustering. The input module prepares the input for scLAPA, including a poly(A) site expression matrix (hereinafter referred as PA-matrix) and a gene expression matrix (hereinafter referred as GE-matrix). The PA-matrix is generated from raw scRNA-seq with scAPAtrap [15], which stores expression levels of poly(A) sites, with each row denoting a poly(A) site and each column denoting a cell. The GE-matrix can be obtained from websites like NCBI GEO and 10x Genomics, or generated by various routine scRNA-seq analysis tools like Cell Ranger. In the module of cell-cell distance, a cell-cell distance matrix is learned for PA-matrix (called PA-dist) and GE-matrix (called GE-dist), respectively. The module of distance fusion employs similarity network fusion (SNF) [28] to integrate the two distance matrices (PA-dist and GE-dist) into one cell-cell distance matrix. The cell type clustering module clusters cells based on the fused distance matrix with various clustering methods. scLAPA was implemented as an open source R package and is available at https://github.com/BMILAB/scLAPA. Scripts and data used in this study are also available at the GitHub website. Identification of poly(A) sites from scRNA-seq We followed the tutorial provided at the scAPAtrap website (https://github.com/BMILAB/scAPAtrap) to identify poly(A) sites with scAPAtrap [15]. It should be noted that alternative tools, such as Sierra [16] and scAPA [18], can also be used. Briefly, raw FASTQ reads were mapped with Cell Ranger 2.1.0 (https://www.10xgenomics.com/) and then uniquely mapped reads were obtained with SAMtools (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://github.com/BMILAB/scAPAtrap https://github.com/BMILAB/scAPAtrap https://doi.org/10.1101/2021.01.04.425335 (http://samtools.sourceforge.net/). Then UMI-tools [29] was employed to remove polymerase chain reaction (PCR) duplicates and extract unique molecular identifiers (UMIs). The findTails function in the scAPAtrap package was used to determine exact locations of poly(A) sites from reads with A/T stretches and the findPeaks function was adopted to identify all potential peaks of poly(A) sites from the whole genome level. Finally consensus poly(A) sites supported by both of the peak and the tail evidence were used. The featureCounts function in the Subread toolkit [30] was adopted to quantify the expression level for each poly(A) site. Poly(A) site annotation was performed with the movAPA package [31], using the latest genome annotation of the respective species -- TAIR10 for Arabidopsis, mm10 for mouse and GRCh38 for human. Briefly, poly(A) sites identified from scAPAtrap were annotated with rich information, such as genomic regions (i.e., 5′ UTR, 3′ UTR, coding sequence (CDS), intron, exon and intergenic) and gene id. Similar to previous studies [32-35], annotated 3′ UTRs were extended by a length of 1000 bp to recruit intergenic sites that may originate from authentic 3′ UTRs. Calculation of cell-cell distance scLAPA learns a cell-cell distance matrix for PA-matrix and GE-matrix, respectively. Various distance metrics can be chosen, including Euclidean distance, Pearson correlation, two metrics of proportionality (𝜌𝑝 and ∅𝑠) [3], RAFSIL (RAndom Forest based SImilarity Learning) [36] and SIMLR [4]. Euclidean distance and Pearson correlation are widely used in either single-cell or bulk transcriptomics. The two measures of proportionality were found to have strong performance according to a comprehensive benchmarking analysis of a large single-cell transcriptome compendium (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 [3]. RAFSIL is a random forest based approach that learns cell-cell similarities from scRNA-seq data, including two variations -- RAFSIL1/2. SIMLR learns a distance metric that fits the structure of the scRNA-seq data by combining multiple kernels corresponding to different informative representations of the data. Euclidean distance and Pearson correlation were calculated by the dists and cor functions in the R package stats, respectively; SIMLR metric was calculated by the SIMLR R package with argument ‘cores.ratio=0’; RAFSIL metric was calculated by the RAFSIL R package with arguments ‘nrep=50, gene_filter=FALSE’; 𝜌𝑝 and ∅𝑠 were calculated by the perb and phis functions in the R package propr, respectively. For each distance metric, cell-cell distance matrices, PA-dist and GE-dist, can be learned for PA-matrix and GE-matrix, respectively. PA-dist represents the cell-cell similarity network learned from the APA isoform layer, whereas GE-dist reflects the network learned from the gene layer, each of which encapsulates complementary information about cell-cell associations absent in the other genomic layer. Distance fusion After learning PA-dist and GE-dist, similarity network fusion (SNF) [28] is utilized to flexibly integrate the two layers of cell-cell similarities into one similarity matrix. First, PA-dist and GE-dist were iteratively and gradually fused to a consensus network, utilizing the non-linear method of message passing theory [37]. Then weak similarities representing potential noise were discarded, and strong similarities were retained. By generating coherent cell-cell similarities from both APA isoform and gene layers, SNF profiles a more comprehensive biological relationship among cells, beyond the scope of methods solely based on GE-matrix. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Given a PA-matrix storing expression levels of 𝑚 poly(A) sites in 𝑛 cells or a GE-matrix recording expression levels of 𝑚 genes in 𝑛 cells, the corresponding cell-cell distance matrix (PA-dist or GE-dist) can be obtained using a selected distance metric. The distance matrix can also be denoted as a graph 𝐺 =< 𝑉, 𝐸, 𝑊 >, with vertices 𝑉 {𝑐1 , … , 𝑐𝑛 } corresponding to cells, edges 𝐸 representing cell-cell link and edge weights 𝑊[𝑛×𝑛] denoting the kernel representation of cell-cell similarities. The weight of an edge linking cells 𝑐𝑖 and 𝑐𝑗 is determined using a scaled exponential similarity kernel: 𝑊𝑖𝑗 = 𝑒𝑥𝑝 − 𝑑𝑖𝑗 2 𝜇𝛽𝑖𝑗 (1) Here 𝑑𝑖𝑗 represents the distance between cells 𝑐𝑖 and 𝑐𝑗 measured by a distance metric (e.g. Pearson correlation). 𝜇 is an empirical hyperparameter with a recommended value in a sizable range of [0.3, 0.8] [28]. 𝛽𝑖𝑗 is a scaling factor defined as follows: 𝛽𝑖𝑗 = 𝑑 𝑐𝑖,𝑁𝑖 +𝑑 𝑐𝑗 ,𝑁𝑗 +𝑑𝑖𝑗 3 (2) where 𝑁𝑖 are neighboring cells of 𝑐𝑖 and 𝑑 𝑐𝑖, 𝑁𝑖 is the average distance of 𝑐𝑖 to its neighbors. To obtain a fused network from PA-dist and GE-dist, a full and sparse kernel on the vertex set 𝑉 is derived from the weight matrix 𝑊. The full kernel is a normalized weight matrix 𝑊 [𝑛×𝑛] which stores the full information of cell-cell similarities. The normalized weight between 𝑐𝑖 and 𝑐𝑗 is defined as: 𝑊 𝑖𝑗 = 𝑊𝑖𝑗 2 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑖 ≠ 𝑗 0.5 𝑤ℎ𝑒𝑛 𝑖 = 𝑗 (3) Another matrix 𝐴[𝑛×𝑛] encodes the local affinity that measures similarities of a cell to its 𝐾 most similar cells: (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 𝐴𝑖𝑗 = 𝑊𝑖𝑗 𝑊𝑖𝑘𝑘≠𝑖 𝑤ℎ𝑒𝑛 𝑗 ∈ 𝑁𝑖 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (4) Here 𝑁𝑖 is the set of cell 𝑐𝑖 and its neighbors in the graph 𝐺. The network fusion initiates from 𝑊 , using 𝐴 as the kernel matrix to capture local structure of the graph. To fuse the two distance matrices (PA-dist and GE-dist), first 𝑊𝑃𝐴 and 𝑊𝐺𝐸 were computed, respectively. Then the corresponding initial state matrices 𝑊 𝑃𝐴 and 𝑊 𝐺𝐸 were derived from the two similarity matrices, and the kernel matrices 𝐴𝑃𝐴 and 𝐴𝐺𝐸 were also computed. Given the initial two status matrices at 𝑡 = 0, 𝑊 𝑡=0 𝑃𝐴 and 𝑊 𝑡=0 𝐺𝐸 , the fusion process iteratively updates the respective similarity matrix: 𝑊 𝑡+1 𝑃𝐴 = 𝐴𝑃𝐴 × 𝑊 𝑡 𝑃𝐴 × (𝐴𝑃𝐴 )𝑇 𝑊 𝑡+1 𝐺𝐸 = 𝐴𝐺𝐸 × 𝑊 𝑡 𝐺𝐸 × (𝐴𝐺𝐸 )𝑇 (5) Then after 𝑡 iterations, the final status matrix is obtained: 𝑊 = 𝑊 𝑡 𝑃𝐴 +𝑊 𝑡 𝐺𝐸 2 (6) 𝑊 is the fused cell-cell distance network by incorporating cells’ APA isoform and gene expression profiles. The corresponding cell-cell similarity matrix is 1 − 𝑊 . The distance or similarity matrix can be used for downstream cell type clustering. Single cell clustering Four widely-used clustering methods were provided in scLAPA to cluster cells on the basis of the fused cell-cell similarity matrix, including Louvain clustering [38], hierarchical clustering (HC) [39], spectral clustering (SC) [40] and k-means. The Louvain clustering was implemented by the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 cluster_louvain function in the R package igraph, with arguments ‘mode=undirected, weighted=TRUE, diag = TRUE’. The spectral clustering was implemented by the SpectralClustering function in the R package SNFtool with default settings [28]. The hierarchical clustering [39] was performed by the flashClust function in the R package flashClust with default settings [41]. The k-means clustering was implemented by the kmeans function of the R package stats with arguments ‘iter.max=1e+9, nstart=1000’. Performance evaluation We distinguished two scenarios, similarity learning and clustering, to evaluate our approach. For each scenario, we applied scLAPA to four scRNA-seq datasets with pre-annotated cell labels, and compared results with other competing approaches. For the scenario of similarity learning, we compared scLAPA with seven similarity measures, including three measures designed for scRNA-seq (RAFSIL1/2 and SIMLR), two measures of proportionality (𝜌𝑝 and ∅𝑠) and two traditional similarity measures (Euclidean distance and Pearson correlation). Each of these measures was applied to a given GE-matrix to learn a cell-cell similarity matrix. For scLAPA, we applied each measure to learn two cell-cell similarity matrices from PA-matrix and GE-matrix and fused them into one matrix. We also applied different clustering methods including Louvain, HC, SC and k-means on the similarity matrix learned from each similarity measure to assess different similarity measures in the context of clustering. For the scenario of clustering, we compared scLAPA with five state of the art clustering methods for scRNA-seq data, including SC3 [6], Seurat v3 [13], SINCERA [42], SNN-Cliq [43] and dynamic tree (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 cut method (dynamicTreeCut) [44]. None of these approaches provides explicit similarity learning procedure, instead they provide cell labels by unsupervised learning on the GE-matrix. Each approach was applied to a given GE-matrix for cell clustering and class labels of cells were obtained. For scLAPA, we applied each of the four methods (Louvain, HC, SC and k-means) on the fused similarity matrix to obtain clustering results. Two internal validation metrics, Dunn index [45] and Connectivity [45], were employed for the first scenario to quantitatively assess the goodness of a clustering structure without relying on any clustering methods or knowing external information about class labels. The Dunn index [45] evaluates non-linear combinations of the between-group separation and the within-group compactness. The Connectivity reflects the extent of observations that are present in the same group as their neighbors in the data space. The original value of Connectivity ranges from zero to infinity, with smaller value denoting higher performance. Here we used a transform, 1/log10(Connectivity +1), to make Connectivity consistent with Dunn. The larger the score of Connectivity or Dunn, the better the separation is. The R package clValid [45] was adopted to calculate the Connectivity and Dunn index. Additionally, we used three popular metrics to evaluate the performance of scLAPA in the context of clustering, including the ARI (Adjusted Rand Index), Jaccard and NMI (Normalized Mutual Information). The value of ARI ranges from -1 to 1, and values of NMI and Jaccard range from 0 to 1, with the higher value indicating the better performance. ARI is a widely-used metric for measuring the concordance between two clustering results. The Jaccard index quantifies the similarity between two datasets. NMI is a variation of mutual information for evaluating clustering results, which corrects the bias of the consistency caused by chance. ARI and Jaccard were calculated using the adjustedRand (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://en.wikipedia.org/wiki/Mutual_information https://en.wikipedia.org/wiki/Cluster_Analysis https://doi.org/10.1101/2021.01.04.425335 function in the R package clues [46]; NMI was obtained by the compare function in the R package igraph (https://igraph.org/r/). Bioinformatics analyses UMAP [47] was adopted for visualization of distributions of single cells, which employs the non-linear dimensional reduction technique to group similar cells in low-dimensional space. UMAP was implemented by the calculateUMAP function in the scater R package [48]. For the analysis of the Arabidopsis root data, DESeq2 [49] was adopted to identify DE genes and DE poly(A) sites. First GE-matrix and PA-matrix were normalized by the median ratio method provided in DESeq2. Then the DESeq function was applied for DE detection. Gene or poly(A) sites with log2 fold change>=0.8 and adjusted P-value<=0.1 were considered as DE. Results Single-cell polyadenylation profile distinguishes cells Recently, scRNA-seq has emerged as a unique tool to explore cell-specific gene or isoform expression in plants [50-54]. A previous study [51] utilized root-hair and nonhair cell types as models and revealed the potential of using scRNA-seq data for inferring specific cells during the process of cell-type differentiation. Here we focused on the epidermal tissue and analyzed differential expression on both gene and APA levels between root-hair and nonhair cells. A total of 294 root-hair cells and 195 nonhair cells were defined by the previous study [51]. Although both GE-matrix and PA-matrix were obtained from the same scRNA-seq data, we still found four genes exclusively present in the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 PA-matrix (Figure 1A). For example, AT1G64140, a WRKY transcription factor gene, was absent in the single-cell GE-matrix, while it has one poly(A) site (coord: 23803757) with much higher expression level in nonhair than in hair cells according to the PA-matrix. Interestingly, this poly(A) site is an annotated poly(A) site in extended 3′ UTR, which was supported by bulk 3′-seq data according to PlantAPAdb [55]. Similarly, AT3G2522, a hypothetical protein coding gene, is missing in the GE-matrix, while its one poly(A) site (coord: 9184927) is expressed much higher in nonhair cells than in hair cells. This poly(A) site was also annotated as a 3′ UTR site in PlantAPAdb. Moreover, 1422 genes possess at least one differentially used poly(A) site, among which 171 genes were not DE genes (Table S3). For example, AT1G59725 is a DNAJ heat shock family protein expressed in root. Although both AT1G59725 and its one poly(A) site are expressed higher in root hair cells than in nonhair cells, the difference between the two cell types characterized by the poly(A) profile is much more pronounced than that by the gene profile (Figure 1B). Further, using only the GE-matrix, a subset of cells are indistinguishable between hair and nonhair cell types (Figure 1C). In contrast, cells from the two cell types were clearly separated on the basis of the PA-matrix and two potential subpopulations of nonhair cells were observed (Figure 1C). Therefore, we anticipate that the poly(A) site expression profile may encode complementary information that is absent or insignificant in the gene expression profile, which could be useful to distinguish cell types. There is a great potential to develop integrative approaches for discerning cell identities that can properly incorporate single-cell profiling of both gene expression and polyadenylation information. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Learning cell-cell similarities with scLAPA We proposed the scLAPA toolkit that can learn cell-cell similarities by taking advantage of the complementarities from both layers of APA isoforms and genes. Here we compared the performance of the similarity metric learned from scLAPA with other seven similarity metrics by analyzing four scRNA-seq datasets. Two metrics, Dunn and Connectivity, were adopted to quantitatively measure cell separation independent of clustering methods. Generally, scLAPA provides higher or comparable performance than other metrics across all the four datasets, whereas Pearson correlation or Euclidean has a consistently lower performance (Figures 2A and S2). In terms of both Dunn and Connectivity, scLAPA and SIMLR perform significantly better than other three metrics. Particularly, SIMLR outperforms scLAPA on the Hypothalamus data whereas scLAPA outperforms SIMLR on the Mammary data. Overall, scLAPA performs better than at least six out of the seven metrics in all the four datasets, never being the worst in any case. According to the Dunn index (Figure 2A), even for datasets where the performance of scLAPA is not the best, scLAPA is always the close match to the best. For example, the Dunn score from scLAPA on the Hypothalamus data is 0.94, which is very close to the best score (0.985 from SIMLR). Next we used the radar chart to compare the performance of these similarity metrics more intuitively. Apparently, scLAPA and SIMLR stand out as universally better than the others, and discrepancies of performance of other six metrics across different datasets were observed (Figure 2B). For example, the overall similarity based on the RAFSIL1/2 metric is much higher on Mammary and Hypothalamus data than the other two datasets, revealing the instability of performance of RAFSIL across different (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 datasets. In contrast, for all these four datasets, both Euclidean and Pearson correlation emerge as the worst similarity metric. In contrast, scLAPA provides a more robust result regardless of datasets. scLAPA is integrative and flexible in that different distance metrics can be chosen to learn cell-cell similarities for distance fusion. Next we examined the effect of using different distance metrics in scLAPA. The performance of scLAPA according to the Dunn index is highly robust across all datasets regardless of distance metrics used in scLAPA (Figure 2C). It is widely accepted that it is highly challenging to determine an optimal distance metric for profiling true cell-cell relationships from the complex and heterogeneous scRNA-seq data [3]. However, the integrative framework of scLAPA provides an effective solution of distance fusion by assembling results from multiple data layers into one ensemble result, which can mitigate limitations in individual similarity metrics or data layers and facilitate the generalization and adaption for different scRNA-seq datasets. Take the Hypothalamus data as an example. Apparently, the matrix with block structures obtained from scLAPA showed higher consistency with true labels than did other similarity metrics (Figure S3). Block structures learned by SIMLR are indistinguishable from background signatures; block structures learned by Pearson correlation, Euclidean or the two measures of proportion are also mixed with background signatures; block structures learned from RAFSIL are generally consistent with true structures except that cell types with small number of cells are less distinguishable. Overall, scLAPA provides more divergent clusters with higher distinction, and individual clusters obtained by scLAPA are more compact than those by other similarity metrics. These results demonstrate the ability and robustness of scLAPA in improving the cell separation across numerous scRNA-seq datasets. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Cell type clustering with scLAPA Cell-cell similarities learned by different similarity metrics can be adapted to other clustering methods that take similarities as inputs. Here we performed extensive comparisons of scLAPA with other seven similarity metrics by applying different clustering methods for cell clustering. First we applied Louvain [2], a graph-based method for community detection, to different similarity metrics for clustering. According to the ARI score, similarities learned by scLAPA and SIMLR significantly outperform similarities obtained from Euclidean, Pearson correlation or RAFSIL1/2 (Figure 3A). Overall, SIMLR shows similar performance with scLAPA, whereas scLAPA outperforms SIMLR in three out of the four datasets. Particularly, Euclidean and Pearson correlation present the worst performance in two datasets, Mammary and Root. Similar results were obtained in terms of other two indexes, NMI and Jaccard (Figure S4). In addition to Louvain clustering, we also investigated other three popular clustering methods, including hierarchical clustering [39], spectral clustering [40] and k-means [56], to evaluate the robustness of results by applying different clustering methods on the same similarity metric (Figures S5-7). Particularly, the performance of scLAPA and RAFSIL1/2 are robust regardless of clustering methods used, whereas scLAPA consistently outperforms RAFSIL. In contrast, SIMLR, Euclidean and Pearson correlation are very sensitive to clustering methods applied (Figure 3B). Surprisingly, although SIMLR achieves comparable performance with scLAPA based on Louvain clustering (Figure 3A), its performance is the worst using k-means or spectral clustering (Figure 3B). Take the Mammary data for example, the ARI score of SIMLR drops from 0.769 when using Louvain clustering to an extremely low median value of 0.026 when using k-means. Moreover, (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 we noted that, ARI scores from individual runs of k-means clustering on SIMLR similarities varied greatly, revealing the relatively poor robustness of SIMLR with k-means clustering (Figure S5). These results demonstrate that the cell-cell similarity matrix learned from scLAPA is more effective and robust than competing similarity metrics in clustering cell subpopulations. During the preparation of this manuscript, we noticed another method scDaPars [57], which quantifies and recovers APA events in single cells using standard scRNA-seq data. The authors also integrated APA information identified by scDaPars with imputed gene expression by similarity network fusion to reveal novel cell subpopulations during human embryonic development. Different from scDaPars that employs the (imputed) percentage of distal poly(A) site usage index (PDUI) to measure APA usage, scLAPA directly utilizes raw poly(A) expression profile. Here we compared the performance of scLAPA and scDaPars by applying them to the four scRNA-seq datasets in our benchmarking analysis. Following the process in Gao et al. [57], we calculated PDUI based on the PA-matrix and imputed APA profiles using scDaPars. Then we applied five similarity metrics on the scDaPars-imputed APA profile and the GE-matrix to generate scDaPars-dist and GE-dist, respectively. After fusing the two distance matrices with SNF, we applied Louvain clustering on the fused cell-cell similarities to cluster cells. According to the ARI score (Figure 4), scLAPA significantly outperforms scDaPars on all the four datasets. Particularly, ARI scores of scDaPars with different similarity metrics varied greatly whereas the performance of scLAPA is robust with different similarity metrics (Figure 4 vs. Figure 2C), revealing that the poly(A) expression profile used in scLAPA is more efficient and robust than the PDUI profile used in scDaPars for clustering cells. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Next we expanded the benchmarking analysis by comparing clustering results of scLAPA with other single-cell clustering methods that directly take the gene-cell expression matrix as input without an explicit procedure of similarity learning. Specifically, we included five popular tools for comparison, including SC3 [6], Seurat v3 [13], SINCERA [42], SNN-Cliq [43] and dynamicTreeCut [44]. According to the ARI score, scLAPA achieves generally higher or comparable performance than other methods, whereas dynamicTreeCut provides a consistently lower performance (Figure 5). Similar results were observed using indexes of Jaccard or NMI (Figure S8). Specifically, scLAPA provides the best ARI score in three out of the four datasets (Figure 5). For the Hypothalamus data where SC3 performs the best, scLAPA presents very close ARI score to SC3 (scLAPA=0.985; SC3=0.99). Particularly, for three datasets (Mammary, Hypothalamus and Root), ARI scores of individual SC3 runs varied greatly, reflecting the performance of SC3 may be unstable on some kinds of datasets. Overall, the performance of scLAPA is robust and consistently high across diverse scRNA-seq datasets. scLAPA identifies hidden subpopulations of cells We next applied scLAPA on the human PBMC 4k dataset from 10x Genomics for cell type clustering. First we examined the cell type composition of the PBMCs by applying Seurat to the gene-cell expression matrix (GE-matrix). Ten distinct cell clusters were yielded (Figure 6A). Based on the expression of known markers (Table S2), nine clusters were annotated. Up to 13,512 poly(A) sites from 9601 genes were identified from the raw RNA-seq data with scAPAtrap. We learned cell-cell similarities with scLAPA by jointly considering expression profiles of APA isoforms and genes. After (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 applying Louvain clustering on the cell-cell similarity matrix, 14 cell clusters were obtained and 11 clusters were successfully annotated. These 11 clusters covered the nine clusters identified by Seurat and contained two new small clusters (Figure 6B). Both subclusters were supported by the expression patterns of markers, suggesting that they represented distinct cell types. One subcluster was annotated as regulatory T cell on the basis of elevated expression of three markers, CCR10, FOXP3 and IL2RA (Figure S9). Depending only on the gene expression profile, regulatory T cells were not well resolved and are indistinguishable among other T cells (Figure 6A). Although the gene expression of the marker CCR10 is sparse and weak among T cells, we could still distinguish clearly regulatory T cells from other T cell types according to the UMAP visualization of the gene expression profile (Figure 6C). Particularly, CCR10 has four annotated poly(A) sites according to APASdb [58], whereas only one poly(A) site was identified from scRNA-seq data. This is not unexpected as the bulk 3′-seq data contain more diverse tissue samples than the PBMC data and scRNA-seq data is generally too sparse to identify all poly(A) sites. However, we have shown that, even for a single poly(A) site, it could encapsulate useful information beyond the gene expression profile (Figure 1). The other subcluster where cell markers such as PPBP and PF4 are expressed, was annotated as megakanyocyte progenitors (Figures 6D and S10). According to the PA-matrix, PPBP carries three poly(A) sites, and five poly(A) sites of PPBP were annotated in APASdb. These three poly(A) sites were all highly expressed in megakanyocyte progenitors (cluster 10) (Figure 6E). These results demonstrate that scLAPA facilitates the capture and identification of hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Discussion scLAPA is an integrative framework for learning association for single-cell transcriptomics by leveraging expression profiles of genes and APA isoforms in individual cells, which highlights the inclusion of polyadenylation signatures for improving cell type clustering and discovering new cell types. The effectiveness of scLAPA for cell-cell similarity learning and cell type clustering is evidenced by comparisons with various similarity metrics and single-cell clustering methods on several scRNA-seq datasets. scLAPA has a number of desirable features. First, scLAPA incorporates existing tools to extract and quantify poly(A) sites directly from scRNA-seq, which augments the gene-level analysis with additional layer of APA information without altering the scRNA-seq protocol or performing additional sequencing experiment. Second, by employing the strategy of similarity network fusion, scLAPA jointly considers expression profiles at both levels of APA isoforms and genes for learning highly informative cell-cell similarities. Third, in contrast to many other methods that cluster cells without explicit similarity learning step, scLAPA provides two independent but connected modules for similarity learning and cell clustering, each with various methods for users’ choice. Accordingly, users can freely combine different similarity metrics and clustering methods in scLAPA to evaluate the clustering results for any given dataset. Fourth, the framework of scLAPA is highly flexible, which can be seamlessly embedded into most existing scRNA-seq pipelines or tools for downstream analyses, such as dimension reduction, cell type clustering and differential expression analysis. Accordingly, existing tools, such as those designed for dropout imputation, normalization and similarity learning, can also be easily incorporated into scLAPA. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 With scLAPA, distinct cell-cell similarity networks can be effectively learned from profiles of gene expression and polyadenylation separately by various similarity metrics. scLAPA then employed the strategy of similarity network fusion for scalable and robust integration of similarity networks learned from different data layers. This strategy has the advantage to exploit complementarities in distinct data layers for fully profiling the spectrum of underlying data. Moreover, the consensus set of cell-cell interactions and associations from the APA layer and the gene layer can be learned from the given data, mitigating noise and dropouts in conventional gene-cell expression profile and thus enhancing accuracy for downstream analyses. By combining expression profiles of APA and gene through similarity network fusion, we found two hidden subpopulations of PBMCs that were undetectable using only gene expression data (Figure 6). Moreover, the augmentation of gene expression profiles with polyadenylation information enhances single-cell clustering results and generates more discriminative cell types (Figures 2-5). As a comprehensive toolkit, scLAPA provides a unique strategy to improve cell type clustering and discover novel cell types, by combining gene expression with polyadenylation information at single-cell resolution. scLAPA consists of three core function modules, including learning cell-cell similarities, distance fusion and clustering. Currently, numerous methods are available to learn cell-cell similarities or cluster cells with reasonable accuracy [3]. However each method has its own strengths and limitations, and it is extremely challenging, if not impossible, to determine an optimal method for all kinds of datasets as different methods may exploit different characteristics in the data [59]. Moreover, some similarity metrics may be overly dependent on downstream clustering methods, exacerbating difficulties in choosing a universally applicable combination of similarity and clustering methods. For (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 example, based on the GE-matrix alone, similarities learned from SIMLR provide an overall high performance across datasets in terms of internal validation indexes (Figure 2A). However, SIMLR is highly dependent on downstream clustering methods for single-cell clustering; it achieves high performance with Louvain clustering (Figure 3A), whereas its performance drops sharply with k-means or spectral clustering (Figure 3B). In contrast, our benchmarking analyses showed that performances of scLAPA are robust and consistently high across diverse datasets regardless of distance metrics or clustering methods selected in scLAPA (Figures 2-5). The unique strength of scLAPA may be due to that it efficiently fuses rich structures stored in GE-matrix as well as the accompanied PA-matrix, thus can amplify biological signals and augment cell-cell relationships. scLAPA is an easy-to-use and highly flexible framework. The input of scLAPA is the GE- and PA-matrix, without using any priori biological information. Even with raw scRNA-seq data, it is easy obtain the prerequisite GE-matrix and/or PA-matrix using various tools, e.g. Cell Ranger for GE-matrix, scAPAtrap and Sierra for PA-matrix. Lately another tool, scDaPars [57], was proposed to quantify and recover APA usages from scRNA-seq data, which uses the relative usage of the distal poly(A) site called PDUI to measure a gene’s APA usage. With scDaPars, Gao et al. [57] analyzed cell-type-specific APA regulation and discovered hidden cell subpopulations from cancer and human endoderm differentiation scRNA-seq data. In scLAPA the input PA-matrix can be replaced with any other gene-cell-like matrix, thus the scDaPars-imputed PDUI matrix can be used readily in scLAPA for downstream cell type clustering. However, although the scDaPars-imputed PDUI profile seems to be effective in revealing APA dynamics among cell types in the previous study [57], we found that, for cell type clustering, the performance with the PDUI-matrix is much lower and less robust than that (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 with scLAPA’s PA-matrix (Figure 4). This may be due to several reasons. First, only genes with at least two 3′ UTR poly(A) sites can be used for scDaPars’ PDUI calculation, consequently the PDUI-matrix is much more sparse than the PA-matrix and information encoded in genes with single poly(A) site is lost. Second, although the PDUI profile can be imputed with scDaPars, limited information in the highly sparse PDUI-matrix confounds reliable imputation and may lead to propagation of errors or noises during the imputation process. Third, unlike scLAPA which is specifically designed for learning cell-cell similarities and cell type clustering, the main function of scDarpas is to analyze cell-type-specific APA dynamics and identify novel APA-related cell types. We anticipate that the PA-matrix used in scLAPA may contain more comprehensive and reliable information than the PDUI-matrix or the imputed PDUI-matrix, which can significantly enhance the accuracy of cell type clustering. Overall, the PA-matrix is simple but effective which can be easily obtained from scRNA-seq data by various tools, making it more convenient to use scLAPA for scRNA-seq analyses. For practical application purpose, the current version of scLAPA implements seven similarity metrics and four clustering methods for users’ choice, which allows users to investigate their own strategies for evaluation of the effect of different combinations of distance metrics and clustering methods. Moreover, scLAPA is easily expandable in that additional distance metrics or clustering methods can be readily incorporated. Meanwhile, scRNA-seq preprocessing steps, such as dropout imputation and normalization, can also be easily applied before similarity learning. scLAPA can also be used as a plug-in architecture for most existing scRNA-seq pipelines for similarity learning and cell clustering. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Supplementary Data The file of supplemental materials contains all the Supplementary Figures, and Tables. Funding This work was supported by the National Natural Science Foundation of China (Nos. 61871463 to X.W. and 61573296 to G.J.) and Xiamen YLZ Yihui Technology Co., Ltd (XDHT2020131A). References 1. Ziegenhain C, Vieth B, Parekh S et al. Comparative Analysis of Single-Cell RNA Sequencing Methods, Mol Cell 2017;65:631-643.e634. 2. Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data, Nature Reviews Genetics 2019. 3. Skinnider MA, Squair JW, Foster LJ. Evaluating measures of association for single-cell transcriptomics, Nature Methods 2019;16:381-386. 4. Wang B, Zhu J, Pierson E et al. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning, Nature Methods 2017;14:414. 5. Grun D, Lyubimova A, Kester L et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types, Nature 2015;525:251-255. 6. Kiselev VY, Kirschner K, Schaub MT et al. SC3: consensus clustering of single-cell RNA-seq data, Nature Methods 2017;14:483. 7. Levine Jacob H, Simonds Erin F, Bendall Sean C et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis, Cell 2015;162:184-197. Key Points  We proposed a computational toolkit called scLAPA for learning association for single-cell transcriptomics from scRNA-seq data.  scLAPA improves cell-cell similarity learning and cell type clustering by integrating single-cell profiling of gene expression and alternative polyadenylation.  Objective benchmarking analyses using diverse scRNA-seq datasets demonstrate higher performance and robustness of scLAPA than competing methods in cell-cell similarity learning and cell type clustering.  scLAPA discovers hidden subpopulations of cells that are unrecognizable based on the gene expression profile alone. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 8. Qi R, Ma A, Ma Q et al. Clustering and classification methods for single-cell RNA-sequencing data, Briefings in Bioinformatics 2020;21:1196-1208. 9. Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data, Briefings in Bioinformatics 2020;21:1209-1223. 10. Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis, Nature Methods 2014;11:740. 11. Grun D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics, Nat Methods 2014;11:637-640. 12. Stuart T, Satija R. Integrative single-cell analysis, Nature Reviews Genetics 2019;20:257-272. 13. Stuart T, Butler A, Hoffman P et al. Comprehensive Integration of Single-Cell Data, Cell 2019;177:1888-1902.e1821. 14. Welch JD, Kozareva V, Ferreira A et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity, Cell 2019;177:1873-1887.e1817. 15. Wu X, Liu T, Ye C et al. scAPAtrap: identification and quantification of alternative polyadenylation sites from single-cell RNA-seq data, Briefings in Bioinformatics 2020. 16. Patrick R, Humphreys DT, Janbandhu V et al. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data, Genome Biol 2020;21:167. 17. Levin M, Zalts H, Mostov N et al. Gene expression dynamics are a proxy for selective pressures on alternatively polyadenylated isoforms, Nucleic Acids Res 2020;48:5926-5938. 18. Shulman ED, Elkon R. Cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data, Nucleic Acids Res 2019;47:10027-10039. 19. Arzalluz-Luque A, Conesa A. Single-cell RNAseq for the study of isoforms-how is that possible?, Genome Biology 2018;19:110. 20. Song Y, Botvinnik OB, Lovci MT et al. Single-Cell Alternative Splicing Analysis with Expedition Reveals Splicing Dynamics during Neuron Differentiation, Molecular Cell 2017;67:148-161.e145. 21. Macosko EZ, Basu A, Satija R et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets, Cell 2015;161:1202-1214. 22. Hashimshony T, Wagner F, Sher N et al. CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification, Cell Rep 2012;2:666-673. 23. Zheng GX, Terry JM, Belgrader P et al. Massively parallel digital transcriptional profiling of single cells, Nat Commun 2017;8:14049. 24. Ye C, Zhou Q, Hong Y et al. Role of alternative polyadenylation dynamics in acute myeloid leukaemia at single-cell resolution, Rna Biology 2019;16:785-797. 25. Kim N, Chung W, Eum HH et al. Alternative polyadenylation of single cells delineates cell types and serves as a prognostic marker in early stage breast cancer, PloS one 2019;14:e0217196. 26. Velten L, Anders S, Pekowska A et al. Single-cell polyadenylation site mapping reveals 3' isoform choice variability, Molecular Systems Biology 2015;11:812-812. 27. Franzén O, Gan L-M, Björkegren JLM. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data, Database 2019;2019. 28. Wang B, Mezlini AM, Demir F et al. Similarity network fusion for aggregating data types on a genomic scale, Nature Methods 2014;11:333. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 29. Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy, Genome Research 2017;27:491-499. 30. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features, Bioinformatics 2013;30:923-930. 31. Ye W, Liu T, Fu H et al. movAPA: Modeling and visualization of dynamics of alternative polyadenylation across biological samples, Bioinformatics 2020. 32. Shen Y, Ji G, Haas BJ et al. Genome level analysis of rice mRNA 3'-end processing signals and alternative polyadenylation, Nucleic Acids Research 2008;36:3150-3161. 33. Wu X, Liu M, Downie B et al. Genome-wide landscape of polyadenylation in Arabidopsis provides evidence for extensive alternative polyadenylation, Proceedings of the National Academy of Sciences, USA 2011;108:12533-12538. 34. Zhao Z, Wu X, Raj Kumar PK et al. Bioinformatics Analysis of Alternative Polyadenylation in Green Alga Chlamydomonas reinhardtii Using Transcriptome Sequences from Three Different Sequencing Platforms, G3: Genes|Genomes|Genetics 2014;4:871-883. 35. Wu X, Gaffney B, Hunt A et al. Genome-wide determination of poly(A) sites in Medicago truncatula: evolutionary conservation of alternative poly(A) site choice, BMC Genomics 2014;15:615. 36. Pouyan MB, Kostka D. Random forest based similarity learning for single cell RNA sequencing data, Bioinformatics 2018;34:i79-i88. 37. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. 38. Blondel VD, Guillaume J-L, Lambiotte R et al. Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment 2008;2008:P10008. 39. Eisen MB, Spellman PT, Brown PO et al. Cluster analysis and display of genome-wide expression patterns, Proc Natl Acad Sci U S A 1998;95:14863-14868. 40. Ng AY, Jordan M, Weiss Y. On Spectral Clustering: Analysis and an Algorithm. Advances in neural information processing systems. 2001, 849–856. 41. Langfelder P, Horvath S. Fast R Functions for Robust Correlations and Hierarchical Clustering, Journal of Statistical Software 2012;46:1-17. 42. Guo M, Wang H, Potter SS et al. SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis, PLoS Comput Biol 2015;11:e1004575-e1004575. 43. Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method, Bioinformatics 2015;31:1974-1980. 44. Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R, Bioinformatics 2007;24:719-720. 45. Guy Brock, Vasyl Pihur, Susmita Datta et al. clValid, an R package for cluster validation, Journal of Statistical Software 2011;25:1-22. 46. Chang F, Qiu W, Zamar RH et al. clues: An R Package for Nonparametric Clustering Based on Local Shrinking, Journal of Statistical Software 2010;33:16. 47. McInnes L, Healy J, Saul N et al. UMAP: Uniform Manifold Approximation and Projection, Journal of Open Source Software 2018;3:861. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 48. McCarthy DJ, Campbell KR, Lun AT et al. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R, Bioinformatics 2017;33:1179-1186. 49. Love M, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome Biology 2014;15:550. 50. Jean-Baptiste K, McFaline-Figueroa JL, Alexandre CM et al. Dynamics of Gene Expression in Single Root Cells of Arabidopsis thaliana, The Plant Cell 2019;31:993-1011. 51. Ryu KH, Huang L, Kang HM et al. Single-Cell RNA Sequencing Resolves Molecular Relationships Among Individual Plant Cells, Plant Physiology 2019;179:1444-1456. 52. Shahan R, Hsu C-W, Nolan TM et al. A single cell Arabidopsis root atlas reveals developmental trajectories in wild type and cell identity mutants. 2020. 53. Shulse CN, Cole BJ, Ciobanu D et al. High-Throughput Single-Cell Transcriptome Profiling of Plant Cell Types, Cell Reports 2019;27. 54. Zhang T-Q, Xu Z-G, Shang G-D et al. A Single-Cell RNA Sequencing Profiles the Developmental Landscape of Arabidopsis Root, Molecular Plant 2019;12:648-660. 55. Zhu S, Ye W, Ye L et al. PlantAPAdb: A Comprehensive Database for Alternative Polyadenylation Sites in Plants, Plant Physiology 2020;182:228-242. 56. Kaufmann L, Rousseeuw P. Clustering by means of medoids. In: Dodge Y. (ed) Statistical data analysis based on the L1-norm and related methods. Amsterdam: North-Holland, 1987, 405–416. 57. Gao Y, Li L, Amos CI et al. Dynamic Analysis of Alternative Polyadenylation from Single-Cell RNA-Seq(scDaPars) Reveals Cell Subpopulations Invisible to Gene Expression Analysis, bioRxiv 2020:2020.2009.2023.310649. 58. You L, Wu J, Feng Y et al. APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals, Nucleic Acids Research 2014;43:D59-D67. 59. Shirkhorshidi AS, Aghabozorgi S, Wah TY. A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data, PLoS One 2015;10:e0144059. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Figure legends Figure 1. Single-cell poly(A) profile in root hair and nonhair cells. (A) Genes exclusively present in the PA-matrix. Four genes (AT3G22970, AT1G64140, AT4G09200 and AT3G25221) were not present in the GE-matrix, whereas they had at least one poly(A) site according to the PA-matrix. For each gene, the violin plot shows expression levels of its poly(A) site in hair and nonhair cells and the UMAP visualization shows the 2D embeddings of poly(A) profile. (B) Two example genes (AT1G59725 and AT4G18940) that are not differentially expressed (DE) but possess at least one DE poly(A) site. The upper panel places the violin plot and UMAP visualization showing the poly(A) profile of the respective gene in hair and nonhair cells. The lower panel shows the gene profile. (C) Single-cell poly(A) profile distinguishes root hair and root nonhair cells. The left plot is the UMAP representation on the basis of 171 genes that are not DE but with at least one DE poly(A) site, the right plot is the UMAP representation on the basis of poly(A) profile of the 171 genes. Figure 2. Benchmarking of similarity learning with scLAPA on four published scRNA-seq datasets. (A) The internal validation metric of Dunn was employed to measure the cell separation. (B) Radar chart showing the performance of different similarity metrics across datasets. Dataset names are shown near the vertex of the plot. Each vertex denoting the Dunn score of a metric on the respective dataset. The larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. (C) Radar chart showing the performance of scLAPA with different distance metrics for distance fusion. Each vertex denotes the Dunn score of using different distance metrics on the respective dataset. Figure 3. Benchmarking of similarity learning with scLAPA in the context of clustering on four published scRNA-seq datasets. (A) ARI was employed to measure the concordance between inferred and true cluster labels. Louvain clustering was applied on the similarity matrices obtained from different methods. (B) Radar charts showing ARI scores by applying different clustering methods on cell-cell similarities learned by each similarity metric. Each plot represents results of one dataset. Clustering methods are shown near the vertex of the plot. The vertex of a plot denotes the ARI score of applying a clustering method on different metrics. The larger the area of a polygon displayed in a radar chart is, the higher the overall performance is. HC, hierarchical clustering; SC, spectral clustering. Figure 4. Comparison of performance between scLAPA and scDaPars across four scRNA-seq datasets. Five similarity metrics were applied on the scDaPars-imputed PDUI profile and the GE-matrix to generate scDaPars-dist and GE-dist, respectively. After fusing the two distance matrices with SNF, Louvain clustering was applied on the fused cell-cell similarities to cluster cells. We did not include RAFSIL in this experiment due to its slow calculation speed. For scLAPA, Pearson correlation was used for similarity learning and Louvain was used for clustering. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Figure 5. ARI scores from six clustering methods across four scRNA-seq datasets. For scLAPA, Pearson correlation was used for similarity learning and Louvain was used for clustering. Figure 6. scLAPA identifies hidden subpopulations of cells from human PBMCs. (A) UMAP representation of Seurat’s clustering results on the basis of GE-matrix. Ten clusters were obtained and nine were annotated with known cell types: Naive T cell (1), CD14+ Monocytes (2), CD8+ T cell (3), B cell (4), CD4+ Memory T (5), NK cell (6), CD16+ Monocytes (7), Monocyte Derived Dendritic (8, 10) and Plasmacytorid Dendritic (9). (B) UMAP representation of scLAPA’s clustering results on the basis of GE-matrix and PA-matrix. Fourteen clusters were obtained and 11 clusters were annotated with known cell types: Regulatory T cell (1), Naive T cell (2, 3), Plasmacytorid Dendritic (4), CD4+ Memory T (5), CD8+ T cell (6), CD14+ Monocytes (7), Monocyte Derived Dendritic (8, 11, 14), CD16+ Monocytes (9), Megakaryocyte Progenitors (10), B cell (12) and NK cell (13). The two arrows mark two new subpopulations of cells identified by scLAPA. (C) Gene expression of CCR10 distinguishes regulatory T cells from other T cell types according to the UMAP visualization of the gene expression profile. The details in the dashed line box are shown in the solid line box. (D) Gene expression of PPBP distinguishes megakanyocyte progenitors from other cell types. (E) Three poly(A) sites of PPBP are all highly expressed in megakanyocyte progenitors. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 0 1 2 3 Hair Nonhair Hair 0 1 2 3 0 1 2 3 0 1 2 3 P A 7 6 4 0 0 1 2 3 A T 1 G 5 9 7 2 5 (B) (C) (A) 0 1 2 3 A T 4 G 0 9 2 0 0 (P A c o o rd : 5 8 6 1 8 6 5 ) A T 3 G 2 2 9 7 0 (P A c o o rd : 8 1 5 4 1 1 0 ) A T 1 G 6 4 1 4 0 (P A c o o rd : 2 3 8 0 3 7 5 7 ) A T 3 G 2 5 2 2 1 (P A c o o rd : 9 1 8 4 9 2 7 ) Hair Nonhair UMAP 1 UMAP 1 U M A P 2 0 1 2 3 A T 4 G 1 8 9 4 0 0 1 2 3 P A 3 0 7 9 2 -4 -2 0 2 -3 0 3 6 -2 -1 0 1 2 -4 0 4 U M A P 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 D u n n Euclidean Pearson SIMLR RAFSIL1 RAFSIL2 scLAPA (A) p  s  Hypothalamus Mammary Amygdala Root Hypothalamus MammaryRoot Amygdala (B) (C) Euclidean Pearson p  s  SIMLR RAFSIL1 RAFSIL2 scLAPA Euclidean+Euclidean Pearson+Pearson + + p  s  SIMLR+SIMLR RAFSIL1+RAFSIL1 RAFSIL2+RAFSIL2p s  (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 A R I Euclidean Pearson SIMLR RAFSIL1 RAFSIL2 scLAPA Hypothalamus Mammary Amygdala Root (A) p  s  HC SC K-means Louvain (B) Euclidean Pearson p  s  SIMLR RAFSIL1 RAFSIL2 scLAPA HC SC K-means Louvain HC SC K-means Louvain HC SC K-means Louvain (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 A R I Euclidean Pearson SIMLR scLAPA p  s  scDaPars+ (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 Amygdala Hypothalamus Mammary Root 0.00 0.25 0.50 0.75 1.00 A R I SC3 SINCERA SNNclip Seurat dynamicTreeCut scLAPA (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335 (A) (C) 4 11 10 8 9 0 4 8 12 -12 -9 -6 UMAP 1 U M A P 2 0 1 2 3 4 5 (E) 6 3 5 1 2 -15 -10 -5 0 5 -15 -10 -5 0 5 10 UMAP 1 U M A P 2 0 1 2 (D) 0 2 4 6 PA11108 (coord:73986478) 0 2 4 6 PA11107 (coord:73986076 ) 0 2 4 6 1 2 3 4 5 6 7 8 91011121314 Identity E x p re s s io n L e v e l PA11109 (coord:73987038) 1 2 3 4 5 6 7 8 91011121314 1 2 3 4 5 6 7 8 91011121314 4 10 5 1 2 5 1 9 8 4 3 6 710 -10 0 10 -10 -5 0 5 10 UMAP 1 U M A P 2 -10 0 10 -10 -5 0 5 10 UMAP 1 U M A P 2 Regulatory T cell Megakaryocyte Progenitors 7 6 3 5 4 11 12 13 2 10 8 14 9 1 (B) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.04.425335doi: bioRxiv preprint https://doi.org/10.1101/2021.01.04.425335