key: cord-0290613-p16e7hte authors: Shen, Hongru; Shen, Xilin; Feng, Mengyao; Wu, Dan; Zhang, Chao; Yang, Yichen; Yang, Meng; Hu, Jiani; Liu, Jilei; Wang, Wei; Li, Yang; Zhang, Qiang; Yang, Jilong; Chen, Kexin; Li, Xiangchun title: A universal approach for integrating super large-scale single-cell transcriptomes by exploring gene rankings date: 2021-08-24 journal: bioRxiv DOI: 10.1101/2021.08.23.457305 sha: d61f3d3d267e8ab78893b90d52cbc6a95a08f2ec doc_id: 290613 cord_uid: p16e7hte Advancement in single-cell RNA sequencing leads to exponential accumulation of single-cell expression data. However, there is still lack of tools that could integrate these unlimited accumulation of single-cell expression data. Here, we presented a universal approach iSEEEK for integrating super large-scale single-cell expression via exploring expression rankings of top-expressing genes. We developed iSEEEK with 13.7 million single-cells. We demonstrated the efficiency of iSEEEK with canonical single-cell downstream tasks on five heterogenous datasets encompassing human and mouse samples. iSEEEK achieved good clustering performance benchmarked against well-annotated cell labels. In addition, iSEEEK could transfer its knowledge learned from large-scale expression data on new dataset that was not involved in its development. iSEEEK enables identification of gene-gene interaction networks that are characteristic of specific cell types. Our study presents a simple and yet effective method to integrate super large-scale single-cell transcriptomes and would facilitate translational single-cell research from bench to bedside. Additionally, we found that iSEEEK can work effectively on new dataset that was not 170 171 iSEEEK enables discovery of marker genes and gene interaction modules 172 We added and trained a classifier at the end of iSEEEK for identification of marker 173 genes on the dataset of FACS-sorted CD4/8+ T cells (see Methods). An apparent 174 separation of CD4+ and CD8+ T cells were observed on the UMAP visualization plot ( Figure 4A ). We identified cell-type specific markers for these CD4/8+ T cells (see KLRK1 and NKG7( Figure 4B ). We respectively obtained gene interaction networks that are characteristic of CD4+ and interactions among cytotoxic genes including GNLY, NKG7, PRF1, LCK and KLRD1 36 . 192 In addition, the CD8+ T cell recruitment gene CCL5 37 exhibited strong interaction with 193 markers of CD8+ T cells including CD8A, CD8B and GZMB. Gene interactions from 194 the CD8+ T cell specific module is enriched in STRING database (12/144 interactions; 195 hypergeometric test, p = 1.3e-3). In this study, we presented a universal approach iSEEEK for integrating super large-199 scale single-cell transcriptomes by exploring of the rankings of top-expressing genes. simple and yet effective way. iSEEEK can make use of single-cell transcriptomes from different species, which was exemplified by the integration of data from Homo sapiens 220 and Mus musculus in our study. iSEEEK circumvents the tremendous challenge of 221 batch-correction in single-cell integration by modeling gene expression rankings rather 222 than actual expression levels. As iSEEEK is not relying on actual expression levels but 223 rather on the ranking of top-expressing genes, its sensitivity to batch effect is decreasing, 224 which was verified in this study (Supplementary Figure 7) . Batch-correction methods In this study, we formulate single-cell transcriptome integration as a language modeling task. Recent advances in natural language processing will benefit single-cell integration. The paradigm of pretraining-then-finetuning is a de facto procedure in natural language 265 processing as this paradigm is robust to overfitting and has the advantage of making 266 use of super large-scale data and reducing the need of big data on downstream tasks 40 . Herein, we provided a universal, scalable, transferable, effective and easy-to-use Taylor, W. L. "Cloze procedure": A new tool for measuring readability. Journalism quarterly 30, 415-433 (1953) . sentences. The five datasets used in downstream task of iSEEEK were described below: 500 We constructed a dictionary with protein-encoding genes. For each cell, we prepared a For a specific cell type, we rank the influence of genes by the average value of ∆ and 531 those ranked on the top is considered to be marker genes. The symmetrical matrix Q can be decomposed as UAU T . Let . A family with 550 parameter timescale of t for approximated diffusion maps is defined as: θ is a threshold to filter out low attentions and a value of 0.05 was used in this study. Given that attentions between gene i and j is not identical to j and i, therefore, the 573 attention matrix a specific cell type was further refined as: Single-cell clustering and evaluation 584 We extracted the represented features of each single-cell with the pretrained iSEEEK. The extracted features were used as input to the K-Nearest Neighbors (KNN) algorithm For comparison, we also performed single-cell clustering using Scanpy (v1.6.0) as the 592 benchmarking tools. The conventional single-cell analysis based on the gene expression. 593 We first filtered out cells and the criteria: the number of expression genes <200 or 594 mitochondrial counts >30%. The highly variable genes (HVGs) were selected with 595 default parameters (i.e max_mean=3 and min_mean=0.0125). We used the default 50 596 principal components to construct the KNN graph and subsequently applied Leiden 597 community detection algorithm to delineate cluster with default parameter (i.e. 598 resolution =1). We used adjusted rand index (ARI) as clustering measure to evaluate the clustering Expression profiling This work was supported by the National Natural Science Foundation of China (no.ARI a a a a n é ù ae ö ae ö ae ö ae ö ê ú ç ÷ ç ÷ ç ÷ ç ÷ è ø è ø è ø è ø ë û = é ù é ù ae ö ae ö ae ö ae ö ae ö + ê ú ê ú ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ è ø è ø è ø è ø è ø ë û ë û å å å å å å å