key: cord-1020092-k4z86lk8
authors: Han, Wenkai; Cheng, Yuqi; Chen, Jiayang; Zhong, Huawen; Hu, Zhihang; Chen, Siyuan; Zong, Licheng; King, Irwin; Gao, Xin; Li, Yu
title: Self-supervised contrastive learning for integrative single cell RNA-seq data analysis
date: 2021-07-27
journal: bioRxiv
DOI: 10.1101/2021.07.26.453730
sha: e3f1228ad26afcd6975f68814b825e70468862bb
doc_id: 1020092
cord_uid: k4z86lk8

Single-cell RNA-sequencing (scRNA-seq) has become a powerful tool to reveal the complex biological diversity and heterogeneity among cell populations. However, the technical noise and bias of the technology still have negative impacts on the downstream analysis. Here, we present a self-supervised Contrastive LEArning framework for scRNA-seq (CLEAR) profile representation and the downstream analysis. CLEAR overcomes the heterogeneity of the experimental data with a specifically designed representation learning task and thus can handle batch effects and dropout events. In the task, the deep learning model learns to pull together the representations of similar cells while pushing apart distinct cells, without manual labeling. It achieves superior performance on a broad range of fundamental tasks, including clustering, visualization, dropout correction, batch effect removal, and pseudo-time inference. The proposed method successfully identifies and illustrates inflammatory-related mechanisms in a COVID-19 disease study with 43,695 single cells from peripheral blood mononuclear cells. Further experiments to process a million-scale single-cell dataset demonstrate the scalability of CLEAR. This scalable method generates effective scRNA-seq data representation while eliminating technical noise, and it will serve as a general computational framework for single-cell data analysis.

Single-cell RNA sequencing (scRNA-seq) has been a powerful tool for measuring the transcriptome-2 wide gene expression in individual cells and understanding the heterogeneity among cell populations 1, 3 2 . It has been facilitating researchers to investigate several critical biomedical topics, such as cancer 3 4 and autoimmunity 4 . Despite its promises, the unique properties of the scRNA-seq data, such as extreme 5 sparsity and high variability 5 , have posed a number of computational challenges to researchers 6, 7 . To 6 analyze the data, among all the steps 7 , the key processing is to obtain a reliable low-dimension 7 representation for each cell, which can preserve the biological signature of the cell for downstream 8 analysis while eliminating technical noise 8, 9 .

The existing commonly used methods to perform the above processing are based on different backbone 10 algorithms and assumptions. The earliest methods utilize the traditional dimension reduction algorithms, 11 such as Principal Component Analysis (PCA), followed by k-means or hierarchical clustering to group 12 cells 5, 10-15 . Although these methods are widely used, their assumption, that is, the complex single-cell 13 transcriptomics can be accurately mapped onto a low-dimensional space by a generalized linear model, 14 may not be necessarily justified 8 . Considering the complexity of the data, researchers have developed 15 multiple kernel-based spectral clustering methods to learn more robust similarity matrices for cells 16, 17 . 16 However, the time and space complexity of such methods impede the broad applications of the methods 5 .

In contrast, the graph-based methods enjoy high speed and scalability 14, 15, 18, 19 . But such methods are what the deep learning models will learn, although some very recent studies try to impose constraints 24 and our prior knowledge about the problem onto the low-dimensional space 5, 27 . Researchers have also 25 tried to utilize manual labeling as supervision for training the models, accompanied by transfer 26 learning 22 or meta-learning 30 , but such methods encounter scalability issues and have strong 27 assumptions on the homogeneity of different datasets, making them less popular than the above methods.

As discussed above, almost all the existing methods are based on unsupervised learning 7 , regardless of 1 the specific algorithm. Without accessible supervision, for the deep learning-based methods, it is hard 2 to guide the training process of the model and explain why a particular transformation is learned, 3 although the model may work well. To promote the scRNA-seq data analysis, we indeed have some 4 specific requirements for the model. For example, the functionally similar cells should be close in the 5 transformed space, while distinct cells should be distant 7 ; the model should overcome the batch effect 6 and map the cells of the same type but from different experiments into the same region 8 . Unsupervised 7 learning methods may have difficulty in incorporating these requirements explicitly. Here, we propose 8 a novel method, CLEAR, for integrative single-cell RNA-seq data analysis, based on a new machine 9 learning scenario, self-supervised learning, which can model all the above requirements explicitly.

More specifically, we design our method based on self-supervised contrastive learning 31 , where we 11 construct the training labels from the unlabeled data. For the gene expression profile of each cell, we 12 distort the data slightly by adding noise to the raw data, which mimics the technical noise in the 

Overview of CLEAR 2 Unlike most existing methods, which are based on unsupervised learning to map the single-cell gene 3 expression profile to the low-dimension space, we develop CLEAR base on self-supervised learning.

That is, although we do not have the golden standard supervised information, such as the cell type, we 5 train the deep learning model using the supervision constructed from the unlabeled data themselves.

Notice that we can incorporate our prior knowledge about single-cell RNA-seq data, such as noise and 7 dropout events, into the model training process implicitly and seamlessly when we build the label from 8 the unlabeled data. More specifically, we design CLEAR based on self-supervised contrastive learning 31 .

As shown in Fig. 1 , eventually, we also want to train a deep learning encoder to map the gene expression 10 profile into the low-dimension space. However, in addition to that, we further want the trained model 11 to force functionally similar cells close in the transformed space while distinct cells being distant. Here, 12 the model should also be robust to technical noise, such as dropout events. That is, the profiles from the 13 same cell, no matter with or without dropout events, should be mapped into the same place in the low-14 dimension space. Although it is difficult to estimate the noise level of the real dataset, we can add 15 simulated noise to the data and force the trained to be robust to them. Based on the above idea, we 16 design CLEAR as shown in Fig. 1 . Given the single-cell gene expression profile, we add different 17 simulated noise, such as Gaussian noise and simulated dropout events, to it (data augmentation), To access how the representation from CLEAR helps to cluster, we evaluate the proposed method, 2 combined with the -means clustering algorithm, on ten published datasets with expert-annotated 3 labels 32-38 . The label information is only available during testing. We compare our model with several 4 state-of-the-art methods that are widely used for scRNA-seq data and belong to different categories, 5 including PCA-based tools (Seurat 10 , SC3 11 , CIDR 12 , SINCERA 13 ), graph-based methods (Seurat 18 , 6 scGNN 24 ), deep generative models (scVI 8 , scDHA 21 , scGNN 24 , ItClust 22 ), and transfer learning 7 approach (ItClust 22 ). Evaluated on the same datasets with five-fold cross validation, CLEAR achieves 8 substantially better performance in clustering adjusted Rand index (ARI) score than all the other 9 methods on most datasets (Fig. 2a, Supplementary Table 6 ). In particular, on average, CLEAR 10 improves over the second-best method, scDHA, by 4.56% regarding the score. To evaluate the 11 performance more comprehensively, we also use other metrics, such as normalized mutual 12 information(NMI), where the compared methods show similar results. To better understand the 13 representation produced by each method, we use uniform manifold approximation (UMAP) to project 14 the internal representations into a two-dimensional space and visualize them. (Fig. 2b, Supp Fig 1-9 suggesting that the proposed framework can implicitly capture the data's biological features.

Furthermore, to demonstrate the effect of the proposed self-supervised contrastive learning settings, we 25 perform an ablation study on the data augmentation operations, removing each operation one by one 26 and recording the performance change. As shown in Supplementary Table 8,10, removing either 27 augmentation step will lead to the decreased ARI performance of CLEAR, which strongly indicates 28 that the introduced noise and instance discrimination task help the model to capture the real cell-cell 1 relationships. In addition, we further check the influence of using highly variable genes, discovering 2 that those low variable genes may reduce the signal ratio and harm the model performance, which is 3 consistent with the previous study 24, 39 properly. We next evaluate the robustness of CLEAR when encountering dropout events. Although it 9 is impossible to recover the actual gene expression levels and determine how dropouts impact the data, 10 we simulate the dropout effects by randomly masking non-zero entries into zero with a hypergeometric 11 distribution. Given the additional artificial dropouts, clustering becomes much more difficult. We test 12 the eight competing approaches together with CLEAR on the Hravtin dataset, containing 48,266 single 13 cells with 25,187 genes and thus 1.2 billion read counts. Among these reads, 94.2% of them are zeros.

14 We set 10%, 30%, 60%, 80% dropout rates for the non-zero entries, respectively, resulting in a masked 15 dataset with up to 98.8% zeros (Supp Fig. 16) . CLEAR achieves the best performance in handling 16 dropout events in terms of clustering, even when 80% of the non-zero entries are masked, suggesting 17 that it is robust and has the potential to extract important features in some extreme cases. Although the 18 performance of scDHA is similar to that of CLEAR when no dropouts are introduced, it becomes worse 19 when the dropout rate is 80%. (Fig. 2d) .

Although several methods have been proposed to correct batch effects, which are undesirable variability 21 in the scRNA-seq datasets from technical and biological noise, most of them work as separate modules,

focusing on one variable, and thus cannot generalize to the large complex atlas projects. CLEAR, 23 however, has the potential to model multiple batch effects in an end-to-end fashion. Here, we assess 24 CLEAR on correcting batch effects. Specifically, we first evaluate CLEAR on a dataset 40 , consisting of We further quantify the performance of different methods regarding batch effect removal with two 7 metrics, average silhouette width (ASW) and adjusted rand index(ARI), on six datasets (Datasets). We 

To demonstrate the application potential of CLEAR on real-world biology research, we apply it to as Type I interferon response (Fig. 5i) , which may indicate a more active interferon level in moderate 25 COVID patients and have the potential to become a clinical blood test marker to monitor COVID 26 progress. Interestingly, we also find a secretion pathway and myeloid leukocyte activation upregulation 27 in severe samples (Fig. 5j) . This may suggest a dysregulated CD14+ monocytes activation in patients 1 with ARDS. annotated manually according to each cluster's top 100 differential expressed genes (Fig. 6a) . Among 15 them, we find 14 subtypes and then plot selected marker genes for each cell type. Satisfyingly, a 16 significant expression track of these marker genes is obtained under the higher level (e.g., CD4+ T cells 17 and CD8+ T cells are combined as T cells) of these subtypes (Fig. 6b) . Performing sensitive feature 18 extraction while eliminating technical noise on the million-scale dataset, CLEAR is an easy-to-use and

well-performed large-scale scRNA-seq data analysis tool, which has the potential to assist the 20 construction and refinement of cell atlases. 6 Furthermore, it scalable enough to handle a million-scale dataset, which suggests its potential to handle 7 the emerging cell atlases.

In the future, CLEAR can be further developed from both the biological aspect and the machine learning 9 aspect. Regarding the biological application, obviously, CLEAR is a very flexible framework to 10 perform data integration, no matter the single-omics, multi-dataset integration (cell atlases 11 construction), or multi-omics integration (e.g., the integration of scRNA-seq and scATAC-seq data). In 12 terms of the machine learning technical details, more advanced methods to handle data imbalance and 13 incorporate prior knowledge, such as partially labeled data, should be developed. We believe that our 14 framework, CLEAR, will become an alternative approach for single-cell data analysis. Pollen, and Tabula Muris Senis datasets, to improve the data quality, we filtered out low-quality cells 8 with fewer than 200 genes and genes expressed in less than three cells.

Yan dataset. The Yan dataset refers to the human preimplantation embryos and embryonic stem cells.

In this dataset, 90 cells were sequenced with the Tang protocol. We downloaded the dataset from 11 Hemberg Group's website. We used Scanpy to log-transform the dataset, and then each cell was 12 normalized to 10,000 read counts. After that, highly variable genes were selected. Finally, we scaled 13 the dataset to unit variance and zero mean. 

We removed the low-quality data and the spiked-in cells. The top 2,000 most variable genes were 26 selected for downstream analysis after we normalized each cell to 10,000 read counts. 27 1 human pancreas. We downloaded the data from the GEO database (accession number GSE85241), and 2 removed the low-quality data and cells containing a higher number of mitochondrial genes and spike-3 in RNAs. The highly variable genes were then selected after log transformation.

Pollen dataset. The dataset, sequenced with the SMARTer protocol, contains 301 cells in the developing 5 cerebral cortex from 11 populations. We downloaded it from the Hemberg Group's website, removing 6 the low-quality data and cells having a higher number of mitochondrial genes and spike-in RNAs. The 7 highly variable genes were then selected after log transformation.

Tabula Muris Senis dataset. The entire dataset, sequenced with 10X, contains more than 100,000 cells.

It was generated across the lifespan of mice, including 23 tissues and organs, but here we only focus on 10 four tissues: bladder, mammary gland, limb muscle, and diaphragm. These datasets allow us to examine 11 the batch effect and cell clustering. After downloading the raw data from 12 https://figshare.com/projects/Tabula_Muris_Senis/64982, we removed the low-quality data and cells 13 containing a higher number of mitochondrial genes and spike-in RNAs. Each cell was normalized to 14 10,000 read counts. The highly variable genes were then selected after log transformation. Finally, we 15 scaled the dataset to unit variance and zero mean.

Human dendritic cells dataset. It consists of human blood dendritic cell (DC) data from Villani et al 55 .

We downloaded the data from https://github.com/JinmiaoChenLab/Batch-effect-removal- 

Finally, we scaled the dataset to unit variance and zero mean. The cell type information was annotated 6 using marker genes by experts.

The CLEAR framework 9

The key idea of CLEAR is to learn effective cell representations, considering noise in the data, and to 1. Data augmentation. We first perform data augmentation to generate training pairs. Each cell will 24 have two augmented versions, and thus a minibatch of cells is augmented to 2 cells. This step will 25 be discussed in detail in Data augmentation.

2. Constructing negative labels with data from multiple minibatches. For data in one minibatch, we can 27 consider the two data points generated from the same gene expression profile as a positive pair while 28 the other combinations as negatives. However, if we only consider the negatives within a minibatch, 1 the learned mapping function is less likely to be effective for global clustering. To make the locally 2 smooth function have a global effect, we should consider negatives from other minibatches. We 3 achieve that by maintaining a queue with data from multiple minibatches. When the current minibatch 4 is enqueued, the oldest minibatch will be dequeued. Within the queue, a specific distorted profile only 5 has one positive pair match, while all the other profiles are negatives for it. Notice that the conceptual 6 difference between minibatch and the queue arises from the hardware limitation. If the GPU memory 7 is large enough and we can feed all the data in one minibatch, we can discard the queue maintenance. (1)

Note that , is asymmetrical. Suppose we put all the pairs in an order, such that 2 − 1 and 2 denote 17 for the paired augmentations, then the summed-up loss is:

(2) propagation. However, for the key encoder, we update it with momentum:

Here ∈ [0,1) is a momentum coefficient. The momentum update makes the encoder network evolve 1 smoothly.

2 5. Inference. After we train the model, the query encoder network is the final productive network, 3 which outputs the representation of a single cell gene expression profile. After obtaining the 4 representations of all the cells in a dataset, we cluster the cells with the common clustering algorithms 5 (e.g., k-means algorithm, Louvain algorithm, and Leiden algorithm). Finally, cell types are assigned to 6 the discovered clusters based on the differential expression genes in the cluster. 

In addition, we also used cell ARI (cARI) and batch ARI (bARI), as well as cell ASW (cASW) and 

For the dropout clustering experiment, we create a dropout_sampling function for random sampling.

Random seeds are set to ensure each sampling is unique. Software parameters are the same as the 25 baseline clustering experiment. 

Single-cell sequencing-based technologies 12 will revolutionize whole-organism science

Single-cell RNA-seq reveals dynamic paracrine control of cellular 14 variation

Therapy-Induced Evolution of Human Lung Cancer Revealed by 16 Single-Cell RNA Sequencing

Single-Cell RNA-Seq Reveals AML Hierarchies Relevant to Disease 18 Progression and Immunity

Model-based deep embedding for 20 constrained clustering analysis of single cell RNA-seq data

Computational and analytical challenges 23 in single-cell transcriptomics

Challenges in unsupervised clustering of 25 single-cell RNA-seq data

Deep generative modeling 27 for single-cell transcriptomics

Scalable analysis of cell-type 29 composition from single-cell transcriptomics using deep recurrent learning

Spatial reconstruction of 32 single-cell gene expression data

SC3: consensus clustering of single-cell RNA-seq data

Ultrafast and accurate clustering through 36 imputation for single-cell RNA-seq data

A Pipeline for Single-1

SCANPY: large-scale single-cell gene expression 3 data analysis

Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like 5 Cells that Correlate with Prognosis

Visualization and 7 analysis of single-cell RNA-seq data by kernel-based similarity learning

Spectral clustering based on learning similarity matrix

Integrating single-cell 12 transcriptomic data across different conditions, technologies, and species

Recovering Gene Interactions from Single-Cell Data Using Data 15 Diffusion

Single-cell RNA-seq 17 denoising using a deep count autoencoder

Fast and precise single-cell data analysis using a hierarchical 19 autoencoder

Iterative transfer learning with neural network for clustering and cell type 21 classification in single-cell RNA-seq analysis

Data denoising with transfer learning in single-cell transcriptomics

scGNN is a novel graph neural network framework for single-cell 25 RNA-Seq analyses

Deep learning enables accurate clustering with batch effect removal in 27 single-cell RNA-seq analysis

Interpretable dimensionality reduction of single 29 cell transcriptome data with deep generative models

Deep generative model embedding of single-cell RNA-Seq 32 profiles on hyperspheres and hyperbolic spaces

ZIFA: Dimensionality reduction for zero-inflated single-cell gene 34 expression analysis

A general and flexible 36 method for signal extraction from single-cell RNA-seq data

MARS: discovering novel cell types across heterogeneous single-cell 39 experiments

A Simple Framework for Contrastive 41 Learning of Visual Representations

Single-cell RNA-seq reveals 43 dynamic, random monoallelic gene expression in mammalian cells

47 34. Pollen, A.A. et al. Low-coverage single-cell mRNA sequencing reveals cellular 1 heterogeneity and activated signaling pathways in developing cerebral cortex

Single cell RNA-sequencing of pluripotent states unlocks 4 modular transcriptional variation

A single-cell transcriptome atlas of the human pancreas

Single-cell analysis of experience-dependent transcriptomic states in 8 the mouse visual cortex

A single cell transcriptomic atlas characterizes aging tissues in the 10 mouse

Benchmarking atlas-level data integration in single-cell 12 genomics

A benchmark of batch-effect correction methods for single-cell 14 RNA sequencing data

PAGA: graph abstraction reconciles clustering with trajectory 16 inference through a topology preserving map of single cells

The dynamics and regulators of cell fate decisions are revealed by 19 pseudotemporal ordering of single cells

A single-cell atlas of the peripheral immune response in patients with 21 severe COVID-19

Immunologic perturbations in severe COVID-19/SARS-CoV-2 23 infection

Comprehensive mapping of immune perturbations 25 associated with severe COVID-19

Antibody Responses to SARS-CoV-2 in Patients With Novel Coronavirus 27 Disease

The interplay between inflammatory 29 pathways and COVID-19: A critical review on pathogenesis and therapeutic options

The cytokine storm and COVID-19

Suppressive myeloid cells are a hallmark of severe 34 COVID-19. medRxiv

Single-Cell Omics Reveals Dyssynchrony of the Innate and 36

Adaptive Immune System in Progressive COVID-19. medRxiv

Single-cell analysis of two severe COVID-19 patients reveals a 39 monocyte-associated and tocilizumab-responding cytokine storm

Cytokine Storm; What We Know So Far

Severe COVID-19 Is Marked by a Dysregulated Myeloid 44 Cell Compartment

Contributions 7

The authors declare no competing interests. 9Additional Information 10 11