key: cord-0267682-kmqxtt1w
authors: Hu, Yinlei; Li, Bin; Zhang, Wen; Liu, Nianping; Cai, Pengfei; Chen, Falai; Qu, Kun
title: WEDGE: imputation of gene expression values from single-cell RNA-seq datasets using biased matrix decomposition
date: 2020-10-01
journal: bioRxiv
DOI: 10.1101/864488
sha: 8a9c27da1835e1150a11a1eec706b69960c6be41
doc_id: 267682
cord_uid: kmqxtt1w

The low capture rate of expressed RNAs from single-cell sequencing technology is one of the major obstacles to downstream functional genomics analyses. Recently, a number of imputation methods have emerged for single-cell transcriptome data, however, recovering missing values in very sparse expression matrices remains a substantial challenge. Here, we propose a new algorithm, WEDGE (WEighted Decomposition of Gene Expression), to impute gene expression matrices by using a biased low-rank matrix decomposition method (bLRMD). WEDGE successfully recovered expression matrices, reproduced the cell-wise and gene-wise correlations, and improved the clustering of cells, performing impressively for applications with multiple cell type datasets with high dropout rates. Overall, this study demonstrates a potent approach for imputing sparse expression matrix data, and our WEDGE algorithm should help many researchers to more profitably explore the biological meanings embedded in their scRNA-seq datasets.

Single-cell sequencing technology has been widely used in studies of many biological 36 systems, including but not limited to embryonic development (Xue et netNMF-sc (Elyanow et al. 2020) , each of which seeks to improve recovery of the 61 expression matrix for single-cell data. However, for datasets with high dropout rates-62 which therefore have very sparse expression matrices-it is still a challenge to 63 wherein a higher ARI value indicates that the clustering result is relatively closer to the 108 "true" cell types. Using the expression matrix imputed by WEDGE, we can clearly 109 distinguish these cell types. The ARI value of the cell clusters from the WEDGE 110 imputed matrix is 0.99, higher than those from the other three imputation methods. To examine the performance of WEDGE on real scRNA-seq data, we applied it to 144 Zeisel's dataset (Zeisel et al. 2015 ) on mouse brain scRNA-seq. We first constructed 145 all the genes detected in more than 40% of cells, and then generated an "observed" 147 matrix with high sparsity by randomly setting a large proportion of the non-zero 148 elements to zeros (dropout rate=0.85). From the heatmaps of gene expression 149 matrices ( Fig. 2A) , we can see that WEDGE recovered the expression of the DE genes, 150 especially those differentially expressed between interneurons and S1 pyramidal cells. CMD indicates that the imputed data is closer to the reference data (Fig. 2C ). For the 163 matrix generated by WEDGE, the cell-to-cell CMD is 0.03 and the gene-to-gene CMD 164 is 0.12, which are each tied for the lowest of all the tested methods. These comparisons 165 together highlight that our WEDGE approach can recover both the cell-cell and gene-166 gene correlations from sparse single-cell RNA-seq datasets.

In the tSNE map of cells, WEDGE can clearly distinguish interneurons, S1 pyramidal 168 neurons, and CA1 pyramidal neurons, and the ARI value of 0.56 for the clustering 169 result calculated from its imputed matrix is the highest among all tested methods ( ALRA, and ENHANCE enhanced the expression of Cr2 and Fcer2a in some cells, but 253 these methods also amplified batch effects, and clustering based on the imputation 254 data from these methods did not clearly distinguish splenic B cells into FO and MZ 255 subpopulations. SAVER-X did not classify the FO and MZ subpopulations based on 256 differential expression trends for Cr2 or Fcer2a. In addition, VIPER and SCRABBLE 257 were unable to obtain imputation results from this dataset within 100 hours on the 258 computer with 72 CPU-cores (2.2GHz) and 1TB memory, and netNMF-sc did not 259 complete because of memory errors. 1.00; Fig. 4D ). Notably, the DE genes of 277 cluster 9 generated from the WEDGE imputed data cover 99% of the DE genes in the 278 raw data (Fig. 4C) . The WEDGE imputed data also increased the expression bias of to complete the imputation process, which was close to the MAGIC method. (Fig.  301 5A&B). To further assess the computer resources that WEDGE spends on datasets of 302 various sizes, we applied it to impute datasets comprising different numbers of cells 303 (5000~1000000) but a fixed number of genes (2000), which were sampled from the 304 mouse brain atlas project (see Methods). The runtime of WEDGE increased linearly 305 with the number of cells, and its speed was close to DCA and MAGIC (Fig. 5C) . For 306 the dataset containing 1 million cells and 2000 genes, WEDGE finished the imputation 307 of missing values in 12 minutes. Notably, WEDGE offers a visual interactive interface, 308 making it convenient for researchers to use. We have uploaded WEDGE and the 309 datasets used in this study to GitHub (https://github.com/QuKunLab/WEDGE). 310 Here, we present an approach, WEDGE, to impute missing gene expression 320 information in single-cell sequencing datasets that is based on the combination of low-321 rank matrix decomposition and biased weight parameters for the zero and non-zero 322 elements in the expression matrix. We demonstrate that the usage of WEDGE 323 significantly improves the clustering accuracy of many scRNA-seq datasets, amplifies 324 the contribution of differential genes to identifying cell types, and helps distinguish 325 more cell subpopulations from low-quality data. 326 recovery performance is insensitive to for datasets with dropout rates less than 0.6 328 (Supplemental Fig. S11 ; see Methods). For the datasets with dropout rates greater 329 than 0.6, λ values between 0.1~0.15 can produce the best recovery results. We thus 330 set = 0.15 for all datasets presented in this paper. The imputation contribution of the 331 zero elements decreases with the increase of matrix sparsity, but it cannot be ignored, 332 which implies that some zero elements may be related to the low expression of certain 333 genes, rather than simply reflecting experimental noise. 334

There are still challenges for the informative imputation of scRNA-seq datasets, such 335 as how to recover the heterogeneity between cell types instead of experimental 336 batches, how to discover cell subtypes with very few cells from the imputed data, and 337 how to use limited computer resources to process large datasets containing millions of 338 cells. Moreover, it necessary to assess whether current imputation methods are 339 applicable to datasets obtained using diverse bioanalytical methods beyond standard 340 RNA-seq (e.g., single-cell ATAC-seq and profiling methods for various epigenomic 341 modifications). Users can tune ∈ [0, 1] to balance the contributions of the two terms of the objective 362 functions. In order to study the influence of on the recovery performance, we down-363 sampled the reference data of the Zeisel and Baron datasets and adopted different 364 dropout rates. By computing the correlation matrix distances (CMDs) ( 

where ̃ is the combination of + and 0 according to the original order of their 380 elements in , and is the ith row of . In this case, optimizing is equivalent 381 to solving non-negative least-squares problems (2) in parallel (Lawson and Hanson 382 1995). After was obtained, we fixed it and solved using similar algorithm as 383 described above. Step2: from a given , solve in parallel with a non-negative least-square method.

Step3: from the obtained in step 2, calculate a new .

Step4: iteratively return back to step 2 and 3 until the relative difference in the object function between two adjacent loops is less than 1×10 -5 or the maximum specified number of iterations is reached.

CMD is usually used to determine the difference between two correlation matrices. It With the fixed number of genes, we sampled 1000, 5000, 10000, 100000, 500000, and 501 1000000 cells from the raw dataset to simulate experiments of different scales. 502

(1) Settings for dimension reduction: For the simulated dataset, Baron's dataset, and 505

Zeisel's datasets, we used the first 20 principal components to perform tSNE analysis. 

A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas 549 Reveals Inter-and Intra-cell Population Structure

Integrating single-cell transcriptomic 551 data across different conditions, technologies, and species

VIPER: variability-preserving imputation for accurate gene expression 553 recovery in single-cell RNA sequencing studies

netNMF-sc: leveraging gene-gene 555 interactions for imputation and dimensionality reduction in single-cell expression analysis

Single-cell RNA-seq denoising using 558 a deep count autoencoder

Validation of noise models for single-cell 560 transcriptomics

Single-cell analysis of 562 two severe COVID-19 patients reveals a monocyte-associated and tocilizumab-563 responding cytokine storm

Global 565 characterization of T cells in non-small-cell lung cancer by single-cell sequencing

Correlation matrix distance, a meaningful measure 568 for evaluation of non-stationary MIMO channels

A systematic evaluation of single-cell RNA-sequencing 571 imputation methods

573 SAVER: gene expression recovery for single-cell RNA sequencing

Nonnegative matrix factorization based on alternating nonnegativity 576 constrained least squares and active set method

Weighted nonnegative matrix factorization

SC3: consensus clustering of single-cell RNA-seq data

Distinct Transcriptomic Features are Associated with 587 Transitional and Mature B-Cell Populations in the Mouse Spleen

Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells

Neuronal subtypes and diversity revealed by single-nucleus RNA sequencing of the 593 human brain

Integrative single-cell analysis of transcriptional and epigenetic states in the human 596 adult brain

Solving least squares problems

Distance between Sets

Zero-preserving imputation of scRNA-seq data using low-600 rank approximation

Marginal-zone B cells

Maintenance of the marginal-zone B cell 604 compartment specifically requires the RNA-binding protein ZFP36L1

Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression 608 in pancreatic ductal adenocarcinoma

SCRABBLE: single-cell RNA-seq imputation constrained by 610 bulk RNA-seq data

Computational and analytical challenges in single-cell 612 transcriptomics

Smibert 614 P, Satija R. 2019a. Comprehensive integration of single-cell data

Smibert 617 P, Satija R

Organ c, processing, Library p, sequencing, Computational 619 data a, Cell type a, Writing g et al

Benchmarking single cell RNA-sequencing analysis pipelines using 623 mixture control experiments

Accurate denoising of single-cell RNA-Seq data using unbiased 628 principal component analysis

Data denoising with transfer 630 learning in single-cell transcriptomics

Orthogonal rank-one matrix pursuit for low 632 rank matrix completion

A single-cell atlas of the peripheral immune response in patients 635 with severe COVID-19

SCANPY: large-scale single-cell gene expression data analysis

Genetic 639 programs in human and mouse early embryos revealed by single-cell RNA sequencing

Single-cell RNA-642

Seq profiling of human preimplantation embryos and embryonic stem cells

Splatter: simulation of single-cell RNA sequencing data

Cell types in the mouse cortex and hippocampus 648 revealed by single-cell RNA-seq

Lineage 650 tracking reveals dynamic relationships of T cells in colorectal cancer

Transcriptome Network Underlying Gastric Premalignant Lesions and Early Gastric Cancer

Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing

The authors declare that they have no competing interests. 546