key: cord-0827188-fj23vjts authors: Booeshaghi, A Sina; Pachter, Lior title: Normalization of single-cell RNA-seq counts by log(x + 1)(*) or log(1 + x)(*) date: 2021-03-02 journal: Bioinformatics DOI: 10.1093/bioinformatics/btab085 sha: fc027d1b839a8ac419352e7192a3edcf6e198653 doc_id: 827188 cord_uid: fj23vjts : Single-cell RNA-seq technologies have been successfully employed over the past decade to generate many high resolution cell atlases. These have proved invaluable in recent efforts aimed at understanding the cell type specificity of host genes involved in SARS-CoV-2 infections. While single-cell atlases are based on well-sampled highly-expressed genes, many of the genes of interest for understanding SARS-CoV-2 can be expressed at very low levels. Common assumptions underlying standard single-cell analyses don’t hold when examining low-expressed genes, with the result that standard workflows can produce misleading results. SUPPLEMENTARY INFORMATION: Supplementary data and all of the code to reproduce Figure 1 are available here: https://github.com/pachterlab/BP_2020_2/. The ACE2 receptor, which facilitates entry of SARS-Cov-2 into cells 1, has become one of the most studied genes in the history of genomics over the past two months. There are already hundreds of preprints about the gene (Google Scholar), and it is currently the default gene displayed on the UCSC genome browser 2. Several studies have reported on the expression of ACE2 at single-cell resolution, and papers have been rife with speculation about implications of differential ACE2 mRNA abundance for severity of disease. As is common in single-cell RNA-seq, the expression estimates of ACE2 are derived from counts that are filtered and normalized. Figure 1a shows an analysis of ACE2 mRNA in mice lungs (data from 3). The expression is computed from cells containing at least one copy of the gene. While single-cell RNA-seq expression data has been modeled with many different distributions 4; 5, for simplicity in illustrating our points we model this count data with a simple Poisson random variable X with parameter λ in order to demonstrate the implications of this restriction. Application of the filter amounts to computing While this is approximately λ when λ is large, it is close to 1 when λ is small 6. Figure 1b shows the fraction of cells containing at least one copy of ACE2 7. Evidently, Figure 1a creates a misleading impression. While it may appear that average ACE2 expression is similar between young and old mice, when comparing the fraction of cells with nonzero expression of ACE2 it is clear that ACE2 has significantly lower mRNA expression in the lungs of aged mice than young mice. The fraction f of cells with nonzero expression of a gene has a useful statistical interpretation. We leave it as an exercise for the reader to show that the the following estimator for the Poisson rate is consistent: ( Since f is approximately equal to this expression when f is small, this provides an interpretation of the fraction of cells with at least one copy of a low-abundance gene as an estimate of the rate parameter λ in a Poisson distribution. Another mistake that we've found to be common in reporting ACE2 expression has to do with the log transformation, frequently used as part of a normalization of counts. Counts are log transformed for two reasons: the first is to stabilize the variance, as the log transform has the property that it stabilizes the variance for random variables whose variance is quadratic in the mean 8; 9. There are two main considerations for this step: first when performing PCA on the gene expression matrix to find a reduced-dimensional representation that captures the variance, it is desirable that all genes contribute equally. The second consideration for the log transform is that it converts multiplicative relative changes to additive differences. In the context of PCA, this allows for interpreting the projection axes in terms of relative, rather than absolute, abundances of genes. A seemingly minor technical issue in log transforming counts is that zero counts cannot be "logged", as log(0) is undefined. To circumvent this problem, it is customary to add a "pseudocount", e.g. +1, to each gene count prior to log transforming the data 10. We denote log(1 + x) by log1p in accordance with nomenclature standard in scientific computing 11. For a gene with an average of λ counts where λ is large, it is intuitive that the average of the log1p transformed counts is approximately log(λ). However, this is not true for small λ. An understanding of the result of applying the log1p transform begins with the observation that for a random variable X, E[f (X)] is not, in general, equal to f (E[X]). For example, if X is a Poisson random variable with parameter λ, it is not true that E[log(1 + X)] = log(λ + 1). By Taylor approximation, This shows that for low-expressed genes, the average log1p expression can differ considerably from log(λ), with the maximum difference according to the Taylor approximation at λ ≈ 1. (see Figure 1c) . Thus, while a 2-fold change for large λ translates to a log(2) difference after log1p, that is not the case for small λ. In summary, while single-cell RNA-seq atlases offer detailed information about the transcriptomic profiles of distinct cell types, their use to examine specific genes, as has been done recently in the study of SARS-CoV-2 infection related genes, requires care. Methods should not be used unless their limitations are understood. For example, while it doesn't matter whether one uses log(x + 1) or log(1 + x), the filtering and normalization applied to counts can affect comparative estimates in non-intuitive ways. For example, the SCnorm normalization 12 is based on a preliminary filter for all cells with at least one count, thus subjecting the method to the problem seen in Figure 1a ,b. Indeed, there have been reports of problems with SCnorm when applying the method to sparse datasets with many zeroes 13, leading to the development of methods that account for this 14. Moreover, there are subtle problems that arise when working with small counts that transcend the elementary issues we have raised 15; 16. These matters are not theoretical; we leave the identification of published preprints and papers that have ignored the issues we've raised, and hence reported misleading results, as another exercise for the reader. Angiotensin-converting enzyme 2 (ACE2 ) as a SARS-CoV-2 receptor: molecular mechanisms and potential therapeutic target UCSC Genome Browser enters 20th year An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics Bayesian gamma-negative binomial modeling of single-cell RNA sequencing data Observation weights unlock bulk RNAseq tools for zero inflation and single-cell applications Analyse des infiniment petits Moutard Decrease in ACE2 mRNA expression in aged mouse lung. bioRxiv, 2 The Use of Transformations A New Family of Power Transformations to Improve Normality or Symmetry scClustViz-Single-cell RNAseq cluster assessment and visualization Berkeley Elementary Functions Test Suite SCnorm: robust normalization of single-cell RNA-seq data Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression Why you cannot transform your way out of trouble for small counts Overcoming systematic errors caused by logtransformation of normalized single-cell RNA sequencing data. bioRxiv Article title ) Article title Oxidative stress regulated genes in nigral dopaminergic neurnol cell: correlation with the known pathology in Parkinson's disease Chapter title The future of clinical cancer management: one tumor, one chip Chapter title Sur un syndrome d'obesite infantile avec polydactylie et retinite pigmentaire (contribution a l'etude des formes cliniques de l'obesite hypophysaire) We thank Charles Herring, Michael Hoffman, Johan Gustafsson, Harold Pimentel, Jeffrey Spence, and Valentine Svensson for helpful comments. A.S.B. and L.P. were partially funded by NIH U19MH114830.