Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies 1 Benchmarking Association Analyses of Continuous Exposures with RNA-seq in Observational Studies Tamar Sofer1,2,*, Nuzulul Kurniansyah1,*, François Aguet3, Kristin Ardlie3, Peter Durda4, Deborah A. Nickerson5, Joshua D. Smith5, Yongmei Liu6, Sina A. Gharib7, Susan Redline1, Stephen S. Rich8, Jerome I. Rotter9, Kent D. Taylor9 1Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, MA, USA 2Departments of Medicine and of Biostatistics, Harvard University, Boston, MA, USA 3The Broad Institute of MIT and Harvard, Cambridge, MA, USA 4Department of Pathology and Laboratory Medicine, Larner College of Medicine, University of Vermont, Burlington, VT, USA 5Department of Genome Sciences, University of Washington, Seattle, WA, USA 6Duke Molecular Physiology Institute, Department of Medicine, Division of Cardiology, Duke University Medical Center, Durham, NC, USA 7Computational Medicine Core, Center for Lung Biology, UW Medicine Sleep Center, Department of Medicine, University of Washington, Seattle, WA, USA 8Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA 9The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA USA *These authors contributed equally to the work. Correspondence: Tamar Sofer Email: tsofer@bwh.harvard.edu 221 Longwood Ave .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 2 Boston, MA 02115 Abstract Large datasets of hundreds to thousands of individuals measuring RNA-seq in observational studies are becoming available. Many popular software packages for analysis of RNA-seq data were constructed to study differences in expression signatures in an experimental design with well-defined conditions (exposures). In contrast, observational studies may have varying levels of confounding of the transcript-exposure associations; further, exposure measures may vary from discrete (exposed, yes/no) to continuous (levels of exposure), with non-normal distributions of exposure. We compare popular software for gene expression - DESeq2, edgeR, and limma - as well as linear regression-based analyses for studying the association of continuous exposures with RNA-seq. We developed a computation pipeline that includes transformation, filtering, and generation of empirical null distribution of association p-values, and we apply the pipeline to compute empirical p-values with multiple testing correction. We employ a resampling approach that allows for assessment of false positive detection across methods, power comparison, and the computation of quantile empirical p-values. The results suggest that linear regression methods are substantially faster with better control of false detections than other methods, even with the resampling method to compute empirical p-values. We provide the proposed pipeline with fast algorithms in R. Introduction .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 3 Many studies of phenotypes associated with gene expression from RNA-seq consist of small sample sizes (tens of subjects) and are focused on comparisons of transcriptional expression patterns between well-delineated states, such as different experimental conditions, tumor versus non-tumor cells (1; 2), and disease vs non-disease groups (3). Some studies are designed to identify differential expression across hidden, discrete conditions (4). Epidemiological cohorts have recently utilized stored samples to facilitate the use of RNA-seq data in studies of association with subclinical phenotypes such as blood biomarkers, imaging, and other physiological measures, with often continuous measures being used in statistical analyses. High throughput RNA sequencing enables broad assaying of a sample’s transcriptome (5) and has been in increasing use for over a decade (6). A large variety of analytic and statistical approaches have been developed to address scientific questions such as alternative splicing, differential expression, and more (4; 7-11), often building on methods developed for analyses of expression microarrays (12-14); comprehensive reviews are available (15-19). In this work, we are specifically interested in differential expression analysis with continuous exposures, and we assume that count data are already prepared and available to the analyst. Popular software packages for differential expression analysis include the DESeq2 R package (9), which models the expression counts as following a negative binomial distribution, with shrinkage imposed on both the mean and the dispersion parameters, based on estimates from the entire transcriptome, or user-supplied values. EdgeR (7) uses a negative binomial model similar to the DESeq2 model for transcript counts, in combination with overdispersion moderation. EdgeR was primarily designed for differential expression analysis between two groups when at least one of the groups has replicated measurements (20). Limma (21) uses linear models, which are very .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 4 flexible and can effectively accommodate many study designs and hypotheses. Similar to the DESeq2 and edgeR packages, Limma also uses an empirical Bayes method to borrow information across transcripts to estimate a global variance parameter that is applied for the computation of variance parameters of each single transcript. It uses log transformation and weighting, known as the “voom” transformation, in the final linear model that is used for differential expression analysis. We refer to it henceforth as the limma-voom. Prior to differential expression analysis, library normalization is performed (22). Popular approaches are the TMM (trimmed-means of M-values) normalization (23), implemented in edgeR, and the size factors normalization (24), implemented in DESeq2. Sleep disordered breathing phenotypes, such as the Apnea-Hypopnea Index (AHI), the number of apnea and hypopnea events per hour of sleep, provides a quantitative assessment of the severity of the disorder, with no clear threshold above which different biological processes occur (although thresholds are used for clinical decision making and health insurance reimbursement). Association analysis with continuous exposures provides different challenges than those traditionally encountered. The distribution of such exposures may have strong effects on the association analysis results, regardless of the underlying associations, due to the combination of skewed exposure distributions and the distribution of RNA-seq read count data, that are generally over-dispersed with occasional extreme values. As observational study data analyses may include covariates, statistical methods from experimental studies (e.g., exact tests) cannot be applied. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 5 In this manuscript, we compare the DESeq2, edgeR, and limma-voom analysis approaches for differential expression analysis, with linear regression–based approaches that do not use the empirical Bayes approach for estimating variance parameter across the transcriptome. We study the computation of p-values using resampling of phenotype residuals, while preserving the structure of the data. This addresses the limitation of permutation noted by others in the context of differential expression analysis of RNA-seq (21), where permutation may not be tuned to test a specific null hypothesis because in its standard form it “breaks” all relationships between the permuted variable and the rest of the dataset. Finally, we study the use of empirical p-values that tune the original p-values based on the residual resampling scheme. Throughout, we use a dataset with sleep disordered breathing phenotypes and RNA-seq from the Multi-Ethnic Study of Atherosclerosis as a case study. We demonstrate the statistical implications of performing association analysis of RNA-seq with continuous, non-normal exposures, compare analysis methods, and develop recommendations. Methods The Multi-Ethnic Study of Atherosclerosis (MESA) MESA is a longitudinal cohort study, established in 2000, that prospectively collected risk factors for development of subclinical and clinical cardiovascular disease among participants in six field centers across the United States (Baltimore City and Baltimore County, MD; Chicago, IL; Forsyth County, NC; Los Angeles County, CA; Northern Manhattan and the Bronx, NY; and St. Paul, MN). The cohort has been studied every few years. The present analysis considers N = 462 individuals who participated in a sleep ancillary study performed shortly following the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 6 participants Exam 5 during 2010-2013 (25; 26), with RNA-seq measured via the Trans-Omics in Precision Medicine (TOPMed) program. Here, we used RNA-seq data with RNA extracted from whole blood drawn in Exam 5 (2010-2012). Sleep data were collected using standardized full in- home level-2 polysomnography (Compumedics Somte Systems, Abbotsville, Australia, AU0), as described before (26). Of the 462 participants in the current analysis, there were 196 African- Americans (AA), 259 European-Americans (EA) and 125 Hispanic-Europeans (HA). RNA sequencing in MESA is briefly described in the Supplementary Materials. Sleep disordered breathing measures As examples for continuous exposures from population-based studies, we took three sleep disordered breathing measures: (1) the Apnea-Hypopnea Index (AHI), defined as the number of apnea (breathing cessation) and hypopnea (at least 30% reduction of breath volume, accompanied by 3% or higher reduction of oxyhemoglobin saturation compared to the baseline saturation) per 1 hour or sleep; (2) minimum oxyhemoglobin saturation during sleep (MinO2), and (3) average oxyhemoglobin saturation during sleep (AvgO2). We chose these traits because they are clinically relevant, often used in sleep research studies, and represent exposures that may alter gene expression (via hypoxemia and sympathetic activation). The AHI had the least skewed distribution of the considered phenotypes, and AvgO2 had the longest “tail” of small values in the residual distribution. Residuals were obtained by regression the sleep measures on age, sex, body mass index (BMI), study center, and self-reported race/ethnic group. Compared tests of associations between exposure and transcripts .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 7 We compared the standard packages DESeq2, edgeR, limma, and linear regression-based approaches, in which we always applied log transformation on the transcript counts, and then applied linear regression. Because some of the observed transcript count values are zero, which cannot be log transformed, we compared a few approaches for replacing zero values. For a given transcript �, denote the minimum observed transcript level that is higher than zero by m� � min ����, … , ���: ��� � 0 for � � 1, … , ��. We compare the following approaches, applied on each transcript �, � � 1, … , � separately: A1. SubHalfMin: Replace zero values with �� � . A2. AddHalfMin: Replace all values ��� by ��� � �� � . A3. AddHalf: Replace all values ��� by ��� � � � . Conceptual framework for studying analysis approaches To study performance of various analysis approaches, we performed simulation studies. Simulation study 1 was used to assess type 1 error across methods when using output p-values, and when using “empirical p-values”, which are p-values that account for true distribution of the p-values under the null and are described later. Simulation study 2 was used to assess power in transcriptome-wide analysis settings, when using methods that control the type 1 error according to simulation study 1. In addition, we performed a simulation study (Supplementary Materials) to assess power for testing of individual transcript according to various distributional characteristics of transcript counts. The goal was to identify approaches for filtering transcripts for association analysis that will optimize power. All simulations used a “residual permutation” (below). The reported criteria for declaring differentially expressed transcripts were False Discovery Rate (FDR) controlling p-values <0.05 based on the Benjamini-Hochberg (BH) procedure, and based .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 8 on the local FDR procedure implemented in the qvalue R package, Family-Wise Error Rate (FWER) controlling p-values <0.05 based on the Holms procedure, and an arbitrary threshold of p-value<10-5. Residual permutation approach for simulations and for empirical p-value computation To generate realistic simulation studies in which: (a) the data structure, including the exposure, covariates, and outcome distributions; and (b) their relationships, aside from the exposure- outcome association, are the same as in the real data, we used a residual permutation approach. We regressed each sleep exposure of interest � on the covariates � and estimated their effect �. We then obtained residuals, defined as: � � � � ���. To study type 1 error, we permuted these residuals at random to obtain ��� !�", and generated a sleep exposure unassociated with any of the RNA-seq measures by: ����� � ��� � ��� !�". We repeated this procedure 1000 times for evaluating type 1 error control. We generated simulated data under four power simulations in a similar approach, with the difference that we forced a specific correlation value between the simulated sleep exposure and a specific transcript. To this end, for a given transcript � measured on individuals � � 1, … , �, we computed the rank of each individual: ��!���", … , ��!���". To set a correlation $ between the simulated � and transcript � we sampled $ % � (rounded) indices from 1, … , �, corresponding to $ % � individuals for which we forced their ranks in the permuted residual values, now denoted by ��� !�" , to be the same as their ranks in the transcript values (note that the transcript values are never changed). For the rest of the individuals, the permuted residuals are completely random. When .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 9 multiple individuals have the same transcript counts (i.e., their ranks are tied), we randomly assign their ranks. For example, if 100 people have zero counts for a given transcript, each of these individuals will be equally likely to have the rank of 1, 2, …, or 100. The code for generating this residual permutation approach is provided in the Supplementary Information and in a dedicated GitHub repository https://github.com/nkurniansyah/RNA- Seq_continuous_exposure. Empirical p-values to account for the null distribution of p-values We used the residual permutation approach, under the null hypothesis, to generate a null distribution of p-values and to compute empirical p-values. When the distribution of p-values under the null hypothesis is unknown, and specifically when it is not uniform, their values are not reliable for hypothesis testing. Alternative approaches compute “empirical p-values” with the goal of generating an appropriate p-value distribution, i.e., in which an empirical p-value � satisfies Pr!� ' 0.05|*�" � 0.05 (Supplementary Materials). For computing empirical p-values, we use a relatively small number of residual permutations (in comparison to the number of permutations used for computing permutation p-values) followed by transcriptome-wide association studies. We use the results of these transcriptome-wide tests under permutation to compute the null distribution of p-values, which is then used to compute the empirical p-values. We compare two types of empirical p-values: quantile empirical p-values, and Storey empirical p-values implemented in the qvalue R package (27). The quantile empirical p-value approach is inspired by previously proposed procedures based on permutation (28) of phenotypes (rather than residuals). It estimates the null distribution of p-values non- .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 10 parametrically, and the quantile empirical p-value is the quantile of the raw p-value in this distribution. The Storey empirical p-values uses the null distribution of the test statistics to identify whether a transcript is likely sampled from the null or a non-null distribution. Both implementations assume that the empirical null distribution is the same for all transcripts. We used 100 residual permutations to compute test statistics and p-values under the null and compared the empirical p-values to standard permutation p-values. Resampling approach for binary exposure phenotypes We compared the analysis of a continuous exposure to that of a dichotomized variable. Instead of a sleep measure, we used body mass index (BMI), because it is known to have large impact of gene expression and is therefore a powerful phenotype for such a comparison. BMI was dichotomized to “obese” if BMI + 30kg/m2 and non-obese otherwise. Because obesity is binary and, therefore, the residual permutation approach is not appropriate as proposed for continuous variables, we generated a binomial obesity variable based on BMI probability given covariates. Given a logistic model -./��0�!123� � 1"4 � 5� �6, we estimated the covariates’ association parameters 67 and obtained estimated probabilities for obesity for each person � � 1, … , � by �̂!123� � 1" � �9���!5� �67". Based on these estimated outcome probabilities, we sampled random obesity status as binomial variables. Results MESA participant characteristics are provided in the Supplementary Material, Table S1. The distribution of the raw phenotypes AHI, MinO2, and AvgO2, and their residuals after regression on covariates is provided in Figure 1, demonstrating the high non-normality. Simulations were .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 11 performed after normalizing the data so that each library has the same size (prior to filtering), which we set to the median observed value (i.e., median normalization) in the raw reads, or 23,210,672. Results for some of the settings in simulation study 1 under TMM and size factor normalizations are provided in the Supplementary Materials. Simulation study 1: type 1 error analysis After normalization, we applied filters to remove lowly expressed transcripts. There were 58,311 transcripts. After applying filters requiring that the (a) maximum read count is >10 and that (b) the proportion of individuals with zero counts for a transcript across the sample is not higher than 0.75 (see Supplementary Materials for more information on filters), 23,004 transcripts were available for the simulation study. We used residual permutation to generate simulated SDB phenotypes that are not associated with the transcripts, but maintain the same correlation structure with the transcript and covariates. We generated 100 datasets with simulated SDB phenotypes, and performed analyses. Complete results showing the average number of false positive detection based on the existing packages limma, edgeR, and DESeq2, as well as the three linear regression analyses described here, are provided in Supplementary Figures S3-S5. These results include comparisons of raw p-values, the proposed quantile empirical p-values, and the empirical p-values provided in the qvalue R package (27), and for the three SDB phenotypes. We found that the number of false positives vary with the exposure phenotypes, with analyses of MinO2 (Figure 2) generally resulting in more false positive detections than analyses of the AHI, with intermediate numbers for AvgO2 (Figures S3-S5 in the Supplementary Materials). Figure 2 compares the average number of falsely discovered transcript associations when using simulated .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 12 sleep phenotypes mimicking MinO2 using the residual permutation approach by focusing on limma, edgeR, DESeq2, and linear regression applied on log2 of expression counts with SubHalfMin. For each method, type I error was determined using raw p-values and Storey empirical p-values, with significance thresholds based on Benjamini-Hochberg (BH) FDR, local FDR (qvalue package), and Holms Family-Wise Error Rate (FWER). Empirical p-values usually reduced the number of false detections, with the method in the qvalue package being usually more conservative than the quantile-based empirical p-values method. Compared to linear regression-based approaches, DESeq2, edgeR and limma-voom had many false detections when using the raw p-values, even after applying multiple testing corrections. The three linear regression-based methods described here were quite similar, with the AddHalf approach often resulting in slightly more false detections. Based on these results, we chose to move forward for the next set of simulations with linear regression with SubHalfMin for handling of zero counts. Simulation study 2: power analysis We performed simulations that mimic transcriptome-wide analysis to assess power. Based on simulations comparing power by transcript distributional characteristics (see Supplementary Materials), we only considered 19,742 transcripts for which no more than 50% of the sample had zero counts. We chose two transcripts, and for each of these and each of the sleep phenotypes, we performed 100 simulations in which we used the residual permutation approach to generate association between the sleep phenotype and the transcript with correlation $ � 0.3. We performed transcriptome-wide association analysis using DESeq2, edgeR, and linear regression with SubHalfMin transformation (limma-voom was not used, given its high rate of false .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 13 positive detections in some of the settings in simulation study 1). For power, we always used empirical p-values (both types) and determined whether the specific transcript of interest passed the significance threshold based on FDR-adjusted (29) empirical p-value < 0.05. Power was defined as the proportion of the simulations in which the associations was significant, and was consistently higher for the linear regression-based approach compared to DESeq2 or edgeR. For linear regression, the quantile empirical p-values performed essentially the same as Storey’s empirical p-values, while Storey’s empirical p-values resulted in substantially higher statistical power when using DESeq2 and edgeR. We illustrate power comparisons in Figure 3 using Storey’s empirical p-values. Power comparisons using quantile empirical p-values are provided in the Supplementary Materials Figure S8. Proposed analysis approach Based on the above simulation studies, we developed an analytic pipeline as depicted in Figure 4: (a) the raw read count are normalized; (b) filters are applied to remove lowly expressed transcripts and those for which the statistical power is low, as determined by simulations, (c) AddHalfMin transformation is applied for each individual separately, then log transformation is applied on all transcripts, (d) association analyses is performed using linear regression to compute effect sizes and p-values, (e) permutations are computed 100 times on exposure residuals after regressing on covariates, to generate simulated traits that maintain the data structure, (f) each of 100 vectors of simulated traits are analyzed using the same approach as the raw trait, generating p-values, (g) p-values from the analysis of the 100 simulated traits are combined to generate an empirical null distribution of p-values, that are used to generate .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 14 empirical p-values for the raw trait using the qvalue package, and (h) multiple testing correction is applied on the empirical p-values. Comparison of analysis of continuous BMI with analysis of dichotomous obesity status We compared the differential expression of transcripts in analysis of BMI and obesity. Residual permutation procedure was used and quantile-empirical p-values generated for both analyses. A total of 925 MESA individuals had BMI measure available and, for analysis, at least 50% non- zero transcripts were required. For obesity, several non-zero transcript thresholds were examined: 50%, 40%, and 30%. The results were similar for all thresholds, resulting in many more identified transcript associations (446 vs. 251) with continuous BMI compared to using a dichotomous trait (Supplementary Information Figure S9). Computing time comparison The compute time for transcriptome-wide association study was obtained for analyses using DESeq2, edgeR, and our linear regression implementation. Using our linear regression implementation on a single core, a single transcriptome-wide association study applied on ~19K transcripts and N=462 individuals took less than a minute; when 100 transcriptome-wide association studies applied to residual permutations were included to compute empirical p- values, the time reached 7 minutes, and the maximum memory used was 1.3GB. In comparison, DESeq2 took 53.5 minutes and edgeR took 18.8 minutes for a single transcriptome-wide association study. The maximum memory used for DESeq2 and edgeR was similar at 3.1GB. R package .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 15 Code for implementing the proposed procedure and for a shiny app is provided in the GitHub repository https://github.com/nkurniansyah/Olivia. The code also provides test of multiple exposure variable at the same time, which applies the multivariate-Wald test, and an efficient implementation of a permutation test when considering a single transcript, rather than a transcriptome-wide analysis. The repository also includes code used for simulations. Data availability MESA data are available through application to dbGaP. Phenotypes are available in MESA study accession phs000209.v13.p3, and RNA-seq data has been deposited and will become available through the TOPMed-MESA study accession phs001416.v2.p1. Discussion We systematically assessed the approaches for studying the association of gene expression, estimated using RNA sequencing, with continuous and non-normally distributed exposure phenotypes. We found that linear regression-based analysis performs well for continuous phenotype associations, and is computationally highly efficient. We used a residual permutation approach to study the distribution of p-values under the null of no association between the phenotypes and RNA-seq, and used this approach to further study power, and to compute empirical p-values. Notably, the residual permutation approach allows for the dataset to have the same correlation structures and associations between the phenotypes and the transcripts and covariates, while eliminating the transcript-phenotype associations. We implemented this approach in an R package and developed an R shiny app, to make our pipeline easily accessible to the research community. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 16 Recently, van Rooij et al (30) also performed a benchmarking study comparing analysis approaches for transcriptome-wide analysis of RNA-seq in population-based studies, including when using continuous phenotypes in association testing. While we used similar statistical methods to theirs, we took a different analytical approach. Van Rooij et al. used multiple datasets to apply association analysis between a phenotype and transcripts, and assessed replication between analyses. We, on the other hand, leveraged simulations to generate data under a known association structure. In addition, we were motivated by a specific problem: highly non-normal sleep exposure measures, often leading to suboptimal control of Type 1 error. Thus, it was critical to assess control of false discovery under the null hypothesis. Notably, sleep phenotypes are less often available and there are no other large observational studies data sets to our knowledge with both RNA-seq measures and similar SDB phenotypes. Some of our findings are similar to those of van Rooij et al.: they also recommend using linear regression analysis, and they also found that using a continuous phenotype is generally more powerful than dichotomizing it (in agreement with what is known from statistical literature). Similarly, they found that normalization method had very little effect on the results. However, they recommend testing all genes, while we recommend filtering transcripts with at least 50% zero counts, based on our power simulations. Additional future work is needed to evaluate various filtering criteria, and to develop methods that allow for flexible, non-linear modeling of the association between phenotype and gene expression while remaining computationally efficient to allow for permutation analysis. We propose to compute p-values under the null hypothesis of no association between the transcript and the exposure phenotype by permuting residuals of the exposure phenotype after .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 17 regressing on covariates, and re-structuring the exposure by summing the permuted residuals with the estimated mean, and thus maintain the overall data structure except for the exposure- outcome association of interest. Outside the gene expression literature, others have proposed to permute residuals rather than the outcome. For example, previous permutation methods proposed to permute residuals of the outcome after regressing on covariates (31), or to permute the residuals of the exposure phenotypes without constructing a new exposure phenotype by summing the permuted residuals with the estimated mean (32). It will be interesting to perform a more comprehensive study of statistical permutation approaches for RNA-seq association analyses, as well as studying them in the context of mixed models. We recommend using empirical p-values, which require 100 residual permutation, and therefore, performing 101 transcriptome-wide association analyses instead of one. Considering Figures S3- S5 in the Supplementary Information, one can see that in most settings, linear regression methods do not have many false positive detections even when raw p-values are used. However, we chose to be more conservative by strongly protecting the analysis from false positive detections. Importantly, the linear regression analysis with empirical p-values had higher power than the other common approaches (DESeq2, edgeR), indicating simultaneous improvement in controlling false positives and increasing power. Unfortunately, we cannot effectively estimate the FDR in these simulations. FDR is defined as the proportion of false discoveries out of all discovered (significant) associations. In simulation study 1, none of the transcripts were associated with the outcomes, so that any estimated FDR would be 100%. Under the alternative, one can suggest to use the number of wrongly discovered associations to estimate the FDR. However, many transcripts are highly correlated with the one simulated to be associated with the .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 18 exposures, and are therefore associated with the exposure by design, and thus the number of transcripts falsely detected as associated with the exposure cannot be easily determined. The empirical p-values procedure uses p-values from the entire tested transcriptome to compute the empirical null distribution. This encapsulates the assumption that the null distribution of p- values is the same for all transcripts, which is generally a limitation, but has been shown to be often acceptable since it will lead to less power, rather than increasing the number of false detections (33; 34). An approach that does not require this assumption estimates the null distribution for p-value for each transcript separately, which is a standard permutation approach. We investigated this issue by comparing the quantile empirical p-values with the permutation p- values that use 100,000 residual permutations to estimate the null distribution of the p-value of each transcript separately (Figure S2 in the Supplementary Materials). The two p-value distributions are very similar. Therefore, a computationally expensive permutation approach, as well as other approaches proposed by investigators, such as estimating null distributions across sets of transcripts with similar properties (33; 35), are likely unnecessary and not superior to the computationally efficient empirical p-values method. Another approach for estimating the null distribution of p-values uses the primary results, without any permutation (36; 37). These approaches also use the assumption that the null p-value distribution is the same across transcripts (i.e. a shared null distribution exists). Given the computationally fast implementation of the transcriptome-wide association study, we believe that using residual permutation is beneficial because it allows for a more precise quantification of the null p-value distribution. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 19 Batch effects are important to account for in studies of RNA-seq. Here, we did not study their effect because it was beyond the scope of our investigation. van Rooij et al (30) in their benchmarking study focusing on replication across cohorts, compared a few approaches for adjusting for technical covariates, including estimating and adjusting for latent confounders (38). They concluded that inclusion of more technical adjusting covariates, including hidden confounders, increases the rate of replication between studies. To summarize, we highlighted the problem of high false positive findings in RNA-seq data when studying the association of continuous exposure phenotypes that are highly non-normal. We developed a computationally efficient pipeline to address the false positive detection problem, and studied strategies to optimize statistical power. Our approach will be particularly useful for epidemiological studies with RNA-seq data that were not designed as disease-focused case-control studies. Acknowledgements This work was supported by the National Heart Lung and Blood Institute grant R35HL135818. MESA and the MESA SHARe projects are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC- 95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC- 95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420. Also supported in .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 20 part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR001881, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center. Molecular data for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). RNA-Seq for “NHLBI TOPMed: Multi- Ethnic Study of Atherosclerosis (MESA)” (phs001416.v1.p1) was performed at the Northwest Genomics Center (HHSN268201600032I). Core support including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC, and general program coordination were provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. Author contributions TS conceptualized and drafted the manuscript and supervised the analysis. NZ performed all statistical analysis and data visualization and developed the R package and R shiny app. DN, and JS performed RNA sequencing. FA and KA generated the MESA processed the sequenced RNA to generate the RNA-seq dataset. PD, YL, SSR, JIR, and KDT designed the RNA-seq study in .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 21 MESA. SR designed and supervised the MESA sleep ancillary study. All authors critically reviewed and approved the manuscript. References 1. Zhai W, Yao XD, Xu YF, Peng B, Zhang HM, et al. 2014. Transcriptome profiling of prostate tumor and matched normal samples by RNA-Seq. Eur Rev Med Pharmacol Sci 18:1354-60 2. Peng L, Bian XW, Li DK, Xu C, Wang GM, et al. 2015. Large-scale RNA-Seq Transcriptome Analysis of 4043 Cancers and 548 Normal Tissue Controls across 12 TCGA Cancer Types. Sci Rep 5:13413 3. Kim WJ, Lim JH, Lee JS, Lee SD, Kim JH, Oh YM. 2015. Comprehensive Analysis of Transcriptome Sequencing Data in the Lung Tissues of COPD Subjects. Int J Genomics 2015:206937 4. Klambauer G, Unterthiner T, Hochreiter S. 2013. DEXUS: identifying differential expression in RNA-Seq studies with unknown conditions. Nucleic Acids Res 41:e198 5. Auer PL, Doerge RW. 2010. Statistical design and analysis of RNA sequencing data. Genetics 185:405-16 6. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621-8 7. Law CW, Alhamdoosh M, Su S, Dong X, Tian L, et al. 2016. RNA-seq analysis is easy as 1- 2-3 with limma, Glimma and edgeR. F1000Research 5 8. Liu R, Holik AZ, Su S, Jansz N, Chen K, et al. 2015. Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res 43:e97 9. Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550 10. Pimentel H, Bray NL, Puente S, Melsted P, Pachter L. 2017. Differential analysis of RNA- seq incorporating quantification uncertainty. Nat Methods 14:687-90 11. Wolf JBW. 2013. Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. Molecular Ecology Resources 13:559-72 12. Kathleen Kerr M, A. Churchill G. 2001. Statistical design and the analysis of gene expression microarray data. Genetical Research 77:123-8 13. Durbin BP, Hardin JS, Hawkins DM, Rocke DM. 2002. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 18:S105-S10 14. Mostafavi S, Battle A, Zhu X, Urban AE, Levinson D, et al. 2013. Normalizing RNA- Sequencing Data by Modeling Hidden Covariates with Prior Knowledge. PLOS ONE 8:e68141 15. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, et al. 2016. A survey of best practices for RNA-seq data analysis. Genome Biol 17:13 16. Costa-Silva J, Domingues D, Lopes FM. 2017. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS ONE 12 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 22 17. Ge SX, Son EW, Yao R. 2018. iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data. BMC Bioinformatics 19:534 18. Hrdlickova R, Toloue M, Tian B. 2017. RNA-Seq methods for transcriptome analysis. Wiley Interdisciplinary Reviews: RNA 8:e1364 19. Li WV, Li JJ. 2018. Modeling and analysis of RNA-seq data: a review from a statistical perspective. Quantitative Biology 6:195-209 20. Robinson MD, McCarthy DJ, Smyth GK. 2009. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139-40 21. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, et al. 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47 22. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, et al. 2012. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics 14:671-83 23. Robinson MD, Oshlack A. 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:R25 24. Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Nature Precedings 25. Bild DE, Bluemke DA, Burke GL, Detrano R, Diez Roux AV, et al. 2002. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am J Epidemiol 156:871-81 26. Chen X, Wang R, Zee P, Lutsey PL, Javaheri S, et al. 2015. Racial/Ethnic Differences in Sleep Disturbances: The Multi-Ethnic Study of Atherosclerosis (MESA). Sleep 38:877-88 27. Storey J, Bass A, Dabney A, Robinson D. 2019. qvalue: Q-value estimation for false discovery rate control. In R package version 2.18.0. 28. van der Laan MJ, Hubbard AE. 2006. Quantile-function based null distribution in resampling based multiple testing. Stat Appl Genet Mol Biol 5:Article14 29. Benjamini Y, Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B 57:289-300 30. van Rooij J, Mandaviya PR, Claringbould A, Felix JF, van Dongen J, et al. 2019. Evaluation of commonly used analysis strategies for epigenome- and transcriptome-wide association studies through replication of large-scale population studies. Genome Biology 20:235 31. Anderson MJ, Legendre P. 1999. An empirical comparison of permutation methods for tests of partial regression coefficients in a linear model. Journal of Statistical Computation and Simulation 62:271-303 32. Werft W, Benner A. 2010. glmperm: A Permutation of Regressor Residuals Test for Inference in Generalized Linear Models. The R Journal 2:39 33. Yang H, Churchill G. 2006. Estimating p-values in small microarray experiments. Bioinformatics 23:38-43 34. Storey JD, Tibshirani R. 2003. SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays. In The Analysis of Gene Expression Data: Methods and Software, ed. G Parmigiani, ES Garrett, RA Irizarry, SL Zeger:272-90. New York, NY: Springer New York. Number of 272-90 pp. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 23 35. Fan J, Chen Y, Chan HM, Tam PKH, Ren Y. 2005. Removing intensity effects and identifying significant genes for Affymetrix arrays in macrophage migration inhibitory factor-suppressed neuroblastoma cells. Proceedings of the National Academy of Sciences of the United States of America 102:17751 36. van Iterson M, van Zwet EW, Heijmans BT, the BC. 2017. Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution. Genome Biology 18:19 37. Efron B. 2004. Large-Scale Simultaneous Hypothesis Testing. Journal of the American Statistical Association 99:96-104 38. Wang J, Zhao Q, Hastie T, Owen AB. 2017. CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING. Annals of statistics 45:1863-94 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 24 Figure legends Figure 1: Distributions of the three sleep-disordered breathing exposure phenotypes used as case studies in this manuscript. The left column provides the empirical density functions of the raw phenotypes, the right column provides the empirical density function of their residuals after regressing on age, sex, BMI, self-reported race/ethnic group, and study center. AvgO2: average oxyhemoglobin saturation during sleep. MinO2: minimum oxyhemoglobin saturation during sleep. AHI: Apnea Hypopnea Index. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 25 Figure 2: Average number of false positive transcript associations detected by various methods used in simulation study 1 and computed over 100 repetitions. We used the residual permutation approach to mimic the MESA data set with the sleep phenotype MinO2. The methods reported here are linear regression (applied on log2-transformed transcript counts, with zero values replaced with SubHalfMin); DESeq2, edgeR, and limma-voom. The left column provides results when using raw p-values, the middle corresponds to use of quantile- empirical p-values, and the right corresponds to Storey empirical p-values. We report false positive detections as those with Benjamini-Hochberg (BH) False Discovery Rate adjusted (FDR) adjusted p-value < 0.05, Local FDR <0.05 (qvalue package) and with Holms Family-Wise Error Rate (FWER) adjusted p-values < 0.05. Error bars reflect the mean ; standard error. In Supplementary Figures S3-S5, we provide complete results, including for additional sleep phenotypes: AHI and AvgO2. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 26 Figure 3: Estimated power for detecting a transcript simulated as associated with the three sleep traits when using Storey empirical p-values, and association is determined significant if its BH FDR-adjusted p- value is <0.05. The transcripts were randomly selected out of available transcripts (after filtering of transcripts with 50% or higher zero counts across the sample). We compared linear regression, DESeq2, and edgeR in transcriptome-wide association analysis for each of the sleep phenotypes. For each transcript used in simulations, we show both power and the box plot of its distribution in the sample after Median normalization. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/ 27 Figure 4: Analysis pipeline for association transcriptome-wide association analysis of continuous exposure phenotypes. The raw data is normalized using library-size normalization, followed by filtering of transcripts, transformation of transcript expression values, then single-transcript testing to obtain raw p-values. In parallel, residual permutation is applied under the null 100 times, and p-values are used to construct an empirical p-value distribution under the null, and to compute empirical p-values. Finally, the quantile empirical p-values are corrected for multiple testing. 27 s .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430989doi: bioRxiv preprint https://doi.org/10.1101/2021.02.12.430989 http://creativecommons.org/licenses/by-nc/4.0/