key: cord-0301198-fdbp4td4 authors: Bhattacharya, A.; Hirbo, J. B.; Zhou, D.; Zhou, W.; Zheng, J.; Kanai, M.; The Global Biobank Meta-analysis Initiative,; Daly, M. J.; Pasaniuc, B.; Gamazon, E. R.; Cox, N. J. title: Best practices of multi-ancestry, meta-analytic transcriptome-wide association studies: lessons from the Global Biobank Meta-analysis Initiative date: 2021-11-26 journal: nan DOI: 10.1101/2021.11.24.21266825 sha: 89761f1bcdf15f44c80836ddcb54837945127507 doc_id: 301198 cord_uid: fdbp4td4 The Global Biobank Meta-analysis Initiative (GBMI), through its genetic and demographic diversity, provides a valuable opportunity to study population-wide and ancestry-specific genetic associations. However, with multiple ascertainment strategies and multi-ethnic study populations across biobanks, the GBMI provides a distinct set of challenges in implementing statistical genetics methods. Transcriptome-wide association studies (TWAS) are a popular tool to boost detection power for and provide biological context to genetic associations by integrating single nucleotide polymorphism to trait (SNP-trait) associations from genome-wide association studies (GWAS) with SNP-based predictive models of gene expression. TWAS presents unique challenges beyond GWAS, especially in a multi-biobank and meta-analytic setting like the GBMI. In this work, we present the GBMI TWAS pipeline, outlining practical considerations for ancestry and tissue specificity and meta-analytic strategies, as well as open challenges at every step of the framework. Our work provides a strong foundation for adding tissue-specific gene expression context to biobank-linked genetic association studies, allowing for ancestry-aware discovery to accelerate genomic medicine. In these analyses, one reason ancestry-unaware models may perform poorly in AFR samples is due to 144 differences in minor allele frequency (MAF) of highly predictive SNPs between EUR and AFR ancestry 145 populations. It is important to note that this discrepancy is not generally specific to any one ancestry; 146 rather, ancestry imbalance in the training or reference datasets may lead to poor portability of genetic 147 models due to differences in allele frequency. To account for common SNPs in both AFR or EUR 148 ancestry populations, we additionally trained ancestry-unaware and ancestry-specific models using SNPs 149 with minor allele frequency (MAF) exceeding various thresholds in both AFR and EUR samples. 150 Excluding SNPs with MAF < 0.01 improved predictive performance of ancestry-unaware models across 151 all tissues ( Figure S4 , Table S5 ). However, the gap in predictive performance between ancestry-specific 152 and ancestry-unaware models did not decrease when the MAF cutoff was increased ( Figure 2B , Table 153 S4). This observation may reflect that dropping ancestry-specific rarer SNPs ignores variants with large 163 Another critical consideration for the GBMI involves meta-analysis while using GWAS summary statistics. 164 TWAS estimates the association between GReX and the phenotype by weighting the standardized SNP-165 trait effect sizes from GWAS summary statistics by SNP-gene weights from the expression models. To 166 account for the correlation between SNPs, an external LD reference panel, like the 1000Genomes Project 167 (Auton et al., 2015) , is used to estimate the standard error of the TWAS association. Accordingly, the 168 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint accuracy of this reference panel to the LD structure in the GWAS cohort dictates how aligned the 169 summary-statistics based TWAS association is to the TWAS association from direct imputation into 170 individual-level genotypes in the GWAS cohort. Ideally, in-sample LD will give the best estimate of the 171 TWAS standard error, but several biobanks do not provide this information under their specific genetic 172 data sharing and privacy policies. Even departures in LD across subgroups of European ancestry 173 populations may influence the standard error estimate. In addition, as the estimates of SNP-gene weights 174 are influenced by the LD in the eQTL panel, differences in LD between the eQTL and GWAS panel will 175 also affect the TWAS effect size. As LD structure greatly differs across ancestry groups (Shifman et al., 2003) , pooling ancestry groups in 178 TWAS may lead to reduced power. We conducted TWAS for asthma risk using ancestry-unaware and 179 EUR-and AFR-specific models of whole blood expression (4,782 genes with heritable expression at 180 nominal P < 0.01 and models with cross-validation R 2 > 0.01 with nominal P < 0.05, trained via elastic net 181 regression). Ancestry-specific TWAS Z-scores across EUR and AFR ancestry groups were not strongly 182 correlated ( = 0.11), potentially due to differences in sample size and eQTL and GWAS architecture 183 ( Figure 3A , Figure S5 -S6) (Shang et al., 2020; Wyss et al., 2018) . In fact, we detected only two genes 184 across both EUR and AFR with P < 2.5 × 10 −6 . One of these genes, DFFA, has been implicated with 185 asthma risk through GWAS and colocalization in EUR (Vicente et al., 2017) . However, the TWAS 186 associations across EUR and AFR were in opposite directions using blood tissue. In the other 4 tissues 187 explored, DFFA TWAS associations did not reach transcriptome-wide significance but effect directions 188 were generally concordant ( Figure S7) . In blood, lead local-eQTLs (within 1 Megabase) of DFFA show 189 are in opposite directions, though only nominally significant at P < 0.05 ( Figure S8) We investigated 5 different meta-analytic strategies empirically: meta-analyzing across ancestry-specific, 198 per-biobank GWAS summary statistics using (1) inverse-variance weighting (IVW) and (2) sample-size 199 weighting (SSW), meta-analyzing across ancestry-specific meta-analyzed GWAS summary statistics 200 using (3) IVW and (4) SSW, and (5) TWAS using ancestry-unaware models and meta-analyzed GWAS 201 summary statistics across EUR and AFR ancestry groups (STAR Methods). QQ-plots in Figure 3B show 202 earlier departure of Z-scores from the QQ-line for SSW meta-analyzed Z-scores and the ancestry- However, it is unclear whether ancestry-specific IVW meta-analysis to the per-biobank level is necessary. 213 As shown in Figure S10 , Z-scores from these two IVW methods are moderately positively correlated ( = CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. are more likely to be tissue-or cell-type-specific (Yang et al., 2017) , the association signal from these 281 distal-eQTLs could also be leveraged in cross-tissue fine-mapping strategies. Table S6 ). These phenotypes include multiple inflammations of 320 organs (e.g., laryngitis, osteitis, meningitis, inflammations of the digestive and respiratory system, etc). 321 We also detected several associations with related respiratory diseases and traits. Similarly, for the two 322 previously implicated genes, we find enrichments for respiratory and hematopoietic GTAs for ILRAP18 323 and across multiple categories for TMEM258, consistent with the categorized functions and associations 324 of these genes (Figure S12-S13, Table S6 ). The utility of GBMI's robust roster of phenotypes enables 325 GReX-PheWAS to add biological and clinical context to novel TWAS associations. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint used in training (aligned imputation panel), we use leave-one-out CV when measuring predictive 505 performance. Lastly, when imputing into AFR and EUR samples using the ancestry-unaware models, we 506 use leave-one-out CV, as well, but only cross-validating over the AFR or EUR samples, respectively. 507 508 Comparison of meta-analytic strategies 509 We compared 5 different meta-analytic strategies empirically: meta-analyzing across ancestry-specific, 510 per-biobank GWAS summary statistics using (1) inverse-variance weighting (IVW) and (2) For the ancestry-unaware TWAS, we use ancestry-unaware elastic net regression models and integrate 529 with GWAS summary statistics meta-analyzed across all ancestry groups and biobanks. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint 532 Transcriptome-wide significant genes are further prioritized by performing GReX-PheWAS to categorize 533 associations across a broad spectrum of phenotypes. Using UKBB summary statistics from European 534 ancestry patients (Bycroft et al., 2018) , we tested for GTAs for 731 phenotypes grouped into 9 categories: 535 dermatologic, digestive, endocrine/metabolic, genitourinary, hematopoietic, musculoskeletal, neoplasms, 536 neurological, and respiratory. Here, we illustrate GReX-PheWAS using three genes from the European- CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint Figure 1 : Overview of GBMI transcriptome-wide association study (TWAS) pipeline with challenges at every data level. (A) Each level of data in TWAS introduces a unique set of challenges: (1) genetics data include confounding from genetic ancestry, population structure and relatedness, and complex linkage disequilibrium patterns, (2) gene expression data introduces context-specific factors, such as tissue-, celltype-, or cell-state-specific expression, and (3) phenotypic data, especially in the meta-analyses of multiple biobanks, involve challenges in acquiring and aggregating phenotypes, properly defining controls for phenotypes, and ascertainment and selection bias from non-random sampling. (B) An overview of the GBMI TWAS pipeline: (1) JTI and MOSTWAS for model training, (2) inverse-variance weighted metaanalysis using per-biobank, per-ancestry group TWAS summary statistics, and (3) various follow-up tests, including conditional or permutation tests, distal-SNPs added last test, probabilistic fine-mapping using FOCUS, and tests for SNP horizontal pleiotropy. Dotted lines represent associations that are tested in the TWAS pipeline, while the solid lines represent a link built through predictive modeling. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint Distribution of difference in adjusted R 2 (Y-axis) when predicting expression in the AFR imputation sample between models trained in EUR and in AFR training samples across tissue (X-axis). (B) Distribution of difference in adjusted R 2 between ancestry-specific and ancestry-unaware models imputing into EUR (left) and AFR (right) samples. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint Figure 3 : Comparison of meta-analytic strategies for multi-biobank, trans-ancestry TWAS. (A) Scatterplot of per-ancestry meta-analyzed TWAS scores across EUR (X-axis) and AFR ancestry (Y-axis). The dotted horizontal and vertical lines indicate P < 2.5 × 10 −6 with a diagonal line for reference. Points are colored based on which ancestry population the TWAS association meets P < 2.5 × 10 −6 . (B) QQ-plot of TWAS Z-scores, colored by meta-analytic strategies. Per ancestry refers to TWAS meta-analysis across metaanalyzed ancestry-specific GWAS summary statistics. Per bank/per ancestry refers to TWAS metaanalysis using all biobank-and ancestry-specific GWAS summary statistics. (C) Effect sizes and Bonferroni-corrected confidence intervals (CIs) for TWAS associations across 17 individual biobanks (stratified by ancestry group with EUR in green and AFR in red) and 2 IVW meta-analysis strategies (in yellow) for 5 representative genes. The Higgins-Thompson I 2 statistic for heterogeneity is provided, with the dotted line showing the null. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 26, 2021. ; https://doi.org/10.1101/2021.11.24.21266825 doi: medRxiv preprint Detecting shared genetic 693 architecture among multiple phenotypes by hierarchical clustering of gene-level association statistics TIGAR: An Improved Bayesian Tool for Transcriptomic Data 697 Imputation Enhances Gene Mapping of Complex Traits Fast and accurate imputation of summary statistics enhances 701 evidence of functional enrichment Integrative analyses identify susceptibility genes underlying 704 COVID-19 hospitalization Predicting gene 706 Phenotypic and functional translation of IL1RL1 710 locus polymorphisms in lung tissue and asthmatic airway epithelium Association of IL1RL1, IL18R1, and IL18RAP gene cluster polymorphisms with 713 asthma and atopy The candidate gene approach in asthma: 715 what happens with the neighbours? Genetic Architecture of Gene Expression in European and African Americans: An eQTL Mapping 718 Study in GENOA A novel random effect model for GWAS meta-analysis and its application to 720 trans-ethnic meta-analysis Linkage disequilibrium patterns of 722 the human genome across populations Mendelian randomization': can genetic epidemiology contribute to 724 understanding environmental determinants of disease? A comparison of multiple testing adjustment 726 methods with block-correlation positivelydependent tests National population-based biobanks for genetic 728 research GRIK5 Genetically Regulated Expression Associated with Eye and 731 Vascular Phenomes: Discovery through Iteration among Biobanks, Electronic Health Records, and 732 Zebrafish Phenome-based approach identifies RIC1 -linked Mendelian syndrome through 735 zebrafish models, biobank associations and clinical studies How powerful are summary-based methods for identifying 737 expression-trait associations under different genetic architectures? Lessons from ten years of genome-wide 740 association studies of asthma Opportunities and challenges for transcriptome-743 wide association studies Evaluating phecodes, clinical classification software, and ICD-9-CM codes 746 for phenome-wide association studies in the electronic health record The importance of cohort studies in the post-GWAS era Multiethnic meta-analysis identifies ancestry-specific and cross-751 ancestry loci for pulmonary function Identifying cis-mediators for trans-eQTLs across 753 many human tissues using genomic mediation analysis GCTA: a tool for genome-wide complex 755 trait analysis Testing and 757 controlling for horizontal pleiotropy with probabilistic Mendelian randomization in transcriptome-wide 758 association studies PTWAS: Investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic 761 TWAS analysis On Using Local Ancestry to Characterize the 763 Genetic Architecture of Human Traits: Genetic Regulation of Gene Expression in Multiethnic or Admixed 764 Populations A unified framework for 766 joint-tissue transcriptome-wide association and Mendelian randomization analysis Efficiently controlling for case-control imbalance and sample 770 relatedness in large-scale genetic association studies Global Biobank Meta-analysis Initiative: powering genetic discovery across 773 human diseases Identifying causal genes mediating a trait through Bayesian estimation of allelic heterogeneity Shared genetic and experimental links between obesity-related traits and 779 asthma subtypes in UK Biobank