97992561 Title: NIAGADS Alzheimer’s GenomicsDB: A resource for exploring Alzheimer’s Disease genetic and genomic knowledge Authors Emily Greenfest-Allen24, Conor Klamann123, Prabhakaran Gangadharan123, Amanda Kuzma123, Yuk Yee Leung123, Otto Valladares123, Gerard Schellenberg123, Christian J. Stoeckert Jr. 124, Li-San Wang123 Affiliations 1 Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 2 Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 3 Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 4 Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Corresponding Author Emily Greenfest-Allen allenem@pennmedicine.upenn.edu Li-San Wang lswang@pennmedicine.upenn.edu (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint mailto:allenem@pennmedicine.upenn.edu https://doi.org/10.1101/2020.09.23.310276 Abstract INTRODUCTION: The NIAGADS Alzheimer’s Genomics Database (GenomicsDB) is an interactive knowledgebase for Alzheimer’s disease (AD) genetics that provides access to GWAS summary statistics datasets deposited at NIAGADS, a national genetics data repository for AD and related dementia (ADRD). METHODS: The website makes available >70 genome-wide summary statistics datasets from GWAS and genome sequencing analysis for AD/ADRD. Variants identified from these datasets are mapped to up-to-date variant and gene annotations from a variety of resources and linked to functional genomics data. The database is powered by a big data optimized relational database and ontologies to consistently annotate study designs and phenotypes, facilitating data harmonization and efficient real-time data analysis and variant or gene report generation. RESULTS: Detailed variant reports provide tabular and interactive graphical summaries of known ADRD associations, as well as highlight variants flagged by the Alzheimer’s Disease Sequencing Project (ADSP). Gene reports provide summaries of co-located ADRD risk-associated variants and have been expanded to include meta-analysis results from aggregate association tests performed by the ADSP allowing us to flag genes with genetic evidence for AD. DISCUSSION: The GenomicsDB makes available >150 million variant annotations, including ~30 million (5 million novel) variants identified as AD-relevant by ADSP, for browsing and real-time mining via the website. With a newly redesigned, efficient, search interface and comprehensive record pages linking summary statistics to variant and gene annotations, this resource makes these data both accessible and interpretable, establishing itself as valuable tool for AD research. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 1 Background Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that affects 5.8 million people in US in 2018, is effectively untreatable, and invariably progresses to complete incapacitation and death 10 or more years after onset. Early work in the 1990s identified mutations in the amyloid precursor protein (APP) gene, presenilins 1 and 2 that cause AD, and alleles of the apolipoprotein E gene (APOE) that increase (ε4) or decrease (ε2) susceptibility to late-onset Alzheimer’s disease (LOAD). Heritability of AD is high, ranging from near 60% to 80% in the best fitting model [1,2]. However, apart from APOE, there is no simple pattern of inheritance for LOAD. Instead, it is likely caused by a complex combination of common, polygenic variants [3] acting together with a small number of rare variants with a large effect [4,5]. Our current understanding of genetic risk for AD has resulted mainly from massive genotyping and sequencing efforts such as the Alzheimer’s Disease Genetics Consortium (ADGC), the International Genomics of Alzheimer’s Project (IGAP), and the Alzheimer’s Disease Sequencing Project (ADSP). Large-scale genome wide association studies (GWAS) and GWAS-derived meta- analyses have been performed by each of these groups [4–7], the results of which are deposited at the National Institute of Aging (NIA) Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) at the University of Pennsylvania [8]. NIAGADS is an NIA-designated essential national infrastructure, providing a one-stop access portal for Alzheimer’s disease ′omics datasets. Qualified investigators can submit data use requests to access protect personal genetic information. NIAGADS also disseminates unrestricted meta-analysis results and GWAS summary statistics to promote data reuse, allowing researchers to explore known evidence for AD genetic risk. However, substantive bioinformatics expertise and compute power are required to annotate and mine these datasets, which are significant hurdles for many researchers planning to explore this large and ever-increasing volume of data. Assembly of unrestricted genomic knowledge into an integrated, interactive web resource would help overcome this barrier. Here, we introduce the NIAGADS Alzheimer’s Genomics Database (GenomicsDB), which was developed in collaboration with the ADGC and ADSP with this goal in mind. The GenomicsDB is a user-friendly workspace for data sharing, discovery, and analysis designed to facilitate the quest for better understanding of the complex genetic underpinnings of AD neurodegeneration and accelerate the progress of research on AD and AD related dementias (ADRD). It accomplishes this by making summary genetic evidence for AD/ADRD both accessible to and interpretable by molecular biologists, clinicians and bioinformaticians alike regardless of computational skills. 2 Methods 2.1 Genomics Datasets 2.1.1 NIAGADS GWAS summary statistics (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 As of December 2020, the NIAGADS GenomicsDB provides unrestricted access to genome-wide summary statistics p-values from >70 GWAS and ADSP meta-analysis. Summary statistic results are linked to >150 million ADSP annotated single-nucleotide variants (SNVs) and indels. GWAS summary statistics datasets deposited at NIAGADS are integrated into the GenomicsDB as they become publicly available via publication or permission of the submitting researchers. These include studies that focus specifically on AD and late-onset AD (LOAD), as well as those on ADRD-related neuropathologies and biomarkers. A full listing of the summary statistics datasets currently available through the NIAGADS GenomicsDB is provided in Supplementary Table S1. Prior to loading in the database, the datasets are annotated (e.g. provenance, phenotypes, study design) and variant representation normalized to ensure consistency with ADSP analysis pipelines and facilitate harmonization with third-party annotations. To ensure the privacy of personal health information, the NIAGADS GenomicsDB website only makes p-values from the summary statistics available for browsing (on dataset, gene, and variant reports and as genome browser tracks) and analysis. Access to the full summary statistics (including genome-wide allele frequencies and effect sizes) and corresponding GWAS or sequencing results is managed via formal data-access requests made to NIAGADS. All datasets included in the GenomicsDB are properly credited to the submitting researchers or sequencing project. 2.1.2 NHGRI-EBI GWAS Catalog Variants and summary statistics curated in the NHGRI-EBI GWAS catalog [9] are listed in NIAGADS GenomicsDB variant reports and a track is available on the genome browser. Variants linked to AD/ADRD are highlighted. 2.1.3 ADSP meta-analysis results The NIAGADS GenomicsDB has recently expanded its scope to include meta-analysis results offering genetic evidence for gene-level and single-variant risk associations for AD. Currently available are case/control association results recently published by the ADSP [7] and deposited at NIAGADS (Accession No. NG00065). 2.2 Variant annotation 2.2.1 Variant identification Single nucleotide polymorphisms (SNPs) and short-indels are uniquely identified by position and allelic variants. This allows accurate mapping of risk-association statistics to specific mutations and to external variant annotations from resources such as gnomAD (https://gnomad.broadinstitute.org/) [10] and GTex (https://www.gtexportal.org/home/) [11]. All variants are mapped to dbSNP (https://www.ncbi.nlm.nih.gov/snp/) [12] and linked to refSNP identifiers when possible. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://gnomad.broadinstitute.org/ https://www.gtexportal.org/home/ https://www.ncbi.nlm.nih.gov/snp/ https://doi.org/10.1101/2020.09.23.310276 2.2.2 ADSP variant annotations Annotated variants in the NIAGADS GenomicsDB include the >29 million SNPs and ~50,000 short-indels identified during the ADSP Discovery Phase whole-genome (WGS) and whole- exome sequencing (WES) efforts [13]. These variants are highlighted in variant and dataset reports and their quality control status is provided. As part of this sequencing effort, the ADSP developed an annotation pipeline that builds on Ensembl’s VEP software [14] to efficiently integrate standard annotations and rank potential variant impacts according to predicted effect (such as codon changes, loss of function, and potential deleteriousness) [13,15]. Variant tracks annotated by these results are available for both the WES and WGS variants on the GenomicsDB genome browser. The pipeline has been applied to all variants in the GenomicsDB. These annotations can be browsed on variant reports or used to filter search results. User uploaded lists of variants are automatically annotated in real-time. 2.2.3 Allele frequencies The NIAGADS GenomicsDB includes allele frequency data from 1000 Genomes (phase 3, version 1) (https://www.internationalgenome.org/home) [16], ExAC (http://exac.broadinstitute.org/) [17], and gnomAD [10]. 2.2.4 Linkage disequilibrium Linkage-disequilibrium (LD) structure around annotated variants is estimated using phase 3 version 1 (11 May 2011) of the 1000 Genomes Project [16]. LD estimates were made using PLINK v1.90b2i 64-bit [18]. Only LD-scores meeting a correlation threshold of r2 ≥ 0.2 are stored in the database. Locuszoom.js [19,20] is used to render LD-scores in the context of the GWAS summary statistics datasets. 2.3 Gene and transcript annotation 2.3.1 Gene identification Gene and transcript models are obtained from the GENCODE Release 19 (GRCh37.p13) reference gene annotation [21]. A GRCh38 version of the NIAGADS GenomicsDB is planned for 2021. Standard gene nomenclature is imported from the HUGO Gene Nomenclature Committee at the European Bioinformatics Institute [22] and used to link annotated genes to external resources such as UniProt (https://www.uniprot.org/) [23], the UCSC Genome Browser (http://genome.ucsc.edu)[24], and Online Mendelian Inheritance in Man (OMIM) database (https://omim.org/) [25,26]. 2.3.2 Functional annotation (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://www.internationalgenome.org/home http://exac.broadinstitute.org/ https://www.uniprot.org/ http://genome.ucsc.edu/ https://omim.org/ https://doi.org/10.1101/2020.09.23.310276 Annotations of the functions of genes and gene products are taken from packaged releases of the Gene Ontology (GO; http://geneontology.org) and GO-gene associations [27] and are updated regularly. GO-gene associations are reported in summary tables on gene reports and include details on annotation sources, as well as new information from the GO causal modeling (GO-CAM) framework that allows better understanding of how different gene products work together to effect biological processes [28]. Users can run functional enrichment analysis on gene search results or uploaded gene lists. Geneset enrichment and semantic similarity scores are calculated using the goatools Python library for GO analysis [29]. 2.4.3 Pathways Gene membership in molecular and metabolic pathways is provided from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (https://www.genome.jp/kegg/) [30] and Reactome (https://reactome.org/) [31]. Users can run pathway enrichment analysis on gene search results or uploaded gene lists. Pathway enrichment statistics are calculated using a multiple hypothesis corrected Fisher’s exact test implemented using the SciPy, pandas, and statsmodels Python packages. 2.4 Functional genomics Hundreds of functional genomics tracks have been integrated into the NIAGADS GenomicsDB and mapped against AD/ADRD-associated variants. These tracks are queried from the NIAGADS Functional genomics repository (FILER), which provides harmonized functional genomics datasets that have been GIGGLE indexed [32] for quick lookups [33]. FILER tracks made available through the GenomicsDB have been pulled from established functional genomics resources, including the Encyclopedia of DNA Elements (ENCODE) [34,35], the Functional Annotation of the Mouse/Mammalian Genome (FANTOM5) enhancer atlas [36], and the NIH Roadmap Epigenomics Mapping Consortium [37]. Genome browser tracks are available for all functional genomics datasets and are organized by data source, biotype (e.g., cell, tissue, or cell line), type of functional annotation (e.g., expressed enhancers, transcription factor binding sites, histone modifications) and platform or assay type to facilitate track selection. 2.5 Overview of database design An overview of the NIAGADS GenomicsDB systems architecture is provided in Figure 1. The GenomicsDB is powered by a PostgreSQL relational database system that has been optimized for parallel big data querying, allowing for efficient real-time data mining. Data are organized using the modular Genomics Unified Schema version 4 (GUS4), designed for scalable integration and dissemination of large-scale ′omics datasets. Loading of all data is managed by the GUS4 application layer (https://github.com/VEuPathDB/GusAppFramework), which ensures the accuracy of data integration. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint http://geneontology.org/ https://www.genome.jp/kegg/ https://reactome.org/ https://github.com/VEuPathDB/GusAppFramework https://doi.org/10.1101/2020.09.23.310276 2.6 Overview of website design and organization The NIAGADS GenomicsDB is powered by an open-source database system and web- development kit (WDK; https://github.com/VEuPathDB/WDK) developed and successfully deployed by the Eukaryotic Pathogen, Vector and Host Informatics (VEuPathDB) Bioinformatics Resource Center [38,39]. The VEuPathDB WDK provides a query engine that ties the database system to the website via an easily extensible XML data model. The data model is used to automatically generate and organize searches, search results, and reports, with concepts and data organized by topics from the EMBRACE Data And Methods (EDAM) ontology, which defines a comprehensive set of concepts that are prevalent within bioinformatics [40]. This facilitates updates of third-party data and rapid integration of new datasets as they become publicly available. The WDK also provides a framework for lightweight Java/Jersey representational state transfer (REST) services for data querying. This allows search results and reports to be returned in multiple file formats (e.g., delimited-text, XML, and JSON) in addition to browsable, interactive web pages. This new feature of GenomicsDB has enabled the inclusion of sophisticated visualizations for summarizing search results and annotations in gene and variant reports. API development is still undergoing, with plans to develop a flexible API that allows researchers to integrate GenomicsDB datasets and annotations into analysis pipelines. The GenomicsDB uses a combination of an in-house JavaScript genomics visualization toolkit and established third- party visualization tools, including the HighCharts.js (https://www.highcharts.com/) charting library for rendering scatter, pie, and bar charts, ideogram.js (https://github.com/eweitz/ideogram) for chromosome visualization, LocusZoom.js for rendering LD structure in the context of NIAGADS GWAS summary statistics datasets, and an IGV.js powered genome browser [41]. All code used to generate the WDK website, including the JavaScript genomics visualizations are available on GitHub (https://github.com/NIAGADS). 2.7 Overview of the NIAGADS genome browser The NIAGADS genome browser enables researchers to visually inspect and browse GWAS summary statistics datasets in a genomic context. The genome browser allows users to compare NIAGADS GWAS summary statistics tracks to each other, against annotated gene or variant tracks, or to the functional genomics tracks from the NIAGADS FILER functional genomics repository. This tool is powered by IGV.js, with track data queried in real-time by NIAGADS GenomicsDB REST services. The browser also provides a track selection tool that allows users to easily find tracks of interest by keyword search, data source, biotype (e.g., cell, tissue, or cell line) or type of functional annotation (Fig. 2). 3. Results (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://github.com/VEuPathDB/WDK https://www.highcharts.com/ https://github.com/eweitz/ideogram https://github.com/NIAGADS https://doi.org/10.1101/2020.09.23.310276 The NIAGADS Alzheimer’s GenomicsDB creates a public forum for sharing, discovery, and analysis of genetic evidence for Alzheimer’s disease that is made accessible via an interface designed for easy mastery by biological researchers, regardless of background. The GenomicsDB provides four main routes for data exploration and mining. First, detailed reports compile all available data concerning summary statistics datasets and genetic evidence linking AD/ADRD to genes and variants. Second, datasets can be mined in real-time to isolate a refined set of variants that share biological characteristics of interest. Third, visualization tools such a s LocusZoom.js and the NIAGADS Genome Browser offer the ability to quickly view and draw conclusions from comparisons of summary statistics or ADSP annotated variants to different types of sequence data in a genomic area of interest. Fourth, and finally, tools such as enrichment analyses offer opportunities for users to link variants to biological processes via impacted genes. 3.1 Finding variants, genes, and datasets The GenomicsDB homepage and navigation menu contain a site search allowing users to quickl y find variants, genes, and datasets of interest by identifier or keyword. This search is paired with interactive graphics found throughout the site that provide shortcuts to resources and annotations of interest to the AD/ADRD research community (Fig. 3A, B). The GenomicsDB also provides a dataset browser that allows users to search for GWAS summary statistics datasets by AD/ADRD phenotype, population, genotype, attribution, and sequencing center. 3.2 Browsing and mining NIAGADS GWAS summary statistics A detailed report is provided for each of the GWAS summary statistics and ADSP meta-analysis datasets in the NIAGADS GenomicsDB (Fig. 4A). These reports allow users to browse the genetic variants with genome-wide significance in the dataset (p-value ≤ 5 × 10-8 to account for false positives due to testing associations of millions of variants simultaneously) via tables and interactive plots that provide an overview of the distribution and potential functional or regulatory impacts of the top variants (and proximal gene-loci) across the genome. All genes and variants listed in a dataset report are linked to reports in the GenomicsDB that provide detailed information about genetic evidence for AD for the sequence feature (see next sections). Dataset reports also provide quick links back to their parent accession in the NIAGADS repository where users can download the complete p-values or make formal data access requests for the full summary statistics, related GWAS, expression, or sequencing data associated with the accession. The reports also provide an inline search allowing users to mine the summary statistics in real-time via the website, setting their own p-value cut-off (see section 3.5 for more information). 3.3 Detailed variant reports Variant reports include a basic summary about the variant (alleles, variant type, flanking sequence, genomic location) and a graphical overview of NIAGADS GWAS summary statistics datasets in which the variant has genome-wide significance (Fig. 5A). All other information in (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 the report is subdivided into multiple sections that can be expanded or hidden at the user’s discretion. These sections include sub-reports on genetic variation (e.g., allele population frequencies and LD), function prediction determined via the ADSP annotation pipeline (incl. transcript and regulatory consequences), and comprehensive listings of GWAS inferred disease or trait associations from both NIAGADS summary statistics and the NHGRI-EBI GWAS Catalog. Tables listing summary statistics results can be dynamically filtered by p-value, dataset, phenotypes, or covariates, and the filtered results are downloadable. Links to the source datasets for each reported statistic are also provided, leading to detailed dataset reports (e.g., NIAGADS GWAS summary statistics) or to the source publication (e.g., curated variant catalogs). These tables are paired with browsable LocusZoom.js views of the LD structure surrounding the variant in the context of selected GWAS summary statistics datasets. Links to the NIAGADS Alzheimer’s Disease Variant Portal (ADVP) and external resources for additional information (e.g., dbSNP, ClinVar) are also provided. 3.4 Detailed gene reports Like the variant reports, gene reports provide basic summary information about the gene (nomenclature, gene type, genomic span) and a graphical overview of NIAGADS GWAS summary statistics-linked variants proximal to or within the footprint of the gene (Fig.5B). Two types of gene-linked genetic evidence for AD are provided in the GenomicsDB gene reports. First, we have surveyed the top risk-associated variants from the NIAGADS GWAS summary statistics datasets and provide a comprehensive listing of and links to those contained within ±100kb of each gene (Fig. 5C). Second, we report meta-analysis results from gene-based rare variant aggregation tests performed as part of the ADSP discovery phase case/control analysis [42]. Genes found to have a significant p-value in these results are flagged as being associated with genetic-evidence for AD. Also provided on the gene report are sections reporting function prediction (Gene Ontology associations and evidence) and pathway membership (KEGG and Reactome). Tables reporting these results or annotations can be dynamically filtered or downloaded. Links to the NIAGADS ADVP and to external resources (e.g., UniprotKB, OMIM, and ExAC) are also provided. 3.5 Workspaces The GenomicsDB provides an interactive workspace for exploring a dataset in more depth. As an example, dataset reports provide an inline search allowing users to mine the summary statistics. Variants meeting the search criterion are reported in an interactive workspace that includes both tabular and graphical summaries. Users are initially presented with a table that can be sorted or filtered by annotations (e.g., variant type, predicted effect, deleteriousness) (Fig. 4B). A per-chromosome genome view is also available allowing users to explore an interactive ideogram depicting the distribution of variants meeting the search and filter criteria across the genome and allowing inspection of LD structure among proximal variants (Fig. 4C). Tables of results can be downloaded or requested via the API for programmatic processing. Registered users also have the option to save and share search results both privately and (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 publicly; publicly shared search results are assigned a stable URL that can be referenced in publications. 3.6 Genome Browser The NIAGADS genome browser can be used to visually inspect any of the NIAGADS GWAS summary statistics datasets in a broader genomic context and compare against annotated ADSP variant tracks or other ′omics tracks in the GenomicsDB or FILER (see section 2.7, Fig. 2B). 4 Discussion The NIAGADS Alzheimer’s Genomics Database is a user-friendly platform for interactive browsing and real-time in-depth mining of published genetic evidence and genetic risk-factors for AD. It provides open, real-time access to summary statistics datasets from genome-wide association analysis (GWAS) of Alzheimer’s disease and related neuropathologies. Flexible search options allow users to easily retrieve AD risk-associated variants, conditioned on phenotypes such as ethnicity and age of onset. Users can compare the NIAGADS datasets against personal gene or variant lists. Every entry in the GenomicsDB has been linked with relevant external resources and functional genomics annotations to supply further information and assist researchers in interpreting the potential functional or regulatory role of risk-associated variants and susceptibility loci. The GenomicsDB is updated periodically with enhanced features and new datasets and annotations when they are reported. The AD research community is actively encouraged through outreach and collaboration to submit data to NIAGADS to keep this public platform updated and timely. The GenomicsDB is integrated with other resources available at NIAGADS. Users can follow links back to the NIAGADS repository to view comprehensive details about all GWAS summary statistics datasets from NIAGADS accession or request access to the primary data. The REST services used to query the database and generate data or feature reports provide the foundation of an API that allows programmatic access to the database, which we plan to integrate with cloud based NIAGADS analysis pipelines. The GenomicsDB is regularly updated to keep up with advances in Alzheimer’s disease genomics research. New AD-related GWAS summary statistics datasets and meta-analysis results from the ADSP are added as they become available. Reference databases are updated yearly. All genomics data in the current version of the GenomicsDB are aligned and mapped to the GRCh37.p13 genome build. A GRCh38 version of the database is planned for release in early 2021, which will include variants from the ongoing ADSP sequencing effort, including 20K WES in 2020 and 17K WGS in 2021. GenomicsDB is a potent platform for the AD genetics community to host comprehensive AD genetic and genomic findings. It uses the latest web and database technologies to allow integration with new tools, and NIAGADS is constantly improving. As more data and tools (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 become available the NIAGADS Alzheimer’s Genomics Database will become a central hub for AD/ADRD research and data analysis. 5 Conflicts of Interest The authors have no financial interests to disclose. 6 Acknowledgements and Funding Information This work is supported by the NIH National Institute on Aging (grant number U24-AG041689). The ADSP Discovery Phase analysis of sequence data is supported through UF1AG047133 (to Drs. Schellenberg, Farrer, Pericak-Vance, Mayeux, and Haines); U01AG049505 to Dr. Seshadri; U01AG049506 to Dr. Boerwinkle; U01AG049507 to Dr. Wijsman; and U01AG049508 to Dr. Goate. Additional funding and acknowledgement statements for the ADSP can be found in the supplement. 7 References [1] Gatz M, Reynolds CA, Fratiglioni L, Johansson B, Mortimer JA, Berg S, et al. Role of genes and environments for explaining Alzheimer disease. Arch Gen Psychiatry 2006;63:168–74. https://doi.org/10.1001/archpsyc.63.2.168. [2] Jansen IE, Savage JE, Watanabe K, Bryois J, Williams DM, Steinberg S, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature Genetics 2019;51:404–13. https://doi.org/10.1038/s41588-018-0311-9. [3] Hollingworth P, Harold D, Sims R, Gerrish A, Lambert J-C, Carrasquillo MM, et al. Common variants in ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer’s disease. Nat Genet 2011;43:429–35. https://doi.org/10.1038/ng.803. [4] Lambert J-C, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, Bellenguez C, et al. Meta- analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nature Genetics 2013;45:1452–8. https://doi.org/10.1038/ng.2802. [5] Kunkle BW, Grenier-Boley B, Sims R, Bis JC, Damotte V, Naj AC, et al. Genetic meta- analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat Genet 2019;51:414–30. https://doi.org/10.1038/s41588- 019-0358-2. [6] Naj AC, Jun G, Beecham GW, Wang L-S, Vardarajan BN, Buros J, et al. Common variants at MS4A4/MS4A6E , CD2AP , CD33 and EPHA1 are associated with late-onset Alzheimer’s disease. Nature Genetics 2011;43:436–41. https://doi.org/10.1038/ng.801. [7] Bis JC, Jian X, Kunkle BW, Chen Y, Hamilton-Nelson KL, Bush WS, et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Molecular Psychiatry 2018:1– 17. https://doi.org/10.1038/s41380-018-0112-7. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 [8] Kuzma A, Valladares O, Cweibel R, Greenfest-Allen E, Childress DM, Malamon J, et al. NIAGADS: The NIA Genetics of Alzheimer’s Disease Data Storage Site. Alzheimer’s & Dementia 2016;12:1200–3. https://doi.org/10.1016/j.jalz.2016.08.018. [9] Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 2019;47:D1005–12. https://doi.org/10.1093/nar/gky1120. [10] Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. BioRxiv 2019:531210. https://doi.org/10.1101/531210. [11] Gamazon ER, Segrè AV, van de Bunt M, Wen X, Xi HS, Hormozdiari F, et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait- associated variation. Nature Genetics 2018;50:956–67. https://doi.org/10.1038/s41588-018- 0154-4. [12] Sherry ST, Ward M-H, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308–11. [13] Butkiewicz M, Blue EE, Leung YY, Jian X, Marcora E, Renton AE, et al. Functional annotation of genomic variants in studies of late-onset Alzheimer’s disease. Bioinformatics 2018;34:2724–31. https://doi.org/10.1093/bioinformatics/bty177. [14] McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol 2016;17. https://doi.org/10.1186/s13059-016-0974-4. [15] Wheeler NR, Benchek P, Kunkle BW, Hamilton-Nelson KL, Warfe M, Fondran JR, et al. Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies. Pac Symp Biocomput 2020;25:523–34. [16] Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature 2015;526:68–74. https://doi.org/10.1038/nature15393. [17] Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016;536:285–91. https://doi.org/10.1038/nature19057. [18] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559–75. https://doi.org/10.1086/519795. [19] Pruim RJ, Welch RP, Sanna S, Teslovich TM, Chines PS, Gliedt TP, et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 2010;26:2336–7. https://doi.org/10.1093/bioinformatics/btq419. [20] Clark CP, Flickinger M, Welch R, VandeHaar P, Taliun D, Boehnke M, et al. LocusZoom.js: Web-based plugin for interactive analysis of genome and phenome wide association studies. Presented at the 66th Annual Meeting of The American Society of Human Genetics, Vancouver: 2016, p. 189T. [21] Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 2019;47:D766–73. https://doi.org/10.1093/nar/gky955. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 [22] Braschi B, Denny P, Gray K, Jones T, Seal R, Tweedie S, et al. Genenames.org: the HGNC and VGNC resources in 2019. Nucleic Acids Res 2019;47:D786–92. https://doi.org/10.1093/nar/gky930. [23] UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 2019;47:D506–15. https://doi.org/10.1093/nar/gky1049. [24] Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The Human Genome Browser at UCSC. Genome Res 2002;12:996–1006. https://doi.org/10.1101/gr.229102. [25] Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res 2015;43:D789-798. https://doi.org/10.1093/nar/gku1205. [26] Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res 2019;47:D1038–43. https://doi.org/10.1093/nar/gky1151. [27] The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res 2019;47:D330–8. https://doi.org/10.1093/nar/gky1055. [28] Thomas PD, Hill DP, Mi H, Osumi-Sutherland D, Auken KV, Carbon S, et al. Gene Ontology Causal Activity Modeling (GO-CAM) moves beyond GO annotations to structured descriptions of biological functions and systems. Nature Genetics 2019;51:1429–33. https://doi.org/10.1038/s41588-019-0500-1. [29] Klopfenstein DV, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, et al. GOATOOLS: A Python library for Gene Ontology analyses. Scientific Reports 2018;8:1– 17. https://doi.org/10.1038/s41598-018-28948-z. [30] Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28:27–30. https://doi.org/10.1093/nar/28.1.27. [31] Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, et al. The reactome pathway knowledgebase. Nucleic Acids Res 2020;48:D498–503. https://doi.org/10.1093/nar/gkz1031. [32] Layer RM, Pedersen BS, DiSera T, Marth GT, Gertz J, Quinlan AR. GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods 2018;15:123–6. https://doi.org/10.1038/nmeth.4556. [33] Kuksa PP, Gangadharan P, Katanic Z, Kleidermacher L, Amlie-Wolf A, Lee C-Y, et al. FILER: large-scale, harmonized FunctIonaL gEnomics Repository. BioRxiv 2021:2021.01.22.427681. https://doi.org/10.1101/2021.01.22.427681. [34] ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74. https://doi.org/10.1038/nature11247. [35] Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res 2018;46:D794–801. https://doi.org/10.1093/nar/gkx1081. [36] Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature 2014;507:455–61. https://doi.org/10.1038/nature12787. [37] Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature 2015;518:317–30. https://doi.org/10.1038/nature14248. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 [38] Fischer S, Aurrecoechea C, Brunk BP, Gao X, Harb OS, Kraemer ET, et al. The strategies WDK: a graphical search interface and web development kit for functional genomics databases. Database (Oxford) 2011;2011. https://doi.org/10.1093/database/bar027. [39] Aurrecoechea C, Barreto A, Basenko EY, Brestelli J, Brunk BP, Cade S, et al. EuPathDB: the eukaryotic pathogen genomics database resource. Nucleic Acids Res 2017;45:D581–91. https://doi.org/10.1093/nar/gkw1105. [40] Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, et al. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 2013;29:1325–32. https://doi.org/10.1093/bioinformatics/btt113. [41] Robinson JT, Thorvaldsdóttir H, Turner D, Mesirov JP. igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). BioRxiv 2020:2020.05.03.075499. https://doi.org/10.1101/2020.05.03.075499. [42] Bis JC, Jian X, Kunkle BW, Chen Y, Hamilton-Nelson KL, Bush WS, et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Mol Psychiatry 2018. https://doi.org/10.1038/s41380-018-0112-7. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 GWAS summary statistics GUS API provides transaction management and ensures data harmonization and referential integrity Variant annotations Gene annotations FILER: Functional genomics GUS Database modular, scalable and big-data optimized for quick look ups and real- time analysis ADSP meta-analysis results GenomicsDB Website scalable RESTful services and graphical front-end for interactively browsing detailed feature reports and real-time mining of datasets {JSON} Programmatic access for integration with analysis pipelines Interactively browse or mine data and annotations using popular web-browsers Link back to the NIAGADS repository to learn more about accessions and make formal data- access requests NIAGADS (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 59,600 kb 59,800 kb 60,000 kb 60,200 kb 60,400 kb 60,600 kb Ensembl Genes ADSP Single-Variant Risk Association: European (Model 2) (Bis et al. 2018) ADSP Variants (WES) IGAP: Stage 1 (Kunkle et al. 2019) IGAP APOE-Stratified Analysis: APOEε4 Non-Carriers (Jun et al. 2016) IGAP APOE-Stratified Analysis: APOEε4 Carriers (Jun et al. 2016) Roadmap Enh: NH-A Astrocytes >15 -log10p 6 9 123<1 B MS4A4E MS4A6A MS4A2 STX3 MS4A4A MS4A6E MS4A5 MS4A12 MS4A8 MS4A18 MS4A15 ZP1LINC00301 MS4A3TCN1 GIF A (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 A B (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 C Variant Span containing multiple variants 3 B 2 A 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276 A B C (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 12, 2021. ; https://doi.org/10.1101/2020.09.23.310276doi: bioRxiv preprint https://doi.org/10.1101/2020.09.23.310276