key: cord-1005009-rgbda4bh authors: Rehm, Heidi L.; Page, Angela J.H.; Smith, Lindsay; Adams, Jeremy B.; Alterovitz, Gil; Babb, Lawrence J.; Barkley, Maxmillian P.; Baudis, Michael; Beauvais, Michael J.S.; Beck, Tim; Beckmann, Jacques S.; Beltran, Sergi; Bernick, David; Bernier, Alexander; Bonfield, James K.; Boughtwood, Tiffany F.; Bourque, Guillaume; Bowers, Sarion R.; Brookes, Anthony J.; Brudno, Michael; Brush, Matthew H.; Bujold, David; Burdett, Tony; Buske, Orion J.; Cabili, Moran N.; Cameron, Daniel L.; Carroll, Robert J.; Casas-Silva, Esmeralda; Chakravarty, Debyani; Chaudhari, Bimal P.; Chen, Shu Hui; Cherry, J. Michael; Chung, Justina; Cline, Melissa; Clissold, Hayley L.; Cook-Deegan, Robert M.; Courtot, Mélanie; Cunningham, Fiona; Cupak, Miro; Davies, Robert M.; Denisko, Danielle; Doerr, Megan J.; Dolman, Lena I.; Dove, Edward S.; Dursi, L. Jonathan; Dyke, Stephanie O.M.; Eddy, James A.; Eilbeck, Karen; Ellrott, Kyle P.; Fairley, Susan; Fakhro, Khalid A.; Firth, Helen V.; Fitzsimons, Michael S.; Fiume, Marc; Flicek, Paul; Fore, Ian M.; Freeberg, Mallory A.; Freimuth, Robert R.; Fromont, Lauren A.; Fuerth, Jonathan; Gaff, Clara L.; Gan, Weiniu; Ghanaim, Elena M.; Glazer, David; Green, Robert C.; Griffith, Malachi; Griffith, Obi L.; Grossman, Robert L.; Groza, Tudor; Auvil, Jaime M. Guidry; Guigó, Roderic; Gupta, Dipayan; Haendel, Melissa A.; Hamosh, Ada; Hansen, David P.; Hart, Reece K.; Hartley, Dean Mitchell; Haussler, David; Hendricks-Sturrup, Rachele M.; Ho, Calvin W.L.; Hobb, Ashley E.; Hoffman, Michael M.; Hofmann, Oliver M.; Holub, Petr; Hsu, Jacob Shujui; Hubaux, Jean-Pierre; Hunt, Sarah E.; Husami, Ammar; Jacobsen, Julius O.; Jamuar, Saumya S.; Janes, Elizabeth L.; Jeanson, Francis; Jené, Aina; Johns, Amber L.; Joly, Yann; Jones, Steven J.M.; Kanitz, Alexander; Kato, Kazuto; Keane, Thomas M.; Kekesi-Lafrance, Kristina; Kelleher, Jerome; Kerry, Giselle; Khor, Seik-Soon; Knoppers, Bartha M.; Konopko, Melissa A.; Kosaki, Kenjiro; Kuba, Martin; Lawson, Jonathan; Leinonen, Rasko; Li, Stephanie; Lin, Michael F.; Linden, Mikael; Liu, Xianglin; Udara Liyanage, Isuru; Lopez, Javier; Lucassen, Anneke M.; Lukowski, Michael; Mann, Alice L.; Marshall, John; Mattioni, Michele; Metke-Jimenez, Alejandro; Middleton, Anna; Milne, Richard J.; Molnár-Gábor, Fruzsina; Mulder, Nicola; Munoz-Torres, Monica C.; Nag, Rishi; Nakagawa, Hidewaki; Nasir, Jamal; Navarro, Arcadi; Nelson, Tristan H.; Niewielska, Ania; Nisselle, Amy; Niu, Jeffrey; Nyrönen, Tommi H.; O’Connor, Brian D.; Oesterle, Sabine; Ogishima, Soichi; Wang, Vivian Ota; Paglione, Laura A.D.; Palumbo, Emilio; Parkinson, Helen E.; Philippakis, Anthony A.; Pizarro, Angel D.; Prlic, Andreas; Rambla, Jordi; Rendon, Augusto; Rider, Renee A.; Robinson, Peter N.; Rodarmer, Kurt W.; Rodriguez, Laura Lyman; Rubin, Alan F.; Rueda, Manuel; Rushton, Gregory A.; Ryan, Rosalyn S.; Saunders, Gary I.; Schuilenburg, Helen; Schwede, Torsten; Scollen, Serena; Senf, Alexander; Sheffield, Nathan C.; Skantharajah, Neerjah; Smith, Albert V.; Sofia, Heidi J.; Spalding, Dylan; Spurdle, Amanda B.; Stark, Zornitza; Stein, Lincoln D.; Suematsu, Makoto; Tan, Patrick; Tedds, Jonathan A.; Thomson, Alastair A.; Thorogood, Adrian; Tickle, Timothy L.; Tokunaga, Katsushi; Törnroos, Juha; Torrents, David; Upchurch, Sean; Valencia, Alfonso; Guimera, Roman Valls; Vamathevan, Jessica; Varma, Susheel; Vears, Danya F.; Viner, Coby; Voisin, Craig; Wagner, Alex H.; Wallace, Susan E.; Walsh, Brian P.; Williams, Marc S.; Winkler, Eva C.; Wold, Barbara J.; Wood, Grant M.; Woolley, J. Patrick; Yamasaki, Chisato; Yates, Andrew D.; Yung, Christina K.; Zass, Lyndon J.; Zaytseva, Ksenia; Zhang, Junjun; Goodhand, Peter; North, Kathryn; Birney, Ewan title: GA4GH: International policies and standards for data sharing across genomic research and healthcare date: 2021-11-10 journal: Cell Genom DOI: 10.1016/j.xgen.2021.100029 sha: 0de0f121e70379fba129f9b401cf86c4b0f62e45 doc_id: 1005009 cord_uid: rgbda4bh The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits. The Universal Declaration of Human Rights states that everyone has the right to share in scientific advancement and its benefits. 1,2 In order to fully deliver the benefits from genomic science to the broad human population, researchers and clinicians must come together to agree on common methods for collecting, storing, transferring, accessing, and analyzing molecular and other health-related data. Otherwise, this information will remain siloed within individual disease areas, institutions, countries, or other jurisdictions, locking away its potential to contribute to research and medical advances. The Global Alliance for Genomics and Health (GA4GH) is a worldwide alliance of genomics researchers, data scientists, healthcare practitioners, and other stakeholders. We are collaborating to establish policy frameworks and technical standards for responsible, international sharing of genomic and other molecular data as well as related health data. Founded in 2013, 3 the GA4GH community now consists of more than 1,000 individuals This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). across more than 90 countries working together to enable broad sharing that transcends the boundaries of any single institution or country (see https://www.ga4gh.org). In this perspective, we present the strategic goals of GA4GH and detail current strategies and operational approaches to enable responsible sharing of clinical and genomic data, through both harmonized data aggregation and federated approaches, to advance genomic medicine and research. We describe technical and policy development activities of the eight GA4GH Work Streams and implementation activities across 24 real-world genomic data initiatives ("Driver Projects"). We review how GA4GH is addressing the major areas in which genomics is currently deployed including rare disease, common disease, cancer, and infectious disease. Finally, we describe differences between genomic sequence data that are generated for research versus healthcare purposes, and define strategies for meeting the unique challenges of responsibly enabling access to data acquired in the clinical setting. As the costs associated with human genomic sequencing continue to decline, genomic assays are increasingly used in both research and healthcare. As a result, we expect tens of millions of human whole-exome or whole-genome sequences to be generated within the next decade, with a high proportion of that data coming from the healthcare setting and therefore associated with clinical information. 4 If they can be shared, these datasets hold great promise for research into the genetic basis of disease 5 and will represent more diverse populations than have traditionally been accessible in research; however, data from individual healthcare systems are rarely accessible outside of institutional boundaries. GA4GH aims to enable the responsible sharing of clinical and genomic data across both research and healthcare by developing standards and facilitating their uptake. 6 We believe that without such a consortium, the emerging utility of genomics in clinical practice will be slower, more expensive, and fragmented, with little harmonization between countries. 7 GA4GH standards (see Table 1 ) allow researchers to securely and responsibly access data regardless of where they are physically located. Technical standards give researchers the confidence that someone else could reproduce their work by running the same packaged method over the same underlying data, using the same persistent identifiers. Standards also give data providers confidence that their data are being accessed in accordance with their data use policies, by researchers they have authorized, without losing control of multiple downloaded copies of the data. As a result, data providers can enable research with the assurance that their legal and ethical requirements are being upheld, while researchers benefit from the use of global data resources and tools. As nascent genomic medicine programs emerge in many countries, we believe that federated approaches (see Federated access below), in addition to centralized data sharing where feasible, are necessary to satisfy the goals of both the research and healthcare communities. In addition, many commercial and public organizations aim to minimize the costs and risks of the complex technical software needed to either contribute to genomic medicine or deliver genomic tools. A complex, multi-stakeholder ecosystem requires neutral and technically competent standards; these standards must be adaptable for disparate purposes and useful for the broad set of end-users: clinical, academic, commercial, and public. Finally, standards must be developed to intentionally support the global research community with specific attention to policies of equity, diversity, and inclusion to tangibly enable progress for all global communities. GA4GH has partnered with 24 real-world genomic data initiatives (Driver Projects) to ensure its standards are fit for purpose and driven by real-world needs. Driver Projects make a commitment to help guide GA4GH development efforts and pilot GA4GH standards (see Table 2 ). Each Driver Project is expected to dedicate at least two full-time equivalents to GA4GH standards development, which takes place in the context of GA4GH Work Streams (see Figure 1 ). Work Streams are the key production teams of GA4GH, tackling challenges in eight distinct areas across the data life cycle (see Box 1). Work Streams consist of experts from their respective sub-disciplines and include membership from Driver Projects as well as hundreds of other organizations across the international genomics and health community. GA4GH Work Streams and Driver Projects have identified, and are actively developing, the technical specifications and policy frameworks they believe to be of most relevance to enable widespread data sharing, federated approaches, and interoperability across datasets to facilitate genomic research (see supplemental information for more details on the product development process); the areas of focus are outlined in Box 1, with individual products defined in Table 1 and in the 2020/2021 GA4GH Roadmap (https://www.ga4gh.org/ roadmap). Each GA4GH deliverable can be implemented on its own to enable interoperability and consistency in a single area. However, when implemented together, they support broader activities in the research and clinical domains and enable productive genomic data sharing and collaborative analyses that can leverage global datasets produced in distinct locations around the world. Each approved GA4GH deliverable is reviewed by a panel of internal and external experts not involved in the product's development, and then by the GA4GH Steering Committee (https://www.ga4gh.org/about-us/governance-andleadership-2/#steering). GA4GH standards are not typically accredited by a national or international standards body, and instead follow a model inspired by the Internet Engineering Task Force (IETF; https://www.ietf.org) and the World Wide Web Consortium (W3C; http://www.w3.org). This enables a flexible and rapid response to community needs and a focus on lowering barriers to interoperability through the development and adoption of pragmatic standards. However, there are occasions when certain standards benefit from a more formal accreditation process, especially when there is a direct link into healthcare usage (see next section and Box 2). To achieve greater international coordination and consistency of standards development, GA4GH proactively collaborates with other standards development organizations working in genomics, e.g., Health Level Seven (HL7; http://www.hl7.org), International Organization for Standardization (ISO; https://www.iso.org), Open Biological and Biomedical Ontology Foundry (OBO; http://www.obofoundry.org/). While defined work processes between GA4GH and other standards development bodies are still under development, GA4GH has initiated several pilot projects to explore mechanisms of collaboration. One such approach is the submission of GA4GH standards to ISO's technical committees for approval as ISO international standards. Using a product development timeline that aligns the ISO approval process with the GA4GH approval process, both communities are able to contribute to the development of a standard in a harmonized manner. These efforts expand the diversity of contributors to both organizations, leading to more robust and internationally applicable standards. Another approach, guided by HL7 working groups and experts, is the translation of GA4GH standards into HL7 Fast Health Interoperability Resources (FHIR) Implementation Guides. These implementation guides enable interoperability of GA4GH standards with clinical systems and accelerate the use of clinical data for research. GA4GH also aims to support and interoperate with existing translational models, ontologies, and terminologies (e.g., FHIR, HGVS, OMOP, PCORnet, Human Phenotype Ontology, SNOMED CT) for clinical genetics and genomics. [21] [22] [23] Before launching a new standards development project, GA4GH Work Streams are encouraged to complete a landscape analysis that both defines relevant existing standards and how they will influence the development of the new standard. Coordination activities-such as joint meetings, shared documentation, and process harmonization between GA4GH work and these health standards-focused efforts-are critical for bridging the research-clinical divide and keeping respective products aligned. This helps prevent unnecessary proliferation of redundant standards and minimizes the development of semantically and syntactically conflicting standards that could hamper large-scale interoperability and lead to confusion within the adopter community (see Box 2) . Federated approaches-the ability to analyze data across multiple distinct and secure sites -is increasingly seen as an important strategy where data cannot be pooled for legal or practical reasons. These approaches are characterized by independent organizations hosting data in secure processing environments (e.g., clouds, trusted research environments) while adopting technical standards that enable analysis at scale. 24 Application programming interfaces (APIs) can be deployed to enable researchers and portable workflows to visit multiple databases even where the data and computing environment are variably configured. 25 Tools like "identity federation" can facilitate even closer integration across organizations. [26] [27] [28] 29 the Japanese Genotype-phenotype Archive (JGA), or the database of Genotypes and Phenotypes (dbGaP). Researchers worldwide will draw on these openly shared genomic datasets for their own studies, increasing the amount of knowledge derived from each genome. 34 However, while such research genomes are more readily available, these datasets usually do not include the type or extent of longitudinal, standardized, or interoperable clinical data needed for genomic medicine. 35 Healthcare-based research and testing have an entirely different financial, legal, and social landscape, with the structure, provision, and regulation varying by country, covering the full spectrum from state-run to private schemes. 7 In each system, the cost of an assay in healthcare-genomics included-is often considered in light of its benefits to the health of an individual and cost effectiveness within the healthcare system. 36 In theory, if a genomic assay demonstrates clinical utility for a specific application within a healthcare system-especially if it is cost effective-the only limit to its deployment is the number of patients who will potentially benefit. In practice, however, there are logistical, financial, regulatory, educational, scientific, and clinical-based hurdles to overcome before a genomic test becomes a routine clinical offering. In addition, barriers to healthcare access will likely remain impediments to large-scale implementation in many countries. The current case for implementing genomics in healthcare can be presented in four broad disease areas: rare disease, cancer, common/chronic disease, and infectious disease. In the following sections we outline the case for healthcare-funded sequencing in each disease area. We also highlight challenges to implementation in each area and GA4GH deliverables aimed at overcoming these issues. Arguably, the rare disease space has seen the most successful deployment of genomics in healthcare, with many reporting diagnostic rates of at least 20%-30%, and health economic studies demonstrating cost-effectiveness and diagnostic utility. [36] [37] [38] [39] [40] [41] Clinical geneticists have used single-gene or small gene panel tests since the early 1990s to support diagnosis and some treatment decisions for many of these diseases. The cost of assaying broader genomic regions-including exome and genome sequencing-has fallen considerably, with a substantial impact on rare-disease diagnosis and discovery research. 42, 43 However, with more than 10,000 rare diseases 44 affecting more than 300 million patients worldwide, 45 diagnosing and discovering treatments for many of these diseases has been challenging. As such, the rare disease community has embraced data sharing in order to facilitate global knowledge exchange and improve patient diagnostic rates, understand disease progression, and augment care strategies. 41 To further enable progress, clinical and research laboratories and health systems must support several key activities to effectively identify, diagnose, and eventually treat the genetic causes of rare disease: (1) aggregate genomic and phenotypic data, needed for discerning population allele frequencies in disease and non-disease populations and implicating new genes in rare disease; (2) catalog the validity of gene-disease associations using consistent annotation models and terminologies; 46 (3) collectively build knowledge bases to understand variant pathogenicity; (4) define the natural histories of rare diseases to predict disease progression and enable a foundation upon which to develop clinical trials; and (5) monitor treatment efficacy of emerging therapeutics. GA4GH standards and policies already enable and will continue to build upon these activities. For example, the Matchmaker Exchange-a rare disease gene discovery platform which has benefited from GA4GH guidance on API-based data exchange formats as well as consent 47 and data security policies 48,49 -illustrates the power of bringing practicing clinicians and researchers together, as cases from across the globe are necessary to build evidence to confirm new gene-disease relationships. 48 GA4GH promotes knowledge sharing in ClinVar, a database which has accelerated improvements in variant classification across the clinical laboratory community. 50 Additional methods are now being deployed to move beyond manual submission of variant classifications to a centralized database; such advances will enable more timely access to siloed laboratory knowledge and evidence-based variant classification. Real-time sharing with ClinVar-facilitated by APIs and with entries linked to rich, case-level data-will be needed to scale our understanding of the more than 750 million variants so far identified in the human genome (e.g., within gnomAD; https://gnomad.broadinstitute.org). The Variation Representation (VRS) 18 and Variant Annotation (VA) specifications aim to support the exchange of variant data, Phenopackets and Pedigree representation to support the use of standardized clinical and family history data, as well as new APIs (e.g., Beacon v2 API and Data Connect API) to enable the identification of data for further access and analysis. The aim is for these standards to support a more global and federated approach to rare disease data and knowledge sharing that will be critical to advancing diagnosis and treatment of rare diseases. One in five men and one in six women worldwide will have a cancer diagnosis in their lifetime. 51 This risk is 2-to 3-fold greater in higher-resource countries, 51 with estimates as high as one in two people in the UK for example. 52 An altered somatic genome is a consistent hallmark of cancer, often associated with specific pathogenic mutations. 53 In some individuals with hereditary cancer syndromes, germline variants can disrupt cancer-related pathways and increase the risk of developing a "heritable" malignancy. [54] [55] [56] Characterizing a cancer by sequencing a patient's tumor genome alongside their germline genome has resulted in profound insights into molecular mechanisms of malignant transformation and discovery of potential therapeutic targets. 57, 58 Tumor/normal sequencing has demonstrated applications in disease monitoring 59 as well as diagnosis, 60 prognosis, 61 and therapeutic response prediction, 62 both at initial presentation 63 and disease recurrence. 64 Applying cancer genomics in the clinic is more complicated than that for rare diseases. For cancer patients, treatment strategy time frames are commonly measured in weeks and incorporating genomic information within such an urgent turnaround time is logistically challenging to integrate into clinical decision making. 65 Additionally, while the use of genomics for diagnosis and improved symptom management can lead to substantial improvements for rare disease patients and their families, application of genomics in cancer treatment is more complex and may include dual assessment of both somatic and germline genomes to determine heritable cancer risk and the assessment of the evolving tumor genome due to changing selective pressures in response to targeted therapies. Cancer genomic information is most useful if it informs treatment options, yet development of systems that match patients to appropriate clinical trials would be needed to fully realize the benefits of genomic tumor data where estimates of clinical trial enrollment in patients with cancer stands at ~8%. 66 Genomic information is increasingly important in clinical decision making through routine clinical sequencing assays and molecular tumor boards. 67 The heterogeneity of cancer as a disease-of each individual tumor and of any concurrent or subsequent manifestation, such as metastasis or recurrence-adds many layers of complexity to genomic analysis. 68 To address this complexity, it is important to analyze somatic and germline variation data together to understand their contribution to cancer risk. 69 Most of the same standards and workflows important for rare disease apply to tumor sequencing, including data storage and compression standards (e.g., CRAM), variation representation (e.g., VCF and VRS), analysis (e.g., cloud-based workflows), and linkage to patient records (e.g., Phenopackets). However, discovery of oncogenic driver mutations also requires significant coordination and standardization to track outcome data (e.g., progression and response to treatment), a key element in determining the clinical significance of variation found in cancer patients. 70 As such, many groups have created knowledge bases to annotate cancer genomic variation associated with evidence of pathogenicity or relevant treatment options; however, these knowledge bases can have limited levels of interoperability. In 2014, a GA4GH task team launched the Variant Interpretation for Cancer Consortium (VICC), which standardizes and coordinates clinical somatic cancer curation efforts and has created an open community resource to provide the aggregated information. 71 Moving forward, major oncogenomic resources are now working with GA4GH on the harmonization of variant interpretation evidence, through refinement and adoption of standards such as the Beacon API, the Data Use Ontology (DUO), 9 VA, and VRS. Additionally, these standards are being implemented across multiple GA4GH Driver Projects (see Table 2 ) that capture genomic data and/or diagnostic variant interpretation across the longitudinal evolution of cancer. Common/chronic disease "Common disease" is a catchall phrase describing a vast spectrum of diseases that have complex environmental and genetic etiologies. Accurate prediction of common diseases from genetics has been a topic of study since the inception of human genetics, yet genomic information is still not widely used in clinical practice for this purpose. The discovery of a large number of genetic susceptibility loci (polygenic architecture) supported the commondisease common-variant hypothesis 72 and has led to the generation of polygenic risk scores summarizing common disease risk. 73 Studies are now beginning to demonstrate the clinical benefits of applying polygenic risk scores in practice through stratification of the population for deploying disease management strategies. [74] [75] [76] As the assay of choice moves from genotype arrays to sequencing, there will be integration between common disease and rare disease applications; this is already the case for certain diseases such as susceptibility to breast cancer 75 or heart disease. 77 When such genomic information can be used clinically for common diseases, it will be more justifiable to sequence entire populations. Populationscale sequencing is in place already in some countries (e.g., Iceland) and is likely to become more commonplace in the next two decades. To support the discovery of the genetic causes and contributors to common disease across all populations, researchers must be able to identify and access aggregated data from large-scale cohort population studies from diverse backgrounds, carried out by multiple distinct sites such as biobanks in the UK (UK BioBank, Generation Scotland), China (China Kadoorie Biobank), the US (NIH All of Us Research Program), and Japan (Tohoku Medical Megabank, Japanese BioBank); and whole population cohorts in Iceland (deCODE), Estonia (Estonian Genome Project), and Finland (FinnGen). Doing so requires the data to be harmonized across all sites using common data models and terminologies. Furthermore, since genomic datasets of this scale are too large to download and manipulate at individual sites, researchers must be able to bring analytical tools to the data, regardless of their location. Protocols are needed to deploy these tools consistently and effectively across distinct federated sites. GA4GH products support this critical type of biological study across the typical research life cycle from data discovery to analysis: (1) identify and access datasets relevant to a disease study (e.g., GA4GH Passports, DUO, multiple data discovery APIs), (2) access secure genotype and phenotype information on patients with related traits (e.g., Phenopackets, Data Repository Service [DRS] API, VRS, VA), and (3) remotely run analytical methods on data of interest (e.g., Task Execution Service [TES], Workflow Execution Service [WES] API, htsget API 12 ), avoiding the need for inter-jurisdictional transfers and disparate regulatory requirements. Genomics can be used to identify the infectious agents of disease with more confidence and precision than ever before, and at increasing speed, allowing treatments that can quickly resolve infections [78] [79] [80] as well as identifying the evolution of new species that may evade antibiotics, antivirals, and vaccines. The main challenges to deployment of genomics in infectious disease care are managing cost and logistics, tracking disease progression and its characterization, achieving precise phenotypic prediction (e.g., antibiotic resistance), and harmonizing historical knowledge bases from non-genomic-based assays to integrate with contemporary genomic tests. The COVID-19 pandemic tested this infrastructure, with diagnostic testing becoming widespread, viral genomic sequencing enabling tracking of strains, and human genome sequencing of symptomatic individuals contributing to a better understanding of the basis of COVID-19 disease severity. 81 Infectious disease genomic research and surveillance primarily rely on sequencing bacterial and viral pathogens and the organisms in which they are carried and transmitted. These genomes vary greatly in size, content, and associated metadata, so the standards and APIs created for human genomic data may be insufficient for infectious disease data. However, while the specific data standards needed to advance pathogen genomics differ from those in human genomics, there is still considerable overlap in the mechanics of sharing the data. Through a variety of strategic alignments with organizations such as the Public Health Alliance for Genomic Epidemiology (PHA4GE; https://pha4ge.org/), the International COVID-19 Data Alliance (ICODA; http://www.icoda-research.org), and the European COVID19 data portal (http://www.covid19dataportal.org), GA4GH is working to ensure that the species-agnostic elements of genomic data sharing standards are transferred into the infectious disease community. In addition, some GA4GH standards have begun to explore how they should adapt to support infectious disease data; for example, the Phenopackets standard was improved to support case-level presentation for infectious diseases in 2020 in response to the COVID-19 pandemic. In addition, recently launched initiatives such as largescale tuberculosis sequencing in several countries, 82 rapid identification of Ebola and Zika virus strains, 83 and tracing hospital outbreaks using genomics 84,85 demonstrate a vibrant, functional interface between research, public health institutions, and clinical practice. We envision the global clinical and research communities collaborating seamlessly in the context of practicing healthcare 86, 87 to enable a true "learning healthcare system" (LHS). The LHS concept has existed for over a decade; 88,89 however, implementation is still in its infancy, facing several barriers. 90 Some useful implementations are found across medicine, 91-94 including genomic medicine. 95 Increasing numbers of institutions and countries have begun biobanks, in many cases connected to their healthcare system (see Common/chronic disease above), providing fertile grounds on which to bring healthcare data-including clinical genomic data-into research. To enable these efforts to reach their full potential, disparate systems must be able to share genomic and clinical data, requiring the community to overcome key challenges, particularly in the areas of infrastructure development, patient and physician incentives, ethics and regulation, privacy and security, and socio-cultural expectations (see Box 3). We believe these challenges can be overcome-but only if the genomics and healthcare communities commit to broad-based advocacy and coordinated efforts worldwide. This has already been successfully modeled through the Clinical Genome Resource (ClinGen; a GA4GH Driver Project), where healthcare providers, clinical laboratory staff, and researchers work together to develop standards for gene and variant curation, share underlying evidence, and then apply that evidence through a consensus-driven process to classify genes and variants which are made freely accessible to the broader community to support both research and clinical care. 96, 97 Developing clinical data standards Much of the clinical data contained within healthcare are not encoded in a standardized format. 98 Multiple electronic health record (EHR) vendors exist today and are highly proprietary in their technical structures, making standardization across EHRs and with downstream research systems difficult. Although data recorded in EHRs often use standardized clinical terminologies (e.g., ICD, SNOMED CT), the intent of these systems is generally to present clinical information on individuals to healthcare providers and, in some regions, facilitate billing practices. This presents a challenge for secondary users, where it is difficult to make accurate, population-scale conclusions, often requiring extensive efforts to understand practices and generate useful research data. 99 In order to promote adoption of standardized formats in research and ultimately within EHRs, GA4GH is developing standardized information models (e.g., Phenopackets, Pedigree) to describe clinical phenotypes and family histories. Standardizing the representation of phenotype and pedigree information will allow patients, care providers, and researchers to share this information more easily between healthcare and research systems and enable software tools to use this information to improve genome analysis and diagnosis. Resource limitations for healthcare providers and patients also impact their ability to share valuable clinical data. Some healthcare institutions (e.g., NHS England [https://www.england.nhs.uk/genomics/nhs-genomic-med-service], Dana-Farber Cancer Institute [http://www.dana-farber.org/for-patients-and-families/becominga-patient/preparing-for-your-first-appointment/checklist-for-new-adult-patients], Danish healthcare 100 ) have built layered consent procedures into the regular routine of medical practice. 101 Others support parallel biobanking efforts to separately consent patients for research. [102] [103] [104] [105] [106] Still others have built this into their operations as an inherent part of the healthcare system. 100 Further incentives can be built if providers can experience the direct benefits of research. For example, the clinical laboratory genetic testing industry largely participates voluntarily in data sharing through ClinVar, in part because they directly benefit from accurate variant interpretation. 50, 107, 108 Several laboratories also joined when the US insurance industry began requiring submission as a condition of test reimbursement. 109 However, despite progress in the sharing of variant knowledge, additional incentives and infrastructure are needed to support access to case-level results (e.g., variants interpreted for a patient indication) as well as full sequencing data, along with rich clinical phenotypes. Currently, most genetic test results are returned through PDF-based reports or accessed through external portals outside the medical system. Although standards exist for the exchange of genetic test results (see, for example, HL7's guide in the web resources), 110 robust standards that capture highly detailed, discrete genomic data are still under development. Adoption of those standards has been motivated by the implementation of downstream clinical decision support, 111-113 but more incentives and infrastructure will be needed. To date, GA4GH has worked on maintaining and evolving standardized file formats for raw and annotated genomic data (SAM, BAM, CRAM, VCF/BCF); individual variant representation and interpretation (VRS, VA); and transmission of individual phenotype data and interpreted results (Phenopackets), all of which are critical for the evolving use of genomics in healthcare systems-particularly clinical laboratory workflows to share genomic data and genetic testing results. Future areas of development include better representation of structural variants, unambiguous representation of complex multi-allelic loci, and research into new, more scalable formats for storing and exchanging genetic variation. Population-scale sequencing programs in which healthcare systems share clinical genomic data for research are unlikely to allow large-scale aggregation of data to migrate beyond national boundaries, but federated analysis-in which analytical algorithms or queries are brought to the data in its location without data egress-is feasible and is a major area of focus of GA4GH's standards development. Ethical considerations for patients and populations, together with responsible regulation, are essential for healthcare-funded genomics, which involves complex national regulation and legislation. Different countries and institutions have individual values and policies that relate to allowing access to personal information, with some embracing more open regulatory norms and systems on data collection, access, and sharing, and others being more restrictive. Nevertheless, most systems have some mechanism for researchers to access both research and clinical data. The GA4GH Regulatory and Ethics Work Stream (REWS) develops readyto-use policy guidance to support responsible, international genomic and health-related data sharing. In Box 4, we list central components of the GA4GH Regulatory & Ethics Toolkit, including policies, consent tools, and data access guidance. The REWS also reviews all GA4GH technical standards for consideration of any regulatory or ethics issues that may be relevant. The first REWS product was the GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, 115 which is built on the human right to benefit from scientific progress and its applications, as well as privacy, non-discrimination, and procedural fairness. It provides guidance for the responsible sharing of human genomic and health-related data, including personal health data and other types of data that may have predictive power in relation to health. The Framework has now been translated into 14 languages and has been used to inform local data sharing approaches around the globe, including, for example, the World Economic Forum, 116 the Academy of Science of South Africa, 117 DNA.Land, Health Data Research UK, 118 and the Horizon-2020 CORBEL project. 119 Keeping the fundamental human right to benefit from science at the heart of clinical and genomic data sharing ensures a universal approach to balancing the benefits and potential risks. We believe that most healthcare system actors can ultimately participate in responsible, worldwide data sharing while remaining compliant with applicable laws and institutional policies. Federating large volumes of sensitive clinical and genomic data across internationally distributed virtual computing environments presents formidable challenges in assuring data integrity, service availability, and individual privacy. Some of these challenges call for innovative application of well-established security standards, frameworks, and protocols -such as identity federation on a global scale-and some GA4GH standards already do so (e.g., crypt4GH, Authentication & Authorization Infrastructure [AAI] / Passports). Another crucial challenge is to enable secure, privacy-preserving federated analysis, wherein researchers can extract information without having to transfer raw data. This evolution is key to foster inter-institutional and international collaboration and will be a strong incentive to improve ontology homogeneity. Several technical solutions are available, either based on hardware devices or on software algorithms. The former are computationally efficient, but require trusting a vendor and are prone to side-channel attacks. The latter are computationally slower, but are mathematically proven and are a better response to GA4GH expectations. Recent results have demonstrated the effectiveness of a softwarebased approach (a combination of homomorphic cryptography and secure multi-party computation called "Multi-party Homomorphic Encryption" or MHE); these approaches have been positioned with respect to the GDPR. 120, 121 One of the major strengths of MHE is that partial aggregates can be considered to be anonymized and not just pseudonymous, in the sense of GDPR, and thus potentially obviating the need for data transfer and use agreements (DTUAs). Societal challenges of allowing access to genomic data within the healthcare ecosystem include maintaining public trust, overcoming differences in objectives and methods between research and healthcare, and breaking down unproductive divides between disciplines. Our vision for healthcare data ecosystems is one in which vetted researchers around the world can, with appropriate oversight and policy enforcement, gain access to human health data for the benefit of patients. GA4GH has defined the core elements of responsible data sharing, including transparency, accountability, recognition, and attribution as well as sanctions for misuse which form a framework to respect and maintain the trust of participants. 122 In particular, the GA4GH Engagement Framework (see Box 4) further assists researchers in designing and understanding engagement with public, patient, and participant stakeholders through the central themes of fairness, context, heterogeneity, and the recognition of tensions. Through the implementation arm of GA4GH, the Genomics in Health Implementation Forum (https://www.ga4gh.org/implementation) described below and other engagement efforts, GA4GH is tackling the broader societal implementation issues including education and engagement of the public, healthcare providers, and regulators in order to build trust within the community. The GA4GH "Your DNA, Your Say" survey, an effort to gather international public attitudes toward genomic data sharing, has provided an evidence base for understanding which factors are important to maintaining public trust in the generation and sharing of genomic data, as well as how concerns differ according to geography. 123, 124 These findings help ensure that GA4GH's work can enhance the public trust in a global context upon which the future of genomics depends. With more than 30 GA4GH standards approved, and dozens of production-ready implementations of those standards deployed around the world, GA4GH is now shifting its focus toward demonstrating how standards can work together to provide seamless support of genomic activities. Interconnected standards that are compatible and interoperable with each other and are hardened for real-world use will enable solutions for federated analyses across platforms and use cases. To drive this effort, GA4GH has established the Federated Analysis System Project (FASP), which aims to demonstrate how GA4GH APIs, when used in concert, can support real-world, scientific use cases (see https://www.ga4gh.org/genomicdata-toolkit/2020-connection-demos/). A key outcome of FASP is a series of scripts that represent working examples of clients accessing real-world GA4GH-compatible services to solve a spectrum of challenges across the search-access-analyze workflow. The scripts illustrate how these services have adopted GA4GH standards to solve challenges, such as dataset discoverability and controlled data access, in order to drive larger scale and more powerful analyses. By developing working implementations of GA4GH standards that are pressure tested in real world scenarios, the FASP team has identified specific areas of improvement within the standards. As a result of this work, new features will be added to existing GA4GH specifications to further facilitate secure, real-world federated data sharing and analysis. Most notably, the group is working toward a standardized solution for using a GA4GH Passport to access a controlled access dataset from a Data Repository Service (DRS), while fulfilling robust security requirements, such as preventing escalation of privilege. These efforts will be critical to support access to valuable datasets across the globe. To date, GA4GH has primarily focused on overcoming the challenges of enabling interoperability within new initiatives built on a foundation of cloud infrastructure. However, an additional-and potentially more significant-challenge is bringing high-performance computing (HPC) infrastructures that are not already focused on cloud interoperability into the federated network envisioned by this community. While more ambitious goals are on the horizon for connecting and extending GA4GH standards (e.g., discovery of datasets; matching requests, analyses, and datasets; describing phenotypes; reporting on variants), FASP has shown through its real-world demonstrations of access across distributed but interoperable datasets that the initial groundwork for federated analysis is now in place. The Data Repository Service (DRS) allows data custodians to make controlled access data available at multiple sites; the Workflow Execution and Task Execution Services (WES & TES) allow researchers to encapsulate and run analyses on those data; and AAI and Passports allow for federated authorization and authentication, streamlining the data access process for both researchers and data custodians. In 2021, GA4GH has begun to develop the GA4GH Starter Kit, a set of open source reference implementations (for example, code bases that demonstrate the standards working in practice), to help ensure existing HPC environments can interoperate with the wider GA4GH network. These resources consist of "plug-and-play" code that any institution (cloud-based or HPC) can use to quickly achieve GA4GH compatibility and will facilitate the progressive movement of established large-scale systems toward interoperability. In addition, a testing suite will be developed to ensure deployments of both reference and non-reference implementations are compliant to their respective GA4GH specifications. Once standards have been piloted in real-world Driver Project settings and shown to enable true federated analysis in FASP, they can begin to be promoted more broadly in the research and clinical genomics communities. Launched in 2020, the Genomics in Health Implementation Forum (GHIF) brings together a group of national-scale genomic data initiatives to share resources, experiences, and best practices for implementing GA4GH standards, as well as broader experience in rolling out national and international data sharing activities. GHIF aims to support more accurate data interpretation and disease diagnosis plus other innovative solutions across healthcare through global cooperation in data sharing and clinical implementation of genomics. Broad uptake of GA4GH standards among GHIF members-which include both GA4GH Driver Projects as well as other national and multi-national initiatives (see https://ga4gh.org/ implementation for full list)-will provide strong evidence that GA4GH standards are supporting the community's actual data sharing needs. Implementation of GA4GH policies and standards throughout the scientific and healthcare communities will allow researchers to access data across the globe-a critical step toward answering otherwise impenetrable questions about disease and basic human biology. As the volume of genomic and health-related data grows exponentially around the world, researchers, clinicians, and bioinformaticians have a responsibility to make that data appropriately accessible and to use it to realize benefits for all humans everywhere. The promise of genomic medicine lies at a crossroads that depends on harmonization across the global community to significantly enhance human health and medicine. We believe that GA4GH, by embracing collaborative innovation and knowledge exchange, is well poised to meet this challenge. Refer to Web version on PubMed Central for supplementary material. By aligning with existing standards, tools, and resources, GA4GH aims to minimize redundancy and the unnecessary proliferation of competing standards. We outline three specific examples that demonstrate GA4GH efforts to align with existing standards and standards development organizations. The PED format is a well-known standard for exchanging pedigree information and is widely used in both research and clinical settings (see PLINK in web resources). 20 However, PED only allows for the representation of basic parent-child relationships, and does not represent all of the data elements and relationships needed by the genomics community. Building upon this format, the GA4GH Pedigree Subgroup has mapped PED format data elements to the Pedigree data model, allowing adopters to transition to a more robust representation of family health history without data loss and enabling compatibility with pre-existing family health history tools. Phenopackets, a standard for case-level phenotypic data exchange, can be compared to a hierarchical structure of "slots" that can be populated with ontology terms and other data. In order to maximize utility of computational analyses, these slots are compatible with any pre-existing terminologies or ontologies, such as the Human Phenotype Ontology for human disease phenotypes, NCI Thesaurus for cancer, LOINC for laboratory results, and MONDO for diseases. The modular design of the standard also enables interoperability with complementary GA4GH deliverables, like Pedigree and the Variation Representation Specification (VRS), by integrating them within the structure of the phenopacket. The GA4GH Variation Representation Specification (VRS) and Variant Annotation (VA) framework were developed to address the diverse methods used to access reference genome sequence and genomic annotation (e.g., genes, variation, regulatory regions, expression). Associated metadata can often be unstructured. VRS and VA aim to enable the provision, sharing, and computational representation of genomic variation information in a way that is unambiguous and semantically rigorous. These specifications are developed with bidirectional feedback with the standards of the health level 7 (HL7) clinical genomics working group, which supports the reporting of clinical genomic test results and related information with electronic health records (EHRs). Alignment between these specifications is a critical step toward supporting data exchange and system interoperability across the clinical-translational-research spectrum. The GA4GH Regulatory and Ethics Work Stream (REWS) develops ready-to-use policy guidance to support responsible, international genomic and health-related data sharing. Here, we list central components of the GA4GH Regulatory & Ethics Toolkit. The REWS also reviews all GA4GH technical standards for any regulatory or ethics issues that may be relevant. GA4GH has developed five policy guidance documents (or "Frameworks") that build on the Framework for Responsible Sharing of Genomic and Health-Related Data, each aiming to address a specific area of responsible data sharing: • Consent Policy Framework: describes how to maximize responsible and respectful international data sharing through the design of consents for prospective data collection and through the assessment of existing consents for retrospective data sharing (https://www.ga4gh.org/wp-content/uploads/ GA4GH-Final-Revised-Consent-Policy_16Sept2019.pdf) Framework: provides principled and practical guidance for processing data in a way that protects and promotes the security, integrity, and availability of data and services, and the privacy of individuals, families, and communities whose data are processed (https://www.ga4gh.org/wp-content/uploads/GA4GH-Data-Privacy-and-Security-Policy_FINAL-August-2019_wPolicyVersions.pdf) • Ethics Review Recognition Policy Framework: provides essential elements for the ethics review process of multi-jurisdictional research involving health-related data so as to foster recognition of extrajurisdictional ethics reviews and efficient and responsible health-related data sharing (https://www.ga4gh.org/wp-content/uploads/GA4GH-Ethics-Review-Recognition-Policy.pdf) Framework: provides principled and practical best practices for sharing data in a way that protects and promotes the confidentiality, integrity, and availability of data and services, and the privacy of individuals, families, and communities whose data are shared (https://www.ga4gh.org/wp-content/uploads/Privacy-and-Security-Policy.pdf) Results: provides a reference point for managing the return of clinically actionable research results that recognizes the importance of the accountability and transparency of genomic researchers toward participants (https://www.ga4gh.org/wp-content/uploads/2021-Policyon-Clinically-Actionable-Genomic-Research-Results.pdf) A typology of model consent clauses that aim to assist researchers in the drafting of interoperable consent forms and ensure they use language that matches cutting-edge GA4GH international standards. A typology of clauses has been developed for genomics research (https://www.ga4gh.org/wp-content/uploads/Consent-Clauses-for-Genomic-Research.docx.pdf), familial consent (https://www.ga4gh.org/wpcontent/uploads/Familial-Consent-Clauses-6.pdf), 114 pediatric consent (forthcoming), and rare disease (https://bmcmedethics.biomedcentral.com/articles/10.1186/s12910-019-0390-x/tables/3). Additional typologies are forthcoming for large-scale initiatives and clinical whole-genome sequencing. The MRCG provides instructions for researchers to integrate standard data-sharing language into consent forms in a way that can be translated into a computable language. Machine-readable consent language can be attached to datasets and stored in their descriptive data using DUO terms. Researchers can then search for datasets that have been consented for their research purposes (https://www.ga4gh.org/wp-content/uploads/ Machine-readable-Consent-Guidance_6JUL2020-1.pdf) DACReS is a set of procedural standards for data access committees that facilitate consistency, effectiveness, and robustness of reviews for data access requests to genomic and health-related data. This framework enables researchers and others to robustly design engagement with various public and patient audiences implicated in genomic data sharing. Through reflexive questions centered around themes of fairness, context, heterogeneity, and the recognition of tension, the framework facilitates critical inquiry into stakeholder engagement (https://www.ga4gh.org/wp-content/uploads/ GA4GH_Engagement-policy_V1.0_July2021-1.pdf). These monthly briefs answer important questions about the impact of the European General Data Protection Regulation on various aspects of international health research and genomic and health-related data sharing. (https://www.ga4gh.org/genomic-datatoolkit/regulatory-ethics-toolkit/gdpr-forum/). GA4GH is a community of diverse stakeholders from Driver Projects and other institutions working together in the context of Work Streams. Each GA4GH Driver Project is expected to dedicate two full-time equivalents across at least two GA4GH Work Streams. As foundational groups that review all GA4GH deliverables, the Regulatory and Ethics and Data Security Work Streams must have representation from every Driver Project. In addition to Driver Projects, any member of the community-regardless of domain, sector, nation, or affiliation-is invited to participate in any GA4GH Work Stream. Supplemental information includes details on how each of the 24 GA4GH Driver Projects intersects with the six technical Work Streams. Rehm It provides a framework for public web services responding to queries against genomic data collections, for instance from population-based or disease-specific genome repositories. Beacon is designed to (1) focus on robustness and easy implementation, (2) be maintained by individual organizations and assembled into a federated network, (3) be general-purpose and able to be used to report on any variant collection, (4) provide a boolean (or quantitative) answer about the observation of a variant, and (5) protect privacy, with queries not returning information about single individuals. A new version of the API will include support for more granular control based on a user's identity authorization and will enable discovery of cohorts, cases (patients), biological samples, and genomic variants and associated knowledge. More details can be found on the Beacon Project website. Data Connect https://github.com/ga4gh-discovery/data-connect API data custodians, researchers, and API & tool developers Data Connect is a specification for discovery and search of biomedical data, which provides a mechanism for describing data and its data model, and for searching data within the given data model. and tool developers The GA4GH Passport specification aims to support data access policies within current and evolving data access governance systems. This specification defines Passports and Passport Visas as the standard way of communicating a user's data access authorizations based on either their role (e.g., researcher), affiliation, or access status. Passport Visas from trusted organizations can therefore express data access authorizations that require either a registration process (for the Registered Access data access model 11 ) or custom data access approval (such as the Controlled Access applications used for many datasets). Service Info https://github.com/ga4gh-discovery/ga4gh-service-info API API and tool developers Service discovery is at the root of any computational workflow using web-based APIs. Traditionally, this is hard-coded into workflows, and discovery is a manual process. Service Info provides a way for an API to expose a set of metadata to help discovery and aggregation of services via computational methods. It also allows a server/implementation to describe its capabilities and limitations. Service-info is described in GA4GH OpenAPI specification, which can be visualized using Swagger Editor (https://editor.swagger.io/?url=https://raw.githubusercontent.com/ga4ghdiscovery/ga4gh-service-info/develop/service-info.yaml). Service Registry https://github.com/ga4gh-discovery/ga4gh-service-registry API API and tool developers Service registry is a GA4GH service providing information about other GA4GH services, primarily for the purpose of organizing services into networks or groups and service discovery across organizational boundaries. Information about the individual services in the registry is described in the complementary Service Info specification (see above). The Service Registry specification is useful when dealing with technologies that handle multiple GA4GH services. Common use cases include creating networks or groups of services of a certain type (e.g., Beacon Network searches networks of Beacon services across multiple organizations, a workflow can be executed by a specific group of Workflow Execution Services, or Data Connect search on biomedical data is federated across a set of nodes), or a certain host (e.g., an organization provides implementations of Beacon, Data Connect, and Data Repository Service APIs, or a server hosts an implementation of refget and htsget APIs). Remotely run analytical methods on data of interest htsget 12 samtools.github.io/hts-specs/htsget.html API API and tool developers, researchers htsget is a data retrieval API that bridges from existing genomics file formats to a client/server model with the following features: • Incumbent data formats (BAM, CRAM, VCF) are preferred initially, with a future path to others. • Multiple server implementations are supported, including those that do format transcoding on the fly, and those that return essentially unaltered filesystem data. • Multiple use cases are supported, including access to small subsets of genomic data (e.g., for browsing a given region) and to full genomes (e.g., for calling variants). Cell Genom. Author manuscript; available in PMC 2022 January 20. infrastructure. An increasing number of GA4GH projects rely on Cloud services to pursue their goals, and the GA4GH Cloud Work Stream is working on several products to make the GA4GH community take full advantage of the Cloud paradigm. However, the use of the Cloud poses significant security and privacy challenges that need to be carefully evaluated and addressed. The purpose of the Cloud Security and Privacy Policy is to outline a common security technology framework that can be used to systematically assess the products developed by the CWS from a security perspective. Product developers and reviewers can leverage the information contained herein to identify requirements, threats, and countermeasures related to the products they are working on, thus facilitating the production of secure standards. • Significantly better lossless compression than BAM • To permit simple and lossless transformations to and from BAM files • Support for controlled loss of data The first two objectives allow users to take immediate advantage of the CRAM format while offering a smooth transition path from using BAM files. The third objective supports the exploration of different lossy compression strategies and provides a framework in which to effect these choices. Data in CRAM is stored in a columnar fashion, with each column being compressed with either a general-purpose compressor or a custom method. If aligned, sequences may be stored as differences against a reference sequence, which is optionally stored within the CRAM file. External references may be either a local file or obtained remotely via the refget API. Data may be retrieved either as whole alignment records, or selectively only for the fields (columns) required. Purpose exchange mechanisms. It will also provide a formal framework for defining custom extensions to the core model -allowing community-driven development of VA-based data models for new data types and use cases. A more detailed description of these components can be found online. The VA-Spec is being authored by a partnership among national resource providers and major public initiatives within GA4GH. It has been informed by and will be tested in diverse, established, and actively developed Driver Projects, including ClinGen, VICC, Genomics England, the Monarch Initiative, BRCA Exchange, and Australian Genomics. In these contexts, it will be used to support different types of tools and information systems, including variant curation tools and interpretation platforms (e.g., ClinGen, CIViC, Genomics England), variant annotation services (e.g., CellBase), knowledge aggregators/portals (e.g., BRCA Exchange, Monarch Initiative), matchmaking applications (e.g., Matchmaker Exchange), and clinical information systems and decision support tools. API and tool developers, data custodians Maximizing the personal, public, research, and clinical value of genomic information will require that clinicians, researchers, and testing laboratories exchange genetic variation data reliably. The Variation Representation Specification (VRS, pronounced "verse") -written by a partnership among national information resource providers, major public initiatives, and diagnostic testing laboratories -is an open specification to standardize the exchange of variation data. The primary contributions of VRS include (1) terminology and an information model, (2) a machine readable schema, (3) conventions that promote reliable data sharing, (4) globally unique computed identifiers, and (5) a Python implementation (available at vrs-python) that demonstrates the above schema and algorithms and supports translation of existing variant representation schemes into VRS for use in genomic data sharing. It may be used as the basis for development in Python, but it is not required in order to use VRS. The machine-readable schema definitions and example code are available online at the VRS repository. Readers may wish to view a complete example before reading the specification. For a discussion of VRS with respect to existing standards, such as HGVS, SPDI, and VCF, see "Relationship of VRS to existing standards," an appendix to the specification documentation. deletions, and structural variants, together with rich annotations. VCF may hold data for multiple samples within the same file. The specification contains the header meta-data fields, a series of mandatory columns describing the variants, and details of the optional annotations which are either per-site or per-sample. VCF and its binary counterpart, BCF, is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The GA4GH Toolkit outlines a suite of secure standards and frameworks that will enable more meaningful research and patient data harmonization and sharing. This suite addresses a variety of challenges across the data sharing life cycle and is applicable across the world's accessible medical and patient-centered systems, knowledgebases, and raw data sources. Dipayan Gupta 23 , Melissa A. Haendel 51 , Ada Hamosh 52 , David P. Hansen 16,81 , Reece K. Hart 1,98,122 , Dean Mitchell Hartley 53 , David Haussler 34,125 , Rachele M. Hendricks-Sturrup 54 , Calvin W.L. Ho 55 , Ashley E. Hobb 6 , Michael M Universal Declaration on the Human Genome and Human Rights (revised draft) Creating a Global Alliance to Enable Responsible Sharing of Genomic and Clinical Data Genomics in healthcare: GA4GH looks to 2022 The next 20 years of human genomics must be more equitable and more open GENOMICS. A federated ecosystem for sharing genomic, clinical data Integrating Genomics into Healthcare: A Global Responsibility Federated discovery and sharing of genomic data using Beacons The Data Use Ontology to streamline responsible access to diverse datasets GA4GH Passport standard for digital identity and access permissions Registered access: authorizing data access htsget: a protocol for securely streaming genomic data Refget: standardised access to reference sequences Efficient storage of high throughput DNA sequencing data using reference-based compression Crypt4GH: a file format standard enabling native access to encrypted data Empirical Validation of an Automated Approach to Data Use Oversight The Sequence Alignment/Map format and SAMtools The GA4GH Variation Representation Specification: A Computational Framework for variation representation and Federated Identification The variant call format and VCFtools PLINK: a tool set for whole-genome association and population-based linkage analyses The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species The Human Phenotype Ontology in 2021 Classification, Ontology, and Precision Medicine International Federation of Genomic Medicine Databases Using GA4GH Standards Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language Resource entitlement management system Federated Identity Management for research collaborations Common ELIXIR Service for Researcher Authentication and Authorisation Federated Identity Management for Research Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease Sharing genomic data from clinical testing with researchers: public survey of expectations of clinical genomic data management in Queensland, Australia Laboratory and clinical genomic data sharing is crucial to improving genetic health care: a position statement of the American College of Medical Genetics and Genomics NCBI's Database of Genotypes and Phenotypes: dbGaP Feasibility of using Clinical Element Models (CEM) to standardize phenotype variables in the database of genotypes and phenotypes (dbGaP) Prospective comparison of the cost-effectiveness of clinical whole-exome sequencing with that of usual care overwhelmingly supports early use and reimbursement Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine Metaanalysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases Clinical whole genome sequencing as a first-tier test at a resource-limited dysmorphology clinic in Mexico Rapid whole-genome sequencing decreases infant morbidity and cost of hospitalization The case for open science: rare diseases Mendelian Gene Discovery: Fast and Furious with No End in Sight A Randomized, Controlled Trial of the Analytic and Diagnostic Performance of Singleton and Trio, Rapid Genome and Exome Sequencing in Ill Infants How many rare diseases are there? Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database Evaluating the Clinical Validity of Gene-Disease Associations: An Evidence-Based Framework Developed by the Clinical Genome Resource Matching" consent to purpose: The example of the Matchmaker Exchange The Matchmaker Exchange: a platform for rare disease gene discovery The Matchmaker Exchange API: automating patient matching through the exchange of structured phenotypic and genotypic profiles Scaling resolution of variant classification differences in ClinVar between 41 clinical laboratories through an outlier approach Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries Trends in the lifetime risk of developing cancer in Great Britain: comparison of risk for those born from 1930 to 1960 Hallmarks of cancer: the next generation Prevalence of germline mutations in cancer predisposition genes in patients with pancreatic cancer The relationship between the roles of BRCA genes in DNA repair and cancer predisposition Realizing the promise of cancer predisposition genes Refractory alveolar rhabdomyosarcoma in an 11-year-old male TRIM28 congenital predisposition to Wilms' tumor: novel mutations and presentation in a sibling pair Treatment response and tumor evolution: lessons from an extended series of multianalyte liquid biopsies in a metastatic breast cancer patient DICER1 and FOXL2 mutations in ovarian sex cord-stromal tumours: a GINECO Group study DNMT3A mutations in acute myeloid leukemia Osimertinib: First Global Approval The diagnostic challenges and clinical course of a myeloid/lymphoid neoplasm with eosinophilia and ZBTB20-JAK2 gene fusion presenting as B-lymphoblastic leukemia The pivotal role of sampling recurrent tumors in the precision care of patients with tumors of the central nervous system Genomics-Driven Precision Medicine for Advanced Pancreatic Cancer: Early Results from the COMPASS Systematic Review and Meta-Analysis of the Magnitude of Structural, Clinical, and Physician and Patient Barriers to Cancer Clinical Trial Participation Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients The challenge of intratumour heterogeneity in precision medicine Translating Germline Cancer Risk into Precision Prevention All the World's a Stage: Facilitating Discovery Science and Improved Cancer Care through the Global Alliance for Genomics and Health Variant Interpretation for Cancer Consortium (2020). A harmonized meta-knowledgebase of clinical interpretations of somatic genomic variants in cancer Common Disease-Common Variant Hypothesis The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation Genome-Wide Polygenic Score and Cardiovascular Outcomes With Evacetrapib in Patients With High-Risk Vascular Disease: A Nested Case-Control Study Polygenic background modifies penetrance of monogenic variants for tier 1 genomic conditions Performance of Atrial Fibrillation Risk Prediction Models in Over Four Million Individuals Whole-Genome Sequencing to Characterize Monogenic and Polygenic Contributions in Patients Hospitalized With Early-Onset Myocardial Infarction Public health genomics and the new molecular epidemiology of bacterial pathogens The potential of whole genome NGS for infectious disease diagnosis Clinical Pathogen Genomics Mapping the human genetic architecture of COVID-19 England world leaders in the use of whole genome sequencing to diagnose TB Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study Rapid whole-genome sequencing for investigation of a neonatal MRSA outbreak Genomics in healthcare: GA4GH looks to 2022 The Convergence of Research and Clinical Genomics The Learning Healthcare System: Workshop Summary (IOM Roundtable on Evidence-Based Medicine Best Care at Lower Cost: The Path to Continuously Learning Health Care in America Barriers to Achieving Economies of Scale in Analysis of EHR Data. A Cautionary Tale Using a network organisational architecture to support the development of Learning Healthcare Systems A new approach to clinical research: Integrating clinical care, quality reporting, and research using a wound care networkbased learning healthcare system Recent Approaches to Improve Medication Adherence in Patients with Coronary Heart Disease: Progress Towards a Learning Healthcare System Eunice Kennedy Shriver National Institute of Child Health and Human Development Collaborative Pediatric Critical Care Research Network Patient-Centered Precision Health In A Learning Health Care System: Geisinger's Genomic Medicine Experience Development of Clinical Domain Working Groups for the Clinical Genome Resource (ClinGen): lessons learned and plans for the future ClinGen-the Clinical Genome Resource Common Problems, Common Data Model Solutions: Evidence Generation for Health Technology Assessment Extracting research-quality phenotypes from electronic health records to support precision medicine Better Use of Health Data Color Data v2: a user-friendly, open-access database with hereditary cancer and hereditary cardiovascular conditions datasets The UK Biobank resource with deep phenotyping and genomic data Cohort Profile: Estonian Biobank of the Estonian Genome Center The Geisinger MyCode community health initiative: an electronic health record-linked biobank for precision medicine research Million Veteran Program: A mega-biobank to study genetic influences on health and disease Development of a large-scale de-identified DNA biobank to enable personalized medicine The value of genomic variant ClinVar submissions from clinical providers: Beyond the addition of novel variants Clinical laboratories collaborate to resolve differences in variant interpretations submitted to ClinVar A new era in the interpretation of human genomic variation The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future Cohort Profile: The Right Drug, Right Dose, Right Time: Using Genomic Data to Individualize Treatment Protocol (RIGHT Protocol) Real-world integration of genomic data into the electronic health record: the PennChart Genomics Initiative PG4KDS: a model for the clinical implementation of pre-emptive pharmacogenetics The Genetic Family as Patient? Framework for responsible sharing of genomic and health-related data Federated Data Systems: Balancing Innovation and Trust in the Use of Sensitive Data ASSAf Statement on Academic Freedom and the Values of Science DIGITAL INNOVATION HUB PROGRAMME PROSPECTUS APPENDIX Sharing and reuse of individual participant data from clinical trials: principles and recommendations Truly Privacy-Preserving Federated Analytics for Precision Medicine with Multiparty Homomorphic Encryption. bioRxiv Revolutionizing Medical Data Sharing Using Advanced Privacy-Enhancing Technologies: Technical, Legal, and Ethical Synthesis Toward better governance of human genomic data Global Public Perceptions of Genomic Data Sharing: What Shapes the Willingness to Donate DNA and Health Data? Demonstrating trustworthiness when collecting and sharing genomic data: public views across 22 countries Box 1. The GA4GH Work Streams are the key production teams of the organization. Each tackles a specific area in the data life cycle, as described below (URLs listed in the web resources). Data use & researcher identities: Develops ontologies and data models to streamline global access to datasets generated in any country 9, 10 2. Genomic knowledge standards: Develops specifications and data models for exchanging genomic variant observations and knowledge 18 Cloud: Develops federated analysis approaches to support the statistical rigor needed to learn from large datasets Develops guidelines and recommendations to ensure identifiable genomic and phenotypic data remain appropriately secure without sacrificing their analytic potential Regulatory & ethics: Develops policies and recommendations for ensuring individual-level data are interoperable with existing norms and follow core ethical principles Discovery: Develops data models and APIs to make data findable, accessible, interoperable, and reusable (FAIR) Clinical & phenotypic data capture & exchange: Develops data models to ensure genomic data is most impactful through rich metadata collected in a standardized way Large-scale genomics: Develops APIs and file formats to ensure harmonized technological platforms can support large-scale computing Box 3. Here we outline some of the major challenges to achieving the broad goal of responsible sharing of genomic and related health data. This includes setting up the infrastructure to support the flow of data from clinical practice into research, as well as establishing data-access and accountability mechanisms that are appropriate to research settings. These need to be consistent with the legal frameworks of the healthcare setting, and respectful of the rights of the individual data donor including their privacy, the security of their data, and their autonomy with regard to research participation. Inconsistency and lack of version control in data-generating pipelines Lack of dataset interoperability due to disparate data models and terminologies Inadequate infrastructure for ingesting and storing data Difficulty or lack of resources for enabling access to data Insufficient consent for data sharing and lack of resources to support the consent process Data privacy and security issues, as well as real and perceived regulatory issues Challenges to ensuring patients understand how their data are used and have sufficient autonomy around data sharing participation Differences in priorities, experiences, and trust levels concerning data sharing between different population groups and stakeholders Lack of incentives in the clinical care system for prioritizing data sharing and research Lack of data-sharing mandates