key: cord-0785783-m7amgd9f authors: Li, Cuiping; Tian, Dongmei; Tang, Bixia; Liu, Xiaonan; Teng, Xufei; Zhao, Wenming; Zhang, Zhang; Song, Shuhui title: Genome Variation Map: a worldwide collection of genome variations across multiple species date: 2020-11-10 journal: Nucleic Acids Res DOI: 10.1093/nar/gkaa1005 sha: 70a8ee43b779caa52898af8f3e0fd03dffe486f4 doc_id: 785783 cord_uid: m7amgd9f The Genome Variation Map (GVM; http://bigd.big.ac.cn/gvm/) is a public data repository of genome variations. It aims to collect and integrate genome variations for a wide range of species, accepts submissions of different variation types from all over the world and provides free open access to all publicly available data in support of worldwide research activities. Compared with the previous version, particularly, a total of 22 species, 115 projects, 55 935 samples, 463 429 609 variants, 66 220 associations and 56 submissions (as of 7 September 2020) were newly added in the current version of GVM. In the current release, GVM houses a total of ∼960 million variants from 41 species, including 13 animals, 25 plants and 3 viruses. Moreover, it incorporates 64 819 individual genotypes and 260 393 manually curated high-quality genotype-to-phenotype associations. Since its inception, GVM has archived genomic variation data of 43 754 samples submitted by worldwide users and served >1 million data download requests. Collectively, as a core resource in the National Genomics Data Center, GVM provides valuable genome variations for a diversity of species and thus plays an important role in both functional genomics studies and molecular breeding. The Genome Variation Map (GVM; https://bigd.big.ac.cn/ gvm/), as a core resource of the National Genomics Data Center (CNCB-NGDC) (1), part of the China National Center for Bioinformation (CNCB), is a public data repository of genome variations. Since its inception in 2017 (2), GVM has served as a central public resource for genome variations and played an important role in both functional genomics studies and molecular breeding (3, 4) . For instance, variants and knowledge associations deposited in GVM have been used in several data resources (e.g. IC4R (5) , SR4R (6) , MBKbase for rice (7) , GWAS Atlas (8) and Animal-ImputeDB (9)). Over the past several years, advances in high-throughput sequencing technologies have empowered large-scale population genome sequencing projects, leading to massive genome variations identified at unprecedented rates. Consequently, GVM has accepted >50 data submissions (10) (11) (12) from all over the world, and as of September 2020, accordingly housed a large number of genome variations from 41 species, including not only human, but also domesticated animals, cultivated plants and viruses, particularly SARS-CoV-2, a coronavirus provoking the ongoing global pandemic. Meanwhile, GVM has served >1 million data download requests (https://bigd.big.ac.cn/gvm/statistics). Importantly, to provide high-quality variant data and metadata and deliver user-friendly data services, GVM has been frequently updated in the past years by standardizing the curation model and process, improving the web functionalities for data submission, browse and download, providing the database tutorial in PPT and video, and adding external links to other public databases, such as dbSNP (13) , GWAS Catalog (14), NCBI genome (15), ENSEMBL (16) , JGI (17), maizedb (18) and DRDB (19) . Here we present an updated release of GVM and briefly describe its recent updates and data growth. Whole-genome resequencing projects were collected from published literatures, and raw sequence data were downloaded from Sequence Read Archive (SRA) (20) and Genome Sequence Archive (GSA) (21, 22) . All collected raw sequence reads were subjected to quality control using Trimmomatic-0.36 and cleaned reads were aligned to the reference genomes using BWA-MEM (23) . Those aligned reads were then merged into a single BAM file and sorted by SAMtools (19) , and marked for duplicates using MarkDuplicates in GATK-4.0.5.0 (24) . After removing duplicate reads, high-quality variants were identified by both GATK HaplotypeCaller and SAMtools mpileup, and base quality was recalibrated by Base Quality Score Recalibration (BQSR). Then, an intermediate genomic GVCF file for each sample was produced by running HaplotypeCaller in GVCF mode, and Genotype-GVCFs in GATK was applied to pool all GVCF files together to create a VCF file containing all raw variants. These raw variants were further filtered by using SelectVariants and VariantFiltration in GATK. Default parameters were used in the variant calling. The effects of all variants were annotated using VEP (25) as well as in-house pipelines. Functional annotation of variants were performed based on GO (26), UniProt (27) and Pfam (28) . Furthermore, the genotype-to-phenotype (G2P) associations were manually curated from published GWAS literatures, and the relationships between sequence variants and phenotypic traits were established. Over the past several years, GVM has been significantly updated regarding data modules and data volume. To better present genomic variants, all relevant entities and metadata in GVM are organized into six modules in terms of species, project, sample, variant, association and submission. Moreover, the number of genomic variants hosted in GVM is growing rapidly from ∼497 million in 19 species in August 2017 to ∼960 million in 41 species in August 2020 (Table 1 ). An illustration of all collected species and data statistics is presented in Figure 1 (with details in Supplementary Table S1 ). Compared with the previous version, particularly, a total of 22 species, 115 projects, 55 937 samples, 463 429 609 variants, 66 220 associations and 56 submissions (as of 7 September 2020) were newly added in the current version of GVM. The Species module provides a comprehensive overview on all collected species as well as their associated projects, samples, variants and associations (if available), which together are organized in a tabular table and linked to internal and external resources ( Figure 2A ). As of 7 September 2020, there are a total of 41 species, including 13 animals, 25 plants and 3 viruses. The newly updated species include three animals (cat, horse, and tarpan), 10 cultivated plants (carrot, cassava, common bean, cotton, cucumber, date palm, grape, apricot, rapeseed and wheat), five traditional Chinese medicine plants (Catharanthus roseus, Cannabis, Ganoderma, Jatropha and Salvia miltiorrhiza) and three coronaviruses (SARS-CoV-2, SARS-CoV and MERS-CoV). Since the outbreak of severe respiratory disease COVID-19 in late December 2019, SARS-CoV-2 has been rapidly spread as a global pandemic. To help worldwide researchers better understand the genome variation and transmission of SARS-CoV-2 (29), we analyzed genome sequences of SARS-CoV-2 as well as two close relatives (SARS-CoV and MERS-CoV) and made their genomic variants publicly available for the global research community through 2019nCoVR (19) . As of 7 September 2020, there are a total of 16 934 variants identified from 52 466 high-quality SARS-CoV-2 assemblies as well as 477 and 1742 variants from 105 SARS-CoV and 248 MERS-CoV assemblies, respectively. In the Project and Sample modules, we compiled the metadata of whole-genome resequencing projects ( Figure 2B ) and samples ( Figure 2C ), respectively. The Project module displays an overview of all resequencing projects, involving sequenced sample size, sampling material, sequencing technology, data type and average sequencing coverage. Besides, bibliographic details (e.g. title, year, journal, PubMed ID) and a short description for each publication are collectively summarized, which are helpful for users to quickly understand the outline of the sequencing project(s) of interest. Likewise, the Sample module provides a detailed description on sequenced samples, including sample name, cultivar/breeder, geographic information (from which sequenced samples were collected), etc. A unique accession ID was assigned for each sample, and the number of sampled materials for a specific geographic region was further mapped in a world map, providing a worldwide landscape on the distribution of samples for each species and accordingly facilitating researchers to evaluate the sample representativeness and genetic diversity. The Variant module ( Figure 2D ) provides a catalog of genome variations, including SNPs and indels, identified from a diversity of species (details see methods). For each variant, a unique identifier was assigned and its related details including variant coordinate, reference and alternative alleles, minor allele frequency, and hyperlinks to external databases (e.g. dbSNP, ClinVar) were provided. To help users prioritize the potentially functional SNPs, GVM provides comprehensive annotations for each variant, including consequence type, variant effect, population frequency and phenotype association, and also incorporates the functional domain information from UniProt and Pfam. Moreover, with the rapid accumulation of huge amounts of genomic variants, we further calculated the SNP density for each species and found that the number of SNP markers ranged from 1 to 64 per kb, with an averaged distance of 131 bp between adjacent SNP loci ( Figure 1 ). In short, the SNP-based high-density genetic map for each species is critically important for a wide range of functional studies. The Association module ( Figure 2E ) integrates a total of 78 950 high-quality (P < 0.001) genotype-to-phenotype (G2P) GWAS associations for 12 non-human species that were manually curated from 304 publications. These G2P associations account for 735 traits across seven cultivated plants (cotton, Japanese apricot, maize, rapeseed, rice, sorghum and soybean) and five domesticated animals (chicken, cattle, goat, pig and sheep). More importantly, all associations and traits have been further annotated and organized based on a suite of ontologies (Plant Trait Ontology, Animal Trait Ontology for Livestock, etc.) in GWAS Atlas (8) , and these G2P associations are of great significance for genetic research on important traits and breeding application. The Submission module offers online data submission services and accepts multiple data formats including VCF, GVCF and HapMap. It allows variation submission for any species and from any particular genome (e.g. mito- GVM provides open access to all publicly available variants, which are downloadable in both VCF and FASTA formats at https://bigd.big.ac.cn/gvm/download. The brief VCF file contains genomic position, reference and alternative alle-les and variant quality, and the FASTA file provides 50nt flanking sequences for each variant. According to the user's feedback, we newly added the detailed VCF file containing the genotype information for all samples, which would be more useful for users to conduct in-depth GWAS functional analysis. GVM, as a public data repository of genomic variants, features comprehensive integration of different types of genome variations for a wide range of species. With the development of high-throughput sequencing technology, GVM is expected to continue to grow rapidly over the next following years. As GVM offers high-density variation map for each species, these variants are of critical significance for population genetics, evolutionary analysis, association studies and genomic breeding. Thus, future developments are to generate different reference SNP panels, including hapmapSNPs after data filtration and genotype imputation, tagSNPs after removing linkage disequilibrium-based redundancy SNPs, fixedSNPs selected from genes exhibiting selective sweep signatures and barcodeSNPs selected from DNA fingerprinting simulation. In fact, it has been implemented in the 3000 Rice Genome Project (30) and SNP Ready for Rice (SR4R, http://sr4r.ic4r.org/) (19) . Additionally, these SNP datasets will be readily for optimal design of low-density (LD), medium-density (MD) or high-density (HD) SNP chip, which would be helpful to develop a rapid, accurate and efficient method for genotyping several hundred or thousand polymorphisms in large numbers of individuals. Furthermore, ongoing efforts will also include optimization of curation models and processes, integration of more variation datasets, enhancement of genomic variant annotation, and improvement of web interfaces and data analysis pipelines. Database resources of the national genomics data center in 2020 Genome Variation Map: a data repository of genome variations in BIG Data Center Inferring the population history of Tai-Kadai-speaking people and southernmost Han Chinese on Hainan Island by genome-wide array genotyping Mapping regulatory variants controlling gene expression in drought response and tolerance in maize IC4R-2.0: Rice genome reannotation using massive RNA-seq data SR4R: An integrative SNP resource for genomic breeding and population research in rice MBKbase for rice: an integrated omics knowledgebase for molecular breeding in rice GWAS Atlas: a curated resource of genome-wide variant-trait associations in plants and animals Animal-ImputeDB: a comprehensive database with multiple animal reference panels for genotype imputation An intercross population study reveals genes associated with body size and plumage color in ducks Genomic variation in Pekin duck populations developed in three different countries as revealed by whole-genome data Pan-Genome of wild and cultivated soybeans ) dbSNP: the NCBI database of genetic variation The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy The genome portal of the Department of Energy Joint Genome Institute: 2014 updates MaizeDB -a functional genomics perspective The 2019 novel coronavirus resource The Sequence Read Archive: explosive growth of sequencing data GSA: Genome Sequence Archive Fast and accurate long-read alignment with Burrows-Wheeler transform The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data The Ensembl variant effect predictor The Gene Ontology Resource: 20 years and still GOing strong UniProt: the universal protein knowledgebase The Pfam protein families database in 2019 The Elements of Data Sharing Genomic variation in 3,010 diverse accessions of Asian cultivated rice We thank our colleagues, students, and a number of users for reporting bugs and sending comments. Supplementary Data are available at NAR Online.