key: cord-0326405-trtv08yr authors: Chen, Meili; Ma, Yingke; Wu, Song; Zheng, Xinchang; Kang, Hongen; Sang, Jian; Xu, Xingjian; Hao, Lili; Li, Zhaohua; Gong, Zheng; Xiao, Jingfa; Zhang, Zhang; Zhao, Wenming; Bao, Yiming title: Genome Warehouse: A Public Repository Housing Genome-scale Data date: 2021-02-10 journal: bioRxiv DOI: 10.1101/2021.02.10.430367 sha: cee40d92f73c50917be4ae49e93b80938d170501 doc_id: 326405 cord_uid: trtv08yr The Genome Warehouse (GWH) is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission, storage, release, and sharing. As one of the core resources in the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB, https://bigd.big.ac.cn/), GWH accepts both full genome and partial genome (chloroplast, mitochondrion, and plasmid) sequences with different assembly levels, as well as an update of existing genome assemblies. For each assembly, GWH collects detailed genome-related metadata including biological project and sample, and genome assembly information, in addition to genome sequence and annotation. To archive high-quality genome sequences and annotations, GWH is equipped with a uniform and standardized procedure for quality control. Besides basic browse and search functionalities, all released genome sequences and annotations can be visualized with JBrowse. By December 2020, GWH has received 17,264 direct submissions covering a diversity of 949 species, and has released 3370 of them. Collectively, GWH serves as an important resource for genome-scale data management and provides free and publicly accessible data to support research activities throughout the world. GWH is publicly accessible at https://bigd.big.ac.cn/gwh/. genome assemblies with different assembly levels (complete, draft in chromosome, 89 scaffold, and contig) are all acceptable and existing genome assemblies are allowed to 90 be updated. Accession numbers are assigned with the following rules (Figure 1) : (1) 91 each genome assembly has an accession number prefixed with "GWH", followed by 92 four capital letters and eight zeros (e.g., GWHAAAA00000000); (2) genome 93 sequences have the same accession number format as their corresponding genome 94 assembly, with the exception that the eight digits start from 00000001 and increase in 95 order (e.g., GWHAAAA00000001); (3) genes have similar accession pattern as those 96 of genome sequences, with the addition of letter "G" between the GWH prefix and the 97 four capital letters, and there are six digits at the end instead of eight (e.g., 98 GWHGAAAA000001); (4) transcripts use the letter "T" to replace "G" in accession 99 numbers for genes (e.g., GWHTAAAA000001); (5) proteins use the letter "P" to 100 replace "G" in accession numbers for genes (e.g., GWHPAAAA000001); (6) if the 101 submission is an update of existing submission in GWH, it will be assigned a dot and 102 an incremental number to represent the version (e.g., GWHAAAA00000000.1). 103 104 GWH is a centralized resource housing genome-scale data, with the purpose to 105 archive high-quality genome sequences and annotation information. GWH is 106 equipped with a series of web services for genome data submission, release, and 107 sharing, accordingly involving three major components, namely, data submission, 108 quality control, and archive and release ( Figure 2) . 109 GWH not only accepts genome assembly associated data through an on-line 111 submission system but also allows off-line batch submissions. Users need to register 112 first and then to provide complete description on submitted genome sequences. 113 Biological project and sample information should be provided (through BioProject 114 and BioSample, respectively) together with genome assembly sequence, annotation, 115 and associated metadata. Metadata mainly consist of a variety of information about 116 submitter, general assembly, file(s), sequence assignment, and publication (if 117 available). After submission, GWH runs an automated quality control pipeline to 118 check the validity and consistency of submitted genome sequence and genome 119 annotation files. Accession numbers are assigned to assemblies and sequences upon 120 the pass of quality control. The updated assembly data can also be submitted to GWH. 121 It should be noted that compatible with the INSDC members (e.g., NCBI GenBank), it 122 is the responsibility of the submitters to ensure the data quality, completeness, and 123 consistency and GWH does not warrant or assume any legal liability or responsibility 124 for the data accuracy. (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). If there is an error, a report will be 145 automatically sent to the submitter by email. To finish a successful submission, the 146 submitter needs to fix all errors and resubmit files until they pass the QC process. 147 Archive and release 148 GWH will assign a unique accession number to the submitted genome assembly upon 149 the pass of quality control, allot accession numbers for each genome sequence, gene, 150 transcript, and protein, generate and backup downloadable files of genome sequence 151 and annotation in FASTA, GFF3, and TSV formats. Data generation is performed 152 with in-house-writing scripts based on submitted genome sequence and annotation 153 files. In order to ensure the security of submitted data, a copy of backup data is stored 154 on a physically separate disk. GWH will release sequence data on a user-specified 155 date, unless a paper citing the sequence or accession number is published prior to the 156 specified release date, in which case the sequence will be released immediately. For 157 the released data, GWH will generate web pages containing two primary tables: 158 genome and assembly. The former shows species taxonomy information and genome 159 assemblies, and the latter contains general information of the assembly (including 160 external links to other related resources), statistics of genome assembly and its 161 corresponding annotation. All released data are publicly available at GWH FTP site 162 (ftp://download.big.ac.cn/gwh/). GWH provides data visualization for both genome 163 sequence and genome annotation using JBrowse [14] . It offers statistics and charts in 164 light of total holdings, assembly levels, genome representations, citing articles, 165 submitting organizations, sequencing platforms, assembly methods, and downloads. 166 GWH provides user-friendly web interfaces for data browse and query using BIG 167 Search [13] , in order to help users find any released data of interest. For a released 168 genome assembly, GWH also provides machine-readable APIs (Application 169 Programming Interfaces) for publicly sharing and automatically obtaining information 170 on its associated BioProject, BioSample, genome, and assembly metadata and file 171 paths. Genome assembly accession number is prefixed with "GWH", followed by four 334 capital letters (represented by XXXX) and 8 zeros. For genome sequence accessions, 335 eight digits increase in order. For gene sequence, transcript sequence, and protein 336 sequence accessions, G, T, and P are followed by the GWH prefix, respectively, with 337 six digits at the end that increase in order. 338 Table 1 Pan-genome of wild and 272 cultivated soybeans Whole-genome and 274 time-course dual RNA-Seq analyses reveal chronic pathogenicity-related gene 275 dynamics in the ginseng rusty root rot pathogen Ilyonectria robusta MethBank 3.0: a database of 278 DNA methylomes across a variety of species EWAS Data Hub: a resource of 280 DNA methylation array data and metadata Genome Variation Map: a 282 data repository of genome variations in BIG Data Center iDog: an integrated 285 resource for domestic dogs and wild canids Biodiversity conservation in China: policies and practice Chromosome-level genome 289 assembly for giant panda provides novel insights into Carnivora chromosome 290 evolution A draft sequence for the 292 genome of the domesticated silkworm (Bombyx mori) Taraxacum kok-saghyz Rodin provides new insights into rubber biosynthesis The HuangZaoSi maize 297 genome provides insights into genomic variation and improvement history of maize The international nucleotide sequence 300 database collaboration Database resources of the National Genomics Data 302 China National Center for Bioinformation in 2021 JBrowse: 305 a dynamic web platform for genome visualization and analysis The 2019 novel 308 coronavirus resource The global landscape of 310 SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR VAPiD: a 313 lightweight cross-platform viral annotation pipeline and identification tool to facilitate 314 virus genome submissions to NCBI GenBank Database 318 resources of the National Center for Biotechnology Information CNGBdb: China 321 National GeneBank DataBase World data 323 centre for microorganisms: an information infrastructure to explore and utilize 324 preserved microbial strains worldwide The elements of data sharing a new generation of protein database search 329 programs 234 The authors have declared no competing interests. 235Acknowledgments 236 We thank Profs. Jingchu Luo and Weimin Zhu for their helpful suggestions and a 237 number of users for reporting bugs and sending comments. We also thank the NCBI 238