9691071 Genome Warehouse: A Public Repository Housing 1 Genome-scale Data 2 3 Meili Chen1,2,#, Yingke Ma1,2,#, Song Wu1,2,3, Xinchang Zheng1,2, Hongen Kang1,2,3, 4 Jian Sang1,2,3, † , Xingjian Xu1,2,3, †† , Lili Hao1,2, Zhaohua Li1,2,3, Zheng Gong1,2,3, Jingfa 5 Xiao1,2,3, Zhang Zhang1,2,3, Wenming Zhao1,2,3, Yiming Bao1,2,3,* 6 1 National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of 7 Sciences / China National Center for Bioinformation, Beijing 100101, China 8 2 CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of 9 Genomics, Chinese Academy of Sciences, Beijing 100101, China 10 3 University of Chinese Academy of Sciences, Beijing 100049, China 11 12 # Equal contribution. 13 * Corresponding author. 14 E-mail: baoym@big.ac.cn (Bao Y). 15 † Current address: Division of Cancer Epidemiology and Genetics, National Cancer 16 Institute, National Institutes of Health, Bethesda, Maryland 20892, USA 17 † † Current address: College of Computer Science Technology, Inner Mongolia 18 Normal University, Hohhot, Inner Mongolia 010010, China 19 20 Running title: Chen M et al / Genome Assembly Data Repository 21 22 Total letter counts (Title): 63 23 Total letter counts (Running title): 46 24 Total word counts (Abstract): 193 25 Total keywords: 5 26 Total word counts (from “Introduction” to “Conclusions” or “Materials and 27 methods”): 1799 28 Total figures: 3 29 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Total tables: 1 30 Total supplementary figures: 0 31 Total supplementary tables: 0 32 Total supplementary files: 0 33 34 35 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Abstract 36 The Genome Warehouse (GWH) is a public repository housing genome assembly data 37 for a wide range of species and delivering a series of web services for genome data 38 submission, storage, release, and sharing. As one of the core resources in the National 39 Genomics Data Center (NGDC), part of the China National Center for Bioinformation 40 (CNCB, https://bigd.big.ac.cn/), GWH accepts both full genome and partial genome 41 (chloroplast, mitochondrion, and plasmid) sequences with different assembly levels, 42 as well as an update of existing genome assemblies. For each assembly, GWH collects 43 detailed genome-related metadata including biological project and sample, and 44 genome assembly information, in addition to genome sequence and annotation. To 45 archive high-quality genome sequences and annotations, GWH is equipped with a 46 uniform and standardized procedure for quality control. Besides basic browse and 47 search functionalities, all released genome sequences and annotations can be 48 visualized with JBrowse. By December 2020, GWH has received 17,264 direct 49 submissions covering a diversity of 949 species, and has released 3370 of them. 50 Collectively, GWH serves as an important resource for genome-scale data 51 management and provides free and publicly accessible data to support research 52 activities throughout the world. GWH is publicly accessible at 53 https://bigd.big.ac.cn/gwh/. 54 55 KEYWORDS: Genome submission; Genome sequence; Genome annotation; 56 Genome warehouse; Quality control 57 58 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Introduction 59 Genome sequences and annotations are fundamental information for a wide range of 60 genome-related studies, including various omics data analysis such as genome [1], 61 transcriptome [2], epigenome [3,4], and genome variation [5,6]. China, as one of the 62 most biodiverse countries in the world, harbors more than 10% of the world’s known 63 species [7]. In the past decades, a large number of genome assemblies of featured and 64 important animals and crops in China have been sequenced [1, 8–11], most of which 65 were submitted to International Nucleotide Sequence Database Collaboration (INSDC) 66 members (National Center for Biotechnology Information (NCBI), European 67 Bioinformatics Institute (EBI), and DNA Data Bank of Japan (DDBJ)) [12]. With the 68 rapid growth of genome assembly data, in China for example, large genome data size, 69 slow data transfer rate due to limited international network transfer bandwidth, and 70 language barrier for communication of technical issues have obstructed researchers 71 from efficiently submitting their data to INSDC members. All these call for a 72 centralized genomic data repository within China to complement the INSDC. 73 Here, we report the Genome Warehouse (GWH, https://bigd.big.ac.cn/gwh/), a 74 centralized resource housing genome assembly data and delivering a series of genome 75 data services. As one of the core resources in the National Genomics Data Center 76 (NGDC), part of the China National Center for Bioinformation (CNCB, 77 https://bigd.big.ac.cn/) [13], the aim of GWH is to accept data submissions worldwide 78 and provide an important resource for genome data quality control, data archive, rapid 79 release, and public sharing (e.g., with INSDC) in support of research activities from 80 all over the world. To date, GWH has received a total of 12,366 genome submissions 81 (including 14 international submissions), demonstrating its increasingly important role 82 in global genome data management and sharing. 83 Data model 84 Designed for compatibility with the INSDC data model, each genome assembly in 85 GWH is linked to a BioProject (https://bigd.big.ac.cn/bioproject) and a BioSample 86 (https://bigd.big.ac.cn/biosample), which are two fundamental resources for metadata 87 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ description in CNCB-NGDC. Full or partial (chloroplast, mitochondrion, and plasmid) 88 genome assemblies with different assembly levels (complete, draft in chromosome, 89 scaffold, and contig) are all acceptable and existing genome assemblies are allowed to 90 be updated. Accession numbers are assigned with the following rules (Figure 1): (1) 91 each genome assembly has an accession number prefixed with "GWH", followed by 92 four capital letters and eight zeros (e.g., GWHAAAA00000000); (2) genome 93 sequences have the same accession number format as their corresponding genome 94 assembly, with the exception that the eight digits start from 00000001 and increase in 95 order (e.g., GWHAAAA00000001); (3) genes have similar accession pattern as those 96 of genome sequences, with the addition of letter “G” between the GWH prefix and the 97 four capital letters, and there are six digits at the end instead of eight (e.g., 98 GWHGAAAA000001); (4) transcripts use the letter “T” to replace “G” in accession 99 numbers for genes (e.g., GWHTAAAA000001); (5) proteins use the letter “P” to 100 replace “G” in accession numbers for genes (e.g., GWHPAAAA000001); (6) if the 101 submission is an update of existing submission in GWH, it will be assigned a dot and 102 an incremental number to represent the version (e.g., GWHAAAA00000000.1). 103 Database components 104 GWH is a centralized resource housing genome-scale data, with the purpose to 105 archive high-quality genome sequences and annotation information. GWH is 106 equipped with a series of web services for genome data submission, release, and 107 sharing, accordingly involving three major components, namely, data submission, 108 quality control, and archive and release (Figure 2). 109 Data submission 110 GWH not only accepts genome assembly associated data through an on-line 111 submission system but also allows off-line batch submissions. Users need to register 112 first and then to provide complete description on submitted genome sequences. 113 Biological project and sample information should be provided (through BioProject 114 and BioSample, respectively) together with genome assembly sequence, annotation, 115 and associated metadata. Metadata mainly consist of a variety of information about 116 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ submitter, general assembly, file(s), sequence assignment, and publication (if 117 available). After submission, GWH runs an automated quality control pipeline to 118 check the validity and consistency of submitted genome sequence and genome 119 annotation files. Accession numbers are assigned to assemblies and sequences upon 120 the pass of quality control. The updated assembly data can also be submitted to GWH. 121 It should be noted that compatible with the INSDC members (e.g., NCBI GenBank), it 122 is the responsibility of the submitters to ensure the data quality, completeness, and 123 consistency and GWH does not warrant or assume any legal liability or responsibility 124 for the data accuracy. 125 Quality control 126 After metadata and file(s) are received, GWH automatically runs standardized quality 127 control (QC) to check 45 different types of errors in submitted genome sequences and 128 annotations, and to scan for contaminated genome sequences (see details at 129 https://bigd.big.ac.cn/gwh/documents) if needed (Figure 2), which roughly falls into 5 130 QC steps: (1) The component will check the consistency of file(s) according to 131 filename and md5 code. (2) For genome sequences, the component will check the 132 legality of genome sequence ID and sequence content, e.g., unique sequence ID, 133 sequence composition (A/T/C/G or degenerate base), sequence length (≥ 200 bp). (3) 134 For genome annotations, the component will check gene structure completeness and 135 consistency, e.g., unique ID, a exon/CDS/UTR coordinate falling within the 136 corresponding gene coordinate, strand consistency for all features (including 137 gene/transcript/exon/CDS/UTR), codon validity (e.g., valid start/stop codon, no 138 internal stop codon). (4) Finally, it will check the internal consistency of genome 139 sequence and annotation, e.g., sequence ID in genome annotation must match genome 140 sequence ID, a feature coordinate falling within the range of the corresponding 141 genome sequence. (5) Genome sequences will also be scanned to check vectors, 142 adaptors, primers, and indices (collected from UniVec database, 143 ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/) using NCBI’s VecScreen 144 (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). If there is an error, a report will be 145 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ automatically sent to the submitter by email. To finish a successful submission, the 146 submitter needs to fix all errors and resubmit files until they pass the QC process. 147 Archive and release 148 GWH will assign a unique accession number to the submitted genome assembly upon 149 the pass of quality control, allot accession numbers for each genome sequence, gene, 150 transcript, and protein, generate and backup downloadable files of genome sequence 151 and annotation in FASTA, GFF3, and TSV formats. Data generation is performed 152 with in-house-writing scripts based on submitted genome sequence and annotation 153 files. In order to ensure the security of submitted data, a copy of backup data is stored 154 on a physically separate disk. GWH will release sequence data on a user-specified 155 date, unless a paper citing the sequence or accession number is published prior to the 156 specified release date, in which case the sequence will be released immediately. For 157 the released data, GWH will generate web pages containing two primary tables: 158 genome and assembly. The former shows species taxonomy information and genome 159 assemblies, and the latter contains general information of the assembly (including 160 external links to other related resources), statistics of genome assembly and its 161 corresponding annotation. All released data are publicly available at GWH FTP site 162 (ftp://download.big.ac.cn/gwh/). GWH provides data visualization for both genome 163 sequence and genome annotation using JBrowse [14]. It offers statistics and charts in 164 light of total holdings, assembly levels, genome representations, citing articles, 165 submitting organizations, sequencing platforms, assembly methods, and downloads. 166 GWH provides user-friendly web interfaces for data browse and query using BIG 167 Search [13], in order to help users find any released data of interest. For a released 168 genome assembly, GWH also provides machine-readable APIs (Application 169 Programming Interfaces) for publicly sharing and automatically obtaining information 170 on its associated BioProject, BioSample, genome, and assembly metadata and file 171 paths. 172 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Global sharing of SARS-CoV-2 and coronavirus genomes 173 During the COVID-19 outbreak, GWH, in support of the 2019 Novel Coronavirus 174 Resource (2019nCoVR) [15, 16] has received worldwide submissions of more than a 175 thousand SARS-CoV-2 genome assemblies with standardized genome annotations 176 [17], and has released 134 of them. To expand the international influence of data, 62 177 of the released sequences have been shared, with the submitters’ permission, in 178 GenBank [18] through a data exchange mechanism established with NCBI. In this 179 model, GWH accessions are represented as secondary accessions in NCBI GenBank 180 records, which are retrievable by the NCBI Entrez system. This model sets a good 181 example for data sharing among different data centers. 182 In addition, GWH offers sequences of the Coronaviridae family to facilitate 183 researchers to reach the data conveniently and thus to study the relationship between 184 SARS-CoV-2 and other coronaviruses. To promote the data sharing and make all 185 relevant information of the Coronaviridae readily available, GWH integrates genomic 186 and proteomic sequences as well as their metadata information from NCBI [19], 187 China National GeneBank Database (CNGBdb) [20], National Microbiology Data 188 Center (NMDC) [21] and CNCB-NGDC. Duplicated records from different sources 189 are identified and removed to gain a non-redundant dataset. As of December 31, 2020, 190 the dataset has 83,095 nucleotide and 575,438 protein sequences of the Coronaviridae. 191 Filters are implemented to narrow down the required Coronaviridae sequences using 192 multiple conditions, including country/region, host, isolation source, length, and 193 collection date. Both the metadata and sequences of the filtered results can be selected 194 and downloaded as a separate file. The daily updated sequences and all sequences can 195 also be downloaded from FTP 196 (ftp://download.big.ac.cn/Genome/Viruses/Coronaviridae/). 197 Data statistics 198 By December, 2020, GWH has received 17,264 direct submissions covering a broad 199 diversity of species (Table 1) with different assembly levels (Figure 3). These 200 genome assemblies link to 301 BioProjects and 16,538 BioSamples, and are 201 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ submitted by 231 submitters from 61 institutions (including 5 international submitters 202 from 2 countries). There are a total of 3370 released submissions, which were 203 reported in 83 articles from 44 journals. GWH has over 135,000 visits from 153 204 countries/regions, with ~891,000 downloads. The amount of data, visits, and 205 downloads in the GWH has been on the dramatic increase over the past years, clearly 206 showing its great utility in genome-scale data management. 207 Summary and future directions 208 Collectively, GWH is a user-friendly portal for genome data submission, release, and 209 sharing associated with a matched series of services. The rapid growth of genome 210 assembly submissions demonstrates the great potential of GWH as an important 211 resource for accelerating the worldwide genomic research. With the aim to fully 212 realize the findability, accessibility, interoperability, and reusability (FAIR) of 213 genome data [22], GWH has made ongoing efforts, including but not limited to, 214 improvement of web interfaces for data submission, presentation, and visualization, 215 continuous integration of newly sequenced genomes, and development of useful 216 online tools to help users analyse genome data (such as BLAST [23]). Therefore, we 217 will put in more efforts to provide genome annotation services, especially for bacteria 218 and archaea genomes, with the particular consideration that uniform standardized 219 annotation determines the accuracy of downstream data analysis. Besides, we will 220 expand the Coronaviridae dataset to other important pathogens to improve the ability 221 of public health emergency response. Finally, we plan to share and exchange all 222 public genome assembly data with the INSDC members to provide comprehensive 223 data for researchers globally. 224 CRediT author statement 225 Meili Chen: Methodology, Software, Investigation, Data Curation, Writing - Original 226 Draft, Project administration. Yingke Ma: Software, Writing - Original Draft. Song 227 Wu: Software, Data Curation. Xinchang Zheng: Data Curation. Hongen Kang: 228 Software. Jian Sang: Investigation, Data Curation. Xingjian Xu: Software. Lili Hao: 229 Investigation. Zhaohua Li: Data Curation. Zheng Gong: Data Curation. Jingfa Xiao: 230 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Writing - Review & Editing. Zhang Zhang: Writing - Review & Editing. Wenming 231 Zhao: Writing - Review & Editing. Yiming Bao: Conceptualization, Writing - 232 Review & Editing, Supervision. 233 Competing interests 234 The authors have declared no competing interests. 235 Acknowledgments 236 We thank Profs. Jingchu Luo and Weimin Zhu for their helpful suggestions and a 237 number of users for reporting bugs and sending comments. We also thank the NCBI 238 GenBank group, especially Ilene Mizrachi, Karen Clark, Mark Cavanaugh, and Linda 239 Yankie, for their valuable advices on sequence contamination scanning and 240 SARS-CoV-2 sequence exchange. This work was supported by Strategic Priority 241 Research Program of Chinese Academy of Sciences [XDB38060100 and 242 XDB38030200 to YB; XDB38050300 to WZ; XDB38030400 to JX; XDA19050302 243 to ZZ]; National Key Research and Development Program of China 244 [2016YFE0206600 to YB; 2020YFC0847000, 2018YFD1000505, 2017YFC1201202, 245 and 2016YFC0901603 to WZ; 2017YFC0907502 to ZZ]; The 13th Five-year 246 Informatization Plan of Chinese Academy of Sciences [XXH13505-05 to YB]; 247 Genomics Data Center Construction of Chinese Academy of Sciences 248 [XXH-13514-0202 to YB]; Open Biodiversity and Health Big Data Initiative of IUBS 249 [to YB]; The Professional Association of the Alliance of International Science 250 Organizations [ANSO-PA-2020-07 to YB]; National Natural Science Foundation of 251 China [32030021 and 31871328 to ZZ]; International Partnership Program of the 252 Chinese Academy of Sciences [153F11KYSB20160008 to ZZ]. 253 ORCID 254 ORCID: 0000-0003-0102-0292 (Chen Meili) 255 ORCID: 0000-0002-9460-4117 (Ma Yingke) 256 ORCID: 0000-0002-0923-639X (Wu Song) 257 ORCID: 0000-0001-5739-861X (Zheng Xinchang) 258 ORCID: 0000-0002-9581-1329 (Kang Hongen) 259 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ ORCID: 0000-0003-4953-3417 (Sang Jian) 260 ORCID: 0000-0002-4466-3821 (Xu Xingjian) 261 ORCID: 0000-0003-3432-7151 (Hao Lili) 262 ORCID: 0000-0002-2673-0103 (Li Zhaohua) 263 ORCID: 0000-0001-7285-2630 (Gong Zheng) 264 ORCID: 0000-0002-2835-4340 (Xiao Jingfa) 265 ORCID: 0000-0001-6603-5060 (Zhang Zhang) 266 ORCID: 0000-0002-4396-8287 (Zhao Wenming) 267 ORCID: 0000-0002-9922-9723 (Bao Yiming) 268 269 270 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ References 271 [1] Liu Y, Du H, Li P, Shen Y, Peng H, Liu S, et al. Pan-genome of wild and 272 cultivated soybeans. Cell 2020;182:162-76.e13. 273 [2] Guan Y, Chen M, Ma Y, Du Z, Yuan N, Li Y, et al. Whole-genome and 274 time-course dual RNA-Seq analyses reveal chronic pathogenicity-related gene 275 dynamics in the ginseng rusty root rot pathogen Ilyonectria robusta. Sci Rep 276 2020;10:1586. 277 [3] Li R, Liang F, Li M, Zou D, Sun S, Zhao Y, et al. MethBank 3.0: a database of 278 DNA methylomes across a variety of species. Nucleic Acids Res 2018;46:D288–D95. 279 [4] Xiong Z, Li M, Yang F, Ma Y, Sang J, Li R, et al. EWAS Data Hub: a resource of 280 DNA methylation array data and metadata. Nucleic Acids Res 2020;48:D890–D5. 281 [5] Song S, Tian D, Li C, Tang B, Dong L, Xiao J, et al. Genome Variation Map: a 282 data repository of genome variations in BIG Data Center. Nucleic Acids Res 283 2018;46:D944–D9. 284 [6] Tang B, Zhou Q, Dong L, Li W, Zhang X, Lan L, et al. iDog: an integrated 285 resource for domestic dogs and wild canids. Nucleic Acids Res 2019;47:D793–D800. 286 [7] McBeath J, McBeath JH. Biodiversity conservation in China: policies and practice. 287 Journal of International Wildlife Law & Policy 2006;9:293–317. 288 [8] Fan H, Wu Q, Wei F, Yang F, Ng BL, Hu Y. Chromosome-level genome 289 assembly for giant panda provides novel insights into Carnivora chromosome 290 evolution. Genome Biol 2019;20:267. 291 [9] Xia Q, Zhou Z, Lu C, Cheng D, Dai F, Li B, et al. A draft sequence for the 292 genome of the domesticated silkworm (Bombyx mori). Science 2004;306:1937–40. 293 [10] Lin T, Xu X, Ruan J, Liu SZ, Wu SG, Shao XJ, et al. Genome analysis of 294 Taraxacum kok-saghyz Rodin provides new insights into rubber biosynthesis. Natl Sci 295 Rev 2018;5:78–87. 296 [11] Li C, Song W, Luo Y, Gao S, Zhang R, Shi Z, et al. The HuangZaoSi maize 297 genome provides insights into genomic variation and improvement history of maize. 298 Mol Plant 2019;12:402–9. 299 [12] Arita M, Karsch-Mizrachi I, Cochrane G. The international nucleotide sequence 300 database collaboration. Nucleic Acids Res 2021;49:D121–D4. 301 [13] Members C-N, Partners. Database resources of the National Genomics Data 302 Center, China National Center for Bioinformation in 2021. Nucleic Acids Res 303 2021;49:D18–D28. 304 [14] Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G, et al. JBrowse: 305 a dynamic web platform for genome visualization and analysis. Genome Biol 306 2016;17:66. 307 [15] Zhao WM, Song SH, Chen ML, Zou D, Ma LN, Ma YK, et al. The 2019 novel 308 coronavirus resource. Yi Chuan 2020;42:212–21. 309 [16] Song S, Ma L, Zou D, Tian D, Li C, Zhu J, et al. The global landscape of 310 SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR. Genomics, 311 Proteomics & Bioinformatics 2020. [DOI: https://doi.org/10.1016/j.gpb.2020.09.001] 312 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ [17] Shean RC, Makhsous N, Stoddard GD, Lin MJ, Greninger AL. VAPiD: a 313 lightweight cross-platform viral annotation pipeline and identification tool to facilitate 314 virus genome submissions to NCBI GenBank. BMC Bioinformatics 2019;20:48. 315 [18] Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. 316 GenBank. Nucleic Acids Res 2020;48:D84–D6. 317 [19] Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database 318 resources of the National Center for Biotechnology Information. Nucleic Acids Res 319 2021;49:D10–D7. 320 [20] Chen FZ, You LJ, Yang F, Wang LN, Guo XQ, Gao F, et al. CNGBdb: China 321 National GeneBank DataBase. Yi Chuan 2020;42:799–809. 322 [21] Wu L, Sun Q, Desmeth P, Sugawara H, Xu Z, McCluskey K, et al. World data 323 centre for microorganisms: an information infrastructure to explore and utilize 324 preserved microbial strains worldwide. Nucleic Acids Res 2017;45:D611–D8. 325 [22] Zhang Z, Song S, Yu J, Zhao W, Xiao J, Bao Y. The elements of data sharing. 326 Genomics Proteomics Bioinformatics 2020;18:1–4. 327 [23] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. 328 Gapped BLAST and PSI-BLAST: a new generation of protein database search 329 programs. Nucleic Acids Res 1997;25:3389–402. 330 331 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure legends 332 Figure 1 Data model in GWH 333 Genome assembly accession number is prefixed with "GWH", followed by four 334 capital letters (represented by XXXX) and 8 zeros. For genome sequence accessions, 335 eight digits increase in order. For gene sequence, transcript sequence, and protein 336 sequence accessions, G, T, and P are followed by the GWH prefix, respectively, with 337 six digits at the end that increase in order. 338 Figure 2 Major components in GWH data processing workflow 339 Figure 3 Statistics of genome assembly in GWH (as of December 31, 2020) 340 .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ Tables 341 Table 1 Total data holdings in GWH 342 Status Type Animals Plants Fungi Bacteria Archaea Viruses Metagenomes Others Total Released Assembly 187 (5.55%) 210 (6.23%) 13 (0.39%) 220 (6.53%) 73 (2.17%) 701 (20.80%) 1957 (58.07%) 9 (0.27%) 3370 Species 72 (19.41%) 139 (37.47%) 12 (3.23%) 106 (28.57%) 11 (2.96%) 19 (5.12%) 3 (0.81%) 9 (2.43%) 371 Unpublic Assembly 6783 (48.82%) 926 (6.66%) 5 (0.04%) 68 (0.49%) 13 (0.09%) 939 (6.76%) 4702 (33.84%) 458 (3.30%) 13,894 Species 22 (3.67%) 549 (91.50%) 5 (0.83%) 7 (1.17%) 2 (0.33%) 6 (1.00%) 5 (0.83%) 4 (0.67%) 600 Total Assembly 6970 (40.37%) 1136 (6.58%) 18 (0.10%) 288 (1.67%) 86 (0.50%) 1640 (9.50%) 6659 (38.57%) 467 (2.71%) 17,264 Species 92 (9.69%) 675 (71.13%) 16 (1.69%) 110 (11.59%) 13 (1.37%) 24 (2.53%) 7 (0.74%) 12 (1.26%) 949 343 . C C -B Y -N C -N D 4 .0 In te rn a tio n a l lice n se p e rp e tu ity. It is m a d e a va ila b le u n d e r a p re p rin t (w h ich w a s n o t ce rtifie d b y p e e r re vie w ) is th e a u th o r/fu n d e r, w h o h a s g ra n te d b io R xiv a lice n se to d isp la y th e p re p rin t in T h e co p yrig h t h o ld e r fo r th is th is ve rsio n p o ste d F e b ru a ry 1 0 , 2 0 2 1 . ; h ttp s://d o i.o rg /1 0 .1 1 0 1 /2 0 2 1 .0 2 .1 0 .4 3 0 3 6 7 d o i: b io R xiv p re p rin t https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.10.430367doi: bioRxiv preprint https://doi.org/10.1101/2021.02.10.430367 http://creativecommons.org/licenses/by-nc-nd/4.0/