key: cord-0710974-dr48u7p3 authors: Sayers, Eric W; Cavanaugh, Mark; Clark, Karen; Pruitt, Kim D; Schoch, Conrad L; Sherry, Stephen T; Karsch-Mizrachi, Ilene title: GenBank date: 2021-12-01 journal: Nucleic Acids Res DOI: 10.1093/nar/gkab1135 sha: e0cdc784b313e6ab40e04425e2c2a37375f057ff doc_id: 710974 cord_uid: dr48u7p3 GenBank(®) (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 15.3 trillion base pairs from over 2.5 billion nucleotide sequences for 504 000 formally described species. Recent updates include resources for data from the SARS-CoV-2 virus, including a SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus and the Submission Portal. We also discuss upcoming changes to GI identifiers, a new data management interface for BioProject, and advice for providing contextual metadata in submissions. GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD, USA. After discussing updates to SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) resources, this paper summarizes the growth of GenBank in the past year and briefly reviews recent updates and developments. As part of our ongoing response to the COVID-19 pandemic that emerged in early 2020, NCBI continues to update several tools and interfaces to support both submitters and consumers of sequence data for SARS-CoV-2. These include the SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus, and the Submission Portal. The SARS-CoV-2 landing page (https://www.ncbi.nlm.nih. gov/sars-cov-2/) collects a wide variety of data and resources related to SARS-CoV-2, including all relevant data in GenBank. Of particular interest to users seeking Gen-Bank data are links to NCBI Datasets and NCBI Virus (see below) along with a link to download the full list of nucleotide accessions for SARS-CoV-2. NCBI Datasets is an experimental product that allows users to download complex genomic datasets easily using either a web interface, an API or a UNIX/LINUX command-line tool (https://www.ncbi.nlm.nih.gov/datasets/). The specialized coronavirus page released last year now provides genome downloads for almost 430 000 complete SARS-CoV-2 genomes, an annual increase of 29-fold (https: //www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes). This page provides downloads of metadata tables for SARS-CoV-2 genomes as well as complete genomic datasets. Users interested in SARS-CoV-2 proteins can access these data on a separate specialized page (https: //www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/). Finally, NCBI Datasets also includes a new genome interface that supports taxonomic searches and selection based on the taxonomic tree. This page may be of interest to users seeking data for other coronaviruses not included on the specialized SARS-CoV-2 pages. The NCBI Virus resource contains an SARS-CoV-2 Hub (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/sarscov-2) that organizes an extensive set of data and visualizations about SARS-CoV-2 data (Figure 1) subset these data by collection date and release date, and these filters update the map display. A 'Tabular View' option loads an interactive table listing all SARS-CoV-2 sequences with 21 filters, including sequence length, collection date, and geographic region. Conveniently, any filters set on the Dashboard transfer to the table, allowing easy exploration of the data. Moreover, users can spawn alignments from this table, and can also build phylogenetic trees. NCBI continues to update a customized submission portal for both assembled and unassembled SARS-CoV-2 sequences (https://submit.ncbi.nlm.nih.gov/sarscov2/). On average this portal provides accessions back to submitters in 1-2 h, and assembled sequences will be annotated with VADR (2). We encourage submitters to use these portals, as this ensures not only that sequence data are made available through the INSDC databases, but also through the NCBI Virus resource (3), RefSeq (4), and BLAST (5). We also encourage submitters to submit both reads and traditional GenBank sequences, and to submit data to BioProject and BioSample. We are actively updating all of these resources to support novel variants and to update the content of the SARS-CoV-2 pages discussed above. GenBank sequences are organized into 21 divisions, each of which is represented by a three-letter abbreviation (Table 1) . As shown in Table 1 , especially large increases occurred in the VRL, UNA and INV divisions. Not surprisingly, the large increase in the VRL division resulted from the many submissions of SARS-CoV-2 sequences (Figure 2) . As previously discussed (1), improving sequencing technologies are now capable of producing very long sequences, some of which are longer than what signed 32-bit integers can represent (about 2.1 Gbp). In such cases, submitters must split such records in order to submit them to Gen-Bank. A recent example of such a case is chromosome 1 from the West African lungfish, Protopterus annectens. The total length of this chromosome is 5.26 Gbp, and so in Gen-Bank it is represented by three records: CM033073 (2.00 Gbp), CM033074 (2.00 Gbp) and CM033075 (1.26 Gbp). We encourage GenBank users and developers of products that rely on GenBank data to be aware of the implications involved in representing very long sequences and to consider preparing their own tools for sequence lengths and feature locations that will require 64-bit integers. In addition to the above issue of handling very long individual sequences, GenBank is approaching a point where the number of sequences will exhaust the space of GI identifiers provided by 32-bit integers. To mitigate this, we are taking multiple approaches. First, we continue to recommend that users shift to using accession.version identifiers to refer to all GenBank data (6) . Most external NCBI interfaces, including the Entrez web interface and the E-utilities API, now accept and return accession.version identifiers for all sequences. Second, we are transitioning our internal software to use 64-bit integers for GI identifiers. Once the transition occurs, GenBank users will encounter these identifiers in the XML and ASN.1 presentations of GenBank data provided through the Entrez web interface and in Gen-Bank FTP products (https://ncbiinsights.ncbi.nlm.nih.gov/ 2021/09/02/64-bit-gis/). We encourage developers who rely on GenBank data to ensure that their software is capable of handling these 64-bit identifiers. Such identifiers are easy to recognize, as they are any integer greater than 2 147 483 647. When submitters register sequencing projects in the BioProject database (https://www.ncbi.nlm.nih.gov/bioproject), we can create reliable linkages between such sequencing projects and the data they produce, and in many cases to the BioSample database (7) that provides additional information about the biological materials used in the study. In many cases, submitters create BioProject records before they have collected all relevant data and published the results of the study. We have now made it easier for submitters to update their BioProject records with such information by offering a 'Manage Data' interface in the Submission portal (https://dataview.ncbi.nlm.nih.gov/?archive=bioproject). Using this interface, submitters can add publications and grants or edit text metadata such as the BioProject title and description. We hope this will allow BioProject to reflect better the current state of these projects and provide a better service to the community. Contextual metadata. As discussed previously (1), we continue to encourage submitters to provide contextual metadata, particularly data that specifies the sampling location (e.g. country, latitude, and longitude). The importance of such basic geographic information, such as country codes displayed on public sequence records (https://insdc.org/country), has only grown with the urgency to verify and track distribution of biodiversity in the current era. Including other data such as the isolate name or number and applicable museum/collection identifiers is also helpful. Where possible, adding links to permanent samples or vouchers at biorepositories provides access to sources with important, richly populated information. This facilitates replication and validation, while also allowing for analyses across scientific disciplines (8) . GenBank has long followed the standards of structuring vouchers using Darwin Core formats (9) that allows us to link to specimen pages at external biorepositories using URLs curated in the NCBI BioCollections database (10) . Recently, BioCollections included a new category, 'digital repository', that will include online data aggregators of collection data that do not include physical specimens. Additionally, to prepare the way for a more comprehensive treatment of these data elements and to make their presence required as part of the submission process, GenBank and the INSDC developed a set of standardized terms to clearly indicate when submitters cannot provide voucher information, for example in cases where the data were not collected or cannot be reported because of privacy concerns (https://www.insdc.org/ missing-value-reporting). In addition to the above, there are other ways in which submitters can enhance their data. Submitters can use evidence tags to provide information about supporting evidence for annotations (https://www.ncbi.nlm.nih.gov/ genbank/evidence/). They can cite within their submission the accession numbers of any publicly available sequencing reads they used to improve the quality of their D164 Nucleic Acids Research, 2022, Vol. 50, Database issue assemblies. When submitting prokaryotic genomes, they can create annotated genomes with NCBI's Prokaryotic Genome Annotation Pipeline (PGAP; https://www.ncbi. nlm.nih.gov/genome/annotation prok/) either by submitting FASTA files and requesting PGAP during submission of the genomes to GenBank or by running the public version of PGAP themselves and then submitting the GenBank-ready ASN.1 output file. Acquiring the database NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp.ncbi.nlm.nih.gov/genbank. For release 245 (15 August 2021) there are 4032 files requiring 1888 GB of uncompressed disk storage. In addition, daily GenBank incremental update files containing new records and those updated since the most recent release are available in flat file format at ftp.ncbi.nlm.nih.gov/genbank/daily-nc/. If you use the GenBank database in your published research, we ask that this article be cited. Funding for open access charge: Intramural Research Program of the National Library of Medicine, National Institutes of Health. Conflict of interest statement. None declared. VADR: validation and annotation of virus sequence submissions to GenBank NCBI viral genomes resource Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation BLAST: a more efficient report with usability improvements BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata Integrating biodiversity infrastructure into pathogen discovery and mitigation of emerging infectious diseases Darwin Core: an evolving community-developed biodiversity data standard