key: cord-0907285-wjf8t8vp authors: Brister, J. Rodney; Ako-adjei, Danso; Bao, Yiming; Blinkova, Olga title: NCBI Viral Genomes Resource date: 2015-01-28 journal: Nucleic Acids Res DOI: 10.1093/nar/gku1207 sha: b44258fb9be12d1ddecb77617f2cf7914313d35c doc_id: 907285 cord_uid: wjf8t8vp Recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. Yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. The NCBI Viral Genomes Resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. The resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. As the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. The rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. This, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. Recent outbreaks of Ebolavirus (1, 2) and Middle East respiratory syndrome coronavirus (MERS-CoV) (3, 4) clearly demonstrate the power of sequence analysis in viral surveillance, host reservoir identification and public health policy debate. As these viruses have filled media headlines, their genome sequences have spilled into international public databases. Such real time analysis promises to fundamentally alter our understanding of viral biology and significantly impact public health responses to viral dis-ease, but it also places renewed emphasis on public research infrastructure that is necessary to support the storage and analysis of sequence data. This infrastructure includes primary databases that together comprise the International Nucleotide Sequence Database Collaboration (INSDC) (5) , GenBank (6) , European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) (7) , and DNA Database of Japan (DDBJ) (8) , and reference databases like the ViralZone Resource at the Swiss Institute of Bioinformatics (http://viralzone.expasy. org) (9) and the Viral Genome Resource at National Center for Biotechnology Information (NCBI) (http://www. ncbi.nlm.nih.gov/genome/viruses/) (10) . Whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes (11) (12) (13) , sequence assembly and virus discovery (14) (15) (16) (17) , viral dynamics and evolution (18) (19) (20) and pathogen detection (14, (21) (22) (23) . The NCBI Viral Genomes Project was established in response to the growing need for a public, virus-specific, reference sequence resource (24) . The project catalogs all complete viral genomes deposited in INSDC databases and creates so-called RefSeq records for each viral species. Each RefSeq is derived from an INSDC sequence record, but may include additional annotation and/or other information. Accessions for RefSeq genome records include the prefix 'NC ', allowing them to be easily differentiated from INSDC records. For example, the RefSeq genome record for Enterobacteria phage T4 has the accession NC 000866 but was derived from the INSDC record AF158101. Typically, the first genome submitted for a particular species is selected as a RefSeq, and once a RefSeq is created, other validated genomes for that species are indexed as 'genome neighbors'. As such, the viral RefSeq data model is taxonomy centric, or more specifically, species centric, and all RefSeq records and genome neighbors are indexed at the species level. This model requires both the demarcation of individual viral species and the grouping of genome sequences into defined species. Virus genome type RefSeq genome segments Total genome segments Total INSDC sequences dsDNA viruses, no RNA stage 1755 3023 115 911 dsRNA viruses 919 17 929 56 699 ssDNA viruses 669 6692 40 337 ssRNA negative-strand viruses 187 4384 478 791 ssRNA positive-strand viruses, no DNA stage 917 14 441 415 664 Retro-transcribing viruses 123 8614 727 762 a The table does not include influenza virus sequences. These sequences are stored in a specialized database (11, 25) . There are now 71 628 validated viral and viroid genome segments deposited within INSDC databases, not including influenza sequences, which are stored in a specialized database (11, 25) . This figure represents a nearly 9fold increase since 2000 (Figure 1 ), and this rise reflects both steady increases in the number of novel viruses sequenced--as measured by the number of RefSeq genome segments--and a large increase in the number of genome neighbors, i.e. genome sequences belonging to viral species already represented by a RefSeq (Figure 1 ). As shown in Table 1 , RefSeq genome segments are distributed among all viruses, but genome neighbor segments are concentrated among smaller, ssDNA, RNA, and retro-transcribing viruses. Although many of these neighbor genomes are concentrated among human pathogens, there are also several viruses of agricultural importance with high numbers of sequenced genomes ( Table 2 ). While most of the viruses in Table 2 are well studied in the laboratory, many other sequenced viruses are not. The RefSeq data model for most organisms underscores the importance of very well annotated reference sequence records (26) . Unfortunately, a minority of viral systems are experimentally well defined, so there is often little primary data on which to base genome annotations. In some cases, sequence homologies allow the transfer of annotation from experimentally defined to poorly characterized genomes (11) (12) (13) . Yet, often genomes are annotated by purely ab initio processes (27) (28) (29) . Given the difficulty of implementing a purely well annotated representation of viral genome sequences, the viral RefSeq model has evolved into a more flexible approach that includes both reference and representative sequences. Reference RefSeq records provide sources of well annotated sequence features, whereas representative records provide coverage of extant sequence variation. The comment 'REVIEWED REFSEQ' is added to RefSeq records to highlight those that include additional annotation, and as of this writing, there are 747 reviewed viral RefSeq records, including references for several human pathogens, such as human immunodeficiency virus 1 (NC 001802), Measles virus (NC 001498) and Poliovirus (NC 002058) and several other important viral systems such as Enterobacteria T4 (NC 000866), Enterobacteria T7 (NC 001604) and Tobacco mosaic virus (29) (30) . Moreover, some viral communities are developing well defined subspecies classification such as the genotyping schemes for hepatitis B virus and hepatitis C virus (31) (32) (33) . These genotyping schemes can provide an important framework for the interpretation of genome sequence data (34) , and more communities are expected to develop genotyping schemes in the coming years. Finally, there are cases when the best characterized viral isolate is a laboratory variant, and it may be important to create multiple RefSeq records in order to provide both experimentally annotated references and sufficient sequence representation of circulating isolates. Together these cases highlight the need for both reference genome sequences that capture the best possible annotation and representative genome sequences that capture important intraspecies variation or define subspecies categories. Therefore the viral RefSeq model has expanded to include both reference and representative genome sequences to better serve community needs. The rising pace of viral discovery has a number of implications for data processing by the Viral Genomes Group. Viral taxonomy within the NCBI Taxonomy database is based on the list of valid species names and classifications provided by the International Committee for the Taxonomy of Viruses (ICTV) (35, 36) . When the Viral Genomes Project was initiated, there were many more viral species recognized by the ICTV than viral RefSeq genome sequence records ( Figure 2 ). However, as the rate of viral genome sequencing has increased over the past decade, so too has the pace of viral discovery. As a result many RefSeqs are made from viruses clearly distinct from existing ones but without of- ficial taxonomy designation. Taxonomy also affects the interpretation of genome sequence data, and technical difficulties encountered when sequencing the termini of some ssRNA and ssDNA viruses often lead to differing community standards for 'complete genomes' (37) . This means that some difficult to sequence genomes are considered complete if they include the entire coding region but are missing some terminal sequence. Improved methods may eventually resolve this issue (38) , but in the meantime it would be useful for communities to define completeness standards with regard to current technology. In addition to manual selection based on genome length, the taxonomy of both RefSeq genome records and INSDC genome neighbor records are validated. Indeed, given that many novel virus genome sequences are submitted before analysis by the ICTV (see Figure 2 ), validation of taxonomy assignment is a major facet of curation. Taxonomy is important to the overall usability of NCBI viral genome resources, and when properly implemented, creates a framework for groups of related sequences. Using standards established by individual ICTV study sections (36) and published reports, the taxonomy of each viral genome is validated and updated as necessary. Newly submitted viral genomes without official ICTV assignment are placed with 'uncharacterized' taxonomy bins that are easily distinguished from those recognized by the ICTV. Often little information is included in the INSDC sequence record and a growing number of sequences do not include any linked publications. Using sequence analysis and comparative genomics, every attempt is made to place new genomes into a family (i.e. the 'uncharacterized' bin associated with a specific family) or lower order classification bin. However, some genomes are very distinct from previously characterized ones and only higher order classification is possible. Reference viral RefSeq records are generally curated by biologists using in-house annotation tools and the scientific literature as guides. A panel of Viral Genome Advisors from outside NCBI bolsters curation efforts by offering expert guidance or taking responsibility for specific RefSeq records themselves. This approach is used for the maintenance of Adenovirus and Herpesvirus RefSeq records (39) and could be extended to other virus genomes (29) . These efforts considered, the growing number of viral genomes submitted to INSDC databases and the rapid pace of scientific discovery make maintenance of up-to-date references difficult. Therefore collaboration with scientific communities is critical to providing accurate annotation. Sometimes these collaborative efforts are directed at curating a single RefSeq record, and all of the reviewed RefSeq records mentioned in the previous section were curated in collaboration with experts from individual viral communities. Other times these collaborations are more extensive and touch many sequence records. For example, overlapping gene annotations were corrected on RefSeq records from 14 virus families (Arteriviridae, Arteriviridae, Bunyaviridae, Caliciviridae, Circoviridae, Disistroviridae, Flavoviridae, Luteoviridae, Paramixovridae, Parvoviridae, Picornaviridae, Potyviridae, Reoviridae, Togaviridae) as directed by experimental or predictive analysis (40, 41) . A new emphasis has been placed on initiating annotation collaborations at the beginning of a large genome sequencing program so that reference annotations, isolate naming schemes and other standards can be established prior to sequence submission (42) (43) (44) . These collaborations often include members of the UniProt Viral Protein Annotation Program (45) (9), and/or curators from sequencing centers and other databases (46) in addition to members of the relevant viral communities and effectively ensure both well annotated references and consistently annotated INSDC sequence records. Such arrangements underscore the extensive impact of viral genome annotation issues--from public databases to sequencing centers to individual researcher communities--and were formalized within the Viral Genome Annotation Working Group, which brings together stakeholders and provides a forum for the discussion of annotation issues (29, 47) . In addition to protein annotation and isolate naming issues, this group is working to define standards for viral genome sequence data. As the number of viral sequences has risen, so has the demand for curated metadata describing sequences. The Viral Genomes Group has implemented two models designed to capture and standardize metadata. In the first model exemplified by the Virus Variation Resource, host, isolation country and other important metadata are parsed from individual sequence records, mapped against vocabulary lists and standardized (25, 48) . Sequences can then be searched using these standardized metadata terms. Currently, only a small subset of viral sequences are included in the Virus Variation Resource, including those for influenza, dengue and West Nile viruses, but the ultimate goal is to expand this semi-automated model to include more viruses. The second model captures and standardizes host information for all viruses, and whenever a new RefSeq record is created, a manually curated 'viral host' property is assigned to the relevant species within the NCBI Taxonomy database. The property defines higher order, biologically relevant taxonomic host groups--algae, archaea, bacteria, diatom, environment, fungi, human, invertebrates, plants, protozoa and vertebrates--and enable sorting and selection of sequences within the NCBI Taxonomy (http://www. ncbi.nlm.nih.gov/taxonomy) and Viral Genomes Resource. For example searching the NCBI Taxonomy database with the term 'vhost fungi'[Properties] (quotes included) will return a list of taxonomy groups comprised of viruses that infect fungi. Users can then select the 'Genome' database from 'Find related data' link on the Taxonomy search page to view all viral genomes associated with viruses retrieved from the search. In cases where a virus infects multiple types of organisms, multiple terms are assigned, for example 'invertebrates, plants'. To search NCBI Taxonomy for viruses that infect multiple hosts simply include 'AND' between search terms, for example 'vhost invertebrates' [Properties] AND 'vhost plants' [Properties] (quotes included). The current distribution of assigned viral host terms is shown in Figure 3 . The NCBI Viral Genome Resource can be accessed at www.ncbi.nlm.nih.gov/genome/viruses/. On this home page, users will find ftp links where users can download accession list of all viral and viroid genomes (RefSeq and genome neighbors) and the complete viral and viroid Ref-Seq dataset. Perhaps the central features of the resource are the viral and viroid genome browsers. These tables list all viral and viroid species represented by a reference sequence and include links to genome neighbor sequences. Users can navigate to specific taxonomic groups and sort the table by viral host type. Once a dataset has been defined by taxonomy and host types, users can download the resultant table, the list of RefSeq accessions in the table, or a list that includes RefSeq and genome neighbor accessions as well as taxonomy and viral host information. Several specialized viral resources and tools are also linked through the Viral Genomes Resource home page. These include specialized resources for influenza, dengue and West Nile and other viruses that are part of the Virus Variation Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/variation/) (25, 48, 49) . The link to the Retrovirus Resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the Retrovirus Genotyping Tool and HIV-1, Human Interaction Database (50, 51) . These tools are designed to assist retroviral researchers in the identification and classification of sequences and to document HIV-1 and human protein and replication interactions through a searchable interface. Finally, there is a link to the Pairwise Sequence Comparison Tool (PASC) (http://www.ncbi.nlm.nih.gov/sutils/pasc), a Blast-based tool with graphical output that can be used to establish taxonomic classification criteria of some viruses and classify viruses with newly sequenced genomes (52, 53) . Both RefSeq records and other genomes for species are linked throughout NCBI resources and can be used in a variety of operations. Among these, the RefSeq dataset can be used to reduce the redundancy of Blast searches (http://blast.ncbi.nlm.nih.gov/Blast.cgi) (54), providing fewer, higher quality sequences within search results. To restrict nucleotide Blast searches to include only viral RefSeq genomes, employ the 'Choose Search Set' options in the Blast search interface (55): Select 'Reference genomic sequences (RefSeq genomic)' in the database field and enter 'Viruses' in the 'Organism' field text box. For protein Blast searches, the viral RefSeq protein set can be used by selecting 'Reference proteins' (RefSeq proteins) in the database field and entering 'Viruses' in the 'Organism' field text box. Data derived from viral RefSeqs are also used to support a number of other databases including Gene (56) and Protein Clusters (57) . Each species that includes a RefSeq can be found in the Genome database (http://www.ncbi.nlm.nih.gov/genome) (56) . This resource can be searched by taxonomy names, and retrieved genome records include links to all RefSeqs for that species. Each individual genome record also includes links to neighbor sequences for that species under 'Related information', and these can be viewed by selecting the 'Other genomes for species' option. These links display all genome neighbor records in the nucleotide database where they can be viewed and/or downloaded. Genome neighbor records can also be retrieved from multiple genome records using the 'Find related data' options, allowing the user to search for an entire viral family or similar and then retrieve all genome neighbor records defined by the original search criteria. Simply select 'Nucleotide' in 'Database' pull down menu and 'Other genomes for species' from the 'Option' pull down menu to return all genome neighbors for all the species listed in the search results. As the sequencing revolution continues to gather steam, and the rate of viral genome sequencing increases, reference databases will be pressed to serve growing community needs. Meeting these will require further collaboration with individual viral communities and across public databases. Data models will also need to shift to better represent the extant sequence universe and provide better standardized sequence annotation. Once annotated, large-scale genome sequence data will need to be presented in ways that facilitate human data sorting and discovery operations. This will require semiautomated metadata capture and standardization, as well as innovative interfaces and tools that leverage metadata in discovery operations. Many of these approaches and processes are currently being tested within the NCBI Virus Variation Resource (25) where users can readily find sequences based on specific, standardized sequence descriptors, greatly improving the accessibility and utility of viral sequence data. While currently limited to a handful of human pathogens, our intent is to expand the Virus Variation data model to include more viruses from more viral communities. This should open up a number of possibilities and will support the aggregation and retrieval of sequences based on community-defined criteria like genotypes or complete genome sets as is currently possible for influenza virus sequences (11, 25) . The growing cloud of viral genome sequences also poses significant barriers to the maintenance of reference genome records. The pace of experimental discovery and the number and breadth of viral genomes make it increasingly difficult to provide well annotated, up-to-date reference sequences. To counter, we must leverage community knowledge and activities against the goal of better RefSeq viral resources and must collaborate with viral communities to maintain well annotated reference sequences, develop community-accepted gene and protein naming standards and define community-established subspecies classification schemes. Though collaborations have been initiated within D576 Nucleic Acids Research, 2015, Vol. 43, Database issue some communities (29, (42) (43) (44) 47) , they need to be scaled to include more groups. As a public resource, we serve a range of communities--from the public health to the basic research--and rely on them to both better inform our mission and help support it. Only by engaging our stakeholders and working together on shared goals can we provide the rigorous resources necessary to support viral sequence data activities. Emergence of Zaire Ebola virus disease in Guinea--preliminary report Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Middle East respiratory syndrome coronavirus in dromedary camels: an outbreak investigation Transmission and evolution of the Middle East respiratory syndrome coronavirus in Saudi Arabia: a descriptive genomic study The International Nucleotide Sequence Database Collaboration The European Bioinformatics Institute's data resources 2014 DDBJ progress report: a new submission system for leading to a correct annotation ViralZone: recent updates to the virus knowledge resource NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins FLAN: a web server for influenza virus genome annotation VIGOR extended to annotate genomes for additional 12 different viruses VIGOR, an annotation program for small viral genomes Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-Seq Identification of a novel polyomavirus from patients with acute respiratory tract infections Klassevirus 1, a previously undescribed member of the family Picornaviridae, is globally widespread A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes Deep sequencing of norovirus genomes defines evolutionary patterns in an urban tropical setting Molecular epidemiology of contemporary G2P[4] human rotaviruses cocirculating in a single U.S. community: footprints of a globally transitioning genotype Going viral: next-generation sequencing applied to phage populations in the human gut PathSeq: software to identify or discover microbes by deep sequencing of human tissue VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples National Center for Biotechnology Information Viral Genomes Project Virus Variation Resource--recent updates and future directions NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy Improving gene annotation of complete viral genomes Identification of proteins associated with murine cytomegalovirus virions Microbial virus genome annotation-mustering the troops to fight the sequence onslaught Imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches Molecular identification of hepatitis B virus genotypes/subgenotypes: revised classification hurdles and updated resolutions Consensus proposals for a unified system of nomenclature of hepatitis C virus genotypes Expanded classification of hepatitis C virus into 7 genotypes and 67 subtypes: updated criteria and genotype assignment web resource Is there any value to hepatitis B virus genotype analysis? The NCBI Taxonomy database Virus Taxonomy: Classification and Nomenclature of Viruses: Ninth Report of the International Committee on Taxonomy of Viruses Rapid cDNA synthesis and sequencing techniques for the genetic study of bluetongue and other dsRNA viruses A new approach to determining whole viral genomic sequences including termini using a single deep sequencing run Herpesvirus systematics Evolution of viral proteins originated de novo by overprinting Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation Uniformity of rotavirus strain nomenclature proposed by the Rotavirus Classification Working Group (RCWG) Virus nomenclature below the species level: a standardized nomenclature for natural variants of viruses assigned to the family Filoviridae Virus nomenclature below the species level: a standardized nomenclature for laboratory animal-adapted strains and variants of viruses assigned to the family Filoviridae The Universal Protein Resource (UniProt) in 2010 ViPR: an open bioinformatics database and analysis resource for virology research Towards Viral Genome Annotation Standards Virus variation resources at the National Center for Biotechnology Information: dengue virus The influenza virus resource at the National Center for Biotechnology Information A web-based genotyping resource for viral sequences Human immunodeficiency virus type 1, human protein interaction database at NCBI PAirwise Sequence Comparison (PASC) and its application in the classification of filoviruses Improvements to pairwise sequence comparison (PASC): a genome-based web tool for virus classification BLAST: a more efficient report with usability improvements NCBI BLAST: a better web interface Database resources of the National Center for Biotechnology Information The National Center for Biotechnology Information's Protein Clusters Database We would like to thank Vyacheslav Chetvernin, Boris Fedorov, Sergey Resenchuck, Igor Tolstoy, Tatiana Tatusova and Jim Ostell for their development and support.