key: cord-0721531-1pau5m3s authors: Phadke, Sujal; Macherla, Saichetana; Scheuermann, Richard H. title: Database and Analytical Resources for Viral Research Community date: 2019-12-31 journal: Reference Module in Life Sciences DOI: 10.1016/b978-0-12-809633-8.20995-3 sha: 4c9485fe81cd65bef00fbcc374672ac941c7f2fb doc_id: 721531 cord_uid: 1pau5m3s Abstract Many public databases and analytical resources are available to facilitate virology research. The Virus Pathogen Database and Analysis Resource (ViPR, see “Relevant Websites section”) and Influenza Research Database (IRD, see “Relevant Websites section”) are comprehensive and highly curated repositories of genome and protein sequence records and annotations, protein structures, immune epitopes, and epidemiological and surveillance data about human and related viral pathogens. These data are acquired from public repositories, direct submissions and in-house bioinformatics analyses. The resources offer seamless integration of data, analytics and visualization, and are freely available without cost or restriction to facilitate diagnostics and therapeutics development for priority pathogens. Viruses infect all kingdoms of life. In this article, we focus on database resources for viruses that infect humans and other animals. We call attention to databases such as Plant Viruses Online (see "Relevant Websites section") and the Prokaryotic Virus Ortholog Groups (pVOGs) (see "Relevant Websites section") for readers interested in viruses that infect other host organisms, which are out of scope of this article. The landscape of databases and analytical tools available for human virology research is guided by research and development goals for priority pathogens. The available resources can be categorized as databases that store specific data types and bioinformatics webtools that offer specific analytical capabilities. These two essential functions have also been combined and integrated in comprehensive resources such as ViPR and IRD. Several types of databases are available for virology research that can be distinguished based on the type of data they contain or the pathogen area of focus (Table 1) . For instance, many popular databases focus on storing information about specific biomolecules, such as gene and protein sequences, immune epitopes, or protein structures. These databases can be further distinguished as sequence archives such as GenBank (see "Relevant Websites section"), and UniProt (see "Relevant Websites section"), where data is deposited by the primary investigators and curated DBs such as RefSeq (see "Relevant Websites section") that integrate additional knowledge (e.g., annotations) with sequence records to provide an enhanced knowledgebase. Biomolecule information other than sequences is also stored in other databases, including the Protein Data Bank (PDB; see "Relevant Websites section"), which stores 3D structural data, the Immune Epitope Database (IEDB; see "Relevant Websites section"), which catalogs experimental data on B cell and T cell epitopes studied in humans and other animals and the Virus Particle Explorer (VIPERdb; see "Relevant Websites section"), which stores the structures of viruses with icosahedral virions. Virology databases have also been designed to focus on particular taxa of viral pathogens. For example, recognizing hepatitis B virus as a major public health problem worldwide, the Hepatitis B Virus Database (HBVDb; see "Relevant Websites section") has been designed to facilitate research on the genetic variability of HBV and its resistance to treatment. HBVDb allows the analysis of annotated sequences for genotyping and drug resistance profiling. Similarly, a collection of databases for research on the Human Immunodeficiency Virus (HIV) are available (see "Relevant Websites section") that contain comprehensive data on genome and protein sequences and immunological epitopes. Because influenza virus poses perhaps the most persistent major global public health threat, several databases are dedicated to research on influenza. For instance, the Global Initiative on Sharing All Influenza Data (GISAID; see "Relevant Websites section") is an access-controlled resource of influenza sequence information and related epidemiological data. FluNet (See Relevant Websites Section) is a global web-based influenza surveillance data collection, maintained at the World Health Organization (WHO) and available for tracking the movement of flu viruses globally. The Influenza Virus Resource (see "Relevant Websites section") supports the search and analysis of influenza genomic and protein sequences at National Center for Biotechnology Information (NCBI). The Influenza Research Database (IRD; see "Relevant Websites section") provides the most comprehensive collection of influenza virus-related data and an integrated suite of analytical and visualization capabilities for research on influenza virus. In contrast to the aforementioned resources that are focused on a particular data type or virus, the ViPR resource (see "Relevant Websites section") is unique in that it provides cross-referenced data of multiple types on all high priority human pathogenic viruses that pose a threat to public health, except HIV. Each virus family has a dedicated portal within ViPR that offers intuitive, customized search interfaces and analytical options tailored for each of the virus families. ViralZone (see "Relevant Websites section") provides access to a highly curated and extensive knowledgebase about a wide range of viruses. Research in virology is heavily dependent on data mining using sophisticated bioinformatics tools. With the foresight into the importance of such capabilities, several dedicated webtools are available for the users to conduct various types of analyses on viral genomes. For instance, tools such as IDSeq (see "Relevant Websites section"), Virome (see "Relevant Websites section") and VirusDetect (see "Relevant Websites section") allow detection of viruses from deep-sequencing of metagenomics samples. IDSeq is designed with the aim of real-time pathogen detection from metagenomes. Virome focuses on environmental metagenomes, whereas VirusDetect specifically uses small RNA datasets to detect viruses using both de novo and reference-based assemblies. Once a novel virus isolate or variant is detected, tools such as the Viral Genome ORF Reader (VIGOR; see "Relevant Websites section") enable genome annotation. VIGOR is a homology-driven viral gene prediction program that yields predicted proteins and mature peptides for newly sequenced isolates and variants of human virus. The software uses a set of highly curated databases enabling VIGOR to annotate a given viral genome. Currently VIGOR supports gene prediction and annotation of about 25 different virus taxonomic groups. Tools are also available to study the viral pathogen in the context of its host environment. For instance, STRING-Viruses (see "Relevant Websites section") is a webtool available as part of the STRING database that allows assessment of protein-protein interactions using visualization tools such as Cytoscape. This webtool is particularly important for studying how viral proteins interact with host proteins during various stages of infection. Likewise, NextStrain (see "Relevant Websites section") is a webtool that enables rapid, real-time tracking of evolving pathogen populations during infectious outbreaks. NextStrain is an open source system that tracks mutation marker data on pathogen phylogenies to make inferences about epidemiologically-relevant parameters such as spatio-temporal spread of the infection within a host population. For the remainder of this article, we focus on describing two related database and analytical resources available for research on human viral pathogens -ViPR and IRDas examples for how these types of resources are developed and used. For a more comprehensive list of other available virus database and analysis resources, we encourage the reader to explore additional information about resources listed at ViralZone (see "Relevant Websites section"). The National Institute of Allergy and Infectious Diseases (NIAID) at the U.S. National Institutes of Health (NIH) implemented the Bioinformatics Resource Centers (BRCs) for Infectious Diseases program to support research on priority pathogens of humans. As a result, the BRC focused on viral pathogens has developed the ViPR and IRD resources as publicly-accessible online repositories for viruses that adversely affect public health with the aim of integrating research and surveillance data. ViPR (see "Relevant Websites section") is unique amongst viral-centered databases in offering a wealth of information on a large number of viral families. In contrast, IRD (see "Relevant Websites section") is a parallel resource that is focused exclusively on Influenza virus. The objective of both resources is to provide virus data and analytical capabilities to advance the understanding of virus transmission, pathogenesis, and host range, and to support the development of diagnostics and therapeutic interventions. The ViPR and IRD databases integrate data from three sources ( Table 2) : The ViPR and IRD databases capture various data types from multiple publicly-accessible data archives. ViPR and IRD integrate genomic sequence information from GenBank (see "Relevant Websites section"), protein sequences from UniProt (see "Relevant Websites section"), protein structures from the Protein Data Bank (PDB; see "Relevant Websites section"), experimentally determined T-cell and B-cell epitopes from the Immune Epitope Database (IEDB; see "Relevant Websites section"), and Gene Ontology annotations from the GO database (GO, see "Relevant Websites section"). All data types are regularly updated and are searchable using their original accession numbers within intuitive web-based user interfaces. In some cases, active research projects supported by the U.S. National Institutes of Health and other interested parties submit data and related metadata directly to ViPR and IRD. For instance, NIAID-funded Systems Biology Consortium for Infectious Diseases research programs submit a variety of different transcriptomic, proteomic, and metabolomic datasets that investigate in vivo and in vitro host responses to viral infections. The Genomic Sequencing Centers for Infectious Diseases (GCID) program submit detailed structured metadata, including clinical information such as disease symptoms, severity, and diagnostic test outcomes, that are linked with sequence records of the corresponding virus isolate obtained from GenBank. IRD serves as the repository for the influenza human and animal surveillance data collected by the Centers of Excellence for Influenza Research and Surveillance (CEIRS) program. The IRD and ViPR development team generates and integrates unique derived data from bioinformatics analysis pipelines performed in-house, tailored specifically for a given taxonomic groups. Derived data include improved and consistent metadata annotations including strain name, clade and genotype information, virus taxonomy, host and country of isolation, and collection date. For instance, the ViPR annotation process extends information available in the representative RefSeq strain for each species. The process uses multiple sequence alignment to map homologous regions across related viral genomes to map mature peptide cleavage sites on polyproteins. Likewise, a custom annotation pipeline is used in IRD to predict open reading frames and sequences for variants of influenza proteins including PA-X, PA-N155, PA-N182, M42, NS3 and PB1-40. The predicted variant proteins can be retrieved from the Nucleotide and Protein Sequence Search pages. Various tree-based clade classification tools are also available and have been used to predict clades and genotypes of pathogenic strains of several viruses including Zika, rotaA, and Hepatitis C virus in ViPR and H1N1, H5N1 and swine H1 strains in IRD. Furthermore, Sequence Features (SFs) are derived using information integrated from UniProt, GenBank, IEDB and the scientific literature followed by inspection and validation by domain experts. SFs are protein regions with important structural, functional, immune epitopes, or sequence alteration characteristics. Once the SF protein regions are defined, the extent of sequence variation observed in each region is determined as a series of Variant Types (VTs). Lastly, the Host factor component of IRD/ViPR contains a variety of derived data that gives insights about the systems-level infection dynamics. For instance, host factor biosets are group of genes/proteins/metabolites that are significantly differentially expressed/abundant at different times post infection. Data models derived using Weighted Gene Coexpression Network Analysis (WGCNA) are available to aid identification of co-expressed genes that may be functionally related, tightly co-regulated or members of similar pathway. The set of co-expressed genes can also be visualized as Cytoscape networks where nodes represent genes and edges represents the strength of co-expression. Table 3 ViPR and IRD offer frequent updates on all data types. Genome sequence data are updated daily (IRD) or weekly (ViPR) while all other data types are updated with each bimonthly release. As of September 23, 2019, ViPR provides data on 667,249 virus strains from nearly 6126 viral species belonging to 20 families including Arenaviridae, Caliciviridae, Coronaviridae, Fimoviridae, Filoviridae, Flaviviridae, Hantaviridae, Hepeviridae, Herpesviridae, Nairoviridae, Paramyxoviridae, Peribunyaviridae, Phasmaviridae, Phenuviridae, Picornaviridae, Poxviridae, Reoviridae, Rhabdoviridae and Togaviridae. It contains sequences from nearly 883,170 genomes, out of which upwards of 110,742 are complete genome sequences. Sequence data on 42,100,000 proteins are also available and contains various attributes including annotations, mature peptide data, experimentally determined epitopes, etc. ViPR contains a total of 16,945 3D protein structures from PDB and 61,816 experimentally-determine immune epitopes. Table 3 displays a breakdown of available data; details may be found at the link (see "Relevant Websites section"). The IRD holds 751,002 total influenza genome segment sequences, 1748 PDB structures and 1,184,929 proteins with predicted epitopes. Also, the IRD is unique in providing host factor datasets generated from experimental infections of host organism and cell lines with various viral strains. These cover a range of pathogens in the Orthomyxoviridae and Coronaviridae families. Currently, • Unique data types including host factor, sequence features and animal and human surveillance information ViPR and IRD offer highly-curated data that has been vetted using computational and manual curation strategies. For instance, an in-house curation and annotation pipeline provides curated sequences from which sequence anomalies have been detected for potential removal during downstream analysis. Along with the sequence data, the ViPR team has manually-curated the scientific literature to provide improved and consistent annotations of metadata including the geographic location, year, and host for many clinically-relevant taxonomic groups. The highly curated strain level data are displayed with a Genome Map image and a Protein Information table from which detailed structural and functional information for a given gene/protein can be obtained. ViPR utilizes RefSeq strains to extend the manually-curated annotations to strains belonging to the same taxon. Furthermore, RefSeq sequences are used to construct virus ortholog groups and their associated annotations, which enable identification of proteins with similar function within a given virus taxon. ViPR and IRD also offer curated data on T-cell and B-cell immune epitopes and their predicted positioning on protein structures from the IEDB. Data curation in ViPR and IRD continues to grow and expand beyond sequence and strain level information. For example, both databases offer curated antiviral drug data from the DrugBank (see "Relevant Websites section"), including the descriptive drug information, 3D structures for target complexes, interaction sites as sequence features and antiviral resistance mutations to aid in assessing the risk of anti-viral drug resistance development. ViPR offers customized search interfaces to allow for the retrieval of selected genomic, structural and other data records using specific metadata for different virus families. Users initiate the search by selecting a virus family on the home page. A user can narrow the search data specific to a virus strain by querying the database using genus or species, geographical location and date of isolation, virus host, and clinical or experimental data. The user also has an option to cast a wider net using keywords with further narrowing using advanced search options. Once a strain or set of strains is selected, detailed genomic and protein sequence information and associated annotation can be accessed. These data can then be directly analyzed using any of the appropriate tools available from within ViPR. Because IRD is dedicated to influenza viruses, the search interface design is guided by the availability of influenza strain-specific data. Users can query the database using the branching logic inherent in the database. For instance, users can search for complete or partial genome and segment sequences, and proteins by directly entering the name of the strain(s) of interest. Users can also choose amongst the several metadata fields such as host, geographic location and the date of pathogen isolation. Once a particular taxon, strain or metadata category is selected as a search criterion, additional search criteria appear dynamically to allow the users to perform more focused searches. Moreover, users can also use the advanced search options to refine the search results with the more fine-grained search criteria. For instance, users can choose to view data on strains isolated in specific months of a given year(s) or limit search to specific host attributes such as gender and age and choose to limit their search to specific specimen type, laboratory strains and organism detection method. Lastly, users can customize their viewing options to specific display fields through advanced search menu. An example of the various search options is shown in Fig. 1 . ViPR and IRD provide users an option to retrieve certain data types using command line utilities via Application Programming Interfaces (APIs). Specifically, the sequence search API allows users to retrieve sequences and associated metadata using GenBank and protein accession IDs. The retrieved sequences can be obtained in either FASTA or JASON formats with user-defined associated metadata. The surveillance API allows retrieval of surveillance records and metadata from host surveillance samples. IRD allows users to submit sequences for large phylogenetic analysis jobs through an API to the high-performance computing environment provided by the NSFsponsored Cyber-Infrastructure for Phylogenetic Research (CIPRES) Gateway. Tree calculations are made using the high-performance computing environment and the resulting phylogenetic tree is returned for visualization using the Archaeopteryx tool in IRD. ViPR and IRD host a comprehensive suite of bioinformatics tools for data analysis and visualization, closely integrated with the supported data. These include popular webtools in bioinformatics constructed by the ViPR team or contributed by users/collaborators. Examples of the types of analysis that can be performed and the webtools that are available are described below. For a complete list of analytical tools, the reader is directed to the ViPR (see "Relevant Websites section") and IRD (see "Relevant Websites section") homepages. The sequence annotation pipelines allow users to upload and annotate genomic sequences to predict segment type, CDS location, and genotype information, and to identify possible sequencing artifacts. Users can use popular tools such as BLAST and MUSCLE within ViPR and IRD. Sequences can be selected from a search result or a working set in their personal workbenches. Users can also perform manual exploration and curation of sequence alignments including relabeling the sequences and adding sequence features. After an alignment is completed, users have an option to download the input sequences and output files in a variety of formats or pass the alignment to another tool including SNP analysis or meta-CATS. Fig. 1 Protein search interface in IRD. The search page supports queries based on "classical" as well as "variant" proteins and associated metadata. A search query can be made more specific by choosing various query features. For example, users can search for specific strain(s) by entering the strain name and subtypes in the appropriate search fields. Additionally, choosing a type of host, such as avian, brings a drop down menu from which the user can choose one or more species to make the search criterion more specific. Users may also choose to limit their search results by geographic region(s) by choosing one or more countries in the dropdown menu. Search results may be limited to a specific date range by putting in the years or by choosing a month range through advanced options. Multiple other search criteria such as keyword search, submission date, host gender and age etc. are available through advanced options to make the search results more specific. Users can infer phylogenetic relationships using RAxML and FastME algorithms, and visualize the results using a customized visualization tool developed in house called Archaeopteryx, which allows users to color-code and annotate various tree branches and nodes using available metadata, such as geographical location, host, isolation year, and amino acid residues at user-selected positions (e.g., Fig. 2 ). The meta-CATS tool provides a statistical analysis of sequences to identify genome and protein positions that show significantlydifferent residue distributions between groups of sequences using the Chi-squared statistic. Sequences can be segregated into groups manually or automatically based on selected metadata values, such as year of isolation, geographical location, host species, etc. Thus, a user can put the analysis of sequences in the context of infection and infer association of variations in a genomic region with a particular infection characteristic. Users can search for protein structures using multiple types of queries, including PDB IDs, gene symbol, Entrez ID, UniProt accession, and gene product names. Furthermore, search can be restricted to include only proteins with experimentally determined epitopes, experimentally determined active sites and proteins with sequence features. Additionally, users can use advanced options to query the database using theoretical structures. Once a particular structure from the search results is selected, users can customize the general appearance of protein structures. For example, users can highlight ligands, active sites, epitopes and sequence features on the 3D structures. Individual residues within the protein structures are mapped to homologous positions from UniProt records, which allows comparison between protein structures. Annotating a 3D structure with important residues and regions of interest can yield testable hypotheses about the functional relevance of the protein. Lastly, users can download the highlighted protein structure as a publication quality image file or a structure movie. Fig. 2 Example of a phylogenetic tree constructed in ViPR. The search interface was used to retrieve unique sequences belonging to the 450 bp region that codes for the C-terminal 150 amino acids of the N nucleocapsid protein of the human measles Morbillivirus. A total of 554 unique sequences were obtained. Multiple sequence alignments were performed with MUSCLE and phylogenetic relationships inferred using RAxML for visualization with Archaeopteryx.js. The legend shows options for users to customize the tree visuals and highlight desired metadata in Archaeopteryx. For instance, the phylogram display with aligned labels has been chosen from the top left panel for improved readability. Likewise, the "Dyna Hide" option on the left panel has been selected to only show representative sequence names. Names of the nodes have been color coded to indicate the year of isolation. The node color indicates the country of origin. The available sequences separate into 3 clades belonging to subtypes D8, B3 and H1. The subtypes are represented in the parentheses following the names of the sequences. While the subtypes D8 and B3 are more globally circulating, causing infections across different countries, the subtype H1 is predominantly found in China. Moreover, in a given country, strains belonging to the same subtype have repeatedly caused measles infections over several years indicating pathogen persistence in the population, likely due to imperfect and inadequate vaccination practices. Users can use the VIGOR software tool along with its collection of highly-curated reference databases for different viruses to predict viral protein open reading frames and sequences, and to identify typical viral transcriptional and translational exceptions including RNA editing, stop codon read-throughs and ribosomal slippage. ViPR and IRD offer two types of user/community contributed tools for virus classification. A clade classification tool infers clades for a query sequence from its position within a reference phylogenetic tree. Currently, clade classification is available for Zika virus, and Hepatitis C Virus (HCV) in ViPR and swine H1 and H5N1 influenza viruses in IRD. The H5N1 classification tool uses phylogenetic analysis to classify HA sequences according to the WHO H5 classification scheme. The H5N1 classifier has been verified to have 498% accuracy for sequences of at least 300 nucleotides of HA1. On the other hand, the H1N1 classification tool in IRD is a robust application of BLAST to recognize sequences closely related to pandemic sequences. BLAST-based classification is also available for classifying rotaA virus sequences in ViPR. IRD has implemented an HA subtype numbering conversion tool that allows users to convert HA sequence coordinates among any selected subtypes based on protein structure alignment rather than sequence-based alignment. Using this tool, the user can convert the coordinates of an HA protein sequence to the corresponding coordinates in other subtypes, to compare substitutions associated with phenotypic changes and to identify cross-reactive immune epitopes. The tool can also be integrated with sequence variation analysis and meta-CATS. • Seamless integration of data and analysis/visualization tools • Analysis of user data in combination with database data Users can establish personal workspaces under the "workbench" feature within the IRD and ViPR. This tool provides an interface that allows users to save previous search or analysis results, which enables users to re-use their work without re-running the analysis. It also allows users to combine multiple analyses. Users can upload and save their own private data and metadata to their workspace to be analyzed using the analytical and visualization tools provided by IRD and ViPR. The saved data and analysis results can be shared with collaborators through their workbench accounts. The IRD and ViPR databases are open access resources and can be used and shared without restrictions. The databases offer multifaceted user support (Fig. 3) . Users can report a problem or ask a question using the forms provided online. The development and management teams of both IRD and ViPR are responsive to questions from the helpdesk and to suggestions for enhancements. Users can join a newsletter mailing list to get information about updates of the resources. Both IRD and ViPR provide extensive tutorials, training modules and manuals. For additional support, the development and management teams engage in outreach sessions that include webinars, tutorials, and training workshops at various geographical locations. For an expert user, the analytical tools developed by the ViPR/IRD team are also available on GitHub, which avails the user with an option of using the tools outside of the IRD and ViPR resources on their preferred platform. The ViPR and IRD databases continue to provide critical resources in several research studies as evident by the increasing number of citations in the scientific literature (Fig. 4) . Together, the two databases have been cited in 1080 publications as of May 10, 2019. The number of new sessions initiated per week in 2018 (Google Analytics) tallies at 1488 at ViPR and 1482 at IRD. Importantly, these sessions have been documented from 181 countries for ViPR and 174 countries for IRD. Virology research is dependent on timely availability of reliable data on viral pathogens, their hosts and the infection/outbreak dynamics. ViPR and IRD offer comprehensive, highly curated data on human viral pathogens along with an intuitive search interface and seamless integration of the data with analytical and visualization tools. The resources are available freely without restrictions. The availability of such resources streamlines and expedites experimental discovery advancing the ultimate goal of developing improved diagnostics and therapeutics for priority pathogenic viruses. Immunoinformatics approach for epitope-based peptide vaccine design and active site prediction against polyprotein of emerging Oropouche Virus Bats, Bat-Borne Viruses, and Environmental Changes Atypical cowpox virus infection in smallpox-vaccinated patient Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes Contemporary circulating enterovirus D68 strains show differential viral entry and replication in human neuronal cells Anti-Chikv Antibodies and Uses Thereof, US Patent The EMPRES-i genetic module: A novel tool linking epidemiological outbreak information and genetic characteristics of influenza viruses. Database. bau008. Dutta National institute of allergy and infectious diseases bioinformatics resource centers: New assets for pathogen informatics WGCNA: An R package for weighted correlation network analysis Identification of diagnostic peptide regions that distinguish zika virus from related mosquito-borne flaviviruses Diversifying selection analysis predicts antigenic evolution of 2009 pandemic H1N1 influenza A virus in humans The number of peer-reviewed citations for the ViPR and IRD resources An integrative approach to CTL epitope prediction: A combined algorithm integrating MHC class I binding, TAP transport efficiency, and proteasomal cleavage predictions Creating the CIPRES science gateway for inference of large phylogenetic trees Luteolin escape mutants of dengue virus map to prM and NS2B and reveal viral plasticity during maturation ViPR: An open bioinformatics database and analysis resource for virology research Virus pathogen database and analysis resource (ViPR): A comprehensive bioinformatics database and analysis resource for the coronavirus research community Metadata-driven comparative analysis tool for sequences (meta-CATS): An automated process for identifying significant sequence variations that correlate with virus attributes Methods and Reagents for Detection of Chikungunya Virus and Zika Virus, United States Patent Application Influenza research database: An integrated bioinformatics resource for influenza research Virulence difference of five type I dengue viruses and the intrinsic molecular mechanism org: Free epitope database and prediction resource Plant Viruses Online: Descriptions and Lists from the VIDE Database The Hepatitis B Virus database Virology links B ViralZone page Virus Pathogen Database and Analysis Resource. www.viprbrc.org Virus Pathogen Database and Analysis Resource. www.pdb.org wwPDB: Worldwide Protein Data Bank