key: cord-0882443-3ulketgy
authors: Snyder, E. E.; Kampanya, N.; Lu, J.; Nordberg, E. K.; Karur, H. R.; Shukla, M.; Soneja, J.; Tian, Y.; Xue, T.; Yoo, H.; Zhang, F.; Dharmanolla, C.; Dongre, N. V.; Gillespie, J. J.; Hamelius, J.; Hance, M.; Huntington, K. I.; Jukneliene, D.; Koziski, J.; Mackasmiel, L.; Mane, S. P.; Nguyen, V.; Purkayastha, A.; Shallom, J.; Yu, G.; Guo, Y.; Gabbard, J.; Hix, D.; Azad, A. F.; Baker, S. C.; Boyle, S. M.; Khudyakov, Y.; Meng, X. J.; Rupprecht, C.; Vinje, J.; Crasta, O. R.; Czar, M. J.; Dickerman, A.; Eckart, J. D.; Kenyon, R.; Will, R.; Setubal, J. C.; Sobral, B. W. S.
title: PATRIC: The VBI PathoSystems Resource Integration Center
date: 2006-11-16
journal: Nucleic Acids Res
DOI: 10.1093/nar/gkl858
sha: 2cf60380bb05becd140053f71fa4cc4e4eaa0d5f
doc_id: 882443
cord_uid: 3ulketgy

The PathoSystems Resource Integration Center (PATRIC) is one of eight Bioinformatics Resource Centers (BRCs) funded by the National Institute of Allergy and Infection Diseases (NIAID) to create a data and analysis resource for selected NIAID priority pathogens, specifically proteobacteria of the genera Brucella, Rickettsia and Coxiella, and corona-, calici- and lyssaviruses and viruses associated with hepatitis A and E. The goal of the project is to provide a comprehensive bioinformatics resource for these pathogens, including consistently annotated genome, proteome and metabolic pathway data to facilitate research into counter-measures, including drugs, vaccines and diagnostics. The project's curation strategy has three prongs: ‘breadth first’ beginning with whole-genome and proteome curation using standardized protocols, a ‘targeted’ approach addressing the specific needs of researchers and an integrative strategy to leverage high-throughput experimental data (e.g. microarrays, proteomics) and literature. The PATRIC infrastructure consists of a relational database, analytical pipelines and a website which supports browsing, querying, data visualization and the ability to download raw and curated data in standard formats. At present, the site warehouses complete sequences for 17 bacterial and 332 viral genomes. The PATRIC website () will continually grow with the addition of data, analysis and functionality over the course of the project.

Bioterrorism became an important national security issue (1) following the deliberate release of anthrax spores into the US postal system in October 2001 (2) . Meanwhile, emerging and reemerging infectious diseases (3) have had profound effects on public health in many parts of the world. Recognizing the pathogens responsible for these diseases as threats to homeland security, the National Institute of Allergy and Infectious Diseases (NIAID) of the US National Institutes of Health has embarked upon a series of initiatives aimed at developing a comprehensive understanding of the organisms identified as NIAID category A, B and C priority pathogens (for a complete list, see http://www3.niaid.nih.gov/biodefense/bandc_priority. htm). The Virginia Bioinformatics Institute's PathoSystems Resource Integration Center (PATRIC) is one of eight Bioinformatics Resource Centers (BRCs) established to study the NIAID priority pathogens and develop these information resources for the research community. While database resources for bacterial ((4) and those cited in (5) ) and viral (6, 7) genomics have been available for number of years, this project seeks to integrate genomics with comparative genomics and pathway analysis and ultimately proteomics, transcriptomics, immune epitope mapping, hostresponse and other downstream technologies. The goal is to help researchers and clinicians better detect and respond to biothreat agents (and infectious diseases in general) by facilitating the development of diagnostics, vaccines and therapeutics. This requires access to comprehensive information on the molecular biology, physiology and pathogenicity of these organisms.

PATRIC is responsible for the eight organism categories listed in Table 1 . The three genera of proteobacteria are all intracellular pathogens that are known or potential biowarfare agents. In the 1950s, Brucella suis was the first infectious agent developed for use as a biowarfare agent by the United States. Brucellosis, caused by Brucella sp., is an important agricultural disease infecting cattle, sheep, goats and swine as well as humans. It is highly contagious and readily dispersed as an aerosol (8) . Coxiella burnetii, the causative agent of Q fever, is a highly infectious agent of relatively low lethality. Its interest as a biowarfare agent stems from its high infectivity, stability to heat and desiccation and potential for aerosol dispersal. The genus Rickettsia contains the organisms responsible for numerous types of typhus and arthropod-borne spotted fevers (9, 10) . Rickettsia prowazekii was developed as a bioweapon by the USSR in the 1930s and was used by the Japanese in Manchuria during World War II (11) .

The five categories of viruses studied by PATRIC are all positive-strand ssRNA viruses, with the exception of Lyssaviruses, which have negative-strand ssRNA genomes. While there are no reports of any of these viruses being weaponized, they represent the causative agents for a number of emerging and reemerging diseases including Severe Acute Respiratory Syndrome (SARS), rabies and transmissible gastroenteritis. Recombinant vaccines for these viruses are either still in development or unavailable in areas where these infections are endemic or epidemic, compounding the public health risk.

The pace of research on these organisms has increased significantly since the turn of the millennium, with outbreaks, such as that of SARS in 2003 (12, 13) , spawning a flurry of scientific activity. The widespread use of automated DNA sequencing, microarray gene expression analysis and other high-throughput laboratory technologies has increased the volume of data produced, but not necessarily its accessibility. Currently, significant genomics and bioinformatics expertise is required to extract, process and interpret this wealth of data.

To address these problems, PATRIC has created an interdisciplinary team of bioinformaticians, software engineers, computational biologists and organism experts to build a publicly accessible resource aimed at providing high quality, analyzed and curated data to the infectious disease community working on these pathogens. To date, we have achieved the following objectives:

(i) collection and organization of existing genomic data for the eight pathosystems under a single, unified framework (ii) genome annotation and curation following standardized procedures (iii) visualization of raw data from analytical programs, as well as curated data (iv) creation of orthologous gene groups within each organism category allowing comparative analysis of gene content (v) prediction and visualization of bacterial metabolic pathways to complement functional analysis of proteins (vi) integration of online literature reviews from PathInfo (14) for selected organisms.

Longer-term goals include integration of data from gene expression and proteomics experiments (including hostresponse), predicted protein and RNA secondary and tertiary structures, and well-cataloged literature compilations. Ultimately, we hope our website will become an essential tool for researchers working on these pathogens and provide networking opportunities within the pathogen research communities.

PATRIC is implemented on Oracle 9i RDBMS using the Genomics Unified Schema (GUS) version 3.5, developed at the Computational Biology and Informatics Laboratory at the University of Pennsylvania (see http://www.gusdb.org). GUS is used to store all sequence data and associated annotation with the exception of metabolic pathway data, which is The database is populated with all known full-length or nearly full-length genomic sequences for the eight organism categories listed in Table 1 . Automated scripts query Gen-Bank (16) daily to identify new or updated records. The corresponding sequences, annotation and associated literature are retrieved from NCBI and loaded following curatorial review to remove redundancies and assign unique names to each genome. RefSeq (17) records are used when available to take advantage of their more thorough and consistent annotation. Draft genome sequences from Joint Genome Institute (JGI)/Los Alamos National Labs (LANL) and the NIAIDfunded Microbial Sequencing Centers will also be part of the PATRIC dataset. In addition to genome sequences and primary annotation from the original GenBank or RefSeq entry, the database stores the results of all automated and manual analyses described in the following section.

Our motivation to invest resources in sequence-level annotation is to maintain a high standard of quality over time. Even when good reference annotation is available, there are many reasons to re-annotate microbial genomes (18) . GenBank data are of variable quality and there is a trend towards depositing draft genome sequences with no annotation at all. In-house annotation also allows us to present supporting evidence and keep the annotation up to date. This is of particular importance for alignment-based annotation since databases such as GenBank (16) and UniProt (19) continue to grow at a prodigious rate.

Due to the large number of closely related genomes in each organism category, we have adopted an annotation strategy in which automated methods are applied to all genomes while detailed manual curation is applied to a limited number of reference genomes. The species B.suis 1330, C.burnetii RSA 493 and R.prowazekii str. Madrid E were chosen as reference genomes for their respective categories. Each viral category has (or will have) multiple reference genomes, representing phylogenetically diverse strains.

Automated nucleic acid and protein sequence annotation is accomplished using a Java-based genome annotation pipeline (unpublished), which reads an XML script containing the names and parameters of the analytical applications. The bacterial pipeline executes the gene prediction programs Glimmer (20) and GeneMark (21, 22) followed by start site correction programs RBSfinder (23) and TICO (24) . BLASTX (25) searches the non-redundant protein database, complementing the ab initio gene prediction methods. RNA genes are identified by tRNAscan-SE (26) and BLASTN searching against a ribosomal RNA database (27, 28) . The annotation protocol containing the full list of applications and parameters is available online at https://patric.vbi.vt. edu/documents/ under 'standard operating procedures'.

Results of the genome analysis pipeline are merged with original GenBank or RefSeq features for automated interpretation. A decision tree is used to classify genes into categories based on the level of agreement between the various prediction methods. Genes that are unambiguously predicted by multiple methods are automatically 'finalized', creating new 'gene', 'CDS' and/or '[t/r]RNA' features. The remaining genes are marked for manual curation. For viral genomes, an abbreviated pipeline is executed that emphasizes sequence alignment for gene identification and employs GeneMarkHMM optimized for mammalian (host) genomes.

After curatorial review, finalized protein-coding (CDS) features are translated and subjected to another pipeline executing InterProScan and structure prediction methods such as MEMSAT 2 (29) . Currently, each protein is associated with GO terms (30), TIGRroles, Enzyme Commission numbers (31) based on Pfam (32) and TIGRfam alignments (for a description of TIGRfam and TIGRroles, see: http://www. tigr.org/TIGRFAMs/). The protocol for automated proteome annotation is also available online. Manually curated protein sequences will be available in early 2007.

Once protein sequences are inferred from each genome in an organism category, putative ortholog groups are generated using BLASTP for all pairwise genome combinations and applying the conventional bidirectional-best-hit (BBH) criterion (33) . While putative ortholog groups within the bacterial categories are generally well defined, many viral proteins cannot be readily clustered using the stringent BBH criterion. This is an active area of curation. Using the ortholog groups as a starting point, a reference protein list is created for each bacterial category consisting of the proteins of the reference genome (each representing one ortholog group) plus a representative protein from each ortholog group identified in the associated genomes. A gene occurring in only a single genome constitutes a 'group' of one and would be included in the reference list. The reference protein lists will be manually curated and include, whenever possible, detailed functional descriptions, gene symbols, GO terms and EC numbers. Thus, every protein in the database will either be manually curated or be linked to an ortholog group member that has been manually curated.

The ortholog groups are further processed to create multiple sequence alignments (MSAs) using MUSCLE (34) with default parameters. Phylogenetic estimations using the neighbor-joining method (35) were created based on trimmed alignments using PHYLIP (36) . Trees were validated by bootstrapping (37) using a minimum of 100 replicates.

To help users understand the function of the bacterial proteins in context, we have adopted the Pathway Tools system (15) to derive pathways from genome annotation and to fill potential gaps in annotation known as pathway holes. The system takes a list of protein names, descriptions and EC numbers as input. Proteins with EC numbers can be assigned roles directly; the roles of other proteins are suggested by lexicographic analysis of descriptive information and/or analysis of gene order from homologous regions of related genomes and confirmed or rejected by the curation staff. The output is a database with integrated web server that allows users to browse and query the organism's metabolic pathways. This system has been integrated with the PATRIC web site, allowing users to access pathway information for all bacterial reference genomes. The current analysis was based on preexisting RefSeq or GenBank annotations; later releases will incorporate data curated in house, unifying the genomic and pathway versions of the data. The analysis of pathways can facilitate the identification of metabolic choke points, critical enzymes that could be targeted by drugs that may have valuable antimicrobial properties.

Pathway analysis can also yield clues to pathogenesis by comparing virulent and avirulent strains and examining the roles of genes not present in both strains.

The PATRIC website is hosted on a Sun Microsystems v20z server running SuSE Linux using the Apache web server. Applications are written in PHP and Perl, accessing data from an Oracle 9i server hosted on a Sun Microsystems E15000 running Sun OS.

The conceptual organization of the website is described in Figure 1 . The website's home page contains news, Figure 1 . Conceptual map of PATRIC website. Arrows show the relationship between the principal datatype on a page and related data on neighboring pages. Solid arrows represent 'drilling down' to more specific information (e.g. from genome to gene). Dashed arrows represent links between different views of conceptually similar data (e.g. between ortholog group and phylogenetic tree). This figure represents only a subset of the pages and links on the actual website. a navigation bar and the list of PATRIC organisms. Users can select their organism of interest from the list to access the corresponding organism category page. This page contains a table of genomes currently in our database with links to the three principal representations of individual genomes: the genome summary, genome browser and gene table. These pages allow users to view a summary of genome sequencing information and to identify specific genes and link to their corresponding gene, protein and pathway information pages. The gene information page displays the output of sequence analysis software run by the annotation pipeline, as well as curated data. Similarly, the protein information page displays InterProScan and TIGRfam alignments and associated information such as GO terms and EC numbers. For bacterial genomes, the pathway information page illustrates the protein's position in the organism's metabolic network and links to a wealth of information provided by PathwayTools.

The organism category page also contains links to a pathogen summary, ortholog group table and a phylogenetic tree based on 16S rRNAs for bacteria or a selected protein family for viruses. For bacterial genomes, detailed pathosystem information is available, provided by the VBI PathInfo documents (14) . The ortholog group table shows the presence or absence of reference gene list proteins for each organism in the organism category and provides links to an MSA and tree viewer and the Base-By-Base MSA editor (38) for every ortholog group. Base-By-Base allows users to add sequences to the MSA, recalculate it using Clustal (39) , T-Coffee (40) or MUSCLE and generate the corresponding tree using neighbor-joining or a number of clustering algorithms.

The PATRIC website also supports analytical and query tools. A database search page allows user-supplied sequences to be BLASTed against reference and curated sequences from PATRIC organisms. The page also supports MUMMER (41) comparisons between genomes in the database or with a usersupplied sequence. A query tool is available throughout the site by which users can retrieve genes by name, ID, description, as well as GO and EC identifiers and descriptions. Questions, comments and suggestions concerning the website and its contents may be submitted via the 'feedback' page, accessible from the menu bar.

The PATRIC database is hosted at the Virginia Bioinformatics Institute at Virginia Tech and can be accessed via web browser at https://patric.vbi.vt.edu. Sequences and annotation in GFF3 format (see http://song.sourceforge.net/gff3.shtml) can be downloaded by following the 'downloads' link on the main menu bar. GFF3 files are also available through BRC-Central at: http://brc-central.org.

This paper presents the first detailed description of the PATRIC website. Future development will advance on several fronts. Genome and proteome curation will continue, complemented by improved tools for query, analysis and visualization. For viruses, we will transition to the more widely accepted ICTV taxonomy (42) . The website's user interface is being enhanced to integrate organism-, tool/ task-and data-centric approaches to data access, allowing users more efficient and effective access to PATRIC resources. This will be followed up by prioritized curation targeted at potential drug and vaccine targets, virulence factors and genes with differential representation or polymorphisms associated with clinically significant phenotypes. Leveraging another NIAID-funded VBI project, the Administrative Resource for Biodefense Proteomics Research (http:// www.proteomicsresource.org/), we plan to integrate expression profiling and proteomics data from pathogen and host to better understand the pathosystem's biology and help the community identify targets for counter-measures. The integration of these disparate data types into a single, easy-touse system is a goal that we anticipate will enable pathogen researchers to make full use of available data to develop diagnostics, vaccines and therapeutics.

Biodefence on the research agenda

Investigation of bioterrorism-related anthrax

The challenge of emerging and re-emerging infectious diseases

The comprehensive microbial resource

) xBASE, a collection of online databases for bacterial comparative genomics

New bioinformatics tools for viral genome analyses at Viral Bioinformatics-Canada

VirGen: a comprehensive viral genome resource

Bichat guidelines for the clinical management of brucellosis and bioterrorism-related brucellosis

The past and present threat of rickettsial diseases to military medicine and international public health

Rickettsial pathogens and their arthropod vectors

Principles of the malicious use of infectious agents to create terror: reasons for concern for organisms of the genus Rickettsia

Outbreak of severe acute respiratory syndrome-worldwide

Aetiology: Koch's postulates fulfilled for SARS virus

PIML: the Pathogen Information Markup Language

The Pathway Tools software

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

The past, present and future of genome-wide re-annotation

The Universal Protein Resource (UniProt): an expanding universe of protein information

Improved microbial gene identification with GLIMMER

GeneMark.hmm: new solutions for gene finding

GenMark: parallel gene recognition for both DNA strands

A probabilistic method for identifying start codons in bacterial genomes

TICO: a tool for improving predictions of prokaryotic translation initiation sites

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence

The European ribosomal RNA database

The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs

A model recognition approach to the prediction of all-helical membrane protein structure and topology

The Gene Ontology (GO) database and informatics resource

The ENZYME database in 2000

The Pfam protein families database

A genomic perspective on protein families

MUSCLE: a multiple sequence alignment method with reduced time and space complexity

The neighbor-joining method: a new method for reconstructing phylogenetic trees

Phylogenetic analysis using PHYLIP

Confidence limits on phylogenies: an approach using the bootstrap

Base-By-Base: single nucleotide-level analysis of whole viral genome alignments

Multiple sequence alignment with the Clustal series of programs

T-Coffee: A novel method for fast and accurate multiple sequence alignment

Versatile and open software for comparing large genomes

International Committee on Taxonomy of Viruses and the 3,142 unassigned species

We would like to thank Chris Upton for making the application Base-By-Base (38) available for incorporation into our website and to Peter Karp for making a similar contribution with his Pathway Tools software (15) . This work is funded through NIAID contract HHSN266200400035C to Bruno Sobral. Funding to pay the Open Access publication charges for this article was provided by NIAID contract HHSN266200400035C to Bruno Sobral.

Conflict of interest statement. None declared.