key: cord-0907508-xfzhn1n1 authors: Jabado, Omar J.; Liu, Yang; Conlan, Sean; Quan, P. Lan; Hegyi, Hédi; Lussier, Yves; Briese, Thomas; Palacios, Gustavo; Lipkin, W. I. title: Comprehensive viral oligonucleotide probe design using conserved protein regions date: 2007-12-13 journal: Nucleic Acids Res DOI: 10.1093/nar/gkm1106 sha: ad7e256a901378cfdb314e11ab3504eda659df42 doc_id: 907508 cord_uid: xfzhn1n1 Oligonucleotide microarrays have been applied to microbial surveillance and discovery where highly multiplexed assays are required to address a wide range of genetic targets. Although printing density continues to increase, the design of comprehensive microbial probe sets remains a daunting challenge, particularly in virology where rapid sequence evolution and database expansion confound static solutions. Here, we present a strategy for probe design based on protein sequences that is responsive to the unique problems posed in virus detection and discovery. The method uses the Protein Families database (Pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. In silico testing using an experimentally derived thermodynamic model indicated near complete coverage of the viral sequence database. The capacity of DNA microarrays to simultaneously screen for hundreds of viral agents makes them an attractive supplement to traditional methods in microbiology. Their utility has been demonstrated through detection of papilloma virus in cervical lesions (1) , SARS coronavirus in tissue culture (2) , parainfluenza virus 4 in nasopharyngeal aspirates (3) , influenza from nasal wash and throat swabs (4, 5) , gammaretrovirus in prostate tumors (6) , coronaviruses and rhinoviruses from nasal lavage (7) , metapneumovirus from bronchoalveolar lavage (8) , filoviruses and malarial parasites in blood in hemorrhagic fever (9) , and a wide variety of respiratory pathogens in nasal swabs and lung tissue (10) . Viral microarrays have increased in density and strain coverage as fabrication technologies have improved. cDNA pathogen arrays derived from reference strain nucleic acids (11, 12) have been replaced by oligonucleotide arrays due to their increased flexibility. Oligonucleotide design strategies have focused on pairwise sequence comparisons to identify conserved regions within a variety of viral pathogens (13) (14) (15) . Multiple alignments have been used to design probes for clinically important virus genera, e.g. rotaviruses (16) , orthopoxviruses (17) or influenzaviruses (18) . Viral resequencing arrays have recently been introduced that allow single nucleotide resolution (4, (19) (20) (21) . Although such tiling arrays enable accurate typing, the number of probes required to build a resequencing array for all viral sequences exceeds current art. A comprehensive viral microarray should address the entire viral sequence database. Pairwise nucleic acid comparisons, while rapid, do not scale well with sequence number and ignore valuable coding information. Nonoverlapping segments, heterogeneous sizes and the large number of sequences preclude automated multiple alignments of nucleic acids for probe design. Protein-protein comparisons are more sensitive for detecting conserved regions due to the power of substitution matrices (22) ; however, at the time of writing, no reported oligonucleotide design algorithm leverages this information. The Protein Families Database (Pfam) (23) is a repository of hand curated protein multiple alignments and Hidden Markov Models (HMMs) across all phylogenetic kingdoms. HMMs are probabilistic representations of protein alignments that are well suited to identifying homologies (24, 25) . Beginning with the Pfam database as a foundation, we established a tiered method for creating viral probes that uses all sequence information available for viruses. Our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. We pursued experiments to determine the effects of probetarget mismatch and background nucleic acid concentration on array sensitivity and specificity; results were used to derive parameters for probe design. West Nile virus RNA (WNV, strain New York 1999, AF202541) was used as template in hybridization experiments on an Agilent oligonucleotide array with 1131 complementary probes of length 60 nucleotides (nt). Approximately one third of the probes had between 1 and 20 randomly introduced mismatches. The plus and minus (reverse complement) strands of each sequence were deposited, in duplicate. In addition to the flaviviral specific probes, the array contained nearly 36 500 probes for other viral families, negative and positive controls. A volume containing 10 6 copies of WNV and 200 ng of background nucleic acid (human lung tissue RNA) was amplified using random primers and hybridized in four replicate experiments as previously described (10) . Hybridizing a WNV isolate of known sequence allowed prediction of probe-viral hybrid strength and correlation to fluorescence data. To predict hybrids with high accuracy, Smith-Waterman alignments of the virus sequence against microarray probes were generated using the EMBOSS bioinformatics suite (26) . The number of mismatches was calculated for each expected probe-target pair. The change in Gibbs free energy at 658C (hybridization temperature) was calculated using PairFold version 1.7 (27) as a separate measure of probe-template binding strength. PairFold employs a dynamic programming algorithm to compute the minimum free energy structure (excluding pseudo-knots); the standard free energy model is used (28) with empirical nearest neighbor energies (29) . The arrays were visualized with an Agilent slide scanner, then processed with the quantile normalization technique (30) . SPSS version 14 was used for statistics and data plots (http://www.spss.com/), fluorescence data is available as supplementary material. The EMBL nucleotide sequence database [July 2007, Release 91; 461,353 nucleic acid sequences (31) ] was chosen as the reference for this study because it is tightly integrated with the Pfam protein family database (23, 32 Taxon growth was estimated using a standard least squares method, with the SPSS statistical package. A non-redundant database comprising 74 044 sequences was generated with CD-Hit (33), using a similarity cutoff of 98% to define sequences as identical. Bacteriophages were not included in the analysis; however, data were retained to allow probe design using the EMBL phage database. The Pfam database is comprised of hand curated seed protein alignments that are converted to a probabilistic representation using HMMs. These HMMs are used to search the protein database for homologues that can be added to the seed to create a comprehensive alignment (23, 24) . Pfam domains were analyzed to identify short, conserved protein regions and corresponding nucleic acid sequences. In the first step, the log-odds score for each position of the HMM built from the seed alignment was summed; lower scores were considered to indicate conservancy. The most conserved, non-overlapping 20 amino acid (aa) regions were identified. In the second step, protein alignments of all Pfam-A families were extracted and mapped to their underlying nucleotide sequences by cross reference to the EMBL records. HMM parsing modules from the BioPerl package were used. In the third step, the underlying nucleotide sequences were extracted and stored. In cases where the region contained gaps, flanking nucleotides were brought together to yield sequences of length 60. These sequences formed the basis for downstream probe design. Domain alignments in the Pfam-B were not used in probe design because they are of lower quality; also, as domain quality improves these alignments will be integrated into Pfam-A (23). All coding nucleic acid sequences that were not part of a Pfam-A alignment were extracted. In this step, the most conserved regions within homologous genes were identified for probe design. Sequences were clustered at the protein level with CD-Hit, using a similarity threshold of 80%. All sequence clusters were subjected to a MEME motif search (34) using the following parameters: motif width of 20, zero or one motif allowed per sequence, a minimum of two sequences per motif. Three motifs were selected for each sequence cluster. The underlying nucleic acid sequence extracted for each protein motif was used for probe design. A sliding window approach was used for highly divergent sequences that did not share any motifs. Using the PAM250 matrix (35) a summed log odds scores for every 20 aa subsequence in the protein was calculated; the three least likely to vary (lowest log odds score) were selected as regions for probe design. Viruses often have highly conserved non-coding regions at the termini of their genomes or genome segments that serve critical roles in replication, transcription, and packaging. We reasoned that probes based in these regions may be useful in microarray design. We identified conserved probes across homologous regions in sequences annotated as 5 0 UTR, 3 0 UTR, LTR, and those without annotation. Sequences were first clustered at the 80% threshold. Clustered sequences were then subjected to a motif search using the same parameters employed for proteins, except that a length of 60 nt per motif was specified. We addressed sequences that did not contain a shared motif separately; three non-overlapping 60 nt subsequences were chosen as probes. Probe selection and minimization with set cover algorithm An algorithm was designed to automate identification of the minimum set of probes required to address a repertoire of potential viral targets (36) . In the first step of analysis, the number of mismatches between a probe and its viral target was computed; the algorithm considered a probe to be 'covering' if it had 5 mismatches to the template. Coverage data were converted to a matrix of binary values. A greedy algorithm was implemented to choose a probe combination from the matrix, minimizing the number required probes. Candidate probes were further screened to ensure a T m >608C, no repeats exceeding a length of 10 nt, no hairpins with stem lengths exceeding 11 nt, and <33% overall sequence identity to non-viral genomes. Because it is not feasible to test all probes with all known viruses, we tested probe validity using a Gibbs free energy model of hybridization. All probe sequences were compared to the non-redundant set of viral sequences by BLASTN (37) . Probe-target pairs were aligned by Smith-Waterman to ensure accuracy; mismatches and change in Gibbs free energy at 658C (hybridization temperature) were then calculated. To gauge the performance of our probe selection algorithm, another comprehensive method was devised that used only nucleic acid sequence. Sequences in the Reference Sequence Viral Genomes project (38) are evenly distributed among viral families; therefore, we reasoned that probes derived from these sequences would provide broad coverage. To contrast with our method, we selected 60 nt oligonucleotides end-to-end along all viral genomic RefSeq sequences (1701 viruses). This resulted in a tiling probe-set where the length of a sequence was proportional to the number of interrogating probes. The viral sequence database is dominated by gene fragments We queried the EMBL viral database to determine the frequencies of coding sequences and full genomic sequences. The majority of viral sequences were <1 kb 1982 1984 1986 1988 1990 1992 1994 1996 1998 A commonly used method to reduce sequence complexity is generation of a non-redundant sequence set by clustering (33) . We grouped sequences at the 98% identity level and selected the longest sequences as unique representatives of each group. This method was used to assess the growth of sequence diversity between January 2000 and the current release of July 2007. The database grew 600% in the 7-year period; doubling every three years. Unique sequences decreased as a proportion of the database, from 42% to 27%; overall growth of unique sequences was 378% (Figure 1b) . The current database comprised 74 044 unique sequence representatives at the 98% similarity level. Thus, the growth in the number of sequences in the viral database has been rapid, while growth in diversity has been more modest. One hypothesis to explain this slower growth of sequence diversity is that many of the existing viruses infecting humans have already been discovered and new isolates deposited are variants of well studied viruses. We charted the growth of viral taxonomic groups as a function of time to visualize trends in viral discovery (Figure 1c ). The number of families and genera has remained stable since 1996; however, the number of sequences that have been classified as a new species has steadily risen. A least squares fit of this growth indicates that the steady increase in new species characterization is likely to continue, while the discovery of new viral families will be less common. A tiered, protein-motif-based approach to probe design addresses all viral sequences Nucleotide sequences were divided into four subtypes: (i) coding sequences that corresponded to Pfam-A alignments (cPf), (ii) coding sequences not in the Pfam-A (cNPf), (iii) sequences that were annotated as untranslated regions (UTR) or long terminal repeats (LTR) and (iv) sequences that were unannotated (UA). We sought to match the quality of Pfam-A alignments in the non-Pfam coding sequences by clustering them into groups of related sequences, approximating homologous genes. These were then subjected to a protein motif finding program to identify the conserved regions within each cluster. The untranslated and unannotated sequences were subjected to a similar clustering analysis, but at the nucleotide level. All four subtypes were subjected to the same three step design method: identification of conserved regions, extraction of nucleotide probe sequences, and minimization of covering probes. By allowing a limited number of mismatches to cognate templates, the number of probes required can be reduced. The mismatch threshold was determined based on experiments with West Nile virus (strain New York 1999, AF202541) that indicated high, homogenous fluorescence signal was observed if probes had five or fewer mismatches to the viral template ( Figure 2 ). The probe minimization technique serves to lower microarray printing costs and simplify analysis while maintaining sequence coverage. A flowchart of the design method is depicted in Figure 3 . The most recent Pfam-A release (Version 22) comprised 9318 families, of which 1540 had viral members. Of 405 543 annotated protein sequences with length >20 aa, 278 119 (68.6%) belonged to a Pfam-A family, while 127 424 (31.4%) did not. Three probes were chosen for each gene, yielding a total of 104 467 cPf and 133 513 cNPf probes. Of sequences not contained in Pfam-A, only 5.6% (6956) were found in Pfam-B alignments. Thus, due to the lower quality of alignments (23) and poor viral representation, the Pfam-B was not used for probe design. The 12 428 untranslated regions processed yielded 4616 probes. For the 24 841 unannotated sequences processed, 13 740 probes were designed. Sequences that were not covered due to high/low GC%, low complexity, repetitive sequence or a preponderance of ambiguous nucleotides (4244) were processed with a sliding window strategy; 14 530 probes were designed. Overall, the number of probes required to address all viral sequences was 270 866. Sequence counts and probe counts for the most recent EMBL/Pfam release are detailed in Figure 4 . An example of typical probe distribution is shown with respect to the Dengue virus 1 genome (NC_001477; Figure 5 ). Probe sequence composition is a major determinant of hybridization signal and is responsible for much of the variance between probes that target the same nucleic acid strand. Probe-target thermodynamics have been successfully modeled to predict fluorescence (41, 42) , control for variance (43) and even estimate concentrations of target detected in samples (44) . Observing that some probes with more than five mismatches to their targets showed strong fluorescent signal, we concluded that sequence composition is a major factor in our array platform. We sought to validate the probe design method by generating a simple thermodynamic model to predict hybridization signal based on sequence composition. We computed the change in Gibbs free energy (ÁG) for all expected probe-viral nucleic acid pairs in the West Nile virus hybridization experiments described above. The calculation method employed finds the most thermodynamically stable structure (minimum free energy) (28) based on empirically established nearest neighbor energies (29) . Strong signal was observed from probe-virus hybrids with ÁG of À32.5 kJ or less. Thus, this value was chosen as the threshold to classify a probe as likely to generate high signal when the cognate viral target is present (Figure 6 ). Probes will be designed in the area of short motifs of 20aa or 60nt Figure 3 . Comprehensive motif-based probe design. The EMBL viral database is clustered with a threshold of 98% nucleotide identity to create a non-redundant sequence database. Coding sequences are subjected to an amino acid motif search, and then probes are made from the underlying nucleic acid sequences. Similarly, nucleic acid motifs are found in non-coding sequences and used to make probes. Database coverage is checked; supplementary probes for highly divergent sequences are designed as necessary. Acronyms: Pfam-Protein Families database, MEME-Multiple Expectation maximization for Motif Elicitation, UTR-untranslated region, LTR-long terminal repeat. Motif-based probe design provides higher coverage than virus genome tiling Use of motif finding and set cover minimization markedly increases the computational resources needed to generate probe sets. To determine whether increased complexity results in a more comprehensive probe set, we compared our method to a genome tiling strategy. Probes of 60 nt were designed end-to-end along the entire genome for all 1701 Reference Sequence viral strains available as of May 2007. The tiling probe set served as a contrast to our design method since it was based on nucleic acid sequence, had more probes per gene, required less computation, and included viruses from all genera. In comparison of the methods, the following rules were used to compute database coverage: sequences >400 nt in length were considered covered if six or more probes met hybridization criteria; sequences <400 nt in length were considered covered if two probes met hybridization criteria; sequences <200 nt in length were considered covered with a single probe meeting hybridization criteria. Coverage of the entire database was gauged by computing probe-template ÁG for all 74 044 unique sequence representatives. Database coverage using the tiling method was 47.8% and required 850 136 probes; coverage using the motif-based method was 99.7% and required 270 866 probes (Table 1) . Whereas probe design in motif-based arrays can exploit partial genome sequences, probes in tiling arrays are based on full length genome sequences. Complete Reference Sequence genomes represent 1% of EMBL sequence entries. Although at least one full length genome sequence is described for all viral genera, only 49% (1701 of 3441) of viral species have a fully sequenced representative genome. The impact of differences in the motif and tilingbased strategies for probe design is reflected in differences in coverage. For the tiling-based probe-set, 40 of 44 families with <80% sequence coverage included species lacking representative genomes. Coverage with motifbased probe-sets for these same species was !93%. There is an increasing appreciation for the power of microarray technology in clinical microbiology, public health and environmental surveillance. Viral microarray probe design poses unique challenges due to the rapid increase in sequence data and the high propensity for sequence divergence within viral taxa. To ensure coverage of the newest isolates it is essential to consider partial as well as complete genomic sequences in probe design. Probe design based on multiple alignments or pairwise comparisons of nucleic acids for all known sequences is computationally intensive and scales poorly with database size. Protein sequence comparisons are rapid and incorporate rich evolutionary models, but require a cumbersome mapping step to extract underlying nucleic acid sequence. We have described a method that capitalizes on the Pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions. The protein motif-centric method has several advantages: (i) the majority of viral nucleic acid sequences encode proteins; thus, using this information leverages knowledge about function; (ii) protein sequence comparison and the resulting probesets are independent of viral taxonomy; this may enable incorporation of misclassified sequences; (iii) the Pfam is a well established and highly annotated database that will provide a basis for future design efforts; and (iv) probes designed in conserved regions may be able to detect novel viruses. A second application of this design method is viral expression profiling. Insights into the replication cycle, host evasion and virulence factors may be obtained by monitoring viral transcript levels during infection. To this end, arrays could be synthesized that combine probes for a single viral family and all host genes. Because the viral probe sets generated by our method account for known variants across all genes, a variety of strains could be profiled with a single array. This would provide a unique experimental platform for investigating virus biology, while minimizing fabrication cost and simplifying analysis. The thresholds used to design and validate probes were experimentally determined for the Agilent Technologies array platform and the types of clinical samples our Figure 6 . Gibbs free energy model of hybridization signal. The change in Gibbs free energy of probe-West Nile virus hybrids was computed. Aliquots of West Nile virus (New York 1999 strain RNA) at 10 6 copies were spiked into 200 ng of human lung (background) RNA. The fluorescent signal values of replicate arrays were log 2 transformed, normalized, and converted to Z-scores. 95% confidence intervals of the mean for fluorescence versus Gibbs energy is plotted. Probe-virus hybrids with free energy -32.5 kJ had high fluorescence; this value was chosen as the threshold for considering a probe likely to generate a strong signal when the target virus is present (dotted line). laboratory encounters. Probe length can be selected to emphasize efficient coverage of higher order taxa or speciation. The goal of this project is to cover all known viral sequences and optimize potential for detecting related viral sequences. Thus, we designed 60 nt probes because they can better tolerate mismatched templates than 25 nt oligonucleotide probes (45) . Using an empirical approach, appropriate thresholds can be determined for other array platforms, hybridization conditions, and probe lengths. The method of probe design and setcover minimization is flexible and agnostic of platform; application to bead, solution, or surface-based hybridization technology should be straightforward. Although the growth of the public sequence databases has been rapid, sequence diversity has not grown as quickly. If this trend continues, we anticipate that only incremental updates to a core set of probes will be needed to maintain array integrity. An update strategy would require periodic testing of probe sets against newly deposited sequences and fresh design only in the cases of high sequence divergence. Supplementary Data are available at NAR Online. Correlation of cervical carcinoma and precancerous lesions with human papillomavirus (HPV) genotypes detected with the HPV DNA chip microarray method Viral discovery and sequence recovery using DNA microarrays Microarray detection of human parainfluenzavirus 4 infection associated with respiratory failure in an immunocompetent adult Broad-spectrum respiratory tract pathogen identification using resequencing DNA microarrays Experimental evaluation of the FluChip diagnostic microarray for influenza virus surveillance Identification of a novel Gammaretrovirus in prostate tumors of patients homozygous for R462Q RNASEL variant Pan-viral screening of respiratory tract infections in adults with and without asthma reveals unexpected human coronavirus and human rhinovirus diversity Diagnosis of a critical respiratory illness caused by human metapneumovirus by use of a pan-virus microarray Panmicrobial oligonucleotide array for diagnosis of infectious diseases Detection of respiratory viruses and subtype identification of influenza A viruses by GreeneChipResp Oligonucleotide Microarray DNA microarrays for virus detection in cases of central nervous system infection Detection of potato viruses using microarray technology: towards a generic method for plant viral disease diagnosis Microarray-based detection and genotyping of viral pathogens Database to dynamically aid probe design for virus identification Design of microarray probes for virus identification and detection of emerging viruses at the genus level Detection and genotyping of human group A rotaviruses by oligonucleotide microarray hybridization Detection and discrimination of orthopoxviruses using microarrays of immobilized oligonucleotides Robust sequence selection method used to develop the FluChip diagnostic microarray for influenza virus Sequence-specific identification of 18 pathogenic microorganisms using microarray technology Tracking the evolution of the SARS coronavirus using highthroughput, high-density resequencing arrays GeneChip resequencing of the smallpox virus genome can identify novel strains: a biodefense application Amino acid substitution matrices from an information theoretic perspective Pfam: a comprehensive database of protein domain families based on seed alignments Profile hidden Markov models Sequence comparison and protein structure prediction EMBOSS: the European Molecular Biology Open Software Suite RNAsoft: A suite of RNA secondary structure prediction and design software tools Calculating nucleic acid secondary structure A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics A comparison of normalization methods for high density oligonucleotide array data based on variance and bias EMBL Nucleotide Sequence Database: developments in 2005 Pfam: clans, web tools and services Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences An artificial intelligence approach to motif discovery in protein sequences: application to steriod dehydrogenases Atlas of Protein Sequence and Structure Greene SCPrimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments Gapped BLAST and PSI-BLAST: a new generation of protein database search programs National center for biotechnology information viral genomes project Rationale and uses of a public HIV drugresistance database Global epidemiology of HIV Modeling of DNA microarray data by using physical properties of hybridization Thermodynamic calculations and statistical correlations for oligo-probes design Improving comparability between microarray probe signals by thermodynamic intensity correction Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer The work presented here was supported by National Institutes of Health awards (AI070411, Northeast Biodefense Center U54-AI057158-Lipkin, AI056118, HL083850 EY017404 and T32GM008224). We thank Carolyn Morrison for excellent technical assistance. Funding to pay the Open Access publication charges for this article was provided by NIH U54-AI057158-Lipkin.Conflict of interest statement. None declared.