key: cord-0727237-7avs10yv
authors: Slezak, Tom; Gardner, Shea; Allen, Jonathan; Vitalis, Elizabeth; Torres, Marisa; Torres, Clinton; Jaing, Crystal
title: Design of Genomic Signatures for Pathogen Identification and Characterization
date: 2010-09-23
journal: Microbial Forensics
DOI: 10.1016/b978-0-12-382006-8.00029-3
sha: 88ee8a63427422c89a01061e86807f4ac8148681
doc_id: 727237
cord_uid: 7avs10yv

This chapter addresses issues associated with the identification of signatures based on genomic DNA/RNA, which can be used to identify and characterize pathogens for biodefense and microbial forensic goals. Genomic signature-based identification techniques have the advantage of being precise, highly sensitive, and relatively fast in comparison to biochemical typing methods and protein signatures. Classic biochemical typing methods were developed long before knowledge of DNA and resulted in dozens of tests that are used to roughly characterize the major known pathogens. Genomic signatures can be intended for many different purposes and applied at multiple different resolutions. Organism signatures are intended to uniquely identify the organisms involved. Mechanism signatures can be best thought of as identifying particular genes that result in functional properties such as virulence, antibiotic resistance, or host range. The primary reason to identify mechanisms, independent of organisms, is to detect potential genetic engineering. A secondary reason is that nature has shared many important mechanisms on its own over the millennia, and thus they may not be sufficiently unique to identify specific organisms. Method signatures present yet another dimension of analyzing pathogens: evidence of potential bacterial genetic engineering may be seen in a genome by checking for traces of the bacterial vectors that may have been used to insert one or more foreign genes and related components into the genome being modified.

Lawrence Livermore National Laboratory, Livermore, California

This chapter addresses some of the many issues associated with the identification of signatures based on genomic DNA/RNA, which can be used to identify and characterize pathogens for biodefense and microbial forensic goals. For the purposes of this chapter, we define a "signature" as one or more strings of contiguous genomic DNA or RNA bases sufficient to identify a pathogenic target of interest at the desired resolution and that could be instantiated with particular detection chemistry on a particular platform. The target may be a whole organism, an individual functional mechanism (e.g., a toxin gene), or simply a nucleic acid indicative of the organism. The desired resolution will vary with each program's goals but could easily range from family to genus to species to strain to isolate. Resolution may not be taxonomically based but rather pan-mechanistic in nature: detecting virulence or antibiotic-resistance genes shared by multiple microbes. Entire industries exist around different detection chemistries and instrument platforms for identification of pathogens, and we only briefly mention a few of the techniques that have been used at Lawrence Livermore National Laboratory (LLNL) to support our biosecurity-related work since 2000. Most nucleic acid-based detection chemistries involve the ability to isolate and amplify the signature target region(s), combined with a technique to detect amplification.

Genomic signature-based identification techniques have the advantage of being precise, highly sensitive, and relatively fast in comparison to biochemical typing methods and protein signatures. Classic biochemical typing methods were developed long before knowledge of DNA and resulted in dozens of tests (Gram's stain, differential growth characteristics media, etc.) that could be used to roughly characterize the major known pathogens (of course, some are uncultivable). These tests could take many days to complete and precise resolution of species and strains is not always possible. In contrast, protein

Design of Genomic Signatures for Pathogen Identification and Characterization recognition signatures composed of antibodies or synthetic high-affinity ligands offer extremely fast results but require a large quantity of the target to be present. False positives/negatives are also a problem with some protein-based techniques (home pregnancy kits use this basic approach).

Genomic signatures can be intended for many different purposes and applied at multiple different resolutions. At LLNL, we have been working on signatures that can be broken out into several categories: (i) organism signatures, (ii) mechanism signatures, and (iii) method signatures.

Organism signatures are intended to uniquely identify the organism(s) involved. Mechanism signatures can be best thought of as identifying particular genes that result in functional properties such as virulence, antibiotic resistance, or host range. The primary reason to identify mechanisms, independent of organisms, is to detect potential genetic engineering. A secondary reason is that nature has shared many important mechanisms on its own over the millennia, and thus they may not be sufficiently unique to identify specific organisms. Knowledge of whether a particular isolate has the full virulence kit or possesses unusual antibiotic resistance properties and whether it is human transmissible is important for biodefense and public health responses. Method signatures present yet another dimension of analyzing pathogens: evidence of potential bacterial genetic engineering may be seen in a genome by checking for traces of the bacterial vector(s) that may have been used to insert one or more foreign genes and related components (promoters, etc.) into the genome being modified. In the future, host range signatures might indicate that an otherwise uncharacterized pathogen was potentially capable of evading or defeating the immune system of a particular host organism. 

Organism detection signatures must be conserved sequence, reliable, and able to detect all intended organisms to minimize false negatives, and unique sequence, specific to the target organism and not detecting nontarget organisms to minimize false positives. Organism detection signatures can be at different taxonomic resolution, typically genus, species, or strain.

In biosecurity applications, high-resolution signatures are needed to precisely identify particular isolates or strains. In past years, a large distinction was drawn between identification or detection signatures and forensic signatures, where forensic signatures were typically thought of as at the strain level or below (typically thought of as substrain or isolate specific). More recently the distinction has become blurred because taxonomic distinctions have become less certain and because new signature techniques provide increased resolution levels. Using current commercially available microarray technologies that allow several millions of signatures to be designed on each chip, one can interrogate the entire resolution range (genus, species, strain, and isolate) for desired pathogen targets, providing both detection and forensic resolution. Signature design today is a combination of the desired signature purpose, our current understanding of the diversity of the organism being targeted, and the particular mission constraints that may dictate the detection chemistry and platform to be used for either biodefense or public health.

There is no single resource for all genomic sequence data pertinent to signature design. The most comprehensive public source for genomic sequence data is GenBank, which is located at the National Center for Biotechnology Information (NCBI) Web site (http://www.ncbi.nlm.nih.gov/). The NCBI has reciprocal data exchange agreements with the European Molecular Biology Laboratory in the United Kingdom and the DNA Data Bank of Japan, which are equivalent databases used heavily in those parts of the world. Most authors of published sequence data usually submit a final version of their sequence data sets to GenBank. However, numerous sequence databases exist that have organism-specific data that may not be found in GenBank during the interim period of data generation and manuscript preparation and those Finished whole genome microbial sequences have undergone an iterative process to assemble contigs and then use a variety of techniques to order and orient them and close any gaps. This often lengthy and costly process, when completed, produces a single string of high-quality bases from the individual and scrambled contigs of the draft sequence. Obviously, finished genomes are superior to drafts when it comes to performing annotation of gene content or other features, as well as for performing multiple sequence alignments to compare two or more genomes. In our experience at LLNL, an 8-10 Sanger draft genome provides sufficient information for DNA signature design purposes (1). When you consider that finished microbial genomes can be 4-10 times as expensive as draft, it is not surprising that many microbial genomes will never be finished. Increasingly, short-read sequences are being mapped to reference genomes in lieu of a de novo assembly.

Another increasingly important category of data is the metagenomic sequence, where no attempt has been made to isolate individual organisms for sequencing. Sometimes this is because no way is known to isolate and culture the particular organism(s) of interest. Only a tiny fraction of organisms can be cultured in vitro and our knowledge base is greatly skewed toward those that can. At other times it is because what is desired is a sampling of an entire community of organisms.

Although numerous metagenomic samples have been sequenced, it is exceedingly rare for complete assemblies of sequence from multiple organisms to result. One exception is a very small symbiotic bacterial community found living in an extremely harsh acidic environment in a mine (2) . Metagenomic data are not currently of much utility for genomic signature development. A recent paper on the acid mine bacterial community is providing clues about the evolution of viral resistance (3), which illustrates the vital role metagenomic sequencing will play in expanding our systems biology knowledge at both the organism and the ecosystem level.

Searching for sequence data based on free-text queries can be problematic. For example, GenBank does not enforce consistency with sequence designation. Not all complete genomes have "complete genome" in the title, and some that do are not actually complete genomes. We have encountered complete genomes that were labeled "complete cds" (coding sequence), "complete gene," or otherwise unlabeled as a complete genome. Curation is required to validate any sequence data obtained from a public resource, and periodic in-house testing against benchmark data is necessary to maintain a database of high fidelity. A related problem is distinguishing when a new finished genome should replace a prior draft, as strain name, authors, or institutions may have changed.

Finding regions of conservation across all target genomes can be done with "alignment-based" methods and with "alignment-free" methods. The difference between methods revolves around a trade-off between time and quality.

The first issue to be faced when searching for conservation with a multiplesequence alignment (MSA) is the amount of sequence (breadth) that an alignment method can handle. Alignments sometimes fail when input sequences are very long or when there are a large number of sequences to be aligned (depth), even if the sequences are not particularly long. Failure happens because an MSA takes impractically long to finish due to the intractable computational complexity involved or due to a lack of memory. These limitations mean the optimal alignment approach may vary depending on the breadth and depth of sequences used as input. The recent explosion of genome sequence data has resulted in a lack of MSA algorithms that can scale appropriately.

Alignment-free methods for finding consensus can be a shortcut if a complete MSA is impractical or not needed for downstream analysis. Building an alignment-free consensus relies on one sequence serving as a reference for the sequence order of the remaining sequences. This reference sequence is compared pair-wise with the remaining sequences, and the consensus is expressed in the sequence order of the reference. This is often less computationally Identifying Conserved Sequence Among Targets complex than performing a complete MSA, and results are of sufficient quality to identify suitably conserved regions for potential signatures.

Another topic of concern when identifying conserved sequence regions is whether an approach can incorporate incomplete and/or draft sequences. Incomplete sequences do not cover the complete genome of the organism. Draft sequences may cover the complete genome and may be of lower quality, particularly near the ends of contigs. Increasingly, the number of genomes being finished to completion is significantly fewer than the number of genomes that will remain incomplete and in draft form. MUMmer (4) is a notable MSA program in this respect because it can align draft and complete genomes. Note that any use of incomplete genomes carries an inherent risk because regions not present in the incomplete genome(s) will not appear to be conserved and thus may not be considered for signature mining.

Finally, viruses are often highly divergent at the nucleotide level. This extreme divergence, common among many RNA viruses, can cause even alignment-free methods that rely on a pair-wise sequence search to fail at finding all shared genetic regions. Some nonviral organisms have also been observed with enough divergence to make using alignment-free methods error prone. To help overcome the hurdles of divergent targets, we have developed a novel method of signature generation, "minimal set clustering" (MSC), described later.

Finding regions of sequence unique to the target organism is done by searching large sequence databases. There is a trade-off in sequence search between execution time and search sensitivity. "Heuristic" algorithms (methods that take reasonable shortcuts, which may decrease sensitivity) offer the best time performance. "Nonheuristic" algorithms (methods that guarantee complete coverage within the problem space) can have more sensitive results than heuristics, but are slower and the additional sensitivity is not always significant.

Heuristics are used most commonly because they make it possible to search extremely large databases such as NCBI's NT (not nonredundant nucleotide database) quickly. The most popular of these is BLAST (5) , which can scale to provide fast results with large databases by splitting the search space into many parallel processes across compute clusters. If additional limitations in search sensitivity are acceptable, other approaches, such as suffix tree-based Vmatch (http://www.vmatch.de/), can be faster. Another heuristic approach is to compute hidden Markov models that represent the sequence families of interest, such as in the program HMMER (http://hmmer.janelia.org/).

After pathogen target regions that are both conserved and unique are found, they are mined for detection signatures. Signatures are found by searching for oligonucleotides with appropriate length, melting temperature, and GC ratio and by searching for oligonucleotide combinations with appropriate overall amplicon size and minimal interoligonucleotide hybridization potential. Programs such as Primer3 (6) can perform some or all of the signature selection work given a target sequence input. Primer3 can be integrated into any signature development pipeline, unlike other packages that only offer a manual graphic interface.

This section discusses major design criteria that the LLNL KPATH (7) signature design pipeline was built around. KPATH's native signature format, which we describe, is TaqMan® PCR. Its ability to handle several other formats is not described here.

The process begins by looking across all complete target genomes for sequence conservation. We use an in-house, alignment-free, BLAST-based program for finding conservation (unpublished results).

Conserved regions of the target genomes are next screened across our complete genome database in search of potential cross-reactions. Because the oligonucleotides of TaqMan signatures are about 18 to 30 bp long, a fairly large seed length of 18 is acceptable (which means that some short perfectly matching sequences may be omitted from results). Larger seed lengths make it possible for us to search much larger databases in reasonable amounts of time. We currently use Vmatch for large database searches.

The resulting conserved and apparently unique sequence, which has no significant similarity to other known sequences, is now mined for signatures. It is important to note that we only find apparent uniqueness based on the state of the current whole genome database available to us. We anticipate that as additional pathogen targets, near-neighbor organisms, and other organisms are sequenced, our regions of conservation and uniqueness will diminish. For this reason, signature design is an iterative process and not an end point. The original KPATH system used Primer3 in a single execution to identify TaqMan signature candidates with a forward primer, reverse primer, and a hybridization probe.

To let us enforce additional signature design constraints and options without ruling out potential target regions, we converted signature identification into two executions of Primer3-one for primer pairs and one for probes. Separate primer and probe results are combined with an in-house signature builder and scorer to allow us to identify the best combinations of primers and probes.

Next, signatures are filtered down so there is little or no overlap of candidate signatures within the target organism. When exhaustive signature searches are performed, many of the mathematically best signature candidates will share oligonucleotides and generally be very similar. This means that choosing the best scoring signatures for any given locus helps us remove excess redundancy from the pool of signature candidates. We note that in recent years other DNA signature pipelines have been built that take a reverse approach. Like LLNL's minimal set clustering described, they first generate all potential valid TaqMan PCR signatures for each available genome of a target organism and then BLAST them to check for sufficient conservation and uniqueness.

High rates of mutation and lack of genome repair mechanisms in many viruses generate high levels of intraspecific diversity and result in quasispecies, particularly for many single-stranded RNA viruses. Consequently, PCR-based signatures for viral detection often require high levels of degeneracy or multiplexing in order to detect all variants robustly. Large amounts of sequence data are often required to represent the range of target diversity, sometimes dozens to hundreds of genomes. As noted previously, building multiple sequence alignments with many diverse genomes taxes the capabilities of most available software. Once an alignment is built, it may reveal insufficient consensus for even a single primer, much less a pair, to detect all members of some species (e.g., human immunodeficiency virus-1 or influenza A).

One solution is to subdivide the targets into smaller or more closely related subgroups, such as clade, serotype, or phenotype, of interest (examples of phenotypes could include virulent versus vaccine, domestic versus foreign), and attempt to find signatures separately for each subgroup. This approach implies that multiple signatures will be required for species-level detection of all subgroups. One must make an assessment in advance of signature design of how best to subdivide the target sequences. A second approach is to allow degenerate or inosine bases so that a single signature will detect more diverse genomes. Specificity may suffer if some combinations of degenerate bases also pick up nontarget species. Sensitivity may decline, as the specific priming sequence for a given target is diluted in the degenerate mix. A number of tools that require a multiple sequence alignment as input are available for degenerate primer design (e.g., SCPrimer, PrimaClade, Primo, Amplicon, and HYDEN). A third approach is to forego sequence alignment altogether and to look for sets of primer-length oligomers of length k, or "k-mers," present in many targets and unique relative to nontarget sequences. Using combinatoric or greedy algorithms, one can build a signature set of k-mers such that each target contains at least two k-mers to function as forward and reverse primers. This approach demands large amounts of computing memory to store all candidate k-mers for large or many genomes, especially as k increases above 20, and may require suffix trees or other techniques for data compression.

A fourth approach employed is called MSC. Because it avoids the need for multiple sequence alignment or a priori subgrouping of target sequences, this method can be run blindly without expert knowledge of the target species. It begins by removing nonunique regions from consideration as primers or probes from each of the target sequences relative to a database of nontarget sequences. The remaining unique regions of each target sequence are mined for all or many candidate signatures, without regard for conservation among other targets, yet satisfying user specifications for primer and probe length, T m , GC%, amplicon length, and so on. All candidate signatures are compared to all targets and clustered by the subset of targets they are predicted to detect.

To predict detection, we may require that a signature's primers and probe have a perfect match to target in the correct orientation and proximity, or we may relax the match requirements to allow a limited number of mismatches, as long as T m remains above a specified threshold or those mismatches do not occur too close to the 3 end of a primer. Signatures within a given cluster are equivalent in that they are predicted to detect the same subset of targets, so by clustering we reduce the redundancy and size of the problem to finding a small set of signatures that detect all targets. Nevertheless, finding the optimal solution of the fewest clusters to detect all targets is an "NP complete" problem, so for large data sets we use a greedy algorithm to find a small number of clusters that together should pick up all targets. LLNL has used this method to design signature sets for numerous RNA viruses, including influenza A HA serotypes, foot and mouth disease, Norwalk, Crimean-Congo hemorrhagic fever, ebola, and other divergent viruses. Figure 29 .1 shows the result of an MSC computation for Crimean-Congo hemorrhagic fever performed in 2005, with the resulting signatures displayed against a whole genome phylogenetic tree of all the sequences available at that time.

Detecting evidence for genetic engineering in bacteria is challenging when the target modification is not known and the effects of an outbreak on human health are not well understood. We may, for example, anticipate a biological outbreak that employs a bacterial host containing a foreign toxin, but the observed effects of the toxin may not implicate a known gene. Even in cases where the gene is known, it may be difficult to rule out a natural origin for the outbreak. In such cases, it may be useful to search for more direct evidence of the genetic engineering tools used to insert and express foreign genes in a bacterial host. Among the most widely used and readily available tools for genetic engineering in bacteria are artificial vector DNA molecules. Genetic engineering with artificial vectors began with efforts to improve on early work using natural plasmids for gene cloning. Natural plasmids are extrachromosomal replicons (self-replicating molecules) that come in both circular and linear form and are generally nonessential genetic material for the bacterial host but can confer important phenotypes such as virulence and drug resistance. These plasmids are mobile genetic elements that serve as a natural mechanism for the exchange of genetic material across different bacterial species (8) . Artificial vectors are natural plasmid derivatives designed to improve support for the insertion and manipulation of foreign genetic elements in the carrier plasmid.

We use the term "artificial vector" to refer to replicons created through human intervention to explicitly distinguish them from their natural plasmid precursors. Sequence features designed to support genetic manipulation form the basis for methods used to distinguish artificial vector sequence from natural plasmids. The most common artificial vector-specific feature is the multiple cloning site region, which is a sequence insert containing clusters of restriction enzyme sites used to facilitate insertion of the foreign gene elements. Selection marker genes also play an important role in selecting bacteria, which maintain the artificial vector. The gene transcription control unit, which includes a promoter sequence and transcription terminator sequence for the foreign gene elements, is also an important feature, along with the origin of replication site required for maintenance of the artificial vector in the bacterial colony (9) .

Detecting an artificial vector sequence in a mixed bacterial sample potentially requires testing a broad range of sequence targets. This suggests use of an assay with a high degree of multiplex capability that tests for the presence of a large number of sequences simultaneously. Microarray-based assays are a logical choice for accommodating a large number of artificial vector detection probes. The large collection of artificial vector sequences can be clustered according to exact k-mer sequence matching to find the k-mers shared among different vector sequences (10) . Sequence length k corresponds to the desired probe length used in the microarray design. Each cluster of shared sequence is compared against all available sequenced natural chromosomal bacterial and viral genomes, including natural plasmids to identify which k-mers in the artificial vector sequence are distinct from the natural background. These unique k-mers are called candidate signatures. After candidate signatures are found, a probe set is created that ensures that each vector contributes a preset minimum number of candidate signatures to the final microarray probe set design. A greedy algorithm can be used to pick signatures shared by the greatest number of artificial vectors, selecting candidate signatures in decreasing order.

Additional postprocessing steps may further improve the quality of the signature probe set design to achieve the ultimate goal of sensitive detection, while maintaining a hybridization pattern on the microarray that distinguishes artificial vectors from the natural background found in a mixed sample. Once the initial probe set is designed, a BLAST search can be used to tune the probe set

Signatures of potential Bacterial Genetic Engineering by replacing the candidate signatures with near matches to the background with candidates showing a greater percentage of vector unique variation. Crossvalidation can be used to estimate a similarity threshold for distinguishing artificial and natural genomic sets. [An example of this approach using crossvalidation is given elsewhere (10) .] Another postprocessing step is to tune the probe set to ensure probes derived from each vector come from multiple functional regions. Confidence in vector detection is boosted when probes are found for multiple functional locations. Using probes from multiple regions may also provide useful forensic information on the origins and function of the detected artificial vector. Given the similarities between artificial vectors and natural plasmids, having additional probes for natural plasmids allows for direct comparison with the natural plasmid hybridization pattern, which could reduce the potential for false-positive predictions.

Numerous microarrays have been designed for viral discovery, detection, and resequencing (11) (12) (13) (14) . Resequencing arrays can provide sequence information for viruses closely related (90% similarity) to sequences from which the array was designed. Discovery arrays to detect more diverse and more distantly related organisms have been built using techniques for selecting probes from regions of known conservation based on BLAST nucleotide sequence similarity (15) or profile HMM and motif indications of amino acid sequence conservation (14) . Array design to span an entire kingdom on a single microarray demands substantial investment in probe selection algorithms. LLNL designed a microarray to detect all bacteria, plasmids, and viruses based on all available whole genome, whole segment, and whole plasmid sequences and is in the process of including probes for highly conserved fungal genes as well. We attempted to find probes that are unique to each viral and bacterial family, and favor probes conserved within a family. We used probes 50-65 bases long, enabling sensitive detection of targets with some sequence variation relative to the probe. We used a greedy minimal set cover algorithm to ensure that all sequences have at least 50 (for viruses) or 15 (for bacteria and plasmids) probes per sequence. We allowed some mismatches between probe and target, based on previous mismatch experiments in which we determined that probes with a contiguous match at least 29 bases long and with 85% sequence similarity between probe and target still gave a strong signal intensity. Our design should characterize unknowns to at least the family level, and in all cases tested so far, including blinded clinical samples containing multiple viruses, we are able to accurately detect and characterize all viruses contained in that sample to the species or strain level (16) .

Our first-generation viral array included 36,000 probes designed from family-specific regions of all 72 viral families, and our second version included 170,000 viral probes, again from family-specific regions. There were no regions greater than 25-bp matches to human or bacteria and no regions greater than 17-bp matches to other nontarget viral families. In addition, we also included the 20,000 probes from the Virochip developed by Dr. Joseph DeRisi from University of California, San Francisco, as a control (11) .

Preliminary testing using NimbleGen arrays with mixed DNA and RNA viruses and with blinded clinical samples showed accurate detection of multiple viruses in a single sample. In addition, we can identify the exact strains and isolates hybridized as a mixed sample, although the array was designed to guarantee discrimination only to family. We developed a novel statistical method that is based on likelihood maximization within a Bayesian network, incorporating a sophisticated probabilistic model of probe-target hybridization developed and validated with experimental data from hundreds of thousands of probe intensity measurements. The method is designed to enable quantifiable predictions of likelihood for the presence of each of multiple organisms in a complex, mixed sample, which is especially important in an environmental sample or one with chimeric organisms. Future detection chip designs will include probes from conserved regions of bacterial families and plasmids and fungal families. This chip will be a major platform for identification of known and unknown pathogens.

Issues related to scaling, taxonomy, and technology advances appear to be main drivers for the future of genomic signatures.

Scaling problems all stem from the exponential rate at which genomic sequence data are growing. Although it is inexpensive to buy sufficient hardware to store data physically, the current generation of bioinformatics tools was designed in an era when it was a luxury to have a handful of genomes of a particular pathogen available to work with. In recent years the Influenza Community Sequencing Project (17) has deposited many thousands of complete influenza genomes into GenBank, far exceeding the capacity of most tools to handle them. Similarly, some of the new sequencing technologies can generate billions of bases in a single run from metagenomic samples (18) , but truly efficient software that takes full advantage of this information is lacking. It will likely take several years for research funding to be focused properly to close this bioinformatics tool gap. Another aspect of scaling problems is that few researchers have access to computers with large enough memories to be able to process certain classes of sequence analyses related to genomic signature design. Computer clusters optimal for physical science problems (where each node represents a point in a three-dimensional physical grid representation and almost all communication is with nearest neighbor nodes) are suboptimal for some classes

The Future of Genomic Signatures of biological sequence algorithms where a large memory computer would be better.

Earlier we mentioned difficulties with the evolving taxonomy of pathogenic organisms, as classification schemes originally developed based on phenomenology are faced now with genomic inconsistencies. The current flood of metagenomic data is presenting us with an even larger problem: what exactly do concepts such as "species" and "strains" mean if it turns out that microbial life is a broad spectrum with few well-defined transitions? It is now common to refer to a "core genome" and additional distinct gene content variation that presumably is responsible for different phenotypes (19) . It is possible that new concepts and terminology will be needed to map existing taxonomic categories into the genomic reality of the 21st century.

The rate of advancement in sequencing technology exceeds that even of computers, fueled by the promise of personalized medicine if individual drug and disease reactions can be determined and if individual genetic variation can be determined efficiently via low-cost sequencing. The field of pathogen diagnostics is riding this technology wave, too small a market to have any direct influence. Note that the read lengths of some new sequencing technologies may be too short to provide confident pathogen identification based on a single read, meaning that direct metagenomic identification of human pathogens from complex clinical or environmental samples contains some degree of uncertainty. Microarrays will have to ride their own faster/less expensive/moreinformation-per-chip curve if they are not to become obsolete within a few years. Alternatively, one could argue that future advances in protein detection technology could lead to breakthroughs in fast dipstick assays (similar to current home pregnancy test kits) that could provide fast, accurate, and inexpensive results for pathogen detection. In all likelihood, all these techniques will continue to compete as they evolve asynchronously.

Another technological advance is seen in the recent breakthroughs in gene and genome synthesis (20) . Not only do we need to deal with emerging natural viruses from every remote corner of the planet, but now we also need to deal with the fact that for relatively modest amounts of money, it is possible to synthesize combinatorial versions of any DNA one might wish to (re)create. This potential ability to create a new class of supercharged pathogens, as well as the possibility of synthesized pathogens that do not exist in nature, puts a new urgency into ensuring that we have adequate tools to deal with these evolving biothreats.

What all this means for genomic signature design is that we will have to exist in a combination of a data avalanche, new analysis tools, and rapidly evolving new technologies. Against this background of change, we will have to deal with new missions and new challenges from adversaries equipped with the latest technologies. Fittingly for biodefense, it is indeed a very Darwinian challenge that faces us.

Draft versus finished sequence data for DNA and protein diagnostic signature development

Genome dynamics in a natural archaeal population

Virus population dynamics and acquired virus resistance in natural microbial communities

Versatile and open software for comparing large genomes

Basic local alignment search tool

Primer3 on the WWW for general users and for biologist programmers

Comparative genomics tools applied to bioterrorism defense

Mobile genetic elements: The agents of open source evolution

DNA signatures for detecting genetic engineering in bacteria

Microarraybased detection and genotyping of viral pathogens

Panmicrobial oligonucleotide array for diagnosis of infectious diseases

Broad-spectrum respiratory tract pathogen identification using resequencing DNA microarrays

Comprehensive viral oligonucleotide probe design using conserved protein regions

Viral discovery and sequence recovery using DNA microarrays

A microbial detection array (MDA) for viral and bacterial detection

Largescale sequencing of human influenza reveals the dynamic nature of viral genome evolution

The impact of next-generation sequencing technology on genetics

Whole genome comparisons of serotype 4b and 1/2a strains of the food-borne pathogen Listeria monocytogenes reveal new insights into the core genome components of this species

Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome

This chapter was prepared as an account of work sponsored by an agency of the U.S. government. Neither the U.S. government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed or represents that its use would not infringe on privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the U.S. government or Lawrence Livermore National Security, LLC. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the U.S. government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.