key: cord- -uu oz ei authors: kumar, ranjit; lawrence, mark l.; watt, james; cooksey, amanda m.; burgess, shane c.; nanduri, bindu title: rna-seq based transcriptional map of bovine respiratory disease pathogen “histophilus somni ” date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: uu oz ei genome structural annotation, i.e., identification and demarcation of the boundaries for all the functional elements in a genome (e.g., genes, non-coding rnas, proteins and regulatory elements), is a prerequisite for systems level analysis. current genome annotation programs do not identify all of the functional elements of the genome, especially small non-coding rnas (srnas). whole genome transcriptome analysis is a complementary method to identify “novel” genes, small rnas, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. in particular, the identification of non-coding rnas has revealed their widespread occurrence and functional importance in gene regulation, stress and virulence. however, very little is known about non-coding transcripts in histophilus somni, one of the causative agents of bovine respiratory disease (brd) as well as bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis. in this study, we report a single nucleotide resolution transcriptome map of h. somni strain using rna-seq method. the rna-seq based transcriptome map identified srnas in the h. somni genome of which srnas were never predicted or reported in earlier studies. we also identified novel potential protein coding open reading frames that were absent in the current genome annotation. the transcriptome map allowed the identification of operon (total genes) structures in the genome. when compared with the genome sequence of a non-virulent strain pt, a disproportionate number of srnas (∼ %) were located in genomic region unique to strain (∼ % of the total genome). this observation suggests that a number of the newly identified srnas in strain may be involved in strain-specific adaptations. systems biology approaches are designed to facilitate the study of complex interactions among genes, proteins, and other genomic elements [ , , ] . in the context of infectious disease, systems biology has the potential to complement reductionist approaches to resolve the complex interactions between host and pathogen that determine disease outcome. however, a prerequisite for systems biology is the description of the system's components. therefore, genome structural annotation or the identification and demarcation of boundaries of functional elements in a genome (e.g., genes, non-coding rnas, proteins, and regulatory elements) are critical elements in infectious disease systems biology. bovine respiratory disease (brd) costs the cattle industry in the united states as much as $ billion annually [ , ] . brd is the outcome of complex interactions among host, environment, bacterial, and viral pathogens [ ] . histophilus somni, a gramnegative, pleomorphic species, is one of the important causative agents of brd [ ] . h. somni causes bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis [ ] . h. somni strain , the serotype used in this study and isolated from pneumonic calf lung, has a . mbp genome and predicted open reading frames (orfs), of which ( %) have an assigned biological function. genome structural annotation is a multi-level process that includes prediction of coding genes, pseudogenes, promoter regions, repeat elements, regulatory elements in intergenic regions such as small non-coding rnas (srna), and other genomic features of biological significance. computational gene prediction methods such as glimmer [ ] or genmark [ ] use hidden markov models which are based on a training set of well annotated genes. although these methods are quite efficient, they often miss genes with anomalous nucleotide composition and have several well-described shortcomings: because bacterial genomes do not have introns, detecting gene boundaries is comparatively difficult; due to the usage of more than one start codon, computational genome annotation methods may predict overlapping orfs [ ] ; prediction programs use arbitrary minimum cutoff lengths to filter short orfs, which may lead to under-representation of small genes. in case of srna (small non-coding rna) prediction, the lack of dna sequence conservation, lack of a protein coding frame, and the limited accuracy of transcriptional signal prediction programs (promoter/rho terminator prediction) confound computational prediction [ , ] . computational prediction methods are a ''first pass'' genome structural annotation. whole genome transcriptome studies (such as whole genome tiling arrays [ , , ] and high throughput sequencing [ , ] ) are complementary experimental approaches for bacterial genome annotation and can identify ''novel'' genes, gene boundaries, regulatory regions, intergenic regions, and operon structures. for example, a transcriptomic analysis of mycoplasma pneumoniae identified previously unknown transcripts, many of which were non-coding rnas, and two novel genes [ ] . transcriptome analyses identified novel, non-coding regions in other species, including srnas in caulobacter crescentus [ ] , srnas in salmonella typhimurium [ ] , and a large number of putative srnas in vibrio cholerae [ ] . srnas found in pathogen genomes are known to be involved in various housekeeping activities and virulence [ ] . in this study we used rna-seq for the experimental annotation of the h. somni strain genome and to construct a single nucleotide resolution transcriptome map. novel expressed elements were identified, and where appropriate, computational predictions of previously described gene boundaries were corrected. in the complete genome sequence of the h. somni strain became available (genbank cp ). the , , bp circular genome has a gc content of . %, and % of the sequence is annotated to coding regions. the genome has computationally predicted genes, of which are protein coding. we sequenced the transcriptome of h. somni using illumina rna-seq methodology, and obtained , , reads, with an average read length of approximately bp. we mapped approximately . % reads onto the reference dna sequence of h. somni strain using the alignment program bowtie [ ] . to determine expressed regions in the genome, we estimated the average coverage depth of reads mapped per nucleotide/base. we used pileup format, which represents the signal map file for the whole genome in which alignment results (coverage depth) are represented in per-base format. regions where coverage depth was greater than the lower tenth percentile of expressed genes were considered significantly expressed [ ] ; in the current study, this corresponded to a coverage depth of reads/bp in pileup format. as another measure for estimating background expression level, we analyzed the coverage in the intergenic regions of the genome. we assumed that at least half of the intergenic region is not expressed (considering the presence of known expressed regions, such as and utr of genes, intergenic region of the operons, and srnas) and calculated the coverage, which corresponded to # reads per base, lower than our first cutoff estimate. we retained the most conservative cutoff for expression, i.e., reads per base for describing the expression map of h. somni. nucleotides in the genome sequence with coverage depth above our threshold value were considered to be expressed. this resulted in the generation of a whole genome transcriptome profile of h. somni at a single nucleotide resolution. figure show the steps involved in the analysis of expressed intergenic regions. we compared the rna-seq based transcriptome map with the available genome annotation to identify expressed, novel, and intergenic regions in the genome. promoters and terminators were predicted across the genome to add confidence to the identified novel elements. for the first time, we report the identification of srnas (table ) in the h. somni genome. the start and end for srna in table refer to the boundaries of transcriptionally active regions (tar, putative srnas). of these, twelve were similar to wellcharacterized srna families that are described in many bacterial species, such as tmrna, s, and fmn ( figure ). the total of novel srnas reported in this study has not been reported earlier. the majority of the identified srnas (. %) were shorter than nucleotides (length range - nucleotides). the average gc content of srna at . % was slightly higher compared with the . % gc content of the genome. promoters within nt upstream/downstream of the tar boundaries were predicted for srna. similarly, rho-independent transcription terminators were predicted within bp upstream/downstream of srna. figure shows the depth of coverage for one of the identified novel srna ''hs '' viewed in the artemis genome browser [ ] . blast analysis of the srna sequences against the nonredundant, nucleotide database at ncbi revealed that of the srna sequences were unique to the h. somni genome. another were highly conserved (. % identity with . % coverage) only in h. somni strain pt, which is a commensal, preputial isolate. a set of srnas were conserved in the related pasteurellaceae family, which includes genomes such as p. multocida, h. influenzae, h. parainfluenzae, and h. ovis. only srnas were conserved in distant bacterial genomes from genera streptococcus, clostrodium, actinobacillus, vibrio, and others. this lack of srna sequence conservation beyond the species could indicate that srna sequences are under strong selection pressure, and that they could be responsible for the adaptation of many species to different environmental niches. we searched all h. somni srna sequences against the rfam database [ ] to determine their putative functions. we found that srnas were homologs to well characterized srnas in other genomes. the identified functional categories included fmn riboswitches, gcvb, glycine, intron_gpii, lysine, alpha_rbs, lr-pk , isrk, mocorna, rnasep_bact_a, tmrna, and s. srnas for which no rfam function could be predicted represent a completely novel set of non-coding srnas. functions of these novel srna need to be determined by further experiments. we evaluated the coding potential of all expressed intergenic regions, by conducting blastx based sequence searches against the non-redundant protein database at ncbi followed by manual analysis and interpretation. we identified novel protein coding regions ( table ). the average length of the identified novel proteins was around amino acids (ranged from to amino acids). the majority of the novel proteins ( ) were conserved hypothetical proteins present in related species such as h. somni pt, m. haemolytica, and h. influenzae. some of the novel proteins had predicted functions, such as dnak suppressor protein, toxic membrane protein tnac, and predicted toxic peptide ibsb ( table ). figure shows an example of a novel protein ''hsp '' that is similar ( % similarity and % coverage) to a putative, phage-related dna-binding protein of neisseria polysaccharea. the single nucleotide resolution map described in this study enabled us to correct the start site for five genes based on the current genome annotation (table ) . these genes were annotated as phospholipid synthesis protein, ribosomal protein s , aconitate hydratase , peptide chain release factor , and duf , a protein of unknown function. based on evidence from rna-seq data, we performed a blast comparison with other phylogenetically similar proteins to confirm the new gene boundaries (table ) . the comparison of the transcriptome map of the h. somni genome with predicted proteins revealed the presence of frameshift mutations. four genes have non-functional start codons, resulting in a predicted protein, truncated at the amino terminus (based on blast comparison with homologous proteins in other species), although full length mrna was present. an example is presented for the gene ''hsm_ '', annotated as ''alpha-lfucosidase'' ( figure s ). the other three genes, hsm_ , hsm_ and hsm_ , encode a hypothetical protein, type iii restriction protein res subunit, and ctp synthase, respectively. two genes with frameshifts causing protein truncations (based on blast comparison with homologous proteins) are hsm_ (beta-hydroxyacyl dehydratase, faba) and hsm_ (alcohol dehydrogenase zinc-binding domain protein). the transcriptome map revealed a full length mrna for these two genes that code for truncated proteins. our transcriptome map of h. somni identified expression from (approximately %) of the predicted genes. the expressed genes were distributed evenly across all tigrfam functional categories (table s ). the transcriptome map allowed identification of operon structures at a genome scale, critical for identifying co-expressed genes and for understanding coordinated regulation of the bacterial transcriptome. we identified co-expression for pairs (total genes) of h. somni genes ( table s ) that were transcribed together and constituted a minimal operon. by joining consecutive overlapping pairs of co-expressed genes, we identified distinct transcription units (table s ) . we compared our experimentally identified co-expressed genes with computationally predicted operons. the overlap between computational prediction of co-expressed genes using door [ ] and this study was % ( gene pairs) (table s ) . thus, our dataset validates expression of computational gene-pair predictions. we identified new gene pairs that are co-expressed and were not predicted by door, which could be part of unidentified, new operon structures. for example, further in-depth analysis indicated a new operon consisting of three genes: hsm , hsm and hsm , annotated as ribosomal protein l , ribosomal protein l , and translation initiation factor if- respectively, which were not predicted computationally ( figure ). the orthologs of these genes are well known to form a functional operon of ribosomal proteins (if -l -l ) in escherichia coli [ ] . in this study using rna-seq we describe the whole genome transcriptome profile of h. somni , a bovine respiratory disease pathogen. the single nucleotide resolution map helped uncover the structure and complexity of this pathogen's transcriptome and led to the identification of novel, small rnas and protein coding genes as well as gene co-expression. prokaryotic genome annotation is performed often using computational gene prediction programs [ , ] . however, these prediction algorithms are not able to identify the non-coding srnas, antisense transcripts, and other small proteins. to overcome the shortcomings of computational genome structural annotation, various experimental methods are used for identification of novel expressed elements [ , , , , , , , , ] . deep transcriptome sequencing (rna-seq) has emerged recently as a method that enables the study of rna-based structural and regulatory regions at the genome scale. rna-seq technology has many advantages compared with existing array based methods for transcriptome analysis. in particular, rna-seq does not require probes, so the process is free from probe design issues or bias from hybridization issues. also, the transcriptome coverage from rna-seq is very high [ , ] . rna-seq was demonstrated to be effective for the discovery of bacterial non-coding rnas, accurate operon definition, and correction of gene annotation [ , , ] . therefore, in the current study, we used rna-seq for profiling h. somni transcriptome. mapping of rna-seq reads onto the h. somni genome sequence resulted in more than % coverage with at least one read per base. this observation is consistent with the reported % genome expression in bacillus anthracis, . % in sulfolobus solfataricus, and % in burkholderia cenocepacia, studied under one or more experimental growth conditions using rna-seq [ , , ] . these results indicate that most of the bacterial genome sequence is expressed at some basal level. to identify significantly expressed regions above this baseline, we used two alternative methods (discussed in results section) to estimate the background expression. both methods yielded similar results ( - reads per base). we selected the higher stringency cutoff of reads per base to minimize the number of false positives. we identified a total of srnas in the h. somni genome. twelve of these were predicted by rfam [ ] and are similar to conserved srna (e.g., s, tmrna, fmn) in other bacterial species, which helps validate our approach. the novel h. somni srnas may have housekeeping function, regulatory activity, or participate in virulence as described in other pathogenic bacteria [ , , ] . the identified srnas did not show any location specific bias across the genome. similarly, genes known to be associated with virulence are known to be scattered across bacterial genomes [ , ] . however, the tendency to form clusters was observed with srnas, which could indicate that functionally related srnas tend to be located in close proximity. the rna-seq based transcriptome map of h. somni identified novel protein coding genes that were missed by the initial annotation. the average length of the proteins coded by these genes exceeds amino acids, suggesting that length based cutoff was not the main reason that these genes were missed by computational gene prediction programs. the novel protein coding genes identified in the current study could serve as a training set to improve gene prediction algorithms. the transcriptome map helped to identify incorrect annotation of start codons in the genome. transcriptional mapping does not provide direct evidence of translational start sites. however, location of identified transcriptional start sites suggest that the annotated start codons are incorrect, an observation that is confirmed by blast comparisons against homologous genes in other bacterial species. transcriptional mapping revealed genes where the untranslated sequence extended well beyond the translational start. blast comparisons indicated that these genes have either nonsense or missense base changes relative to homologous genes in other bacterial species, causing apparent ''truncated'' proteins compared with those in other species. further work is needed to determine whether these untranslated regions serve regulatory functions or they are vestigial. rna-seq data enabled us to determine operon structures at a genome scale, and it allowed identification of some operons not predicted by the computational operon prediction method. operon structures that include genes not expressed under the experimental growth condition used in the current study, could not be identified. our results support the notion that using a combination of experimental operon identification by rna-seq and computational prediction can improve operon identification in bacterial genomes [ ] . for the first time, we report the rna-seq based transcriptome map of h. somni and describe novel expressed regions in the genome. whereas the results are interesting, we are aware of the limitations of the study. because the rna-seq protocol was not strand specific, we could not determine the strand specificity of expressed novel transcripts. therefore, table lacks information about srna orientation in the genome. because strand specific information was missing, we could not describe antisense expression in the genome. for protein coding genes, we derived strand specificity based on alignment of the blast hit. despite this shortcoming, we identified novel expressed regions and transcriptional patterns across the whole genome at a high coverage, which is not possible by other transcriptome analysis methods. overall, this study describes rna-seq based transcriptome map of h. somni for identification of functional elements in a pathogen of importance to agriculture. our genome-wide survey predicts numerous, novel, expressed regions that need biological characterization for understanding disease pathogenesis. description of all functional elements in the h. somni system is a prerequisite for conducting holistic systems approaches to understand the complex pathogenesis of bovine respiratory disease. we propagated h. somni on three tsa-blood plates (with % sheep red blood cells) for hr or until a fresh lawn of cells was visible. ibc approval was not required for acquiring the plates as they were purchased through a commercial vendor: fisher scientific (pittsburgh, pa), and manufactured by becton dickinson diagnostic systems, (franklin lakes, nj). we washed the plates with brain heart infusion (bhi) broth, adjusted the culture to an od nm = . , and supplemented with rnaprotect reagent. the cells were harvested by centrifugation and stored at uc. we extracted total rna using the rneasy mini kit (qiagen, valencia, ca) following the manufacturer's protocol. total rna was treated with rnase-free dnase (invitrogen, carlsbad, ca). using bioanalyzer (agilent technologies, santa clara, ca), we determined the rna integrity number (rin) of total rna to be greater than . microbexpress tm kit (ambion, tx, usa), which specifically removes rrnas, was used for mrna enrichment. small rnas (i.e., trna and s rrna) are not removed with this enrichment step (confirmed by bioanalyzer). we used ng enriched mrna with illumina mrna-seq sample preparation kit (illumina, san diego, ca) for library construction following the manufacturer's protocols. briefly, mrna was fragmented chemically by divalent zinc cations and randomly primed for cdna synthesis. after ligating paired-end sequence adaptors to cdna, we isolated fragments of approximately bp by gel electrophoresis and amplified. we we checked all illumina reads for quality, and removed sequence reads containing ''ns''. custom perl script was written to convert illumina reads into fastq format. the script ''fq_all std.pl'' from maq [ ] converted fastq format to sanger fastq format. reads in sanger fastq format, were mapped onto the histophilus somni genome sequence (genbank accession number. cp ) using the alignment tool bowtie [ ] , allowing for a maximum of two mismatches. the reads that mapped to more than one location were discarded. we used samtools [ ] to convert data into sam/bam format, and to generate alignment results in a pileup format. pileup format provides the signal map file and has per-base format coverage. custom perl scripts were written to calculate the background expression. processed data was deposited in geo with the accession number gse . we used in-house perl scripts to extract novel expressed intergenic regions to identify novel small rnas, riboswitches, and putative novel proteins. srna , bp in length were discarded to minimize the number of false positives. for each novel expressed region, blast sequence searches were performed against the non-redundant protein database at ncbi to identify potential protein coding regions. intergenic regions within predicted operons [ ] represent expressed regions and can be mis-classified as srnas. therefore, these regions were excluded. we analyzed blast results manually, to identify novel protein coding regions and start codon corrections. if no protein coding region was found in the intergenic expressed regions, the presence of a promoter or a rho-independent terminator allowed us to classify the regions as srna. bacterial promoter sequences were predicted by neural network promoter prediction program (http://www.fruitfly.org/seq_tools/promoter.html) [ ] . rho-independent transcription terminators were identified using the program transtermhp [ ] . for functional annotation, all identified identified srna sequences were searched against the rfam database [ ] . srna sequence conservation among other genomes was determined by blastn searches against nonredundant nucleotide database at ncbi. we mapped srnas, along with additional features, onto genome browsers like igv [ ] and artemis [ ] for further visualization, manual analysis, and interpretation. gene expression: expressed reads with coverage above background were mapped onto the annotated genes of h. somni . genes that had a significantly higher proportion of their length (. %) covered by expressed reads were considered to be expressed. operons: rna-seq can identify and predict operon structures in bacteria. we considered two or more consecutive genes to be part of an operon, if they fulfilled the following criteria: (a) they are expressed; (b) they are transcribed in the same direction; and (c) the intergenic region between the genes is expressed. overlapping pairs of such genes were joined together to identify large operon structures. we used in-house perl scripts for the analyses. figure s mutated start codon. the figure shows that the predicted protein coding frame (mh_ ) is shorter at the end than the corresponding transcript level shown by the rna-seq coverage. although the transcript is longer near end, no start codon is found in that region which might be a result of the mutation in that region of the start codon. this was further validated using homology searches of the full length transcript which shows high homology ( % identity and . % coverage) to a alpha-l-fucosidase protein from m. haemolytica phl . (tif) host-pathogen systems biology a systems biology approach to infectious disease research: innovating the pathogen-host research paradigm virus-host interactions: from systems biology to translational research infectious bovine rhinotracheitis, parainfluenza- and bovine respiratory coronavirus economic impact associated with respiratory disease in beef cattle the immunology of the bovine respiratory disease complex bovine platelets activated by haemophilus somnus and its los induce apoptosis in bovine endothelial cells microbial gene identification using interpolated markov models genemarks: a self-training method for prediction of gene starts in microbial genomes. implications for finding sequence motifs in regulatory regions large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? computational approaches for the discovery of bacterial small rnas computational prediction of srnas and their targets in bacteria transcriptome analysis of escherichia coli using high-density oligonucleotide probe arrays wholegenome tiling array analysis of mycobacterium leprae rna reveals high expression of pseudogenes and noncoding regions small non-coding rnas in caulobacter crescentus experimental discovery of srnas in vibrio cholerae by direct cloning, s/trna depletion and parallel sequencing deep sequencing analysis of small noncoding rna and mrna targets of the global post-transcriptional regulator transcriptome complexity in a genome-reduced bacterium identification of small rnas in diverse bacterial species lack of development of new antimicrobial drugs: a potential serious threat to public health pathogen proteomes during infection: a basis for infection research and novel control strategies artemis: sequence visualization and annotation rfam: annotating non-coding rnas in complete genomes door: a database for prokaryotic operons resistance or decreased susceptibility to glycopeptides, daptomycin, and linezolid in methicillin-resistant staphylococcus aureus deep rna sequencing improved the structural annotation of the tuber melanosporum transcriptome bacillus anthracis genome organization in light of whole transcriptome sequencing identification of novel non-coding small rnas from streptococcus pneumoniae tigr using high-resolution genome tiling arrays studying bacterial transcriptomes using rna-seq next generation sequencing of microbial transcriptomes: challenges and opportunities a strand-specific rna-seq analysis of the transcriptome of the typhoid bacillus salmonella typhi mapping the burkholderia cenocepacia niche response via high-throughput sequencing structure and complexity of a bacterial transcriptome a single-base resolution map of an archaeal transcriptome small noncoding rnas controlling pathogenesis regulatory rna in bacterial pathogens vfdb: a reference database for bacterial virulence factors a genomic window into the virulence of histophilus somni the relative value of operon predictions extended-spectrum beta-lactamase-producing enterobacteriaceae: an emerging public-health concern ultrafast and memoryefficient alignment of short dna sequences to the human genome bad bugs, no drugs: no eskape! an update from the infectious diseases society of america application of a time-delay neural network to promoter annotation in the drosophila melanogaster genome rapid, accurate, computational discovery of rho-independent transcription terminators illuminates their relationship to dna uptake biologicalnetworks-tools enabling the integration of multi-scale data for the host-pathogen studies host-microbe interaction systems biology: lifecycle transcriptomics and comparative genomics we thank dr. john harkness and dr. stephen b. pruett for editing the final version of the manuscript. key: cord- -p v wi authors: bigot, yves; samain, sylvie; augé-gouillou, corinne; federici, brian a title: molecular evidence for the evolution of ichnoviruses from ascoviruses by symbiogenesis date: - - journal: bmc evol biol doi: . / - - - sha: doc_id: cord_uid: p v wi background: female endoparasitic ichneumonid wasps inject virus-like particles into their caterpillar hosts to suppress immunity. these particles are classified as ichnovirus virions and resemble ascovirus virions, which are also transmitted by parasitic wasps and attack caterpillars. ascoviruses replicate dna and produce virions. polydnavirus dna consists of wasp dna replicated by the wasp from its genome, which also directs particle synthesis. structural similarities between ascovirus and ichnovirus particles and the biology of their transmission suggest that ichnoviruses evolved from ascoviruses, although molecular evidence for this hypothesis is lacking. results: here we show that a family of unique pox-d ntpase proteins in the glypta fumiferanae ichnovirus are related to three diadromus pulchellus ascovirus proteins encoded by orfs , and . a new alignment technique also shows that two proteins from a related ichnovirus are orthologs of other ascovirus virion proteins. conclusion: our results provide molecular evidence supporting the origin of ichnoviruses from ascoviruses by lateral transfer of ascoviral genes into ichneumonid wasp genomes, perhaps the first example of symbiogenesis between large dna viruses and eukaryotic organisms. we also discuss the limits of this evidence through complementary studies, which revealed that passive lateral transfer of viral genes among polydnaviral, bacterial, and wasp genomes may have occurred repeatedly through an intimate coupling of both recombination and replication of viral genomes during evolution. the impact of passive lateral transfers on evolutionary relationships between polydnaviruses and viruses with large double-stranded genomes is considered in the context of the theory of symbiogenesis. approximately two-thirds of these wasps are endoparasites, meaning that the larval stages develop within the body cavity of their hosts, typically other insects. among the most successful of these endoparasitic wasps are those that use lepidopteran larvae as hosts. owing to the economic importance of these insects and the utility of their wasp parasites as biological control agents, the ability of these parasites to develop within lepidopteran hosts without triggering an intense immune response has been the subject of numerous studies over the past forty years. early studies of the mediterranean flour moth, ephestia kuhniella, parasitized by the ichnemonid, venturia canescens, showed that eggs of this species are coated with particles that resemble virions [ ] [ ] [ ] and contain surface proteins that mimic host proteins, thus keeping the eggs and larvae from being recognized as foreign material by their host. these particles lack dna, and thus are not considered virions [ ] . with respect to both species number and mechanisms that lead to successful parasitism, endoparasitic wasps are known to inject secretions at oviposition, but only a few lineages use viruses or virus-like particles (vlps) to evade or to suppress host defences. in the family ichneumonidae, for example, four types of host defence suppression mediated by the injection of fluids or suspensions are known that lead to successful parasitism. ) fluid injected with eggs bypasses host defences without the aid of viruses or vlps [ ] . ) wasps inject a virus that replicates in both the wasp and lepidopteran host. one example is the wasp diadromus pulchellus, which injects an ascovirus, dpav [ ] into host pupae to circumvent host defence response. ) the wasp injects vlps capable of molecular mimicry and/or direct defence suppression. ) the wasp injects polydnavirus particles that contain genes coding for proteins that interfere with host defence responses. the last mechanism is by far the best-studied type of direct immune suppression by ichneumonid wasps, and occurs in many species belonging to genera campoletis, hyposoter and tranosema (ichneumonidae, campopleginae), and glypta (ichneumonidae, banchinae) [ ] . in these cases, female wasps inject eggs along with ichnovirus particles into their hosts. similarly, in certain lineages of endoparasitic braconid wasps, other types of immunosuppressive particles containing dna occur in the fluid injected along with eggs [ [ ] ; for a review, [ ] ]. once in the host, ichneumonid and brachonid particles enter host nuclei and their dna is transcribed, producing proteins that selectively suppress various steps in the host defence response. as a result of this unusual biology, these particles were described as symbiotic viruses belonging to new viral family, polydnaviridae [ ] [ ] [ ] since the 's, it was assumed that the dna in the polydnavirus particles, as with all other viruses, encoded typical enzymes and proteins for viral replication and virion assembly and structure. however, several recent genomic studies have shown that only a small number of the genes vectored into lepidopteran hosts, less than %, have homologs in other viruses. most viral dna is noncoding, except that which codes for wasp proteins involved in suppression of immune pathways, such as phenoloxidase activation and the toll pathways [ , , ] . even before these genomic studies, it was suggested that these particles were more similar to organelles than viruses [ ] . the similarities between particle structure and virions of known types of complex dna insect viruses are striking, and suggest these immunosuppressive particles originated by symbiogenesis between viruses and endoparasitic wasps, the same evolutionary process by which mitochondria and plastids originated from symbiotic bacteria [ ] . for example, most braconid wasps produce enveloped bacilliform particles classified as bracoviruses, and these resemble baculovirus and nudivirus virions [ , ] . similarly, ichneumonid wasps produce enveloped spindle-shaped particles classified as ichnoviruses that resemble virions of ascoviruses, viruses lethal to lepidopterans, which, interestingly, are vectored by endoparasitic wasps [ ] . it must also be noted that ichnoviruses resemble other true virus particles that are structurally very similar to virions of ascoviruses, but which remain unclassified because the lack of information about their genomes [ ] [ ] [ ] [ ] [ ] . however, ascoviruses and ichnoviruses display very different genome properties; similar genomic differences occur between bracoviruses and baculoviruses or nudiviruses, suggesting that convergent evolution led to the origin the different polydnavirus types from at least two different types of viruses. in ascoviruses, the genome consists of a single circular dna molecule ranging from -to -kpb in size [ ] . phylogenetic analyses of several viral genes have revealed that ascoviruses are closely related to iridoviruses [ ] , and likely evolved from them. in contrast, the genome of ichnoviruses is composed of multiple circular dna molecules ( to ) representing a total size of to kbp, all of which are replicated from the wasp chromosomes. the ichnovirus proviral genome is specifically excised and amplified in several segments in the female calyx cells, the only wasp tissue in which ichnovirus virogenesis occurs. after assembly, these particles are secreted into the female genital tract. once injected into the host, the ichnovirus genome does not replicate, and does not lead to the production of a new virus generation. the third characteristic of ichnoviruses is that most of the genes borne by the particles are not related to viral genes. among the annotated ichnovirus gene families, there are four (rep, prrp, n, and trv) for which no homology with known eukaryotic (or prokaryotic) proteins has been detected and for which no function has been proposed. among the remaining three (cys, ank and inx), cys-motif proteins have no clear homologs among eukaryotic (or prokaryotic) proteins, although the "cysteine knot" that they form is a folding domain found in many proteins, but not one that is necessarily related to eukaryotic host immune systems [ , ] . however, some protein domains and their putative functions suggest that they might be related to regulatory components of eukaryotic host defence systems that are not sufficiently elucidated. although the resemblance of the polydnavirus virions to those of conventional insect viruses suggests that the former evolved from the latter, to date no molecular evidence supports this hypothesis. in the case of ascoviruses and ichnoviruses, well-conserved genes found among the three ascoviruses sequenced so far (sfav a [ ] , tnav c [ ] , and hvav e [ ] ) are not found in ichnovirus genomes. as noted above, the principal reason for this is that the genomes of the latter viruses appear to contain mainly wasp genes, not viral genes. this highlights the need for new and alternative types of sequence data obtained from pertinent biological systems. in this regard, dpav has features that could provide important insights. indeed, it is the only ascovirus known to replicate in both its wasp and caterpillar hosts. it is transmitted vertically from wasp to caterpillars to suppress the defence response of the latter host, thereby enabling parasite development [ , ] . moreover, in males and females of d. pulchellus, the dpav genome resides in the nuclei of all hosts cells, providing a possible example of what may have been an intermediate stage in the symbiogenesis that led to the evolutionary origin of ichnoviruses. we recently sequenced the dpav genome, and a combination of our analysis of this genome and recent data from new types of ichnoviruses, as well as new software programs that elucidate protein relationships based on structural analysis, have enabled us to detect phylogenetic relationships between proteins encoded by open reading frames of dpav and the glypta fumiferanae (gfiv) and campolitis sonorensis (csiv) ichnoviruses. in support of the symbiogenesis hypothesis for the origin of ichnoviruses, data and analyses suggest two independent symbiogenic events, in agreement with what was previously proposed [ ] . the first led to the ichnoviruses in banchinae lineage. this hypothesis is based on the occurrence of a gene cluster present in gfiv and dpav . the second symbiogenic event led to ichnoviruses in the campopleginae wasp lineage. this hypothesis is based on relationships of the major capsid proteins among csiv, ascoviruses and iridoviruses. extending our investigations to proteins encoded by open reading frames of certain ascoviruses and bracoviruses, hosts and bacteria, in the light of recent analyses about the involvement of the replication machinery of virus groups related to ascoviruses in lateral gene transfer [ ] , we discuss the robustness and the limits of the molecular evidence supporting an ascovirus origin for ichnovirus lineages. the dpav genome sequenced by genoscope (france) is , -bp in length. its organization, gene content and evolutionary characteristics will be detailed in a separate publication (manuscript in preparation; additional file ). however, blast results obtained with several orfs in the dpav genome provide evidence that certain ichnovirus orfs have their closest relatives in an ascovirus genome. specifically, we identified a -kbp region that contains a cluster of three genes ( fig. , orf , and ; additional files and ) that have close homologs in a gfiv gene family composed of seven members [ ] . all contain a domain similar to a conserved domain found in the pox-d family of ntpases. to date, this pox-d domain has been identified as a ntp binding domain of about amino acid residues found only in viral proteins encoded by poxvirus, iridovirus, ascovirus and mimivirus genomes. these genes seem to be specific to gfiv, as they are absent in the three sequenced genomes of other ichnoviruses, namely csiv, tranosema rostrales ichnovirus (triv), and hyposoter fugitivus ichnovirus (hfiv). more specifically, in dpav , orf encodes a protein of amino acid residues that is % similar from position to to a protein of amino acid residues encoded by the orf contained in the segment c in the gfiv genome (fig. ) . these two proteins can therefore be considered putative orthologs. the c-terminal residues of this dpav protein are also % similar to the cterminal domain of the protein homologs encoded by the orf of the d and d gfiv segments, % similar to the n-terminal and the c-terminal domains of the protein encoded by the orfs r and l of the iridovirus civ and lcdv, and % similar with those encoded by orfs , and in the ascovirus genomes of hvav e, sfav a and tnav c, respectively. overall, this indicates that this dpav protein is more closely related to that of gfiv than to those found in other ascovirus and iridovirus genomes currently available in databases. orf encodes a protein of amino acid residues similar only with the c-terminal domain of three proteins encoded by the orfs , and , contained, respectively, in gfiv segments d , d and d . in contrast, orf is closer to iridovirus and ascovirus genes than to gfiv genes. this protein of amino acid residues is % similar over all its length to civ orf r orthologs in all iridoviral and ascoviral genomes and is only % similar over amino acid residues to the c-terminal domain of the gfiv protein homologs encoded by the orf , , , , and in, respectively, the c , c , d , d , d and d segments of this virus. analysis of the genes surrounding the dpav orf- - - cluster confirms that this virus has an ascovirus origin since this region contains orfs that are close homologs of genes in iridovirus and ascovirus genomes. upstream from the orf- - - cluster, an orf encoding the dna-dependent rna polymerase subunit c is present, which is an ortholog of the iridoviral civ orf r and the ascoviral sfav a orf . downstream from this cluster, there are two genes, absent in known ascoviral genomes, but similar to the iridoviral civ orf l and civ orf l. these two genes encode, respectively, a chromosomal replication initiation protein and zinc finger protein. in between them, a gene encoding a small protein is present that is similar to that encoded by the orf l of the iridovirus civ, and which corresponds to the ali-like protein also found in entomopoxviruses [ ] . since the three dpav genes have relatives in all ascovirus and iridovirus genomes sequenced so far, their presence in the dpav genome cannot result from a lateral transfer that occurred from an ichnovirus genome related gfiv to dpav . thus, as these dpav genes are the closest relatives of the pox-d gene family present in gfiv identified so far, they could be considered a landmark of the symbiogenic ascovirus origin of the ichnovirus lineage to which this polydnavirus belongs. an alternative explanation is that the presence of dpav -like genes in the genome of gfiv resulted from a lateral transfer from viral genomes closely related to those of gfiv and dpav . indeed, this might have happened when a glypta wasp was infected by an ancestral virus related to dpav . nevertheless, the symbiogenic origin of gfiv from ascoviruses is also supported by morphological features of its virions [ ] , which, aside from similarities in shape, also show reticulations on their surface in negatively stained preparations, a characteristic of the virions of all ascovirus species examined to date [ ] . because ascovirus virions and ichnovirus particles display structural similarities, we developed an approach to search for homologs of virion structural proteins in ichnoviruses. these approaches were initiated in and recently finalized, but some of the conclusions have been published [ ] . to date, only two virion proteins from the campoletis sonorensis ichnovirus (csiv) have been characterized [ , ] . the first is the p (acc n° aad ), a structural protein that appears to be located as a layer between the out envelope and nucleocapsid, and the second, p , a capsid protein (acc n° af ). presently, there are more than one hundred ascoviral or iridoviral mcp sequences in databases. blast searches using these sequences failed to detect any similarities between csiv virion proteins and ascoviral or iridoviral mcps, or any other proteins [ ] . to evaluate the possibility that homology between ichnovirus and ascovirus virion proteins may simply not be detectable by conventional blastp searches, we used a different method, wapam (weighted automata pattern matching; [ ] ). the models were designed on the basis of a previous study [ ] demonstrating that mcp encoded by ascovirus, iridovirus, phycodnavirus and asfarvirus genomes are related, and all contain conserved domains separated by hinges of very variable size. we investigated these conserved domains further using hydrophobic cluster analysis (hca, [ ] ). this map of the -kbp region of the dpav genome (embl acc. n° cu and cu ) that contains the gene cluster with direct homologs in the genome of the glypta fumiferanae ichnovirus amino acid sequence comparison resulting from a blast search done with the dpav orf as a query, and the best hit corresponding to the protein encoded by the orf of the ichnovirus segment gfv-c (subject; genbank acc. n° yp_ ) figure amino acid sequence comparison resulting from a blast search done with the dpav orf as a query, and the best hit corresponding to the protein encoded by the orf of the ichnovirus segment gfv-c (subject; genbank acc. n° yp_ ). analysis revealed that most conservation occurred at the level of hydrophobic residues, as expected for structural proteins (additional file a and b). the size variability of the hinges between conserved domains and the conservation of hydrophobic residues might explain why blast searches using iridoviral and ascoviral mcp sequences have limited ability to detect mcp orthologs in phycodnavirus and asfarvirus genomes. we designed two syntactic models (see materials and methods), which together were able to specifically align all mcp sequences of the four virus families. importantly, wapam aligned the csiv ichnovirus p structural protein with both models. complementary structural and hca confirmed the presence of the seven conserved domains in this csiv structural protein ( fig. a and additional file c). in addition to the above analysis, ten syntactic models were developed using proteins conserved in the three sequenced ascovirus species (sfav a, tnav c, and hvav a) and twelve iridoviruses [ ] . none of these and , typed in black) , dpav (lanes and , typed in blue) and sfav a (lanes and , typed in purple) . conserved positions among the amino acid sequence of csiv and those of dpav and sfav a are highlighted in grey. secondary structures in the three sfav a orf orthologs were calculated with the network protein sequence analysis at http://npsa-pbil.ibcp.fr/ and the statistical relevance of the secondary structures were evaluated with psipred at http://bioinf.cs.ucl.ac.uk/psipred/. c, e and h in lanes to respectively indicated for each amino acid that it is involved in a coiled, b sheet or a helix structure. using default parameters of psipred, upper case letters indicate that the predicted secondary structure is statically significant in psipred results. significant secondary structures are highlighted in yellow. in (a), the comparisons were limited to three of the seven conserved domains (additional file a, b and c), the , and . indeed, classical in silico methods appeared to be inappropriate to predict statistically significant secondary structures in conserved structural protein rich in b strand such as iridovirus and ascovirus mcp. in contrast, a complete and coherent domain comparison was obtained by hca profiles (fig. s b, c) . , developed from small proteins encoded by the dpav orf , sfav a orf , hvav a orf , and tnav c orf in the ascovirus genomes, and iridovirus civ orf l and mimivirus miv orf r genomes, respectively. importantly, these proteins have orthologs in vertebrate iridoviruses, phycodnaviruses, and asfarvirus. in sfav a, the peptide encoded by orf is one of the virion components. in ascoviruses, iridoviruses, phycodnaviruses, and the asfarvirus, they have been annotated as thioredoxines, proteins that play a role in initiating viral infection [ ] [ ] [ ] . database mining with our model revealed four hits with csiv sequences (acc n°. m , s , af , af ) each a homolog orf of sfav a orf . in fact, these sequences correspond to several variants of a single region contained in the b segment of the csiv genome. to date, these have not been annotated in the final csiv genome, probably because they overlap a recombination site. hca analyses confirmed that the hydrophobic cores were conserved ( fig. b and additional file d and e). the chromosomal locations of genes encoding these two csiv proteins, i.e., p and p , were also consistent with the symbiogenesis hypothesis. in fact, the orf encoding p is not found in proviral dna. it is notable that no orfs encoding orthologs of p or other structural proteins such as mcps are found in any of the other three ichnovirus genomes sequenced -triv, gfiv, hfiv [ , ] . therefore, this indicates that the orthologs of ichnovirus mcps and other virion structural proteins are also probably located in the genomes of these wasps, i.e., not in proviral dna. in contrast to this, we found that the gene encoding the csiv ortholog of sfav a orf is located within the proviral dna. whether ortholog proteins are similarly involved in the triv, gfiv and hfiv biology, their genes are not found in proviral dna, since no matches were detected in their viral genomes. the phylogenetic analysis performed previously on p and the sfav a orf orthologs [ ] indicated that they have an ancestor close to that of the ascoviruses and iridoviruses. as in the case of genes encoding pox-d family of ntpases in all ascoviruses, iridoviruses, and gfiv, genes encoding virion proteins cannot result from a horizontal transfer from a campoplegine or banchine ichnovirus genome to all ascovirus, iridovirus, phycodnaviruses and asfarvirus genomes. as the ascovirus genes encoding the two virion proteins investigated here are the closest relatives of virion proteins in csiv, they can be considered a landmark reflecting the symbiogenic origin of the two ichnovirus lineages from ascoviruses closely related to dpav . in fact, the difficulty encountered in elucidating their sequence relationships can be explained by a combination of the marked transition from ascovirus to ichnovirus, and the significant selection constraints that resulted as the latter virus type evolved from the former. analysis of available ascovirus, iridovirus and ichnovirus genomes provides some of the first molecular support for the hypothesis that ichnoviruses evolved from ascoviruses by symbiogenesis. however, examining genes shared only by ascovirus, iridovirus and ichnovirus genomes likely limits the sources of genes that contributed to the evolution and complexity of these viruses, especially of the role of lateral gene transfer. relevant to this is the recent finding that an important part of the mimivirus and phycodnavirus genomes had a bacterial origin [ ] . obviously, this did not lead to the conclusion that these viruses had a bacterial origin. the cytoplasmic environment in which these viruses replicate is rich in bacterial dna because their amobae and unicellular algae hosts feed on bacteria that they digest in their cytoplasm. thus, it has been proposed [ ] that lateral transfers of bacterial dna within these viral genomes were driven by intimate coupling of recombination and viral genome replication. indeed, replication of these viruses is similar to that of bacteriophage t . this mode of replication has been called recombination-primed replication. it permits integration of dna molecules with sequence homology as short as -bp [ , ] . the replication machinery used by ascoviruses, iridoviruses, mimiviruses, phycodnaviruses, and other nucleocytoplasmic large dna viruses (ncldv) [ , ] is common to all of them, despite differences in the specifics of replication in each virus family. it can therefore be expected that recombination-primed replication occurred repeatedly during evolution of both these viruses and the genome of their eukaryotic hosts. in an eukaryotic cellular environment in which bacteria, chromosomes, ncldv viruses and non-ncldvs (such as baculoviruses) intimately cohabit temporarily or permanently, recombination-primed replication is able to allow reciprocal passive lateral transfers between viral genomes, host chromosomes, and bacterial dna. under these conditions, lateral transfers are considered passive since they just result from the intimate environment and not from an active mechanism dedicated to genetic exchanges. in ascoviruses and iridoviruses, the occurrence of such lateral transfers is supported by blastp searches that detected the presence of orfs whose closest relatives have their origin within eukaryotic genomes (e.g., for dpav , in additional data , orfs , , , , , ), bacterial genomes (e.g., for dpav , in additional data , orfs , , , , and ) or viruses belonging to other ncldv and non-ncldv families (e.g., for dpav , in additional data , orfs , , , ). the transmission of ascoviruses is unusual in that they are poorly infectious per os and appear to be transmitted among lepidopteran hosts by parasite wasp vectors at oviposition [ , ] . the genome of the ascoviruses can be replicated in presence of polydnavirus dna either within the reproductive tissues of female wasps or within the body of the parasitized hosts infected by both polydnavirus and ascovirus. consequently, integrated sequences of ascovirus origin can be expected within wasp and polydnavirus genomes. reciprocally, sequences of polydnavirus origin may have been integrated in ascovirus genomes, whatever the wasp origin, ichneumonid or braconid. one gene family related to a bacterial family of n-acetyl-l-glutamate -phosphotransferase (acc. n° of the closest bacterial relatives yp_ , cam , zp_ , zp_ ), identified only within the sfav a, hvav e and tnav c genomes, supports this conclusion. it has been found in the genome of a bracovirus, cotesia congregata bracovirus (ccbv [ ] ; fig. ). since this gene is absent in the genome of microplitis demolitor bv, a related bracovirus [ ] , it is difficult to infer the direction of the lateral transfer between the common ancestors of the three ascoviruses and of the wasp c. congregata. however, they unambiguously indicate that there was at least one lateral transfer for this gene between the common ancestor of ascoviruses and the parasitic wasp. since iridoviruses, like ascoviruses and other virus species [ , ] , are, in some cases, vectored by parasitic wasps, databases were mined using all the available ichnovirus virus proteins as queries. we found no significant relationships between csiv, hfiv and triv genomes and genomes of their putative closest relatives ncldv and non-ncldv relatives. this indicates that passive lateral gene transfers from virus to eukaryotes that are successfully spread and maintained in ichnovirus genomes remain rare events. one case of such lateral transfer was described in the ccbv genome. in this genome, aside from the presence of cardinal endogenous eukaryotic retrotranposon and polintons that transposed in the chromosomal dna of the proviral form of ccbv [ ] [ ] [ ] , two genes encoding acmnpv p -related proteins, which have their closest relatives among granuloviruses (xcgv), were found. this suggests that ccbv contained at least two cases of lateral transfers between non-ncldv and a bracovirus. our results provide another source of evidence that passive lateral gene transfers have occurred regularly during evolution from bacteria to viruses and eukaryotes, and between viruses and eukaryotes [ ] [ ] [ ] [ ] . even if the pox-d ntpase genes in the gfiv genome, and the mcp and sfav -like genes in the csiv genome, indicate that they have an ascovirus origin, they provide only limited evidence supporting an ascovirus origin of ichnoviruses. indeed, their sequence conservation and biological characteristics suggest that there were repeated lateral transfers during evolution between ascoviruses and wasp genomes, including the proviral ichnovirus loci. this raises an important issue about the role of lateral transfers during co-evolution of the ncldvs and non-ncldvs, ichnovirus, wasp and parasitized host. indeed, genetic materials of various origins have been exchanged and maintained during co-evolution. this therefore suggests that ichnoviruses might be chimeric entities partly resulting from sev- symbiogenesis was first proposed as an evolutionary mechanism when it became widely recognized that mitochondria and plastids originated from free-living prokaryotes [ ] . the genomes of the endosymbiotic cyanobacteria and proteobacteria, respectively, at the origin of chloroplasts and mirochondria have evolved by reduction of several orders of magnitude to the approximate size of plasmids. concurrently, nuclear genomes have been the recipients of plastid genomes. this relocation of the genes encoding most proteins of the endosymbiotic bacteria to the host nucleus is the ultimate step of this evolutionary process, so-called endosymbiogenesis [ , ] . recent studies of plants have revealed a constant deluge of dna from organelles to the nucleus since the origin of organelles [ ] . this allows the host cell to have the genetic control on its organelles, in a relationship that is closer to enslavement or domestication than to a symbiosis or a mutualism in which the organelles would recover benefits from their contribution to the eukaryotic cell well-being. to date, this deluge of dna is considered to correspond to passive lateral transfers that result from the interactions between the life cycle of the organelle and nuclear replication. numerous cases of symbiogenesis between endocellular bacteria and a wide variety of eukaryotic hosts have been characterized. however, recent work has demonstrated that this evolutionary process was not restricted to bacteria. it also occurred between endocellular eukaryotes such as unicellular algae and fungal endophyte in plants [ , ] . endosymbiogenesis was also proposed as the evolutionary mechanism that allowed some invertebrate viruses with a large double-stranded dna genome related to the nudiviruses and the ascoviruses [ ] , to have led, respectively, to the origin of bracoviruses and ichnoviruses, which are currently recognized as forming two genera within the family polydnaviridae. although presently there is no definitive evidence ruling out the hypothesis that the resemblance between ichnovirus and ascovirus virions is only an evolutionary convergence, the genomic differences between ascovirus and ichnoviruses are in good agreement with the symbiogenetic hypothesis. indeed, they match an evolutionary scenario of endosymbiogenesis during which, from a single integration event of symbiotic virus genome, viral genes were lost and/or translocated from the provirus to other chromosomal regions (fig. ). in parallel, host genes of interest for the wasp parasitoid were integrated and diversified by selection and gene duplication in the proviral dna. in this scenario, the more ancient symbiogenesis, the rarer the traces of genes from viral origin in the ichnovirus genome would be. this constitutes a constraint that dramatically limits the possibility to investigate the evolutionary links between ascovirus and ichnovirus. results of our analyses demonstrate that the situation is also complicated by the fact that lateral gene transfers unrelated to the origin of ichnoviruses cause important misleading background noise. moreover, the scenario in figure is close to a previously proposed version [ ] , but is not consistent with results presented here, nor with recently accumulated knowledge on dna transfer from organelles into the nucleus. since endocellular environments favour lateral transfers between virus and wasp nucleus, it can be proposed that genes of virus origin that are involved in the ichnovirus biology were passively integrated in one or several loci, step by step over time, alone or through transfers of gene clusters, or even the entire viral genome. since parasitoid wasps are able to vector different viruses [ , ] , this second scenario opens the exciting possibility that virus genes involved in the ichnovirus biology might correspond to a gene patchwork resulting from transfers from viruses belonging to different ncldv and non-nclvd families. because of the background noise due to lateral gene transfers found in these systems, elucidating the origins of ichnoviruses will be very time-consuming, requiring new accurate experimental approaches to generate more robust evidence. sequencing wasp genomes to identify proteins of viral origin that are components of virions and involved in the assembly of these may well contribute to our understanding of how ichnoviruses and bracoviruses evolved from other insect dna viruses. searches for similarities were mainly developed using facilities of blast programs at two websites http:// www.ncbi.nlm.nih.gov/blast/blast.cgi and http:genoweb.univ-rennes .fr/serveur-gpo/out ils.php ?id_rubrique= . for dpav genes having their origin within eukaryotic, bacterial or virus genomes belonging to ncldv and non-ncldv families, the closest gene was located using the distance trees supplied with each blast search at the ncbi website. construction of syntactic models: conserved amino acid blocks and positions described previously [ , ] and with new data sets were verified or determined using meme at http://meme.sdsc.edu/meme/meme.html. in the first step, we used motifs resulting from meme to make mast minings in databases at http:// meme.sdsc.edu/meme/mast.html. since meme motifs depend significantly on the data set use to calculate them, this approach did not enable an exhaustive detection of homologs among ascoviruses, iridoviruses, phycodnaviruses, mimiviruses and asfarviruses, and the detection sensitivity was ultimately very similar to that obtained with blast. to reach our detection objectives, we therefore constructed syntactic models that only included the most conserved positions and their variable spacing using wapam at the website. http://genoweb.univ-rennes .fr/ serveur-gpo/ outils_acces.php ?id_syndic= &lang=en. defining these models was obtained empirically until they allowed an exhaustive detection in refseq-protein and genbank databases of the homologs among ascoviruses, iridoviruses, phycodnaviruses, mimiviruses and asfarviruses. the procedures were done until we were only able to detect exact match with the syntactic model. whatever obtained with wapam, they required a confirmation with other approaches. here, we used psipred result comparison for regions with scores over and hca analyses for regions having scores lower than with psipred. this simplified the statistical treatment of the result obtained with wapam, since all exact matches have significance or a score of %. syntactic hypothetical mechanism for the integration and evolution of ascovirus genomes in endoparasitic wasps figure hypothetical mechanism for the integration and evolution of ascovirus genomes in endoparasitic wasps. schematic representation of the three-step process of symbiogenesis, and dna rearrangements that putatively occurred in the germ line of the wasp ancestors in the banchinae and campopleginae lineages, from the integration of an ascoviral genome to the proviral ichnoviral genome. sequences that originate from the ascovirus are in blue, those of the wasp host and its chromosomes are in pink. genes of ascoviral origin are surrounded by a thin black or white line, depending on their final chromosomal location. two solutions can account for the final chromosomal organisation of the proviral ichnovirus genome, monolocus or multilocus, since this question is not fully understood in either wasp lineage. more complex alternatives to this three-step process might also be proposed and would involve, for example, the complete de novo creation of a mono or multi locus proviral genome from the recruitment by recombination or transposition of ascoviral and host genes located elsewhere in the wasp chromosomes. this model for the chromosomal organization of proviral dna in polydnaviruses is consistent with data recently published [ ] . immune surface of eggs of a parasitic insect the resistance of insect parasitoids to the defense reactions of their hosts an insect glycoprotein: a study of the particles responsible for the resistance of a parasitoid's egg to the defence reactions of its insect host role of virus-like particles in parasitoid-host interaction of insects venom from the endoparasitic wasp pimpla hypochondriaca adversely affects the morphology, viability, and immune function of hemocytes from larvae of the tomato moth, lacanobia oleracea characteristics of pathogenic and mutualistic relationships of ascoviruses in field populations of parasitoid wasps polydnavirus genomes reflect their dual roles as mutualists and pathogens particles containing dna associated with the oocyte of an insect parasitoid family polydnaviridae. in virus taxonomy. eighth report of the international commitee on taxonomy of viruses edited by: fauquet cm virus in aparasitoid wasp: suppression of the cellular immune response in the parasitoid's host polydnaviridae -a proposed family of insect viruses with segmented, doublestranded, circular dna genomes genome sequence of a polydnavirus: insights into symbiotic virus evolution shared and species-specific features among ichnovirus genomes origin and evolution of polydnaviruses by symbiogenesis of insect dna viruses in endoparasitic wasps symbiosis in cell evolution hyenoptera: formicidae) from brazil the ultrastructure of microorganisms in the tissues of casenaria infesta (hymenoptera: ichneumonidae) apparent replication of an unusual viruslike particle in both parasitoid wasp and its host an unusual virus from the parasitic wasp cotesia melanoscela. virology viruslike particles in the ovaries of microctonus aethiopoides loan (hymenoptera: braconidae), a parasitoid of adult weevils (coleoptera: curculionidae) evidence for the evolution of ascoviruses from iridoviruses genomic sequence of spodoptera frugiperda ascovirus a, an enveloped, double-stranded dna insect virus that manipulates apoptosis for viral reproduction sequence and organization of the trichoplusia ni ascovirus c (ascoviridae) genome. virology sequenceand organization of the heliothis virescens ascovirus genome biological and molecular features of the relationships between diadromus pulchellus ascovirus, a parasitoid hymenopteran wasp (diadromus pulchellus) and its lepidopteran host, acrolepiopsis assectella dpav- , on thehemocytic encapsulation response and capsule melanization of the leek-moth pupa, acrolepiopsis assectella genomic and morphological features of a banchine polydnavirus: comparison with bracoviruses and ichnoviruses i am what i eat and i eat what i am: acquisition of bacterial genes by giant viruses the genome of melanoplus sanguinipes entomopoxvirus cloning and expression of a gene encoding a campoletis sonorensis polydnavirus structural protein a gene encoding a polydnavirus structural polypeptide is not encapsidated what does structure tell us about virus evolution? cluster of re-configurable nodes for scanning large genomic banks deciphering protein sequence information through hydrophobic cluster analysis (hca): current status and perspectives comparative genomic analysis of the family iridoviridae: reannotating and defining the core set of iridovirus genes the thioredoxin system in retroviral infection and apoptosis mimivirus giant particles incorporate a large fraction of anonymous and unique gene products cell entry by enveloped viruses: redox considerations for hiv and sars-coronavirus genetic recombination of the dna plant virus pbcv- in a chlorella alga common origin of four diverse families of large eukaryotic dna viruses evolutionary genomics of nucleo-cytoplasmic large dna viruses effects of the nonoccluded virus of spodoptera frugiperda (lepidoptera: noctuidae) on the development of a parasitoid parasitoid-mediated transmission of an iridescent virus non-poly-dna viruses, their parasitic wasp, and hosts the few virus-like genes of cotesia congragata self-synthesizing dna transposons in eukaryotes marvericks, a novel class of giant transposable elements widespread in eukaryotes and related to dna viruses evolution of viruses by acquisition of cellular rna or dna nucleotide sequences and genes: an introduction microbialgenes in the human genome: lateral transfer or gene loss? science are there bugs in our genome? science express genome-wide survey for genes horizontally transferred from cellular organisms to baculoviruses morphogenesis by symbiogenesis endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes a cryptic intracellular green alga in ginkgo biloba: ribosomal dna markers reveal worldwide distribution forest succession suppressed by an introduced plant-fungal symbiosis unfolding the evolutionary story of polydnaviruses structure and evolution of a proviral locus of glyptapanteles indiensis bracovirus this research was funded by grants from the c.n.r.s. (pics n° ), the genoscope, the a.n.r. project in bioinformatics modulome, the ministère de l'education nationale, de yb is the leader of all aspects of the research on the biology, genomics, and evolution of dpav . ss coordinated the sequencing, assembly, and sequence quality control of the dpav genome. cag participated in the bioinformatics analysis of the dpav genome development of the manuscript. baf contributed original concepts regarding the evolutionary origins and role of polydnaviruses in endoparasitoid biology, provided virological expertise to optimize data interpretation, and participated in writing the manuscript. predicted orfs in dpav genome. key: cord- - s kuno authors: jaiswal, arun kumar; tiwari, sandeep; jamal, syed babar; de castro oliveira, letícia; alves, leandro gomes; azevedo, vasco; ghosh, preetam; oliveira, carlo jose freira; soares, siomar c. title: the pan-genome of treponema pallidum reveals differences in genome plasticity between subspecies related to venereal and non-venereal syphilis date: - - journal: bmc genomics doi: . /s - - - sha: doc_id: cord_uid: s kuno background: spirochetal organisms of the treponema genus are responsible for causing treponematoses. pathogenic treponemes is a gram-negative, motile, spirochete pathogen that causes syphilis in human. treponema pallidum subsp. endemicum (ten) causes endemic syphilis (bejel); t. pallidum subsp. pallidum (tpa) causes venereal syphilis; t. pallidum subsp. pertenue (tpe) causes yaws; and t. pallidum subsp. ccarateum causes pinta. out of these four high morbidity diseases, venereal syphilis is mediated by sexual contact; the other three diseases are transmitted by close personal contact. the global distribution of syphilis is alarming and there is an increasing need of proper treatment and preventive measures. unfortunately, effective measures are limited. results: here, the genome sequences of t. pallidum strains isolated from different parts of the world and a diverse range of hosts were comparatively analysed using pan-genomic strategy. phylogenomic, pan-genomic, core genomic and singleton analysis disclosed the close connection among all strains of the pathogen t. pallidum, its clonal behaviour and showed increases in the sizes of the pan-genome. based on the genome plasticity analysis of the subsets containing the subspecies t pallidum subsp. pallidum, t. pallidum subsp. endemicum and t. pallidum subsp. pertenue, we found differences in the presence/absence of pathogenicity islands (pais) and genomic islands (gis) on subsp.-based study. conclusions: in summary, we identified four pathogenicity islands (pais), eight genomic islands (gis) in subsp. pallidum, whereas subsp. endemicum has three pais and seven gis and subsp. pertenue harbours three pais and eight gis. concerning the presence of genes in pais and gis, we found some genes related to lipid and amino acid biosynthesis that were only present in the subsp. of t. pallidum, compared to t. pallidum subsp. endemicum and t. pallidum subsp. pertenue. spirochetal organisms of the treponema genus are responsible for causing treponematoses. pathogenic treponemes cause multi-stage infections like endemic syphilis, venereal syphilis, yaws and pinta. these infections have many similarities, but they can be differentiated based on epidemiological, clinical and geographical criteria [ ] [ ] [ ] . primarily, the pathogenic treponemes can be classified based on the clinical symptoms of the respective disease they cause. treponema pallidum subsp. endemicum causes endemic syphilis; t. pallidum subsp. pallidum causes venereal syphilis; t. pallidum subsp. pertenue causes yaws; and t. pallidum subsp. carateum causes pinta. out of these four high morbidity diseases, venereal syphilis is only transmitted by sexual contact; the other three diseases are transmitted by close personal contact [ ] . it is estimated by the world health organization (who) that there are million new cases of syphilis annually and the aggregated cases of yaws, bejel, and pinta (the endemic treponematoses) are approximately . million globally, although good surveillance data is not available. the infections caused by t. pallidum are characterized by periods of active clinical disease interrupted by episodes of asymptomatic latent infection and may cause life-long infections in untreated individuals [ , ] . treponema pallidum is a gram-negative, motile, spirochete human pathogen. syphilis is a multistage infectious disease that can be communicated between sexual partners through active lesions or from an infected woman to her fetus during pregnancy [ , ] . syphilis has a worldwide distribution (e.g. africa has a high incidence), affecting every country and continent except perhaps antarctica [ ] [ ] [ ] [ ] [ ] . the stages of syphilis have been divided on the basis of clinical findings that lead to treatment and follow-up. syphilis chancres may go unnoticed primarily due to their well-documented painless nature and if they are present in those parts of the body that are difficult to visualize (e.g. cervix, throat or anus/ rectum) [ ] . furthermore, due to pleomorphic appearance and lack of physician familiarity with the expressions of syphilis, their lesions may be misdiagnosed. secondary, syphilis may manifest itself through severe rashes that may go unobserved by the patient or may mimic an extensive condition [ ] . t. pallidum is completely sensitive to penicillin treatment, despite the use of this antibiotic for seven decades in treating syphilis infections. standard treatment of uncomplicated syphilis with parenteral benzathine penicillin g is highly effective at all stages. many antibiotics' resistance (e.g macrolide and clindamycin resistance) has been reported in several countries [ ] . the ongoing high rate of syphilis worldwide, despite the availability of inexpensive and effective treatment, presents the most convincing argument for the need of developing new and potent vaccine against syphilis [ ] . despite the who's initiative for the global elimination of congenital syphilis, an intensive syphilis-targeted public health control has been undertaken to reduce the incidence; however, it has not been achieved yet [ ] . specifically, the reasons for failure are multifactorial; some of the responsibility can be attributed to the difficulty in the diagnosis of syphilis and treatment, and lack of access or use of prenatal screening programs [ ] . the advancement in the field of genomics and cost-effective sequencing technologies has transformed the human bacterial pathogens study and helped in the improvement of vaccine designing technologies. a new and emerging methodology to get deep insight of the genome of a species or genus is the pan-genomics approach, which was introduced by tettelin and collaborators in working with streptococcus agalactiae [ ] . pan-genome provides us with the complete and non-redundant collection of genes from a species or genus and is composed of three subsets (core genome, shared genome and singletons): the core genome, which is the collection of all the genes commonly shared between all the genomes used as dataset; the shared genome, which contains only the genes shared between two or more strains, which are not present in all strains of the dataset; and, the singletons, which are present only in one strain and are referred to as strainspecific genes. the first genome of t. pallidum subsp. pallidum (strain nichols) was sequenced in . the organism has a comparatively small genome and only % of t. pallidum's open reading frames are recognized to have a biological function, which indicates that it uses host biosynthesis to complete some of its metabolic needs [ ] . the dna-dna hybridization studies showed homology between dna of venereal syphilis spirochete and dna of culturable treponemes (t. phagedenis and its biotypes reiter and kazan) was less than % identical, but was indistinguishable from dna of the yaws spirochete t. pallidum [ , , ] . this study led to the reclassification of the agents of endemic syphilis, venereal syphilis and yaws as t. pallidum subsp. endemicum, treponema pallidum subsp. pallidum and t. pallidum subsp. pertenue, respectively. genomic sequencing has recognized these subspecies as clonal, but forming distinct genetic clusters [ , ] . in this work, we perform a pan-genome approach to better understand the differences of treponema pallidum infections in the broad spectrum and how genome plasticity is related to the symptom patterns. for pangenomic comparative analyses, we used t. pallidum strains. we present phylo-genomic correlations between all t. pallidum strains. furthermore, we describe the "pan-genome", which is the complete inventory of genes found in any member of the species; the "core genome", which is important for basic life processes; and the "singletons", which are normally related to environmental fitness and adaptation to host. finally, we provide insights into the specific subsets (singletons and the panand core genomes) of genomes of t pallidum strains and correlate these subsets with the plasticity of pathogenicity islands and virulence genes. the phylogenomics relationships between t. pallidum strains were determined using gegenees [ ] . furthermore, all genome sequences were cross-compared to generate a phylogenomic tree and to plot a heatmap. according to the generated phylogenomic tree, closely related strains appeared in the same cluster. the subspecies responsible for non-venereal syphilis is treponema pallidum subsp. endemicum (ten) and t. pallidum subsp. pertenue (tpe) strains appeared in closely related clusters (fig. ) . the t. pallidum subspecies strains responsible for venereal syphilis formed different clusters. additionally, t. pallidum strain bosniaa (subsp. endemicum) was positioned between the clusters of treponema pallidum subsp. pertenue and venereal syphilis (treponema pallidum subsp. pallidum). according to the heatmap, the non-venereal isolates are % similar to each other and many of the venereal isolates are % similar to each other, but the two groups show some difference (additional file : figure s ). moreover, the heatmap indicated the clonal-like behavior of t. pallidum subsp., compared with the isolates other than genital, anal or neurosyphilitic samples, which showed similarities ranging from to %. the pan-genome, core genome and singletons of treponema pallidum the main goal of the pan-genome is the comparison of different strains of the same species or even genus at the genomic level. the resulting pan-genome of pan all ( fig. a -a ), pan subsp_pallidum ( fig. b -b ), and pan_subsp_pertenue ( fig. c -c ) , of t. pallidum contains a total of , , and genes respectively. the formula (α = -γ) inferred that the pan-genome of t. pallidum is increasing with an α of . . the extrapolation was also separately calculated for all divided subsets for the analysis in this work. the α value for each subset pan subsp_pallidum and pan_subsp_pertenue, were . and . respectively. the α values for all datasets used in this work are less than which indicates that all have an open pan-genome. however, although the pan-genome is still open, it increases at a very low rate [ , ] . the core genome and singletons of the complete dataset and all the subsets of t. pallidum were calculated by the least-squares fit of the exponential regression decay to the mean values, as represented by the formula n = k * exp[-x/τ] + tg(θ), where n is the expected subset of genes for a given number of genomes, x is the number of genomes, exp is euler's number, and the other terms are constants defined to fit the specific curve. the resulting core genome of the complete dataset (pan all), the subsets pan subsp_pallidum and pan subsp_pertenue, have the following tg(θ) values, respectively:~ , , and~ . concerning the singletons of the complete dataset (pan all) and the subsets pan subsp_ pallidum, and pan subsp_pertenue, have the following tg(θ) values, respectively:~ ,~ . , and~ . . according to the least-squares fit of the exponential regression decay, the tg(θ) represents the point where the curve stabilizes, which may be translated to the number of genes in the core genome after stabilization and the number of singletons that will be added to the pan-genome for each newly sequenced genome. considering this rule, the core genome of the subset subsp_pertenue have higher number of core genes ( -number of core genes) after stabilization, whereas, the complete dataset has the smallest number of core genes ( -number of core genes). for the singletons, the tg(θ) value for all the dataset indicates only one gene will be added, whereas, the subsets from pan subsp_pallidum and pan subsp_pertenue will have and . newly added genes respectively. the core genes of the complete dataset, the subsets pan subsp_pallidum and pan subsp_pertenue, of t. pallidum were classified by cog (cluster of orthologous genes) functional category. according to the chart in fig. a -c, the core genome of all the strains had many genes related to the "metabolism" and "information storage and processing" categories. moreover, the majority of the core genome of all the strains were classified as "poorly characterized" (additional file : table s a -c). the presence of pathogenicity islands (pais) is generally related to evolution in a different genomic environment [ ] . however, it may only be the effect of relaxation of purifying selection genes involved in increasing the range of environmental responses. interspecies genome plasticity may result from several events, of which horizontal gene transfer is particularly important because it can cause the acquisition of blocks of genes (genomic islands, or gis), producing evolution by quantum leaps [ ] . these genes are often flanked by transposases (insertion elements), have altered g + c content and skew, suggesting their acquisition through horizontal gene transfer (hgt), intermediated by phages or recombination [ ] . pais are important in this context because they represent a class of gis that carry virulence genes, i.e., factors that enable or enhance the parasitic growth of an organism inside a host [ ] . the genome plasticity of all t. pallidum strains was determined by using gipsy (genomic island prediction software) on subspecies-based study. the software brig (blast ring image generator) [ ] was used for the circular genome comparison visualization. some of the other strains from the representing cluster of the dendogram were also used for the circular genome visualization. we found differences in the presence/absence of pathogenicity islands (pais) and genomic islands (gis) on subspecies-based study: four pathogenicity islands (pais) eight genomic islands (gis) in subsp. regarding the presence of genes in pais and gis, we compared the genes in all the subsp. of t. pallidum to each other. when compared to each other, we found high similarity of the genes in all the subsp. of t. pallidum. the genomic region related to pais and pais of subsp. pertenue and endemicum (non-venereal subsp.) were similar to the pais and pais of subsp. pallidum. when we compared the genes related to pais of subsp. pertenue and endemicum, there were differences of three genes found that were only present in subsp. pertenue. out of those three genes, two were hypothetical proteins and one was rna polymerase sigma factor. furthermore, the genes clusters related to the pais of subs. pertenue and endemicum were similar to pais of subsp. pallidum. interestingly, we found the genomic region related to pais of subsp. pertenue and endemicum (non-venereal subsp.) were not present in any of the gis or pais of subsp. pallidum. the list of genes related to pai of subsp. pertenue and endemicum is mentioned in table . on the other hand, we found that the genes present in pais of subsp. pallidum were not present in any of the gis or pais of subsp. pertenue and endemicum (nonvenereal subsp.). this may reflect the fact that the genomic signature of those regions has already adapted in subsp. pallidum to cause different modes of transmission. the list of genes related to pai of subsp. pallidum is mentioned in table excluding the hypothetical genes. moreover, we also compared gis of all subspecies; as a result, we found that the genes of some gis which are present in the gi and gi in pallidum subspecies and are not reported in any of gis of the subspecies endemicum and pertenue ( table ) . most of the genes present in gi and gi of pallidum subspecies are hypothetical genes but some genes are chemotaxis protein (chea) that are associated with the transmission of sensory signals from the chemoreceptors to the flagellar motors [ ] . the mechanisms by which t. pallidum sense and respond to nutrient gradients help in pathogenic processes such as crossing the endothelial barrier to reach the bloodstream. the subspecies t. pallidum subsp. endemicum (ten) and t. pallidum subsp. pertenue (tpe), are reasons for the diseases bejel and yaws, respectively. in the last few years, t. pallidum subsp. pallidum (tpa), has been reported as a reemerging pathogen [ , ] . these three subsp. of treponema pallidum are so close to each other that they cannot be differentiated serologically, their morphology is indistinguishable and are antigenically cross-reactive [ , ] . mostly, the disease phenotype caused by these pathogens can only be distinguished clinically and geographically. the distribution of venereal syphilis is global, non-venereal yaws usually effect kids in hot and/or humid regions of africa and asia, endemic syphilis be in dry places like sahelian africa and saudi arabia [ , ] . the nature of t. pallidum is highly invasive. it circulates through bloodstream and lymphatics and overruns a wide-ranging of tissues and organs. as demonstrated by the widespread clinical manifestations related to syphilis infections, treponema pallidum subsp. pallidum crosses placental, endothelial and blood-brain barriers early in infection, the incidence of congenital syphilis and invasion of central nervous system has been observed in almost % of early syphilis patients. though, the understanding of the mechanisms responsible for the widespread distribution capability of t. pallidum is still very limited [ , ] . the transmission of yaws is characterized by direct contact on skin and primary cutaneous lesion. it is facilitated by damaged skin surface. scratching or rubbing these damaged parts of the body can facilitate the lesions spread across the body [ , ] . contrarily, endemic syphilis is an acute infection. primary lesions of endemic syphilis can be seen in the children of ages between and years in dry and arid climates. while the mode of transmission is not known, it is believed that it may occur through mucosal and skin contact, even via shared eating utensils or drinking vessels [ , ] . the defined relationships among the bacteria are still argued. the expansion of next-generation sequencing (ngs) in last few decades influences the fields of treatment and prevention, especially about bacterial diseases [ ] . the ability of genomics data of t. pallidum gives us better understanding of the biology involving its interaction with its hosts. a comprehensive in silico pan-genome study was carried out for sequenced genomes of t. pallidum, which indicates that the pangenome of t. pallidum is still open; however, it is increasing at a very low rate as represented by the α of . for the pan all and the α of . and . for pan subsp_pallidum and pan_subsp_pertenue, respectively. moreover, the α of . indicates that the pan subsp pertenue is almost closed, which is corroborated by the tg(θ) of~ . . the genome plasticity analysis reveals the differences in the presence and absence of some genome regions when compared at the subspecies level. pathogenicity islands carry the genes related to the virulence, which are essential and characterize a class of genomics island [ ] . the comparative analysis of pais and gis showed the absence of genes at the subspecies level. we found gene clusters, that are related to amino acid and lipid biosynthesis, belonging to pais of t. pallidum subsp. pallidum have not been identified in any pais or gis of t. pallidum subsp. endemicum and t. pallidum subsp. pertenue. it might be possible that these genes help bacteria to execute different modes of infection at subsp. level of t. pallidum. acyl carrier protein (acp) synthase (acps) catalyzes the transfer of the ′-phosphopantetheine moiety from coenzyme a (coa) onto a serine residue of apo-acp, to convert apo-acp to the functional holo-acp. during the biosynthesis of fatty acids and phospholipids, the holo form of bacterial acp plays a vital role in mediating the transfer of acyl fatty acid. acps is therefore an attractive target for therapeutic interpolation. it has been reported that, acps enzymes from mycoplasma pneumoniae and s. pneumoniae may fig. pan-genome, core genome and singletons of t. pallidum pan_subsp_pertenue. c /c /c , respectively, showing the pan-genome, core genome and singletons development using strains belonging to subspecies pertenue play a crucial role in the acylation of fatty acids derived from human tissues for their lipid biosynthesis, suggesting that acps is a more striking antimicrobial target for discovery of novel antibiotics than bacterial fatty acid biosynthetic enzymes [ , ] . moreover, the presence of chemotaxis protein (chea) in different gis of t. pallidum subsp. pallidum might be responsible for different molecular modes of infection as t. pallidum genome contains two operons for the che response regulators [ , ] . the bacterial transcriptionrepair coupling factor (trcf) is a large, multi-domain, sf atpase that is generally conserved. it forms the dual of nucleotide excision repair with transcription by dislodging inactive rna polymerase molecules stalled at template dna lesions, and by increasing the rate at which the uvr(a) bc exonuclease acts at these sites [ ] . pathogens are frequently using antigenic variation mechanisms to elude the adaptive immune response that ultimately results in persistent infection [ ] . it might be because of the variation in expression of different tpr proteins in the syphilis spirochete, treponema pallidum subsp. pallidum, that have important implications in its ability to elude host immune detection [ ] . a membered protein family treponema pallidum repeat (tpr) has been identified in t. pallidum subsp. pallidum, which may be concerned in the pathogenesis of t. despite the host's efforts to eliminate the infection, mechanisms of t. pallidum's persistence include residence within intracellular or immune-privileged positions to hide from the immune effectors. t. pallidum's has the ability to cape its surface with host serum proteins or mucopolysaccharides to dodge immune response and immunosuppression of the host triggered by syphilis infection [ ] . freeze-fracture electron microscopy of t. pallidum has revealed lack of integral membrane proteins in the outer membrane (om) of t. pallidum, conceivably accounting for the reasonably poor antigenicity of this spirochete's surface [ , , ] . however, as t. pallidum could be phagocytized in the presence of opsonic antibody, antibody targets must be present on the surface of the bacterium. furthermore, the treponemes harvested from the tissues of later stage infections after the elimination of majority of treponemes are resistant to opsonophagocytosis. it raised the likelihood of antigenic variation occurring in t. pallidum, but no exact variable antigen was identified [ , ] . following the identification and investigation of tprk, provides the first candidate antigen of t. pallidum that might function in fudging the immune response. tprk vary among and within t. pallidum strains, with diversity of sequence localized in seven distinct regions (v -v ) bordered by conserved domains [ , , , ] . during experimental infection, these v regions are the main targets of the host humoral immune response [ ] . antigenic variation of the tprk antigen has been acknowledged to explain the persistence of t. pallidum in the host. recent work of dan liu et al. [ ] has recognized an improved number of variants within these seven v regions of the tprk gene in the samples of secondary syphilis. a -bp changing pattern was observed in the sequences within each v region of the protein. however, same pattern of change was observed in variable sequences within the v regions of tprk in the secondary syphilis. notably, the amino acid sequences iasdggaikh and iasedg-sagnlkh in v are not only present in high proportion in inter-strain comparison but also were found at a quite high frequency in the populations. the alignment of all amino acid sequences revealed some really stable pattern within each v region of the primary and secondary syphilis samples, particularly the amino acid sequences iasdggaikh and iasedgsagnlkh in v region. the highly stable peptides found in v region are likely promising vaccine components. the highly heterogenetic regions (e.g., v ) could help to understand the role of tprk in fudging immune response. however, in our analysis, we found that some of tpr genes (tprc, tprd,tprf,tpri, trpj) were present in some of pais or gis t. pallidum subsp. endemicum (ten) and t. pallidum subsp. pertenue (tpe). while, the gis and pais related to t. pallidum subsp. pallidum we only identified some tpr domain proteins. it has been reported by maděrankova et al. , tpr genes responsible for the adaptive evolution of the pathogen [ ] . apart from establishing phylogenetic relationships among treponemal species and subspecies, the addition of comparative genomics was also required to illuminate the lower degree of virulence associated with t. pallidum subsp. pertenue than with t. pallidum subsp. pallidum. unlike syphilis, it is said that yaws cannot be transmitted vertically or affect the central nervous system. it is rather limited to skin, bones, joints and soft tissues. in the s, a very limited genetic diversity between these pathogens was established when hybridization experiments were carried out with dna isolated from yaws and syphilis strains [ ] . our work also showed that genomes of syphilis, yaws, and bejel treponemes share - % overall similarity, as well as the identical organization. this evidence proposes that small genetic changes in key genes among these organisms could be responsible for the reported differences in disease pathogenesis. considering the genes in pais and gis, we identified some absence of pathogenicity islands in all subspecies. genes which are present in pallidum subspecies pathogenicity islands (pais) or genomic islands (gis) are absent in the subspecies endemicum and pertenue. the findings of this analysis are very important, as it can help in the understanding of molecular basis of infections from t. pallidum subsps. furthermore, the core genes represent the most desirable source for the selection of conserved genes; therefore, characterization of such poorly studied proteins helps in understanding the cellular metabolism, mode of infection the genome sequences of t. pallidum strains were retrieved from the ncbi (national centre for biotechnology . six genomes from africa and australia/ oceania continents (strain samoad, cdc , gauthier, cdc , ghana and lmnp- ) from subsp. pertenue were isolated from humans, baboons and rabbits (additional file : table s ). one genome of treponema pallidum subsp. endemicum (strain bosniaa) was isolated in europe from human tongue and tonsils. the genome of treponema denticola strain atcc was used as non-pathogenic bacteria in this work. the general information about all t. pallidum strains and the complete workflow applied in this work are given in additional file : table s and figure s , respectively. for phylogenomic analysis of all treponema pallidum strains, gegenees (version . ) [ ] was used. the gegenees software was used to perform an all-versus-all similarity search. it divides the genomes into small sequences and determines the minimum content shared by all the genomes. subsequently, the obtained minimum shared contents were subtracted from all the genomes resulting in the variable contents, which were eventually compared with all the other strains for the calculation of the percentages of similarity. finally, these percentages were plotted in a heatmap chart with a spectrum ranging from low similarity (red) to high similarity (green). the gegenees data was exported as a distance matrix file in nexus format (.nex) and, further, the generated distance matrix was used as an input file in splitstree software (version . . ) [ ] using neighbour joining method to create a dendogram [ , ] . prediction of pan-genome, core-genome and singleton we divided strains of t. pallidum in subsets for pan-genome calculation. we performed pan all (with all strains of t. pallidum), pan subsp_pallidum and pan_subsp_pertenue (based on subspecies). for the identification of core genome (commonly shared by all strains), shared genome (genes present in two or more than two strains but not shared by all strains) and singletons (strain specific genes), we used orthofinder [ ] . briefly, orthofinder uses the .faa amino acid sequence file for each genome to perform all-vs-all blastp for the orthologous analysis. it uses mcl (markov clustering algorithm) program to determine the orthologous genes [ ] . the cut-off value of e − was used for pangenome, core-genome and singletons identification for all the subsets. furthermore, in-house scripts were used to estimate the fixed parameters for heap's law (pangenome analyses) [ , ] and least-squares fit of the exponential regression decay (core-genome and singletons). the extrapolations of the pan-genomes from the complete dataset and all subsets were calculated based on heap's law [ , ] , which was used to calculate whether the pan-genome was open or closed. heap's law is an empirical law represented by the formula n = k*n γ ; it describes the number of distinct words in a document (or set of documents) as a function of the document length. in a genetic context, n is the expected number of genes for a given number of genomes, n determines the number of genomes, and the k and γ (α = -γ) are free parameters that are determined empirically. according to heap's law, when α > (γ < ), the pangenome is considered to be closed, and there will be no significant increase in the number of genes with the addition of a new genome. on the other hand, when α < ( < γ < ), the pan-genome is open and there will be a significant increase in the number of genes for each newly added genome. this section describes the analyses that were performed for the prediction of genomic and pathogenicity islands following three datasets based on the subspecies: a) using t. pallidum subsp. pallidum strain nichols as a reference; b) using t. pallidum subsp. pertenue strain samoad as a reference; and c) using t. pallidum subsp. endemicum strain bosniaa as a reference. the islands predictions for three datasets were determined by using gipsy (genomic island prediction software) [ ] . gipsy is a multi-step approach that predicts genomic islands (gis) and pathogenicity islands (pais). pais and gis predictions are based on commonly shared features such as genomic signature deviation (anomalous g + c content and codon usage deviation), presence of transposase genes; metabolism, virulence, antibiotic resistance, or symbiosis-related genes; flanking trna genes; and absence in other organisms of the same genus or closely related species [ ] . t. denticola strain atcc was used as a non-pathogenic species from the same treponema genus for gis and pais prediction [ ] . the sizes of the islands were compared with all the other strains via act (artemis comparison tool) software [ ] . pais regions were plotted using the software brig [ ] . following the curation of the pais, the genes of all the islands in each strain were assessed for their presence/ absence in all the other strains. supplementary information accompanies this paper at https://doi.org/ . /s - - - . additional file : table s . general information about treponema pallidum strains used in this work. list of all treponema pallidum strains (with features) retrieved from the ncbi (national center for biotechnology information) database. figure s . the complete workflow applied in this work. the figure represent the methodology and software were used in this analysis. figure s . the heatmap analysis of strains of treponema pallidum.the figure represents the comparison between the variable content of all strains. the percentages were plotted in the heatmap with a spectrum ranging from red (low similarity) to green (high similarity). the names of the strains on the left side of the figure (vertically) are organized in the same order in the top part of the figure (horizontally). once gegenees uses the similarities in the variable contents, the outgroup normally presents a very small percentage of similarity to the other strains. the pathogenesis of syphilis: the great mimicker, revisited advances in the diagnosis of endemic treponematoses: yaws, bejel, and pinta treponema pallidum, the syphilis spirochete: making a living as a stealth pathogen molecular differentiation of treponema pallidum subspecies syphilis: presentations in general medicine global challenge of antibiotic-resistant treponema pallidum global challenge of antibiotic-resistant treponema pallidum sexually transmitted infections and hiv: epidemiology and interventions sexually transmitted diseases in children in india an in silico identification of common putative vaccine candidates against treponema pallidum: a reverse vaccinology and subtractive genomics based approach china's syphilis epidemic: epidemiology, proximate determinants of spread, and control responses trends in the epidemiology of bacterial sexually transmitted infections in eastern europe diagnosis and management of syphilis current status of syphilis vaccine development: need, challenges, prospects genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial "pan-genome genetics of treponema: relationship between treponema pallidum and five cultivable treponemes genetic relationship between treponema pallidum and treponema pertenue, two noncultivable human pathogens gegenees: fragmented alignment of multiple genomes for determining phylogenomic distances and genetic signatures unique for specified target groups inside the pan-genome -methods and software overview comparative genomics: the bacterial pan-genome the bacterial pan-genome:a new paradigm in microbiology pathogenicity islands in bacterial pathogenesis a vibrio cholerae pathogenicity island associated with epidemic and pandemic strains blast ring image generator (brig): simple prokaryote genome comparisons identification, sequences, and expression of treponema pallidum chemotaxis genes treponemal infection in nonhuman primates as possible reservoir for human yaws tools for opening new chapters in the book of treponema pallidum evolutionary history the endemic treponematoses a defined syphilis vaccine candidate inhibits dissemination of treponema pallidum subspecies pallidum biological basis for syphilis origin of modern syphilis and emergence of a pandemic treponema pallidum cluster gipsy: genomic island prediction software fatty acid biosynthesis as a target for novel antibacterials acyl carrier protein synthases from gramnegative, gram-positive, and atypical bacterial species: biochemical and structural properties and physiological implications complete genome sequence of treponema pallidum, the syphilis spirochete the bacterial transcription repair coupling factor antigenic variation in treponema pallidum: tprk sequence diversity accumulates in response to immune pressure during experimental syphilis antibody responses elicited against the treponema pallidum repeat proteins differ during infection with different isolates of treponema pallidum subsp. pallidum van voorhis wc. subfamily i treponema pallidum repeat protein family: sequence variation and immunity treponema pallidummajor sheath protein homologue tpr k is a target of opsonic antibody and the protective immune response analysis of outer membrane ultrastructure of pathogenic treponema and borrelia species by freeze-fracture electron microscopy genome-scale analysis of the noncultivable treponema pallidum reveals extensive within-patient genetic variation a subpopulation of treponema pallidum is resistant to phagocytosis: possible mechanism of persistence the tprk gene is heterogeneous among treponema pallidum strains and has multiple alleles sequence diversity of treponema pallidum subsp. pallidum tprk in human syphilis lesions and rabbit-propagated isolates insights into the genetic variation profile of tprk in treponema pallidum during the development of natural human syphilis infection identification of positively selected genes in human pathogenic treponemes: syphilis-, yaws-, and bejel-causing strains differ in sets of genes showing adaptive evolution drawing explicit phylogenetic networks and their integration into splitstree application of phylogenetic networks in evolutionary studies the pan-genome of the animal pathogen corynebacterium pseudotuberculosis reveals differences in genome plasticity between the biovar ovis and equi strains orthofinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy an efficient algorithm for largescale detection of protein families comparison of the genome of the oral pathogen treponema denticola with other spirochete genomes act: the artemis comparison tool publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations we acknowledge the collaboration and assistance of all team members and the brazilian funding agencies capes (coordenação de aperfeiçoamento de pessoal de nível superior, brasil), and fapemig (fundação de amparo à pesquisa de minas gerais). arun kumar jaiswal was supported by the capes (coordenação de aperfeiçoamento de pessoal de nível superior, brasil) fellowship for doctoral studies. syed babar jamal acknowledges the "twas-cnpq postgraduate fellowship programme" for granting a fellowship for doctoral studies. authors' contributions akj, st, sbj, lco,lga, scs conceived, designed the protocol, collected and analysed initial data, wrote the paper: st, scs, va, cjfo coordinated and led the entire project: akj, st, sbj, scs, pg, va, cjfo cross-checked all data, analysis, wrote the paper: all authors read and approved the manuscript. no funding supported this research. all data generated and analysed during this study are included in this published article and its supplementary information files.ethics approval and consent to participate not applicable. not applicable. the authors declare that they have no competing interests.author details key: cord- -p of kq authors: celniker, susan e.; dillon, laura a. l.; gerstein, mark b.; gunsalus, kristin c.; henikoff, steven; karpen, gary h.; kellis, manolis; lai, eric c.; lieb, jason d.; macalpine, david m.; micklem, gos; piano, fabio; snyder, michael; stein, lincoln; white, kevin p.; waterston, robert h. title: unlocking the secrets of the genome date: - - journal: nature doi: . / a sha: doc_id: cord_uid: p of kq despite the successes of genomics, little is known about how genetic information produces complex organisms. a look at the crucial functional elements of fly and worm genomes could change that. supplementary information: the online version of this article (doi: . / a) contains supplementary material, which is available to authorized users. t he primary objective of the human genome project was to produce highquality sequences not just for the human genome but also for those of the chief model organisms: escherichia coli, yeast (saccharomyces cerevisiae), worm (caenorhabditis elegans), fly (drosophila melanogaster) and mouse (mus musculus). free access to the resultant data has prompted much biological research, including development of a map of common human genetic variants (the international hapmap project) , expression profiling of healthy and diseased cells and in-depth studies of many individual genes. these genome sequences have enabled researchers to carry out genetic and functional genomic studies not previously possible, revealing new biological insights with broad relevance across the animal kingdom , . nevertheless, our understanding of how the information encoded in a genome can produce a complex multicellular organism remains far from complete. to interpret the genome accurately requires a complete list of functionally important elements and a description of their dynamic activities over time and across different cell types. as well as genes for proteins and non-coding rnas, functionally important elements include regulatory sequences that direct essential functions such as gene expression, dna replication and chromosome inheritance. although geneticists have been quick to decode the functional elements in the yeast s. cerevisiae, with its small compact genome and powerful experimental tools [ ] [ ] , our understanding of the more complex genomes of human, mouse, fly and worm is still rudimentary. intrinsic signals that define the boundaries of protein-coding genes can only be partly recognized by current algorithms, and signals for other functional elements are even harder to find and interpret. experimental approaches, notably the sequencing of complementary dna and expressed sequence tags, have been invaluable, but unfortunately these data sets remain incomplete . non-coding rna genes present an even greater challenge [ ] [ ] [ ] , and many remain to be discovered, particularly those that have not been strongly conserved during evolution. flies and worms have roughly the same number of known transcription factors as humans , but comprehensive molecular studies of gene regulatory networks have yet to be tackled in any of these species. in an attempt to remedy this situation, the national human genome research institute (nhgri) launched the encode (encyclopedia of dna elements) project in , with the goal of defining the functional elements in the human genome. the pilot phase of the project focused on % of the human genome and a parallel effort to foster technology development . the initial encode analysis revealed new findings but also made clear just how complex the biology is and how our grasp of it is far from complete . on the basis of this experience, the nhgri launched two complementary programmes in : an expansion of the human encode project to the whole genome (www.genome.gov/encode) and the model organism encode (modencode) project to generate a comprehensive annotation of the functional elements in the c. elegans and d. melanogaster genomes (www.modencode. org; www.genome.gov/modencode). these two model organisms, with their ease of husbandry and genetic manipulation, are pillars of modern biological research, and a systematic catalogue of their functional genomic elements promises to pave the way to a more complete understanding of the human genome. studies of these animals have provided key insights into many basic metazoan processes, including developmental patterning, cellular signalling, dna replication and inheritance, programmed cell death and rna interference (rnai). the genomes are small enough to be investigated comprehensively with current technologies and findings can be validated in vivo. the research communities that study these two organisms will rapidly make use of the modencode results, deploying powerful experimental approaches that are often not possible or practical in mammals, including genetic, genomic, transgenic, biochemical and rnai assays. modencode, with its potential for biological validation, will add value to the human encode effort by illuminating the relationship between molecular and biological events. the modencode project (table ) complements other systematic investigations into these highly studied organisms. in both organisms, rnai collections have been developed and used to uncover novel gene functions [ ] [ ] [ ] [ ] [ ] . mutants are being recovered through insertional mutagenesis and targeted deletions (http://celeganskoconsortium.omrf.org; the modencode project will operate as an open consortium and participants can join on the understanding that they will abide by the set criteria (www.genome.gov/ ). an important aim of the project is to respond to the needs of the broader drosophila and c. elegans scientific communities, and several avenues will be open for suggestions on which experiments to prioritize. for example, researchers can visit www.modencode.org/ vote.shtml now to help prioritize transcription factors for studies using chromatin immunoprecipitation followed by dna microarray or dna sequencing (chip-chip and chip-seq), and can also indicate whether they have useful antibodies. we will seek community input on other issues as the opportunities arise. the core of the modencode project consists of ten groups who use high-throughput methods to identify functional elements (see table ). a data coordinating center (dcc) will collect, integrate and display the data. together, the groups expect to identify the principal classes of functional element for d. melanogaster and c. elegans. they will work closely together to complete the precise annotation of protein-coding genes, identify small rnas and non-coding rna transcripts, map transcription start sites, identify promoter motif elements, elucidate functional elements within ʹ untranslated regions, and identify alternatively spliced transcripts as well as the signals required for splicing. genomic sites bound by sequence-specific transcription factors will also be comprehensively identified. charting the chromatin 'landscapes' will include the characterization of key histone modifications and variants, nucleosome phasing, rna polymerase ii isoforms and proteins involved in dosage compensation, centromere function, replication, homologue pairing, recombination and associations of chromosomes with the nuclear envelope. integrative analysis of these data across the different types of functional element will be used to reveal fundamental principles of fly and worm genome biology and to begin to uncover the emergent properties of these complex genomes. some topics the modencode groups, along with interested members of the wider community, intend to explore are outlined below, but these are only a beginning. our intention is to create a resource that will provide the foundation for ongoing analysis by scientists for years to come. our two model organisms share many similarities with other metazoans, including humans. they also differ from other organisms in some striking ways, particularly in details of the establishment and maintenance of cellular identity, centromere biology and heterochromatin function. to help understand how the similarities and differences in worm and fly biology are reflected in their genome sequences and how they are specified by genome function at the molecular level, we will carry out comparative analyses of transcription, splicing, cis-regulatory and post-transcriptional elements and chromatin function. we will subsequently investigate how our findings apply to the control of gene expression in the human genome. origin mapping, timing, differential replication we also plan to use genome-wide data on pre-and post-transcriptional functional elements to expand our understanding of generegulatory networks. we will study how these two layers of control complement or reinforce each other during development. for example, the availability of full-length transcripts and promoter structures for microrna (mirna) genes will enable us to develop models of regulatory circuits that integrate the upstream regulation of mirna genes with that of other regulatory factors (such as transcription factors) and the effects of mirnas on their downstream targets. we will search global patterns identified in the regulatory programs for emerging principles of gene regulation within and across species; as part of this endeavour, we will evaluate evidence for the modular structure of regulatory networks. because several developmental stages and diverse tissues will be sampled in both animals, we will be able to investigate the global and dynamic activities of functional elements across the entire genome in multiple cell types and stages of differentiation. we aim to define the characteristics and rules that distinguish regulatory programs in different cell types and developmental stages at the dna, chromatin, and post-transcriptional levels. this will enable us to identify the types of element that function together in various spatio-temporal environments and find new types of functional element, perhaps including those used in restricted developmental contexts. an important objective is to generate specific biological hypotheses that can be refined and tested experimentally by the broader scientific community. for example, these analyses might identify transcribed regions with novel regulatory roles, structural regions that function in the establishment of chromatin structure or three-dimensional conformation, enhancers far away from the gene they control, and alternative promoter regions. in addition, we will use comparative analyses of the sequenced genomes from different species to clarify the extent of conservation and the functional constraints associated with potential new classes of element and to characterize their evolutionary signatures . another objective of the modencode project is the creation of reference data sets of maximum utility. we have agreed that, whenever possible, a common set of reagents will be used to facilitate comparison of data sets generated by different groups. for example, the fly and worm groups using chip-chip and related methods to map the genome-wide distributions of histone modifications will use a common set of validated antibodies. in addition, we will use common fly and worm strains, and in the case of drosophila, the common cell lines kc , s -drsc, cme w cl. + and ml-dmbg -c . the fly and worm genomes are about a thirtieth of the size of their mammalian counterparts, making current methods for high-throughput genomic analysis cost-effective. we will use high-density tiling dna microarrays to interrogate the genome on a single microarray (c. elegans, base pair (bp) median spacing; d. melanogaster, bp median spacing) at a resolution sufficient for chip-chip experiments. denser arrays (d. melanogaster, bp median spacing), which promise higher resolution, will be used in a move to highthroughput sequencing platforms such as the illumina genome analyzer to generate sufficient sequence coverage for transcript mapping and mirna and chip experiments. the biological significance of the genomic features identified will be tested in experiments designed to evaluate the accuracy and functionality of subsets of the structural and regulatory annotations. for example, we will carry out chip experiments on extracts from whole animals or cells that lack selected regulators (using mutants or rnai). the tissue-specific dna-binding patterns of selected regulators will be validated in transgenic animals. figure summarizes the dna elements to be interrogated and the methods to be used. data generated by the modencode consortium, including those from validation experiments, will be collected, quality checked, integrated and distributed through the modencode dcc (www.modencode. org). the dcc will collate detailed metadata for each submitted data set to ensure broad and long-term usability. where appropriate, the data will also be submitted to public databases, for example, genbank (www.ncbi. nlm.nih.gov) and the gene expression omnibus (www.ncbi.nlm.nih.gov/geo) or array express (www.ebi.ac.uk/microarray-as/aer/ entry) and the university of california, santa cruz genome bioinformatics site (http:// genome.ucsc.edu). the dcc will also work closely with wormbase (www.wormbase. org) and flybase (www.flybase.org) to facilitate integration of the modencode data with selected data from these databases and with other information about these organisms. all data will be available for bulk download through an ftp site and through a number of generic model organism database tools (www.gmod.org): biomart (www.biomart. org) will provide powerful data-mining capabilities, and intermine (www.intermine. org) will provide a flexible interface for complex querying of the data, a library of canned queries, and powerful list-based tools and operations (http://intermine.modencode. org). as for the encode pilot project data (www.genome.gov/ ), new data can be examined alongside existing data using interactive genome browsers for both the fly (www. modencode.org/cgi-bin/gbrowse/fly) and the worm (www.modencode.org/cgi-bin/gbrowse/ worm). the drosophila and c. elegans communities have thrived because of their open culture. in keeping with this tradition and with those of the genome sequencing projects, hapmap and the encode pilot project, modencode is a 'community resource project' subject to the nhgri's data-sharing policy. the success of this policy is based on mutual and independent responsibilities for the production and use of the resource. we will release data rapidly (table ) , before publication, once they have been established to be reproducible (verification; see www.modencode.org/'publication policy link' for the criteria), even if the data have not been sampled to determine if there is biological meaning (validation). in turn, users are asked to recognize the source of the data and to respect the legitimate interest of the resource producers to publish an initial report of their work (see www.genome.gov/modencode for more details). finally, the funding agencies chromatin structure and function identify sites of association between dna and chromosomal proteins involved in centromere specification, meiotic recombination, dosage compensation, nuclear envelope and matrix interactions and chromosome condensation. identify sites of incorporation of histone variants and specifically modified histones. correlate transcription maps for metaanalysis of developmental chromatin dynamics. dna replication identify cell-and tissue-specific origins of replication. correlate with cell-and tissue-specific transcription and chromatin marks recognize the need to support the analysis and dissemination of the data. in addition, a variety of physical resources (for example, dna constructs and transgenic strains) will be produced that are likely to be of use to the broader community and to which that community will have unrestricted access. we expect to cooperate with data users in the worm and fly communities to set the gold standard for data release and openness. the human genome project benefited enormously from the technology developed and the experience acquired in sequencing the significantly smaller genomes of model organisms, particularly c. elegans and d. melanogaster. the modencode project is dedicated to the next phase of decoding the information stored in these genomes: the comprehensive identification of sequence-based functional elements. having laid the foundation for the discovery of many of the genetic programs underlying metazoan development and behaviour, drosophila and caenorhabditis will serve as ideal model systems to identify dna-based functional ele-ments on a genome-wide basis. in the future, these data will provide a powerful platform for characterizing the functional networks that direct multicellular biology, thereby linking genomic data with the biological programs of higher organisms, including humans. ■ , usa. center for genomics and systems biology tennis court road, cambridge cb qr, uk. department of molecular acknowledgements we thank brenda andrews and tim hughes for discussions on the status of yeast functional genomics.author information correspondence should be addressed to s.e.c. (celniker@fruitfly.org). supplementary information a full list of names and addresses of current consortium participants is linked to the online version of this feature at http://tinyurl. com/modencode key: cord- -wj q f authors: lázaro, ester title: genetic variability in rna viruses: consequences in epidemiology and in the development of new stratgies for the extinction of infectivity date: journal: structural approaches to sequence evolution doi: . / - - - - _ sha: doc_id: cord_uid: wj q f nan viruses constitute one of the simplest biological entities in nature. they possess some properties typical of life, such as the transmission of genetic information through generations, but lack a proper metabolism and a system to translate the genetic information into proteins. this ambivalence places them at the border between living and non-living matter. to reproduce themselves, viruses are forced to infect a host cell, behaving as intra-cellular parasites. despite their simplicity, viruses have been able to develop a wide repertoire of infection mechanisms and replication strategies to adapt to the broad diversity of the cellular world in order to execute their genetic program. all known viruses consist of one or several genomic nucleic acid molecules, covered by protective layers. usually, there is one protein capside that can be enclosed by a lipid bilayer membrane proceeding from the host cell. the number of proteins encoded by the viral genomes is rather small, making their success to replicate and give rise to an offspring dependent on their ability to take advantage of the enzymatic activities provided by the host cell. after infection, cellular protein synthesis is stopped and most of the subcellular machinery is directed to produce copies of the viral nucleic acids and proteins. these newly synthesized viral components are assembled inside the cell into mature virions that can infect other cells of the same organism or establish transmission chains between different individuals. the stability of these chains strongly conditions the survival of the virus in nature. outside an adequate host, viruses can still persist for some time in a latent state in which they are unable to replicate and exposed to irreversible damage by physical conditions of the environment. when they interact with the specific cellular receptors of a suitable host, they can initiate an infection that, if successfully transmitted in the population, can seriously compromise the survival of the host species. in all cellular organisms, the genetic information is contained in the dna. before being translated into proteins (the molecules that execute the functions necessary for the performance of the cell), the dna has to be copied to mrna. but genetic information also needs to be maintained through generations, a process that takes place by means of dna replication, in which many enzymatic activities are involved. in contrast to cells, viruses are more versatile and can use dna or rna to store the genetic information. dna viruses can follow a scheme similar to that followed by cells to replicate their genomes and to synthesize their proteins. however, rna viruses need another process, rna replication, which is not among the functions carried out routinely in the cell [ ] . therefore, they have to encode and express the enzymatic activities necessary to copy their genomes. these enzymes are the rna replicases and the reverse transcriptases, which in many cases are co-encapsidated with the nucleic acid during the assembly of the viral particles. in this way, they are available at the start of the infection. all living systems must reach a compromise between the correct copy of the nucleotide sequence of their genomes and the ability to adapt to an environment that is continuously changing [ ] . the generation of mutants upon replication provides the necessary diversity from which natural selection can choose the best adapted variants in a concrete environment. the observed divergences among mutation rates in different species suggest that possibly this character is selected depending on the variability of the environment [ ] . cellular systems are able to maintain a relative constancy in the intra-cytoplasmatic medium and because of that, they do not need a high genetic variability. thus, evolution has emphasized the selection of a replicative machinery with several corrector activities that permit a high copying accuracy. however, even in the cellular world, mutation rate is not a fixed character that cannot be altered. it can be modified in response to environmental changes by selection of variants with higher or lower error rates. the isolation of hypermutator strains, which show deficiencies in some of the polymerase corrector activities, is frequent in conditions of environmental stress and is a proof of the versatility of mutability as a character that can be modified when the environment requires it [ , ] . a relevant characteristic of rna viruses is that they replicate their genomes with a copying fidelity several orders of magnitude lower than cellular dna [ ] . this fact has been interpreted as a consequence of the fluctuating environments that viruses have to face. high error prone replication, together with the short replication times and large population sizes typical of rna viruses, instead of being a handicap for survival provides an extraordinary evolutionary advantage by permitting the generation of a wide reservoir of mutants with different phenotypic properties [ ] . the high variability of rna viruses facilitates their survival in presence of antibodies and other defence mechanisms produced by the immune system of the host. it also makes possible the acquisition of novel pathogenic properties that in occasions have allowed to cross species boundaries favouring the infection of alternative hosts [ , ] . finally, the heterogeneity of rna virus populations makes it also difficult to eradicate diseases with antiviral drugs, due to the emergence of drug-resistant mutants [ ] , a problem that will be treated in more detail in the next sections. whether the high genetic variability of rna viruses is a selected character, necessary for survival in high fluctuating environments, or it is simply a consequence of the lack of corrector activities of rna replicases and reverse transcriptases is a debated question. however, the fact that dna organisms, which usually live in constant environments, have evolved corrector activities, whereas rna viruses have not, suggests that replication with high error rates is a selected character that strongly favours viral adaptation to fast changing conditions. the first requisite for the evolution of any population is the generation of a significantly wide genetic diversity on which natural selection and genetic drift can act to shape the properties of the new populations generated at subsequent generations. the genetic variation attained by rna viruses is mainly the result of mutation and recombination, two processes that are dependent on the properties of the enzymes that replicate their genomes. genome segment reassortment occurs during encapsidation and can add extra variability in the case of viruses with segmented genomes. the replication of rna viruses takes place through two main mechanisms that involve the use of different enzymatic activities [ ] . riboviruses, including many prokaryotic rna viruses as well as many animal and plant viruses (poliovirus, influenza virus, hepatitis a and c viruses, etc.), replicate their genomes using rna replicases that catalyze the rna-dependent rna synthesis. the template rna can be of positive polarity (it can work as mrna) or negative polarity (it is the complementary strand that is translated into proteins). viruses with positive polarity genomes deliver the nucleic acid directly to the cellular ribosomes and begin infections with translation. in contrast, viruses with negative polarity genomes begin infections with transcription to obtain mrna molecules that can be translated. retroviruses, hiv- (human immunodeficiency virus type ) being the best known example, replicate their genomes through a different mechanism with an intermediary step that consists in the copy of the genomic rna to dna. this process is catalyzed by the enzyme reverse transcriptase, an rna-dependent dna polymerase carried by the viral particle. the dna obtained in this way is integrated in the host chromosome, being transcribed by the cellular enzyme rna polymerase ii to produce transcripts that can function either as precursors of mrnas or as genomic rnas that can be assembled into progeny viruses. the lack of corrector activities of both classes of enzymes rna replicases and reverse transcriptases results in high mutation rates, which have been estimated in − to − misincorporations per nucleotide copied. for a virus with a genome length of , nucleotides, this amounts to the incorporation of one incorrect nucleotide per genome copied on the average [ , ] . thus, each new viral genome differs from its parent at one or two nucleotide positions. the relative proportion of a specific mutant in the viral population depends on the rate at which the mutant is generated and on its fitness, which is defined as the ability to give rise to a progeny in competition with the rest of viruses replicating under certain environmental conditions [ ] . the number of mutations that occur per time unit is also influenced by the number of replication rounds during that period, this is, the generation time. for viruses with similar error rate polymerases, the shorter the generation time, the larger the number of mutants that is produced in the same time interval. recombination takes place when a new genome is built from fragments belonging to different parental molecules. in rna viruses, this process usually occurs by template switching during rna or cdna synthesis. most studies suggest that recombination rates in rna viruses are lower than in other organisms [ ] , although there are some notable exceptions, such as hiv- in which the recombination rate seems to be higher than the mutation rate [ ] . recombination can be a powerful mechanism to create advantageous genomes and to purge deleterious mutations in a very short time. however, the actual effects of recombination in rna viruses have not been studied in detail, and it is not clear whether it is beneficial or it has a negative effect on fitness [ ] . genome segment reassortment occurs in viruses with segmented genomes and consists in the encapsidation in the same viral particle of genome segments proceeding from different parental viruses. influenza viruses are the typical example in which this process has been responsible for antigenic shifts, probably resulting from combinations of segments of influenza virus of different specificity [ ] . the natural reservoir of influenza is aquatic birds, although the virus can also infect domestic birds and mammals (human or pigs preferably). when a reassortant influenza virus emerges, its pathogenic potential can dramatically increase, because the infected host is not able to recognize the antigenic determinants of the new virus generated. these reassortant strains have been responsible for a number of pandemics through history and most studies suggest that a new influenza pandemic is unavoidable [ ] . the structure of viral populations results from the concerted action of the processes of mutation and selection acting in very large ensembles of replicating units. population size fluctuations, which frequently take place during transmission of viruses in nature, constitute an additional and important factor influencing the extension of genetic diversity from which a new virus population will be generated. the evolution and self-organization of heterogeneous populations composed by a large number of molecules subjected to error-prone replication and exposed to selection was first studied theoretically [ ] . these studies showed that, for large population sizes and after long growth times in a constant environment, a steady-state is reached where each mutant represents a constant fraction of the total population. this equilibrium population was called quasi-species [ , ] . the most frequently occurring molecular species, usually the one with the highest fitness, is called the master sequence. this sequence is accompanied by a mutant spectrum, composed by an ensemble of variants that differ in one or several nucleotide positions that can be responsible for fitness variations in individual mutants. the number of nucleotide differences between two sequences is called the hamming distance. the consensus sequence is defined as the sequence of the most represented nucleotides at each genomic position in the ensemble of genomes constituting the population. the correspondence between fitness values and sequences (or between phenotypes and genotypes) reveals that fitness landscapes (a surface in the genotype space representing the fitness of each genotype as a point placed at a different height) are rather rugged, since relatively small sequence differences can cause great differences in fitness values. analysis of rna virus populations, either at the phenotypic or genotypic level showed that these populations have a structure similar to the molecular quasi-species described theoretically [ , ] . they present a master sequence surrounded by a mutant cloud and, in the absence of a cloning method to separate individual genomes, only the consensus sequence can be determined. however, two main differences have to be taken into account when comparing theoretical and viral quasi-species. the first one is that viral quasi-species usually are not equilibrium populations because viruses are continually confronted with many environmental perturbations that cause variations in the fitness distribution of the population. the second difference is that viral fitness is not only determined by the genomic replicative ability. in spite of its simplicity, a virus must complete successfully many processes to originate an infective progeny. these include recognition of the cellular receptors, uncoating and release of the nucleic acid inside the cell, interaction with many enzymes and cellular structures, correct assembly to give rise to new viruses and exit out of the cell. the ability of a virus to perform correctly all these processes, together with the replicative ability of its genome, is what determines its fitness value. these differences introduce uncertainties and additional complexity to viral evolution compared to molecular evolution described theoretically. viral quasi-species constitute very dynamical structures in which the processes of generation of new mutants, selection of the best adapted and elimination of the less fit are continuously acting. quasi-species replicating during a long time in a near-constant environment in the absence of large population size fluctuations can present a low rate of fixation of mutations in the consensus sequence, despite the continuous occurrence of mutants that is characteristic of the underlying dynamics of the population. in this case, the quasi-species is well adapted to the environment and can maintain a low rate of evolution, as determined by the stability of the consensus sequence. many of the mutants generated are lethal or very deleterious and are eliminated or maintained at low frequency by the action of negative selection. in contrast, mutants with a selective advantage can be present at high frequency, even if they are produced at a low rate [ ] . neutral mutations are thought to be very restricted in rna viruses because of their highly compact genomes [ ] . low rates of evolution have been described for viruses well adapted to their animal reservoir (the host in which the virus is usually maintained in nature) as influenza in birds or hantavirus in rodents. the same happens in the laboratory, where viruses are usually cultivated during years in the same cellular type. in both cases an almost invariant consensus sequence can be found, although the dynamics of the quasi-species is always dominated by the processes of mutation and selection acting in close concert. the factors that promote the fixation of mutations in the consensus sequence are usually environmental changes that favour the selection of the best adapted genomes in the new conditions or drastic reductions in the number of individuals that will originate a new population, what is called population bottlenecks. the occurrence of genetic alterations with an adaptive advantage in the absence of environmental perturbations is also possible, although it happens more rarely. in this section, we will focus on some features that favour the action of positive selection and amplification of advantageous mutants. we will mention three examples that make enormously difficult virus eradication: . treatment with antiviral drugs. when a viral infection is treated with an antiviral agent the usual outcome is that, after a short time of success, the treatment loses its efficacy. the failure is generally due to the presence in the mutant spectrum of some genomes able to resist the action of the drug. usually, these genomes have lower fitness in the absence of the drug and they are maintained at low frequencies by the action of negative selection. the presence of the drug inhibits the replication of the sensitive genomes, but not of the resistant ones, which are selected and amplified. it can also occur that, at the beginning of the treatment, no drug-resistant genomes are present in the population. however, the high mutation rates and large population sizes of rna viruses make highly probable that, after a variable time lag, a resistant mutant appears, which in a short time can dominate the population. maybe, the most dramatic example of drug-induced resistance occurs in patients infected with hiv- in which variants resistant to all currently used drugs have been isolated [ , ] . at present, the most effective treatment to control hiv- infection is the so called highly active anti-retroviral therapy (haart), which involves a combination of several drugs, aimed at preventing the emergence of variants with mutations conferring resistance to all the drugs at the same time. . antigenic drift. there are considerable differences in the nature and duration of the immune response elicited by different viruses [ ] . some human viruses, such as measles or chicken pox, can only infect once, because their antigenic determinants have very slow evolution rate and the immunological response of the memory cells continues being effective along the whole life of the individual. in contrast, there are other viruses, influenza being the paradigmatic example, that can infect the same organism repeated times. most experimental evidence leads to the conclusion that the tolerance of a virus to accept immune-escape mutations is limited by the restriction of conserving the cell tropism [ ] . modifications in the capside antigen domains of measles virus seem to have very deleterious effects, possibly because they affect the recognition of the cellular receptors. in contrast, influenza can experience a continuous change in the antigenic properties of the two main surface proteins involved in the entry of the virus inside the cell, the hemagglutinin and the neuraminidase. the evolution of the virus seems to be strongly influenced by selection of new antigenic variants to escape the immune system at the same time that the capacity of interaction with the cellular receptor is preserved [ ] . in the case of the hemagglutinin gene, codon sites have been identified in which non-synonymous nucleotide substitutions are much more frequent than synonymous [ , ] . the remaining sites show the more common pattern of synonymous substitutions, indicating that possibly they are subjected to stronger evolutionary constraints. . change of tropism. many viruses are maintained in nature in animal reservoirs that do not manifest symptoms of disease. this probably occurs because the relation virus-host is very old (hundreds or thousands of years) and both species have had enough time to co-evolve, meaning that the virus has attenuated its virulence and the host also has acquired some properties that permit coexistence with the pathogen. the long time of evolution in the same host has permitted to these viruses to be close to the equilibrium between mutation and selection processes and to maintain a high stability in the consensus sequence. occasionally, a virus well-adapted for replication in a particular host can cross the species boundaries and infect a new host. this can be facilitated by genetic changes in the virus and/or by ecological factors that involve alterations in the relationships established among different species in nature [ , ] . the infection of a new host constitutes a sudden change in the environment in which viral replication takes place, usually with the consequence of a drastic decrease in the average fitness of the virus population, which prevents further transmission. the success of a virus to establish as a new infectious agent in the new host relies largely on two features (a) its ability to interact with a cellular receptor that permits the entry inside the cells and (b) the acquisition and fixation of mutations that allow efficient replication and capacity of transmission between organisms. most recent virus emergences in humans include hiv- , whose closest animal ancestor seems to be the simian immunodeficiency virus found in a particular species of chimpanzees (sivcpz) [ ] , the coronavirus causing sars (severe acute respiratory syndrome) [ ] and the influenza virus h n , an avian virus strain that can infect directly humans without further human-to-human transmission [ ] . virus dynamics in nature cannot be separated from host population dynamics, constituting two processes in continuous interaction [ , , ] . factors such as the transmission mode, the basic reproductive number (r ), the duration of the infectious period, the renewal of susceptible hosts and the durability of the immune response contribute to shape the genetic heterogeneity of viruses and the quasi-species structure. they also strongly condition the evolution of the pathogen along the time, adding a great complexity to the epidemiological and phylogenetic studies on rna viruses. an important factor in viral evolution that takes place at the inter-host level is the number of viral particles that are transmitted from one host to the other [ ] . when this number is very small, a population bottleneck takes place. then, only one or few individuals originate a new population, resulting in a strong reduction in the genetic diversity. the consequence is that any mutation present in the founder genomes will have a high probability of being transmitted to the progeny, accelerating in this way the rate of fixation of mutations ( fig. . ). since most mutations are deleterious, the expected effect of their accumulation through repeated bottlenecks is a decrease in the mutant spectrum consensus sequence fig. . . accumulation of mutations in the consensus sequence of a heterogeneous virus population when a single genome (in the box ) is selected to found a new population. all the mutations carried by this genome are transmitted to the progeny, and consequently they will be fixed in the consensus sequence average fitness that eventually could lead to the extinction of the population. bottlenecks are very frequent in nature, during the inter-organ or inter-host transmission of many viruses. thus, in each new infected organism, the quasispecies must be rebuilt from one or a few founder genomes, a fact that could lead to a wide diversity in diseases in which the usual form of transmission is mediated through bottlenecks. the persistence of viruses in nature and the limited number of circulating strains in diseases such as influenza, despite the frequent occurrence of bottlenecks, is paradoxical [ ] . it is believed that the inter-host competition that can induce stochastic losses of the less fit variants, together with the action of previous immune responses on genetically related virus variants (the so called cross-immunity) are factors that restrict the strain diversity. the intra-host competition that takes place after each transmission event is an additional factor that favours the optimization of the viral population inside each infected individual and also contributing to the resistance to extinction of viruses transmitted through bottlenecks. grenfell et al. [ ] have classified rna viruses in four phylodynamic categories, according to factors pertaining to the host-pathogen interactions (mainly the duration of the infection and the nature and strength of the immune response). they are briefly described: . short infections with strong cross-immunity. the best known viruses included in this category belong to the family of morbilliviruses (measles being a well studied example). in these viruses, epidemic cycles are mainly determined by the lifelong immunity elicited by the pathogen, which causes that the renewal of susceptible hosts takes place only at the birth of new individuals. the existence of a strong immune response that is powerful against all circulating strains (strain-transcending immunity) would prevent the action of selection. in these viruses, the burden of many different strains seems to be limited by spatio-temporal parameters of the dynamics of the epidemic process. . short infections with partial cross-immunity. the best example of this category is influenza a virus. the high mutation rate characteristic of rna viruses, together with the transmission of influenza through bottleneck events, opposes to the limited variability within lineages. in contrast to measles, cross-immunity against virus variants is only partial and the replacement of susceptible individuals takes place, not only through the birth of new hosts but also through generation of new influenza strains that may affect individuals previously exposed to the virus. evolution of influenza and its epidemic dynamics have been modelled in several studies, trying to reproduce the strong seasonality of infections and the replacement of strains at each epidemic. the most successful models reproduce the behaviour of influenza epidemics when a short-lived straintranscending immunity (in contrast to the long-lived immunity characteristic of viruses in the previous category) is included as an essential factor limiting viral diversity in the host population [ ] . however, the role of within-host dynamics after each bottleneck mediating transmission remains to be added to the epidemic model. . infections with immune enhancement. these are infections with the possibility of antibody-dependent enhancement (ade). an example is dengue virus that comprises four serotypes co-circulating in tropical regions. ade causes that secondary infections produced by a different virus serotype usually curse with more severe symptoms than primary infections. . persistent infections. in this category are included viruses such as hiv and hcv (hepatitis c virus) that can persist in their host during long times periods. for these viruses, inter-host dynamics is slow, being more important and faster the intra-host period of evolution that is driven by continuous and strong immune pressure. since the high genetic heterogeneity of rna viruses provides an enormous adaptive capacity, it could be naively expected that additional increases in the replication error rate makes evolutionary adaptation even more efficient. however, there are many theoretical and experimental evidences showing that rna viruses have selected the maximal error rate, which is compatible with the preservation of their genetic information. theoretical studies on molecular evolution postulate that the higher the error rate and the genome length, the smaller is the probability of obtaining a progeny identical to the parental genome and to conserve the master sequence in the population [ , ] . there is a sharp limit, called error threshold, which cannot be crossed without catastrophic consequences for the survival of the population (see the chapter by jain and krug in this book). below this limit, the quasi-species can maintain a large genetic variability from which the best adapted molecules are selected. when the threshold is crossed, the dispersing force of mutation cannot be compensated by selection of the best adapted phenotypes and the genetic information melts away in a process with the physical characteristics of a first order phase transition, as the melting of a solid. the transition takes place in an 'information space' that is multidimensional, comprising n sequences of length n [ ] . the error rate that can be maintained is related to the genome length according to this relation: here n max is the maximal length of the genome and it is inversely proportional to the error rate per nucleotide ( − q). the factor s indicates the selective advantage of the master sequence in relation to the mutant spectrum. measurements of the chain lengths and the replication error rates of rna viruses show that the genome lengths of rna viruses are close to the maximum that can be maintained at the error rates of their replication. moreover, phylogenetic analysis of rna viruses reveals a negative correlation between rates of nucleotide substitution and genome size [ ] . as a direct consequence, all viral functions must be encoded within a limited genomic space ( - kb on average for most rna viruses), meaning that certain regions of the genome will often have to participate in several functions at the same time, resulting in restrictions to the capacity of rna virus to alter their nucleotide sequences. the most frequent evolutionary constraints identified are the following [ ] : . usually, the antigenic determinants of a virus are domains of the same proteins involved in the recognition of the cellular receptor [ ] . this fact restricts the possibilities of immune escape to the occurrence of mutations in domains that are not crucial for penetration of the virus inside the cell. . genomic coding regions can also be involved in the interaction with enzymes or cellular structures and in the regulation of the correct synthesis and assembly of the viral components to constitute mature particles. . synonymous mutations may be not silent and have effect on fitness because they can affect the secondary structure of rna domains critical for keeping the stability and functionality of the molecule [ ] . . sometimes the same genomic region can encode several proteins through the use of overlapping reading frames. in most rna viruses, a high amount of particles is not infectious, suggesting that viral populations operate near the error threshold and most mutations are not easily tolerated, possibly due to the above-mentioned constraints. the large population sizes constituted by rna viruses seem to be necessary to avoid stochastic extinctions that could happen due to the generation of many deleterious mutants. given the high mutation rate of rna viruses, and the increased fraction of deleterious mutations over advantageous ones that occur when a population is well adapted to the environment, one can think of two alternative strategies for driving a viral population to extinction. both of them involve an increase in the number of mutations in individual viral genomes, which can be related, although not necessarily, to changes in the consensus sequence of the population. the first pathway is the classical one described by molecular evolution error catastrophe theories. it consists in the increase of the replication error rate, usually through the use of mutagens. the new populations generated exhibit larger complexity than the initial ones. advantageous mutations, even if they occur, would be spoiled by the continuous generation of deleterious mutations, before they can be fixed by natural selection. in this case, it is the strong dispersing force of mutation what dominates the dynamics of the population. the second pathway consists in the application of successive bottlenecks to the population. after each bottleneck, the founder genomes give rise to a new population through a limited number of replication rounds. the larger the number of generations between bottlenecks, the closer is the new population to the equilibrium between mutation and selection [ ] . the resulting populations have two essential characteristics. the first one is their low complexity, because the low number of copy rounds taking place between bottlenecks does not permit to generate a large genetic diversity. the second one is an increased rate in the fixation of mutations in the consensus sequence, since most mutations present in the founder genomes are transmitted to the descendants (fig. . ) . given the high amount of deleterious mutations, the expected result of their accumulation is a progressive reduction in the average fitness of the population that could lead to the extinction of infectivity. the structure of the viral populations generated through the two pathways described here have different evolutionary consequences that have been explored experimentally by several groups. next sections contain a review of the main results published in this field. there are many experimental evidences documenting extinction of rna viruses experiencing an increased mutation rate due to the action of mutagens [ ] [ ] [ ] [ ] [ ] [ ] . the mutagens most currently used are -fluorouracil (fu), -azacytidine (azc), azidothymidine (azt), ribavirin and -hydroxyurea. some of them are nucleoside analogues that, in addition to increasing the rate of erroneous incorporation of nucleotides, can also interfere with other cellular or viral processes, such as endogenous nucleotide metabolism, viral replication or transcription. foot-and-mouth disease virus (fmdv), poliovirus, hiv- and lymphocytic choriomeningitis virus (lcmv) are some examples of rna viruses in which successful extinctions of infectivity have been documented. the results agree with molecular evolution theories that postulate that viral replication operates very close to an error threshold that cannot be crossed without compromising the transmission of genetic information and the existence of the population (reviewed in [ ] ). although many studies have been devoted to the characterization of the mutant spectrum of pre-extinction populations [ , , ] , it is not clear how the quasi-species looses its infective capacity. it is not known whether all the genomes are carrying lethal mutations and therefore are unable to replicate or it is the disorganization of the mutant spectrum what makes the quasi-species to be non-infective. in the last case, the quasispecies could still conserve some viable genomes that, in the absence of the interfering mutants, could initiate the development of an infective population. mutagenized populations of fmdv treated with azc and fu could be efficiently extinguished [ ] . as expected, the characterization of the rna genomes composing the pre-extinction populations did not show mutations in the consensus sequences, but displayed an increase in the complexity of the mutant spectrum (reviewed in [ ] ). the maximum increases in complexity occurred in the polymerase gene, which usually is well conserved. other studies have also demonstrated the invariance of the consensus sequence, despite the occurrence of a high number of mutations in individual genomes [ ] . the same mutagenic agent can behave differently in different viruses. as an example, lcmv was systematically extinguished after only two or three passages in the presence of fu [ , ] , whereas extinction of fmdv was stochastic and required a larger number of passages in the presence of similar amounts of mutagenic agent. differences in the susceptibility of a virus to a mutagen can be explained by a number of factors including different affinity of the polymerase for the mutagen, effect of the mutagen in other viral or cellular processes, type of mutations preferentially induced by the mutagen that can affect viral functions differently depending on the nucleotide composition of the virus genomes, etc., [ ] . the influence of variations in the mutation rate of different virus polymerases in the capacity of mutagens to extinguish infections is not well known. in principle, it should be expected that the closer is the virus to the error threshold, the easier should be its extinction by increased mutagenesis. however, the error rate of the polymerase is very difficult to estimate, and it can change depending on environmental factors and the region of the genome sequenced. mutation rates are usually obtained from measurements of mutation frequency, a procedure that can lead to underestimation of the true mutation rates, because only replicating genomes are abundant enough to be detected. the isolation of a poliovirus mutant with a high fidelity polymerase [ , ] that is resistant to the action of ribavirin and other mutagens clearly indicates that the error rate of a particular polymerase is a relevant factor contributing to the efficiency of increased mutagenesis to extinguish viral infections. studies in both riboviruses and retroviruses suggest that host enzymes also represent a potential source of variation by rna editing [ ] . there are some cellular enzymes able to produce hypermutation in the viral genomes, which occurs as clusters of specific base substitutions. a documented example is the enzyme apobec g, which has been shown to generate g→a hypermutations in hiv- . enzymes of this type could act as a natural strategy for limiting viral infection by increasing mutagenesis above the error threshold. the discovery of these host factors constitutes an alternative for the development of agents that specifically enhance the natural antiviral activity of cells. recently, extinction by lethal mutagenesis has been shown to involve more complex mechanisms than those affecting only the replicative ability of genomes [ ] . it is well known that in a normal infection a variable amount of the viruses produced are non-infective because they are unable to code for all functional proteins [ ] . however, inside the cell, it is plausible that many of these non-infective genomes behave as parasites and replicate using the proteins produced by other viruses. when the mutation rate is kept below a critical threshold, defective mutants maintain an equilibrium with viable genomes. the increase in the mutation rate forces the appearance of a larger amount of defective genomes that, beyond a critical fraction, can exhaust the although a high amount of rna is still present in the samples, infectivity declines until undetectable levels, indicating that replicative ability does not disappear simultaneously with infectivity. further details of this experiment can be found in [ ] . open symbols correspond to the intra-cellular fraction. filled symbols correspond to the supernatant fraction resources necessary for viral replication, becoming an additional force that can promote extinction. this conceptual framework derives from several 'in vitro' experiments with lcmv [ ] . infective viruses and rna genomic molecules were monitored during a virological steady-state persistent infection of bhk- cells by lcmv in the absence and presence of -fu (fig. . ) . in the course of the infection, there is a clear increase in the number of genomic rna molecules, both in the intra-cellular fraction and in supernatants of control and mutagenized virus. however, in fu-treated virus cultures, infectivity declines and falls below detection, despite the high number of genomic rna molecules. the number of infective units per rna molecule as a function of the mutation frequency yields a curve with a sharp decay when the mutation frequency overcomes a critical threshold. the sudden loss of infectivity takes place through a transition analogous to that predicted by error catastrophe theories. however, the unexpected outcome of the experiment was the presence of large numbers of rna molecules, revealing that the replicative ability does not disappear simultaneously with infectivity. similar results have been found with poliovirus and hantaan virus where decreases in infectivity preceded decreases in viral rna levels [ , , ] . lethal mutagenesis probably presents many of the same difficulties as conventional antiviral therapy. an important problem takes place in viruses, such as retroviruses, that can stay in a latent state during a long time in cellular or anatomical reservoirs. activation of these latent viruses can contribute to the resurgence of the disease after interruption of drug treatment in vivo [ ] . however, the strongest obstacle to antiviral mutagenesis is the appearance of drug-resistant mutants due to the presence of enhanced fidelity polymerases. possibly these mutants are less able to generate resistances to other antiviral drugs, due to the diminished ability of adaptation that results from the reduction of the genetic diversity because of the higher fidelity of the polymerase. therefore, combined therapies consisting of lethal mutagenesis and other antivirals could be a promising strategy for the treatment of viral infections [ ] . the probability of extinction of small asexual populations due to the accumulation of mutations was first studied by muller several decades ago [ ] . he predicted that the genomes with the lowest mutational load could be stochastically lost due to population fluctuations through a mechanism similar to the clicks of a ratchet. when the ratchet clicks the first time, this means that the genomes with no mutations are lost and the least loaded class corresponds to individuals carrying one mutation. in the next click, the one-mutation class disappears by a similar mechanism, and the least mutated class corresponds now to genomes with two mutations and so on. at that time it was believed that the least mutated genomes were the best adapted and that reversions were the only mechanism able to recover fitness. thus, this process, which is particularly effective at high mutation rates, as it happens in rna viruses, should inevitably imply a progressive fitness loss that can lead populations to extinction. the experimental study on the transmission of rna viruses through successive bottlenecks usually is carried out making serial plaque-to-plaque transfers ( fig. . ) . at each transfer, the viral population is plated at low multiplicity of infection to get well-isolated lytic plaques that are the result of the infection by a single virus, which after several replication rounds gives rise to a progeny. since at each transfer the effective population size is reduced to one individual, this constitutes the most extreme form of bottleneck. the population contained in a randomly chosen plaque is isolated, properly diluted and plated again in a process that is serially repeated. the consequences on fitness of successive repetitions of this process have been analyzed with several rna viruses including bacteriophages ms [ ] and phi [ ] , vesicular estomatitis virus (vsv) [ ] [ ] [ ] , fmdv [ ] [ ] [ ] and hiv- [ ] . in all these studies, progressive fitness declines were found, although extinctions of infectivity were only observed in the case of hiv- . the most complete study on the effect that the accumulation of mutations through plaque-to-plaque transfers has on fitness evolution has been carried out with fmdv [ ] [ ] [ ] ] . in this study, the titer of the plaques (determined as the number of infectious units per plaque or pfu) at each transfer was taken as a measure of fitness. the mutations that accumulated along the process were identified by determining the consensus sequence of the viral population isolated from single plaques at different transfers. the expected result of the experiment was a progressive decrease in fitness accompanied by an increase in the number of mutations fixed in the consensus sequence. after a certain number of transfers, extinctions of infectivity were expected. in contrast to these expectations, a biphasic dynamics of fitness decrease was observed. there was an initial period of roughly exponential fitness loss, but after a variable number of passages, a statistically stationary state of fitness with large fluctuations around a mean constant value was reached (fig. . ) . in this state, the virus exhibits a great resistance to extinction, since when it reaches a very low fitness value, the usual outcome at the next passage is a sudden fitness recovery. a detailed statistical analysis of the viral titers at the stationary state showed that fluctuations in the viral yield followed a weibull distribution [ ] . this distribution is indicative of an underlying dynamics with two main features (a) an exponential amplification of the founder genomes during the development of each plaque, which makes that small fitness differences are considerably amplified and (b) large variations in the initial state of the system at each transfer, which is determined by the stochastic nature of the sampling process. strikingly, mutations accumulated at the same rate in the phase of fitness decrease and in the stationary state [ ] . this might indicate that the nature and effects of mutations can vary with the transfer number, depending on the restrictions imposed by the selection of the genomes able to form plaques. when the population is well adapted to the environment, as it happens at the beginning of the experiment, deleterious mutations are well tolerated. however, as the population is getting more debilitated, less deleterious mutations can be accepted and possibly there are many extinctions of individual genomes fig. . . infectious units per plaque produced along the process of plaque-toplaque transfers experienced by the viral clone c . after an exponential decay of infectivity, a statistical stationary state with strong fluctuations is attained that become unable to replicate. nevertheless, a fraction of the genomes contained in a plaque can possess advantageous mutations, in some occasions because the mutation has a positive effect 'per se' and in others because it has a compensatory effect in a concrete genome carrying a particular combination of mutations. in the stationary state, where average fitness values possibly are the lowest ones compatible with virus survival, advantageous mutations would be more easily selected, because only the genomes carrying them can form plaques and be chosen for the next transfer. each advantageous mutation produces a fitness increase that moves the genome to a different position in the fitness landscape. this permits the acceptance of additional deleterious mutations, originating the fluctuating pattern of infectivity that is observed in the experiments. an interesting result is the preferential accumulation of mutations in certain genomic regions that present a mutation frequency significantly higher than the average obtained considering the whole genome [ ] (fig. . ). an unusual distribution of mutations has also been found in bottlenecked hiv- clones in which there was a higher accumulation of mutations in the gene gag and the first third of the genome, compared to the gene env, which is less conserved in natural populations of the virus [ ] . bottlenecked vsv clones also accumulated a high number of mutations in the n open reading frame, contrasting with the conservation of this region in natural isolates [ ] . all these results suggest that bottlenecks permit the isolation of genomes that [ ] . the lines below the genome indicate the non-synonymous (ns) and synonymous (s) mutations present in the virus. the boxes indicate the genomic regions where the number of mutations is significantly higher than the average for the whole genome otherwise would be eliminated under the action of positive selection that dominates virus optimization. genomic regions that seem to be much conserved might mutate with the same mutation rate as the rest of the genome, although subjected to stronger constraints. nevertheless, the evolutionary relevance and the molecular mechanism by which the mutation clusters observed in the bottlenecked fmdv clones are generated is unknown and further experiments are in progress to answer this question. a numerical model of evolution through bottlenecks was developed with the aim of identifying the parameters that are responsible for the biphasic dynamics of fitness loss [ , ] . the main features of the model are the occurrence with low probability of advantageous mutations and the presence of an extinction threshold, which means that genomes reaching the minimal allowed fitness value are eliminated. the results of the simulations were very similar to those observed in the experiments: a biphasic dynamics of fitness decrease and large fluctuations in the fitness values attained at the stationary state. moreover, the statistical analysis of fitness values reveals that, similarly to the experimental results, they follow a weibull distribution, strongly supporting that the underlying dynamics must be the same in both the simulations and the experiments. the elimination of individuals as their fitness falls below the extinction threshold and the probability of selecting for the subsequent transfer genomes with compensatory mutations constitute two factors acting in close concert to avoid extinctions due to an excessive accumulation of deleterious mutations. the occurrence of compensatory, advantageous mutations was not introduced in most models of muller's ratchet that considered that back mutations were the only mechanism to revert the negative effect of deleterious mutations [ , ] . however, compensatory mutations are much more frequent than reversions as a mechanism to increase fitness, as it has been demonstrated in several theoretical and experimental studies [ , ] . accordingly, during the process of fitness recovery of fmdv and vsv bottlenecked clones upon large population passages, both reversions and compensatory mutations were found to be responsible for the observed fitness increases [ , , ] . none of the recovered strains reverted to a wild type sequence, confirming that bottlenecks move the quasi-species through the fitness landscape towards regions where the adaptive value of mutations can be drastically altered. the results of all these studies show that there are different mechanisms able to modulate the adaptive value of mutations. when the environment is altered, a new fitness landscape appears where the effect of particular mutations varies. in a similar way, even if the fitness landscape is not modified, bottlenecks constitute an effective way to explore new regions, where the selective value of mutations can differ from that present in the initial quasi-species. this means that the effect of mutations can vary depending on the mutations previously accumulated in the genome, a fact that points to epistatic interactions. sanjuán et al. have studied the effect of pair of mutations in the vsv genome, compared to their effects as single mutations [ ] . they found mainly antagonistic interactions between deleterious mutations (the effect of both mutations appearing together is smaller than the sum of the separate effect of each mutation). this finding can partially explain the non-linear dynamics of fitness loss observed in the fmdv clones. some theoretical studies also show that antagonistic epistasis can reduce the speed of the ratchet. a relevant question concerns the effect that the high mutational load of viral populations with a long history of bottlenecks has on their adaptability. the studies of novella [ ] have shown that bottlenecked viruses, even if they have recovered fitness through massive passages, always loss in competition experiments with the wild type, meaning that they have lower adaptability. it would be quite interesting to investigate if the high number of mutations accumulated in bottlenecked viruses also has negative consequences for adaptation to a new environment with a different fitness landscape. these studies can be carried out with viruses carrying different combinations of mutations and having the same fitness value, as those obtained at different transfer number in the stationary state attained by fmdv bottlenecked clones. the results of experiments of this type would allow to get more insight in the alternative adaptive solutions that can be explored by rna virus populations differing in the consensus sequence. most of the viruses that are important human pathogens have rna as genetic material. all of them share high mutability and a great potential for adaptation that makes their eradication enormously difficult. the isolation of drug-resistant mutants, the emergence of new diseases in humans caused by viruses that usually are maintained in animal reservoirs or the appearance of viral variants able to resist the action of the immune system of the host constitute important challenges for research in this century. one of the most promising strategies for the control of viral diseases consists in the increase of the error rate of viral replication above the threshold that prevents further transmission of genetic information. the difficulty to apply lethal mutagenesis to the treatment of viral infections largely rely, as it happens with other antiviral drugs, on the emergence of resistant mutants, which in this case would probably be those carrying high fidelity polymerases. the knowledge of the exact mechanisms leading a population to error catastrophe implies a detailed study of the composition and structure of the mutant spectrum of the quasi-species. in this sense, the comparison with the structure of bottlenecked populations that have accumulated a large number of mutations still compatible with survival can help to design new strategies for the extinction of infectivity. fields virology proc. natl. acad. sci. usa variability of rna genomes quasispecies and rna virus evolution: principles and consequences hiv- sequence compendium infectious diseases of humans: dynamics and control proc. natl. acad. sci. usa proc. natl. acad. sci. usa proc. natl. acad. sci. usa proc. natl. acad. sci. usa proc. natl. acad. sci. usa proc. natl. acad. sci. usa proc. natl. acad. sci. usa microbes infect proc. natl. acad. sci. usa proc. natl. acad. sci. usa the author thanks s.c. manrubia, c. escarmís, a. grande-pérez, j.pérez-mercader and e. domingo for valuable suggestions and their participation in many of the studies detailed in this chapter. j.e. gonzález-pastor is also acknowledged for critical reading of the manuscript. work at centro de astrobiología has been supported by inta (instituto nacional de técnica aeroespacial). key: cord- -yjvwa ot authors: mitchell, michael title: taxonomy date: - - journal: viruses and the lung doi: . / - - - - _ sha: doc_id: cord_uid: yjvwa ot this chapter addresses the classification and taxonomy of viruses with special attention to viruses that show pneumotropic properties. information provided in this chapter supplements that provided in other chapters in parts ii–v of this volume that discuss individual viral pathogens. taxonomy may be defi ned as a logical discipline for the identifi cation and classifi cation of biological entities based on objective, measurable characteristics of relevant entities. useful taxonomic systems should be broadly applicable across diverse types of biological groups. they should also be fl exible, so that new data from technological advances may be integrated into the classifi cation scheme. primary goals of systemic taxonomy, regardless of biological discipline, include the following: • establishing groups (taxa) that refl ect varying degrees of evolutionary relatedness among the different biological entities studied • establishing criteria for assignment of known or unknown clinical isolates to a given group • establishing a clear and unequivocal nomenclature the origins of biological taxonomy are fi rmly rooted in botany and zoology. early taxonomic systems relied on gross characteristics, like biological niche, internal and external morphology, reproductive strategies and compatibilities, and fossil records. the seminal works of the swedish botanist carl linnaeus used a hierarchical scheme to represent biological relatedness and established the simplifi ed binomial system of nomenclature that serves as the basis for modern classifi cation systems. the modern scientifi c classifi cation in biology is designed to describe all biological entities within a hierarchy consisting of the following taxa: a basic assumption for the establishment of such a hierarchy assumes that all biological entities have evolved from a single common cellular life-form. different biological entities have evolved as a result of accumulated changes in dna that have provided survival advantages in different ecological niches. species may be classifi ed on the basis of phylogenetic and evolutionary relatedness: members of a given species are the most closely related, different species within a single genus are more closely related to each other than to a species within a different genus, and so on. newer technologies like microscopy, improved biochemical and physiological analysis, and advanced protein and molecular analytical methods have resulted in an enormous expansion of characteristics that may be studied for the classifi cation of biological entities and validation of taxonomic systems (woese et al. ). there are a number of excellent texts that discuss the clinical and laboratory aspects of virus biology (knipe and howley ; richman et al. ; versalovic ) . though viruses are certainly "biological entities," they are fundamentally different from the cellular life-forms classifi ed by previous taxonomic schemes. viruses have no autonomous metabolic or replicative ability; they are completely dependent on cellular life-forms. however, within their biological milieu, viruses do replicate and evolve, and they are composed of the same types of organic macromolecules as are cellular lifeforms. because of their intimate relationship with cellular life-forms, it seems legitimate to integrate the schemes for classifi cation of viruses with the schemes used for biological classifi cation of cellular life-forms (lefkowitz ) . initially, various features, like host range, crossimmunity, clinical disease, and pathologic features, were used to classify viruses. technological advances have led to more detailed and integrated classifi cation, taxonomy, and phylogenetic characterization (evolutionary relatedness) of viruses. sophisticated nucleic acid sequence analysis has emerged as a powerful tool for virus classifi cation and phylogenetic determination, in spite of some limitations (holmes ; mccormack and clewley ; zanotto et al. ) . a robust system for classifi cation of viruses developed by david baltimore has gained wide acceptance (baltimore ) . classifi cation is based on the genomic nucleic acid used by the virus (dna or rna), strandedness (single or double stranded), and method of replication. the system has been used to defi ne seven classes of viruses: class i: double-stranded dna (dsdna) class ii : single-stranded dna (ssdna) the primary classifi cation of viruses is into species. a virus species is defi ned as a polythetic class of viruses that constitute a replicating lineage and occupy a specifi c ecological niche (international committee on taxonomy of viruses ). in polythetic classifi cations, group members share a number of characteristics, but no single characteristic is necessary or suffi cient to defi ne members of the group. higher-level taxa are monothetic, i.e., there are characteristics that are necessary and suffi cient to defi ne members of the class. it is important to note that not all viruses can be assigned through all taxonomic levels. virus species may be assigned to a genus or remain unassigned. similarly, a genus may be assigned to a family or subfamily, or remain unassigned, and so on up the taxonomic hierarchy. each genus has a type species . the type species is the virus that necessitated the creation of the genus; it is always linked to the genus. in the most recent publication ( ), the ictv recognized orders, families, subfamilies, genera, and , species. important characteristics used by the ictv to defi ne and classify viruses within these taxa include the following: • susceptible host range : most viruses have a restricted range of hosts which they are able to infect. • virus structure : the viral genome is surrounded by a protective shell of proteins called a capsid. the capsid may also enclose proteins, like reverse transcriptase or proteins required for organization of the nucleocapsid. a nucleocapsid refers to a viral nucleus surrounded by an intact capsid. the nucleocapsids of certain viruses are also surrounded by an envelope of host-derived membranes. the complete virus particle is referred to as a virion. icosahedral capsids are very common; these quasi-spherical shells are composed of identical equilateral triangles with edges and vertices. icosahedral capsids are very effi cient geometrically (internal volume versus protein content) and genetically (many small sides require fewer and smaller genes to code for capsid proteins). the nucleocapsid proteins of some viruses, like the infl uenza viruses, form helical tubes with the nucleic acid incorporated directly into the helical structure. the nucleocapsids of some viruses are surrounded by envelopes composed of lipid bilayers and host-or viral-encoded proteins. envelopes are typically acquired by budding of the nucleocapsid through a virally modifi ed portion of a specifi c host-cell membrane (plasma, endoplasmic reticulum, golgi, nucleus) . the shape of the virus nucleocapsid or intact virion is usually determined by electron microscopy. the shape and dimensions of the nucleocapsid and intact virion, and the presence or absence of an envelope, are useful characteristics for classifying viruses. • genome : the viral genome is either dna or rna; the nucleic acids may be single or double stranded. the genome size may be expressed in terms of kilobases (kb) for singlestranded genomes or kilobase pairs (kbp) for double-stranded genomes. the sequence of genes of positive-sense ssrna may be directly translated by the host into viral proteins. the sequence of negative-sense ssrna is complementary to the coding sequence for translation, so mrna must be synthesized by rna polymerase, typically carried within the virion, before translation into viral proteins. the sequence of positive-sense ssdna is the same as that of the mrna coding for viral proteins; negative-sense ssdna is complementary to mrna and may be transcribed into mrna for viral protein synthesis. ambisense single-stranded nucleic acids use both positive-sense and negative-sense sequences. the viral nucleic acid may be linear or circular; the nucleic acid may be in the form of a single molecule or broken into two or more segments. in addition to the type of nucleic acid, the size of the viral genome, measured in number of bases or base pairs, is an important characteristic used for classifi cation. • nucleic acid sequence analysis : the analysis of specifi c viral nucleic acid sequences is increasingly used as a powerful tool for taxonomic assignment and assessment of evolutionary relatedness. the utility is greatest for related groups of viruses (lauber and gorbalenya a , b ) , but has been challenging for more divergent groups of viruses. sequence analysis alone has not provided a reliable single criterion on which all viruses may be classifi ed. construction of a universal phylogenetic tree for viruses, as has been proposed for cellular life-forms, may not be possible for viruses. it is not clear that all viruses emerged from a single progenitor virus; there is evidence for multiple, independent origins of existing viruses. phylogenetic analysis using nucleic acid sequences is further complicated by recombination, reassortment, incorporation of host nucleic acid sequences, and other factors (domingo ; holmes ) . currently, expert consensus, considering laboratory, phenotypic, clinical, and other characteristics, remains the most accurate and robust method for the classifi cation and taxonomic assignment of viruses. note that the formal names assigned at all taxonomic levels are italicized, while the common names, which are often used clinically, are not italicized. the viruses that have been associated with human infections are shown in table . . among the families of viruses able to infect humans and other vertebrate hosts, there are many species that target and cause disease in the lung. these viruses commonly use airborne transmission as an effective mode of transmission between an infected host and a new susceptible host. characteristics of viruses that directly or indirectly cause pulmonary disease are discussed in this section. adenoviridae: adenoviruses are pathogenic for humans and other vertebrate species. a structural protein at each of the of the icosahedral nucleocapsid vertices anchors a rodlike projection with a terminal knob, which interacts with specifi c host surface receptor molecules and which confers the hemagglutination pattern and tissue tropism for the different groups of adenoviruses. the genome encodes ~ genes (davison et al. a ) , including common genes and species-specifi c genes. genes are grouped into early, delayed early, and late transcribed genes. the genome contains inverted repeat sequences at both ends. sequences of both dna strands are transcribed to mrna; mrna splicing is used for expression of many adenovirus genes. the family adenoviridae has not been assigned to an order. within this family, there are fi ve genera. the seven species that cause human infection are human adenovirus a, b, c, d, e, f, and g , all within the mastadenovirus genus; there are accepted serotypes (buckwalter et al. ) . endemic respiratory infections are most commonly caused by serotypes of human adenovirus c (the type species of the genus); most epidemic respiratory infections are caused by serotypes within species adenovirus b and adenovirus e . arenaviridae : arenaviruses may cause several hemorrhagic fever syndromes. specifi c rodents are the reservoir for each arenavirus; human disease is incidental and is usually transmitted by infectious aerosols. viruses of this family are enveloped; evenly spaced glycoprotein complexes (a tetramer of viral gp with viral gp ionically bound as a globular head) are attached to the envelope giving complete virions a studded spherical morphology. complete virions are ~ nm in diameter, but show signifi cant pleomorphism (range, - nm). the genome is divided into two segments which are complexed with nucleoproteins (peters ) . complementary sequences at the ′ and ′ ends of each segment result in the formation of two circular nucleocapsids. arenaviruses use both negative-sense and ambisense coding strategies. host ribosomes are often incorporated within the envelope of complete virions. this family of viruses is not assigned to an order. there is one genus, arenavirus , with species that fall into two complexes on the basis of serologic and genetic relatedness. the old world, or african, species include lassa virus (lassa fever) and lujo virus. the new world species include guanarito virus (venezuelan hf), junín virus (argentine hf), and machupo virus (bolivian hf). the type species of the genus arenavirus is lymphocytic choriomeningitis virus . bunyaviridae : bunyaviruses may cause several hemorrhagic fever syndromes. viruses coronaviridae : transmembrane proteins produce blunt projections from the surface of coronaviruses, resulting in a "crown-like" appearance on electron microscopic studies ( - nm in diameter). translation of the coronavirus genome is unique and includes production of polyproteins, discontinuous synthesis, overlapping reading frames, ribosomal frame shifting, and post-translational proteolytic processing (marra et al. ; rota et al. ; theil et al. ) . the major structural proteins, spike glycoprotein (s), membrane glycoprotein (m), nucleocapsid phosphoprotein (n), hemagglutinin-esterase glycoprotein (he), and envelope protein (e), are present in all coronaviruses. nonstructural proteins are encoded in - unique or overlapping reading frames (lai et al. ). the human coronaviruses are assigned to the order nidovirales , family coronaviridae , and subfamily coronavirinae . there are four genera and three serological groups. relevant viruses include human coronavirus e and human coronavirus nl of the genus alphacoronavirus (antigenic group i), human coronavirus hku , betacoronavirus and severe acute respiratory syndrome-related coronavirus of the genus betacoronavirus (antigenic group ii). filoviridae : filoviruses may cause several hemorrhagic fever syndromes. the fi loviruses have a unique threadlike morphology. the helical nucleocapsids are surrounded by an envelope studded by spikes formed by a single type of glycoprotein (gp). the genome consists of a single segment of negative-sense ssrna that encodes for seven proteins (kuhn et al. ) . the presence of gene overlap for several genes is an unusual feature of fi loviruses. in ebolaviruses, the surface glycoprotein is encoded by two adjacent reading frames. a truncated version (sgp), which lacks the hydrophobic anchor, results from translation of the upstream reading frame only. this protein is secreted from cells and may serve as a decoy for the host's immunological response. the full-length gp is formed only when the rna polymerase misreads a poly-u editing site between the reading frames. the fulllength gp is inserted, as homotrimers, into the host membranes that will form the virion envelope. a helical nucleocapsid is formed by association of the ssrna with nucleoproteins. the nucleocapsid is ~ nm in diameter, with a central axial space ~ nm in diameter. the nucleocapsid is attached to the envelope by matrix protein. the complete virions are ~ nm in diameter, but the virion length may vary from to , nm. the family filoviridae is assigned to the order mononegavirales . there are two genera within the family ebolavirus and marburgvirus . there are fi ve ebolavirus species, including sudan ebolavirus and zaire ebolavirus (the type species). the genus marburgvirus consists of one species, marburg marburgvirus . humans and nonhuman primates are susceptible to ebolavirus and marburgvirus infection; the host reservoirs for these viruses are unknown. humans may be infected sporadically by presumed contact with the host species or by direct contact with virus containing body fl uids taken from acutely infected humans or nonhuman primates. nosocomial and laboratoryacquired infections are well described. flaviviridae : flaviviruses may cause several hemorrhagic fever syndromes. hepatitis c virus is also a fl avivirus species. flaviviruses are surrounded by an envelope studded with dimers of viral e glycoprotein and m protein which give the mature virion a herringbone appearance with icosahedral symmetry. the genome consists of a single segment of positive-sense ssrna (chambers et al. ; osatomi and sumiyoshi ) . cyclization of the genome, through hybridization of rna sequences of the ′ and ′ ends of the genome, may be required for mrna synthesis (alvarez et al. ). there is a long open reading frame that codes for three structural proteins at the ′ end; downstream of this region are genes for seven nonstructural proteins (thurner et al. ). the positive-sense genome is directly translated into a large polyprotein, which undergoes intra-and post-translational cleavage. strain evolution and clinical diversity have been driven by a high rate of mutation at replication and through molecular recombination. the nucleocapsid is formed by interaction of genomic rna with capsid proteins. the complete virion has a spherical morphology approximately nm in diameter. this family of viruses is not assigned to an order. there are four genera within the family flaviviridae . within the genus flavivirus , there are species, including dengue virus (simmons et al. hepatitis c virus (hcv) is the type species of the genus hepacivirus in the family flaviviridae . the physical properties of hcv have not been as well defi ned as other fl aviviruses because there is no effi cient method for in vitro replication of hcv. virion morphology is consistent with other fl aviviruses; complete, enveloped virions have a diameter of - nm. the single segment positive-sense ssrna is ~ . kb in length (hijikata et al. ) . a single open reading frame is fl anked by highly conserved regions at the ′ and ′ ends. cap-independent protein synthesis, typical of flavivirus species, is initiated at an internal ribosomal entry site (ires) within the ′ untranslated region. this results in synthesis of a polyprotein that undergoes cleavage and further processing during and after translation. a unique and highly conserved sequence upstream of the ires interacts with liver-specifi c microrna and is required for effi cient replication. circulating hcv is associated with host ldl/vldl, which may play a role in delivery of virions to hepatocytes. the error-prone rna polymerase and high replication rate of hcv has resulted in a great genetic diversity and heterogeneity of clinical isolates. hcv isolates can be grouped by genotypic analysis into six groups and many subgroups. there are differences with respect to responses to antiviral therapy among the genotypes, but intrinsic virulence is similar. the vast majority of strains in the united states are genotypes a, b, and , whereas central african strains are almost exclusively genotype . hemorrhagic fever (hf) syndromes : viral hemorrhagic fever syndromes may be caused by many species of viruses from four different families: arenaviridae, bunyaviridae, flaviviridae and filoviridae ; all are single-stranded rna viruses. see the discussions above for specifi c information related to these virus families. typical symptoms of viral hemorrhagic fever infection include fever, malaise, hypotension, and coagulation defects. with the exception of dengue, the other hf viral agents are maintained in nonhuman vertebrate hosts; humans are coincidental, dead-end hosts. in dengue, human infection is maintained through a mosquito vector. the epidemiologic distribution of disease refl ects the geographic range of the reservoir host. hf viruses primarily infect dendritic cells, macrocytes, and monocytes, which are present in virtually all tissues and organ systems; parenchymal cells may also be susceptible to infection, depending on the virus. infected cells release mediators that result in marked increased vascular permeability, compromising the function of critical organ systems. suppression of cellular type interferon response is a signifi cant contributor to pathogenesis (habjan et al. ) . hepadnaviridae: in the family hepadnaviridae , there are two genera, avihepadnavirus (two species) and orthohepadnavirus (four species); hepatitis b virus (hbv), the type species of orthohepadnavirus , is only human pathogen in family. the family hepadnaviridae is not assigned to an order. eight distinct hbv genotypes (a-h) and subtypes can be recognized on the basis of antigenic or sequence variation. the genotypes show geographic and ethnic variability; the hbv genotype infl uences the severity and outcome of disease (garfein et al. ; lin and kao ) . the complete, enveloped hbv virion (dane particle) is - nm in diameter. the icosahedral nucleocapsid (~ nm in diameter) of the virion contains a single molecule of partially double-stranded dna with a dna-dependent polymerase covalently linked to the ′ end of the complete dna strand, hepatitis b e antigen (hbeag) and hepatitis b core antigen (hbcag). the nucleocapsid is surrounded by an envelope derived from host-cell membrane and viral envelope proteins, including hepatitis b surface antigen. the genome of hbv is a circular, partially double-stranded dna molecule which is replicated by a unique process of reverse transcription of an rna intermediate. the minus dna strand runs the entire length of the hbv genome; the plus strand covers only about two-thirds of the genome. the genome is replicated by synthesis of a fulllength ssrna transcript (pre-genomic rna), followed by dsdna synthesis by reverse transcription of the ssrna by viral-encoded reverse transcriptase/dna polymerase. all viral proteins are also transcribed from the minus dna strand. there are four overlapping open reading frames, all read in the same direction (liang ) . herpesviridae : the herpesvirus species associated with human infections (hsv- , hsv- , cmv, ebv, vzv, hhv- , hhv- , and hhv- ) belong to the family herpesviridae within the order herpesvirales . there are four subfamilies of the herpesviridae : alphaherpesvirinae ( genera), betaherpesvirinae ( genera), gammaherpesvirinae ( genera), and a single genus in an unassigned subfamily. specifi c human herpesviruses are discussed in the sections below. the herpesviruses are double-stranded dna viruses. the icosahedral capsid (~ nm diameter) is surrounded by an envelope studded by a variety of short glycoproteins. the nucleocapsid is a dense toroid complex with an outer diameter ~ nm and inner diameter ~ nm. an irregular "tegument" fi lls the space between the envelope and capsid. depending of the thickness of the tegument layer, complete virions range in size from ~ to > nm. the size and organization of the dsdna genome varies among the species causing human disease (mcgeoch et al. ) . the genomes of human herpesviruses include unique sequences and repeated sequences. though the genomes are linear in virions, they circularize in the nucleus of infected cells, which is mediated through repeat sequences at both ends of the dsdna genome. for hhv and hhv (class a genome), a large unique sequence region is fl anked by a region that is repeated at both ends of the linear strand of dsdna. the genome of ebv and the kaposi's sarcoma-associated herpesvirus (class c genome) have smaller left and right terminal repeat sequences, while repeat sequences r to r divide the unique sequence nucleic acid into four discrete regions. for vzv (class d genome), a large terminal sequence is inverted and inserted into the genome, resulting in a large unique sequence region (ul) and a small unique sequence region (us). hsv- , hsv- , and cmv (class e genomes) are the most complex. there are repeat sequence regions at both ends of the linear dsdna molecule. the unique sequence dsdna is divided into ul and us regions by a sequence composed of juxtaposed copies of the terminal repeat sequences inserted in an inverted orientation. typical of dsdna viruses, a large number of proteins are produced by various herpesviruses. the organization of the coding regions is complex, with ′ and ′ reading frames, gene overlap, spliced genes, and intron regions. forty genes are conserved among the α-, β-, and γ-herpesviruses. these core genes are divided among seven gene blocks (albà et al. ) ; within each block the order and polarity of genes are conserved, including genes for gene regulation, nucleotide metabolism, dna replication, virion maturation, envelope glycoprotein synthesis, and capsid, fusion and tegument protein synthesis. diseases caused by human herpesviruses range from systemic to localized infection of virtually all organ systems, although the hostcell range and typical disease characteristics vary by species. a characteristic of herpesvirus infections is latency, which is commonly associated with reactivation and symptomatic infections (e.g., shingles). while active infection with herpesviruses results in the destruction of the infected host cell, latently infected cells remain viable. in latently infected cells, the viral genome forms circularized molecules within the host nucleus with limited expression of viral genes. (gompels et al. ) of these viruses has the simplest organization and lowest %g + c content compared to the other herpesviruses. reading frames are present on each strand of the dsdna. core herpesvirus proteins are clustered near the center of the strands, while species-specifi c genes are located toward the ends of the strands (braun et al. ). hhv- and hhv- are assigned to the genus roseolovirus in the subfamily betaherpesvirinae , family herpesviridae , and order herpesvirales . there are two distinct hhv- species: human herpesvirus a (the roseolovirus type species) and human herpesvirus b . hhv- b is the agent of exanthem subitum. there is a single hhv- species, human herpesvirus , which is also a cause of exanthem subitum. t-lymphocytes are the primary target cell of hhv- and hhv- viruses. • kaposi's sarcoma-associated herpesvirus : the complete virions of kaposi's sarcoma-associated herpesvirus (kshv) have a diameter ~ nm. in addition to virusspecifi c proteins, the tegument also carries viral mrnas, probably the result of passive incorporation during the cytoplasmic envelopment process (bechtel et al. ) . the envelopes of complete virions bear kshv-specifi c glycoproteins. the genome (~ kbp) has class c organization (russo et al. ) typical of gammaherpesvirinae . the conserved herpesvirus genes are clustered in four blocks; kshv-specifi c genes are typically distributed in the regions outside and between these blocks (renne et al. ) . the kshv species designation is human herpesvirus , which is assigned to the genus rhadinovirus within the subfamily gammaherpesvirinae . the virus has tropism for b-lymphocytes and is implicated in all forms of kaposi's sarcoma. four clades, a-d, with distinctive geographical distributions, have been identifi ed by genotypic analysis; the a and c clades cluster together and are most typical for isolates from europe and the united states. • varicella-zoster virus (vzv) : the dense core of vzv is enclosed in an icosahedral capsid ( - nm diameter), which is surrounded by an amorphous tegument. the envelope may be derived from multiple types of hostcell membranes during transit from the nucleus through the cytoplasm; specifi c viralencoded glycoproteins are embedded in the envelope of the complete virions, which may be spherical or pleomorphic ( - nm in diameter). vzv has a class d dsdna genome (~ kbp) (clarke et al. ; davison ) , resulting in production of two isomeric genomic forms by infected cells through inversion of the us region (ecker and hyman ) . the genome encodes more than proteins. the organization includes grouping of several genes into single transcription units, genes with overlapping reading frames, and spliced segments (davison and scott ) . the species designation for vzv is human herpesvirus . it is the type species of the genus varicellovirus within the subfamily alphaherpesvirinae . there is only a single serotype of vzv. for epidemiologic purposes, vzv isolates may be genotyped on the basis of minor differences in dna sequence; different genotypes may be classifi ed as european, japanese, or mosaic (loparev et al. ). the host range of vzv is restricted to cells of humans or other primates; in humans, vzv has tropism for human t-lymphocytes and establishes latent infection in the cells of the dorsal root ganglia. orthomyxoviridae : infl uenza viruses belong to the family orthomyxoviridae . they are polymorphic; viruses may be spherical (~ nm diameter) or fi lamentous. complete virions are surrounded by an envelope derived from the host cytoplasmic membrane. viral hemagglutinin and neuraminidase proteins are embedded in the envelope resulting in characteristic - nm spikes projecting from the surface of virions. in addition to the ha and na protein, m protein is embedded into the envelope of infl uenza a viruses; nb and bm proteins are embedded into the envelopes of infl uenza b viruses. the matrix protein (m ) is located just below the envelope. the nucleocapsid is composed of viral rna and nonstructural proteins, including ribonucleoproteins and polymerases. the genome of infl uenza viruses is composed of negative-sense ssrna. all viral rna synthesis occurs in the nucleus of the host cell. the a and b infl uenza virus genomes are composed of eight segments, while the infl uenza c virus genome consists of seven segments (hayden and palese ) . the segments range in size from ~ to , nucleotides in length. each segment codes for one or more viral proteins (mccauley et al. ). the ′ and ′ ends of each segment contain noncoding, regulatory regions (fujii et al. ) . the three largest segments code for various components of rna polymerase; the pb segment of infl uenza a virus has a second open reading frame that encodes the pro-apoptotic protein pb -f . in infl uenza types a and b, the fourth and sixth segments encode for the surface hemagglutinin (ha) and neuraminidase (na) glycoproteins, respectively. the infl uenza a surface protein m is encoded by the seventh segment; infl uenza b surface protein nb is encoded by the sixth segment, while the bm is encoded by the seventh segment. the fi fth segment of both a and b infl uenza viruses encodes for the rna-binding nucleoprotein (np). the matrix protein m is encoded by the seventh segment of both viruses. the eighth and smallest rna segment of infl uenza a and b viruses encodes for ns , a multifunctional protein with interferon antagonistic properties and nep/ ns protein which is involved in transport of vrnps across the nuclear membrane of the host cell. the names of clinical isolates of human infl uenza isolates include the species of origin, isolation location, number of the isolate, and year of isolation; infl uenza a virus isolates also include the hemagglutinin (h to h ) and neuraminidase (n to n ) subtypes (atmar and lindstrom ) . for example, a/california/ / (h n ), a/victoria/ / (h n ), and b/ wisconsin/ / viruses were recommended for the - seasonal infl uenza vaccine. large outbreaks have only occurred with h , h , and h and neuraminidases n and n viral subtypes. antigenic drift and antigenic shift contribute to reinfection with infl uenza viruses (taubenberger and kash ) . antigenic drift is caused by a gradual accumulation of point mutations in hemagglutinin and neuraminidase genes, which result in minor antigenic changes in these proteins. antigenic shift is caused by a virus created by reassortment of infl uenza virus rna segments during coinfection of a host, usually with a human infl uenza virus and an avian or swine infl uenza virus or through introduction of a nonhuman infl uenza virus strain into human populations after mutation during a host-species infection creates a new isolate permissive for interspecies transmission. papillomaviridae: the papillomaviruses (pvs) represent a large (and growing) family of viruses that currently includes different genera and species; the taxonomy has undergone signifi cant reorganization in recent years (bravo et al. ) . the oncogenic potential of human papillomaviruses is well established. pvs are non-enveloped; virions are icosahedral with diameters of - nm. the capsid contains two structural proteins, l , the most abundant viral protein, and l . the pv genome consists of a single molecule of circularized dsdna (zheng and baker ) . the open reading frames for all viral genes are located on only one of the dna strands, and transcription proceeds in a single direction. there are eight early (e) open reading frames that encode for regulatory proteins that control viral metabolism and dna synthesis. the e proteins of high-risk hpv types have anti-apoptotic effects and interfere with p regulatory function in infected host cells (howley et al. ) . two late (l) reading frames encode for synthesis of the structural proteins l and l . epithelial cells of a wide variety of vertebrate hosts are susceptible to papillomavirus infection, but the different host species are only susceptible to species-specifi c viruses. papillomaviruses have been classifi ed according to susceptible host species and the type of disease produced, but comparison of sequence differences of the l reading frame has provided a more detailed description of papillomavirus phylogeny (de villiers et al. ) . the family papillomaviridae is not assigned to an order. human pathogens are clustered within fi ve papillomavirus genera. paramyxoviridae : the paramyxoviruses are enveloped (host cytoplasmic membrane) with an unsegmented negative-sense ssrna genome ( - kb) . the viral rna serves as template for synthesis of mrna and for synthesis of antigenomic (positive-sense) rna for synthesis of new viral negative-sense rna for new virions. there are six to ten genes; genes for the six major proteins are linked in the following ′ to ′ order: nucleocapsid (n) → phosphoprotein (p) → matrix (m) → fusion (f) → hemagglutinin/neuraminidase (hn) → large polymerase (l). there is an untranslated leader sequence at the ′ end and untranslated trailer sequence at the ′ end. the genes are separated by untranslated sequences and do not overlap, with the exception of the m and l genes of human metapneumovirus . translation is initiated at the ′ end and proceeds directly through to the ′ end. because the rna polymerase is unstable and may detach at the untranslated regions between genes, there is a gradient in the concentration of gene products from ′ to ′. in different species, other proteins are produced by additional small genes, mrna editing, or overlapping reading frames within the p gene. the v and c proteins regulate viral rna transcription and also interfere with host interferon signaling and other aspects of the immune response to paramyxovirus infection (andrejeva et al. ; durbin et al. ; swedan et al. ). formation of the nucleocapsid core is constrained by a required association of one n protein to every six genomic nucleotides (kolakofsky et al. ; skiadopoulos et al. ) . the resulting helical structure has a diameter of nm with a nm central core. p proteins (a polymerase cofactor) are attached to this rigid rod and serve as attachment of l proteins, which interact to provide enzymatic activity for rna synthesis. this core structure, rather than free genomic rna, serves as the template for mrna and antigenomic rna synthesis. the paramyxovirus m proteins surround and organize the nucleocapsid and interact with the cytoplasmic tails of transmembrane envelope proteins. the envelope formed from modifi ed host-cell plasma membranes is studded by viral protein complexes, including hn proteins, which mediate virion attachment to target cells, and f protein, which mediates ph-independent fusion of the viral envelope and cell cytoplasmic membrane. the paramyxoviridae are one of the four families within the order mononegavirales and include signifi cant and frequent pathogens of humans and animals. there are two subfamilies: the paramyxovirinae and the pneumovirinae . there are seven genera and thirty-one species in the subfamily paramyxovirinae and two genera and fi ve species in the pneumovirinae subfamily. • henipahviruses: in the subfamily paramyxovirinae , there are two species within the genus henipahvirus : hendra virus (hev) and nipah virus (niv); hev is the type species. henipahvirus virions are pleomorphic (spherical to helical forms). electron microscopy of hendra virus shows a "double-fringe" appearance due to short and long surface projections (hyatt et al. ) . complete virions range in size from to , nm in longest dimension. the genome (~ kb) includes genes typical of paramyxoviridae . long untranslated sequences are attached to the ′ end of fi ve of the six genes, resulting in the larger genome size of henipahviruses compared to other paramyxoviruses (eaton et al. ; wang et al. ) . the p gene also codes for v and w proteins by mrna editing and c protein by a shifted reading frame. henipahviruses are assigned to the family paramyxoviridae and subfamily paramyxovirinae . henipahvirus infections are zoonotic; fruit bats are the presumed reservoir for hendra virus infections, while fruit bats (johara et al. ) includes two open reading frames; the function of m - protein is undefi ned, while m - protein is a regulator of viral transcription. there is no gene for hemagglutinin-neuraminidase; the product of the g gene serves as the major attachment protein. metapneumoviruses are members of the subfamily pneumovirinae . human metapneumovirus is a species in the genus metapneumovirus . there is a single hmpv serotype, with two antigenic subtypes a and b. • human respiratory syncytial viruses: the envelope of human respiratory syncytial virus (hrsv) is studded with three viral glycoproteins: g protein (the major attachment protein), f protein, and sh protein (a small hydrophobic protein). complete virions are pleomorphic (spherical to fi lamentous forms). the nucleocapsid diameter, - nm, is smaller than typical for other paramyxoviruses (hall ) . the hrsv genome is ~ kb and includes ten genes (collins and wertz ) . the fi rst eight reading frames are nonoverlapping; the last two genes, m and l, overlap by nucleotides. in addition to n, p, and l proteins, m - protein, a transcription factor, is associated with the nucleocapsid. the overall organization of the hrsv genome is similar to other paramyxoviruses. in addition to the typical genes, the hrsv genome includes genes ns and ns (nonstructural proteins that interfere with interferon induction and signaling), sh, and m - and m - (nonstructural proteins involved in regulation of transcription). there is no nh gene; attachment is mediated by the g gene product. human respiratory syncytial virus is the type species of the genus pneumovirus in the subfamily pneumovirinae . there is a single hrsv serotype, with two antigenic subtypes a and b. (henrickson ) . the genome of human parainfl uenza viruses is ~ kb in length with an organization and six reading frames (n, p, m, f, hn, l) typical of the paramyxoviridae (karron and collins ) . there are no overlapping reading frames. accessory proteins, c (hpiv and ), v (hpiv and ), and d (hpiv ), are produced by mrna editing of the p gene. n proteins are tightly bound to viral and antigenomic rna; p and l proteins are also bound to the nucleocapsid, forming functional complexes for rna polymerization and processing. human parainfl uenza viruses are assigned to two genera in the subfamily paramyxovirinae . (berns ). virions are stable in the environment and thought to transmit infection by attachment to specifi c receptors of actively dividing cells. the parvovirus genome is composed of unsegmented ssdna (cotmore and tattersall ; shade et al. ; zhi et al. ) . complete virions of different species may contain negativesense or both negative-and positive-sense dna in various proportions. there are two major reading frames: one encoding capsid proteins and the other coding for nonstructural proteins. noncoding sequences at the ′ and ′ ends include complementary sequences which result in the formation of hairpin structures that serve to regulate nucleic acid synthesis (deiss et al. ) . various host-cell molecules mediate attachment and infection by parvoviruses. erythrocyte p antigen is the major receptor for human parvovirus b . viruses are taken up by endocytosis, followed by transport into the host-cell nucleus. viral dna replication depends on host-cell polymerases during the s phase of host-cell replication. human infections are caused by parvovirus b and bocavirus (schildgen et al. ; vicente et al. ) . the family parvoviridae is not assigned to an order; there are two subfamilies, the densovirinae and the parvoviridae . there are fi ve genera in the subfamily parvoviridae : amdovirus ( species), bocavirus ( species including the type species bovine parvovirus ), dependovirus ( species, including adenoassociated viruses), erythrovirus ( species including the type species human parvovirus b ), and parvovirus ( species). picornaviridae : infections of the respiratory tract and other organ systems by enteroviruses and parechovirus are well described. enteroviruses were initially classifi ed on the basis of clinical disease and epidemiology, suckling mouse inoculation, replication in cell culture, electron microscopic studies, physical properties, and the vast range of specifi c antigenic differences. the major subgroups were poliovirus, coxsackievirus (a and b), and echovirus. a characteristic of these viruses is their relative stability in acidic media and nonionic detergents. translation of the positive-sense ssrna genome is regulated by a ′ non-translated region (lindberg and polacek ) that is covalently linked to protein vpg (virion protein, genome linked); the short ′ noncoding region is polyadenylated. translation results in synthesis of a single polyprotein, which is cleaved into functional proteins by post-translational processing (nicklin et al. ; pallansch and roos ) . there are three functional regions delimited by ribosomal entry sites. the p region codes for capsid proteins, while regions p and p code for nonstructural proteins. capsid proteins vp , vp , and vp are exposed externally and account for the serological diversity of the viruses. with the advent of molecular phylogenetic analysis, the enteroviruses have been reclassifi ed by the ictv. enteroviruses are in the order picornavirales , family picornaviridae, and genus enterovirus . the enteroviruses have been assigned to species, including human enterovirus a ( serotypes including coxsackieviruses and enteroviruses), human enterovirus b ( serotypes, including coxsackieviruses, echoviruses, and enteroviruses), human enterovirus c (the type species; serotypes including coxsackieviruses, all human polioviruses, and enteroviruses), and human enterovirus d ( enterovirus serotypes). in addition to the enteroviruses, the genus enterovirus also includes rhinoviruses species, human rhinovirus a , b, and c , and more than serotypes. also within the family picornaviridae is the genus parechovirus . human parechovirus is the type species for the genus. there are parechovirus serotypes. polyomaviridae : polyomaviruses may infect a variety of primate and non-primate vertebrate host species; the oncogenic potential of polyomaviruses is well established (white and khalili ) . sialic acid and/or gangliosides on the hostcell membranes serve as receptors for attachment of human polyomavirus. though these molecules are widespread on human cells, there is a restricted tropism. respiratory epithelial cells and cells of lymphoid origin are the likely targets for initial infection, followed by hematogenous spread to target organs. the virions are non-enveloped; the icosahedral capsids ( - nm diameter) are composed of three proteins (vp , vp , and vp ), which enclose the circular dsdna genome (~ kbp). the genome is divided into three regions. the early region encodes for proteins involved in viral processes that occur prior to dna replication, including t (tumor) antigens (benjamin ) . the late region encodes for proteins involved in processes that primarily occur after dna replication. the early and late regions do not overlap and are transcribed from opposite strands of the viral dna and in opposite directions. a number of viral proteins are encoded as a result of alternative splicing and other posttranslational modifi cations of mrna. polyomaviruses are members of the family polyomaviridae , which is not assigned to an order. there is genus, polyomavirus , and species, including the human pathogens bk polyomavirus and jc polyomavirus and simian virus (type species). retroviridae : the retroviruses are a unique group of viruses, including human immunodefi ciency virus types and and human t-cell leukemia virus type ; they may infect a wide range of vertebral host species. the human immunodefi ciency viruses and human t-cell leukemia virus are able to cause disease in humans. these rna viruses use a unique replication cycle that uses a "reverse fl ow" of genetic information from rna to dna: viral rna is reverse transcribed and converted into a dsdna copy of the viral genome which is integrated into the host-cell genome. integration of the proviral dna allows the viruses to establish persistent, presumably lifelong, infection. another consequence of insertion of the viral dna is functional mutation of the host genome at the site of insertion which may alter the host gene or regulation of a gene's expression; the oncogenic potential of retroviral infection is well described in humans and other vertebrate host species. the electron microscopic morphology of retroviruses shows a dense nucleocapsid core (cylindrical or cone shaped) (chrystie and almeida ; gelderblom et al. ) . viruses are functionally diploid: the core includes two copies of the positive-sense ssrna genome, which are closely complexed with viral nucleoproteins. the sequences of the two ssrna molecules may differ because of errors in transcription of new genomic ssrna molecules during replication. the core also includes several functional viral proteins, including reverse transcriptase, integrase, and protease. the core is surrounded by capsid proteins; the nucleocapsid is surrounded by viral matrix protein. complete virions are surrounded by an envelope derived from virus-modifi ed host-cell cytoplasmic membranes; the envelope is studded by viral glycoproteins. the transmembrane protein extends from the matrix layer through the lipid bilayer to the external surface. the receptorbinding complex is anchored to the external portion of the transmembrane protein. mature virions are spherical (~ nm diameter). the ssrna genomes of retroviruses are similar to the host-cell mrna. a repeat sequence is present at both ends of the ssrna; the ′ end is capped and the ′ end polyadenylated. the order of sequences from the ′ end to the ′ end is cap → repeat sequence → unique sequence (u ) → the initiation site for initiation of minus-strand dna synthesis → gag gene → pol gene → env gene → the initiation site for plus-strand dna synthesis → a unique sequence (u ) → repeat sequence → poly(a) sequence. after entry into the cytoplasm of a susceptible cell, double-stranded dna is synthesized by reverse transcription of both copies of the retroviral ssrna. the viral-encoded dna is transported into the nucleus, after which it is integrated into the host's genomic dna. the process of forming new virions is initiated by transcription of the proviral dna. the processed viral rna is exported into the cytoplasm and genes for precursor viral proteins are translated. virions are assembled at the cytoplasmic membrane and then released by budding; fi nal virion maturation occurs by extracellular processing of viral proteins. a characteristic of retroviruses is the high mutation rate and marked genomic heterogeneity of isolates. the major factors that contribute to this phenomenon include ( ) error-prone reverse transcription, without proofreading correction, of the infecting virus genome; ( ) recombination between the two genomic ssrna strands during reverse transcription; and ( ) the very high-level production of progeny viruses from infected cells. retroviruses are not assigned to a taxonomic order. the family retroviridae has two subfamilies. the orthoretrovirinae includes six genera, including deltaretrovirus and lentivirus . htlv- is assigned the species name primate t-lymphotropic virus in the deltaretrovirus genus. human immunodefi ciency virus and hiv- are named human immunodefi ciency virus (type species) and human immunodefi ciency virus , respectively, in the genus lentivirus (clavel et al. b ). • human immunodefi ciency viruses : the human immunodefi ciency viruses have a conical core surrounded by an envelope derived from viralmodifi ed host-cell cytoplasmic membrane. binding and entry of hiv into susceptible cells requires several specifi c receptors: cd (present on host helper t cells, cd + macrophages, and some dendritic cells) plus chemokine receptors, including ccr and cxcr (klatzman et al. ; simmons et al. ). the biological properties of hiv- isolates depend on the chemokine coreceptor(s) used by the virus (berger et al. ). isolates that exclusively use cxcr are t-cell tropic with rapid replication and syncytium formation. isolates that use ccr exclusively are tropic to macrophages, replicate more slowly, and do not induce syncytium formation. isolates that can use either cxcr or ccr have intermediate phenotypes. the gag , pro , pol, and env genes are translated from full-length mrna transcripts of the proviral genome: gag and env in one reading frame and pro and pol from a second reading frame. in addition, several genes are transcribed from overlapping or unique reading frames, including several spliced gene products. human immunodefi ciency type and viruses evolved from simian viruses (gao et al. ; peeters et al. ; daniel et al. ; marx et al. ) . these viruses may be distinguished by a number of characteristics, including clinical disease, specifi c antigens, and gene sequences (clavel et al. a ) . hiv- isolates may be further characterized into genetic groups and subtypes or clades (wainberg ) . most hiv- isolates are in the m (main) group, which has a number of well-defi ned subgroups and recombinant forms with heterogeneous global distribution; clade b viruses are the predominant isolates in north america and europe (hemelaar et al. ; osmanov et al. ) . group o (outlier) strains have mainly been isolated or acquired in western africa. group n (non-m, non-o) and recombinant forms are also most commonly isolated from western africa. • human t-cell leukemia virus type : mature htlv virions have a spherical core, symmetrically placed within the envelope. the host-cell receptor is glut- , a surface glucose transport molecule (manel et al. ) . the gag , pro , pol, and env genes are translated from fulllength mrna transcripts of the proviral genome: gag in one reading frame, pro and env from a second, and pol from a third reading frame. in addition, several spliced genes are transcribed from overlapping reading frames. recent and continuing progress to develop and use standardized and widely accepted methods for biological and taxonomic classifi cation of viral pathogens has resulted in improvement in the medical response to viral illnesses. at a very basic level, these systems allow clinicians and scientists to communicate effectively and ensure the comparability of data generated by clinical or basic scientifi c studies. further, accurate and standardized data is critical for understanding issues related to transmission, prevention, and treatment of viral illnesses. establishing phylogenetic similarity to known viral pathogens may allow clinicians to anticipate the clinical behavior of new and emerging viral pathogens, as may be seen when virus mutation results in acquisition of new pathogenic mechanisms, like changes to antigens associated with evasion of the immune response of the host species or changes that allow a viral pathogen to jump from one species into new, susceptible species. as analytical tools improve, even more informative data relevant to clinical and pathologic characteristics of viral pathogens is anticipated. genomewide function conservation and phylogeny in the herpesviridae longrange rna-rna interactions circularize the dengue virus genome the v proteins of paramyxoviruses bind the ifn-inducible rna helicase, mda- , and inhibit its activation of the inf-beta promoter infl uenza viruses, chap . in: versalovik j (ed) manual of clinical microbiology dna sequence and expression of the b - epstein-barr virus genome expression of animal virus genomes rnas in the virion of kaposi's sarcoma-associated herpesvirus polyoma virus: old fi ndings and new challenges a new classifi cation for hiv- human herpesvirus the clinical importance of understanding the evolution of papillomaviruses real-time qualitative pcr for human adenovirus types from multiple specimen sources flavivirus genome organization, expression, and replication the morphology of human immunodefi ciency virus (hiv) by negative staining confi guration and terminal sequences of the simian varicella virus genome isolation of a new human retrovirus from west african patients with aids molecular cloning and polymorphism of the human immune defi ciency virus type cdna cloning and transcriptional mapping of nine polyadenylylated rnas encoded by the genome of human respiratory syncytial virus characterization and molecular cloning of a human parvovirus genome isolation of t-cell tropic htlv-iii-like retrovirus from macaques structure of the genome termini of varicella-zoster virus the complete dna sequence of varicella-zoster virus genetic content and evolution of adenoviruses the human cytomegalovirus genome revisited: comparison with the chimpanzee cytomegalovirus genome classifi cation of papillomaviruses cloning of the human parvovirus b genome and structural analysis of its palindromic termini the genome sequence of herpes simplex virus type et al ( ) transcriptional map of the measles virus genome functional profi ling of a human cytomegalovirus genome mutations in the c, d, and v open reading frames of human parainfl uenza virus type attenuate replication in rodents and primates henipaviruses, chap virus taxonomy: one step forward, two steps back varicella-zoster virus dna exists as two isomers importance of both the coding and the segment-specifi c noncoding regions of the infl uenza type a virus ns segment for its effi cient incorporation into virions origin of hiv- in the chimpanzee pan troglodytes troglodytes factors associated with fulminant liver failure during an outbreak among injection drug users with acute hepatitis b morphogenesis and morphology of hiv. structure-function relations the dna sequence of human herpesvirus- : structure, coding content, and genome evolution processing of genome ′ termini as a strategy of negative-strand rna viruses to avoid rig-idependent interferon induction respiratory syncytial virus and parainfl uenza virus infl uenza virus, chap global and regional distribution of hiv- genetic subtypes and recombinants in parainfl uenza viruses gene mapping of the putative structural region of the hepatitis c virus genome by in vitro processing analysis evolutionary history and phylogeography of human viruses what does virus evolution tell us about virus origins? association of human papillomavirus types and e proteins with p ultrastructure of hendra virus and nipah virus within cultured cells and host animals the international code of virus classifi cation and nomenclature nipah virus infection in bats (order chiroptera) in peninsular malaysia t-lymphocyte t molecule behaves as the receptor for human retrovirus lav paramyxovirus mrna editing, the "rule of six" and error catastrophe: a hypothesis proposal for a revised taxonomy of the family filoviridae: classifi cation, names of taxa and viruses, and virus abbreviations partitioning the genetic diversity of a virus family: approach and evaluation through a case study of picornaviruses toward genetics-based virus taxonomy: comparative analysis of a geneticsbased classifi cation and the taxonomy of picornaviruses taxonomy and classifi cation of viruses, chap . in: versalovic j (ed) manual of clinical microbiology hepatitis b: the virus and disease hepatitis b viral factors and clinical outcomes of chronic hepatitis b molecular analysis of the prototype coxsackievirus b genome global identifi cation of three major genotypes of varicella-zoster virus: longitudinal clustering and strategies for genotyping the ubiquitous glucose transporter glut- is a receptor for htlv the genome sequence of the sars-associated coronavirus isolation of a simian immunodefi ciency virus related to human immunodefi ciency virus type from a west african pet sooty mangabey structure and function of the infl uenza virus genome the application of molecular phylogenetics to the analysis of viral genome diversity and evolution topics in herpesvirus genomics and evolution bunyaviridae: bunyaviruses, phleboviruses, nairoviruses, and hantaviruses, chap site-specifi c inversion sequences of the herpes simplex virus genome: domain and structural features poliovirus polypeptide precursors: expression in vitro and processing by exogenous c and a proteinases complete nucleotide sequence of dengue type virus genome rna estimated global distribution and regional spread of hiv- genetic subtypes in the year enteroviruses: polioviruses, coxsackieviruses, echoviruses, and newer enteroviruses, chapter isolation and partial characterization of an hiv-related virus occurring naturally in chimpanzees in gabon arenaviruses, chap the size and conformation of kaposi's sarcoma-associated herpesvirus (human herpesvirus ) dna in infected cells and virions characterization of a novel coronavirus associated with severe acute respiratory syndrome nucleotide sequence of the kaposi sarcoma-associated herpesvirus (hhv ) human bocavirus: passenger or pathogen in acute respiratory tract infections? nucleotide sequence and genome organization of human parvovirus b isolated from the serum of a child during aplastic crisis cxcr as a functional coreceptor for human immunodeficiency virus type infection of primary macrophages the genome length of human parainfl uenza virus type follows the rule of six, and recombinant viruses recovered from non-poly-hexameric-length antigenomic cdnas contain a biased distribution of correcting mutations respiratory syncytial virus nonstructural proteins decrease levels of multiple members of the cellular interferon pathways infl uenza virus evolution, host adaptation, and pandemic formation mechanisms and enzymes involved in sars coronavirus genome expression conserved rna secondary structures in flaviviridae genomes analysis of the genomic sequence of a human metapneumovirus manual of clinical microbiology human bocavirus, a respiratory and enteric virus hiv- subtype distribution and the problem of drug resistance the exceptionally large genome of hendra virus: support for the creation of a new genus within the family parmyxoviridae polyomaviruses and human cancer: molecular mechanisms underlying patterns of tumerogenesis towards a natural system of organisms: proposal for the domains archaea, bacteria and eucarya a reevaluation of the higher taxonomy of viruses based on rna polymerases papillomavirus genome structure, expression, and post-transcriptional regulation construction and sequencing of an infectious clone of the human parvovirus b key: cord- -gfn aa authors: muse, spencer title: genomics and bioinformatics date: - - journal: introduction to biomedical engineering doi: . /b - - - - . -x sha: doc_id: cord_uid: gfn aa this chapter discusses the basic principles of molecular biology regarding genome science and describes the major types of data involved in genome projects, including technologies for collecting them. genome science is heavily driven by new technological advances that allow for rapid and inexpensive collection of various types of data. the emergence of genomic science has not simply provided a rich set of tools and data for studying molecular biology. it has been the catalyst for an astounding burst of interdisciplinary research, and it has challenged long-established hierarchies found in most institutions of higher learning. the next generation of biologists needs to be as comfortable at a computer workstation as they are at the lab bench. recognizing this fact, many universities have already reorganized their departments and their curricula to accommodate the demands of genomic science.the chapter discusses practical applications and uses of genomic data. for example, in the foreseeable future, are gene therapies that can repair genetic defects. at the conclusion of this chapter, the reader will be able to: use key bioinformatics databases and web resources. in april , sequencing of all three billion nucleotides in the human genome was declared complete. this landmark of modern science brought with it high hopes for the understanding and treatment of human genetic disorders. there is plenty of evidence to suggest that the hopes will become reality- human genetic diseases are now associated with known dna sequences, compared to the less than that were known at the initiation of the human genome project (hgp) in . the success of this project (it came in almost years ahead of time and % under budget, while at the same time providing more data than originally planned) depended on innovations in a variety of areas: breakthroughs in basic molecular biology to allow manipulation of dna and other compounds; improved engineering and manufacturing technology to produce equipment for reading the sequences of dna; advances in robotics and laboratory automation; development of statistical methods to interpret data from sequencing projects; and the creation of specialized computing hardware and software systems to circumvent massive computational barriers that faced genome scientists. clearly, the hgp served as an incubator for interdisciplinary research at both the basic and applied levels. the human genome was not the only organism targeted during the genomic era. as of june , the complete genomes were available for viruses, microbes, and eukaryotes ranging from the malaria parasite plasmodium falciparum to yeast, rice, and humans. continued advances in technology are necessary to accelerate the pace and to reduce the expense of data acquisition projects. improved computational and statistical methods are needed to interpret the mountains of data. the increase in the rate of data accumulation is outpacing the rate of increases in computer processor speed, stressing the importance of both applied and basic theoretical work in the mathematical and computational sciences. in this chapter, the key technologies that are being used to collect data in the laboratory, as well as some of the important mathematical techniques that are being used to analyze the data, are surveyed. applications to medicine are used as examples when appropriate. understanding the applications of genomic technologies requires an understanding of three key sets of concepts: how genetic information is stored, how that information is processed, and how that information is transmitted from parent to offspring. in most organisms, the genetic information is stored in molecules of dna, deoxyribonucleic acid ( fig. . ) . some viruses maintain their genetic data in rna, but no emphasis will be placed on such exceptions. the size of genomes, measured in counts of nucleotides or base pairs, varies tremendously, and a curious observation is that genome size is only loosely associated with organismal complexity (table . ). most of the known functional units of genomes are called genes. for purposes of this chapter, a gene can be defined as a contiguous block of nucleotides operating for a single purpose. this definition is necessarily vague, for there are a number of types of genes, and even within a given type of gene, experts have difficulty agreeing on precisely where the beginning and ending boundaries of those genes lie. a structural gene is a gene that codes instructions for creating a protein ( fig. . ). a second category of genes with many members is the collection of rna genes. an rna gene does not contain protein information; instead, its function is determined by its ability to fold into a specific three-dimensional configuration, at which point it is able to interact with other molecules and play a part in a biochemical process. a common rna gene found in most forms of life is the trna gene illustrated in figure . . structural genes are the entities most scientists envision when the word ''gene'' is mentioned, and from this point on, the term gene will be used to mean ''structural gene'' unless specified otherwise. the number and variety of genes in organisms is a current topic of importance for genome scientists. gene number in organisms ranges from tiny ( in mycoplasma) to enormous ( , or more in plants). non-free-living organisms have even smaller gene numbers (the hiv virus contains only nine). the number of genes in a typical human genome has been estimated to be about , , perhaps the single most surprising finding from the human genome project. this number was thought to be as large as , as recently as . the confusion over this number arose in part because there is a not a ''one gene, one protein'' rule in humans, or indeed, in many eukaryotic organisms. instead, a single gene region can contain the information needed to produce multiple proteins. to understand this fact, the series of steps involved in creating a functional protein from the underlying dna sequence instructions must be understood. the central dogma of molecular biology states that genetic information is stored in dna, copied to rna, and then interpreted from the rna copy to form a functional protein ( fig. . ). the process of copying the genetic information in dna into an rna copy is known as transcription (see chapter ). the process is thought by many to be a remnant of an early rna world, in which the earliest life forms were based on rna genomes. it is at the level of transcription that gene expression is regulated, determining where and when a particular gene is turned on or off. the transcription of a gene occurs when an enzyme known as rna polymerase binds to the beginning of a gene and proceeds to create a molecule of rna that matches the dna in the genome. it is this molecule of messenger rna (mrna) that will serve as a template for producing a protein. however, it is necessary for organisms to regulate the expression of genes to avoid having all genes being produced in all cells at all times. transcription factors interact with either the genomic dna or the polymerase molecule to allow delicate control of the gene expression process. a feedback loop is created whereby an environmental stimulus such as a drug leads to the production of a transcription factor, which triggers the expression of a gene. in addition to this example of a positive control mechanism, negative control is also possible. an emerging theme is that sets of genes are often coregulated by a single or figure . the transfer rna (trna) is an example of a non-protein-coding gene. its function is the result of the specific two-and three-dimensional structures formed by the rna sequence itself. . introduction small group of transcription factors. these sets of genes often share a short upstream dna sequence that serves as a binding site for the transcription factor. one of the earliest surprises of the genomic era was the discovery that many eukaryotic gene sequences are not contiguous, but are instead interrupted by dna sequences known as introns. as shown in figure . , introns are physically cut, or spliced, from the mrna sequence before the rna is converted into a protein. the presence of introns helps to explain the phenomenon that there are more proteins produced in an organism than there are genes present. the process of alternative splicing allows for exons to be assembled in a combinatoric fashion, resulting in a multitude of potential proteins. for example, consider a gene sequence with exons e , e , and e interrupted by introns i and i . if both introns are spliced, the resulting protein would be encoded as e -e -e . however, it is also possible to splice the gene in a way that produces protein e -e , skipping exon e . much like transcription factors regulate gene expressions, there are factors that help to regulate alternative splicing. a common theme is to find a single gene that is spliced in different ways to produce isoforms that are expressed in specific tissues. the process of reading the template in an mrna molecule and using it to produce a protein is known as translation. conceptually, this process is much more simple than the transcription and splicing processes. a structure known as a ribosome binds to the mrna molecule. the ribosome then moves along the rna in units of nucleotides. each of these triplets, or codons, encodes one of amino acids. at each codon the ribosome interacts with trnas to interpret a codon and add the proper amino acid to the growing chain before moving along to the next codon in the sequence (see chapter ). genome science is heavily driven by new technological advances that allow for the rapid and inexpensive collection of various types of data. it has been said that the field is data-driven rather than hypothesis-driven, a reflection of the tendency for researchers to collect large amounts of genomic data with the (realistic) expectation that subsequent data analyses, along with the experiments they suggest, will lead to better understanding of genetic processes. although the list of important biotechnologies changes on an almost daily basis, there are three prominent data types in today's environment: ( ) genome sequences provide the starting point that allows scientists to begin understanding the genetic underpinnings of an organism; ( ) measurements of gene expression levels facilitate studies of gene regulation, which, among other things, help us to understand how an organism's genome interacts with its environment; and ( ) genetic polymorphisms are variations from individual to individual within species, and understanding how these variations correlate with phenotypes such as disease susceptibility is a crucial element of modern biomedical research. the basic principles for obtaining dna sequences have remained rather stable over the past few decades, although the specific technologies have evolved dramatically. the most widely used sequencing techniques rely on attaching some sort of ''reporter'' to each nucleotide in a dna sequence, then measuring how quickly or how far the nucleotide migrates through a medium. the principles of sanger sequencing, originally developed in , are illustrated in figure . . dna sequences have an orientation. the ' end of a sequence can be considered to be the left end, and the ' end is on the right. sanger sequencing begins by creating all possible subsequences of the target sequence that begin at the same ' nucleotide. a reporter, originally radioactive but now fluorescent, is attached to the final ' nucleotide in each subsequence. by using a unique reporter for each of the four nucleotides, it is possible to identify the final ' nucleotide in each of the subsequences. consider the task of sequencing the dna molecule aggt. there are four possible subsequences that begin with the ' a: a, ag, agg, and aggt. the technology of sanger sequencing produces each of those four sequences and attaches the reporter to the final nucleotide. the subsequences are sorted from shortest to longest based on the rate at which they migrate through a medium. the shortest sequence would correspond to the subsequence a; its reporter tells us that the final nucleotide is an a. the second shortest subsequence is ag, with a final nucleotide of g. by arranging the subsequences in a ''ladder'' from shortest to longest, the sequence of the complete target sequence can be found simply by reading off the final nucleotide of each subsequence. a series of new advances have allowed sanger sequencing to be applied in a highthroughput way, paving the way for sequencing of entire genomes, including that of the human. radioactive reporters have been replaced with safer and cheaper fluorescent dyes, and automatic laser-based systems now read the sequence of fluorescentlabeled nucleotides directly as they migrate. early versions of sanger sequencing only allowed for reading a few hundred nucleotides at a time; modern sequencing devices can read sequences of nucleotides or more. perhaps most important has been the replacement of ''slab gel'' systems with capillary sequencers. the older system required much labor and a steady hand; capillary systems, in conjunction with the development of necessary robotic devices for manipulating samples, have allowed almost completely automated sequencing pipelines to be developed. not to be ignored in the series of technological advances is the development of automated base-calling algorithms. a laser reads the intensities of each of the four fluorescent reporter dyes as each nucleotide passes it. the resulting graph of those intensities is a chromatogram. statistical algorithms, including the landmark program phred, are able to accept chromatograms as input and output dna sequences with very high levels of accuracy, reducing the need for laborious human intervention. by assessing the relative levels of the four curves, the base-calling algorithms not only report the most likely nucleotide at each position, but they also provide an error probability for each site. a single state-of-the-art dna sequencing machine can currently produce upwards of one million nucleotides per day. large regions of dna are not sequenced in single pieces. instead, larger contigs of dna are fragmented into multiple, short, overlapping sequences. the emergence of shotgun sequencing (fig. . ), pioneered by dr. craig venter, has revolutionized approaches for obtaining complete genome sequences. the fundamental approach to shotgun sequencing of a genome is simple: ( ) create many identical copies of a genome; ( ) randomly cut the genomes into millions of fragments, each short enough to be sequenced individually; ( ) align the overlapping fragments by identifying matching nucleotides at the ends of fragments, and finally; ( ) read the complete genome sequence by following a gap-free path through the fragments. until venter's work, the idea of shotgun sequencing was considered unfeasible for a variety of reasons. perhaps most daunting was the computational task of aligning the millions of fragments generated in the shotgun process. specialized hardware systems and associated algorithms were developed to handle these problems. following in the footsteps of high-throughput genome sequencing came technology that allowed scientists to survey the relative abundance of thousands of individual gene products. these technologies are, in essence, a modern high-throughput replacement of the northern blot procedure. for each member in a collection of several thousand genes, the assays provide a quantitative estimate of the number of mrna copies of each gene being produced in a particular tissue. two technologies, cdna and oligonucleotide microarrays, currently dominate the field, and they have opened the door to many exploratory analyses that were previously impossible. as a first example, consider taking two samples of cells from an individual cancer patient: one sample from a tumor and one from normal tissue. a microarray experiment makes it possible to identify the set of genes that are produced at different levels in the two tissue types. it is likely that this set of differentially expressed genes contains many genes involved in biological processes related to tumor formation and proliferation ( fig. . ). a second common type of study is a time course experiment. microarray data is collected from the same tissues at periodic intervals over some length of time. for instance, gene expression levels may be measured in -hour increments following the administration of a drug. for each gene, the change in gene expression level can be plotted against time ( fig. . ). groups of coregulated genes will be identified as having points in time where they all experience either an increase or decrease in expression levels. a likely cause for this behavior is that all genes in the coregulated set are governed by a single transcription factor. a final important medical application of microarray technologies involves diagnosis. suppose that a physician obtains microarray data from tumor cells of a patient. figure . shotgun sequencing of genomes or other large fragments of dna proceeds by cutting the original dna into many smaller segments, sequencing the smaller fragments, and assembling the sequenced fragments by identifying overlapping ends. the data, consisting of the relative levels of gene expression for a suite of many genes, can be compared to similar data collected from tumors of known types. if the patient's gene expression profile matches the profile of one of the reference samples, the patient can be diagnosed with that tumor type. the advent of microarray techniques has rapidly improved the accuracy of this type of diagnosis in a variety of cancers. cdna microarrays were the first, and are still the most widely used, form of highthroughput gene expression methods. the procedure begins by attaching the dna sequences of thousands of genes onto a microscope slide in a pattern of spots, with each spot containing only dna sequences of a single gene. a variety of technologies have emerged for creating such slides, ranging from simple pin spotting devices to technologies using laser jet printing techniques. the rna of expressed genes is next collected from the target cell population. through the process of reverse transcription, a cdna version of each rna is created. a cdna molecule is complementary to the genomic dna sequence in the sense that complementary base pairs will physically bind to one another. for example, a cdna reading gttac could physically bind to the genomic dna sequence caatg. during the process of creating the cdna collection, each cdna is labeled with a fluorescent dye. the collection of labeled cdnas is poured over the microscope slide and its set of attached dna molecules. the cdnas that match a dna on the slide physically bind to their mates, and unbound cdnas are washed from the slide. finally, the number of bound molecules at each spot (genes) can be read by measuring the fluorescence level at each spot. highly expressed genes will create more rna, which results in more labeled cdnas binding to those spots. a more common variant on the basic cdna approach is illustrated in figure . . in this experiment, rna from two different tissues or individuals is collected, labeled with two different dyes, and competitively hybridized on a single slide. the relative abundance of the two dyes allows the scientist to state, for instance, that a particular gene is expressed fivefold times more in one tissue than in the other. oligonucleotide arrays take a slightly different approach to assaying the relative abundance of rna sequences. instead of attaching full-length dnas to a slide, oligonucleotide systems make use of short oligonucleotides chosen to be specific to individual genes. for each gene included in the array, approximately to different oligonucleotides of length - nucleotides are designed and printed onto a chip. the use of multiple oligonucleotides for each gene helps to reduce the effects of a variety of potential errors. fluorescently labeled rna (rather than cdna) is collected from the target tissue and hybridized against the oligonucleotide array. one limitation of the oligonucleotide approach is that only a single sample can be assayed on a single chip-competitive hybridization is not possible. although oligonucleotide and cdna approaches to assaying gene expression rely on the same basic principles, each has its own advantages and disadvantages. as already noted, competitive hybridization is currently only possible in cdna systems. the design of oligonucleotide arrays requires that the sequences of genes for the chip are already available. the design phase is very expensive, and oligonucleotide systems are only available for commercially important and model organisms. in contrast, cdna arrays can be developed fairly quickly even in organisms without sequenced figure . a cdna microarray slide is created by ( ) attaching dna to spots on a glass slide, ( ) collecting expressed rna sequences (expressed sequence tags, ests) from tissue samples, ( ) converting the rna to dna and labeling the molecules with fluorescent dyes, ( ) hybridizing the labeled dna molecules to the dna bound to the slide, and ( ) extracting the quantity of each expressed sequence by measuring the fluorescence levels of the dyes. genomes. in their favor, oligonucleotide arrays allow for more genes to be spotted in a given area (thus allowing more measurements to be made on a single chip) and tend to offer higher repeatability of measurements. both of these facts reduce the overall level of experimental error rate in oligonucleotide arrays relative to cdna microarrays, although at a higher per observation cost. because of the trade-off between obtaining many cheap noisy measurements versus a smaller number of more precise but expensive measurements, it is not clear that either technology has an obvious cost advantage. both techniques share the same major disadvantage: only measurements of rna levels are found. these measurements are used as surrogates for the much more desirable and useful quantities of the amount of protein produced for each gene. it appears that rna levels are correlated with protein levels, but the extent and strength of this relationship is not understood well. the near future promises a growing role for protein microarray systems, which are currently seeing limited use because of their very high costs. the ''final draft'' of the human genome was announced in april . it included roughly . billion nucleotides, with some , to , genes spread across pairs of chromosomes. the next phase of major data acquisition on the human genome is to discover how differences, both large and small, from individual to individual, result in variation at the phenotypic level. toward this end, a major effort has been made to find and document genetic polymorphisms. polymorphisms have long been important to studies of genetics. variations of the banding patterns in polytene chromosomes, for instance, have been studied for many decades. allozyme assays, based on differences in the overall charge of amino acid sequences, were popular in the s. most modern studies of genetic polymorphisms, though, focus on identifying variation at the individual nucleotide level. the international snp consortium (http://snp.cshl.org) is a collaboration of public and private organizations that discovered and characterized approximately . million single nucleotide polymorphisms (snps) in human populations. in medicine, the expectation is that knowledge of these individual nucleotide variants will accelerate the description of genetic diseases and the drug development process. pharmaceutical companies are optimistic that surveys of variation will be of use for selecting the proper drug for individual patients and for predicting likely side effects on an individual-to-individual basis. most snps (pronounced ''snips'') are the result of a mutation from one nucleotide to another, whereas a minority are insertions and deletions of individual nucleotides. surveys of snps have demonstrated that their frequencies vary from organism to organism and from region to region within organisms. in the human genome, a snp is found about every to nucleotides. however, the frequency of snps is much higher in noncoding regions of the genome than in coding regions, the result of natural selection eliminating deleterious alleles from the population. furthermore, synonymous or silent polymorphisms, which do not result in a change of the encoded amino acids, are more frequent than nonsynonymous or replacement polymorphisms. the fields of population genetics and molecular evolution provide many empirical surveys of snp variation, along with mathematical theory, for analyzing and predicting the frequencies of snps under a variety of biologically important settings. simple sequence repeats (ssrs) consist of a moderate ( - ) number of tandemly repeated copies of the same short sequence of to nucleotides. ssrs are an important class of polymorphisms because of their high mutation rates, which lead to ssr loci being highly variable in most populations. this high level of variability makes ssr markers ideal for work in individual identification. ssrs are the markers typically employed for dna fingerprinting in the forensics setting. in human populations, an ssr locus usually has or more alleles and a per generation mutation rate of . . the fbi uses a set of tetranucleotide repeats for identification purposes, and experts claim that no two unrelated individuals have the exact same collection of alleles at all of those loci. as the technology for collecting genomic data has improved, so has the need for new methods for management and analysis of the massive amounts of accumulated data. the term bioinformatics has evolved to include the mathematical, statistical, and computational analysis of genomic data. work in bioinformatics ranges from database design to systems engineering to artificial intelligence to applied mathematics and statistics, all with an underlying focus on genomic science. a variety of bioinformatics topics may be illustrated using the core technologies described in the preceding section. it is necessary to carry out sequence alignments in order to assemble sequence fragments. all of these sequences, along with the vital information about their sources, functions, and so on, must be stored in databases, which must be readily available to users in a variety of locations. once a sequence has been obtained, it is necessary to annotate its function. one of the most fundamental annotation tasks is that of computational gene finding, in which a genome or chromosome sequence is input to an algorithm that subsequently outputs the predicted location of genes. a gene sequence, whether predicted or experimentally determined, must have its function predicted, and many bioinformatics tools are available for this task. once microarray data are available, it is necessary to identify subsets of coregulated genes and to identify genes that are differentially expressed between two or more treatments or tissue types. polymorphism data from snps are used to search for correlations with, for example, the presence or absence of a disease in family pedigrees. these questions are all of fundamental importance and draw on many different fields. by necessity, bioinformatics is a highly multidisciplinary field. genome projects involve far-reaching collaborations among many researchers in many fields around the globe, and it is critical that the resulting data be easily available both to project members and to the general scientific community. in light of this requirement, a number of key central data repositories have emerged. in addition to providing storage and retrieval of gene sequences, several of these databases also offer advanced sequence analysis methods and powerful visualization tools. in the united states, the primary public genomics resource is that of the national center for biotechnology information (ncbi). the ncbi website (http:// www.ncbi.nlm.nih.gov) provides a seemingly endless collection of data and data analysis tools. perhaps the most important element of the ncbi collection is the genbank database of dna and rna sequences. ncbi provides a variety of tools for searching genbank, and elements in genbank are linked to other databases, both within and outside of ncbi. figure . shows some results from a simple query of the genbank nucleotide database. genbank data files contain a wealth of information. figure . shows a simple genbank file for a prion sequence from duck. the accession number, af , is figure . the result of a simple query of the genbank database at ncbi. this query found entries in the genbank nucleotide database containing the term ''tyrosine kinase.'' each entry can be clicked to find additional information. figure . a simple genbank file containing the dna sequence for a prion protein gene. the unique identifier for this entry. the genbank file contains a dna sequence of nucleotides, its predicted amino acid sequence, and a citation to the chinese laboratory that obtained the data. the ''links'' icon in the upper right provides access to related information found in other databases. it is essential for those working in genomics or bioinformatics to become familiar with genbank and the content of genbank files. ncbi is also the home of the blast database searching tool. blast uses algorithms for sequence alignment (described later in this chapter) to find sequences in genbank that are similar to a query sequence provided by the user. to illustrate the use of blast, consider a study by professor eske willerslev at the university of copenhagen. willerslev and his colleagues collected samples from siberian permafrost that included a variety of preserved plant and animal material estimated to be , - , years old. they were able to extract short dna sequences from the rbcl gene. these short sequences were used as input to the blast algorithm, which reported a list of similar sequences. it is likely that the most similar sequences come from close relatives of the organisms that provided the ancient dna. the european bioinformatics institute (ebi, http://www.ebi.ac.uk) is the european ''equivalent'' of ncbi. users who explore the ebi website will find much of the same type of functionality as provided by ncbi. of particular note is the ensembl project (http://www.ensembl.org), a joint venture between ebi and the sanger institute. ensembl has particularly nice tools for exploring genome project data through its genome browser. figure . shows a portion of the display for a region of human chromosome . ensembl provides comparisons to other completed genome sequences (rat, mouse, and chimpanzee), along with annotations of the locations of genes and other interesting features. most of the items in the display are clickable and provide links to more detailed information on each display component. many other databases and web resources play important roles in the day-to-day working of genome scientists. table . includes a selection of these resources, along with short descriptions of their unique features. the most fundamental computational algorithm in bioinformatics is that of pairwise sequence alignment. not only is it of immediate practical value, but the underlying dynamic programming algorithm also serves as a conceptual framework for many other important bioinformatics techniques. the goal of sequence alignment is to accept as input two or more dna, rna, or amino acid sequences; identify the regions of the sequences that are similar to one another according to some measure; and output the sequences with the similar positions aligned in columns. an alignment of six sequences from hiv strains is shown in figure . . sequence alignments have numerous uses. alignments of pairs of sequences help us to determine whether or not they have the same or similar functions. regions of alignments with little sequence variation likely correspond to important structural or functional regions of protein coding genes. by studying patterns of similarity in an alignment of genes from several species, it is possible to infer the evolutionary history resources and software indices of the species, and even to reconstruct dna or amino acid sequences that were present in the ancestral organisms. many methods for annotation, including assigning protein function and identifying transcription factor binding sites, rely on multiple sequence alignments as input. to illustrate the principles underlying sequence alignment, consider the special case of aligning two dna sequences. if the two sequences are similar, it is most likely because they have evolved from a common ancestral sequence at some time in the past. as illustrated in figure . , the sequences differ from the ancestral sequence and from each other because of past mutations. most mutations fall into one of two classes: nucleotide substitutions, which result in these two sequences being different at the location of the mutation (fig. . a) , and insertions or deletions of short sequences (fig. . b) . the term indel is often used to denote an insertion or deletion mutation. figure . b shows that indels lead to one sequence having nucleotides present at certain positions, whereas the second sequence has no nucleotides at those positions. to align two sequences without error, it would be necessary to have knowledge of the entire collection of mutations in the history of the two sequences. since this information is not available, it is necessary to rely on computational algorithms for reconstructing the likely locations of the various mutation events. a score function is chosen to evaluate alignment quality, and the algorithms attempt to find the pairwise alignment that has the highest numerical score among all possible alignments. consider aligning the two short sequences cagg and cga. it can be shown that there are possible ways to align these two sequences, several of which are shown in figure . . how does one determine which of the possibilities is best? alignments (a) and (b) each have two positions with matching nucleotides; however, alignment (b) includes three columns with indels, whereas (a) has only one. on the other hand, alignment (a) has one mismatch to (b)'s zero. there is no definitive answer to the question of which alignment is best; however, it makes sense that ''good'' alignments will tend to have more matches and fewer mismatches and indels. it is possible to quantify that intuition by invoking a scoring scheme in which each column receives a score, s i , according to the formula s i ¼ m, the bases at column i match d, the bases at column i do not match i, there is an indel at column i < : using this scheme with match score m ¼ , mismatch score d ¼ À , and indel score i ¼ À , the alignment in figure . a would receive a score of À þ À ¼ . similarly, the alignment in figure . b has a score of À À þ À ¼ . the remaining alignments in figure . have scores of , , , , À , and , respectively. alignment (a) is considered best under the standards of this scoring scheme, and, in fact, it has the best score of all possible alignments. this example suggests an algorithm for finding the best scoring alignment of any two sequences: enumerate all possible alignments, calculate the score for each, and select the alignment with the highest score. unfortunately, it turns out that this approach is not practical for real data. it can be shown that the number of possible alignments of sequences of length n is approximately n = ffiffiffiffiffiffiffiffiffi pn p when n is large. even for a pair of short sequences of length , the number of alignments is  , orders of magnitude larger than avogadro's number! techniques such as the needleman-wunsch and smith-waterman algorithms, which allow for computationally efficient identifications of the optimal alignments, are important practical and theoretical components of bioinformatics. conceptually, the task of aligning three or more sequences is essentially the same as that of aligning pairs of sequences. the computational task, however, becomes enormously more complex, growing exponentially with the number of sequences to be aligned. no practically useful solutions have been found, and the problem has been shown to belong to a class of fundamentally hard computational problems said to be np-complete. in addition to the increased computation, there is one important new concept that arises when shifting from pairwise alignment to multiple alignment. scoring columns in the pairwise case was simple; that is not the case for multiple sequences. complications arise because the evolutionary tree relating the sequences to be aligned is typically unknown, which makes assigning biologically plausible scores difficult. this problem is often ignored, and columns are scored using a sum of pairs scoring scheme in which the score for a column is the sum of all possible pairwise scores. for example, the score for a column containing the three nucleotides cgg, again using the scores m ¼ , d ¼ À , and i ¼ À , is À À þ ¼ . other algorithms, such as the popular clustalw program, use an approach known as progressive alignment to circumvent this issue. almost all widely used methods for finding sequence alignments rely on a scoring scheme similar to the one used in the preceding paragraphs. clearly, this formula has very little biological basis. furthermore, how does one select the scores for matches, mismatches, and indels? considerable work has addressed these issues with varying degrees of success. the most important improvement is the replacement of the simple match and mismatch scores with scoring matrices obtained from empirical collections of amino acid sequences. rather than assigning, for example, all mismatches a value of À , the blosum and pam matrices provide a different penalty for each possible pair of amino acids. since these penalties are derived from actual data, mismatches between chemically similar amino acids such as leucine and isoleucine receive smaller penalties than mismatches between chemically different ones. a second area of improvement is in the assignment of indel penalties. the alignments in figure . b and . e each have a total of three sites with indels. however, the indels at sites and of figure . b could have been the result of a single insertion or deletion event. recognizing this fact, it is common to use separate open and extension penalties for indels. if the open penalty is o ¼ À and the extension penalty is e ¼ À , then a series of three consecutive indels would receive a score of À À À ¼ À . the most common bioinformatics task is searching a molecular database such as genbank for sequences that are similar to a query sequence of interest. for example, the query sequence may be a gene sequence from a newly isolated viral outbreak, and the search task may be to find out if any known viral sequences are similar to this new one. it turns out that this type of database searching is a special case of pairwise sequence alignment. essentially, all sequences in the database are concatenated end to end, and this new ''supersequence'' is aligned to our query sequence. since the supersequence is many, many times longer than the query sequence, the resulting pairwise alignment would consist mostly of gaps and provide relatively little useful information. a more useful procedure is to ask if the supersequence contains a short subsequence that aligns well with the query sequence. this problem is known as local sequence alignment, and it can be solved with algorithms very similar to those for the basic alignment problem. the smith-waterman algorithm is guaranteed to find the best such local alignment. even though the smith-waterman algorithm provides a solution to the database search problem for many applications, it is still too slow for high-volume installations such as ncbi, where multiple query requests are handled every second. for these settings, a variety of heuristic searches have been developed. these tools, including blast and fasta, are not guaranteed to find the best local alignment, but they usually do and are, therefore, valuable research tools. it is no exaggeration to claim that blast (http://www.ncbi.nlm.nih.gov/blast) is one of the most influential research tools of any field in the history of science. the algorithm has been cited in upwards of , studies to date. in addition to providing a fast and effective method for database searching, the use of blast spread rapidly because of the statistical theory developed to accompany it. when searching a very large database with a short sequence, it is very likely that one or more instances of the query sequence will be found in the database simply by chance alone. when blast reports a list of database matches, it sorts them according to an e-value, which is the number of matches of that quality expected to be found by chance. an e-value of . indicates a match that would only be found once every searches, and it suggests that the match is biologically interesting. on the other hand, an e-value of . implies that two matches of the observed quality would be found every search simply by chance, and therefore, the match is probably not of interest. consider an effort to identify the virus responsible for sars. the sequence of the protease gene, a ubiquitous viral protein, was isolated and stored under genbank accession number ay . if that sequence is submitted to the tblastx variant of blast at ncbi, the best matching non-sars entries in genbank (remember that the sars entries would not have been in the database at the time) all belong to coronaviruses, providing strong evidence that sars is caused by a coronavirus. this type of comparative genomic approach has become invaluable in the field of epidemiology. much work in genomic science and bioinformatics focuses on problems of identifying biologically important regions of very long dna sequences, such as chromosomes or genomes. many important regions such as genes or binding sites come in the form of relatively short contiguous blocks of dna. hidden markov models (hmms) are a class of mathematical tools that excel at identifying this type of feature. historically, hmms have been used in problems as diverse as finding sources of pollution in rivers, formal mathematical descriptions of written languages, and speech recognition, so there is a rich body of existing theory. predictably, many successful applications of hmms to new problems in genomic science have been seen in recent years. hmms have proven to be excellent tools for identifying genes in newly sequenced genomes, predicting the functional class of proteins, finding boundaries between introns and exons, and predicting the higher-order structure of protein and rna sequences. to introduce the concept of an hmm in the context of a dna sequence, consider the phenomenon of isochores, regions of dna with base frequencies unique from neighboring regions. data from the human genome demonstrate that regions of a million or more bases have g+c content varying from % to %, a much higher range than one would expect to see if base composition were homogeneous across the entire genome. a simple model of the genome assigns each nucleotide to one of three possible classes (fig. . a) : a high g+c class (h), a low g+c class (l), or a normal g+c class (n). in the normal class, each of the four bases a, c, g, and t is used with equal frequency ( %). in the high g+c class, the frequencies of the four bases are % a, % c, % g, and % t, and in the low g+c class, the frequencies are % a, % c, % g, and % t. in the parlance of hmms, these three classes are called hidden states, since they are not observed directly. instead, the emitted characters a, c, g, and t are the observations. thus, this simple model of a genome consists of successive blocks of nucleotides from each of the three classes ( fig. . b) . the formal mathematical details of hmms will not be discussed, but it is useful to understand the basic components of the models (fig. . ). each hidden state in an hmm is able to emit characters, but the emission probabilities vary among hidden states. the model must also describe the pattern of hidden states, and the transition probabilities determine both the expected lengths of blocks of a single hidden state and the likelihood of one hidden state following another (e.g., is it likely for a block of high g+c to follow a block of low g+c?). the transition probabilities play important roles in applications such as gene finding. & what is the chance of seeing a block of high g+c nucleotides shorter than ? these types of questions will be addressed in the examples discussed in the next section. the task of gene prediction is conceptually simple to describe: given a very long sequence of dna, identify the locations of genes. unfortunately, the solution of the problem is not quite as simple. as a first pass, one might simply find all pairs of start (atg) and stop (tag, tga, taa) codons. blocks of sequence longer than, say, nucleotides that are flanked by start and stop codons and that have lengths in multiples of three are likely to be protein coding genes. although this simple method will be likely to find many genes, it will probably have a high false positive rate figure . an hmm for the states in fig. . . transition probabilities govern the chance that one hidden state follows another. for example, an n state is followed by another n state % of the time, by an l state % of the time, and by an h state % of the time. emission probabilities control the frequency of the four nucleotides found at each type of hidden state. in the hidden state l, there is % a, % c, % g, and % t. (incorrectly predict that a sequence is a gene), and it will certainly have a high false negative rate (fail to predict real genes). for instance, the method fails to consider the possibility of introns, and it is unable to predict short genes. gene finding algorithms rely on a variety of additional information to make predictions, including the known structure of genes, the distribution of lengths of introns and exons in known genes, and the consensus sequences of known regulatory sequences. hmms turn out to be exceptionally well-suited for gene finding, and the basic structure of a simple gene finding hmm is shown in figure . . note that the hmm includes hidden states for promoter regions, the start and stop codons, exons, introns, and the noncoding dna falling between different genes. also note that not all hidden states are connected to one another. this fact reflects an understanding of gene and genome structure. the sequence of states start -intron -stop -exon -promoter is not biologically possible, whereas the series noncoding -promoter -start -exon -intron -exon -intron -exon -stop -noncoding is. good hmms incorporate this type of knowledge extensively. to put the hmm of figure . to use, the model must first be trained. the training step involves taking existing sequences of known genes and estimating all of the transition and emission probabilities for each of the model's hidden states. for example, if a training data set included introns, the observed frequency of c in those intron sequences could be used to come up with the emission probability for c in the hidden state intron. the average lengths of introns and exons would be used to estimate the transition probabilities to and from the exon and intron hidden states. once the training step is complete, the hmm machinery can be used to predict the locations of genes in a long sequence of dna, along with their intron/exon boundaries, promoter sites, and so forth. gene finding algorithms in actual use are much more complex than the one shown in figure . , but they retain the same basic structure. the performance of gene finders continues to get better and better as more genomes are studied and the quality of the underlying hmms is improved. in bacteria, modern gene finding algorithms are rarely incorrect. upwards of % of the predicted genes are subsequently found to be actual genes, and only - % of true genes are missed by the algorithms. the situation is not as rosy for eukaryotic gene prediction, however. eukaryotic genomes are much larger, and the gene structure is more complex (most notably, eukaryotic genes have introns). the effectiveness of gene finding algorithms is usually measured in terms of sensitivity and specificity. if these quantities are measured on a per nucleotide basis, an algorithm's sensitivity, s n , is defined to be the percentage of nucleotides in real genes that are actually predicted to be in genes. the specificity, s p , is the percentage of nucleotides predicted to be in genes that truly are in genes. good gene predictors have high sensitivity and high specificity. the best gene eukaryotic gene finders today have sensitivities and specificities around % at the individual nucleotide. if the quantities are measured at the level of entire exons (e.g., did the algorithm correctly predict the location of the entire exon or not?), the values drop to around %. an emerging and powerful approach for predicting the location of genes uses a comparative genomics approach. the entire human genome sequence is now available, and the locations of tens of thousands of genes are known. suppose that a laboratory now sequences the genome of the cheetah. since humans and cheetahs are both mammals, they should have reasonably similar genomes. in particular, most of the gene sequences should be quite similar. gene prediction can proceed by doing a pairwise sequence alignment of the two genomes and then predicting that positions in the cheetah genome corresponding to locations of known human genes are also genes in the cheetah. this approach is remarkably effective, although it will obviously miss genes that are unique to one species or the other. the degree of relatedness of the two organisms also has a major impact on the utility of this approach. the human genome could be used to predict genes in the gorilla genome much better than it could be used to predict genes in the sunflower or paramecium genomes. in addition to hmms and comparative genomics approaches, a variety of other techniques are being used for gene prediction. neural networks and other artificial intelligence methods have been used effectively. perhaps most intriguing, as more and more genomes become available, are hybrid methods that integrate, for example, hmms with comparative genomic data from two or more genomes. once a genome is sequenced and its genes are found or predicted, the next step in the bioinformatics pipeline is to determine the biological function of the genes. ideally, molecular biological work would be carried out in the laboratory to study each gene's function, but clearly that approach is not feasible. two basic computational approaches will be described, one using comparative genomics and the other using hmms. comparative genomics approaches to assigning function to genes rely on a simple logical assumption: if a gene in species a is very similar to a gene in species b, then the two genes most likely have the same or related functions. this logic has long been applied at higher biological levels (e.g., the kidneys of different species have the same basic biological function even though the exact details may differ in the two species). at the level of genes, the inference is less accurate, especially if the species involved in the comparison are not closely related, but the approach is nonetheless useful and usually effective. simple database searches are the most straightforward comparative genomic approach to functional annotation. a newly discovered gene sequence that returns matches to cytochrome oxidase genes when input to blast is likely to be a cytochrome oxidase gene itself. complications arise when matches are to distantly related species, when the matching regions are very short, or when the sequence matches members of a multigene family. in the first case, the functions of the genes may have changed during the tens or hundreds of millions of years since the two organisms shared a common ancestor. however, if two or more such distantly related organisms have gene sequences that are nearly identical, a strong argument can be made that the gene is critical in both organisms and that the same function has been maintained throughout evolutionary history. short matches may arise simply as a result of elementary protein structure. for example, two sequences may have regions that match simply because they both encode alpha helical regions. such matches provide useful structural information, but the stronger inference of shared function is not justified. multigene families are the result of gene duplications followed by functional divergence. examples include the globin and amylase families of genes. at some point in the past, a single gene in one organism was completely duplicated in the genome. at that point, the duplicated copy was free to evolve a new, but often related, function. subsequent duplications allow for the growth and diversification of such families. because of their shared ancestry, all members of a gene family tend to have similar dna sequences. this fact makes it difficult to assign function with high accuracy when matches appear in database searches, but it often provides a general class of functions for the query sequence. efforts have been made to classify all known proteins into functional groups using comparative genomics. suppose that the genbank protein database is queried with protein sequence a and the result is that its closest match is protein sequence b. if the database is next queried using sequence b and the closest match for b is found to be sequence a, then these two proteins are said to be reciprocal best matches, and they are likely to have the same function. likewise, if the best match to sequence a is b, the best match to b is c, and the best match to c is a, then a, b, and c are likely to have the same function. this general principle has been used to create clusters of genes that are predicted to have similar or identical functions. the cogs (clusters of orthologous groups of proteins) database at ncbi (http://www.ncbi.nlm.nih.gov/cog) represents a comprehensive clustering of the entire genbank protein database using this type of scheme. there are many known examples of proteins or individual protein domains that have the same function or structure. the pfam (protein family) database (http:// www.sanger.ac.uk/software/pfam) includes multiple sequence alignments of almost such protein families. using the sequence data for each alignment, the pfam project members created a special type of hmm called a profile hmm. this database makes it possible to take a query sequence and, for each of the families and their associated profile hmms, ask the question, ''is the query sequence a member of this gene family?'' a query to the pfam results in a probability assigned to each of the included protein families, providing not only the best matches but also indications of the strength of the matches. currently, about % of the proteins in genbank have a match in pfam, indicating a fairly high likelihood of any newly discovered protein having a pfam match. pfam is of interest not only because of its effectiveness, but also because of its theoretical approach of combining comparative genomic and hmm components. a common experiment is to use microarray or oligonucleotide array technology to measure the expression level for several thousand genes under two different ''treatments.'' it is often the case that one treatment is a control while the other is an environmental stimulus such as a drug, chemical, or change in a physical variable such as temperature or ph. other possibilities include comparisons between two tissue types (e.g., brain vs. heart), between diseased and undiseased tissues (e.g., tumor vs. normal), or between samples at two developmental phases (e.g., embryo vs. adult). one of the primary reasons to carry out such an experiment is to identify the genes that are differentially expressed between the two treatments. the basic format of the data from a simple two-treatment microarray experiment is the following: each spot on a microarray corresponds to a single gene, and in competitive hybridization experiments, a single spot usually provides measurements of gene expression under two different treatments. note that the first column has been intentionally labeled ''spot'' instead of ''gene.'' it is important that the same gene be used and measured multiple times; therefore, a number of different spots will typically correspond to the same gene. the final column of data is the most important for interpreting this experiment. the most extreme difference in relative expression levels is found at spot , where the gene is expressed almost fourfold higher under treatment . the question now becomes, ''how large (or small) must the ratio be to say that the expression levels are really different?'' this question is one of variability and of statistical significance. phrased differently, would a ratio near . for spot be likely if the experiment were repeated? the data in the table do not provide the necessary information to answer this question, and this fact points out the importance of replication in experimental design. whenever quantitative measurements are to be compared, replication is needed in order to estimate the variance of the measurements. this fundamental tenet of experimental design was largely ignored during the early history of microarray studies. fortunately, recent work has included careful attention to experimental design and proper analysis using the analysis of variance (anova). typical experiments now include five or more replicate measurements of each gene. in order to detect very small treatment effects on levels of expression, even larger amounts of replication are needed. a second type of microarry experiment is designed not to find differentially expressed genes, but to identify sets of genes that respond to two or more treatments in the same manner. this type of study is best illustrated with a time course study in which expression levels are measured at a series of time intervals. examples of such studies might involve measuring expression levels in laboratory mice each hour following exposure to a toxic chemical, expression levels in a mother or fetus at each trimester of a pregnancy, or expression levels in patients each year following infection with hiv. if plots of expression levels (y axis) against time (x axis) for each gene are overlaid as shown in figure . , it is possible to visually compare the expression profiles of genes. the desired pattern is a group of genes that tend to increase or decrease their expression levels in unison. in figure . it appears that genes and have very similar expression profiles, as do genes and . the similarity between the expression profiles of two genes can be described using the correlation coefficient, where x i and y i are the expression levels of genes x and y at time point i. values near or À indicate that the two genes have very similar profiles. when faced with thousands of profiles, the task becomes a bit more problematic. a common theme is to cluster genes on the basis of the similarity in their profiles, and many algorithms for carrying out the clustering have been published. all of these algorithms share the objective of assigning genes to clusters so that there is little variation among profiles within clusters, but considerable variation between clusters. top down clustering begins with all genes in a single cluster, then recursively partitions the genes into smaller and smaller clusters. bottom up methods start with each gene in its own cluster and progressively merge smaller clusters into larger ones. clustering algorithms may also be supervised, meaning that the user specifies ahead of time the final number of clusters, or unsupervised, in which case the algorithm determines the final number of clusters. the emergence of genomic science has not simply provided a rich set of tools and data for studying molecular biology. it has been the catalyst for an astounding burst of interdisciplinary research, and it has challenged long-established hierarchies found in most institutions of higher learning. the next generation of biologists will need to be as comfortable at a computer workstation as they are at the lab bench. recognizing this fact, many universities have already reorganized their departments and their curricula to accommodate the demands of genomic science. from a more practical point of view, the results of genomic research will begin to trickle into medicine. already, diagnostic procedures are changing rapidly as a result of genomics. the next phase of genomics will focus on relating genotypes to complex phenotypes, and as those connections are uncovered, new therapies and drugs will follow. consider, for example, a drug that is of significant benefit to % of users, but causes serious side effects in the remaining %. such drugs currently have difficulty remaining in the marketplace. however, the use of genetic screens to identify the patients likely to suffer side effects should make it possible for these drugs to be used safely and effectively. less imminent, but certainly in the foreseeable future, are gene therapies that will allow for repair of genetic defects. the continued interplay of figure . overlaid expression profiles for genes. note that genes and , as well as genes biology, engineering, and the mathematical sciences will be responsible for exploration of these frontiers. exercises . how many possible proteins could be formed by a gene region containing four exons? . in general, eukaryotes have introns, whereas prokaryotes do not. what are possible advantages and disadvantages of introns? . most amino acids are encoded by more than a single codon. if one of these synonymous codons is energetically more efficient for the organism to use, what effect would that have on the organism's genome content? how might this fact be used in gene finding algorithms? . what is the chance that a -nucleotide oligonucleotide matches a sequence other than the one it was designed to match? assume for simplicity that all nucleotides have frequency %. how many matches to that oligonucleotide would one expect to find in the human genome? . if each of the ssr markers used by the fbi for identification purposes has equally frequent alleles, what is the chance that two randomly chosen individuals have the same collection of alleles at those markers? . how many mammalian genomes have been completely sequenced? what are they? . what is the size of the anopheles gambiae genome? how many chromosomes does it have? how many genes does it have? . what is the length of the drosphila melanogaster alcohol dehydrogenase gene? . consider the following two alignments for the sequences cggtca and cagca: c-ggtca c-ggtca ca-g-ca ca-gca. a. find the score of each alignment using a match score of , mismatch penalty of À , and gap penalty of À . b. find the score of each if the gap penalty is À for opening and À for extending. . suppose a computer can calculate the scores for one million alignments per second. how long would it take to find the best alignment of two bp sequences by exhaustive search? . find an example of a zinc finger gene sequence using genbank. use blast to discover how many genbank sequences are similar to the sequence you found. what does the result tell you about zinc finger genes? . what are some additional features that might be added to the simple gene finding hmm of fig. . ? draw a diagram of a simple gene finding hmm that might be useful for prokaryotes. the hmm should contain hidden states for exons and intergenic regions, and it should guarantee that exons have lengths that are multiples of three use the pfam website to give a brief description of the structure and function of members of the hamartin gene family , gene seems to be expressed at higher levels than gene . justify the claim that the two genes have similar profiles and might be coregulated compute the correlation coefficient for each pair of genes. do any of them have similar profiles? . the expression levels for two genes measured at four times are: implication of the correlation coefficient? often, the gene sequences placed on microarray slides are of unknown function. suppose that an experiment identifies such a gene as being important for formation of a particular type of tumor when carrying out a database search using blast with a protein coding gene as the query sequence, there are two possible approaches. first, it is possible to query using the original dna sequence. second, one could translate the coding dna and query using the amino acid sequence of the encoded protein basic local alignment search tool isochores and the evolutionary genomics of vertebrates exploring the new world of the genome with dna microarrays prediction of complete gene structures in human genomic dna the human genome project after a decade: policy issues genomics: the science and technology behind the human genome project new goals for the us human genome project the minimal gene complement of mycoplasma genitalium a primer of genome science principles of population genetics amino acid substitution matrices from protein blocks initial sequencing and analysis of the human genome a map of human genome sequence variation containing . million single-nucleotide polymorphisms analysis of variance for gene expression in microarray data gene-expression profile of the aging brain in mice bioinformatics: sequence and genome analysis a general method applicable to the search for similarities in the amino acid sequences of two proteins a gene expression database for the molecular pharmacology of cancer identification of common molecular subsequences pfam: multiple sequence alignments and hmm profiles of protein domains increasing biological complexity is positively correlated with the relative genome-wide expansion of non-protein-coding dna sequences shotgun sequencing of the human genome the sequence of the human genome database resources of the national center for biotechnology information diverse plant and animal genetic records from holocene and pleistocene sediments gene expression profiles in normal and cancer cells suggested reading key: cord- -f uvhstb authors: sintchenko, vitali title: informatics for infectious disease research and control date: - - journal: infectious disease informatics doi: . / - - - - _ sha: doc_id: cord_uid: f uvhstb the goal of infectious disease informatics is to optimize the clinical and public health management of infectious diseases through improvements in the development and use of antimicrobials, the design of more effective vaccines, the identification of biomarkers for life-threatening infections, a better understanding of host-pathogen interactions, and biosurveillance and clinical decision support. infectious disease informatics can lead to more targeted and effective approaches for the prevention, diagnosis and treatment of infections through a comprehensive review of the genetic repertoire and metabolic profiles of a pathogen. the developments in informatics have been critical in boosting the translational science and in supporting both reductionist and integrative research paradigms. "new age" infectious disease informatics rests on advances in microbial genomics, the sequencing and comparative study of the genomes of pathogens, and proteomics or the identification and characterization of their protein related properties and reconstruction of metabolic and regulatory pathways (bansal ) . the speed of microbial genome sequencing has been steadily accelerating since the introduction of modern dna sequencing methods more than thirty years ago (sanger et al. ) . the accumulation of sequenced genomes of bacteria shows a good fit to exponential functions with a doubling time of approximately months (koonin and wolf ) . despite the historical bias towards the "working horses" of bacterial genomics, such as commensals e. coli and b. subtilis (collado-vides et al. ) , the depth and breadth of the coverage of sequences belonging to different species of viral, bacterial, fungal and protozoan pathogens has been rapidly expanding. microbial genomes are thousands or millions of base pairs in length, requiring both a global view of the genome and the ability to zoom in on details for the purpose of analysis and annotation. annotation is the extraction of biological knowledge from raw nucleotide sequences (médigue and moszer ) . such decoding of the genomes allows the prediction of protein-coding genes and therefore, the proteins the organism is able to produce. desktop computer sequence editors such as chromas lite (http://chromas-lite.software.informer.com/), trace edit (http://www.ridom.de/traceedit/) or commercial products like lasergene (http://www.dnastar.com/products/lasergene.php) or sequencher (http://www. sequencher.com/) are helpful in the initial sequence assessment. the task of assembling of sequences from re-sequencing experiments, when a reference sequence is available, can be supported by tools like traceeditpro (http://www . ridom.de/traceeditpro/) or seqscape. different software pipelines have been developed to automate microbial genome annotation and assembly (table . ). the integrated microbial genome (img) system, hosted by the joint genome institute (jgi), and the rast (rapid annotation using subsystem technology) server are examples of open resources. major sequencing centers offer genome viewers and browsers through their websites (mcneil et al. ) . for example, manatee (j. craig venter institute (jcvi)) has been developed to view and to alter initial automatic annotations of prokaryotic genomes. the sanger institute's pathogen sequencing unit has been maintaining freeware for sequence analysis, viewing and annotation, such as artemis and the artemis comparison tool (act) (carver et al. ) . the alignment of genomes of three strains of staphylococcus aureus using act is shown in fig. . . alternatively, multiple genome alignments in the presence of large-scale evolutionary events, such as rearrangement and inversion, can be efficiently constructed and visualized using the mauve program (http://gel.ahabs.wisc.edu/mauve/download. php) (darling et al. ). these tools assist in the rapid identification of protein-coding informatics for infectious disease research and control genes, as well as other features like non-coding rna genes, repetitive sequences or recently acquired dna. web servers like integrated microbial genomes (joint genome institute; http:// img.jgi.doe.gov) or the bacterial annotation system (basys, http://wishart.biology. ualberta.ca/basys/cgi/submit.pl) also support comparative analysis and the automated annotation of bacterial genomic (chromosomal and plasmid) sequences (van domselaar et al. ) . they accept raw sequence data and gene identification information, and provide textual annotation and hyperlinked image output. strings of nucleotides are assembled into draft sequences that can be characterized by the following: ( ) > % of genome in contigs, ( ) average contig length > kb, ( ) > % of a set of conserved genes present, ( ) contig n length > kb, ( ) > % of bases > × read coverage, ( ) scaffold n length > kb. the information used to annotate genomes comes from three types of analysis: ( ) ab initio gene finding programs, which are run on the dna sequence to predict protein coding genes; ( ) s.aureus usa s.aureus col fig. . alignment of genomes of three strains of staphylococcus aureus. dna sequences that find a perfect match are connected with red lines or blocks. blue areas are inversions or transitions and white areas represent indels. the figure was produced using artemis software (the wellcome trust sanger institute, uk) informatics for infectious disease research and control evidence-based gene calling or translating alignments of the dna sequence to known proteins; and ( ) aligning cdnas from the same or related species. gene finding has progressed far beyond the simple identification of open reading frames. the programs aligning cdna and protein sequences to genomic dna can locate the protein coding regions by searching the publicly available databases or by applying machine learning algorithms such as hidden markov models (hmm). there is a long list of such programs including genemark, morfind, prodigal (prokaryotic dynamic programming genefinding algorithm), argon and glimmer (gene locator and interpolated markov modeller) (delcher et al. ; suzek et al. ; majoros ) . they differ in the time required for automated annotation as well as the quality of gene calling (guigo et al. ) . problems with the accuracy of current gene finders reflect not only the performance of their algorithms but also the quality of the primary resources and the abundance of non-coding dna regions in microbial genomes. genome assembly annotation methods and tools including new applications for rna genes, were reviewed in detail elsewhere (stothard and wishart ; médigue and moszer ; brent ; pop and salzberg ) . recent breakthroughs in high-throughput sequencing technologies have posed new challenges for genome assembly, annotation and analysis. these technologies make it feasible to sequence not only static genomes but also entire transcriptomes expressed under different conditions (shendure and ji ) . however, they can produce read lengths as short as - nucleotides, which cannot be analyzed with software developed for sanger data as they are often non-unique, lack neighborhood context and have a different distribution of errors. the task of linking such short-reads may be accomplished using a comparative assembly algorithm, in which new sequences are put together by mapping them onto close relatives or the "reference genomes." not surprisingly, the comparative assembly strategy works best when the two species are more than % identical. alternatively, when no "reference genome" is available, the new cohort of assembly algorithms based on de bruijn graphs -a way to transform sequence data into a network structure -has risen to the task (chaisson and pevzner ; maclean et al. ). strategies and systems that address these new challenges have recently been reviewed elsewhere (pop and salzberg ; maclean et al. ; ussery et al. the metagenomics or the sequencing of genomes of complex mixed communities has emerged at the interface of genomics, microbiology and information technology. this field examines the interplay of hundreds of microbial species present at specific sites of potential infections in space and time (hutchinson ; smarr et al. ). significantly, metagenomics has extended its focus from environmental microorganisms to microbial communities or "community whole genome sequences" of the human host (field et al. ; verberkmoes et al. ). most of the - trillion microorganisms in the human gastrointestinal tract live in the colon (turnbaigh et al. ). the genomes of these microbial symbionts have been collectively defined as the microbiome or ecosystem in which the number of microbial genes is estimated to be many folds higher than those present in the human genome. the human gut microbiome initiative, a logical conceptual extension of the human genome project, aims to discover genomes of at least new intestinal species. this approach has targeted the totality of genes involved in the gut biofilms, the mechanisms of horizontal gene transfer, and the role of the microbial pan-genome (field et al. ) . the microbiome project aims to address some of the most inspiring and fundamental scientific questions today in order to identify new ways to determine health and predisposition to diseases and define parameters in addition to conventional strings of nucleotides, large-scale sequencing can provide new types of data reflecting global genome architecture and the properties of pathogens. these data include the size of a genome and its nucleotide composition, the locations of genes and intergenic regions, gc percentage and gene density. microbial genomes are compared by the number of particular sets of genes, gene order (synteny) and the presence or absence of important genes. other metrics include gene set properties (the number of two component system regulatory genes) and nucleotide sequence-based measures (distance between paired twocomponent system genes and consensus sequence) (whitworth ; ussery et al. ). these metrics represent a global view of genomes but often have limited biological meaning. thus, "signature" sequences have been suggested as a means of identifying organisms or genes with sequence profiles correlating with the pathogen phenotype or disease outcomes. examples of genome characteristics that are more directly related to biologically important behavior are bacterial iq (a measure of the number of signal transduction proteins as a function of genome size) and extrovertedness (the proportion of signaling proteins predicted to sense external stimuli) (galperin ) . analyses of genomics data challenge the traditional taxonomy of microbial species. recent projects have focused on producing simple analytical diagnostic tools based on strong taxonomic knowledge collated in the dna reference libraries such as the dna barcode of life data system (bold; http://www.boldsystems. org). these types of data enable the acquisition, storage, analysis and publication of dna barcode results, and provide clues about the global distribution of species. their genetic diversity and structure is based on two postulates: first, that every species is represented by a unique dna barcode (indeed there are possible atgc combinations compared to an estimated million species remaining to be discovered (frézal and leblois ) ), and second, that the genetic variation between species exceeds the variation within species. dna barcoding requires a minimum sequence length of bp and more than three individual sequences per species. the initial barcode of life framework was based on the sequence of a single universal marker -the cytochrome c oxidase gene -but has evolved since then, giving rise to a flexible description of dna barcoding, a larger range of applications and the broader use of the term "barcode" (frézal and leblois ) . for example, the whole microbial genome's barcodes were defined as frequency distributions of periodic dna sequences or k-mers across the whole genome (zhou et al. ). it has been postulated that such barcode similarities are proportional to the genomes' phylogenetic closeness and could be utilized in metagenome analyses (zhou et al. ) . microbial species diversity can be also estimated by the average nucleotide identity (ani) using the list of orthologs and deriving the overall divergence of the core genome by averaging the percentages of identity at the nucleotide level (konstantinidis and tiedje ) . another approach to measure distances between genomes is based on estimating the proportion of common genes by calculating the ratio of orthologs to the total number of genes of the reference genome. more recently, similar methods such as dna content, blast distance phylogeny and the mum (maximal unique and exact matches) index have been suggested as more sensitive measures for intra-species comparisons (deloger et al. ). the true power of large-scale comparative genomic studies lies in their ability to identify and characterize biological trends and rules that explain particular phenomena (field et al. ). computational methods have become essential steps in formulating hypotheses about gene functions. the comparative approach has not only yielded fundamental insights into the function and evolution of microbial genomes, but has also led to practical results. comparative genomics has allowed the accurate estimation of the structure of genomes and the speed of gene movements, including the role of natural selection versus genetic drift, the origin of the pandemic strains, and the ecology of a pathogen in its natural reservoir yang et al. a) . computational studies identified unexpected relationships between genomic features and ecological niches, demonstrated diversity in the microbial world and helped to reconstruct evolutionary relationships among genomes (binnewies et al. ; field et al. ) . comparisons made between different genomes can also generate new hypotheses for testing, usually relating to the unexpected presence or absence of particular genes with respect to other genomes (whitworth ) . the studies of three main forces shaping genome evolution -gene loss, gain and change -have been especially fruitful in this respect (burrack et al. ; whitworth ) . discoveries of gene duplication in many bacterial pathogens, resulting in increased numbers of key gene clusters or the expansion of important protein families have led to the development of new diagnostic methods. for example, the gene clusters encode a secreted protein called the early secretory antigenic target or esat , which was identified as one of the key virulence factors in mycobacterium tuberculosis and was subsequently used in the interferon-gamma release assays for the diagnosis of tuberculosis (pallen and wren ; behr ) . comparative genomics has also revealed that pathogens undergo a process of genome decay or a reduction in the number of biosynthetic pathways, resulting in a dependence on the infected host for certain essential functions. the most surprising informatics for infectious disease research and control snapshots of genome decay have come from relatively recently emerged pathogens that have changed their lifestyles by adopting a simpler host-associated niche. for example, the genomes of yersinia pestis (parkhill et al. b) and salmonella enterica serovar typhi (parkhill et al. a ) contain hundreds of pseudogenes. these findings challenge the traditional view that bacterial genomes never contain "junk" dna and that every gene in a bacterial genome must have a function. instead, every genome should be viewed as a work in progress, burdened with some non-functional "baggage of history" (pallen and wren ) . as the smallest-scale variation in microbial genomes occurs at the level of singlenucleotide polymorphisms (snps), snp detection has been applied extensively to many pathogens (yao et al. ) . while snps are generally considered rare, at one per several thousand base pairs, two genomes of m.tuberculosis of mb each may have some , snps between two isolates (behr ) . whole-genome sequencing has been proven as an even more powerful tool to detect snps. it enabled the differentiation of escherichia coli strains that had diverged for as few as generations (shendure and ji ) and revealed genomic changes in pathogens in the process of human infection (chen et al. ; forst ; pallen and wren ). in the pre-informatics era, virulence factors were typically identified either by biochemical studies or through genetic screens. informatics has enabled innovative strategies for the recognition of virulence gene recognition through the analysis of genetic signatures (pallen and wren ) . despite the variety of microbial life styles and associated genomic and metabolic complexity, pathogen genomes share common architectural principles. as a result, computational techniques assist in exploring similarities between virulence factors and other genes with known functions. this association can then be tested using targeted genetic methods such as the inactivation of the putative virulence gene followed by the comparison of phenotypes of the original and modified microorganisms raskin et al. ) . a strategy that does not rely on sequence similarity for identifying potential genes is the detection of coding sequences, which is based the gene context "grammars" supplemented with machine learning models (garrido et al. ) . for example, functional gene recognition tools genemark and glimmer employ hidden markov models, in which the preceding nucleotide bases are used to predict the next base in a coding region, and the algorithm is trained on a trusted set of sequences. gene coding regions are then identified using probability estimates of the correct coding "grammar" in a region (dougherty et al. ) . different statistical and machine learning methods for gene prediction have been reviewed elsewhere (majoros ) . gene-gene interactions specifically associated with a phenotype or a particular disease can be explored with or without a prior biological knowledge. several techniques utilizing bayesian networks, pair-wise mutual information and graphical gaussian models have been proposed for this purpose. coupled with biological knowledge, the identification of such phenotype-specific interactions can shed light on the responsible pathways. the complexity of data handling and visualization has led to efforts to develop dedicated comparative genomics resources such as gendb (meyer et al. ) , cmr, act, (table . ) xbase and microbes online as well as data management systems such as seed (table . ) (chaudhuri et al. ). informatics has been instrumental in the change from static to a dynamic view of the microbial world. in contrast to the static view of genome annotations focused on the gene or protein prediction, the dynamic view places information obtained into a biological context to identify interactions between the genomic components and the reconstruction of regulatory networks (médigue and moszer ; sakata and winzeler ) . under the network vision of the microbial world, microbial chromosomes are not envisaged as strictly defined genotypes gradually changing in time but rather as islands of temporary, relative dynamic stability that form tightly connected (vertically and horizontally) areas of the network (koonin and wolf ) . the infection cycle should be considered as a whole and the links between growth, virulence, immune evasion and transmission should be assessed (restif ). biological interactions vary in their nature and are spatially and temporally heterogeneous. one can abstract the actions of proteins and metabolites by representing genes acting on other genes as a gene network or as genetic regulatory, transcription or expression networks. such networks can be constructed using computationally assigned functional linkages inferred by rosetta stone, operon or similar methods (rachman and kaufmann ; harrington et al. ) , and often point to highly connected and central proteins frequently referred to as "hubs" (wu et al. ) . biological interaction and communication networks share several commonalities: they are scale free (only a few nodes are highly connected) and are small world networks (highly clustered with short distances between any two nodes) (kann ) . increasingly, disease pathogenesis and the mechanisms of drug action are viewed from a biological systems perspective (wu et al. ) . from this perspective, a deeper understanding of infectious diseases may rely on an exhaustive characterization of all potential interactions occurring between proteins encoded by viruses and those expressed in infected cells. thus, the integration of all protein-protein interactions into an infected cellular network, or "infectome," offers a powerful framework for the virtual modeling and analysis of infections (navrati et al. ). the terms "interactome" and "phenomics" have been coined in this context (lussier and liu ) . numerous resources have been developed to explore host-pathogen interactions (phi) (table . ) . specifically, phi-base (winnenburg et al. ) , phidias (xiang et al. ), biohealthbase (squires et al. ) , pig (driscoll et al. ) virusmint (chatr-aryamontri et al. ) and virhostnet (navrati et al. ) have been virulence prediction lengauer et al. drug resistance prediction navrati et al. raman et al. effect of diseases on gene expression drug target identification reddy et al. squires et al. stavrinides et al. drug resistance prediction drug resistance prediction suggested to study and visualize pathogen-related pathways. for example, the virhostnet is a knowledge base for the management and analysis of proteome-wide virus-host interaction networks and a resource of manually curated interactions defined for a wide range of viral species (navrati et al. ). genomic and proteomic data is often informationally synergistic, allowing for the reconstruction of known pathways from the first principles. the combination of these forms of data have been used to identify libraries of recurring motifs, where the mixed semantics of the pattern promises to be more informative than any single data source taken in isolation in building biological networks (michael et al. ; stavrinides et al. ) . systems biology has arisen from various attempts to move away from the reductionist approach, which is hindered by the difficulty of breaking a system into separable and meaningful parts. it encompasses several high-throughput analytic technologies, including genomics, transcriptomics to measure gene expression and its regulation at the level of messenger rna and microrna production, proteomics to measure changes in protein production, and computational biology, which depends on analytic software packages for analyzing, organizing, and interpreting those data (sakata and winzeler ) . such an approach treats pathogens and their environments as a series of hierarchical levels or networks from gene products to whole organisms and integrates the time dimension in order to structure knowledge and to determine rules that would allow navigation between levels (lisacek et al. ). this approach demands new tools for data management, the integration of which offers the opportunity to correlate multiple lines of evidence and to reduce uncorrelated noise. the major difference between the pre-and post-genomics eras is that one can now potentially account for and keep track of all components at once. however, the gathering of a large collection of data does not guarantee that we can make sense of it or that new knowledge will emerge (collado-vides et al. ). the chance for enriching biomedical knowledge can be increased by mixing various streams of data and gaining robustness from the "cross-validation" of the knowledge sources (guyet et al. ). public websites like galaxy (http://galaxy.psu.edu) and interpro (http://www.ebi.ac.uk/interpro/) offer integration toolsets for genomics and proteomics analyses. as generating data remains a costly undertaking, computational models have a pivotal role to play in the integrative science. they help researchers to illuminate the underlying processes and identify the key questions that need to be addressed experimentally (restif ). compared to conventional, small-scale experimental approaches, they give a wider, often more relevant view of host responses to infections or other health insults. these computational models have the capacity to guide and direct wet lab experimental efforts complimenting traditional in vivo, in situ, and in vitro testing with the emerging in silico approach (lengauer et al. ; raman et al. ) . some impressive starts have been made on bacterial models in the form of simulation tools. for example, the reconstruction of metabolic networks gave birth to the first examples of in silico strains that can be utilized to explore alternative ways of identifying new drug targets (jamshidi and palsson ) . the end result of these simulations may be the genomic bioengineering of microorganisms based on knowledge of interacting systems and networks of genes and gene products. text mining tools are being created to query the pubmed literature database and to integrate the available genomic and proteomic information to map the genes and their interrelationship with particular networks of a disease (korbel et al. ; jelier et al. ; rzhetsky et al. ; zaremba et al. ). an unsupervised, systematic approach for associating genes and phenotypic characteristics (g p) that combines literature mining with comparative genome analysis has been successfully applied and has uncovered clusters of unsuspected g p associations (korbel et al. ). the phase of history in which biomedical science could be significantly advanced by individual researchers without data sharing has come to a close. the global, collaborative analyses of data and the exchange of the results across social, political and technological boundaries have created the demand for new cyber-infrastructures for research. there has been a major effort, in the form of e-science, to develop technologies to fulfill these demands (craddock et al. ). the chance of making a discovery or replicating the finding is greatly increased if there are effective mechanisms for different groups to share data and thereby enlarge the number of samples that are studied. this paradigm has been successful in both human genomics and infectious disease research (e.g., including the rapid discovery and identification of emerged pathogens such as the nipah virus and the novel coronavirus that caused the sars epidemic). post-genomic era solutions such as federated databases and other technologies that enhance connectivity and data retrieval have created a new knowledge environment (birkholtz et al. ; thorisson et al. ). the level of technical competence required of the users is being reduced by the provision of "off-the-shelf" solutions. for example, the gen phen project offers "database-in-a-box" installation packages, which include an open-source complete genetic association database system with the option for federation (thorisson et al. ). alternative infrastructures for e-science with significant advantages over conventional internet technologies are offered by grid and cloud computing and the semantic web (numann and prusak ; craddock et al. ) . first, grids provide unique access to high performance computing power, distributed applications and sources (see chap. for examples). second, grids increase data storage spaces, and allow data and tools to be shared by geographically dispersed users. however, developing and maintaining grid or cloud architectures remains a complex task and requires further advances in security and privacy models before they can be embraced by diagnostic laboratories (lisacek et al. ). tasks that require an e-science approach or global science that is performed in silico are typically computationally intensive and use heterogeneous resources that must be integrated across distributed networks (craddock et al. ) . increasingly, the genomic, proteomic and metabolomic data have to be integrated with traditional literature in a machine-readable way. typical sets of experimental data yield component lists with quantitative content data and a catalog of interactions and networks. this requires the establishment of a middleware to convert experimental data into a format suitable for manipulation and viewing by end-users. for example, the generic model organism database project (gmod; http://gmod.org) aims to link experimental data with corresponding contextual meta-data about experimental conditions and protocols in a multi-user, multi-center environment. it offers a collection of open source tools for creating and managing genome-scale biological databases ranging from a small database of genome annotations to a large web-accessible community database. another approach is to trade off the width of integration for more depth with regard to a particular analysis task, and to employ workflow systems such as inforsense (http://www.inforsense.com) or taverna (http://taverna.sf.net). these act as glue layers between various data sources and analysis packages and are also often referred to as pipelines, in silico protocols or e-experiments (turnbaigh et al. ) . "pipeline" is mostly used to describe executable workflows, while the other terms are dedicated to abstract workflows (lisacek et al. ) . many innovative solutions for the multi-dimensional integration of data produced by experimental laboratories have been introduced by bioinformatics resource centers for biodefense and emerging/re-emerging infectious diseases through regional biodefense centers of excellence (mcneal et al ; greene et al. ) . sets of task-and domain-specific online query and display tools are being developed to allow the end-user to view data in a number of different formats and to run informative comparisons of data with existing libraries (louie et al. ; glassner et al. ) . the most striking change in data collection and representation is expressed by the move from flat databases to atlases or collections of interconnected maps (lisacek et al. ). the uneven content and quality of data and the constant evolution of biomedical knowledge remain the main obstacles to data integration (lisacek et al. ). the quality of data is affected by a number of factors including the accuracy of the mapping algorithms and reference datasets, the standardization of data formats and the level of detail of the experiment description (stead et al. ). in addition, an increasing number of genomes are being released in "draft" form, before the finishing stage of a sequencing project, with high sequencing error rates (de keersmaesher et al. ; médigue and moszer ) . recent developments in databases and browsers for genomics have been summarized by schattner ( ) . there is an urgent need for data structures suitable for infectious disease space that can be applied to emerging "omics" data sets. the pathogen information markup language (piml) has also recently been introduced to enhance the interoperability of microbiology datasets for pathogens with epidemic potential (he et al. ) by capturing the data elements that describe determinants of pathogen profiles. however, the jury is still out on the question of which data integration architectures are best suited to assembling large scale and highly diverse genomic data. integrating high-throughput techniques with other analytic tools brings a new understanding of infectious processes and introduces an era of personalized strategies for managing infectious diseases. in this way, informatics becomes an irreplaceable platform for the constant cross-fertilization and interplay between focused and genome-wide studies. rapid and standardizable molecular identification systems have emerged during the last decade, with the development of sequence based species identification and sub-typing as the alternative to slow, labor-intensive and underpowered phenotypic techniques. molecular identification usually relies on the detection of a single gene or multiple gene targets, or requires the comparison of whole microbial genomes. for example, in the pragmatic world of diagnostic bacteriology, conserved housekeeping genes such as the s rrna gene, rpob gene and others have been accepted as reliable targets. they are found in all microorganisms and show enough sequence conservation for accurate alignment as well as enough variation for phylogenetic analyses (christen ) . furthermore, the s rrna gene based phylogeny is sufficiently congruent with those based on whole genome approaches. sequencing of six to eight genes or loci, as it typically done in multilocus sequence typing analysis, may constitute a reasonable compromise between single genebased and whole genome-based methods for species diversity studies. to streamline the process of the translation of sequencing-based identification into clinical practice, the concept of the pathogen profile has been introduced . a pathogen profile is a single, multivariate observation or set of observations, comprised of classes of specific attributes (e.g., genome, transcriptome, proteome or metabolome data), which are designed to allow the interrogation of existing or future databases, and the integration of genomics and post-genomics data with clinical observations and patient outcomes. the profile may indicate the probability that a specific marker is associated with a clinically relevant phenotype such as in vivo antimicrobial resistance or high transmissibility. this information allows the classification of strains into "risk groups" for treatment failure or a propensity to cause outbreaks of infections. it is often important to capture the quantitative information about a pathogen, in vivo, i.e. viral or bacterial loads and their units of measurement. in contrast to traditional subtyping, which is based on phenotypic characteristics such as serotype, biotype, phage type or antimicrobial susceptibility, genetic profiling describes the phenotypic potential in the nucleic acid sequence. a pathogen profile is a synthesis of various markers and clinical end-points, which can be extracted from medical charts that characterize an individual patient's clinical and public health outcomes. the profile may be heuristic, when only a single genetic marker is associated with a specific patient outcome, while more insights can be achieved when attributes from different levels of the biological hierarchy (i.e. gene detection, gene expression, metabolite profiles etc) corroborate and complement each other. machine learning algorithms, such as e-predict (urisman et al. ) , are being developed to identify viruses and bacteria present in clinical samples. these profiles are based on the microarray hybridization patterns or dna sequences of pathogens. many computerized evidence-based guidelines and decision support systems (dss) have been designed to improve the effectiveness and efficiency of antibiotic prescribing (samore et al. ; buising et al. ) . the most frequently utilized are electronic guidelines and protocols, especially for the empirical selection of antibiotics. the majority of dss result in improvement in clinical performance and, in at least half of the published trials, in improved patient outcomes (finch and low ; sintchenko et al. a) . the revival of interest in prescribing-decision-support reflects the recent change in emphasis from support for diagnostic decisions towards support for patient management, and the changing focus from systems targeting a broad range of clinical diagnoses to task-and condition-specific decision aids. despite reported successes of individual applications, the safety of electronic prescribing systems in routine practice has recently been identified as an issue of potential concern. bioinformatics assisted prescribing has become a new frontier in reducing the complexities of prescribing combinations of antimicrobials in the era of multidrug resistance. the great diversity of mutational patterns contributing to antimicrobial resistance complicates the choice of optimal therapies. a range of bioinformatics tools to predict drug resistance or response to therapy from a genotype, have been developed to support clinical decision-making (beerenwinkel et al. ; lengauer and singh ) . these tools use either a statistical approach, in which the inferred model and prediction are informatics for infectious disease research and control treated as regression problems, or machine learning algorithms, in which the model is addressed as a classification problem (sintchenko et al. a) . a statistical learning approach to the ranking of therapeutic choices often relies on a direct correlation between the baseline microbial profiles, the therapeutic decision and the patient's response to treatment (e.g., expected reduction in viral load resulting from anti-hiv combination therapy). for example, several susceptibility scores have been used for combination antiretroviral therapy. these take into account specific resistance mutations and add up the activities of individual drugs in the regimen (lengauer and singh ) . computer-assisted therapy depends on the availability of widely shared databases that can correlate quality-controlled data from genotypic resistance assays and treatment regimens with short-and long-term clinical outcomes. databases such ardb (liu and pop ) capture differences in antimicrobial sensitivities and reflect variation in the amino acid composition of resistant microbes, but simply counting mutations may not be enough to predict functional differences, which affect treatment outcomes. the molecular profiling of pathogens is based on the concept that various pathogens can be associated with different clinical outcomes. it brings together the pathogen and host factors as the pathogenesis and natural history of infection are determined by both the pathogen and human genetic susceptibility. the effectiveness of combining host and pathogen genetics in a single system or "genetics-squared" has been proven in studies of viral infections (persson and vance ) . investigations of the impact of host genetics on the susceptibility to hiv infection and the rate of disease progression have mainly used a candidate gene approach to reveal associations with a number of different genes. the genome-wide association studies look at the genetic variation across the human genome in order to uncover factors not previously suspected of influencing infection outcomes. for example, this strategy identified variants of the hiv virus associated with differences in the control of viral load at set points and in disease progression. however, unraveling the interaction between the host and microbial genetic factors requires large clinical trials, reinforcing the role of collaborative networks and data repositories. informatics methods have become critical for data mining to decipher links between genetic variation and disease pathogenesis in order to define markers of disease progression, to guide the optimum use of therapeutics and to refine the drug and vaccine development (mansmann ) . a better understanding of the function of genes and other parts of the genome has enabled the reverse engineering approach, which may lead to the characterization and discovery of potential drug targets, vaccine candidates and diagnostic or prognostic markers (davies and flower ; yang et al. b) . proteins with essential biological functions present in multiple pathogens could be the best drug targets. once the target genes essential for pathogen survival are identified, their susceptibility to specific compounds derived from large chemical libraries is examined in silico and in vitro (muzzi et al. ; biswas et al. ). increases in the use of electronic medical records and the availability of information technology tools have created opportunities for the automation of surveillance and facilitation of surveillance based on either syndromic or disease-specific signals (amadoz and gonzales-candelas ; m'ikanatha et al. ). the automation of data collection improves the time and completeness of surveillance and allows infection control professionals to focus on interventions (hota et al. ; young and stevenson ) . the comparison of chromosomal sequences allows the identification of the unique genomic signatures of pathogens for the purposes of infection control and "microbial forensics." molecular typing methodologies, in contrast to classical phenotypic methods, allow the discrimination of variations among strains within a species, the elucidation of the route of contamination, the identification of the source of infection as well as the analysis of epidemics. the identification of the natural reservoir and any possible intermediate hosts of pathogens is critical for understanding the transmission modes, designing a long-term disease control strategy, and preventing future reintroduction ). bioinformatics assisted biosurveillance addresses the inefficiencies of traditional surveillance, as well as the need for a more timely and comprehensive infectious disease monitoring and control. it leverages on recent breakthroughs in the rapid, high-throughput molecular profiling of microorganisms and text mining, as well as on the growing electronic body of knowledge about the molecular epidemiology of pathogens with epidemic potential. such a framework combines the genetic and geographic data of a pathogen to reconstruct its history and to identify the migration routes through which the strains spread regionally and internationally (cantón ; sintchenko et al. b) . computer-based geographic information systems (gis) have offered an efficient way to visualize the dynamics of the transmission of infections, especially in the setting of a community outbreak (mckee et al. ; schreiber et al. ) . another way to track infectious diseases of public health concern is to monitor health-seeking behavior in the form of queries to online search engines used by the general public or health professionals. epidemics of seasonal influenza in areas with a large population of internet users have been successfully detected using google search data and then correlated with visits to a doctor (ginsberg et al. ; brownstein et al. ). the advent of news aggregators has led to the development of new disease surveillance tools that can continuously mine, categorize, filter, and visualize multilingual online information about epidemics. the global public health intelligence network (gphin), developed almost a decade ago by health canada in collaboration with who, healthmap (http://www.healthmap.org/en) ( fig. . ) or geosentinel (http://www.istm.org/geosentinel/main.html) among many others are examples of such early warning systems. resources for infection prevention and control on the world wide web have been recently reviewed elsewhere (brownstein et al. ; johnson et al. ) the reductionist approach to biomedical research focusing on the study of cells and molecules has peaked with the sequencing of the human genome. however, it is becoming increasingly clear that "taking apart" analyses have reached their limit, and the time has perhaps come for integrative science (an and faeder ) . developments in informatics have been critical in supporting and engaging with both reductionist and integrative paradigms. on one hand, informatics has equipped comparative genomics with tools to scrutinize genes and explore genetic polymorphisms. on the other hand, informatics has enabled the generation of integrative and testable hypotheses through the discovery of knowledge in databases and through the study of gene-phenotype connections between a pathogen and its host environment. a variety of data sets can be integrated, including the patient's demographic and clinical presentation, the laboratory results, the pathogen's gene regulation and expression, and metabolic maps with different parameters reflecting the phenotypic behavior of a pathogen and host factors. in early years some skeptics saw informatics-assisted research as a distraction of effort and funding away from traditional hypothesis-driven inquiry. since then, infectious disease informatics has verified its status as a platform for hypothesis generation and testing ). new breakthroughs in infectious disease informatics (idi) are the result of cross-pollination between different disciplines that use technologies to gather and disseminate knowledge (fig. . ) . microbial genome sequence analysis and metagenomics have contributed intriguing new data types and data sources to idi. bioinformatics has brought to the idi a range of analytic tools, databases and data standards. conventional health informatics and computer science has provided high performance solutions for the data storage, sharing, analysis and visualization as well as clinical terminology libraries, data standards, decision support and technology evaluation frameworks. importantly, the infectious disease informatics community has fed the lessons learnt from the implementation of clinical and public health systems back to the broader audience. as the subsequent chapters of this volume testify, infectious disease informatics is set to lead to the more targeted and effective prevention, diagnosis and treatment of infections through a comprehensive review of the genetic repertoire and metabolic profiles of pathogens. the post-genomic era offers new opportunities for the efficient discovery of safe and efficacious subunit vaccines by shortcutting the enormous economic burden of the experimental process. our analytical capacity has already become the rate-limiting step in biomedical research. at the same time, it provides an opportunity to apply the engineering paradigm to biomedical research, thereby mandating the development of tools that can dynamically represent a body of current knowledge. however, the simplistic application of brute force computational power to massive reams of biomedical data is unlikely to result in meaningful mechanistic insight. it cannot be overstressed that informatics initiatives should compliment "wet laboratory" practices. an iterative loop of discovery and validation between the two methodologies remains the best way forward. epipath: an information system for the storage and management of molecular epidemiology data from infectious pathogens detailed qualitative dynamic knowledge representation using a bionet gen model of tlr- signaling and preconditioning bioinformatics in microbial biotechnology -a mini review geno pheno: estimating phenotypic drug resistance from hiv- genotypes mycobacterium du jour: what's on tomorrow's menu? microb infect ten years of bacterial genome sequencing: comparative-genomics-based discoveries integration and mining of malaria molecular, functional and pharamacological data: how far are we from a chemogenomic knowledge space? a bioinformatic approach to understanding antibiotic resistance in intracellular bacteria through whole genome analysis steady progress and recent breakthroughs in the accuracy of automated genome annotation digital disease detection -harnessing the web for public health surveillance improving antibiotic prescribing for adults with community acquired pneumonia: does a computerised decision support system achieve more than academic detailing alone?-a time series analysis genomic approaches to understanding bacterial virulence role of the microbiology laboratory in infectious disease surveillance, alert and response artemis and act: viewing, annotating and comparing sequences stored in a relational database short read fragment assembly of bacterial genomes virusmint: a viral protein interaction database xbase : a comprehensive resource for comparative bacterial genomics vfdb: a reference database for bacterial virulence factors identification of genes subject to positive selection in uropathogenic strains of escherichia coli: a comparative genomics approach identification of pathogens -a bioinformatic point of view bioinformatics resources for the study of gene regulation in bacteria e-science: relieving bottlenecks in largescale genome analyses mauve: multiple alignment of conserved genomic sequence with rearrangements harnessing bioinformatics to discover new vaccines integration of omics data: how well does it work for bacteria? improved microbial gene identification with glimmer a genomic distance based on mum indicates discontinuity between most bacterial species and genera microbial genomics and novel antibiotic discovery: new technology to search for new drugs pig -the pathogen interaction gateway how do we compare hundreds of bacterial genomes? a critical assessment of published guidelines and other decision-support systems for the antibiotic treatment of community-acquired respiratory tract infections host-pathogen systems biology four years of dna barcoding: current advances and prospects biosurveillance of emerging biothreats using scalable genotype clustering a census of membrane-bound and intracellular signal transduction proteins in bacteria: bacterial iq, extroverts and introverts evaluation of eight different bioinformatics tools to predict viral tropism in different human immunodeficiency virus type subtypes detecting influenza epidemics using search engine query data enteropathogen resource integration center (eric): bioinformatics support for research on biodefense-relevant enterobacteria national institute of allergy and infectious diseases bioinformatics resource centers: new assets for pathogen informatics egasp: the human encode genome annotation assessment project knowledge construction from time series data using a collaborative exploration system predicting biological networks from genomic data piml: the pathogen information markup language informatics and infectious diseases: what is the connection and efficacy of information technology tools for therapy and health care epidemiology dna sequencing: bench to bedside and beyond investigating the metabolic capabilities of mycobacterium tuberculosis h rv using the in silico strain inj and proposing alternative drug targets anni . : a multipurpose text-mining tool for the life sciences resources for infection prevention and control on the world wide web what would you do if you could sequence everything? protein interactions and disease: computational approaches to uncover the etiology of diseases analysis of mixed sequencing chromatograms and its application in direct s rrna gene sequencing of polymicrobial samples genomic insights that advance the species definition for prokaryotes genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world systematic association of genes to phenotypes by genome and literature mining mega: a biologist-centric software for evolutionary analysis of dna and protein sequences bioinformatics-assisted anti-hiv therapy bioinformatics prediction of hiv coreceptor usage proteome informatics ii: bioinformatics for comparative proteomics ardb -antibiotic resistance genes database data integration and genomic medicine computational approaches to phenotyping: high-througput phenomics infectious disease surveillance application of 'next-generation' sequencing technologies to microbial genetics methods for computational gene prediction genomic profiling: interplay between clinical epidemiology, bioinformatics and biostatistics application of a geographic information system to the tracking and control of an outbreak of shigellosis the national microbial pathogen database resource (nmpdr): a genomic platform based on subsystem annotation annotation, comparison and databases for hundreds of bacterial genomes gendb -an open source genome annotation system for prokaryote genomes building a knowledge base for system pathology the pan-genome: towards a knowledge-based iscovery of novel targets for vaccines and antibacterials virhostnet: a knowledge base for the management and the analysis of proteome-wide virus-host interaction networks knowledge networks in the age of the semantic web bacterial pathogenomics complete genome sequence of a multiple drug resistant salmonella enterica serovar typhi ct genome sequence of yersinia pestis, the causative agent of plague genetics-squared: combining host and pathogen genetics in the analysis of innate immunity and bacterial virulence bioinformatics challenges of new sequencing technology exploring functional genomics for the development of novel intervention strategies against tuberculosis targettb: a target identification pipeline for mycobacterium tuberculosis through an interactome, reactome and genome-scale structural analysis bacterial genomics and pathogen evolution tb database: an integrated platform for tuberculosis research evolutionary epidemiology years on: challenges and prospects seeking a new biology through text mining genomics, system biology and drug development for infectious diseases clinical decision support and appropriateness of antimicrobial prescribing nucleotide sequence of bacteriophage x dna genomes, browsers and databases dengueinfo: a web portal to dengue information resources next-generation dna sequencing laboratory-guided detection of disease outbreaks: three generations of surveillance systems genomic profiling of pathogens for disease management and surveillance are we measuring the right thing? variables that affect the impact of computerized decision support on patient outcomes: a systematic review decision support systems for antibiotic prescribing towards bioinformatics assisted infectious disease control building an optiplante collaboratory to support microbial metagenomics biohealthbase: informatics support in the elucidation of influenza virus host-pathogen interactions and virulence host-pathogen interplay and the evolution of bacterial effectors information quality in proteomics automated bacterial genome analysis and annotation a probabilistic method for identifying start codons in bacterial genomes genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial "pan-genome genotype-phenotype databases: challenges and solutions for the post-genomic era the human microbiome project e-predict: a computational strategy for species identification based on observed dna microarray hybridization patterns computing for comparative microbial genomics: bioinformatics for microbiologists shortgun metaproteomics of the human distal gut flora genomes and knowledge -a questionable relationship phi-base: a new database for pathogen host interactions discovery of virulence factors of pathogenic bacteria phidias: a pathogen-host interaction data integration and analysis system genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research infectious disease in the genomic era bioinformatics databases and tools in virology research: an overview primersnp: a web tool for whole-genome selection of allele-specific and common primers of phylogenetically-related bacterial genomic sequences real-time surveillance and decision support: optimizing infection control and antimicrobial choices at the point of care text-mining of pubmed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens infectious disease informatics and outbreak detection key: cord- -ji tvsl authors: jakupciak, john p.; colwell, rita r. title: biological agent detection technologies date: - - journal: mol ecol resour doi: . /j. - . . .x sha: doc_id: cord_uid: ji tvsl the challenge for first responders, physicians in the emergency room, public health personnel, as well as for food manufacturers, distributors and retailers is accurate and reliable identification of pathogenic agents and their corresponding diseases. this is the weakest point in biological agent detection capability today. there is intense research for new molecular detection technologies that could be used for very accurate detection of pathogens that would be a concern to first responders. these include the need for sensors for multiple applications as varied as understanding the ecology of pathogenic micro‐organisms, forensics, environmental sampling for detect‐to‐treat applications, biological sensors for ‘detect to warn’ in infrastructure protection, responses to reports of ‘suspicious powders’, and customs and borders enforcement, to cite a few examples. the benefits of accurate detection include saving millions of dollars annually by reducing disruption of the workforce and the national economy and improving delivery of correct countermeasures to those who are most in need of the information to provide protective and/or response measures. the availability of sensitive and cost-effective diagnostic methods is paramount to the success of biological agent detection systems for public health protection and across industry sectors. currently available methods are generally costly and have evolved basically from diagnostic development and applications. these methods range from cell culture to antibody (mason et al. ; gessler et al. ) , polymerase chain reaction (pcr; christensen et al. ) , microarray (brodie et al. ) , and sequencing approaches (margulies et al. ) . partial genome sequencing and comparison with known sequence data (requires a priori knowledge of the bioagent) is not effective, particularly since bioengineering makes it possible to modify organisms to be more infectious, better at avoiding immune responses or resistant to medical countermeasures (jackson et al. ). critical to the success of biothreat surveillance is the ability to screen for and detect multiple agents rapidly in a single reaction with minimal sample processing (cirino et al. ) . traditional microbial typing technologies employed for characterization of pathogenic micro-organisms and monitoring their global spread are often difficult to standardize, are poorly portable, and lack sufficient ease of use, throughput, and automation. a survey published in , reported that emergency and primary care physicians and their local health care systems were not well prepared to respond to potential disease outbreaks or biological attacks and many believed that more resources should be allocated to equipping a response system. these findings highlight the importance of expanding bioterrorism preparedness efforts to improve the public health system (alexander et al. ) . other analyses of bioterrorism preparedness have recommended modernization of public health response to emergencies and investigations (the century foundation ) . similar analyses of technology gaps by groups, such as the department of energy and the heritage foundation emphasize the need to improve emergency response, regional coordination and technologies (heritage foundation ) . while antibody-based approaches are very widespread, there are many well-recognized limitations, for example, low quality control of antibody. antibody production is dependent, in part on cell culture and the lack of understanding of the extent of biological community complexity attributes to the limited use of methods relying on culture. nonculturable organisms can only be identified by molecular genetic methods. singleplex, multiplex and reverse transcriptase-pcr (rt-pcr) approaches have become standard methods for most laboratories. furthermore, nucleic acid-based methods (pcr and sequencing) are more sensitive than antibody-based detection systems (lim et al. ) . pcr-based methods have critical limitations, since they depend on a priori knowledge of what sequence to detect in a sample further complicated by recent demonstrations of greater variability in genomic sequence than expected. in addition, pcr probe sequence resolution erodes (table , detection methods comparison). a platform for genome identification of a specimen from any source must not only be sensitive and specific, but must also detect a variety of pathogens with high accuracy, including modified or previously uncharacterized agents, and this challenge is daunting when identification must be achieved using nucleic acids in a complex sample matrix. the available devices and instruments are severely limited in the length of time required for analysis, the complexity of the process employed, and the lack of a systems approach, that is, from extraction to identification without preconceived notion of what may be in the sample (hybridization surfaces, chips or microarrays). it is widely understood by microbial ecologists and more recently, medical microbiologists, that the microbial species encompasses significant variability with, perhaps, a core genome incorporated in a wider, more variable genome. furthermore, the first described strain is designated the 'type species,' but it may not (and probably is not) the median strain to serve as the reference strain for sequencing comparison (colwell & liston a, b, c) . specialists from the federal bureau of investigation, experts from hospitals and university research centres and key opinion leaders from the private sector agree that the most effective approach for comprehensive genetic variation discovery is by sequencing (budowle et al. ) . rapid advances in biological engineering have dramatically impacted the design and capabilities of dna sequencing tools. these have led to an increase in the number of base pairs sequenced per day by more than -fold while the costs have decreased three orders of magnitude. despite these advances, extension of sequencing technology should include the capacity to distinguish naturally occurring micro-organisms from intentionally distributed pathogens. this represents a considerable challenge because the diversity of the microbial world is effectively unknown. dna sequencing has the capacity to address this need. the emergence of non-sanger sequencing as well as microarrays and flow cells has changed the dna sequencing landscape. innovation of a genome identification technology meeting the goal of a '$ genome' is in progress and remains to be discovered. very likely, the engineering challenges and hurdles will be overcome and the basic science for dna processing and detection will eventually be done. given the trends of the past, a 'eureka' vision is highly probable that will introduce new concepts within a few years. the build-out of genome identification dna sequencing technology in the form of practical instrumentation will be achieved by incorporating the critical requirements for accurate long reads, without dependency for template amplification, capable of manipulating terabytes of data to provide reliable and useful identification of genetic sequences within any unknown sample, whether clinical, environmental, or other type of specimen. with advancement of dna sequencing technology, molecular typing methods based on nucleic acid fingerprints or 'mini-sequencing' are currently being replaced by more sensitive genome-wide single nucleotide polymorphismbased methods, as exemplified by anthrax bacilli (patil et al. ) . de novo genome sequence determination is not the maximum capability of sequencing technology. the technology forecast includes extending the capability to perform metagenomic (environmental) sequencing. rather than focusing on an individually isolated genome, environmental sequencing is aimed at sampling populations of genomes as found, for example, in bodies of water or different types of soil and within or on the surface of the human body. in this way, an analysis of sequence and gene diversity can be obtained from organisms that cannot be cultured using conventional techniques. genome sequencing capability will facilitate the evaluation of 'deep' resequencing methods which compare different sources of dna across one or a few genes. for example, k-ras gene mutations associated with cancer. the most widely used technology for dna sequencing is a capillary electrophoresis (ce)-based system employing the sanger method. although chemistries and automation advances have made sanger-based dna sequencing easier and faster, the basic technology remains the same. an immense challenge is that of managing the variety and complexity of data types, the hierarchy of biology, and the inevitable need to acquire data by a wide variety of modalities. inexpensive dna sequencing will revolutionize medicine by making personalized treatments possible. rapid genome sequencing is regarded as the next great frontier for science that will allow doctors to determine individual susceptibility to disease and the genetic links to cancer or cardiovascular disease. sanger sequencing is a method introduced by frederick sanger (sanger et al. ) . it is remarkable that sanger sequencing-based methods are so ubiquitous and long-lived. the flexibility of the method has been its strongest asset. furthermore, the utility of using genomic dna directly, that is, cloneless libraries greatly accelerated dna analyses. over the years, instrumentation for dna sequencing has improved dramatically in terms of read length and throughput (fig. ) . dna sequencing methods reaching the market include bead-based, microfluidic-based and microarray-based approaches as well as emerging concepts for sequencing, for example, nanopore-based sequencing (rhee & burns ) . unlike sanger sequencing, the use of pyrosequencing for sequencing-by-synthesis does not require fluorescent labelling. incorporation of each dntp is accompanied by release of pyrophosphate, which is converted by sulfurylase into atp, which leads to the release of light from the conversion of luciferase to oxyluciferin. however, asynchronistic extensions may occur because of slight variations in dispensing order of the dntps. bridge amplification in a flow cell represents another alternative to sanger sequencing. it employs four-colour tagging, but uses forward-thinking, acid-labile reversible terminator chemistry. synthesis-by-hybridization defines a resequencing method. dna microarrays are the modern, massively parallel version of classic molecular biology hybridization techniques. the technique permits analysis of genetic material (dna) and monitoring of expression changes (rna, really based on cdna) occurring in a biological sample under various conditions. microarrays have been used successfully in various research areas including dna sequencing. while microarray-based approaches enable high-throughput, they are limited to genes present in reference databases and hence require pre-selected sequence information limiting what dna samples can be interrogated (fig. , dna sequencing methods comparison). the genomes of most organisms, from the simplest unicellular organism to more complex species, consist of a variety of genomic landscapes. each with a unique profile of genetic content, gc-richness, high-copy repeats and low-copy repeats (bacolla et al. ) . particularly near regions of structurally important sequences, such as the centromere, the genomic landscape can become quite problematic for analysis. the primary factor contributing to the difficulty in studying these areas within the genome is that clusters of genomic duplications and unusual repeat structures often lie in close proximity. consequently, sequence similarity-based methods of global genome assembly fail to properly assign the correct positions of duplicated sequences. annotation and interpretation of data from current dna sequencing technologies can be difficult, because artificial overlaps form, significant warping of working draft sequences occurs, and numerous gaps appear in the assembly, all which, reduce overall quality and relevance of the assembly. these effects are further compounded by the absence of unique sequence-tagged sites (sts) or dna landmarks within such regions and a general under-representation of such areas in clone-by-clone sequencing. thus, current dna sequencing instruments are challenged by the presence of large spans of duplicated sequence, which interfere with genome analysis. with faster methods to collect data, more attention has turned to sequence annotation. researcher can quickly generate sequence information and want to annotate their data. an annotated sequence provides a wealth of information about the organism not directly obvious from the sequence. it also acts as a standard, giving investigators the ability to work on the same basic gene structures and to compare findings. the annotation process is a challenging task especially in light of the limited infrastructure and expertise. new opportunities to develop faster and less expensive methods for sequencing dna are pushing genome sequencing technology to include answers to: • information on the nature and source of a sample • effective data collection for comparison of samples (from known and probable locations) • confidence in data comparison • sample divergence from a common ancestor (the mechanism of the variation or the heterotachy of the mutation) • genome-wide information analysis, including evolutionary distance to enable the measurement of the variation • evidence of genetic engineering • all or partial data exclusion as a contaminant or failure in sample handling. a major obstacle to identification of micro-organisms is having a reference or comparison strain. for all named microbial species, there is a requirement for deposition of the culture in a collection so that others may have the reference strain. research over the last years has shown that the reference strain rarely is the 'median' strain and often is an outlier to the species. furthermore, recent genome sequence data show that strains passaged several times in media will display single nucleotide polymorphism (snp) differences and multiple isolates of a single species will not be identical base pair by base pair. this reality must be dealt with in identifying pathogens, especially isolates or even the nucleic acid from a natural environment. the high-throughput nature of the sequencing device inevitably produces sequencing errors and limits data quality. the errors in standard genome sequencing projects can be reduced by applying efficient genome assembly techniques. potential sequencing errors can also be further minimized by posterior computational processes (gajer et al. ; huse et al. ). in october , the use of the us mail system as a method for disseminating a weapon set off a national incident for biological agent detection. traditional microbiological laboratory methods have served as the standard for identification of viable pathogens, but are time-consuming and labour intensive. the sheer number of pathogens and their complex biology, diversity and capacity to exchange genetic material complicates interpretation of current constrained data collection. moreover, it is estimated that % of microbial species cannot be cultured (torsvik et al. ; amann et al. ; zak & visser ; ranjard et al. ; bridge & spooner ; anderson & cairney ; harayama et al. ) . even when culturing is effective, the long process is a limitation for investigators who need rapid sample analysisto-answer. to accomplish this, data acquisition must distinguish individual isolates from similar samples to the most precise level possible (ideally to a single source). an unexpected and deadly pathogen is severe acute respiratory syndrome (sars) virus. the natural hosts of the virus are thought to be wild civets and bats (guan et al. ; li et al. ) . the epidemic of sars appears to have originated in guangdong province, china in november , although the chinese government did not disclose the information outside china nor informed world health organization of the outbreak of hitherto unknown infectious disease. the lack of a proper diagnostic method for sars resulted in public health crisis in (bloom ) . right after sars was recognized as a potential threat to global health, several leading laboratories were brought to identify the causal agent. initial electron microscopic examination in hong kong and germany found viral particles with structures suggesting paramyxovirus in respiratory secretions of sars patients (hassler et al. ) . in contrast, chinese researchers reported that a chlamydia-like disease may be behind sars. the mutation rates of rapidly evolving microbial genomes can be up to in bp. the production rate of viruses being as rapid as virions per day indicates a very high genetic diversity (neumann et al. ) measurable because viral samples duplicated closely related in time are unlikely to harbour identical genomes! furthermore, because of recombination, insertion sequences, rearrangements or gene duplications, the genome size of isolates from the same species can be different (swiecicka ) . following on the heels of the publication of the sequence analysis of human dna comes the realization that there is more-than-expected variation (sutton et al. ). this observation combined with the recent demonstration from the j. craig venter institute (jcvi) of the first transplant of a bacterial genome heightens the need for genome identification technology. the technology must be able to distinguish human variation and to detect engineered organisms. genetic material transplant is not just gene transfer, but the transfer of an entire chromosome. thus, the outer shell of the organism no longer represented its genomic content, the trojan horse of micro-organism population genetics. the surface appearance cloaks the hidden genetic content. in the case of sars, detection was based on the physical appearance of the unknown biothreat, but today and in the future, organism identification will require genome identification. genome transplantation is an essential enabling step in the field of synthetic genomics as it is a key mechanism by which chemically synthesized chromosomes can be activated within a cell. the ability to transfer naked dna isolated from one species into a second microbial species paves the way for subsequent experiments to transplant a fully synthetic bacterial chromosome into a living organism. current genome identification strategies are based on the use of a reference to make a detection of the next unknown. forward-looking strategies encompass not just the detection of a target or key signature, but also the characterization of the population and the identification of genomes in a mixture in an environment. approaches based on antibodies, pcrprobes, microarray-probes can only capture information on predicted answers. sequencing based on a metagenomic approach has established the capability to correlate the traceability or identification of an organism as a causative agent (constantin et al. ) and can further distinguish between nontoxicgenic and pathogenic versions (mohapatra et al. ). on account of biodiversity being significantly large between and within species, sequencing enables the solution for bioweapon detection, pathogen identification, and predictions of disease outbreaks; true personalized medicine. at first glance, the extent of biodiversity could seem to be too much of a challenge to fully characterize; however, sequencing technologies can measure that diversity and accurately bin and distinguish organisms based on population genetics principles, mutation rates, genome stability, host interactions, genetic mobility, microbial ecology and population dynamics. comparative genomics will provide genome identification because it is based on identification of populations and not on a technique that tries to find the best match to a pre-defined reference. the genome revolution leads far beyond an instrument to meet the national human genome research institute goal of a complete human genome for $ . this goal challenges the technical community and is machine independent, and does not include the more challenging specifications required of the device to identify micro-organisms. it is likely, that recent market trends that have resulted in a large number of diverse approaches to sequencing represented by a variety of commercial and academic research centres, which are faced with dedicating teams of scientists/technicians and engineers to accomplish their singular goal, will move in the opposite direction. nevertheless, we can expect to consolidate the independent efforts into large collaborative efforts over the next several years with sharper focus on the identification of all life forms and characterization of their populations. physicians preparedness for bioterrorism and other public health priorities phylogenetic identification and in situ detection of individual microbial cells without cultivation diversity and ecology of soil fungal communities long homopurine homopyrimidine sequences are characteristic of genes expressed in brain and the pseudoautosomal region soil fungi: diversity and detection urban aerosols harbor diverse and dynamic bacterial populations genetic analysis and attribution of microbial forensics evidence detection of biological threat agents by real-time pcr: comparison of assay performance on the rapid, the lightcycler, and the smart cycler platforms multiplex diagnostic platforms for detection of biothreat agents taxonomy of xanthomonas and pseudomonas taxonomic analysis with the electronic computer of some xanthomonas and pseudomonas species taxonimic relationships among the pseudomonads environmental signatures associated with cholera epidemics. proceedings of the national academy of sciences automated correction of genome sequence errors evaluation of lateral flow assays for the detection of botulinum neurotoxin type a and their application in lab diagnosis of botulism isolation and characterization of viruses related to the sars coronavirus from animals in southern china microbial communities in oil-contaminated seawater heritage foundation. empowering america: a proposal for enhancing regional preparedness accuracy and quality of massively-parallel dna pyrosequencing expression of mouse interleukin by a recombinant ectromelia virus suppresses cytolytic lymphocyte responses and overcomes genetic resistance to mousepox bats are natural reservoirs of sarslike coronaviruses current and developing technologies for monitoring agents of bioterrorism and biowarfare genome sequencing in open microfabricted high density picoliter reactions taxonomic identification of microorganisms by capture and intrinsic fluorescence detection determination of relationships among non-toxigenic vibrio cholerae o biotype el tor strains from housekeeping gene sequences and ribotype patterns hepatitis c viral dynamics in vivo and the anti-viral efficacy of interferon therapy blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome monitoring complex bacterial communities using culture-independent molecular techniques nanopore sequencing technology: nanopore preparations dna sequencing with chain-terminating inhibitors the diploid genome sequence of an individual human molecular typing by pulsed-field gel electrophoresis of bacillus thuringiensis from root voles the century foundation 'are bioterrorism dollars making us safer? high diversity in dna of soil bacteria an appraisal of soil fungal biodiversity the authors have no conflict of interest to declare and note that the funders of this research had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. key: cord- - ubh u r authors: nelson, oranmiyan w.; garrity, george m. title: genome sequences published outside of standards in genomic sciences, july - october date: - - journal: stand genomic sci doi: . /sigs. sha: doc_id: cord_uid: ubh u r the purpose of this table is to provide the community with a citable record of publications of ongoing genome sequencing projects that have led to a publication in the scientific literature. while our goal is to make the list complete, there is no guarantee that we may have omitted one or more publications appearing in this time frame. readers and authors who wish to have publications added to subsequent versions of this list are invited to provide the bibliographic data for such references to the sigs editorial office. aeromonas aquariorum, sequence accession bafl through bafl , and ap [ ] aerococcus viridans ll , sequence accession ajtg [ ] bacillus anthracis h , sequence accession cp . ( chromosome), cp . . (plasmid pxo ), and cp . (plasmid pxo ) [ ] bacillus atrophaeus c , sequence accession ajrj [ ] bacillus cereus nc , sequence accession ap (chromosome), ap (plasmid standards in genomic sciences pnccld), ap (plasmid pnc , kb), ap (plasmid pnc , kb), ap (plasmid pnc , kb), and ap (plasmid pnc , kb) [ ] bacillus siamensis kctc t , sequence accession ajvf [ ] bacillus sp. strain b , sequence accession ajst [ ] bacillus sp. strain , sequence accession afsu [ ] citreicella aestuarii strain , sequence accession ajkj [ ] clostridium beijerinckii strain g , sequence acceession akwa [ ] corynebacterium pseudotuberculosis strain / -a, sequence accession cp [ ] enterobacter sp. isolate ag , sequence accession akxm [ ] enterococcus faecalis d , sequence accession cp through cp [ ] enterococcus faecalis strain np- , sequence accession ab [ ] enterococcus hirae (streptococcus faecalis) atcc , sequence accession cp (chromosome), nc_ (plasmid ptg ) [ ] "geobacillus thermoglucosidans" tno- . , sequence accession ajjn [ ] lactococcus garvieae ipla , sequence accession akfo [ ] lactobacillus mucosae lm , sequence accession ahit [ ] lactobacillus rossiae dsm , sequence accession akzk [ ] paenibacillus polymyxa osy-df, sequence accession aipp [ ] pediococcus pentosaceus strain ie- , sequence accession cahu through cahu [ ] pelosinus fermentans a , sequence accession akvm [ ] pelosinus fermentans b , sequence accession akvj [ ] pelosinus fermentans jbw , sequence accession akvo [ ] pelosinus fermentans r , sequence accession akvn [ ] planococcus antarcticus dsm , sequence accession ajyb [ ] pseudomonas stutzeri strain jm , sequence accession cp [ ] rhodococcus sp. strain dk , sequence accession ajlq [ ] staphylococcus aureus strain lct-sa , sequence accession ajlp [ ] staphylococcus capitis qn , sequence accession ajtg [ ] staphylococcus equorum subsp. equorum mu , sequence accession cajl to cajl [ ] staphylococcus hominis zbw , sequence accession akgc [ ] staphylococcus saprophyticus subsp. saprophyticus m - , sequence accession ahkb [ ] streptococcus mutans gs- , sequence accession cp [ ] streptococcus pyogenes m , sequence accession ap [ ] streptococcus salivarius ps , sequence accession ajfw [ ] streptococcus thermophilus strain mn-zlw- , sequence accession cp [ ] ureibacillus thermosphaericus strain thermo-bf, sequence accession ajik [ ] phylum tenericutes mycoplasma leachii strain pg t , sequence accession cp . [ ] mycoplasma mycoides subsp. mycoides, sequence accession cp . [ ] mycoplasma wenyonii strain massachusetts, sequence accession cp [ ] phylum actinobacteria actinomyces massiliensis strain t , sequence accession akio [ ] bifidobacterium animalis subsp. lactis b , sequence accesion cp [ ] bifidobacterium animalis subsp. lactis bi- , sequence accesion cp [ ] bifidobacterium bifidum strain bgn , sequence accession cp [ ] brevibacterium massiliense strain t , sequence accession cajd [ ] corynebacterium bovis dsm , sequence accession aenj [ ] corynebacterium diphtheriae biovar intermedius nctc , sequence accession ajvh [ ] corynebacterium pseudotuberculosis strain / -a, sequence accession cp [ ] corynebacterium pseudotuberculosis strain / - sequence accession cp . [ ] corynebacterium pseudotuberculosis strain / -a, sequence accession cp [ ] microbacterium yannicii, sequence accession cajf through cajf [ ] micromonospora lupini lupac , sequence accession caie [ ] mycobacterium bolletii strain m , sequence accession ajly [ ] mycobacterium intracellulare clinical strain mott- y, sequence accession cp [ ] mycobacterium massiliense m , sequence accession ajsc [ ] mycobacterium massiliense strain go , sequence accession cp [ ] mycobacterium massiliense strain m , sequence accession ajma [ ] mycobacterium tuberculosis rgtb , sequence accession cp [ ] mycobacterium tuberculosis mtb , sequence accession cp [ ] parascardovia denticolens ipla , sequence accession akii [ ] saccharothrix espanaensis dsm t , sequence accession he [ ] streptomyces auratus strain agr , sequence accession ajgv [ ] "streptomyces cattleya" dsm t , sequence accession fq and fq [ ] streptomyces globisporus c- , sequence accession ajuo [ ] streptococcus mutans gs- , sequence accession cp [ ] streptomyces sp. strain aa , sequence accession alap [ ] streptomyces sulphureus l , sequence accession ajtq [ ] phylum spirochaetes borrelia crocidurae, sequence accession cp (chromosome), cp to cp (plasmids) [ ] treponema sp. strain jc , sequence accession jq [ ] phylum bacteroidetes flavobacterium sp. strain f , sequence accession akzq [ ] fusobacterium nucleatum subsp. fusiforme atcc t , sequence accession akxi [ ] "imtechella halotolerans" k t , sequence accession ajju [ ] standards in genomic sciences phage clp , sequence accession jn [ ] pseudomonas aeruginosa siphophage mp , sequence accession jx [ ] pseudomonas aeruginosa temperate phage mp , sequence accession eu [ ] pseudomonas aeruginosa temperate phage mp , sequence accession jq [ ] pseudomonas phage Φ-s , sequence accession jx [ ] siphophage mp , sequence accession jx [ ] staphylococcus aureus bacteriophage gh , sequence accession jq [ ] vibrio vulnificus bacteriophage ssp , sequence accession jq [ ] african bovine rotaviruses rva/cowwt/zaf/ / /g p, sequence accession s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn [ ] african bovine rotaviruses rva/cowwt/zaf/ / /g p, sequence accession s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn [ ] african bovine rotaviruses rva/cowwt/zaf/ / /g p, sequence accession s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (vp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn , s (nsp ) jn [ ] avian leukosis virus, sequence accession jx [ ] avian influenza virus h n , sequence accession jx through jx [ ] avian influenza virus h n , sequence accession jq through jq [ ] avian-like h n swine influenza, sequence accession jx through jx [ ] avian paramyxovirus, sequence accession jq [ ] avian tembusu-related virus strain wr, sequence accession jx [ ] bluetongue virus serotype , sequence accession jx to jx [ ] bluetongue virus serotype , sequence accession [ ] bombyx mori nucleopolyhedrovirus, sequence accession jq [ ] bovine viral diarrhea virus , sequence accession jf [ ] bovine foamy viruses, sequence accession jx [ ] canine noroviruses, sequence accession fj and fj [ ] chicken anemia virus, sequence accession jx [ ] chikungunya virus, sequence accession jx [ ] chinese virulent avian coronavirus gx-yl , sequence accession hq [ ] chinese virulent avian coronavirus gx-yl , sequence accession hq [ ] coxsackievirus b , sequence accession jx [ ] enterovirus c (hev-c ), sequence accession jx [ ] genotype hepatitis e virus strain, sequence accession jq [ ] h n avian influenza virus, sequence accession jq to jq [ ] h n subtype influenza virus fjg , sequence accession jf . , jn . through jn . [ ] . herpes simplex virus strain mckrae, sequence accession jx [ ] human coronavirus nl , sequence accession jx [ ] human g p rotavirus, sequence accession ab through ab [ ] ikoma lyssavirus, sequence accession jx [ ] korean sacbrood viruses amsbv-kor , sequence accession jq [ ] korean sacbrood viruses amsbv-kor , sequence accession jq [ ] mitochondrion of frankliniella occidentalis, sequence accession jn [ ] new circular dna virus from grapevine, sequence accession jq [ ] novel porcine epidemic diarrhea virus, sequence accession jx [ ] pararetrovirus, sequence accession jq [ ] parechovirus, sequence accession jx [ ] peste des petits ruminants virus, sequence accession jx [ ] polyomavirus, sequence accession jq [ ] porcine circovirus b strain cc , sequence accession jq [ ] porcine circovirus type (pcv ), sequence accession jx [ ] porcine epidemic diarrhea virus strain aj , sequence accession jx [ ] porcine [ ] waterfowl aviadenovirus goose adenovirus , sequence accession jf [ ] plant genomes plants cpdna of smilax china, sequence accession nc_ [ ] elodea canadensis, sequence accession jq [ ] ogura-type mitochondrial genome, sequence accession ab [ ] fungus aspergillus oryzae strain . , sequence accession akhy [ ] rhodosporidium toruloides mtcc , sequence accession ajmj [ ] animal genomes helicoverpa armigera, sequence accession hq [ ] plasmids plasmidincn plasmid prsb , sequence accession jn [ ] plasmidincn plasmid prsb , sequence accession jn [ ] plasmidincn plasmid prsb , sequence accession jn [ ] plasmidincn plasmid prsb , sequence accession jn [ ] complete genome sequence of the hyperthermophilic cellulolytic crenarchaeon "thermogladius cellulolyticus" complete genome sequence of methanomassiliicoccus luminyensis, the largest genome of a human-associated archaea species isolated from a deep-sea hydrothermal sulfide chimney on the juan de fuca ridge complete genome sequence of leptospirillum ferrooxidans strain c - , isolated from a fresh volcanic ash deposit on the island of miyake sequence analysis of a complete . mb prochlorococcus marinus med genome cloned in yeast genome sequence of acinetobacter sp. strain ha, isolated from the gut of the polyphagous insect pest helicoverpa armigera draft genome sequence of the hydrocarbon-degrading and emulsan-producing strain acinetobacter venetianus rag- t genome sequence and mutational analysis of plant-growth-promoting bacterium agrobacterium tumefaciens ccnwgs isolated from a zinc-lead mine tailing draft genome sequence of alcaligenes faecalis subsp. faecalis ncib (ccug ) genome sequence of pectin-degrading alishewanella agri, isolated from landfill soil genome sequence of pectin-degrading alishewanella aestuarii strain b t, isolated from tidal flat sediment genome sequence of thermotolerant bacillus methanolicus: features and regulation related to methylotrophy and production of l-lysine and l-glutamate from methanol draft genome sequence of the sulfur-oxidizing bacterium "candidatus sulfurovum sediminum" ar, which belongs to the epsilonproteobacteria genome sequence of bartonella birtlesii, a bacterium isolated from small rodents of the genus apodemus complete genome sequence of brucella abortus a , a new strain isolated from the fetal gastric fluid of dairy cattle complete genome sequence of brucella canis strain hsk a , isolated from the blood of an infected dog genome sequence of brucella melitensis s , an isolate of sequence type , prevalent in china genome sequences of brucella melitensis m and its two derivatives m w and m w, which evolved in vivo complete genome sequence of the endophytic bacterium burkholderia sp. strain kj draft genome sequence of the soil bacterium burkholderia terrae strain bs , which interacts with fungal surface structures revised genome sequence of burkholderia thailandensis msmb with improved annotation complete genome sequence of the opportunistic food-borne pathogen cronobacter sakazakii es genome sequence of the rice pathogen dickeya zeae strain zju genome sequence of the plant growth-promoting bacterium enterobacter cloacae gs genome sequence of enterococcus faecium clinical isolate lct-ef genome sequence of enterobacter radicincitans dsm t, a plant growth-promoting endophyte genome sequence of escherichia coli j , a reference strain for genetic studies draft genome sequence of escherichia coli lct-ec complete genome sequence of klebsiella oxytoca e , a new delhi metallo-β-lactamase- -producing nosocomial strain complete genome sequences of six strains of the genus methylobacterium genome sequence of methylobacterium sp. strain gxf , a xylem-associated bacterium isolated from vitis vinifera l. grapevine complete genome sequences of methylophaga sp. strain jam and methylophaga sp. strain jam genome sequence of radiation-resistant modestobacter marinus strain bc , a representative actinobacterium that thrives on calcareous stone surfaces genome sequence of mycobacterium massiliense m , isolated from a lymph node biopsy specimen genome sequence of a neisseria meningitidis capsule null locus strain from the clonal complex of sequence type genome sequence of novosphingobium sp strain rr - , a nopaline crown gall-associated bacterium isolated from vitis vinifera l. grapevine complete genome sequence of providencia stuartii clinical isolate mrsn whole-genome shotgun sequence of the sulfur-oxidizing chemoautotroph pseudaminobacter salicylatoxidans kct genome sequence of pseudomonas aeruginosa strain sjtd- , a bacterium capable of degrading long-chain alkanes and crude oil genome sequence of the lactate-utilizing pseudomonas aeruginosa strain xmg genome sequence of the rice pathogen pseudomonas fuscovaginae cb draft genome sequence of arctic marine bacterium pseudoalteromonas issachenkonii pamc draft genome sequence of high-siderophore-yielding pseudomonas sp. strain hys genome sequence of the polychlorinated-biphenyl degrader pseudomonas pseudoalcaligenes kf complete genome sequence of the naphthalene-degrading pseudomonas putida strain nd genome sequence of pseudomonas putida strain sjte- , a bacterium capable of degrading estrogens and persistent organic pollutants draft genome sequence of pseudomonas sp. strain m t , carried by bursaphelenchus xylophilus isolated from pinus pinaster genome sequence of the moderately halotolerant, arsenite-oxidizing bacterium pseudomonas stutzeri ts genome sequence of ralstonia sp. strain pba, a bacterium involved in the biodegradation of -aminobenzenesulfonate genome sequences for six rhodanobacter strains, isolated from soils and the terrestrial subsurface, with variable denitrification capabilities genome sequence of rickettsia conorii subsp. caspia, the agent of astrakhan fever genome sequence of rickettsia australis, the agent of queensland tick typhus draft genome sequence of rickettsia sp. strain meam , isolated from the whitefly bemisia tabaci genome sequence of rickettsia conorii subsp. israelensis, the agent of israeli spotted fever draft genome sequence of the antagonistic rhizosphere bacterium serratia plymuthica strain pri- c draft genome sequence of serratia marcescens strain lct-sm complete genome sequence of the broad-host-range strain sinorhizobium fredii usda genome sequence of sphingobium indicum b a, a hexachlorocyclohexane-degrading bacterium genome sequence of stenotrophomonas maltophilia pml , which displays baeyer-villiger monooxygenase activity draft genome sequences of eight salmonella enterica serotype newport strains from diverse hosts and locations whole-genome sequences and comparative genomics of salmonella enterica serovar typhi isolates from patients with fatal and nonfatal typhoid fever in papua new guinea draft genome sequence of serratia sp. strain m t , isolated from pinewood disease nematode bursaphelenchus xylophilus draft genome sequence of a psychrotolerant sulfur-oxidizing bacterium, sulfuricella denitrificans skb , and proteomic insights into cold adaptation genome sequence of xanthomonas campestris jx, an industrially productive strain for xanthan gum draft genome sequence of yersinia pestis strain , an isolate from the great gerbil plague focus in xinjiang genome sequence of a novel human pathogen, aeromonas aquariorum genome sequence of aerococcus viridans ll complete genome sequence of bacillus anthracis h , an isolate from a korean patient with anthrax draft genome sequence of the sponge-associated strain bacillus atrophaeus c , a potential producer of marine drugs complete genome sequence of bacillus cereus nc , which produces high levels of the emetic toxin cereulide draft genome sequence of the plant growth-promoting bacterium bacillus siamensis kctc t isolated from a cherry tree genome sequence of the plant growth-promoting rhizobacterium bacillus sp. strain draft genome sequence of citreicella aestuarii strain , a member of the roseobacter clade isolated without xenobiotic pressure from a petroleum-polluted beach draft genome sequence of butanol-acetone-producing clostridium beijerinckii strain g complete genome sequence of corynebacterium pseudotuberculosis strain / -a, isolated from a horse in north america draft genome sequences of enterobacter sp. isolate ag from the midgut of the malaria mosquito anopheles gambiae complete genome sequence of the porcine isolate enterococcus faecalis d complete genome sequence of bacteriophage bc- specifically infecting enterococcus faecalis strain np- genome sequence of enterococcus hirae (streptococcus faecalis) atcc , a model organism for the study of ion transport, bioenergetics, and copper homeostasis complete genome sequence of geobacillus thermoglucosidans tno- . , a thermophilic sporeformer associated with a dairy-processing environment genome sequence of lactococcus garvieae ipla , a bacteriocin-producing, tetracycline-resistant strain isolated from a raw-milk cheese genome sequence of lactobacillus mucosae lm , isolated from piglet feces draft genome sequence of lactobacillus rossiae dsm t draft genome sequence of paenibacillus polymyxa osy-df, which coproduces a lantibiotic, paenibacillin, and polymyxin e genome sequence of pediococcus pentosaceus strain ie- draft genome sequences for two metal-reducing pelosinus fermentans strains isolated from a cr(vi)-contaminated site and for type strain r draft genome sequence of pelosinus fermentans jbw , isolated during in situ stimulation for cr(vi) reduction draft genome sequences for two metal-reducing pelosinus fermentans strains isolated from a cr(vi)-contaminated site and for type strain r genome sequence of the antarctic psychrophile bacterium planococcus antarcticus dsm genome sequence of pseudomonas stutzeri strain jm (dsm ), a soil isolate and model organism for natural transformation draft genome sequence and comparative analysis of the superb aromatic-hydrocarbon degrader rhodococcus sp. strain dk whole-genome sequence of staphylococcus aureus strain lct-sa genome sequence of staphylococcus capitis qn , which causes infective endocarditis genome sequence of staphylococcus equorum subsp. equorum mu , isolated from a french smear-ripened cheese whole-genome sequence of staphylococcus hominis, an opportunistic pathogen draft genome sequence of staphylococcus saprophyticus subsp. saprophyticus m - , isolated from the gills of a korean rockfish, sebastes schlegeli hilgendorf, after high hydrostatic pressure processing complete genome sequence of streptococcus mutans gs- , a serotype c strain complete genome sequence of streptococcus pyogenes m , isolated from a patient with streptococcal toxic shock syndrome complete genome sequence of streptococcus salivarius ps , a strain isolated from human milk complete genome sequence of streptococcus thermophilus strain mn-zlw- draft genome sequence of ureibacillus thermosphaericus strain thermo-bf, isolated from ramsar hot springs in iran complete genome sequences of mycoplasma leachii strain pg t and the pathogenic mycoplasma mycoides subsp. mycoides small colony biotype strain gladysdale complete genome sequence of mycoplasma wenyonii strain massachusetts draft genome sequence of actinomyces massiliensis strain t complete genome sequences of probiotic strains bifidobacterium animalis subsp. lactis b and bi- complete genome sequence of the probiotic bacterium bifidobacterium bifidum strain bgn draft genome sequence of brevibacterium massiliense strain t draft genome sequence of corynebacterium bovis dsm , which causes clinical mastitis in dairy cows draft genome sequence of corynebacterium diphtheriae biovar intermedius nctc complete genome sequence of corynebacterium pseudotuberculosis strain / -a, isolated from a horse in north america complete genome sequences of corynebacterium pseudotuberculosis strains / - and / -a, isolated from sheep in scotland and australia, respectively genome sequence of microbacterium yannicii, a bacterium isolated from a cystic fibrosis patient genome sequence of micromonospora lupini lupac , isolated from root nodules of lupinus angustifolius draft genome sequence of mycobacterium bolletii strain m , a rapidly growing mycobacterium of contentious taxonomic status complete genome sequence of mycobacterium intracellulare clinical strain mott- y, belonging to the int genotype genome sequence of mycobacterium massiliense m , isolated from a lymph node biopsy specimen complete genome sequence of mycobacterium massiliense annotated genome sequence of mycobacterium massiliense strain m , belonging to the recently created taxon mycobacterium abscessus subsp. bolletii comb. nov whole-genome sefquences of two clinical isolates of mycobacterium tuberculosis from kerala, south india genome sequence of parascardovia denticolens ipla , isolated from human breast milk complete genome sequence of saccharothrix espanaensis dsm t and comparison to the other completely sequenced pseudonocardiaceae insights into fluorometabolite biosynthesis in streptomyces cattleya dsm through genome sequence and knockout mutants draft genome sequence of streptomyces globisporus c- , which produces an antitumor antibiotic consisting of a nine-membered enediyne with a chromoprotein complete genome sequence of streptococcus mutans gs- , a serotype c strain draft genome sequence of the marine streptomyces sp. strain aa , isolated from the yellow sea draft genome sequence of the marine actinomycete streptomyces sulphureus l , isolated from marine sediment complete genome sequence of borrelia crocidurae draft genome sequence of treponema sp. strain jc , a novel spirochete isolated from the bovine rumen draft genome sequence of flavobacterium sp. strain f , isolated from the rhizosphere of bell pepper (capsicum annuum l. cv. maccabi) draft genome sequence of fusobacterium nucleatum subsp. fusiforme atcc t genome sequence of the halotolerant bacterium imtechella halotolerans k t genome sequence of a novel actinophage pis isolated from a strain of saccharomonospora sp complete genome sequence of aeromonas hydrophila phage cc complete genome sequence of bacteriophage bc- specifically infecting enterococcus faecalis strain np- complete genome sequence of bacteriophage ssu specific for salmonella enterica serovar typhimurium rough strains genome sequence of blattabacterium sp. strain bgiga, endosymbiont of the blaberus giganteus cockroach complete genome sequence of caulobacter crescentus bacteriophage φcbk complete genome sequence of celeribacter bacteriophage p l complete genome sequence of croceibacter bacteriophage p s complete genome sequence of cronobacter sakazakii temperate bacteriophage phies complete genome sequence of marinomonas bacteriophage p complete genome sequence of phytopathogenic pectobacterium carotovorum subsp. carotovorum bacteriophage pp complete genome sequences of two persicivirga bacteriophages, p s and p l genome sequence of the phage clp , which infects the beer spoilage bacterium pediococcus damnosus complete genome sequence of pseudomonas aeruginosa siphophage mp complete genome sequences of two pseudomonas aeruginosa temperate phages, mp and mp , which lack the phage-host crispr interaction genome sequence of the broad-host-range pseudomonas phage Φ-s complete genome sequence of pseudomonas aeruginosa siphophage mp complete genome sequence of staphylococcus aureus bacteriophage gh complete genome sequence of vibrio vulnificus bacteriophage ssp whole genome sequence analyses of three african bovine rotaviruses reveal that they emerged through multiple reassortment events between rotaviruses from different mammalian species complete genome sequence of an avian leukosis virus isolate associated with hemangioma and myeloid leukosis in egg-type and meat-type chickens genome sequence of a novel reassortant h n avian influenza virus in southern china complete genome sequence of an h n avian influenza virus isolated from a parrot in southern china complete genome sequence of an avian-like h n swine influenza virus discovered in southern china complete genome sequence of a novel avian paramyxovirus complete genome sequence of avian tembusu-related virus strain wr isolated from white kaiya ducks in fujian complete genome sequence of bluetongue virus serotype : implications for serotyping complete genome sequence of bluetongue virus serotype of goat origin from india genome sequence of a bombyx mori nucleopolyhedrovirus strain with cubic occlusion bodies complete genome sequence of a bovine viral diarrhea virus from commercial fetal bovine serum complete genome sequences of two novel european clade bovine foamy viruses from germany and poland complete genome sequences of novel canine noroviruses in hong kong complete genome sequence analysis of a recent chicken anemia virus isolate and comparison with a chicken anemia virus isolate from human fecal samples in china complete genome sequence of a chikungunya virus isolated in guangdong complete genome sequences of two chinese virulent avian coronavirus infectious bronchitis virus variants complete genome sequence of a recombinant coxsackievirus b from a patient with a fatal case of hand, foot, and mouth disease in guangxi complete genome sequence of a novel human enterovirus c (hev-c ) identified in a child with community-acquired pneumonia complete genome sequence of the genotype hepatitis e virus strain prevalent in swine in jiangsu province, china, reveals a close relationship with that from the human population in this area complete genome sequence of an h n avian influenza virus isolated from a live bird market in southern china complete genome sequence of a novel h n subtype influenza virus fjg strain in china reveals a natural reassortant event characterization and complete genome sequence of human coronavirus nl isolated in china whole genome sequence analyses of three african bovine rotaviruses reveal that they emerged through multiple reassortment events between rotaviruses from different mammalian species complete genome sequence of ikoma lyssavirus analysis of the complete genome sequence of two korean sacbrood viruses in the honey bee, apis mellifera the complete mitochondrial genome sequence of the western flower thrips frankliniella occidentalis (thysanoptera: thripidae) contains triplicate putative control regions genome sequence of methylobacterium sp. strain gxf , a xylem-associated bacterium isolated from vitis vinifera l. grapevine complete genome sequence of porcine epidemic diarrhea virus strain aj isolated from a suckling piglet with acute diarrhea in china complete genome sequence of a novel pararetrovirus isolated from soybean complete genome sequence of a novel type of human parechovirus strain reveals natural recombination events complete genome sequence of a peste des petits ruminants virus recovered from wild bharal in tibet complete genome sequence of a polyomavirus isolated from horses complete genome sequence of a novel field strain of rearranged porcine circovirus type in southern china complete genome sequence of a novel field strain of rearranged porcine circovirus type in southern china complete genome sequence of porcine epidemic diarrhea virus strain aj isolated from a suckling piglet with acute diarrhea in china complete genome sequence of a novel porcine sapelovirus strain yc isolated from piglets with diarrhea complete genome sequence of porcine reproductive and respiratory syndrome virus strain qy reveals a novel subgroup emerging in china genome sequences of sat foot-and-mouth disease viruses from egypt and palestinian autonomous territories (gaza strip) complete genome sequence of a street rabies virus from mexico genome sequence of a waterfowl aviadenovirus, goose adenovirus jenny) xiang q-y. complete cpdna genome sequence of smilax china and phylogenetic placement of liliales -influences of gene partitions and taxon sampling complete chloroplast genome sequence of elodea canadensis and comparative analyses with other monocot plastid genomes a complete mitochondrial genome sequence of ogura-type male-sterile cytoplasm and its comparative analysis with that of normal cytoplasm in radish (raphanus sativus l.) draft genome sequence of aspergillus oryzae strain . genome sequence of the oleaginous red yeast rhodosporidium toruloides mtcc complete genome sequence of a monosense densovirus infecting the cotton bollworm, helicoverpa armigera the complete genome sequences of four new incn plasmids from wastewater treatment plant effluent provide new insights into incn plasmid diversity and evolution key: cord- -bsypo l authors: van dorp, lucy; acman, mislav; richard, damien; shaw, liam p.; ford, charlotte e.; ormond, louise; owen, christopher j.; pang, juanita; tan, cedric c.s.; boshier, florencia a.t.; ortiz, arturo torres; balloux, françois title: emergence of genomic diversity and recurrent mutations in sars-cov- date: - - journal: infect genet evol doi: . /j.meegid. . sha: doc_id: cord_uid: bsypo l sars-cov- is a sars-like coronavirus of likely zoonotic origin first identified in december in wuhan, the capital of china's hubei province. the virus has since spread globally, resulting in the currently ongoing covid- pandemic. the first whole genome sequence was published on january , , and thousands of genomes have been sequenced since this date. this resource allows unprecedented insights into the past demography of sars-cov- but also monitoring of how the virus is adapting to its novel human host, providing information to direct drug and vaccine design. we curated a dataset of public genome assemblies and analysed the emergence of genomic diversity over time. our results are in line with previous estimates and point to all sequences sharing a common ancestor towards the end of , supporting this as the period when sars-cov- jumped into its human host. due to extensive transmission, the genetic diversity of the virus in several countries recapitulates a large fraction of its worldwide genetic diversity. we identify regions of the sars-cov- genome that have remained largely invariant to date, and others that have already accumulated diversity. by focusing on mutations which have emerged independently multiple times (homoplasies), we identify filtered recurrent mutations in the sars-cov- genome. nearly % of the recurrent mutations produced non-synonymous changes at the protein level, suggesting possible ongoing adaptation of sars-cov- . three sites in orf ab in the regions encoding nsp , nsp , nsp , and one in the spike protein are characterised by a particularly large number of recurrent mutations (> events) which may signpost convergent evolution and are of particular interest in the context of adaptation of sars-cov- to the human host. we additionally provide an interactive user-friendly web-application to query the alignment of the sars-cov- genomes. on december , china notified the world health organisation (who) about a cluster of pneumonia cases of unknown aetiology in wuhan, the capital of the hubei province. the initial evidence was suggestive of the outbreak being associated with a seafood market in wuhan, which was closed on january . the aetiological agent was characterised as a sars-like betacoronavirus, later named sars-cov- , and the first whole genome sequence (wuhan-hu- ) was deposited on ncbi genbank on january ( ) . human-to-human transmission was confirmed on january , by which time sars-cov- had already spread to many countries throughout the world. further extensive global transmission led to the who declaring covid- as a pandemic on march . betacoronaviridae comprise a large number of lineages that are found in a wide range of mammals and birds ( ) , including the other human zoonotic pathogens sars-cov- and mers-cov. the propensity of betacoronaviridiae to undergo frequent host jumps supports sars-cov- also being of zoonotic origin. to date, the genetically closest-known lineage is found in horseshoe bats (batcov ratg ) ( ) . however, this lineage shares % identity with sars-cov- , which is not sufficiently high to implicate it as the immediate ancestor of sars-cov- ( ) . the zoonotic source of the virus remains unidentified at the date of writing (april ). the analysis of genetic sequence data from pathogens is increasingly recognised as an important tool in infectious disease epidemiology ( , ) . genetic sequence data sheds light on key epidemiological parameters such as doubling time of an outbreak/epidemic, reconstruction of transmission routes and the identification of possible sources and animal reservoirs. additionally, whole-genome sequence data can inform drug and vaccine design. indeed, genomic data can be used to identify pathogen genes interacting with the host and allows characterization of the more evolutionary constrained regions of a pathogen genome, which should be preferentially targeted to avoid rapid drug and vaccine escape mutants. there are thousands of global sars-cov- whole-genome sequences available on the rapid data sharing service hosted by the global initiative on sharing all influenza data (gisaid; https://www.epicov.org) ( , ) . the extraordinary availability of genomic data during the covid- pandemic has been made possible thanks to a tremendous effort by hundreds of researchers globally depositing sars-cov- assemblies (table s ) and the proliferation of close to real time data visualisation and analysis tools including nextstrain (https://nextstrain.org) and cov-glue (http://cov-glue.cvr.gla.ac.uk). in this work we use this data to analyse the genomic diversity that has emerged in the global population of sars-cov- since the beginning of the covid- pandemic, based on a download of assemblies. we focus in particular on mutations that have emerged independently multiple times (homoplasies) as these are likely candidates for ongoing adaptation of sars-cov-measured via the site specific consistency index. for this analysis all ambiguous sites in the alignment were set to 'n'. to assess whether any particular open reading frame (orf) showed evidence of more homoplasies than expected given the length of the orf, an empirical distribution was obtained by sampling, with replacement, equivalent length windows and recording the number of homoplasies detected (table s ) . homoplasyfinder identified homoplasies ( excluding masked sites), which were distributed over the sars-cov- genome ( figure s , table s ). of these, sites have a derived allele at > % of the total isolates. however, homoplasies can arise due to convergent evolution (putatively adaptive), recombination, or via errors during the processing of sequence data. the latter is particularly problematic here due to the mix of technologies and methods employed by different contributing research groups. we therefore filtered identified homoplasies using a set of thresholds attempting to circumvent this problem (filtering scripts and figures are available at https://github.com/liampshaw/cov-homoplasy-filtering). in summary, for each homoplasy we computed the proportion of isolates with the homoplasy pnn where the nearest neighbouring isolate in the phylogeny also carried the homoplasy (excluding identical sequences). this metric ranges between pnn= (all isolates with the homoplasy present as singletons) and pnn= (no singletons i.e. clustering of isolates with the homoplasy in the phylogeny). we reasoned that artefactual sequencing homoplasies would tend to show up as singletons, so excluded all homoplasies with pnn< . from further analysis. to obtain a set of high confidence homoplasies, we then used the following criteria: ≥ . % isolates in the alignment share the homoplasy (equivalent to > isolates), pnn> . , and derived allele found in strains sequenced from > originating lab and > submitting lab. we also required the proportion of isolates where the homoplasic site was in close proximity to an ambiguous base (± bp) to be zero. the application of these various filters reduced the number of homoplasies to (table s ) . we also plotted the distributions of cophenetic distances between isolates carrying each homoplasy compared to the distribution for all isolates ( figure s ) , and inspected the distribution of all identified homoplasies in the phylogenies from our own analyses and on the phylogenetic visualisation platform provided by nextstrain. finally, we examined whether ambiguous bases were seen more often at homoplasic sites than at random bases(excluding masked sites), which was not the case ( figure s ). to further validate the homoplasy detection method applied to the alignment of the sars-cov- genome assemblies, we took advantage of the genome sequences for which raw reads were available on the short read archive (sra). a variant calling pipeline (available at https://github.com/damienfr/cov-homoplasy) was used to obtain high-confidence alignments for the (out of as of april ) sra genomic datasets both meeting our quality criterions and matching gisaid assemblies. the topology of the maximum likelihood phylogeny of these samples was compared to that of the corresponding samples from the gisaid genome assemblies using a mantel test and the phytools r package ( ) (figures s -s , see supplementary text). ≥ %), and homoplasies were kept in the sra dataset and in the gisaid dataset, respectively. nine sites were detected in both datasets. for sites which failed the filtering thresholds, this was largely due to the low number of studied accessions, which increases the probability of an isolated strain displaying a homoplasy e.g. if n= isolates have a homoplasy, by definition they cannot be nearest neighbours, so pnn= . the alignment was translated to amino acid sequences using seaview v ( ) . sites were identified as synonymous or non-synonymous and amino acid changes corresponding to these mutations were retrieved via multiple sequence alignment. we assessed the change in hydrophobicity and charge of amino acid residues arising due to homoplastic non-synonymous mutations using the hydrophobicity scale proposed by janin ( ) . the ten most hydrophobic residues on this scale were considered hydrophobic and the rest as hydrophilic. in addition, amino acid residues were either classified as positively charged, negatively charged or neutral at ph . the charge of each residue can either increase, decrease or remain the same (neutral mutation) due to mutation ( figure s ). sars-cov- and mers-cov are both zoonotic pathogens related to sars-cov- , which underwent a host jump into the human host previously. we investigated whether the major homoplasies we detect in sars-cov- affect sites which also underwent recurrent mutations in these related viruses as these adapted to their human host. all coronaviridae assemblies were downloaded (ncbi taxid: ) on th of april and human associated mers-cov and sars-cov- assemblies extracted. this gave a total of assemblies for sars-cov- and assemblies for mers-cov. following the same protocol (augur align) as applied to sars-cov- assemblies, each species was aligned against the respective refseq reference genomes: nc_ . for sars-cov- and nc_ . for mers-cov. this produced alignments of , bp ( snps) and , bp ( snps) respectively. the sars-cov- genomes offer an excellent geographical and temporal coverage of the covid- pandemic (figure a-b) . the genomic diversity of the sars-cov- genomes is represented as maximum likelihood phylogenies in a radial (figure c ) and linear layout ( figure s -s ). there is a robust temporal signal in the data, captured by a statistically significant correlation between sampling dates and 'root-to-tip' distances for the sars-cov- ( figure s ; r = . , p< . ). such positive association between sampling time and evolution is expected to arise in the presence of measurable evolution over the timeframe over which the genetic data was collected. specifically, more recently sampled strains have accumulated additional mutations in their genome than older ones since their divergence from the most recent common ancestor (mrca, root of the tree). the origin of the regression between sampling dates and 'root-to-tip' distances ( figure s ) provides a cursory point estimate for the time to the mrca (tmrca) around late . using treedater ( ), we observe an estimated tmrca, which corresponds to the start of the covid- epidemic, of october - december ( % cis) ( figure s ). these dates for the start of the epidemic are in broad agreement with previous estimates performed on smaller subsets of the covid- genomic data using various computational methods ( table ) , though they should still be taken with some caution. indeed, the sheer size of the dataset precludes the use of some of the more sophisticated inference methods available. the sars-cov- global population has accumulated only moderate genetic diversity at this stage of the covid- pandemic with an average pairwise difference of . snps between any two genomes, providing further support for a relatively recent common ancestor. we estimated a mutation rate underlying the global diversity of sars-cov- of ~ × - nucleotides/genome/year (ci: x - - x - ) obtained following time calibration of the maximum likelihood phylogeny. this rate is largely unremarkable for an rna virus ( , ) , despite coronaviridae having the unusual capacity amongst viruses of proofreading during nucleotide replication, thanks to the non-structural protein nsp exonuclease, which excises erroneous nucleotides inserted by their main rna polymerase nsp ( , ) . some of the major clades in the maximum likelihood phylogeny (figure c and figure s ) are formed predominantly by strains sampled from the same continent. however, this likely represents a temporal rather than a geographic signal. indeed, the earliest available strains were collected in asia, where the covid- pandemic started, followed by extensive genome sequencing efforts first in europe and then in the usa. the sars-cov- genomic diversity found in most countries (with sufficient sequences) essentially recapitulates the global diversity of covid- from the -genome dataset. figure highlights the proportion of the global genetic diversity found in the uk, the usa, iceland and china. in the uk, the usa and iceland, the majority of the global genetic diversity of sars-cov- is recapitulated, with representatives of all major clades present in each of the countries (figure a-c) . the same is true for other countries such as australia ( figure s a ). this genetic diversity of sars-cov- populations circulating in different countries points to each of these local epidemics having been seeded by a large number of independent introductions of the virus. the main exception to this pattern is china, the source of the initial outbreak, where only a fraction of the global diversity is present (figure d ). this is also to an extent the case for italy (figure s b) , which was an early focus of the covid- pandemic. however, this global dataset includes only sars-cov- genomes from italy, so some of the genetic diversity of sars-cov- strains in circulation likely remains unsampled. the genomic diversity of the global sars-cov- population being recapitulated in multiple countries points to extensive worldwide transmission of covid- , likely from extremely early on in the pandemic. the sars-cov- alignment can be considered as broken into a large two-part open reading frame (orf) encoding non-structural proteins, four structure proteins: spike (s), envelope (e), membrane (m) and nucleocapsid (n), and a set of small accessory factors (figure a ). there is variation in genetic diversity across the alignment, with polymorphisms often found in neighbouring clusters ( figure s ) . a simple permutation resampling approach suggests that both orf a and n exhibit snps which fall in the th percentile of the empirical distribution (table s ) . however, not all of these sites can be confirmed as true variant positions, due to the lack of accompanying sequence read data. however, we closely inspected those sites that appear to have arisen multiple times following a maximum parsimony tree building step. we identified a large number of putative homoplasies (n= excluding masked regions), which were filtered to a high confidence cohort of positions (see methods). these positions in the sars-cov- genome alignment ( . % of all sites) were associated with amino acid changes across all genomes. of these amino acid changes, comprised non-synonymous and comprised synonymous mutations. two non-synonymous mutations involved the introduction or removal of stop codons were found (* y, * g). of the remaining non-synonymous mutations involved neutral hydrophobicity changes ( figure s a ). in addition, of the remaining non-synonymous mutations involved neutral changes ( figure s b ). both orf ab and n had a four-fold higher frequency of hydrophilic → hydrophobic mutations than hydrophobic → hydrophilic mutations ( figure s ). in addition, neutral hydrophobic changes were clearly favoured in the s protein. lastly, of the remaining non-synonymous mutations involved neutral charge changes. amongst the strongest filtered homoplasic sites (> change points on the tree), three are found within orf ab (nucleotide positions , , ) and s ( ). we exemplify the strongest signal and our approach using position in figure and provide a full list of homoplasic sites, both filtered and unfiltered, in tables s - . the strongest hit in terms of the inferred minimum number of changes required (figure b -c) at orf ab ( , codon ) falls over a region encoding the non-structural protein, nsp , and is also observed in our analyses of the sra dataset (table s ) . we note that some of the hits also overlap with positions identified as putatively under selection using other approaches (http://virological.org/t/selection-analysis-of-gisaid-sars-cov- data/ / , accessed april ), with orf ab consistently identified as a region comprising several candidates for non-neutral evolution. orf ab is an orthologous gene with other humanassociated betacoronaviruses, in particular sars-cov- and mers-cov which both underwent host jumps into humans from likely bat reservoirs ( , ) . we performed an equivalent analysis on human-associated virus assemblies available on the ncbi virus platform. we identified six putative homoplasic sites within sars-cov- , two occurring within the c-like proteinase just upstream of nsp ( , ) and a further two homoplasies within orf ab at nsp and nsp ( figure s ). in addition, one homoplasy was identified in the spike protein and one in the membrane protein orfs. for mers-cov, multiple unfiltered homoplasies were detected, consistent with previous observations of high recombination in this species ( ) , though only one invoked more than a minimum number of changes on the maximum parsimony tree ( figure s ) . this corresponded to a further homoplasy identified in orf ab nsp (position ). it is of note that this genomic region coincides with the strongest homoplasy in sars-cov- which also occurs in the nsp encoding region of orf ab. codon of orf ab shares a leucine residue in mers-cov and sars-cov- , though a valine in sars-cov. the exact role of these and other homoplasic mutations in human associated betacoronaviruses represents an important area of future work, although it appears that the orf ab region may exhibit multiple putatively adapted variants across human betacoronavirus lineages. the genome alignment of the sars-cov- genomes can be queried through an open access, interactive web-application (https://macman .shinyapps.io/ugi-scov -alignmentscreen/). it provides users with information on every snp and homoplasy detected across our global sars-cov- alignment and allows visual inspection both within the sequence alignment and across the maximum likelihood tree phylogeny. figure illustrates some of the functionalities of the web application using position in the alignment as an example. this particular homoplasy was observed times across the genomes and requires a minimum of character-site changes to become congruent with the observed sars-cov- phylogeny (figure a and b ). pandemics have been affecting humanity for millennia ( ) . over the last century alone, several global epidemics have claimed millions of lives, including the / influenza a (h n ) pandemic, the sixth ( - ) and seventh 'el tor' cholera pandemic ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) , as well as the hiv/aids pandemic ( -today). covid- acts as an unwelcome reminder of the major threat that infectious diseases represent in terms of deaths and disruption. one positive aspect of the current situation, relative to previous pandemics, is the unprecedented availability of scientific and technological means to face covid- . in particular, the rapid development of drugs and vaccines has already begun. modern drug and vaccine development are largely based on genetic engineering and an understanding of host-pathogen interactions at a molecular level. the mobilisation to address the covid- pandemic by scientists worldwide has been remarkable. this includes the feat of the global scientific community who has already produced and publicly shared well over , complete sars-cov- genome sequences at the time of writing (april ), which we have used here with gratitude. further initatives in the united kingdom (https://www.cogconsortium.uk/data/) have already to date produced over , genomes, some of which overlap with those already available on gisaid. to put these numbers of sars-cov- genomes in context, it is interesting to consider parallels with the h n pdm influenza pandemic, the first epidemic for which genetic sequence data was generated in near-real time ( , ) . the genetic data available at the time looks staggeringly small in comparison to the amount that has already been generated for sars-cov- during the early stages of the covid- pandemic. for example, fraser et al. considered partial hemagglutinin gene sequences two months after the who had declared h n pdm influenza a pandemic ( ) . this unprecedented genomic resource has already provided strong conclusions about the pandemic. for example, analyses by multiple independent groups place the start of the covid- pandemic towards the end of ( table ). this rules out any scenario that assumes sars-cov- may have been in circulation long before it was identified, and hence have already infected large proportions of the population. extensive genomic resources for sars-cov- should in principle also be key to informing on optimal drug and vaccine design, particularly when coupled with knowledge of human proteome and immune interactions ( ) . ideally, drugs and vaccines should target relatively invariant, strongly constrained regions of the sars-cov- genome, to avoid drug resistance and vaccine evasion. therefore ongoing monitoring of genomic changes in the virus will be essential to gain a better understanding of fundamental host-pathogen interactions that can inform drug and vaccine design. the vast majority of mutations observed so far in sars-cov- circulating in humans are likely neutral ( , ) or even deleterious ( ) . homoplasies, such as those we detect here, can arise by product of neutral evolution or as a result of ongoing selection. of the homoplasies we detect (after applying stringent filters), some proportion are very likely genuine targets of positive selection which signpost to ongoing adaptation of sars-cov- to its new human host. indeed, we do observe an enrichment for non-synonymous changes ( %) in our filtered sites. as such, our provided list (table s ) contains candidates for mutations which may affect the phenotype of sars-cov- and virus-host interactions and which require ongoing monitoring. conversely, the finding that % of the homoplasic mutations involve no polarity change could still reflect strong evolutionary constraints at these positions ( , ) . the remaining non-neutral changes to amino acid properties at homoplasic sites may be enriched in candidates for functionally relevant adaptation and could warrant further experimental investigation. one of the strongest homoplasies lies at site in the sars-cov- genome in a region of orf a encoding nsp . this site passed our stringent filtering cirteria and was also present in our analysis of the sra dataset (table s ) . interestingly, this region overlaps a putative immunogenic peptide predicted to result in both cd + and cd + t-cell reactivity ( ) . more minor homoplasies amongst our top candidates, identified within orf a (table s ) , also map to a predicted cd t cell epitope. while the immune response to sars-cov- is poorly understood at this point, key roles for cd t cells, which activate b cells for antibody production, and cytotoxic cd t cells, which kill virus-infected cells, are known to be important in mediating clearance in respiratory viral infections ( ) . of note, we also identify a strong recurrent mutation in nucleotide position , corresponding to the sars-cov- spike protein (codon ). while the spike protein is the known mediator of host-cell entry, our detected homoplasy falls outside of the n-terminal and receptor binding domains. our analyses presented here provide a snapshot in time of a rapidly changing situation based on available data. although we have attempted to filter out homoplasies caused by sequencing error with stringent thresholds, and also used available short-read data to validate a subset of homoplasic sites in a smaller dataset, our analysis nevertheless remains reliant on the underlying quality of the publicly available assemblies. as such, it is possible that some results might be artefactual, and further investigation will be warranted as additional raw sequencing data becomes available. however, given the crucial importance of identifying potential signatures of adaptation in sars-cov- for guiding ongoing development of vaccines and treatments, we have suggested what we believe to be a plausible approach and initial list in order to facilitate future work and interpretation of the observed patterns. more data continues to be made available, which will allow ongoing investigation by ourselves and others. we believe it is important to continue to monitor sars-cov- evolution in this way and to make the results available to the scientific community. in this context, we hope that the interactive web-application we provide will help identify key recurrent mutations in sars-cov- as they emerge and spread. figure . global sequencing efforts have contributed hugely to our understanding of the genomic diversity of sars-cov- . a) viral assemblies available from global regions as of / / . b) cumulative total of viral assemblies uploaded to gisaid included in our analysis. c) radial maximum likelihood phylogeny for complete sars-cov- genomes. colours represent continents where isolates were collected. green: asia; red: europe; purple: north america; orange: oceania; dark blue: south america according to metadata annotations available on nextstrain (https://github.com/nextstrain/ncov/tree/master/data). figure c .  phylogenetic estimates support that the covid- pandemic started sometimes around october - december , which corresponds to the time of the host-jump into humans.  the diversity of sars-cov- strains in many countries recapitulates its full global diversity, consistent with multiple introductions of the virus to regions throughout the world seeding local transmission events.  sites in the sars-cov- genome appear to have already undergone recurrent, independent mutations based on a large-scale analysis of public genome assemblies.  detected recurrent mutations may indicate ongoing adaptation of sars-cov- to its novel human host.  monitoring the build-up and patterns of genetic diversity in sars-cov- has potential to inform targets for drug and vaccine development. a new coronavirus associated with human respiratory disease in china the phylogenetic range of bacterial and viral pathogens of vertebrates a pneumonia outbreak associated with a new coronavirus of probable bat origin the genomic and epidemiological dynamics of human influenza a virus unifying the epidemiological and evolutionary dynamics of pathogens disease and diplomacy: gisaid's innovative contribution to global health global initiative on sharing all influenza datafrom vision to reality mafft multiple sequence alignment software version : improvements in performance and usability raxml-ng: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data bayesian inference of ancestral dates on bacterial phylogenetic trees scalable relaxed clock phylogenetic dating mpboot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation homoplasyfinder: a simple tool to identify homoplasies on a phylogeny toward defining course of evolution -minimum change for a specific tree topology phytools: an r package for phylogenetic comparative biology (and other things) seaview version : a multiplatform graphical user interface for sequence alignment and phylogenetic tree building surface and inside volumes in globular proteins an unusually high substitution rate in transplant-associated bk polyomavirus in vivo is further concentrated in hla-c-bound viral peptides the evolution of ebola virus: insights from the - epidemic unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group lineage discovery of an rna virus '-> ' exoribonuclease that is critically involved in coronavirus rna synthesis severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats middle east respiratory syndrome coronavirus in bats, saudi arabia mers-cov recombination: implications about the reservoir and potential for adaptation what are pathogens, and what have they done to and for us? pandemic potential of a strain of influenza a (h n ) : early findings origins and evolutionary genomics of the swine-origin h n influenza a epidemic a sars-cov- -human protein-protein interaction map reveals drug targets and potential drug-repurposing infectious diseases of humans a dynamic nomenclature proposal for sars-cov- to assist genomic epidemiology computational inference of selection underlying the evolution of the novel coronavirus, sars-cov- a sars-cov- vaccine candidate would likely match all currently circulating strains synonymous mutations and the molecular evolution of sars-cov- origins looking for darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level distribution of the strength of selection against amino acid replacements in human proteins a sequence homology and bioinformatic approach can predict candidate targets for immune responses to sars-cov- immunity to respiratory viruses transmission dynamics and evolutionary history of -ncov the first two cases of -ncov in italy: where they come from genomic epidemiology of sars-cov- in guangdong province % bci may % bci november rate-estimated relaxed clock model ( ) % ci october % ci november unreported clock model (beast % hpd november strict clock model (beast v . ) relaxed clock model (beast v . ) % ci november o analysed data and performed computational analyses l.v.d and f.b. acknowledge financial support from the newton fund uk-china nsfc initiative (grant mr/p / ) and the bbsrc (equipment grant bb/r x/ ). computational analyses were performed on ucl computer science cluster and the south green bioinformatics platform hosted on the cirad hpc cluster. we thank jaspal puri for insights and assistance on the development of the alignment visualisation tool and nicholas mcgranahan and rachel rosenthal for their comments on the manuscript. we additionally wish to acknowledge the very large number of scientists in originating and submitting labs who have readily made available sars-cov- assemblies to the research community. key: cord- -oreg rnj authors: spyrou, maria a.; bos, kirsten i.; herbig, alexander; krause, johannes title: ancient pathogen genomics as an emerging tool for infectious disease research date: - - journal: nat rev genet doi: . /s - - - sha: doc_id: cord_uid: oreg rnj over the past decade, a genomics revolution, made possible through the development of high-throughput sequencing, has triggered considerable progress in the study of ancient dna, enabling complete genomes of past organisms to be reconstructed. a newly established branch of this field, ancient pathogen genomics, affords an in-depth view of microbial evolution by providing a molecular fossil record for a number of human-associated pathogens. recent accomplishments include the confident identification of causative agents from past pandemics, the discovery of microbial lineages that are now extinct, the extrapolation of past emergence events on a chronological scale and the characterization of long-term evolutionary history of microorganisms that remain relevant to public health today. in this review, we discuss methodological advancements, persistent challenges and novel revelations gained through the study of ancient pathogen genomes. the long shared history between humans and infectious disease places ancient pathogen genomics within the inter est of several fields such as microbiology, evolutionary biology, history and anthropology. research on this topic aims to better understand the interactions between pathogens and their hosts on an evolutionary timescale, to uncover the origins of pathogens and to disentangle the genetic processes involved in their epidemic emer gence among human populations. over the past , years, major transitions in human subsistence strategies, such as those that accompanied the neolithic revolution , likely exposed our species to a novel range of infectious agents . closer contact with domesticated animals would have increased the frequency of zoonotic transmission events, and higher human population densities would have enhanced the potential of pathogens to propagate within and between groups. throughout human his tory, a number of epidemics and pandemics have been recorded or are hypothesized to have occurred (fig. ) . although most of their causative agents still remain speculative, robust molecular methods coupled with archaeological and historical data can confidently demonstrate the involvement of certain pathogens in these episodes. the investigation of past infectious diseases has tra ditionally been conducted through palaeopathological assessment of ancient skeletal assemblages , , although this approach is limited by the fact that most acute infec tions do not leave visible traces on bone. since the s, the field of ancient dna (adna) has brought molecular techniques to this study, providing a diachronic genetic perspective to infectious disease research. initial attempts relied on pcr technology [ ] [ ] [ ] [ ] [ ] , which restricted the study of ancient microbial dna to targeted, short genomic fragments that were amplified from ancient human remains. this method made infectious disease detection possible but gave limited information on the evolution ary history of the patho gen. in addition, complications associated with the study of adna, which is typically present at low quantities, is heavily fragmented and har bours chemical modifications [ ] [ ] [ ] , hampered efforts to reproduce and authenticate early findings [ ] [ ] [ ] . over the past decade, major advancements in geno mics, in particular, the development of high throughput sequencing, also called next generation sequencing (ngs) , radically increased the amount of data that can be retrieved from ancient remains. this techno logy has assisted the development of quantitative meth ods for adna authentication , , [ ] [ ] [ ] and has enabled the retrieval of whole ancient pathogen genomes from archaeological specimens. the first such genome, pub lished in (ref. ), was that of the notorious bacterial pathogen yersinia pestis, the causative agent of plague. since then, the field has expanded its directions to the in depth study of infectious disease evolution, providing a unique resource for understanding human history. here, we review the latest methodological innovations that have aided the whole genome retrieval and evolu tionary analysis of various ancient pathogens (table ) , most of which are still relevant to public health today. a scientific field focused on the study of whole pathogen genomes retrieved from ancient human, animal or plant remains. the cultural transition associated with the adoption of farming, animal husbandry and domestication as well as the practice of a sedentary lifestyle among human populations. the infectious disease transmission from animals to humans. in the second half of this review, we highlight the util ity of this approach by discussing evolutionary events in the history of y. pestis that have been uniquely revealed through the study of ancient genomes. methods for isolating ancient microbial dna the sweet spot for ancient pathogen dna. the retrieval of dna from ancient human, animal or plant remains carries with it a number of challenges, namely, its limited preservation and hence low abundance, its highly fragmented and damaged state and the perva sive modern dna contamination that necessitates a confident evaluation of its authenticity , . efficient adna recovery is best accomplished via sampling of the anatomical element that contains the highest quantity of dna from the target organism. for human adna analysis, bone and teeth have been the preferred study material, given their abundance in the archaeological record. recent studies suggest that the inner ear por tion of the petrous bone and the cementum layer of teeth have the greatest potential for successful human dna retrieval. however, petrous bone sampling and shotgun ngs sequencing of adna from five bronze age skeletons previously shown to be carrying y. pestis failed to detect the bacterium in this source material, suggesting that its preservation potential for pathogen dna is low . direct sampling from skeletal lesions, where present, has proved a rich source of adna for some chronic disease causing bacteria, such as mycobacterium tuberculosis, which was isolated from vertebrae ; mycobacterium leprae, which could be isolated from portions of the maxilla and various long bones , ; and treponema pallidum subsp. pallidum and t. pallidum subsp. pertenue, which have been isolated from long bones . of note, the sampling methods for recovering pathogen dna do not generally follow a standardized procedure, in part because of the great diversity in tissue tropism and resulting disease progression. in addition, acute blood borne infections do not typically produce diagnostic bone changes as opposed to those that affect their hosts chronically . therefore, if infections have caused mortal ity in the acute phase, as is the case for individuals from epidemic contexts who do not display skeletal evidence of infection, the preferred study material has been the inner cavities of teeth. pathogen adna is thought to be preserved within the remnants of the pulp chamber, likely as part of desiccated blood , . consequently, tooth sampling has proved successful in the retrieval of whole genomes or genome wide data (that is, low coverage genomes that have provided limited analytical resolution) from ancient bacteria such as y. pestis , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , borrelia recurrentis and salmonella enterica ; ancient eukaryotic pathogens such as plasmodium falciparum ; and ancient viruses such as hepatitis b virus (hbv) , and human parvovirus b (b v) . even m. leprae, which commonly manifests in the chronic form, has been retrieved from ancient teeth , . other types of specimen have also shown potential for adna retrieval. examples are dental calculus as a source of oral pathogens, such as tannerella forsythia pandemics refers to increased, often sudden, disease occurrence within populations across more than one region or continent, whereas epidemics refers to increased disease occurrences within a confined region or country. the evaluation of the health status of ancient individuals or populations, usually through the analysis of disease marker presence on skeletal assemblages. (adna). the dna that has been retrieved from historical, archaeological or palaeontological remains. tropism refers to the type of tissue or cell in which infection is established and supported. segregating the metagenomic soup: methods for pathogen detection. regardless of the source of genetic material, most ancient specimens yield complex metagenomic data sets. poorly preserved adna usually makes up a miniscule fraction of the total genetic mate rial extracted from a sample (< %), and the majority of dna usually stems from organisms residing in the envi ronment . hence, specialized protocols are necessary for the detection and isolation of ancient pathogen dna and its confident segregation from a rich environmental dna background (fig. ) . in this context, laboratory based techniques are sep arated into those that target a specific microorganism and those that screen for several pathogenic micro organisms simultaneously (fig. ). methods that screen for a single microorganism have used species specific assays of conventional or quantitative pcr (also known as real time pcr) , - , as well as hybridization based enrichment techniques , , (fig. ). these methods are particularly useful when the target microorganism is known, for example, in the presence of diagnostic skeletal lesions among the studied individuals , , or when a hypothesis exists for the causative agent of an epidemic . by contrast, broad laboratory based patho gen screening in adna research has used microarrays for both targeted enrichment and hbv that were sequenced using capillary sequencing (sanger method). a term used to describe a specimen or data set that includes nucleic acid sequences from all organisms within the sampled proportion. the diagram provides an overview of techniques used for pathogen dna detection in ancient remains by distinguishing between laboratory and computational methods. in both cases, processing begins with the extraction of dna from ancient specimens . as part of the laboratory pipeline, direct screening of extracts can be performed by pcr (quantitative (qpcr) or conventional) against species-specific genes, as done previously , , , . pcr techniques alone, however, can suffer from frequent false-positive results and should therefore always be coupled with further verification methods such as downstream genome enrichment and/or next-generation sequencing (ngs) in order to ensure ancient dna (adna) authentication of putatively positive samples. alternatively , construction of ngs libraries , has enabled pathogen screening via fluorescence-based detection on microarrays and via dna enrichment approaches . the latter has been achieved, through single locus in-solution capture , or through simultaneous screening for multiple pathogens using microarray-based enrichment of species-specific loci and enables post-ngs adna authentication. in addition, data produced by direct (shotgun) sequencing of ngs libraries before enrichment can also be used for pathogen screening using computational tools. after pre-processing, reads can be directly mapped against a target reference genome (in cases for which contextual information is suggestive of a causative organism) or against a multigenome reference composed of closely related species to achieve increased mapping specificity of ancient reads. alternatively , ancient pathogen dna can also be detected using metagenomic profiling methods, as presented elsewhere , , , through taxonomic assignment of shotgun ngs reads. both approaches allow for subsequent assessment of adna authenticity and can be followed by whole pathogen genome retrieval through targeted enrichment or direct sequencing of positive sample libraries. detection , whereby probes are designed to represent unique or conserved regions from a range of pathogenic bacteria, parasites or viruses. although amplification based or fluorescence based approaches can be fast and cost effective for screening large sample collections , , enrichment based techniques are usually coupled with ngs and therefore provide data that can be used to assess adna authenticity. when shotgun sequencing data are generated, com putational screening approaches can be used to detect the presence of pathogen dna as well as for meta genomic profiling of ancient specimens (fig. ). in cases for which a causative agent is suspected, ngs reads can be directly mapped (for example, using the read align ment software burrows-wheeler aligner ) against a specific reference genome or against a multigenome reference that includes several species of a certain genus with the purpose of achieving a higher mapping speci ficity to the target organism (fig. ). in addition, broad approaches involve the use of metagenomic techniques for pathogen screening. examples of tools that have shown their effectiveness with ancient metagenomic dna include the widely used basic local alignment search tool (blast) ; the megan alignment tool (malt) , which involves a taxonomic binning algorithm that can use whole genome databases (such as the national center for biotechnical information (ncbi) reference sequence (refseq) database ); metagenomic phylogenetic analysis (metaphlan) , which is also integrated into the metagenomic pipeline metabit and uses thousands (or millions) of marker genes for the distinction of specific microbial clades; or kraken , an alignment free sequence classifier that is based on k-mer matching of a query to a constructed database. taxonomic sequence assignments from the above methods, however, should be interpreted with caution, mainly because some pathogenic microorganisms have close environmental relatives that are often insuffi ciently represented in public databases. for example, a > % sequence identity was shown between environ mental taxa and human associated pathogens such as m. tuberculosis and y. pestis according to an analysis of s ribosomal rna genes . as such, given that envi ronmental dna often dominates ancient remains that stem from burial contexts , analyses should always ensure a qualitative assessment of assigned reads, that is, an evaluation of their mapping specificity and their genetic distance (also called edit distance) to the puta tively detected organism. in addition, one should con sider the known adna damage characteristics as criteria for data authenticity. although several types of chemical damage can affect post mortem dna survival, certain characteristics have been more extensively quantified. the first, termed depurination, is a hydrolytic mecha nism under which purine bases become excised from dna strands. this process results in the formation of abasic sites and is a known contributor to the fragmen tation patterns observed in adna. as such, an increased base frequency of a and g compared with c and t immediately preceding the ʹ ends of adna fragments is often considered a criterion for authenticity . a sec ond type of damage commonly identified among adna data sets is the hydrolytic deamination of c, whereby a c base is converted into u (and detected as its dna analogue, t) , . this base modification usually occurs at single stranded dna overhangs that are most acces sible to environmental insults, resulting in an increased frequency of miscoding lesions at the terminal ends of adna fragments , . consequently, the evaluation of dna damage profiles (for instance, by using map damage . (ref. )) is a prerequisite for authenticating ancient pathogen dna and is necessary for ensuring adna data integrity in general. more detailed overviews of authentication criteria in ancient pathogen research have been reviewed elsewhere , . targeted enrichment approaches to isolate whole ancient pathogen genomes. evolutionary relationships between past and present infectious agents are best determined through the use of whole genome sequences of pathogens. however, the recovery of high quality data is often challenging owing to the aforementioned char acteristics of adna and therefore requires specialized sample processing. for example, in cases in which adna authenticity has already been achieved in the detection step, u residues resulting from post mortem c deami nation can be entirely or partially excised from adna molecules using the enzyme uracil dna glycosylase (udg) to avoid their interference with downstream read mapping and variant calling. in addition, given the low proportion of patho gen dna in ancient remains, a common and cost effective approach for whole genome retrieval involves microarray based or in solutionbased hybridization capture. both methods constitute a form of genomic selec tion of continuous or discontinuous genomic regions through the design and use of single stranded dna or rna probes that are complementary to the desired tar get. microarray based capture utilizes densely packed probes that are immobilized on a glass slide . it is cost effective in that it permits the parallel enrichment of molecules from several libraries that can be subsequently recovered through deep sequencing, although competi tion over the probes can impair enrichment efficiencies in specimens with comparatively lower target dna con tents. nevertheless, this type of capture has shown its effectiveness in the recovery of both ancient pathogen and human dna , , , , , . more recently, in solutionbased capture approaches have gained popularity owing to their capacity for greater sample throughput without compromising capture effi ciency [ ] [ ] [ ] ; every sample library can be captured indi vidually, thus providing, in principle, an equal probe density per specimen. this technique has contributed to the increased number of specimens from which human genome wide single nucleotide polymorphism (snp) data could be retrieved , , even from climate zones that pose challenges to adna preservation (pre sented elsewhere [ ] [ ] [ ] ). in addition, in solutionbased capture has recently become the preferred method for microbial pathogen genome recovery for both bacteria and dna viruses (for examples, see refs , , , , , , ). nevertheless, deep shotgun sequencing alone has also been used for human [ ] [ ] [ ] and pathogen , , high quality an algorithm that assigns metagenomic dna reads to a species or a higher taxonomic rank (for example, genus or family) based on the sequence specificity. the matching, for each read, of multiple subsequences of length k without mismatches to a database. a hydrolytic reaction in which the β-n-glycosidic bond of a purine (adenine or guanine) is cleaved, causing its excision from a dna strand. the hydrolytic removal of an amine group (nh ) from a molecule. in ancient dna studies, the term deamination most often refers to the deamination of cytosine residues into uracils. the identification of polymorphisms (nucleotide differences) in sequenced data by comparison to a reference. www.nature.com/nrg | june | volume genome reconstruction, especially for specimens with fairly high endogenous dna yields, although this frequently carries with it a greater production cost. in the absence of ancient pathogen genomes, the tim ings of infectious disease emergence and early spread are inferred mainly through comparative genomics of modern pathogen diversity , , palaeopathological eval uation of ancient skeletal remains or analysis of his torical records , . such approaches are highly valuable and, when combined, can be used to build an inter disciplinary picture of infectious disease history; however, limitations also exist. for example, the analysis of con temporary pathogen genetic diversity considers only a short time depth of available data and cannot predict evolutionary scenarios that derive from lineages that are now extinct. in addition, skeletal markers of specific infections in past populations only exist for a few con ditions and, when present, can rarely be considered as definitive, as numerous differential diagnoses can exist for a given skeletal pathology . similarly, historically recorded symptoms can often be misinterpreted given that past descriptions may be unspecific and do not always conform to modern medical terminology . in the past decade, the reconstruction of ancient pathogen genomes has complemented such analyses with direct molecular evidence, often revealing aspects of past infections that were unexpected on the basis of existing data. the recent identification of hbv dna in a mummified individual showing a vesicopustular rash , which is usually considered characteristic of infection with varv, highlights the importance of molecular methods in evaluating differential diagnoses. the oldest recovered genomic evidence of hbv to date was from a , year old individual from present day germany , which shows that this pathogen has affected human populations since the neolithic period. in addition, the virus was identified recently in human remains from the bronze age, iron age and up until the th century of the current era (ce) in eurasia , , , . regarding bacterial pathogens, the identification of b. recurrentis in a th century individual from norway showed that -aside from y. pestis -other vector borne pathogens were also circulating in medieval europe. furthermore, the causative agents of syphilis and yaws, t. pallidum subsp. pallidum and t. pallidum subsp. pertenue, respectively, were recently identified in different individuals from colonial mexico who exhib ited similar skeletal lesions. this study demonstrates the power of ancient pathogen genomics in distinguishing past infectious disease agents that are genetically and phenotypically similar but that differ greatly in their public health significance. finally, the identification of g. vaginalis and s. saprophyticus in calcified nodules from a woman's remains ( th century troy) directly implicates these bacteria in pregnancy related com plications in the past. these findings, as well as other insights gained from analyses of ancient pathogen genomes (table ) , demonstrate the ability of adna to contribute aspects of infectious disease history beyond those accessible by the palaeopathological, historical and modern genetic records. the reconstruction of whole pathogen genomes has not only been a tool for demonstrating infectious disease presence in the past but also aided in the robust infer ence of microbial phylogeography, which is important for understanding the processes that influence pathogen distribution and diversity over time. the evaluation of genetic relationships between ancient and modern pathogens is often conducted by direct whole genome or genome wide snp compari sons of bacteria , , , , , viruses , , , or mito chondrial genomes and nuclear genome data from eukaryotic microorganisms , , . hence, accurate variant calling is critical for drawing reliable evolutionary inferences, although this process is often a challenge when handling data sets derived from samples with high rates of dna fragmentation (resulting in ultrashort read data), low endogenous dna content and high levels of dna dam age. in these cases, increased accuracy is best achieved through stringent ngs read mapping parameters and through visual inspection of the sequences overlap ping the studied snps . in addition, histograms of snp allele frequencies -used to estimate the frequency of heterozygous calls in haploid organisms , -can often demonstrate the effects of environmental contamination on ancient microbial data sets . once variant calls are authenticated, one of the most common types of evolutionary inference in patho gen research is through phylogenetic analysis, which is a powerful means of resolving the genetic history of clonal microorganisms (fig. ) . among the most commonly used tools in ancient microbial genomics are mega , which comprises several phylogenetic methods; phyml , raxml and iq tree , which implement maximum likelihood approaches; mrbayes , which uses a bayesian approach; and programs used for phylo genetic network inference, such as splitstree . two notable studies that examined phylogenetic relationships among ancient m. leprae genomes revealed a high strain diversity in europe between the th and th centuries ce , . considered alongside the oldest palaeopatholog ical cases of leprosy dating to as early as the copper and bronze age in eurasia , and the high frequency of protective immune variants against the disease identi fied in modern day europeans , these results may sug gest a long history of m. leprae presence in this region. moreover, the phylogenetic analysis of a th century s. enterica subsp. enterica genome from europe showed its placement within the paratyphi c lineage . further identification of the bacterium in th century colonial mexico revealed it as a previously unknown candidate pathogen that was likely introduced to the americas through european contact. given the low frequency of paratyphi c today, these results may be indicative of a higher prevalence in past populations. finally, an example from viral genomics is the recovery of hiv rna from degraded serum specimens , which high lighted the importance of archival collections in reconcil ing the expansion of recent pandemics. specifically, these data were able to dispute a long standing hypothesis regarding the initiation of hiv spread in the usa. when the evolutionary histories of pathogens are influenced equally by mutation and recombination, additional tools have been used to identify recombining loci and to determine genetic relationships within and between microbial populations (fig. ) . for example, the programs clonalframeml and recombination detection program (rdp ) have been used to infer potential recombination regions within ancient , , , respectively. in addition, principal component analysis (pca) and ancient admix ture component estimation using the bayesian modelling frameworks structure and finestructure on both multilocus sequence typing (mlst) and whole genome data were recently used for population assignment of a , year old h. pylori genome . these analyses revealed key information on changes of the bacterial population structure that occurred in europe over time. furthermore, the recent study of ancient t. pallidum subsp. pallidum and t. pallidum subsp. pertenue used the program tree puzzle , a maximum likelihoodbased phylogenetic algorithm, to gain a more robust phylogenetic resolution of ambiguous branching patterns among bacterial lineages. such whole genome analyses of both clonal and recombining pathogens have helped to elucidate not only past infectious disease phylogeography but also possible zoonotic or anthroponotic transmission events that reveal disease interaction networks through time. among others (table ) , a notable example is that of , year old pre columbian m. tuberculosis genomes isolated from human remains, which showed a phylo genetic placement among animal adapted lineages, being most closely related to a strain circulating in modern day seals and sea lions . although the extent to which these strains were capable of human to human transmission is unclear, this study supports the existence of tuberculosis in pre columbian south america and is helping to delineate the genomic and adaptive history of m. tuberculosis in the region before european contact . another example of intriguing evolutionary relationships revealed uniquely through the study of ancient pathogen genomes includes analy ses of neolithic and bronze age hbv. these genomes grouped in extinct lineages that are most closely related to modern strains identified exclusively among african non human p ri ma te s , , a result that raises further questions regarding past transmission events in hbv history. finally, the phylogenetic analysis of medieval m. leprae genomes suggested a european source for lep rosy in the americas , reinforcing the hypothesis that humans passed the disease to the nine banded arma dillo, the most common reservoir for this disease in the new world . importantly, the resolution of evolutionary analyses will depend on the quality, size and evenness of spatial sampling in the comparative data set. therefore, the incomplete and often biased sampling of ancient and modern microbial strains can introduce challenges for discerning true biological relationships and past evo lutionary events. nevertheless, in recent years, marked reductions in ngs costs have aided the increased pro duction of large whole genome microbial data sets from present day strains. current efforts for centralized data repositories that are continuously curated (such as the pathosystems resource integration center (patric) database and the recently introduced enterobase ) and the development of robust phylogenetic frameworks that can accommodate genome wide data from > , strains (for example, grapetree ) are becoming valua ble for integrating large sample sizes into microbial evo lutionary analyses. in combination with the increasing number of ancient microbial data sets, these tools will aid in the evaluation of genetic relationships by offering higher resolution. inferring divergence times through molecular dating. apart from providing a molecular fossil record and revealing diachronic evolutionary relationships, a third analytical advantage gained from the retrieval of ancient pathogen genomes is that their ages can be directly used for calibration of a molecular clock. the ages of ancient specimens can be determined through contextual information, through archaeological artefacts or directly through radiocarbon dating, predominantly of bone or tooth collagen. such temporal calibrations are required for high accuracy estimations of micro bial nucleotide substitution rates and in turn lineage divergence dates (fig. ) , particularly because both esti mations seem to be highly influenced by the time depth covered by the genomic data set . for such analyses, the most widely used program is the bayesian statistical framework beast , . a characteristic example of how ancient calibration points can considerably affect divergence date estimates is that of m. tuberculosis. according to modern genetic data and human demographic events, the m. tuberculosis complex (mtbc) evolution was suggested to have fol lowed human migrations out of africa, with its emer gence estimated at more than , years ago . recently, its emergence was re estimated to a maximum of , years ago on the basis of the , year old myco bacterial genomes from peru , a result that was further the diagram is an overview of whole-genome analysis applied to date for ancient microbial data sets and distinguishes the methods used for clonal and recombining pathogens; of note, the depicted summary is not meant to represent an exhaustive pipeline of all possible analyses that could be undertaken. ancient genome reconstruction is usually initiated through reference-based mapping or through de novo assembly of the data, although the latter has only been possible in exceptional cases of ancient dna (adna) preservation , . subsequently , the genomes are assessed for their coverage depth and gene content for evaluation of their quality , which is also relevant for the comparative identification of virulence genes over their evolutionary time frames. here, we show an example of virulence factor presence-or-absence analysis in the form of a heat map, as done previously , , , . in addition, a comparison of the ancient genome or genomes with modern genomes can be carried out for single-nucleotide polymorphism (snp) identification and for assessment of snp effects (using snpeff ), which is particularly relevant for variants that seem to be unique to the ancient genome or genomes. initial evolutionary inference can often be carried out through phylogenetic analysis and by testing for possible evidence of recombination in the analysed data set, for example, by comparing the support of different phylogenetic topologies and by identifying potential recombination regions and homoplasies , . if the data support clonal evolution, robust phylogenetic inference (for example, through a maximumlikelihood approach) is followed by assessment of the temporal signal in the data , . if the data set shows a sufficient phylogenetic signal, molecular dating analysis and demographic modelling are considered possible, although the size of the data set will determine whether such analyses will be feasible and meaningful. alternatively , if recombination is confirmed, genetic relationships between microbial clades or populations can be determined through phylogenetic network analysis or through the use of population genetic methods such as principal component analysis (pca) and identification of ancestral admixture components , . in this case, the assessment of the temporal signal and proceeding with molecular dating analysis is cautioned and likely best performed after exclusion of recombination regions from all genomes in the data set. mrca , most recent common ancestor. ngs, next-generation sequencing. a term used to describe that genome evolution occurs as a function of time and, therefore, the genetic distance between two living forms is proportional to the time of their divergence. a technique to estimate the age of a specimen on the basis of the amount of incorporated radiocarbon ( c) that after the death of an organism gradually becomes lost over time. denotes the frequency of substitution accumulation in an organism within a given time; usually represented as substitutions per site per year. the dates of separation between two phylogenetic lineages, for example, the split between two species. corroborated by the incorporation of th century european mtbc genomes in the dating analysis , , . in molecular phylogenies, the length of each individ ual branch usually reflects the number of substitutions acquired by an organism within a given period of time and, as such, varying branch lengths should represent heterochronous sequences. therefore, an important pre requisite for a robust dating analysis is that the nucleo tide substitution rate of the species whose phylogeny is to be dated behaves in a 'clock like' manner, meaning that phylogenetic branch lengths correlate with archaeological dates or sampling times. such relationships can be assessed through date randomization and root-to-tip regression tests (fig. ) . the former is used to assess the effect of arbi trary exchange of phylogenetic tip dates on the nucleo tide substitution rate and divergence date estimates , whereas the latter is used for estimation of a correlation coefficient (r) and coefficient of determination (r ) by relating the tip date of each taxon to its snp distance from the tree root (using, for example, the program temp est ). the resulting values determine whether there is a temporal signal in the data and suggest whether branches within a phylogeny evolve at a constant rate, in which case a strict molecular clock can be statistically tested, for example, using mega or marginal likelihood estimations , , and applied. if branches are affected by differences in their evolutionary rates, a relaxed clock would be more appropriate. in general, a constant mole cular clock will rarely reliably describe the history of a microbial species, even more so for infectious pathogens whose replication rates vary between active and latent or between epidemic and dormant phases , . in certain cases, neither of the two models may fit the data, such as when extensive rate variation weakens the temporal signal. this challenge was encountered in initial attempts to date the y. pestis phylogeny using too few ancient cali bration points , . similar limitations can arise when the evolutionary history of a microorganism is vastly affected by recombination, as observed for hbv , , although hbv molecular dating was recently attempted using a different genomic data set and suggested that the currently explored diversity of old and new world pri mate lineages (including all human genotypes) may have emerged within the last , years . molecular dating analysis requires the use of an appro priate demographic model for the available data, which can be determined through model testing approaches (for example, through marginal likelihood estimations , ). currently, the most widely used models for estimating dates of divergence are the coalescent constant size , which assumes a continuous population size history -and is unrealistic for epidemic pathogens -and the coalescent skyline , which can estimate effective population size (n e ) changes over time. moreover, the birth-death demographic model , , which is cur rently unexplored within adna frameworks, may prove an insightful analysis tool in the future. this model has shown its applicability on comprehensive pathogen data sets from modern day epidemic contexts . it has the ability to incorporate prior knowledge on incom plete sampling proportions and sampling biases within a data set, a frequent caveat of adna studies that is currently unaccounted for within molecular dating analy ses. finally, recently developed fast dating algorithms should also be noted, for example, the least squared dating (lsd) program, which does not use constrained demographic models but can handle uncorrelated rate variation among phylogenetic branches and has shown potential for analysing large genomic data sets . the pathogen best studied using adna analysis so far is y. pestis, the causative agent of plague. to date, ancient genomes of this bacterium have been published , - (fig. ) , and their analyses have yielded valuable infor mation on past pandemic emergence as well as in depth microbial evolution. integration of such knowledge into human population frameworks has provided key insights into the association of human migrations and infectious disease transmission in the past , . this sec tion describes the evolutionary history of y. pestis with the aim of demonstrating aspects of its emergence and spread as revealed through adna research. plague is a well defined infectious disease caused by the gram negative bacterium y. pestis, which belongs to the fam ily enterobacteriaceae. it evolved from a close relative, yersinia pseudotuberculosis, which is an environmental enteric diseasecausing bacterium . although the two species are clearly distinguishable in terms of their vir ulence potential and transmission mechanisms, their nucleotide genomic identity reaches % among chromo somal protein coding genes . in addition, they share the virulence plasmid pcd , which encodes a type iii secretion system common to three known pathogenic yersinia: y. pestis, y. pseudotuberculosis and yersinia enterocolitica. the distinct transmission mechanism and pathogenicity of y. pestis are conferred by the unique acquisition of two plasmids, ppcp , which contributes to the invasive potential of the bacterium , and pmt , which is involved in flea colonization , , as well as by chromosomal gene pseudogenization or loss throughout its evolutionary history . y. pestis is not human adapted. its primary hosts are sylvatic rodents such as marmots, mice, great gerbils, voles and prairie dogs, among others, in which it is continuously or intermittently maintained in so called reservoirs or foci [ ] [ ] [ ] . its global distribution includes numerous rodent species , and encompasses regions in eastern europe, asia, africa and the americas (fig. ) , where the bacterium persists in active foci, some of which have existed for centuries or even millennia , , , , . y. pestis transmission among hosts is facilitated by a flea vector (fig. ) . the best yet characterized is the oriental rat flea, xenopsylla cheopis, although others are also known to play important roles in y. pestis transmission , , . notably, recent modelling infer ences suggest important roles for ectoparasites such as body lice and human fleas in its propagation during human epidemics . landmark studies investigating the classical model of transmission have shown that y. pestis has the unique ability to colonize and form a biofilm within the flea, which blocks a portion of its a test that involves random shuffling of calibration points (tip dates) across a molecular phylogeny to evaluate the effect of randomizations compared to true data on the nucleotide substitution rate estimates. a test that uses a linear correlation to determine the relationship between branch lengths and sampling times within a time-dependent phylogeny. a mathematical model that aims to explain the size and density of a population over time. www.nature.com/nrg | june | volume foregut, the proventriculus (fig. ). this phenotype is determined by the unique acquisition and activity of certain genomic loci in y. pestis, namely, the yersinia murine toxin (ymt) gene, which is present on the pmt plasmid , and facilitates colonization of the arthropod midgut . in addition, it is dependent on the pseudogenization of certain genes, namely, the biofilm downregulators rcsa, pde (also known as rtn), pde (also known as y ) and the ure ase gene ured , , which are, by contrast, active in y. pseudotuberculosis. the biofilm prevents a blood meal from entering the flea's digestive tract, leaving it starving; as a result, the insect intensifies its feeding behaviour and promotes bacterial transmission to un infected hosts [ ] [ ] [ ] . this continuous transmission cycle among fleas and rodents, also called the enzootic phase of maintenance (fig. ) , is thought to drive the preser vation of plague foci around the world and is depen dent on environmental and climatic factors as well as on host population densities , [ ] [ ] [ ] . disruption of this equilibrium for reasons that are not well understood can cause disease eruption among susceptible rodent species, leading to so called plague epizootics (fig. ) . during that time, marked reductions in the rodent pop ulations force fleas to seek alternative hosts, which can lead to infections in humans and, as a result, trigger the initiation of epidemics or pandemics. plague manifestation in humans has three disease forms, namely, bubonic, pneumonic and septicaemic . bubonic plague is the most common form of the disease and can cause up to % mortality when left untreated . subsequent to the bite of an infected flea, bacteria travel to the closest lymph node, where excessive replica tion occurs, giving rise to large swellings, the so called buboes. in addition, following primary bubonic plague, bacteria can disseminate into the bloodstream to cause septicaemia (secondary septicaemic plague) and to the lungs, causing secondary pneumonic disease. both forms are highly lethal disease presentations and cause nearly % mortality when left untreated. only the pneumonic form can result in direct human tohuman transmission. early evolution: plague in prehistory. the time of divergence between y. pestis and y. pseudotuberculosis has been difficult to determine given the wide temporal interval produced by recent molecular dating attempts , , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , are shown as grey circles within their geographical country or region of isolation, and the size of each circle is proportional to the number of strains sequenced from each location (number indicated when more than one genome is shown). the areas highlighted in brown are regions that contain active plague foci as determined by contemporary or historical data. ybp, years before present. adapted with permission from the 'global distribution of natural plague foci as of march ' from https://www.who.int/csr/disease/plague/plague-map- .pdf. based on adna data ( , - , years before present (ybp)) , . nevertheless, y. pestis identification in human remains from neolithic and bronze age eurasia suggests that it caused human infections during these periods and originated more than , years ago , , . these data have revealed important details about the early evolution of the bacterium. genomic and phylogenetic analyses have shown that strains from the late neolithic and bronze age (lnba) occupy a basal lineage in the y. pestis phylogeny, and a recent study suggests the presence of even more basal variants in neolithic europe (fig. ). such analyses have demonstrated that, during its early evolution, the bacterium had not yet acquired impor tant virulence factors consistent with the complex trans mission cycle common to historical and extant strains. one of these genes is ymt, whose absence has been asso ciated with an inability for flea midgut colonization in y. pestis . in addition, these strains possess the active forms of the rcsa, pde , pde and ured genes, which suggests an impaired ability towards biofilm formation and blockage of the flea's proventriculus , . finally, they possess an active flagellin gene (flhd), which is pres ent as a pseudogene in all other y. pestis, as it is a potent inducer of the innate immune response of the host . as a result, during its initial evolutionary stages, y. pestis may have been unable to efficiently transmit via a flea vector. flea borne transmission of y. pestis is a known prerequisite for bubonic plague development ; hence, it has been suggested that this disease phenotype was not present during prehistoric times , . in addition, these results have raised uncertainty regarding the pos sible vector and host mammalian species of the bacte rium. the bronze age in eurasia was a period of intense human migrations, which shaped the genomic landscape of modern day europe , . remarkably, the y. pestis lnba lineage was shown to mirror human movements during that time and was found in regions that do not host wild reservoir populations today (fig. ) . the wide geographical distribution of these strains, their supposed limited bubonic disease potential and their relationship with human migration routes might together be indica tive of a different reservoir host species compared to wild rodents that have a central role in plague transmission in areas such as central and east asia, where the disease is endemic today. nevertheless, an alternative mode of flea transmis sion, termed the early phase transmission, which occurs during the initial phases of infection and was suggested to be biofilm independent , should also be considered as a possible way of y. pestis propagation during its early evolution . although this transmission mechanism is currently not well understood, its comparative mode and efficiency in different rodent species have recently started to be assessed evidence showing the full capacity for flea colonization similar to modern and historic strains was identified in two , year old skeletons from the samara region of modern day russia . although this strain was shown to occupy a phylogenetic position among modern y. pestis lineages (fig. ) , molecular dating analysis indicated that it originated ~ , years ago, suggesting that it over lapped temporally with the other bronze age strains that lacked the genetic prerequisites for arthropod transmis sion. similar characteristics were previously identified in a low coverage , year old isolate from modern day armenia , which suggests that multiple forms of the bacterium were circulating in eurasia between , and , years ago that may have had different transmission cycles and produced different disease phenotypes. as the propagation mechanisms of those strains are still uncer tain, and the exact timing of flea adaptation in y. pestis is unknown, additional metagenomic screening from human and animal remains may provide relevant infor mation on disease reservoirs and hosts across neolithic and bronze age eurasia. it is becoming increasingly apparent that, aside from plague, other infectious diseases, such as those caused by hbv , and b v (table ) , were circulating dur ing the same time periods. further pathogen screening coupled with a temporal assessment of human immune associated genomic variants may reveal key aspects of disease prevalence and susceptibility during this pivotal period of human history. after the bronze age, bubonic plague has been associated with three historically recorded pandem ics. the earliest accounts of the so called first plague pandemic, which began with the plague of justinian ( ce), suggest that it erupted in northern africa in the mid th century ce , and subsequently spread through europe and the vicinity until ~ ce. the sec ond historically recorded plague pandemic began with the infamous black death ( - ce) and con tinued with outbreaks in europe until the th century ce. the most recent third plague pandemic began in the mid th century in the yunnan province of china, and it was during that time that alexandre e. j. yersin first described the bacterium in hong kong, in (fig. ) and has persisted until today in active foci in africa, asia and the americas. although the majority of mod ern plague cases derive from strains disseminated in this global dispersal, the pandemic is considered to have largely subsided since the s . the association of y. pestis with the two earlier pan demics has, until recent years, been contentious. on the basis of their serological characterization, modern strains were traditionally grouped into three distinct biovars, namely, 'antiqua' , 'medievalis' and 'orientalis' , according to their ability to ferment glycerol and reduce nitrate , . in addition, historical accounts of the dis ease seemed to correlate with the supposed distinct geographical distributions of these biovars , and their phylogenetic relationships, as inferred from mlst data, reinforced the hypothesis that each was responsible for a single pandemic . by contrast, later studies identified additional, atypical biovars , and more robust phylo genetic analysis suggested that phylogeography does not correlate clearly with the phenotypic distinctions described between these bacterial populations , , . recent genomic analyses have revealed high genetic diversity of the bacterium in east asia, which invaria bly led to the assumption that y. pestis emerged there . however, a strong research focus on the diversity of the bacterium in these endemic regions, mainly china, has contributed to a profound sampling bias in the available modern data (fig. ) . more recent investigations have revealed previously uncharacterized genetic diversity in the caucasus region and in the central asian steppe that ought to be further explored [ ] [ ] [ ] [ ] (fig. ) . currently, the evolutionary tree of the bacterium is characterized by five main phylogenetic branches (fig. ) . the most ances tral, branch , includes strains distributed across china, mongolia and the areas encompassing the former soviet union. the more phylogenetically derived branches - were formed through a rapid population expan sion event and are today found in asia, africa and the americas . their wide distribution mainly reflects the geographical breadth of branch , which is associated with the third plague pandemic that spread worldwide during the th and th centuries and is still respon sible for more confined epidemics such as those reported in madagascar . the analysis of adna from historical epidemic contexts has generated important information regard ing the evolutionary history of plague. the recovery of y. pestis dna via pcr from remnants of human den tal pulp suggested the involvement of the bacterium in both the first and second pandemics; however, these results were difficult to authenticate , , . subsequent pcr based snp typing of ancient specimens offered some phylogenetic resolution and revealed an expected ancestral placement of medieval strains in the y. pestis phylogeny [ ] [ ] [ ] . more recently, full characterization and authentication of the bacterium were achieved using plasmid and whole genome enrichment coupled with ngs , , , . historical accounts of the first plague pandemic ( th to th centuries ce) suggest that the disease expanded mainly across the mediterranean basin; however, its exact breadth and impact have been difficult to assess given the limited availability of historical and archaeo logical data, with the latter being currently under revision . two recent studies have reconstructed th century y. pestis genomes from southern germany , (fig. ) , a region that lacked historical documentation of the pandemic. phylogenetic analysis showed that both genomes belong to a lineage that is today extinct and is closely related to strains from modern day china , , which suggests the possibility of an east asian origin of the first pandemic. this hypothesis was recently reinforced by the publication of a nd century to rd century y. pestis genome from the tian shan mountains of modern day kyrgyzstan , which shares a common ancestor with the justinianic plague lineage (figs , ) . however, given the > year age difference between these strains , , , as well as the aforementioned east asian sampling bias of modern y. pestis data , the geographical origin of the pandemic remains hypothet ical. retrieval of additional y. pestis strain diversity from that time period, particularly from areas known to have played an important role in the entry of this bacterium into europe, that is, the eastern mediterranean region, may hold clues about its putative source. the beginning of the second plague pandemic, years later, was marked by the notorious black death of europe ( - ce), estimated to have caused an up to % reduction of the continental pop ulation in only years . historical records suggest that the first outbreaks occurred in the lower volga region of russia, and the disease then spread into southern europe through the crimean peninsula . initial analy sis of y. pestis via pcr from victims of the black death revealed a distinct phylogenetic positioning of two mid tolate th century strains and led to the proposal that the disease entered the continent through independent pulses . by contrast, whole genome analysis of ancient strains from western, northern and southern europe demonstrated a lack of y. pestis diversity during the black death, which suggests its fast spread through the continent and favours a single wave entry model of the bacterium into europe , , , although the possible presence of additional strain diversity during that time has recently been explored . intriguingly, the phylo genetic positioning of the black death y. pestis genomes places them on branch , only two nucleotide substitu tions away from the 'star like' diversification of branches - (fig. ) , which gave rise to most of the strain diversity identified around the world today , . after the black death, plague epidemics continued to affect europe until the th century , . inferred climatic data from tree ring records in central asia and europe have recently suggested that such epidem ics were likely caused by multiple introductions of the bacterium into europe as a result of climate driven disruptions of pre existing asian reservoirs . by con trast, ancient genetic and genomic evidence supports the persistence of the disease in europe for years after the black death , , . analysis of y. pestis strains spanning from the late th to the th century ce has revealed the formation of at least two european lin eages that were responsible for the ensuing medieval epidemics (fig. ). both lineages derive from the black death y. pestis strain identified in th century west ern, northern and southern europe , , , suggesting that they likely arose locally. the first lineage survives today and gave rise to modern branch strains , (which are associated with the third plague pandemic), suggesting the european black death as a source for modern day epidemics . the second lineage has not been identified among present day diversity and currently encompasses strains from th century germany and th century france (great plague of marseille, - ce) (fig. ) . these phylogenetic patterns are consistent with a con tinuous persistence of the bacterium in europe dur ing the second plague pandemic. in addition, they are supported by analyses of historical records that suggest the existence of plague reservoirs in the continent until the th century ce . y. pestis is absent from most of europe today; specif ically, no active foci exist west of the black sea. plague is thought to have disappeared from most of europe at the end of the second pandemic ( th century ce). this finding is striking given the thousands of outbreaks that were recorded in the continent until that time , . the reasons for its disappearance are unknown, although numerous hypotheses have been put forward , includ ing a change in domestic rodent populations in europe, namely, the replacement of the black rat, rattus rattus, by the brown rat, rattus norvegicus ; an acquired plague immunity among humans and/or rodents (although this hypothesis requires an update to accommodate the recent identification of y. pestis in europe , years ago , , and the involvement of the bacterium in the first plague pandemic , ); the increased living standards such as the better nutrition and hygienic conditions at the beginning of the early modern era, which may have contributed to improved overall health conditions in europe and likely decreased the number of rats and ecto parasites in human environments , ; and the poten tial disruption of the european wild rodent ecological niche owing to habitat loss and industrialization start ing in ce . given the contribution that molecular data can offer in these discussions, future research on ancient sources of y. pestis dna will be instrumental in further revealing the history of one of humankind's most devastating pathogens. the analysis of ancient pathogen genomes has afforded promising views into past infectious disease history. for y. pestis, adna exploration of its evolutionary past has revealed how a predominantly environmental bac terium and opportunistic gastroenteric pathogen deve loped into an extremely virulent form by acquisition of only a few virulence factors. we eagerly await revelations on a similar scale for other important pathogens that are expected to arise from deep temporal sampling and genomic reconstruction, as made possible through the recent advancements discussed here. integration of ancient pathogen genomes into disease modelling and human population genetic frameworks, as well as their analysis alongside the information offered by the archaeological, historical and palaeopathological records, will help build a more interdisciplinary and com plete picture of host-pathogen interactions and human evolutionary history over time. published online april the origins of agriculture: population growth during a period of declining health emerging and re-emerging infectious diseases: the third epidemiologic transition identification of pathological conditions in human skeletal remains nd edn the global history of paleopathology: pioneers and prospects pre-columbian tuberculosis in northern chile: molecular and skeletal evidence identification of mycobacterium tuberculosis dna in a pre-columbian peruvian mummy molecular analysis of skeletal tuberculosis in an ancient egyptian population detection of -year-old yersinia pestis dna in human dental pulp: an approach to the diagnosis of ancient septicemia the use of the polymerase chain reaction (pcr) to detect mycobacterium tuberculosis in ancient skeletons ancient dna: extraction, characterization, molecular cloning, and enzymatic amplification temporal patterns of nucleotide misincorporations and dna fragmentation in ancient dna the study provides a quantitative description of adna-associated patterns of nucleotide misincorporation and fragmentation that are currently used as primary authentication criteria ancient dna: do it right or not at all absence of yersinia pestisspecific dna in human teeth from five european excavations of putative plague victims no proof that typhoid caused the plague of athens (a reply to papagrigorakis et al.) genome sequencing in microfabricated high-density picolitre reactors targeted enrichment of ancient pathogens yielding the ppcp plasmid of yersinia pestis from victims of the black death the neandertal genome and ancient dna authenticity mining metagenomic data sets for ancient dna: recommended protocols for authentication the study describes the first whole-genome sequence of an ancient bacterial pathogen through the use of high-throughput sequencing genetic analyses from ancient dna optimal ancient dna yields from the inner ear part of the human petrous bone comparing ancient dna preservation in petrous bone and tooth cementum ancient pathogen dna in human teeth and petrous bones pre-columbian mycobacterial genomes reveal seals as a source of new world human tuberculosis ancient genomes reveal a high diversity of mycobacterium leprae in medieval europe the study presents the first de novo assembled ancient pathogen genome and an analysis of m historic treponema pallidum genomes from colonial mexico retrieved from archaeological remains integrative approach using yersinia pestis genomes to revisit the historical landscape of plague during the medieval period emergence and spread of basal lineages of yersinia pestis during the neolithic decline eighteenth century yersinia pestis genomes reveal the long-term persistence of an historical plague focus early divergent strains of yersinia pestis in eurasia , years ago the study describes y. pestis genomes from bronze age human remains and provides a chronological timing of virulence determinant acquisition during the early evolution of the bacterium the stone age plague and its persistence in eurasia a high-coverage yersinia pestis genome from a sixth-century justinianic plague victim yersinia pestis and the plague of justinian - ad: a genomic analysis analysis of -year-old yersinia pestis genomes suggests bronze age origin for bubonic plague historical y. pestis genomes reveal the european black death as the source of ancient and modern plague pandemics ancient human genomes from across the eurasian steppes genomic blueprint of a relapsing fever pathogen in th century scandinavia this paper presents the metagenomic tool malt and is the first case study to demonstrate metagenomic detection of ancient pathogens in the absence of prior knowledge on the causative agent of an epidemic plasmodium falciparum malaria in st− nd century ce southern italy ancient hepatitis b viruses from the bronze age to the medieval period the studies by mühlemann (nature, ) and krause-kyora (elife, ) present a time transect of hbv genomes, spanning from the neolithic period to the medieval period, and provide an overview of the hbv population history across millennia ancient human parvovirus b in eurasia reveals its long-term association with humans this study provides an analysis of the composition of human dental calculus from ancient individuals, showing the presence of oral microbiome bacterial dna, periodontal pathogen dna and proteins associated with host immunity recovery of a medieval brucella melitensis genome using shotgun metagenomics a molecular portrait of maternal sepsis from byzantine troy the study provides insights into the genomic history of h. pylori over several millennia through a population genomic analysis of a copper age strain against a worldwide data set th century variola virus reveals the recent history of smallpox variola virus in a -year-old siberian mummy eighteenth-century genomes show that mixed infections were common at time of peak tuberculosis in europe the paradox of hbv evolution as revealed from a th century mummy tracing hepatitis b virus to the th century in a korean mummy second-pandemic strain of vibrio cholerae from the philadelphia cholera outbreak of mitochondrial dna from the eradicated european plasmodium vivax and p. falciparum from -year-old slides from the ebro delta in spain and 'patient 'hiv- genomes illuminate early hiv/aids history in north america characterization of the influenza virus polymerase genes the rise and fall of the phytophthora infestans lineage that triggered the irish potato famine reconstructing genome evolution in historic samples of the irish potato famine pathogen screening ancient tuberculosis with qpcr: challenges and opportunities genotyping yersinia pestis in historical plague: evidence for long-term persistence of y. pestis in europe from the th to the th century yersinia pestis dna from skeletal remains from the th century ad reveals insights into justinianic plague distinct clones of yersinia pestis caused the black death parallel detection of ancient pathogens via array-based dna capture ancient pathogen dna in archaeological samples detected with a microbial detection array fast and accurate long-read alignment with burrows-wheeler transform basic local alignment search tool reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation metagenomic microbial community profiling using unique clade-specific marker genes metabit, an integrative and automated metagenomic pipeline for analysing microbial profiles from high-throughput sequencing shotgun data kraken: ultrafast metagenomic sequence classification using exact alignments a robust framework for microbial archaeology complications in the study of ancient tuberculosis: presence of environmental bacteria in human archaeological remains dna sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient dna : fast approximate bayesian estimates of ancient dna damage parameters removal of deaminated cytosines and detection of in vivo methylation in ancient dna partial uracil-dna-glycosylase treatment for screening of ancient dna hybrid selection of discrete genomic intervals on custom-designed microarrays for massively parallel sequencing targeted investigation of the neandertal genome by array-based sequence capture dna analysis of an early modern human from tianyuan cave application and comparison of large-scale solution-based dna capture-enrichment methods on ancient experimental conditions improving in-solution target enrichment for ancient dna genome-wide patterns of selection in ancient eurasians this study delineates large-scale population migrations into europe during the bronze age by analysis of human genome-wide data of genomic insights into the origin of farming in the ancient near east genetic origins of the minoans and mycenaeans language continuity despite population replacement in remote oceania ancient human genome sequence of an extinct palaeo-eskimo the complete genome sequence of a neanderthal from the altai mountains a high-coverage genome sequence from an archaic denisovan individual yersinia pestis genome sequencing identifies patterns of global phylogenetic diversity the study shows a possible co-expansion of m. tuberculosis among human populations during out-of-africa migrations the bioarchaeology of tuberculosis: a global perspective on a re-emerging disease the black death transformed: disease and culture in early renaissance europe the black death advances in human palaeopathology past human infections mega : molecular evolutionary genetics analysis version . for bigger datasets a simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood raxml version : a tool for phylogenetic analysis and post-analysis of large phylogenies iq-tree: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies mrbayes : bayesian phylogenetic inference under mixed models application of phylogenetic networks in evolutionary studies ancient dna study reveals hla susceptibility locus for leprosy in medieval europeans ancient skeletal evidence for leprosy in india possible cases of leprosy from the late copper age ( - cal bc) in hungary leprosy and the adaptation of human toll-like receptor the study describes a th century salmonella enterica subsp. enterica serovar paratyphi c genome and its analysis alongside a comprehensive data set of thousands of s clonalframeml: efficient inference of recombination in whole bacterial genomes rdp : detection and analysis of recombination patterns in virus genomes inference of population structure using multilocus genotype data inference of population structure using dense haplotype data tree-puzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing on the origin of leprosy dna sequencing costs: data from the nhgri genome sequencing program (gsp). genome patric, the bacterial bioinformatics database and analysis resource the paper introduces a web-based platform that performs genome assembly and multilocus sequence typing analysis and can be used for the retrieval of large data sets on enteric bacteria grapetree: visualization of core genomic relationships among , bacterial pathogens genome-scale rates of evolutionary change in bacteria beast: bayesian evolutionary analysis by sampling trees beast : a software platform for bayesian evolutionary analysis the study describes the first sequenced ancient m. tuberculosis genome and shows the presence of mixed infections in th century europe the performance of the date-randomization test in phylogenetic analyses of time-structured virus data exploring the temporal structure of heterochronous sequences using tempest (formerly path-o-gen) bayesian analysis of elapsed times in continuous-time markov chains model selection and parameter inference in phylogenetics using nested sampling improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty relaxed phylogenetics and dating with confidence the study presents a comprehensive y. pestis modern genomic data set from east asia and demonstrates extensive clock-rate variations across the y bayesian coalescent inference of past population dynamics from molecular sequences birth-death skyline plot reveals temporal changes of epidemic spread in hiv and hepatitis c virus (hcv) estimating the basic reproductive number from viral sequence data fast dating using least-squares criteria and algorithms yersinia pestis, the cause of plague, is a recently emerged clone of yersinia pseudotuberculosis insights into the evolution of yersinia pestis through whole-genome comparison with yersinia pseudotuberculosis early emergence of yersinia pestis as a severe respiratory pathogen murine toxin of yersinia pestis shows phospholipase d activity but is not required for virulence in mice role of yersinia murine toxin in survival of yersinia pestis in the midgut of the flea vector retracing the evolutionary path that led to flea-borne transmission of yersinia pestis this study presents a functional description of genes associated with flea-dependent colonization and transmission in y. pestis recovery of the black-footed ferret progress and continuing challenges. united states geological survey scientific investigations report natural history of plague: perspectives from more than a century of research intraspecific diversity of yersinia pestis yersinia pestis in small rodents comparative ability of oropsylla montana and xenopsylla cheopis fleas to transmit yersinia pestis by two different mechanisms human ectoparasites and the spread of plague in europe during the second pandemic silencing and reactivation of urease inyersinia pestis is determined by one g residue at a specific position in the ured gene silencing urease: a key evolutionary step that facilitated the adaptation of yersinia pestis to the flea-borne transmission route advances in yersinia research transmission of yersinia pestis from an infectious biofilm in the flea vector observations on the mechanism of the transmission of plague by fleas metapopulation dynamics of bubonic plague plague dynamics are driven by climate variation predictive thresholds for plague in kazakhstan yersinia pestisetiologic agent of plague the genus yersinia: from genomics to function ecological opportunity, evolution, and the emergence of fleaborne plague this study presents human genome-wide data from the bronze age period that was used to delineate large-scale migrations across eurasia. the same data set was later used for pathogen screening and could show the presence of y. pestis, hbv and b v in those populations the role of early-phase transmission in the spread of yersinia pestis infectious blood source alters early foregut infection and regurgitative transmission of yersinia pestis by rodent fleas the fate of rome: climate, disease, and the end of an empire the justinianic plague: origins and effects plague (world health organization monograph series varietes de l'espece pasteurella pestis: nouvelle hypothese genetics of metabolic variations between yersinia pestis biovars and the proposal of a new biovar, microtus microevolution and history of the plague bacillus, yersinia pestis phylogeny and classification of yersinia pestis through the lens of strains from the plague foci of commonwealth of independent states yersinia pestis strains of ancient phylogenetic branch .ant are widely spread in the high-mountain plague foci of kyrgyzstan nineteen whole-genome assemblies of yersinia pestis subsp. microtus, including representatives of biovars caucasica, talassica, hissarica, altaica, xilingolensis, and ulegeica yersinia pestis strains isolated in the caucasus region temporal phylogeography of yersinia pestis in madagascar: insights into the longterm maintenance of plague yersinia pestis orientalis in remains of ancient plague patients orientalis-like yersinia pestis, and plague pandemics tracking mass death during the fall of rome's empire (i) digitizing historical plague les hommes et la peste en france et dans les pays européens et méditerranéens. tome ii. -les hommes face à la peste climate-driven introduction of the black death and successive plague reintroductions into pandemic disease in the medieval world: rethinking the black death the disappearance of plague: a continuing puzzle mortality risk and survival in the aftermath of the medieval black death complete mitochondrial genome sequence of a middle pleistocene cave bear reconstructed from ultrashort dna fragments illumina sequencing library preparation for highly multiplexed target capture and sequencing single-stranded dna library preparation from highly degraded dna using t dna ligase a program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: snps in the genome of drosophila melanogaster strain w six whole-genome assemblies of yersinia pestis subsp. microtus bv. ulegeica (phylogroup . pe ) strains isolated from mongolian natural plague foci complete genome sequence of yersinia pestis strain , an isolate avirulent to humans whole-genome sequencing and comparative analysis of yersinia pestis, the causative agent of a plague outbreak in northern peru human plague associated with tibetan sheep originates in marmots thirty-two complete genome assemblies of nine yersinia species, including y. pestis, y. pseudotuberculosis, and y. enterocolitica nine whole-genome assemblies of yersinia pestis subsp. microtus bv. altaica strains isolated from the altai mountain natural plague focus (no. ) in russia the genus yersinia: from genomics to function complete genome sequence of yersinia pestis strains antiqua and nepal : evidence of gene reduction in an emerging pathogen genome sequence of yersinia pestis, the causative agent of plague draft genome sequences of yersinia pestis strains from the plague epidemic of surat and shimla outbreak in india comparative genomics of seasonal plague (yersinia pestis) in new mexico complete genome sequences of yersinia pestis from natural foci in china a north american yersinia pestis draft genome sequence: snps and phylogenetic analysis yersinia pestis halotolerance illuminates plague reservoirs mycobacterium leprae genomes from a british medieval leprosy hospital: towards understanding an ancient epidemic characterization of the reconstructed spanish influenza pandemic virus the authors thank c. warinner for her valuable comments to the manuscript and m. keller for his contributions in assembling comprehensive meta-information for the y. pestis modern genomic data set. in addition, the authors thank all members of the molecular paleopathology and computational pathogenomics groups at the max planck institute for the science of human history for insightful discussions during meetings. moreover, they are grateful to m. o'reilly, h. shell and r. barquera for extensive assistance with the graphics. this work was supported by the max planck society. m.a.s. researched the literature and wrote the article. all authors provided substantial contributions to discussions of the content and reviewed and/or edited the manuscript. the authors declare no competing interests. springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. nature reviews genetics thanks e. willerslev and other anonymous reviewer(s) for their contribution to the peer review of this work. key: cord- -pbv mjfc authors: tong, yaojun; deng, zixin title: an aurora of natural products-based drug discovery is coming date: - - journal: synth syst biotechnol doi: . /j.synbio. . . sha: doc_id: cord_uid: pbv mjfc natural products (nps), a nature's reservoir possessing enormous structural and functional diversity far beyond the current ability of chemical synthesis, are now proving themselves as most wonderful gifts from mother nature for human beings. many of them have been used successfully as medicines, as well as the most important sources of drug leads, food additives, and many industry relevant products for millennia. most notably, more than half of the antibiotics and anti-cancer drugs currently in use are, or derived from, natural products. however, the speed and outputs of np-based drug discovery has been slowing down dramatically after the fruitful harvest of the “low-hanging fruit” during the golden age of s- s. with recent scientific advances combining metabolic sciences and technology, multi-omics, big data, combinatorial biosynthesis, synthetic biology, genome editing technology (such as crispr), artificial intelligence (ai), and d printing, the “high-hanging fruit” is becoming more and more accessible with reduced costs. we are now more and more confident that a new age of natural products discovery is dawning. the number of unique permutations in the chemical space is believed to be much larger than the number of stars in the current known universe. natural products, especially the secondary metabolites from plants and microbes (bacteria and fungi), occupy a huge and uniquely biologically relevant chemical space that no current chemical synthesis can cover. it makes natural products very important resources for pharmaceuticals. natural products-based medicines can be traced back thousands of years in the ancient mesopotamia's sophisticated medicinal system [ ] . in china, for example [ ] the first record is 五十二病方 (wushi'er bingfang), containing prescriptions, which dates from about b.c. then followed by神农本草经 (shennong herbal), from about b.c., containing drugs; 唐本草 (新修本草, 英公本草, tang herbal), from a. d., containing drugs; and 本草纲目 (compendium of materia medica), from a. d., containing , drugs. together with many other documented or undocumented chinese materia medica, they gave rise to traditional chinese medicine (tcm). however, due to the technology limitations, we have not known for a very long time which natural product component(s) of the herbs are responsible for the medical activities. the discovery of penicillin from the filamentous fungi penicillium notatum in by sir alexander fleming [ ] not only opened the molecular age of natural products, but also dramatically shifted the focus of natural products discovery from plants to microbes. fleming shared the nobel prizes in physiology or medicine with ernst b. chain, and sir howard florey for that discovery. as a result of this focus change, a number of other discoveries were made in rapid succession: in , selman a. waksman discovered streptomycin from streptomyces griseus [ ] (who was awarded the nobel prizes in physiology or medicine for that discovery); in , benjamin m. duggar discovered chlortetracycline from streptomyces aureofaciens [ ] ; in , paul r. burkholder discovered chloramphenicol from streptomyces venezuelae [ ] ; in , edmund c. kornfeld discovered vancomycin from amycolatopsis orientalis [ ] ; and in , satoshi omura discovered avermectins from streptomyces avermitilis [ ] (he shared the nobel prizes in physiology or medicine with william c. campbell and youyou tu for the discovery of avermectins and artemisinin, respectively); etc. these remarkable milestones during the s to s are known as the golden age of natural products-based drug discovery. to date, research based on the discovery of natural products has been awarded nobel prizes in physiology or medicine a total of three times in its -year history. the first two were years apart, while the last two were years apart. this decline in research outcomes is partially due to the fact that the easily accessible discoveries were made by the classic "top-down" strategy ( fig. years now with no new approvals of natural product-based antibiotics [ ] . with the emergence of combinatorial synthesis, a large number of synthetic libraries have been synthesized. it made many pharmaceutical companies step back and even exit from the natural products arena, which resulted in a sharp reduction in the output of both new drug leads and approved drugs from the drug development pipeline [ ] . however, with the advances in whole genome sequencing and genome mining, we see that there is a huge yet unexploited potential of secondary metabolites which is hidden in the genome of the microorganisms, both culturable and unculturable [ ] . typically, there are - biosynthetic gene clusters (bgcs) encoding secondary metabolites per actinomycete strain, less than % of which have been chemically identified in laboratory conditions. given the fact that new bioactivities are inherently linked with novel chemical structures, we could imagine a large number of new bioactive secondary metabolites are waiting on us to discover them. the award of the nobel prizes in physiology or medicine for research in natural products both validated and reinvigorated the whole natural product community [ ] . the incredible rate of development in genome sequencing, modern metabolic engineering, synthetic biology, advanced genome editing, big data, artificial intelligence (ai), and d printing together with the growing microbial strain collections enable us to access the previously inaccessible natural products. taking these advances all together, there is no doubt that a new age of natural products discovery is yet to come. as the chinese proverb goes, "without rice, even the cleverest housewife cannot cook". natural products discovery also requires good resources. the resources discussed here are mainly microbial strain collections and high-quality genome/metagenome sequences. microbes, as the natural products producers, are the start point of the whole journey of microbial natural products discovery, making themselves irreplaceable. initiated by professor v. b. d. skerman, the word federation for culture collection (wfcc)-world data centre for microorganisms (wdcm) has been growing into wfcc-mircen (microbial resources centres)-wdcm, the world's leading data center for microbial resources (http://www.wdcm.org). it has been hosted and maintained by the institute of microbiology, chinese academy of sciences since . to date, more than , , microbials from culture collections in countries were registered in wfcc-mircen-wdcm. beyond these registered resources, there also are some wellestablished microbial strain collections in both non-profit research institutes and companies (table ) . given the fact that the majority of microogranisms are unculturable with current technologies, genomic and metagenomic sequence information are becoming more and more important as nowadays we are able to directly get encoded compounds from the genetic information. a detailed introduction of microbial genome resources can be found in the review [ ] . two such well-known databases are genbank (https:// www.ncbi.nlm.nih.gov/genbank/) and joint genome institute (jgi) genome portal (http://genome.jgi.doe.gov/portal/). to date, they have recorded > , and > , complete bacterial genomes, respectively. it's worth noting that a large part of these complete genomes are resequencing results of the same species. only around complete genomes for streptomycetes are available in genbank. of course, many genome sequencing projects are ongoing, for example the k microbial sequencing project coordinated by dr. lixin zhang, and also many microbial genome sequences have not yet been deposited to public databases, for example the genomic database of over , strains that was originally assembled by warp drive bio, and now it belongs to ginkgo bioworks. with rice ready, now we need tools to cook the meals. tools related to natural products discovery will be discussed in following sections. after the discovery the double helix of dna and the human genome project (hgp), synthetic biology was crowned "the third biotechnology revolution". it is a product of multi-disciplinary integration, which is one of the most active areas in biological and biotechnological development. this highly interdisciplinary area applies engineering principles to biology. it involves biology, evolution, chemistry, physics, mathematics, engineering, and informatics. applications of synthetic biology have made significant achievements in many areas, such as bioenergy, biomaterials, biomedicine, and bulk chemicals. it greatly enables us to better understand and even engineer life. given the complexity of the biosynthetic pathway of natural products, the "topdown" strategy is obsolete, the more acceptable way to study natural products is the "bottom-up" approach (fig. ) . it starts with genome mining (the analysis of high quality whole genome information), which requires bioinformatics, big data, and even ai; to pathway cloning (refactoring), expression and fermentation, which needs design-buildtest-learn (dbtl) cycle-based metabolic engineering; to the target natural product identification, which requires modern chemical analysis; and to later compound modification and clinical studies, which needs biochemistry and cell biology. this procedure perfectly matches the principles of synthetic biology. applying synthetic biology to natural products-based drug discovery surely will bring the renaissance of natural products discovery [ ] . natural products are a subset of specialized metabolites produced by a given cell factory (the living organism). as a complicated system, the cell itself can be considered as a delicate factory, each metabolic pathway works as a pipeline, all pathways together forming a complicated network [ ] . the expression and high yield of a given natural product are highly linked to the whole metabolic network (primary and secondary metabolisms) [ ] . we have to know which pathways/enzymes have crosstalk; when, where, and how many enzymes are needed; how to balance the cell growth and production; how to direct the metabolic flow to the target pathway to reach the theoretical yield. a classic example is the production of penicillin, the yield increased more than , times by simple strain (the cell factory) improvement. with the continuous development in crispr-based genome editing techniques, such as crispr-cas [ ] , crispr base editor [ , ] , and crispr prime editing [ ] enables much faster and more detailed strain improvement. given the fact that most of the microbial gene clusters encoding secondary metabolites are so called "cryptic/silent" gene clusters, named for their non-expression and/or trace expression, heterologous expression becomes a powerful approach. with the growth of the metabolism knowledge base, and with the advances in genome editing and metabolic modeling, we are now able to design and construct better microbial cell factories for activation and/or high production of natural products. for example, the anti-malaria drug artemisinin was primarily from a. annua. due to the low yield, seasonal and regional limitations, it could not meet the market demand. however, artemisinin can easily be chemically synthesized from the precursor artemisinic acid, therefore, researchers reconstituted the biosynthetic pathway of artemisinic acid in a yeast cell factory, the yield can reach as high as g/l in fermentation [ ] . similarly, the bonds between opioid production and requiring field-grown poppies were broken. recently, researchers successfully produced hydrocodone in a yeast cell factory by reconstitution of a complete hydrocodone biosynthetic pathway involving enzymatic steps [ ] . besides yeast cell factories, like saccharomyces cerevisiae and yarrowia lipolytica, some other microbes have great potential for being cell factories in regards to natural product production, such as: e. coli, which has the most well established knowledge base and toolkit; pseudomonas putida, which in general has high tolerance to many chemicals; and actinomycetes (s. albus, s. coelicolor, s. avermitilis, s. ambofaciens, saccharopolyspora erythraea, etc), which could be more suitable for expressing biosynthetic gene clusters with high gc content. the design and construction of actinomycete cell factories have been heavily limited by the humble traditional genetic manipulation approaches, the recently established cripsr based genome editing methods for actinomycetes [ ] [ ] [ ] bring us possibilities to efficiently make good actinomycete cell factories. moreover, the emerging of crispr-based biosensor development [ ] enables faster and more sensitive detection of target compounds produced by microbial cell factories. during the past years, we have witnessed the incredible development of dna sequencing technologies. the first whole genome to be sequenced was that of the bacteriophage ϕx in [ ] , with only , bp. to get a human genome ( , . mb) sequenced, the hgp took years ( - ) , involved universities in countries, and spent~ billion us dollars. by comparison, the broad institute announced that it had sequenced , whole human genomes by . with the fast evolution of sequencing related technologies, from shotgun sequencing, to pyrosequencing ( ), to illumina sequencing, and now to long read pacbio sequencing and nanopore sequencing, dna sequencing has become much cheaper, easier and orders of magnitude faster. we are now entering the age of the k genome (it only costs , us dollars for whole genome sequencing of a human). the increased ability of dna reading (sequencing) has made the information of dna sequences indispensable for biological research. the exponentially accumulated information of whole genome sequences is also changing how natural products are discovered, instead of the traditional "top-down" approach, we are shifting to the genome mining based "bottom-up" strategy. genome mining confirms that there is still a huge potential of novel natural products in the microbial genomes [ ] . compared to the ability to read dna, the ability to write dna still has a long way to go. the current nucleic acid synthesis relies heavily on chemical synthesis, with relatively high cost and a long processing time. can we develop a bio-based (mimicking living organisms), highfidelity, cheap, and fast nucleic acid synthesis platform to reach the stage of "made-to-order" dna for desired purposes? one bottle neck of the aforementioned "bottom-up" strategy is the cloning of long biosynthetic gene cluster with high gc content. one direct solution, of course, is the complete synthesis of the whole biosynthetic gene cluster and even the genome with reasonable cost and processing time [ ] . we are accumulating tons and tons of data in every aspect. for example, the microbial whole genome sequences are becoming big data. for natural products discovery, the information we are looking for from the big data of genomes is clear, to find the right biosynthetic gene clusters with all necessary factors. however, these pieces of information are still like the needle in a haystack. therefore, the processing of big data to determine the right way to find that needle is critical. more and more software and algorithms have been developed for natural products discovery. a good summary can be found in the secondary metabolite bioinformatics portal (https://www.secondarymetabolites. org) [ ] , currently it contains such tools. the advances in ai and automation will also facilitate natural products discovery. d printing is a process of building a three-dimensional object from a computer-aided design. it has been successfully applied in manufacturing, medical, industry, and many other areas. the list of these potential areas is growing. to date, d printing has not reached the molecular or even atomic level yet. natural products-based drugs (or small molecules) that could be directly d printed on-demand would be highly advantageous. some pioneer work has already been done recently [ ] . if that day of d printing in the molecular level comes, and surely it will, it can help us to solve so many problems, such as the shortage of medical supplies in the fight against the sudden outbreak of some diseases such as covid- caused by the coronavirus sars-cov- . we are in the best times, a lot of what was considered the impossible is becoming possible. those previously inaccessible resources are becoming accessible. it is an age of pioneering and innovation. worldwide scientific collaborations are extraordinarily frequent and easy. it is becoming more and more clear that multidisciplinary integration is the trend of scientific development, such as synthetic biology. we are the witnesses of history, but we also need to be prepared for being the builders of history. with world population growth increasing, areas such health, resources, and environment have drawn more and more attention. natural products would be one of the keys to a bio-sustainable world. we rely on nature, learn from nature, and eventually we may outperform nature. imagining a picture that in the near future, big data assists us in designing molecules for given diseases, ai helps us to construct the optimal biosynthetic pathways, which then will be synthesized by modern dna synthesis, synthetic biology helps us to express the pathways and get the desired compounds. an even more wild thought would be to get the desired compounds directly by d printing. there are no conflicts to declare. the beginnings of drug therapy: ancient mesopotamian medicine the pharmacology of chinese herbs the discovery of penicillin isolation of streptomycin-producing strains of streptomyces-griseus aureomycin; a product of the continuing search for new antibiotics chloromycetin, a new antibiotic from a soil actinomycete some laboratory and clinical experiences with a new antibiotic, vancomycin avermectins, new family of potent anthelmintic agents: producing organism and fermentation natural products as sources of new drugs from to drug discovery and natural products: end of an era or an endless frontier? opportunities for natural products in (st) century antibiotic discovery a new golden age of natural products drug discovery web resources for microbial data genome engineering and modification toward synthetic biology for the production of antibiotics engineering and modification of microbial chassis for systems and synthetic biology strategies for terpenoid overproduction and new terpenoid discovery multiplex genome engineering using crispr/cas systems targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems programmable editing of a target base in genomic dna without double-stranded dna cleavage search-and-replace genome editing without double-strand breaks or donor dna high-level semi-synthetic production of the potent antimalarial artemisinin complete biosynthesis of opioids in yeast highly efficient dsb-free base editing for streptomycetes with crispr-best crispr/cas-based genome engineering in natural product discovery crispr-cas based engineering of actinomycetal genomes a crispr-cas a-derived biosensing platform for the highly sensitive detection of diverse small molecules nucleotide sequence of bacteriophage phi x dna genome mining approaches to bacterial natural product discovery synthetic genomics: from dna synthesis to genome design the secondary metabolite bioinformatics portal: computational tools to facilitate synthetic biology of secondary metabolite production digitization of multistep organic synthesis in reactionware for on-demand pharmaceuticals the authors thank dr. wen-jun li and dr. lixin zhang for agreeing to disclose the number of their microbial strain collections and providing unpblished information. the authors thank simon shaw for proofreading the manuscript. y.t. acknowledges fundings from the novo nordisk foundation (nnf cc ; nnf oc ; and nnf oc ). z.d. acknowledges funding from the national natural science foundation of china ( ). key: cord- -fs dj dp authors: liu, yu-tsueng title: infectious disease genomics date: - - journal: genetics and evolution of infectious disease doi: . /b - - - - . - sha: doc_id: cord_uid: fs dj dp the history and development of infectious disease genomics are discussed in this chapter. hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. the completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. with the completion of the genome sequence of a virulent menb strain, a “reverse vaccinology” approach was applied for the development of a universal menb vaccine by novartis. the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents. through a systematic screening of , natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis. vector biology network was formed to achieve three goals ( ) to develop basic tools for the stable transformation of anopheline mosquitoes by the year ; ( ) to engineer a mosquito incapable of carrying the malaria parasite by ; and ( ) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by . the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the history and development of infectious disease genomics are tightly associated with the human genome project (hgp) (watson, ) . a series of important discussions about the hgp were made in and (dulbecco, ; watson, ) , which led to the appointment of a special national research council (nrc) committee by the national academy of sciences to address the needs and concerns, such as its impact, leadership, and funding sources. the committee recommended that the united states begin the hgp in (nrc, ) . they emphasized the need for technological improvements in the efficiency of gene mapping, sequencing, and data analysis capabilities. in order to understand potential functions of human genes through comparative sequence analyses, they also advised that the hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. in the meantime, the office of technology assessment (ota) of the us congress also issued a similar report to support the hgp (ota, ) . in , the department of energy (doe) and the national institutes of health (nih) jointly presented an initial -year plan for the hgp (dhhs and doe, ) . in october , the sanger center/institute (hinxton, uk) was officially open to join the hgp. the cost of dna sequencing was about $ À per base in , and the initial aim was to reduce the costs to less than $ . per base before large-scale sequencing (dhhs and doe, ) . the sequencing cost gradually declined during the subsequent years. in , the national human genome research institute (nhgri) challenged scientists to achieve a $ , human genome ( gb/haploid genome) by and a $ genome by to meet the need of genomic medicine. the first complete genome to be sequenced was the phix bacteriophage ( . kb) by sanger's group in (sanger et al., . the complete genome sequence of sv polyomavirus ( . kb) was published in (fiers et al., ; reddy et al., ) . the human epsteinÀbarr virus ( kb) genome was determined in (baer et al., ) . the first completed free-living organism genome was *e-mail: ytliu@ucsd.edu haemophilus influenza ( . mb), sequenced through a whole-genome shotgun approach in (fleischmann et al., ) . the second sequenced bacterial genome, mycoplasma genitalium ( kb), was completed in less than a month in the same year using the same approach (smith, ) . the doe was the first to start a microbial genome program (mgp) as a companion to its hgp in (doe, . the initial focus was on nonpathogenic microbes. along with the development of the hgp, there was exponential growth of the number of completely sequenced freeliving organism genomes. the fungal genome initiative (fgi) (fgi, ) was established in to accelerate the slow pace of fungal genome sequencing since the report of the genome of saccharomyces cerevisiae in (goffeau et al., ) . one of the major interests was to sequence organisms that are important in human health and commercial activities. as of september , completed genome projects, a . -fold increase from years ago, were documented (liolios et al., ) . these include bacterial, archaeal, and eukaryotic genomes. in addition, more than other ongoing sequencing projects were reported. the genomes of human malaria parasite plasmodium falciparum and its major mosquito vector anopheles gambiae were published in (gardner et al., ; holt et al., ) . the effort to sequence the malaria genome began in by taking advantage of a clone derived from laboratory-adapted strain (hoffman et al., ) . many parasites have complex life cycles that involve both vertebrate and invertebrate hosts and are difficult to maintain in the laboratory. currently, a few other important human pathogenic parasites, such as trypansomes el-sayed et al., ) , leishmania (ivens et al., ) , and schistosomas (berriman et al., ; consortium, ) , have been either completely or partially sequenced (brindley et al., ; aurrecoechea et al., ) . in the meantime, the genome sequence of aedes aegypti, the primary vector for yellow fever and dengue fever, was published in . the genome size ( mb) of this mosquito vector is about times larger than the previously sequenced genome of the malaria vector anopheles gambiae. approximately % of the genome consists of transposable elements. in , the genome sequence of the body louse (pediculus humanus humanus), an obligatory parasite of humans and the main vector of epidemic typhus (rickettsia prowazekii), relapsing fever (borrelia recurrentis), and trench fever (bartonella quintana), was reported (kirkness et al., ) . its mb genome is the smallest among the known insect genomes. genome sequencing projects for other important human disease vectors are in progress megy et al., ). these include culex pipiens (mosquito vector of west nile virus), ixodes scapularis (tick vector of lyme disease, babesia, and anaplasma), and glossina morsitans (tsetse fly vector of african trypanosomiasis). the challenge to sequence the genome of an insect vector is much greater than a microbe. for example, the genomes of ticks were estimated to be between and gb and may have a significant proportion of repetitive dna sequences, which may be a problem for genome assembly (pagel van zee et al., ) . furthermore, the evolutionary distances among insect species may also affect homology-based gene predictions. it is as important to understand the sequence diversity within a species as to perform a de novo sequencing of a reference genome from the perspective of human health. this is true for both hosts and pathogens (feero et al., ; alcais et al., ) . the goal of the genomes project is to find most genetic variants that have frequencies of at least % in the human populations studied (kaiser, ) . one of the similar efforts for human pathogens is the nih influenza genome sequencing project. when this project began in november , only seven human influenza h n isolates had been completely sequenced and deposited in the genbank database (fauci, ; ghedin et al., ) . as of may , more than human and avian isolates have been completely sequenced, including the "spanish" influenza virus (taubenberger et al., ) . databases for human immunodeficiency virus (hiv) and hepatitis c virus have also been established. while most human studies of microbes have focused on the disease-causing organisms, interest in resident microorganisms has also been growing. in fact, it has been estimated that the human body is colonized by at least times more prokaryotic and eukaryotic microorganisms than the number of human cells (savage, ) . it was suggested to have "the second human genome project" to sequence human microbiome (relman and falkow, ) . highly variable intestinal microbial flora among normal individuals has been well documented (eckburg et al., ; costello et al., ; turnbaugh et al., ) . therefore, the human microbiome project (hmp) was initiated by the nih to study samples from multiple body sites from each of at least "normal" volunteers to determine whether there are associations between changes in the microbiome and several different medical conditions, and to provide both standardized data resources and new technological approaches (peterson et al., ) . the completed or ongoing genome projects (table . ) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. specific examples will be provided to illustrate how the information provided by various genome projects may help achieve the goal of promoting human health. meningococcal isolates produce of antigenically distinct capsular polysaccharides, but only (a, b, c, w , and y) are commonly associated with disease (lo et al., ) . the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. while conventional vaccines consisting of the conjugation of capsular polysaccharides to carrier proteins for meningococcus serogroups a, c, y, and w- have been clinically successful, the same approach failed to produce clinically useful vaccine for serogroup b (menb). the capsule polysaccharide (α - n-acetylneuraminic acid) of menb is identical to human polysialic acid and therefore is poorly immunogenic (finne et al., ) . alternatively, vaccines consisting of outer membrane vesicles (omv) have been successfully developed to control menb outbreaks in areas where epidemics are dominated by one particular strain (bjune et al., ; sierra et al., ; boslego et al., ; jackson et al., ) . the most significant limitation of this type of vaccine is that the immune response is strain-specific, mostly directed against the porin protein, pora, which varies substantially in both expression level and sequence across strains (martin et al., ; pizza et al., ) . with the completion of the genome sequence of a virulent menb strain, a "reverse vaccinology" approach was applied for the development of a universal menb vaccine by novartis (pizza et al., ; tettelin et al., ; giuliani et al., ) . through bioinformatic searching for surface-exposed antigens, which may be the most suitable vaccine candidates due to their potential to be readily recognized by the immune system, open reading frames (orfs) were selected from a total of orfs of the mc genome. eventually, five antigens were chosen as the vaccine components based on a series of criteria including the ability of candidates to be expressed in escherichia coli as recombinant proteins ( candidates), the confirmation of surface exposure by immunological analyses, the ability of induced protective antibodies in experimental animals ( candidates), and the conservation of antigens within a panel of diverse meningococcal strains, primarily the disease-associated menb strains (pizza et al., ; giuliani et al., ; rinaudo et al., ) . the vaccine formulation consists of an fhbp-gna fusion protein, a gna -gna fusion protein, nada, and omv from the new zealand menzb vaccine strain, which contains the immunogenic pora. initial phase ii clinical results in adults and infants showed that this vaccine could induce a protective immune response against three diverse menb strains in À % of subjects following three vaccinations and À % after four vaccinations (rinaudo et al., ) . in , a phase iii trial for this vaccine ( cmenb) has met primary endpoint. targeting an essential pathway is a necessary but not sufficient requirement for an effective antimicrobial agent (brinster et al., ) . identification of essential genes in a completely sequenced genome has been actively pursued with various approaches (hutchison et al., ; ji et al., ) . the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents (wright and reynolds, ) . the subcellular organization of the fatty acid biosynthesis components is different between mammals (type i fas) and bacteria (dissociated type ii fas), which raises the likelihood of host specificity of the targeting drugs. comparison of the available genome sequences of various species of prokaryotes reveals highly conserved fas ii systems suggesting that the antimicrobial agent can be broad spectrum (zhang et al., ) . in addition, through computational analyses, new members of the fas ii system have been discovered in different bacterial species (heath and rock, ; marrakchi et al., ) . one of the protein components in this system, fabi, is the target of an anti-tuberculosis drug isoniazid and a general antibacterial and antifungal agent, triclosan (banerjee et al., ; levy et al., ; zhang et al., ) . through a systematic screening of , natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis and a selective fabf/b inhibitor in fas ii system (wang et al., ) . treatment with platensimycin eradicated staphylococcus aureus infection in mice. platensimycin did not have cross-resistance to other antibiotic-resistant strains in vitro, including methicillin-resistant s. aureus, vancomycin-intermediate s. aureus, and vancomycin-resistant enterococci. no toxicity was observed using a cultured human cell line. the activity of platensimycin was not affected by the presence of human serum in this study. however, the fas ii system appears to be dispensable for another gram-positive bacterium, streptococcus agalactiae, when exogenous fatty acids are available, such as in human serum (brinster et al., ; balemans et al., ) . the susceptibility to inhibitors targeting the fas ii system indicates heterogeneity in fatty acid synthesis or in acquiring exogenous fatty acids among gram-positive pathogens (balemans et al., ) . comparative genomic approaches may be useful to identify and develop a strategy to target the salvage pathway for streptococcus agalactiae. alternatively, similar approaches as described earlier for menb vaccine may also be applied for streptococcus agalactiae (group b streptococcus) (maione et al., ) . an early mathematical model for malaria control suggested that the most vulnerable element in the malaria cycle was survivorship of adult female mosquitoes (macdonald, ; enayati and hemingway, ) . therefore, insect control is an important part of reducing transmission. the use of ddt as an indoor residual spray in the global malaria eradication program from to reduced the population at risk of malaria to b % by compared with % in (hay et al., ; enayati and hemingway, ) . engineering genetically modified mosquitoes refractory to malaria infection appeared to be an alternative approach (curtis, ) given the environmental impact of ddt and the emergence of insecticide-resistant insects. the vector biology network (vbn) was formed in and proposed a -year plan with the world health organization (who) in to achieve three major goals: ( ) to develop basic tools for the stable transformation of anopheline mosquitoes by the year ; ( ) to engineer a mosquito incapable of carrying the malaria parasite by ; and ( ) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by (alphey et al., ; morel et al., ; beaty et al., ) . while some proof-of-concept experiments were achieved for the first two aims in when the anopheles gambiae genome was completely sequenced (catteruccia et al., ; ito et al., ) , the progress has been relatively slow (marshall and taylor, ) . genomic loci of the anopheles gambiae responsible for plasmodium falciparum resistance have been identified through surveying a mosquito population in a west african malaria transmission zone (riehle et al., ) . a candidate gene, anopheles plasmodium-responsive leucine-rich repeat (apl ), was discovered. subsequently, other resistant genes have also been identified (blandin et al., ; povelones et al., ) . studying the genetic basis of resistance to malaria parasites and immunity of the mosquito vector will be important to control malaria transmission. perhaps the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the information may be of great importance to the public health when a newly emerged or re-emerged pathogen is discovered. the swine-origin influenza a virus (s-oiv) (dawood et al., ) and sars (severe acute respiratory syndrome) coronavirus rota et al., ) are the two most recent examples. s-oiv emerged in the spring of in mexico and was also discovered in specimens from two unrelated children in the san diego area in april (cdc, ; dawood et al., ) . those samples were positive for influenza a but negative for both human h and h subtypes. the complete genome sequence and a real-time pcr-based diagnostic assay were released to the public in late april. the outbreak evolved rapidly and the who declared the highest phase worldwide pandemic alert on june , . s-oiv has three genome segments (ha, np, ns) from the classic north american swine (h n ) lineage, two segments (pb , pa) from the north american avian lineage, one segment (pb ) from the seasonal h n , and most notably, two segments (na, m) from the eurasian swine (h n ) lineage (dawood et al., ) . with the available influenza genome database, diagnostic assays to distinguish previous seasonal h n , h n , and s-oiv can be easily accomplished (lu et al., ) . a comprehensive pathogen genome database is not only useful for infectious disease diagnosis but also for novel pathogen discovery (liu, ) . homologous sequences within the same family or among different family members are important for new pathogen identification even with the advent of third-generation sequencing technology (munroe and harris, ) . de novo pathogen discovery may be also complicated by coexisting microorganisms, such as commensal bacteria in the human body. without prior knowledge of these microorganisms, one may be misled. in , a microarray-based assay, designated virochip, was used to help discover the sars coronavirus (wang et al., ) . the virochip contained the most highly conserved mer sequences from every fully sequenced reference viral genome in genbank. the computational search for conservation was performed across all known viral families. a microarray hybridized with a reaction derived from a viral isolate cultivated from a sars patient revealed that the strongest hybridizing array elements belong to families astroviridae and coronaviridae. alignment of the oligonucleotide probes having the highest signals showed that all four hybridizing oligonucleotides from the astroviridae and one oligonucleotide from avian infectious bronchitis virus, an avian coronavirus, shared a core consensus motif spanning nucleotides. interestingly, it had been known previously through bioinformatic analyses that this sequence is present in the utr of all astroviruses, avian infectious bronchitis virus, and an equine rhinovirus (jonassen et al., ) . therefore, a new member of the coronavirus was identified through the unique hybridizing pattern and subsequent confirmations. the finding of the seventh human oncogenic virus, merkel cell polyomavirus (mcv) (feng et al., ) in is another example of why conserved sequences are important for novel pathogen discovery. mcv is the etiological agent of merkel cell carcinoma (mcc), which is a rare but aggressive skin cancer of neuroendocrine origin. two cdna libraries derived from mcc tumors were subjected to high-throughput sequencing by a next-generation roche/ sequencer. nearly , sequence reads were generated. the majority ( . %) of the sequences derived from human origin were removed from further analyses. only one of the remaining cdna was homologous to the t antigen of two known polyomaviruses. one additional cdna was subsequently identified to be part of the mcv sequence when the complete viral sequence was known. later analyses showed that % ( / ) of the mcc had integrated mcv in the human genome. monoclonal viral integration was revealed by the patterns of southern blot analysis. only À % of control tissues had low copy number of mcv infection. while we can expect that the efforts of a variety of genome projects may improve human health, the socioeconomic issues that are not discussed in this chapter may be substantial. in addition, the tremendous amount of information derived from these projects will also be a challenge for scientists as well nonscientists to follow and understand. human genetics of infectious diseases: between proof of principle and paradigm malaria control with genetically manipulated insect vectors eupathdb: a portal to eukaryotic pathogen databases dna sequence and expression of the b - epsteinÀbarr virus genome essentiality of fasii pathway for staphylococcus aureus inha, a gene encoding a target for isoniazid and ethionamide in mycobacterium tuberculosis the influenza virus resource at the national center for biotechnology information from tucson to genomics and transgenics: the vector biology network and the emergence of modern vector biology the genome of the african trypanosome trypanosoma brucei the genome of the blood fluke schistosoma mansoni effect of outer membrane vesicle vaccine against group b meningococcal disease in norway dissecting the genetic basis of resistance to malaria parasites in anopheles gambiae efficacy, safety, and immunogenicity of a meningococcal group b ( :p . ) outer membrane protein vaccine in iquique, chile. chilean national committee for meningococcal disease helminth genomics: the implications for human health type ii fatty acid synthesis is not a suitable antibiotic target for gram-positive pathogens stable germline transformation of the malaria mosquito anopheles stephensi swine influenza a (h n ) infection in two children-southern california, marchÀapril the schistosoma japonicum genome reveals features of hostÀparasite interplay bacterial community variation in human body habitats across space and time possible use of translocations to fix desirable genes in insect pest populations the comprehensive microbial resource understanding our genetic inheritance, the u.s. human genome project: the first five years: fiscal years microbial genome program a turning point in cancer research: sequencing the human genome diversity of the human intestinal microbial flora the microbial rosetta stone database: a compilation of global and emerging infectious microorganisms and bioterrorist threat agents the genome sequence of trypanosoma cruzi, etiologic agent of chagas disease malaria management: past, present, and future the genome gets personal-almost clonal integration of a polyomavirus in human merkel cell carcinoma fungal genome initiative complete nucleotide sequence of sv dna an igg monoclonal antibody to group b meningococci cross-reacts with developmentally regulated polysialic acid units of glycoproteins in neural and extraneural tissues whole-genome random sequencing and assembly of haemophilus influenzae rd genome sequence of the human malaria parasite plasmodium falciparum large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution a universal vaccine for serogroup b meningococcus life with genes the global distribution and population at risk of malaria: past, present, and future funding for malaria genome sequencing the genome sequence of the malaria mosquito anopheles gambiae global transposon mutagenesis and a minimal mycoplasma genome transgenic anopheline mosquitoes impaired in transmission of a malaria parasite the genome of the kinetoplastid parasite, leishmania major phase ii meningococcal b vesicle vaccine trial in new zealand infants identification of critical staphylococcal genes using conditional phenotypes generated by antisense rna a common rna motif in the end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus dna sequencing. a plan to capture human diversity in genomes ensembl genomes: extending ensembl across the taxonomic space genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle a novel coronavirus associated with severe acute respiratory syndrome vectorbase: a data resource for invertebrate vector genomics molecular basis of triclosan activity the genomes online database (gold) in : status of genomic and metagenomic projects and their associated metadata a technological update of molecular diagnostics for infectious diseases mechanisms of avoidance of host immunity by neisseria meningitidis and its effect on vaccine development detection in of the swine origin influenza a (h n ) virus by a subtyping microarray the epidemiology and control of malaria identification of a universal group b streptococcus vaccine by multiple genome screen a new mechanism for anaerobic unsaturated fatty acid formation in streptococcus pneumoniae malaria control with transgenic mosquitoes effect of sequence variation in meningococcal pora outer membrane protein on the effectiveness of a hexavalent pora outer membrane vesicle vaccine genomic resources for invertebrate vectors of human pathogens, and the role of vectorbase the mosquito genome-a breakthrough for public health third-generation sequencing fireworks at marco island a catalog of reference genomes from the human microbiome genome sequence of aedes aegypti, a major arbovirus vector mapping and sequencing the human genome mapping our genes-genome projects: how big? how fast? tick genomics: the ixodes genome project and beyond the nih human microbiome project identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing leucine-rich repeat protein complex activates mosquito complement in defense against plasmodium parasites the genome of simian virus the meaning and impact of the human genome sequence for microbiology natural malaria infection in anopheles gambiae is regulated by a single genomic control region vaccinology in the genome era characterization of a novel coronavirus associated with severe acute respiratory syndrome nucleotide sequence of bacteriophage phi x dna microbial ecology of the gastrointestinal tract database resources of the national center for biotechnology information gemina, genomic metadata for infectious agents, a geospatial surveillance pathogen database vaccine against group b neisseria meningitidis: protection trial and mass vaccination results in cuba history of microbial genomics characterization of the influenza virus polymerase genes complete genome sequence of neisseria meningitidis serogroup b strain mc a core gut microbiome in obese and lean twins viral discovery and sequence recovery using dna microarrays platensimycin is a selective fabf inhibitor with potent antibiotic properties the human genome project: past, present, and future antibacterial targets in fatty acid biosynthesis the application of computational methods to explore the diversity and structure of bacterial fatty acid synthase inhibiting bacterial fatty acid synthesis key: cord- -ew orn z authors: zhao, xiangyan; tian, yonglei; yang, ronghua; feng, haiping; ouyang, qingjian; tian, you; tan, zhongyang; li, mingfu; niu, yile; jiang, jianhui; shen, guoli; yu, ruqin title: coevolution between simple sequence repeats (ssrs) and virus genome size date: - - journal: bmc genomics doi: . / - - - sha: doc_id: cord_uid: ew orn z background: relationship between the level of repetitiveness in genomic sequence and genome size has been investigated by making use of complete prokaryotic and eukaryotic genomes, but relevant studies have been rarely made in virus genomes. results: in this study, a total of viruses were examined, which cover % of genera. the results showed that simple sequence repeats (ssrs) is strongly, positively and significantly correlated with genome size. certain repeat class is distributed in a certain range of genome sequence length. mono-, di- and tri- repeats are widely distributed in all virus genomes, tetra- ssrs as a common component consist in genomes which more than kb in size; in the range of genome < kb, genomes containing penta- and hexa- ssrs are not more than %. principal components analysis (pca) indicated that dinucleotide repeat affects the differences of ssrs most strongly among virus genomes. results showed that ssrs tend to accumulate in larger virus genomes; and the longer genome sequence, the longer repeat units. conclusions: we conducted this research standing on the height of the whole virus. we concluded that genome size is an important factor in affecting the occurrence of ssrs; hosts are also responsible for the variances of ssrs content to a certain degree. viruses are small infectious agents, which are found wherever there is a life and have probably existed since living cells first evolved [ , ] . there are millions of virus types [ ] . wherein, those virus species which have been reported were sorted into dsdna, ssdna, dsdna-rt, ssrna-rt, dsrna, (−)ssrna and (+)ssrna viruses based on their genome types; they can also be sorted into algae, archaea, bacteria, fungi, invertebrates, plants, protozoa and vertebrates viruses based on the general host categories according to the ictv (international committee on the taxonomy of viruses) [ ] . these viruses can infect all types of organisms including archaea, bacteria, plants and animals [ ] . many common human diseases are caused by viruses, such as common cold, influenza, chickenpox, cold scores, etc. in addition, many serious diseases such as ebola, aids, avian influenza and sars are also caused by viruses. what's more, many genotypes of viruses are responsible for cancers, for example, human papillomavirus, hepatitis b virus, hepatitis c virus, epstein-barr virus, kaposi's sarcoma-associated herpesvirus and human t-lymphotropic virus, and so on (http://en.wikipedia.org/wiki/virus). though there are three main theories on the origin of virus: regressive, cellular and coevolution origin theory, it is still unclear how viruses originated because they do not like other organisms forming fossils [ , ] . so studying viruses via molecular information has been the most useful means in investigating how they arose and evolved [ , [ ] [ ] [ ] . success of viral genome researches will promote our understandings and solutions of numerous problems, including their origin, evolution, infection mechanism, disease treatment, etc. the genome sizes (defined as haploid dna content) of viruses vary greatly between species. the smallest viral genomesthe ssdna circoviruses, family circoviridaecode for only two proteins and have a genome size of only kb; the largestminiviruses have genome sizes of over . mb and code for over one thousand proteins [ , ] . two main mechanisms have been implicated in changes of genome size: one is the accumulation of transposable elements [ , ] ; the other is the accumulation of tandemly repetitive sequences [ ] . simple sequence repeats (ssrs), also known as microsatellites, generally defined as simple sequences of - nucleotides that are repeated multiple times and are present in both coding and non-coding regions of the genome [ , ] . ssrs are ubiquitous and highly abundant in eukaryotic [ ] [ ] [ ] [ ] and prokaryotic genomes [ , ] . dna repeats are primarily expanded by three models: replication, repair and recombination [ ] . meiotic recombination plays a key role in the maintenance of sequence diversity in the human genome, and ssrs have been reported to be hot spots for recombination as well as sites for random integration [ , ] . thus, alterations in ssrs lie at the center of dna evolution and sequence diversity that drives adaptation; on the other hand, changes in repetitive sequences can result in deleterious effects on gene expression and function, leading to diseases [ ] . the instability of ssrs was identified to be a pathway to lead to colorectal cancer [ ] . it is now accepted that unstable maintenance of microsatellites occurs in about % of sporadic colorectal cancers [ , ] . microsatellite instability is also frequently associated with other diseases such as ovarian cancers, malignant tumors of endometrium [ ] , small intestine [ ] , stomach [ ] , skin [ ] and brain, etc. the features of microsatellite instability observed in bacteria, yeast, mice and man can provide general clues as to how genomes evolve and how certain instability could contribute to human disease [ ] . some pathogens use ssrs in a strategy that counteracts the host immune response by increasing the antigenic variance of the pathogen population [ ] . genome sequences with diverse lengths make it possible to investigate the relationship between genome size and accumulation of ssrs in all virus genera whose complete genome sequences have been reported. therefore, scatter plots and regression analysis were performed to survey the correlation between repetitiveness (ssrs occurrence as well as ssrs length) and genome size. distributions of different repeat classes were also surveyed among virus genomes of various sizes. while, relative abundance and relative density were examined to make the ssrs comparison parallel among differently sized species genomes; principal component analysis (pca) was designed to investigate which repeat class(es) made a greater contribution to the variance among virus species as well as the relationships between repeat classes. the eighth report of ictv (international committee on taxonomy of viruses) provided information on orders, families, subfamilies, genera and virus species [ ] ; wherein genera have been reported on complete genome sequences on ncbi and one typical species was identified as the representative for each genus according to the listing in taxonomic order (http://ictvdb. bio-mirror.cn/ictv/index.htm). therefore, the genome sequences were selected as samples for the analysis of relationship between ssrs distribution and genome size in the level of the whole virus. all the genome sequences were downloaded in both genbank and fasta formats from the ncbi (ftp://ncbi.nlm.nih.gov/genbank/). sequences obtained include dna and rna, so both t and u bases were represented with t. some genomes were segmented, multipartite and consist of two or more segments with various sizes (additional file ). ssrs were identified and localized using the software ssr identification tool (ssrit), which identifies perfect di-, tri-, tetra-, penta-and hexanucleotide repeats. we have considered only those repeats, wherein the motif was repeated more than times for further analysis. mononucleotide repeats (with a repeat length of nt) were identified using the tool imex (imperfect microsatellite extractor), which can extract perfect microsatellites as well as imperfect microsatellites. here we presented the data for all perfect repeat types. no distinctions between the occurrence of repeats in coding and noncoding regions were made, the rationale for this decision was that the coding regions often account for the large proportion (mean value approximately %); while the sequences of noncoding regions are usually very short; moreover, the overlap phenomenon is very common in virus genomes, and many of the details were presented in additional file . these total numbers have been normalized by using relative abundance and relative density of ssrs to allow the comparisons to be parallel among genome sequences with different sizes. relative abundance was calculated by dividing the number of ssrs by kilo base pair (kb) of sequences; and relative density (bp/kb) was calculated by dividing the total sequences analyzed (kb) by the number of base pairs of sequence contributed by each ssr. principal components analysis (pca) is a well known statistical technique which has wide ranging applications. the main goal of pca is to reduce the dimensionality by decomposing the total variances observed in an original data set. that is to say, we use pca method to transform a set of original variables into a set of new and uncorrelated variables. the mathematic principle of pca method lies in coordinate conversion. consequently, pc (principal component) is a linear combination of the original variables. mathematical model. if the sample size isn, and each sample has p observed index (x , x , ⋯, xp), we can get the following matrix of the original dataset: making linear combinations using the p variables (x , x , ⋯, xp) of the original data matrix x: hence, y i ¼ e i x þ e i x þ ⋯ þ e pi x p ; i ¼ ; ; ⋯; p here, yi is the principal component, but it must meet the following conditions: ( ) t h e r e i s n o c o rr e l a t i o n b e t w e e n y i and yj ( i ¼ j; i; j ¼ ; ; ⋯; p ); ( ) the variance of yi is the maximum during yi; yi þ ; ⋯; yp; geometric meaning. supposing that the sample contains n individuals, each individual has two variables x , x , and in addition, variables subject to the normal distribution. that is, we discuss the geometric meaning of pca by using bivariate normally distributed variables. therefore, scatters of sample are roughly distributed in the shape of ellipse ( figure ). then orthogonally rotate the original plane rectangular coordinates composed of x and x with an angle θ, thus, two original correlated variables (x , x ) were transformed into two integrated and uncorrelated variables (y , x ), and the correlation between the original and new axes is as following: because the variance of the original variables is greater in y axis than in y axis, so a minimum of information will be lost if integrated variable y is used for replacing all original variables. hence, y is defined as the first principal component; in contrast, the variance of variables is smaller in y axis, and it can explain minor information relative to y , so y is called the second principal component. to obtain an expansive and unbiased data set, all virus genera with complete genome sequences reported on ncbi were scanned for ssrs analysis; wherein, one typical species was selected as the representative for each genus according to the ictvdb (http://ictvdb.bio-mirror.cn/ ictv/index.htm). therefore, we analyzed perfect ssrs over bp long, from the completely sequenced virus genomes. while, the genome size varies widely, ranging from bp (s -(−)ssrna- , hepatitis delta virus, nc_ ) to bp (s -dsdna- , emiliania huxleyi virus , nc_ ) (additional file ). we constructed two sets of scatter plots and then performed regression analysis of ssrs (occurrence and figure geometric meaning of pca explained by using bivariate normally distributed variables. scatters of sample are distributed in the shape of ellipse roughly, then orthogonally rotate the original plane rectangular coordinates composed of x and x with an angle θ. by now, two original correlated variables(x , x ) were transformed into two integrated and uncorrelated variables (y , y ). because the variance of the original variables is greater in y axis than in y axis, so the minimum of information will be lost if integrated variable y is used for replacing all original variables. hence, y is defined as the first principal component; in contrast, variance of variables is smaller in y axis, and it can explain minor information relative to y , soy is called the second principal component. length) versus complete genome size for all analyzed viruses to examine the relationship between ssrs and genome size. above all, scatter plots were made, in which, genome size was taken as an independent variable, and all analyzed data were split into two groups (genome > bp and ≤ bp) to make the scatters and curves natural and visible (figures , ) ; and then curves (linear, logarithmic, inverse, quadratic, cubic, compound, power, s, growth and exponential) were fitted according to their respective mathematical models by using the software spss . . parameter estimates and visual inspection showed that goodness fit of data varies greatly to different models; nevertheless, curves with the best goodness of fit were picked out for correlation analysis between ssrs (occurrence and length) and genome size (figures , ). the number of repeat arrays varies from in nodamura virus genome (s -(+)ssrna- ) to in amsacta moorei entomopoxvirus 'l' genome (s -dsdna- ) (additional file ). the power function model provides the best fitted values towards all studied ssrs occurrence and genome size by regression analysis, and results display a very strong and significant positive relationship between the occurrence of ssrs and genome size clearly (r = . , p < . ) (figure a ). power function and cubic model best fit for the data of genome > bp and ≤ bp group, respectively ( figure b,c) . clearly, the ssrs occurrence is strongly, significantly and positively related to the genome size in both genome > bp (r = . , p < . ) and ≤ bp (r = . , p < . ) group. especially in the group of genome ≤ bp, the values of ssr occurrences fluctuate with a relatively narrow range. an exceptional case is worth noting. one point of the scatter plot locating far above the fitted curve represents the value of ssrs in amsacta moorei entomopoxvirus 'l' genome (s -dsdna- , nc_ ) with the size of bp, in which the ssrs occurrence is a total of , far more than ssrs in any other analyzed virus genome. the length of ssrs varies from bp in nodamura virus genome (s -(+)ssrna- ) to bp in amsacta moorei entomopoxvirus 'l' genome (s -dsdna- ); and the percentage of ssrs varies from . % in nodamura virus genome (s -(+)ssrna- ) to . % in amsacta moorei entomopoxvirus 'l' genome (s -dsdna- ) (additional file ). similarly, we investigated the correlation between ssrs length and genome size. figure showed that the distribution of ssrs length is similar to the ssrs occurrence in differently-sized genomes, and it indicated that ssrs length is also significantly and positively correlated with the genome size to all analyzed data (r = . , p < . ), to genome > bp group (r = . , p < . ) and to genome ≤ bp (r = . , p < . ) group. likewise, amsacta moorei entomopoxvirus 'l' genome (s -dsdna- , nc_ ) shows features out of the ordinary, with the total ssrs length of bp and ssrs percentage of . %, occupying the number-one spot in length and percentage of ssrs among all analyzed virus genomes. except that, other points float up and down the curve with a small range ( figure ). the above results indicated that genome size is an important factor in affecting repetitiveness of microsatellites in viruses. we surveyed the distribution of different ssr classes in virus genomes to investigate the relationship between repeat classes (mono-, di-, tri-, tetra-, penta-and hexa-) and genome sequence length. the data of genome size < kb group are not in our consideration here, because too small sample sizes lead to statistical insignificance. data presents such a trend that, for the same repeat class, the ratio of genomes with corresponding ssrs to all genomes increases with the genome sequence growing, although the genome distribution is uneven among different genome ranges (table ) . for example, the ratio of genomes with hexanucleotide ssrs is in group of ~ kb, and it is . % in ~ kb, . % in ~ kb, . % in ~ kb and . % in > kb group, respectively. for the same range of genome sizes, tendency seems to be that the ratio decreases with the increase of the length of repeat unit. for example, in the genome range of ~ kb, the ratio is % (mono-), % (di-), . % (tri-), . % (tetra-), . % (penta-) and . % (hexa-), respectively. observed value per virus genome showed a rising trend with the increase of the genome sequence. additionally, long repeat units such as penta-and hexa-ssrs were rarely, or even not, observed in small genomes, and certain repeat unit class distributed in genomes with a certain range of sequence length. all mono-and di-repeats were observed in analyzed genomes except duck hepatitis b virus (s -dsdna-rt- ), cryphonectria parasitica mitovirus (s -(+) ssrna- ) and nodamura virus (s -(+)ssrna- ) in which mono-repeats were not found; tri-repeats seemed to widely distribute in all virus genomes; and tetra-ssrs, as a common component, consist in genomes with size more than kb ( . % of the virus genomes contain tetra-in group of genome > kb); in contrast, it is rarely observed in genomes with size < kb; and genomes containing penta-and hexa-ssrs are not more than % in < kb group. moreover, the number of tetra-, penta-and hexa-ssrs is very small in genome range of < kb (table ) . results indicated that the correlation is strong between length of repeat unit and genome size. the longer the genome sequence, the longer repeat units. for the same repeat unit class such as mononucleotide ssrs, the number of ssrs increases with the genome length increasing. it confirmed a preference that ssrs tend to accumulate in larger virus genomes. because of the irregular sizes of analyzed virus genomes, we calculated the relative abundance and relative density of ssrs to make the comparison of ssrs abundance parallel among differently-sized genomes. frequency of virus genomes with the ssrs relative abundance of . ~ . is quite high with the value of ( . % of all analyzed viruses). wherein, genomes ( . % of all analyzed viruses) were found to have the ssrs relative abundance of . ~ . . however, genomes with the ssrs relative abundance of < . and > . are relatively fewer (with the total number of , accounting for . % of all analyzed viruses) (figure , additional file ). paralleling, frequency of genomes is relatively high in the ssrs relative density range of ~ bp/kb with the genome number of ( . % of all analyzed viruses), and genomes ( . %) have the ssrs relative density among ~ bp/kb; moreover, genomes ( . %) have the ssrs density of ~ bp/kb ( figure , additional file ). the relationship between ssrs relative abundance, relative density and genome size were investigated respectively. scatter plots showed that the correlations between the ssrs relative abundance and genome size and between the relative density and genome size are quite weak (additional file , additional file ). the results indicated that the genome size has slightly affected the relative abundance and relative density of ssrs in virus genomes. chen et al. [ ] also found that the relative abundance and relative density of ssrs were not significantly related to genome size. on the contrary, ssrs are distributed in the virus genomes with a certain proportion. pca was used to examine which factor(s) primarily lead (s) to differences in ssrs abundance among the virus species. the sample with the size of (n = virus genomes) contains variables (p = , including the percentages of mono-, di-, tri-, tetra-, penta-, hexa-, respectively). di-ssrs is the most and hexa-ssrs is the least on average, but the standard deviation is very large for each repeat unit class among the virus genomes (additional file ). even so, correlation is still strong and extremely significant between the original variables (additional file ). the results showed that the two principal components with eigenvalues of . and . together can account for . % of all differences of ssrs abundance among viruses. wherein, the first component can account for . % and the second . % of all variances, respectively. other components played a less important role in explaining the differences of ssrs abundance among virus genomes. the comparison of the parameters' coefficients for the first and second components showed that the first component has a major loading on the difference of ssrs during analyzing genomes ( table ). the results indicated that the ssrs differences among virus genomes are mainly due to the following parameters: mono-, di-, tri-and tetra-. wherein, the variable of di-affects the differences of ssrs among virus genomes most strongly with the loading of (see figure on previous page.) figure regression analysis of relationship between ssrs length and genome size. . , followed by tri-, mono-and tetra-. in this component, penta-and hexa-played relative minor role in explaining the differences of ssrs among virus genomes. in the second component, hexa-with high positive coefficient and tetra-, penta-with negative coefficients hexa-played the most important role in explaining differences of ssrs abundance. overall, the results of pca indicated that di-affected the ssrs variances among virus genomes most strongly, followed by tri-, monoand tetra-; and then by hexa-; penta-played the weakest role in influencing the variances of ssrs abundance among viruses. all results of kaiser-meyer-olkin (kmo), bartlett's and scree test indicated that it is significantly meaningful to analyze our data using pca ( table ). the kmo measure with the value of . is close to , and bartlett's test (< . ) approximates to , and scree plot displays the "cliff" and the "screes" vividly (additional file ). moreover, the correlation is strong between the original variables (additional file ). ssrs vary greatly in repeat classes and motifs among analyzed virus genomes (table , additional file , additional file , additional file and additional file ). dinucleotide ssrs accounts for the largest proportion of . % in all repeat classes, followed by mono-( . %) and trinucleotide ssrs ( . %). both a and t mono-ssrs are much more than c and g ssrs, and they make up about . %, . %, . % and . % of all ssrs in analyzed viruses respectively. at/ta ssrs predominate in dinucleotide repeats with the proportion of . %, and it is slightly more than a and t mono-ssrs ( . %, . %); other di-repeat motifs are neck and neck in occurrence, but they are all higher than c and g mono-ssrs (table ) . repeat motif group of aat/ata/att/ taa/tat/tta showed the highest percentage and agt/act/cta/gta/tac/tag showed the lowest percentage in tri-ssrs. tetra-, penta-and hexanucleotide ssrs are rare, accounting for . % more or less. it's abnormal that penta-ssrs are less than hexa-ssrs with . %, which is approximately only one third of hexa-ssrs. however, it is usually assumed that the longer repeat unit, the lower frequency it occurred. repeat motifs differ greatly among different virus genomes (details in additional file , additional file , additional , additional file ). these analyses extend those in chen et al. [ ] in three ways: firstly, by using larger sample such that these analyses cover almost all taxonomic virus genera; secondly, by making the data more comprehensive because the genome size varies greatly, ranging from bp (s -(−)ssrna- , hepatitis delta virus, nc_ ) to bp (s -dsdna- , emiliania huxleyi virus , nc_ ), (additional file ); and thirdly, by applying statistically significant methods. the above extension made it possible to investigate the relationship between repetitiveness of microsatellites and genome size more fully and deeply. the previous analysis [ ] simply considered the correlation between microsatellites and genome size based on relatively small sample with complete hepatitis c virus (hcv) genomes, and they found that the number of ssrs is weakly correlated with genome size. we believe that chen's result is lacking of statistical significance due to the relatively small sample size and uniform genome length. here, the sample made up of representative virus genome sequences was designed to investigate the relationship between ssrs and genome size on the level of the whole virus. the result of our data showed a very strong and significant positive relationship between the occurrence, or length of ssrs and genome size with the value of r = . , p < . ( figure a ) and r = . , p < . ( figure a ), respectively. that is, the longer the virus genome sequence, the more ssrs extracted. hancock [ , , ] confirmed that the simple sequence repeats were positively and significantly correlated with the genome size in both archaea and eubacteria, and ssrs accumulate preferentially in organisms with larger genomes. moreover, there is evidence proved that short ssrs ( - bp length) exist in reduced genomes, but long ssrs ( - bp length) consist in larger genomes in prokaryotes [ ] . the overall level of repetition in genomes is related to genome size and to the degree of repetition, and the entire genome accepts simple sequences in a concerted manner when its size increases [ , ] . a relative scarcity of repeating dna is a major factor in causing the relatively compact size of the avian genome [ , ] . what's more, differences in genome size account for approximately % of the variance in genomic repetition in archaea and eubacteria [ ] , suggesting that other factors can also play important roles. dna structure and base-stacking determined the number and length distributions of microsatellites in vertebrate genomes over evolutionary time [ ] . hosts are responsible for the variances of ssrs content to a certain degree. for example, with the similar genome size, viruses infecting vertebrates and invertebrates tend to be higher than viruses attacking bacteria in ssrs content, relative abundance and relative density of ssrs overall (additional file ). this can be explained by the following statements. genomes of reptiles are estimated to consist of about - % repeats, birds have been estimated to consist of - % of repeats [ , ] , mus musculus of . % [ , ] , and . % of human genome were occupied by repeats [ , ] . while ssr tracts make up . % of the e. coli genome [ ] , significantly less than vertebrates'. ssrs have been reported to be hot spots for recombination as well as sites for random integration [ , ] . thus, the increase of viral ssrs content is maybe due to combining partial genome sequences of hosts in the process of infecting vertebrates and invertebrates. as we know, hosts evolved a number of defense systems in response to the challenge from parasites. meanwhile, the parasites evolved multiple counter-defense mechanism as well under the selection pressure from hosts. bacteria have developed crispr/ cas (crispr, clustered regularly interspaced short palindromic repeats; cas, crispr-associated) immune system to defend against bacteriophages by cleaving their dna [ ] . antagonistic coevolution between bacteria and their ubiquitous parasites, bacteriophage (phage), is well known [ , ] . the genomic regions of crispr/cas are hot spot of recombination, and crispr/cas modules underwent rapid evolution in natural environments because of recurrent selection pressure exerted by coevolving viruses [ ] . meanwhile, viruses may combine partial crispr/cas sequence in response to the counterdefense of bacteria. therefore, it is no coincidence that ssrs content is high in both viruses that infect vertebrates and invertebrates and these hosts themselves. the recombination enhanced the virus's ability of infection and anti-immunity to a certain extent. evolutionarily speaking, it is the result of selection in the process of interaction between viruses and hosts. it has proposed that reduced genome size represents an adaptation to the high rate of oxidative metabolism in birds, which results primarily from the demands of flight, and the relatively small genome size of birds in general may reflect the selective pressure to minimize the amount of repetitive dna [ , ] . overall, the longer genome sequence, the stronger capability the genome holding long ssrs. each type of repeat unit is distributed in a certain length range of genomes. mono-and di-ssrs were observed in almost all analyzed virus genomes; tri-repeats appeared to widely distribute in all virus genomes but it's number is obviously less than mono-and di-ssrs; tetra-ssrs as a common component consist in genomes with size more than kb ( . % of the genomes contain tetra-ssrs in group of genome > kb). in contrast, it is relatively rare in genomes with the size < kb; genomes containing penta-and hexa-ssrs are not more than % in < kb group. moreover, the number of tetra-, penta-and hexa-ssrs is very small ( table ) . dinucleotide and trinucleotide ssrs were observed in all analyzed hiv genomes (genome size approximately kb), but almost no tetra-, penta-and hexanucleotide ssrs were found [ ] . tetranucleotide ssrs are contained in . % of the analyzed potyvirus genomes (genome size approximately kb), but the number of tetranucleotide ssrs is small [ ] . the data of tetra-, penta-and hexanucleotide ssrs are also rare in mycoplasma, but they are relatively sufficient in bacterial [ , ] , fungal [ ] , plant [ ] , vertebrates [ , ] and human [ , ] . those results confirmed that ssrs distribution is closely related to the genome size, indeed. the accumulation of simple sequence repeats would be attributed to the results of selection in the process of evolution. it has been well known that viruses such as influenza virus, hepatitis virus and human immunodeficiency virus (hiv) have a higher mutation rate to resist drugs, vaccines and so on during the process of replication and (or) recombination, which is one of the reasons for curing flu, hepatitis and acquired immunodeficiency syndrome (aids) with difficulty. moreover, viruses lack complete repair mechanisms. therefore, long ssrs can be poorly found in viruses. in the opinion of mrázek et al. [ ] , small genomes have a strong negative selection against long ssrs due to their strong constraints against expansion. genome size is an important factor in affecting the occurrence and the total length of ssrs, moreover, there is a positive correlation between them. additionally, hosts are also responsible for the variances of ssrs content to a certain degree. for example, with similar genome sizes, viruses infecting vertebrates and invertebrates tend to be higher than viruses attacking bacteria in ssrs content, relative abundance and relative density of ssrs, overall. we inferred that maybe viruses combined partial genome sequences of hosts in infecting, resulting in relative large genome and high content of ssrs. evolutionarily speaking, it is the result of selection in the process of interaction between viruses and hosts. virus is a group of parasite, so studying of ssrs in viruses is helpful to the research of many etiopathogenesis of its hosts. whole genome molecular phylogeny of large dsdna viruses using composition vector method evolutionary genomics of nucleo-cytoplasmic large dna viruses here a virus, there a virus, everywhere the same virus? virus taxonomy, viiith report of the ictv the ancient virus world and evolution of cells rapid evolution of rna genomes the evolution of viral emergence viruses at the edge of adaptation evolution experiments with microorganisms: the dynamics and genetic bases of adaptation belshaw r: viral mutation rates distinctive features of large complex virus genomes and proteomes dna viruses: the really big ones (giruses) transposable element contributions to plant gene and genome evolution international human genome sequencing consortium: initial sequencing and analysis of the human genome genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects microsatellites within genes: structure, function, and evolution features of trinucleotide repeat instability in vivo abundance and length of simple repeats in vertebrate genomes are determined by their structural properties simple sequence repeats in organellar genomes of rice: frequency and distribution in genic and intergenic regions mining microsatellites in eukaryotic genomes computational and experimental analysis of microsatellites in rice (oryza sativa l.): frequency, length variation, transposon associations, and genetic marker potential analysis of distribution indicates diverse functions of simple sequence repeats in mycoplasma genomes simple sequence repeats in prokaryotic genomes expandable dna repeats and human disease meiotic recombination hot spots and human dna diversity high-resolution genome-wide mapping of transposon integration in mammals microsatellite instability in colorectal cancer heritable germline epimutation of msh in a family with hereditary nonpolyposis colorectal cancer germline epimutation of mlh in individuals with multiple cancers familial endometrial cancer in female carriers of msh germline mutations tumor microsatellite instability in early onset gastric cancer biallelic somatic inactivation of the mismatch repair gene mlh in a primary skin melanoma an appraisal of the potential for illegitimate recombination in bacterial genomes and its consequences: from duplications to genome reduction microsatellite is an important component of complete hepatitis c virus genomes the contribution of slippage-like processes to genome evolution simple sequences in a 'minimal' genome simple sequences and the expanding genome dna repeat arrays in chicken and human genomes and the adaptive evolution of avian genome size the repetitive landscape of the chicken genome dna sequence organization in avian genomes genome-wide analysis of microsatellite polymorphism in chicken circumventing the ascertainment bias initial sequencing and comparative analysis of the mouse genome mgsc genome assembly release analysis of the largest tandemly repeated dna families in the human genome comparative analyses of human single-and multilocus tandem repeats simple sequence repeats in escherichia coli: abundance, distribution, composition, and polymorphism the crispr/cas bacterial immune system cleaves bacteriophage and plasmid dna bacteria-phage antagonistic coevolution in soil coevolution with viruses drives the evolution of bacterial mutation rates nature and intensity of selection pressure on crispr-associated genes a bird's-eye view of the c-value enigma: genome size, cell size, and metabolic rate in the class aves cell size and nuclear dna content in vertebrates similar distribution of simple sequence repeats in diverse completed human immunodeficiency virus type genomes microsatellites in different potyvirus genomes: survey and analysis compound microsatellites in complete escherichia coli genomes survey of simple sequence repeats in completed fungal genomes genomic distribution of simple sequence repeats in brassica rapa the genome-wide determinants of human and chimpanzee microsatellite evolution genome-wide analysis of tandem repeats in daphnia pulex-a comparative approach coevolution between simple sequence repeats (ssrs) and virus genome size we would like to thank chuansheng he for the language editing and anonymous reviewers for constructive comments on the earlier version of the manuscript. additional file : scree plot. it displays the "cliff" and the "screes" vividly, which can be visually proved that the applicability of pca is very good to the current data set. the authors declare that they have no competing interests.authors' contributions zt and ml conceived and designed this study. xz and yt performed and drafted manuscript. hf, qo, yt and yn participated in the data processing. ry, jj, gs and ry involved in revising the manuscript critically for important intellectual content. all authors read and approved the final manuscript.submit your next manuscript to biomed central and take full advantage of: key: cord- -kqcx lrq authors: ladner, jason t.; beitzel, brett; chain, patrick s. g.; davenport, matthew g.; donaldson, eric; frieman, matthew; kugelman, jeffrey; kuhn, jens h.; o’rear, jules; sabeti, pardis c.; wentworth, david e.; wiley, michael r.; yu, guo-yun; sozhamannan, shanmuga; bradburne, christopher; palacios, gustavo title: standards for sequencing viral genomes in the era of high-throughput sequencing date: - - journal: mbio doi: . /mbio. - sha: doc_id: cord_uid: kqcx lrq thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. however, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. we also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. v iruses represent the greatest source of biological diversity on earth, and with the help of high-throughput (ht) sequencing technologies, great strides are being made toward the genomic characterization of this diversity ( ) ( ) ( ) . genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. here, we outline a set of viral genome quality standards, similar in concept to those proposed for large dna genomes ( ) but focused on the particular challenges of and needs for research on small rna/ dna viruses, including characterization of the genomic diversity inherent in all viral samples/populations. our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. despite the small sizes of viral genomes, complications related to limited rna quantities, host "contamination," and secondary structure mean that it is often not time-or cost-effective to finish every genome, and given the intended use, finishing may be unnecessary ( ) . therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. each viral family/species comes with its own challenges (e.g., secondary structure and gc content); therefore, we provide only loose guidance on the depth of sequence coverage likely required to obtain different levels of finishing. in reality, a similar amount of data will generate genomes with different levels of finishing for different viruses. to alleviate any reliance on particular aspects of the different sequencing technologies, we have made two assumptions that should be valid in most viral sequencing projects. the first assumption is a basic understanding of the genomic structure of the virus being sequenced, including the expected size of the genome, the number of segments, and the number and distribution of major open reading frames (orfs). fortunately, genome structure is highly conserved within viral groups ( ) , and although new viruses are constantly being uncovered, the discovery of a novel family or even genus remains relatively uncommon ( ) . in the absence of such information, the defined standards can still be applied following further analysis to determine genome structure. the second assumption is that the genetic material of the virus being described can be accurately separated from the genomes of the host and/or other microbes, either physically or bioinformatically. depending on the technology used, it is critical that the potential for crosscontamination of samples during the sample indexing/bar coding process and sequencing procedure be addressed with appropriate internal controls and procedural methods ( ) . for a summary of the proposed categories for whole-genome sequencing of viruses, see fig. and table . the "standard draft" category is for whole shotgun genome assemblies with coverage that is low and/or uneven enough to prevent the assembly of a single contig for Ն genome segments. genomes in this category are likely to result from samples with low viral titers, such as clinical and environmental samples, or to be those containing regions that are difficult to sequence across (e.g., intergenic hairpin regions) ( ) . to distinguish standard drafts from targeted amplification of partial viral sequences, standard drafts should contain at least contig for each genomic segment and should be prepared in a manner that allows the possibility of sequencing the vast majority of a virus's genome. to avoid the inclusion of small pieces of genomes as "drafts," there needs to be some type of minimum cutoff for breadth of coverage. therefore, we suggest that at least a majority (Ն %) of the genome be present for a set of sequences to be considered a draft genome. high quality (hq). genomes should be considered high quality if no gaps remain (i.e., a single contig per genome/segment), even if one or more orfs remain incomplete due to missing sequence at the ends of segments. an hq genome can often be achieved with modest levels of ht sequencing coverage (~ to ϫ) or through sanger-mediated gap resolution of an sd. coding complete (cc). the "coding complete" category indicates that in addition to the lack of gaps, all orfs are complete. this level of completion is typically possible with high levels of ht sequencing coverage (Ͼ ϫ) or may require the use of conserved pcr primers targeting the ends of the segments. complete. a genome is complete when the genome sequence has been fully resolved, including all non-protein-coding sequences at the ends of the segment(s). this is typically achieved through rapid amplification of cdna ends (race) or similar procedures. finished. this final category represents a special instance in which, in addition to having a completed consensus genome sequence, there has been a population-level characterization of genomic diversity. typically this requires~ to , ϫ coverage (see below). this provides the most complete picture of a viral population; however, this designation will apply only for a single stock. additional characterizations will be necessary for future passages. population-level characterization. ht sequencing technologies provide powerful platforms for investigating the genetic diversity within viral populations, which is integral to our understanding of viral evolution and pathogenesis ( , ) . population-level characterization requires very high levels of ht sequencing coverage ( , ); however, the exact level will depend on the background error profiles of the sequencing technology and the desired level of sensitivity. as an example, wang et al. ( ) determined that for pyrosequencing data,~ ϫ coverage is necessary to identify minor variants present at % frequency with . % confidence, and~ , ϫ coverage is needed for variants with a frequency of . %. targeted amplification of the viral genome is often necessary to achieve these coverage requirements. due to the modest sequence lengths of most ht technologies, the state of the art for population-level analysis has been the characterization of unphased polymorphisms. however, single-molecule technologies, with maximum read lengths of Ͼ kb, are opening the door for complete genome haplotype phasing ( ) . identification of contaminants or adventitious agents. after isolation, viruses are often maintained as stocks, which are propagated within host cells in tissue culture and thus amplified and preserved for future use. despite careful laboratory practices, it is possible for these stocks to become contaminated with additional microbes. contaminating microbes are often detrimental to subsequent applications such as vaccine development or the testing of therapeutics, making it imperative to monitor the purity of viral stocks. ht sequencing provides a powerful method for not only detecting the presence of contaminants within a sample but also for identification and characterization of any contaminants. the level of sequencing required for contamination analysis is dependent on the desired sensitivity, with more sequencing required to ensure detection of contaminants present at very low levels. for most approaches, hq-level sequencing should be sufficient. depending on the intended applications, analysis may need to be repeated after further passaging to ensure that no additional contaminants have been introduced. description of novel viruses. despite the rapidly growing collection of viral sequences, the description of novel viruses is likely to remain an important aspect of viral genome sequencing ( , , ) . this is true in part because viruses evolve rapidly and are capable of recombining to form novel genotypes ( , ) . it is also true that most of the viruses that are currently circulating remain uncharacterized ( ) . particularly lacking are representatives from groups that are not currently known to infect humans or organisms of economic importance. it would be imprudent, however, to continue to ignore these uncharacterized reservoirs of diversity, because it is difficult to predict the source of future emerging diseases ( ) ( ) ( ) . additionally, with the current suite of primarily sequence similarity-based pathogen identification tools, the ability to detect novel pathogens is wholly dependent on highquality reference databases ( ) . there is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last to % of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. therefore, for the majority of viral characterization projects, we recommend, at a minimum, a cc genome. this will ensure a complete description of the viral proteome and will allow accurate phylogenetic placement. molecular epidemiology. one of the most common and important applications for viral genomes is in the study of viral epidemiology, which encompasses our understanding of the patterns, causes, and effects of disease. early studies of molecular epidemiology targeted small pieces of viral genomes; however, this type of analysis is likely to miss important changes elsewhere in the genome. therefore, there has been a strong focus in recent years toward the sequencing of "full" viral genomes. institutes such as the broad institute and the j. craig venter institute (jcvi) have been instrumental in breaking ground in the collection of large numbers of good-quality viral sequences. their newly identified genomes typically fall within our cc category. this is likely to remain the gold standard for studies involving a large number of genome sequences, especially when some samples come from lowtiter clinical samples, often necessitating amplicon-based sequencing methods. cc genomes allow for interrogation of changes throughout the coding portion of the viral genome and often include partial noncoding regions. in the absence of highthroughput race alternatives, the time and resources required to complete hundreds or thousands of genomes are likely to continue to outweigh the potential information gained from completing the terminal sequences. countermeasure development. advancements in our capabilities to sequence viral genomes are changing the way we counteract global pandemics and acts of bioterrorism. there are two important aspects of countermeasure development that can benefit strongly from the availability of genome sequences and ht sequencing data: the detection of the infectious agent and the treatment of the disease caused by the agent. taxonomic classification and detection through dna/rna-based inclusivity assays (i.e., using techniques such as pcr to detect the presence of a pathogen) can be designed using fragmented and incomplete genomes (e.g., sd and hq sequences). fully resolved orfs (cc) further enable the development of immunological assays, such as enzyme-linked immunosorbent assays (elisa) and immunofluorescence assays (ifa), for protein-based detection, and obtaining a complete genome opens the door to a plethora of additional downstream applications, including the design of exclusivity tests, the establishment of reverse genetics systems, and the design of robust forensics protocols. however, for effective development and testing of animal models, therapeutics, vaccines, and prophylactics, it is necessary to obtain a complete picture of the variability present within both the challenge stock and postinfection populations, thereby necessitating finished genomes. in these medical applications, it is also important to demonstrate the absence of adventitious agents. in addition to standardizing the vocabulary of viral genome assemblies, it is also critical for researchers to routinely provide raw sequencing reads. without these, it is impossible for others to independently verify the quality of an assembly. data repositories such as genbank already provide a platform for depositing ht sequencing reads, but this is not a requirement for the submission of a genome, nor is this option typically utilized. wider analysis of data will ultimately result in higher-quality assemblies. it is worth considering broader implementation of a wiki-like, crowdsourcing strategy to genome assembly, similar to the annotation strategies that have been adopted for specific genomes of high interest ( , ) . this approach would allow multiple parties to work on genome assembly and annotation at the same time and would provide instant updates for the entire community to evaluate and utilize in their own research. our primary goal here is to initiate a conversation. the rate at which viral genomes are being sequenced is only going to increase in the coming years, and without some standardization, it will be impossible for these valuable resources to be utilized to their full potential. we present these categories as a starting point, with the goal of adjusting and refining them over time as our capabilities and needs continue to change. crystal ball. the viriosphere: the greatest biological diversity on earth and driver of global processes metagenomic analysis of coastal rna virus communities the search for meaning in virus discovery genome project standards in a new era of sequencing next generation sequencing of viral rna genomes . virus taxonomy. ninth report of the international committee on taxonomy of viruses human viruses: discovery and emergence double indexing overcomes inaccuracies in multiplex sequencing on the illumina platform rescue of the prototypic arenavirus lcmv entirely from plasmid viruses as quasispecies: biological implications quasispecies diversity determines pathogenesis through cooperative interactions in a viral population characterization of mutation spectra with ultra-deep pyrosequencing: application to hiv- drug resistance highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data the advantages of smrt sequencing a strategy to estimate unknown viral diversity in mammals the changing face of pathogen discovery and surveillance the evolution of epidemic influenza characterization of the candiru antigenic complex (bunyaviridae: phlebovirus), a highly diverse and reassorting group of viruses affecting humans in tropical america isolation and characterization of viruses related to the sars coronavirus from animals in southern china the emerging novel middle east respiratory syndrome coronavirus: the "knowns" and "unknowns relationship between domestic and wild birds in live poultry market and a novel human h n virus in china computational tools for viral metagenomics and their application in clinical research web apollo: a web-based genomic annotation editing platform pseudomonas genome database: improved comparative analysis and population genomics capability for pseudomonas genomes key: cord- -j jpqd k authors: o'brien, stephen j. title: cats date: - - journal: curr biol doi: . /j.cub. . . sha: doc_id: cord_uid: j jpqd k nan ecological niches on every continent except australia and the poles. wild cats dominate their habitat but require vast expanses to survive, which explains the tragic depredation such that every species of felidae, except the domestic cat, is considered either endangered or threatened in the wild today by cites, iucn red book and other monitors of the world's most endangered species. there are lots of them, small and easy to keep in the laboratory. whether we love cats or hate them, biology has learned much from studies of their comparative anatomy, comparative physiology (notably neurology) and behavior. for many biological subdisciplines, the cat species are considered as three groups: large, medium and small, a testament to their amazing similarity. the exception is reproduction, where each species has evolved exquisite co-adapted strategies for ovulation, hormone level regulation, sperm production, estrous incidence, mating preference and social organization. scrutiny by behavioral ecologists has provided a rich literature of distinctive reproductive parameters for several cat species, facilitating advances in assisted reproduction such as artificial insemination, cryopreservation, embryo transfer, in vitro fertilization and the first cloning of a domestic cat in . the advanced stage of reproductive assessment will one day soon lead to feline embryonic stem cells, transgenic and gene knockout cats, and protocols for stem cell and gene therapy trials. one of the most powerful biomedical models for cats involves the interaction of deadly infectious agents and the cat host's genome. domestic cats first gave us feline leukemia virus, which allowed the discovery of scores of 'oncogenes' in the s and s. when homologous human oncogenes were rapidly discovered thereafter, their misfiring in signal transduction pinpointed the molecular basis for many aggressive cancers. more recently, feline immunodeficiency virus (fiv), a first cousin of hiv, was discovered in house cats as the cause of a depletion of the cd t-cell subset that is a prelude to immune system collapse and pathology, the only naturally occurring model of aids. interestingly, over eight free-ranging wild species of felidae are infected with their own species-specific fiv strain (based on fiv gene sequence monophyly) that in most cases seems to be attenuated by historic selection of genetically resistant survivors in today's wild places. the devastating sars human coronavirus has a feline counterpart that causes a deadly feline infectious peritonitis (fip) syndrome in domestic cats. an outbreak of fip in a genetically uniform african cheetah colony led to % morbidity and % mortality, emphasizing the sensitivity of genetically inbred hosts to viral outbreaks. cats and their wild relative have given some sobering lessons about emerging virus outbreaks. in the mid s, a feline pan-leucopenia virus cultivated in a cat vaccine factory abruptly jumped from cats to dogs, producing a hyper-virulent a persian breed domestic cat. (photo courtesy of m. calingo). strain in puppies that within a few months caused widespread puppy mortality across the world. payback from the dogs came when a strain of canine distemper, endemic in the pet dogs of masai tribesman in tanzania, jumped to hyenas and then to african lions, killing a third of the huge lion population of serengeti national park in a six month interval in . add to the list of verified catspecific agents: alpha-herpesvirus, toxoplasmosis, cryptococcus, plague, q-fever, chlamydiosis and rotavirus infections, ehrlichiosis, calicivirus infection, poxvirus infection and mycobacteriosis. cats are also highly resistant to anthrax, with obvious implications. all these infections, and more, could prove valuable to biomedical research, providing we have a better working knowledge of the innate and adaptive immune system of cats. domestic cats and dogs enjoy more medical scrutiny than any species except humans. the world's veterinary schools produce thousands of practitioners each year, providing extensive documentation of genetic and chronic diseases with relevance to human maladies. the result is a comprehensive veterinary literature, which has described some feline genetic diseases (http://www.angis.org.au/databa ses/birx/omia/). these disease models offer, not only insight into disease development, but opportunities for better diagnostics and treatment experimentations. institute (nhgri) announced its decision to support sequencing of the domestic cat genome, along with those of seven other species: elephant, armadillo, tenrec, common shrew, guinea pig, hedgehog and rabbit. these species were chosen to complement those already selected for whole genome sequencing (human, mouse, rat, cow, chimpanzee, macaque, opossum and platypus) and to reflect the diversity among the living species of mammals as a first step in annotating the human genome's largely uninterpreted coding, regulatory and evolutionary conserved sequences. the cat offers the promise of a second carnivore species (in addition to the dog, which shares a common ancestor with cats dating back to approximately million years ago) to improve human genome annotation, as well as to complement the biomedical and genomic discoveries that make the feline genome attractive. genome evolution in mammals appears to proceed at two very different rates. the common default rate of chromosomal exchange is very slow and deliberate, so that the genome organization can be inferred for the common ancestor of all primates, carnivores and placental mammals. but a more rapid mode of genome rearrangement is seen in some lineages, such as gibbons, owl monkeys, dogs, bears and murid rodents, the genomes of which appear to have been re-shuffled several times relative to the ancestral form. cats and humans both have genomes in the primitive, un-rearranged form, so the cat provides a good opportunity to study the constraints on genome organization that have characterized the million year old mammalian radiations. the conserved genome of the cat is retained in the other felidae species, as well as most of the species of the carnivora order, the only reshuffled exceptions occuring in the dog and bear families. what can we expect after the cat genome sequence becomes available? many areas of biological and medical research will benefit from the projected cat genome sequence. human genome interpretation and annotation will be augmented by, on average, a single variant for each of its three billion base pairs. cats will enjoy genomic tools for inspection of feline hereditary disease, as well as the discovery of candidate gene variants that may explain evolved genomic defenses of infectious disease that threaten cats and man. evolutionary biologists will identify specific genes contributing to survival and species formation as unabridged gene/sequence maps narrow the search for adaptations. the observed constraints on genome reorganization will become a challenge for inferring the footprints of evolutionary steps that led to the modern cat species. an increased and informed database of mechanistic developmental specializations will add yet another reason to conserve the surviving cat species that are the keystone of species and habitat conservation projects, from cheetahs in namibia, to tigers in india and the russian far east, to lions in east africa and jaguars in the amazon. the natural history of the wild cats the laboratory cat wild cats status survey and conservation action plan iucn gland switzerland tears of the cheetah and other tales from the genetic frontier the promise of comparative genomics in mammals the feline genome project wild cats of the world key: cord- -wy zk p authors: blinov, v. m.; zverev, v. v.; krasnov, g. s.; filatov, f. p.; shargunov, a. v. title: viral component of the human genome date: - - journal: mol biol doi: . /s sha: doc_id: cord_uid: wy zk p relationships between viruses and their human host are traditionally described from the point of view taking into consideration hosts as victims of viral aggression, which results in infectious diseases. however, these relations are in fact two-sided and involve modifications of both the virus and host genomes. mutations that accumulate in the populations of viruses and hosts may provide them advantages such as the ability to overcome defense barriers of host cells or to create more efficient barriers to deal with the attack of the viral agent. one of the most common ways of reinforcing anti-viral barriers is the horizontal transfer of viral genes into the host genome. within the host genome, these genes may be modified and extensively expressed to compete with viral copies and inhibit the synthesis of their products or modulate their functions in other ways. this review summarizes the available data on the horizontal gene transfer between viral and human genomes and discusses related problems. the relationships between viruses and their hosts are in fact more complex and diverse than is generally perceived. if their interactions are transient and limited to the single virus host paradigm, it largely determines the diagnosis of an infection, and the virus is considered an absolute parasite; this is the approach currently practiced by clinicians. however, on a larger time scale, transient interactions of this kind represent only a specific case of a more general role of viruses, since they are the basis of the evolutionary progression of the whole biological system where viruses and their hosts are constantly adapting to each other, either gaining certain advantages or suffering considerable losses. this is a continuous, ongoing process with a varying rate. both sides act on the population level and exhibit different extents of aggression and plasticity. for this reason, it can be extremely difficult to determine the causes and the nature of a pathology (especially chronic) in a given individual because they are often modified by virological events that occurred many generations ago. therefore, it seems fairly reasonable to consider a potential viral origin (in a broad sense) for almost any human disease with unclear etiology, even though it appears noninfectious, especially taking into account that, in particular, this origin implies the possibility of horizontal gene transfer, a phenomenon that is most efficiently mediated by viruses and largely determines the evolutionary progress as the consequence of the total sum of numerous elementary interspecies interactions [ ] . the initiation of a viral infection depends on the presence of specific receptors on host cells, i.e., on the host sensitivity to a particular virus. the host's ability to resist infection or to develop a mutual relationship with the virus determines a favorable outcome, while weaker elements can be eliminated from the host population by the lethal course of infection. the distribution of genes that encode these receptors in the human genome is shown in fig. [ ] . the parallel evolution of viruses and their host's sensitivity on the immune response level can lead to a decrease in the virus's pathogenicity [ ] . for instance, this process underlaid the spreading of the low-virulence poxvirus variant, alastrim, in the years that preceded the eradication of smallpox [ ] [ ] [ ] . viruses recognize specific receptors on the surface of host cells; otherwise, if the expression of these receptors is inhibited, they become noninfectious. for instance, this happens if the human gene that encodes the ccr receptor to hiv is damaged by a -bp deletion [ ] . in some cases, a virus receptor can be blocked by a protein, the gene of which was previously acquired by the host from the same virus via horizontal transfer. on the molecular level, a mutualistic solution has the form of a latent infection; a viral agent can per-reviews udc . . sist for a long time in a host's cells, which protect it from external factors, while the host organism can make use of the reactivation of the virus, the expression of certain viral genes, or the production of latent rnas for its own benefit; at the same time, sporadic reactivation and release from the host body enable the virus to maintain the level of genomic variation sufficient for its evolutionary promotion. horizontal transfer involves fragments of genetic information that vary strongly in size, in particular depending on the buffer genome capacity of each participant. in the human genome, this capacity is determined by the portion of chromosomal dna, which does not contain species-specific protein-encoding sequences and, thus, can basically make a place for novel information that will be modified to reach a new balance. if we consider full-size genes, the essential sequences occupy ~ % of the human genome, while only . % of the genome are gene exons. the reverse process, i.e., the acquisition of host genes or shorter sequences by viruses, is also possible, although viral genomes obviously have a lower abso-lute capacity for storing the acquired material. however, there still are certain provisions; for instance, it was shown that up to % of genes in the herpesvirus [ ] [ ] [ ] [ ] and adenovirus [ ] [ ] [ ] [ ] genomes can be removed and substituted with foreign dna without losing virus viability. in some cases, it is difficult to identify the direction of the initial horizontal transfer (i.e., to determine whether a gene was transferred from the virus to the host or the other way around), because these genes start to perform important functions in both the virus and the host. considerable interest is drawn to viral trna-like (clover leaf, or l-form) structures present in the human genome and in some viral genomes, such as alphavirues and endogenous retroviruses. these structures can participate in the stabilization of viral rna, as well as in viral replication and translation; all of these functions are determined by the folding of these structures [ , ] . in human dna, viral insertions can be present as full-size genome sequences, but also as smaller i ii iii iv v vi vii viii ix x (b) genome segments, individual viral genes or their clusters, and short sequence fragments. the genetic material of all known virus types using all possible replication strategies can reach animal germ cells and be transferred to subsequent generations, which determines the evolutionary role of the gene flow from viruses to animals [ ] . retroviruses have certainly left the most extensive and frequent evolutionary ancient viral traces in host dna: in the human genome, sequences of human endogenous retroviruses (herv) amount to ~ % of its total size [ ] [ ] [ ] and are derived from at least phylogenetically different sources [ ] . this fact is related to specific characteristics of the retrovirus replication machinery, which transcribes the genetic information carried by the viral dna into dna that is subsequently incorporated into the host genome. these incorporated viral sequences can be maintained in the host genome for a long time, either in the initial form, or with some modifications, and can be inherited. modifications are determined by the activity of a number of different factors, including mobile elements of the host genome and transposons, which makes it usually very difficult to identify the source of a given insertion. further on, viral genes integrated into host chromosomes can act as alleles that modify the host phenotype and sometimes provide a considerable selective advantage. for example, it is believed that this phenomenon contributed to the evolution of viviparous placental mammals; their genomes carry retroviral insertions that encode syncytins, proteins that serve to form the syncytiotrophoblast layer of the placenta and to ensure the immunological tolerance of the mother towards the embryo. importantly, not all of the mammals (and not only mammals) possess a well-developed placenta, but those who do also have syncytins derived from surface glycoproteins of different retroviruses, the insertion of which occurred at different moments of mammalian evolution. in retroviruses, these surface glycoproteins contain immunosuppressing domains [ , [ ] [ ] [ ] [ ] [ ] [ ] , and it is these domains that are used by the novel hosts. relic retroviral sequences (herv-k) can be found in the genomes of human ancestors, old world primates, nearly to the moment of their separation from the new world primates [ ] . human chromosomal dna contains - herv-k copies, some of which contain genes that exhibit low levels of expression in normal testicular and placental tissue [ ] . at the same time, endogenous retroviruses and retrotransposons can induce carcinogenesis in somatic cells [ , ] . molecular other rna viruses. in fact, the scope of the described phenomena is not limited to retroviruses as such, since the ubiquity of retroviral elements in animal genomes, their activity in germline cells [ ] , along with the fact that viral replication depends significantly on rna expression, allow retroviruses to contribute in different ways to the insertion of nonretroviral genes into animal germline cells. the genomic integration of nonretroviral genes can be mediated by nonhomologous recombination with chromosomal dna [ ] or by interaction with retroelements of the host cells [ , ] . it has been shown that retrotransposons can help the host borrow sequences from different rna and dna viruses. for example, recombination between the rna of the lymphocytic choriomeningitis virus and a murine iap retrotransposon results in reverse transcription of the rna and in its integration into the host genome [ ] . in some species, such insertions were shown to provide advantages; for example, bees that have acquired the gene that encode structural proteins of dicistrovirus become resistant to this agent, which causes acute paralysis in wild-type individuals [ ] . this phenomenon is very common in the kingdoms of plants and fungi [ , ] . in human, genomic incorporation of nonretroviral sequences has been described for such rna viruses as ebola [ , , , ] and marburg viruses (family filoviridae), agents of the born disease (family bornaviridae) [ , , , ] , and polioviruses (order picornavirales) [ ] . the first two families belong to the order mononegavirales, which also includes paramyxoviruses and rabdoviruses. these viruses have different virion structures, but share a common trait: their genome is composed of a single-stranded negative-sense rna molecule up to kb long. filoviruses cause extremely severe acute infections in humans (with - % lethality), and bornaviruses causes an equally severe disease in horses (with up to % lethality). bornaviruses have also been detected in humans; in patients with severe mental disorders, such as schizophrenia [ ] , as well as in individuals without any pronounced clinical presentation in the cells of which these viruses persisted over long periods of time [ ] . fragments of filo-and bornavirus genomes are inserted into the host genome via interaction with lines [ ] , the most common mobile elements in higher eukaryotes, which bear a reverse transcriptase gene. the most frequent findings in the host genomes are inserted fragments containing bornavirus genes n and l, which encode nucleocapsid protein n and rna-dependent rna polymerase (p ), respectively. it is difficult to say when these fragments were inserted into the host genomes; presumably, this event occurred about ma ago. its initiation and rate varied among different hosts and for different bornaviruses species. at present, in many animal species, bornavirus-derived genes have evolved into homologous own genes, ebln and ebln [ ] . these acquisitions provided an important selective advantage, enabling the host to resist devastating bornavius epizootics. it was shown that animals that possess ebln genes are resistant to species-specific bornavirus infections, or the course of the disease is less severe in them [ ] . the molecular basis of this resistance is the excessive synthesis of a protein n analog, which inhibits p polymerase and, thus, decreases the virus yield. the initial functions of inserted genes can undergo gradual modification with time; this phenomenon is referred to as exaptation. filovirus genes appropriated by vertebrates are those that encode np nucleoprotein and structural vp protein, which inhibits interferon production in the host. cellular analogs of these genes have been found in bats, bandicoot, wallaby, and other animals of the area. the borrowed viral genes can be partially transcribed, and the resulting truncated n-terminal np fragments (the full-size sequence of the host gene apparently is not expressed [ ] ) compete with the corresponding viral component, inhibiting the replication of the virus [ ] . retroviruses are not the only group for which nucleotide sequences can be fixed in eukaryotic genomes. in the s, v.m. zhdanov, an outstanding soviet virologist, hypothesized that it should be possible for other rna viruses [ ] . in a later work, l.yu. frolova et al. showed that trna-encoding sequences in the human genome homologous to trna-like elements encoded by ltrs of endogenous retroviruses, can act as targets for alphaviruses [ ] . presumably, virus-specific revertase is not at all necessary for a viral nucleic acid to insert into a host genome [ ] . some features of this process are similar in very different viruses and are worth a detailed analysis. for example, the stable incorporation of viral genes into the host genome typical for retroviruses was also described in filoviruses (although it is less frequent), while in many other viruses, this feature is less prominent or unknown at all. the ebola virus genome has an interesting feature, which probably does not explain its ability to integrate into the host genome (most probably occurring by homologous recombination), but is nevertheless worth mentioning. it is a short ( - amino acids) immunosuppressing fragment p e, which exhibits a high level of homology in ebola and retroviruses [ ] . importantly, the function of this domain is activated as a result of the incorporation of an additional adenine, which results in a reading frame shift. this adenine insertion in p e serves as a marker of pathogenicity of both retroviruses and the ebola virus when they infect a new host. the human genome contains a number of immunosuppressing fragments: they are expressed within syncytin genes in plancetal cells [ ] . insertions of nonretrovirus genes can basically have occurred as result of interactions between the gene source and a retrovirus (most commonly, a line retrotransposon). among the proteins, the genes of which were borrowed from a virus, those that perform a primarily protective function are the most likely to be fixed in their new environment. for example, fv is a protein similar to ca protein of the murine leukemia virus and competing with it; it binds the viral capsid and blocks reverse transcription, providing insensitivity to infection. in human and other animals, there are also protective proteins encoded by genetic elements of viral origin incorporated in the host genome. in particular, trim (and its analogs present in some primates) can inhibit the proliferation of some retroviruses in largely the same way as fv and, at the same time, affects the proinflammatory transcription factors nf-κb and ap- , which control the expression of genes involved in immune response, apoptosis, and cell cycle [ ] . dna retroviruses. elements derived from genomes of dna retroviruses (e.g., the duck hepatitis b virus) are found significantly less frequently in host dna than those originating from rna retroviruses [ , , ] ), even though they possess revertase, which could be expected to enable efficient incorporation of viral sequences into the host genome. other dna viruses. host genomes also bear traces of encounters with dna viruses. most commonly, these are members of the large parvovirus family (parvoviridae): dependoparvoviruses (adenoassociated agents that can replicate only in the presence of a helper adenovirus or herpesvirus), nonpathogenic in humans. dependoparvovirus genes have been found in the dna of pigs, cattle, rats, mice, and other animals [ , ] . probably (although it has not been proven), these insertions serve to protect against parvovirus infections. papillomaviruses were also shown to integrate into human genome [ ] . apart from the above agents, it was also found that the pig genome contains relic copies of the circovirus genome (noninfectious for humans). genes of nanoand geminiviruses were found in plant genomes [ , ] . human dna was also found to contain genes and larger genome elements of herpesviruses, including the epstein-barr virus (human herpesvirus , hhv ), human herpesvirus (hhv ), and other members of this superfamily [ ] . finally, at least % of the human genome is composed of fairly large virus-like sequences: the so-called selfish dna, the origin of which is unclear, while the only observed type of activity is autoreplication. the most active group are transposons of the line class (long interspersed elements); in human, they harbor approximately one in thousand genetic mutations [ ] [ ] [ ] . on the whole, virus-like components of the human genome account for nearly a half of the chromosomal dna, and some of them play an important role in the host organism, but hardly anything is known about the origin and functions of the others. antiviral host response can transform an acute infection into the chronic or even the latent form (as it happens with herpesviruses), and a reservoir of viruses is thus conserved in their natural host, who will remain their target in case of reactivation. it should be underlined that, for many important reasons, the body of relevant data available from the existing publications is far from being complete [ ] . first of all, not all of the known viruses have been studied as potential sequence donors for the host genome, and not all the potential host species have their genomes sequenced. secondly, an insertion of a viral sequence may represent a temporary outcome of a single infection event and will not be maintained in subsequent host generations. moreover, the host species may also be eliminated from the evolution (as a result of extinction), which means that the acquired insertions will only be conserved if they are no longer limited to the extinct species and have become specific for a more general taxonomic branch, such as genus, order, or higher. next, the relatively recent insertions may be insufficiently widespread to be identified and thus evade observation. finally, the ability to incorporate parts of the viral genome into the chromosomal dna of host germline cells can vary strongly among different taxonomic groups of viruses, i.e., orders, families, genera, and even species if insertions of viral sequences remain functionally active in the host cell genome, they can give rise to either proteins that function in a new environment or untranslated rnas of different sizes. if these insertions are inactive, they can merely witness a history of close and evolutionary long-term interactions between the virus and the host. the characteristic trait of human herpesviruses is that, in their typical latent state, they can persist and replicate in the form of an episome in the direct vicinity of the host genomic dna. these viruses have long coexisted with their hosts and the hosts' phylogenetic ancestors, and their genomes carry full-size genes that in turn were captured at some moment from the host and can often be expressed in their new environment [ , ] . the nucleotide sequences of these genes, the encoded amino acids, and even the functions of the resulting proteins do not correspond strictly to their cell counterparts, but the range of their functions in the virus certainly suggests a relationship between the viral and the host genes. some herpesviruses (such as hhv associated with kaposi's sarcoma) have acquired genes of serpentines (g-protein-coupled receptors; gpcr), which they employ at the early stage of lytic infection of lymphoid cells and of sarcoma itself; moreover, the inhibition of viral dna synthesis does not affect the functioning of these molecular biology vol. no. genes. their function is not quite clear, since they act mainly at the early stages of infection: gpcr transcripts, which have mainly bicistronic structure, protect the '-region of the coding sequence of another hhv gene, К ; while monocistronic transcripts analyzed in model experiments did not exhibit such properties. this probably indicates that the translation of gpcr transcripts may be reinitiated and suggests the need for further analysis of all the functions of gpcr itself in the pathogenesis of kaposi's sarcoma. the product of the gpcr-encoding gene bilf captured by hhv acts as a specific inhibitor of class i major histocompatibility complex. the reverse process can also be imagined easily, as well as subsequent gpcr modifications as a result of the repeated gene capture with the acquisition of new functions. in fact, no such data are currently available, but the research has only begun very recently. a recent study detected the first endogenous herpesvirus (genus roseolovirus) in the genome of the philippine tarsier (order primates), while insertions of nearly full-size hhv -and hhv -like genomes were found in the dna of other primates: aye-aye, lemur, and chimpanzee [ ] . in this context, it is also worth mentioning that more data exist on the integration of herpesviruses into host genomes [ ] , as well as on the incorporation of herpesvirus dna (hhv ; also of the genus roseolovirus) into the telomeric zones of human chromosomes. the significance of these insertions (which can be transmitted vertically if occur in germline cells) for human diseases or for the functioning of the immune system is currently absolutely unclear [ , , ] . similar data were obtained for viruses of other taxonomic groups. a systematic analysis showed that sequences derived from a wide range of animal viruses other than retroviruses are present as endogenous elements in mammalian, avian, and insect genomes. these elements of animal genomes represent the full spectrum of viral replication strategies [ ] ; moreover, the larger the sample of animal genomes, the wider the diversity of endogenous viral elements. obviously, the more ancient these elements are, the smaller the number of host species needed to detect them. to identify more recent viral insertions, a much wider sample on the order and genus level is required. however, as we have pointed out above, the currently available data are far from complete. the diversity of the known virus isolated as represented in virus gene/protein banks is but a small portion of the total virus diversity. in view of their likely ancient origin, members of the many virus families may be much more widely distributed among their mammalian hosts than we currently imagine, both as separate entities and as genome fragments in the host dna. this is also reflected in the virus phylogeny, which was constructed using endogenous virus insertions along with exogenous viruses; close exogenous relations frequently are either not identified or were only described in the past decade [ ] [ ] [ ] . the recently discovered relationship between filoviruses and marsupials has suddenly specified this infraclass of mammals as a potential filovirus reservoir. the presence of viral insertions may become an important factors for evaluation of findings obtained using metagenomic approaches [ , , ] . all of these data have been discussed in several comprehensive reviews [ , , ] . the capture of certain fragments of a viral genome by the host may be a random event; however, should the captured genes reach germline cells and prove useful to the host, they can become fixed in subsequent generations. it may prove interesting to consider information that concerns shorter insertions of viral origin. they can be expected to exist in much higher numbers, since the probabilities of their incorporation, as well as of transposition or multiplication, seem higher than for larger insertions. fragments of - nt long would have the same size as the biologically active rnas, such as those involved in rna interference. numerous studies, e.g., those reviewed in [ ] , indicate that such molecules can directly participate in the regulation of mammalian viruses. for instance, liverspecific mir- suppresses the replication of hepatitis c virus [ ] , while several human micrornas, such as mir- a- p, mir- , and mir- a- p, inhibit hepatitis b virus [ ] . other human micrornas are targeted against the influenza virus (mir- , mir- , and mir- ), the vesicular stomatitis virus (mir- and mir- ), and against hiv (mir- , mir a, mir- b, mir- , mir- , and mir- ) [ ] . there are mirnas against hhv and hhv herpesviruses (mir- / and mir- b/ ) [ ] [ ] [ ] , coxsackie virus (mir- - p) [ ] , and human papilloma virus [ ] . it is also known that some viral mirnas can circulate in human. for instance, mirna-ul - is encoded by the human cytomegalovirus (hhv ; genus cytomegalovirus, family herpesviridae), and its target is mrna of the pre-early viral protein ie [ ] ; mirna of hhv is targeted against bart lmp mrna [ , ] , while mirnas of herpes simplex viruses (hhv and hhv ) , mir-h , mir-h , and mir-h , are directed against mrna of viral proteins icp and icp . [ ] . similar mirna-mrna pairs were detected in experimental models of a retroviral infection [ ] . it was shown that endogenous small interfering rnas (sirnas) can regulate the activity of some endogenous retroviruses [ ] . however, it is currently hardly possible to identify the targets of micrornas based on their sequences only, since they do not need to be strictly complementary. in addition to the clip-seq and par-clip techniques that employ immunoprecipitation of rna-protein complexes with subsequent sequencing [ ] , it may prove helpful to analyze the data on the coexpression of mirnas and their putative gene tar-gets [ ] . two other groups of molecules participating in rna interference are sirnas and so-called piwirnas (or pirnas). the latter are - nt long; their activity is mediated by a specific mechanism involving enzymes of the piwi family, and their targets are usually transposons, retrotransposons, and endogenous retroviruses [ ] . the - -nt long fragments that form sirnas and pirnas must be almost fully complementary to their target sequences (with mismatches of no more than nt). these rnas are mainly targeted at exons, which facilitates the in silico search for putative small insertions in the recipient dna. the functional activity of sirnas and pirnas should be determined experimentally. the detection of nucleotide fragments of the specific size in the escherichia coli genome invited the hypothesis that bacteria possess an antiviral immune defense system [ ] , which later became the basis for the development of the revolutionary gene editing tools, crispr cas /cpf [ , ] . we attempted to identify -to -nt-long homology stretches (which we provisionally refer to as hits) in the human genome and in the genomes of human adenoviruses and herpesviruses and showed that their number was significantly higher than in extended samples that included either viruses that do not directly infect human, such as bacteriophages, or artificially generated sequences of the same as the herpesvirus genome [ ] . later, we showed that, even in a given group of viruses (hhvs), the portion of the viral exome that corresponds to hits, i.e., short - -nt long sequences homologous between the virus and the host, is specific to each virus type and, as the first approximation, can be related to the destructive effect of the viral infection [ ] . we have also proposed a hypothesis that reactivated hhv could exhibit oncolytic properties in vivo [ ] , because the body of rl (which encodes icp . protein) contains a large number of virus/human homologous sequences (hits); in artificial hhv -based oncolytic constructs, this gene is switched off first. obviously, our hypothesis requires experimental verification. the fig. schematically shows the load of human dna with viral relics. we define the known viral insertions and their derivatives, including transposons, as relics, since we are currently able to identify only those that were incorporated into the host genome long time ago. human translated genes constitute approximately % of the total information volume of genomic dna, whereas as virus relics and virus-like structures (with some reservations, in addition to endogenous retrovi-ruses, these may be assumed to include retrotransposons and dna transposons) amount to nearly half of the human genome. this information suggests that there might be a need to reconsider the notion of a virus. from the traditional medical point of view, a virus is a parasite that infects a sensitive cell and ultimately causes its destruction. the virus (its genome) can also infiltrate the host dna, occupying the free space or adding to its size. by replicating in the host genome, the virus ensures its genetic diversity and at the same time acquires a safe depot protecting it from external factors. this viral activity can be reasonably classified as parasitic. however, from this viewpoint, we overlook the potential evolutionary advantages that the host population may gain as result of a viral infection: viruses are the most efficient vectors, which might transfer novel genetic information. for this reason, we propose to define a virus as an information carrier indistinguishable from a specific one and possessing an autonomous autoreplication program that employs the addressee's reading, synthesizing, and metabolic machinery for its own realization. this definition underlines the major evolutionary function of viruses as carriers of genetic information and is free from moral judgment, which is foreign to nature. moreover, this definition has a more general character than the one specifying viruses a parasites. certainly, these considerations do not imply that homo sapiens does not need to struggle with destructive viral infections. moreover, the fact that, in nature, portions of the human genome occupied by protein-coding genes ( %) and relic viral sequences: endogenous retroviruses ( %), dna transposons ( %), sines ( %), and lines ( %). dark area of the human genome ( %) also contains other viral sequences (see text), genes whose activity is limited to transcription, and structures whose function is currently unknown. [ ] [ ] [ ] . as for the short regions of homology to herpesvirus or adenovirus dna (which can basically be viral insertions), their distribution in the human genome is specific for every virus species, and on the whole, as we showed previously [ ] , it can hardly be random. it was not the main objective of our brief review to list all of the available data on the question. the body of these data is rapidly growing, which, however, does not seem to change the existing consensus view on the problem. obviously, mutual insertions of genomic fragments in virus-host pairs occur regularly, if not very frequently. their exact frequency remains to be determined both on the population level and on the level of single individuals. the objective is not as much to determine the number of already incorporated foreign genes but rather to find out how this frequency depends on the type of the virus and its host. it is also important to determine the consequences of each insertion, either transient or fixed, can have for the human host within the lifetime of the hosts or of their ancestors. it should be evaluated how often these insertions reach the host's germline cells and what key factors govern these events. there are currently no answers to these questions, but these issues are addressed in different areas of study, which sometimes produces unexpected results. for instance, the haldane and waddington problem about the number of generations were required to obtain recombinant inbred strains was initially only solved by the authors for the cases of two and three genes [ ] . recently, for the general case samal and martin proposed an approach based on a statistical formalism rarely used in areas other than physics [ ] ; surprisingly, it provided an exact solution to the problem involving any given number of genes. although the above example lies in the area of population genetics, while the number of studies on our topic of interest is insufficient to allow population-level generalizations, it nevertheless confirms the need to continue the efforts and emphasizes the productivity of breaking traditional thinking patterns, even in such conservative fields as medical diagnostics and therapy. the exchange of genetic information between living organisms is a complex albeit an infrequent phenomenon and generates a considerable uncertainty when we attempt to provide a comprehensive explanation of the causes and nature of a given pathological condition. the dynamic equilibrium between the human genome sensu stricto and the integrated viral sequences that perform protective and regulatory functions in the host organism represents a much deeper relationship than the organism's interaction with the internal microbiota. currently, these relationships constitute the subject of separate omicstype research. they are the consequence of a multisided (not just a two-sided) encounter, where one side is represented by a viral population, which is heterogeneous in each individual case and consists of different proportions that are both fully featured and defective, as well as mutant particles, while the other side is a multicellular organism that developed as a result of macro-and microevolution. viruses have enriched the host genome in functional virus relics, the amounts and diversity of which greatly exceed the total of the host's own genes. on the other hand, the multicellular organism has developed a system of antiviral defense and continues to develop it while employing the full range of available molecular mechanisms. predicting the consequences of this encounter and the results of their targeted modification for the benefit of the infected macroorganism is a problem that can only be solved after considerable advances in the techniques of analyzing and modeling the underlying mechanisms. vol. no. endogenous viral elements in animal genomes cellular analogs of viral proteins genes in the terminal regions of orthopoxvirus genomes experience adaptive molecular evolution a probable molecular factor responsible for generalization of variola virus infection entire coding sequence of the variola virus comparison of the genetic maps of variola and vaccinia viruses resistance to hiv- infection in caucasian individuals bearing mutant alleles of the ccr- chemokine receptor gene halliburton i.w. . molecular genetics of herpes simplex virus: demonstration of regions of obligatory and nonobligatory identity within diploid regions of the genome by sequence replacement and insertion clustering of genes dispensable for growth in culture in the s component of the hsv- genome dna of herpes group viruses the open reading frames ul , ul , ul , and ul are dispensable for the replication of herpes simplex virus in cell culture adenoviral vectors for gene transfer and therapy production of first generation adenovirus vectors: a review adenovirus: from foe to friend adenovirus: the first effective in vivo gene delivery vector comparison and functional implications of the d architectures of viral trna-like structures an rna tertiary switch by modifying how helices are tethered long-term reinfection of the human genome by endogenous retroviruses initial sequencing and analysis of the human genome new bioinformatic tool for quick identification of functionally relevant endogenous retroviral inserts in human genome the envelope glycoprotein of ebola virus contains an immunosuppressive-like domain similar to oncogenic retroviruses dormant" immunosuppressive domain in filoviruses the gp-protein of marburg virus contains the region similar to the 'immunosuppressive domain' of oncogenic retrovirus p e proteins when genes go walkabout transposon-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals can viruses make us human? retroviruses and primate evolution herv-k: the biologically most active human endogenous retrovirus family quantitative studies of naturally occurring murine leukemia virus infection of akr mice molecular functions of human endogenous retroviruses in health and disease reverse transcriptase activity in mature spermatozoa of mouse nonreplicative rna recombination in poliovirus effects of retroviruses on host genome function recombination of retrotransposon and exogenous rna virus results in nonretroviral cdna integration epigenetic regulation of an iap retrotransposon in the aging mouse: progressive demethylation and de-silencing of the element by its repetitive induction isolation and characterization of israeli acute paralysis virus, a dicistrovirus affecting honeybees in israel: evidence for diversity due to intra-and inter-species recombination unexpected inheritance: multiple integrations of ancient bornavirus and ebolavirus/marburgvirus sequences in vertebrate genomes endogenous non-retroviral rna virus elements in mammalian genomes ebola-associated genes in the human genome: implications for novel targets ebola virus vp protein binds double-stranded rna and inhibits alpha/beta interferon production induced by rig-i signaling virology: bornavirus enters the genome rna from borna disease virus in patients with schizophrenia, schizoaffective patients, and in their biological relatives borna disease virus infection, a human mental-health risk non-retroviral fossils in vertebrate genomes functional mapping of the nucleoprotein of ebola virus integration of viral genomes mutations in the highly conserved ggq motif of class polypeptide release factors abolish ability of human erf to trigger peptidyl-trna hydrolysis genomewide screening reveals high levels of insertional polymorphism in the human endogenous retrovirus family herv-k(hml ): implications for present-day activity genomic fossils calibrate the long-term evolution of hepadnaviruses discovery and characterization of mammalian endogenous parvoviruses integration of human papillomavirus type into the human genome correlates with a selective growth advantage of cells identification of new herpesvirus gene homologs in the human genome mammalian retroelements transposable elements and the evolution of gene expression transposable elements as sources of variation in animals and plants the latent human herpesvirus- a genome specifically integrates in telomeres of human chromosomes in vivo and in vitro the first endogenous herpesvirus, identified in the tarsier genome, and novel sequences from primate rhadinoviruses and lymphocryptoviruses herpesviruses and chromosomal integration chromosomally integrated human herpesvirus : questions and answers mapping the telomere integrated genome of human herpesvirus a and b quaranfil, johnston atoll, and lake chad viruses are novel members of the family orthomyxoviridae liao ning virus, a new chinese seadornavirus that replicates in transformed and embryonic mammalian cells multiple diverse circoviruses infect farm animals and are commonly found in human and chimpanzee feces rapid identification of known and new rna viruses from animal tissues bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses a technique for genome-wide identification of differences in the interspersed repeats integrations between closely related genomes and its application to detection of human-specific integrations of herv-k ltrs antiviral effects of human micrornas and conservation of their target sites modulation of hepatitis c virus rna abundance by a liver-specific microrna role of micrornas in hepatitis b virus replication and pathogenesis micrornas and human retroviruses cullen b.r. . the viral and cellular microrna targetome in lymphoblastoid cell lines ebv and human micrornas co-target oncogenic and apoptotic viral and human genes during latency kaposi's sarcoma-associated herpesviral il- and human il- open reading frames contain mirna binding sites and are subject to cellular mirna regulation regulation of cellular mirna expression by human papillomaviruses suppression of immediate-early viral gene expression by herpesvirus-coded micrornas: implications for latency modulation of lmp protein expression by ebv-encoded micrornas modulation of lmp a expression by a newly identified epstein-barr virus-encoded microrna mir-bart mammalian alphaherpesvirus mirnas replication competent hiv- viruses that express intragenomic microrna reveal discrete rna-interference mechanisms that affect viral replication an inside job for sirnas genome-wide identification of mirna targets by par-clip crosshub: a tool for multi-way analysis of the cancer genome atlas (tcga) in the context of gene expression regulation mechanisms rna interference against viruses: strike and counterstrike unusual nucleotide arrangement with repeated sequences in the escherichia coli k- chromosome multiplex genome engineering using crispr/ cas systems cpf is a single rna-guided endonuclease of a class crispr-cas system how many antiviral small interfering rnas may be encoded by the mammalian genomes? short nucleotide sequences in herpesviral genomes identical to the human dna rna interference: the next genetics revolution? in: horizon symposia; understanding the rnaissance rna interference and its role in cancer therapy the second coming of rnai inbreeding and linkage statistical physics methods provide the exact solution to a long-standing problem of genetics key: cord- -yjwavea authors: kidgell, claire; winzeler, elizabeth a. title: elucidating genetic diversity with oligonucleotide arrays date: journal: chromosome res doi: . /s - - - sha: doc_id: cord_uid: yjwavea dna microarrays, initially designed to measure gene expression levels, also provide an ideal platform for determining genetic diversity. oligonucleotide microarrays, predominantly high-density oligonucleotide arrays, have emerged as the principal platforms for performing genome-wide diversity analysis. they have wide-ranging potential applications including comparative genomics, polymorphism discovery and genotyping. the identification of inheritable genetic markers also permits the analysis of quantitative traits, population studies and linkage analysis. in this review, we will discuss the application of oligonucleotide arrays, in particular high-density oligonucleotide arrays for elucidating genetic diversity and highlight some of the directions that the field may take. elucidating the diversity of genes between and within individual species forms the basis of understanding the evolution and adaptation of an organism. the level of similarity (homogeneity) or difference (heterogeneity) within a species population indicates the diversity of the gene pool. this diversity can be assessed through the observation of expressed genetic traits (phenotype) or by determining the variants of individual genes (genotype). such information can then be related to understanding the molecular basis of pathogenesis and disease transmission, for example, as well as deciphering the evolutionary history of a species. understanding genetic variation is also important in the analysis of an organism's transcriptome, where a change in gene expression may result from coding sequence variability rather than differential gene regulation. genetic diversity can be introduced by either mutation or recombination. mutation is a change in the dna sequence of a gene within an organism. a single nucleotide change, or polymorphism, results in either a change in amino acid (non-synonymous) or a silent mutation, where no change in amino acid occurs (synonymous). recombination produces di¡erent combinations of alleles as a result of the physical exchange of dna between two di¡erent chromosomes in the case of higher organisms, or between another isolate or closely related species in the case of prokaryotic organisms. other sources of genetic variation can be sequence rearrangements, genetic insertions and/or deletions. in all cases, the frequencies at which these events occur within a particular species are in£uenced by biological and ecological factors and help drive the evolutionary processes within the organism. the genetic analysis of organisms, both prokaryotic and eukaryotic, is a fast-moving and expanding scientific field. population diversity can be measured by a variety of techniques, with the simplest method being an assessment of the observed phenotype of a population, such as eye colour or hair colour. however, for those species where the expressed phenotypes are not readily determined, biochemical and molecular approaches are needed for a more in-depth analysis of genetic relatedness. at the moment there is a wide-ranging number of approaches available for identifying genetic diversity or relatedness in a population. these include pulse field gel electrophoresis (pfge) (beadle et al. ) , microsatellite mapping (dearlove ) , along with more recent techniques such as mass spectrometry-based genotyping (pusch et al. ) . however, while traditional techniques, such as pfge, are highly discriminatory and have added greatly to the study of genetic diversity, they do not illustrate the total variability of the organism in question. the only way to show total variability is to sequence the genome. the advances in whole genome dna sequencing within the last few years have permitted considerable progress towards the assessment of genetic diversity within certain organisms (bentley & parkhill ) . comparative genomics can facilitate a detailed catalogue of the biological similarities and differences between species, revealing fascinating insights into the genome evolution and biology of numerous organisms (herrero et al. , nelson et al. , mcclelland et al. ). however, typically, only one strain or individual of a particular species is sequenced, thereby limiting the extent of analysis possible by this whole-genome approach. the availability of whole genome sequences from many organisms has directly influenced the development of dna microarrays as rapid highthroughput molecular analysis platforms that are now commonplace in laboratory research. the microarray system is a powerful apparatus that has revolutionized functional and genomic analyses in a variety of species. a dna microarray enables all or selected open reading frames from an annotated genome to be represented on a single microscope slide or window. generally, there are two microarray technologies that dominate the field, the glass spotted dna microarray and the high-density oligonucleotide array (yauk et al. ) . two methods can be used in the generation of the glass spotted microarray. either pcr ampli¢ed open reading frames (orfs) or cdna clones from the genome of interest are robotically spotted onto poly-l-lysine-coated glass slides or long oligonucleotides ( ^ mers) are directly synthesized and subsequently also spotted onto poly-l-lysine slides. in contrast, a very di¡erent approach is taken in the manufacturing of high-density oligonucleotide arrays. a¡ymetrix (a¡ymetrix, santa clara, ca) have pioneered the production of the high-density array in which photolithographic chemistry is utilized to synthesize in situ short oligonucleotides, typically -mers directly onto the microarray platform (lockhart & winzeler ) . there are advantages and disadvantages associated with both techniques. greater £exibility can be attained with the spotted microarray as speci¢c pcr products can be easily generated and as such there is no ¢xed probeset. however, spotted pcr amplicons only facilitate a limited coverage of the genome and, due to the size of each pcr product representing an orf, issues of cross hybridization can also arise. inconsistencies in the hybridization properties of long oligonucleotide probes can also be an issue. consequently, although the high-density oligonucletide array platform overcomes both of these issues, the photolithographic manufacturing process required to directly build the -mer feature onto the microarray platform signi¢cantly reduces the £exibility of this system. however, such an array produces better speci¢city, quality control and portability and the smaller feature size allows for signi¢cantly greater coverage of any genome. the increased number of probes for every orf also permits novel observations that would be impossible using glass spotted microarrays. both the spotted dna microarray and oligonucleotide array platforms can be successfully applied to genetically characterize a strain or species. however, the increased feature density and reproducibility o¡ered by the high-density oligonucleotide arrays makes them the preferred platform for genomic analysis (figure ). the identification of single nucleotide polymorphisms (snps) within any genome can provide information as to the age and diversity of the organism in question. however, by only sampling particular regions of a genome, as in nucleotide sequencing, one can obtain a biased picture of the true extent of any such diversity. although originally designed as a tool for gene expression analysis (lockhart et al. ) , hybridization of genomic dna to an affymetrix high-density oligonucleotide array can be used effectively for the genome-wide detection of variation in the alternative hybridizing strain relative to the reference strain. due to the short probe sequences ( -mer) used in the construction of the microarray, the alteration in hybridization signal caused by a single base change between the target and probe sequence can be readily identi¢ed. consequently, because the exact location of each -mer probe in the genome is known, the position of these potential single feature polymorphisms (sfp) can be found. a sfp can represent a single nucleotide polymorphism, small insertion/deletion or full deletion. these genetic markers can then be used in population studies, the analysis of quantitative traits, genetic mapping and linkage analysis (steinmetz et al. ) . generally, the high-density arrays for snp genotyping in humans contain thousands of allele-speci¢c oligonucleotides for each snp to be analysed. the probes contain all possible sequences at the site of the snp, and multiplex hybridizations are undertaken (hacia et al. ). in the a¡ymetrix genechip assay, a computer algorithm is then implemented to assign the genotype of each snp. however, while snp detection and genotyping in humans is now feasible on a microarray platform, mapping studies in which ^ snps were analysed using high-density allele-speci¢c oligonucleotide microarrays showed that the assay failed to distinguish between heterozygous and homozygous snp genotypes for a large fraction of the snps (wang et al. ) . indeed, the pcr ampli¢cation step required to achieve sensitive and speci¢c snp genotyping is a principal factor that limits the use of high-throughput hybridizationbased assays in human snp genotyping (syvanen ) . to overcome such issues, approaches including dna-polymerase assisted single nucleotide primer extension are now being implemented in order to perform parallel genotyping of snps on microarrays. in studies, this method provided a ten-fold better power of discrimination between genotypes in comparison with hybridization with allele-speci¢c oligonucleotide probes (syvanen ) . since a number of complex issues still remain with high-throughput microarray-based snp genotyping in humans, in the remainder of this review, we will discuss the application of high-density oligonucleotide arrays to elucidate genetic diversity, with particular focus on studies undertaken with saccharomyces cerevisiae (winzeler et al. ) , arabidopsis thalania (borevitz et al. ) and the pathogenic organisms, plasmodium falciparum (volkman et al. ) and mycobacterium tuberculosis (tsolaki et al. ). we will summarize by highlighting some of the directions that the ¢eld maytake. malaria is responsible for at least . million deaths annually worldwide (breman et al. ) . currently, there is no commercially available malaria vaccine and parasite resistance to the cheap yet e¡ective anti-malarials is rapidly increasing. despite the release of the complete genome sequence of the most lethal causative agent, the apicomplexan parasite p. falciparum (gardner et al. ) , relatively little is know about the extent of genetic diversity within this complex organism. genetic diversity in the malaria parasite facilitates both its survival and propagation and therefore an understanding of such variation is critical if longterm control measures are to be implemented (clark ) . detecting variation, such as snps and putative deletions on a genome-wide scale using established molecular systems in this organism is technically challenging. however, it was recently reasoned that a custom-designed a¡ymetrix oligonucleotide array would facilitate such an analysis. in order to investigate this, a high-density oligonucleotide array based on the chromosome sequence of p. falciparum was designed. this microarray consisted of unique single-stranded -mer probes, positioned approximately every nucleotides to provide complete coverage of all , base-pairs of chromosome (volkman et al. ) . the array also consisted of , probes for , cdnas from human tissues so that the amount of host mrna contamination occurring during the culture and extraction of the parasite dna could be measured. the at content of the malaria genome is extremely high ( %) (gardner et al. ) , yet this did not pose any problems in the subsequent analysis. although single nucleotide changes have previously been identi¢ed in p. falciparum (clark ) , the genome-wide analysis facilitated by hybridization of genomic dna to the a¡ymetrix microarray identi¢ed signi¢cant di¡erences in potential selection pressure across di¡erent gene families and locations within the chromosome (volkman et al. ) . a total of snps was identi¢ed across the four strains, along with a large -kb pair deletion in the strain w isolated in southeast asia. polymorphisms were predominantly located in those genes associated with varying the antigenic and adhesive character of the parasite. these gene families are of particular importance as currently they are being widely investigated as potential vaccine candidate epitopes due to their antigenicity. elevated levels of genetic diversity were shown to be located in the subtelomeric ends of each chromosome, which is consistent with the extensive breakage and recombination previously reported in plasmodium ssp. (freitas-junior et al. ) and yeast (winzeler et al. ) . the broad genetic diversity observed within chromosome warranted an analysis of the complete p. falciparum genome in order to elucidate the extent of genetic variation in this human pathogen. with this in mind, an a¡ymetrix custom oligonucleotide array containing -mer single-stranded probes speci¢c to the coding and non-coding sequences of the entire p. falciparum genome was designed. the probes were placed on average every bp on both strands (le roch et al. ). an additional human and mouse sequences corresponding to genes that are highly expressed in blood cells, a¡ymetrix controls and background controls were also included. to date, genomic dna from several strains of p. falciparum have been hybridized to this array and initial results illustrate extensive diversity across and within the series of global isolates analysed. in addition, the ability of oligonucleotide arrays to identify areas within the malaria genome likely to be under selection pressure from the host's immune system may have a signi¢cant impact on the choice of future targets for vaccine and drug development. although snps and deletions can be readily identi¢ed using a¡ymetrix high-density arrays, more complex types of genetic diversity may also be determined using this platform. identifying inheritable markers permits the relatedness of di¡erent strains to be easily determined. the hybridization of fourteen strains of s. cerevisiae (laboratory and wild-type isolates) to an a¡ymetrix s oligonucleotide array containing -mer probes allowed approximately % of the yeast genome to be investigated. a total of markers were derived that detected variation in at least one of the strains. the criterion for determining relatedness was obtained by comparing each of the strains in a pair-wise fashion and determining whether each marker was present in both, one or neither of the strains. the genealogical relationship between all fourteen strains could then be easily plotted (winzeler et al. ) . typically, phylogenetic trees are plotted from data derived from selected sequences of a particular genome. however, plotting relatedness by the use of highdensity oligonucleotide arrays o¡ers a signi¢cantly more re¢ned and, through the increased number of markers, more in-depth approach to determining ancestral genetic relationships. traditionally, identifying the genes responsible for quantitative genetic traits in complex genomes is challenging and laborious (lander & schork ) . in highly pathogenic organisms such as plasmodium spp., genetic mapping and linkage analysis are essential to locate and verify genetic determinants involved in traits of drug resistance (su et al. ) , virulence (wellems et al. , day et al. ) and transmission (vaidya et al. ) . marker location data combined with information regarding recombination frequencies can subsequently be used as a basis for exploring the genetic structure and variation in populations. high-density oligonucleotide arrays o¡er a novel approach to dissecting the genetic loci responsible for such phenotypes. for example, the a¡ymetrix yeast microarray (winzeler et al. ) was successful in determining the quantitative trait responsible for the high temperature growth phenotype (htg) common in clinical isolates of yeast. the hybridization of total genomic dna from the reference strain s (htgÀ) and the haploid strain, yjm (htgþ) identi¢ed a total of bi-allelic markers, with markers spaced approximately every bp across the genome. a series of htgþ segregants was subsequently analysed in genome-wide scans and meotic recombination breakpoints were identi¢ed which permitted the genomic intervals inherited from the same parent to be identi¢ed (steinmetz et al. ) . this analysis indicated that a combination of both common and rare variants are likely to underlie quantitative traits and the number of genes responsible for each trait is generally far higher than anticipated. consequently, genome-wide analysis of heritability permitted by oligonucleotide microarrays will probably be critical to e¡ectively mapping quantitative traits in the future. the mapping of recessive mutations by dna hybridization has been demonstrated in the plant, a. thaliana. the erecta mutation is a recessive mutation in the a. thaliana landsberg erecta strain (ler) and maps to a de¢ned region on chromosome . genomic dna was extracted from pooled samples of either the reference strain columbia (col) and ler f plants showing the erecta phenotype or from the wild-type col/ler f plants and hybridized to an oligonucleotide microarray consisting of perfect match (pm) features speci¢c to the reference strain, col. a total of markers (sfps) were scored following a single hybridization and the position of the erecta gene was mapped to within cm from the exact position of the erecta gene, within a % con¢dence interval (borevitz et al. ) . the ability to perform genetic analyses with oligonucleotide arrays o¡ers renewed hope in complex organisms such as plasmodium spp. where genetic and biochemical manipulation is severely limited due to the cultivation and life-cycle stages of this haploid eukaryote. the incidence of malaria in endemic countries is rising rapidly due to the appearance of multidrug-resistant parasites (white ) and insecticide-resistant mosquito vectors (roberts & andre ) . however, one must be able to genetically characterize the drugresistant phenotype in order to fully understand the basis of resistance and facilitate the implementation of novel therapies. the chloroquine resistance locus was mapped using microsatellite markers to a -kbp segment on chromosome (su et al. ) after typing progeny strains from a cross between chloroquine-resistant and chloroquine-sensitive parental strains (wellems et al. ) . although linkage analysis based on laboratory crosses has been highly successful in identifying the genetic basis of genetic traits such as drug resistance (su et al. ) and parasite development (guinet & wellems ) traditional molecular analysis is di⁄cult and limited by the number of speci¢c markers available for each chromosome. in contrast, microarray hybridization has shown to be a valuable and highly e⁄cient method to map inheritance markers and localize key genetic traits in a number of organisms (steinmetz et al. , borevitz et al. and o¡ers renewed optimism for genome-wide linkage analysis of quantitative traits in clinically important organisms. whole genome comparisons can reveal extensive differences in gene content and genome organization between related organisms, which enable a better understanding of the genome function and evolution of a species. the loss of genetic material can be both deleterious and advantageous. for example, short-term evolutionary pressure from the immune system may favour the elimination of a gene that is a drug target, whereas long-term physiological requirements may be in place to maintain the gene in the population. the enteric bacteria salmonella enterica serovar typhimurium, the cause of human gastroenteritis, and salmonella enterica serovar typhi, which causes human typhoid, share . % identity at the genome level (edwards et al. ), yet % of s. typhimurium genes are lost or inactivated (pseudogene) in the s. typhi genome. this is widely thought to have contributed to the evolution from host generalist (s. typhimurium) to humanrestricted variants (s. typhi; mcclelland et al. ) . chromosomal gene deletions in humans can be attributed to a host of clinical outcomes. a large deletion in a region of the long arm of chromosome is associated with the cause of williams-beuren syndrome and loss of the long arm of chromosome has been identified previously as a common occurrence in adenocarcinomas of the oesophagus and gastro-oesophageal junction (rumpel et al. ) . comparative genomics, although a vital tool for identifying potential deletions between two genomes is limited by the number of complete genome sequences available. following the hybridization of genomic material to a¡ymetrix gene expression arrays, clusters of sfps that exhibit a low hybridization signal can be considered potential deletions (borevitz et al. ) . this approach provides a means by which to interrogate hundreds of unsequenced genomes and gain information as to the diversity of a particular population. the comparison of complete genome sequences of two strains of the causal agent of tuberculosis in humans, m. tuberculosis, identi¢ed large sequence polymorphisms (lsp) and snps but clues as to the molecular basis of virulence and pathogenicity remained unresolved (fleischmann et al. ) . in contrast, the hybridization of epidemiologically well-characterized clinical isolates of m. tuberculosis to an a¡ymetrix m. tuberculosis high-density oligonucleotide array facilitated an improved understanding of genomic deletions (tsolaki et al. ). approximately . % of the m. tuberculosis reference genome h rv was found to be completely or partially absent from the clinical isolates investigated. gene deletions were observed in many functional classes of genes; however, the ¢ndings also suggested that these deletions did not have a strong e¡ect on the isolate phenotype. deletions were not found to be evenly distributed throughout the genome, with certain closely related isolates exhibiting distinct deletions suggesting genomically disruptive processes speci¢c to an individual mycobacterial lineage (tsolaki et al. ). the specificity of oligonucleotide arrays in detecting single base-pair variations suggests that microarrays could also play a role in the detection and genotyping of viral and bacterial pathogens. multiple pathogens and variable sequences could be detected in a single assay, which would revolutionize current typing and detection methods. although the affymetrix array platform offers the highest probe density and resolution, the high price and lack of flexibility with this technology currently limits their application in microbial diagnostics. to date, there are few examples of such technology being used in the clinical setting but preliminary studies suggest that oligonucleotide arrays would offer a fast high-throughput alternative for the parallel detection of organisms, which would overcome some of limitations encountered when using traditional molecular-based techniques (bodrossy & sessitsch ) . in , a study was published describing the use of oligonucleotide arrays in the detection of common pathogenic bacteria causing foodborne infections (hong et al. ). twenty-one species-speci¢c oligonucleotide probes of the s rrna gene from sixteen bacterial species were synthesized and spotted onto nylon membranes. the results were extremely encouraging, showing that the custom oligonucleotide array was successful in distinguishing between nine of the pathogenic bacteria. however, the results also highlighted a potential drawback with the approach in that closely related bacterial strains may not be so easily distinguished. indeed, the widely used s rrna markers do not facilitate resolution to below the species level and, in cases of the enterobacteriaceae, do not allow even species di¡erentiation. therefore, a wide range of highly validated markers that allow resolution at a number of taxonomic levels would be required in order for microarrays to be used in clinical applications. a more extensive study of the discriminatory power of oligonucleotide arrays was undertaken in the detection of viral pathogens. a series of overlapping -mer oligonucleotides speci¢c to the most highly conserved sequences within a viral gene family from sequenced viral genomes were represented on the microarray. this custom-designed array is capable of detecting hundreds of viruses (wang et al. ) . following a random sequence independent ampli¢cation step, a diverse set of viruses such as rhinovirus and para-in£uenza virus could be detected from human respiratory samples. di¡erential patterns of hybridization also enabled discrimination between viral sero-types, greatly increasing the ease at which viruses can be typed. consequently, this information may also form the basis for the re-assignment of established viral subtypes. the use of the most conserved sequences also permits unsequenced, unidenti¢ed or newly evolved viral family members to be detected as hybridization of dna isolated from an unsequenced/uncharacterized virus to the conserved regions within a viral gene family suggests a common lineage (wang et al. ) . in conclusion, although oligonucleotide arrays are currently not routinely used in molecular diagnostics, preliminary studies are encouraging (wang et al. , hong et al. ) and suggest that microarrays can be used as a rapid, accurate and e⁄cient approach for identifying diversity within a species. however, issues with speci¢city and sensitivity still have to be resolved. while the application of oligonucleotide arrays in current scientific research is vast, there are also disadvantages and pitfalls associated with such a technique (table ) . cost considerations and the requirement for high-throughput screening equipment currently limit the use of this technology to those laboratories that have the financial resources to fund such experiments. with regard to genetic diversity analyses, an important practical limitation is the number of specific probe features that can be synthesized on a single array platform, which restricts the extent of variability that can be discovered. for example, only % of the yeast genome s. cerevisiae was represented on a single affymetrix chip (winzeler et al. ) . although recent array designs have enabled up to unique features to be represented (le roch et al. ) , this still does not facilitate complete genome coverage for many organisms. consequently, although the false discovery rate for sfp detection is low (borevitz et al. ) , the false negative discovery rate must also be considered. secondly, the detection of sfps using a -mer oligonucleotide array only permits the resolution of polymorphisms within base pairs and does not provide the actual sequence of the alternative allele. this is a disadvantage in terms of detecting novel drug resistant or disease-related alleles where, in some cases, a single point mutation may be all that is required to confer the phenotype. in population biology, this level of resolution does not permit the assignment of a synonymous or non-synonymous nucleotide change and consequently limits the extent of evolutionary analysis that can be undertaken. it has also been shown that the location of the sfp within the -mer sequence is important (borevitz et al. , winzeler et al. . polymorphisms located near the central base of the -mer feature are more likely to be detected than those near to the or end. frequently the features on an oligonucleotide array are speci¢c to one particular strain or individual of a species. consequently, all subsequent hybridization analysis and comparisons are rela-tive to this reference strain. although a higher intensity of hybridization signal at a particular locus may be characteristic of a gene duplication, insertions and re-arrangements within the hybridizing strain will not be identi¢ed with this approach thereby limiting the extent to which genetic diversity can be characterized. whole genome tiling arrays offer a redefined approach for elucidating genetic diversity and sfp discovery. in contrast to the affymetrix expression arrays where only a proportion of the genome of interest is represented, tiling arrays facilitate significantly higher genome coverage. whole genome tiling arrays interrogate the whole genome in an unbiased approach allowing various features of the genome to be investigated (mockler & ecker ) . non-overlapping or partially overlapping probes are designed that tile the whole genome from end to end. this not only enables the identification of many more sfps within a particular genome but overlapping probes increase the resolution of sfp discovery. for the -mbp arabidopsis genome, only chips were required to tile both strands of the genome at a -base-pair resolution. furthermore, at an -base-pair resolution, only chips were needed to cover the complete genome (yamada et al. ) . although such an approach is viable for the smaller eukaryote and prokaryotic table . advantages and disadvantages associated with elucidating genetic diversity by high-density oligonucleotide arrays. . genome-wide analysis. . all data derived is relative to the reference strain (generally the sequenced strain). . extremely rapid and reproducible. . cost prohibitive (arrays and expensive laboratory equipment are required to undertake experiments). . large amounts of data (potential deletions and single nucleotide polymorphisms) can be obtained from a single hybridization. . diversity resulting from insertions, deletions and rearrangements is not readily determined. . high resolution (up to bp). . the alternative allele and position of a snp can only be determined with resequencing arrays or traditional methods. . inheritable markers can be used to easily map quantitative traits such as drug resistance. . the affymetrix array platform is currently not suitable for microbial genotyping and identification due to the lack of specific species markers. . high-throughput. . low flexibility. . low false-positive rate of sfp identification. . false-negative rate of sfp identification must be considered. genomes, the cost and technical practicalities of this method mean it may not be feasible for the analysis of larger eukaryotic genomes. resequencing arrays o¡er the highest resolution and discriminatory power in array technology to date, essentially resequencing a genome or portion of a genome relative to the reference sequence. the resequencing array is designed with a set of tiled probes at an ultra-high resolution ( bp). essentially, four probes are designed for each base pair in any given sequence. one probe is speci¢c for the reference sequence while the three remaining probes vary at the central base and code for the three alternative nucleotides at that position ( figure ). using the same principles as the a¡ymetrix expression arrays for detecting mismatches (sfp), following the labelling and hybridization of genomic dna, three out of the four probes should show a decrease in hybridization signal between the target and probe sequence whereas the fourth probe, containing the correct nucleotide at the central position, should show a stronger hybridization signal. using this approach, an entire genome can be resequenced in a single hybridization experiment (wong et al. ) , facilitating in-depth genomewide analysis of diversity. the recent use of resequencing arrays in decoding the -kb genome of several clinical isolates of the sars-coronavirus (sars-cov) demonstrates the application of this array-based technology as a rapid and reliable approach for assessing genetic diversity within a species population and undertaking epidemiological studies of outbreaks of disease (wong et al. ) . the advantage of resequencing arrays in determining the alternative alleles present at each nucleotide mismatch has important implications for population biology and diversity analysis within a species. while sequence rearrangements and novel sequences or insertions will still not be identi¢able by such an approach, the possibilities there is no doubt that the dna microarray is a powerful tool that has revolutionized the field of genetic diversity. a single hybridization can quickly yield vast quantities of data on the relatedness of one strain to another, at a single base-pair resolution. while traditional typing techniques suffer from severe limitations, microarrays offer a comprehensive and unbiased approach to analysing diversity and permit observations that would be overlooked with established techniques where only small regions of a given genome are investigated. in today's society, accurately distinguishing between closely related strains is essential in cases of pathogens associated with food-borne diseases or bioterrorism. indeed, microarrays played a critical role in epidemiological studies concerned with the emergence of the deadly new pathogen, sars. we also foresee the application of microarrays in the typing and classification of organisms such as bacteria and viruses where traditional antibody typing can sometimes fall short. in summary, the application of oligonucleotide arrays in elucidating genetic diversity is immense and will continue to make a significant impact in future population and evolutionary genetic studies. electrophoretic karyotype analysis in fungi comparative genomic structure of prokaryotes oligonucleotide microarrays in microbial diagnostics large-scale identification of single-feature polymorphisms in complex genomes the intolerable burden of malaria: a new look at the numbers population genetics: malaria variorum genes necessary for expression of a virulence determinant and for transmission of plasmodium falciparum are located on a . -megabase region of chromosome high throughput genotyping technologies comparative genomics of closely related salmonellae wholegenome comparison of mycobacterium tuberculosis clinical and laboratory strains frequent ectopic recombination of virulence factor genes in telomeric chromosome clusters of p. falciparum genome sequence of the human malaria parasite plasmodium falciparum physical mapping of a defect in plasmodium falciparum male gametocytogenesis to an kb segment of chromosome strategies for mutational analysis of the large multiexon atm gene using high-density oligonucleotide arrays comparative genomics of yeast species: new insights into their biology application of oligonucleotide array technology for the rapid detection of pathogenic bacteria of foodborne infections genetic dissection of complex traits discovery of gene function by expression profiling of the malaria parasite life cycle genomics, gene expression and dna arrays expression monitoring by hybridization to high-density oligonucleotide arrays complete genome sequence of salmonella enterica serovar typhimurium lt comparison of genome degradation in paratyphi a and typhi, human-restricted serovars of salmonella enterica that cause typhoid applications of dna tiling arrays for whole-genome analysis whole genome comparisons of serotype b and / a strains of the food-borne pathogen listeria monocytogenes reveal new insights into the core genome components of this species maldi-tof mass spectrometry-based snp genotyping insecticide resistance issues in vector-borne disease control mapping of genetic deletions on the long arm of chromosome in human esophageal adenocarcinomas dissecting the architecture of a quantitative trait locus in yeast complex polymorphisms in an approximately kda protein are linked to chloroquine-resistant p. falciparum in southeast asia and africa accessing genetic variation: genotyping single nucleotide polymorphisms functional and evolutionary genomics of mycobacterium tuberculosis: insights from genomic deletions in strains a genetic locus on plasmodium falciparum chromosome linked to a defect in mosquito-infectivity and male gametogenesis excess polymorphisms in genes for membrane proteins in plasmodium falciparum large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome microarraybased detection and genotyping of viral pathogens a histidinerich protein gene marks a linkage group favored strongly in a genetic cross of plasmodium falciparum chloroquine resistance not linked to mdr-like genes in a plasmodium falciparum cross antimalarial drug resistance direct allelic variation scanning of the yeast genome genetic diversity in yeast assessed with whole-genome oligonucleotide arrays tracking the evolution of the sars coronavirus using high-throughput, high-density resequencing arrays empirical analysis of transcriptional activity in the arabidopsis genome comprehensive comparison of six microarray technologies key: cord- -vmtjc ct authors: georgiev, vassil st. title: genomic and postgenomic research date: journal: national institute of allergy and infectious diseases, nih doi: . / - - - - _ sha: doc_id: cord_uid: vmtjc ct the word genomics was first coined by t. roderick from the jackson laboratories in as the name for the new field of science focused on the analysis and comparison of complete genome sequences of organisms and related high-throughput technologies. two basic computational methods are used for genome analysis: gene finding and whole genome comparison ( ) . gene finding. using a computational method that can scan the genome and analyze the statistical features of the sequence is a fast and remarkably accurate way to find the genes in the genome of prokaryotic organisms (bacteria, archaea, viruses) compared with the still difficult problem of finding genes in higher eukaryotes. by using modern bioinformatics software, finding the genes in a bacterial genome will result in a highly accurate, rich set of annotations that provide the basis for further research into the functions of those genes. the absence of introns-those portions of the dna that lie between two exons and are transcribed into a rna but will not appear in that rna after maturation and therefore are not expressed (as proteins) in the protein synthesis-will remove one of the major barriers to computational analysis of the genome sequence, allowing gene finding to identify more than % of the genes of most genomes without any human intervention. next, these gene predictions can be further refined by searching for nearby regulatory sites such as the ribosome-binding sites, as well as by aligning protein sequences to other species. these steps can be automated using freely available software and databases ( ) . gene finding in single-cell eukaryotes is of intermediate difficulty, with some organisms, such as trypanosoma brucei, having so few introns that a bacterial gene finder is sufficient to find their genes. other eukaryote organisms (e.g., plasmodium falciparum) have numerous introns and would require the use of special-purpose gene finder, such as glimmerm ( , ) . whole genome comparison. this computational method refers to the problem of aligning the entire deoxyribonucleic acid (dna) sequence of one organism to that of another, with the goal of detecting all similarities as well as rearrangements, insertions, deletions, and polymorphisms ( ) . with the increasing availability of complete genome sequences from multiple, closely related species, such comparisons are providing a powerful tool for genomic analysis. using suffix trees-data structures that contains all of the subsequences from a particular sequence and can be built and searched in linear time-this computational task can be accomplished in minimal time and space. because the suffix tree algorithm is both time and space efficient, it is able to align large eukaryotic chromosomes with only slightly greater requirements than those for bacterial genomes ( ) . bacterial genome annotation. the major goal of the bacterial genome annotation is to identify the functions of all genes in a genome as accurately and consistently as possible by using initially automated annotation methods for preliminary assignment of functions to genes, followed by a second stage of manual curation by teams of scientists. the family enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (salmonella, yersinia, klebsiella, shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic escherichia coli k . many of these pathogens have been subject to genome sequencing or are under study. genome comparisons among these organisms have revealed the presence of a core set of genes and functions along a generally collinear genomic backbone. however, there are also many regions and points of difference, such as large insertions and deletions (including pathogenicity islands), integrated bacteriophages, small insertions and deletions, point mutations, and chromosomal rearrangements ( ). the first genome sequence of escherichia coli k (reference strain mg ) was completed and published in ( ) . later, the genome sequence of two other genotypes of e. coli, the enterohemorrhagic e. coli o :h (ehec; strains edl and rimd -sakai) ( , ) and the uropathogenic e. coli (upec; strain cft ) ( ) , were sequenced and the information published. currently, it is accepted that shigellae are part of the e. coli species complex, and information on the genome of shigella flexneri strain a has been published ( ) . a comparison of all three pathogenic e. coli with the archetypal nonpathogenic e. coli k revealed that the genomes were essentially collinear, displaying both conservation in sequence and gene order ( ) . the genes that were predicted to be encoded within the conserved sequence displayed more than % sequence identity and have been termed the core genes. similar observations were made for the shigella flexneri genome, which also shares . mb of common sequence with e. coli ( ) . a comparison of the three e. coli genomes revealed that genes shared by all genomes amounted to , ( ) from a total of , , and about , and , predicted proteincoding sequences for e. coli k , ehec, and upec, respectively ( ) . the region encoding these core genes is known as the backbone sequence. it was also apparent from these comparisons that interdispersed throughout this backbone sequence were large regions unique to the different genotypes. moreover, several studies had shown that some of these unique loci were present in clinical disease-causing isolates but were apparently absent from their comparatively benign relatives ( ) . one such well-characterized region is the locus of enterocyte effacement (lee) in the enteropathogenic e. coli (epec). thus, an epec infection results in effacement of the intestinal microvilli and the intimate adherence of bacterial cells to enterocytes. furthermore, epec also subverts the structural integrity of the cell and forces the polymerization of actin, which accumulates below the adhered epec cells, forming cup-like pedestals ( ) . this is called an attachment and effacing (ae) lesion. subsequently, lee was found in all bacteria known to be able to elicit an ae lesion ( ). the presence of many regions in the backbone sequence similar to lee have been characterized in both gram-negative and gram-positive bacteria ( ) . this led to the concept of pathogenicity islands (pais) and the formulation of a definition to describe their features ( ) . typically, pais are inserted adjacent to stable rna genes and have an atypical g+c content. in addition to virulence-related functions, the pathogenicity islands often carry genes encoding transposase or integrase-like proteins and are unstable and self-mobilizable ( , ) . it was also noted that pais possess a high proportion of gene fragments or disrupted genes when compared with the backbone regions ( ) . it is generally accepted that the pathogenic e. coli genotypes have evolved from a much smaller nonpathogenic relative by the acquisition of foreign dna. this laterally acquired dna has been attributed with conferring on the different genotypes the ability to colonize alternative niches in the host and the ability to cause a range of different disease outcomes ( ) . although sharing some of the features of pais and considered to be parts of the pais, some genomic loci are unlikely to impinge on pathogenicity. to take account of this, the concept of pais has been extended to include islands or strainspecific loops, which represent discrete genetic loci that are lineage-specific but are as yet not known to be involved in virulence ( , ) . currently, there are more than , salmonella serovars in two species, s. enterica and s. bongori. all salmonellae are closely related, sharing a median dna identity for the reciprocal best match of between % and % ( , ) . despite their homogeneity, there are still significant differences in the pathogenesis and host range of the different salmonella serovars. thus, whereas s. enterica subspecies enterica serovar typhi (s. typhi) is only pathogenic to humans causing severe typhoid fever, s. typhimurium causes gastroenteritis in humans but also a systemic infection in mice and has a broad host range ( ) . like e. coli, the salmonellae are also known to possess pais, known as salmonella pathogenicity islands (spis). it is thought that spis have been acquired laterally. for example, the gene products encoded by spi- ( , ) and spi- ( , ) have been shown to play important roles in the different stages of the infection process. both of these islands possess type iii secretion systems and their associated secreted protein effectors. spi- is known to confer on all salmonellae the ability to invade epithelial cells. spi- is important in various aspects of the systemic infection, allowing salmonella to spread from the intestinal tissue into the blood and eventually to infect, and survive within, the macrophages of the liver and spleen ( ) . spi- , like lee and pai- of upec, is inserted alongside the selc trna gene and carries the gene mgtc, which is required for the intramacrophage survival and growth in the low-magnesium environment thought to be encountered in the phagosome ( ) . other salmonella spis encode type iii-secreted effector proteins, chaperone-usher fimbrial operons, vi antigen biosynthetic gene, a type ivb pilus operon, and many other determinants associated with the salmonellae enteropathogenicity ( ). although the mobile nature of pais is frequently discussed in the literature, there is little direct experimental evidence to support these observations. one possible explanation for this may be that on integration, the mobility genes of the pais subsequently become degraded, thereby fixing their position ( ) . certainly, there is evidence to support this hypothesis, as many proposed pais carry integrase or transposase pseudogenes or remnants. one excellent example of this is the high-pathogenicity island (hpi) first characterized in yersinia ( ) . the yersinia hpis can be split into two lineages based on the integrity of the phage integrase gene (int) carried in the island: (i) y. enterocolitica biotype b and (ii) y. pestis and y. pseudotuberculosis. the y. enterocolitica hpi int gene carries a point mutation, whereas the analogous gene is intact in the y. pestis and y. pseudotuberculosis hpis. the yersinia hpi is a -to -kb island that possesses genes for the production and uptake of the siderophore yersiniabactin, as well as genes, such as int, thought to be involved in the mobility of the island. hpi-like elements are widely distributed in enterobacteria, including e. coli, klebsiella, enterobacter, and citrobacter spp., and like many prophages, these hpis are found adjacent to asn-trna genes ( ) . trna genes are common sites for bacteriophage integration into the genome ( ) . integration at these sites typically involves site-specific recombination between short stretches of identical dna located on the phage (attp) and at the integration site on the bacterial genomes (attb). the trna genes represent common sites for the integration of many other pais and bacteriophages, with the secc trna locus being the most heavily used integration site in the enterics ( ). integrated bacteriophages, also known as prophages, are also commonly found in bacterial genomes ( ) . for example, in the s loops of the e. coli o :h strain edl (ehec) unique regions, nearly % were phage related. in addition to the prophage sequences detected in the genome of ehec strain sakai ( ) , the genomes of e. coli k , upec, and s. flexneri have all been shown to carry multiple prophage or prophage-like elements ( , , , ) . moreover, comparison of the genome sequences of ehec o :h strain edl and strain sakai revealed marked variations in the complement and integration sites of the prophages, as did internal regions within highly related phages ( , ) . in addition to genes essential for their own replication, phages often carry genes that, for example, prevent superinfection by other bacteriophages, such as old and tin ( , ) . however, other genes carried in prophages appear to be of nonphage origin and can encode determinants that enhance the virulence of the bacterial host by a process known as lysogenic conversion ( ) . in addition to the presence of the lee pai and the ability to elicit ae lesion, another defining characteristics of the enterohemorrhagic e. coli (ehec) is the production of shiga toxins (stx). the shiga toxins represent a family of potent cytotoxins that, on entry into the eukaryotic cell, will act as glycosylases by cleaving the s ribosomal rna (rrna) thereby inactivating the ribosome and consequently preventing the protein synthesis ( ) . other enteric pathogens such as s. typhi, s. typhimurium, and y. pestis are also known to possess significant numbers of prophages ( , , ) . thus, the principal virulence determinants of the salmonellae are the type iii secretion systems, carried by spi- and spi- , and their associated protein effectors ( , ) . a significant number of these type iii secreted effector proteins are present in the genomes of prophages and have a dramatic influence on the ability of their bacterial hosts to cause disease ( ). small insertions and deletions. even though the large pais play a major role in defining the phenotypes of different strains of the enteric bacteria, there are many other differences resulting from small insertions and deletions, which must be taken into account when considering the overall genomic picture of enterobacteriaceae ( ) . thus, the comparisons between e. coli k and e. coli o :h and between s. typhi and s. typhimurium have indicated the existence of many small differences that exist aside from the large pathogenicity islands. for example, the number of separate insertion and deletion events has shown that there are events of genes or fewer compared with events of genes or more for the s. typhi and s. typhimurium comparison. furthermore, comparison between s. typhi and e. coli revealed events of genes or fewer compared with just events of genes or more. even taking into account that the larger islands contain many more genes per insertion or deletion event, it becomes clear that nearly equivalent numbers of speciesspecific genes are attributable to insertion or deletion events involving genes or fewer as are due to events involving genes or more. these data should lend credence to the assertion that the acquisition and exchange of small islands is important in defining the overall phenotype of the organism ( ) . in the majority of cases studied to date, there is no evidence to suggest the presence of genes that may allow these small islands to be self-mobile. it is far more likely that small islands of this type are exchanged between members of a species and constitute part of the species gene pool. once acquired by one member of the species, they can be easily exchanged by generalized transduction mechanisms, followed by homologous recombination between the near identical flanking genes to allow integration into the chromosome ( ) . this sort of mechanism of genetic exchange would also make possible nonorthologous gene replacement, involving the exchange of related genes at identical regions in the backbone. a specific example to illustrate such a possibility is the observed capsular switching of neisseria meningitides ( ) and streptococcus pneumoniae ( , ) for which different sets of genes responsible for the biosynthesis of different capsular polysaccharides are found at identical regions in the chromosome and flanked by conserved genes. the implied mechanism for capsular switching involves replacement of the polysaccharide-specific gene sites by homologous recombination between the chromosome and exogenous dna in the flanking genes ( ) . point mutations and pseudogenes. one of the most surprising observations to come from enterobacterial genome research has been the discovery of a large number of pseudogenes. the pseudogenes appeared to be untranslatable due to the presence of stop codons, frameshifts, internal deletions, or insertion sequence (is) element insertions. the presence of pseudogenes seems to run contrary to the general assumption that the bacterial genome is a highly "streamlined" system that does not carry "junk dna" ( ). for example, salmonella typhi, the etiologic agent of typhoid fever, is host restricted and appears only capable of infecting a human host, whereas s. typhimurium, which causes a milder disease in humans, has a much broader host range. upon analysis, the genome of s. typhi contained more than pseudogenes ( ) , whereas it was predicted that the number of pseudogenes in the genome of s. typhimurium would be around ( ) . from this observation, it becomes clear that the pseudogenes in s. typhi were not randomly spread throughout its genome-in fact, they were overrepresented in genes that were unique to s. typhi when compared with e. coli, and many of the pseudogenes in s. typhi have intact counterparts in s. typhimurium that have been shown to be involved in aspects of virulence and host interaction. given this distribution of pseudogenes, it has been suggested that the host specificity of s. typhi may be the result of the loss of its ability to interact with a broader range of hosts caused by functional inactivation of the necessary genes ( ) . in contrast with other microorganisms containing multiple pseudogenes, such as mycobacterium leprae ( ) , most of the pseudogenes in s. typhi were caused by a single mutation, suggesting that they have been inactivated relatively recently. taken together, these observations suggest an evolutionary scenario in which the recent ancestor of s. typhi had changed its niche in a human host, evolving from an ancestor (similar to s. typhimurium) limited to localized infection and invasion around the gut epithelium into one capable of invading the deeper tissues of the human hosts ( ) . a similar evolutionary scenario has been suggested for another recently evolved enteric pathogen, yersinia pestis. this bacterium has also recently changed from a gut bacterium (y. pseudotuberculosis), transmitted via the fecal-oral route, to an organism capable of using a flea vector for systemic infection ( , ) . again, this change in niche was accompanied by pseudogene formation, and genes involved in virulence and host interaction are overrepresented in the set of genes inactivated ( ) . yet another example of such an evolutional scenario is shigella flexneri a, a member of the species e. coli (which is predicted to have more than pseudogenes), and is again restricted to the human body ( ) . all of these organisms demonstrate that the enterobacterial evolution has been a process that has involved both gene loss and gene gain, and that the remnants of the genes lost in the evolutionary process can be readily detected ( ). the focus in the postgenomic era is on functional genomics, in which proteomics plays an essential role. the living cell is a dynamic and complex system that cannot be predicted from the genome sequence. whereas genomes will disclose important information on the biological importance of the organism, it is still static and will not reveal information on the expression of a particular gene or of posttranslational modifications or on how a protein is regulated in a specific biological situation ( ) . thus, whereas the complete genome sequence provides the basis for experimental identification of expressed proteins at the cellular level, very little has been accomplished to identify all expressed and potentially modified proteins. direct investigation of the total content of proteins in a cell is the task of proteomics. proteomics is defined as the complete set of posttranslationally modified and processed proteins in a well-defined biological environment under specific circumstances, such as growth conditions and time of investigation ( , ) . proteomics can be studied by following two separate steps: separation of the proteins in a sample, followed by identification of the proteins. the common methodology used for separating proteins is two-dimensional polyacrylamide gel electrophoresis ( d page). the principal method for large-scale identification is mass spectroscopy (ms), but other identification methods, such as n-terminal sequencing, immunoblotting, overexpression, spot colocalization, and gene knockouts, can also be used. because of its high-resolution power, d page is currently the best methodology to achieve global visualization of the proteins of a microorganism. in the first dimension, isoelectric focusing is carried out to separate the proteins in a ph gradient according to their isoelectric point (pi). in the second dimension, the proteins are separated according to their molecular weight by sds-page (sodium dodecyl sulfate-page). the resulting gel image presents itself as a pattern of spots in which pi and the relative molecular weight (m r ) can be recognized as in a coordinate system ( ) . a critical step during the d page procedure is the sample preparation, as there is no single method that can be universally applied because different reagents are superior with respect to different samples. to this end, chaotropes such as urea, which act by changing the parameters of the solvent, are used in most d page procedures. major problems to overcome in d page sample preparation arise because of limited entry into the gel of high-molecular-weight proteins and the presence of highly hydrophobic and/or basic proteins ( , ) . for protein separation, the protein mixture is loaded onto an acrylamide gel strip in which a ph gradient is established. when a high voltage is applied over the strip, the proteins will focus at the ph at which they carry zero net charge. the ph gradient is established during the focusing using either carrier ampholytes in a slab gel ( ) or a precast polyacrylamide gel with an immobilized ph gradient (ipg) ( ) . the latter method is advantageous because of improved reproducibility. samples can be applied to ipg dry strips preferably by rehydration. rehydration of dried ipgs under application of a low voltage ( to v) has significantly improved the recovery especially of high-molecularweight proteins. mass spectrometry is the method of choice for identifying proteins in proteomics. the proteins are converted into gas phase ions that can be measured with an accuracy better than ppm ( ) . two widely used techniques for ionization are matrix-assisted laser desorption ionization (maldi) ( ) and electrospray ionization ( ) . maldi is usually coupled with a tof (time of flight) device for measuring the masses. the ionized peptides are then accelerated by the application of accelerated field and the tof until they reach a detector to calculate their mass/charge ratio ( ) . in electrospray ionization, the peptides are sprayed into the spectrometer ( ) . ionization is achieved when the charged droplets evaporate. an alternative procedure for measuring masses is the ion trap ( ) , which selects ions with certain mass/charge ratios by keeping them in sinusoidal motion between two electrodes. in , the first microbe sequencing project, haemophilus influenzae (a bacterium causing upper respiratory infection), was completed with a speed that stunned scientists (http:// www .niaid.nih.gov/research/topics/pathogen/introduction. htm). encouraged by the success of that initial effort, researchers have continued to sequence an astonishing array of other medically important microorganisms. to this end, niaid has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. in addition, niaid is collaborating with other funding agencies to sequence larger genomes of protozoan pathogens such as the organism causing malaria. the availability of microbial and human dna sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host's immune response and an individual's genetic susceptibility to pathogens. when scientists identify microbial genes that play a role in disease, drugs can be designed to block the activities controlled by those genes. because most genes contain the instructions for making proteins, drugs can be designed to inhibit specific proteins or to use those proteins as candidates for vaccine testing. genetic variations can also be used to study the spread of a virulent or drug-resistant form of a pathogen. niaid has launched initiatives to provide comprehensive genomic, proteomic, and bioinformatic resources. these resources, listed below, are available to scientists conducting basic and applied research on a broad array of pathogenic microorganisms (http://www .niaid.nih.gov/research/topics/ pathogen/initiatives.htm): r niaid's microbial sequencing centers (nscs). the niaid's microbial sequencing centers are state-of-theart high-throughput dna sequencing centers that can sequence genomes of microbes and invertebrate vectors of infectious diseases. genomes that can be sequenced include microorganisms considered agents of bioterrorism and those responsible for emerging and re-emerging infectious diseases. resource center is a centralized facility that provides scientists with the resources and reagents necessary to conduct functional genomics research on human pathogens and invertebrate vectors at no cost. the pfgrc provides scientists with genomic resources and reagents such as microarrays, protein expression clones, genotyping, and bioinformatics services. the pfgrc supports the training of scientists in the latest techniques in functional genomics and emerging genomic technologies. r niaid's proteomics centers. the primary goal of these centers is to characterize the pathogen and/or host cell proteome by identifying proteins associated with the biology of the microorganisms, mechanisms of microbial pathogenesis, innate and adaptive immune responses to infectious agents, and/or non-immune-mediated host responses that contribute to microbial pathogenesis. it is anticipated that the research programs will discover targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics. this will be accomplished by using existing proteomics technologies, augmenting existing technologies, and creating novel proteomics approaches as well as performing early-stage validation of these targets. r administrative resource for biodefense proteomic centers (arbpcs). the arbpcs consolidate data generated by each proteomics research center and make it available to the scientific community through a publicly accessible web site. this database (www.proteomicsresource.org) serves as a central information source for reagents and validated protein targets and has recently been populated with the first data released. r niaid's bioinformatics resource centers. the niaid's bioinformatics resource centers will design, develop, maintain, and continuously update multiorganism databases, especially those related to biodefense. organisms of particular interest are the niaid category a to c priority pathogens and those causing emerging and re-emerging diseases. the ultimate goal is to establish databases that will allow scientists to access a large amount of genomic and related data. this will facilitate the identification of potential targets for the development of vaccines, therapeutics, and diagnostics. each contract will include establishing and maintaining an analysis resource that will serve as a companion to the databases to provide, develop, and enhance standard and advanced analytical tools to help researchers access and analyze data. tb structural genomics consortium. a collaboration of scientists in six countries formed to determine and analyze the structures of about proteins from mycobacterium tuberculosis. the group seeks to optimize the technical and management aspects of highthroughput structure determination and will develop a database of structures and functions. niaid, which is co-funding this project with nigms, anticipates that this information will also lead to the design of new and improved drugs and vaccines for tuberculosis. structural genomics of pathogenic protozoa consortium. this consortium is aiming to develop new ways to solve protein structures from organisms known as protozoans, many species of which cause deadly diseases such as sleeping sickness, malaria, and chagas' disease. the national institute of allergy and infectious diseases is providing support to the microbial genome sequencing centers (mscs) at the j. craig venter institute [formerly, the institute for genomic research (tigr)], the broad institute at the massachusetts institute of technology (mit), and harvard university for a rapid and cost-efficient production of high-quality, microbial genome sequences and primary annotations. niaid's mscs (http://www.niaid.nih.gov/dmid/genomes/mscs/) are responding to the scientific community and national and federal agencies' priorities for genome sequencing, filling in sequence gaps, and therefore providing genome sequencing data for multiple uses including understanding the biology of microorganisms, forensic strain identification, and identifying targets for drugs, vaccines, and diagnostics. in addition, the niaid's mscs have developed web sites that provide descriptive information about the sequencing projects and their progress (http://www.broad.mit.edu/seq/msc/and http://msc.tigr.org/status.shtml). genomes to be sequenced include microorganisms considered to be potential agents of bioterrorism (niaid category a, b, and c), related organisms, clinical isolates, closely related species, and invertebrate vectors of infectious diseases and microorganisms responsible for emerging and re-emerging infectious diseases. in addition, in response to a recommendation from a niaid-sponsored blue ribbon panel on bioterrorism and its implication for biomedical research to support genomic sequencing of microorganisms considered agents of bioterrorism and related organisms, the mscs will address the institute's need for additional sequencing of such microorganisms and invertebrate vectors of disease and/or those that are responsible for emerging and re-emerging diseases (http://www.niaid.nih.gov/dmid/ genomes/mscs/overview.htm). the panel's recommendation included careful selection of species, strains, and clinical isolates to generate genomic data for different uses such as identification of strains and targets for diagnostics, vaccines, antimicrobials, and other drug developments. the mscs have the capacity to rapidly and costeffectively sequence genomic dna and provide preliminary identification of open reading frames and annotation of gene function for a wide variety of microorganisms, including viruses, bacteria, protozoa, parasites, and fungi. sequencing projects will be considered for both complete, finished genome sequencing and other levels of sequence coverage. the choice and justification of complete versus draft sequence is likely to depend on the nature and scope of the proposed project. large-scale prepublication information on genome sequences is a unique research resource for the scientific community, and rapid and unrestricted sharing of microbial genome sequence data is essential for advancing research on infectious agents responsible for human disease. therefore, it is anticipated that prepublication data on genome sequences produced at the niaid microbial sequencing centers will be made freely and publicly available via an appropriate publicly searchable database as rapidly as possible. niaid-supported investigators have completed genome sequencing projects for bacteria, fungi, parasitic protozoa, invertebrate vectors of infectious diseases, and one plant (http://www.niaid.nih.gov/dmid/genomes/ mscs/req process.htm). in addition, niaid completed the sequence for , influenza genomes. in , genome sequencing projects were completed for pathogens as described in section . . . genome sequencing data is publicly available through web sites such as genbank, and data for the influenza genome sequences have been published in . furthermore, through the niaid's microbial sequencing centers, the niaid has funded the sequence, assembly, and annotation of three invertebrate vectors of infectious diseases. in , the final sequence, assembly, and the annotation of aedes aegyptii were released, as well as the preliminary sequence and assembly of the genomes for ixodes scapularis and culex pipiens; the final results for i. scapularis and c. pipiens will be released in . in , niaid supported nearly large-scale genome sequencing projects for additional strains of viruses, bacteria, fungi, parasites, viruses, and invertebrate vectors. new projects included additional strains of borrelia, clostridium, escherichia coli, salmonella, streptococcus pneumonia, ureaplasma, coccidioides, penicillium marneffei, talaromyces stipitatus, lacazia loboi, histoplasma capsulatum, blastomyces dermatitidis, cryptosporidium muris, and dengue viruses, as well as additional sequencing and annotation of aedes aegyptii. in , niaid launched the influenza genome sequencing project (igsp) (http://www.niaid.nih.gov/dmid/genomes/ mscs/influenza.htm), which has provided the scientific community with complete genome sequence data for thousands of human and animal influenza viruses. the influenza sequence data has been rapidly placed in the public domain, through genbank, an international searchable database, and the niaid-funded bioinformatics resource center with accompanying data analysis tools. all of the information will enable scientists to further study how influenza viruses evolve, spread, and cause disease and may ultimately lead to improved methods of treatment and prevention. this sequence information is now providing a larger and more representative sample of influenza than was previously publicly available. the influenza genome sequencing project has the capacity to sequence more than genomes per month and is a collaborative effort among niaid (including the niaid's division of intramural research), the national center for biotechnology niaid is continuing its support for the pathogen functional genomics resource center (pfgrc) (http://www. niaid.nih.gov/dmid/genomes/pfgrc/default.htm) at the institute for genomic research (tigr) (currently part of the j. craig venter institute). the pfgrc was established in to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases. in addition, the pfgrc was expanded to provide the research community with the resources and reagents needed to conduct both basic and applied research on microorganisms responsible for emerging and re-emerging infectious diseases and those considered agents of bioterrorism. one of the priorities for the pfgrc has been to provide the scientific community with access to the reagents and genomic and proteomic data that the pfgrc generated. a new software tool, called snp filtering tool, was developed for affymetrix resequencing arrays to analyze the single nucleotide polymorphism (snp) data. enhancements have been made to other tools for microarray data analysis, including a tool for analyzing slide images. a new layout for the tigr-pfgrc web site (http://pfgrc.tigr.org/) has been developed and launched and has the potential to be more user-friendly for the scientific community to access the pfgrc research and development projects, poster presentations, publications, reagents, and their descriptions and data. the number of organism-specific microarrays produced and distributed to the scientific community increased to pfgrc has continued to collaborate with the national institute of dental and craniofacial research (nidcr/nih) in producing and distributing five organism-specific microarrays, including arrays for actinobacillus actinomycetemcomitans, fusobacterium nucleatum, porphyromonas gingivalis, streptococcus mutans, and treponema denticola. pfgrc has also developed the methods and pipeline for generating organism-specific clones for protein expression. seven complete clone sets are now available for human severe acute respiratory syndrome coronavirus (sars-cov), bacillus anthracis, yersinia pestis, francisella tularensis, streptococcus pneumoniae, staphylococcus aureus, and mycobacterium tuberculosis. in addition, individual custom clone sets are available for more than organisms upon request. comparative genomics analysis using the available bacillus anthracis sequence data and the discovery of the snps were used to develop a new bacterial typing system for screening anthrax strains. this system allowed niaid-funded scientists to define detailed phylogenetic lineages of bacillus anthracis and to identify three major lineages (a, b, c) with the ancestral root located between the a+b and c branches. in addition, a genotyping genechip, which has been developed and validated for bacillus anthracis, will be used to genotype about different strains of bacillus anthracis. pfgrc has developed additional comparative genomic platforms for both facilitating the resequencing a bacterial genome on a chip to identify sequence variation among strains and to discover novel genes. a pilot project has been completed with streptococcus pneumoniae for sequencing different strains using resequencing chip technology. in collaboration with the department of homeland security (dhs), a resequencing chip has been developed and is now being used to screen a number of francisella tularensis strains to identify snps and genetic polymorphisms. sixteen francisella tularensis strains are being genotyped by using the newly developed resequencing chip. additional collaboration with dhs led to the development of a gene discovery platform aimed at discovering novel genes among different strains of yersinia pestis. to this end, nine strains are being analyzed using this platform to discover novel gene sets. pfgrc is developing proteomics technologies for protein arrays and comparative profiling of microbial proteins. a protein expression platform is under development, and a pilot comparative protein profiling project using staphylococcus aureus has already been completed and published. a protein profiling project using yersinia pestis to compare proteomes in different strains is now under way, complementing ongoing proteomics projects supported by niaid; numerous proteins are currently being identified that are differently abundant during different growth conditions. a new project was added in for comparative profiling of proteins on the proteomes of e. coli and shigella dysenteriae to provide the scientific community with reference data on differential protein expression in animal models versus cultured systems infected with the pathogen. in , niaid continued to support the population genetics analysis program: immunity to vaccines/infections. a joint project between niaid's division of allergy, immunity, and transplantation (dait) and the division of microbiology and infectious diseases (dmid), this program is aimed to identify associations between specific genetic variations or polymorphisms in immune response genes and the susceptibility to infection or response to vaccination, with a focus on one or more niaid category a to c pathogens and influenza. niaid awarded six centers to study the genetic basis for the variable human response to immunization (smallpox, typhoid fever, cholera, and anthrax) and susceptibility to disease (tuberculosis, influenza, encapsulated bacterial diseases, and west nile virus infection). the centers are comparing genetic variance in specific immune response genes as well as more generally associated genetic variance across the whole genome in affected and nonaffected individuals. the physiologic differences associated with these genome variations will also be studied. in , these centers focused on recruiting the samples needed for genotyping. for example, more than , smallpox-vaccinated individuals and controls were recruited and blood and peripheral blood mononuclear cell (pbmc) samples were obtained for whole genome association studies, which were conducted in . in another example, one of the centers used genome-wide linkage approaches to map, isolate, and validate human host genes that confer susceptibility to influenza infection. nearly , individuals with susceptibility to influenza and , control individuals were recruited using an iceland genealogy database. by late , the center had recruited more than individuals and had genotyped more than in this subproject of the study. during , niaid continued its support of the eight bioinformatics resource centers (brcs) (http://www. niaid.nih.gov/dmid/genomes/brc/default.htm) with the goal of providing the scientific community with a publicly accessible resource that allows easy access to genomic and related data for the niaid category a to c priority pathogens, invertebrate vectors of infectious diseases, and pathogens causing emerging and re-emerging infectious diseases. the brcs are supported by multidisciplinary teams of scientists to develop new and improved computational tools and interfaces that can facilitate the analysis and interpretation of the genomic-related data by the scientific community. in , each publicly accessible brc web site continued to be developed, the user interfaces were improved, and a variety of genomics data types were integrated, including gene expression and proteomics information, host/pathogen interactions, and signaling/metabolic pathways data. a public portal of information, data, and open-source software tools generated by all the brcs is available at http://www.brccentral.org/. in , many genomes of microbial species were sequenced by the niaid's microbial sequencing centers as well as by other national and international sequencing efforts, and the brcs provided either long-term maintenance of the genome sequence data and annotation or the initial annotation for a number of particular microbial genomes. for example, niaid's brc vectorbase collaborated with niaid's mscs to annotate the genome of aedes aegyptii with the scientific community and will continue the curation of this genome. in , niaid continued to support contracts for seven biodefense proteomics research centers (bprcs) to characterize the proteome of niaid category a to c bioweapon agents and to develop and enhance innovative proteomic technologies and apply them to the understanding of the pathogen and/or host cell proteome (http://www. niaid.nih.gov/dmid/genomes/prc/default.htm). these centers conducted a range of proteomics studies, including six category a pathogens, six category b pathogens, and one category c emerging disease organism. data, reagents, and protocols developed in the research centers are released to the niaid-funded administrative resource for biodefense proteomics research centers (www.proteomicsresource.org) web site within months of validation. the administrative resource web site was created to integrate the diverse data generated by the bprcs. in , more than potential targets for vaccines, therapeutics, and diagnostics were generated. examples of progress include: in , more than , potential new pathogen targets for vaccines, therapeutics, and diagnostics were identified, and more than , new corresponding host targets were generated. in addition: (i) two more sars-cov structures were solved. (ii) ninety-six percent of the orfs for b. anthracis were cloned with % sequence validated. (iii) a custom b. anthracis affymetrix genechip was developed. (iv) fifty-three polyclonal sera generated against novel toxoplasma gondii and cryptosporidium parvum proteins were characterized, and accurate time and mass tag databases were populated for salmonella typhi, monkeypox, and vaccinia virus. r niaid staff are participating in two related nih-wide genomic initiatives that focus on examining and identifying genetic variations across the human genome (genes) that may be linked or influence susceptibility or risk to a common human disease, such as asthma, autoimmunity, cancer, eye diseases, mental illness, and infectious diseases, or response to treatment as a vaccine. the approach is to conduct genome-wide association studies in which a dense set of snps across the human genome is genotyped in a large defined group of controls and diseases samples to identify genetic variations that may contribute to or have a role in the disease, with the hope of identifying an association between a genetic variant in a gene or group of genes and the disease. r niaid has continued to participate in a coordinated federal effort in biodefense genomics and is a major participant in the national inter-agency genomics sciences coordinating committee (nigscc), which includes many federal agencies. this committee was formed in to address the most serious gaps in the comprehensive genomic analysis of microorganisms considered agents of bioterrorism. a comprehensive list of microorganisms considered agents of bioterrorism was developed that identifies species, strains, and clinical and environmental isolates that have been sequenced, that are currently being sequenced, and that should be sequenced. in , the committee focused on category a agents and provided the cdc with new technological approaches for sequencing additional smallpox viral strains. affymetrixbased microarray technology for genome sequencing was established, as well as additional bioinformatics expertise for analyzing the genomic sequencing data. in , as a result of this continuing coordination of federal agencies in genome sequencing efforts for biodefense, niaid developed a formal interagency agreement with the department of homeland security (dhs) to perform comparative genomics analysis to characterize biothreat agents at the genetic level and to examine polymorphisms for identifying genetic variations and relatedness within and between species. r niaid continues to participate in the microbe project interagency working group (iwg), which has developed a coordinated, interagency, -year action plan on microbial genomics, including functional genomics and bioinformatics in (http://www.ostp. gov/html/microbial/start.htm). in , the microbe project interagency working group developed guidelines for sharing prepublication genomic sequencing data that serve as guiding principles, so that federal agencies have consistent policies for sharing sequencing data with the scientific community and can then implement their own detailed version of the data release plan. in , the microbe project iwg supported a workshop on "an experimental approach to genome annotation," which was coordinated by the american society for microbiology, and discussed issues faced in annotating microbial genome sequences that have been completed or will be completed in the next few years. in , the microbe project iwg developed a strategic plan and implementation steps as an updated action plan for coordinating microbial genomics among federal agencies, and the plan was finalized in . r niaid continues to participate with other federal agencies in coordinating medical diagnostics for biodefense and influenza across the federal government and in facilitating the development of a set of contracts to support advanced development toward the approval of new or improved point-of-care diagnostic tests for the influenza virus and early manufacturing and commercialization. r niaid continues to participate in the nih roadmap initiatives, including lead science officers for one of the national centers for biomedical computation and one of the national technology centers for networks and pathways. seven biomedical computing centers are developing a universal computing infrastructure and creating innovative software programs and other tools that would enable the biomedical community to integrate, analyze, model, simulate, and share data on human health and disease. five technology centers were created in and to cooperate in a u.s. national effort to develop new technologies for proteomics and the study of dynamic biological systems. r supramolecular architecture of severe acute respiratory syndrome coronavirus (sars-cov). coronaviruses derive their name from their protruding oligomers of the spike glycoprotein (s), which forms a coronal ridge around the virion. the understanding of the virion and its organization has previously been limited to x-ray crystallography of homogenous symmetric virions, whereas coronaviruses are neither homogenous nor symmetric. in this study, a novel methodology of single-particle image analysis was applied to selected coronavirus features to obtain a detailed model of the oligomeric state and spatial relationships among viral structural proteins. the two-dimensional structures of s, m, and n structural proteins of sars-cov and two other coronaviruses were determined and refined to a resolution of approximately nm. these results demonstrated a higher level of supramolecular organization than was previously known for coronaviruses and provided the first detailed view of the coronavirus ultrastructure. understanding the architecture of the virion is a necessary first step to defining the assembly pathway of sars-cov and may aid in developing new or improved therapeutics ( ). r large-scale sequence analysis of avian influenza isolates. avian influenza is a significant global human health threat because of its potential to infect humans and result in a global influenza pandemic. however, very little sequence information for avian influenza virus (aiv) has been in the public domain. a more comprehensive collection of publicly available sequence data for aiv is necessary for research on influenza to understand how flu evolves, spreads, and causes disease, to shed light on the emergence of influenza epidemics and pandemics, and to uncover new targets for drugs, vaccines, and diagnostics. in this study, the investigators released genomic data from the first large-scale sequencing of aiv isolates, doubling the amount of aiv sequence data in the public domain. these sequence data include , aiv genes and complete genomes from a diverse sample of birds. the preliminary analysis of these sequences, along with other aiv data from the public domain, revealed new information about aiv, including the identification of a genome sequence that may be a determinant of virulence. this study provides valuable sequencing data to the scientific community and demonstrates how informative large-scale sequence analysis can be in identifying potential markers of disease ( ) . genome sequencing project. the analysis of the first full genome sequences from human influenza strains, deposited in genbank through the niaid influenza genome sequencing project, was published in ( ) . influenza isolates were chosen in a relatively unbiased manner, allowing a comprehensive look at the influenza virus population circulating within the same geographic region over several seasons, which provided a real picture of the dynamics of influenza virus mutation and evolution. analysis demonstrated that the circulating strains of influenza included alternative minor lineages that could provide genetic variation for the dominant strain. this may allow a novel strain to emerge within a human host and would explain the unexpected emergence of the fujian influenza strain in - that resulted in a vaccine mismatch. these findings demonstrate the usefulness of full genomic sequences for providing new information on influenza viruses and lend further support for the need for large-scale influenza sequencing and the availability of sequence data in the public domain. within the influenza community, public availability of influenza sequence data and sharing of strains has been an important issue. the niaid has been instrumental in promoting the sharing of influenza sequence information, notably by sequencing more than , complete influenza genome sequences and depositing the sequences in the public domain through gen-bank as soon as sequencing has been completed. history of microbial genomics tools for gene finding and whole genome comparison interpolated markov models for eukaryotic gene finding computational gene finding in plants the genomes of pathogenic enterobacteria the complete genome sequence of escherichia coli k- genome sequence of enterohemorrhagic escherichia coli o :h complete genome sequence of enterohemorrhagic escherichia coli o :h and genomic comparison with a laboratory strain k- extensive mosaic structure revealed by the complete genome sequence of uropathogenic escherichia coli genome sequence of shigella flexneri a: insights into pathogenicity through comparison with genomes of escherichia coli k and o large, unstable inserts in the chromosome affect virulence properties of uropathogenic escherichia coli o strain escherichia coli that cause diarrhea: enterotoxigenic, enteropathogenic, enteroinvasive, enterohemorrhagic, and enteroadherent pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution excision of large dna regions termed pathogenicity islands from trna-specific loci in the chromosome of an escherichia coli wild-type pathogen complete genome sequence of multiple drug resistant salmonella enterica serovar typhi ct complete genome sequence of salmonella enterica serovar typhimurium lt cloning and nucleotide sequence of the salmonella typhimurium lt gnd gene and its homology with the corresponding sequence of escherichia coli k a kb chromosomal fragment encoding salmonella typhimurium invasion genes is absent from the corresponding region of the escherichia coli k- chromosome molecular genetic bases of salmonella entry into host cells identification of a virulence locus encoding a second type iii secretion system in salmonella typhimurium identification of a pathogenicity island required for salmonella survival in host cells pathogenicity islands and host adaptation of salmonella serovars the salmonella selc locus contains a pathogenicity island mediating intramacrophage survival the -kb unstable region of yersinia pestis comprises a high-pathogenicity island linked to a pigmentation segment which undergoes internal rearrangement transfer rna genes frequently serve as integration sites for prokaryotic genetic elements complete nucleotide sequence of the prophage vt -sakai carrying the verotoxin genes of the enterohemorrhagic escherichia coli o :h derived from the sakai outbreak a novel mechanism of virus-virus interactions: bacteriophage p tin protein inhibits phage t dna synthesis by poisoning the t single-stranded dna binding protein, go the old exonuclease of bacteriophage p filamentous phages linked to virulence of vibrio cholerae shiga toxin: purification, structure, and function genome sequence of yersinia pestis, the causative agent of plague salmonella pathogenicity islands encoding type iii secretion systems the salmonella pathogenicity island- type iii secretion system capsule switching of neisseria meningitides capsules and cassettes: genetic organization of the capsule locus of streptococcus pneumoniae genetic and molecular characterization of capsular polysaccharide biosynthesis in streptococcus pneumoniae type massive gene decay in the leprosy bacillus yersinia pestis -etiologic agent of plague yersinia pestis, the cause of plague, is a recently emerged clone of yersinia pseudotuberculosis microbial proteomics from proteins to proteomes: large scale protein identification by twodimensional electrophoresis and amino acid analysis membrane proteins and proteomics: un amour impossible? two-dimensional electrophoresis of membrane proteins: a current challenge for immobilized ph gradients new developments in isoelectric focusing isoelectric focusing in immobilized ph gradients: principle, methodology and some applications laser desorption ionization of proteins with molecular masses exceeding , daltons electrospray ionization for mass spectrometry of large biomolecules ion trap mass spectrometry supramolecular architecture of severe acute respiratory syndrome coronavirus revealed by electron cryomicroscopy large-scale sequence analysis of avian influenza isolates large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution key: cord- -wa gjbck authors: gibbs, richard a. title: the human genome project changed everything date: - - journal: nat rev genet doi: . /s - - - sha: doc_id: cord_uid: wa gjbck thirty years on from the launch of the human genome project, richard gibbs reflects on the promises that this voyage of discovery bore. its success should be measured by how this project transformed the rules of research, the way of practising biological discovery and the ubiquitous digitization of biological science. the joint announcement of the release of the human 'draft' genome sequences occurred years ago, at a ceremony in the white house. the first analyses by two groups, the publicly funded international human genome project (hgp) consortium and celera genomics, were published in nature and science , respectively, shortly after. while the analyses were superficial by contemporary standards, this was nevertheless a milestone that provided exciting first glimpses into the entire human genome. the announcement was hailed as 'the end of the beginning' and a launch pad for a new era. after two decades, have the aspirational aims of the hgp been realized? without doubt, the answer is yes; it is simply inconceivable today that we would not have the genome at our fingertips -as unimaginable, perhaps, as not having computers or the internet. critics cite a failure to meet the most outlandish visions as evidence that the hgp has not lived up to all promises. the project was initially conceived with fairly sober predictions, including the benefits of a complete cancer genome, advances in genetics and the development of improved technologies . it was not until closer to the programme launch in and at milestones along the way that the rhetoric was loudly elevated to claims of revolutionizing biology, biotechnology, drug development and even society. a favourite prediction was the personalization of therapies and the liberation of drugs that otherwise were unusable, through identification of the few individuals with adverse responses. the mysteries of the architecture of common complex diseases were to be revealed and even behavioural traits might be solved. the predictions included the possibility to breed 'super babies' based on this new knowledge and, at the same time, perhaps even predict criminality . in hindsight, there was plenty of hype that was shared with the media and the wider community. critics are correct that the apex of these claims was not reached. the hyperbole that we look back on did not, however, come from the front line. it came from those who championed the programme, mindful of its long-term benefits. thanks to them, they generated the enthusiasm to fund this transformative work. among those immersed in the delivery of the primary aims of the project, the mood was more measured. 'basic' biologists wanted their favourite model organisms characterized so that human gene homologues could be identified. clinical geneticists were fixated on discovery and genetic dissection of the molecular basis of inherited childhood disorders, while adult disease specialists sought answers to why some suffered common maladies, such as cardiovascular disease or cancer. technologists recognized that this was the gateway to the new era of high-throughput, digital biology. there were still lofty goals, and major contributors who were convinced of the imperative of completing the project shared core beliefs of the broad impact of a completed human sequence. all recognized that, for the first time, these studies would share a characteristic comprehensiveness that was an uncommon luxury in biology. for the first time, there would be knowledge on all genes, all diseases and all genetic variants. participants recognized the power of broad data sharing and the legacy of the bermuda principles for future biology . the organizational rigor required to manage the hgp was the human genome project changed everything thirty years on from the launch of the human genome project, richard gibbs reflects on the promises that this voyage of discovery bore. its success should be measured by how this project transformed the rules of research, the way of practising biological discovery and the ubiquitous digitization of biological science. it is simply inconceivable today that we would not have the genome at our fingertips richard gibbs, ac, phd is a human geneticist and the founding director of the baylor college of medicine human genome sequencing center (hgsc). he graduated from the university of melbourne in genetics and radiation biology and moved to houston, tx, to study the molecular basis of genetic disease. he developed basic methods for dna and mutation ana lysis and was an early contributor to the human genome project (hgp), leading one of five sites that generated the majority of the sequence. since the completion of the hgp, he has led multiple genome projects including the generation of the first personalized whole-genome diploid human sequences. his group pioneered the oligonucleotide exon-capture methods that are widely used today for whole-exome sequencing, and he is currently leading programmes for translation of genomic data into the clinic. new for biology, and it was apparent that future programmes would benefit from hgp lessons in logistics. these ambitions were the backdrop for the knowledge of how difficult the task would be, without advanced computers, automated sequencing or any roadmap from a similar effort. a -plus-year timetable there was also a realistic insiders' view of likely post-hgp rates of progress and how difficult biological discovery can be, in the best of circumstances. the hgp was foundational and the project would lead to new ways to do things, but not all thought progress would be easy. the hgp took just years, as after the announcement we all worked an extra years to finish the 'essentially complete genome' , and it is interesting to compare that period to other transitional milestones in biology. in , the groups of francis collins and lap-chee tsui discovered the gene that contains the variants that underlie cystic fibrosis . that discovery (pre-hgp) was appropriately hailed as the first step towards a cure. in , the first resulting drug to treat a subset of patients with cystic fibrosis was approved by the fda. for huntington disease, a similar time span was needed to go from gene discovery to a new treatment that is only now being tested . the familial breast cancer gene is another example of the time between discovery and action; linkage to brca was identified in the s with initial hopes that isolating the gene underlying the % of cases that were familial would give insights into the vast majority of sufferers with sporadic disease. that connection was not obvious, and the complicated relationship between this gene, its germline and somatic variants, related genes and interacting proteins, and the consequences for cancer are still being unraveled . a - -year period between discovery and impact on health care is more the rule than the exception. parallel transformations hgp participants trusted their own power to innovate but also hoped for other developments to leverage the programme. while the project unfolded, a revolution occurred in computation. in the late s, the only computers in the laboratories of genomicists were the earliest pcs and apple products. by , we had all been connected by the internet, bandwidth was adequate to move the genome data, and adequate processing power was accessible. a strength of the hgp and its participants was that these parallel developments were rapidly incorporated into the framework of biology. necessity speeds invention -and the need to manage copious amounts of digital genome data was the real driver of the growth of computational biology, ahead of the demands of physiologists or structural biologists. most importantly, a generation of bioinformatics experts and computational biologists emerged who brought the genome data to the widest audiences. the power of advances in genomics and computers was revealed in the spectacular series of post-hgp projects that were of comparable scale. after multiple mammalian genome projects, programmes including the haplotype mapping (hapmap) project , the genomes project and the cancer genome atlas (tcga) progressively illustrated the advancement of knowledge by more sophisticated data sharing, comparison and analysis. as these and other projects unfolded, new constituencies were engaged and more scientists and clinicians became 'digital' and 'genomic' . the projects were emblematic of the advancement of scaling, digitization and sharing that was sparked by the hgp. some still tally the success of the hgp from lists of new drugs or therapies and argue that world-changing examples in biology, such as the spectacular advances of gene editing tools or the expansion of cancer therapeutics through targeted immunotherapy, are largely based on microbial, cellular and animal studies rather than genomics. this argument misses the point. these are among the myriad of discoveries that occurred in the backdrop of a new era. new ideas and primary discovery may still be the 'quiet conversation with nature' of the experimental biologist -but validation, contextualization, deployment and translation are all streamlined by the fruits of the hgp. it is a vastly different world today in , compared with . human genome sequences cost less than us$ , per genome, all trainees in experimental biology and genetics are pressed to be proficient in computer languages, and easy access to mountains of primary and derived data has come to be expected. as the recent coronavirus pandemic emerged, thousands of trainees, forced to remain out of the wet-lab, pivoted to computational studies; years ago they would have been lost. the real fruits of the hgp lie in the contrast between the primitive state of digital biology in the late s and the current ease with which all scholars can access, harness and analyse biological data. initial sequencing and analysis of the human genome the sequence of the human genome the gene wars: science, politics, and the human genome the criminal law implications of the human genome project: reimagining a genetically oriented criminal justice system bermuda rules: community spirit, with teeth identification of the cystic fibrosis gene: chromosome walking and jumping therapeutic update on huntington's disease: symptomatic treatments and emerging disease-modifying therapies the race" to clone brca the international hapmap project a map of human genome variation from populationscale sequencing a.g. is partially supported by grants from the national human genome research institute. the author declares no competing interests. validation, contextualization, deployment and translation are all streamlined by the fruits of the hgp key: cord- -td wj authors: paszkiewicz, konrad h.; giezen, mark van der title: omics, bioinformatics, and infectious disease research date: - - journal: genetics and evolution of infectious disease doi: . /b - - - - . - sha: doc_id: cord_uid: td wj bioinformatics is basically the study of informatic processes in biotic systems. actually what constitutes bioinformatics is not entirely clear and arguably varies depending on who tries to define it. this chapter discusses the considerable progress in infectious diseases research that has been made in recent years using various “omics” case studies. bioinformatics is tasked with making sense of it, mining it, storing it, disseminating it, and ensuring valid biological conclusions can be drawn from it. this chapter discusses the current state of play of bioinformatics related to genomics and transcriptomics, briefs metagenomics that finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms. this chapter explains the various possibilities of pan-genome, transcriptional reshaping and also enormous progress of proteomics study. bioinformatic algorithms and tools are crucial tools in analyzing the data. the chapter also attempts to provide some details on the various problems and solution in bioinformatics that current-day scientists face while concentrating on second-generation sequencing strategies. although bioinformatics is generally perceived to be a modern science, the term had been put forward over thirty years ago by paulien hogeweg and ben hesper for "the study of informatic processes in biotic systems" (hogeweg, ; hogeweg and hesper, ) . it is necessarily nebulous-bioinformatics spans many disciplines and can have many shades of meaning. indeed it can be argued that it is the collation and analysis of data from different disciplines that has provided some of the greatest insights. in the field of genomics and transcriptomics, bioinformatics is an incredibly diverse field. evolution, epidemiology, ecology, and the response of an organism to its environment are all fields that require bioinformatics to accurately process and place into context various sources of data. at the heart of genomics and transcriptomics is the generation and analysis of vast quantities of sequence data. dna sequencing took off in the late s when applied biosystems developed the first automated sequencing machine. the subsequent development of more efficient ways to sequence resulted in the phenomenal growth of the number of sequences deposited in genbank (figure . ). obviously, with over million sequences deposited in genbank, it is not feasible to do any serious manual work with such a large dataset. data obtained from modern secondgeneration sequencers is on the order of times greater than capillary-based sequencers. it is now possible to routinely generate many gigabases of sequence data. bioinformatics is tasked with making sense of it, mining it, storing it, disseminating it, and ensuring valid biological conclusions can be drawn from it. many of the recent high-throughput functional genomics technologies rely on a bioinformatics component, though bioinformatics is just one part of the process. for example, identification of proteins by mass spectroscopy, quantitative analysis of expression data, phylogenetics, and so on all make use of bioinformatics tools, methods, and databases. bioinformatics plays a key role at several steps in genomics, comparative genomics, and functional genomics: sequence alignment, assembly, identification of single nucleotide polymorphisms (snp), gene prediction, quantitative analysis of transcription data, etc. in this chapter, we will discuss the current state of play of bioinformatics related to genomics and transcriptomics and use relevant examples from the field of infectious diseases. the term "metagenomics" was originally used to describe the sequencing of genomes of uncultured microorganisms in order to explore their abilities to produce natural products (handelsman et al., , rondon et al., and subsequently resulted in novel insights into the ecology and evolution of microorganisms on a scale not imagined possible before (see cardenas and tiedje, ; hugenholtz and tyson, for an overview). however, metagenomics now finds use in infectious disease research as well as the random sequencing of genomes from a variety of organisms from, for example, patient material that could lead to the identification of the cause of disease. in a quite straightforward metagenomics approach to identify pathogens in sputa from cystic fibrosis patients, standard microbiological culture techniques were compared to molecular methods using s rdna pcr (bittar et al., ) . the well-known disadvantage of the microbiological methods is that they normally employ "selective" media that are designed to pick up those bacterial pathogens that are thought to be present. emerging pathogens will be missed using traditional culture techniques. indeed, bittar et al. identified bacteria using cultivation while bacterial species were detected using molecular methods (based on blast comparisons; altschul et al., ) , interestingly, % of the latter were anaerobes, organisms missed in the routine cultivation methods. many bacteria identified using the molecular methods are traditionally not thought to be associated with cystic fibrosis. whether these novel species are associated with the physiopathology of disease remains to be studied. bittar et al. ( ) also noted that the number of bacteria detected increased with increased numbers of clones sequenced, a well-known phenomenon in environmental sequencing that relates to sample depth (huber et al., ; huse et al., ) . however, with the increased use of next-generation sequencing methods in infectious disease research, the lessons learned from environmental studies relating to diversity and relative abundance of different microbes can be put to effective use. an example of the use of second-generation sequencing in a metagenomics approach of patient material is the study by nakamura et al. ( ) to identify viruses in nasal and fecal material. in this study, rna was isolated from patient material obtained during seasonal influenza infections and norovirus outbreaks. this rna was reverse transcribed into cdna, which was subsequently subjected to large-scale parallel pyrosequencing resulting in , reads on average per sample. although the influenza samples were mainly (. %) human in origin, it was nonetheless possible to identify the influenza subtypes in each sample (nakamura et al., ) . as the fecal samples were cleared of human and bacterial cells, yields were much better and the complete norovirus gii. subtype genome was sequenced with an average cover depth of up to . in addition to being able to identify the influenza and noroviruses, two recently identified human viruses were also identified: wu polyomavirus and human coronavirus hku (nakamura et al., ) . major bacterial species normally found in the respiratory tract were also identified. although nakamura et al. suggest that the high-throughput sequencing is more sensitive than standard pcr-based analysis and might result in the detection of additional possible pathogens, they also warn that the increased sensitivity might necessitate follow-up work to decide which of the detected pathogens is the actual cause of the disease. important results are expected from the human microbiome project (http:// www.hmpdacc.org/), which will obtain metagenomic information from various human microenvironments such as the gastrointestinal, nasooral, and urogenital cavities as well as the skin. understanding the human microbiome is thought to answer questions such as whether changes in the human microbiome are related to human health. however, large-scale metagenomics projects that include eukaryotic genomes have thus far been quite costly and laborious due to the generally large genomes of eukaryotes. the lowering of sequencing costs may alleviate part of the problem, but sequence data are still accumulating at a faster rate than developments in computational analysis (hugenholtz and tyson, ) . organisms that have attracted the attention of genome centers are those that cause disease followed by those from model organisms such as saccharomyces cerevisiae (goffeau et al., ) and caenorhabditis elegans (the c. elegans sequencing consortium, ), for example. indeed, the first bacterial genomes sequenced were those from pathogens fraser et al., ; tomb et al., ) , and these were preceded by many bacteriophage genomes such as bacteriophage ms (fiers et al., ) and ϕx (sanger et al., ) and viral genomes (fiers et al., ) . currently, pathogen genomes represent at least one third of all sequenced genomes. obviously, for comparative genomics two genomes are required, and indeed, when the second bacterial pathogen was sequenced (mycoplasma genitalium by fraser et al., ) , it was immediately compared with the first one (haemophilus influenzae by fleischmann et al., ) . interestingly, the h. influenzae genome was completed using a "bioinformatics" approach. unlike previous sequencing projects, the used shotgun approach relied on a computational justification that sufficient random sequencing of small fragments would result in a complete coverage of the whole genome. comparing the m. genitalium genome with the haemophilus genome suggested that the percentage of the total genome dedicated to genes is similar albeit that m. genitalium has far fewer genes (fraser et al., ) . although the genome of m. genitalium is about three times smaller than that of h. influenzae, its smaller genome has not resulted in an increase in gene density or decrease in gene size. detection of several repeats of components of the mycoplasma adhesin, which elicits a strong immune response in humans, suggests that recombination might underlie its ability to evade the human immune response. that this initial genome study was only the tip of the comparative genomics iceberg was already clear from fleischmann et al. ( ) last sentence: "knowledge of the complete genomes of pathogenic organisms could lead to new vaccines." a whole-genome effort at identifying vaccine candidates appeared some years later when pizza et al. ( ) employed bioinformatics to extract putative surface-exposed antigens by genome analysis. although effective vaccines against neisseria meningitidis, the causative agent of meningococcal meningitis and sepsis, did exist, these vaccines did not cover all pathogenic serogroups. serogroup b had evaded the development of a good vaccine as its capsular polysaccharide (against which the vaccines of the other serogroups were developed) is identical to a human carbohydrate. in order to identify putative candidates for vaccine development, pizza et al. decided to sequence the whole genome of a serogroup b strain. all potential open reading frames (orfs) were analyzed for putative cellular locations using blastx. those orfs likely to be cytosolic were excluded from further analysis. the remaining orfs were analyzed to determine whether they encoded proteins that contained transmembrane domains, leader peptides, and outer membrane anchoring motives using a variety of databases such as pfam (finn et al., ) and prodom (servant et al., ) . this resulted in orfs encoding putative exposed antigens. these putative genes were cloned in escherichia coli and pizza et al. successfully expressed orfs. these recombinant proteins were used to generate antisera that were tested in enzyme-linked immunosorbent assay (elisa) and fluorescence-activated cell sorter (facs) analyses to test whether they detected proteins on the outer surface of serogroup b meningococcus strains. in addition, the sera were tested for bactericidal activity. of the proteins, reacted positively in at least one assay but only were positive in all three assays. these were subsequently tested on a large variety of strains to analyze their efficacy. a total of seemed able to provide protection against n. meningitidis strains and in addition, those proteins are À % similar to the homologous n. gonorrhoeae proteins, suggesting they might provide successful protection against that pathogen as well (pizza et al., ) . arguably the most striking aspect of this study is that in months the authors identified more vaccine candidates than in the preceding years using a novel genomics/bioinformatics approach (seib et al., ). this study resulted in a vaccine that is currently in phase iii clinical trials (giuliani et al., ) . protozoan infections are a major burden on developing nations; they take of the diseases targeted by the world health organization's special program for research and training in tropical diseases (http://www.who.int/tdr). over the last years or so, more than parasitic genomes have been sequenced in the hope that their sequences would reveal weak spots to target these pathogens. the trypanosomatids cause serious disease in africa and south america. trypanosoma brucei causes sleeping sickness in humans and wasting disease in cattle. trypanosoma cruzi is the causative agent of chagas disease and leishmania major leads to skin lesions. the completion of their genomes , el-sayed et al., a , ivens et al., and the comparative analysis of all three genomes (el-sayed et al., b) may be able to focus efforts toward obtaining vaccines, as current drugs have serious toxicity issues. although their genomes encode a different number of protein-encoding genes (around in t. brucei; in l. major; , in t. cruzi), comparative analysis resulted in the identification of about genes that entail the trypanosomatid core proteome. all protein coding genes were compared in a three-way manner using blastp (el-sayed et al., b) and the mutual best hits were grouped as clusters of orthologous genes or cogs ( figure . ). trypanosomatid specific proteins from these might be used in a broadscale vaccine. the remainder of the protein-encoding genes from each parasite ( % of the genes in t. brucei; % in l. major; % in t. cruzi) consists of species-specific genes. interestingly, a large proportion of these genes encode surface antigens and this might relate to the different mechanisms these parasites employ to evade the host immune system. in addition, it was noted that many genes encoding surface antigens are found at or near telomeres and that many retroelements seem to be present in these regions as well. this might be related to the enormous antigenic variation observed in both trypanosoma species. the presence of novel genes in these areas might suggest that their products play an unknown role in antigenic variation as well which warrants further studies into these uncharacterized genes (el-sayed et al., b) . detailed knowledge of well-studied pathogens might be successfully used to understand the biology of closely related emerging pathogens. this was the driving force for the sequencing of six candida species (butler et al., ) . candida species are the most common opportunistic fungal infections in the world and c. albicans is the most common of all candida species causing infection. however, c. albicans incidence is declining while other species are emerging. comparison of eight candida species indicated that although genome size was variable, gene content was nearly identical across all species. as the analysis included pathogenic and nonpathogenic species, butler et al. ( ) specifically studied differences between these two groups. of the over gene families analyzed, were significantly enriched in pathogenic species. many gene families known to be involved in pathogenesis were present in these families (e.g., lipases, oligopeptide transporters, and adhesins). more interestingly, several poorly characterized gene families were also identified, suggesting these might play an unexpected role in pathogenesis as well. this comparative study revealed a wealth of new avenues to explore, which, combined with the large body of work performed on c. albicans, will aid understanding the newly emerging pathogenic candida species (butler et al., ). although comparative studies using multiple species can reveal hitherto unknown features as evidenced from the mentioned trypanosomatid and candida studies, they can also reveal something unexpected. because the definition of a bacterial species has been debated for a long time, tettelin et al. ( ) set out to address this question by sequencing multiple strains from streptococcus agalactiae, the most common cause of illness or death among newborns. unexpectedly, despite the presence of a "core-genome" shared between all genomes, mathematical modeling suggested that each additional sequenced genome would add new genes to the "dispensable genome." an additional analysis using s. pyogenes also suggested that sequencing additional genomes would continue to add new genes to the pool resulting in a pan-genome that can be defined as the global gene repertoire of a species . this cannot be extrapolated ad infinitum, as a similar analysis of bacillus anthracis indicated that after the fourth genome, no additional genes were identified in agreement with its known limited genetic diversity (keim and smith, ) . subsequent analyses have confirmed the presence of pan-genomes for many bacterial species (hiller et al., ; lefébure and stanhope, ; rasko et al., ; schoen et al., ; lefébure and stanhope, ) and the ultimate gene repertoire of a bacterial species is much larger than generally perceived. whether this would be the case for eukaryotes remains to be shown. despite the apparently ever-expanding possibilities of the pan-genome, it has also resulted in a universal vaccine candidate for group b streptococcus (gbs). because various gbs serotypes exist, current vaccines only offer protection against a limited set of serotypes. eight genomes from six serotypes were compared resulting in the identification of a core-genome of genes and a dispensable genome of genes, which were not present in each strain . both genomes were analyzed for the presence of putative surface-associated and secreted proteins. of the identified genes, one third were part of the dispensable genome ( genes). the authors subsequently produced recombinant tagged proteins in e. coli that were used to immunize mice. ultimately, a combination of four antigens turned out to be highly effective against all major gbs serotypes. three of these antigens were part of the dispensable genome. in addition, this bioinformatics approach highlights the importance of not dismissing unidentified orfs on genomes (generally up to % of sequenced genomes) as all four antigens had no assigned function. because of their identification using this method, it became obvious they were part of a pilus-like structure that had never seen before in group b streptococcus (lauer et al., ) . the presence of antigens that provide protection on these pilus-like structures suggest that these might play a role in pathogenicity. genomic information is useful as a scaffold. however, in a given environment pathogens and hosts only express a subset of their genes at any one time. the presence of pan-genomes only complicates matters even more. to investigate the response of an organism to an environmental or other stress it is necessary to examine the expression pattern of proteins. at present, this is not possible to accomplish directly on a large scale, but a good approximation can be made by sequencing and counting mrna molecules. at present the process involves converting the rna to cdna, which can introduce biases but nonetheless sequencing has a great many advantages over traditional microarrays (ledford, ) . these include high specificity with little or no background noise and one also gains nucleotide level resolution of expression. despite such drawbacks, microarrays are still extremely powerful tools to understand levels of gene expression, and this is obvious from the study by toledo-arana et al., who discovered novel regulatory mechanisms in listeria (toledo-arana et al., ) . l. monocytogenes is normally harmless but can lead to serious food-borne infections. environmental change, from the soil through the stomach to the intestinal lumen and ultimately into the bloodstream, is thought to be responsible for the up-and downregulation of a plethora of genes. comparative genomics of the nonpathogenic l. innocua has resulted in the identification of a virulence locus (glaser et al., ) . using microarrays, transcripts of one strain grown at c in rich medium were compared to three different conditions: stationary phase, hypoxia, and low temperature ( c). in addition, knockout mutants in three known regulators of listeria virulence gene expression (prfa, sigb, and hfq) were compared to the control strain as well. rna was also extracted from the intestine of inoculated mice and from blood from healthy human donors that were both infected with three different strains (control and prfa and sigb knockouts). this analysis resulted in the discovery of massive transcriptional reshaping under the control of sigb when listeria enters the intestines. however, in the bloodstream, gene expression is under control of prfa. various noncoding rnas were uncovered, which show the same expression patters as virulence genes suggesting a potential role in virulence (toledo-arana et al., ) . because microarray data are based on a comparative difference in hybridization, high-throughput next-generation sequencing is seen as more quantitative as it based on number of hits for each sequenced transcript ( van vliet, ) . however, when making cdna for next-generation sequencing transcriptomics in prokaryotes, there are several difficulties not found in eukaryotes, such as high levels of rrna and trna molecules as well as a lack of poly-a tails, making extraction difficult. nontheless, it is possible to overcome these by either reducing the amount of rrna and trna using commercially available kits or by bioinformatic removal of such sequences postsequencing ( van vliet, ) . to date, some rna-seq style experiments have been performed on prokaryotes. to give an example of the sort of novel insights that can be gleaned using such technology, passalacqua et al. ( ) sequenced the bacillus anthracis transcriptome using solid and illumina sequencing and clearly showed the polycistronic nature of many transcripts on a whole genome scale. although known for individual operons, this had never been shown on a genome-wide scale. they were also able to test the current genome annotations and discovered that loci that were removed as nongenes showed significant transcriptional activity. in addition, nonannotated regions had clear levels of transcription and should therefore be considered as genes (passalacqua et al., ) . as internal methionines could have incidentally been identified as start codons, they also checked whether upstream regions were included in the transcribed region. in cases this proved to be the case suggesting the original start codons were incorrectly annotated. reassuringly, when comparing their data with microarray data, a strong correlation was observed. interestingly, because of the very high resolution of sequence-based transcriptomics studies, it is possible to identify novel regulatory elements. for example, when comparing expression levels under o -and co -rich conditions, the first gene of an eight-gene operon did not show a marked difference in expression level while all the others were significantly upregulated under co (passalacqua et al., ). indeed, a bioinformatics approach had suggested the presence of a t-box riboswitch between genes and of this operon (griffiths-jones et al., ) . a similar approach to study how burkholderia cenocepacia, an opportunistic cystic fibrosis pathogen, responds to environmental changes revealed several new potential virulence factors (yoder-himes et al., ). as b. cenocepacia is routinely isolated from soil, two strains (one isolated from a cystic fibrosis patient and one from soil) were analyzed in their response to changes from growth at synthetic human sputum medium and soil medium. although their overall nucleotide identity is . %, and homologous genes showed a significant difference in expression between the two strains when grown in synthetic sputum medium and soil medium, respectively. this suggests that despite the high level of relatedness, differential gene expression plays a large role in adaptation to their ecological niche (yoder-himes et al., ) . interestingly, similar to passalacqua et al. ( ) , several expressed noncoding rnas were uncovered with different expression levels depending on environmental condition. the significance of this needs to be investigated but highlights the ability of second-generation sequencing to unearth novel findings. despite the fact that a species' genome could well be larger than the actual genome content of one member of that species due to the pan-genome concept, an organism's proteome is by far much more complex. as discussed earlier, transcriptomics will reveal which subset of the genome is expressed under a given condition. however, posttranslational modifications of proteins make the actual proteome far more complex than the transcriptome. this is also the strength of proteomics, as can be seen in a study of the obligate intracellular parasite chlamydia pneumonia. c. pneumonia is the third-most-common cause of respiratory infections in the world, which, in part, is made possible due to the unique bi-phasic life cycle of this bacterial pathogen. chlamydia spread via a metabolically inert infectious particle called the elementary body. these elementary bodies enter the host cell where they differentiate into reticulate bodies. as the elementary body is the infectious phase, proteins presented on the outer membrane would be ideal candidates for vaccine development, especially as effective vaccines are lacking and treatment is via antibiotic therapy. a large-scale genomics-proteomics study by montigiani et al. ( ) systematically assessed putative exposed antigens for possible use in vaccine development. of the c. pneumonia genes, have assigned functions, of the latter are predicted to be peripherally located and were therefore selected for follow-up studies. in addition, the remaining orfs were subjected to a series of search algorithms aimed at identifying putative surface-exposed antigens. in total, orfs were identified as being possibly located on the cell surface. these were subsequently used to produce recombinant proteins in e. coli. because both his-tagged as well as gst-tagged versions were made, a total of recombinant proteins were produced and used for immunizations of mice. all antisera were used in facs analysis to test if they could bind to the c. pneumonia cell surface. this resulted in the identification of putative surface-exposed antigens. interestingly, apart from well-known antigens, antigens from unidentified orfs were part of this group of potential vaccine candidates. all candidates were tested on western blots whether they generated a clean band of the expected size or whether they cross-reacted with other proteins; of the were specific. finally, montigiani et al. conducted a proteomic analysis of total protein from the elementary body phase identifying spots using mass spectrometry. protein sequencing using maldi-tof identified putative surface-exposed antigens on the c. pneumonia d gels (montigiani et al., ) . a follow-up study by thorpe et al. ( ) clearly showed that one of the identified candidates, lcre, induced, amongst others, cd and cd t cell activation and completely cleared infection in a murine model. interestingly, lcre is homologous to a protein that is thought be part of the type iii secretion system of yersinia. the exposed nature of lcre on the c. pneumonia cell surface suggests that a type iii secretion system plays a role in chlamydia infection (montigiani et al., ) . the importance of exposed outer membrane proteins as potential vaccine candidates has prompted berlanda scorza et al. to assess the complement of outer membrane proteins from an extraintestinal pathogenic e. coli strain (berlanda scorza et al., ) . extraintestinal pathogenic e. coli is the leading cause of severe sepsis and current increases in drug resistance warrant the search for novel vaccine targets. in addition, current whole-cell vaccines suffer from undesired cross-reactions to commensal e. coli as well. the novel approach by berland scorza et al. is based on the observation that some gram-negative bacteria release outer membrane vesicles (omv) in the culture media, albeit in minute quantities. a tolr mutant appeared to release much more omvs than wild-type cells and subsequent large-scale mass spectroscopic analysis of its protein content resulted in the identification of proteins. the majority of these were outer membrane and periplasmic proteins. intriguingly, three subunits from the cytolethal distending toxin (cdt) were included. this toxin is unusual in that one of its subunits is targeted to the eukaryotic host cell, where it breaks doublestranded dna resulting in cell death (de rycke and oswald, ) . to check whether the presence of cdt in the omv was due to the tolr knockout, wild-type extraintestinal pathogenic e. coli was tested using western blotting. indeed, cdt was detected in wild-type omv as well (berlanda scorza et al., ) . this suggests that toxin delivery via vesicles might well be the key event in pathogenesis. interestingly, of the identified proteins were not predicted to be targeted to the periplasm or outer membrane by psortb (gardy et al., ) . we see here excellent opportunities to train protein targeting algorithms with new wetbench data as these algorithms generally have been trained on a limited set of model organisms that do not reflect the diversity encountered in real life. despite the enormous progress in genomics of infectious diseases, the discovery of new drugs has not kept equal pace. for example, no candidate drugs have been identified after high-throughput screens using validated bacterial drug targets (payne et al., ) . although broad-spectrum drugs might be more desirable, there has been a recent trend in targeting specific proteins from specific pathogens using structural biology. several structural genomics initiatives have been set up to target specific groups of pathogens. for example, the seattle structural genomics center for infectious diseases (http://ssgcid.org) and the center for structural genomics of infectious diseases (http://www.csgid.org) work on category a to c agents listed by the national institute for allergy and infectious diseases (niaid). other centers focus on specific organisms such as mycobacterium tuberculosis. examples are the mycobacterium tuberculosis structural proteomics project (http://xmtb. org) and the mycobacterium tuberculosis structural proteomics consortium (http://www.doe-mbi.ucla.edu/tb). the field of structural genomics aims to solve as many protein structures as possible from human pathogens with the aim to come up with new drug targets or vaccines (van voorhis et al., ) . obviously, correct selection of candidates for structural genomics projects is paramount and various criteria have been put forward (anderson, ; van voorhis et al., ) . if a protein is already a validated drug target obviously aids in selection. the proteins need to be essential for the pathogen and ideally, absent in humans. proteins involved in the uptake of essential nutrients are another target. classically, drug design has been focusing on substrate binding sites. more recently, small molecules interfering with subunit binding have started to attract attention. as eukaryotic and prokaryotic inorganic pyrophosphatases differ in composition (the former are homodimers, while the latter are homohexamers), efforts are aimed at compounds that interfere with the oligomeric state of the enzyme. in contrast, the highly conserved active site of inorganic pyrophosphatase would not have been a good target (van voorhis et al., ) . the sars outbreak that caught the infectious diseases community (if not the whole world) by surprise is one example where structural genomics has made enormous progress. despite knowing that coronaviruses caused serious diseases in animals, the fact that they only caused mild disease in humans meant that there was very little knowledge about coronavirus biology. the subsequent effort to understand viral assembly and replication/transcription, for example, has resulted in the elucidation of sars-cov solved protein structures. interestingly, the novel fold-discovery rate was nearly %, while it would normally be more close to % (bartlam et al., ) . in addition, one key protein, the sars-cov main protease, has since been at the center of structure-based drug discovery. because of the nature of the discipline, structural genomics is dependent on various other disciplines such as biochemistry, microbiology, structural biology, computational biology, and bioinformatics and can only foster in a truly interdisciplinary environment (anderson, ). it is now possible to sequence the entire genome of a bacterial pathogen, assemble the raw sequence reads, perform automated annotation, and visualize the results within weeks. at the same time (indeed even on the same sequencer) it is also possible to selectively sequence the transcriptome (rna-seq) regions of dna bound to protein (chip-seq) or for relevant species methylated dna to study epigenetic effects as well as small rna molecules. it is also possible to perform the very same sequencing on the host organism at the same time. bioinformatic algorithms and tools are a crucial tool in analyzing such unprecedented volumes of data. these data volumes have emerged as a result of secondgeneration sequencers such as the roche/ , illumina, and abi/solid systems. although useful information can be extracted by single researchers by targeted analysis of the sequencer output, to gain the most information out of such data, it is becoming increasingly common for multiple researchers or research groups with widely differing areas of expertise to collaborate. this collaboration is absolutely crucial if relevant insights are to be gained from large-scale datasets. as a result a vast array of data is generated, which is required to be annotated and curated as well as analyzed for information relevant to any particular experiment. in addition this information needs to be stored, shared, and distributed in a manner that enables reanalysis if and when new hypotheses are generated. platforms as produced by the gmod consortium (http://gmod.org), such as gbrowse, and underlying databases are excellent web-based tools for visualizing and comparing datasets. however, they currently offer limited scope for collaborative annotation or curation of datasets where relevant expertise can be brought to bear from a variety of different research groups. this problem is magnified with the advent of second-generation sequencers since much smaller groups of researchers tend to be involved, meaning that the expertise that large collaborations can muster (such as the influenza research database [fludb], http://www.fludb.org/) is much smaller. thus there is a need for integrated annotation and visualization pipelines to enable individual researchers to perform comparative genomics and transcriptomics. the broad institute offers a number of useful visualization tools to the individual researcher such as argo (http://www.broadinstitute.org/annotation/argo/) and the integrated genome viewer (igv) (http://www.broadinstitute.org/igv/). argo offers the ability to manually annotate and visualize a genome as well as provide a good graphical overview for comparative genomics and transcriptomics. currently, there is no one standard for bioinformatics pipeline development for next-generation sequencing. several efforts are underway or can be adapted from sanger sequencing pipelines. these include the prokaryote annotation pipeline xbase and the isga server (hemmerich et al., ) . these enable de novo sequenced prokaryote genomes to be annotated automatically and corrected manually at a later date. alternative sanger adaptations such as maker can also be used once an assembly has been generated. a large array of programs is now available to either align reads to a reference genome or to assemble them de novo (miller et al., ; paszkiewicz and studholme, ). they will not be listed in detail here as there are many considerations, including sequencing platform used, the read length in use, the expected genome size, length of longest repetitive elements, gc content, and whether paired-end reads are in use. the proprietary newbler software from roche is the most popular method of de novo assembly of reads (typically À bp). popular assemblers for short reads (i.e., mostly from illumina or solid platforms) are velvet (http://www.ebi.ac.uk/bzerbino/velvet) for the assembly of genomic dna or oases from the same group dealing with assembly of reads from transcriptomic cdna (http://www.ebi.ac.uk/bzerbino/oases) (zerbino and birney, ) . other assemblers such as abyss (simpson et al., ) , allpaths (butler et al., ) or soapdenovo (http://soap.genomics.org.cn/soapdenovo.html) are also popular. abyss enables assembly to be parallelized, thus speeding up assembly. allpaths has been shown to offer superior performance when multiple pairedend libraries are used. independent of read length, it is crucial that paired-end libraries are used when constructing de novo assemblies of any genome. note that the use of short-read sequences only can lead to significant gaps being left in the final assembly due to repetitive elements. however, for many analyses (especially for prokaryotic organisms) these gaps are generally not considered to be significant. in cases where closure of these gaps is more desirable than the addition of , sanger or long-range pcr data can often help. where significant quantities of long-and short-read data are available, then a joint assembly can be attempted. a recommended protocol is to assemble the short and long reads separately using their respective packages and to then merge the two assemblers using programs such as minimus (sommer et al., ) . another option is to use a template sequence from a related organism to help guide the assembly (note-this is distinct from remapping as described). the amoscmp package is useful for this purpose (pop et al., ) . finally, whatever assembly method is used, it is important to remember that a longer assembly is not necessarily a better one. examining the reads making up a contig (e.g., using the amos package (http://amos.sourceforge.net) or the tablet viewer (http://bioinf.scri.ac.uk/tablet) and alignment to a core-conserved group of genes should be standard practice to ensure that blatant errors are corrected. remapping of short reads to a reference genome is also a valid method of comparison. although software such as blat (kent, ) can be used with longer reads, it is not an ideal tool for shorter read technologies where data volumes are much greater. where such a genome is available, software such as maq, its successor, bwa, bowtie, soap, and others offer a wealth of tools to identify indels, snps, and other variants which may be of interest. crucially in these cases it is important to have sufficient depth of coverage to ensure snp calls are valid. paired-end data is also valuable to have to highlight the presence of indels. after remapping it is also common practice to assemble unmapped reads using the de novo assembly software to reveal any novel sequence variants, which may be absent in the reference. in the case where pathogens and hosts are sequenced together, if the sequence of at least one is known, then it is relatively straightforward to separate the two using bioinformatic techniques. to deal with transcriptomic data where a reference sequence is available, softwares, such as erange (http://woldlab.caltech.edu/rnaseq/), tophat (trapnell et al., ) , and cufflinks (http://cufflinks.cbcb.umd.edu/), are extremely useful. the cufflinks module in particular offers the ability to predict the most likely exon isoform expression pattern using a combination of bayesian statistics and graphbased algorithms. we are aware that our treatment of the use of "omics" and bioinformatics in infectious disease research is not exhaustive. as mentioned in the introduction, what constitutes bioinformatics is not entirely clear and arguably varies depending on who tries to define it. however, we have attempted to show the considerable progress in infectious diseases research that has been made in recent years using various "omics" case studies. in addition, the last section is an attempt to provide a brief overview of the problems and (bioinformatics) solutions that current-day scientists face who embark on second-generation sequencing strategies. this is a fast-moving field, but the provided references and websites should be a good first approach for those who wish to make further strides toward eradicating infectious diseases from our planet. basic local alignment search tool structural genomics and drug discovery for infectious diseases structural proteomics of the sars coronavirus: a model response to emerging infectious diseases proteomics characterization of outer membrane vesicles from the extraintestinal pathogenic escherichia coli Δtolr ihe mutant the genome of the african trypanosome trypanosoma brucei molecular detection of multiple emerging pathogens in sputa from cystic fibrosis patients allpaths: de novo assembly of whole-genome shotgun microreads evolution of pathogenicity and sexual reproduction in eight candida genomes new tools for discovering and characterizing microbial diversity cytolethal distending toxin (cdt): a bacterial weapon to control host cell proliferation? the genome sequence of trypanosoma cruzi, etiologic agent of chagas disease comparative genomics of trypanosomatid parasitic protozoa complete nucleotide sequence of bacteriophage ms rna: primary and secondary structure of the replicase gene complete nucleotide sequence of sv dna the pfam protein families database whole-genome random sequencing and assembly of haemophilus influenzae rd the minimal gene complement of mycoplasma genitalium psortb v. . : expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis a universal vaccine for serogroup b meningococcus comparative genomics of listeria species life with genes rfam: annotating non-coding rnas in complete genomes molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products an ergatisbased prokaryotic genome annotation web server comparative genomic analyses of seventeen streptococcus pneumoniae strains: insights into the pneumococcal supragenome interactive instruction on population interactions microbial population structures in the deep marine biosphere microbiology: metagenomics ironing out the wrinkles in the rare biosphere through improved otu clustering the genome of the kinetoplastid parasite, leishmania major bacillus anthracis evolution and epidemiology blat-the blast-like alignment tool genome analysis reveals pili in group b streptococcus the death of microarrays? evolution of the core and pan-genome of streptococcus: positive selection, recombination, and genome composition pervasive, genome-wide positive selection leading to functional divergence in the bacterial genus campylobacter identification of a universal group b streptococcus vaccine by multiple genome screen the microbial pangenome assembly algorithms for next-generation sequencing data genomic approach for analysis of surface proteins in chlamydia pneumoniae direct metagenomic detection of viral pathogens in nasal and fecal specimens using an unbiased high-throughput sequencing approach structure and complexity of a bacterial transcriptome de novo assembly of short sequence reads identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing the pangenome structure of escherichia coli: comparative genomic analysis of e. coli commensal and pathogenic isolates cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms nucleotide sequence of bacteriophage phix dna whole-genome comparison of disease and carriage strains provides insights into virulence evolution in neisseria meningitidis the key role of genomics in modern vaccine and drug design for emerging infectious diseases prodom: automated clustering of homologous domains abyss: a parallel assembler for short read sequence data minimus: a fast, lightweight genome assembler genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial "pan-genome genome sequence of the nematode c. elegans: a platform for investigating biology discovery of a vaccine antigen that protects mice from chlamydia pneumoniae infection the listeria transcriptional landscape from saprophytism to virulence the complete genome sequence of the gastric pathogen helicobacter pylori tophat: discovering splice junctions with rna-seq next generation sequencing of microbial transcriptomes: challenges and opportunities the role of medical structural genomics in discovering new drugs for infectious diseases mapping the burkholderia cenocepacia niche response via high-throughput sequencing velvet: algorithms for de novo short read assembly using de bruijn graphs we would like to acknowledge our colleague dr. david j. studholme for his suggestions and feedback. key: cord- -g sjyytr authors: phillippy, adam m; deng, xiangyu; zhang, wei; salzberg, steven l title: efficient oligonucleotide probe selection for pan-genomic tiling arrays date: - - journal: bmc bioinformatics doi: . / - - - sha: doc_id: cord_uid: g sjyytr background: array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. this method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. an unbiased tiling of probes across the entire length of the genome is the most flexible design approach. however, such a whole-genome tiling requires that the genome sequence is known in advance. for the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. prior microarrays have included only a single strain per array or the conserved sequences of gene families. these arrays omit potentially important genes and sequence variants from the pan-genome. results: this paper presents a new probe selection algorithm (panarray) that can tile multiple whole genomes using a minimal number of probes. unlike arrays built on clustered gene families, panarray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. instead, probes are evenly tiled across all sequences of the pan-genome at a consistent level of coverage. to minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. the viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a , probe array that fully tiles the genomes of different listeria monocytogenes strains with overlapping probes at greater than twofold coverage. conclusion: panarray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. it is capable of fully tiling all genomes of a species on a single microarray chip. these unique pan-genome tiling arrays provide maximum flexibility for the analysis of both known and uncharacterized strains. microarrays are well known for their success in studying gene expression [ ] . as one of their many other roles, dna microarrays can also be used to characterize both large-scale and small-scale genetic variations. for instance, array comparative genomic hybridization (acgh) is commonly used in human cancer studies to genotype cell lines by detecting gene loss and copy number variations [ ] . at a finer resolution, microarrays are also used to detect single nucleotide polymorphisms at targeted loci [ ] . in addition to human screens, microarrays have been widely used for the detection and genotyping of microbial species. notably, a viral genotyping microarray [ ] was one of the methods used to etiologically link severe acute respiratory syndrome (sars) to a novel coronavirus [ ] . arrays for the detection and comparative analysis of bacterial genomes have also been developed, including arrays for listeria monocytogenes [ ] [ ] [ ] [ ] [ ] , and many other bacterial species. however, these earlier, low-density arrays did not contain enough probes to target the entire genome of the bacterium, and were forced to probe only a small subset of the known genes. as the density of dna microarrays increased in recent years, it has become possible to probe the entire genome of an organism in addition to only specific genes. an array providing unbiased coverage of probes across a genome is commonly referred to as a whole-genome tiling array. such arrays have been very successful for genome-scale analysis, including the discovery of novel transcripts, splicing variants, protein binding sites, and polymorphisms [ ] . depending on the offset between adjacent probe locations, whole-genome tilings can be either gapped, end-toend, or overlapping ( figure a ). in the human genome, tiling arrays are designed to probe the genome at evenly spaced intervals. to maximize the expected specificity of the array, repetitive probes must be avoided and experimental conditions, such as melting temperature, equalized. this creates an optimization problem in choosing which sequences should be included on the array [ , ] . in smaller microbial genomes, it is possible to target every position of the genome with overlapping probes, simplifying the design process. for example, extreme high-density arrays can now accommodate . million variable length probes on a single chip (roche nimblegen, inc). for an average mb sized bacterial genome and nt probe length, probes can be offset by only a single base-pair and still span the entire genome, generating a coverage of ×. by tiling the entire genome, some suboptimal probes will be included on the array, but can be identified and corrected for in the analysis. these overlapping arrays are capable of identifying polymorphism at a much finer resolution than gapped arrays. tiling arrays have traditionally been constructed based on the genome of a single reference strain and used to locate genomic differences contained in the experimental strains. however, single-genome arrays can only detect and analyze sequences similar to those included on the array, and cannot discover or analyze sequences absent in the reference strain. after the introduction of the pangenome concept [ , ] , it has become increasingly clear that some microbial species contain significant genetic diversity, and it is not suitable to compare against only a single reference strain. the pan-genome hypothesis states that any given species has two sets of genes. first, a set of core genes present in all strains that define the species; and second, a set of dispensable genes present in only one or a few of the strains that presumably mediate adaptation. a illustration of different tiling densities, and an example pan-genome tiling figure illustration of different tiling densities, and an example pan-genome tiling. genomes are represented as horizontal lines and probes as colored rectangles. the offset between probes is the distance between the start of one probe and the start of the next. ( a) three different tiling densities are shown for genome a. the top figure illustrates a gapped tiling, the middle an end-to-end tiling, and the bottom an overlapping tiling. ( b) a pan-genome tiling is shown for two genomes. genomes a and b are identical except for a small insertion in b, represented by vertical red bars. solid blue probes are conserved in both genomes, and probes spanning the insertion event are colored by variant. set h shows the non-redundant set of probes needed to tile the pan-genome including a and b. single genome describes the genomic material for a particular strain, but the pan-genome describes the genomic makeup for an entire species. single reference tiling arrays cannot survey this full diversity. ideally, an array for analyzing new strains should cover the genomic diversity of the entire pan-genome. with the explosion in microarray densities, it is now possible to design pan-genome tiling arrays that contain all genomic sequence from the known pan-genome. the simplest strategy is to fully tile the genomes of each strain independently. however, due to similarities between the strains, some sequences would be tiled with excessive redundancy, and this approach would be cost ineffective. instead, a pan-genome array should aim to minimize costs by using the minimal probe set necessary to target every element of the pan-genome with adequate coverage. the typical approach for targeting multiple strains is to group individual genes into gene families and then probe only the conserved sequences of those families [ ] [ ] [ ] . for example, willenbrock et al. designed an innovative strain escherichia coli pan-genome array by clustering homologous genes based on pairwise alignment similarity [ ] . homology was defined as gene alignments with an e-value < - , a bitscore > , and alignment coverage of at least % of the gene length. for each resulting gene group, a consensus sequence was generated via multiple alignment, and probes were designed to target the most conserved regions of the consensus. the resulting array comprised , probes, targeting , gene groups, with a median coverage of probes per gene group. targeting only the conserved sequence of gene families is an effective and efficient method for detecting--at a low resolution--the presence and absence of gene families; however, for studies that require a finer resolution, this method omits many potentially significant sequences from the array. firstly, a slight variation in a gene (e.g. a partial deletion) can be responsible for a significantly different phenotype. by only targeting the conserved portion of gene families, the variable regions responsible for these differences will not be included on the array. secondly, a gene-centric design includes only coding sequences. therefore, these designs cannot be used to detect differences in intergenic regions which may include regulatory elements, or used for studies that require a whole-genome tiling, such as transcriptome mapping or chromatinimmunoprecipitation-chip (chip-chip) studies. finally, gene-centric design models depend on an accurate annotation of the genome. if genes have been mis-annotated or omitted from the annotation, such genes cannot be properly represented on the array. this is particularly troublesome for many draft-quality genomes that have highly fragmented sequence assemblies and lack accurate annotations. for these reasons, a whole-genome tiling is pref-erable for applications that require more flexibility or an unbiased tiling of the genome. however, no methods have been described for efficiently tiling multiple wholegenome sequences. this paper describes a method for pan-genome tiling array design that both minimizes the number of probes required and guarantees that all sequences in the pangenome are fully tiled by the array. the prior gene-centric approaches are abandoned in favor of a more concrete, probe-centric approach that relies only on the genomic sequences and not the annotation. to summarize the new approach, let the pan-genome g be the set of all genomes from a species, and let p be the non-redundant set of all length k substrings from g. due to sequence conservation between genomes, a single probe may match to multiple locations (genomes) of the pan-genome. call these matches the probe targets. the pan-tiling problem is to find a minimum cardinality subset h ⊆ p such that all sequences of g are targeted by probes in h and no target is offset more than maxoff from an adjacent target (or sequence end). constructing a full tiling of the pan-genome seems like it would require a large number of probes, but by leveraging the similarities between strains, a reasonably sized probe set can be constructed that fully covers a large pangenome with adequate redundancy. the key to the strategy is choosing probes that will hybridize to as many of the strains as possible, while using only a necessary amount of probes to cover polymorphisms (insertions, deletions, variants). for example, figure b shows a pangenome tiling for two miniature genomes, with a maxoff of one-third the probe length. genomes a and b are identical except for a small insertion in the middle of b. fully tiling both genomes requires a total of probe targets ( for a and for b), but probe set h illustrates that these targets can be tiled with just probes. conserved probes are used to tile the left and right of both genomes, and distinct probes are used to tile the two polymorphism variants. this is obviously a simplified example. the problem becomes more difficult as the number of genomes and complexity of polymorphisms increases. the methods presented in this paper were developed to aid the design of a pan-genome cgh tiling array for listeria monocytogenes--the causative agent of listeriosis and a niaid category b biodefense agent that is of significant food safety and public health concern [ ] . the species of l. monocytogenes is composed of three primary genetic lineages (named i, ii, and iii) that display different capabilities of environmental survival and pathogenic potential to cause human infectious disease [ ] . in order to both characterize new strains based on genetic content, and detect polymorphism at a higher resolution in small rnas (srnas) and intergenic sequences, the array was required to cover all pan-genomic sequences with a high density of probes. this bacterial species is particularly well suited for pan-genome array design because there are a remarkable number of strains that have been sequenced. at the time of writing, a total of l. monocytogenes complete or draft genome sequences were available, totaling . mbp (table ) . genomic sequences and annotations were obtained from the national microbial pathogen database resource (nmpdr) [ ] . the sequence conservation for the sequenced strains was computed with nucmer [ ] , and ranges between % and % in nucleotide identity versus the completed egd-e reference strain. even with such substantial diversity within the species, the pan-array algorithm is able to design a pan-genome tiling covering each genome at more than twofold coverage using only , -mer probes. a similar density tiling for a single l. monocytogenes strain would require , probes, meaning the panarray design covers × more genomes using only × more probes. a description of this design, along with array designs for six other bacterial pan-genomes is given in the results section. the general strategy of the panarray design algorithm is best summarized by analogy to the well-known minimum hitting set problem in computer science [ , ] . let p be a set of n points and f = {p , p ,..., p m } be a family of m subsets of p. minimum hitting set is the problem of selecting the minimum cardinality subset h ⊆ p such that h contains at least one element from each subset in f. although finding a minimum hitting set is known to be np hard, it is a well studied problem and efficient approximation algorithms are known. to see the similarities between the pan-tiling and minimum hitting set problems, let the sequence g be a concatenation of all the genomes from a species, and let w = {w , w ,..., w m } be the set of m intervals that results from segmenting g into non-overlapping, end-to-end, length l windows. let p be the non-redundant set of length k substrings from g. a probe candidate p ∈ p is said to hit a window w ∈ w if a match between p and a substring of g begins in the interval w. let p i ⊆ p be the subset of probes that hit the window w i , and f = {p , p ,..., p m } for the m windows of w. a minimum hitting set h of f is a minimum cardinality subset of probes h ⊆ p such that every window of the pan-genome is hit by at least one probe in h. therefore, finding h effectively tiles the entire pangenome using a small number of probes. windowing the genome simplifies the pan-tiling problem by casting it is a minimum hitting set problem, and at the same time enforces the maxoff constraint. because each window is forced to contain at least one target, any two adjacent targets cannot be separated by more than twice the window length. therefore, the window length is equal to one half maxoff. for example, given a maximum offset of l, windows are marked off every l bases of the pangenome--with the first window w covering the interval [ , l] , and the second window w covering [l+ , l], and so on. assuming one target is chosen per window, and the target locations are evenly distributed within windows, the average distance between adjacent targets is expected to be equal to the window length. for a window length l, equal to the probe length k, the resulting depth of coverage averages one, because the probes are spaced k bases apart on average. for any other window length l, the resulting depth of coverage c is expected to be c ≈ k/l. the extreme case being l = , which results in exactly k-fold coverage because a probe must hit every position in g. to solve the minimum hitting set problem, once the pangenome is discretized into a set of windows, each window must be mapped to the set of probe hits it contains. as before, a probe p hits a window if a match between p and g begins within the window's interval. thus far exact matches have been assumed, but a match can be defined by any criteria necessary for efficient hybridization. to help reduce probe redundancy, the panarray implementation can optionally use inexact matches containing a single mismatch. any suitable k-mer indexing algorithm can be utilized for this phase, but allowing for mismatches can be computationally expensive. the implementation uses a fast, but memory intensive, compressed keyword tree for indexing all probe hits. alternatively, a slower, but memory efficient, hashing scheme would also work. to index the -mismatch hits, each probe's k possible -mimsatch permutations are added to the index as well. the result of the indexing is a list of positions and windows for all kmers of the pan-genome (the probe candidates). at this stage, the final list of probe candidates may be manually filtered based on typical criterion such as melting temperature, gc content, secondary structure, etc. for ungapped tilings, it is impossible to avoid suboptimal probes. however, highly repetitive probes can be identified by the number of genomic positions they map to, and should be discarded if they threaten to confound the array analysis (e.g. by affecting normalization). alternatively, the input sequences may be masked prior to k-mer indexing to avoid repetitive or unwanted sequence altogether. for cgh arrays, each probe is considered equivalent to its reverse complement, but for expression or transcriptome arrays, forward and reverse strand probes must be considered independently. probe matches are listed on the strand on which they appear, so for single-stranded samples, the sequence to be synthesized for the array will need to be reversed complemented. for dna tiling arrays it is helpful to assume the sample will be double-stranded so that genomic inversions in one or more of the strains do not have to be tiled separately. as detailed above, selecting a minimum probe set for tiling s is equivalent to finding the minimum hitting set of p. as before, w is the windowed pan-genome. let w p be the subset of windows hit by probe p, and u be the set of currently uncovered windows. let a window hit by at least one probe be termed as covered, and the coverage of a probe be the number of windows it hits |w p |. a naive algorithm for finding a small hitting set h is to choose, for each uncovered window, a probe hitting the window that also hits the most other windows. the idea being that choosing probes with the highest coverage will minimize the total number of probes necessary to cover all windows. however, this approach does not properly account for the probe coverages. only a single probe is needed to cover a window, so after selecting a probe p, all other probes that hit a window in w p will see their effective coverage reduced. take for instance two probes p and q that hit the exact same set of windows. choosing p reduces the effective coverage of q to zero, because all of q's windows have already been covered by p. let the residual coverage r p of a probe be the effective coverage after some other set of probes have already been chosen (r p = |w p ∩ u|). a greedy algorithm first suggested by johnson [ ] improves on the naive approach by allowing to reconsider the residual coverage of probes after each iteration. this algorithm has since been shown to be essentially a bestpossible approximation for the minimum hitting set problem [ ] . when adapted for the current problem, the algorithm chooses, while uncovered windows remain, the probe that hits the most currently uncovered windows. the greedy panarray algorithm is: the algorithm itself is straightforward, but it must be carefully implemented to run efficiently. it is infeasible to recompute the residual coverage |w p ∩ u| for all w p during each iteration, because both p and w can be on the order of millions for a large pan-genome. to avoid this complexity, the panarray implementation exploits a property of the residual coverages that allows it to recompute only a few values at each iteration. note that for any p, its residual coverage r p can never increase. a probe's coverage arg max | | p p p w u ∈ ∩ either remains the same, or decreases because one of its windows was hit by the prior iteration. therefore, instead of recomputing all residuals after each iteration, it is sufficient to maintain a priority queue of residual coverages and only update stale values at the front of the queue. at the start of the algorithm, all initial coverages are inserted into the queue. to maintain the priority queue after a new probe is chosen, all residual coverages are considered invalid. during the next iteration, a new r p value is computed for the front of the queue, marked as valid, and reinserted into the queue. this process is repeated until a valid residual returns to the front of the queue. often, newly computed residuals will return quickly to the head of the queue before the others have been updated. at this point it is unnecessary to update any other residuals because their new values cannot be greater than their current value. therefore, the head of the queue must be the updated maximum. this lazy evaluation of the residuals avoids many unnecessary computations and drastically improves the performance of the algorithm. the greedy algorithm without this speedup takes days to complete, but with the speedup runs in a matter of seconds. the flexibility of the panarray design algorithm is a result of its probe-centric approach. because it does not require any identification or clustering of genes, the design is independent of any genome annotation. therefore, instead of building the annotation into the design of the array, the annotation can be mapped onto the array after the design. most importantly, this strategy allows for intergenic sequence and unannotated genomes to be included on the array, and annotation updates to be incorporated as they become available. for example, after the l. monocytogenes array had been designed (see results), over new srnas were discovered in listeria [ ] . neatly, the sequences of each had already been tiled by the array design, and the updated annotation was easily remapped onto the array. as another example, the gene counts provided by nmpdr in table are inconsistent and vary between , and , genes per genome, suggesting considerable annotation error. uncoupling the array design from the annotations removes any possibility that annotation errors will affect the design. included with the final probe set h is the list of locations on the pan-genome that each probe matches. if the genome sequence is updated, the location information can be easily recovered by remapping the probes to the genome using a matching tool such as mummer [ ] or vmatch [ ] . to annotate the array, probes are mapped to all annotation features with a coinciding location. the result is a many-to-many mapping with each feature being targeted by multiple probes, and a single probe possibly targeting multiple features (e.g. conserved genes between strains). with this mapping, all probes targeting a specific gene in the pan-genome can be quickly recovered. as suggested in the introduction, l. monocytogenes is a good candidate for constructing a pan-genome tiling array because the species has been widely sequenced, with complete or draft genome sequences available. to confirm that the sequenced genomes contain the majority of l. monocytogenes genetic diversity, the pan-genome size was estimated using the methods of tettelin et al. [ ] as implemented in the ergatis package [ ] . seventeen of the eighteen l. monocytogenes genomes listed as annotated by nmpdr in table were used in the analysis (strain / a f was unavailable at the time). according to the cited method, the addition of an n th genome was simulated by searching the annotated genes of each genome against all possible permutations of n- other genomes. genes without a match over % protein similarity for at least % of their length were recorded as "new". the number of new genes n expected to be discovered in the n th sequenced genome was modeled by the power law n = κnα , and the parameters κ and α were estimated from the data via non-linear least squares regression using the r function nls [ ]. the regression was performed on the full set of over million data points. a power law model was found to fit the l. monocytogenes data better than the originally proposed exponential model. this agrees with a recent suggestion that a power law is a more appropriate model of the pan-genome phenomenon [ ] . the estimated number of undiscovered genes is shown in figure . the power law exponent α was found to be . ± . , suggesting that the l. monocytogenes pan-genome is closed (i.e. has a finite number of genes), and the sequencing of more genomes would eventually sample the entire set of dispensable genes. therefore, it appears the vast majority of l. monocytogenes genes have been sequenced and are included on the array. this model predicts that the addition of a st genome to table would yield only ~ new genes. however, only a single lineage iii genome was included in this analysis, so this prediction might be artificially low for a new lineage iii strain. the sole lineage iii strain analyzed (fsl j - ) contains genes absent in any of the lineage i and ii strains. to capture the full diversity of l. monocytogenes, all genomes listed in table were included in the design, with a combined sequence length of , , bp and a total of , annotated genes. to avoid tiling low quality or contaminant sequence, contigs less than kbp in length were discarded--reducing the tiled sequence length to , , bp. the design was constrained to a , feature nimblegen array with a probe length of nt. because hybridization of a -mer probe will tolerate a few mismatches, probes differing by a single mismatch were considered equivalent during the design phase. the window length was set to bp, enforcing a maximum target offset of , an expected depth of coverage of about / = . ×, and resulting in approximately . million windows. these parameters guarantee that every base-pair of the pan-genome will be covered by at least one probe, since the maximum offset is less than the probe length. to cover each window, the panarray algorithm selected , distinct probes mapping to , , positions in the pan-genome. on average, each probe in the design targets about different positions in the pan-genome. rather than being repeated sequences within the same genome, these different locations most often refer to a conserved locus in multiple strains (figure ). interestingly, the degree of probe reuse corresponds well with the known evolutionary relationship of the strains. included on the chip are genomes from lineage i, from lineage ii, and from lineage iii. this would suggest that the peak at genomes = in figure is for strain-specific probes; the peaks around and are for lineage-specific probes; and the peak around is for species-specific probes that are conserved in all l. monocytogenes genomes. because this is a dense tiling of the entire genome, it was unnecessary to optimize probes for uniqueness, as is done in standard expression arrays with only a few probes per gene. probes were screened for repetitive sequences, but the l. monocytogenes strains were found to contain few repeats. the most repetitive -mer occurs only times per genome, and the most repetitive -mer probe used in the design targets a "cell wall surface anchor protein" family and occurs a maximum of times per genome. altogether, . % of the probes target at most one location per genome. to augment the original panarray design, an additional negative control probes were added to the array, chosen from bacillus spp., which is a known cohabitant of listeria. the negative control probes were chosen to be specific to bacillus spp. using the insignia genomic signature design pipeline [ ] . the remaining , features on the array were filled by selecting individual probes to supplement the lowest coverage regions of the design. all probes were checked to conform to nimblegen design specifications, and a few probes were trimmed to meet synthesis cycle limits. the resulting l. monocytogenes pangenome array has an average depth-of-coverage of . ×, with a median probe offset of bp, and a modal offset equal to the window length of bp. the full distribution of probe offsets is given in figure . as expected, the average offset is equal to the window length ( bp). the une- the number of new genes n predicted to be discovered with the addition of an n th listeria monocytogenes genome sequence figure the number of new genes n predicted to be discovered with the addition of an n th listeria monocytogenes genome sequence. a power law fit to the simulated data is given by the solid curve. the circles represent the mean value for each n, and error bars show the % confidence intervals. ven distribution and pronounced mode is the caused by non-random tie breaking. in the case of a conserved sequence, where every probe hits the same number of genomes, the first probe of the window is always chosen. also, the heavy left tail indicates that many windows are covered by more than one probe and the solution that is slightly denser than expected ( . × actual vs. . × expected). this may be a consequence of the sequence composition, or may indicate a non-optimal solution. finally, the majority of targeted sequences exactly match their probe ( %) and the remainder match with a single mismatch ( %). the performance gain of panarray over more naive methods is significant. for instance, selecting a single probe from each window requires roughly . million probes. the slightly more principled naive algorithm, that does not recompute residual coverages, chooses , , probes, but is still well over the , probe limit. the greedy panarray algorithm meets this limit and vastly outperforms the other methods--requiring only , probes to cover the entire pan-genome. with the lazy evaluation speedup, the panarray algorithm is also comparable in runtime to the naive algorithms. on a single . ghz processor, the naive algorithm took seconds; the greedy algorithm without lazy evaluation was terminated without completing after a few days; and the greedy pan-array algorithm with lazy evaluation took only seconds. the runtime for the final design process was dominated by building the k-mer index, which required minutes using a compressed keyword tree. using panarray, additional arrays were designed for a total of seven bacterial pan-genomes, for which a large number of genomes have been sequenced. the additional species include: francisella tularensis, staphylococcus aureus, bacillus anthracis, vibrio cholerae, burkholderia pseudomallei, escherichia coli, and shigella spp. due to their high similarity, e. coli and shigella spp. were considered as a single pangenome. to facilitate easy comparison, all designs were created with a window length of bp, a probe length of nt, and allowing for probes to contain a single mismatch to their target. as with the l. monocytogenes design above, draft genomes were included, but contigs less than kbp were discarded. the results are given in table . probe "reuse" is measured in the average number of targets per probe. it is rare for a -mer probe to match to more than one location per genome, so the number of targets per probe is roughly equivalent to the average number of genomes that a probe matches. the highly conserved species of b. anthracis exhibits near perfect probe reuse. almost every b. anthracis probe histogram of offsets between adjacent probe targets in liste-ria monocytogenes figure histogram of offsets between adjacent probe targets in listeria monocytogenes. the offset between two adjacent probe targets is given on the horizontal axis. targets may contain up to one mismatch to the probe. targets e+ e+ e+ e+ e+ avg. length is the average genome length for a species. pan length is the sum of all genome lengths for a species. targets is the total number of locations targeted by the probes. a single probe may target multiple genomes in the species. reuse is the average number of targets per probe, and a normalized reuse is given in parentheses as the reuse divided by the number of genomes. matches all of the included strains; therefore, the number of probes required to tile the nine sequenced strains is nearly the same as is required to tile one strain. this is because the pan-genome of b. anthracis is closed and the strains are highly conserved at the nucleotide level (usually containing only a few snps per strain). adding successive b. anthracis strains to the array would increase the required number of probes very gradually. in contrast, l. monocytogenes has the lowest degree of probe reuse, with each probe targeting on average only % of the included strains. this is a reflection of the diversity of strains that have been sequenced and the low level of nucleotide conservation between strains, with some strains differing by as much as % (see table ). any snp rate of higher than % ( per bp) exceeds the mismatch threshold per probe and requires additional probes to target the divergent sequence. however, as more variants are added to the array, the addition of each successive genome requires fewer new probes than the last, on average. figure shows this relationship for the l. monocytogenes strains. successive strains are added by order of lineage, from the bottom of table to the top, and the design is recomputed at each step. there are pronounced jumps in the number of probes required when the first of a new lineage is added, but the number of probes needed to tile the rest of the lineage quickly levels off. escherichia coli and shigella spp. form the largest pangenome currently sequenced, totaling over mbp of genomic sequence. even for a pan-genome of this size and diversity, panarray effectively tiles all sequences at an average of × coverage using only , probes--well below the maximum number of probes available on cur-rent arrays. the b. pseudomallei pan-genome is roughly equivalent in total number of pan-genome bases, but requires considerably fewer probes because of higher probe reuse. due to the large number of sequenced genomes and relatively high similarity between strains, the b. pseudomallei design exhibits the highest probe reuse factor of all the designs ( . ×). creating a × coverage tiling by choosing one probe every bp would require roughly . million probes for the b. pseudomallei pangenome, but panarray was able to create a . × tiling of the same pan-genome with only , probes. the panarray algorithm was implemented in c++, and the source code is freely available at http:// www.cbcb.umd.edu/software/panarray. the listeria monocytogenes array design described above is available from the gene expression omnibus [ ] under geo accession number gpl . the panarray algorithm described above is ideal for highdensity tilings of overlapping or closely spaced probes. the results section has shown that this algorithm is applicable for all currently available bacterial pan-genomes. however, if the maximum number of probes is limited, or the genome size is extremely large, it may be necessary to design a tiling with gaps between the probe targets (i.e. a maximum offset greater than the probe length). in this case, it is necessary to choose unique probes that avoid unwanted cross hybridization between repetitive sequences within the genome. to achieve this, repetitive probes can be filtered, or the coverage scores used in the panarray algorithm can be weighted to penalize repetitive probes. for example, probe coverage can be redefined as the number of genomes a probe targets, rather than the number of windows, and probes targeting multiple windows in the same genome can be appropriately downweighted. in many cases, probes within the same window will share the same coverage score, and rules can be applied for breaking the tie and choosing the most reliable probe. similar schemes could be devised to favor probes with any other desirable criteria. array analysis of acgh experiments is typically conducted on signal ratios between a reference and experimental hybridization. duplications or deletions in the experimental samples are evident as non-zero values of the log ratio of the two normalized signals. so-called segmentation algorithms examine this log ratio across multiple positions in reference sequence to determine the boundaries of the variations [ , ] . the most accurate methods consider not just individual probes, but a context of probes around a genomic location, and can identify even small polymorphisms between the strains. these analyses require both a reference signal and a reference coordinate system on which the probes are tiled. usually a whole-genome tiling is constructed for a single reference strain, but because panarray provides a whole-genome tiling for every reference strain included in the array, the same array design can be used to perform segmentation analysis against any reference strain on the array. in addition to segmentation analysis versus a reference genome, a pan-genome array makes it possible to analyze uncharacterized strains in the context of the entire pangenome. in some cases, it is preferable to use a multistrain control [ ] , but depending on the number of genomes, it can be impractical to co-hybridize all reference strains included on the chip. in these cases, traditional segmentation or log-ratio analysis must be replaced by a method that does not require a reference hybridization signal. for gene-level analysis, direct analysis of the individual probe intensities provides comparable sensitivity and specificity versus segmentation analysis [ ] , and various methods have been developed that operate independently of a signal ratio [ , , ] . a probe-based approach provides the most flexibility for pan-genome array analysis, because each probe can be individually scored based on its own intensity, and the genes can be classified based on the aggregated scores of the individual probe scores without the need for a control hybridization. pan-genome tiling arrays have all the applications of single-strain tiling arrays, but with enhanced flexibility and the ability to analyze previously uncharacterized strains. pan-genome acgh offers an economical alterative to sequencing for determining the genomic makeup of uncharacterized strains in a species and explaining the causative factors of phenotypic differences between strains. probe based methods, like microarray, are especially well suited for situations where sequencing is inefficient because there is a low abundance of target dna and a high abundance of background dna intermixed. for example, applications such as real-time pathogen detection, surveillance, and diagnostics require a known sequence of dna to be targeted from a vast environment [ , , ] . a pan-genome array could be used for the detection and genotyping of pathogens from a large environment, without needing to isolate the individual cells. pan-genome arrays could also be used to capture all species-or locus-specific genomic material from an environment, which could then be directly processed or sequenced separately from the metagenome. microarray based genomic capture has already been applied to targeted human resequencing as an efficient means of enriching for desired sequencing templates [ ] [ ] [ ] . without the need for sequencing additional genomes of the same species, pan-genomic acgh has become an increasingly popular and cost-effective approach to compare and characterize genomic contents of unknown bacterial isolates. prior multi-strain arrays have targeted the conserved sequences of gene families, or a selected group of polymorphisms; therefore, providing only partial coverage of the pan-genome. panarray is a probe selection algorithm capable designing a tiling array that fully covers all genomes of a species using a minimal number of probes. the viability of this method is demonstrated by array designs for seven different bacterial pan-genomes, each of which can fit on a single microarray slide. by constructing an unbiased tiling of all known sequences, these unique pan-genome tiling arrays provide maximum flexibility for the analysis, detection, or capture of genomic material for entire species. quantitative monitoring of gene expression patterns with a complementary dna microarray high resolution analysis of dna copy number variation using comparative genomic hybridization to microarrays large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome microarray-based detection and genotyping of viral pathogens a novel coronavirus associated with severe acute respiratory syndrome identification of listeria species by microarray-based assay new aspects regarding evolution and virulence of listeria monocytogenes revealed by comparative genomics and dna arrays mixed-genome microarrays reveal multiple serotype and lineage-specific differences among strains of listeria monocytogenes selective discrimination of listeria monocytogenes epidemic strains by a mixed-genome dna microarray compared to discrimination by pulsed-field gel electrophoresis, ribotyping, and multilocus sequence typing genome diversification in phylogenetic lineages i and ii of listeria monocytogenes: identification of segments unique to lineage ii populations applications of dna tiling arrays for whole-genome analysis design optimization methods for genomic dna tiling arrays optimized design and assessment of whole genome tiling arrays the microbial pan-genome genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial "pan-genome characterization of probiotic escherichia coli isolates with a novel pangenome microarray a fast and flexible approach to oligonucleotide probe design for genomes and gene families. bioinformatics design of long oligonucleotide probes for functional gene detection in a microbial community listeria monocytogenes, a food-borne pathogen ribotypes and virulence gene polymorphisms suggest three distinct listeria monocytogenes lineages with differences in pathogenic potential the national microbial pathogen database resource (nmpdr): a genomics platform based on subsystem annotation versatile and open software for comparing large genomes computers and intractability: a guide to the theory of np-completeness complexity and approximation: combinatorial optimization problems and their approximability properties approximation algorithms for combinatorial problems a threshold of ln n for approximating set cover the listeria transcriptional landscape from saprophytism to virulence the vmatch large scale sequence analysis software . r: a language and environment for statistical computing comparative genomics: the bacterial pan-genome comprehensive dna signature discovery and validation ncbi geo: archive for high-throughput functional genomic data circular binary segmentation for the analysis of array-based dna copy number data a comparison study: applying segmentation to array cgh data for downstream analyses optimal control and analysis of two-color genomotyping experiments using bacterial multistrain arrays improved analysis of bacterial cgh data beyond the log-ratio paradigm detection of divergent genes in microbial acgh experiments comparative genomics tools applied to bioterrorism defence oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays multiplex amplification of large sets of human exons microarray-based genomic selection for high-throughput resequencing direct selection of human genomic loci by microarray hybridization the authors would like to thank arthur delcher for a helpful critique of the manuscript draft, and hervé tettelin and david riley for help running their pan-genome analysis software. this work was supported in part by the us department of homeland security science and technology directorate under award nbch . amp conceived the problem, designed and implemented the algorithm, performed the analyses, and wrote the manuscript. xd contributed to the design and analysis, and edited the manuscript. wz and sls helped conceive the problem, edited the manuscript, and coordinated the project. all authors read and approved the final manuscript. key: cord- -tv ntug authors: gautam, ablesh; tiwari, ashish; malik, yashpal singh title: bioinformatics applications in advancing animal virus research date: - - journal: recent advances in animal virology doi: . / - - - - _ sha: doc_id: cord_uid: tv ntug viruses serve as infectious agents for all living entities. there have been various research groups that focus on understanding the viruses in terms of their host-viral relationships, pathogenesis and immune evasion. however, with the current advances in the field of science, now the research field has widened up at the ‘omics’ level. apparently, generation of viral sequence data has been increasing. there are numerous bioinformatics tools available that not only aid in analysing such sequence data but also aid in deducing useful information that can be exploited in developing preventive and therapeutic measures. this chapter elaborates on bioinformatics tools that are specifically designed for animal viruses as well as other generic tools that can be exploited to study animal viruses. the chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (orf) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. various databases that organize information on animal and human viruses have also been described. the chapter will converse on overview of the current advances, online and downloadable tools and databases in the field of bioinformatics that will enable the researchers to study animal viruses at gene level. viruses are notorious to infect all forms of life ranging from bacteria to chordates. in humans, viruses are known to cause infectious diseases such as influenza, hepatitis, aids, diarrhoea, encephalitis, dengue fever and, more recently, severe acute respiratory syndrome (sars), ebola (singh et al. a) , zika (singh et al. b) , etc. despite the vaccines and treatments for such diseases, morbidity and mortality both occur as a result of the viral infections. viral disease of animals not only affects the production but also is a threat to humans (saminathan et al. ) . a rapid growth in the availability of sequencing methods and a vast amount of viral sequence data have been generated during recent times. thus, it is imperative to decipher this data using more advanced tools such as bioinformatics resources. a large number of bioinformatics tools that can aid in the analysis of viral genomes and develop preventive and therapeutic strategies have been developed for human as well as animal viruses. this chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. analysis of viral sequence involves use of certain tools that are employable on any novel sequence, for example, gene identification, orf identification, functional annotation and phylogeny. however, due to small genome size, viruses have complex methods to maximize the coding potential of genomes and evolution. many viruses utilize overlapping reading frames or translational frameshifts to code for multiple proteins from limited genome sequences. also, higher rates of mutations and recombination between related viruses pose a challenge in accurate phylogenetic and evolutionary analysis of viruses using general-purpose softwares. lately, enormous growth in the volume and diversity of viral sequences in the databases has been seen. now, it has become imperative to organize data of these viral sequences in virus family-specific resources tailored for accurate analysis of a specific virus. one of the most common applications of bioinformatics in virology was to use phylogenetic analysis of the viral isolates to aid in the epidemiological analysis of viral outbreaks. general-purpose phylogeny programs such as phylip (felsenstein ) have been used extensively for the phylogeny and molecular epidemiology of viruses. a comprehensive list of these packages and web servers is maintained by joe felenstein at http://evolution.genetics.washington.edu/phylip/software.html. an open reading frame (orf) is the part of genome that translates into a protein. finding orf is one of the key steps in viral genome analysis. it forms the basis for further analysis such as homologous search, predicting proteins, functional analysis and viral vaccine and antiviral target discovery. if an orf translates a surface protein that is unique to that virus, it may elicit immune responses and could potentially be a vaccine candidate. orf finder by ncbi is a orf prediction program (rombel et al. ) . the program outputs a range of each orfs along with its protein translation in six possible reading frames from the input dna sequence. it can be used to search newly sequenced dna for potential protein encoding sequences and to verify predicted proteins using smart blast or blastp (altschul et al. ). however, the web version of the program is limited to a query sequence length of kb only. a standalone system has no limitation on length but is available only for the linux operating system. neg , a -codon novel orf in segment of influenza virus, was visualized using orf finder (clifford et al. ). using the orf finder in association with the basic local alignment search tool blast, orfs were found in the hz- virus genome (cheng et al. ) . due to small genome size, viruses employ multiple strategies to maximize the coding potential including frameshifts and alternative codon usage. thus, virus-specific programs have been developed to overcome these challenges. genemark (http://opal.biology. gatech.edu/genemark/genemarks.cgi) provides gene prediction tools for viruses (besemer and borodovsky ) . viral genome organizer (vgo) -a java-based web tool -offers identification of gene and orf identification in viral sequences (upton et al. ) . identification of immune epitopes is important in designing new vaccine candidates and in diagnostics. an epitope is the part of an antigen that is recognized by the receptors of immune system components such as antibodies, b cells or t cells. epitopes have been generally classified as either linear or conformational epitopes. t cells recognize linear epitopes, short continuous strings of amino acids derived from protein antigen, presented with mhc class i molecules. b cells and antibodies, on the other hand, recognize conformational epitopes which are formed by interactions of amino acids with multiple discontinuous segments forming a threedimensional antigen (barlow et al. ). owing to the simple linear structure of t cell epitopes, their interaction with receptors can be modelled with high accuracy (delisi and berzofsky ) . a large number of prediction databases and servers thus are available for linear epitope prediction. mhcpep (brusic et al. ) , syfpeithi (rammensee et al. ) , fimm (schonbach et al. ) , mhcbn (bhasin et al. ) and epimhc (reche et al. ) are some of the commonly used t cell epitope prediction programs. immune epitope database and analysis resource (https://www.iedb.org) (vita et al. ) offers the most comprehensive set of tools for epitope analysis for epitope prediction covering hla-a and hla-b for humans as well as chimpanzee, macaque, gorilla, cow, pig and mouse and is one of the few databases that cover such a variety of organisms. since , iedb uses netmhcpan as prediction method. netmhc server uses the artificial neural network method to predict binding of peptides to different alleles from human as well as animals including cattle and pig ( from core). the database also contains curated data for many viruses including influenza and herpesviruses. b cell receptors and epitope interactions are more complex in nature than the linear epitopes for t cells; thus, accuracy of b cell epitopes is relatively low. furthermore, most of the current databases are centred on linear rather than conformational epitopes. bcipep is a tool developed for predicting the linear epitope of b cells (saha et al. ) . epitome is a database of structure-inferred antigenic residues in proteins (schlessinger et al. ) . epitome is especially useful in the prediction of antibodyantigen complex interaction. the database is available at http://www.rostlab.org/ services/epitome/. antijen is an intricate database with entries on both t cell and b cell epitopes. it emphasizes on integration of kinetic, thermodynamic, functional and cellular data within the context of immunology and vaccinology (toseland et al. ) (fig. . a ). three-dimensional prediction of viral proteins can be used to predict the correlation between actual protein structure and antigenic sites, folding surfaces and functional motifs. such structural modelling tools may be implicated to identify and design novel candidates for antiviral inhibitors and vaccine targets. secondary structures may be predicted using the tool predictprotein (http://www.predictprotein.org/) (rost et al. ) . using this online tool, along with secondary structures, solvent accessibility and possible transmembrane helices can be predicted. further, it also provides expected accuracy of prediction methods. swiss-model (http:// swissmodel.expasy.org/) is a popular tool for the prediction of a -d structure of a protein. -d structure prediction programs usually employ homology searching using similar and known protein structures as templates. one of the most commonly used database for such templates is protein data bank (pdb) (reddy et al. ) . output from the swiss-model program includes the template selected, alignment between the query sequence and the template, and the predicted -d model. results of swiss-model are, however, only sent by email (figs. . b, . c, . d and . e). for long, bioinformatic analysis of viruses utilized common bioinformatics tools developed for other organisms. however, analysing viral genomes using general bioinformatics tools could compromise the accuracy and sensitivity of analysis. virus genomes are too small (e.g. < kb) to compute statistics with their codon usage. to maximize the coding potential, viruses work with unusual codon usage patterns comprising of overlapping coding and non-coding functional elements. additionally, viruses also rely on other translational mechanisms such as stop codon read-through, frameshifting, leaky scanning and internal ribosome entry sites. comparative genomic analysis of viruses is complicated by the fact that highly conservative sequences may not be coding for anything. presence of overlapping pairs may be indicated by conservation for the sequences where there is overlapping of cdss and/or non-coding functional elements. novel virus types comprise of new cdss that are different than previously known cdss. there are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. in this section, some of the databases and resources useful for the analysis of veterinary viruses are discussed (table . ). viruses are one of the most diversified and dynamic microorganisms. with increasing viral genome sequencing, there was a need to develop bioinformatics tools to compare and analyse the voluminous data. to meet this requirement, one such downloadable software package is base-by-base, which aids in analysis of whole viral genome alignments at single nucleotide level (brodie et al. ). moreover, with the online resource genome information broker for viruses (gib-v), comparative studies can be made using the generic tools such as clustalw, blast and keyword search algorithms (hirahata et al. ). another downloadable web server tool, viroblast, is an exclusive blast tool that can be used for queries against multiple databases (deng et al. ). sequences from a variety of viral strains can be analysed simultaneously using the alvira software, which is a multiple sequence alignment tool that provides graphical representation as well (enault et al. ). furthermore, comparative analysis of genes and genomes of coronavirus can be carried out by using the covdb (coronavirus database) (huang et al. ). the digital resource viralzone is designed specifically to comprehend viral diversity and acquire information on viral molecular biology, hosts, taxonomy, epidemiology and structures (hulo et al. ). the simmonics program was upgraded to the simple sequence editor (sse) software package, wherein the user-given sequences can be aligned and annotated and further can be analysed for diversity and phylogeny (simmonds ) . evolutionary changes in viral genome lead to polymorphisms in their proteins, which in turn result into changes in viral phenotype such as viral virulence, viral-host interactions, etc. the digital database, viralorfeome, not only stores all variants and mutants of viral orfs, but also provides tools to design orf-specific cloning primers (pellet et al. ). further, degenerate primer pairs can be selected and matched to amplify user-defined viral genomes using the online tool prism (yu et al. ). the recent advances in nextgeneration sequencing and technologies have facilitated to study viral population at an advanced level. the viral population biodiversity and dynamics can be studied using the first such tool developed, phaccs (phage communities from contig spectrum), that can analyse the shotgun sequence data to estimate the structure and diversity of phages (angly et al. ) . later on, more tools/resources were developed to analyse viral metagenomics sequences, such as viral informatics resource for metagenomic exploration (virome), viral metagenome annotation pipeline (vmgap) and metavir (lorenzi et al. , roux et al. , wommack et al. . novel viruses can be identified from a pool of specimen types using a specific computational pipeline, virushunter ). the phenomenon of genetic recombination in viruses is responsible for the emergence of new viruses, increased virulence and host range, immune evasion and development of antiviral resistance. this distinct process of viral recombination can be detected by two bioinformatics tools, viz. jphmm (jumping profile hidden (schultz et al. ; routh and johnson ) . the jphmm, a web server, can be used for predicting recombination in hiv- and hbv, whereas virema, a downloadable software, can be used to analyse next-generation sequencing data. additionally, another software called vipr hmm (viral identification with a probabilistic algorithm incorporating hidden markov model) can detect recombinant and nonrecombinant viruses using microbial detection microarrays (allred et al. ). further, viral genome sequences can be searched for degenerate locus of recombination (lox)-like sites by a web server called selox (surendranath et al. ) . a downloadable software, virapops, is a forward simulator that allows simulation of rna virus population (petitjean and vanet ) . with this software, the drastic changes in rapidly evolving rna viruses such as mutability, recombination, variation, covariation, etc. can be simulated to predict their effects on viral populations. seqmap is a tool capable of identifying viral integration sites (vis) from ligationmediated pcr (lm-pcr), linear amplification-mediated pcr (lam-pcr) and nonrestrictive lam-pcr (nrlam-pcr) reactions and mapping short sequences to the genome (hawkins et al. ) . further, vis can also be detected by three more distinct tools, virusseq, viralfusionseq, and virusfinder , li et al. . for more precise vis prediction, all four tools can be employed by virologists. mirnas: a microrna (mirna) is a small, regulatory, non-coding rna molecule that regulates the translation or stability of viral and host target mrnas, thereby affecting viral pathogenesis. this host-viral regulatory relationship can be investigated by a database called vita, capable of curating known viral mirna genes and known/putative target sites of host mirna (hsu et al. ). vita exploits miranda and targetscan to scan viral genomes and determine mirna targets. vita is also capable of annotating the viruses, virus-infected tissues and tissue specificity of host mirnas. subtypes of viruses, for example, influenza viruses, and the conserved regions in various viruses can also be compared using the vita database. viral mirna candidate hairpins can be predicted using the database vir-mir. it serves as a platform to query the predicted viral mirna hairpins (based on taxonomic classification) and host target genes (based on the use of the rnahybrid program) in human, mouse, rat, zebrafish, rice and arabidopsis (li et al. ) . sirna: a sirna is similar to mirna that operates within the rna interference (rnai) pathway. it interferes in expression of specific genes and, therefore, is used in post-transcriptional gene silencing. virsirnadb is an online curated repository that stores experimentally validated research data of sirna and short hairpin rna (shrna) targeting diverse genes of important human viruses, including influenza virus (tyagi et al. , thakur et al. . the current database includes experimental information on sirna sequence, virus subtype, target gene, genbank accession, design algorithm, cell type, test object, method, efficacy, etc. a web-based software, sivirus, is an antiviral srna design software that allows analysis of influenza virus, hiv- , hcv and sars coronavirus (naito et al. ). further, viral sirna sequence data sets can be analysed using the softwares visitor and virome (antoniewski ; watson et al. ) . a perl script, called paparazzi, enables reconstitution of viral genome using a viral sirna in a given sample (vodovar et al. ). host-pathogenic interactions play an important role in determining the pathogenicity of a pathogen or immune evasion mechanism of a host. to comprehend such interactions between viral and host cellular proteins, various databases and softwares are available. one such database is phever that enables to explore virusvirus and virus-host lateral gene transfers by providing evolutionary and phylogenetic information (palmeira et al. ). this distinct database catalogues homologous families between different viral sequences and between viral and host sequences. it compiles the extensive data from completely sequenced genomes ( nonredundant viral genomes, non-redundant prokaryotic genomes, eukaryotic genomes ranging from plants to vertebrates). thus, it enables compiling of various proteins into homologous families by selecting at least one viral sequence, related alignments and phylogenies for each of these families. with increasing availability of viral genome sequences, data mining, curation and genome annotation have become essential components to better comprehend the structure and function of genome components. this information can further be exploited to develop diagnostics, vaccines and therapeutics. there are a number of tools available capable of annotation and classification of viral sequences, such as ncbi genotyping tool (rozanov et al. ) , vigor (viral genome orf reader) (wang et al. ), viral genome organizer (vgo) (upton et al. ) , genome annotation transfer utility (gatu) (tcherepanov et al. ) , virus genotyping tools (alcantara et al. ), zcurve_v (guo and zhang ) and star (subtype analyser) (myers et al. ) . vgo is a web-based genome browser that allows viewing and predicting genes and orfs in one or more viral genomes. it also allows performing searches within viral genomes and acquiring information about a genome such as locating genes, orfs, start/stop codons, etc. within genome, the sequences can be searched for regular expression, fuzzy motif pattern, genes with highest at composition, etc. using vgo, comparative analyses can be made between different viral genomes. vgo uses the graphical user interface (gui) for constructing alignments and display orthologues in a set of genomes. it also allows searching the translated genome for matches to mass spec peptides. vigor is a gene prediction online tool that was developed by j. craig venter institute in . it started with gene prediction in small viral genomes such as coronavirus, influenza, rhinovirus and rotavirus. with the updated version in (https://www.ncbi.nlm.nih.gov/pmc/articles/pmc /), vigor is now capable of gene prediction in more viruses: measles virus, mumps virus, rubella virus, respiratory syncytial virus, alphavirus and venezuelan equine encephalitis virus, norovirus, metapneumovirus, yellow fever virus, japanese encephalitis virus, parainfluenza virus and sendai virus. with vigor, based on sequence similarity searches, users are able to predict protein coding regions, start and stop codons and other complex gene features such as rna editing, stop codon leakage and ribosomal shunting. further, various features such as frameshifts, overlapping genes, embedded genes, etc. can be predicted in the virus genome. additionally, a mature peptide can be predicted in a given polypeptide open reading frame. vigor is also capable of genotyping influenza virus and rotavirus. four output files -a gene prediction file, a complementary dna file, an alignment file, and a gene feature table file -are produced by vigor. genbank submission can be directly done using the gene feature table. genome annotation transfer utility (gatu) facilitates quick and efficient annotation of similar target genome using the reference genomes that have already been annotated. later, the users can manually curate the annotated genome. the newly annotated genomes can be saved as genbank, embl or xml file format. although it doesn't provide a complete annotation system, gatu serves as a very useful tool for the preliminary work in genome annotation. gatu utilizes tblastn and blastn algorithms to map genes onto the new target genome by using an annotated reference genome. as a result, majority of the new genome's genes are annotated in a single step. with gatu, users can also identify open reading frames present in the target genome and absent from the reference genome. these orfs can further be scrutinized by using other bioinformatics tools such as blast and vgo, which can determine if the orfs should be included in the annotation. multiple-exon genes and mature peptides can also be analysed using gatu. a primer design tool, primerhunter, allows to design highly sensitive and specific primers for virus subtyping by pcr (duitama et al. ). primerhunter allows predicting specific forward and reverse primers with respect to a given set of dna sequences. phylotype is a web-based as well as downloadable software that uses parsimony to reconstruct ancestral traits and to select phylotypes (chevenet et al. ) . rotac is an automated genotyping tool for group a rotaviruses (maes et al. ). it works by comparing a complete orf of interest to other complete orfs of cognate genes available in the genbank database by performing blast searches. viroligo is a database of virus-specific oligonucleotides. the viroligo database acts as a repository for virus-specific oligonucleotides for virus detection (onodera and melcher ) . the database comprises of oligo data and common data tables. the oligo data table enlists pcr primers and hybridization probes that are used for viral nucleic acid detection, while common data table contains pcr and hybridization experimental conditions used in their detection. each oligo data entry provides information on the name of the oligonucleotide, oligonucleotide sequence, target region, type of usage (pcr primer, pcr probe, hybridization or other), note and direction of the pcr oligonucleotide (forward or reverse). each oligonucleotide entry also contains direct links to pubmed, genbank, ncbi taxonomy databases and blast. on the updated version of viroligo as of september , the database contains complete listing of oligonucleotides specific to various animal viruses. the viruses are vaccinia virus; canine parvovirus; porcine parvovirus; rodent parvovirus; tobamovirus; potyvirus; borna virus; bovine herpesvirus types , , and ; bovine viral diarrhoea virus; bovine parainfluenza virus; bovine respiratory syncytial virus; bovine adenovirus; bovine rhinovirus; bovine coronavirus; bovine reovirus; bovine enterovirus; foot-and-mouth disease (fmd) virus; and alcelaphine herpesvirus. virus-ploc is a web server for prediction of subcellular localization of viral proteins within host and virus-infected cells (shen and chou ) . another web server developed a little later, iloc-virus, is a multi-label learning classifier that predicts the subcellular locations of viral proteins with single and multiple sites (xiao et al. ) . similarly, a most recent web server, ploc-mvirus (cheng et al. ) , is a new predictor that identifies subcellular localization of viral proteins with both single and multiple location sites. it works by extracting information from the gene ontology (go) database and is claimed to be more successful than the state-of-the-art method, iloc-virus, in predicting subcellular localization of viral proteins. avppred is an antiviral peptide prediction algorithm that contains the peptides with experimentally proven antiviral activity (thakur et al. ) . the prediction is based on peptide sequence features, peptide motifs, sequence alignment, amino acid composition and physicochemical properties. vips is a viral internal ribosomal entry site (ires) prediction system that can predict ires secondary structures (hong et al. ) . vips uses the rna fold program that predicts local rna secondary structures, rna align program that compares predicted structures and pknotsrg program (reeder et al. ) that calculates the pseudoknot structures. vazymolo, a database that deals with viral sequences at protein level, defines and classifies viral protein modularity (ferron et al. ) . it extracts information of complete genome sequences of various viruses from genbank and refseq and organizes the acquired information about modularity on viral orfs (fig. . f) . there are web-based tools available to predict and analyse structural aspects of viruses. the learncoil-vmf is a computational tool that allows to predict coiledcoil-like regions in viral membrane fusion proteins (singh et al. ) . the membrane fusion proteins are known to be diverse and share no sequence similarity between most pairs of viruses in the same or different families. the learncoil-vmf is also capable of characterizing the core structure of these membrane fusion proteins. viperdb (virus particle explorer database) is a web-based database that enables manual curation of icosahedral virus capsid structures (carrillo-tripp et al. ). this database serves as a comprehensive resource for specific needs of structural virology and comparatives of data derived from structural and computational analyses of capsids. with the updated version, viperdb ( ), capsid protein residues in the icosahedral asymmetric unit (iau) can be deduced using phi-psi (phi-psi) diagrams (azimuthal polar orthographic projections) (ref: https://www.ncbi.nlm.nih. gov/pubmed/ ). these diagrams can be depicted as dynamic interface and surface residues and interface and core residues and can be mapped to the database using a new application programming interface (api). this aids in identifying family-wide conserved residues at the interfaces. additionally, jmol and strap are built in the system to visualize an interactive model of viral molecular structures. vida is a database that organizes animal virus genome open reading frames from partial and complete genomic sequences (alba et al. ) . presently, vida includes a complete collection of homologous protein families from genbank for herpesviridae, papillomaviridae, poxviridae, coronaviridae and arteriviridae. the homologous proteins in vida include both orthologous and paralogous sequences. vida retrieves virus sequences from genbank and the files are parsed into subfields. the parsed fields contain all the information such as genbank accession number, genbank identifier (gi numbers), protein sequence source, sequence length, gene name and gene product. in order to eliminate % redundancy, the virus protein sequences thus retrieved are filtered and a list of synonymous gis is created for reference. the orfs from complete and partial virus genomes are further organized into homologous protein families, on the basis of sequence similarity. furthermore, the structure of known viral proteins or homologous to viral proteins is also mapped onto homologous protein families. vida also provides functional classification of virus proteins into broad functional classes based on typical virus processes such as dna and rna replication, virus structural proteins, nucleotide and nucleic acid metabolism, transcription, glycoproteins and others. this database also provides alignment of the conserved regions based on potential functional importance. apart from functional classification, vida also provides a taxonomical classification of the proteins and protein families. the protein families serve as a tool for functional and evolutionary studies, whereas alignments of conserved sequences provide crucial information on conserved amino acids or construction of sequence profiles. the viral bioinformatics resource center (vbrc) is one of eight nih-sponsored bioinformatics resource centers (http://www.oxfordjournals.org/nar/database/ summary/ ). it is an online platform that provides informational and analytical tools and resources to scientific community. the vbrc is oriented to conduct basic and applied research to better comprehend the viruses included on the nih/niaid list of priority pathogens. these viruses are selected based on their possibility of bioterrorism threats or as emerging or re-emerging infectious diseases. the vbrc focuses specifically on large dna viruses. it includes the viruses that belong to the arenaviridae, bunyaviridae, filoviridae, flaviviridae, paramyxoviridae, poxviridae and togaviridae families. it serves as a relational database and web application tool that allows data storage, annotation, analysis and information exchange of the data. the current version (v . ) consists of complete genomic sequences. using the vbrc, each of the viral gene and genome can be curated. as a result, a comprehensive and searchable summary is acquired that details about the genotype and phenotype of the genes. the role of the genes in host-pathogen relationships is also being emphasized in these curations. additionally, the vbrc also houses multiple analytical tools such as tools for genome annotation, comparative analysis, whole genome alignments and phylogenetic analysis. further, this database also looks forward to include high-throughput data derived from other studies such as microarray gene expression data, proteomic analyses and population genetics data. the poxvirus bioinformatics resource center (pbrc, now merged into vbrc) is an online platform that serves as an informational and analytical resource to better comprehend the poxviridae family of viruses. it allows data storage, annotation, analysis and information exchange of the data. influenza virus is one the major global concern. it gained attention after the emergence of pandemic influenza a virus (h n , swine flu) in . there are a total of web portals and tools that focus only on influenza virus. this includes the influenza virus database (ivdb), influenza research database (ird) and ncbi influenza virus resource (ncbi-ivr) (chang et al. ; bao et al. ; squires et al. ) . researchers can exploit all the three websites mentioned for sequence databases as well as various basic tools such as blast, multiple-sequence alignment, phylogenetic tree construction, etc. ivdb provides access to additional tools such as (i) the sequence distribution tool, which provides global geographical distribution of a given viral genotype as well as correlates its genomic data with epidemiological data, and (ii) the quality filter system, which according to their sequence content (coding sequence [cds], 'untranslated region [ 'utr] , and 'utr) and integrity (complete [c] or partial [p]) categorizes a given viral nucleotide sequence into either of the seven categories of c to c and p to p , respectively. ncbi-ivr is the most widely used and cited online resource. with ncbi-ivr, the given viral genomic sequences can be annotated using a genome annotation tool and flu annotation (flan) tool. additionally, large phylogenetic trees may be constructed and can be visualized in aggregated form with sub-scale details (bao et al. ; bao et al. ; zaslavsky et al. ) . ird provides tools for genomic and proteomic intervention, immune epitope prediction and surveillance data for viral nucleotide sequences (squires et al. ) . furthermore, this resource is also equipped with tools that provide insight into hostpathogen interactions, type of virulence, host range and a correlation of sequence variation and these processes. there are other repositories available: global initiative on sharing avian influenza data (gisaid) consortium that mediated the epiflu database and flugenome database that exclusively provides genotyping of influenza a virus and aids in detecting reassortments taking place in divergent lines (lu et al. ) . furthermore, reassortment events in influenza viruses exclusively can be identified by a program giraf (graph-incompatibility-based reassortment finder) that can be downloaded (nagarajan and kingsford ) . another distinct repository, influenza sequence and epitope database (ised), provides viral sequences and epitopes from asian countries; the information could be exploited to understand and study evolutionary divergence and migration of strains (yang et al. ). the web server ativs (analytical tool for influenza virus surveillance) provides an antigenic map for conducting surveillance and selection of vaccine strains by scrutinizing the serological data of haemagglutinin sequence data of influenza a/h n viruses and influenza subtypes (liao et al. ). there is another online repository openfludb (an isolate-centred inventory), where information of an isolate such as virus type, host, date of isolation, geographical distribution, predicted antiviral resistance, enhanced pathogenicity or human adaptation propensity may be obtained (liechti et al. ) . for influenza viruses, primers and probes can be designed using the influenza primer design resource (ipdr) (bose et al. ). further, prospective influenza seasonal epidemics or pandemics can be predicted using a stochastic model, flute (chao et al. ) (table . ). the ncbi virus variation resource (ncbi-vvr) is a web-based database of a set of viruses, viz. influenza virus, dengue virus, rotavirus, west nile virus, ebola virus, zika virus and mers coronavirus (resch et al. ). it enables the user to submit their viral sequences along with relevant metadata such as sample collection time, isolation source, geographic location, host, disease severity, etc. it further allows integrating and analysing the viral sequences using the generic tools such as multiple sequence alignment and phylogenetic tree construction. rotavirus a (rva) is the most frequent cause of severe diarrhoea in human and animal infants worldwide and remains as a major global threat for childhood morbidity and mortality (minakshi et al. ; basera et al. ) . in recent years, extensive research efforts have been done for the development of live, orally administered vaccines. in india, an orally administered vaccine rotavac was also introduced after successful clinical trials in which became available to clinicians in , although these vaccines will have to be scrutinized and have to be updated regularly to accommodate the emerging rotavirus genotype variations, following which molecular and genetic characterization of new circulating and emerging genotypes of rotavirus strains in humans and animals becomes necessary. recently, a classification system for rvas has been described by the rotavirus classification working group (rcwg) in which all the genomic rna segments are assigned a particular alphabet followed by the particular genotype number. the classification system will be helpful in explaining the importance of genetic reassortments among rvas, host range, transfer of gene segments among two different genotypes and adaptation to different hosts. to differentiate between different gene segments of rvas, an online web-based tool rotac was developed by the leading researchers from rega institute, ku leuven, belgium, in (table . ). it's an easy-to-use and reliable classification tool for rvas and works on the agreement with rcwg. it's a platform-independent tool which works on any web browser by simply going to its url (http://rotac.regatools.be/) and has been released without any restriction of use by academicians or anyone else. as claimed, the rotac web-based tool will be updated regularly to reflect the established as well as newly emerging genotypes announced by the rcwg from time to time. various researches in animal viral diseases are being conducted at the genomic level. often, handling an enormous data obtained from sequencing is daunting to researchers. the chapter categorically provides a list of bioinformatics approaches that are useful in data mining. there are tables that list all such bioinformatics programs as per the applications. the tables also list databases that organize information on human and animal viruses such as genomic data, orfs, oligonucleotides, etc. an illustration has also been provided in the chapter showing the application of the tool predictprotein, which is used for prediction of three-dimensional structures of viral proteins. the major goal of the chapter has been to provide a roadmap to bioinformatics approaches in the field of animal viral diseases. although the chapter elaborates on viruses-specific bioinformatics programs, most of these programs are designed for human viruses. nevertheless, there are bioinformatics tools that are animal-virus specific, but these are limited in number. henceforth, in many cases, researchers have to switch to either human virus-specific tools or other generic tools. application of such tools for studying animal viruses or animal diseases, in many situations, may not be as accurate as with specialized tools. the users should take precautions while using the settings of such tools. furthermore, the results, thus obtained, also need to be scrutinized. therefore, development of new bioinformatics programs/tools that are specifically designed for animal viruses/diseases should be taken up robustly. specialized tools will provide much accurate results and predictions, thereby accelerating the bioinformatics researches in the field of animal viral diseases. vida: a virus database system for the organization of animal virus genome open reading frames a standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences hmm: a hidden markov model for detecting recombination with microbial detection microarrays basic local alignment search tool phaccs, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information visitor, an informatic pipeline for analysis of viral sirna sequencing datasets flan: a web server for influenza virus genome annotation the influenza virus resource at the national center for biotechnology information continuous and discontinuous protein antigenic determinants detection of rotavirus infection in bovine calves by rna-page and rt-pcr genemark: web software for gene finding in prokaryotes, eukaryotes and viruses mhcbn: a comprehensive database of mhc binding and non-binding peptides the influenza primer design resource: a new tool for translating influenza sequence data into effective diagnostics base-by-base: single nucleotidelevel analysis of whole viral genome alignments mhcpep, a database of mhc-binding peptides: update viperdb : an enhanced and web api enabled relational database for structural virology influenza virus database (ivdb): an integrated information resource and analysis platform for influenza virus research flute, a publicly available stochastic influenza epidemic simulation model virusseq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue analysis of the complete genome sequence of the hz- virus suggests that it is related to members of the baculoviridae ploc-mvirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal go information into general pseaac searching for virus phylotypes evidence for a novel gene associated with human influenza a viruses t-cell antigenic sites tend to be amphipathic structures viroblast: a stand-alone blast web server for flexible queries of multiple databases and user's datasets primerhunter: a primer design tool for pcr-based virus subtype identification alvira: comparative genomics of viral strains mathematics vs. evolution: mathematical evolutionary theory vazymolo: a tool to define and classify modularity in viral proteins zcurve_v: a new self-training system for recognizing protein-coding genes in viral and phage genomes identifying viral integration sites using seqmap . genome information broker for viruses (gib-v): database for comparative analysis of virus genomes viral ires prediction system -a web server for prediction of the ires secondary structure in silico vita: prediction of host micrornas targets on viruses covdb: a comprehensive database for comparative analysis of coronavirus genes and genomes viralzone: a knowledge resource to understand virus diversity vir-mir db: prediction of viral microrna candidate hairpins viralfusionseq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution ativs: analytical tool for influenza virus surveillance openfludb, a database for human and animal influenza virus the viral meta genome annotation pipeline(vmgap):an automated tool for the functional annotation of viral metagenomic shotgun sequencing data flugenome: a web tool for genotyping influenza a virus rota c: a web-based tool for the complete genome classification of group a rotaviruses g and p genotyping of bovine group a rotaviruses in faecal samples of diarrheic calves by dig-labeled probes a statistical model for hiv- sequence classification using the subtype analyser (star) giraf: robust, computational identification of influenza reassortments via graph mining sivirus: web-based antiviral sirna design software for highly divergent viral sequences viroligo: a database of virus-specific oligonucleotides phever: a database for the global exploration of virus-host evolutionary relationships viralorfeome: an integrated database to generate a versatile collection of viral orfs virapops: a forward simulator dedicated to rapidly evolved viral populations syfpeithi: database for mhc ligands and peptide motifs epimhc: a curated database of mhcbinding peptides for customized computational vaccinology virus particle explorer (viper), a website for virus capsid structures and their computational analyses pknotsrg: rna pseudoknot folding including nearoptimal structures and sliding windows virus variation resources at the national center for biotechnology information: dengue virus orf-finder: a vector for high-throughput gene identification the predictprotein server discovery of functional genomic motifs in viruses with virema-a virus recombination mapper-for analysis of next-generation sequencing data metavir: a web server dedicated to virome analysis a web-based genotyping resource for viral sequences bcipep: a database of b-cell epitopes prevalence, diagnosis, management and control of important diseases of ruminants with special reference to indian scenario epitome: database of structure-inferred antigenic epitopes an update on the functional molecular immunology (fimm) database jphmm: improving the reliability of recombination prediction in hiv- virus-ploc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells sse: a nucleotide and amino acid sequence analysis platform learncoil-vmf: computational evidence for coiled-coil-like motifs in many viral membrane-fusion proteins advances in diagnosis, surveillance, and monitoring of zika virus: an update ebola virus -epidemiology, diagnosis and control: threat to humans, lessons learnt and preparedness plans-an update on its year's journey biohealthbase: informatics support in the elucidation of influenza virus host pathogen interactions and virulence influenza research database: an integrated bioinformatics resource for influenza research and surveillance selox--a locus of recombination site search tool for the detection and directed evolution of site-specific recombination systems genome annotation transfer utility (gatu): rapid annotation of viral genomes using a closely related reference genome virsirnadb: a curated database of experimentally validated viral sirna/shrna antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data hivsirdb: a database of hiv inhibiting sirnas viral genome organizer: a system for analyzing complete viral genomes the immune epitope database (iedb) . in silico reconstruction of viral genomes from small rnas improves virus-derived small interfering rna profiling vigor, an annotation program for small viral genomes virusfinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data virome: an r package for the visualization and analysis of viral small rna sequence datasets virome: a standard operating procedure for analysis of viral metagenome sequences iloc-virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites influenza sequence and epitope database prism: a primer selection and matching tool for amplification and sequencing of viral genomes visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation identification of novel viruses using virushunter--an automated data analysis pipeline acknowledgements all the authors of the manuscript thank and acknowledge their respective universities and institutes. there is no conflict of interest. key: cord- -pyb pt authors: newell-mcgloughlin, martina; re, edward title: the flowering of the age of biotechnology – date: journal: the evolution of biotechnology doi: . / - - - _ sha: doc_id: cord_uid: pyb pt nan the significance of developing genetic and physical maps of the genome, and the importance of comparing the human genome with those of other species. it also suggested a preliminary focus on improving current technology. at the request of the u.s. congress, the office of technology assessment (ota) also studied the issue, and issued a document in -within days of the nrc report -that was similarly supportive. the ota report discussed, in addition to scientific issues, social and ethical implications of a genome program together with problems of managing funding, negotiating policy and coordinating research efforts. prompted by advisers at a meeting in reston, virginia, james wyngaarden, then director of the national institutes of health (nih) , decided that the agency should be a major player in the hgp, effectively seizing the lead from doe. the start of the joint effort was in may (with an "official" start in october) when a -year plan detailing the goals of the u.s. human genome project was presented to members of congressional appropriations committees in mid-february. this document co-authored by doe and nih and titled "understanding our genetic inheritance, the u.s. human genome project: the first five years" examined the then current state of genome science. the plan also set forth complementary approaches of the two agencies for attaining scientific goals and presented plans for administering research agenda; it described collaboration between u.s. and international agencies and presented budget projections for the project. according to the document, "a centrally coordinated project, focused on specific objectives, is believed to be the most efficient and least expensive way" to obtain the -billion base pair map of the human genome. in the course of the project, especially in the early years, the plan stated that "much new technology will be developed that will facilitate biomedical and a broad range of biological research, bring down the cost of many experiments (mapping and sequencing), and finding applications in numerous other fields." the plan built upon the reports of the office of technology assessment and the national research council on mapping and sequencing the human genome. "in the intervening two years," the document said, "improvements in technology for almost every aspect of genomics research have taken place. as a result, more specific goals can now be set for the project." the document describes objectives in the following areas mapping and sequencing the human genome and the genomes of model organisms; data collection and distribution; ethical, legal, and social considerations; research training; technology development; and technology transfer. these goals were to be reviewed each year and updated as further advances occured in the underlying technologies. they identified the overall budget needs to be the same as those identified by ota and nrc, namely about $ million per year for approximately years. this came to $ billion over the entire period of the project. considering that in july , the dna databases contained only seven sequences greater than . mb this was a major leap of faith. this approach was a major departure from the single-investigator-based gene of interest focus that research took hitherto. this sparked much controversy both before and after its inception. critics questioned the usefulness of genomic sequencing, they objected to the high cost and suggested it might divert funds from other, more focused, basic research. the prime argument to support the latter position is that there appeared to be are far less genes than accounted for by the mass of dna which would suggest that the major part of the sequencing effort would be of long stretches of base pairs with no known function, the so-called "junk dna." and that was in the days when the number of genes was presumed to be - , . if, at that stage, the estimated number was guessed to be closer to the actual estimate of - , (later reduced to - , ) this would have made the task seem even more foolhardy and less worthwhile to some. however, the ever-powerful incentive of new diagnostics and treatments for human disease beyond what could be gleaned from the gene-by-gene approach and the rapidly evolving technologies, especially that of automated sequencing, made it both an attractive and plausible aim. charles cantor ( ) , a principal scientist for the department of energy's genome project contended that doe and nih were cooperating effectively to develop organizational structures and scientific priorities that would keep the project on schedule and within its budget. he noted that there would be small short-term costs to traditional biology, but that the long-term benefits would be immeasurable. genome projects were also discussed and developed in other countries and sequencing efforts began in japan, france, italy, the united kingdom, and canada. even as the soviet union collapsed, a genome project survived as part of the russian science program. the scale of the venture and the manageable prospect for pooling data via computer made sequencing the human genome a truly international initiative. in an effort to include developing countries in the project unesco assembled an advisory committee in to examine unesco's role in facilitating international dialogue and cooperation. a privately-funded human genome organization (hugo) had been founded in to coordinate international efforts and serve as a clearinghouse for data. in that same year the european commission (ec) introduced a proposal entitled the "predictive medicine programme." a few ec countries, notably germany and denmark, claimed the proposal lacked ethical sensitivity; objections to the possible eugenic implications of the program were especially strong in germany (dickson ) . the initial proposal was dropped but later modified and adopted in as the "human genome analysis programme" (dickman and aldhous ) . this program committed substantial resources to the study of ethical issues. the need for an organization to coordinate these multiple international efforts quickly became apparent. thus the human genome organization (hugo), which has been called the "u.n. for the human genome," was born in the spring of . composed of a founding council of scientists from seventeen countries, hugo's goal was to encourage international collaboration through coordination of research, exchange of data and research techniques, training, and debates on the implications of the projects (bodmer ) . in august nih began large-scale sequencing trials on four model organisms: the parasitic, cell-wall lacking pathogenic microbe mycoplasma capricolum, the prokaryotic microbial lab rat escherichia coli, the most simple animal caenorhabditis elegans, and the eukaryotic microbial lab rat saccharomyces cerevisiae. each research group agreed to sequence megabases (mb) at cents a base within years. a sub living organism was actually fully sequenced and the complete sequence of that genome, the human cytomegalovirus (hcmv) genome was . mb. that year also saw the casting of the first salvo in the protracted debate on "ownership" of genetic information beginning with the more tangible question of ownership of cells. and, as with the debates of the early eighties, which were to be revisited later in the nineties, the respondent was the university of california. moore v. regents of the university of california was the first case in the united states to address the issue of who owns the rights to an individual's cells. diagnosed with leukemia, john moore had blood and bone marrow withdrawn for medical tests. suspicious of repeated requests to give samples because he had already been cured, moore discovered that his doctors had patented a cell line derived from his cells and so he sued. the california supreme court found that moore's doctor did not obtain proper informed consent, but, however, they also found that moore cannot claim property rights over his body. the quest for the holy grail of the human genome was both inspired by the rapidly evolving technologies for mapping and sequencing and subsequently spurred on the development of ever more efficient tools and techniques. advances in analytical tools, automation, and chemistries as well as computational power and algorithms revolutionized the ability to generate and analyze immense amounts of dna sequence and genotype information. in addition to leading to the determination of the complete sequences of a variety of microorganisms and a rapidly increasing number of model organisms, these technologies have provided insights into the repertoire of genes that are required for life, and their allelic diversity as well as their organization in the genome. but back in many of these were still nascent technologies. the technologies required to achieve this end could be broadly divided into three categories: equipment, techniques, and computational analysis. these are not truly discrete divisions and there was much overlap in their influence on each other. as noted, lloyd smith, michael and tim hunkapiller, and leroy hood conceived the automated sequencer and applied biosystems inc. brought it to market in june . there is no much doubt that when applied biosystems inc. put it on the market that which had been a dream became decidedly closer to an achievable reality. in automating sangers chain termination sequencing system, hood modified both the chemistry and the data-gathering processes. in the sequencing reaction itself, he replaced radioactive labels, which were unstable, posed a health hazard, and required separate gels for each of the four bases. hood developed chemistry that used fluorescent dyes of different colors for each of the four dna bases. this system of "color-coding" eliminated the need to run several reactions in overlapping gels. the fluorescent labels addressed another issue which contributed to one of the major concerns of sequencing -data gathering. hood integrated laser and computer technology, eliminating the tedious process of information-gathering by hand. as the fragments of dna passed a laser beam on their way through the gel the fluorescent labels were stimulated to emit light. the emitted light was transmitted by a lens and the intensity and spectral characteristics of the fluorescence are measured by a photomultiplier tube and converted to a digital format that could be read directly into a computer. during the next thirteen years, the machine was constantly improved, and by a fully automated instrument could sequence up to , , base pairs per year. in three groups came up with a variation on this approach. they developed what is termed capillary electrophoresis, one team was led by lloyd smith (luckey, ) , the second by barry karger , and the third by norman dovichi. in molecular dynamics introduced the megabace, a capillary sequencing machine. and not to be outdone the following year in , the original of the species came up with the abi prism sequencing machine. the is also a capillary-based machine designed to run about eight sets of sequence reactions per day. on the biology side, one of the biggest challenges was the construction of a physical map to be compiled from many diverse sources and approaches in such a way as to insure continuity of physical mapping data over long stretches of dna. the development of dna sequence tagged sites (stss) to correlate diverse types of dna clones aided this standardization of the mapping component by providing mappers with a common language and a system of landmarks for all the libraries from such varied sources as cosmids, yeast artificial chromosomes (yacs) and other rdnas clones. this way each mapped element (individual clone, contig, or sequenced region) would be defined by a unique sts. a crude map of the entire genome, showing the order and spacing of stss, could then be constructed. the order and spacing of these unique identifier sequences composing an sts map was made possible by development of mullis' polymerase chain reaction (pcr), which allows rapid production of multiple copies of a specific dna fragment, for example, an sts fragment. sequence information generated in this way could be recalled easily and, once reported to a database, would be available to other investigators. with the sts sequence stored electronically, there would be no need to obtain a probe or any other reagents from the original investigator. no longer would it be necessary to exchange and store hundreds of thousands of clones for full-scale sequencing of the human genome-a significant saving of money, effort, and time. by providing a common language and landmarks for mapping, sts's allowed genetic and physical maps to be cross-referenced. with a refinement on this technique to go after actual genes, sydney brenner proposed sequencing human cdnas to provide rapid access to the genes stating that 'one obvious way of finding at least a large part of the important [fraction] of the human genome is to look at the sequences of the messenger rna's of expressed genes' (brenner, ) . the following year the man who was to play a pivotal role on the world stage that became the human genome project suggested a way to implement sydney's approach. that player, nih biologist j. craig venter announced a strategy to find expressed genes, using ests (expressed sequence tag) (adams, ) . these so called ests represent a unique stretch of dna within a coding region of a gene, which as sydney suggested would be useful for identifying full-length genes and as a landmark for mapping. so using this approach projects were begun to mark gene sites on chromosome maps as sites of mrna expression. to help with this a more efficient method of handling large chunks of sequences was needed and two approaches were developed. yeast artificial chromosomes, which were developed by david burke, maynard olson, and george carle, increased insert size -fold (david t. burke et al., ) . caltech's second major contribution to the genome project was developed by melvin simon, and hiroaki shizuya. their approach to handling large dna segments was to develop "bacterial artificial chromosomes" (bacs), which basically allow bacteria to replicate chunks greater than , base pairs in length. this efficient production of more stable, large-insert bacs made the latter an even more attractive option, as they had greater flexibility than yacs. in in a collaboration that presages the snp consortium, washington university, st louis mo, was funded by the pharmaceutical company merck and the national cancer institute to provide sequence from those ests. more than half a million ests were submitted during the project (murr l et al., ) . on the analysis side was the major challenge to manage and mine the vast amount of dna sequence data being generated. a rate-limiting step was the need to develop semi-intelligent algorithms to achieve this herculean task. this is where the discipline of bioinformatics came into play. it had been evolving as a discipline since margaret oakley dayhoff used her knowledge of chemistry, mathematics, biology and computer science to develop this entirely new field in the early sixties. she is in fact credited today as a founder of the field of bioinformatics in which biology, computer science, and information technology merge into a single discipline. the ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. there are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information. paralleling the rapid and very public ascent of recombinant dna technology during the previous two decades, the analytic and management tools of the discipline that was to become bioinformatics evolved at a more subdued but equally impressive pace. some of the key developments included tools such as the needleman-wunsch algorithm for sequence comparison which appeared even before recombinant dna technology had been demonstrated as early as ; the smith-waterman algorithm for sequence alignment ( ); the fastp algorithm ( ) and the fasta algorithm for sequence comparison by pearson and lupman in and perl (practical extraction report language) released by larry wall in . on the data management side several databases with ever more effective storage and mining capabilities were developed over the same period. the first bioinformatic/biological databases were constructed a few years after the first protein sequences began to become available. the first protein sequence reported was that of bovine insulin in , consisting of residues. nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine trna with bases. just one year later, dayhoff gathered all the available sequence data to create the first bioinformatic database. one of the first dedicated databases was the brookhaven protein databank whose collection consisted of ten x-ray crystallographic protein structures (acta. cryst. b, ) . the year saw the creation of the genetics computer group (gcg) as a part of the university of wisconsin biotechnology center. the group's primary and much used product was the wisconsin suite of molecular biology tools. it was spun off as a private company in . the swiss-prot database made its debut in in europe at the department of medical biochemistry of the university of geneva and the european molecular biology laboratory (embl). the first dedicated "bioinformatics" company intelligenetics, inc. was founded in california in . their primary product was the intelligenetics suite of programs for dna and protein sequence analysis. the first unified federal effort, the national center for biotechnology information (ncbi) was created at nih/nlm in and it was to play a crucial part in coordinating public databases, developing software tools for analyzing genome data, and disseminating information. and on the other side of the atlantic, oxford molecular group, ltd. (omg) was founded in oxford, uk by anthony marchington, david ricketts, james hiddleston, anthony rees, and w. graham richards. their primary focus was on rational drug design and their products such as anaconda, asp, and chameleon obviously reflected this as they were applied in molecular modeling, and protein design engineering. within two years ncbi were making their mark when david lipman, eugene myers, and colleagues at the ncbi published the basic local alignment search tool blast algorithm for aligning sequences (altschul et al., ) . it is used to compare a novel sequence with those contained in nucleotide and protein databases by aligning the novel sequence with previously characterized genes. the emphasis of this tool is to find regions of sequence similarity, which will yield functional and evolutionary clues about the structure and function of this novel sequence. regions of similarity detected via this type of alignment tool can be either local, where the region of similarity is based in one location, or global, where regions of similarity can be detected across otherwise unrelated genetic code. the fundamental unit of blast algorithm output is the high-scoring segment pair (hsp). an hsp consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cutoff score. this system has been refined and modified over the years the two principal variants presently in use being the ncbi blast and wu-blast (wu signifying washington university). the same year that blast was launched two other bioinformatics companies were launched. one was informax in bethesda, md whose products addressed sequence analysis, database and data management, searching, publication graphics, clone construction, mapping and primer design. the second, molecular applications group in california, was to play a bigger part on the proteomics end (michael levitt and chris lee). their primary products were look and segmod which are used for molecular modeling and protein design. the following year in the human chromosome mapping data repository, genome data base (gdb) was established. on a more global level, the development of computational capabilities in general and the internet in specific was also to play a considerable part in the sharing of data and access to databases that rendered the rapidity of the forward momentum of the hgp possible. also in edward uberbacher of oak ridge national laboratory in tennessee developed grail, the first of many gene-finding programs. in the first two "genomics" companies made their appearance. incyte pharmaceuticals, a genomics company headquartered in palo alto, california, was formed and myriad genetics, inc. was founded in utah. incyte's stated goal was to lead in the discovery of major common human disease genes and their related pathways. the company discovered and sequenced, with its academic collaborators (originally synteni from pat brown's lab at stanford), a number of important genes including brca and brca , with mary claire king, epidemiologist at uc-berkeley, the genes linked to breast cancer in families with a high degree of incidence before age . by a low-resolution genetic linkage map of the entire human genome was published and u.s. and french teams completed genetic maps of both mouse and man. the mouse with an average marker spacing of . cm as determined by eric lander and colleagues at whitehead and the human, with an average marker spacing of cm by jean weissenbach and colleagues at ceph (centre d'etude du polymorphisme humaine). the latter institute was the subject of a rather scathing book by paul rabinow ( ) based on what they did with this genome map. in , an american biotechnology company, millennium pharmaceuticals, and the ceph, developed plans for a collaborative effort to discover diabetes genes. the results of this collaboration could have been medically significant and financially lucrative. the two parties had agreed that ceph would supply millennium with germplasm collected from a large coterie of french families, and millennium would supply funding and expertise in new technologies to accelerate the identification of the genes, terms to which the french government had agreed. but in early , just as the collaboration was to begin, the french government cried halt! the government explained that the ceph could not be permitted to give the americans that most precious of substances for which there was no precedent in law -french dna. rabinow's book discusses the tangled relations and conceptions such as, can a country be said to have its own genetic material, the first but hardly the last franco-american disavowal of détente (paul rabinow, ) . the latest facilities such as the joint genome institute (jgi), walnut creek, ca are now able to sequence up to mb per day which makes it possible to sequence whole microbial genomes within a day. technologies currently under development will probably increase this capacity yet further through massively parallel sequencing and/or microfluidic processing making it possible to sequence multiple genotypes from several species. nineteen ninety-two saw one of the first shakeups in the progress of the hgp. that was the year that the first major outsider entered the race when britain's wellcome trust plunked down $ million to join the hgp. this caused a mere ripple while the principal shake-ups occurred stateside. much of the debate and subsequently the direction all the way through the hgp process was shaped by the personalities involved. as noted the application of one of the innovative techniques, namely ests, to do an end run on patenting introduced one of those major players to the fray, craig venter. venter, the high school drop out who reached the age of majority in the killing fields of vietnam was to play a pivotal role in a more "civilized" but no less combative field of human endeavor. he came onto the world stage through his initial work on ests while at the national institute of neurological disorders and stroke (ninds) from to . he noted in an interview with the scientist magazine in , that there was a degree of ambiguity at ninds about his venturing into the field of genomics, while they liked the prestige of hosting one of the leaders and innovators in his newly emerging field, they were concerned about him moving outside the nind purview of the human brain and nervous system. ultimately, while he proclaimed to like the security and service infrastructure this institute afforded him, that same system became too restrictive for his interests and talent. he wanted the whole canvas of human-gene expression to be his universe, not just what was confined to the central nervous system. he was becoming more interested in taking a whole genome approach to understanding the overall structure of genomes and genome evolution, which was much broader than the mission of ninds. he noted, with some irony, in later years that the then current nih director harold varmus had wished in hindsight that nih had pushed to do a similar database in the public domain, clearly in venter's opinion varmus was in need of a refresher course in history! bernadine healy, nih director in , was one of the few in a leadership role who saw the technical and fiscal promise of venter's work and, like all good administrators, it also presented an opportunity to resolve a thorny "personnel" issue. she appointed him head of the ad hoc committee to have an intramural genome program at nih to give the head of the hgp (that other larger than life personality jim watson) notice that he was not the sole arbitrator of the direction for the human genome project. however venter very soon established himself as an equally non-conformist character and with the tacit consent of his erstwhile benefactor. he initially assumed the mantle of a non-conformist through guilt by association rather than direct actions when it was revealed that nih was filing patent applications on thousands of these partial genes based on his ests catalyzing the first hgp fight at a congressional hearing. nih's move was widely criticized by the scientific community because, at the time, the function of genes associated with the partial sequences was unknown. critics charged that patent protection for the gene segments would forestall future research on them. the patent office eventually rejected the patents, but the applications sparked an international controversy over patenting genes whose functions were still unknown. interestingly enough despite nih's reliance on the est/cdna technique, venter, who was now clearly venturing outside the ninds mandated rubric, could not obtain government funding to expand his research, prompting him to leave nih in . he moved on to become president and director of the institute for genomic research (tigr), a nonprofit research center based in gaithersburg, md. at the same time william haseltine formed a sister company, human genome sciences (hgs), to commercialize tigr products. venter continued est work at tigr, but also began thinking about sequencing entire genomes. again, he came up with a quicker and faster method: whole genome shotgun sequencing. he applied for an nih grant to use the method on hemophilus influenzae, but started the project before the funding decision was returned. when the genome was nearly complete, nih rejected his proposal saying the method would not work. in a triumphal flurry in late may and with a metaphorical nose-thumbing at his recently rejected "unworkable" grant venter announced that tigr and collaborators had fully sequenced the first free-living organism -haemophilus influenzae. in november , controversy surrounding venter's research escalated. access restrictions associated with a cdna database developed by tigr and its rockville, md.-based biotech associate, human genome sciences (hgs) inc. -including hgs's right to preview papers on resulting discoveries and for first options to license products -prompted merck and co. inc. to fund a rival database project. in that year also britain "officially" entered the hgp race when the wellcome trust trumped down $ million (as mentioned earlier). the following year hgs was involved in yet another patenting debacle forced by the rapid march of technology into uncharted patent law territory. on june , hgs applied for a patent on a gene that produces a "receptor" protein that is later called ccr . at that time hgs has no idea that ccr is an hiv receptor. in december , u.s. researcher robert gallo, the co-discoverer of hiv, and colleagues found three chemicals that inhibit the aids virus but they did not know how the chemicals work. in february , edward berger at the nih discovered that gallo's inhibitors work in late-stage aids by blocking a receptor on the surface of t-cells. in june of that year in a period of just days, five groups of scientists published papers saying ccr is the receptor for virtually all strains of hiv. in january , schering-plough researchers told a san francisco aids conference that they have discovered new inhibitors. they knew that merck researchers had made similar discoveries. as a significant valentine in the u.s. patent and trademark office (uspto) grants hgs a patent on the gene that makes ccr and on techniques for producing ccr artificially. the decision sent hgs stock flying and dismayed researchers. it also caused the uspto to revise its definition of a "patentable" drug target. in the meantime haseltine's partner in rewriting patenting history, venter turned his focus to the human genome. he left tigr and started the for-profit company celera, a division of pe biosystems, the company that at times, thanks to hood and hunkapillar, led the world in the production of sequencing machines. using these machines, and the world's largest civilian supercomputer, venter finished assembling the human genome in just three years. following the debacle with the then nih director bernine healy over patenting the partial genes that resulted from est analysis, another major personality-driven event in that same year occurred. watson strongly opposed the idea of patenting gene fragments fearing that it would discourage research, and commented that "the automated sequencing machines 'could be run by monkeys.' " (nature june , ) with this dismissal watson resigned his nih nchgr post in to devote his full-time effort to directing cold spring harbor laboratory. his replacement was of a rather more pragmatic, less flamboyant nature. while venter maybe was described as an idiosyncratic shogun of the shotgun, francis collins was once described as the king arthur of the holy grail that is the human genome project. collins became the director of the national human genome research institute in . he was considered the right man for the job following his success (along with lap-chee tsui) in identifying the gene for the cystic fibrosis transmembrane (cftr) chloride channel receptor that, when mutated, can lead to the onset of cystic fibrosis. although now indelibly connected with the topic non-plus tout in biology, like many great innovators in this field before him, francis collins had little interest in biology as he grew up on a farm in the shenandoah valley of virginia. from his childhood he seemed destined to be at the center of drama, his father was professor of dramatic arts at mary baldwin college and the early stage management of career was performed on a stage he built on the farm. while the physical and mathematical sciences held appeal for him, being possessed of a highly logical mind, collins found the format in which biology was taught in the high school of his day mind-numbingly boring, filled with dissections and rote memorization. he found the contemplation of the infinite outcomes of dividing by zero (done deliberately rather than by accident as in einstein's case) far more appealing than contemplating the innards of a frog. that biology could be gloriously logical only became clear to collins when, in , he entered yale with a degree in chemistry from the university of virginia and was first exposed to the nascent field of molecular biology. anecdotally it was the tome, the book of life, penned by the theoretical physicist father of molecular biology, edwin schrodinger, while exiled in trinity college dublin in that was the catalyst for his conversion. like schrodinger he wanted to do something more obviously meaningful (for less than hardcore physicists at least!) than theoretical physics, so he went to medical school at unc-chapel hill after completing his chemistry doctorate in yale, and returned to the site of his road to damascus for post-doctoral study in the application of his newfound interest in human genetics. during this sojourn at yale, collins began working on developing novel tools to search the genome for genes that cause human disease. he continued this work, which he dubbed "positional cloning," after moving to the university of michigan as a professor in . he placed himself on the genetic map when he succeeded in using this method to put the gene that causes cystic fibrosis on the physical map. while a less colorful-in-your-face character than venter he has his own personality quirks, for example, he pastes a new sticker onto the back of his motorcycle helmet every time he finds a new disease gene. one imagines that particular piece of really estate is getting rather crowded. interestingly it was not these four hundred pound us gorillas who proposed the eventually prescient timeline for a working draft but two from the old power base. in meetings in the us in , john sulston and bob waterston proposed to produce a 'draft' sequence of the human genome by , a full five years ahead of schedule. while agreed by most to be feasible it meant a rethinking of strategy and involved focusing resources on larger centers and emphasizing sequence acquisition. just as important, it asserts the value of draft quality sequence to biomedical research. discussion started with the british based wellcome trust as possible sponsors (marshall e. ) . by a rough draft of the human genome map was produced showing the locations of more than , genes. the map was produced using yeast artificial chromosomes and some chromosomes -notably the littlest -were mapped in finer detail. these maps marked an important step toward clone-based sequencing. the importance was illustrated in the devotion of an entire edition of the journal nature to the subject. (nature : - ) the duel between the public and private face of the hgp progressed at a pace over the next five years. following release of the mapping data some level of international agreement was decided on sequence data release and databases. they agreed on the release of sequence data, specifically, that primary genomic sequence should be in the public domain to encourage research and development to maximize its benefit to society. also that it be rapidly released on a daily basis with assemblies of greater than kb and that the finished annotated sequence should be submitted immediately to the public databases. in an international consortium completed the sequence of the genome of the workhorse yeast saccharomyces cerevisiae. data had been released as the individual chromosomes were completed. the saccharomyces genome database (sgd) was created to curate this information. the project collects information and maintains a database of the molecular biology of s. cerevisiae. this database includes a variety of genomic and biological information and is maintained and updated by sgd curators. the sgd also maintains the s. cerevisiae gene name registry, a complete list of all gene names used in s. cerevisiae. in a new more powerful diagnostic tool termed snps (single nucleotide polymorphisms) was developed. snps are changes in single letters in our dna code that can act as markers in the dna landscape. some snps are associated closely with susceptibility to genetic disease, our response to drugs or our ability to remove toxins. the snp consortium although designated a limited company is a nonprofit foundation organized for the purpose of providing public genomic data. it is a collaborative effort between pharmaceutical companies and the wellcome trust with the idea of making available widely accepted, high-quality, extensive, and publicly accessible snp map. its mission was to develop up to , snps distributed evenly throughout the human genome and to make the information related to these snps available to the public without intellectual property restrictions. the project started in april and was anticipated to continue until the end of . in the end, many more snps, about . million total, were discovered than was originally planned. by the complete genome sequence of mycobacterium tuberculosis was published by teams from the uk, france, us and denmark in june . the abi prism sequencing machine, a capillary-based machine designed to run about eight sets of sequence reactions per day also reached the market that year. that same year the genome sequence of the first multicellular organism, c. elegans was completed. c. elegans has a genome of about mb and, as noted, is a primitive animal model organism used in a range of biological disciplines. by november the human genome draft sequence reached mb and the first complete human chromosome was sequenced -this first was reached on the east side of the atlantic by the hgp team led by the sanger centre, producing a finished sequence for chromosome , which is about million base-pairs and includes at least genes. according to anecdotal evidence when visiting his namesake centre, sanger asked: "what does this machine do then?" "dideoxy sequencing" came the reply, to which fred retorted: "haven't they come up with anything better yet?" as will be elaborated in the final chapter the real highlight of was production of a 'working draft' sequence of the human genome, which was announced simultaneously in the us and the uk. in a joint event, celera genomics announced completion of their 'first assembly' of the genome. in a remarkable special issue, nature included a -page article by the human genome project partners, studies of mapping and variation, as well as analysis of the sequence by experts in different areas of biology. science published the article by celera on their assembly of hgp and celera data as well as analyses of the use of the sequence. however to demonstrate the sensitivity of the market place to presidential utterances the joint appearances by bill clinton and tony blair touting this major milestone turned into a major cold shower when clinton's reassurance of access of the people to their genetic information caused a precipitous drop in celera's share value overnight. clinton's assurance that, "the effort to decipher the human genome will be the scientific breakthrough of the century -perhaps of all time. we have a profound responsibility to ensure that the life-saving benefits of any cutting-edge research are available to all human beings." (president bill clinton, wednesday, march , ) stands in sharp contrast to the statement from venter's colleague that " any company that wants to be in the business of using genes, proteins, or antibodies as drugs has a very high probability of running afoul of our patents. from a commercial point of view, they are severely constrained -and far more than they realize." (william a. haseltine, chairman and ceo, human genome sciences). the huge sell-off in stocks ended weeks of biotech buying in which those same stocks soared to unprecedented highs. by the next day, however, the genomic company spin doctors began to recover ground in a brilliant move which turned the clinton announcement into a public relations coup. all major genomics companies issued press releases applauding president clinton's announcement. the real news they argued, was that "for the first time a president strongly affirmed the importance of gene based patents." and the same bill haseltine of human genome sciences positively gushed as he happily pointed out that he "could begin his next annual report with the [president's] monumental statement, and quote today as a monumental day." as distinguished harvard biologist richard lewontin notes: "no prominent molecular biologist of my acquaintance is without a financial stake in the biotechnology business. as a result, serious conflicts of interest have emerged in universities and in government service (lewontin, ) . away from the spin doctors perhaps eric lander may have best summed up the herculean effort when he opined that for him "the human genome project has been the ultimate fulfilment: the chance to share common purpose with hundreds of wonderful colleagues towards a goal larger than ourselves. in the long run, the human genome project's greatest impact might not be the three billion nucleotides of the human chromosomes, but its model of scientific community." (ridley, ) . gene therapy the year also marked the passing of another milestone that was intimately connected to one of the fundamental drivers of the hgp. the california hereditary disorders act came into force and with it one of the potential solutions for human hereditary disorders. w. french anderson in the usa reported the first successful application of gene therapy in humans. the first successful gene therapy for a human disease was successfully achieved for severe combined immune deficiency (scid) by introducing the missing gene, adenosine deaminase deficiency (ada) into the peripheral lymphocytes of a -year-old girl and returning modified lymphocytes to her. although the results are difficult to interpret because of the concurrent use of polyethylene glycol-conjugated ada commonly referred to as pegylated ada (pgla) in all patients, strong evidence for in vivo efficacy was demonstrated. ada-modified t cells persisted in vivo for up to three years and were associated with increases in t-cell number and ada enzyme levels, t cells derived from transduced pgla were progressively replaced by marrow-derived t cells, confirming successful gene transfer into long-lived progenitor cells. ashanthi desilva, the girl who received the first credible gene therapy, continues to do well more than a decade later. cynthia cutshall, the second child to receive gene therapy for the same disorder as desilva, also continues to do well. within years (by january ), more than gene therapy protocols had been approved in the us and worldwide, researchers launched more than clinical trials to test gene therapy against a wide array of illnesses. surprisingly, a disease not typically heading the charts of heritable disorders, cancer has dominated the research. in cancer patients were treated with the tumor necrosis factor gene, a natural tumor fighting protein which worked to a limited extent. even more surprisingly, after the initial flurry of success little has worked. gene therapy, the promising miracle of failed to deliver on its early promise over the decade. apart from those examples, there are many diseases whose molecular pathology is, or soon will be, well understood, but for which no satisfactory treatments have yet been developed. at the beginning of the nineties it appeared that gene therapy did offer new opportunities to treat these disorders both by restoring gene functions that have been lost through mutation and by introducing genes that can inhibit the replication of infectious agents, render cells resistant to cytotoxic drugs, or cause the elimination of aberrant cells. from this "genomic" viewpoint genes could be said to be viewed as medicines, and their development as therapeutics should embrace the issues facing the development of small-molecule and protein therapeutics such as bioavailability, specificity, toxicity, potency, and the ability to be manufactured at large scale in a cost-effective manner. of course for such a radical approach certain basal level criteria needed to be established for selecting disease candidates for human gene therapy. these include, such factors as the disease is an incurable, life-threatening disease; organ, tissue, and cell types affected by the disease have been identified; the normal counterpart of the defective gene has been isolated and cloned; either the normal gene can be introduced into a substantial subfraction of the cells from the affected tissue, or the introduction of the gene into the available target tissue, such as bone marrow, will somehow alter the disease process in the tissue affected by the disease; the gene can be expressed adequately (it will direct the production of enough normal protein to make a difference); and techniques are available to verify the safety of the procedure. an ideal gene therapeutic should, therefore, be stably formulated at room temperature and amenable to administration either as an injectable or aerosol or by oral delivery in liquid or capsule form. the therapeutic should also be suitable for repeat therapy, and when delivered, it should neither generate an immune response nor be destroyed by tissue-scavenging mechanisms. when delivered to the target cell, the therapeutic gene should then be transported to the nucleus, where it should be maintained as a stable plasmid or chromosomal integrant, and be expressed in a predictable, controlled fashion at the desired potency in a cell-specific or tissue-specific manner. in addition to the ada gene transfer in children with severe combined immunodeficiency syndrome, a gene-marking study of epstein-barr virus-specific cytotoxic t cells, and trials of gene-modified t cells expressing suicide or viral resistance genes in patients infected with hiv were studied in the early nineties. additional strategies for t-cell gene therapy which were pursued later in the decade involve the engineering of novel t-cell receptors that impart antigen specificity for virally infected or malignant cells. issues which still are not resolved include nuclear transport, integration, regulated gene expression and immune surveillance. this knowledge, when finally understood and applied to the design of delivery vehicles of either viral or non-viral origin, will assist in the realization of gene therapeutics as safe and beneficial medicines that are suited to the routine management of human health. scientists are also working on using gene therapy to generate antibodies directly inside cells to block the production of harmful viruses such as hiv or even cancer-inducing proteins. there is a specific connection with francis collins, as his motivation for pursuing the hgp was his pursuit of defective genes beginning with the cystic fibrosis gene. this gene, called the cf transmembrane conductance regulator, codes for an ion channel protein that regulates salts in the lung tissue. the faulty gene prevents cells from excreting salt properly causing a thick sticky mucus to build up and destroy lung tissue. scientists have spliced copies of the normal genes into disabled adeno viruses that target lung tissues and have used bronchioscopes to deliver them to the lungs. the procedure worked well in animal studies however clinical trials in humans were not an unmitigated success. because the cells lining the lungs are continuously being replaced the effect is not permanent and must be repeated. studies are underway to develop gene therapy techniques to replace other faulty genes. for example, to replace the genes responsible for factor viii and factor ix production whose malfunctioning causes hemophilia a and b respectively; and to alleviate the effects of the faulty gene in dopamine production that results in parkinson's disease. apart from technical challenges such a radical therapy also engenders ethical debate. many persons who voice concerns about somatic-cell gene therapy use a "slippery slope" argument. it sounds good in theory but where does one draw the line. there are many issues yet to be resolved in this field of thorny ethics "good" and "bad" uses of the gene modification, difficulty of following patients in long-term clinical research and such. many gene therapy candidates are children who are too young to understand the ramifications of this treatment: conflict of interest -pits individuals' reproductive liberties and privacy interests against the interests of insurance companies or society. one issue that is unlikely to ever gain acceptance is germline therapy, the removal of deleterious genes from the population. issues of justice and resource allocation also have been raised: in a time of strain on our health care system, can we afford such expensive therapy? who should receive gene therapy? if it is made available only to those who can afford it, then a number of civil rights groups claim that the distribution of desirable biological traits among different socioeconomic and ethnic groups would become badly skewed adding a new and disturbing layer of discriminatory behavior. indeed a major setback occurred before the end of the decade in . jesse gelsinger was the first person to die from gene therapy, on september , , and his death created another unprecedented situation when his family sued not only the research team involved in the experiment (u penn), the company genovo inc., but also the ethicist who offered moral advice on the controversial project. this inclusion of the ethicist as a defendant alongside the scientists and school was a surprising legal move that puts this specialty on notice, as will no doubt be the case with other evolving technologies such as stem cells and therapeutic cloning, that its members could be vulnerable to litigation over the philosophical guidance they provide to researchers. the penn group principal investigator james wilson approached ethicist arthur caplan about their plans to test the safety of a genetically engineered virus on babies with a deadly form of the liver disorder, ornithine transcarbamylase deficiency. the disorder allows poisonous levels of ammonia to build up in the blood system. caplan steered the researchers away from sick infants, arguing that desperate parents could not provide true informed consent. he said it would be better to experiment on adults with a less lethal form of the disease who were relatively healthy. gelsinger fell into that category. although he had suffered serious bouts of ammonia buildup, he was doing well on a special drug and diet regimen. the decision to use relatively healthy adults was controversial because risky, unproven experimental protocols generally use very ill people who have exhausted more traditional treatments, so have little to lose. in this case, the virus used to deliver the genes was known to cause liver damage, so some scientists were concerned it might trigger an ammonia crisis in the adults. wilson underestimated the risk of the experiment, omitted the disclosure about possible liver damage in earlier volunteers in the experiment and failed to mention the deaths of monkeys given a similar treatment during pre-clinical studies. a food and drug administration investigation after gelsinger's death found numerous regulatory violations by wilson's team, including the failure to stop the experiment and inform the fda after four successive volunteers suffered serious liver damage prior to the teen's treatment. in addition, the fda said gelsinger did not qualify for the experiment, because his blood ammonia levels were too high just before he underwent the infusion of genetic material. the fda suspended all human gene experiments by wilson and the university of penn subsequently restricting him solely to animal studies. a follow-up fda investigation subsequently alleged he improperly tested the experimental treatment on animals. financial conflicts of interest also surrounded james wilson, who stood to personally profit from the experiment through genovo his biotechnology company. the lawsuit was settled out of court for undisclosed terms in november . the fda also suspended gene therapy trials at st. elizabeth's medical center in boston, a major teaching affiliate of tufts university school of medicine, which sought to use gene therapy to reverse heart disease, because scientists there failed to follow protocols and may have contributed to at least one patient death. in addition, the fda temporarily suspended two liver cancer studies sponsored by the schering-plough corporation because of technical similarities to the university of pennsylvania study. some research groups voluntarily suspended gene therapy studies, including two experiments sponsored by the cystic fibrosis foundation and studies at beth israel deaconess medical center in boston aimed at hemophilia. the scientists paused to make sure they learned from the mistakes. the nineties also saw the development of another "high-thoughput" breakthrough, a derivative of the other high tech revolution namely dna chips. in biochips were developed for commercial use under the guidance of affymetrix. dna chips or microarrays represent a "massively parallel" genomic technology. they facilitate high throughput analysis of thousands of genes simultaneously, and are thus potentially very powerful tools for gaining insight into the complexities of higher organisms including analysis of gene expression, detecting genetic variation, making new gene discoveries, fingerprinting strains and developing new diagnostic tools. these technologies permit scientists to conduct large scale surveys of gene expression in organisms, thus adding to our knowledge of how they develop over time or respond to various environmental stimuli. these techniques are especially useful in gaining an integrated view of how multiple genes are expressed in a coordinated manner. these dna chips have broad commercial applications and are now used in many areas of basic and clinical research including the detection of drug resistance mutations in infectious organisms, direct dna sequence comparison of large segments of the human genome, the monitoring of multiple human genes for disease associated mutations, the quantitative and parallel measurement of mrna expression for thousands of human genes, and the physical and genetic mapping of genomes. however the initial technologies, or more accurately the algorithms used to extract information, were far from robust and reproducible. the erstwhile serial entrepreneur, al zaffaroni (the rebel who in founded alza when syntex ignored his interest in developing new ways to deliver drugs) founded yet another company, affymetrix, under the stewardship of stephen fodor, which was subject to much abuse for providing final extracted data and not allowing access to raw data. as with other personalities of this high through put era, seattle-bred steve fodor was also somewhat of a polymath having contributed to two major technologies, microarrays and combinatorial chemistry, the former has delivered on it's, promise while the latter, like gene therapy, is still in a somewhat extended gestation. and despite the limitations of being an industrial scientist he has had a rather prolific portfolio of publications. his seminal manuscripts describing this work have been published in all the journals of note, science, nature and pnas and was recognized in by the aaas by receiving the newcomb-cleveland award for an outstanding paper published in science. fodor began his industrial career in yet another zaffaroni firm. in he was recruited to the affymax research institute in palo alto where he spearheaded the effort to develop high-density arrays of biological compounds. his initial interest was in the broad area of what came to be called combinatorial chemistry. of the techniques developed, one approach permitted high resolution chemical synthesis in a light-directed, spatially-defined format. in the days before positive selection vectors, a researcher might have screened thousands of clones by hand with an oligonucleotide probe just to find one elusive insert. fodor's (and his successors) dna array technology reverses that approach. instead of screening an array of unknowns with a defined probe -a cloned gene, pcr product, or synthetic oligonucleotide -each position or "probe cell" in the array is occupied by a defined dna fragment, and the array is probed with the unknown sample. fodor used his chemistry and biophysics background to develop very dense arrays of these biomolecules by combining photolithographic methods with traditional chemical techniques. the typical array may contain all possible combinations of all possible oligonucleotides ( -mers, for example) that occur as a "window" which is tracked along a dna sequence. it might contain longer oligonucleotides designed from all the open reading frames identified from a complete genome sequence. or it might contain cdnas -of known or unknown sequence -or pcr products. of course it is one thing to produce data it is quite another to extract it in a meaningful manner. fodor's group also developed techniques to read these arrays, employing fluorescent labeling methods and confocal laser scanning to measure each individual binding event on the surface of the chip with extraordinary sensitivity and precision. this general platform of microarray based analysis coupled to confocal laser scanning has become the standard in industry and academia for large-scale genomics studies. in , fodor co-founded affymetrix where the chip technology has been used to synthesize many varieties of high density oligonucleotide arrays containing hundreds of thousands of dna probes. in , steve fodor founded perlegen, inc., a new venture that applied the chip technology towards uncovering the basic patterns of human diversity. his company's stated goals are to analyze more than one million genetic variations in clinical trial participants to explain and predict the efficacy and adverse effect profiles of prescription drugs. in addition, perlegen also applies this expertise to discovering genetic variants associated with disease in order to pave the way for new therapeutics and diagnostics. fodor's former company diversified into plant applications by developing a chip of the archetypal model of plant systems arabidopsis and supplied pioneer hi bred with custom dna chips for monitoring maize gene expression. they (affymetrix) have established programs where academic scientists can use company facilities at a reduced price and set up 'user centers' at selected universities. a related but less complex technology called 'spotted' dna chips involves precisely spotting very small droplets of genomic or cdna clones or pcr samples on a microscope slide. the process uses a robotic device with a print head bearing fine "repeatograph" tips that work like fountain pens to draw up dna samples from a -well plate and spot tiny amounts on a slide. up to , individual clones can be spotted in a dense array within one square centimeter on a glass slide. after hybridization with a fluorescent target mrna, signals are detected by a custom scanner. this is the basis of the systems used by molecular dynamics and incyte (who acquired this technology when it took over synteni). in , incyte was looking to gather more data for its library and perform experiments for corporate subscribers. the company considered buying affymetrix genechips but opted instead to purchase the smaller synteni, which had sprung out of pat brown's stanford array effort. synteni's contact printing technology resulted in dense -and cheaper -arrays. though incyte used the chips only internally, affymetrix sued, claiming synteni/incyte was infringing on its chip density patents. the suit argued that dense biochips -regardless of whether they use photolithography -cannot be made without a license from affymetrix! and in a litigious congo line endemic of this hi-tech era incyte countersued and for good measure also filed against genetic database competitor gene logic for infringing incyte's patents on database building. meanwhile, hyseq sued affymetrix, claiming infringement of nucleotide hybridization patents obtained by its cso. affymetrix, in turn, filed a countersuit, claiming hyseq infringed the spotted array patents. hyseq then reached back and found an additional hybridization patent it claimed that affymetrix had infringed. and so on into the next millennium! in part to avoid all of this another california company nanogen, inc. took a different approach to single nucleotide polymorphism discrimination technology. in an article in the april edition of nature biotechnology, entitled "single nucleotide polymorphic discrimination by an electronic dot blot assay on semiconductor microchips," nanogen describes the use of microchips to identify variants of the mannose binding protein gene that differ from one another by only a single dna base. the mannose binding protein (mbp) is a key component of the innate immune system in children who have not yet developed immunity to a variety of pathogens. to date, four distinct variants (alleles) of this gene have been identified, all differing by only a single nucleotide of dna. mbp was selected for this study because of its potential clinical relevance and its genetic complexity. the samples were assembled at the nci laboratory in conjunction with the national institutes of health and transferred to nanogen for analysis. however, from a high throughput perspective there is a question mark over microarrays. mark benjamin, senior director of business development at rosetta inpharmatics (kirkland, wa), is skeptical about the long-term prospects for standard dna arrays in high-throughput screening as the first steps require exposing cells and then isolating rna, which is something that is very hard to do in a high-throughput format. another drawback is that most of the useful targets are likely to be unknown (particularly in the agricultural sciences where genome sequencing is still in its infancy), and dna arrays that are currently available test only for previously sequenced genes. indeed, some argue that current dna arrays may not be sufficiently sensitive to detect the low expression levels of genes encoding targets of particular interest. and the added complication of the companies' reluctance to provide "raw data" means that derived data sets may be created with less than optimum algorithims thereby irretrievably losing potentially valuable information from the starting material. reverse engineering is a possible approach but this is laborious and time consuming and being prohibited by many contracts may arouse the interest of the ever-vigilant corporate lawyers. over the course of the nineties, outgrowths of functional genomics have been termed proteomics and metabolomics, which are the global studies of gene expression at the protein and metabolite levels respectively. the study of the integration of information flow within an organism is emerging as the field of systems biology. in the area of proteomics, the methods for global analysis of protein profiles and cataloging protein-protein interactions on a genome-wide scale are technically more difficult but improving rapidly, especially for microbes. these approaches generate vast amounts of quantitative data. the amount of expression data becoming available in the public and private sectors is already increasing exponentially. gene and protein expression data rapidly dwarfed the dna sequence data and is considerably more difficult to manage and exploit. in microbes, the small sizes of the genomes and the ease of handling microbial cultures, will enable high throughput, targeted deletion of every gene in a genome, individually and in combinations. this is already available on a moderate throughput scale in model microbes such as e. coli and yeast. combining targeted gene deletions and modifications with genome-wide assay of mrna and protein levels will enable intricate inter-dependencies among genes to be unraveled. simultaneous measurement of many metabolites, particularly in microbes, is beginning to allow the comprehensive modeling and regulation of fluxes through interdependent pathways. metabolomics can be defined as the quantitative measurement of all low molecular weight metabolites in an organism's cells at a specified time under specific environmental conditions. combining information from metabolomics, proteomics and genomics will help us to obtain an integrated understanding of cell biology. the next hierarchical level of phenotype considers how the proteome within and among cells cooperates to produce the biochemistry and physiology of individual cells and organisms. several authors have tentatively offered "physiomics" as a descriptor for this approach. the final hierarchical levels of phenotype include anatomy and function for cells and whole organisms. the term "phenomics" has been applied to this level of study and unquestionably the more well known omics namely economics, has application across all those fields. and, coming slightly out of left field this time, the spectre of eugenics needless to say was raised in the omics era. in the year american and british scientists unveiled a technique which has come to be known as pre-implantation genetic diagnosis (pid) for testing embryos in vitro for genetic abnormalities such as cystic fibrosis, hemophilia, and down's syndrome (wald, ) . this might be seen by most as a step forward, but it led ethicist david s. king ( ) to decry pid as a technology that could exacerbate the eugenic features of prenatal testing and make possible an expanded form of free-market eugenics. he further argues that due to social pressures and eugenic attitudes held by clinical geneticists in most countries, it results in eugenic outcomes even though no state coercion is involved and that, as abortion is not involved, and multiple embryos are available, pid is radically more effective as a tool of genetic selection. the first regulatory approval of a recombinant dna technology in the u.s. food supply was not a plant but an industrial enzyme that has become the hallmark of food biotechnology success. enzymes were important agents in food production long before modern biotechnology was developed. they were used, for instance, in the clotting of milk to prepare cheese, the production of bread and the production of alcoholic beverages. nowadays, enzymes are indispensable to modern food processing technology and have a great variety of functions. they are used in almost all areas of food production including grain processing, milk products, beer, juices, wine, sugar and meat. chymosin, known also as rennin, is a proteolytic enzyme whose role in digestion is to curdle or coagulate milk in the stomach, efficiently converting liquid milk to a semisolid like cottage cheese, allowing it to be retained for longer periods in a neonate's stomach. the dairy industry takes advantage of this property to conduct the first step in cheese production. chy-max™, an artificially produced form of the chymosin enzyme for cheese-making, was approved in . in some instances they replace less acceptable "older" technology, for example the enzyme chymosin. unlike crops industrial enzymes have had relatively easy passage to acceptance for a number of reasons. as noted they are part of the processing system and theoretically do not appear in the final product. today about % of the hard cheese in the us and uk is made using chymosin from geneticallymodified microbes. it is easier to purify, more active ( % as compared to %) and less expensive to produce (microbes are more prolific, more productive and cheaper to keep than calves). like all enzymes it is required only in very small quantities and because it is a relatively unstable protein it breaks down as the cheese matures. indeed, if the enzyme remained active for too long it would adversely affect the development of the cheese, as it would degrade the milk proteins to too great a degree. such enzymes have gained the support of vegetarian organizations and of some religious authorities. for plants the nineties was the era of the first widespread commercialization of what came to be known in often deprecating and literally inaccurate terms as gmos (genetically modified organisms). when the nineties dawned dicotyledonous plants were relatively easily transformed with agrobacterium tumefaciens but many economically important plants, including the cereals, remained inaccessible for genetic manipulation because of lack of effective transformation techniques. in this changed with the technology that overcame this limitation. michael fromm, a molecular biologist at the plant gene expression center, reported the stable transformation of corn using a high-speed gene gun. the method known as biolistics uses a "particle gun" to shoot metal particles coated with dna into cells. initially a gunpowder charge subsequently replaced by helium gas was used to accelerate the particles in the gun. there is a minimal disruption of tissue and the success rate has been extremely high for applications in several plant species. the technology rights are now owned by dupont. in some of the first of the field trials of the crops that would dominate the second half of the nineties began, including bt corn (with the bacillus thuriengenesis cry protein discussed in chapter three). in the fda declared that genetically engineered foods are "not inherently dangerous" and do not require special regulation. since , researchers have pinpointed and cloned several of the genes that make selected plants resistant to certain bacterial and fungal infections; some of these genes have been successfully inserted into crop plants that lack them. many more infection-resistant crops are expected in the near future, as scientists find more plant genes in nature that make plants resistant to pests. plant genes, however, are just a portion of the arsenal; microorganisms other than bt also are being mined for genes that could help plants fend off invaders that cause crop damage. the major milestone of the decade in crop biotechnology was approval of the first bioengineered crop plant in . it represented a double first not just of the first approved food crop but also of the first commercial validation of a technology which was to be surpassed later in the decade. that technology, antisense technology works because nucleic acids have a natural affinity for each other. when a gene coding for the target in the genome is introduced in the opposite orientation, the reverse rna strand anneals and effectively blocks expression of the enzyme. this technology was patented by calgene for plant applications and was the technology behind the famous flavr savr tomatoes. the first success for antisense in medicine was in when the u.s. food and drug administration gave the go-ahead to the cytomegalovirus (cmv) inhibitor fomivirsen, a phosphorothionate antiviral for the aids-related condition cmv retinitis making it the first drug belonging to isis, and the first antisense drug ever, to be approved. another technology, although not apparent at the time was behind the second approval and also the first and only successful to date in a commercial tree fruit biotech application. the former was a virus resistant squash the second the papaya ringspot resistant papaya. both owed their existence as much to historic experience as modern technology. genetically engineered virus-resistant strains of squash and cantaloupe, for example, would never have made it to farmers' fields if plant breeders in the 's had not noticed that plants infected with a mild strain of a virus do not succumb to more destructive strains of the same virus. that finding led plant pathologist roger beachy, then at washington university in saint louis, to wonder exactly how such "cross-protection" worked -did part of the virus prompt it? in collaboration with researchers at monsanto, beachy used an a. tumefaciens vector to insert into tomato plants a gene that produces one of the proteins that makes up the protein coat of the tobacco mosaic virus. he then inoculated these plants with the virus and was pleased to discover, as reported in , that the vast majority of plants did not succumb to the virus. eight years later, in , virus-resistant squash seeds created with beachy's method reached the market, to be followed soon by bioengineered virus-resistant seeds for cantaloupes, potatoes, and papayas. (breeders had already created virusresistant tomato seeds by using traditional techniques.) and the method of protection still remained a mystery when the first approvals were given in and . gene silencing was perceived initially as an unpredictable and inconvenient side effect of introducing transgenes into plants. it now seems that it is the consequence of accidentally triggering the plant's adaptive defense mechanism against viruses and transposable elements. this recently discovered mechanism, although mechanistically different, has a number of parallels with the immune system of mammals. how this system worked was not elucidated until later in the decade by a researcher who was seeking a very different holy grail -the black rose! rick jorgensen, at that time at dna plant technologies in oakland, ca and subsequently of, of the university of california davis attempted to overexpress the chalcone synthase gene by introducing a modified copy under a strong promoter.surprisingly he obtained white flowers, and many strange variegated purple and white variations in between. this was the first demonstration of what has come to be known as post-transcriptional gene silencing (ptgs). while initially it was considered a strange phenomenon limited to petunias and a few other plant species, it is now one of the hottest topics in molecular biology. rna interference (rnai) in animals and basal eukaryotes, quelling in fungi, and ptgs in plants are examples of a broad family of phenomena collectively called rna silencing (hannon ; plasterk ) . in addition to its occurrence in these species it has roles in viral defense (as demonstrated by beachy) and transposon silencing mechanisms among other things. perhaps most exciting, however, is the emerging use of ptgs and, in particular, rnai -ptgs initiated by the introduction of double-stranded rna (dsrna) -as a tool to knock out expression of specific genes in a variety of organisms. nineteen ninety one also heralded yet another first. the february , issue of science reported the patenting of "molecular scissors": the nobel-prize winning discovery of enzymatic rna, or "ribozymes," by thomas czech of the university of colorado. it was noted that the u.s. patent and trademark office had awarded an "unusually broad" patent for ribozymes. the patent is u.s. patent no. , , , claim of which reads as follows: "an enzymatic rna molecule not naturally occurring in nature having an endonuclease activity independent of any protein, said endonuclease activity being specific for a nucleotide sequence defining a cleavage site comprising single-stranded rna in a separate rna molecule, and causing cleavage at said cleavage site by a transesterification reaction." although enzymes made of protein are the dominant form of biocatalyst in modern cells, there are at least eight natural rna enzymes, or ribozymes, that catalyze fundamental biological processes. one of which was yet another discovery by plant virologists, in this instance the hairpin ribozyme was discovered by george bruening at uc davis. the self-cleavage structure was originally called a paperclip, by the bruening laboratory which discovered the reactions. as mentioned in chapter , it is believed that these ribozymes might be the remnants of an ancient form of life that was guided entirely by rna. since a ribozyme is a catalytic rna molecule capable of cleaving itself and other target rnas it therefore can be useful as a control system for turning off genes or targeting viruses. the possibility of designing ribozymes to cleave any specific target rna has rendered them valuable tools in both basic research and therapeutic applications. in the therapeutics area, they have been exploited to target viral rnas in infectious diseases, dominant oncogenes in cancers and specific somatic mutations in genetic disorders. most notably, several ribozyme gene therapy protocols for hiv patients are already in phase trials. more recently, ribozymes have been used for transgenic animal research, gene target validation and pathway elucidation. however, targeting ribozymes to the cellular compartment containing their target rnas has proved a challenge. at the other bookend of the decade in , samarsky et al. reported that a family of small rnas in the nucleolus (snornas) can readily transport ribozymes into this subcellular organelle. in addition to the already extensive panoply of rna entities yet another has potential for mischief. viroids are small, single-stranded, circular rnas containing - nucleotides arranged in a rod-like secondary structure and are the smallest pathogenic agents yet described. the smallest viroid characterized to date is rice yellow mottle sobemovirus (rymv), at nucleotides. in comparison, the genome of the smallest known viruses capable of causing an infection by themselves, the single-stranded circular dna of circoviruses, is around kilobases in size. the first viroid to be identified was the potato spindle tuber viroid (pstvd). some species have been identified to date. unlike the many satellite or defective interfering rnas associated with plant viruses, viroids replicate autonomously on inoculation of a susceptible host. the absence of a protein capsid and of detectable messenger rna activity implies that the information necessary for replication and pathogenesis resides within the unusual structure of the viroid genome. the replication mechanism actually involves interaction with rna polymerase ii, an enzyme normally associated with synthesis of messenger rna, and "rolling circle" synthesis of new rna. some viroids have ribozyme activity which allow self-cleavage and ligation of unit-size genomes from larger replication intermediates. it has been proposed that viroids are "escaped introns". viroids are usually transmitted by seed or pollen. infected plants can show distorted growth. from its earliest years, biotechnology attracted interest outside scientific circles. initially the main focus of public interest was on the safety of recombinant dna technology, and of the possible risks of creating uncontrollable and harmful novel organisms (berg , ) . the debate on the deliberate release of genetically modified organisms, and on consumer products containing or comprising them, followed some years later (nas, ) . it is interesting to note that within the broad ethical tableau of potential issues within the science and products of biotechnology, the seemingly innocuous field of plant modification has been one of the major players of the 's. the success of agricultural biotechnology is heavily dependent on its acceptance by the public, and the regulatory framework in which the industry operates is also influenced by public opinion. as the focus for molecular biology research shifted from the basic pursuit of knowledge to the pursuit of lucrative applications, once again as in the previous two decades the specter of risk arose as the potential of new products and applications had to be evaluated outside the confines of a laboratory. however, the specter now became far more global as the implications of commercial applications brought not just worker safety into the loop but also, the environment, agricultural and industrial products and the safety and well being of all living things. beyond "deliberate" release, the rac guidelines were not designed to address these issues, so the matter moved into the realm of the federal agencies who had regulatory authority which could be interpreted to oversee biotechnology issues. this adaptation of oversight is very much a dynamic process as the various agencies wrestle with the task of applying existing regulations and developing new ones for oversight of this technology in transition. as the decade progressed focus shifted from basic biotic stress resistance to more complex modifications the next generation of plants will focus on value added traits in which valuable genes and metabolites will be identified and isolated, with some of the later compounds being produced in mass quantities for niche markets. two of the more promising markets are nutraceuticals or so-called "functional foods" and plants developed as bioreactors for the production of valuable proteins and compounds, a field known as plant molecular farming. developing plants with improved quality traits involves overcoming a variety of technical challenges inherent to metabolic engineering programs. both traditional plant breeding and biotechnology techniques are needed to produce plants carrying the desired quality traits. continuing improvements in molecular and genomic technologies are contributing to the acceleration of product development in this space. by the end of the decade in , applying nutritional genomics, della penna ( ) isolated a gene, which converts the lower activity precursors to the highest activity vitamin e compound, alpha-tocopherol. with this technology, the vitamin e content of arabidopsis seed oil has been increased nearly -fold and progress has been made to move the technology to crops such as soybean, maize, and canola. this has also been done for folates in rice. omega three fatty acids play a significant role in human health, eicosapentaenoic acid (epa) and docosahexaenoic acid (dha), which are present in the retina of the eye and cerebral cortex of the brain, respectively, are some of the most well documented from a clinical perspective. it is believed that epa and dha play an important role in the regulation of inflammatory immune reactions and blood pressure, treatment of conditions such as cardiovascular disease and cystic fibrosis, brain development in utero, and, in early postnatal life, the development of cognitive function. they are mainly found in fish oil and the supply is limited. by the end of the decade ursin ( ) had succeeded in engineering canola to produce these fatty acids. from a global perspective another value-added development had far greater impact both technologically and socio-economically. a team led by ingo potrykus ( ) engineered rice to produce pro-vitamin a, which is an essential micronutrient. widespread dietary deficiency of this vitamin in rice-eating asian countries, which predisposes children to diseases such as blindness and measles, has tragic consequences. improved vitamin a nutrition would alleviate serious health problems and, according to unicef, could also prevent up to two million infant deaths due to vitamin a deficiency. adoption of the next stage of gm crops may proceed more slowly, as the market confronts issues of how to determine price, share value, and adjust marketing and handling to accommodate specialized end-use characteristics. furthermore, competition from existing products will not evaporate. challenges that have accompanied gm crops with improved agronomic traits, such as the stalled regulatory processes in europe, will also affect adoption of nutritionally improved gm products. beyond all of this, credible scientific research is still needed to confirm the benefits of any particular food or component. for functional foods to deliver their potential public health benefits, consumers must have a clear understanding of, and a strong confidence level in, the scientific criteria that are used to document health effects and claims. because these decisions will require an understanding of plant biochemistry, mammalian physiology, and food chemistry, strong interdisciplinary collaborations will be needed among plant scientists, nutritionists, and food scientists to ensure a safe and healthful food supply. in addition to being a source of nutrition, plants have been a valuable wellspring of therapeutics for centuries. during the nineties, however, intensive research has focused on expanding this source through rdna biotechnology and essentially using plants and animals as living factories for the commercial production of vaccines, therapeutics and other valuable products such as industrial enzymes and biosynthetic feedstocks. possibilities in the medical field include a wide variety of compounds, ranging from edible vaccine antigens against hepatitis b and norwalk viruses (arntzen, ) and pseudomonas aeruginosa and staphylococcus aureus to vaccines against cancer and diabetes, enzymes, hormones, cytokines, interleukins, plasma proteins, and human alpha- -antitrypsin. thus, plant cells are capable of expressing a large variety of recombinant proteins and protein complexes. therapeutics produced in this way are termed plant made pharmaceuticals (pmps). and non-therapeutics are termed plant made industrial products (pmips) (newell-mcgloughlin, ) . the first reported results of successful human clinical trials with their transgenic plant-derived pharmaceuticals were published in . they were an edible vaccine against e. coli-induced diarrhea and a secretory monoclonal antibody directed against streptococcus mutans, for preventative immunotherapy to reduce incidence of dental caries. haq et al. ( ) reported the expression in potato plants of a vaccine against e. coli enterotoxin (etec) that provided an immune response against the toxin in mice. human clinical trials suggest that oral vaccination against either of the closely related enterotoxins of vibrio cholerae and e. coli induces production of antibodies that can neutralize the respective toxins by preventing them from binding to gut cells. similar results were found for norwalk virus oral vaccines in potatoes. for developing countries, the intention is to deliver them in bananas or tomatoes (newell-mcgloughlin, ) . plants are also faster, cheaper, more convenient and more efficient than the principal eukaryotic production system, namely chinese hamster ovary (cho) cells for the production of pharmaceuticals. hundreds of acres of protein-containing seeds could inexpensively double the production of a cho bioreactor factory. in addition, proteins can be expressed at the highest levels in the harvestable seed and plant-made proteins and enzymes formulated in seeds have been found to be extremely stable, reducing storage and shipping costs. pharming may also enable research on drugs that cannot currently be produced. for example, croptech in blacksburg, va., is investigating a protein that seems to be a very effective anticancer agent. the problem is that this protein is difficult to produce in mammalian cell culture systems as it inhibits cell growth. this should not be a problem in plants. furthermore, production size is flexible and easily adjustable to the needs of changing markets. making pharmaceuticals from plants is also a sustainable process, because the plants and crops used as raw materials are renewable. the system also has the potential to address problems associated with provision of vaccines to people in developing countries. products from these alternative sources do not require a so-called "cold chain" for refrigerated transport and storage. those being developed for oral delivery obviates the need for needles and aspectic conditions which often are a problem in those areas. apart from those specific applications where the plant system is optimum there are many other advantages to using plant production. many new pharmaceuticals based on recombinant proteins will receive regulatory approval from the united states food and drug administration (fda) in the next few years. as these therapeutics make their way through clinical trials and evaluation, the pharmaceutical industry faces a production capacity challenge. pharmaceutical discovery companies are exploring plant-based production to overcome capacity limitations, enable production of complex therapeutic proteins, and fully realize the commercial potential of their biopharmaceuticals (newell-mcgloughlin, ) . nineteen ninety also marked a major milestone in the animal biotech world when herman made his appearance on the world's stage. since the palmiter's mouse, transgenic technology has been applied to several species including agricultural species such as sheep, cattle, goats, pigs, rabbits, poultry, and fish. herman was the first transgenic bovine created by genpharm international, inc., in a laboratory in the netherlands at the early embryo stage. scientist's microinjected recently fertilized eggs with the gene coding for human lactoferrin. the scientists then cultured the cells in vitro to the embryo stage and transferred them to recipient cattle. lactoferrin, an iron-containing anti-bacterial protein is essential for infant growth. since cow's milk doesn't contain lactoferrin, infants must be fed from other sources that are rich in iron -formula or mother's milk (newell-mcgloughlin, ) . as herman was a boy he would be unable to provide the source, that would require the production of daughters which was not necessarily a straightforward process. the dutch parliments permission was required. in they finally approved a measure that permitted the world's first genetically engineered bull to reproduce. the leiden-based gene pharming proceeded to artificially inseminate cows with herman's sperm. with a promise that the protein, lactoferrin, would be the first in a new generation of inexpensive, high-tech drugs derived from cows' milk to treat complex diseases like aids and cancer. herman, became the father of at least eight female calves in , and each one inherited the gene for lactoferrin production. while their birth was initially greeted as a scientific advancement that could have far-reaching effects for children in developing nations, the levels of expression were too low to be commercially viable. by , herman, who likes to listen to rap music to relax, had sired calves and outlived them all. his offspring were all killed and destroyed after the end of the experiment, in line with dutch health legislation. herman was also slated for the abattoir, but the dutch public -proud of making history with herman -rose up in protest, especially after a television program screened footage showing the amiable bull licking a kitten. herman won a bill of clemency from parliament. however, instead of retirement on a comfortable bed of straw, listening to rap music, herman was pressed into service again. he now stars at a permanent biotech exhibit in naturalis, a natural history museum in the dutch city of leiden. after his death, he will be stuffed and remain in the museum in perpetuity (a fate similar to what awaited an even more famous mammalian first born later in the decade). the applications for transgenic animal research fall broadly into two distinct areas, namely medical and agricultural applications. the recent focus on developing animals as bioreactors to produce valuable proteins in their milk can be catalogued under both areas. underlying each of these, of course, is a more fundamental application, that is the use of those techniques as tools to ascertain the molecular and physiological bases of gene expression and animal development. this understanding can then lead to the creation of techniques to modify development pathways. in a european decision with rather more far-reaching implications than hermans sex life was made. the first european patent on a transgenic animal was issued for a transgenic mouse sensitive to carcinogens -harvard's "oncomouse". the oncomouse patent application was refused in europe in due primarily to an established ban on animal patenting. the application was revised to make narrower claims, and the patent was granted in . this has since been repeatedly challenged, primarily by groups objecting to the judgement that benefits to humans outweigh the suffering of the animal. currently, the patent applicant is awaiting protestors' responses to a series of possible modifications to the application. predictions are that agreement will not likely be forthcoming and that the legal wrangling will continue into the future. bringing animals into the field of controversy starting to swirl around gmos and preceding the latter's commercialization, was the approval by the fda of bovine somatotropin (bst) for increased milk production in dairy cows. the fda's center for veterinary medicine (cvm) regulates the manufacture and distribution of food additives and drugs that will be given to animals. biotechnology products are a growing proportion of the animal health products and feed components regulated by the cvm. the center requires that food products from treated animals must be shown to be safe for human consumption. applicants must show that the drug is effective and safe for the animal and that its manufacture will not affect the environment. they must also conduct geographically dispersed clinical trials under an investigational new animal drug application with the fda through which the agency controls the use of the unapproved compound in food animals. unlike within the eu, possible economic and social issues cannot be taken into consideration by the fda in the premarket drug approval process. under these considerations the safety and efficacy of rbst was determined. it was also determined that special labeling for milk derived from cows that had been treated with rbst is not required under fda food labeling laws because the use of rbst does not effect the quality or the composition of the milk. work with fish proceeded a pace throughout the decade. gene transfer techniques have been applied to a large number of aquatic organisms, both vertebrates and invertebrates. gene transfer experiments have targeted a wide variety of applications, including the study of gene structure and function, aquaculture production, and use in fisheries management programs. because fish have high fecundity, large eggs, and do not require reimplantation of embryos, transgenic fish prove attractive model systems in which to study gene expression. transgenic zebrafish have found utility in studies of embryogenesis, with expression of transgenes marking cell lineages or providing the basis for study of promoter or structural gene function. although not as widely used as zebrafish, transgenic medaka and goldfish have been used for studies of promoter function. this body of research indicates that transgenic fish provide useful models of gene expression, reliably modeling that in "higher" vertebrates. perhaps the largest number of gene transfer experiments address the goal of genetic improvement for aquaculture production purposes. the principal area of research has focused on growth performance, and initial transgenic growth hormone (gh) fish models have demonstrated accelerated and beneficial phenotypes. dna microinjection methods have propelled the many studies reported and have been most effective due to the relative ease of working with fish embryos. bob devlins' group in vancouver has demonstrated extraordinary growth rate in coho salmon which were transformed with a growth hormone from sockeye salmon. the transgenics achieve up to eleven times the size of their littermates within six months, reaching maturity in about half the time. interestingly this dramatic effect is only observed in feeding pins where the transgenics' ferocious appetites demands constant feeding. if the fish are left to their own devices and must forage for themselves, they appear to be out-competed by their smarter siblings. however most studies, such as those involving transgenic atlantic salmon and channel catfish, report growth rate enhancement on the order of - %. in addition to the species mentioned, gh genes also have been transferred into striped bass, tilapia, rainbow trout, gilthead sea bream, common carp, bluntnose bream, loach, and other fishes. shellfish also are subject to gene transfer toward the goal of intensifying aquaculture production. growth of abalone expressing an introduced gh gene is being evaluated; accelerated growth would prove a boon for culture of the slowgrowing mollusk. a marker gene was introduced successfully into giant prawn, demonstrating feasibility of gene transfer in crustaceans, and opening the possibility of work involving genes affecting economically important traits. in the ornamental fish sector of aquaculture, ongoing work addresses the development of fish with unique coloring or patterning. a number of companies have been founded to pursue commercialization of transgenics for aquaculture. as most aquaculture species mature at - years of age, most transgenic lines are still in development and have yet to be tested for performance under culture conditions. extending earlier research that identified methylfarnesoate (mf) as a juvenile hormone in crustaceans and determined its role in reproduction, researchers at the university of connecticut have developed technology to synchronize shrimp egg production and to increase the number and quality of eggs produced. females injected with mf are stimulated to produce eggs ready for fertilization. the procedure produces percent more eggs than the traditional crude method of removing the eyestalk gland. this will increase aquaculture efficiency. a number of experiments utilize gene transfer to develop genetic lines of potential utility in fisheries management. transfer of gh genes into northern pike, walleye, and largemouth bass are aimed at improving the growth rate of sport fishes. gene transfer has been posed as an option for reducing losses of rainbow trout to whirling disease, although suitable candidate genes have yet to be identified. richard winn of the university of georgia is developing transgenic killifish and medaka as biomonitors for environmental mutagens, which carry the bacteriophage phi x as a target for mutation detection. development of transgenic lines for fisheries management applications generally is at an early stage, often at the founder or f generation. broad application of transgenic aquatic organisms in aquaculture and fisheries management will depend on showing that particular gmos can be used in the environment both effectively and safely. although our base of knowledge for assessing ecological and genetic safety of aquatic gmos currently is limited, some early studies supported by the usda biotechnology risk assessment program have yielded results. data from outdoor pond-based studies on transgenic catfish reported by rex dunham of auburn university show that transgenic and non-transgenic individuals interbreed freely, that survival and growth of transgenics in unfed ponds was equal to or less than that of non-transgenics, and that predator avoidance is not affected by expression of the transgene. however, unquestionably the seminal event for animal biotech in the nineties was ian wilmut's landmark work using nuclear transfer technology to generate the lambs morag and megan reported in (from an embryonic cell nuclei) and the truly ground-breaking work of creating dolly from an adult somatic cell nucleus, reported in february, (wilmut, ) . wilmut and his colleagues at the roslin institute demonstrated for the first time with the birth of dolly the sheep that the nucleus of an adult somatic cell can be transferred to an enucleated egg to create cloned offspring. it had been assumed for some time that only embryonic cells could be used as the cellular source for nuclear transfer. this assumption was shattered with the birth of dolly. this example of cloning an animal using the nucleus of an adult cell was significant because it demonstrated the ability of egg cell cytoplasm to "reprogram" an adult nucleus. when cells differentiate, that is, evolve from primitive embryonic cells to functionally defined adult cells, they lose the ability to express most genes and can only express those genes necessary for the cell's differentiated function. for example, skin cells only express genes necessary for skin function, and brain cells only express genes necessary for brain function. the procedure that produced dolly demonstrated that egg cytoplasm is capable of reprogramming an adult differentiated cell (which is only expressing genes related to the function of that cell type). this reprogramming enables the differentiated cell nucleus to once again express all the genes required for the full embryonic development of the adult animal. since dolly was cloned, similar techniques have been used to clone a veritable zoo of vertebrates including mice, cattle, rabbitts, mules, horses, fish, cats and dogs from donor cells obtained from adult animals. these spectacular examples of cloning normal animals from fully differentiated adult cells demonstrate the universality of nuclear reprogramming although the next decade called some of these assumptions into question. this technology supports the production of genetically identical and genetically modified animals. thus, the successful "cloning" of dolly has captured the imagination of researchers around the world. this technological breakthrough should play a significant role in the development of new procedures for genetic engineering in a number of mammalian species. it should be noted that nuclear cloning, with nuclei obtained from either mammalian stem cells or differentiated "adult" cells, is an especially important development for transgenic animal research. as the decade reached its end the clones began arriving rapidly with specific advances made by a japanese group who used cumulus cells rather than fibroblasts to clone calves. they found that the percentage of cultured, reconstructed eggs that developed into blastocysts was % for cumulus cells and % for oviductal cells. these rates are higher than the % previously reported for transfer of nuclei from bovine fetal fibroblasts. following on the heels of dolly, polly and molly became the first genetically engineered transgenic sheep produced through nuclear transfer technology. polly and molly were engineered to produce human factor ix (for hemophiliacs) by transfer of nuclei from transfected fetal fibroblasts. until then germline competent transgenics had only been produced in mammalian species, other than mice, using dna microinjection. researchers at the university of massachusetts and advanced cell technology (worcester, ma) teamed up to produce genetically identical calves utilizing a strategy similar to that used to produce transgenic sheep. in contrast to the sheep cloning experiment, the bovine experiment involved the transfer of nuclei from an actively dividing population of cells. previous results from the sheep experiments suggested that induction of quiescence by serum starvation was required to reprogram the donor nuclei for successful nuclear transfer. the current bovine experiments indicate that this step may not be necessary. typically about embryos needed to be microinjected to obtain one transgenic cow, whereas nuclear transfer produced three transgenic calves from reconstructed embryos. this efficiency is comparable to the previous sheep research where six transgenic lambs were produced from reconstructed embryos. the ability to select for genetically modified cells in culture prior to nuclear transfer opens up the possibility of applying the powerful gene targeting techniques that have been developed for mice. one of the limitations of using primary cells, however, is their limited lifespan in culture. primary cell cultures such as the fetal fibroblasts can only undergo about population doublings before they senesce. this limited lifespan would preclude the ability to perform multiple rounds of selection. to overcome this problem of cell senescence, these researchers showed that fibroblast lifespan could be prolonged by nuclear transfer. a fetus, which was developed by nuclear transfer from genetically modified cells, could in turn be used to establish a second generation of fetal fibroblasts. these fetal cells would then be capable of undergoing another population doublings, which would provide sufficient time for selection of a second genetic modification. as noted, there is still some uncertainty over whether quiescent cells are required for successful nuclear transfer. induction into quiescence was originally thought to be necessary for successful nuclear reprogramming of the donor nucleus. however, cloned calves have been previously produced using non-quiescent fetal cells. furthermore, transfer of nuclei from sertoli and neuronal cells, which do not normally divide in adults, did not produce a liveborn mouse; whereas nuclei transferred from actively dividing cumulus cells did produce cloned mice. the fetuses used for establishing fetal cell lines in a tufts goat study were generated by mating nontransgenic females to a transgenic male containing a human antithrombin (at) iii transgene. this at transgene directs high level expression of human at into milk of lactating transgenic females. as expected, all three offspring derived from female fetal cells were females. one of these cloned goats was hormonally induced to lactate. this goat secreted . - . grams per liter of at in her milk. this level of at expression was comparable to that detected in the milk of transgenic goats from the same line obtained by natural breeding. the successful secretion of at in milk was a key result because it showed that a cloned animal could still synthesize and secrete a foreign protein at the expected level. it will be interesting to see if all three cloned goats secrete human at at the identical level. if so, then the goal of creating a herd identical transgenic animals, which secrete identical levels of an important pharmaceutical, would become a reality. no longer would variable production levels exist in subsequent generations due to genetically similar but not identical animals. this homogeneity would greatly aid in the production and processing of a uniform product. as nuclear transfer technology continues to be refined and applied to other species, it may eventually replace microinjection as the method of choice for generating transgenic livestock. nuclear transfer has a number of advantages: ) nuclear transfer is more efficient than microinjection at producing a transgenic animal, ) the fate of the integrated foreign dna can be examined prior to production of the transgenic animal, ) the sex of the transgenic animal can be predetermined, and ) the problem of mosaicism in first generation transgenic animals can be eliminated. dna microinjection has not been a very efficient mechanism to produce transgenic mammals. however, in november, , a team of wisconsin researchers reported a nearly % efficient method for generating transgenic cattle. the established method of cattle transgenes involves injecting dna into the pronuclei of a fertilized egg or zygote. in contrast, the wisconsin team injected a replication-defective retroviral vector into the perivitelline space of an unfertilized oocyte. the perivitelline space is the region between the oocyte membrane and the protective coating surrounding the oocyte known as the zona pellucida. in addition to es (embryonic stem) cells other sources of donor nuclei for nuclear transfer might be used such as embryonic cell lines, primordial germ cells, or spermatogonia to produce offspring. the utility of es cells or related methodologies to provide efficient and targeted in vivo genetic manipulations offer the prospects of profoundly useful animal models for biomedical, biological and agricultural applications. the road to such success has been most challenging, but recent developments in this field are extremely encouraging. with the may announcement of geron buying out ian wilmuts company roslin biomed, they declared it the dawn of an new era in biomedical research. geron's technologies for deriving transplantable cells from human pluripotent stem cells (hpscs) and extending their replicative capacity with telomerase was combined with the roslin institute nuclear transfer technology, the technology that produced dolly the cloned sheep. the goal was to produce transplantable, tissue-matched cells that provide extended therapeutic benefits without triggering immune rejection. such cells could be used to treat numerous major chronic degenerative diseases and conditions such as heart disease, stroke, parkinson's disease, alzheimer's disease, spinal cord injury, diabetes, osteoarthritis, bone marrow failure and burns. the stem cell is a unique and essential cell type found in animals. many kinds of stem cells are found in the body, with some more differentiated, or committed, to a particular function than others. in other words, when stem cells divide, some of the progeny mature into cells of a specific type (heart, muscle, blood, or brain cells), while others remain stem cells, ready to repair some of the everyday wear and tear undergone by our bodies. these stem cells are capable of continually reproducing themselves and serve to renew tissue throughout an individual's life. for example, they continually regenerate the lining of the gut, revitalize skin, and produce a whole range of blood cells. although the term "stem cell" commonly is used to refer to the cells within the adult organism that renew tissue (e.g., hematopoietic stem cells, a type of cell found in the blood), the most fundamental and extraordinary of the stem cells are found in the early-stage embryo. these embryonic stem (es) cells, unlike the more differentiated adult stem cells or other cell types, retain the special ability to develop into nearly any cell type. embryonic germ (eg) cells, which originate from the primordial reproductive cells of the developing fetus, have properties similar to es cells. it is the potentially unique versatility of the es and eg cells derived, respectively, from the early-stage embryo and cadaveric fetal tissue that presents such unusual scientific and therapeutic promise. indeed, scientists have long recognized the possibility of using such cells to generate more specialized cells or tissue, which could allow the generation of new cells to be used to treat injuries or diseases, such as alzheimer's disease, parkinson's disease, heart disease, and kidney failure. likewise, scientists regard these cells as an important -perhaps essential -means for understanding the earliest stages of human development and as an important tool in the development of life-saving drugs and cell-replacement therapies to treat disorders caused by early cell death or impairment. geron corporation and its collaborators at the university of wisconsin -madison (dr. james a. thomson) and johns hopkins university (dr. john d. gearhart) announced in november the first successful derivation of hpscs from two sources: (i) human embryonic stem (hes) cells derived from in vitro fertilized blastocysts (thomson ) and (ii) human embryonic germ (heg) cells derived from fetal material obtained from medically terminated pregnancies (shamblott et al. ) . although derived from different sources by different laboratory processes, these two cell types share certain characteristics but are referred to collectively as human pluripotent stem cells (hpscs). because hes cells have been more thoroughly studied, the characteristics of hpscs most closely describe the known properties of hes cells. stem cells represent a tremendous scientific advancement in two ways: first, as a tool to study developmental and cell biology; and second, as the starting point for therapies to develop medications to treat some of the most deadly diseases. the derivation of stem cells is fundamental to scientific research in understanding basic cellular and embryonic development. observing the development of stem cells as they differentiate into a number of cell types will enable scientists to better understand cellular processes and ways to repair cells when they malfunction. it also holds great potential to yield revolutionary treatments by transplanting new tissue to treat heart disease, atherosclerosis, blood disorders, diabetes, parkinson's, alzheimer's, stroke, spinal cord injuries, rheumatoid arthritis, and many other diseases. by using stem cells, scientists may be able to grow human skin cells to treat wounds and burns. and, it will aid the understanding of fertility disorders. many patient and scientific organizations recognize the vast potential of stem cell research. another possible therapeutic technique is the generation of "customized" stem cells. a researcher or doctor might need to develop a special cell line that contains the dna of a person living with a disease. by using a technique called "somatic cell nuclear transfer" the researcher can transfer a nucleus from the patient into an enucleated human egg cell. this reformed cell can then be activated to form a blastocyst from which customized stem cell lines can be derived to treat the individual from whom the nucleus was extracted. by using the individual's own dna, the stem cell line would be fully compatible and not be rejected by the person when the stem cells are transferred back to that person for the treatment. preliminary research is occurring on other approaches to produce pluripotent human es cells without the need to use human oocytes. human oocytes may not be available in quantities that would meet the needs of millions of potential patients. however, no peer-reviewed papers have yet appeared from which to judge whether animal oocytes could be used to manufacture "customized" human es cells and whether they can be developed on a realistic timescale. additional approaches under consideration include early experimental studies on the use of cytoplasmic-like media that might allow a viable approach in laboratory cultures. on a much longer timeline, it may be possible to use sophisticated genetic modification techniques to eliminate the major histocompatibility complexes and other cell-surface antigens from foreign cells to prepare master stem cell lines with less likelihood of rejection. this could lead to the development of a bank of universal donor cells or multiple types of compatible donor cells of invaluable benefit to treat all patients. however, the human immune system is sensitive to many minor histocompatibility complexes and immunosuppressive therapy carries life-threatening complications. stem cells also show great potential to aid research and development of new drugs and biologics. now, stem cells can serve as a source for normal human differentiated cells to be used for drug screening and testing, drug toxicology studies and to identify new drug targets. the ability to evaluate drug toxicity in human cell lines grown from stem cells could significantly reduce the need to test a drug's safety in animal models. there are other sources of stem cells, including stem cells that are found in blood. recent reports note the possible isolation of stem cells for the brain from the lining of the spinal cord. other reports indicate that some stem cells that were thought to have differentiated into one type of cell can also become other types of cells, in particular brain stem cells with the potential to become blood cells. however, since these reports reflect very early cellular research about which little is known, we should continue to pursue basic research on all types of stem cells. some religious leaders will advocate that researchers should only use certain types of stem cells. however, because human embryonic stem cells hold the potential to differentiate into any type of cell in the human body, no avenue of research should be foreclosed. rather, we must find ways to facilitate the pursuit of all research using stem cells while addressing the ethical concerns that may be raised. another seminal and intimately related event at the end of the nineties occurred in madison wisconsin. up until november of , isolating es cells in mammals other than mice proved elusive, but in a milestone paper in the november , issue of science, james a. thomson, ( ) a developmental biologist at uw-madison reported the first successful isolation, derivation and maintenance of a culture of human embryonic stem cells (hes cells). it is interesting to note that this leap was made from mouse to man. as thomson himself put it, these cells are different from all other human stem cells isolated to date and as the source of all cell types, they hold great promise for use in transplantation medicine, drug discovery and development, and the study of human developmental biology. the new century is rapidly exploiting this vision. when steve fodor was asked in "how do you really take the human genome sequence and transform it into knowledge?" he answered from affymetrix's perspective, it is a technology development task. he sees the colloquially named affychips being the equivalent of a cd-rom of the genome. they take information from the genome and write it down. the company has come a long way from the early days of venter's ests and less than robust algorithms as described earlier. one surprising fact unearthed by the newer more sophisticated generation of chips is that to percent of the non-repetitive dna is being expressed as accepted knowledge was that only . to percent of the genome would be expressed. since much of that sequence has no protein-coding capacity it is most likely coding for regulatory functions. in a parallel to astrophysics this is often referred to in common parlance as the "dark matter of the genome" and like dark matter for many it is the most exciting and challenging aspect of uncovering the occult genome. it could be, and most probably is, involved in regulatory functions, networks, or development. and like physical dark matter it may change our whole concept of what exactly a gene is or is not! since beadle and tatum's circumspect view of the protein world no longer holds true it adds a layer of complexity to organizing chip design. depending on which sequences are present in a particular transcript, you can, theoretically, design a set of probes to uniquely distinguish that variant. at the dna level itself there is much potential for looking at variants either expressed or not at a very basic level as a diagnostic system, but ultimately the real paydirt is the information that can be gained from looking at the consequence of non-coding sequence variation on the transcriptome itself. and fine tuning when this matters and when it is irrelevant as a predicative model is the auspices of the affymetrix spin-off perlegen. perlegen came into being in late to accelerate the development of high-resolution, whole genome scanning. and they have stuck to that purity of purpose. to paraphrase dragnet's sergeant joe friday, they focus on the facts of dna just the dna. perlegen owes its true genesis to the desire of one of its cofounders to use dna chips to help understand the dynamics underlying genetic diseases. brad margus' two sons have the rare disease "ataxia telangiectasia" (a-t). a-t is a progressive, neurodegenerative childhood disease that affects the brain and other body systems. the first signs of the disease, which include delayed development of motor skills, poor balance, and slurred speech, usually occur during the first decade of life. telangiectasias (tiny, red "spider" veins), which appear in the corners of the eyes or on the surface of the ears and cheeks, are characteristic of the disease, but are not always present. many individuals with a-t have a weakened immune system, making them susceptible to recurrent respiratory infections. about % of those with a-t develop cancer, most frequently acute lymphocytic leukemia or lymphoma suggesting that the sentinel competence of the immune system is compromised. having a focus so close to home is a powerful driver for any scientist. his co-founder david cox is a polymath pediatrician whose training in the latter informs his application of the former in the development of patient-centered tools. from that perspective, perlegen's stated mission is to collaborate with partners to rescue or improve drugs and to uncover the genetic bases of diseases. they have created a whole genome association approach that enables them to genotype millions of unique snps in thousands of cases and controls in a timeframe of months rather than years. as mentioned previously, snp (single nucleotide polymorphism) markers are preferred over microsatellite markers for association studies because of their abundance along the human genome, the low mutation rate, and accessibilities to high-throughput genotyping. since most diseases, and indeed responses to drug interventions, are the products of multiple genetic and environmental factors it is a challenge to develop discriminating diagnostics and, even more so, targetedtherapeutics. because mutations involved in complex diseases act probabilisticallythat is, the clinical outcome depends on many factors in addition to variation in the sequence of a single gene -the effect of any specific mutation is smaller. thus, such effects can only be revealed by searching for variants that differ in frequency among large numbers of patients and controls drawn from the general population. analysis of these snp patterns provides a powerful tool to help achieve this goal. although most bi-alleic snps are rare, it has been estimated that just over million common snps, each with a frequency of between and %, account for the bulk of the dna sequence difference between humans. such snps are present in the human genome once every base pairs or so. as is to be expected from linkage disequilibrium studies, alleles making up blocks of such snps in close physical proximity are often correlated, resulting in reduced genetic variability and defining a limited number of "snp haplotypes," each of which reflects descent from a single, ancient ancestral chromosome. in cox's group, using high level scanning with some old-fashioned somatic cell genetics, constructed the snp map of chromosome .the surprising findings were blocks of limited haplotype diversity in which more than % of a global human sample can typically be characterized by only three common haplotypes (interestingly enough the prevalence of each hapolytype in the examined population was in the ratio : : . ).from this the conclusion could be drawn that by comparing the frequency of genetic variants in unrelated cases and controls, genetic association studies could potentially identify specific haplotypes in the human genome that play important roles in disease, without need of knowledge of the history or source of the underlying sequence, which hypothesis they subsequently went on to prove. following cox et al. pioneering work on "blocking" chromosome into characteristic haplotypes, tien chen came to visit him from university of southern california and following the visit his group developed discriminating algorithms which took advantage of the fact that the haplotype block structure can be decomposed into large blocks with high linkage disequilibrium and relatively limited haplotype diversity, separated by short regions of low disequilibrium. one of the practical implications of this observation is as suggested by cox that only a small fraction of all the snps they refer to as "tag" snps can be chosen for mapping genes responsible for complex human diseases, which can significantly reduce genotyping effort, without much loss of power. they developed algorithms to partition haplotypes into blocks with the minimum number of tag snps for an entire chromosome. in they reported that they had developed an optimized suite of programs to analyze these block linkage disequilibrium patterns and to select the corresponding tag snps that will pick the minimum number of tags for the given criteria. in addition the updated suite allows haplotype data and genotype data from unrelated individuals and from general pedigrees to be analyzed. using an approach similar to richard michelmore's bulk segregant analysis in plants of more than a decade previously, perlegen subsequently made use of these snp haplotype and statistical probability tools to estimate total genetic variability of a particular complex trait coded for by many genes, with any single gene accounting for no more than a few percent of the overall variability of the trait. cox's group have determined that fewer than total individuals provide adequate power to identify genes accounting for only a few percent of the overall genetic variability of a complex trait, even using the very stringent significance levels required when testing large numbers of dna variants. from this it is possible to identify the set of major genetic risk factors contributing to the variability of a complex disease and/or treatment response. so, while a single genetic risk factor is not a good predictor of treatment outcome, the sum of a large fraction of risk factors contributing to a treatment response or common disease can be used to optimize personalized treatments without requiring knowledge of the underlying mechanisms of the disease.they feel that a saturating level of coverage is required to produce repeatable prediction of response to medication or predisposition to disease and that taking shortcuts will for the most part lead to incomplete, clinically-irrelevant results. in hinds et al. in science describe even more dramatic progresss. they describe a publicly available, genome-wide data set of . million common singlenucleotide polymorphisms (snps) that have been accurately genotyped in each of people from three population samples. a second public data set of more than million snps typed in each of people has been generated by the international haplotype map (hapmap) project. these two public data sets, combined with multiple new technologies for rapid and inexpensive snp genotyping, are paving the way for comprehensive association studies involving common human genetic variations. perlegen basically is taking to the next level fodor's stated reason for the creation of affymetrix, the belief that understanding the correlation between genetic variability and its role in health and disease would be the next step in the genomics revolution. the other interesting aspect of this level of coverage is, of course, the notion of discrete identifiable groups based on ethnicity, centers of origin and such breaks down and a spectrum of variation arises across all populations which makes the perlegen chip, at one level, a true unifier of humanity but at another adds a whole layer of complexity for hmos! at the turn of the century, this personalized chip approach to medicine received some validation at a simpler level in a closely related disease area to the one to which one fifth of a-t patients ultimately succumb when researchers at the whitehead institute used dna chips to distinguish different forms of leukemia based on patterns of gene expression in different populations of cells. moving cancer diagnosis away from visually based systems to such molecular based systems is a major goal of the national cancer institute. in the study, scientists used a dna chip to examine gene activity in bone marrow samples from patients with two different types of acute leukemia -acute myeloid leukemia (aml) and acute lymphoblastic leukemia (all). then, using an algorithm, developed at the whitehead, they identified signature patterns that could distinguish the two types. when they cross-checked the diagnoses made by the chip against known differences in the two types of leukemia, they found that the chip method could automatically make the distinction between aml and all without previous knowledge of these classes. taking it to a level beyond where perlegen are initially aiming, eric lander, leader of the study said, mapping not only what is in the genome, but also what the things in the genome do, is the real secret to comprehending and ultimately curing cancer and other diseases. chips gained recognition on the world stage in when they played a key role in the search for the cause of severe acute respiratory syndrome (sars) and probably won a mcarthur genius award for their creator. ucsf assistant professor joseph derisi, already famous in the scientific community as the wunderkind originator of the online diy chip maker in pat brown's lab at stanford, built a gene microarray containing all known completely sequenced viruses ( , of them) and, using a robot arm that he also customized, in a three day period used it to classify a pathogen isolated from sars patients as a novel coronavirus. when a whole galaxy of dots lit up across the spectrum of known vertebrate cornoviruses derisis knew this was a new variant. interestingly the sequence had the hottest signal with avian infectious bronchitis virus. his work subsequently led epidemiologists to target the masked palm civet, a tree-dwelling animal with a weasel-like face and a catlike body as the probable primary host. the role that derisi's team at ucsf played in identifying a coronavirus as a suspected cause of sars came to the attention of the national media when cdc director dr. julie gerberding recognized joe in march , press conference and in when joe was honored with one of the coveted mcarthur genius awards. this and other tools arising from information gathered from the human genome sequence and complementary discoveries in cell and molecular biology, new tools such as gene-expression profiling, and proteomics analysis are converging to finally show that rapid robust diagnostics and "rational" drug design has a future in disease research. another virus that puts sars deaths in perspective benefitted from rational drug design at the turn of the century. influenza, or flu, is an acute respiratory infection caused by a variety of influenza viruses. each year, up to million americans develop the flu, with an average of about , being hospitalized and , to , people dying from influenza and its complications. the use of current influenza treatments has been limited due to a lack of activity against all influenza strains, adverse side effects, and rapid development of viral resistance. influenza costs the united states an annual $ . billion in physician visits, lost productivity and lost wages. and least we still dismiss it as a nuisance we are well to remember that the "spanish" influenza pandemic killed over million people in and , making it the worst infectious pandemic in history beating out even the more notorious black death of the middle ages. this fear has been rekindled as the dreaded h n (h for haemaglutenin and n for neuraminidase as described below) strain of bird flu has the potential to mutate and recognise homo sapiens as a desirable host. since rna viruses are notoriously faulty in their replication this accelerated evolutionary process gives then a distinct advantage when adapting to new environments and therefore finding more amenable hosts. although inactivated influenza vaccines are available, their efficacy is suboptimal partly because of their limited ability to elicit local iga and cytotoxic t cell responses. the choices of treatments and preventions for influenza hold much more promise in this millennium. clinical trials of cold-adapted live influenza vaccines now under way suggest that such vaccines are optimally attenuated, so that they will not cause influenza symptoms but will still induce protective immunity. aviron (mountain view, ca), biochem pharma (laval, quebec, canada), merck (whitehouse station, nj), chiron (emeryville, ca), and cortecs (london), all had influenza vaccines in the clinic at the turn of the century, with some of them given intra-nasally or orally. meanwhile, the team of gilead sciences (foster city, ca) and hoffmann-la roche (basel, switzerland) and also glaxowellcome (london) in put on the market neuraminidase inhibitors that block the replication of the influenza virus. gilead was one of the first biotechnology companies to come out with an anti-flu therapeutic. tamiflu™ (oseltamivir phosphate) was the first flu pill from this new class of drugs called neuraminidase inhibitors (ni) that are designed to be active against all common strains of the influenza virus. neuraminidase inhibitors block viral replication by targeting a site on one of the two main surface structures of the influenza virus, preventing the virus from infecting new cells. neuraminidase is found protruding from the surface of the two main types of influenza virus, type a and type b. it enables newly formed viral particles to travel from one cell to another in the body. tamiflu is designed to prevent all common strains of the influenza virus from replicating. the replication process is what contributes to the worsening of symptoms in a person infected with the influenza virus. by inactivating neuraminidase, viral replication is stopped, halting the influenza virus in its tracks. in marked contrast to the usual protracted process of clinical trials for new therapeutics, the road from conception to application for tamiflu was remarkably expeditious. in , gilead and hoffmann-la roche entered into a collaborative agreement to develop and market therapies that treat and prevent viral influenza. in , as gilead's worldwide development and marketing partner, roche led the final development of tamiflu, months after the first patient was dosed in clinical trials in april , roche and gilead announced the submission of a new drug application to the u.s. food and drug administration (fda) for the treatment of influenza. additionally, roche filed a marketing authorisation application (maa) in the european union under the centralized procedure in early may . six months later in october , gilead and roche announced that the fda approved tamiflu for the treatment of influenza a and b in adults. these accelerated efforts allowed tamiflu to reach the u.s. market in time for the - flu season. one of gilead's studies showed an increase in efficacy from % when the vaccine was used alone to % when the vaccine was used in conjunction with a neuraminidase inhibitor. outside of the u.s., tamiflu also has been approved for the treatment of influenza a and b in argentina, brazil, canada, mexico, peru and switzerland. regulatory review of the tamiflu maa by european authorities is ongoing. with the h n birdflu strain's relentless march (or rather flight) across asia, in through eastern europe to a french farmyard, an unwelcome stowaway on a winged migration, and no vaccine in sight, tamiflu, although untested for this species, seen as the last line of defense is now being horded and its patented production right's fought over like an alchemist's formula. tamiflu's main competitor, zanamivir marketed as relenza™ was one of a group of molecules developed by glaxowellcome and academic collaborators using structure-based drug design methods targeted, like tamiflu, at a region of the neuraminidase surface glycoprotein of influenza viruses that is highly conserved from strain to strain. glaxo filed for marketing approval for relenza in europe and canada. the food and drug administration's accelerated drug-approval timetable began to show results by , its evaluation of novartis's gleevec took just three months compared with the standard - months. another factor in improving biotherapeutic fortunes in the new century was the staggering profits of early successes. in , $ . billion of the $ . billion in revenue collected by genentech in south san francisco came from oncology products, mostly the monoclonal antibody-based drugs rituxan, used to treat non-hodgkin's lymphoma, and herceptin for breast cancer. in fact two of the first cancer drugs to use the new tools for 'rational' design herceptin and gleevec, a small-molecule chemotherapeutic for some forms of leukemia are proving successful, and others such as avastin (an anti-vascular endothelial growth factor) for colon cancer and erbitux are already following in their footsteps. gleevec led the way in exploiting signal-transduction pathways to treat cancer as it blocks a mutant form of tyrosine kinase (termed the philadelphia translocation recognized in 's) that can help to trigger out-of-control cell division. about % of biotech companies raising venture capital during the third quarter of listed cancer as their primary focus, according to online newsletter venturereporter. by according to the pharmaceutical research and manufacturers of america, medicines were in development for cancer up from in . another new avenue in cancer research is to combine drugs. wyeth's mylotarg, for instance, links an antibody to a chemotherapeutic, and homes in on cd receptors on acute myeloid leukemia cells. expertise in biochemistry, cell biology and immunology is required to develop such a drug. this trend has created some bright spots in cancer research and development, even though drug discovery in general has been adversely affected by mergers, a few high-profile failures and a shaky us economy in the early 's. as the millennium approached observers as diverse as microsoft's bill gates and president bill clinton predicted the st century wiould be the "biology century". by the many programs and initiatives underway at major research institutions and leading companies were already giving shape to this assertion. these initiatives have ushered in a new era of biological research anticipated to generate technological changes of the magnitude associated with the industrial revolution and the computerbased information revolution. complementary dna sequencing: expressed sequence tags and human genome project basic local alignment search tool high-tech herbal medicine: plant-based vaccines asilomar conference on recombinant dna molecules potential biohazards of recombinant dna molecules hugo: the human genome organization chimeric plant virus particles administered nasally or orally induce systemic and mucosal immune responses in mice the human genome: the nature of the enterprise orchestrating the human genome project separation and analysis of dna sequence reaction products by capillary gel electrophoresis nutritional genomics: manipulating plant micronutrients to improve human health helping europe compete in human genome research genome project gets rough ride in europe construction of a linkage map of the human genome, and its application to mapping genetic diseases separation of dna restriction fragments by high performance capillary electrophoresis with low and zero crosslinked polyacrylamide using continuous and pulsed electric fields preimplantation and the 'new' genetics a history human genome project it aint necessarily so: the dream of the human genome and other illusions high speed dna sequencing by capillary electrophoresis a strategy for sequencing the genome years early expression of norwalk virus capsid protein in transgenic tobacco and potato and its oral immunogenicity in mice rapid production of specific vaccines for lymphoma by expression of the tumor-derived single-chain fv epitopes in tobacco plants generation and analysis of , human expressed sequence tags national academy of sciences. introduction of recombinant dna-engineered organisms into the environment: key issues functional foods and biopharmaceuticals: the next generation of the gm revolution in let them eat precaution biotechnology: a review of technological developments, publishers forfas vitamin-a and iron-enriched rices may hold key to combating blindness and malnutrition: a biotechnology advance french dna: trouble in purgatory genome: the autobiography of a species in chapters harper collins derivation of pluripotent stem cells from cultured human primordial germ cells production of correctly processed human serum albumin in transgenic plants high-yield production of a human therapeutic protein in tobacco chloroplasts the common thread: a story of science, politics, ethics and the human genome capillary gel electrophoresis for dna sequencing. laser-induced fluorescence detection with the sheath flow cuvette production of functional human alpha -antitrypsin by plant cell culture genetic modification of oils for improved health benefits, presentation at conference, dietary fatty acids and cardiovascular health: dietary recommendations for fatty acids: is there ample evidence? stable accumulation of aspergillus niger phytase in transgenic tobacco leaves antenatal maternal serum screening for down's syndrome: results of a demonstration project viable offspring derived from fetal and adult mammalian cells key: cord- -kvyes lz authors: baker, susan c.; jukneliene, dalia; purkayastha, anjan; snyder, eric e.; crasta, oswald r.; czar, michael j.; setubal, joao c.; sobral, bruno w. title: developing bioinformatic resources for coronaviruses date: journal: the nidoviruses doi: . / - - - - _ sha: doc_id: cord_uid: kvyes lz nan contract from nih-niaid to establish a national bioinformatics resource center (brc) to facilitate research on microbial pathogens. as part of this initiative, vbi is developing the pathosystems resource integration center (patric), a multi-organism relational database to support infectious disease research, especially as it affects biodefense and research on emerging infectious diseases (http://patric.vbi.vt.edu). we expect patric to be used as a computational resource to gain insight into mechanisms of microbial pathogenesis and to hasten the development of improved vaccines, diagnostics, and therapeutics. the database will contain high-quality curated data: sequence annotations from published whole and partial genomes; relevant experimental data; metabolic pathway data; taxonomic data; literature citations; and a suite of visualization and analysis tools. research experts and members of the scientific community will be closely involved at each step of the curation/annotation process. vbi is curating information on a set of eight different pathogen classes that include both bacteria and viruses. included in this set is the genus coronavirus (family coronaviridae). at present we have archived the annotations of the coronavirus species. these include both whole-genome ( ) and partial-genome ( ) annotations. this sequence archive represents the initial step in our efforts to curate data on coronavirus species. we welcome active participation by the coronavirus research community in developing patric as a useful computational resource for infectious disease research. to facilitate the large-scale annotation/curation project that we have undertaken, we have built an annotation pipeline and associated curation tool interface. the annotation pipeline is composed of gene-prediction programs, similarity search algorithms, and protein structure and function prediction programs. the results of these programs and searches assembled by the annotation pipeline are used to propose biological features that are also stored in the curation database that uses the genomics unified schema (gus). the scenario for user interaction with the tools is presented in figure . during the manual curation/annotation process, the curation tool interface retrieves the results of the automated annotation process [along with the proposed biological features] and presents them to a curator. curators review the computational evidence in light of their collective expertise and accept proposed features or edit/remove them. patric genomes are organized into categories based on phylogenetic relationships. the simplest of these patric categories consists of a relatively small number of sequenced genomes from a bacterial or viral family or genus. for the purposes of defining minimal, non-redundant set of genes characteristic of the category, one genome (usually the best-known or best-characterized) is identified as the "reference genome"; the remaining members of the class are called "associated genomes." for example, the tor and urbani isolates were the first two sars coronavirus genomes to be sequenced and therefore were named as reference genomes. efforts are underway to coordinate our system of reference and associated genomes with the refseqs from ncbi. for each organism category, a "reference gene set" is constructed consisting of a single representative of each orthologous group and is built by progressive identification of unique genes from the category's genomes. the reference genome has the highest precedence and therefore contributes its entire gene complement to the reference gene set. the reference set is then compared at the protein level to the first associated genome and vice versa. genes from the associated genome identified as orthologs according to the "bidirectional best hit" test are annotated as such. this allows high-value, manually curated information from the corresponding reference genes to be automatically linked to the associated genes, provided minimal similarity criteria based on automated sequence analysis are satisfied. however, because the orthologous genes from the reference genome are already present in the reference gene set, only genes that fail the orthology test are added to the reference set. these genes are presumed to be novel and characteristic of the associated genome. this process is repeated for the remaining associated genomes. the gap is an automated system for annotating prokaryotic and viral genomes. it consists of two conceptual units, the genomic sequence analysis pipeline (gsap) and protein analysis pipeline (pap) and is configured using gapml, an xml-based pipeline description language. submission of a genomic sequence to the database triggers pipeline execution. analysis begins in the gsap with programs to identify trna, rrna, and protein-coding genes. the programs trnascanse, blastn, glimmer, and genemark, respectively, make the gene predictions. the sequence is processed by the "putative gene interval" (pgi) parser to segment the genome into fragments containing a single gene. this breaks the genome into a manageable size for similarity searches and simplifies interpretation of their results. because noncoding sequence is included within pgis, genomic features such as putative rna secondary structures, transcription regulatory sequences, and other features are annotated and queued for curatorial review. curators make the final call on the predicted gene coordinates and translation and review the other results prior to submission to the gus database. the translations are then passed to the pap where it is first classified with respect to the reference protein set, a gene table navigation bar links to vbi pathinfo & ncbi taxonomy the information presented above reflects our immediate plans for basic genome annotation. this lays the foundation for our future work, which will include the analysis of metabolic and regulatory pathways and comparative genomics. in addition, we plan to relate this information to rna and protein expression as data becomes available. ultimately, the goal of this work is to help the biomedical research community leverage genomic information to better understand the physiology of these organisms and their interaction with their human and animal hosts. in time, this will lead to improved treatment and prophylaxis of disease caused by these potentially deadly organisms. this project is funded by niaid / nih contract hhsn c to bruno sobral. national center for biotechnology information viral genomes project key: cord- -mel fxw authors: o'malley, maureen a.; bostanci, adam; calvert, jane title: whole-genome patenting date: - - journal: nat rev genet doi: . /nrg sha: doc_id: cord_uid: mel fxw gene patenting is now a familiar commercial practice, but there is little awareness that several patents claim ownership of the complete genome sequence of a prokaryote or virus. when these patents are analysed and compared to those for other biological entities, it becomes clear that genome patents seek to exploit the genome as an information base and are part of a broader shift towards intangible intellectual property in genomics. (leave line space here) text text text text text text text texttext text text texttext text text texttext text text texttext text text texttext text text texttext text text texttext text text text text text text text. | june | volume www.nature.com/reviews/genetics to be treating genome patents as if they were nothing other than standard dna patents. however, further analysis reveals that patent specifications describing whole-genome inventions use arguments that imply that genomes are qualitatively different from individual genes. whole-genome patents also use different arguments from microorganism patents, which might be thought of as a similar category of 'whole' biological patent. these distinctions are further complicated by the way in which the european patent office (epo) has dealt with genome patent applicationsa treatment that leads our exploration of genome patenting to the key issue of how arguments for the utility of dna fragments apply in genome patents. genome patenting has emerged as an expression of the recent informational shift in genomics and patenting. this shift is of potential interest to several groups of interested parties and observers. for patent professionals, genome patenting gives an indication of how developments in genomics and bioinformatics might be changing the nature of patenting. for scientists, genome patents blur the supposed line between research and its applications, with implications for how research is financed and data shared. for social scientists, the interactions between genomics and the patent system are of great interest for understanding how society might benefit from the genomics revolution and how commercial interests might shape the future development of this science. finally, for philosophers of biology, genome patenting raises issues about the consequences of conceptualizing genomes as sequence information or biochemical material, and indeed, what the study of genomes means for our understanding of biological entities. although a key tenet of patent law is that naturally occurring substances cannot be patented, substances that have been isolated and purified -such as dna -can be patented as long as they fulfil the criteria for patentability . in the united states, the basic criteria are novelty, non-obviousness and utility; in europe, the equivalent criteria are novelty, inventive step and industrial applicability. an invention is novel if it has not previously been made public. even if some gene sequences in a genome have already been published, genome sequences could be argued to be novel because not all features of the invention have previously been disclosed in a single publication . non-obviousness or inventive step means that the invention would require more than a routine procedure by an individual who is "skilled in the art". dna patents have been subjected to heavy criticism from lawyers, scientists and the public for inadequately fulfilling the utility requirement. some of the strongest objections have been against attempts to patent ests for their use as probes in gene discovery. following public consultation, the uspto has recently tightened its assessment criteria. rather than just being generally useful, applications must now show "specific, substantial, and credible" utility . once the function of a gene is disclosed (producing a specific protein, for example), it is considered to have such a use. the epo has adopted similar standards . furthermore, epo patent applications must satisfy a "unity of invention" requirement, the implementation of which is currently being considered at the uspto. this standard allows several sub-inventions to be linked together by a common "general inventive concept", but prevents unrelated inventions from succeeding as a single abstract | gene patenting is now a familiar commercial practice, but there is little awareness that several patents claim ownership of the complete genome sequence of a prokaryote or virus. when these patents are analysed and compared to those for other biological entities, it becomes clear that genome patents seek to exploit the genome as an information base and are part of a broader shift towards intangible intellectual property in genomics. news of genome patenting is often met with surprise, disbelief or dismissal. nevertheless, several whole-genome patents have been issued by the us patent and trademark office (uspto) and further applications are pending. although gene patenting has been challenged on ethical grounds and in regard to data access and criteria for patentability , , , whole-genome patenting has so far gone almost unnoticed. even the recent controversy surrounding patent applications for the genome sequence of the sars-associated coronavirus , (see also online links box) is primarily concerned with whether patenting is an appropriate and effective way to control access to data and stimulate research. the sars discussion does not address the implications of patenting a whole genome instead of the more common patenting of dna fragments. perhaps genome patents have escaped scrutiny because, at least superficially, they seem to be no more than simple extensions of the dna patenting that has flourished with the increased ease of entire genome sequencing. at first glance, patent offices certainly do seem whole-genome patenting general advantages of having a whole genome as the invention: the "clarification of the structure" of the genome (for example, adult t-cell leukaemia virus in table ), identification of constituent genes (for example, haemorrhagic enteritis virus (hev) in table ) and the capacity to distinguish similar genomes (for example, nanb). genome patents in the second category direct their claims to specified open reading frames (orfs) or polynucleotides, but do so in the context of a broader specification of the invention that argues for the whole sequenced genome as an integral part of the invention. the patents for haemophilus influenzae and mycoplasma genitalium were first filed as claim-specific genome patent applications, but during examination, their potentially far-reaching claims were restricted to specific orfs. this restriction was probably the result of objections by uspto examiners. the specifications of these patents, which do not normally change after filing, still persistently refer to the whole-genome sequence (and any sequence that is . % similar) as comprising or providing the basis of the invention. likewise, in the methanococcus jannaschii patent, the summary of the invention begins with the whole-genome sequence, after which the invention is further directed to the orfs in the claims. the claims of the two virus genome patents in this category genomic fragments but embed these claims in a broader argument that the whole genome constitutes a useful invention (table ) . it is not the legal status of any of these patents that concerns us (in fact, they are still untested in court), but the ways in which they make their arguments and show how genomes have been used in recent patenting. claim-specific whole-genome patents. the first category of genome patents, claim-specific whole-genome patents, places the whole sequence of a specified genome in the claims section as the primary invention. for example, the patent for bacteriophage rm claims the isolated and sequenced genome of the phage, as well as any recombinantly produced dna. although the buchnera sp. strain aps patent has only one claim -the isolated genome as represented by its sequence description -the other claim-specific genome patents extend their claims to cdna, proteins, vectors and host cells (as would most patents for dna fragments). utilities for genome patents in this category range from disease diagnosis and therapy to the development of thermostable enzymes and pesticides (table ) . although these utilities are often elaborated in relation to particular dna fragments or encoded polypeptides (for example, non-a non-b hepatitis virus (nanb) in table ), the descriptions of the inventions avoid excluding other genomic fragments from the overall invention by regularly invoking the rest of the genome. arguments are also made for the application. if, for example, a group of dna fragments or sequences can be linked together by an overarching concept, they can then be covered by one patent. it seems reasonable to think of the genome as a concept with the potential for serving that unifying function, and patent documents provide a good basis for examining the extent to which this suggestion is supported in patent practice. a few publications mention the existence or prospect of whole-genome patenting , , , but do not discuss the actual patents. to pursue the cross-disciplinary implications of such a practice, we searched the online databases of the uspto and the esp@cenet worldwide database of the epo with terms such as 'whole genome' or 'complete nucleotide sequence' . we discarded all search results that only asserted specific dna fragments without significant reference to the entire genome and those in which the genomes had been modified before sequencing. we ended up with whole-genome patents that were issued (making no claims to exhaustiveness), of which were for viral genomes and the remainder for prokaryotic genomes. the patents fell into two categories. in the first, the genome is the primary object of the patent's claims section (table ) . this section is legally the most significant in defining the protection that the patent provides. patents in the second category, which we call 'contextual whole-genome patents', list www.nature.com/reviews/genetics p e r s p e c t i v e s patents are usually based on the genomes of well-known but previously unsequenced organisms. the main difference between wholegenome and organism patents is the extent to which the patent attempts to cover further biological material. both categories of genome patent describe their inventions in terms that stretch to any and every nucleotide and polypeptide implicated by the sequence, as well as to vectors and host cells. the only organism patents that follow this strategy are the few that refer to the dna of the specified organism, either to extend the coverage to further biological material , or to encompass all organisms in the same genus with a certain percentage of sequence similarity . in these cases, it seems that the dna and its potential uses are called on to reinforce the organism patent and expand the protection it provides. genome patents take this strategy one step further by claiming the complete sequence. the obvious question is whether a genome patent achieves more protection for the inventor than does a patent on a collection of dna fragments. archived epo examiners' reports (on the epo online public file inspection page) of past and present applications for whole-genome similarly focus on particular sequence fragments but base their inventions on the whole genome. the inventions of all of these patents are described very broadly, covering an extensive range of related biological material and its demonstrated and postulated uses. the three tigr/hgs (the institute of genomic research and human genome sciences) patents also originally claimed the computer-readable sequence as the invention. these claims now exist only in the specifications, which describe the "computer-related embodiments" of the genome as "a contiguous string of primary sequence information" suitable for storage and analysis by computer. having the genome in this in silico form, argue these patents, allows scientists to move beyond a gene-by-gene approach towards larger discoveries of genomic structure, function and evolutionary history, as well as the identification of "commercially important fragments". although the virus patents do not specify the in silico or machine-readable nature of the invention, they too rely on the knowledge provided by the whole-genome sequence for purposes such as engineering plant resistance to the virus or for establishing boundaries between strains (for example, maize chlorotic dwarf virus (mcdv-tn) in genomes are often thought of in a holistic way, which makes it logical to compare them to whole organisms. patenting microbes is a long-established practice, not only for modified microorganisms , but also for naturally occurring strains that have been isolated and cultured . all the genome patents listed in tables , are for microorganismal genomes (we include viral genomes in this category). the tradition of allowing microorganism patents might partly explain the absence of any patented eukaryote genomes and also why whole-genome patenting has not given rise to particular public concern. but what is the difference between wholegenome patents and patents that have been issued on whole microorganisms? not surprisingly, the utility arguments made for microorganism patents [ ] [ ] [ ] [ ] [ ] [ ] [ ] are generally similar to those for genome patents (tables , ). microorganism patents for archaea, for example, argue for uilities that are related to enzyme production in harsh environmental conditions, and bacteria patents claim applications that range from human health to bioremediation. however, novelty is established in subtly different ways. organism patents are based on previously unknown organisms that might have been isolated in unusual circumstances, whereas genome applications for the sars genome provoked worries that a privately held patent would function as a gatekeeper to all sars-related research and inhibit drug development. it is likely that similar fears would arise with any increase in the numbers and awareness of issued genome patents. although empirical work on the impact of conventional dna patents shows that they do not always have negative effects on research access , , informational patents could be much more restrictive . anticipations of genome patenting could extend to questions about whether prokaryotic genome patents set precedents for eukaryotic genome patents (our nonexhaustive search found none of these, nor did we find any in a search of pending whole-genome patent applications). we see two general factors that inhibit this potential trend: tradition and genomic organization. as we noted earlier, prokaryotic microorganism patenting is well established, but unmodified multicellular organisms are far less commonly the objects of patents. we believe that resistance to the patenting of 'higher life forms' -including genetically modified ones such as the oncomouse -is likely to similarly discourage patenting the genomes of these organisms. at the human level, ethical arguments have been made stating that patenting a whole human genome violates the integrity of an individual in a way that patenting parts of the genome does not . differences between the uspto and other patent offices on the patentability of life forms will no doubt continue to be reflected in the international treatment of future whole-genome applications. moreover, the genomes of most eukaryotes have proportionately less protein-coding dna, meaning that it is more difficult to assign function to large amounts of sequence. once the surprise that whole-genome patents exist has dissipated, it might be tempting to conclude that such patents are either so few or so weak that their existence does not matter. however, the characteristics of whole-genome patenting indicate an important movement in dna patenting from biochemical tools and products to information resources , , . had the computer embodiments remained in the tigr/hgs claims, these patents would have attempted to control information in a way not yet realized in patent practice. as bioinformatics and in silico modelling gain deeper and more extensive purchase on every aspect of genomic science , a full shift to allow claims on intangible informational property seems inevitable. so far, there seem to have been no obvious commercial benefits from whole-genome patents, but industry has a long way to go before it catches up with all the dna patenting of the past two decades. harbingers of how this trend might develop can be seen in new patents for computer programs and business methods. in discussions of how commercial protection is increasingly being sought for the information that is produced by bioinformatics , , unannotated genome sequences (that is, primary information) are considered less patentable than secondary information about how gene products might interact in a cell. as yet, there have been no high-profile cases in which genome patents have been publicly or legally challenged, although the epo examiners' reports give an indication that future genome patents might be treated sceptically in europe with respect to unity of invention and novelty. because the validity of genome patents has yet to be tested in court, the extent to which they will restrict research on the patented genomes is still only a matter for informed speculation. patent patents take one step towards answering this question. most of these reports agree that a genome sequence is novel even when parts of the genome have already been sequenced, although some dispute the novelty of newly sequenced genomes from closely related strains. one examiner, for example, objects that a submitted genome sequence (chlamydia pneumoniae) is merely a definition of a particular strain from the many isolates available, thereby questioning the patentability of genomic variation. in another case (influenza a), the report notes that the identified genes are well known from other related strains and that their presence should therefore be expected. both these reports also argue that the sequencing of a genome is routine and does not in itself entail an inventive step. most tellingly, some of the reports argue that genomes can only be considered as a single invention if they share a uniting feature that is novel, inventive and technically relevant. according to the examiners' comments, the application examples of influenza a and c. pneumoniae do not constitute a unified invention that would meet epo standards. our initial hypothesis that the idea of a genome is sufficient to unite several genes and their functions into one invention is not, therefore, supported by these examples. a genome, at least in these cases, does not have the taken-for-granted unifying capacity of an organism: it is seen as merely a collection of fragments of dna. if there are any qualitative differences between patents for whole genomes and those for dna fragments, it seems likely that they will be found in the utility arguments -the most contested feature of recent gene patenting. are any special uses attributed to genomes that are not attributed to isolated fragments of dna? both claim-specific and contextual genome patents rely on the utility of the information provided by the whole sequence. such information is considered to be valuable because it allows better understanding of the organism, of specific genes, of chromosome structure and function, and of relationships with other genomes. these genomic utilities seem to be primarily research-orientated, in contrast to commercial applications that might arise directly out of the more specific biochemical functions attributed to genes and gene products. by exploiting the whole genome as the informational basis of the invention, these patents distinguish themselves from standard dna fragment patents that articulate their inventions as compositions of matter, analogous to chemical patents . "is the genome just … used to unite several nucleotide sequences into a single invention, or … a causally efficacious phenomenon that does something more than an aggregation of genes can do?" the use of microorganisms to degrade hazardous contaminants in soil and water to environmentally safe levels. coronavirus a genus of virus named after the projections that create a crown-effect around the outside of each virus particle. they infect various mammals and birds, causing respiratory and enteric illness. the sars-associated coronavirus is a previously unrecognized member of the genus with no close genetic relationship to known coronavirus sequences. oncomouse (also known as the harvard mouse.) a type of laboratory mouse that is genetically modified to carry genes that increase susceptibility to cancer (oncogenes). therefore, genomic organization militates against the success of eukaryote genome patenting. if the complete genomes of complex multicellular organisms are ever to be commonly patented, it will probably be as informational components that are incorporated into system models that have diagnostic and other purposes. overall, the aspect of whole-genome patenting that lends itself most readily to investigation is conceptual. all the patents we have identified raise important questions about how genomes are conceptualized, especially in regard to how the utility of a genome can be specified. is the genome just a concept that is used to unite several nucleotide sequences into a single invention, or is it a causally efficacious phenomenon that does something more than an aggregation of genes can do? what is the relationship between the utility of a part (a gene) and any utility associated with the whole (the genome)? the answers to these questions will be different depending on whether the genomes are thought of in terms of biochemistry or bioinformatics. when the relationship between organism and genome patents is examined, further conceptual questions arise, especially in terms of classification. is the genome sequence the representative of the organism? the genomic mosaicism of many viruses and microbes makes the construction of taxonomic relationships very complex, and reducing this complexity to single measures of overall relatedness is likely to obscure biologically meaningful connections , . existing whole-genome patents not only settle for simple measures of genomic relatedness, but do so inconsistently. some use sequence differences between strains as the basis of their genome-patent claim (for example, mcdv-tn), whereas others discount such variations between strains by arguing that the patent covers other sequences within a certain range of similarity (for example, m. genitalium, h. influenzae and hev). the informational patenting of genomic variation could have the benefit of bringing about better patent recognition of the complexity of genomic relationships. our overview of current practices of whole-genome patenting shows how these patents raise fundamental questions about genome utility, classification and the ownership of intangible biological information. all these issues mean that the future of genome patenting should be carefully watched by scientists, as much as by legal theorists, social scientists and philosophers of biology -not to mention the patent owners themselves. patents in the knowledge-based economy dna patents and scientific discovery and innovation: assessing benefits and risks dna patents and human dignity lifeform patents: the high and the low imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches prokaryotic evolution in the light of gene transfer human adult t-cell leukemia virus: complete nucleotide sequence of the provirus genome integrated in leukemia cell dna molecular cloning of the human hepatitis c virus genome from japanese patients with non-a, non-b hepatitis the complete dna sequence and genome organization of the avian adenovirus, hemorrhagic enteris virus genome sequence of the endocellular bacterial symbiont of aphids buchnera sp nucleotide sequence and taxonomy of maize chlorotic dwarf virus within the sequiviridae the s rna segment of tomato spotted wilt virus has an ambisense character tomato spotted wilt virus encodes a putative rna polymerase the nucleotide sequence of the m rna segment of tomato spotted wilt virus, a bunyavirus with two ambisense rna segements the minimal gene complement of mycoplasma genitalium whole-genome random sequencing and assembly of haemophilus influenzae rd complete genome sequence of the methanogenic archaeon, methanococcus jannaschii can patents deter innovation? the anticommons in biomedical research organisation for economic co-operation and development genetic inventions, intellectual property rights, and licensing practices: evidence and policies sars genome patent: symptom or disease? natural substances and patentable inventions reforming the patent system us patent and trademarks office utility examination guidelines intellectual property rights and genetics: a study into the impact and management of intellectual property rights within the healthcare system patents, genomics, research and other diagnostics world trade organization. the results of the uruguay round of multilateral trade negotiations: the legal texts thermococcus av and enzymes produced by the same thermopallium bacteria and enzymes obtainable therefrom haloalkaliphilic microorganisms. uspto bacillus thuringiensis isolates active against weevils a history of patenting life in the united states with comparative attention to europe and canada (citation of judge g.s. rich's published legal opinion therein) (office for official publications of the european communities re-examining the role of patents in appropriating the value of dna sequences bioinformatics -a patenting view are you ready for the revolution? living in an (imm)material world: bioinformatics and intellectual property protection we would like to thank w.f. doolittle and two anonymous referees for several useful suggestions. the research for this paper was supported by the economic and social research council (esrc), uk, as part of the programme of the esrc's centre for genomics in society (egenis). a.b. also acknowledges a travel grant from the chemical heritage foundation. the authors declare no competing financial interests. key: cord- -tjmx msm authors: sardar, rahila; satish, deepshikha; birla, shweta; gupta, dinesh title: comparative analyses of sar-cov genomes from different geographical locations and other coronavirus family genomes reveals unique features potentially consequential to host-virus interaction and pathogenesis date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: tjmx msm the ongoing pandemic of the coronavirus disease (covid- ) is an infectious disease caused by severe acute respiratory syndrome coronavirus (sars-cov ). we have performed an integrated sequence-based analysis of sars-cov genomes from different geographical locations in order to identify its unique features absent in sars-cov and other related coronavirus family genomes, conferring unique infection, facilitation of transmission, virulence and immunogenic features to the virus. the phylogeny of the genomes yields some interesting results. systematic gene level mutational analysis of the genomes has enabled us to identify several unique features of the sars-cov genome, which includes a unique mutation in the spike surface glycoprotein (a v ( c>t)) in the indian sars-cov , absent in other strains studied here. we have also predicted the impact of the mutations in the spike glycoprotein function and stability, using computational approach. to gain further insights into host responses to viral infection, we predict that antiviral host-mirnas may be controlling the viral pathogenesis. our analysis reveals nine host mirnas which can potentially target sars-cov genes. interestingly, the nine mirnas do not have targets in sars and mers genomes. also, hsa-mir- b is the only unique mirna which has a target gene in the indian sars-cov genome. we also predicted immune epitopes in the genomes the first case of covid- patient was reported in december at wuhan (china) and then it has spread worldwide to become a pandemic, with maximum death cases in italy, though initiallythe maximum mortality was reported from china ( ). according to a who report, as on th march there were confirmed , covid- cases and cases of deaths, that includes cases which were locally transmitted or imported ( ) . there are published reports which suggests that sars-cov shares highest similarity with bat sars-cov. scientists across the globe are trying to elucidate the genome characteristics using phylogenetic, structural and mutational analysis. recent paper identified specific mutations in receptor binding domain (rbd) domain of spike protein which is most variable part in coronavirus genome ( ) . there are more than sars-cov assembled genomes available at ncbi database. sequence analysis of the genomes can give us plethora of information which can of use for drug development and vaccine development research attempts. in the current work we collected sars-cov genomes from different geographical origins mainly from india, italy, usa, nepal and wuhan to identify notable genomic features of sars-cov by integrated analysis. these analyses include identification of notable mutational signatures, host antiviral-mirna identification and epitope prediction. as a host defense mechanism, a repertoire of host mirnas also target invading viruses. we followed the parameters used in various anti-viral mirna databases to predict host anti-viral mirnas against sars-cov . our analysis shows unique host-mirnas targeting sars-cov virus genes. respectively, were retrieved from ncbi genome database. sars-cov genomes from india, italy, usa, nepal along with sars-cov and mers were used as query genomes to compare with wuhan sars-cov genome. genes and protein sequences of sars-cov were retrieved from vipr database( ). all assembled query genomes in fasta format were analyzed using genome to understand the variation in genomes from various geographical areas used in the study, we performed a phylogenetic analysis. neighbor joining method with bootstrap value of replicates was used for the construction of consensus tree using mega software( ) ( . . version). cello go ( )server was used to infer biological function for each protein of sars-cov genome with their localization prediction. the mutations reported in literature ( )were catalogued and evaluated for pathogenicity. we used mutpred( )server to identify disease associated amino acid substitution from neutral substitution, with a p-value of >= . . in order to assess the impact of snps on protein stability, we used two machine learning based prediction methods. the first method, i-mutant server( ) was used to predict stability of the protein sequences at ph . and temperature ˚c. the second prediction method is mupro( ) server, the predictions with the former method helps in getting a consensus prediction. to predict host mirnas targeting the virus, we collected a list of experimentally verified antiviral mirnas with their targets from virmirna database ( ) . only these host mirnas were processed for downstream analysis. (figure to identify potential host microrna target sites in the virus genome sequences, we have used miranda ( . a version) ( , ) software, with an energy threshold of - kcal/mol. we also used psrnatarget server to compare the predicted targets by the two methods ( ) . all the genes and protein sequences for sars-cov were retrieved from vipr database. to identify ctl and b-cell epitopes we have used ctlpred( ), abcpred( ) servers with default parameters. chemopred ( ) and vaxijen server ( ) were used to predict chemokines and protective probable antigen, respectively ( figure (c) ). assembled sars-cov genomes sequences in fasta format from india, usa, china, italy and nepal used for coronavirus typing tool analysis. using the tool, we were able to locate query sars-cov genomes with known sars-cov to obtain a cladogram for evolutionary analysis as shown in figure several mutations are revealed when sar-cov and sars-cov spike glycoproteins are compared. six frameshift mutations and insertion in the genome that corresponds to s _q inssdld ( _ insagtgaccttgac) ( table s ) was also revealed. we also observed that there are several mutations located in the regions associated with high immune response (table s ) from snps analysis we observed that all the mutations might bring about decrease in stability without changing their properties i.e. hydrophobicity to hydrophilicity or vice versa. l y mutation predicted to altered ordered interface, disordered interface stability, transmembrane protein and gain of gpi-anchor amidation at n position (table ) . it is known, and also confirmed by gene ontology analysis that the protein is involved in pathogenesis, membrane organization, reproduction, symbiosis, encompassing mutualism through parasitism, and locomotion. psrnatarget analysis based on the complementary matching between the srna sequence and target mrna sequence with predefined scoring schema identified mirnas out of the identified mirnas to target sars-cov genes. the mirnasare predicted to act on the viral genomes by cleaving their target sites (table ). intriguingly, our analysis (s. figure ) revealed that there is only a single host mirna we have used bioinformatics tools to investigate sars-cov sequences from different geographical locations. the phylogenetic analysis of the genomes, the nucleotide sequence diversity analysis of the genomes, the predicted antiviral host mirnas specific to the genomes and the prediction of immune active sequences in the genomes have yielded some interesting facts, including unique features. for the phylogenetic analysis, we compared the sequences of sars-cov isolates from different countries namely, wuhan, india, italy, usa and nepal along with other corona virus species ( figure ). as reported earlier too ( , ) , the virus from wuhan showed higher similarity with sars-cov. there was no phylogenetic segregation of the genomes based on geographic origin, whether from the same continent or a neighboring country (figure ) but, ambiguously showing varied clustering like italy and nepal clustered together, followed by india and usa. this reiterates the findings indicating the massive exchange and importation of the carriers between the epicenter wuhan and these countries. however, a detail analysis, complemented with more sequences and patient met data will give further evolutionary insights regarding the fast spreading pandemic. the phylogenetics heterogeneity between different strains is explored by genome variation profiling to find alterations in genetic information during the course of evolution, outbreak, and clinical spectrum caused by the different strains. in case of sars-cov and sars-cov too, few clinical characteristics differentiate them among themselves and with other seasonal influenza infections as well, as reported recently ( ) . interestingly in the present analysis, in comparison to sars-cov, we observed at least one of the variations like indels, deletions, misaligned and frameshift in all the sars-cov proteins except orf , orf and orf (table s ). the ( ) . going well with the expectations from a rapidly transmitting pandemic virus, in our analysis, we observed various mutations located in the regions associated with immune response (table s ) . these mutations may have significant impact on the antigenic and immunogenic changes responsible for differences in the severity of the outbreak in different geographical regions. to gain further insights, we compared the genetic mutation spectrum identified in the four countries, namely usa, italy, india and nepal. surprisingly, the mutation spectrums were different among these countries ( ( ), combined with other factors-a speculation which maybe verified with more evidences. from this analysis, we also speculate that the presence of country specific mutation spectrum may also be able to explain the current scenario in these countries like severity of illness, containment of the outbreak, the extent and timing of exposures to a symptomatic carrier etc. non-structural proteins have their specific roles in replication and transcription ( ) . previous studies on sars-cov revealed nsp as a potential candidate for the therapeutic target ( ) . it is noteworthy to mention that in the present study; various mutations have been identified in all the non-structural proteins suggesting them to be an important and potential player in proposing therapeutic targets and should be explored experimentally. many studies have reported that mirnas not only act as the signature of tissue expression and function but also as potential biomarkers playing important role in regulating disease pathophysiology ( ) . in viral infections, host antiviral mirnas play a crucial role in the regulation of immune response to virus infection depending upon the viral agent. many known human mirnas appear to be able to target viral genes and their functions like interfering with replication, translation and expression. in the present study, we tried to predict the antiviral host-mirnas specific for ( ) . also there are studies on the regulatory role of mirna hsa-mir- b- p described in ace signaling ( ) . the results of the present study suggest a strong correlation between mirna hsa-mir- b- p and ace which needs to be confirmed experimentally in sars-cov cases. further, we tried to compare the mirnas in the genomes and observed some striking findings. we observed that out of all the mirnas, hsa-mir- b is the only unique based on our analysis, we speculate an important regulatory role of mir- b in sars-cov infection. the contradictory treatment outcomes may be due to the presence of the mir- b target in the indian genome specifically. it probably indicates that the specific genetic and mirna spectrum should be considered as the basis of the treatment management. the findings in the study have revealed unique features of the sars-cov genomes, which may be explored further. for example, one may analyse the link between severity of diseases to each of the variants, expression of the predicted host antiviral mirnas can be checked in the patients, the predicted epitopes may be explored for their immunogenicity, difference in treatment outcomes may also be correlated with genome variations, lastly the potential of the unique segments of the virus proteins and the unique host mirnas may be explored in development of novel antiviral therapies. probable pangolin origin of sars-cov- associated with the covid- outbreak coronavirus disease (covid- ). situation report - the proximal origin of sars-cov- vipr: an open bioinformatics database and analysis resource for virology research genome detective coronavirus typing tool for rapid identification and characterization of novel coronavirus genomes molecular evolutionary genetics analysis across computing platforms. molecular biology and evolution cello go: a web server for protein subcellular localization prediction with functional gene ontology annotation automated inference of molecular mechanisms of disease from amino acid substitutions : predicting stability changes upon mutation from the protein sequence or structure prediction of protein stability changes for single-site mutations using support vector machines virmirna: a comprehensive resource for experimentally validated viral mirnas and their targets. database : the journal of biological databases and curation the microrna.org resource: targets and expression. nucleic acids research pamirdb: a web resource for plant mirnas targeting viruses. scientific reports psrnatarget: a plant small rna target analysis server prediction of ctl epitopes using qm, svm and ann techniques prediction of continuous b-cell epitopes in an antigen using recurrent neural network prediction and classification of chemokines and their receptors vaxijen: a server for prediction of protective antigens, tumour antigens and subunit vaccines systematic comparison of two animal-to-human transmitted human coronaviruses: sars-cov- and sars-cov a novel coronavirus from patients with pneumonia in china composition and divergence of coronavirus spike proteins and host ace receptors predict potential intermediate hosts of sars-cov- recent progress in the discovery of inhibitors targeting coronavirus proteases coronavirus nonstructural protein mediates evasion of dsrna sensors and limits apoptosis in macrophages extracellular mirnas: the mystery of their origin and function. trends in biochemical sciences are patients with hypertension and diabetes mellitus at increased risk for covid- infection? the lancet respiratory medicine the ace /apelin signaling, micrornas, and hypertension. international journal of hypertension regulation of cyclin t and hiv- replication by micrornas in resting cd + t lymphocytes interferon-beta and interferon-gamma synergistically inhibit the replication of severe acute respiratory syndrome-associated coronavirus (sars-cov), virology diagnosis and treatment of novel coronavirus infection in children: a pressing issue key: cord- -r te xob authors: balloux, francois; brønstad brynildsrud, ola; van dorp, lucy; shaw, liam p.; chen, hongbin; harris, kathryn a.; wang, hui; eldholm, vegard title: from theory to practice: translating whole-genome sequencing (wgs) into the clinic date: - - journal: trends microbiol doi: . /j.tim. . . sha: doc_id: cord_uid: r te xob hospitals worldwide are facing an increasing incidence of hard-to-treat infections. limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. wgs could hold the key to solving these issues. however, uptake in the clinic has been slow. some major scientific and logistical challenges need to be solved before wgs fulfils its potential in clinical microbial diagnostics. in this review we identify major bottlenecks that need to be resolved for wgs to routinely inform clinical intervention and discuss possible solutions. hospitals worldwide are facing an increasing incidence of hard-to-treat infections. limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. wgs could hold the key to solving these issues. however, uptake in the clinic has been slow. some major scientific and logistical challenges need to be solved before wgs fulfils its potential in clinical microbial diagnostics. in this review we identify major bottlenecks that need to be resolved for wgs to routinely inform clinical intervention and discuss possible solutions. thanks to progress in high-throughput sequencing technologies over the last two decades, generating microbial genomes is now considered neither particularly challenging nor expensive. as a result, whole-genome sequencing (wgs) (see glossary) has been championed as the obvious and inevitable future of diagnostics in multiple reviews and opinion pieces dating back to [ ] [ ] [ ] [ ] . despite enthusiasm in the community, wgs diagnostics has not yet been widely adopted in clinical microbiology, which may seem at odds with the current suite of applications for which wgs has huge potential, and which are already widely used in the academic literature. common applications of wgs in diagnostic microbiology include isolate characterization, antimicrobial resistance (amr) profiling, and establishing the sources of recurrent infections and between-patient transmissions. all of these have obvious clinical relevance and provide case studies where wgs could, in principle, provide additional information and even replace the knowledge obtained through standard clinical microbiology techniques. this review reiterates the potential of wgs for clinical microbiology, but also its current limitations, and suggests possible solutions to some of the main bottlenecks to routine implementation. in particular, we argue that applying existing wgs pipelines developed for fundamental research is unlikely to produce the fast and robust tools required, and that new dedicated approaches are needed for wgs in the clinic. at the most basic level, wgs can be used to characterize a clinical isolate, informing on the likely species and/or subtype and allowing phylogenetic placement of a given sequence relative to an existing set of isolates. wgs-based strain identification gives a far superior resolution in principle, wgs can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. however, despite this promise, the uptake of wgs in the clinic has been limited to date, and future implementation is likely to be a slow process. the increasing information provided by wgs can cause conflict with traditional microbiological concepts and typing schemes. decreasing raw sequencing costs have not translated into decreasing total costs for bacterial genomes, which have stabilised. existing research pipelines are not suitable for the clinic, and bespoke clinical pipelines should be developed. compared to genetic marker-based approaches such as multilocus sequence typing (mlst) and can be used when standard techniques such as pulsed-field gel electrophoresis (pfge), variable-number tandem repeat (vntr) profiling, and maldi-tof are unable to accurately distinguish lineages [ ] . wgs-informed strain identification could be of particular significance for bacteria with large accessory genomes, which encompass many of the clinically most problematic bacteria, where much of the relevant genetic diversity is driven by differences in the accessory genome on the chromosome and/or plasmid carriage. somewhat ironically, the extremely rich information of wgs data, with every genome being unique, generates problems of its own. clinical microbiology tends to rely on often largely ad hoc taxonomical nomenclature, such as biochemical serovars for salmonella enterica or mycobacterial interspersed repetitive units (mirus) for mycobacterium tuberculosis. while the rich information contained in wgs should in principle allow superseding traditional taxonomic classifications [ , ] , defining an intuitive, meaningful and rigorous classification for genome sequences represents a major challenge. for strictly clonal species, which undergo no horizontal gene transfer (hgt), such as m. tuberculosis, it is possible to devise a 'natural' robust phylogenetically based classification [ ] . unfortunately, organisms undergoing regular hgt, and with a significant accessory genome, do not fall neatly into existing classification schemes. in fact, it is even questionable whether a completely satisfactory classification scheme could be devised for such organisms, as classifications based on the core genome, accessory genome, housekeeping genes (mlst), genotypic markers, plasmid sequence, virulence factors or amr profile may all produce incompatible categories ( figure ). beyond species identification and characterization, genome sequences provide a rich resource that can be exploited to predict the pathogen's phenotype. the main microbial traits of clinical relevance are amr and virulence, but may also include other traits such as the ability to form biofilms or survival in the environment. sequence-based drug profiling is one of the pillars of hiv treatment and has to be credited for the remarkable success of antiretroviral therapy (art) regimes. prediction of amr from sequence data has also received considerable attention for bacterial pathogens but has not led to comparable success at this stage. resistance against single drugs can be relatively straightforward to predict in some instances. for example, the presence of the sccmec cassette is a reliable predictor for broad-spectrum beta-lactam resistance in staphylococcus aureus, with strains carrying this element referred to as methicillin-resistant s. aureus (mrsa). in principle, wgs offers the possibility to predict the full resistance profile to multiple drugs (the 'resistome'). possibly the first real attempt to predict the resistome from wgs data was a study by holden et al. in , showing that, for a large dataset of s. aureus st isolates, . % of all phenotypic resistances could be explained by at least one previously documented amr element or mutation in the sequence data [ ] . since then, several tools have been developed for the prediction of resistance profiles from wgs. these include those designed for prediction of resistance phenotype from acquired amr genes, including resfinder [ ] and abricate (https://github.com/tseemann/abricate), together with those also taking into account point mutations in chromosome-borne genes such as arg-annot [ ] , the sequence search tool for antimicrobial resistance (sstar) [ ] , and the comprehensive antibiotic resistance database (card) [ ] . of these, resfinder and card can be implemented as online methods that, dependent on user traffic, can be considerably slower than most other tools that only use the command-line. they are, however, superior in terms of broad usability and are more intuitive than, for example, the glossary accessory genome: the variable genome consisting of genes that are present only in some strains of a given species. many of the organisms representing the most severe amr threats are characterised by large accessory genomes containing important components of clinically relevant phenotypic diversity. antimicrobial resistance (amr): the ability of a microorganism to reproduce in the presence of a specific antimicrobial compound. also referred to as antibiotic resistance (abr or ar). the sum of the detected amr genes in a sequenced isolate is sometimes referred to as the resistome. horizontal gene tranfer (hgt): the transmission of genetic material laterally between organisms outside 'vertical' parent-to-offspring inheritance, including across species boundaries. genetic elements related to clinically relevant phenotypes such as amr and virulence are often transmitted via hgt. k-mer: a string of length k contained within a larger sequence. for example, the sequence 'attgt' contains two -mers: 'attg' and 'ttgt'. the analysis of the k-mer content of raw sequencing reads allows for rapid characterization of the genetic difference between isolates without the need for genome assembly. multilocus sequence typing (mlst): a scheme used to assign types to bacteria based on the alleles present at a defined set of chromosome-borne housekeeping genes. also referred to as sequence typing (st). phylogenetic tree: a representation of inferred evolutionary relationships based on the genetic differences between a set of sequences. also referred to as a phylogeny. transmission chain: the route of transmission of a pathogen between hosts during an outbreak. this can often be characterized using wgs compared to traditional epidemiological inference based on, for example, tracing contacts between patients. virulence: broadly, a pathogen's ability to cause damage to its host, for example through invasion, adhesion, immune evasion, and toxin production. however, virulence is currently loosely defined by indirect proxies either phenotypically (e.g., through serum-killing assays) or genetically (e.g., by the presence of genes involved in capsule synthesis or hypermucosvisity). whole-genome sequencing (wgs): the process of determining the complete nucleotide sequence of an organism's genome. this is generally achieved by 'shotgun' sequencing of short reads that are either assembled de novo or mapped onto a high-quality reference genome. graphical user interface of sstar. other tools exist for richer species-specific characterization such as phyresse [ ] and patric-rast [ ] . further tools have been developed to predict phenotype directly from unassembled sequencing reads, bypassing genome assembly [ , ] . it has been proposed that wgs-based phenotyping might, in some instances, be equally, if not more, accurate than traditional phenotyping [ ] [ ] [ ] [ ] . however, it is probably no coincidence that the most successful applications to date have primarily been on m. tuberculosis and s. aureus, which are characterised by essentially no, or very limited, accessory genomes, respectively. other successful examples include streptococcal pathogens, where wgs-based predictions and measured phenotypic resistance show good agreement even in large and diverse samples of isolates [ , ] . on the whole, however, predicting comprehensive amr profiles in organisms with open genomes, such as escherichia coli, where only % of genes are found in every single strain [ ] , is challenging and requires extremely extensive and well curated reference databases. the transition to wgs might appear relatively straightforward if viewed as merely replacing pcr panels which are already used when traditional phenotyping can be cumbersome and unreliable. however, to put the problem in context, there are over described b-lactamase gene sequences responsible for multiresistance to b-lactam antibiotics such as penicillins, cephalosporins, and carbapenems [ ] . whilst b-lactam resistance in some pathogens, including s. pneumoniae, can be predicted through, for example, penicillin-binding protein (pbp) typing and machine-learning-based approaches [ ] , the general problem of reliably assigning resistance phenotype based on many described gene sequences is commonplace. at this stage, many of the amr reference databases are not well integrated or curated and have no minimum clinical standard. they often have varying predictive ranges and biases and produce fairly inaccessible output files with little guidance on how to interpret or utilise this information for clinical intervention. perhaps because of these limitations, although of obvious benefit as part of a diagnostics platform, both awareness and uptake in the clinic has been limited. additionally, with some notable exceptions, such as the pneumococci [ ] , most amr profile predictions from wgs data are qualitative, simply predicting whether an isolate is expected to be resistant or susceptible against a compound despite amr generally being a continuous and often complex trait. the level of resistance of a strain to a drug can be affected by multiple epistatic amr elements or mutations [ ] , the copy number variation of these elements [ ] , the function of the genetic background of the strain [ - ], and modulating effects by the environment [ ] . the level of resistance is generally well captured by the semiquantitative phenotypic measurement minimum inhibitory concentration (mic), even if clinicians often use a discrete interpretation of mics into resistant/susceptible based on fairly arbitrary cut-off values. quantitative resistance predictions are not just of academic interest. in the clinic, low-level resistance strains can still be treated with a given antibiotic but the standard dose should be increased, which can be the best option at hand, especially for drugs with low toxicity. the majority of efforts to predict phenotypes from bacterial genomes have been on amr profiling. yet, some tools have also been developed for multispecies virulence profiling: the virulence factors database (vfdb) [ ] or virulencefinder [ ] as well as the bespoke virulence prediction tool for klebsiella pneumoniae, kleborate [ ] . one major challenge is that virulence is often a context-dependent trait. for example, in k. pneumoniae various imperfect proxies for virulence are used. these include capsule type, hypermucovisity, biofilm and siderophore production, or survival in serum-killing assays. while all of these traits are quantifiable and reproducible, and could thus in principle be predicted using wgs, it remains unclear how well they correlate with virulence in the patient. given that virulence is one of the most commonly studied phenotypes, yet lacks a clear definition, the general problem of predicting bacterial phenotype from genotype may be substantially more complex than the special case of amr, which is itself far from solved for all clinically relevant species. beyond phenotype prediction for individual isolates, wgs has allowed reconstructing outbreaks within hospitals and the community across a diversity of taxa ranging from carbapenemresistant k. pneumoniae [ ] [ ] [ ] and acinetobacter baumannii [ ] to mrsa [ , ] , streptococcal disease [ ] , and neisseria gonorrhoea [ ] , amongst others. wgs can reveal which isolates are part of an outbreak lineage and, by integrating epidemiological data with phylogenetic information, detect direct probable transmission events [ ] [ ] [ ] [ ] . timed phylogenies, for example generated through beast [ , ] , can provide likely time-windows on inferred transmissions, as well as dating when an outbreak lineage may have started to expand. approaches based on transmission chains can also be used to identify sources of recurrent infections (so called 'super-spreaders'), and do not necessarily rely on all isolates within the outbreak having been sequenced, allowing for partial sampling and analyses of ongoing outbreaks [ ] . in this way wgs-based inference can elucidate patterns of infection which are impossible to recapitulate from standard sequence typing alone [ ] . however, wgs-informed outbreak tracking is usually performed only retrospectively. typically, the publication dates of academic literature relating to outbreak reconstruction lag greatly, often in the order of at least years since the initial identification of an outbreak [ , ] . even analyses published more rapidly are generally still too slow to inform on real-time interventions [ ] . some attempts have been made to show that near-real-time hospital outbreak reconstruction is feasible retrospectively [ , ] or have performed analyses for ongoing outbreaks in close to real-time [ , ] , but these studies are still in a minority and remain largely within the academic literature. some of this time-lag probably relates to the difficulty of transmission-chain reconstruction at actionable time-scales. this can be relatively straightforward for viruses with high mutation rates, small genomes, and fast and constant transmission times, such as ebola [ ] and zika virus [ ] , but conversely, reconstructing outbreaks for bacteria and fungi poses a series of challenges. available tools tend to be sophisticated and complex to implement, and the sequence data needs extremely careful quality control and curation. unfortunately, in some cases insufficient genetic variation will have accumulated over the course of an outbreak, and a transmission chain simply cannot be inferred without this signal [ , ] . furthermore, extensive within-host genetic diversity (typical in chronic infections) can render the inference of transmission chains intractable [ ] . these complexities mean that a 'one-size fits all' bioinformatics approach to outbreak analyses simply does not exist. one of the key promises of wgs is in molecular surveillance and real-time tracking of infectious disease. this relies on transparent and standardized data sharing of the millions of genomes sequenced each year, together with accompanying metadata on isolation host, date of sampling, and geographic location. with enough data, surveillance initiatives have the potential to identify the likely geographic origin of emerging pathogens and amr genes, group seemingly unrelated cases into outbreaks, and clearly identify when sequences are divergent from other circulating strains. in a hospital setting, surveillance can help to detect transmission within the hospital and inflow from the community, optimize antimicrobial stewardship, and inform treatment decisions; at national and global scales, it can highlight worldwide emerging trends for which collated evidence can direct both retrospective but also anticipatory policy decisions. amongst the most successful global surveillance initiatives and analytical frameworks are those relating specifically to the spread of viruses. influenza surveillance is arguably the most developed, with large sequencing repositories such as the gisaid database (gisaid.org) and online data exploration and phylodynamics available through web tools such as nextflu [ ] and nextstrain (http://nextstrain.org), which also allows examination of other significant viruses including zika, ebola, and avian influenza. another popular tool for the sharing of data and visualization of phylogenetic trees and their accompanying meta-data is microreact (microreact.org) [ ] , which also allows for interactive data querying and includes bacteria and fungi. a further tool, predominately for bacterial data, is wgsa (www.wgsa.net). wgsa allows the upload of genome assemblies through a drag-and-drop web browser, allowing for a quick characterization of species, mlst type, resistance profile, and phylogenetic placement in the context of the existing species database based on core genes. at the time of writing wgsa comprises genomes predominantly from s. aureus, n. gonorrhoeae, and salmonella enterica serovar typhi, together with ebola and zika viruses, all with some associated metadata. although an exciting initiative, wgsa and associated platforms are still a reasonably long way off characterizing all clinically relevant isolates and often rely entirely on the sequences uploaded already being assembled. more generally, the success of any wgs surveillance is dependent on the timely and open sharing of information from around the globe. while sequence data from academic publications is near systematically deposited on public sequence databases (at least upon publication), such data are near useless if the accompanying metadata (see above) are not also released, as remains the case far too often. additionally, as more genomes are routinely sequenced in clinical settings as part of standard procedures, ensuring that the culture of sharing sequence data persists beyond academic research will become increasingly important. for wgs to be routinely adopted in clinical microbiology, it needs to be cost-effective. it is commonly accepted that sequencing costs are plummeting with the national human genome research institute (nhgri) estimating the cost per raw megabase (mb) of dna sequence to . usd (www.genome.gov/sequencingcostsdata). this has led to claims that a draft bacterial genome can currently cost less than usd to generate [ ] . this is a misunderstanding as one cannot simply extrapolate the cost of a bacterial genome by multiplying a highthroughput per dna megabase (mb) sequencing cost by the size of its genome. for microbial sequencing, multiple samples must be multiplexed for cost efficiency, which is easier to achieve in large reference laboratories with high sample turnover. excluding indirect costs such as salaries for personnel, preparation of sequencing libraries now makes up the major fraction of microbial sequencing costs ( figure ). the precipitous drop in the cost of producing raw dna sequences in recent years (figure a ) mostly reflects a massive increase in output with new iterations of illumina production machines. these numbers ignore all other costs and simply reflect output relative to the cost of the sequencing kits/cartridges. realistic cost estimates for a microbial genome including library preparation on the best available platforms give a different picture ( figure b ). since the introduction of the illumina miseq platform in , new sequencing kits generating higher output have only marginally affected true microbial genome sequencing costs, as library preparation makes up a significant portion of the total ( usd of a total of usd for a typical bacterial genome in ). these costs have remained stable over time and are unlikely to go down significantly in the near future. indeed, the market seems to be consolidating in fewer hands (e.g., represented by the procurement of kapa by roche in ), which economic theory predicts will not favor price decrease. it is also important to keep in mind that these costs are massive underestimates which do not include indirect costs such as salaries for laboratory personnel and downstream bioinformatics. such indirect costs are difficult to estimate precisely in an academic setting but are far from trivial. per-genome sequencing and analysis costs are likely to be even higher in a clinical diagnostics environment due to the need for highly standardised and accredited procedures. however, a micro-costing analysis covering laboratory and personnel costs estimated the cost of clinical wgs to £ per m. tuberculosis isolate versus £ applying standard methods, representing relatively marginal cost savings but with significant time savings [ ] . wgs does indeed represent a potentially cost-effective and highly informative tool for clinical diagnostics, but for microbiology-scale sequencing we seem to be in a post-plummeting-costs age. one key feature of useful diagnostics tools is their ability to rapidly inform treatment. most applications of wgs so far have been for lab-cultured organisms (bacteria and fungi). traditional culture methods require long turnaround time, with most bacterial cultures taking - days, fungal cultures - days, and mycobacterial cultures up to - days. in this scenario, wgs is used as an adjunct technology primarily to provide information on the presence of amr and virulence genes, which is particularly useful for mechanisms that are difficult to determine phenotypically (e.g. carbapenem resistance). this use of wgs, whilst solving some of the current clinical problems, does not speed up the diagnosis of infection; it is more the case that new technology is replacing some of the more cumbersome laboratory techniques whilst providing additional information. wgs is more appealing as a microbiological fast diagnostics solution when combined with procedures that circumvent (or shorten) the traditional culture step. this can be achieved through direct sampling of clinical material (box ) or by using a protocol enriching for sequences of specific organism(s). such enrichment methods, generally based on the capture of known sequences though hybridization, are a particularly tractable approach for viruses due to their small genome size. for example, the vircap virome capture method targets all known viruses and can even enrich for novel sequences [ ] . similar methods targeting specific organisms have been developed and successfully deployed, representing an attractive option for unculturable organisms [ , [ ] [ ] [ ] [ ] . relative to the time required for culture and downstream analysis of the data, variation in the speed of different sequencing technologies is relatively modest. there is considerable enthusiasm for the oxford nanopore technology (ont) which outputs data in real time, although the ont requires a comparable amount of time to the popular illumina miseq sequencer to generate the same volume of sequence data. sequencing on the miseq sequencer takes between to hours, but as run time correlates with sequence output and read length, researchers tend to systematically favour runs of longer duration. in the context of this review, genetic material from the human patient present in clinical samples represents contamination, a major obstacle to obtaining a high yield of microbial dna. protocols exist to deplete human dna prior to sequencing [ , ] but these are not completely problem-free as the depletion protocol is likely to bias estimates of the microbial community, and some human reads will likely remain. in particular, levels of human dna are significantly higher in faecal samples from hospitalized patients compared to healthy controls [ ] , box . wgs beyond single genomes wgs in the strict sense usually refers to sequencing the genome of a single organism, and it is common to distinguish between the sample (the material that has actually been taken from the patient) and the isolate (an organism that has been cultured and isolated from that sample). wgs methods traditionally sequence a cultured isolate to reduce contamination from other organisms, or sometimes rely on enrichment strategies targeting sequences from a specific organism [ , ] . however, this represents only a small fraction of the total microbial diversity present in a clinical sample. in contrast, metagenomic approaches sequence samples in an untargeted way. this approach is particularly relevant for clinical scenarios where the pathogen of interest cannot be predicted and/or is fastidious (i.e., has complex culturing requirements). example applications of clinical metagenomics include: when the disease causing agent is unexpected [ , ] ; investigating the spread of amr-carrying plasmids across species [ ] ; and characterizing the natural history of the microbiome [ ] . the removal of the culture requirement can drastically decrease turn-around time from sample to data and enable identification of both rare and novel pathogens. different samples however present different challenges. easy-to-collect sample sites (e.g., faeces and sputum) typically also have a resident microbiota, so it can be challenging to distinguish the etiological agent of disease from colonizing microbes. conversely, sites that are usually sterile (e.g., cerebrospinal fluid, pleural fluid) present a much better opportunity for metagenomics to contribute to clinical care. metagenomic data are more complex to analyze than single species wgs data and tend to rely on sophisticated computational tools, such as the desman software allowing inference of strain-level variation in a metagenomic sample [ ] . such approaches can be difficult to implement, are computationally very demanding, and are unlikely to be deployable in clinical microbiology in the near future, although cloud-based platforms may circumvent the need for computational resources in diagnostic laboratories. furthermore, some faster approaches for rapid strain characterization from raw sequence reads, such as mash [ ] and kmerfinder [ , ] , could find a use in diagnostics microbiology, with the latter having been shown to identify the presence of pathogenic strains even in culture-negative samples [ ] . however, the differences between these methods should not obscure their fundamental similarities. obtaining singlespecies genomes from culture is one end of a continuum of methods that stretches all the way to full-blown metagenomics of a sample. in principle, all methods produce the same kind of data: strings of bases. furthermore, in all cases what is clinically relevant represents only a small fraction of these data. integrating sequencing data from different methods into a single diagnostics pipeline is therefore an attractive prospect to quickly identify the genomic needles in the metagenomic haystack in a species-agnostic manner. for example, the presence of a particular antibiotic-resistance gene in sequencing data may recommend against the use of that antibiotic; whether the gene is present in data from a single-species isolate or from metagenomes is irrelevant. as an example, leggett et al. used minion metagenomic profiling to identify pathogen-specific amr genes present in a faecal sample from a critically ill infant all within h of taking the initial sample [ ] . suggesting that the problem is exacerbated in clinical settings. therefore, the ethical and legal issues raised by introducing human wgs into routine healthcare [ ] cannot be avoided by microbially focused clinical metagenomics. dismissing these concerns as minor may be an option for academic researchers uninterested in these human data, but it is naive to think that hospital ethics committees will share this view. even in the absence of human dna, metagenomic samples from multiple body sites can be used to identify individuals in datasets of hundreds of people [ ] . managing clinical metagenomics data in light of these concerns should be taken seriously, not only as a barrier to implementation but because of the real risks to patient privacy. a major problem in the analysis of wgs data is that there are currently very few (if any) accepted gold standards. the fundamental steps of wgs analyses in microbial genomics tend to be similar across applications and usually consist of the following steps: sequence data quality control; identification/confirmation of the sequenced biological material; characterization of the sequenced isolate (including typing efforts as well as characterization of virulence factors and putative amr elements/mutations); epidemiologic analysis; and finally, storage of the results ( figure ). however, how these analyses are implemented varies widely, both between microbial species and human labs. despite some commercial attempts at one-stop analysis suites such as ridom seqsphere+ (http://www.ridom.com/seqsphere/), most laboratories use a collection of open-source tools to perform particular subanalyses. typically, these tools are then woven together into a patchwork of software (a 'pipeline'). the idea of a pipeline is to allow within-laboratory standardized analysis of batches of isolates with relatively little manual bioinformatics work. such pipelines can be highly customizable for a wide range of questions. there are also some communal efforts at streamlining workflows across laboratories. as an example, galaxy (https://usegalaxy.org) is a framework that allows nonbioinformaticians to use a wide array of bioinformatics tools through a web interface. one major limitation to rapidly attaining useful information in a clinical setting is that analysis pipelines for microbial genomics have generally been developed for fundamental research or public health epidemiology [ ] . this usually means that the pipeline permits a very thorough and sophisticated workflow with a large number of options and moving parts. for example, at the time of writing (may, ), the 'qc and manipulation' step in galaxy alone consists of different tools, tests, and workflows that can be applied to an input sequence. while this is desirable from a researcher's perspective, it is clearly prohibitive for real-time analysis in a clinical setting. a user requires in-depth knowledge about the purpose each tool serves, the relative strengths and weaknesses of each approach, and a functional understanding of the important parameters. furthermore, most analysis pipelines require proficiency in linux systems and navigating the command line, something clinical microbiologists are rarely trained for. the road to stringent, exhaustive analysis of wgs data is long and paved with good intentions. in order to move towards real-time interpretable results for clinics it will be necessary to take certain shortcuts. the focus should be on rapid, automated analysis and clear, unambiguous results. some steps in the pipeline can simply be omitted for clinical purposes. as an example, genome assembly might appear to be a bottleneck for real-time wgs diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. accurate identification of an isolate can be made rapidly with minhash-based k-mer matching methods such as mash [ ] , and amr elements can be identified from k-mers alone [ ] . another example of a computationally intensive step that could be omitted from a default pipeline is sophisticated phylogenetic inference. best practice for the creation of phylogenetic trees may involve evaluating the individual likelihood of a very wide range of possible trees given a sequence alignment or other distance metric, repeated for thousands of bootstrapped replicates, giving a tree with high confidence but with extreme computational time costs. a clinical pipeline could use much faster approaches and still provide an informative phylogenetic tree [ ] . in figure we outline our schematic vision of a computational pipeline specific to diagnostics in clinical microbiology. the clinical pipeline would only encompass a small subset of the research pipeline aimed at generating rapid and interpretable output. for epidemiological inference, pairwise distances between strains would be computed as a matrix of jaccard distances on the shared proportion of k-mers as outputted by mash [ ] . this matrix could be used to generate a phylogenetic tree using a computationally inexpensive method (e.g., neighbor-joining). additionally, a correlation between pairwise genetic distance and sampling date could be performed steps on the right marked with an asterisk represent simplified versions optimised for speed. cgmlst, core genome multilocus sequence typing; snp, single-nucleotide polymorphism; wgmlst, whole genome multilocus sequence typing. to test for evidence of temporal signal in the data (i.e., accumulation of a sufficient number of mutations over the sampling period). in the presence of temporal signal, the user would be provided with a transmission chain based on a fast algorithm such as seqtrack [ ] . any bespoke pipeline for clinical diagnostics would need to be linked with regularly updated multispecies databases containing information about the latest developments in typing schemes, as well as clinically important factors such as amr determinants. results would have to be continuously validated, and international accreditation standards met at regular intervals. at a national level, accreditation bodies (e.g., ukas in the uk) may lack the expertise required. in our experience, many promising databases have collapsed after funding expired or the responsible postdoc left for another job. if wgs is ever to make it into the clinic it will be necessary to secure indefinite funding of both infrastructure and personnel for such databases. the lack of uptake of wgs-based diagnostics may also be in part due to an understandable desire to maintain the 'status quo' in a busy hospital environment with already established treatment and intervention systems. additionally, and perhaps significantly, it also highlights the difficulty to communicate the potential benefits of wgs to the day-to-day life of a clinic. the main proponents of wgs tend to be based in the public health/research environment and are rarely actively involved in clinical decision-making. this in itself can present something of a language barrier, challenging meaningful dialogue over how adoption of new approaches can lead to quantifiable improvements in existing systems. further, the physical planning, implementation and integration of wgs diagnostics may be unlikely to succeed without carefully planned introduction and continued training of its user base. this is of course challenged by the already resource-limited infrastructure of many clinical settings. despite its immense promise and some early successes, it is difficult to predict if and when wgs will completely supersede current standards in clinical microbiology. there are several major bottlenecks to its implementation as a routine approach to diagnose and characterise microbial infections (see outstanding questions). these include, among others: the current costs of wgs, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable amr and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols. focusing in the near future on wgs applications that fulfil unmet diagnostic needs and demonstrate clear benefits to patients and healthcare professionals will help to drive the cultural changes required for the transition to wgs in clinical microbiology. however, irrespective of how this transition occurs and how complete it is, it is likely to feel highly disruptive for many clinical microbiologists. there is also a genuine risk that precious knowledge in basic microbiology will be lost after the transition to wgs, particularly if investment prioritises new technology at the expense of older expertise. more positively, irrespective of the future implementation of wgs in clinical microbiology, we should not forget that the availability of extensive genomic data has been instrumental in the development of a multitude of routine non-wgs typing schemes. efforts to develop wgs-based microbial diagnostics have unsurprisingly focused on highresource settings. though, we can see an opportunity for low-/medium-income countries to outstanding questions can wgs be used to develop robust classification schemes that account for the genetic diversity of organisms with open genomes? which clinically relevant phenotypes can be reliably predicted using wgs, and for which organisms? how can phylogenetic analyses of outbreaks be speeded up to meaningfully contribute to infection control at actionable time scales? how can publicly available databases be reliably maintained to the required clinical accreditation standards over long time periods? will the true cost of generating a bacterial genome remain stable as the sequencing market consolidates in fewer hands? how can clinical metagenomic data be managed safely in line with the ethical considerations applicable to identifiable human dna? how can unwieldy bioinformatics pipelines developed with academic research in mind be adapted for a clinical setting? can current expertise in traditional clinical microbiology be maintained in the transition to wgs? get up to speed with the latest wgs-based developments in real-time clinical diagnostics, rather than adopting classical microbiological phenotyping which might eventually be largely phased out in high-income countries. one precedent for the successful adoption of a technology without transitions through its acknowledged historical predecessors is the widespread use of mobile phones in africa. this has greatly increased communication and allowed access to e-banking, despite the fact that many people previously had no traditional bank account and only limited access to landlines. most hospitals in the developing world do not currently benefit from a clinical microbiology laboratory. the installation of a molecular laboratory based around a standard sequencer, such as a benchtop miseq, might constitute an ideal investment, as it is neither far more expensive nor more complex than setting up a standard clinical microbiology laboratory. high-throughput sequencing and clinical microbiology: progress, opportunities and challenges transforming clinical microbiology with bacterial genome sequencing routine use of microbial whole genome sequencing in diagnostic and public health microbiology bacterial genome sequencing in the clinic: bioinformatic challenges and solutions utility of matrix-assisted laser desorption ionization-time of flight mass spectrometry following introduction for routine laboratory bacterial identification armed conflict and population displacement as drivers of the evolution and dispersal of mycobacterium tuberculosis multilocus sequence typing as a replacement for serotyping in salmonella enterica a robust snp barcode for typing mycobacterium tuberculosis complex strains a genomic portrait of the emergence, evolution, and global spread of a methicillin-resistant staphylococcus aureus pandemic benchmarking of methods for genomic taxonomy arg-annot, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. antimicrob sstar, a stand-alone easy-to-use antimicrobial resistance gene predictor phyresse: a web tool delineating mycobacterium tuberculosis antibiotic resistance and lineage from whole-genome sequencing data antimicrobial resistance prediction in patric and rast rapid determination of anti-tuberculosis drug resistance from whole-genome sequences rapid antibiotic-resistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis wgs accurately predicts antimicrobial resistance in escherichia coli prediction of staphylococcus aureus antimicrobial resistance by whole-genome sequencing whole-genome sequencing and epidemiological analysis do not provide evidence for cross-transmission of mycobacterium abscessus in a cohort of pediatric cystic fibrosis patients short-read whole genome sequencing for determination of antimicrobial resistance mechanisms and capsular serotypes of current invasive streptococcus agalactiae recovered in the usa using whole genome sequencing to identify resistance determinants and predict antimicrobial resistance phenotypes for year invasive pneumococcal disease isolates recovered in the united states comparison of sequenced escherichia coli genomes in silico serine beta-lactamases analysis reveals a huge potential resistome in environmental and pathogenic species validation of beta-lactam minimum inhibitory concentration predictions for pneumococcal isolates with newly encountered penicillin binding protein (pbp) sequences evolutionary mechanisms shaping the maintenance of antibiotic resistance multicopy plasmids potentiate the evolution of antibiotic resistance in bacteria spatiotemporal microbial evolution on antibiotic landscapes vfdb : hierarchical and refined dataset for big data analysis - years on real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic escherichia coli genetic diversity, mobilisation and spread of the yersiniabactin-encoding mobile element icekp in klebsiella pneumoniae populations tracking a hospital outbreak of kpcproducing st klebsiella pneumoniae with whole genome sequencing nested russian doll-like genetic mobility drives rapid dissemination of the carbapenem resistance gene bla(kpc) evolution and transmission of carbapenem-resistant klebsiella pneumoniae expressing the bla(oxa- ) gene during an institutional outbreak associated with endoscopic retrograde cholangiopancreatography utility of whole-genome sequencing in characterizing acinetobacter epidemiology and analyzing hospital outbreaks rapid whole-genome sequencing for investigation of a neonatal mrsa outbreak whole-genome sequencing for the investigation of a hospital outbreak of mrsa in china prolonged and large outbreak of invasive group a streptococcus disease within a nursing home: repeated intrafacility transmission of a single strain genomic analysis and comparison of two gonorrhea outbreaks simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks impact of hiv co-infection on the evolution and transmission of multidrug-resistant tuberculosis bayesian inference of infectious disease transmission from whole-genome sequence data microevolutionary analysis of clostridium difficile genomes to investigate transmission beast : a software platform for bayesian evolutionary analysis bayesian phylogenetics with beauti and the beast . genomic infectious disease epidemiology in partially sampled and ongoing outbreaks transmission of staphylococcus aureus between health-care workers, the environment, and patients in an intensive care unit: a longitudinal cohort study based on wholegenome sequencing whole-genome sequencing to determine transmission of neisseria gonorrhoeae: an observational study a pilot study of rapid benchtop sequencing of staphylococcus aureus and clostridium difficile for outbreak detection and surveillance whole-genome sequencing for analysis of an outbreak of methicillin-resistant staphylococcus aureus: a descriptive study real time application of whole genome sequencing for outbreak investigation -what is an achievable turnaround time? translating genomics into practice for real-time surveillance and response to carbapenemase-producing enterobacteriaceae: evidence from a complex multi-institutional kpc outbreak real-time, portable genome sequencing for ebola surveillance multiplex pcr method for minion and illumina sequencing of zika and other virus genomes directly from clinical samples inferences from tip-calibrated phylogenies: a review and a practical guide when are pathogen genome sequences informative of transmission events? within-host bacterial diversity hinders accurate reconstruction of transmission networks from genomic distance data nextflu: real-time tracking of seasonal influenza virus evolution in humans microreact: visualizing and sharing data for genomic epidemiology and phylogeography insights from years of bacterial genome sequencing rapid, comprehensive, and affordable mycobacterial diagnosis with whole-genome sequencing: a prospective study virome capture sequencing enables sensitive viral diagnosis and comprehensive virome analysis deep sequencing of viral genomes provides insight into the evolution and pathogenesis of varicella zoster virus and its vaccine in humans specific capture and whole-genome sequencing of viruses from clinical samples same-day diagnostic and surveillance data for tuberculosis via whole-genome sequencing of direct respiratory samples rapid whole genome sequencing of m. tuberculosis directly from clinical samples depletion of human dna in spiked clinical specimens for improvement of sensitivity of pathogen detection by next-generation sequencing a method for selectively enriching microbial dna from contaminating vertebrate host dna excretion of host dna in feces is associated with risk of clostridium difficile infection the ethical introduction of genomebased information and technologies into public health identifying personal microbiomes using metagenomic codes astrovirus va /hmo-c: an increasingly recognized neurotropic pathogen in immunocompromised patients human coronavirus oc associated with fatal encephalitis natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability desman: a new tool for de novo extraction of strains from metagenomes mash: fast genome and metagenome distance estimation using minhash rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples rapid minion metagenomic profiling of the preterm infant gut microbiota to aid in pathogen diagnostics whole genome sequencing in clinical and public health microbiology evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study reconstructing disease outbreaks from genetic data: a graph approach we are grateful to nadia debech and jan oksens for their help with digging up historic pricing information for sequencing key: cord- -u q o e authors: shean, ryan c.; makhsous, negar; stoddard, graham d.; lin, michelle j.; greninger, alexander l. title: vapid: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to ncbi genbank date: - - journal: bmc bioinformatics doi: . /s - - -y sha: doc_id: cord_uid: u q o e background: with sequencing technologies becoming cheaper and easier to use, more groups are able to obtain whole genome sequences of viruses of public health and scientific importance. submission of genomic data to ncbi genbank is a requirement prior to publication and plays a critical role in making scientific data publicly available. genbank currently has automatic prokaryotic and eukaryotic genome annotation pipelines but has no viral annotation pipeline beyond influenza virus. annotation and submission of viral genome sequence is a non-trivial task, especially for groups that do not routinely interact with genbank for data submissions. results: we present viral annotation pipeline and identification (vapid), a portable and lightweight command-line tool for annotation and genbank deposition of viral genomes. vapid supports annotation of nearly all unsegmented viral genomes. the pipeline has been validated on human immunodeficiency virus, human parainfluenza virus – , human metapneumovirus, human coronaviruses ( e/oc /nl /hku /sars/mers), human enteroviruses/rhinoviruses, measles virus, mumps virus, hepatitis a-e virus, chikungunya virus, dengue virus, and west nile virus, as well the human polyomaviruses bk/jc/mcv, human adenoviruses, and human papillomaviruses. the program can handle individual or batch submissions of different viruses to genbank and correctly annotates multiple viruses, including those that contain ribosomal slippage or rna editing without prior knowledge of the virus to be annotated. vapid is programmed in python and is compatible with windows, linux, and mac os systems. conclusions: we have created a portable, lightweight, user-friendly, internet-enabled, open-source, command-line genome annotation and submission package to facilitate virus genome submissions to ncbi genbank. instructions for downloading and installing vapid can be found at https://github.com/rcs /vapid. with sequencing technologies becoming cheaper and more accessible, genomic sequencing is becoming increasingly widespread. smaller groups are generating more sequencing data than they can analyze alone. in order to extract maximal scientific and public health value out of these data, sharing of assembled consensus genomes and raw sequence data is critical. the democratization of genomics takes a village. this is especially true for infectious diseases, where searchable sequence databases allow for real-time tracking of viral epidemics and solving of foodborne bacteria outbreaks [ ] [ ] [ ] . while recent high-profile cases receive the most attention, the fact remains that almost all infectious diseases exist in the context of an ongoing outbreak [ , ] . in the clinical world, metagenomic analysis pipelines depend on and take advantage of the availability of many infectious disease genomes to allow for faster and accurate alignments [ ] [ ] [ ] . in the basic science world, a graduate student studying the function of a singular protein is greatly assisted by being able to pull the world's history of sequence diversity for that protein before designing experiments [ ] . many foresee a world in which nearly every infectious disease genome is sequenced and archived in a publicly searchable database [ , ] . state and federal public health laboratories have built capacity such that they now sequence more than influenza virus genomes and more than enteropathogenic bacterial genomes each year [ ] . major efforts in rationalizing workflows, from nucleic acid extraction to data deposition and analysis, have enabled these rapid growths in throughput [ ] [ ] [ ] . these tools allow public health and research laboratories to focus on the epidemiological or scientific insight gained from sequencing rather than rote protocols for data deposition. specifically in the area of genome annotation, the national center for biotechnology information (ncbi) has created the prokaryotic genome annotation pipeline, eukaryotic genome annotation pipeline, and the influenza virus sequence annotation tool [ , ] . surprisingly, other than for influenza virus, ncbi genbank does not currently have an automatic viral genome annotation pipeline. the incredible diversity of dna and rna viruses presents a challenge for development of a universal annotator [ ] . complex viral life cycles involving rna editing, ribosomal slippage, and overlapping reading frames create additional annotation issues, along with non-standard nomenclature for viral gene products [ ] [ ] [ ] . in order to accept submitted viral genomic data, ncbi genbank requires ) viral sequence complete with at least one protein annotation, ) author/depositor metadata, and ) viral sequence metadata, such as strain, collection date, collection location, and coverage. while manual annotation of nucleotide sequence can be done for small numbers of viruses, it is extremely time-consuming and labor-intensive. even after correct annotations have been obtained for all viral sequences, manually integrating author and sample metadata to create files to be submitted is an equally time-consuming effort and not a feasible solution for groups sequencing more than a few viruses. to date, existing viral annotation tools have focused largely on batch submissions of a single virus species. this may be a holdover from when specific pcr-based methods were required to faithfully recover viral whole genome sequence or due to the focus of researchers on a single virus at a time. with the increasing use of metagenomic or shotgun next-generation sequencing and the availability of more and more sequencing capacity, researchers can confidently batch many different rna or dna viruses together on a single sequencing run. in order to facilitate viral genome annotation, we have developed a lightweight and user-friendly command-line tool that takes fasta files of complete or near complete viral genomes as input, automatically annotates them, and outputs the required files for genbank submission over email. vapid handles batch submissions of multiple viruses of different types without prior knowledge of the viral species, correctly annotates rna editing and ribosomal slippage, performs spellchecking on annotations, handles batch or individual submission of metadata, runs with a simple one-line command, and creates annotated viral sequence files for genbank submission. vapid can be downloaded at https://github.com/rcs / vapid. an installation guide, usage instructions, and test data can also be found at the above webpage. the invocation of vapid is shown in fig. , users must provide a standard fasta file containing all of the viral genomes they wish to annotate. users also must provide a genbank submission template (.sbt file) that includes author, publication, and project metadata. the genbank submission template can be used for multiple viral sequences or submissions and is easily created at the ncbi submission portal (https://submit.ncbi.nlm.nih.gov/genbank/template/ submission/). an optional sample metadata file (.csv file) can be provided to vapid to expedite the process of incorporating sample metadata. this optional file can also be used to include any of the source modifiers supported by ncbi (https://www.ncbi.nlm.nih.gov/sequin/modifiers.html). if no sample metadata file is provided, vapid will prompt the user to input the required sample metadata at runtime. additionally, users can provide a specified reference from which to annotate all viruses in a run, as well as provide their own blastn database or force vapid to search ncbi's nt database over the internet. the vapid pipeline is summarized in fig. . the first step is finding the correct reference sequence. this is accomplished in three ways ) using the provided reference database (default), ) forcing vapid to execute an online blastn search of ncbi's nt database, or ) inputting the accession number of a single ncbi sequence to use as the reference. in the default case, ncbi's blast+ tools are called from the command line to search against a reference database that is included with the vapid installation. this database was generated by downloading all complete viral genomes in ncbi on may , . the best result from this search is passed as the reference into the next steps. if the online option for finding the reference is specified, vapid finds an appropriate reference sequence for each genome to be annotated by performing an online blastn search with a word size of using biopython's ncbi www.qblast() function against the online ncbi nt database. the blastn output is parsed for the best scoring alignment among the top results that contains "complete genome" in the reference definition line. if no complete genome is found in the top blastn results, the top-scoring hit is used as the reference sequence. if a specific reference is provided, vapid simply downloads it directly from ncbi. after the correct reference is downloaded, gene locations are stripped from the reference and a pairwise nucleotide alignment between the reference and the submitted sequence is generated using mafft [ ] . the relative locations of the genes on the reference sequence are then mapped onto the new sequence based off the alignment. this putative alignment only requires that fig. example usage of vapid. the two required files are shown as genome.fasta and author_info.sbt. genome.fasta is all of the viral genomes you wish to submit, named as you want them to appear on genbank. in the example code provided in the github repository this example file is called example.fasta. the author_info.sbt file is an ncbi specific file for attaching sequence author names to sequin files and is a required part of properly submitting sequences to ncbi. this file can be generated at (https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ ). the first optional command is a comma separated file in which you can include all relevant metadata. you can create additional columns here so long as they correspond to ncbi approved sequence metadata. a list and formatting requirements can be found at (https://www.ncbi.nlm.nih.gov/sequin/modifiers.html). note that fasta sequence names must be identical to names in the optional metadata sheet. additionally, one could omit the metadata sheet and vapid will prompt for strain name, collection-date, country, and coverage data automatically at runtime. the second optional argument is a location of a local blastn database, which will force vapid to use the specified database instead of the included database. the last optional argument will force vapid to send an online search query to ncbi's nt database fig. general design and information flow of vapid. first the provided sequences are used as queries for a local blast search (default) or an online blastn search. after results have been returned a reference annotation is downloaded, if a specific reference accession number is given then this reference is downloaded. next the original fasta file is aligned with the reference fasta and the resulting alignment is used to map the reference annotations onto the new fasta. then custom code runs through the file and handles rna editing, ribosomal slippage and splicing. these finalized annotations are then plugged into ncbi's tbl asn with the author information and sequin files are generated as well as .gbk files which can be used to manually verify accuracy of new annotations. quality checked .sqn files can be emailed directly to genbank start codons are in regions of high homology and does not rely on intergenic spacing or gene lengths. gene names are taken from the annotated reference sequence genbank entry. spellchecking is performed using ncbi's espell module. this module provides spellchecking of many biological strings including protein product names. an optional argument can be provided at execution that enables this step. the diverse array of methods viruses use to encode genes can present problems for any viral genome annotator. ribosomal slippage allows viruses to produce two proteins from a single mrna transcript by having the ribosome 'slip' one or two nucleotides along the mrna transcript, thus changing the reading frame. since ribosomal slippage is well conserved within viral species and complete reference genomes often list exactly where it occurs, custom code was used to strip the correct junction site and include it in the annotation. rna editing is another process by which viruses can include multiple proteins in a single gene. in rna editing, the rna polymerase co-transcriptionally adds one or two nucleotides that are not on the template. these changes are specifically created during viral mrna transcription and not during viral genome replication. rna editing presents an annotation issue because the annotated protein sequence does not match the expected translated nucleotide sequence. to correctly annotate genes with rna editing, vapid parses the reference genome viral species, detects the rna editing locus, and mimics the rna polymerase. vapid adds the correct number of non-templated nucleotides for the viral species and provides an alternative protein translation. this process is hard-coded for human parainfluenza - , nipah virus, sendai virus, measles virus, and mumps virus. although rna editing occurs in ebola virus, references for ebola virus are annotated in the same way as ribosomal slippage, so code written for ribosomal slippage handles ebola virus annotations. after ribosomal slippage and rna editing are processed, files required for genbank submission are generated with the provided author and sample metadata. vapid first generates the .fsa file, .tbl file, and optional .cmt file. submission files for each viral genome are packaged into a separate folder for each sequence. vapid then runs tbl asn on each folder using the provided genbank submission template file (.sbt). tbl asn generates error reports and sequin (.sqn) and genbank (.gbk) files for manual verification and genbank submission via email attachment to gb-admin@ncbi.nlm.nih.gov. to illustrate our vision of how vapid will be useful to the scientific community we are providing two example use cases. this first example is the task that the authors originally wrote vapid for -annotating large numbers of genomes from different viral species, which mirrors the type of data that many clinical and public health laboratories may encounter. to illustrate this, synthetic viral genomes were created by manually mutagenizing sequences downloaded from ncbi genbank. only point mutations that did not lead to stop codons were used and each sequence was given about - changes depending on the viral genome length, equal to roughly one amino acid change per nucleotides. the species included were nipah, sendai, measles, mumps, parainfluenza - , ebola, rotavirus segments, mers, sars, coronavirus e, west nile virus, htlv, hiv , hepatitis a-c, hepatitis e, norovirus, enterovirus, jc virus, and bk polyomavirus. all genomes were put together into a single fasta file. metadata (collection date, country and coverage) for each of the viral genomes was put together into a single csv file. vapid was then executed with the command [python vapid.py example.fasta example.sbt --metadata_loc example_metadata.csv]. after running for seconds the program successfully generated complete and correct ncbi submittable annotation files (*.sqn), which could be immediately emailed to ncbi. when these same sequences were run with the online option it took minutes to complete, most of which was waiting for ncbi to return blast results. the above example shows how vapid would be ideal for busy groups who produce a diverse array of viral sequence such as public health or clinical testing labs. the second example use case highlights both the cross platform functionality of vapid as well as the ability to manually enter ncbi required metadata at run time and use an online blast search -reducing the number of required files at runtime to two. for this example, a sample human parainfluenza virus fasta was created. no accompanying metadata file was created (unlike in the first example). additionally, the syntax is exactly the same to use vapid across all operating systems. the single human parainfluenza virus took approximately min to annotate (including waiting about seconds for ncbi blast results to be returned) on a windows virtual machine with a single . ghz core and gb of ram. this example highlights how vapid can be used to annotate and submit viruses with only two input files (the author template file only needs to be created once) on almost any mac, linux, or windows computer with a python installation. two other programs exist to handle viral annotation, jcvi's vigor (https://doi.org/ . / - - - ) and the broad institute's viral-ngs suite (https://viral-ngs.readthedocs.io/en/latest/) ( table ) . vigor is an extremely fast and automatic viral annotator that is able to correctly annotate any number of virus genomes without prior knowledge of viral species or type. vigor also is able to correctly handle rna editing and ribosomal slippageneeding only an input fasta file with the genome to be annotated, it can produce ncbi compatible .tbl annotation files. vigor works natively on mac and linux systems. the annotations that vigor produces are identical to those created by vapid except that vapid only anotates cds regions by default whereas vigor will annotate all regions. two primary advantages vapid offers over vigor are ) vapid can be installed and run on windows systems, and ) vapid automates metadata and ncbi submission preparation steps. this fits our intended use case of a public health laboratory that may only own windows machines and does not want to spend extra time worrying about scripting tbl asn and manually preparing metadata files. for those with powerful linux machines and some knowledge of bash scripting we recommend stitching vigor and tbl asn together with custom scripts. however, for those without the infrastructure or time to pursue this path vapid offers a reasonable alternative. the broad institute's publicly available viral-ngs package, also developed in python, represents another alternative to vapid. this suite of tools is custom developed for internal broad institute users but publicly available and well-documented on github. viral-ngs takes a reference genome and corresponding annotation file and transfers the reference annotations onto a new genome. this is very similar to the way vapid transfers annotation, except viral-ngs requires the user to provide a reference, which avoids the potential problem of low quality references propagating errors. viral-ngs is well suited for annotating large numbers of the same type of virus all at the same time. another advantage of viral-ngs is that it contains many tools for going from raw sequencing reads to complete viral genomes. while viral-ngs handles ribosomal slippage and most normal viral genes, because ncbi compatible annotation files do not contain enough information to correctly annotate rna editing viral-ngs fails in these cases. outside of cases involving rna editing vapid and viral-ngs produce identical cds annotations, with viral-ngs transferring all other features. this software, like vapid, is also cross platform and could be easily run on a windows machine with a slightly modified python installation. the main advantages vapid offers over viral-ngs are the ease of batching multiple viral types together and also as with vigor, the automated metadata input and ncbi packaging. as with all software tools and especially gene annotation programs, vapid is not without limitations. vapid was designed purely to expedite the process of submitting large batches of different human virus genomes from a clinical laboratory. as such, a major limitation is that vapid expects a "complete genome" to use as a reference to be available for each of the viral sequences submitted to it. vapid is not the preferred annotation tool for novel or extremely divergent viral species. however, we find that complete reference genomes are available for most viral species of clinical importance, especially for viruses that are commonly sequenced in clinical or public health laboratories. a further limitation of vapid is that it does not perform ab initio gene annotation. instead vapid simply transfers annotations from the closest reference genome with a few quality control steps, such as allowing for slightly different gene lengths and rna editing. a significant limitation of this strategy is that because the new annotation is generated from a downloaded reference, any errors that are in the downloaded reference will be transferred to the new genome. this means that vapid performs best on high-quality and accurate reference sequences. the downside of the vapid strategy is that if an inaccurate reference strain is used, error can propagate extremely fast. for example, early in our development we deposited roughly human parainfluenza viruses that had the matrix protein incorrectly annotated as 'matrix potein' [sic] due to a misspelling in the official ncbi reference sequence nc_ for hpiv . to combat this problem, we have included a variety of ways to pull references, including directly specifying the reference. however, due to ease of implementation this functionality only works if all the viruses to be submitted are the same type. for those submitting many viruses of the same type we recommend using this feature to ensure annotation quality. vapid also contains code that overwrites misspellings in protein product names using ncbi's espell utility. this spell-checking step flags and corrects errors such as the misspelling mentioned above. a helpful message of what was corrected is printed to the console to prevent erroneous corrections. while this takes care of most misspellings, it is not a perfect fix, and viruses that we tested previously could have erroneous references uploaded in the future and further propagate error. it is for this reason that we highly recommend that users inspect and verify the accuracy of their new annotations, which have the potential to be used for future references. vapid allows users to go from any number of viral genome sequences to genbank-ready submission files (.sqn) with a single command. vapid facilitates input of sample metadata for viral genomic sequence, especially when the user desires to submit only a few sequences of diverse viral species. vapid runs from the command line with two required arguments and one optional sample metadata file, with the option of inputting minimal sample metadata via the command line at the time of annotation. the latest version of vapid can be found at https://github.com/rcs /vapid along with detailed installation and usage instructions. this software will allow public health, clinical virology, and research laboratories that do not have the resources to develop their own in-house genome annotation and submission tools or to adapt other pre-existing tools to quickly and easily share the sequence information that they generate on a regular basis with a minimal effort. availability of data and materials all data generated or analyzed during this study are included in this published article, github, or ncbi bioproject prjna . code and databases are available on github (https://github.com/rcs /vapid). authors' contributions rs wrote the code. nm generated sequences for beta-testing and development of the software. ml helped implement several features to correctly annotate rna editing and ribosomal slippage. gs implemented cross-platform functionality. ag conceived of the software and oversaw its development. rs and ag wrote the paper. all authors contributed to debugging and testing of the final version of the software and all authors read and approved the final manuscript. ethics approval and consent to participate use of excess clinical samples to generate viral sequences used for betatesting of vapid was approved by the university of washington human subjects division. not applicable. real-time, portable genome sequencing for ebola surveillance recent outbreaks of shigellosis in california caused by two distinct populations of shigella sonnei with either increased virulence or fluoroquinolone resistance distinct zika virus lineage in salvador rapid metagenomic next-generation sequencing during an investigation of hospitalacquired human parainfluenza virus infections clinical metagenomic identification of balamuthia mandrillaris encephalitis and assembly of the draft genome: the continuing case for reference genome sequencing kraken: ultrafast metagenomic sequence classification using exact alignments a cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples acbd interaction with tbc domain protein is differentially affected by enteroviral and kobuviral a protein binding real-time digital pathogen surveillance -the time is now data sharing: make outbreak research open access practical value of food pathogen traceability through building a whole-genome sequencing network and database validation and implementation of clinical laboratory improvements act-compliant whole-genome sequencing in the public health microbiology laboratory next-generation sequencing technologies and their application to the study and control of bacterial infections prokka: rapid prokaryotic genome annotation ncbi prokaryotic genome annotation pipeline flan: a web server for influenza virus genome annotation a decade of rna virus metagenomics is (not) enough the p gene of bovine parainfluenza virus expresses all three reading frames from a single mrna editing site myeloablation-associated deletion of orf in a human coronavirus e infection editing of the sendai virus p/c mrna by g insertion occurs during mrna synthesis via a virus-encoded activity multiple alignment of dna sequences with mafft the authors declare that they have no competing interests. springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. key: cord- -bqqyly authors: zhao, suhui; wan, chengsong; ke, changwen; seto, jason; dehghan, shoaleh; zou, lirong; zhou, jie; cheng, zetao; jing, shuping; zeng, zhiwei; zhang, jing; wan, xuan; wu, xianbo; zhao, wei; zhu, li; seto, donald; zhang, qiwei title: re-emergent human adenovirus genome type d caused an acute respiratory disease outbreak in southern china after a twenty-one year absence date: - - journal: sci rep doi: . /srep sha: doc_id: cord_uid: bqqyly human adenoviruses (hadvs) are highly contagious pathogens causing acute respiratory disease (ard), among other illnesses. of the ard genotypes, hadv- presents with more severe morbidity and higher mortality than the others. we report the isolation and identification of a genome type hadv- d (dg _ ) from a recent outbreak in southern china. genome sequencing, phylogenetic analysis, and restriction endonuclease analysis (rea) comparisons with past pathogens indicate hadv- d has re-emerged in southern china after an absence of twenty-one years. recombination analysis reveals this genome differs from the s-era prototype and vaccine strains by a lateral gene transfer, substituting the coding region for the l / kda dna packaging protein from hadv- . dg _ descends from both a strain circulating in southwestern china ( ) and a strain from shaanxi causing a fatality and outbreak (northwestern china; ). due to the higher morbidity and mortality rates associated with hadv- , the surveillance, identification, and characterization of these strains in population-dense china by rea and/or whole genome sequencing are strongly indicated. with these accurate identifications of specific hadv types and an epidemiological database of regional hadv pathogens, along with the hadv genome stability noted across time and space, the development, availability, and deployment of appropriate vaccines are needed. human adenoviruses (hadvs) are highly contagious pathogens causing acute respiratory disease (ard), among other illnesses. of the ard genotypes, hadv- presents with more severe morbidity and higher mortality than the others. we report the isolation and identification of a genome type hadv- d (dg _ ) from a recent outbreak in southern china. genome sequencing, phylogenetic analysis, and restriction endonuclease analysis (rea) comparisons with past pathogens indicate hadv- d has re-emerged in southern china after an absence of twenty-one years. recombination analysis reveals this genome differs from the s-era prototype and vaccine strains by a lateral gene transfer, substituting the coding region for the l / kda dna packaging protein from hadv- . dg _ descends from both a strain circulating in southwestern china ( ) and a strain from shaanxi causing a fatality and outbreak (northwestern china; ). due to the higher morbidity and mortality rates associated with hadv- , the surveillance, identification, and characterization of these strains in population-dense china by rea and/or whole genome sequencing are strongly indicated. with these accurate identifications of specific hadv types and an epidemiological database of regional hadv pathogens, along with the hadv genome stability noted across time and space, the development, availability, and deployment of appropriate vaccines are needed. h uman adenoviruses (hadvs) are highly contagious pathogens that are associated with a wide spectrum of human illnesses involving the respiratory, ocular, gastrointestinal, and genitourinary systems , , and a metabolic disorder (obesity) . of the several respiratory disease-associated hadv pathogens, hadv- and hadv- are among the most commonly reported and associated with febrile respiratory disease, in particular, acute respiratory disease (ard) , . these two genotypes , particularly hadv- , circulate globally and frequently in civilian populations , [ ] [ ] [ ] . they are of such public health concerns that specific vaccines have been developed and deployed against them at two different time periods by the u.s. military , . these successful vaccine periods provided an unintentional experiment reinforcing the effectiveness of vaccines in public health . in between the two periods of highly effective vaccine deployments, the inexplicable suspension of the vaccination program resulted in a resurgence of ard cases to prevaccine-era levels . this demonstrates the effectiveness and supports the development, availability, and deployment of vaccines against the hadvs that affect certain populations routinely and predictably, particularly those under stressed and high-density conditions. two important considerations for this public health program are required, along with vaccines: ) the ability to monitor, and to identify circulating hadv types and ) the presence of molecular data quantifying and char-acterizing genome changes that may occur during viral evolution, particularly at the antigenic epitopes, e.g., the hexon hypervariable l /l regions [ ] [ ] [ ] [ ] . hadv- is of particular concern as it is associated often with illnesses presenting with more severe and higher levels of morbidity than other respiratory hadv pathogens, and also may result in higher levels of fatalities , [ ] [ ] [ ] [ ] [ ] . as an example, a higher mortality rate was reported for children infected by hadv- in china during and also in korea (seoul) during - , with mortality rates of %, caused by hadv- genome types d and l, as compared with . % noted for hadv- infections . hadv strains compete against each other in host populations, with the presumably more robust ones replacing the previously dominant circulating strains. in a survey of hadv strains from the u.s. and eastern ontario/canada , two genomic variants, that were previously absent in the population, were identified: hadv- d comprised % of the hadvs reported and hadv- h comprised % . hadv- d was first identified in the u.s. in and subsequently spread further in the u.s. and into eastern ontario . it was concluded that both genomic variants represented ''recent introduction[s]'' from ''previously geographically restricted areas… herald[ing] a shift in predominant [genome type] circulating in the [u.s.]'' . the origins of hadv- d, presumably the parent of hadv- d , was noted as china, where it circulated from to , having ''replaced hadv- b as the predominant circulating virus'' , over a period of years ( - ) . hadv- d, in particular hadv- d , from china subsequently spread ''beyond [its] formerly geographically restricted regions'', e.g., to south korea ( ) ( ) ( ) and also japan, israel and the u.s. and canada , , . as another example of the severity of type hadv pathogens, the genomic variant hadv- h also produced more severe symptoms, including higher fatality levels, when it emerged and circulated in south america in the s . novel and re-emergent strains of highly contagious hadv pathogens identified within mainland china are of public health concerns to the global community, and vice versa. therefore, we call attention to and report the apparent re-emergence of hadv- d, after an approximately nineteen years absence in mainland china ( china ( to , and twenty-one years in southern china ( ), with the isolation, identification, and analysis of the genome sequence of an adenovirus from a child afflicted with ard during an outbreak in a primary school located in dongguan of the guangdong province in southern china. this genome is nearly identical to two other recently characterized genomes, one isolated from shaanxi province in northwest china ( ) and the other from chongqing in southwestern china ( ) (unpublished; jx ). given the caveat that the viruses may have been circulating earlier but had not been identified properly nor reported, the re-emergence of this genomic variant of hadv- has potentially serious consequences in china, and globally, if it follows a similar trajectory as the earlier had- d and hadv- d genome types that emerged in other countries and caused higher morbidity and mortality rates , , . adenovirus identification and genome annotation. all specimens collected were amplified ''pcr-positive'' for adenovirus and identified as hadv- by type-specific pcr analysis. of these, two, isolated from hospitalized and presumably more severe cases, produced visible cpe upon culturing. they were archived as dg _ and dg _ . sequence analysis revealed identical hexons, which were identified as hadv type by blastn analysis. the genome from dg _ was then sequenced, assembled, annotated, and analyzed. figure presents the genomic organization and transcription map of dg _ . this genome contains , bp with a gc content of . %. a total of coding sequences were identified. these genome data, noted formally as ''human adenovirus strain chn/dg / / [p h f ]'' and in this report as ''dg _ '', were deposited in genbank (accession number kc ). genome type determination of hadv- strains dg _ , hz/shx/ , and cq _ . the genome type of dg _ was determined by comparing its in silico rea profiles with other hadv- genome types reported in the literature , , , , . although seemingly antiquated in comparison to genome sequencing, rea profiles are still useful for comparisons with unsequenced but previously reported genome types and strains, and also as rapid and less-expensive alternatives for large-scale characterizations of viruses given a correct reference strain. using the genome type denomination of li, et al. , dg _ is identified as hadv- d, evidenced by the rea patterns and identical with the first reported hadv- d , as shown in figure . the rea patterns generated from dg _ , hz/shx/ , which caused acute bronchitis and pneumonia in an ard outbreak comprising cases amongst young children in the shaanxi province in , including one fatality , and cq _ , which was associated with an epidemic in chongqing, southwestern china (ni, k., et al., unpublished; jx ), are identical to each other and also identical with those of hadv- d reported earlier in israel ( ) and japan ( ) , . the rea patterns of cq _ in this study provide evidence to amend the less-descriptive designation of ''mutant hadv- d '' noted by ni, k., et al. in the genbank entry for cq _ (jx ). for reference, the in silico rea profile for the prototype gomen hadv- is provided; the rea patterns for these recent isolates differ clearly from the prototype, as shown in figure . hadv- prototype is the correct reference genome as three rea profiles, bcli, sali, and xhoi, showed identical patterns and complement the rea patterns that differ, along with sequence similarities across the genome. phylogenetic analysis of hexon genes and whole genomes confirms the genome types. phylogenetic analysis of archived hadv- hexon genes showed that dg _ has an origin common to strains hz/shx/ , cq _ , hebei_sjz_ , and tw_ . these hexons form a subclade that is on the same branch with another subclade containing several non-china isolates, including hadv- d from the u.s., as shown in figure a . the bootstrap value of indicates the hexons from the china genomes are highly similar to each, but are separate from the u.s. hadv- d subclade (bootstrap value ). furthermore, the phylogenetic analysis of available hadv- whole genomes revealed dg _ , hz/shx/ , and cq _ forming a subclade comprising hadv- d, and confirming the close relationships with each other, reaffirming a common lineage ( figure b ) that is distinct from the hadv- d strains of the u.s.a. (bootstrap value ). all of the genome types form subclades that are separate from the clade containing the prototype (gomen; ), with hadv- h forming a separate subclade in the genome phylogenetic analysis in contrast to the hexon gene phylogenetic analysis. comparative genomic analysis and single nucleotide differences of hadv- strains causing ard outbreaks in china. comparative genomics analysis showed dg _ has near genome identity with an earlier hadv- isolate, hz/shx/ ( . %) and also with cq _ ( . %). comparative genomics analysis documented seven single nucleotide substitution and one single base insertion differences between the dg _ and hz/shx/ genomes. of these, two single nucleotide substitutions were localized in the itrs and one non-synonymous substitution each was located in the dna polymerase, penton base, and kda protein coding sequences (table ) . one synonymous nucleotide substitution each was present in the -kda hexon assemblyassociated protein and virus-associated (va) rna ii. the single nucleotide insertion was in a non-coding region of dg _ . there were three single nucleotide substitutions in coding sequences and seven base deletion differences in the itrs between cq _ and hz/shx/ genomes. one synonymous substitution (c to t) was located in hexon assembly-associated protein (a a) and the other two non-synonymous substitutions g to c and g to t were located in dna polymerase (s c) and kda protein (p q), respectively. the nucleotide deletions in itrs of cq _ may be sequencing errors given that the left itr was not identical with the right itr, or may represent recent mutations. if exclusive of itr differences, there were only three single nucleotide substitutions between the cq _ and hz/shx/ genomes ( . %). for strain dg _ , it had a higher genome identity with cq _ ( . %) than hz/shx/ ( . %) if exclusive of itr difference. there were only four single nucleotide substitutions and one single nucleotide insertion in non-coding region between both genomes, which led to three non-synonymous substitutions in dna polymerase (d e, s c) and penton base gene (v a), respectively. nucleotide substitution rates and selection pressures for hadv- d strain dg _ major capsid protein genes. the selective pressures at the protein level for the three hadv- capsid protein genes, hexon, penton base and fiber, were examined by comparing synonymous and non-synonymous mutations. all three genes have ka/ks ratios of less than ( table ). this is in accordance with the hypothesis that organismal evolution is dominated by negative selection, i.e., ones removing mutations harmful to fitness . specifically, both hexon and penton base genes have less non- synonymous substitutions per site, which leads to the low ratios of ka/ks. although the non-synonymous substitutions and ka/ks ratio of the fiber gene is also low, it is relatively higher than for the hexon and penton base genes. this may indicate that the fiber gene has less negative selection pressures, likely due to tissue tropism being determined and constrained by the fiber gene. overall, the majority of mutations are synonymous and do not affect the integrity of the hexon, penton base, and fiber proteins. genome recombination analysis of hadv- d. genome recombination analysis using simplot software reveals a lateral transfer of a small portion of the genome upstream of the penton base gene. this recombination contains the entire l / kda gene from hadv- into hadv- d, as shown in figure a . its importance remains to be revealed. the gene transfer is also found in the genomes from the earlier strains cq _ (southwestern china; ; unpublished) and hz/shx/ (northwestern china; ) , respectively, shown in figure b , but not found in the prototype gomen hadv- genome, as displayed in figure c . among the two hadv species b respiratory pathogens most frequently associated with ard outbreaks globally, hadv- is reported to cause a higher mortality rate than hadv- in one long-term survey , as well as in a recent shorter term survey of adenoviral pneumonia cases in beijing ( - ) . genome type hadv- d apparently originated and circulated in china from - , becoming the predominant strain during the period of - . it was also the prevalent genome type found in korea during two outbreaks in - and - , accounting for - % all of the type hadv strains assayed . interestingly, despite reports of global circulation, hadv- , and in particular hadv- d, epidemics had not been reported in mainland china from to . in , hadv- was identified as the respiratory pathogen in an outbreak that included a fatality in shaanxi and also in a outbreak in chongqing (unpublished), signaling a reemergence. thorough characterization of these pathogens is evidenced by the availability of two genome sequences (jf and jx ), both of which are further identified as the hadv- d genome type in this report, and shown to be nearly identical to this report of an isolate from a ard outbreak in guangdong province (strain dg _ ) by comparative genomics and, in particular, in silico rea pattern analysis, as presented in figure . although not ideal and largely replaced by whole genome sequencing, rea patterns can still provide rapid and relatively inexpensive characterizations of the genomes of large number of pathogens in an outbreak [ ] [ ] [ ] [ ] [ ] [ ] . for hadv comparisons, the caveat is to use the correct reference genome; for example, hadv- contains a partial hexon gene from hadv- , comprising approximately only . % of the length of the genome, in a chassis of hadv- , comprising approximately . % of the length of the genome . using the genome of hadv- as a reference yields meaningless patterns that are subject to researcher-biased interpretations and leads to erroneous conclusions that hadv- is a genome type of hadv- . using the hadv- genome as a reference provides a closer approximation of the genome identities , . however, the recombination event revealed by whole genome sequencing, with the conflicting ''trojan horse'' renal pathogen epitope observed with ard symptoms, indicates this was a novel and emergent pathogen , , [ ] [ ] [ ] [ ] . in contrast, for hadv- d, the prototype hadv- genome provides the correct reference: three rea patterns are identical (bcli, sali, and xhoi); four are obviously different (bamhi, bgli, hpai, and smai); and four are highly similar with a few differences in the band patterns (bglii, bsteii, hindiii, ecori, and xbai), shown in figure . the major advantages of rea comparisons are the value and abundance of earlier molecular epidemiology studies, prior to the genome sequencing era, presenting rea data, and, in many cases, relating particular genome types to clinical, epidemiological, and pathogenicity observations. all of these historical strains are physically lost and no longer available for further genomic or laboratory characterization. in essence, however, the value and knowledge of the outbreaks, pathogens, and researchers of the past are not entirely lost if genomes of current pathogenic strains of interest may be compared with published rea patterns of past pathogens, as demonstrated in the genome type identities presented in this report. whole genome characterization of hadv provides a higher-resolution perspective of understanding this pathogen, which may or may not lead to better public health strategies and measures to prevent outbreaks. as noted for two species b ard pathogens, hadv- and hadv- , ''restricted use'' but effective vaccines can be and are deployed currently in the u.s. military to prevent ard outbreaks [ ] [ ] [ ] . however, even if there were no viable strategy to manage hadv outbreaks, knowing the genome type, either by rea or by whole genome sequencing, allows an understanding of the epidemiology, including potential morbidity and mortality profiles, of the circulating pathogens. as discussed earlier, genome types may have different pathogenicity, infectivity, and virulence profiles; for example, a higher mortality rate was reported for children infected by genome types d and l in korea, with mortality rates of %, compared to . % for hadv- infections . another genome type, hadv- h, also resulted in more severe symptoms, including fatalities in south america . for their molecular epidemiological studies of hadv- , wadell and colleagues presented numerous rea patterns generated with restriction endonucleases (bamhi, bcli, bgll, bglii, bsteii, hindiii, hpai, smai, ecori, sali, xbai, and xhoi), parsing hadv- isolates from various regions and across many years to divide them into more than genome types , , [ ] [ ] [ ] . adenoviruses contain relatively stable double-stranded dna genomes , , . there are seven single base substitutions and a one-base insertion between strains dg _ and hz/shx/ , which led to three non-synonymous substitutions in the dna polymerase, penton base, and kda protein coding sequences. interesting, there are only three single base substitutions between strains cq _ and hz/shx/ , exclusive of the nucleotide deletions in itrs of cq _ which may be due to possible sequencing errors. the high genome percent identity between strains cq _ and hz/shx/ and the adjacent locations of chongqing and shaanxi ( kilometers apart) where strains cq _ and hz/shx/ were isolated indicate strain hz/shx/ may be the origin of strain cq _ . strain dg _ has a higher genome identity with cq _ than hz/shx/ , which also supports the hypothesis that strain cq _ is be the ancestor of strain dg _ . although hadv genomes appear stable in terms of single base changes, as expected for double stranded dna viruses and as observed in pairs of hadv genomes examined to date, e.g., the prototype versus circulating strains of hadv- and - , separated by approximately fifty years , , less common but biologically and clinically significant larger genome changes are observed either as a single, small recombination event, such as the lateral transfer of the renal pathogen epsilon epitope (hadv- ) providing a ''trojan horse'' effect to the recombinant hadv- , an emergent acute respiratory disease (ard) pathogen in a putatively immune naive host population , or as multiple and larger recombination events, such as the lateral transfer of the non-pathogen epsilon epitope (hadv-d ) along with multiple other sequences to an emergent recombinant resulting in the highly contagious ocular pathogen causing epidemic keratoconjunctivitis (ekc), hadv-d . additionally, the presence of the epsilon epitope of a nonpathogenic type, hadv- , found in several recently reported emergent recombinant ekc pathogens, hadv- , support the hypothesis that recombina-tion amongst hadvs is an important mechanism driving the molecular evolution and genesis of hadv pathogens . in both of these latter examples, newly emergent hadv pathogens have the ''serotype'' of nonpathogens but are potent, significant, and highly contagious human pathogens. recombination appears to play another novel and major role in the molecular evolution of hadvs and genesis of human pathogens. recent reports of hadv genomes containing genome segments, including near-entire genomes, derived from simian adenoviruses (sadvs) indicate zoonosis is an avenue of lateral gene transfer. thus, nonhuman primates may be a wellspring of emergent human pathogens , , and vice versa . a novel third type of lateral gene transfer is revealed in this newly reported genome of hadv- d strain dg _ , that of a ''moderate-sized'' single whole gene recombination. this serendipitous insight into the molecular evolution of these respiratory pathogens from hadv species b demonstrates the genomes of individual hadv types, such as type , contain changes revealed only by high-resolution genome sequences and may be important in the context of hadv molecular evolution, viral fitness, origins and bases of clinical and pathogenicity differences, and account for emergent and re-emergent pathogens. blast analysis reveals the recombinant region to encode the entire l / kda gene of hadv-b with flanking non-coding sequences. the blast scores indicate the first highly similar sequence, aside from several type sequences, is that from the hadv-b prototype (max. score , total score , query cover %, e-value . , ident. %) and a hadv-b recombinant ( , , %, . . %), with additional homologous and highly similar sequences found in hadv-b ( , , %, . , %) and hadv-b strains ( , , %, . , %). this encodes a dna-binding protein that is expressed in both the early and late stages of infection, suggesting it could play multiple roles in the adenoviral life cycle. the l / kda protein interacts with the iva protein and is an essential protein that is absolutely required for dna packaging as well , . effects of this particular moderatesized recombination from hadv-b into the hadv-b genome chassis and the resultant emergent pathogen are unknown pending wet-bench investigations and additional clinical reports. the hadv- prototype strain (ay ) analyzed is also known as the gomen strain, which was isolated as a clinical specimen from a throat washing of a u.s. military recruit with pharyngitis . this strain is nearly contemporaneous with the greider strain (ay ) , aka hadv- a, which was used to develop the vaccine strain , . although there are minor genome differences, e.g., point mutations, between the prototype and the vaccine strains, pairwise genome dot blot analysis (pipmaker) indicated no recombination events . these observations strongly support and validate the recent paradigm change of using the genome data along with biological and clinical profile changes to recognize, characterize, type, and name novel hadvs rather than relying solely on the epsilon and/or gamma epitopes , , determined either by serology or imputed by limited dna sequencing, in the past. with the exception of sporadic hadv- infections reported in children in guangzhou ( ) and the three recent outbreaks, the apparent absence of type ard pathogen circulating in the population of southern china before leads to a concern that the dense city populations in china are now immunogenically naïve with respect to hadv- . in northern china, recently, isolates were typed as hadv- by pcr and sequencing of hexon genes from hadv-positive specimens during - ; hav- was associated with most of the severe lower respiratory hadv infections . coupled with increased opportunities for travel, a ''perfect storm'' for present and near-future outbreaks of the apparently more severe disease-causing hadv- d strains is foreboding. . in the former outbreak, a total of patients were sampled, with all of the patients being males, with ages between - years . in the latter, patients aged between - years were reported as afflicted with ard . in taiwan, there was a large community outbreak of hadv- in . in this instance, an abrupt increase in percentage of hadv- infections occurred, from . % in - to % in . the hexon nucleotide sequences of five hadv- isolates collected in taiwan were identical to the sequence of hadv- strain hz/shx/ , which was also identical to dg _ . in the context of the data in this report, these ''taiwan'' hexon genes formed the same subclade with strains hz/shx/ , cq _ and dg _ (fig. a) . given that only the hexon genes were sequenced, the exact genome types of these strains in the two outbreaks remain unknown. however, the possibility of a hadv- d genome type circulating is foreboding. further data, including complete genome sequencing and in silico rea, are important to confirm this possibility. in the interest of global public health, with these recent outbreaks and the identification of nearly identical contemporary hadv- d genome types, we strongly urge molecular surveillance and genotyping of newly isolated hadv strains in china by whole genome sequencing and/or in silico rea. additionally, the newlyredeveloped vaccines, which are now only accessible to the u.s. military , should be made available to the civilian ''at-risk'' public to prevent ''preventable'' highly contagious outbreaks involving hadvs associated with high morbidity rates and fatalities , , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , , , , , . in particular, the vaccine against hadv- is urgently needed in china, due to the apparent decades-absence of circulating hadv- , which presumably resulted in a corresponding lower level of herd immunity in today's population. given the higher severity of diseases and fatality rates caused by hadv- , especially hadv- d, extensive surveillance and corresponding molecular investigation, including genotyping, genome typing, and genome sequencing, should be carried out when confronting outbreaks of hadv pathogens in the high-density populations of china to protect the public and the global community. specimen collection and handling. during february to march of , twentythree primary school children under the age of (dongguan; guangdong province) presented with flu-like symptoms, including fever, pharyngalgia, and coughing as well as other indications of ard. two were hospitalized with severe symptoms. eleven throat swab specimens were collected into -ml viral transport media; transported at uc- uc; and preserved at uc for virus isolation and nucleic acids extraction. this study protocol was approved by the institutional ethics committee of the center for disease control and prevention of guangdong province (guangdong cdc) and was carried out in accordance with the approved guidelines. the guardians of all under-aged participants gave signed informed consent for participation in the study. data records of the samples and sample collection are de-identified and completely anonymous. detection of respiratory pathogens. total nucleic acids were extracted from the specimens using the qiaamp minelute virus spin kit (qiagen; hombrechtikon, germany). human adenovirus, respiratory syncytial virus, influenza virus a and b, parainfluenza virus types - , human rhinovirus, human metapneumovirus, and human coronavirus oc and e were detected by real-time pcr as described earlier . for hadv identification, type-specific primers were used to characterize the type by pcr, as described in an earlier report . adenovirus isolation and genomic dna extraction. adenovirus-positive throat swab specimens, identified by pcr analysis, were inoculated into a cell cultures, and grown in dulbecco's minimum essential medium supplemented with iu penicillin ml , mg streptomycin ml , and % (v/v) fetal calf serum, at an atmosphere of % (v/v) carbon dioxide. cytopathic effect (cpe) was monitored for at least ten days. viral genomic dna was extracted from infected cells for genomic analysis, as described by le, et al. . genome sequencing and annotation. the genome of hadv strain dg _ was sequenced using a sanger chemistry-based, primer-walking method by pcramplification, with overlapping regions sequenced , . both -and -ends (including both inverted terminal repeats) were sequenced directly by primers ad -ltrs a ( -gcctcttgacggaactcg- ) and ad -ltrs ( -ggtccctctaaatacacataca- ), respectively, using genomic dna as template; this ensured the accurate determination of the end sequences , . the sequence data, collected with an abi genetic analyzer, provided an average genome coverage of -to -fold, with both strands represented. gaps and ambiguous sequences were pcr-amplified using different primers and resequenced. these sequencing ladders were assembled with the seqman pro software . . (dnastar, inc.; madison, wi. usa). nucleotide and amino acid sequences were aligned with clustal and blast software. the genome sequence was annotated based on the previous annotation of hadv- prototype strain (gomen) and deposited into genbank with the accession number kc . in silico restriction endonuclease analysis (rea). the specific adenovirus genome type was determined using in silico rea analysis of the whole-genome sequences in accordance with the in vitro protocol described by li, et al . this was performed using the software vector nti advance . (invitrogen corp.; san diego, ca. usa). twelve restriction enzymes were used for this analysis, as performed by li, et al. : bamhi, bcli, bgll, bglii, bsteii, hindiii, hpai, smai, ecori, sali, xbai, and xhoi. phylogenetic analyses of hadv- hexon genes and the whole genome sequences. the molecular evolutionary genetics analysis (mega) version . . software was used for phylogenetic analyses of the hadv- hexon genes and the whole genomes, with additional sequences retrieved from genbank database, as described previously , . neighbor-joining phylogenetic trees with , boot-strap replicates were constructed using a maximum-composite-likelihood method with default parameters. bootstrap numbers shown at the nodes indicate the percentages of , replications producing the clade, with a value of noted as robust and significant. archived hadv- genome sequences from genbank were used for phylogenetic analysis. these are as follows (for reference, the names include the corresponding genbank accession number, country of isolation, strain name, year of isolation (if available), and genome type (if available)): ay _gomen_ _ p, jx _chn_cq _ _ d, jf _chn_ hz/shx_ _ d, jx _usa_ak _ _ b, jx _usa_arg/ak _ _ h, jx _usa_ak _ _ d , jx _usa_ak _ _ d , jn _usa_fs _ _ d , jn _jpn_takeuchi_ _ , jn _ar_ - _ _ h, gq _chn_gz _ , hq _chn_gz _ , ay _usa_vaccine_ , ay _chn_vaccine, ay _usa_nhrc_ _ , and kc _chn_dg _ _ d. the hadv- hexon complete sequences used for these analyses are as follows: ab _gomen_ _ p, jn _jpn_takeuchi_ _ , af _usa_ _vaccine_ _ a, ay _usa_vaccine_ , af _chn_beijing, ay _chn_vaccine, jn _ar_ - _ _ h, af _jpn_ _ _ d, af _jpn_bal_ _ d , ay _kr_ - _ _ d, jx _usa_ak _ _ d , jx _usa_ak _ _ b, ay _usa_nhrc_ _ , af _jpn_s- _ _ a, ay _kr_ - _ _ l, ab _jpn_ _ dx, ab _jpn_osaka_ _ dx, jx _usa_arg/ak _ _ h, jx _usa_ak _ _ d , hq _chn_gz _ , gq _chn_gz _ , gu _chn_ hz/shx_ _ d, jn _usa_fs _ _ d , jx _chn_cq _ _ d, jq _chn_hebei_ /sjz_ , jq _chn_hebei_ /sjz_ , jq _chn_hebei_ /sjz_ , jx _tw_tw _ , jx _tw_tw _ , jx _tw_tw _ , jx _tw_tw _ , jx _tw_tw _ , and kc _chn_dg _ _ d. , along with the prototype gomen genome were analyzed for sequence recombination events using the software tool simplot (http://sray.med.som.jhmi.edu/scroftware/simplot/) . for the recombination analysis, mafft software was used first to align the hadv-b species sequences using default parameters (http://mafft.cbrc.jp/alignment/server/). default parameter settings for the simplot software were used for analyzing the whole genomes, along with the following input: window size ( nucleotides [nt]), step size ( nt), replicates used (n ), gap stripping (on), distance model (kimura), and tree model (neighbor-joining). the following genomic sequences of hadv-b members were used: hadv-b p (ay ), hadv-b (ay ), hadv-b (ay ), hadv-b (ay ), hadv-b (ay ), hadv-b (ay ), hadv-b (ay ), hadv-b (ay ), hadv-b (ay ), and hadv-b (fj ). substitution rate analysis of the hexon, penton base and fiber genes in hadv- . the numbers of non-synonymous (ka) and synonymous (ks) substitutions per site from between sequences were noted and the ka/ks ratios were calculated. this www.nature.com/scientificreports scientific reports | : | doi: . /srep hadv- analysis was conducted using the nei-gojobori model , and included nucleotide sequences from hexon genes, fiber genes, and penton base genes available from genbank. all positions containing gaps and missing data were eliminated automatically. evolutionary analyses were performed with mega . . . the hadv- complete hexon, penton base and fiber gene sequences available in genbank were achieved for analysis. the hadv- complete hexon gene sequences used for this analysis are same with previous those in phylogenetic analysis. the following hadv- complete fiber gene sequences were used: ay , ay , the hadv- complete penton base gene sequences diagnostic procedures for viral, rickettsial, and chlamydial infections adenovirus infections in immunocompetent and immunocompromised patients molecular epidemiology of adenovirus type in the united states characterizing, typing, and naming human adenovirus type in the era of whole genome data genome analysis of south american adenovirus strains of serotype collected over a -year period demonstration of three different subtypes of adenovirus type by dna restriction site mapping molecular epidemiology of adenoviruses: global distribution of adenovirus genome types immunization by selective infection with type adenovirus grown in human diploid tissue cultures. i. safety and lack of oncogenicity and tests for potency in volunteers history of the restoration of adenovirus type and type vaccine, live oral (adenovirus vaccine) in the context of the department of defense acquisition system vaccine-preventable adenoviral respiratory illness in us military recruits strain variation in adenovirus serotypes and a causing acute respiratory disease analysis of adenovirus hexon proteins reveals the location and structure of seven hypervariable regions containing serotype-specific residues structure-based high-throughput epitope analysis of hexon proteins in b and c species human adenoviruses (hadvs) genomic and bioinformatics analyses of hadv- vac and hadv- vac, two human adenovirus (hadv) strains that constituted original prophylaxis against hadv-related acute respiratory disease, a reemerging epidemic disease genome type analysis of adenovirus types and isolated during successive outbreaks of lower respiratory tract infections in children community outbreak of adenovirus lower respiratory tract infections due to adenovirus in hospitalized korean children: epidemiology, clinical features, and prognosis identification and typing of adenovirus from acute respiratory infections in pediatric patients in beijing from clinical analysis of children with types and adenovirus pneumonia molecular epidemiology of adenovirus types and isolated from children with pneumonia in beijing molecular and epidemiological analyses of human adenovirus type strains isolated from the nationwide outbreak in japan an outbreak of adenovirus type in a residential facility for severely disabled children adenovirus type h respiratory infections: a report of cases of acute lower respiratory disease adenovirus serotype associated with a severe lower respiratory tract disease outbreak in infants in shaanxi province, china analysis of different genome types of adenovirus type isolated on five continents molecular and serological characterization of adenovirus genome type h isolated in japan molecular epidemiology of adenovirus type in israel: identification of two new genome types, ad k and ad d cancer evolution is associated with pervasive positive selection on globally expressed genes full-length human immunodeficiency virus type genomes from subtype c-infected seroconverters in india, with evidence of intersubtype recombination ten-year analysis of adenovirus type molecular epidemiology in korea, - : implication of fiber diversity genome variability of human adenovirus type causing epidemic keratoconjunctivitis during - in japan molecular epidemiology and clinical presentation of human adenovirus infections in kansas city children a naturally occurring human adenovirus type variant with a bp deletion in the e cassette molecular characterization of human adenovirus type (hadv- ), including a novel genome type detected in japan simple and cost-effective restriction endonuclease analysis of human adenoviruses human adenovirus type genome typing computational analysis identifies human adenovirus type as a re-emergent acute respiratory disease pathogen molecular and serological characterization of species b adenovirus strains isolated from children hospitalized with acute respiratory disease in buenos aires genome sequence of human adenovirus type , a re-emergent acute respiratory disease pathogen in china epidemiology of human adenovirus and molecular characterization of human adenovirus in china genomic analyses of recombinant adenovirus type a in china outbreak of acute respiratory disease in china caused by b species of adenovirus type genome types of adenovirus type isolated in hiroshima city molecular epidemiology of human adenoviruses adenovirus surveillance on children hospitalized for acute lower respiratory infections in chile ( - ) natural variants of human adenovirus type provide evidence for relative genome stability across time and geographic space computational analysis of adenovirus serotype (hadv-c ) from an hadv coinfection shows genome stability after years of circulation evidence of molecular evolution driven by recombination events influencing tropism in a novel human adenovirus that causes epidemic keratoconjunctivitis analysis of human adenovirus type associated with epidemic keratoconjunctivitis and its reclassification as adenovirus type computational analysis of four human adenovirus type genomes reveals molecular evolution through two interspecies recombination events genomic and bioinformatics analysis of hadv- , a human adenovirus causing acute respiratory disease: implications for gene therapy and vaccine vector development simian adenovirus type has a recombinant genome comprising human and simian adenovirus sequences, which predicts its potential emergence as a human respiratory pathogen encapsidation of viral dna requires the adenovirus l / -kilodalton protein interaction of the adenovirus l / -kilodalton protein with the iva gene product during infection genomic and bioinformatics analysis of hadv- , a human adenovirus of species b that causes acute respiratory disease: implications for vector development in human gene therapy etiology of acute respiratory disease among service personnel at fort ord, california evaluation of a trivalent adenovirus vaccine for prevention of acute respiratory disease in naval recruits immunization with live types and adenovirus vaccines. ii. antibody response and protective effect against acute respiratory disease due to adenovirus type using the whole-genome sequence to characterize and name human adenoviruses identification and typing of respiratory adenoviruses in guangzhou, southern china using a rapid and simple method an outbreak of acute respiratory disease caused by ''human adenovirus type '' in a military training camp in shaanxi epidemiological investigation of an outbreak of acute respiratory infection caused by adenovirus type etiology survey on virus of acute respiratory infection in guangzhou from a modified rapid method of nucleic acid isolation from suspension of matured virus: applied in restriction analysis of dna from an adenovirus prototype strain and a patient isolate genome sequence of the first human adenovirus type isolated in china mega : molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods parental ltrs are important in a construct of a stable and efficient replication-competent infectious molecular clone of hiv- crf _bc simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions q.z. and d.s. conceived and designed experiments. s.z., c.w., c.k., j.s., s.d., l.zou, j.z., z.c., s.j., z.z., j.z., x.wan, x.wu, w.z., l.zhu, d.s. and q.z. performed the experiments and analyzed the data. s.z., c.w., d.s. and q.z. wrote the manuscript. all authors reviewed the manuscript. competing financial interests: the authors declare no competing financial interests. key: cord- -i t cvr authors: pardo, a. title: the human genome and advances in medicine: limits and future prospects date: - - journal: archivos de bronconeumología ((english edition)) doi: . /s - ( ) - sha: doc_id: cord_uid: i t cvr nan on april , , the international human genome sequencing consortium announced the successful completion of its task. the correct sequence of the bases cytosine (c), thymine (t), adenine (a), and guanine (g) in the gene-containing regions of dna had been elucidated with an accuracy of . % for % of the euchromatin. this is considered to be the most that can be achieved with current technology, and all that now remains is to sequence the remaining regions, which are more difficult because they include almost highly repetitive dna fragments in addition to the centromeres, the structures that divide chromosomes. the consortium of which the human genome project (hgp) formed a part included centers in countries (china, france, germany, great britain, japan, and the united states of america). this international group chose to announce the completion of the task in april in order to coincide with the th anniversary of the publication, in april , of the paper by watson and crick that first described dna's double helix structure. the hgp's initial objectives were fulfilled years ahead of schedule, and, in addition to compiling a highly accurate sequence of the human genome which has been made freely available and accessible to everyone, the consortium has developed a set of new technologies and has constructed genetic maps of the genomes of various organisms. moreover, this program of scientific investigation is linked to a parallel bioethics program. it is also interesting to note that, thanks to advances in technology, this result was achieved for a cost lower than the initial budget, which estimated that mb would be sequenced annually at a cost of . dollars per finished base. the final figure was mb per year at a cost of . dollars per base. the size and scope of the hgp has also provided valuable lessons about the organization and management of large projects involving international collaboration, and those lessons will no doubt prove useful in the administration of other large scale projects. what was the genesis of this project? what general lessons has it taught us so far? how will it influence medicine? what future prospects, hopes, and fears has it given rise to? what ethical problems does it pose? these are just some of the general questions that this article will attempt to analyze. the genome is the total set of genes carried by an organism, and each gene is a segment of dna's double helix structure containing the recipe for making a polypeptide chain in a protein. a protein may contain a single polypeptide chain, as in the case of insulin, and therefore a single gene will code for this protein, or it may contain more than one chain, as in the case of hemoglobin, so that this protein is encoded by more than one gene. there are around billion (us trillion) cells in the human organism, and each one of these contains a complete genome. this genome is found on the pairs of chromosomes in the cell nucleus. around . meters of dna containing approximately million ( billion us) base pairs is packed into the nucleus of each cell. the genetic code uses groups of dna bases to specify the amino acids that make up the polypeptide chains of proteins, the principal actors in life's drama. one of the first genomes to be completely sequenced was that of simian virus (sv ), which contains nucleotides. by the beginning of the s, viral genomes containing over bases had been sequenced, making it possible for scientists to envisage the possibility of sequencing bacterial genomes containing over bases. when the idea of sequencing the human genome was first proposed during the mid- s, the undertaking seemed hardly feasible using the technology available at that time. however, after various preparatory meetings, the national institutes of health and the department of energy of the usa officially announced on october , the launch of a program to sequence the human genome, and james watson (of watson appointed director of the recently created national center for human genome research. around the same time, the public consortium known as the human genome project was formed, and this organization announced a -year plan (from to ) with the following objectives: a) to determine the complete nucleotide sequence of human dna and identify all the genes in human dna (estimated to number between and ); b) to build physical and genetic maps; c) to analyze the genomes of selected organisms used in research as model systems (eg, the mouse); d) to develop new technologies; and e) to analyze and debate the ethical and legal implications for individuals and for society as a whole. one of the difficulties that had to be overcome in the task of accurately sequencing the bases that make up the human genome was that approximately % of dna is highly repetitive. the strategy adopted by the hgp was to sequence the dna whose location on the chromosomes was already known. however, this strategy was challenged in by j. craig venter and his team, who had just set up a private company called celera genomics. taking advantage of recent advances in technology, this team proposed an alternative strategy based on cutting the genome into small segments and using a computer to reassemble the sequences by matching the overlapping ends of each fragment. with these innovations, this private consortium announced that they would sequence the human genome in years, in other words, that they would complete the task by . this undoubtedly brought immense pressure to bear on the public group, the hgp, headed since by francis s. collins, and also gave rise to fears that a private company might control a large part of the human genome through patents. after several unsuccessful attempts to get the private and public sector groups to collaborate, an agreement was reached to simultaneously publish a first draft of the human genome in february, . this draft did not, however, have the degree of precision of the current one. consequently, the hgp consortium published its results in nature in february , and celera did likewise in science. the sequences were subsequently corroborated with a greater degree of reliability, and in april , with the sequence practically complete, the hgp consortium declared the task to be completed. , discoveries and surprises one of the surprising facts thrown up by the sequencing of the human genome was that it only contains approximately genes. owing to its size, it had been estimated that the genome would contain between and genes. in simple organisms, such as yeasts, the number of genes directly correlates with the size of the genome because most of the information in the genome clearly codes for proteins, and the individual genes have a well-defined beginning and a clear stop point and exit for the messenger rna. it had seemed logical, therefore, that the greater the complexity of the organism, the larger would be the number of genes. however, the sequencing of the genomes of other organisms has yielded unexpected results. for example, the common fruit fly, drosophila melanogaster, has approximately genes, fewer than other simpler organisms, such as the earth worm, caenorhabditis elegans, with genes, and the mustard plant, arabidopsis thaliana, with around . - therefore, the human genome only has around more genes than arabidopsis despite its obviously greater biological complexity. so we have learned that the human genome has fewer genes than expected and also that that the distance separating them is considerable. it has been calculated that the gene density in the human genome is around per bases, while in drosophila this figure is , and in arabidopsis, . it is important to understand that the genes in human dna, as in most eukaryotes, are highly fragmented; in other words, not all of the bases from the beginning to the end of the gene are read to make a protein. the dna in the genes has coding regions, called exons, interrupted by long noncoding sequences, called introns (intergenic regions). these noncoding regions are removed by the process of splicing in the formation of messenger rna, so that the resulting messenger rna is much shorter than the original dna from which it was produced. for example, it has been reported that around % to % of genes in human chromosomes and undergo alternative splicing-the exons combine in different ways and produce various different proteins. , this means that the number and variety of proteins in an organism does not depend solely on the number of genes in the genome, but rather on the way these genes are used. another important question thrown up by the results of the hgp was the following: if only % to % of the bases in the human genome code for proteins, then what do the rest do? an equivalent part of the noncoding portion of the genome probably contains most of the sequences that regulate the expression of genes, such as the promoters, regions that occur before the beginning of the gene. there are many other elements in the genome that affect the behavior of other components, such as the centromeres and telomeres. finally, a large part of the genome is made up of highly repetitive dna sequences, the function of which is little understood. why are there so many repetitive sequences in the human genome not found in the genomes of invertebrates? many dna sequences seem to have originated as a result of the movement of genetic elements called transposons, segments of dna that can move from one site to another within the genome. it has been postulated that many of the changes that have occurred during the evolution of vertebrates may have been triggered by the action of transposons which jumped to regulating regions and modified the expression pattern of the genes. genome sequencing is a tool that allows us to reconstruct the history of hundreds of millions of years of evolution marked by mutation, that is, the process of exchange and rearrangement of the sequences that has contributed to the formation of new species or has given rise to new genes. the task of solving these puzzles and fitting each piece into its place still presents a huge challenge because clues to our history still lie undiscovered in the noncoding sequences found in each chromosome, the sequences previously considered to be "junk dna." for example, the complete sequencing of the sex-determining y chromosome has revealed some very intriguing facts that have aroused great interest among geneticists and biologists who study evolution. these will be described in general terms in the following section. the human sex chromosomes, x and y, both had their origin in the same ancestral autosome several hundred million years ago, but their sequences diverged through evolution. as a result, sequences identical to those of the x chromosome that permit recombination between the two chromosomes in those regions only exist today in the terminal regions of the y chromosome. however, over % of the modern y chromosome has specific regions with no equivalents on another chromosome that would enable recombination during sperm production, and this is a rare example of persistence in the absence of sexual recombination. these regions contain genes that specifically code for testicular proteins as well as highly repetitive sequences which-probably because they are not understoodwere previously considered to be nonfunctional "junk" dna. with the complete sequencing of these regions, it has been found that some of these sequences are palindromic (as in the phrase anita lava la tina); that is, they read the same from left to right as from right to left, on both strands of the double helix. this fact has led to the hypothesis that x-y recombination has been replaced by recombination between the arms of the y chromosome in the regions where the palindromic sequences are located. in this context, the y chromosome reveals great powers of self-preservation, using evolutionary strategies to survive in the absence of recombination with another homologous chromosome. probably one of the greatest expectations generated by the sequencing of the human genome has been the hope that this knowledge might benefit humans through its medical applications. the understanding of the role played by genetic factors in human health and disease will make it possible for us to discover better ways to approach the prevention, diagnosis, and treatment of pathological processes. it is thought that the science of genomics will soon explain the mysteries of the hereditary factors associated with heart disease, cancer, diabetes, schizophrenia, and many other chronic degenerative processes. it is also hoped that it will give us a better understanding of the genetic factors that influence our susceptibility and/or response to various infectious diseases. genomics holds the promise of individualized medicine that can be tailored to each patient's genetic profile. one of the challenging aspects of any analysis of the influence of an individual's genes on the development of certain diseases is ascertaining whether a particular disease is caused by a single gene or the interaction between several genes. it is also essential to understand how the environment influences the expression of such interactions. there are relatively few known diseases that are associated with mutations in a single gene. they include sickle cell anemia and cystic fibrosis. in the case of the gene that causes cystic fibrosis, over different mutations have been identified that affect the function of the protein it encodes. in normal cells, the protein produced by this gene acts as a channel that allows cells to release chloride and other ions. in people with cystic fibrosis, however, this gene has a mutated sequence, and the protein produced is defective so that the cells do not release chloride. the result is an improper salt balance. this gives rise to the production of an abnormally thick mucus which, among other things, obstructs the airways and leads to infections. however, the origin of most human diseases and of the variations in individual responses to drugs is more complex and involves the interrelation between multiple genetic factors, such as genes and the proteins they produce, and nongenetic factors, such as the influence of the environment. although all individuals share dna sequences that are . % the same, each person has a unique genome. the remaining . % is responsible for the genetic diversity between individuals. many differences are due to a variation in a single base pair in a gene. single nucleotide polymorphisms (snps) are variations of a gene that occur because of a change in a single letter (nucleotide) in the dna sequence, for example, the substitution of "cta" for "cca." snps contribute to the differences between individuals. while most of these polymorphisms have no effect, others cause slight differences in certain characteristics that do not affect health, such as physical appearance. others, however, may increase or decrease the individual' s risk of developing certain diseases. this happens, for example, in the case of acquired immune deficiency syndrome (aids). we now know that not all individuals exposed to the type human immunodeficiency virus (hiv) become infected, and that the progression period from infection to aids is highly variable among infected individuals. some patients may develop the disease in years, while others remain asymptomatic for more than years. although the reasons for these differences are not entirely understood, it has recently been discovered that genetic factors play a very important role in the transmission of the virus and progression to disease. there must be co-receptors on the surface of a cell in order for the virus to attach itself effectively and later infect the host cell. the first of these is cd , the key receptor for t lymphocyte facilitators, and the second is one of the members of the chemokine family of receptors. c-c chemokine receptor (ccr ) is one of the main co-receptors used by the virus to penetrate macrophages and t lymphocytes, so that it plays a critical role in the pathogenic process of aids. several studies have demonstrated that the polymorphic allele ccr -delta (which contains a base pair deletion) has a powerful protective effect in the progression of the hiv infection. similar findings will probably emerge in relation to other diseases, so that in the future we will understand such enigmas as why not all smokers develop chronic obstructive pulmonary disease or lung cancer, or why not everyone who is exposed to avian antigens develops hypersensitivity pneumonitis. scientists have started to compile a catalogue of the common variations in the human population, which includes snps, small deletions and insertions in the coding dna, and other structural differences. part of this database is already available to the public. another important point is that sets of nearby snps on the same chromosome are inherited in blocks. these patterns of snps on a block are known as haplotypes, and certain snps can be used as tags to identify the haplotypes in a block. the elucidation of the complete human genome has given rise to a new project the aim of which is to develop a haplotype map of the human genome called the hapmap. the hapmap locates blocks of haplotypes, and the specific snps that identify them are called snp tags. the international hapmap project was started in and will be of fundamental importance in examining the genome in relation to phenotypes. it will also be a tool that will enable researchers to identify the genes and genetic variations that affect health and illness. in addition to its use in analyzing the relationship between genes and disease, the hapmap will be a powerful resource for studying the genetic factors that contribute to individual variations in our response to environmental factors, susceptibility to infection, adverse reactions, and response to drugs and vaccines. using only the snp tags, researchers will be able to identify regions on the chromosomes with different distributions of haplotypes in two groups of people, for example, those who suffer from a disease and those who do not. this will also facilitate the development of tests that can predict which medicines and vaccines might be more effective in individuals with particular genotypes for the genes that affect the metabolism of these drugs. the complete sequencing of the genome of an organism is only the first step in the quest to understand its biology. it is still necessary to identify all the genes and ascertain the function of the products expressed by these genes, that is, functional rna and proteins. functional genomics is based on the key premise of the central dogma of molecular genetics, which states that dna sequences are used as templates for the synthesis of rna, and this rna is subsequently used as a template for the synthesis of proteins. moreover, scientists still have to analyze and understand the noncoding regulatory regions and other functional elements of the human genome and of the genomes of other organisms. this has led to the creation of a project called the encyclopedia of dna elements-or encode. the goals of this new project are to identify and map the exact location of all the protein-encoding and non-protein-encoding genes, and to identify other functional elements encoded in the dna sequences, such as promoters and other transcriptional regulatory sequences, as well as determinants of chromosome structure and function, such as origins of replication. the aim is to provide a comprehensive encyclopedia of all these elements in order to help researchers better understand human biology and predict potential disease risks, and to stimulate the development of new therapies for the prevention and treatment of disease. it has been said that the basis for understanding the genome of a mammal is the characterization of the part that is transcribed (ie, the transcriptome) and the identification of the proteins it produces (ie, the proteome). many technologies have been developed to study functional genomics, and foremost among these are the cdna microarrays or dna chips, which have been widely used to explore the expression profiles of thousands of genes simultaneously. , this technology has been used to gain a greater understanding of the molecular mechanisms of various diseases, such as, for example, pulmonary fibrosis. idiopathic pulmonary fibrosis belongs to the category of idiopathic interstitial pneumonias and is characterized by the relatively rapid destruction of the lung parenchyma. as a result, some % of patients die within years. in a recent study, lung biopsy samples from patients with idiopathic pulmonary fibrosis and other patients with normal lungs were analyzed using this technique of oligonucleotide microarrays. the results showed that gene expression patterns clearly distinguished normal from fibrotic lungs, and that many of the genes that were significantly increased in fibrotic lungs encoded proteins associated with the extracellular matrix and enzymes responsible for its replacement. this study, and others that have investigated various pathological processes, illustrates the analytical power of gene expression in the identification of the molecular pathways involved in disease. the identification of the different groups of genes involved in the pathogenic processes of human disease will also facilitate the discovery of new molecular targets that can eventually be used in the treatment of such diseases. for example, we have recently found in hypersensitivity pneumonitis, an inflammatory lung disease characterized by lymphocytic alveolitis, the exaggerated expression of a chemokine derived from dendritic cells known as ccl . this chemokine is a powerful attractor of t lymphocytes and, at least theoretically, blocking it for therapeutic reasons could reduce the lymphocyte infiltration that characterizes this disease. other new genomic technologies include: a) toxicogenomics, which studies the genetic basis of an individual's response to environmental factors, such as drugs and contaminants; and b) pharmacogenomics, which deals with the development of drugs designed for specific pathogenic processes that will target specific metabolic pathways. in general terms, the genomic sciences have been defined as those which study genes, their products, and their interactions. one of the earliest objectives of the hgp was to set up a program, called elsi, to analyze the ethical, legal and social implications of genomic sciences. in this context, unesco created the international bioethics committee, and in published a declaration that states, "recognizing that research on the human genome and the resulting applications open up vast prospects for progress in improving the health of individuals and of humankind as a whole, but emphasizing that such research should fully respect human dignity, freedom and human rights, as well as the prohibition of all forms of discrimination based on genetic characteristics, proclaims the principles that follow and adopts the present declaration." the articles of this declaration deal with the following topics: a) human dignity and the human genome; b) rights of individuals; c) research on the human genome; d) conditions for the exercise of scientific activity; e) solidarity and international cooperation, and f) the promotion of the principles set out in the declaration. article of this universal declaration on the human genome and human rights states: "the human genome underlies the fundamental unity of all members of the human family, as well as the recognition of their inherent dignity and diversity. in a symbolic sense, it is the heritage of humanity." the medical application of the information generated by genetics must be consistent with the general principals of medical ethics: a) beneficence, or acting for the good of individuals and their families; b) doing no harm; c) respecting the autonomy of the individual, that is, allowing individuals to make independent decisions after providing them with information; and d) individual and social justice. genetic information is confidential, and it is the responsibility of institutions and authorities not to interfere without prior consent. however, there are certain circumstances that could justify the intervention of the state, such as those related to public health issues, or the well-founded request of an authority in connection with a judicial investigation. how can we define the limits between what is permitted and what is prohibited, or between privacy and responsibility towards third parties? these are the kind of topics that must be discussed and analyzed by the ethics committees in each country, which should then inform their respective legislators on these issues. other aspects that need to be reported and considered include: privacy and justice in the use of genetic interpretation, nondiscrimination, and the need to distinguish between information that we individually prefer not to know and facts that must be revealed for family or social reasons. closely related to these ethical considerations is the problem of the privatization of knowledge and the granting of patents. for example, the last nucleotide in the genetic code of the coronavirus responsible for severe acute respiratory syndrome had hardly been read when the race had already begun to take control of the intellectual rights to the sequence. in private hands, a patent on a viral sequence could delay or increase the cost of developing a treatment or diagnostic tests for a particular disease. this question has caused concern among biomedical researchers, who are afraid that broad patents on genetic sequences will affect research work in universities and public institutions and will have a detrimental effect on future public health strategies. an example of this is the case of the predictive test for breast cancer, which uses the genes brca and brca . the curie institute in paris has been struggling for the right to continue analyzing these genes at a third of the price currently charged by the genome company myriad genetics (utah, usa), which was granted a european patent for these genes in . molecular biology has implicitly promised to transform medicine by elucidating the smallest details of the mechanisms of life. to the extent that the molecular processes of diseases are revealed, we will, in many cases, be able to prevent them or to design effective cures or individualized treatments. genetic tests will be able to predict an individual's susceptibility to a disease, and the diagnosis of many pathological processes will be much more detailed and specific than it is today. new drugs will be designed based on an understanding of the molecular mechanisms of common diseases, such as diabetes and systemic arterial hypertension, and it will be possible to treat these diseases by focusing on specific molecular targets. in the case of diseases such as cancer, for example, drugs can be adapted to the specific response of the patient and, within a few decades, it will be possible to cure many potential diseases at a molecular level before they develop. most probably these changes will not all occur in the immediate future. it will take us a long time to understand the human genome, the book of our species, with its chapters called chromosomes, each containing thousands of stories known as genes, composed of paragraphs called exons, interrupted by as yet indecipherable messages called introns, written in words called codons, made up of letters called bases. no doubt, access to the exact sequence of the genome will gradually modify, with increasingly greater impact, the practice of medicine in the coming decades, and in this context it is essential that this knowledge and these technologies be immediately incorporated into public and professional education; this is a priority and the task must begin today. prometheus stole fire from the gods for the benefit of mankind; it is up to us to ensure that our new promethean knowledge be used to throw light on many of the mysteries of biology. molecular structure of nucleic acids the genome of simian virus the human genome project: past, present, and future the international human genome sequencing consortium. initial sequencing and analysis of the human genome the sequence of the human genome a vision for the future of genomics research. a blueprint for the genomic era human genome sequencing available at the genome sequence of drosophila melanogaster genome sequence of the rematode c. elegans: a platform for investigating biology. the c. elegans sequencing consortium sequence and analysis of chromosome of the plant arabidopsis thaliana the male-specific region of the human y chromosome is a mosaic of discrete sequence classes abundant gene conversion between arms of palindromes in human and ape y chromosomes identification of the cystic fibrosis gene: cloning and characterization of complementary dna international meta-analysis of hiv host genetics. effects of ccr -delta , ccr - i, and sdf- 'a alleles on hiv- disease progression: an international meta-analysis of individualpatient data national human genome research institute. crick f. central dogma of molecular biology medical applications of microarray technologies: a regulatory science perspective chip genético (adn array): el futuro ya está aquí clasificación actual de las neumonías intersticiales idiopáticas gene expression analysis reveals matrilysin as a key regulator of pulmonary fibrosis in mice and humans uses of expression microarrays in studies of pulmonary fibrosis, asthma, acute lung injury, and emphysema ccl /dc-ck- /parc up-regulation in hypersensitivity pneumonitis universal declaration of the human genome and human rights key: cord- -vv gpldi authors: willemsen, anouk; zwart, mark p title: on the stability of sequences inserted into viral genomes date: - - journal: virus evol doi: . /ve/vez sha: doc_id: cord_uid: vv gpldi viruses are widely used as vectors for heterologous gene expression in cultured cells or natural hosts, and therefore a large number of viruses with exogenous sequences inserted into their genomes have been engineered. many of these engineered viruses are viable and express heterologous proteins at high levels, but the inserted sequences often prove to be unstable over time and are rapidly lost, limiting heterologous protein expression. although virologists are aware that inserted sequences can be unstable, processes leading to insert instability are rarely considered from an evolutionary perspective. here, we review experimental work on the stability of inserted sequences over a broad range of viruses, and we present some theoretical considerations concerning insert stability. different virus genome organizations strongly impact insert stability, and factors such as the position of insertion can have a strong effect. in addition, we argue that insert stability not only depends on the characteristics of a particular genome, but that it will also depend on the host environment and the demography of a virus population. the interplay between all factors affecting stability is complex, which makes it challenging to develop a general model to predict the stability of genomic insertions. we highlight key questions and future directions, finding that insert stability is a surprisingly complex problem and that there is need for mechanism-based, predictive models. combining theoretical models with experimental tests for stability under varying conditions can lead to improved engineering of viral modified genomes, which is a valuable tool for understanding genome evolution as well as for biotechnological applications, such as gene therapy. a large number of virus genomes have been engineered to carry additional sequences for a variety of purposes. viruses are often used as vectors for heterologous gene expression in cultured cells or the natural host. for example, the baculovirus expression system is widely used for expression work (chambers et al. ), lentiviruses show great promise for gene therapy (milone and o'doherty ) , and phage display allows for selection of desired epitopes (wu et al. ) . marker genes have also been built into viruses to facilitate tracking infection spread (dolja, mcbride, and carrington ) . as viruses evolve rapidly, including the incorporation of genome-rearrangements, it is therefore unsurprising that the insertion of sequences into viral genomes often goes hand in hand with the rapid occurrence of deletions (koonin, dolja, and morris ; pijlman et al. ; zwart et al. ). the inserted sequence, and sometimes parts of the viral genome, are then rapidly lost. this genomic instability can have economic ramifications, leading to decreases in heterologous protein expression (kool et al. ; de gooijer et al. ; scholthof, scholthof, v c the author(s) . published by oxford university press. this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/ licenses/by-nc/ . /), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. for commercial re-use, please contact journals.permissions@oup.com and jackson ) . it can also introduce limitations and complications to working with marker genes majer, darò s, and zwart ) . understanding the stability of inserted sequences therefore has value from an applied perspective, but it could also shed light on basic questions. first, how stable are natural virus genomes, and under what conditions do they become unstable? second, since horizontal gene transfer (hgt) plays an important role in virus evolution, under what conditions are transferred sequences likely to be retained? in this review, we consider the stability of inserted sequences and the dynamics of their removal from virus genomes from an evolutionary perspective. first, we provide an overview of empirical results which shed light on insert-sequence stability for viruses, based on the baltimore classification. second, we present some conceptual considerations pertaining to sequence stability, identifying important parameters for understanding and potentially predicting stability. we identify theory and experiments that point toward viable strategies for mitigating the rapid loss of inserted genes, and point out key questions that should be addressed in future research. we argue that virus genome organization has a large impact on the stability of inserted sequences, whilst stability is a complex trait that can depend on environmental conditions. we provide an overview of empirical results for the stability of natural and engineered inserted sequences, following the baltimore classification. our primary focus is on engineered viruses: studies where gene insertions are an addition to the viral genome (leading to an increase in genome size) and where the subsequent fate of these inserted sequences has been tracked. as inserted sequences can incur a fitness cost, these are often quickly purged from the viral genome. often these fitness costs are related to a disruption of the viral genome (e.g. gene order). we therefore also consider studies on genome rearrangements in wild viruses and introduce other relevant modifications that shed light on what the impact of genomic inserts can be. we provide an overview of the results and main conclusions of our review in table . several studies relating to the stability of double-stranded (ds) dna viruses have been published. the dsdna viruses have a wide range of genome sizes, from the small polyomaviridae and papillomaviridae ranging from . to . kbp, to the relatively table . we provide an overview of the main conclusions, for all viruses and for the different baltimore classification groups. viruses genera covered in relevant studies conclusions of this review all viruses • inserted sequences are often unstable and rapidly lost upon passaging of an engineered virus • the position at which a sequence is integrated in the genome can be important for stability • sequence stability is not an intrinsic property of genomes because demographic parameters, such as population size and bottleneck size, can have important effects on sequence stability • the multiplicity of cellular infection affects sequence stability, and can in some cases directly affect whether there is selection for deletion variants • deletions are not the only class of mutations that can reduce the cost of inserted sequences, although they are the most common i: dsdna alphabaculovirus, lambdavirus, mastadenovirus, orthopoxvirus, t likevirus, varicellovirus • large genomes that are readily engineered and also highly plastic, as exemplified by the 'genome accordion' in poxviruses • small insertions can be stable, but larger insertion are rapidly lost • classic studies with phages exemplify how lower limits to the size of packaged genomes can be used to increase insertion stability ii: the inverted terminal repeats of vaccinia virus undergo rapid changes in size due to unequal crossover events leading to stable and unstable forms (moss, winters, and cooper ) . the diversity in this region is needed for immune evasion and for the colonization of novel hosts and appears to be mainly regulated by recombination events. however, other processes such as mutation leading to accelerated rates of recombination cannot be ruled out. poxviruses, such as vaccinia, virus are classified as nucleocytoplasmic large dna viruses (ncldvs). these viruses have larger than average genome sizes and the more recently discovered giant viruses are also classified as such. the ncldvs appear to have undergone a dynamic evolution where gene gain and loss events go in parallel with host-switches between animal and protist hosts (koonin and yutin ) . interestingly, the phylogenomic reconstructions performed by koonin and yutin ( ) suggest that giant viruses (for which the host range appears to be restricted to protists) have evolved from simpler viruses (infecting animals) on many independent occasions. this again suggests that the host plays an important role in genome stability where in animals the pressure for smaller virus genomes is stronger than in protists (koonin and yutin ) . experimentally it has also been shown that vaccinia virus has a highly plastic genome. after deletion of one host range gene of vaccinia virus, another host range gene increases in copy number (elde et al. ) , leading to genomic expansion. the increased gene expression is in itself beneficial, but the high gene copy number also increases the supply of beneficial gain-offunction mutations. once these gain-of-function mutations are fixed in the population, the other copies of the gene are lost and thus the vaccinia genome size decreased (associated with the cost of an increased genome size) (elde et al. ), leading to accordion-like evolutionary dynamics (andersson, slechta, and roth ) . modified vaccinia virus ankara (mva) is used as a viral vector for the development of vaccines against infectious diseases such as malaria, influenza, tuberculosis, hiv/aids, and ebola (sutter and staib ; gó mez et al. ; gilbert ; stanley et al. ) the optimization of poxvirus promoters in this viral vector has proven to be an effective strategy for increasing the stability of antigen (inserted sequence) expression, and therewith the development of mva-based vaccines (alharbi ) . although live attenuated vaccines have substantially reduced rabies prevalence after oral-vaccination campaigns were conducted (lafay et al. ; macinnes et al. ) , such live vaccines are not efficacious in all rabies vector species. as an alternative, recombinant human adenovirus vaccine vectors expressing the rabies glycoprotein have been developed. the fitness of a replication-competent human adenovirus expressing the rabies glycoprotein was similar to that of the wild-type virus, as tested in vitro (knowles et al. ). moreover, the inserted rabies virus gene was stable during both in vivo and in vitro passaging (knowles et al. ), demonstrating the potential of this recombinant vaccine vector as an effective alternative. non-human adenoviruses can be used as alternative vaccine vectors, providing several advantages such as a limited host range and restricted replication in non-host species. by using bovine adenovirus type , a variety of antigens and cytokines were successfully expressed in vivo (ayalew et al. ) . the stability of bovine adenovirus type was tested by inserting the eyfp marker and subsequently passaging the recombinant virus in cell culture (ren et al. ) . although replication of this recombinant virus was less efficient than the wild-type virus, the inserted eyfp was stable. engineered alphabaculoviruses (infecting arthropods) are widely used as vectors for the expression of heterologous genes in insect cells. nonetheless, during serial passaging defective interfering (di) baculoviruses that lack large portions of the genome are rapidly produced, in what appears to be an intrinsic property of baculovirus infection (pijlman et al. ) . as a result of having a smaller genome size, these dis most likely have a replicative advantage (higher fitness). especially in bioreactor configurations where the cellular multiplicity of infection (moi, the number of virus particles infecting a cell) is high, fasterreplicating dis can rapidly reach high frequencies (kool et al. ) . the rapid generation of dis involves several recombination steps and prevents the development of stable baculovirus expression vectors, as inserted sequences are then also rapidly lost (pijlman et al. ) . the loss of sequences inserted into baculovirus genomes is not only due to the formation of dis. when an origin of replication that is enriched in di genomes was removed, baculovirus genomic stability at high mois increased as no dis were observed. strikingly, inserted foreign sequences were still rapidly lost (pijlman, van schinjndel, and vlak ) , showing that rapid di generation is not the only impediment to the stability of inserted genes. addition of endogenous viral sequences-homologous repeat regions important for baculovirus replication-to inserted sequences promoted the stability of insertions (pijlman et al. ) , highlighting the importance of the genomic context for insert stability. another study in which the importance of the genomic context was stressed involved the generation of infectious clones and determination of the stability of suid herpesvirus , the causal agent of aujeszky's disease. sequences inserted in infectious clones were genetically stable in escherichia coli. however, for the reconstituted viruses, the insertion at the gg locus was highly unstable, whereas the same insert was stable when inserted between the us and us genes (smith and enquist , ) . stability was only determined in a short-term experiment, but these results nevertheless emphasize the importance of the genomic context for stability, even in viruses with relatively large and stable genomes. bacteriophages were instrumental in the development of molecular cloning methods. among dsdna phages, lambdaviruses of e.coli were widely used as cloning vectors, and methods were developed to increase the stability and maximum size of inserts (chauthaiwale, therwath, and deshpande ) . one interesting approach made use of the fact that there is a minimum genome size for efficient packaging into virus particles. when endogenous genes that are non-essential for the lytic cycle are removed, not only can larger sequences be inserted, but there is also selection for maintaining the inserted sequences because they increase genome size and enable packaging (thomas, cameron, and davis ) . moreover, it has been shown that phage t engineered with a biofilm-degrading enzyme (dispersin b) was superior to unmodified phage at clearing short-term biofilms (lu and collins ) . although providing a 'public' benefit in the form of an exoenzyme that can degrade host defenses, surprisingly this insertion does not have a cost and is therefore stable (schmerer et al. ) . interestingly, the insertion of an endosialidase at the same locus was both beneficial and costly, although in this case evolutionary stability was not determined (gladstone, molineux, and bull ) . in summary, engineered dsdna viruses containing foreign gene insertions are relatively unstable and stability is only reached when the genomic context and demographic conditions (e.g. census population sizes, bottleneck sizes, and population structure) are optimal. contrarily, in natural conditions dsdna viruses appear to be highly plastic where increases and decreases in genome size occur on a relatively short evolutionary time scale. in particular, host-switches may play important roles in increased plasticity and stability of dsdna viral genomes. even though unstable viral genomes may help increase viral fitness by avoiding the hosts' immune system in natural conditions, it may also prevent the development of stable viral expression vectors in bioreactor configurations. the ssdna viruses have much smaller genome sizes as compared to the dsdna viruses (group i), ranging from . to . kbp genomes of the circoviridae to the . kbp genome of the spiraviridae. judging only by the small range in genome size, one would expect that ssdna viruses are less plastic compared their dsdna counterparts, and thus less likely to accept foreign genes in their genomes. although few studies have addressed genomic stability of ssdna viruses after an insertion, an example in wild viruses of frequent sequence insertions, duplications, and deletions are the geminiviridae, with genomes of about . - kbp (monopartite) or . - . kbp (bipartite). during the course of geminivirus infection in plants, shorter subgenomic dnas often arise. these subgenomic dnas can range in size and some result in defective dnas (stenger et al. ; stanley et al. ; patil et al. ) , that replicate at the expense of the full-length genome. these subgenomic dnas can lead to reduced symptom severity in plants and thereby act as modulators of viral pathogenicity. it is speculated that the (sometimes stepwise) deletion process leading to subgenomic dnas can also be the process leading to the reversion to wild-type full-length dna molecules with either insertions or deletions that make these bigger or smaller than the wild-type genome (martin et al. ). when inserting sequences into the genome of maize streak virus (msv, geminiviridae), the infection efficiency decreased as the size of the insert increased (shen and hohn ) . although, some of the msv mutants obtained deletions and reverted to the wildtype length, the frequency of the deletion process did not increase linearly with the size of the insert, but rather depended on the nature of the sequence (shen and hohn ) . deletion mutants of the african cassava mosaic virus (acmv) have also shown to revert back to the original wild-type genome length through recombination between the two components of the bipartite genome (etessami, watts, and stanley ) . the selection pressure on the reversion to wild-type genome length is probably a strong size constraint on encapsidation, where in the case of acmv the size of encapsidated dna determines the multiplicity of geminivirus particles (frischmuth, ringel, and kocher ) . the nanoviridae family includes ssdna viruses with a multipartite genome that are composed of six to eight circular segments. segmented ssdna viruses present unique challenges when thinking about the stability of inserted sequences, because the frequency of genomic segments is highly plastic for some of these viruses. these viruses might therefore downregulate segments for expression of the inserted sequence, even if downregulation of co-localized homologous genes is costly. a lower frequency would also entail a lower mutational supply, limiting evolvability, the capacity of the virus to generate beneficial variation and subsequently adapt. segmented viruses might therefore display rapid adaptive responses to inserted sequences, whilst simultaneously limiting their potential for longer-term evolution. to the best of our knowledge, this potentially interesting tradeoff has not been shown. inserted sequences can be unstable in ssdna phages, which like their dsdna counterparts also can have an upper limit to genome size. inserts of up to bp were stable in x , despite markedly reducing fitness (russell and muller ) . genomes with larger insertions were still infectious, although the insert was then rapidly lost. later, it was shown that short palindromic sequences could be inserted in x , but that these inserts become more unstable as the number of repeats is increased and when the identity of the repeats is identical (williams and mü ller ) . in other work, it has been shown that phage display (wu et al. ) can be used to select clones coding for peptides with high affinity for a particular target, although selection for m phages with no insert-due to their presumed faster replication-can hamper 'phage panning' (tur et al. ) . based on the little evidence we obtained there appears to be strong selection for genome streamlining in ssdna viruses. after a sequence insertion, reversion to the wild-type genome size is observed in both natural and laboratory conditions. interestingly, the nature of the insert appears to be more important than the size of the insert, indicating that the genomic context also plays an important role in the stability of ssdna viruses. the dsrna viruses have a range of genome sizes ( . - . kb) that is similar to the ssdna viruses. most of the dsrna viruses contain segmented genomes, where during replication, positivesense ssrnas are packaged into procapsids and serve as templates for dsrna synthesis. thus, the progeny particles contain a complete set of equimolar genome segments. proper recognition and stoichiometrical packaging of the ssrnas is indispensable for multi-segmented genome assembly. although, different dsrna viruses employ different mechanisms for this assembly, these all rely on proper recognition of the ssrnas in either specific rna-protein or rna-rna interactions (borodavka, desselberger, and patton ) . we therefore expect that dsrna virus genomes are highly streamlined, since most gene insertions will probably disturb the recognition and packaging process of the ssrnas. interestingly, for rotaviruses it has been observed that genome segments containing sequence duplications are preferentially packaged into progeny viruses relative to wild-type segments (troupin et al. ) , indicating that an increase in genome/segment size may not be a hard constraint. we hypothesize that few if any gene insertions will lead to viable genomes due to the perturbation of segmented genome assembly into virus particles. if a gene insertion happens to be viable, it will probably be rapidly purged from the viral genome. one exception to this hypothesis could be a gene insertion originating from a closely related virus, for example a virus with similar packaging signals, leading to a fitness advantage, such as increased packaging efficiency. only a small number of studies that test the stability of inserts in dsrna viruses are available, and these concern the generation of recombinant rotavirus expressing foreign genes. group a rotavirus, consisting of eleven segments, has been engineered to express fluorescent proteins (kanai et al. (kanai et al. , komoto et al. ) such as enhanced green fluorescent protein (egfp) and mcherry. however, segment in which these genes were introduced is expressed at low levels and is subject to proteasomal degradation ( based on limited evidence, we tentatively conclude that stoichiometric packaging of segmented genomes may form an impediment to engineering and insert stability. however, recent work also suggests that careful engineering of dsrna viruses may lead to stable sequence insertions. the generality of these conclusions for the dsrna viruses, and their dependence on environmental and demographic conditions, remain to be seen. the ssrna(þ) viruses range in genome size from . to kbp. it has been shown that both animal and plant ssrna(þ) viruses can express inserted foreign genes. however, the nature of the ssrna(þ) genomes poses several limitations to efficient expression and maintenance of the insert. most ssrna(þ) genomes used for the expression of foreign genes code for a polyprotein, a single orf that is further processed after translation into different mature peptides. the processing occurs through autocatalytic cleaving at specific cleavage sites located between the different proteins to be expressed. insertions should therefore be carefully engineered, including proper cleavage sites corresponding to the site of insertion. even when respecting these design rules, inserts may impose restriction on viral replication due to the failure of proper protease cleavage due to conformational constraints. in addition, the genomes of rna viruses tend to be composed of overlapping genes (belshaw, pybus, and rambaut ) , which limits their adaptive capacity (simon-loriere, holmes, and pagá n ). overlap can form an impediment to engineering, and perhaps to the likelihood an inserted sequence is maintained, as insertions will often affect multiple genes. we focus exclusively on engineered viruses, given that there are many examples for this virus group. poliovirus is a good candidate as a live viral vector for the expression of foreign genes, since the attenuated sabin strains of poliovirus elicit strong protective immune responses without causing disease (sabin ) . insertions of up to bp of the rotavirus vp gene into sabin poliovirus gave rise to infectious viruses that expressed portions of the vp outer capsid protein (mattion et al. ). this is promising as antibodies generated to vp are able to neutralize the virus. nevertheless the size of the insert in this construction is limited as only inserts of about bp or smaller were stable upon serial passages in tissue culture, whereas larger insertions failed to produce infectious viruses (mattion et al. ). one of the major limitations appears to be the polyprotein nature of the genome. the recombinant viruses expressing the inserted gene were found to be slower in the assembly of infectious virus particles, and showed smaller plaques and lower virus titers. this is possibly due to slow cleavage at the artificial cleavage sites around the insert (mattion et al. ) . sindbis virus, another ssrna(þ) virus genome that encodes for a single orf, accepted relatively large inserts of . kbp in the . kbp genome (pugachev et al. ) . however, the recombinant sindbis viruses appeared unstable and especially inserts at the end were rapidly lost during serial passages, suggesting a positional effect. for members of the flaviviridae family, such as west nile virus and hepatitis c virus (hcv), inserted reporter genes appear to be unstable. this instability is related to the size of the insert, and comes about because of the disruption of structural rna elements required for viral replication (ruggli and rice ; pierson et al. ) . to cope with these issues, recombinant flaviviridae viruses carrying the split-luciferase gene were generated (tamura et al. ), including dengue virus, japanese encephalitis virus, hcv and bovine viral diarrhea virus. in vitro, these recombinant viruses appear to be evolutionary stable and propagation was comparable to the wild-type virus, most probably due to the small amino acid insert size. to demonstrate the utility of the split reporter system-to determine in vivo viral dynamics and the efficacy of antiviral reagents-the recombinant hcv was tested in chimeric mice. chronic infection was established and the luciferase gene was stably maintained in the viral genome (tamura et al. ) . live attenuated vaccines for porcine reproductive and respiratory syndrome (prrsv) have failed to provide effective protection due to the genetic diversity of circulating prrsv strains. to improve the efficacy of prrsv vaccination, a recombinant virus expressing porcine interleukin- (a regulator of the immune response) was constructed (zhijun . this recombinant virus remained stable upon serial passaging in vitro, and induced higher ratios of interleukin- and cd þcd þ doublepositive t cells in vivo. despite the presumably better immune response of the host, the recombinant prrsv vaccine did not significantly improve protection efficacy (zhijun . in another attempt, granulocyte-macrophage colony-stimulating factor (gm-csf) was inserted in a prrsv vaccine strain. the inserted gene was stably expressed upon serial passaging in vitro, and the presence of gm-csf led to increased surface expression of mhciþ, mhciiþ, and cd / þ (yu et al. ) . although evaluated solely in vitro, this recombinant strain is expected to elicit stronger immune responses and hereby improve vaccine efficacy against prrsv infection. it has been shown that many different plant viruses can express foreign genes, and they have the advantage being able to express these directly in vivo. as an initial strategy to express foreign genes in plants, on many occasions viral genes were replaced with the gene of interest (gene replacement instead of gene insertion). this strategy appeared to be (partially) successful in plant ssdna viruses (hayes et al. ; ward, etessami, and stanley ; hayes, coutts, and buck ) , as the replaced coat protein did not appear to play an essential role in virus spread throughout the plant host (ward, etessami, and stanley ) . however, viral ssrna(þ) genomes seem to be less plastic as the replacement strategy was mostly unsuccessful in plant rna viruses. although the rna viral vectors permitted the expression of replaced genes, either they were only viable in protoplasts and not in whole plants (french, janda, and ahlquist ; joshi, joshi, and ow ) , or they were unable to establish systemic infections (takamatsu et al. ; dawson, bubrick, and grantham ) . shortly after, studies showed that gene insertion-rather than gene replacement-was better suited for expressing foreign genes in ssrna(þ) viral genomes (dawson et al. ; donson et al. ; chapman, kavanagh, and baulcombe ) . the chloramphenicol acetyltransferase (cat) gene (dawson et al. ) , and the dihydrofolate reductase (dhfr) and the neomycin phosphotransferase (npt) genes (donson et al. ) were successfully expressed in plants by using tobacco mosaic virus (tmv) as a vector. in addition, the bacterial gus gene has shown to successfully express when inserted into the viral genome of potato virus x (pvx) (chapman, kavanagh, and baulcombe ) . however, in all these cases the presence of a foreign gene leads to genomic instability resulting in the partial deletion of the gus and npt genes and a complete deletion of cat during systemic infection. this instability may result from the presence of the insert leading to lower accumulation levels of the genomic rna, as well as leading to mrna instability and/or interfering with synthesis of the viral proteins. sequence redundancy due to a promoter duplication can also lead to genomic instability and thus the subsequent deletion of the inserted sequence (dawson et al. ; chapman, kavanagh, and baulcombe ) . indeed, for tmv and pvx it has been shown that replacing one of the promoter sequences with that from related viruses (donson et al. ) together with further removal of additional sequence duplications (dickmeis, fischer, and commandeur ) , leads to increased stability of the insert. interestingly, as for the dna viruses, the site and size of the insert seems to be important for ssrna(þ) viruses. first, the positioning of the cat gene downstream (instead of upstream) of the tmv coat protein, resulted in a poorly replicating virus that was not able to systematically infect the host plants (dawson et al. ) . and second, the dhfr gene ( bp) inserted in a tmv background appears to be maintained stably through several passages, while the .  larger npt gene ( bp) in the same experimental setup was unstable during systemic movement of the virus. this may also be related to the nature of the insert, where sequences with a codon usage similar to that of the viral vector may be retained longer than those that have an opposite codon usage. interestingly, chung, canto, and palukaitis ( ) generated recombinant plant viruses with inserted genes of unrelated plant viruses and observed instability and variation in the rate of partial or complete loss of the insert depending on the inserted sequence itself, the host used, or the viral vector used (chung, canto, and palukaitis ) . also sequences with a high toxicity for the host, are more likely to become deleted faster or to impede viral replication. in a previous study we reported on experimental evolution of pseudogenization in virus genomes using tobacco etch virus (tev) expressing egfp (zwart et al. ), a gene known to be toxic in many expression systems. in this case egfp can be considered a non-functional sequence, as it does not add any function to the viral genome. we showed that egfp has a high fitness cost in tev, and the loss of egfp depended on the passage length, where longer passages led to a faster and assured loss. similarly, prolonged propagation of tev and plum pox potyvirus expressing gus (dolja, mcbride, and carrington ; dolja et al. ; guo, ló pez-moya, and garcía ) , and tmv expressing gfp (rabindran and dawson ) , led to the appearance of spontaneous deletion variants. due to the increase in genome size, viruses that carry an insert are unlikely to be as fit as the parental (ancestral) virus, even if they accumulate initially to similar levels. the tev-egfp genomes that had lost the insert had a within-host competitive fitness advantage, where the smaller the genome the higher the within-host competitive fitness. interestingly, although the size of the deletions varied, convergent evolution did occur in terms of fixed point mutations (zwart et al. ) . this result also suggests that a demographic 'sweet spot' exists, where heterologous insertions are not immediately lost while evolution can act to integrate them into the viral genome. in summary, in several studies passage duration has an effect on insert stability, with inserts being more stable in shorter passages. we explore these effects in the conceptual section presented at the end of this paper (see also box ). here we illustrate how demography can affect the observed stability of an inserted sequence, using a simulation model. this model is based on (willemsen et al. ) and incorporates logistic virus growth, deterministic recombination with a fixed rate, and population bottlenecks after a given number of generations. to describe virus growth and recombination in each generation, two coupled ordinary differential equations are used: here, i is the number of viruses with the insertion intact, d is the number of viruses with a deletion, x is initial growth rate of each virus variant, j is the carrying capacity, q is the rate at which i recombines to d, and w is a constant for determining the effect of each virus on the others replication, with the effect of d on i being w d ¼ x d =x i and vice versa w i ¼ =w d . the frequency of the deletion variant is at the start of each passage, to simulate the bottleneck we draw the number i from a binomial distribution with a size a and success probability f d from the previous time point, and then d ¼ a À i. to illustrate the effects of bottlenecks we chose the parameters in table , set the initial f d to zero, and considered various values of a. the difference in fitness between the virus with insertion and without is large (x i =x d ¼ : ). the simulation data illustrate how under these conditions narrow bottlenecks can lead to stable inserted sequence (fig. ) . during each round of passaging the frequency of the deletion variant comes up, but as it does not reach a frequency near /a this variant is not sampled during the bottleneck. only when the bottleneck is wider is the probability of sampling the virus variant with a deletion large enough for this to occur regularly. once a deletion variant has been sampled during the bottleneck, it rapidly goes to fixation as it has a much higher fitness than the full-length virus. figure provides a simple illustration of the same principle. when considering host species jumps using the same tev-egfp vector, we show that host switches can radically change evolutionary dynamics (willemsen, zwart, and elena ). after over half a year of evolution in two semi-permissive host species, with a large difference in virus-induced virulence, the egfp insert appears to remain stable. a fitness costs of egfp was only found in the host for which tev has low virulence. in the hosts for which tev has high virulence there was no fitness cost and viral adaptation was observed. this contradicts theories that suggest that high virulence could hinder between-host transmission. when considering the evolution of genome architecture, host species jumps might play a very important role, by allowing evolutionary intermediates to be competitive. the stability of an insert could change when considering insertions that might be beneficial for the virus. using the tev genome we simulated two hgt events, by separately introducing functional exogenous sequences that are potentially beneficial for the virus (willemsen et al. ). in one case, the insertion was rapidly purged from the viral genome, restoring fitness to wild-type fitness levels. in another case, the inserted gene-the b rna silencing suppressor from cucumber mosaic virus-did not seem to have a major impact on viral fitness and was therefore not lost when performing experimental evolution. interestingly this insertion duplicated the function of rna silencing suppression function of another gene in the genome. when mutating this functional domain of the tev gene, the inserted gene provided a replicative advantage. these observations suggest a potentially interesting role for hgt of short functional sequences in improving evolutionary constraints on viruses. besides hgt, another mechanism for evolutionary innovation is gene duplication. the effects in the stability on a genetically redundant insert might be variable. on one hand, one would expect the duplicated copy to be rapidly deleted from the genome as it does not confer an additional function. on the other hand, if a duplicated sequence is stable it may act as a stepping stone to the evolution of new biological functions. we have investigated the stability of genetically redundant sequences by generating (tev) viruses with potentially beneficial gene duplications (willemsen et al. ). all gene duplications resulted in a loss of viability or in a significant reduction in viral fitness. experimental evolution always led to deletion of the duplicated gene copy and maintenance of the ancestral copy. however, the stability of the different duplicated genes was highly divergent, suggesting that passage duration is not the main factor for determining whether the insert will be stable or unstable. the deletion dynamics of the duplicated genes were associated with the passage duration and the size of the duplicated copy. by developing a mathematical model we showed that the fitness effects alone are not enough to predict genomic stability. a context-dependent recombination rate is also required, with the context being the identity of the insert and its position. in summary, these experimental observations demonstrate the deleterious nature of gene insertions in ssrna(þ) viruses, where the highly streamlined genomes limit sequence space for the evolution of novel functions, and in turn adaptation to environmental changes. the ssrna(À) viruses are composed of genomes that range from to . kbp in size. these viruses are particularly attractive candidates as viral vectors. while in ssrna(þ) viruses inserts are subject to deletion, inserts in their ssrna(À) counterparts appear more stable (mebatsion et al. ; schnell et al. ) . one reason for this stability is that in general the genes in the ssrna(À) viral genomes are non-overlapping and are expressed as separate mrnas, thus consisting of a modular organization that can be easily manipulated for the insertion of foreign genes. if correctly engineered (e.g. without affecting any regulatory regions), one could expect that gene insertions are more stable in ssrna(À) viruses as compared to ssrna(þ) viruses, since the complexities surrounding correct processing of a polyprotein are not an issue here. moreover, if expressed as a separate mrna, the size of the insert is probably restricted only by the packaging limits of ssrna(À) viruses. the low rate of homologous recombination in ssrna(À) viruses can be another explanation for higher genomic stability (chare, gould, and holmes ; han and worobey ) . non-homologous recombination will probably rarely lead to variants with the insert deleted and other regions undisturbed, given it is less constrained than homologous recombination, and hence low homologous recombination rates could be a limiting factor on sequence evolution. however, genomic deletions that disrupt the inserted sequence will be subject to less constraints, as for example they can disrupt the reading frame of the insert without affecting the expression of virus genes. canine distemper virus (cdv), a species in the morbillivirus genus, is an important pathogen of a variety of animals, including the dog. this virus, however, has shown to be a promising expression vector for the development of vaccines. although the replicative fitness of a recombinant cdv carrying the rabies virus glycoprotein was slightly lower than the wild-type cdv, the insert was stably expressed during serial passaging in vitro and inoculation in vivo induced specific neutralizing antibodies against both rabies and cdv . similarly, genes expressing foreign antigens can be cloned into recombinant measles virus where measles virus proteins and inserted genes are coexpressed. this relatively small vector can accept large gene insertions, that in most cases are stably expressed (billeter, naim, and udem ; malczyk et al. ) . for example, for the development of a vaccine against middle east respiratory syndrome coronavirus (mers-cov), it has been shown that a recombinant measles virus expressing the spike glycoprotein of mers-cov is genetically stable in vitro and induces strong humoral and cellular immunity in vivo (malczyk et al. ) . vesicular stomatitis virus (vsv) is a commonly used vaccine vector that has been engineered to express surface proteins from diverse viruses, including ebola (garbutt et al. ), human immunodeficiency virus type (hiv- ) (johnson et al. ) , and influenza a (roberts et al. ) , which can stimulate protective immune responses against these pathogens (bukreyev et al. ). in addition, vsv has shown promise as a candidate for oncolytic virus therapy, as it replicates most efficiently in cells with diminished innate immunity such as cancer cells, which often have impaired production of and/or response to interferon (barber ) . mutations that attenuate vsv growth in healthy immune-competent cells can further enhance the safety of this anti-cancer therapy potential (barber ) . what is particularly interesting about the genome organization of vsv and other ssrna(À) viruses is that promoter proximal genes are more efficiently expressed than promoter distal ones (iverson and rose ; wertz, perepelitsa, and ball ; pesko et al. ) . the efficiency of expression of the inserted gene (and therewith the strength of the immune response) can be controlled (tokusumi et al. ; roberts et al. ). however, inserting a foreign gene close to the promoter also can also reduce the expression of downstream vector genes (skiadopoulos et al. ) , which in turn can negatively affect virus transcription and rna replication (wertz, moudy, and ball ; zhao and peeters ) . these empirical observations again show that the site of the insert plays an important role in recombinant vector stability. when considering the size of the insert and its stability, ssrna(À) viruses accept relatively large insert without drastically affecting virus replication. sendai virus, with a genome size of about . kbp, can carry and efficiently express gene insertions up to . kbp (sakai et al. ) . however, also here the insert size is limited, where the final virus titers in vitro are proportionally reduced as the insert size increases. while in vivo no such size-dependent effect was observed, an attenuated replication and pathogenicity were detected (sakai et al. ) . insertions up to . kbp in the $ . kbp genome of the human parainfluenza virus were viable and replicated efficiently in vitro (skiadopoulos et al. ) . nonetheless, the insertions longer than , bp reduced the robustness to environmental perturbation of the virus, as temperature sensitivity was augmented and replication was restricted to certain sites in vivo (skiadopoulos et al. ) . the ssrna(À) viruses seem promising expression vectors, where one can control gene expression and introduce relatively large inserts that, in many instances, appear to be stable. the constraints imposed on viral gene insertions seem to be the lowest in this group of viruses. yet, the ideal vector that accepts all types and sizes of foreign gene insertions without decreasing viral replication, has not been identified yet. retro-transcribing ssrna(þ) viruses, or retroviruses, are small viruses varying in genome size from to kbp and are classified in the retroviridae family. after entering a host cell, the retroviral rna genome is converted into dsdna by reverse transcription. the viral dna integrates into the host genome, where viral genes are translated. therefore, these viruses are often used for gene therapy. retroviruses frequently undergo genomic rearrangements, including gene insertions and deletions (indels). moreover, recombination can be common due to the combination of 'diploid' virus particles and high intrinsic recombination rates (jetzt et al. ) . therefore as a general observation this viral group appears to have a highly plastic genome, and should be relatively open to foreign gene insertions. as retroviruses integrate into the host genome, the stability of inserts does not necessarily depend solely on the retrovirus genome configuration and demographic conditions. as host genomes are in general less streamlined than those of viruses, one could expect that gene insertions are stable after integration into the host genome. however, the random integration of retroviruses in the host genome makes it hard to predict genomic stability. as a wild example, hiv- frequently undergoes genomic rearrangements, where indels are significant source of evolutionary change. these indels appear to have an impact on virus transmission and adaptation as for example indels in the hiv- pol gene are associated with drug resistance (rakik et al. ) , and indels in the gag and vif genes are associated with disease progression and infectivity (alexander et al. ; aralaguppe et al. ). the hiv- surface envelope glycoprotein contains five variable regions (v -v ) that can tolerate a higher rate of indels than the rest of the genome. interestingly, indel rate estimates vary significantly among variable regions and subtypes (from different hosts) (palmer and poon ). when introducing gfp into the five variable regions of hiv- , certain regions (v and v ) were more tolerant to foreign gene insertions than the other variable regions (v , v , and v ) (nakane, iwamoto, and matsuda ) . in particular, gfp insertions into the v region showed lower levels of expression (nakane, iwamoto, and matsuda ) , which is consistent with v having the lowest indel rate (palmer and poon ) , thus having a lower stability after gene insertions. this piece of empirical evidence again shows that the site of insertion plays an important role in determining expression levels and stability. retroviruses have a valuable potential as vectors for introducing therapeutic genes into cancer cells. murine retroviruses are the most commonly used vectors in clinical trials today, and seem promising candidates for human gene therapy as they target dividing cells with a high degree of efficiency and lead to stable gene transfer as they integrate into the chromosomes of the target cell (edelstein et al. ). however, we still have to deal with important safety issues when using retroviruses for gene therapy. the random integration of retroviruses in the host genome poses a risk, as the integration near the lmo proto-oncogene promoter can trigger the development of leukemia (hacein-bey-abina et al. ) . besides the risks related to retroviral gene therapy, the limited efficiency of in vivo gene transfer poses another obstacle. replication defective retrovirus vectors are often used in clinical trials but limited since they can only infect a fraction of solid tumor cells (rainov and ren ) . for the delivery of the transgene in all tumor cells, replication-competent retroviral vectors are a promising alternative. the suitability of murine leukemia virus (mlv)-based vectors for cancer gene therapy has been analyzed in vitro and in vivo by paar et al. ( ) . they found that the choice of the virus strain, the position of the insert, and the host cells used, can influence the replication kinetics, genomic stability, and transgene expression levels (paar et al. ) . concordantly, the egfp sequence was inserted into mlv under different configurations (i.e. site of insertion and flanking sequence), and the reporter gene was deleted upon extended cell culture (duch et al. ). the stability was improved by decreasing the length of sequence repeats flanking the inserted sequence, however, eventually egfp was always (partially or completely) deleted (duch et al. ) . in another study, transgenes of different sizes (gfp, hph, pac) were inserted into mlv. deletions were always observed, where the deletion dynamics depended on the size of the insert and preferred sites of recombination were detected (logg et al. ) . using retroviral vectors for the expression and transfer of foreign genes is central to the development of gene therapy. an advantage of using retro-transcribing ssrna(þ) viruses is that after reverse transcription a dsdna molecule stably integrates into the host genome. with careful design, testing, and engineering, the retroviruses are promising vectors for the treatment of diseases, such as cancer. the retro-transcribing dsdna (rt-dsdna) viruses have small genome sizes varying from to . kb, and include the viral families caulimoviridae and hepadnaviridae. as the name suggests, the rt-dsdna viruses replicate through an rna intermediate, and in some cases the pre-genomic rna is alternatively spliced. although genomic rearrangements appear to be frequent in rt-dsdna viruses, we hypothesize that gene insertions will often be unstable, because , they tend to have compact genomes, and , insertions can easily disturb a viral regulatory sequence or lead to incorrect processing of the alternative spliced products. a. willemsen and m. p. zwart | . . wild viruses in contrast to the retroviruses (group vi), the genome replication of the caulimoviridae is entirely episomal. however, fragmented and rearranged endogenous caulimovirus sequences have been found in a wide variety of plant species (teycheney and geering ) . for the hepadnaviridae, the viral genome can be integrated into the host genome, through a process that exploits ds breaks in the host genome. although this is an infrequent event, the integrated viral dna often contains deletion, inversions and duplications, often inactivating the virus. in the case of hepatitis b virus (hbv), integration into the human genome can cause genetic damage and chromosomal instability leading to hbv-induced liver cancer (shafritz et al. ; furuta et al. ). several studies in the s already reported the possibility of inserting foreign dna into specific sites of the cauliflower mosaic virus (camv) genome without greatly affecting viral infectivity or function (gronenborn et al. ; howell, walker, and walden ; dixon, koenig, and hohn ; brisson et al. ; lefebvre, miki, and laliberté ) . in two of these studies, functional bacterial genes were introduced into the camv genome, where a fragment of the lac operator (gronenborn et al. ) and the dhfr gene (brisson et al. ) were successfully expressed. in these studies, issues regarding the stability of the insert were raised, where the lac operator was lost after five successive transfers and extended growth of the plants, and deletions in the dhfr gene started appearing after the second and third transfers. on the contrary, an inserted mammalian metallothionein gene appeared to be stable and functional in the camv genome (lefebvre, miki, and laliberté ) . these studies suggest that the differences in stability of inserts in the camv genome depend on at least two factors. first, the site of the insert seems to be important as many inserts are lethal for the virus (gronenborn et al. ; howell, walker, and walden ; dixon, koenig, and hohn ) . second, the size of the insert is important, as camv can accept only small foreign genes due to viral encapsidation limits (gronenborn et al. ; lefebvre, miki, and laliberté ) . as described along this review, vectors containing the gfp as an insert are often designed to study the infection dynamics of viruses. however, the size of gfp is relatively large (around nt) and often leads to instability of vectors (zwart et al. ; nakane, iwamoto, and matsuda ) . to cope with the size limitation a split gfp system has been engineered (cabantous, terwilliger, and waldo ) , where only a small part of gfp is introduced in the viral vector and the other part is expressed using a transgenic host. when the two gfp fragments are together, spontaneous association leads the formation of a fluorescent molecule. in the camv genome this system allowed to track a camv protein in vivo (dá der et al. ). the partial gfp insertion was stable for ten or four serial passages, depending on the host plant species used, suggesting that the demographic conditions such as the host play an important role in stability. although the number of studies on insert stability in rt-dsdna viruses is limited, we reason that several constraints limit insert stability in these viruses. although small inserts will allow to track viral infection dynamics, the use of rt-dsdna viruses for gene therapy does not seem practicable as integration into the host genome is a rare event for these viruses. sequence loss is inherently an evolutionary process, at a minimum involving mutation and selection, and therefore needs to be framed in an evolutionary context. here, we consider how theory might help to better understand and ultimately predict this process. first, inspired by empirical results we consider the effects of virus population and bottleneck sizes on sequence loss. second, we consider whether there are different evolutionary trajectories that lead to a restoration of fitness following insertion of a sequence, and their implications for sequence stability. we understand demography to be a description of the size and structure of virus populations over time. in this discussion we will consider virus populations that are divided into demes at the host or cell level. theory suggests that demography could have major implications for the loss of inserted sequences, with small population sizes, narrow bottlenecks, and short time intervals between bottlenecks resulting in high sequence stability. hence, the stability of the inserted sequence cannot be viewed solely as a property of a genome, rather it is a phenotype and therefore depends on the environment. in this section, we motivate this argument and present a simulation model that highlights the effects of demography on the deletion of inserted sequences. at its core, the stability of genomic insertions in viral genomes depends on two key factors. first, the supply of mutations removing the insertion is crucial, because selection can only act on existing heritable variation. second, selection then acts to fix variants with the inserted sequence removed. all other things equal, the larger the supply of mutations that remove the insertion and the larger the selection coefficients of variants with the insert removed, the less stable the insertion will be. the interplay between mutation and selection will govern the stability of genomic inserts, and in many cases demography has an important role in shaping this interplay. for example, low fitness can lead to small population sizes, which in turn will limit the mutation supply (chao ; lynch and gabriel ) . a high-cost inserted sequence might therefore limit viral evolvability, thereby promoting its own stability. genetic drift can also play an important role in determining the stability of inserted sequences, as inserts can have high stability if a viral population regularly passes through population bottlenecks. this idea is inspired by the empirical observation that a group iv plant virus appears to be stable when shortduration passages are used, but not in long-duration passages zwart et al. ; willemsen et al. willemsen et al. , . viruses pass through bottlenecks at many points during infection, in vitro and in vivo (zwart and elena ) , it is therefore important to consider these effects. even if there is a large supply of deletions and strong selection for the deletion mutant, if deletion mutants fail to reach a frequency !/a, where a is the bottleneck size, they are unlikely to pass through the bottleneck (willemsen et al. (willemsen et al. , . this leads to a 'resetting' of the virus population by each bottleneck event (fig. ) , effectively resulting in high stability of the inserted sequences (box , see also fig. ). short passages shorten the time for deletion mutants to reach the frequency /a, making it more difficult for these variants to pass through bottlenecks and hereby promoting insert stability. it is important to remember that assays for detecting deletion mutants, such as deep sequencing or the polymerase chain reaction, do have limited sensitivity. deletions may therefore also be detected more readily in longer passages, whilst low frequency mutations that will be purged by bottlenecks may not be detected (bull, nuismer, and antia ) . demography can also modulate the strength of selection itself. the moi (cellular multiplicity of infection) is a key demographic parameter at the cellular level, as it describes the number of virus particles infecting a cell. if an inserted sequence affects viral fitness in trans at the within cell level-for example by being toxic-then the moi will determine whether there can be selection (miyashita and kishino ) . at high mois there will be no selection, because the toxin is produced in all cells and affects the replication of both producers and nonproducers of the toxin (fig. ). an interesting conundrum is that high mois also tend to promote the evolution of di viruses (huang ) due to within-cell selection, and hence these two effects must be weighed accordingly. in this review, we considered only a few cases in which inserted sequences potentially could have beneficial effects on a virus (thomas, cameron, and davis ; gladstone, molineux, and bull ; schmerer et al. ; willemsen et al. ) . beneficial effects could promote insertion stability and are therefore interesting from a bioengineering perspective, but demography can once again play a role in determining sequence stability. heterologous expression of endosialidase, an exoenzyme that degrades a key biofilm component after phageinduced cell lysis, lead to increased amplification of phage t in capsulated e.coli (gladstone, molineux, and bull ) . however, a phage that did not express the dispersin outcompeted the engineered virus, as it could reap the benefits of dispersin production whilst not bearing costs. this tragedy of the commons is a reversal of the situation sketched above for high mois (fig. ) . one proposed strategy to increase stability would be setting up culture conditions such that phages are growing in isolation or spatially structured environments (gladstone, molineux, and bull ) , other examples of demography-based approaches to increasing insert stability. it will certainly not always be possible to address issues of insert stability through demographic changes, but theory suggests this can be an interesting approach. some experimental protocols already exploit some of these principles, in particular strict adherence to low moi (fitzgerald et al. ) . one should caution against naive applications of evolutionary theory, as the details of each real-world system matter (schmerer et al. ). there are multiple, non-mutually exclusive mechanisms by which an inserted sequence can be costly for a virus. consequently, deletion of the inserted sequence may not be the only class of mutation that ameliorates the insert's effects on fitness, a possibility we explore in this section. we argue that alternative trajectories may sometimes play a role, but that due to mutation supply of different types of mutations, deletion of the inserted sequence is the most likely trajectory. a cost of the insert can arise because of the attributes of the inserted sequence (i.e. metabolic costs of expressing extra genes, toxicity of gene products), reorganization of the genome due to the insertion (i.e. disruption of the regulation of gene expression, polyprotein processing, and subgenomic rnas), or limitations on genome size imposed by virus-particle packaging. deletion of the inserted sequence is therefore not the only plausible class of mutation that can restore viral fitness, as other mutations can also affect fitness. these mutation types are , regulatory mutations (i.e. promoter mutations) that downregulate gene expression (van opijnen, boerlijst, and berkhout ) , , removal of immunogenic sequence motifs (fros et al. ) , , alteration of unfavorable secondary rna structures (mcfadden et al. ) , and , adopting a more favorable codon usage (carrasco, de la iglesia, and elena ; agashe et al. ; cladel et al. ) , as synonymous mutations can have marked effects on virus fitness. these different mutation classes are likely to have different mutation rates, and mutation bias might therefore drive the evolutionary route that is followed (stoltzfus and mccandlish ) . for example, consider that recombination rates are high for many viruses (tromas and elena ) , and there are many figure . in panel a, we illustrate how the cellular moi can have a direct effect on selection strength. consider a virus that expresses a product that is toxic and acts in trans within cells to lower replication levels, but deletions can remove the gene coding this gene. if there is a mixed virus population with variants with the insertion intact and deleted, at high moi all cells will be infected with both variants and the toxin will lower replication. the ubiquitousness of the toxin will limit selection against the virus variant with the deletion. when moi is low, due to genetic drift at the cellular not all cells will contain both variants, and virus variant with the deletion is selected because those cells infected only with this variant have higher replication. in panel b, the relationship between the cellular moi (ordinate) and the frequency of single-genotype infection (abscissa) for a virus population with genotypes a and b is given, for different frequencies of the two virus genotypes in the population (f a shown, f b ¼ -f a ). note that the frequency of single-genotype infections is given as the proportion of infected cells in which only virus genotypes a or b are present. as the moi increases, the frequency of single-genotype infections decreases, although it depends on the frequency of the two virus genotypes in the population. if genotype a expresses a gene that has fitness costs that act in trans (e.g. toxicity), then selection can only act against this genotype when there is an appreciable number of singlegenotype infections. possible recombination events that partially remove an insertion. in contrast, probably only a small fraction of point mutations will be beneficial (sanjuá n, moya, and elena ; carrasco, de la iglesia, and elena ) , e.g. in this case by lowering expression of the inserted gene or leading to more favorable codon usage. we therefore conjecture that mutation supply is likely to favor the evolution of deletions in the transgene over beneficial point mutations that affect fitness cost. consider the 'genomic accordion' observed in poxviruses (elde et al. ) : beneficial point mutations typically occur long after gene amplification by copy number variation. likewise we expect deletions that remove an insertion to be fixed before point mutations that also lessen its impact occur. nevertheless, the occurrence of alternative evolutionary trajectories could, depending on the exact mutation supply and effect sizes for different classes of mutations, contribute to making stability of genomic inserts less repeatable and predictable in some cases (de visser and krug ; bolnick et al. ). whereas some sequences inserted into viral genomes are stable, others are clearly not. although there are some factors that appear to explain these differences, at the end of the day there is still a great deal about the relatively simple question of stability that we do not understand. in contrast, these different outcomes are encouraging, because they suggest that if we understand the process well enough, we can design more stable insertions. for most viruses, strong selective constraints appear to exist against increasing genome size. in natural conditions, this is an impediment for evolutionary innovation by gene duplication or hgt. in laboratory conditions, this is an impediment for expressing a gene of interest by using engineered viral vectors. when stratifying by viral groups, we observe that the stability of viral genomes partially depends on the nature of the genome. viral genomes with separately expressed nonoverlapping orfs (group v: ssrnaÀ) appear to have less constraints imposed on sequence insertions as compared to genomes with genes encoded in one single orf (group iv: ssrnaþ). although the dsdna (group i) virus genomes are extremely plastic in natural conditions, this observation is not a good predictor for stability of engineered viral genomes as inserts are generally lost. in the case of ssdna (group ii) viruses, the varying frequency of genomic segments might lead to rapid adaptive responses to inserted sequences. while in the case of segmented dsrna (group iii) viruses, sequence insertions probably perturb segmented genome assembly. when comparing the retro-transcribing viruses, the rt-ssrna(þ) (group vi) viruses appear to successfully express sequences of interest after stable integration into the host genome, whilst the rt-dsdna (group vii) viruses are less stable and only rarely integrate into the host genome. multipartite viruses, represented in various groups, also present unique challenges when thinking about the stability of inserted sequences. when comparing all viral genome architectures, we conclude that that genomic stability is not a fixed, intrinsic property. although we show that insert stability depends on the nature of the genome, the site and size of the insert and the recombination rate, the host species and demographic conditions (i.e. population and bottleneck size) can radically change viral evolutionary dynamics. we have illustrated this idea with a simple simulation model that considers the effect of genetic bottlenecks (box ), where the observed stability of the viral genome decreases as the bottleneck is widened. the interplay between all factors affecting insert stability appears to be complex and unexpectedly sensitive to the exact conditions under which a virus population evolves. given these complexities, we think it may be challenging to develop predictive models of insert stability, for different types of virus genomes under different conditions. we hope to see developments in this area, possibly linked to resurging interest in preventing and exploiting di viruses. however, we think that experimental tests of the stability of viral constructs will remain important in the foreseeable future. experimental evolution can detect design problems in engineered genomes by looking at fitness and evolutionary stability (springman et al.) . as springman and collaborators suggest, experimental evolution may also prove useful for optimizing the stability of expression vectors by ameliorating constraints for which solutions are hard to predict because we lack a mechanistic understanding, such as codon usage (carrasco, de la iglesia, and elena ; agashe et al. ) . this approach can lead to improved engineering of viral genomes, which is also of interest for designing vectors with tags to follow viral infection, and for the use of viral vectors for gene therapy as well as for vaccine vectors. finally, for real-world applications it can be useful to determine quantitatively the impact of the loss of inserted sequences on the desired output. for example, models suggest that deletions in vector vaccines may not have a large impact on eliciting the desired immune response (bull, nuismer, and antia ) . we have noticed that a surprisingly large number of studies draw conclusions on the stability of inserted sequences in viral genomes based on experiments with either no or low replication. we cannot stress enough the importance of replication in studying genomic stability, in the first place because mutation is a stochastic process. moreover, as illustrated by our simple simulations-in which mutation is deterministic-bottlenecks and population dynamics can also introduce further stochastic effects that influence stability (fig. ) . furthermore, empirical studies with high levels of replication show the extent to which observed stability does vary between replicates (zwart et al. ). good codons, bad transcript: large reductions in gene expression and fitness arising from synonymous mutations in a key enzyme inhibition of human immunodeficiency virus type (hiv- ) replication by a two-amino-acid insertion in hiv- vif from a nonprogressing mother and child poxviral promoters for improving the immunogenicity of mva delivered vaccines evidence that gene amplification underlies adaptive mutability of the bacterial lac operon increased replication capacity following evolution of pyxe insertion in gag-p is associated with enhanced virulence in hiv- subtype c from east africa bovine adenovirus- as a vaccine delivery vehicle vsv-tumor selective replication and protein translation the evolution of genome compression and genomic novelty in rna viruses reverse genetics of measles virus and resulting multivalent recombinant vaccines: applications of recombinant measles viruses (non)parallel evolution genome packaging in multi-segmented dsrna viruses: distinct mechanisms with similar outcomes', current opinion in virology expression of a bacterial gene in plants by using a viral vector nonsegmented negative-strand viruses as vaccine vectors recombinant vector vaccine evolution protein tagging and detection with engineered self-assembling fragments of green fluorescent protein distribution of fitness and virulence effects caused by single-nucleotide substitutions in tobacco etch virus overview of the baculovirus expression system fitness of rna virus decreased by muller's ratchet' potato virus x as a vector for gene expression in plants bacteriophage lambda as a cloning vector stability of recombinant plant viruses containing genes of unrelated plant viruses synonymous codon changes in the oncogenes of the cottontail rabbit papillomavirus lead to increased oncogenicity and immunogenicity of the virus split green fluorescent protein as a tool to study infection with a plant pathogen modifications of the tobacco mosaic virus coat protein gene affecting replication, movement, and symptomatology a tobacco mosaic virus-hybrid expresses and loses an added gene a structured dynamic model for the baculovirus infection process in insect-cell reactor configurations empirical fitness landscapes and the predictability of evolution potato virus x-based expression vectors are stabilized for long-term production of proteins and larger inserts mutagenesis of cauliflower mosaic virus tagging of plant potyvirus replication and movement by insertion of beta-glucuronidase into the viral polyprotein systemic expression of a bacterial gene by a tobacco mosaic virus-based vector transgene stability for three replication-competent murine leukemia virus vectors' gene therapy clinical trials worldwide - -an overview poxviruses deploy genomic accordions to adapt rapidly against host antiviral defenses size reversion of african cassava mosaic virus coat protein gene deletion mutants during infection of nicotiana benthamiana protein complex expression by using multigene baculoviral vectors' bacterial gene inserted in an engineered rna virus: efficient expression in monocotyledonous plant cells the size of encapsidated single-stranded dna determines the multiplicity of african cassava mosaic virus particles cpg and upa dinucleotides in both coding and non-coding regions of echovirus inhibit replication initiation post-entry', elife correction: characterization of hbv integration patterns and timing in liver cancer and hbv-infected livers properties of replication-competent vesicular stomatitis virus vectors expressing glycoproteins of filoviruses and arenaviruses clinical development of modified vaccinia virus ankara vaccines evolutionary principles and synthetic biology: avoiding a molecular tragedy of the commons with an engineered phage poxvirus vectors as hiv/aids vaccines in humans propagation of foreign dna in plants using cauliflower mosaic virus as vector susceptibility to recombination rearrangements of a chimeric plum pox potyvirus genome after insertion of a foreign gene lmo -associated clonal t cell proliferation in two patients after gene therapy for scid-x homologous recombination in negative sense rna viruses stability and expression of bacterial genes in replicating geminivirus vectors in plants rescue of in vitro generated mutants of cloned cauliflower mosaic virus genome in infected plants defective interfering viruses localized attenuation and discontinuous synthesis during vesicular stomatitis virus transcription high rate of recombination throughout the human immunodeficiency virus type genome specific targeting to cd þ cells of recombinant vesicular stomatitis viruses encoding human immunodeficiency virus envelope proteins bsmv genome mediated expression of a foreign gene in dicot and monocot plant cells entirely plasmid-based reverse genetics system for rotaviruses in vitro and in vivo genetic stability studies of a human adenovirus type recombinant rabies glycoprotein vaccine (onrab)', vaccine reverse genetics system demonstrates that rotavirus nonstructural protein nsp is not essential for viral replication in cell culture detection and analysis of autographa californica nuclear polyhedrosis virus mutants with defective interfering properties evolution and taxonomy of positive-strand rna viruses: implications of comparative analysis of amino acid sequences vaccination against rabies: construction and characterization of sag , a double avirulent derivative of sadbern mammalian metallothionein functions in plants a recombinant canine distemper virus expressing a modified rabies virus glycoprotein induces immune responses in mice rescue and evaluation of a recombinant prrsv expressing porcine interleukin- ' genomic stability of murine leukemia viruses containing insertions at the env ' untranslated region boundary dispersing biofilms with engineered enzymatic bacteriophage mutation load and the survival of small populations elimination of rabies from red foxes in eastern ontario stability and fitness impact of the visually discernible rosea marker in the tobacco etch virus genome a highly immunogenic and protective middle east respiratory syndrome coronavirus vaccine based on a recombinant measles virus vaccine platform recombination in eukaryotic single stranded dna viruses the shift from low to high non-structural protein expression in rotavirus-infected ma- cells', memó rias do instituto oswaldo cruz attenuated poliovirus strain as a live vector: expression of regions of rotavirus outer capsid protein vp by using recombinant sabin viruses influence of genome-scale rna structure disruption on the replication of murine norovirus-similar replication kinetics in cell culture but attenuation of viral fitness in vivo highly stable expression of a foreign gene from rabies virus vectors clinical use of lentiviral vectors estimation of the size of genetic bottlenecks in cell-to-cell movement of soil-borne wheat mosaic virus and the possible role of the bottlenecks in speeding up selection of variations in trans-acting genes or elements instability and reiteration of dna sequences within the vaccinia virus genome the v and v variable loops of hiv- envelope glycoprotein are tolerant to insertion of green fluorescent protein and are useful targets for labeling effects of viral strain, transgene position, and target cell type on replication kinetics, genomic stability, and transgene expression of replication-competent murine leukemia virus-based vectors phylogenetic measures of indel rate variation among the hiv- group m subtypes deletion and recombination events between the dna-a and dna-b components of indian cassava-infecting geminiviruses generate defective molecules in nicotiana benthamiana generation of recombinant rotavirus expressing nsp -unag fusion protein by a simplified reverse genetics system an infectious west nile virus that expresses a gfp reporter gene spontaneous excision of bac vector sequences from bacmid-derived baculovirus expression vectors upon passage in insect cells double-subgenomic sindbis virus recombinants expressing immunogenic proteins of japanese encephalitis virus induce significant protection in mice against lethal jev infection assessment of recombinants that arise from the use of a tmv-based transient expression vector clinical trials with retrovirus mediated gene therapy: what have we learned? a novel genotype encoding a single amino acid insertion and five other substitutions between residues and of the hiv- reverse transcriptase confers high-level cross-resistance to nucleoside reverse transcriptase inhibitors generation of infectious clone of bovine adenovirus type i expressing a visible marker gene complete protection from papillomavirus challenge after a single vaccination with a vesicular stomatitis virus vector expressing high levels of l protein functional cdna clones of the flaviviridae: strategies and applications construction of bacteriophage phix mutants with maximum genome sizes properties and behavior of orally administered attenuated poliovirus vaccine accommodation of foreign genes into the sendai virus genome: sizes of inserted genes and viral replication the distribution of fitness effects caused by single-nucleotide substitutions in an rna virus challenges in predicting the evolutionary maintenance of a phage transgene the minimal conserved transcription stop-start signal promotes stable expression of a foreign gene in vesicular stomatitis virus plant virus gene vectors for transient expression of foreign proteins in plants integration of hepatitis b virus dna into the genome of liver cells in chronic liver disease and hepatocellular carcinoma mutational analysis of the small intergenic region of maize streak virus the effect of gene overlapping on the rate of rna virus evolution long nucleotide insertions between the hn and l protein coding regions of human parainfluenza virus type yield viruses with temperature-sensitive and attenuation phenotypes construction and transposon mutagenesis in escherichia coli of a full-length infectious clone of pseudorabies virus, an alphaherpesvirus evolutionary stability of a refactored phage genome chimpanzee adenovirus vaccine generates acute and durable protective immunity against ebolavirus challenge novel defective interfering dnas associated with ageratum yellow vein geminivirus infection of ageratum conyzoides a number of subgenomic dnas are produced following agroinoculation of plants with beet curly top virus mutational biases influence parallel adaptation vaccinia vectors as candidate vaccines: the development of modified vaccinia virus ankara for antigen delivery expression of bacterial chloramphenicol acetyltransferase gene in tobacco plants mediated by tmv-rna characterization of recombinant flaviviridae viruses possessing a small reporter tag endogenous viral sequences in plant genomes viable molecular hybrids of bacteriophage lambda and eukaryotic dna recombinant sendai viruses expressing different levels of a foreign reporter gene the rate and spectrum of spontaneous mutations in a plant rna virus rotavirus rearranged genomic rna segments are preferentially packaged into viruses despite not conferring selective growth advantage to viruses selection of scfv phages on intact cells under low ph conditions leads to a significant loss of insert-free phages effects of random mutations in the human immunodeficiency virus type transcriptional promoter on viral fitness in different host cell environments expression of a bacterial gene in plants mediated by infectious geminivirus dna adding genes to the rna genome of vesicular stomatitis virus: positional effects on stability of expression predicting the stability of homologous gene duplications in a plant rna virus high virulence does not necessarily impede viral adaptation to a new host: a case study using a plant rna virus effects of palindrome size and sequence on genetic stability in the bacteriophage /x advancement and applications of peptide phage display technology in biomedical science construction and in vitro evaluation of a recombinant live attenuated prrsv expressing gm-csf recombinant newcastle disease virus as a viral vector: effect of genomic location of foreign gene on gene expression and virus replication matters of size: genetic bottlenecks in virus infection and their potential impact on evolution key: cord- - t xthk authors: gmyl, a. p.; agol, v. i. title: diverse mechanisms of rna recombination date: journal: mol biol doi: . /s - - -x sha: doc_id: cord_uid: t xthk recombination is widespread among rna viruses, but many molecular mechanisms of this phenomenon are still poorly understood. it was believed until recently that the only possible mechanism of rna recombination is replicative template switching, with synthesis of a complementary strand starting on one viral rna molecule and being completed on another. the newly synthesized rna is a primary recombinant molecule in this case. recent studies have revealed other mechanisms of replicative rna recombination. in addition, recombination between the genomes of rna viruses can be nonreplicative, resulting from a joining of preexisting parental molecules. recombination is a potent tool providing for both the variation and conservation of the genome in rna viruses. replicative and nonreplicative mechanisms may contribute differently to each of these evolutionary processes. in the form of trans splicing, nonreplicative recombination of cell rnas plays an important role in at least some organisms. it is conceivable that rna recombination continues to contribute to the evolution of dna genomes. the genomes of rna viruses change and evolve like the genomes of all other biological organisms. the major cause of their variation is the infidelity of template replication, which is partly determined by the fact that viral rna-dependent rna polymerases (rdrps) lack proofreading activity [ , ] . rough estimates showed that, on average, one mutation arises in every newly synthesized molecule of viral rna. the mutation rate is so high in some viruses (e.g., in picornaviruses) that its slight increase already suffices to dramatically reduce or completely abolish the viability of the virus because of a high probability of substantial genetic lesions (mutation catastrophe) [ , ] . in addition, replication of viral rna genomes is accompanied by covalent rearrangements: deletions, duplications, and recombination. an illustrative example of deletions is provided by defective interfering (di) genomes, which accumulate in a virus population upon high-multiplicity infections and lack a fragment of the sequence coding for viral proteins [ ] [ ] [ ] . short duplications and deletions arise, in some cases, as pseudoreversions in response to artificial damaging mutations, as observed for the poliovirus and theiler's murine encephalomyelitis virus [ ] [ ] [ ] [ ] . duplications also reserve a place in the evolutionary history of viruses: for instance, the genome of the foot-andmouth disease virus codes for three highly similar variants of the replicative protein vpg [ ] . there are grounds for believing that an extended sequence (more than nt in size) was duplicated in the 'untranslated region ( utr) of the genome of an enterovirus precursor. an additional extended duplication took place more recently in another region of utr in the bovine enterovirus [ ] . a special role in the variation of rna viruses is played by recombination, the generation of new genomes from two or more parental rnas. recombination between viral rna molecules was observed for the first time as early as in the s in the poliovirus [ , ] . two related but phenotypically different strains were used as parents. a minor portion of the virus progeny resulting from co-infection with the two strains expressed characters of both parents. a similar method was used soon afterwards to detect recombination in the foot-and-mouth disease virus [ ] . the recombination rate proved to vary among different pairs of poliovirus mutants [ ] . on the assumption that the recombination rate is proportional to the distance between the corresponding mutations, a genetic (linkage) map was constructed for the polioviral genome. the map proved to be additive, testifying again to the existence of rna recombination. it is indeed hardly conceivable that a different mechanism associates the genetic distance between mutations with the frequency of the corresponding double mutants. a. p. gmyl and v. i. agol , direct biochemical evidence for the inheritance of information from two parental viral rnas was obtained in the early s. proteins of recombinants resulting from crossing polioviruses of different serotypes were assayed by partial proteolysis and isoelectric focusing and proved to originate from different strains [ , ] . convincing evidence for rna recombination was provided by rna sequencing [ ] [ ] [ ] [ ] [ ] [ ] . early studies of rna recombination were described in detail elsewhere [ ] [ ] [ ] . recombination is characteristic of most, if not all, rna viruses of animals [ ] [ ] [ ] [ ] , plants [ , ] , and microorganisms [ , ] , but its rate considerably varies among different viruses. some viruses with a negative rna genome (e.g., hantaviruses [ ] ) have a low recombination rate. in some flaviviruses, rna recombination has not been detected so far [ ] . in many rna viruses, however, recombination occurs at a high rate under experimental conditions and is widespread in nature. for instance, there are grounds for believing that recombination events have taken place in the evolutionary history of all currently circulating enteroviruses [ ] [ ] [ ] [ ] [ ] [ ] [ ] . rna recombination can be observed not only in vivo but also in cell-free systems [ , [ ] [ ] [ ] [ ] [ ] . mutations and covalent rearrangements of rna genomes contribute to their diversity, a resource providing for the evolution of rna viruses. here, we summarize the current views of the mechanisms and biological significance of rna recombination as a major generator of this resource, with special emphasis on the recent progress in the field. intermolecular rna recombination, as well as deletions and duplications, can theoretically proceed via two fundamentally different mechanisms, replicative and nonreplicative. recombination is associated with rna synthesis in the former case, while recombinant molecules are generated at the postsynthetic (postreplicative) level in the latter. according to the model of replicative template switching, synthesis of a complementary strand starts on one rna molecule and is completed on another; i.e., a newly synthesized molecule is a primary recombinant ( fig. ) . several variants are conceivable for the transfer of the nascent strand from one template to another. the strand can be removed from the primary template as a result of premature termination and can be transferred to another template with or without rdrp. a donor rna fragment can result from degradation of a previously synthesized rna; in this case, the start and end of generation of a recombinant molecule are temporally separate. cooper et al. [ , ] were the first to propose template switching as a mechanism of homologous (i.e., exact) recombination in polioviruses. one of the ideas underlying this hypothesis was simple and quite convincing: the enzymes that covalently join rna molecules were unknown at that time. kirkegaard and baltimore [ ] showed that replication is indeed essential for the generation of recombinants, providing experimental support for the hypothesis. it was concluded on circumstantial evidence that, in the system used, template switching took place predominantly during synthesis of the (-) rna strand on the template of the viral (+) rna strand. although the arguments were open to criticism [ , ] , the model of template switching was extrapolated to other viral systems [ ] and was considered to be a synonym of the replicative mechanism, which, in turn, was thought to be the only possible way of rna recombination. the replicative model is supported, though indirectly, by the fact that mutations altering the proteins of the replication complex affect the efficiency of recombination [ ] [ ] [ ] . the possibility of template switching was confirmed in experiments with purified rdrps of the poliovirus [ ] , the bovine viral diarrhea virus (bvdv) [ ] , and several plant viruses [ , ] . it is essential for template switching that rdrp is capable of using the ' end of the incomplete nascent rna strand as a primer to be elongated on a new template. such activity was demonstrated for rdrps of at least some rna viruses (e.g., see [ ] [ ] [ ] [ ] [ ] ). although the model of template switching is widely accepted, the molecular mechanisms underlying this phenomenon are still incompletely understood. in particular, three key questions are open. first, why is elongation on the first template interrupted? second, does the incomplete nascent strand dissociate alone or together with rdrp? third, how is the site of the acceptor template chosen to resume synthesis? these problems still lack ultimate answers and are a matter of more or less justified speculation. it is quite conceivable that elongation pauses, which are possibly determined by some elements of rna secondary structure, are among the factors favoring premature termination of the nascent rna strand and its dissociation from the template [ , , [ ] [ ] [ ] . possibly, the nascent strand dissociates more easily at rna regions enriched in u and a [ ] . moreover, premature termination and dissociation may be caused by degradation of the template, for instance, in au-rich, poorly structured regions [ ] or by erroneous addition of a mismatching nucleotide by rna polymerase [ ] . in the latter case, termination and dissociation of the incomplete nascent strand can be regarded as a special type of proofreading. a commonly accepted model of template switching suggests dissociation of the elongation complex. it was assumed, however, that replicative rearrangements arise without dissociation as well. according to one hypothesis, regions of two parental rna molecules are held together via complementary interactions with a third, supporting, molecule, while a recombinant molecule is generated when rdrp passes from one parental template onto the adjacent region of the second template ( fig. ) [ ] . another hypothesis suggests that rdrp sometimes slides back on the template, releasing a short unpaired '-terminal region of the nascent strand. when this region is anchored on the same or another template, resumed elongation results in deletion/duplication or recombination, respectively. this mechanism possibly underlies the origin of some short deletions [ ] . note that the capability to slide back on the template is well known for dna-dependent rna polymerases [ , ] . the major role in choosing the site to resume rna synthesis on the second template is played by rdrp, the ' end of the nascent rna strand, or both. one of the existing hypothesis suggests that rdrp, along with the nascent rna strand, binds to its recognition site, a replicative cis element, on the second template. this hypothesis is based on the fact that crossover sites cluster in the vicinity of promoters or replicative enhancers in the genomes of some plant viruses [ , [ ] [ ] [ ] [ ] . mutations of such elements impair the efficiency of recombination. it is possible that recognition of replicative cis elements by viral rdrp contributes to the choice of crossover site during recombination in alphaviruses [ ] and coronaviruses [ ] , although, in the latter, a considerable role in choosing the landing site is played by complementary interactions between the ' end of the nascent strand and intergenic repeats of the template [ ] . the clustering of crossover sites in the vicinity of replicative cis elements provides evidence, though circumstantial, in favor of template switching. it is clear, however, that such an association with replicative cis elements is absent when crossover sites are relatively uniformly distributed throughout the genome, as is the case with picornaviruses. it is thought that the choice of anchorage site is facilitated when the ' end of the nascent strand is complementary to a region of the second template [ , ] . yet complementary sites may each consist of a few nucleotides, and an erroneous landing is highly probable because of the abundance of short direct repeats in rnas. a factor theoretically capable of bringing correct (homologous) regions of two templates together is the formation of a heteroduplex between two direct repeats corresponding to hairpins [ , ] . the effect of temperature on the distribution of crossover sites [ ] is probably associated with its effect on the secondary structure of rna and, consequently, on termination/dissociation of the primary elongation complex or the association of the incomplete nascent strand with another template. template switching is not the only possible mechanism of rna recombination. an alternative model was advanced on the basis of experimental data on replication of phage q β rna in a cell-free system [ ] . the system contained only highly purified q β replicase and ribonucleoside triphosphates and allowed exponential replication both for phage q β rna and for small rq rnas, natural satellites of the phage [ ] . to study rna recombination with this system, experiments were performed with ' and ' fragments of a satellite rq rna. the fragments complemented each other and, in themselves, could not be exponentially replicated by q β replicase. it is essential for replication that both ends of a molecule contain necessary cis elements, which could be achieved only via a fusion of two molecules into one as a result of recombination. the reaction products were analyzed by a molecular colony technique [ ] : an rna sample was applied to agarose gel containing q β replicase; gels were covered with a nylon membrane impregnated with a solution of four ntps; and generation of molecules capable of replication was inferred from the formation of rna colonies (clones), which are detectable, for instance, with ethidium bromide. detection of colonies directly demonstrated for the first time that rna recombination does not necessarily involve dna intermediates (dna synthesis was impossible because of the absence of dntps), nor does it require any protein other than q β replicase [ ] . the mechanism of recombination in the above system differs from template switching. first, recombinants were nonhomologous, notwithstanding the homology of overlapping ends, which were added to the fragments on purpose. only homologous recombinants were produced in a control sample containing not only the same reagents, but also reverse transcriptase, which is capable of jumping from one tem-plate to another. second, a hydroxyl group at the ' terminus of the ' fragment was critical for efficient recombination. most recombinants contained the fulllength ' fragment and a part of the ' fragment. recombination was observed even when rna fragments of opposite polarities were used as partners [ ] . to explain the above findings, a mechanism was proposed that is similar to the mechanism of splicing and suggests that the '-hydroxyl group of one fragment attacks a phosphodiester bond of the other [ , ] . this new type of rna recombination is considered to be replicative, because it depends on q β replicase [ ] and rna synthesis [ ] . it is thought that, during synthesis, replicase assumes a certain active conformation, which allows it to catalyze the above transetherification reaction [ ] . data are continuously accumulating that viruses possess a fundamentally different, nonreplicative, mechanism(s) of rna recombination. in particular, this is evident from experiments with the poliovirus. the polioviral genome is a single-stranded rna of about . kb in size and of a positive polarity, being thereby capable of functioning as a template in translation. this rna contains a single extended open reading frame and codes for a precursor polyprotein, whose partial proteolysis by viral proteases yields viral proteins [ ] . translation of viral rna is initiated via a cap-independent mechanism, unusual for eukaryotic mrna: the polioviral rna lacks a cap, and the ribosome binds to what is known as the internal ribosome entry site (ires) in the utr [ , ] . replication of this rna requires at least three cis -acting replicative elements, which are at both ends and in the internal region of the molecule, as well as rdrp and some other proteins encoded by the viral genome [ ] . replication of the polioviral rna, along with other steps of virus reproduction, occurs in the cytoplasm. the possibility of nonreplicative recombination was first studied with pairs of polioviral rna fragments. one fragment in a pair contained the near fulllength utr and lacked the coding region; the other had an intact open reading frame coding for the polyprotein, while the translational and replicative elements of the utr were removed or inactivated (fig. ) [ ] . introduced together into cells, these fragments allowed generation of a viable recombinant virus progeny. most recombinants were a result of imprecise (nonhomologous) recombination. the changes observed in the utr did not affect functionally significant regulatory elements but were restricted to a region where the primary structure can be dramatically rearranged without impairing the infectivity of ' ' the virus [ , , [ ] [ ] [ ] . since the fragments were not in themselves capable of replication or translation, recombinant genomes could arise only by a nonreplicative mechanism. the possibility of nonreplicative rna recombination was demonstrated even more rigorously with pairs of rna fragments corresponding to the polioviral rna with a break in the rdrp-coding region (fig. ) [ ] . since each fragment contained only a part of the gene coding for rdrp, this enzyme, which is essential for virus replication, could be synthesized only after recombination. in one variant, the rdrp gene lacked a single phosphodiester bond (fig. a) . simple ligation would restore the integrity of the genome in this case. cotransfection of virus-sensitive cells with two partners yielded viable viruses. the efficiency of ligation was virtually independent of whether the fragments could form heteroduplexes in which the nucleotides to be ligated were close together. it was only necessary that the ' partner have a '-phosphate and the ' partner have a '-oh group. rna ligases that join a 'phosphorylated nucleotide with a '-hydroxylated nucleotide are still unknown, suggesting preliminary activation of the partner ends by cell enzymes. known rna ligases utilize as partners either terminal ', 'cyclophosphates with a '-phosphate or a '-hydroxyl group [ ] [ ] [ ] [ ] or '-oh with '-phosphate (or '-triphosphate) [ , ] . it is possible to assume that, in the cell, the ends of the fragments are converted into the form suited to rna ligases. one of the possible variants of such activation is cyclization of the '-terminal phosphate of the ' fragment by rna- '-phosphate cyclase [ , ] . however, conversion of the '-terminal phosphate of the ' fragment into ', '-cyclophosphate did not increase the efficiency of ligation [ ] . thus, cyclization of the '-terminal phosphate was either not essential or nor limiting for ligation under the conditions used. fragments of other pairs had an overlap (fig. b) . the sequence coding for the active enzyme could be restored only as a result of precise (homologous) recombination. therefore, it was necessary that the extra segment(s) be deleted from one or both fragments to produce a viable genome. such deletions were indeed found in viable viruses resulting from cotransfection. an interesting association was observed between the location of crossover sites and the structure of the terminal nucleotides in partners (fig. ) . when both fragments used for transfection had phosphorylated terminal nucleotides (the ' fragment had a monophosphorylated ' end and the ' fragment had a tri-, di-, or monosphosphorylated ' end), the full-length ' fragment was incorporated into the genome in most recombinants. when both fragments had dephosphorylated ends, it was the full-length ' fragment that was incorporated into the genome in most recombinants. in other words, the '-phosphorylated nucleotide of a ' fragment and the '-hydroxylated nucleotide of a ' fragment were capable of finding the correct site on the rna partner and integrating into it to yield a perfect recombinant genome. when the ' fragment was dephosphorylated and the ' fragment phosphorylated, the crossover site was within the overlap; i.e., precise internal crossing over was observed in this case. the available data are insufficient for proposing a mechanism of recombination observed in the above experiments. moreover, it is still unclear whether terminal incorporation and internal crossing over proceed via the same or different mechanisms. it seems another open question is how the accuracy of recombination is achieved. it is most plausible that crossover sites are to some extent promiscuous (or at least have many possible locations) and that the precision of recombination is determined by selection of viable variants. first, in the experiments on recombination at the utr, precise recombination was not essential for generation of viable genomes and crossover sites were distributed throughout the permissible region in rna [ ] . second, inserts of one or two triplets were observed even for recombination at the rdrp-coding region, which was expected to require strong homology at the crossover site [ ] . third, recombination sometimes yielded malformed, though still viable, genomes with a mosaic rdrp gene and additional inserts in the utr [ ] . although occurring in many regions of viral rna, nonreplicative recombination does not necessarily proceeds in different regions with similar efficiencies. indeed, crossover sites cluster in hot spots when recombination takes place at the utr, regardless of whether the ' or the ' full-length partner is incorporated or recombination is internal ( [ ] ; e.v. belousov et al. , unpublished data). the factors responsible for such selectivity are still unknown. it is possible to assume that hot spots correspond to sites of preferential cleavage of the recombining partners by nucleases or cryptic ribozymes or, alternatively, to sites of preferential ligation (or incorporation into a polynucleotide chain). similar data were recently obtained with the bvdv, which belongs to the genus pestivirus of the family flaviviridae. like the polioviral rna, the genome of this cytoplasmic virus is a single-stranded rna of about . kb, which functions as a translational template and contains a single open reading frame coding for a polyprotein precursor of all viral proteins. the utr and the utr of the bvdv rna harbor cis elements essential for replication and translation, including ires [ ] . infection with the bvdv is usually asymptomatic. in some cases, however, it has severe complications and a lethal outcome. then two virus variants, or a virus pair, can be isolated from affected animals: one is noncytopathogenic and causes asymptomatic infection and the other is cytopathogenic and causes death [ ] . cytopathogenic variants commonly originate from noncytopathogenic viruses as a result of genome rearrangements (recombination). it is experiments with the bvdv that provided additional evidence for nonreplicative rna recombination [ ] . transfection was performed with overlapping fragments of the bvdv rna, which lacked different parts of the rdrp-coding segment. it is clear that such fragments could not be replicated by themselves. however, cotransfection yielded viable recombinant viruses. both homologous and nonhomologous recombination was observed, with viral rdrp being altered by inserts and deletions in the latter case. "spontaneous" rna recombination nonreplicative rna recombination probably involves cell enzymes in the above cases. however, there are variants of recombination that seem to require no macromolecules other than rna partners. to prevent recombination depending on q β replicase (see above), rq rna fragments were incubated in the absence of this enzyme and ntps (the medium contained mg + ) or the ' fragment was oxidized and lacked the '-oh group, which is essential for the replicase-dependent mechanism. recombinants were generated again, although the generation rate was three orders of magnitude lower than in the presence of replicase [ ] . such "spontaneous" recombination occurred only between internal regions of the partners, and the crossover sites were distributed fairly uniformly. these findings made it possible to assume that the capability of exchanging fragments is a general property of rna molecules [ ] . however, such exchanges are rare: their rate was estimated at - recombination events per nucleotide per hour in the above experiments. it has been known for a long time that nonreplicative covalent bonding between cell rnas or their fragments takes place during splicing of various types [ ] [ ] [ ] and some variants of rna editing [ ] . the main mechanisms of these processes are briefly considered below in order to compare them with nonreplicative recombination of viral rna genomes and to note some differences and similarities. intramolecular ( cis ) splicing is most common and consists in excision of internal segments (introns) from a primary rna transcript and ligation of the other segments (exons) of the same molecule. one mechanism of splicing is intricate and specialized and involves spliceosomes, which consist of tens of proteins and half a dozen low-molecular-weight rnas. another mechanism is due to the catalytic (ribozyme) activity of rnas themselves. in this case, a ribozyme is in an intron and its activity is determined, in particular, by the specific secondary and tertiary structures of the rna molecule. in addition to the complex specialized machinery (spliceosomes and ribozyme-containing introns), splicing requires recognition of several oligonucleotide signals located both in introns and in exons. natural splicing is strongly site-specific, involving certain sites of an rna molecule. one molecule can harbor several specific sites, which open up the possibility of alternative splicing [ ] . the molecular mechanisms of splicing are beyond the scope of this review. however, the problem of interest is clearly connected with trans splicing, a specific splicing variant that involves the joining of segments belonging to different rna transcripts. it is possible to consider trans splicing as a variant of nonreplicative rna recombination. trans splicing was detected in various organisms but is especially common in protozoans. trans splicing can proceed both via the spliceosome-dependent [ ] and, at least under laboratory conditions, via the ribozyme [ ] mechanisms. ribozymes are indeed capable of catalyzing partial reactions underlying trans splicing. for instance, some ribozymes introduce breaks in rna [ , ] and thereby generate potential partners for subsequent recombination. the capability of rna ligation is also characteristic of natural ribozymes [ , ] and ribozymes obtained by artificial selection in vitro [ ] [ ] [ ] . recent interest in trans splicing is due to two circumstances. on the one hand, its ribozyme variant may provide a model for important processes associated with the prebiotic stage of the evolution of the rna world. on the other hand, artificial trans splicing is promising for correcting pathologically changed rna molecules and, consequently, can be used for treatment and prevention of various disorders. several variants are known for artificial trans splicing catalyzed by ribozymes. one of the rna partners can be covalently linked with a ribozyme, while the other, free, partner must contain a short oligonucleotide recognizable by the given ribozyme. as a result of the ribozyme activity, the partners are joined in a single molecule and the ribozyme is released. this mechanism was successfully used to restore the integrity of a truncated lac z mrna [ ] and to correct the coding potential of several other modified mrnas [ ] [ ] [ ] . in another variant, a ribozyme (a selfsplicing group ii intron) was cleaved into two components. its '-terminal fragment was ligated with the ' end of exon and the 'terminal fragment was ligated with the ' end of exon . as a result of its function, the two exons were joined together and the intron was released [ ] . one more variant of trans splicing involves two rna partners and a ribozyme as separate molecules. for instance, the yeast group ii intron catalyzes reciprocal trans splicing of two rnas containing hexanucleotide sites for intron binding. the reaction yields two chimeric rnas, one containing the ' end of one rna partner and the ' end of the other and the second one, vice versa [ ] . recombinant rnas were similarly obtained with tetrahymena and azoarcus group i introns acting as ribozymes [ ] . the azoarcus ribozyme performed the reaction with a high yield and imposed minimal requirements on the structure of recombination partners: the partners must only contain trinucleotides complementary to the corresponding sites of the intron. a specific variant of the covalent joining of rna molecules is reverse splicing, that is, insertion of an intron between two exons [ , ] . the reaction is not strongly specific: in addition to precise homing, an intron can be inserted in other sites that meet the minimal requirement of the presence of a specific tetranucleotide [ ] . reverse splicing probably contributes to the spreading of introns [ , ] . both the spliceosome-dependent and the ribozyme variants were studied in attempts to employ trans splicing in gene therapy. in the former case, cells are transformed with a construct designed to insert a therapeutic exon into a pathologically changed endogenous rna or to substitute it for a pathological exon [ ] . another approach is introducing a donor of a correcting rna sequence along with a covalently bound ribozyme to catalyze trans splicing. though less efficient, this method of correcting rna molecules is independent of the exon-intron structure of the target rna [ ] . a study was made of the possibility of using ribozyme-dependent trans splicing for treating persistent infection with the hepatitis c virus. a ribozyme was ligated with a fragment of the viral ires and the mrna coding for the diphtheria toxin. as a result of trans splicing with the viral rna, the exogenous mrna acquires the functional ires and synthesis of the diphtheria toxin causes death of the infected cell [ ] . thus, the covalent joining of the fragments of different rnas by means of trans splicing requires a sophisticated machinery (spliceosomes or ribozymes). under natural conditions, trans splicing is highly specific and joins only exons possessing the proper cis signals. yet the signals themselves are quite simple, providing for the possibility of nonspecific reactions [ ] . in any case, the necessity for an intricate machinery and a relatively high site specificity differentiate trans splicing from the known types of nonreplicative rna recombination. recombination recombination plays a dual role in the evolution of rna viruses. on the one hand, recombination facilitates elimination of harmful mutations arising during replication of viral rna and thereby provides a potent mechanism for stabilizing the genome [ , [ ] [ ] [ ] . homologous recombination is probably responsible for maintaining the conservation of the ' ends of plant rna viruses [ ] [ ] [ ] [ ] . on the other hand, rna recombination is an important factor in the variation of viruses, providing for the acquisition of qualitatively new genetic information as a result of transferring functionally significant modules from one virus to another or from the host to a virus [ , , , ] . while these general ideas are unquestionable, the biological significance of individual recombination events occurring in nature is still incompletely understood. for instance, intertypic recombinants are rapidly generated and become dominant in the intestine of recipients of the sabin poliomyelitis vaccine, which contains polioviruses of three serotypes [ , ] . this finding suggests a selective advantage of intertypic recombinants over the parental viruses. vaccination is performed with attenuated viruses, and recombination may eliminate the attenuating mutations. this circumstance is probably responsible for the predominance of intertypic recombinants among viruses isolated from rare patients with paralytic poliomyelitis developing as a result of vaccination [ ] [ ] [ ] . under certain conditions (e.g., when population immunity is low), derivatives of vaccine viruses become capable of wide circulation. so far, all known long-circulating (which cause outbreaks of infection) derivatives of poliovirus vaccine strains are recombinants that contain rna regions acquired either from wild-type polioviruses or related enteroviruses [ ] . on the other hand, it cannot be excluded that, owing to its high rate, recombination is likely to occur when two compatible partners meet and that further fixation of recombinants is due to random bottlenecking rather than selection. in addition, it is difficult to determine whether particular natural recombination events are replicative or nonreplicative. it is possible to assume from general observations (e.g., the low processivity of viral rdrp) that, for example, replicative recombination is more common than nonreplicative recombination in coronaviruses. the clustering of crossover sites in the vicinity of replication promoters or enhancers suggests the same situation for some plant viruses [ , ] . however, it is still impossible to compare the frequency of the two mechanisms of recombination for other viruses, even for the best-studied system of polioviral infection. the type of recombination is difficult to identify even when viral rna fragments, rather than viruses, are recombination partners. it is most interesting in the context of this review that viable recombinants are generated from rna partners incapable of self-replication. such a situation was described, in particular, for overlapping rna fragments of the sindbis [ ] and rubella [ ] viruses. one fragment comprised about two-thirds of the genome from its ' end and coded for nonstructural proteins, and the other coded for structural proteins. cotransfection with these fragments yielded recombinants, and a considerable por-tion of these was a result of imprecise (nonhomologous) recombination. these findings were interpreted in terms of the replicative model, because the ' fragment was capable of directing rdrp synthesis. although possible, this interpretation is still questionable because, first, rdrp of the rubella virus is insufficiently effective in trans [ , ] . second, it seems unnatural that the portion of nonhomologous crossovers is rather high, while there are extended identical sequences corresponding to the overlap of the two fragments. thus, the nonreplicative origin of recombinants cannot be excluded in these cases. data on the incorporation of cell nucleic acids into viral rnas are of particular interest. the evolutionary relatedness of virus and cell genes was demonstrated quite convincingly [ ] . here, we will only consider relatively recent cases of the acquisition of host genes or gene fragments by rna viruses. the bvdv provides the most illustrative example. as already noted, this virus usually causes asymptomatic infection both in cultured cells and in animals. however, a pair of related isolates can be obtained from animals with lethal infection: one isolate is a cytopathogenic recombinant and the major diseaseproducing factor [ ] . many cytopathogenic variants result from insertion of various cell nucleotide sequences into the viral genome. viral rna most commonly acquires the ubiquitin gene [ , , ] , which plays an important role in protein degradation within the cell. in addition, integration into the viral genome was observed for other host sequences such as genes coding for ubiquitin-like proteins involved in nucleocytoplasmic transport [ ] and regulation of the cell cycle [ ] , other intracellular transport components [ , ] , a chaperone [ ] , and the ribosomal protein s a [ ] . such inserts affect the processing of the viral polyprotein and thereby change the virus phenotype. other examples of cell nucleotide sequences integrated into viral rnas are a fragment of the s rrna in the genome of the influenza virus [ ] and a sequence similar to an exon of tobacco chloroplast rna in the rna genome of the potato leafroll virus [ ] . recombination between host and viral rnas was also observed under laboratory conditions. structures originating from cell trnas were found in the '-terminal region of several di genomes of the sindbis virus [ ] and in small rq rnas from escherichia coli cells infected with phage qβ [ ] . fragments of other host genes were also detected in rq rnas. a pseudorevertant was isolated from cells transfected with a mutant rna transcript of the poliovirus and proved to contain an insert partly ( out of nt) identical to a region of the host s rrna [ ] . since the identical region is also present in the e. coli s rrna, it cannot be excluded that recombination between the viral rna and the contaminant bacterial rrna occurred during the experiment. working with nonpurified polioviral transcripts, we observed a fragment of the e. coli s rrna inserted into the utr of the polioviral rna (unpublished data). although incorporation of host nucleotide sequences into viral rnas can be explained in terms of template switching, the nonreplicative mechanism is equally possible in such cases. the nonhomologous character of the corresponding recombination events provides additional circumstantial evidence in favor of the nonreplicative mechanism. conclusions covalent rearrangements of rna are widespread and play an important biological role. intramolecular or intermolecular splicing is a classical example of natural rearrangements in cell rnas. although splicing can be determined by different mechanisms (depending on spliceosomes or ribozymes), its important features are fairly strong site specificity and the involvement of complex macromolecular structures. recombination between viral rnas represents a special type of rna rearrangements. rna recombination can proceed through various-replicative and postreplicative (nonreplicative)-mechanisms. in turn, the mechanisms of replicative and nonreplicative rna recombination can vary. some of them are probably similar to the processes of splicing, while others are based on different principles. in most cases, rna recombination shows no appreciable site specificity and does not require conserved or intricate rna structures. recombination continues to play an important role in the variation and evolution of rna viruses, facilitating the exchange of genes (or their fragments) between different viruses or between viruses and host cells. on the other hand, recombination performs an opposite function, maintaining the stability of viral rna genomes and eliminating unfavorable mutations. in addition, nonreplicative recombination between viral rnas has another aspect, going beyond the scope of virology. it is appealing to assume that the uninfected cell also provides room for some of the processes underlying nonreplicative recombination and differing from canonical exon-intron rearrangements associated with intramolecular or intermolecular splicing. if nonspecific covalent joining of cell rnas or their fragments does occur, then a natural step on the road is to consider the possibility of fixation of chimeric rna sequences in chromosomal dna via reverse transcription. in other words, it is possible to assume that some of the mechanisms of nonreplicative rna recombination play an important role in the evolution of not only viral, but also cell genomes [ , ] . viral quasi-species and fitness variations error frequencies of picornavirus rna polymerases: evolutionary implications for virus populations response of foot-and-mouth disease virus to increased mutagenesis: influence of viral load and fitness in loss of infectivity rna virus error catastrophe: direct molecular test by using ribavirin defective interfering particles of poliovirus: . isolation and physical properties common and distinct region of defective-interfering rnas of sindbis virus primary structure of poliovirus defective-interfering particle genomes and possible generation mechanisms of the particles functional and genetic plasticities of the poliovirus genome: quasi-infectious rnas modified in the '-untranslated region yield a variety of pseudorevertants agol v.i. . a prokaryotic-like ciselement in the cap-independent internal initiation of translation on picornavirus rna attenuation of theiler's murine encephalomyelitis virus by modifications of the oligopyrimidine/aug tandem, a host-dependent translational ciselement distinct attenuation phenotypes caused by mutations in the translational starting window of theiler's murine encephalomyelitis virus nucleotide sequence and genome organization of foot-andmouth disease virus gross rearrangements within the '-untranslated region of the picornaviral genomes genetic recombination with newcastle disease virus, polioviruses, and influenza. cold spring harbor symp genetic recombination with poliovirus type . studies of crosses between a normal horse serum-resistant mutant and several guanidine-resistant mutants of the same strain evidence of genetic recombination in foot-and-mouth disease virus a genetic map of poliovirus temperature-sensitive mutants biochemical evidence for intertypic genetic recombination of polioviruses intertypic recombination in poliovirus: genetic and biochemical studies multiple sites of recombination within the rna genome of foot-and-mouth disease virus analysis of oligonucleotide maps as a method for identifying intertypic recombination in poliovirus the primary structure of crossover region in the genome of two intertypic polyovirus recombinants the primary structure of crossover regions of intertypic poliovirus recombinants: a model of recombination between rna genomes the mechanism of rna recombination in poliovirus genetics of picornaviruses recombination between rna genomes rna recombination in animal and plant viruses recombination and other genomic rearrangements in picornaviruses evolutionary aspects of recombination in rna viruses phylogenetic analysis reveals a low rate of homologous recombination in negative-sense rna virus new insights into the mechanisms of rna recombination how rna viruses exchange their genetic material heterologous recombination in the doublestranded rna bacteriophage phi recombination in bacteriophage qβ its satellite rnas: the in vivo and in vitro studies the extent of homologous recombination in members of the genus flavivirus genetic recombination in wild-type poliovirus evolution of the genome of human enterovirus b: incongruence between phylogenies of the vp and cd regions indicates frequent recombination within the species recombination in circulating enteroviruses recombination in uveitis-causing enterovirus strains rna recombination plays a major role in genomic change during circulation of coxsackie b viruses evidence for frequent recombination within species human enterovirus b based incomplete genomic sequences of all thirty-seven serotypes complete genome sequences of all members of the species human enterovirus a poliovirus rna recombination in cell-free extracts genetic recombination of poliovirus in a cell-free systems dissecting rna recombination in vitro: role of rna sequences and the viral replicase mechanism of rna recombination in carmo-and tombusviruses: evidence for template switching by the rna-dependent rna polymerase in vitro factors regulating template switch in vitro by viral rna-dependent rna polymerases: implications for rna-rna recombination on the nature of poliovirus genetic recombinants the puzzle of rna recombination a new look on rna recombination mutations in the helicase-like domain of protein a alter the sites of rna-rna recombination in brome mosaic virus a mutation in the putative rna polymerase gene inhibits nonhomologous, but not homologous, genetic recombination in rna virus mutations in the n terminus of the brome mosaic virus polymerase affect genetic rna-rna recombination poliovirus rnadependent rna polymerase ( dpol) is sufficient for template switching in vitro poliovirus-specific primer-dependent rna polymerase able to copy poly(a) encephalomyocarditis virus rna polymerase preparations, with and without rna helicase activity rna duplex unwinding activity of poliovirus rna-dependent rna polymerase d pol template/primer requirements and single nucleotide incorporation by hepatitis c virus nonstructural protein b polymerase viral rna-directed rna polymerases use diverse mechanisms to promote recombination between rna molecules studies on the recombination between rna genomes of poliovirus: the primary structure and nonrandom distribution of crossover regions in the genomes of intertypic poliovirus recombinants. virology targeting the site of rna-rna recombination in brome mosaic virus with antisense sequences role rna structure in nonhomologous recombination between genomic molecules of brome mosaic virus preferred sites of recombination in poliovirus rna: an analysis of intertypic crossover sequences homologous rna recombination in brome mosaic virus: au-rich sequence decreases the accuracy of crossovers a model for rearrangements in rna genomes the polymerase in its labyrinth: mechanisms and implications of rna recombination rna polymerase marching backward transcriptional arrest: escherichia coli rna polymerase translocates backward, leaving the ' end of the rna intact and extruded sequences and structures required for recombination between virus-associated rnas a transcriptionally active subgenomic promoter supports homologous crossovers in a plus-strand rna virus dissecting the requirement for subgenomic promoter sequences by rna recombination of brome mosaic virus in vivo: evidence for functional separation of transcription and recombination rna recombination between cucumoviruses: possible role of predicted stem-loop structures and an internal subgenomic promoter-like motif nonhomologous rna-rna recombination events at the ' nontranslated region of sindbis virus genome: hot spots and utilization of nonviral sequences unusual heterogeneity of leader-mrna fusion in a murine coronavirus: implications for the mechanism of rna transcription and recombination coronaviridae: the viruses and their replication genetic recombination of poliovirus in vitro and in vivo: temperature-dependent alteration of crossover sites nonhomologous rna recombination in a cell-free system: evidence for a transesterification mechanism guided by secondary structure on the nature of spontaneous rna synthesis by qβ replicase cloning of rna molecules in vitro spontaneous rearrangements in rna sequences picornavirus genome: an overview initiation of translation of picornavirus rnas: structure and function of the internal ribosome entry site translational control of picornavirus phenotype possible unifying mechanism of picornavirus genome replication agol v.i. . nonreplicative rna recombination in poliovirus construction of viable deletion and insertion mutants of the sabin strain of type poliovirus: function of the ' noncoding sequence in viral replication construction of less neurovirulent polioviruses by introducing deletions into the ' noncoding sequence of the genome poliovirus neurovirulence depends on the presence of a cryptic aug upstream of the initiator codon nonreplicative homologous rna recombination: promiscuous joining of rna pieces? rna trna splicing a host-specific function is required for ligation of a wide variety of ribozymeprocessed rnas bacteriophage t rna ligase (gp . ) exemplifies a family of rna ligases found in all phylogenetic domains two reactions of haloferax volcanii rna splicing enzymes: joining of exons and circularization of introns distinct functions of two rna ligases in active trypanosoma brucei rna editing complexes uridine insertion/deletion rna editing in trypanosome mitochondria: a complex business origin of splice junction phosphate in trnas processed by hela cell extract rna '-terminal phosphate cyclase activity and rna ligation in hela cell extract flaviviridae: the viruses and their replication molecular characterization of pestiviruses rna recombination in vivo in the absence of viral replication mechanisms of fidelity in pre-mrna splicing the chemical repertoire of natural ribozymes rna-protein interactions that regulate pre-mrna splicing alternative pre-mrna splicing and proteome expansion in metazoans trans and cis splicing in trypanosomatids: mechanism, factors, and regulation ribozyme-mediated repair of defective mrna by targeted trans-splicing structural diversity of self-cleaving ribozymes kinetics and thermodynamics of intermolecular catalysis by hairpin ribozymes the internal equilibrium of the hammerhead ribozyme reaction isolation of new ribozymes from a large pool of random sequences. science emergence of a dual-catalytic rna with metal-specific cleavage and ligase activities: the spandrels of rna evolution design and optimization of effector activated ribozyme ligases ribozyme-mediated repair of sickle β-globin mrnas in erythrocyte precursors design of highly specific cytotoxins by using trans-splicing ribozymes induction of wildtype p activity in human cancer cells by ribozymes that repair mutant p transcripts use of engineered ribozymes to catalyze chimeric gene assembly group ii intron rna-catalyzed recombination of rna in vitro generalized rnadirected recombination of rna reverse self-splicing of the tetrahymena group i intron: implication for the directionality of splicing and for intron transposition integration of group ii intron bi into a foreign rna by reversal of the selfsplicing reaction in vitro integration of the tetrahymena group i intron into bacterial rrna by reverse splicing in vivo unexpected abundance of self-splicing group i introns in the genome of bacteriophage twort: introns in multiple genes, a single gene with three introns, and exon skipping by group i ribozymes barriers to intron promiscuity in bacteria messenger rna reprogramming by spliceosome-mediated rna trans-splicing ribozyme-mediated revision of rna and dna ribozyme-mediated selective induction of new gene activity in hepatitis c virus internal ribosome entry site-expressing cells by targeted trans-splicing in vivo restoration of biologically active ' ends of virus-associated rnas by nonhomologous rna recombination and replacement of a terminal motif evolution of sex and the molecular clock in rna viruses picornavirus genetics: an overview genetic recombination between rna components of a multipartite plant virus rna recombination in turnip crinkle virus: its role in formation of chimeric rnas, multimers, and in '-end repair polymerization of nontemplate bases before transcription initiation at the ' ends of templates by rna-dependent rna polymerase: an activity involved in ' end repair of viral rnas efficient system of homologous rna recombination in brome mosaic virus: sequence and structure requirements and accuracy of crossovers evolution of positivestrand rna viruses molecular biology and evolution of closteroviruses: sophisticated buildup of large rna genomes intertypic genomic rearrangements of poliovirus strains in vaccines genomic features of intertypic recombinant sabin poliovirus strains excreted by primary vaccinees frequent isolation of intertypic poliovirus recombinants with serotype specificity from vaccine-associated polio cases polioviruses with natural recombinant genomes isolated from vaccine-associated paralytic poliomyelitis evolution of the sabin type poliovirus in humans: characterization of strains isolated from patients with vaccine-associated paralytic poliomyelitis circulating vaccine-derived polioviruses: current state of knowledge genesis of sindbis virus by in vitro recombination of nonreplicative rna precursors analysis of intermolecular rna-rna recombination by rubella virus rubella virus rna replication is cis-preferential and synthesis of negative-and positive-strand rnas is regulated by the processing of nonstructural protein rubella virus di rnas and replicons: requirement for nonstructural proteins acting in cis for amplification by helper virus origin of rna viral genomes: approaching the problem by comparative sequence analysis viral cytopathogenicity correlated with integration of ubiquitin-coding sequences nonhomologous rna recombination in bovine viral diarrhea virus: molecular characterization of a variety of subgenomic rnas isolated during an outbreak of fatal mucosal disease insertion of a bovine smt b gene in ns b and duplication of ns in a bovine viral diarrhea virus genome correlate with the cytopathogenicity of the virus insertion of cellular nedd coding sequences in a pestivirus insertion of a sequence encoding light chain of microtubule-associated proteins a and b in a pestivirus genome: connection with virus cytopathogenicity and induction of lethal disease in cattle cellular sequences in pestivirus genomes encoding gamma-aminobutyric acid receptor-associated protein and golgi-associated atpase enhancer of kilodaltons ribosomal s a coding sequences upstream of ubiquitin coding sequences in the genome of a pestivirus increased viral pathogenicity after insertion of a s ribosomal rna sequence into the haemagglutinin gene of an influenza virus the '-terminal sequence of potato leafroll virus rna: evidence of recombination between virus and host rna rnas from two independently isolated defective interfering particles of sindbis virus contain a cellular trna sequence at their ' ends an in vivo recombinant rna capable of autocatalytic synthesis by q beta replicase transduction of a human rna sequence by poliovirus we are grateful to a.b. chetverin for critical comments on the manuscript.our recent experimental work considered in the review was supported by a grant from intas (no. - ), the russian foundation for basic research (project nos. - - , - - ), a scientific school support grant (no. nsh- . . ) , and the ministry of science and technology of the russian federation (project no. . . . . /ipve). key: cord- -ef svn f authors: saitou, naruya title: eukaryote genomes date: - - journal: introduction to evolutionary genomics doi: . / - - - - _ sha: doc_id: cord_uid: ef svn f general overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk dnas. we then discuss the evolutionary features of eukaryote genomes, such as genome duplication, c-value paradox, and the relationship between genome size and mutation rates. genomes of multicellular organisms, plants, fungi, and animals are then briefly discussed. duplications sometimes occur in eukaryotes, especially in plants and in vertebrates, but genome duplication is so far not known for prokaryotic genomes. because the gene number of typical eukaryotic genomes is much larger than that of prokaryotes, there are many genes shared among most of eukaryote genomes but nonexisting in prokaryote genomes. some examples are listed in table . . for example, myosin is located in animal muscle tissues, and its homologous protein exists in cytoskeleton of all eukaryotes, but not found in prokaryotes. recently, kryukov et al. ( ; [ ] ) constructed a new database on oligonucleotide sequence frequencies and conducted a series of statistical analyses. frequencies of all possible - oligonucleotides were counted for each genome, and these observed values were compared with expected values computed under observed oligonucleotide frequencies of length - . deviations from expected values were much larger for eukaryotes than prokaryotes, except for fungal genomes. figure . shows the distribution of the deviation for various organismal groups. the biological reason for this difference is not known. there are two major types of organella in eukaryotes: mitochondria and plastids. figure . shows schematic views of mitochondria and chloroplasts. these two organella has their independent genomes. this suggests that they were initially independent organisms which started intracellular symbiosis with primordial eukaryotic cells. because most eukaryotes have mitochondria, the ancestral eukaryotes, a lineage that emerged from archaea, most probably started intracellular symbiosis with mitochondrial ancestor. a parasitic rickettsia prowazekii is so far phylogenetically closest to mitochondria [ ] , and a rickettsia-like bacterium is the best candidate as the mitochondrial ancestor. however, there is an alternative "hydrogen hypothesis" [ ] . plastids include chloroplasts, leucoplasts, and chromoplasts and exist in land plants, green algae, red algae, glaucophyte algae, and some protists like euglenoids. mitochondrial genome sizes of some representative eukaryotes are listed in table . . most of animal mitochondrial genomes are less than kb, and sizes of protist and fungi mitochondrial genomes are somewhat larger. mitochondrial genome size of plants is much larger than those of other eukaryotic lineages, yet the size is mostly less than kb. an ancestral eukaryotic cell, probably an archaean lineage, hosted a bacterial cell, and intracellular symbiosis started. initially, archaea and bacteria shared genes responsible for basic metabolism, and the situation is a sort of gene duplication for many genes, though homologous genes are not identical but already diverged long time ago. in any case, division of labor followed, and only limited metabolic pathways were left in the bacterial system, which eventually became mitochondria. animal mitochondrial genomes contain very small number of genes; for peptide subunits, for trna, and for rrna [ ] . genome size (kb) animals homo sapiens (human) . takifugu rubripes (torafugu fi sh) representative animal species mitochondrial dna genomes. although most of vertebrate mitochondrial dna genomes have the same gene order as in human ( fig. . a ), gene order may vary from phylum to phylum. yet the gene content and the genome size are more or less constant among animals. it is not clear why animal mitochondrial genomes are so small. one possibility is that animal individuals are highly integrated compared to fungi and plants, and this might have infl uenced a drastic reduction of the mitochondrial genome size. another interesting feature of animal mitochondrial dna genomes is the heterogeneous rates of gene order change. for example, platyhelminthes exhibit great variability in mitochondrial gene order (sakai and sakaizumi, ; [ ] ). in contrast, plant mitochondrial genomes are much larger (see table . ). figure . shows the genome structure of tobacco mitochondrial genome (from sugiyama et al. ; [ ] ). horizontal gene transfers are also known to occur in plant mitochondrial dnas even between remotely related species [ ] . the melon ( cucumis melo ) mitochondrial genome size, ca. . mb, is exceptionally large, and recently its draft genome was determined [ ] . interestingly, melon mitochondrial genome looks like the vertebrate nuclear genome in its contents, in spite of its genome size being similar to that of bacteria. the protein coding gene region accounted for only . % of the genome, and about half of the genome is composed of repeats. the remaining part is mostly homologous to melon nuclear dna, and . % is homologous to melon chloroplast dna. most of the protein coding genes of melon mitochondrial dnas are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are kb and kb, respectively. this indicates that the huge expansion of its genome size occurred only recently. interestingly, cucumber ( cucumis sativus ), another congeneric species, also has ~ . -mb mitochondrial genome with many repeat sequences [ ] . it will be interesting to study whether the increase of mitochondrial genomes of melon and cucumber is independent or not. chloroplasts exist only in plants, algae, and some protists. it may change to leucoplasts and chromoplasts. because of this, a generic name "plastids" may also be used. the origin of chloroplast seems to be a cyanobacterium that started intracellular symbiosis as in the case of mitochondria. a unique but common feature of chloroplast genome is the existence of inverted repeats [ ] , and they mainly contain rrna genes. chloroplast dna contents may [ ] . chloroplast genomes were determined for more than species as of december [ ] . their genome sizes range from kb ( rhizanthella gardneri ) to kb ( floydiella terrestris ). although the largest chloroplast genome is still much smaller than atypical bacterial genome, its average intergenic length is kb, much longer than that for bacterial genomes. fractions of mitochondrial dna may sometimes be inserted to nuclear genomes, and they are called "numts." an extensive analysis of the human genome found over numts [ ] . their sequence patterns are random in terms of mitochondrial genome locations. this suggests that mitochondrial dnas themselves were inserted, not via cdna reverse-transcribed from mitochondrial mrna. a possible source is sperm mitochondrial dna that were fragmented after fertilization [ ] . the reverse direction, from nucleus to mitochondria, was observed in melon, as discussed in subsection . . . intron is a dna region of a gene that is eliminated during splicing after transcription of a long premature mrna molecule. intron was discovered by phillip a. sharp and richard j. roberts in as "intervening sequence" [ ] , but the name "intron" coined by walter gilbert in [ ] is now widely used. it should be noted that some description on intron by kenmochi [ ] was used for writing this section. there are various types of introns, but they can be classifi ed into two: those requiring spliceosomes (spliceosome type) and self-splicing type. figure . shows the splicing mechanisms of these two major types. most of introns in nuclear genomes of eukaryotes are spliceosome type, and there are common gu-ag type and rare au-ac type, depending on the nucleotide sequences of the intron-exon boundaries [ ] . spliceosomes involving these two types differ [ ] . self-splicing introns are divided into three groups: groups i, ii, and iii. group i introns exist in organellar and nuclear rrna genes of eukaryotes and prokaryotic trna genes. group ii are found in organellar and some eubacterial genomes. cavalier-smith [ ] suggested that spliceosome-type introns originated from group ii introns because of their similarity in splicing mechanism and structural similarity between group ii introns and spliceosomal rna. group iii introns exist in organellar genomes, and its splicing system is similar with that of group ii intron, though they are smaller and have unique secondary structure. there is yet another type of introns which exist only in trnas of single-cell eukaryotes and archaea [ ] . these introns do not have self-splicing functions, but endonuclease and rna ligase are involved in splicing. the location of this type of introns is often at a certain position of the trna anticodon loop. after the discovery of introns, their probable functions and evolutionary origin have long been argued (e.g., [ , ] ). because self-splicing introns can occur at any time, even in the very early stage of origin of life, we consider only spliceosometype introns. for brevity, we hereafter call this type of introns as simply "intron." there are mainly two major hypotheses: introns early and introns late. the former claims that exon existed as a functional unit from the common ancestor of prokaryotes and eukaryotes, and "exon shuffl ing" was proposed for creating new protein functions [ ] . introns which separate exons should also be quite an ancient origin [ , ] . in contrast, introns are considered to emerge only in the eukaryotic lineage according to the introns-late hypothesis [ , ] . the protein "module" hypothesis proposed by go [ ] is related to be intronsearly hypothesis. pattern of intron appearance and loss has been estimated by various methods (e.g., [ , ] ). kenmochi and his colleagues analyzed introns of ribosomal proteins of mitochondrial genomes and eukaryotic nuclear genomes in details [ - ] . these studies supported the introns-late hypothesis, because introns in mitochondrial and cytosolic ribosomal proteins seem to be independent origins and introns seem to emerge in many ribosomal protein genes after eukaryotes appeared. introns do not code for amino acid sequences by defi nition. in this sense, most of introns may be classifi ed as junk dnas (see the next section). there are, however, evolutionarily conserved regions in introns, suggesting the existence of some functional roles in introns. ohno ( ; [ ] ) proclaimed that the most part of mammalian genomes are nonfunctional and coined the term "junk dna." with the advent of eukaryotic genome sequence data, it is now clear that he was right. there are in fact so much junk dnas in eukaryotic genomes. junk dnas or nonfunctional dnas can be divided into repeat sequences and unique sequences. repeat sequences are either dispersed type or tandem type. unique sequences include pseudogenes that keep homology with functional genes. prokaryote genomes sometimes contain insertion sequences; however, this kind of dispersed repeats constitutes the major portion of many eukaryotic genomes. these interspersed elements are divided into two major categories according to their lengths: short ones (sines) and long ones (lines). one well-known example of sine is alu elements in primate genomes. it is about -bp length, and originated from sl ribosomal rna gene. let us see the real alu element sequence from the human genome sequence. if we retrieve the ddbj/embl/genbank international sequence database accession number ap (a part of chromosome ), there are alu elements among the kb sequence. the density is . alu elements per kb. if we consider the whole human genome of ~ billion bp, alu repeats are expected to exist in ~ . million copies. one example of alu sequence is shown below from this entry coordinates from to : ggcgggagcg atggctcacg cctgtaatgc cagcactttg ggaggccgag gtgggtggat cacaaggtca ggagatagag accatcctgg ctaacacggt gaaacactgt ctctactaaa aacacaaaaa actagccagg cgtggtggcg ggtgcctgta atcccagcta ctcgggaggc tgaggcagga gaatggtgtg aacccaggaa gtggagcttg cagtgagctc agattgcgcc actgcactcc agcctgggtg acagagtgag actccatctc aaaaaaaata aaataaataa aaaaaa if we do blast homology search (see chap. ) using ddbj system ( http:// blast.ddbj.nig.ac.jp/blast/blastn ) targeted to nonhuman primate sequences (pri division of ddbj database), the best hit was obtained from chimpanzee chromosome , which is orthologous to human chromosome . i suggest interested readers to do this homology search practice. alu elements were fi rst classifi ed into j and s subfamilies [ ] . it is not clear about the reason of selection of two characters (j and s), but probably two authors (jurka and smith) used initials of their surnames. in any case, this division was based on the distance from alu consensus sequence; alu elements which are more close to the consensus were classifi ed as s and those not as j. later, a subset of the s subfamily were found to be highly similar with each other, and they were named as y after 'young," for they appeared relatively in young or recent age. rough estimates of the divergence time of alu elements are as follows: j subfamily appeared about million years ago, and s subfamily separated from j at million years ago, followed by further separation of y at million years ago [ ] . figure . shows the overall pattern of alu element evolution (based on [ ] ). tandemly repeated sequences are also abundant in eukaryotic genomes, and the representative ones are heterochromatin regions. heterochromatins are highly condensed nonfunctional regions in nuclear dna, in contrast to euchromatins, in which many genes are actively transcribed. heterochromatins usually reside at teromeres, terminal parts of chromosomes, and at centromeres, internal parts of chromosomes, that connect spindle fi bers during cell division. a more than -mb teromeric regions of arabidopsis thaliana were found to be tandem repeats of ca. -bp repeat unit [ , ] . the nucleotide sequence below is arabidopsis thaliana tandemly repeated sequence ar (international sequence database accession number x ): aagcttcttc ttgcttctca atgctttgtt ggtttagccg aagtccatat gagtctttgt ctttgtatct tctaacaagg aaacactact taggctttta ggataagatt gcggtttaag ttcttatact taatcataca catgccatca agtcatattc gtactccaaa acaataacc the human genome also has a similar but nonhomologous sequence in centromeres, called "alphoid dna" with the -bp repeat unit [ ] . the following is the sequence (international sequence database accession number m ): catcctcaga aacttctttg tgatgtgtgc attcaagtca cagagttgaa cattcccttt cgtacagcag tttttaaaca ctctttctgt agtatctgga agtgaacatt aggacagctt tcaggtctat ggtgagaaag gaaatatctt caaataaaaa ctagacagaa g if we do blast homology search (see chap. ) targeted to the human genome sequences of the ncbi database, there was no hit with this alphoid sequence. this clearly shows that the human genome sequences currently available are far from complete, for they do not include most of these tandem repeat sequences. telomores of the human genome are composed of hundreds of -bp repeats, ttaggg. if we search the human genome as -bp long tandem repeats of this -repeat units as query using the ncbi blast, many hits are obtained. as we already discussed in chap. , authentic pseudogenes have no function, and they are genuine members of junk dnas. when a gene duplication occurs, one of two copies often become a pseudogene. because gene duplication is prevalent in eukaryote genomes, pseudogenes are also abundant. pseudogenes are, by defi nition, homologous to functional genes. however, after a long evolutionary time, many selectively neutral mutations accumulate on pseudogenes, and eventually they will lose sequence homology with their functional counterpart. there are many unique sequences in eukaryote genomes, and majority of them may be this kind of homology-lost pseudogenes. a long rna is initially transcribed from a genomic region having an exon-intron structure, and then rnas corresponding to introns are spliced out. these leftover rnas may be called "junk" rnas, for they will soon be degraded by rnase. only a limited set of genes are transcribed in each tissue of multicellular organisms, but leaky expression of some genes may happen in tissues in which these genes should not be expressed. again these are "junk" rnas, and they are swiftly decomposed. a series of studies (e.g., [ , ] ) claimed that many noncoding dna regions are transcribed. however, van bakel et al. [ ] showed that most of them were found to be artifact of chip-chip technologies used in these studies. if nonsense or frameshift mutations occur in a protein coding sequences, that gene cannot make proteins. yet its mrna may be produced continuously until the promoter or its enhancer will become nonfunctional. in this case, this sort of mutated genes produces junk rnas. if only a small quantity of rnas are found from cells and when they are not evolutionarily conserved, they are probably some kind of junk rnas. as junk dnas and junk rnas exist, cells may also have "junk" proteins. if mature mrnas are not produced in the expected way, various aberrant mrna molecules will be produced, and ribosomes try to translate them to peptides based on these wrong mrna information. proteins produced in this way may be called "junk" proteins, for they often have no or little functions. even if one protein is correctly translated and is moved to its expected cellular location, it can still be considered as "junk" protein. one good example is the abcc transporter protein of dry-type cerumen (earwax), for one nonsynonymous substitution at this gene caused that protein to be essentially nonfunctional [ ] . there are various genomic features that are specifi c to eukaryotes other than existence of introns and junk dnas, such as genome duplication, rna editing, c-value paradox, and the relationship between genome size and mutation rates. we will briefl y discuss them in this section. the most dramatic and infl uential change of the genome structure is genome duplications. genome duplications are also called polyploidization, but this term is tightly linked to karyotypes or chromosome constellation. prokaryotes are so far not known to experience genome duplications, which are restricted to eukaryotes. interestingly, genome duplications are quite frequent in plants, while it is relatively rare in the other two multicellular eukaryotic lineages. an ancient genome duplication was found from the genome analysis of baker's yeast [ ] , and rhizopus oryzae , a basal lineage fungus, was also found to experience a genome duplication [ ] . among protists, paramecium tetraurelia is known to have experienced at least three genome duplications [ ] . because we human belongs to vertebrates and the two-round genome duplications occurred at the common ancestor of vertebrates (see chap. ), we may incline to think that genome duplications often happen in many animal species. it is not the case. so far, only vertebrates and some insects are known to experience genome duplications. the reason for this scattered distribution of genome duplication occurrences is not known. if we plot the number of synonymous substitutions between duplogs in one genome, it is possible to detect a relatively recent genome duplication. this is because all genes duplicate when a genome duplication occurs, while only a small number of genes duplicate in other modes of gene duplications (see chap. ). figure . shows the schematic view of two cases: with and without genome duplication. lynch and conery ( ; [ ] ) used this method to various genome sequences and found that the arabidopsis thaliana genome showed a clear peak indicative of relatively recent genome duplication, while the genome sequences of nematode caenorhabditis elegans and yeast saccharomyces cerevisiae showed the curves of exponential reduction. it is interesting to note that before the genome sequence was determined, the genome duplication was not known for arabidopsis thaliana, while the genome of saccharomyces cerevisiae was later shown to be duplicated long time ago [ ] . when genome duplications occurred in some ancient time, the number of synonymous substitutions may become saturated and cannot give appropriate result. in this case, the number of amino acid substitutions may be used, even if each protein may have varied rates of amino acid substitutions. in any case, accumulation of mutations will eventually cause two homologous genes to become not similar with each other. therefore, although the possibility of genome duplications in prokaryotes are so far rejected [ ] , it is not possible to infer the remote past events simply by searching sequence similarity. we should be careful to reach the fi nal conclusion. modifi cation of particular rna molecules after they are produced via transcription is called rna editing. all three major rna molecules (mrna, trna, and rrna) may experience editing [ ] . there are various patterns of rna editing; substitutions, in particular between c and u, and insertions and deletions, particularly u, are mainly found in eukaryote genomes. guide rna molecules exist in one of the main rna editing mechanisms, and they specify the location of editing, but there are some other mechanisms [ ] . it is not clear how the rna editing mechanism evolved. tillich et al. [ ] studied chloroplast rna editing and concluded that suddenly many nucleotide sites of chloroplast dna genome started to have rna editing, but later the sites experiencing rna editing constantly decreased via mutational changes. they claimed that there was no involvement of rna editing on gene expression. this result does not give rna editing a positive signifi cance. because there are many types of rna molecules inside a cell, there also exist many sorts of enzymes that modify rnas. it may be possible that some of them suddenly started to edit rnas via a particular mutation. rna editing which did not cause deleterious effects to the genome may have survived by chance at the initial phase. this view suggests the involvement of neutral evolutionary process in the evolution of rna editing. organisms with complex metabolic pathways have many genes. multicellular organisms are such examples. generally speaking, their genome sizes are expected to be large. in contrast, viruses whose genomes contain only a handful of genes have small genome sizes. therefore, their possibility of genome evolution is rather limited. even if amino acid sequences are rapidly changing because of high mutation rates, the protein function may not change. unless the gene number and genome size increase, viruses cannot evolve their genome structures. it is thus clear that the increase of the genome size is crucial to produce the diversity of organisms. however, genomes often contain dna regions which are not indispensable. organisms with large genome sizes have many such junk dna regions. because of their existence, the genome size and the gene number are not necessarily highly correlated. this phenomenon was historically called c-value paradox (e.g., [ ] ), after the constancy of the haploid dna amount for one species was found, yet their values were found to vary considerably among species at around (e.g., [ - ] ). "c-value" is the amount of haploid dna, and c probably stands as acronym of "constant" or "chromosomes." we now know that the majority of eukaryote genome dna is junk, and there is no longer a paradox in c-values among species. ]) found conserved noncoding dna sequences from insects, nematodes, and yeasts by comparing closely related species. we will discuss more on conserved noncoding sequences of vertebrates in chap. . as for plants, kaplinsky [ ] ) compared genome sequences of arabidopsis, grape rice, and brachypodium and found > times more abundant cnss from monocots than dicots. hettiarachchi and saitou; [ ] compared genome sequences of plant species and searched lineage-specifi c cnss. they found and cnss shared by all vascular plants and angiosperms, respectively, and also confi rmed that monocot cnss are much more abundant than those of dicots. what kind of the relationship exists between the genome size and mutation rates? if all the genetic information contained in the genome of one organism are necessary for survival of that organism, the individual will die even if only one gene of its genome lost its function by a mutation. an organism with a small genome size and hence with a small number of genes, such as viruses, can survive even if the mutation rate is high. in contrast, organisms with many genes may not be able to survive if highly deleterious mutations often happen. therefore, such organisms must reduce the mutation rate. however, when the nucleotide substitution type mutation rate per generation was compared with the whole-genome size, lynch ( ; [ ] ) found a positive correlation. more recently, lynch ( ; [ ] ) admitted that for organisms with small-sized genomes, these two values were in fact negatively correlated. however, when large-genome-sized eukaryotes are compared, now a positive correlation was observed. we have to be careful when we discuss these two contradictory reports. one considered the rate using unit as physical year, while the other used one generation as the unit. another difference is to use either only protein coding gene region dna sizes or the whole-genome sizes. the relationship between the mutation rate and genome size is not simple. drake et al. ( ; [ ] ) examined this problem and found that the mutation rate per genome per replication was approximately / for bacteria, while mutation rates of multicellular eukaryotes vary between . and per genome per sexual or individual generation. table . shows the list of the mutation rate and the genome size for various organisms. apparently there is no clear tendency. we will discuss genomes of three multicellular lineages of eukaryotes: plants, fungi, and animals in this section. unfortunately, there seems to be no common feature of genomes of multicellular organisms, so each lineage is discussed independently. arabidopsis thaliana was the fi rst plant species whose -mb genome was determined in [ ] . a. thaliana is a model organism for fl owering plants (angiosperms), with only -month generation time. in spite of its small genome size, only % of the human genome, it has , protein coding genes. the genome sequence of its closely related species, a. lyrata , was also recently determined [ ] . angiosperms are divided into monocots and dicots. a. thaliana is a dicot, and genome sequences of six more species were determined as of december (see table . ). rice, oryza sativa , is a monocot, and its genome size, ~ mb, is much smaller than that of the wheat genome. its japonica and indica subspecies genomes were determined [ ] and [ ] , and the origin of rice domestication is currently in great controversy, particularly in single or multiple domestication events (e.g., [ , ] ). the number of protein coding genes in the rice genome is , ~ , [ ] . wheat corresponds to genus triticum , and there are many species in this genus. the typical bread wheat is triticum aestivum , and it is a hexaploid with ( × ) chromosomes. its genome arrangement is conventionally written as aabbdd [ ] . because it is now behaving as diploid, genomic sequencing of chromosomes (a -a , b -b , and d -d ) is under way (see http://www.wheatgenome.org/ for the current status). the hexaploid genome structure emerged by hybridization of diploid (dd) cultivated species t. durum and tetraploid (aabb) wild species aegilops tauschii [ ] . a genome duplication followed hybridization. non-seedling land plants are ferns, lycophytes, and bryophytes, in the order of closeness to seed plants (e.g., [ ] ). a draft genome sequence of a moss, physcomitrella patens was reported in [ ] , followed by genome sequencing of a lycophyte, selaginella moellendorffi i, in [ ] . these genome sequences of different lineages of plants are deciphering stepwise evolution of land plants. the genome sequence of baker's yeast ( saccharomyces cerevisiae ) was determined in , as the fi rst eukaryotic organism [ ] . there are chromosomes in s . cerevisiae, and its genome size is about mb. there are a total of , genes in its genome: , orfs and , other genes. the genome-wide gc content is %, slightly lower than that of the human genome. the proportion of introns is very small compared to that of the human genome, and the average length of one intron is only bp, in contrast to the , -bp average length of exons [ ] . as we already discussed, the ancestral genome of baker's yeast experienced a genome-wide duplication [ ] . pseudogenes, which are common in vertebrate genomes, are rather rare in the genome of baker's yeast; they constitute only % of the protein coding genes [ ] . the baker's yeast is often considered as the model organisms for all eukaryotes; however, their genome may not be a typical eukaryote genome. as of december , genome sequences of more than fungi species are available (see ncbi genome list at http://www.ncbi.nlm.nih.gov/genome/browse/ for the present situation). figure . shows the relationship between the genome size and gene numbers for genomes. there is a clear positive correlation between them. however, there are some outliers. the perigord black truffl e ( tuber melanosporum ), shown as a i n fig. . , has the largest genome size (~ mb) among the fungi species whose genome sequences were so far determined, yet the number of genes is only ~ , [ ] . three other outlier species are postia placenta , ajellomyces dermatitidis , and melampsora laricipopulina , shown as b, c, and d in fig. . , respectively. interestingly, these four outlier species are phylogenetically not clustered well; two are belonging to pezizomycotina of ascomycota and the other two are agaricomycotina and pucciniomycotina of basidiomycota. if we exclude these four outlier species, a good linear regression is obtained, as shown in fig. . . this straight line indicates that in average, one gene size corresponds to . kb in a typical fungi genome. if we apply this average gene size to the truffl e genome, its genome size should be ~ mb, but the real size is mb larger. this suggests that there is unusually large number of junk dna in this genome. in fact, % of its genome consists of transposable elements [ ] . the truffl e genome must still have % more junk dna region. gain and loss of genes in each branch of the phylogenetic tree for fungi species are shown in fig. . (based on [ ] ). it will be interesting to examine genome sizes of species related to the perigord black truffl e, so as to infer the evolutionary period when the genome size expansion occurred. the relationship between the genome size and gene numbers among fungi genomes system that is responsible for this is hox genes. we thus fi rst discuss this gene system in this subsection. the genome of c. elegans , fi rst determined genome among animals, will be discussed next, followed by genomes of insects and those of deuterostomes. because genomes of many vertebrate species were determined, we discuss them in chap. , and in particular, on the human genome in chap. . hox genes were initially found through studies of homeotic mutations that dramatically change segmental structure of drosophila by edward b. lewis [ ] . they code for transcription factors, and a dna-binding peptide, now called homeobox domain, was later found in almost all animal phyla [ ] . figure . shows the hox gene clusters found in animal groups. there are four hox clusters in mammalian and avian genomes, and they are most probably generated by the two-round genome duplication in the common ancestor of vertebrates (see chap. ). interestingly, the physical order of hox genes in chromosomes and the order of gene expression during the development are corresponding, called "collinearity" [ ] . this suggests that some sort of cis-regulation is operating in hox gene clusters, and in fact, many long transcripts are found, and some of their transcription start sites are highly conserved among vertebrates [ ] . figure . shows highly conserved the hox genes control expression of different groups of downstream genes, such as transcription factors, elements in signaling pathways, or genes with basic cellular functions. hox gene products interact with other proteins, in particular, on signaling pathways, and contribute to the modifi cation of homologous structures and creation of new morphological structures [ ] . there are other gene families that are thought to be involved in diverse animal body plan. one of them is the zic gene family [ ] . the zic gene family exists in many animal phyla with high amino acid sequence homology in a zinc-fi nger domain called zf, and members of this gene family are involved in neural and neural crest development, skeletal patterning, and left-right axis establishment. this gene family has two additional domains, zoc and zf-bc. interestingly, cnidaria, platyhelminthes, and urochordata lack the zoc domain, and their zf-bc domain sequences are quite diverged compared to arthropoda, mollusca, annelida, echinodermata, and chordata. this distribution suggests that the zic family genes with the entire set of the three conserved domains already existed in the common ancestor of bilateralian animals, and some of them may be lost in parallel in the platyhelminthes, nematodes, and urochordates [ ] . interestingly, phyla that lost zoc domains have quite distinct body plan although they are bilateralian. caenorhabditis elegans was the fi rst animal species whose -mb draft genome sequence was determined in [ ] . this organism belongs to the nematoda phylum which includes a vast number of species [ ] . brenner ( ; [ ] ) chose this species as model organism to study neuronal system, for its short generation time (~ days) and its size (~ mm). the following description of this section is based on the information given in online "wormbook" [ ] . there are , protein coding genes in c. elegans including , alternatively spliced forms, with % confi rmed to be transcribed at least partially. the number of trna genes is , and are located in x chromosome. the three kinds of rrna genes ( s, . s, and s) are located in chromosome i in - tandem repeats, while ~ s rrna genes are also in tandem form but located in chromosome v. the average protein coding gene length is kb, with the average of . coding exons per gene. in total, protein coding exons constitute . % of the whole genome. figure . shows the distribution of the protein coding genes, and fig. . the distribution of exon numbers per gene. both distributions have long tails. the median sizes of exons and introns are bp and bp, respectively. intron lengths of c. elegans are quite short compared to these of vertebrate genes (see chap. ). the distribution of protein coding genes varies depending on chromosomes, slightly more dense for fi ve autosomes than x chromosome and more dense in the central region than the edge of one chromosome. processed, i.e., intronless, pseudogenes are rare, and a total of pseudogenes were reported at the wormbase version ws . about half of them are homologous to functional chemoreceptor genes. genome sequences of four congeneric species of c. elegans ( c. brenneri , c. briggsae , c. japonica , and c. remanei ) were determined ( http://www.ncbi.nlm.nih. gov/genome/browse/ ). a fruit fl y drosophila melanogaster was used by thomas hunt morgan's group in the early twentieth century and has been used for many genetic studies. because of this importance, its genome sequence was determined at fi rst among arthropods in [ ] . heterochromatin regions of ~ mb were excluded from sequencing, [ ] . their genome sizes vary from to mb, and the number of genes is , - , . interestingly, d . melanogaster has the largest genome size and the smallest number of genes. a total of insect species other than drosophila species were sequenced by end of [ ] . as of december , their genome sizes are in the range of mb and mb, more than fi ve times difference, and the gene numbers are from , to , . deuterostomes contain fi ve phyla: echinodermata, hemichordata, chaetognatha, xenoturbellida, and chordata. the genome of sea urchin strongylocentrotus purpuratus [ ] was determined in . its genome size is mb with , genes. genomes of another sea urchins, lytechinus variegatus and patiria miniata , are also under sequencing, as well as hemicordate saccoglossus kowalevskii . chordata is classifi ed into urochordata (ascidians), cephalochordata (lancelets or amphioxus), and vertebrata (vertebrates). because we will discuss genomes of vertebrates in chap. , let us discuss genomes of ascidians and lancelets only. the genome of ascidian ciona intestinalis was determined in [ ] , and the genome sequence of its congeneric species, c. savignyi , was also determined three years later [ ] . the genome size of c. intestinalis is ~ mb with ~ , genes. interestingly it contains a group of cellulose synthesizing enzyme genes, which were probably introduced from some bacterial genomes via horizontal gene transfer [ , ] . the c. intestinalis genome also contains several genes that are considered to be important for heart development ( [ ] ), and this suggests that heart of ascidians and vertebrates may be homologous. through the superimposition of phylogenetic trees (see chapter a ) for fi ve genes coding muscle proteins, oota and saitou ([ ] ) estimated that vertebrate heart muscle was phylogenetically closer to vertebrate skeletal muscles. if both results are true, muscles used in heart might have been substituted in the vertebrate lineage. the genome sequences of an amphioxus (cephalochordate branchiostoma fl oridae ) was determined in by holland et al. ( ; [ ] ), and they provide good outgroup sequence data for vertebrates. eukaryotic viruses are relying most of metabolic pathways to their eukaryote host species. therefore, the number of genes in virus genomes is usually very small. for example, infl uenza a virus has rna fragments coding for protein genes, and the total genome size is ~ . kb. as in bacteriophages, there are both dna type and rna type genomes in eukaryotic viruses. table . shows one example of classifi cation of eukaryotic viruses based on their genome structure [ ] . genomes of double-strand dna genome viruses have four types: circular, simple linear, linear with proteins covalently attached to both ends, and linear but both ends were closed. genomes of single-strand dna genome viruses are either circular or linear. genomes of rna genomes are all linear in both single-and double-strand type. those of single-strand rna genomes are classifi ed into two types: plus strand and minus strand. a subset of single-plus strand rna genome type is experiencing [ ] . megavirus is phylogenetically close to mimivirus [ ] , a member of nucleoplasmic large dna viruses, including pox virus. recently, a larger genome size virus, pandoravirus, with more than . -mb genome, was discovered [ ] . the phylogenetic status of these large genome size dna viruses is unknown at this moment. analysis of the genome sequence of the fl owering plant arabidopsis thaliana the genome of the cucumber, cucumis sativu s l draft genome sequence of the oilseed species ricinus communis the genome of black cottonwood, populus trichocarpa the grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla genome sequence of foxtail millet ( setaria italica ) provides insights into grass evolution and biofuel potential a new database (gcd) on genome composition for eukaryote and prokaryote genome sequences and their initial analyses the genome sequence of rickettsia prowazekii and the origin of mitochondria the hydrogen hypothesis for the fi rst eukaryote mitochondrial genome the complete mitochondrial genome of dugesia japonica (platyhelminthes; order tricladida) the complete nucleotide sequence of the tobacco mitochondrial genome: comparative analysis of mitochondrial genomes in higher plants and multipartite organization widespread horizontal transfer of mitochondrial genes in fl owering plants determination of the melon chloroplast and mitochondrial genome sequences reveals that the largest reported mitochondrial genome in plants contains a significant amount of dna having a nuclear origin small, repetitive dnas contribute signifi cantly to the expanded mitochondrial genome of cucumber the complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression changes in the structure of dna molecules and the amount of dna per plastid during chloroplast development in maize pattern of organization of human mitochondrial pseudogenes in the nuclear genome why genes in pieces? introns. in encyclopedia of evolution . tokyo: kyoritsu shuppan comprehensive splice-site analysis using comparative genomics the ever-growing world of small nuclear ribonucleoproteins intron phylogeny: a new hypothesis trnomics: analysis of trna genes from genomes of eukarya, archaea, and bacteria reveals anticodon-sparing strategies and domain-specifi c features the origin of introns and their role in eukaryogenesis: a compromise solution to the introns-early versus introns-late debate? the evolution of spliceosomal introns: patterns, puzzles and progress genes in pieces: were they ever together? nuclear volume control by nucleoskeletal dna, selection for cell volume and cell growth rate, and the solution of the dna c-value paradox the recent origins of spliceosomal introns revisited correlation of dna exonic regions with protein structural units in haemoglobin remarkable interkingdom conservation of intron positions and massive, lineage-specifi c intron loss and gain in eukaryotic evolution new maximum likelihood estimators for eukaryotic intron evolution analysis of ribosomal protein gene structures: implications for intron evolution intron dynamics in ribosomal protein genes so much "junk" dna in our genome a fundamental division in the alu family of repeated sequences whole-genome analysis of alu repeat elements reveals complex evolutionary history characterization of highly repetitive sequences of arabidopsis thaliana centromeric repetitive sequences in arabidopsis thaliana sequence defi nition and organization of a human repeated dna empirical analysis of transcriptional activity in the arabidopsis genome identifi cation and analysis of functional elements in % of the human genome by the encode pilot project most "dark matter" transcripts are associated with known genes a snp in the abcc gene is the determinant of human earwax type molecular evidence for an ancient duplication of the entire yeast genome genomic analysis of the basal lineage fungus rhizopus oryzae reveals a whole-genome duplication global trends of whole-genome duplications revealed by the ciliate paramecium tetraurelia size of the protein-coding genome and rate of molecular evolution the evolutionary fate and consequences of duplicated genes comparative genomics in prokaryotes functions and mechanisms of rna editing the evolution of chloroplast rna editing chromosome structure and the c-value paradox la teneur du noyau cellulaire en acide désoxyribonucléique à travers les organes, les individus et les espèces animales (in french) nucleoprotein determination in cytological preparations the constancy of deoxyribose nucleic acid in plant nuclei conserved linkage between the puffer fi sh (fugu rubripes) and human genes for platelet-derived growth factor receptor and macrophage colony-stimulating factor receptor conserved noncoding sequences are reliable guides to regulatory elements enrichment of regulatory signals in conserved non-coding genomic sequence evolution at two level: on genes and form evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes utility and distribution of conserved noncoding sequences in the grasses conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution conserved noncoding sequences in the grasses arabidopsis intragenomic conserved noncoding sequence the banana ( musa acuminata ) genome and the evolution of monocotyledonous plants computational analysis and characterization of uce-like elements (ules) in plant genomes identifi cation and analysis of conserved noncoding sequences in plants viral mutation rates the origins of eukaryotic gene structure evolution of the mutation rate rates of spontaneous mutation analysis of the genome sequence of the fl owering plant arabidopsis thaliana the arabidopsis lyrata genome sequence and the basis of rapid genome size change a draft sequence of the rice genome ( oryza sativa l. ssp. japonica) a draft sequence of the rice genome phylogeography of asian wild rice, oryza rufi pogon , reveals multiple independent domestications of cultivated rice, oryza sativa independent domestication of asian rice followed by gene fl ow from japonica to indica curated genome annotation of oryza sativa ssp. japonica and comparative genome analysis with arabidopsis thaliana multigene phylogeny of land plants with special reference to bryophytes and the earliest land plants the physcomitrella genome reveals evolutionary insights into the conquest of land by plants the selaginella genome identifi es genetic changes associated with the evolution of vascular plants overview of the yeast genome origin of genome architecture perigord black truffl e genome uncovers evolutionary origins and mechanisms of symbiosis master control genes in development and evolution: the homeobox story from dna to diversity evolution of conserved non-coding sequences within the vertebrate hox clusters through the two-round whole genome duplications revealed by phylogenetic footprinting analysis wormbook -the online review of c. elegans biology function and specifi city of hox genes a wide-range phylogenetic analysis of zic proteins: implications for correlations between protein structure conservation and body plan complexity genome sequence of the nematode c. elegans : a platform for investigating biology an improved molecular phylogeny of the nematoda with special emphasis on marine taxa the genetics of caenorhabditis elegans the genome sequence of drosophila melanogaster evolution of genes and genomes on the drosophila phylogeny the genome of the sea urchin strongylocentrotus purpuratus the draft genome of ciona intestinalis : insights into chordate and vertebrate origins assembly of polymorphic genomes: algorithms and application to ciona savignyi a functional cellulose synthase from ascidian epidermis phylogenetic relationship of muscle tissues deduced from superimposition of gene trees genome science and microorganismal molecular genetics distant mimivirus relative with a larger genome highlights the fundamental features of megaviridae the . -megabase sequence of mimivirus ultraconserved elements in the human genome genomu shinkagaku nyumon (written in japanese, meaning 'introduction to evolutionary genomics') the amphioxus genome illuminates vertebrate origins and cephalochordate biology pandoraviruses: amoeba viruses with genomes up to . mb reaching that of parasitic eukaryotes key: cord- - pvln x authors: asbury, thomas m; mitman, matt; tang, jijun; zheng, w jim title: genome d: a viewer-model framework for integrating and visualizing multi-scale epigenomic information within a three-dimensional genome date: - - journal: bmc bioinformatics doi: . / - - - sha: doc_id: cord_uid: pvln x background: new technologies are enabling the measurement of many types of genomic and epigenomic information at scales ranging from the atomic to nuclear. much of this new data is increasingly structural in nature, and is often difficult to coordinate with other data sets. there is a legitimate need for integrating and visualizing these disparate data sets to reveal structural relationships not apparent when looking at these data in isolation. results: we have applied object-oriented technology to develop a downloadable visualization tool, genome d, for integrating and displaying epigenomic data within a prescribed three-dimensional physical model of the human genome. in order to integrate and visualize large volume of data, novel statistical and mathematical approaches have been developed to reduce the size of the data. to our knowledge, this is the first such tool developed that can visualize human genome in three-dimension. we describe here the major features of genome d and discuss our multi-scale data framework using a representative basic physical model. we then demonstrate many of the issues and benefits of multi-resolution data integration. conclusions: genome d is a software visualization tool that explores a wide range of structural genomic and epigenetic data. data from various sources of differing scales can be integrated within a hierarchical framework that is easily adapted to new developments concerning the structure of the physical genome. in addition, our tool has a simple annotation mechanism to incorporate non-structural information. genome d is unique is its ability to manipulate large amounts of multi-resolution data from diverse sources to uncover complex and new structural relationships within the genome. background a significant portion of genomic data that is currently being generated extends beyond traditional primary sequence information. genome-wide epigenetic characteristics such as dna and histone modifications, nucleosome distributions, along with transcriptional and replication center structural insights are rapidly changing the way the genome is understood. indeed, these new data from high-throughput sources are often demonstrating that much of the genome's functional landscape resides in extra-sequential properties. with this influx of new detail about the higher-level structure and dynamics of the genome, new techniques will be required to visualize and model the full extent of genomic interactions and function. genome browsers, such as the uscs genome database browser [ ] , are specifically aimed at viewing primary sequence information. although supplemental information can easily be annotated via new tracks, representing structural hierarchies and interactions is quite difficult, particularly across non-contiguous genomic segments [ ] . in addition, in spite of the many recent efforts to measure and model the genome structure at various resolutions and detail [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , little work has focused on combining these models into a plausible aggregate, or has taken advantage of the large amount of genomic and epigenomic data available from new high-throughput approaches. to address these issues, we have created an interactive d viewer, genome d, to enable integration and visualization of genomic and epigenomic data. the viewer is designed to display data from multiple scales and uses a hierarchical model of the relative positions of all nucleotide atoms in the cell nucleus, i.e., the complete physical genome. our model framework is flexible and adaptable to handle new more precise structural information as details emerge about the genome's physical arrangement. the large amounts of data generated by highthroughput or whole-genome experiments raise issues of scale, storage, interactivity and abstraction. novel methods will be required to extract useful knowledge. genome d is an early step toward such new approaches. genome d is a gui-based c++ program which runs on windows (xp or later) platforms. its software architecture is based on the model-viewer-controller pattern [ ] . genome d is a viewer application to explore an underlying physical model displaying selections and annotations based on its current user settings. to support multiple resolutions and maintain a high level of interactivity, the model is designed using an objectoriented, hierarchical data architecture [ ] . genome d loads the model incrementally as needed to support user requests. once a model is loaded, genome d supports ucsc genome browser track annotations of the bed and wig formats [ ] . at highest detail, a model of the physical genome requires a d position (x, y, z) for each bp atom of the genome. the large amount of such data ( × bp × atoms/bp × positions × bytes~ gigabytes for humans) is reduced by exploiting the data's hierarchical organization. we store three scales of data for each chromosome in compressed xml format. atomic positions are computed on demand and not saved. this technique reduces the storage size for a human genome to~ . gigabytes, resulting in more than × savings. there are several sample models available for download from the genome d project homepage. more information of our representative model and its data format can be found in additional file . the range of scales and spatial organizations of dna within the human cell presents many visualization challenges. to meet these challenges, genome d manipulates and displays genomic data at multiple resolutions. figure shows several screen captures of the genome d application at various levels of detail. genome d allows the user to specify the degree of detail to view, and the corresponding data is loaded dynamically. because of the large amount of data and the limited memory that is available, only portions of the data can typically be viewed at high resolution. the interactivity of genome d facilitates exploring the model to find areas of interest. additionally, the user can configure various display parameters (such as color and shape) to highlight significant structural relationships. genome d features include: • display of genomic data from nuclear to atomic scale. genome d has multiple windows to visualize the physical genome model from simultaneous different viewpoints and scales. the model resolution of the current viewing window is set by the user, and its viewing camera is controlled by the mouse. resolutions and viewpoints depend of the type of data that is being visualized. • a fully interactive point-and-select d environment the user can navigate to an arbitrary region of interest by selecting a low resolution region and then loading corresponding higher resolution data which appears in another viewing window. • loading of multiple resolution user-created models with an open xml format the genome d application adheres to the model-view-controller software design pattern [ ] . the viewing software is completely separated from the multiscale model that is being viewed. we have chosen a simple open format for each resolution of the model, and users can easily add their own models. • image capture and povray/pdb model export support genome d supports screen capture of the current display image to a jpg format. for highly quality renders, it can export the current model and view as a povray model [ ] format for off-line print quality rendering. in addition, atomic positions of selected dna can be saved to a pdb format file for downstream analysis. • incorporation and user-defined visualization of ucsc annotation tracks onto the physical model the ucsc genome database browser has a variety of epigenetic information that can be exported directly from its web-site [ ] . this data can be loaded into genome d and displayed on the currently loaded genome model. we now give a few examples of applying biological information to a model and suggest possible methods of inferring unique structural relationships at various resolutions. one of the advantages of a multi-scale model is the ability to integrate data from various sources, and perhaps gain insight in higher level relationships or organizations. we choose to concentrate on highthroughput data sets that are becoming commonplace in current research: genome wide nucleosome positions, snps, histone methylations and gene expression profiles. the sample images, which can be visualized in gen-ome d, were export and rendered in povray [ ] . the impact of nucleosome position on gene regulation is well-known [ ] . in addition to nucleosome restructuring/modification [ ] , the rotation and phasing information of dna sequence may also play a significant role in gene regulation [ ] , particularly within non-coding regions. figures a, b show a non-coding nucleosome with multiple snps using genome-wide histone positioning data [ ] combined with a snp dataset [ ] . it highlights one of the advantages of three dimensional genomic data by clearly showing the phasing of the snps relative to the histone. observations of this type and of more complicated structural relationships may provide insights for further analysis, and such hidden three-dimensional structure is perhaps best explored with the human eye using a physical model. figure two examples of nucleosome epigenomic variation. a top view of snp variants rs , rs , rs , and rs (numbered - respectively) within a non-coding histone of chromosome : - . the histone position was obtained from [ ] , the snps were taken from a recent study examining variants associated with hdl cholesterol [ ] . such images may reveal structural relationships between non-coding region snps and histone phasing. b side view of a. c a series of histone trimethylations within encode region enr on chromosome : - [ ] . the histone bp positions are from [ ] . each histone protein is shown as an approximate cylinder wedge: h a (yellow), h b (red), h (blue), h (green). the ca backbones of the h and h n-terminal tails are modeled using the crystal structure of the ncp (pdb a i) [ ] . the bright yellow spheres indicate h k me and h k me , and the orange spheres are h k me , h k me and h k me . another important source of epigenomic information is histone modification. genome-wide histone modifications are being studied through a combination of dna microarray and chromatin immunoprecipitation (chipchip assays) [ ] . histone methylations have important gene regulation implications, and methylations have been shown to serve as binding platforms for transcription machinery. the encode initiative [ ] is creating high-resolution epigenetic information for~ % of the human genome. despite the fact that such modification occurs in histone proteins, current approaches to map and visualize such information are limited to sequence coordinates in the genome. our physical genome model visualizes methylation of histone proteins at atomic detail as determined by crystal structure. figure c shows histone methylations for several histones within an encode region. an integrated physical genome model can show the interplay between histone modifications and other genomic data, such as snps, dna methylation, the structure of gene, promoter and transcription machinery, etc. in addition to epigenomic data, the physical genome model also provides a platform to visualize highthroughput gene expression data and its interplay with global binding information of transcription factors. we consider a sample analysis of transcription factor p . genome-wide binding sites of p proteins [ ] can be combined with the gene expression results from a study investigating the dosing effect of p [ ] . this may identify genes that have p binding sites in their promoter regions and are responsive to the dosing effect of p protein. such large-scale microarray expression data is often displayed with a two-dimensional array format, emphasizing shared expression between genes, while p binding data are stored in tabular form. with a physical model, expression levels of genes in response to p level can be mapped to genome positions together with global p binding information, revealing any structural bias of the expression. figure shows this type of physical genome annotation. drawing inferences from coupling averaged or "snap-shot" expression data with the dynamic architecture of the genome may be helpful in determining structural dependences in expression patterns. to illustrate the capability of genome d to integrate and examine data of appropriate scales, we constructed an elementary model of the physical genome (see additional file for details). this basic model is approximate since precise knowledge of the physical genome is largely unknown at present. however, the model's inaccuracies are secondary to its multi-scale approach that provides a framework to improve and refine the model. current technologies are making significant progress toward capturing chromosome conformation within the nucleus at various scales [ , ] . because our multi-scale model is purely descriptive beyond the ncp scale, it can easily incorporate more accurate structural folding information, such as the 'fractal globule' behaviour [ ] . the genome d viewer, decoupled from the genome model, can be used to view any model that uses our model framework. building a d model of a complete physical genome is a non-trivial task. the structure and organization at a physical level is dynamic and heavily influenced by local and global constraints. a typical experiment may provide new data at a specific resolution or portion of the genome, and the integration of these data with other information to flesh out a multi-resolution model is challenging. for example, an experiment may measure local chromatin structure around a transcription site. this structure can be expressed as a collection of dna strands, ncps, and perhaps lower resolution nm chromatin fibers. our data formats are flexible enough to allow partial integration of this information, when the larger global structure is undetermined, or inferred by more global stochastic measurements from other experiments. combining such data across resolutions is often difficult, but establishing data formats and visualization tools provide a framework that may simplify the integration process. recent advances in determining chromosome folding principles [ ] highlight the need for new visualization methods. more detailed three-dimensional genomic models will help in discovering and characterizing epigenetic processes. we have created a multi-scale genomic viewer, genome d, to display and investigate genomic and epigenomic information in a three-dimensional representation of the physical genome. the viewer software and its underlying data architecture are designed to handle the visualization and integration issues that are present when dealing with large amount of data at multiple resolutions. our data structures can easily accommodate new advances in chromosome folding and organization. a common framework of established scales and formats could vastly improve multi-scale data integration and the ability to infer previously unknown relationships within the composite data. our model architecture defines clear demarcations between four scales (nuclear, fiber, nucleosome and dna), which facilitates data integration in a consistent and well-behaved manner. as more data become available, the ability to model, characterize, visualize, and perhaps most crucially, integrate information at many scales is necessary to achieve fuller understanding of the human genome. software development, and wjz oversaw the whole project. all authors read and approved the final manuscript. the ucsc genome browser database: update gene regulation in the third dimension polymer models for interphase chromosomes a randomwalk/ giant-loop model for interphase chromosomes a polymer, random walk model for the size-distribution of large dna fragments after high linear energy transfer radiation a chromatin folding model that incorporates linker variability generates fibers resembling the native structures capturing chromosome conformation modeling dna loops using the theory of elasticity computational modeling predicts the structure and dynamics of chromatin fiber multiscale modeling of nucleosome dynamics applications programming in smalltalk- : how to use model-view-controller (mvc) object-oriented biological system integration: a sars coronavirus example computer graphics: principles and practice persistence of vision pty. ltd., persistence of vision raytracer (version . ) cooperation between complexes that regulate chromatin structure and transcription the language of covalent histone modifications binding of nf to the mmtv promoter in nucleosomes: influence of rotational phasing, translational positioning and histone h dynamic regulation of nucleosome positioning in the human genome newly identified loci that influence lipid concentrations and risk of coronary artery disease genome-wide approaches to studying chromatin modifications a global map of p transcription-factor binding sites in the human genome gene expression profiling of isogenic cells with different tp gene dosage reveals numerous genes that are affected by tp dosage and identifies cspg as a direct target of p comprehensive mapping of long-range interactions reveals folding principles of the human genome organization of interphase chromatin the role of topological constraints in the kinetics of collapse of macromolecules the landscape of histone modifications across % of the human genome in five human cell lines crystal structure of the nucleosome core particle at . a resolution submit your next manuscript to biomed central and take full advantage of: • convenient online submission • thorough peer review • no space constraints or color figure charges • immediate publication on acceptance • inclusion in pubmed, cas, scopus and google scholar • research which is freely available for redistribution this work is partly supported by grants irg - - from the american cancer society, computational biology core of ul rr - , r gm - s , a pilot project and statistical core of grant p rr - , phrma foundation research starter grant, a pilot project from p rr to w.j.z, and nsf and r gm - s to jt. t.m.a. is supported by nlm training grant -t -lm - . the authors thank y.ruan for valuable discussion about the project, k.zhao and d.e. schones for providing nucleosome positioning data, m.boehnke for critical reading of the manuscript, and t qin, lc tsoi, and k. sims for software testing. the high performance computing facility utilized in this project is supported by nih grants: r lm , p rr , t gm and t lm . project name: genome dproject homepage: http://genomebioinfo.musc.edu/ genome d/index.html operating system: windows-based operation systems (xp or later) programming language: c++ and python other requirements: openglv . and glsl v . (may not be present on some older graphics adapters -see additional file ) any restrictions to use by non-academics: none additional file : supplemental information. additional details about human physical genome model construction and the genome d software.additional file : genome d v . readme. the readme file for genome d software.authors' contributions wjz conceived the initial concept of the project and developed the project with tma. tma developed the d genomic model and worked with mm to develop the genome d software. jt and wjz advised tma and mm on the key: cord- -f ab authors: barr, j.n.; fearns, r. title: genetic instability of rna viruses date: - - journal: genome stability doi: . /b - - - - . - sha: doc_id: cord_uid: f ab despite having very limited coding capacity, rna viruses are able to withstand challenge of antiviral drugs, cause epidemics in previously exposed human populations, and, in some cases, infect multiple host species. they are able to achieve this by virtue of their ability to multiply very rapidly, coupled with their extraordinary degree of genetic heterogeneity. rna viruses exist not as single genotypes, but as a swarm of related variants, and this genomic diversity is an essential feature of their biology. rna viruses have a variety of mechanisms that act in combination to determine their genetic heterogeneity. these include polymerase fidelity, error-mitigation mechanisms, genomic recombination, and different modes of genome replication. rna viruses can vary in their ability to tolerate mutations, or “genetic robustness,” and several factors contribute to this. finally, there is evidence that some rna viruses exist close to a threshold where polymerase error rate has evolved to maximize the possible sequence space available, while avoiding the accumulation of a lethal load of deleterious mutations. we speculate that different viruses have evolved different error rates to complement the different “life-styles” they possess. viruses are enormously successful. they have been identified in organisms within all domains of life. despite decades of scientific effort to combat viruses that cause disease in humans and economically important crops and animals, there are relatively few cases in which we have succeeded. viruses have shown they are able to adapt and multiply to overcome almost any obstacle that is imposed on them. this remarkable adaptability can be attributed to their extremely high replication rate and their propensity for mutation. this is particularly true of the viruses that have rna genomes: the riboviruses and retroviruses. this chapter will focus on these rna viruses and on the exciting research that has provided valuable insight into how rna viruses benefit from their genetic variability. in the first two sections of the chapter, two fundamental concepts are introduced: the intimate relationship between rna viruses and their hosts, and the idea that viruses behave as quasispecies. having introduced these concepts, the remainder of the chapter discusses the viral and host mechanisms that govern rna virus genetic variability and the ability of viruses to withstand mutation. we then discuss evidence that at least some rna viruses have a replication fidelity that is poised to maximize genome sequence space without incurring catastrophic lethal mutations and describe how this can be exploited to control viral infections. throughout the chapter, we attempt to convey the diversity of rna virus biology and mutation frequency and we conclude by speculating that each rna virus has evolved an error rate that complements its genome replication strategy and mode of transmission. rna viruses are very simple entities with small genomes that vary in length from about to kb, depending on the virus. thus, they have very limited coding capacity and so, similarly to dna viruses, they are obligate intracellular parasites, depending on a host cell to provide energy generating systems, ribo-and deoxyribonucleotides, cellular translation machinery, trnas and amino acids to translate their mrnas, cellular enzymes to posttranslationally modify their proteins, and cellular structures such as membranes, vesicular compartments, and/or cytoskeleton networks to act as a scaffold for assembling and transporting components required to make virus particles. there are many rna viruses and they vary enormously in their genome structures and mechanisms of replication. however, in its most distilled and generic form, the rna virus infection cycle consists of the steps shown in fig. first a protein on the surface of the virus particle attaches to a receptor molecule on a host cell enabling the viral genome to be delivered into the cell. the genome is expressed to produce viral proteins and replicated multiple times to produce progeny genomes. the progeny genomes are packaged with the proteins that make up the virus particle and are released to infect new cells. thus, viruses multiply by a process of genome replication, expression, and assembly, rather than division, and a cell infected by a single infectious virus particle could release thousands of progeny virions in a matter of hours. this enables viruses to multiply very rapidly and to achieve large population sizes. because viruses depend on a host cell to be able to replicate, their ability to multiply is heavily influenced by the biology of each cell that they encounter, such as the nature and density of surface molecules that can act as viral receptors, the cell's metabolic rate and availability of macromolecules, as well as the cell's innate antiviral defenses that have evolved to suppress viral replication. in addition to being able to replicate within a single cell type, most viruses require the capacity to replicate and spread within a multicellular host organism, which has tissues with varied cellular characteristics, physiological and anatomical constraints, and an adaptive immune response. while some viruses might only require the ability to infect one tissue to be successfully maintained in the environment, some viruses need to infect and multiply in different tissues to be spread to a new host and complete their transmission cycles. for example, measles virus initially infects alveolar macrophages and dendritic cells in the lung. it is then transferred to t-and b-lymphocytes and is amplified and spread systemically throughout the body. infected lymphocytes can then transfer the virus to the basolateral surface of lung epithelial cells by attaching to an epithelial cell receptor. the virus multiplies further in the lung epithelium and is spread to new hosts by coughing and sneezing [ , ] . thus, measles virus requires the ability to infect multiple cell types to complete its transmission cycle. viruses must also be capable of replicating within populations of hosts whose immune responses are shaped by different histories of virus exposure and some viruses even require the ability to replicate in different host species. for example, west nile virus transmission is dependent on the virus being able to replicate efficiently in both mosquitos and birds. in mosquitos the virus multiples in the salivary glands and is transmitted to birds when the mosquito takes a blood meal. the virus is amplified in virus. an rna virus particle, or virion, consists of an rna genome (blue) surrounded by a protein coat or capsid (black). some viruses also have a lipid envelope studded with viral proteins surrounding the capsid (not shown). the virus particle attaches to a receptor on the surface of a susceptible host cell ( ), and becomes internalized ( ) . the viral genome codes for viral proteins (black shapes) ( ) and is replicated via a replication intermediate (red) ( ) . newly synthesized genomes and proteins assemble together ( ) and newly made virus particles are released ( ) . birds and can be transferred to further mosquitos [ ] . because rna viruses need to replicate in these highly variable and dynamic environments, they need to be highly adaptable to maintain their existence. this adaptability is conferred by their genetic heterogeneity. in the late s it was discovered that the nucleotide sequences of rna bacteriophage, qβ, are highly heterogeneous [ ] , and this observation has since been extended to all rna viruses. accurate quantification of rna virus mutation rates is challenging, but they have been estimated at − to − per nucleotide per round of copying [ , ] . this equates to approximately one mutation per genome replication event, which is a considerably higher rate than that of bacteria, estimated at one mutation per genome replication events [ ] . in addition to point mutations, recombination between viral genomes can occur in high frequency in some rna viruses, resulting in replacement of different regions of genome sequence. any particular rna virus population is always in flux, with new mutations arising and deleterious mutations being lost through selection. the high mutation rate of rna viruses, coupled with their very high levels of replication and the large population sizes that they can achieve means that rna viruses exist as a swarm of variants rather than as a single genotype entity. thus, rna viruses are a genetic paradox: they are in one sense very simple entities, having very limited genetic information, but on the other hand, they are genetically complex, having the capability to access millions of sequence combinations. adding to this complexity, there is evidence for some rna viruses that they can exist as quasispecies in which the related genome sequences can complement each other and function cooperatively [ , ] . thus, when a virus spreads from cell to cell and host to host, it is the properties of a swarm of genetically related but distinct viruses that enables this to occur, not the properties of a single, isogenic virus. as described in detail later in this chapter, rna viruses require a high mutation rate to enable them to survive the varied environments that they encounter in the course of their transmission cycle. interestingly, they also have evolved genome sequences that have a bias that allows them to rapidly adapt. however, there is also evidence that at least some have a mutation rate that is so high that they are poised at the edge of a threshold of viability, with small increases in mutation rate causing them to accumulate so many lethal mutations that they are extinguished. together, these findings suggest that rna viruses have evolved to have a specific mutation rate and mutation bias to enable them to survive in the particular environments in which they need to exist. there are several sources of genetic variability in rna viruses, some are inherent to the biology of the virus and others are consequences of the cellular environment. the viral mutation rate is the rate at which a viral genome acquires mutations per genome replication event and is determined by the viral polymerase and any proofreading activities that the virus encodes. the mutation frequency of a virus is the frequency with which mutations accumulate over a virus infection cycle and can be impacted by the mode of virus replication, and cellular factors. thus, to understand how viral genetic heterogeneity arises, it is helpful to have an appreciation of the mechanisms by which rna viruses replicate their genomes. rna viruses can be divided into different classes by virtue of their distinct genome structures and strategies of genome replication [ ] (fig. . ) . the riboviruses replicate their genomes via an rna intermediate synthesized by a viral rna-dependent rna polymerase (rdrp). riboviruses can have single or double-stranded rna genomes; those with single-stranded genomes can be further characterized by being either positive or negative stranded (ie, having a genome that is of the same sense, or the opposite sense to mrna, respectively). riboviruses can also have genomes contained within a single piece of rna, or a genome that is divided into multiple segments. another class of rna virus is the retroviruses. these viruses have an rna genome, which is reverse transcribed by the viral reverse transcriptase enzyme into double-stranded dna. the virusspecific double-stranded dna then integrates into the host genome and becomes a template for cellular rna polymerase ii, which synthesizes multiple copies of rna to generate the progeny viral genomes. it is important to appreciate that this classification system does not relate in any way to the tissues or hosts that a virus can infect, or the way in which it is transmitted to new hosts. for example, hepatitis c virus (hcv) and west nile virus are both positive-strand rna viruses, but they cause very different diseases and are spread in different ways. both rdrps and reverse transcriptases have the potential to introduce deletions, insertions, and nucleotide mismatches into the nucleic acid product [ ] [ ] [ ] . unlike dna-based life forms, most rna viruses have no mechanisms to identify and repair mismatches [ , ] and so polymerase error is not corrected. the error-prone nature of polymerase activity, coupled with the absence of a proofreading mechanism, is the key reason why rna virus genomes acquire mutations and exist as a swarm of genetic variants. although all rdrps and reverse transcriptases are capable of introducing mutations, they are not equally error prone. for example, the viral mutation rate inversely correlates with genome size, such that viruses with larger genomes have a lower per nucleotide mutation rate than those with small genomes [ ] . this is intuitively logical as a high mutation rate in a virus with a large genome would increase the chance of genomes acquiring a lethal mutation and so viruses with low fidelity polymerases could not be sustained. this suggests that viruses with larger genomes have evolved to limit their mutation rate and some rna viruses encode proteins that function to mitigate polymerase error, as described in the following. however, even when related viruses with similar genome lengths are compared, there are differences in polymerase fidelity [ , ] . for example, in a side-by-side comparison, using in vitro biochemical assays, the rdrp of coxsackievirus b is of higher fidelity than that of poliovirus, even though these are highly related viruses [ ] . in sum, these facts suggest that polymerase error rate is determined by selection pressures related to viral genome size and other facets of virus biology. the molecular mechanisms that govern polymerase fidelity have been elucidated by detailed enzyme kinetics studies of wild-type polymerases and by studying mutant versions of polymerase with altered fidelity [ , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . these studies have shown that the error rate of the polymerase can be modulated by single amino acid substitutions in the enzyme, and that substitutions outside the active site can have an effect. thus, the structure of the polymerase is tuned to enable it to manifest a particular fidelity. in addition to controlling the rate of replication error, polymerase determinants can also influence what substitution mutations are introduced. in a landmark study, a novel sequencing approach was employed to identify low-frequency mutations that accrued in the poliovirus genome under relatively constant conditions [ ] . viral populations present at different times were analyzed to determine what mutations accumulated in this stable environment, where selection pressure was minimized. this analysis showed that transitions occurred more frequently than transversions, and within these categories there was variation: c-to-u and g-to-a transitions accumulated more frequently than u-to-c or a-to-g. thus, these studies indicate that there is directionality to the mutation pattern of the viral swarm. similar findings had been made with hiv [ ] and studies with west nile virus have shown that different polymerase variants have different mutational biases [ ] . thus, rna viruses do not incur substitutions randomly, but have a mutation bias that is likely governed by the molecular determinants of fidelity in the polymerase. this bias might play an important role in allowing the virus to generate a favorable spectrum of sequences following a genetic bottleneck. although the viral polymerase is typically the key viral factor that determines how faithfully the viral genome will be copied, it is not the only factor and there are examples of viruses in which other proteins can come into play to reduce polymerase error. as noted earlier, the genomes of all rna viruses are relatively small compared to those of the largest dna viruses, and it is thought that the high mutation rate of rna virus polymerases imposes an upper limit on genome size. however, there is a wide range of genome sizes within the rna viruses, with the largest being those of the coronaviruses, at up to kb. this is more than twofold longer than most other rna viruses. it has now become apparent that the reason why the coronaviruses can sustain this relatively large rna genome is that they have an rna proofreading activity [ , ] facilitated by an exonuclease that probably functions by removing incorrect insertions at the ′-end of the rna product during rna synthesis [ ] . interestingly, the activity of the exonuclease is significantly enhanced by an additional coronavirus protein [ , ] . the fact that a multipartite complex performs the polymerization and rna proofreading activities raises the intriguing possibility that coronavirus fidelity could be regulated. some of the nonsegmented, negative-strand rna viruses also have an additional protein that might function to limit rdrp error. the pneumovirus subfamily has genomes of approximately kb, and so they are in the midrange of rna virus genome sizes. these viruses encode a small protein called m - . it has been found that deletion of m - of human meta-pneumovirus results in increased accumulation of transitions, transversions, deletions, and insertions in the viral rna, suggesting that m - serves to increase the fidelity of the viral polymerase [ ] . the mechanism by which m - functions is not known, but it has no known enzyme activity and so it is unlikely to function as an exonuclease, but instead might serve to increase fidelity by altering rdrp structure. a deoxyuridine-triphosphatase (dutpase) enzyme is expressed by some, but not all, retroviruses [ , ] . this enzyme hydrolyzes dutp and maintains low dutp:ttp ratios, thus limiting misincorporation of deoxyuridine into viral dna. the viral dutpase has been shown to limit the mutation rate of feline immunodeficiency virus and caprine arthritis-encephalitis virus [ , ] . interestingly, the primate lentiviruses, including hiv, do not encode a dutpase, but might package a cellular dna repair enzyme, uracil dna glycosylate, into their virions to help limit the mutation rate [ , ] . in addition to the mutations that can be introduced when the polymerase selects an incorrect nucleotide during rna synthesis, genetic variation can also arise by recombination. recombination can occur when two or more viral genomes enter the same cell and a part of one genome is incorporated into the other. this can result in significant changes in genome composition with dramatic impact on virus biology. for example, there is evidence that recombination might have been a factor that enabled the emergence of sars coronavirus [ ] and it is a key factor in emergence of pandemic influenza viruses [ ] . however, while recombination can impact diversity, there is debate as to whether it has evolved as a means to generate variability, or is merely a consequence of viral genome replication [ ] . in this respect, it is interesting that even viruses with similar genome structures can undergo different rates of recombination, perhaps suggesting that recombination is also finely tuned by evolutionary pressure. there are three mechanisms by which rna viruses can recombine: templateswitching recombination, nonreplicative recombination, and re-assortment ( fig. . ) . template-switching, otherwise known as copy-choice recombination, can occur during the process of rna synthesis if the viral polymerase transfers from one template to another, while remaining attached to the nascent nucleic acid chain [ ] . this results in production of a mosaic genome. template switching tends to occur between sequences of close similarity to give rise to a homologous recombination event. nonhomologous recombination can also occur, but this typically results in defective genomes and is observed less frequently. viruses differ significantly in the rate with which they can recombine by template-switching [ ] . it can be highly frequent in retroviruses, particularly hiv, and also in coronaviruses. the high frequency of recombination in these viruses may be due to the replication strategies that they have. in the case of retroviruses, the reverse transcription process is highly complex and the reverse transcriptase must switch from one template to another during dna synthesis [ ] . likewise, in coronaviruses, transcription of the genome, to allow gene expression, requires the rdrp to transfer from one site to another on the genomic template [ ] . thus, the fact that the polymerases of these viruses have evolved to transfer to a different template sequence probably means they are more likely to do so during other aspects of rna synthesis. recombination can occur in other positive-strand rna viruses besides coronaviruses, and in double-stranded rna viruses, although the recombination frequency apparently varies between viruses. for example, it occurs frequently in the positive-stranded enteroviruses, such as poliovirus, but less so in the flaviviruses, such as hcv [ ] . template switch recombination is much less frequent in the negative-strand rna viruses [ ] , probably because their genomes are not naked rna, but rather are encapsidated, or buried, in protein called nucleoprotein [ ] . the polymerase transiently displaces nucleoprotein as it moves along the template and only recognizes rna sequences as the nucleoprotein is displaced. this probably prevents the rdrp from transferring from one genome to a similar sequence in another genome to yield functional recombinants. however, negative-strand rna viruses containing gene duplications have arisen naturally [ ] and defective interfering genomes, which contain promoters and partial genome sequences, but not complete genomes, are often detected. these findings suggest that the rdrps of the negative-strand rna viruses are capable of jumping from one sequence to another, or that nonreplicative recombination (described later) can come into play, but perhaps most products of these events are nonviable and so are not detected. in contrast to template switch recombination, nonreplicative recombination seems to be a relatively rare event that to date has only been described for a few positive-strand rna viruses. this mechanism was documented by recovery of viable viruses (or replicative templates) following cotransfection of cells with two viral rna fragments, each of which was unable to function in replication independently. the fragments recombined to form functional rnas [ ] [ ] [ ] [ ] . this mechanism of nonreplicative recombination might not involve a viral enzyme activity. instead, it seems that the two rna strands are joined together either by a transesterification reaction [ ] , or by ligation, presumably by cellular ligases [ ] [ ] [ ] . thus, rna genome fragments created by physical shearing, nuclease cleavage, or cryptic ribozyme activity have the potential to be joined to form a novel viral genome, which can be further refined by homologous recombination to remove duplicated sequence [ ] . re-assortment is a process that can occur during coinfection of a cell with viruses with segmented genomes [ ] . during re-assortment, a virus can exchange one of its own segments for that of another related virus. this process is well studied in influenza virus, in which it occurs frequently. influenza a virus has eight genome segments that all need to be packaged into a virion for that virus particle to be infectious. this process is not completely random; there are packaging signals in the rna segments and epistatic interactions enable the correct complement of segments to be incorporated into virions. there are many subtypes of influenza virus, but if the packaging signals of two viral subtypes coinfecting a cell are sufficiently similar, this enables a segment from one virus to be incorporated into another, resulting in release of virions with a new genome composition. because different viral subtypes have different antigenic properties, this process has significant impact on influenza virus epidemiology [ ] . as described earlier, the mutation frequency of a virus differs from the mutation rate, in that it refers to the accumulation of mutations over a virus infection cycle, for example, from the point of entry of a virus into a cell until release of infectious progeny. in addition to having different genome structures and nucleic acid intermediates, different rna viruses have different numbers of replication events per infection cycle and so this can impact on mutation frequency. retrovirus reverse transcriptase only copies the genome twice during an infection cycle: once to generate cdna from the rna template and a second time to synthesize the complementary dna strand to generate double-stranded dna. thus, these are the only two occasions in the retrovirus infection cycle where the viral polymerase can introduce mutations. the cellular rna polymerase ii enzyme is responsible for generating multiple copies of genome rna that become packaged into viral particles, and while there is the potential for error to be introduced by rna polymerase ii, cellular proofreading mechanisms come into play at this step and so the major source of mutation during retrovirus genome replication is the reverse transcriptase [ ] . in the riboviruses, the viral rdrp is responsible for all genome replication events and it copies the genome multiple times. thus, in this case, there are many more opportunities for mutations to be introduced by polymerase error. within the riboviruses, there are different modes of genome replication, referred to as a stamping machine or geometric modes, and the degree to which a virus employs one mode versus the other will affect mutation frequency [ ] . in stamping machine mode, the infecting genome template is used to make multiple progeny genomes, but these genomes are not used as templates until they have been delivered into another cell. it is thought that double-stranded rna viruses use this mode primarily. in contrast, in geometric replication, an incoming genome template acts as a template to make multiple complementary strands (or antigenomes), which in turn act as templates to make multiple genome sense strands, within the same infection cycle. in this case, there are many more opportunities for mismatch errors to be introduced than in the stamping machine mode. positive and negative sense rna viruses probably use a combination of both modes, but the exact contribution of each to the output virus is not well characterized, except in a few cases [ ] . the mutation rate of the viral polymerase, coupled with the replication mode that the virus employs (and extrinsic factors, described in the following text) will determine the extent of genetic variability of viruses released from an infected cell. the cellular environment can impact virus mutation rates and frequency. for example, dntp pool imbalances can affect retrovirus mutation rates [ ] , and it has been suggested that differences in substitution rates between rna viruses is a consequence of differences in virus rna synthesis rates in different cell types [ ] . in addition to these effects, there are also cellular factors that can result in increased mutation in rna viruses. adenosine-to-inosine modification by enzymes called adenosine deaminase acting on rna (adar) is the most common form of rna base modification that occurs in mammals. a-to-i conversion has important consequences in the coding potential of substrate rnas, as inosine is decoded as a g by polymerases during template copying. the a-to-i conversion in a dsrna duplex also has consequences to stability of rna secondary structures, as the a:i pairing is less stable than a canonical a:u pair. this can have important consequences for rnas that depend on their structure rather than sequence for their function [ ] . adar modification of cellular double-stranded rna was shown to prevent its recognition by the cytoplasmic sensor of nonself rnas that would otherwise lead to chronic activation of innate immune pathways [ ] . there is also evidence that adar can modify viral rnas. sequence analysis of rna virus genomes has revealed that they preferentially accumulate a-to-g transitions, which are characteristic hallmarks of adar activity. measles virus is a negative-stranded rna virus, responsible for an acute disease predominantly in infants, but in rare instances associated with a fatal latent infection of the cns known as subacute sclerosing panencephalitis (sspe). analysis of measles virus genomes from sspe victims has revealed abundant a-to-g transitions, suggesting a role for adar in establishment of sspe [ ] . consistent with an antiviral role for adar, measles virus infection of adar knock-out cell lines displayed increased cellular pathology, and similar findings were reported for other rna viruses, implicating adar as a cellular restriction factor for a wide range of negative-stranded rna viruses [ ] . direct evidence of adar modification of a viral rna genome comes from studies of hepatitis delta virus (hdv). hdv is the smallest of the rna viruses and encodes just two proteins, hdag-l and hdag-s, both of which are essential for virus viability. hdag-l and hdag-s share the same amino terminal open reading frame, but hdag-l possesses a carboxyl terminal extension that is accessed when the stop codon at the end of the hdag-s orf is bypassed. early during infection only the truncated hdag-s is expressed, but then at later times expression of hdag-l increases due to the sitespecific modification of the stop codon by adar [ ] . this editing event is highly specific and is promoted by the highly secondary structured hdv rna genome. this action by adar is clearly proviral, in that without the activity of adar, no infectious hdv particles would form. another family of cellular factors that can modify the sequence of viral genomes is the apobec family of enzymes. these comprise an extensive arm of the innate immune system [ ] . they are responsible for the modification by deamination of cytosine residues to uracil, which is an activity largely performed on single-stranded dna substrates, leading to the phenomenon of hypermutation. apobec activity can affect the retroviruses. hiv infection is blocked by apobec, unless it expresses the viral infectivity factor (vif). the mechanism for this blockade relies on the packaging of multiple apobec family members within hiv virions, which can act on the hiv genome once it has been copied by reverse transcriptase into a complementary dna. the effect of apobec activity can be the modification of up to % of susceptible cytosine residues, resulting in a drop in infectivity of up to -fold. together, the studies described earlier show that there is a range of viral and host factors that combine to alter mutation frequency. the question that arises is: how do rna viruses withstand mutation? the ability of a genome to withstand genetic or environmental perturbations without a change in phenotype is referred to as genetic robustness [ ] . the high mutation rate that rna viruses incur comes at a cost. it has been estimated that - % of virus genomes generated during infection are defective [ ] and so at an individual level, most viral genomes are not robust. this is not surprising: the small size of rna virus genomes limits their coding potential and so they have limited genetic redundancy. moreover, rna virus are highly compact, often containing overlapping reading frames, and nucleotide sequences that function at the rna level, for example, as cis-acting elements that enable genome replication, as well as at the protein level. however, robustness is influenced by the genetic background in which it operates and so in the case of rna viruses, genetic robustness is considered in the context of the viral swarm, rather than individual genotypes. rna viruses are not all equally robust, and even closely related viruses can exhibit different degrees of robustness [ ] . there are several factors that contribute toward this [ , ] which are described in the following paragraphs. robustness is conferred if a virus has an ability to more readily arrive at a new optimal or adapted genotype in the face of a changed environment, and the genetic composition of the viral swarm can facilitate this. because the majority of nucleotide changes in rna virus genomes are either strongly deleterious or lethal, the population is perpetually refined as deleterious genomes become purged through selection, leaving only mutations with phenotypically neutral or advantageous consequences to persist [ , ] . the neutral mutations can impact robustness. an explanation for this is that if the virus encounters a new environment, multiple nucleotide changes might need to occur for it to arrive at an optimal genotype. if some of these changes are already in place, then the jump to the new genotype is more likely to occur. this means that a population that includes a high proportion of neutral mutations will be more adaptable in the face of environmental change, as genomes with neutral mutations can act as stepping-stones toward reaching the new adapted genotype [ ] (fig. . ) . thus one viral determinant of robustness is their high mutation frequency, which results in a more extensive neutral network [ , [ ] [ ] [ ] . consequently, factors that affect mutation frequency, such as polymerase fidelity and replication mode, will also impact robustness. interestingly, there is evidence that some rna virus genomes have evolved to enable rapid adaptation. experiments in which synonymous mutations were introduced into rna virus genomes and fitness was assessed showed that the rna nucleotide sequence has an effect on fitness, independently of its effect on protein sequence [ ] . this could be due to effects on rna structures and cis-acting elements. however, experiments with poliovirus showed that this might not be the only explanation. in this case, a region of the poliovirus genome that does not contain cis-acting rna structures was recoded with synonymous mutations. the virus variant containing the synonymous mutations had reduced robustness and was attenuated in an animal model [ ] . this finding suggests that wild-type poliovirus occupies a sequence space that enables it to rapidly adapt to environmental pressure. another viral determinant of robustness relates to the ability of rna viruses to generate large numbers of genomes within individual infected cells. a consequence of the resulting polyploidy is that a genome containing a detrimental change can be complemented by the properties of another genome that is unaltered. this mechanism also has a downside in that it reduces the ability to purge poorly adapted genotypes, and thus their persistence in a population may lead to a reduction in its overall fitness. interestingly, the huge range in the extent of polyploidy that occurs throughout the infection cycle may allow different levels of robustness at different times of the virus life cycle, with more opportunity for complementation at later stages of infection when the copy number of viral genomes is at its highest. such a scenario may have important consequences for viruses that stimulate innate immunity early on in the infection cycle. the innate immune response poses a high adaptive requirement at a time when viral genome numbers are at their lowest. conversely, persistent viruses that maintain high copy numbers for extended periods of time without inducing cell death, such as hcv, may be particularly robust due to the wide range of genotypes contained with the massive population of persisting rnas. the presence of multiple genomes within the same cell can also enable recombination. recombination is another factor that influences robustness, as it can result in purging of multiple mutations from a genome in a single recombination or re-assortment event [ ] . rna virus robustness can also be impacted by host cell factors. the ability of chaperones to buffer mutations was first proposed for the groel molecular chaperone [ ] . subsequently it has been experimentally observed that chaperones, such as members of the heat shock protein and families, play important roles in the infection cycles of many rna viruses. it has been proposed that viruses have evolved the ability to interact with chaperones in order to buffer the effects of deleterious coding mutations that would otherwise prevent their correct folding [ , ] . this provision is particularly important as viruses depend on assembly of high-order multimers to build their capsids, a major component of the virions that are released. in these cases, a single misfolded protein has the potential to disrupt the function of the entire complex and so mechanisms that facilitate appropriate protein folding can have a significant impact. although there are a number of properties of rna viruses that contribute to genetic robustness, the role of robustness in the natural history of rna viruses is a controversial topic. a virus population with increased neutral genotypic diversity and thus high robustness can readily adapt to new environments due to its inherent diversity, and increased availability of adaptive pathways. this has important implications for viral pathogenesis and robustness has been shown to increase virulence in host organisms [ ] . however, it appears that the converse can also be true and under certain conditions the neutral network can be composed of genotypes that are unable to reach a high level of fitness in the new environment [ ] . this suggests that it may be difficult to make generalizations over how robustness shapes virus adaptability. as mentioned at the beginning of this chapter, genetic heterogeneity of rna viruses is such a key facet of their biology that it brings up the question of whether their high mutation rates have been selected for and are of evolutionary benefit. fidelity comes at the price of elongation efficiency [ ] . thus, it is possible that the high mutation rates of rna viruses are simply a consequence of polymerases that are under selective pressure to replicate genomes very rapidly to ensure efficient viral infection [ ] [ ] [ ] . according to this view, rna viruses have evolved a balance between rapid genome synthesis and error, such that the mutations that they incur are tolerable and on occasion advantageous, but are not necessary for virus survival. however, while genome synthesis rate is certainly an important factor in virus fitness [ ] ; for some viruses there is also evidence that the high mutation rate is beneficial and that rna virus polymerase fidelity is tuned, enabling the virus to maximize sequence space while avoiding the accumulation of so many deleterious mutations that the genomes become nonviable. this is the concept that rna viruses are "on the edge." in this example, the green mutation alone is deleterious, but is neutral or beneficial in combination with the red mutation. if the neutral network contains genomes with red mutations, it provides a stepping-stone to enable introduction of the green mutation. (c) a neutral network containing genomes that have different codons for the same amino acid can provide a stepping-stone to genomes containing different spectra of amino acids following a single nucleotide substitution. in this example, a neutral network contains genomes coding for leucine at a given position, but the genomes differ by coding for leucine with either uua or cua. this expands the range of amino acids that could arise following a single nucleotide change. as described earlier, most mutations that arise are deleterious and so there is a significant cost to having an error-prone polymerase. furthermore, while complementation between defective genomes can occur, enabling genetic robustness, it is also possible for defective genomes to have an antagonistic effect, for example, by expressing mutant proteins that function as dominant negatives. nonetheless, despite these disadvantages, it is possible to generate mutant viruses that have changes in the polymerase that result in its increased accuracy; these are known as high-fidelity mutants. elegant studies performed with a poliovirus high-fidelity mutant showed that efficient spread within a host requires a quasispecies, and an error-prone rdrp to generate it [ , ] . naturally, poliovirus replicates in the gut, but it can replicate in other tissues and spread to the spinal cord and brain. the ability to infect this variety of tissues requires poliovirus to overcome significant barriers to replication [ ] . experiments comparing the growth characteristics of wild type and a variant of poliovirus with a highfidelity rdrp showed that the high-fidelity variant could replicate relatively efficiently compared to the wild-type virus in a single multiplication cycle in cell culture [ , ] , and if introduced into mice intravenously, it could replicate efficiently in the spleen, kidney, and small intestine [ ] . thus, in this case, genome replication was not significantly delayed by the increased accuracy in rna synthesis. however, in contrast to wild-type poliovirus, this high-fidelity mutant virus could not efficiently spread to the central nervous system (cns), hence the % lethal dose (ld ) was increased -fold [ , ] . to examine if the defect in virus spread was due specifically to the mutation (perhaps this variant of the rdrp could not function in a neuronal environment), or to the lack of genome diversity within its population, vignuzzi and coworkers increased the diversity of the high-fidelity virus by treating it with mutagens. this had the dramatic effect of increasing the ability of the high-fidelity virus to replicate in the spinal cord and brain, and the ld was restored to the same level as wild-type poliovirus. this result showed that poliovirus spread to the cns is dependent on the virus being able to establish a highly diverse population. in addition, it was shown that coinfection of mice with wild-type and high-fidelity mutant virus enabled the high-fidelity virus to reach the brain [ ] . this indicates that different viral genotypes in the quasispecies can complement each other to facilitate infection spread. it is not known exactly how complementation functions in this case, but it is easy to imagine that one variant of a virus might be more efficient at subverting innate immune defenses (which could impact virus genomes within the same cell and neighboring cells), whereas another variant might express a capsid protein better adapted to bind to a new cell receptor. in its natural context, poliovirus is spread through ingestion of contaminated water and so there is no necessity for poliovirus to be able to spread to the cns to be able to complete its transmission cycle. however, these studies are important because they show that viruses can benefit from polymerase infidelity and a high mutation rate, particularly under conditions where they encounter a change in environment [ , ] . studies with a number of viruses indicate that these findings are widely applicable in rna virology and so it seems likely that rna viruses have evolved a high mutation rate that enables them to rapidly adapt to the dynamic and varied environments in which they exist. the studies described earlier show that rna viruses benefit by having an error-prone polymerase to enable them to adapt to new conditions. however, there is also a cost if the polymerase has mutations that decrease its fidelity, so that the error rate is increased. experiments performed with coxsackievirus b and poliovirus showed that low-fidelity mutants were able to replicate efficiently in cell culture when propagated at high multiplicity of infection (ie, when the population size was large), but were extinguished when the viruses were propagated under low multiplicity conditions, which mimics conditions when a virus first establishes infection in a host or when it has overcome a barrier, such as adaptation to a new host cell type. consistent with these findings, both the coxsackievirus b and poliovirus low-fidelity mutants were attenuated in vivo. in the case of the coxsackievirus b , low-fidelity mutants were unable to establish productive infection in the heart, the usual site for coxsackievirus b virus replication, and in the case of poliovirus they were unable to reach the cns [ , ] . comparison of the high-and low-fidelity poliovirus variants indicates how much latitude there is in the mutation rate for this virus. the high-fidelity rdrp had an approximately twofold decrease in nucleotide misincorporation rate, and the low-fidelity rdrp had a twofold increase [ ] . thus, the range in misincorporation rate that can lead to virus extinction in an animal host is not that substantial, even in a virus that is relatively genetically robust. this indicates that the fidelity of the polymerase, coupled with the impact that accuracy has on the rate of rna synthesis, is optimized to enable viruses to adapt to the many environments in which they need to exist while avoiding extinction [ ] . the propensity that rna viruses have for mutation seems to have opened this up as an avenue for host cell defense. pathogenic viruses and their hosts are engaged an epochal "arms race" in which the host evolves immune defenses to suppress virus infection and the virus in turn evolves countermeasures to disable host defenses. the existence of apo-bec and adar, cellular proteins that can increase virus mutation frequency, suggests that mammalian hosts have taken advantage of the high mutation rate of viruses and evolved mechanisms to induce further mutations in the viral genomes and push viruses toward extinction [ ] . conversely, primate lentiviruses have evolved vif, a protein that can target apo-bec for proteosomal degradation, indicating that these retroviruses have evolved a mechanism to counter this cellular defense [ , ] . likewise, the nonsegmented, negative-strand rna viruses, which are susceptible to adar, maintain their genomes encased in a ribonucleoprotein complex throughout the infection cycle, reducing the opportunity for them to adopt double-stranded rna structures, the substrate for adar. this perhaps prevents adar causing as much damage as it otherwise might. the high mutation rate of rna viruses has often been an impediment to drug and vaccine development as viruses can rapidly gain resistance to antiviral drugs and to the immune response elicited by vaccines. however, our increasing understanding of function and consequences of genetic variability has opened new avenues for controlling viral infection. as described earlier, small decreases in polymerase fidelity can have dramatic effects on viral infectivity. similarly, studies have shown that small increases in viral mutation rate caused by treatment with mutagenic compounds can result in significant decreases in viral fecundity [ , ] . thus, treatment with mutagens that increase the accumulation of mutations in the viral genome can lead directly to virus extinction, or can reduce virus infection to enable effective clearance with other inhibitors, given in combination, or by host immune responses [ , ] . the identification of high-fidelity mutant viruses that can infect animals has also suggested a means to exploit these mutants as vaccine candidates. live-attenuated virus vaccines can be highly effective, but have the disadvantage that they can potentially revert to a wild-type pathogenic phenotype. by engineering recombinant viruses with increased fidelity, it is possible to generate viruses that are attenuated, as described earlier, and that elicit protective immune responses, with reduced risk of reversion [ ] . the rna viruses are hugely diverse, not only in their genome structures and replication strategies, but also in their "lifestyles," which can differ significantly, even between closely related viruses. what has emerged from studies of virus genetics is that rna viruses are also highly divergent in terms of their polymerase fidelity, recombination rates, replication modes, and genetic robustness. we speculate that rna viruses have evolved such that there is an intricate balance between these factors that is tuned to match the "lifestyle" of each virus, enabling it to occupy the niche in which it exists. there is some evidence to support this idea. for example, a side-by-side comparison of influenza virus and hiv polymerase fidelity showed that influenza virus rdrp is much less error prone than hiv reverse transcriptase. this may be a reflection of the fact that the influenza virus rdrp performs many more genome replication events during an infection cycle than the hiv reverse transcriptase and needs to be less error prone to avoid having a mutation frequency that is too high [ ] . another example comes from studies of west nile virus. while the fidelity of the west nile virus rdrp has not been directly compared to that of other viruses, there is a greater difference in fidelity between the wild-type west nile virus rdrp and a high-fidelity mutant than has been found for most rdrps [ ] . this could suggest that west nile virus rdrp is naturally more error prone than most. this could be a necessary feature of west nile virus to enable it to cycle back and forth between mosquito and avian hosts. understandably, much of the work that has been performed so far has focused on viruses that are "model" viruses-those that are relatively easy to culture in vitro and replicate rapidly. however, a fuller understanding of how the factors that influence genetic diversity intertwine with virus biology will come from extending the work that has been performed so far and applying it to other viruses that have similar genome structures and replication strategies, but diverse lifestyles, such as west nile virus and hcv, or vesicular stomatitis virus and measles virus. research in this area will potentially be transformed by new sequencing techniques, such as cirseq, which can detect low-level genetic variants above the background of errors introduced during rna sequencing [ ] , and base-seq, a method for obtaining long stretches of sequence that can be used to identify haplotypes [ ] . ultimately, application of cutting-edge sequencing technologies, mathematical analyses, and virology studies to a range of viruses will enhance our understanding of the genesis and functional consequences of rna virus genetic instability. complementation the ability of the products of one viral genome to provide a function that cannot be performed by the products of another viral genome. copy-choice recombination a recombination event that occurs when the viral polymerase switches to another template while remaining attached to the nascent rna. also known as template switch recombination. epistatic mutation a phenomenon in which mutations have different effects in combination than individually. fidelity the accuracy with which the polymerase copies the template. a high-fidelity polymerase will make fewer errors than a low-fidelity polymerase. genetic robustness the degree to which a genome can withstand environmental or genetic perturbation. geometric mode a mode of genome replication in which the newly synthesized genomes become templates for further rounds of genome replication during the infection cycle. infection cycle the cycle of events by which an infectious virus particle infecting a cell results in release of virus progeny. in the virology field, this is often referred to as the virus replication cycle, but infection cycle was used here to avoid confusion with genome replication. lethal dose (ld ) the quantity of infectious virus that is required to cause death in % of inoculated hosts. live-attenuated virus vaccine a vaccine that consists of a live (infection-competent) virus that contains mutations that reduce the disease symptoms, usually by impairing its ability to efficiently complete its infection cycle. mutation rate the rate at which a viral genome acquires mutations per genome replication event. mutation frequency the frequency at which a viral genome acquires mutations per viral infectious cycle. this frequency could be affected by cellular factors and the mode of viral replication, as well as by polymerase fidelity. nonreplicative recombination a recombination event in which two rna fragments are joined together by either a trans-esterification reaction, or ligation by cellular ligases. persistent virus a virus that can infect a host and maintain the infection for extended periods of time. hiv and hcv are examples of persistent viruses. polyploidy the presence of multiple viral genomes within the same cell. quasispecies a collection of closely related viral genomes, genetically linked through mutation, that compete within a highly mutagenic environment, interact cooperatively, and collectively contribute to the population phenotype. re-assortment a recombination event that can occur with viruses with segmented genomes, in which a genome segment from one virus is packaged into a virus particle in place of a genome segment from another virus, thus producing a virus with a novel complement of genome segments. retrovirus a class of rna viruses that replicate their genomes via a double-stranded dna intermediate. reverse transcriptase a viral enzyme encoded by retroviruses that is responsible for generating a double-stranded dna copy of the viral rna genome. ribovirus rna viruses that replicate their genomes via an rna intermediate. rna-dependent rna polymerase a viral enzyme encoded by riboviruses that is responsible for generating the viral genome rna and the rna replication intermediates. rna virus a virus that carries a genome composed of rna in the virus particle. stamping machine mode a mode of genome replication in which the incoming genome is reiteratively used as a template to produce multiple copies of replication product. swarm a population of closely related viruses, connected through mutation, similarly to a quasispecies. we have used the term swarm in many instances here because a population of virus variants might not always fully fulfill the definition of quasispecies. synonymous mutation a nucleotide substitution that does not result in an amino acid change. template switch recombination a recombination event that occurs when the viral polymerase switches to another template while remaining attached to the nascent rna, also known as copy-choice recombination. transmission cycle the cycle of events by which a virus is transmitted from one host to another host in the same species. the pathogenesis of measles nectin is the epithelial cell receptor for measles virus the global ecology and epidemiology of west nile virus nucleotide sequence heterogeneity of an rna phage population rates of spontaneous mutation viral mutation rates viral quasispecies rna virus populations as quasispecies expression of animal virus genomes fidelity of hiv- reverse transcriptase the accuracy of reverse transcriptase from hiv- incorporation fidelity of the viral rna-dependent rna polymerase: a kinetic, thermodynamic and structural perspective lack of evidence for proofreading mechanisms associated with an rna virus polymerase correlation between mutation rate and genome size in riboviruses: mutation rate of bacteriophage qβ mutational robustness of an rna virus influences sensitivity to lethal mutagenesis structure-function relationships underlying the replication fidelity of viral rna-dependent rna polymerases determinants of rna-dependent rna polymerase (in)fidelity revealed by kinetic analysis of the polymerase encoded by a foot-and-mouth disease virus mutant with reduced sensitivity to ribavirin poliovirus rna-dependent rna polymerase ( d pol ): pre-steady-state kinetic analysis of ribonucleotide incorporation in the presence of mg + remote site control of an active site fidelity checkpoint in a viral rna-dependent rna polymerase k r and k a substitutions in hiv- reverse transcriptase enhance polymerase fidelity by decreasing both dntp misinsertion and mispaired primer extension efficiencies poliovirus rna-dependent rna polymerase ( d pol ): kinetic, thermodynamic, and structural analysis of ribonucleotide selection structural dynamics as a contributor to error-prone replication by an rna-dependent rna polymerase mechanistic differences in rna-dependent dna polymerization and fidelity between murine leukemia virus and hiv- reverse transcriptases a role for dntp binding of human immunodeficiency virus type reverse transcriptase in viral mutagenesis mutational and fitness landscapes of an rna virus revealed through population sequencing sequence-specific fidelity alterations associated with west nile virus attenuation in mosquitoes coronaviruses: an rna proofreading machine regulates replication fidelity and diversity high fidelity of murine hepatitis virus replication is decreased in nsp exoribonuclease mutants insights into rna synthesis, capping, and proofreading mechanisms of sars-coronavirus rna '-end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp /nsp exoribonuclease complex mutations in coronavirus nonstructural protein decrease virus replication fidelity deletion of human metapneumovirus m - increases mutation frequency and attenuates growth in hamsters distinct subsets of retroviruses encode dutpase characterization of equine infectious anemia virus dutpase: growth properties of a dutpase-deficient mutant increased mutation frequency of feline immunodeficiency virus lacking functional deoxyuridine-triphosphatase dutpase-minus caprine arthritis-encephalitis virus is attenuated for pathogenesis and accumulates g-to-a substitutions roles of uracil-dna glycosylase and dutpase in virus replication uracil dna glycosylase specifically interacts with vpr of both human immunodeficiency virus type and simian immunodeficiency virus of sooty mangabeys, but binding does not correlate with cell cycle arrest recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission evolution and ecology of influenza a viruses why do rna viruses recombine? the mechanism of rna recombination in poliovirus hiv- reverse transcription. cold spring harb a contemporary view of coronavirus transcription recombination in hepatitis c virus genetic recombination during coinfection of two mutants of human respiratory syncytial virus nucleoproteins and nucleocapsids of negative-strand rna viruses major changes in the g protein of human respiratory syncytial virus isolates introduced by a duplication of nucleotides nonhomologous rna recombination in a cell-free system: evidence for a transesterification mechanism guided by secondary structure noncytopathogenic pestivirus strains generated by nonhomologous rna recombination: alterations in the ns a/ns b coding region nonreplicative rna recombination in poliovirus nonreplicative homologous rna recombination: promiscuous joining of rna pieces? rna structural elements determine frequency and sites of nonhomologous recombination in an animal plus-strand rna virus nonhomologous recombination between defective poliovirus and coxsackievirus genomes suggests a new model of genetic plasticity for picornaviruses rna virus reassortment: an evolutionary mechanism for host jumps and immune evasion mutational analysis of hiv- long terminal repeats to explore the relative contribution of reverse transcriptase and rna polymerase ii to viral mutagenesis how does the genome structure and lifestyle of a virus affect its population variation? deoxyribonucleoside triphosphate pool imbalances in vivo are associated with an increased retroviral mutation rate cell tropism predicts long-term nucleotide substitution rates of mammalian rna viruses adars: viruses and innate immunity rna editing by adar prevents mda sensing of endogenous dsrna as nonself biased hypermutation of viral rna genomes could be due to unwinding/modification of doublestranded rna rna editing enzyme adenosine deaminase is a restriction factor for controlling measles virus replication that also is required for embryogenesis control of adar editing of hepatitis delta virus rnas apobecs and virus restriction perspective: evolution and detection of genetic robustness rna virus genetic robustness: possible causes and some consequences the role of mutational robustness in rna virus evolution ultradeep sequencing analysis of population dynamics of virus escape mutants in rnai-mediated resistant plants beyond the consensus: dissecting within-host viral population diversity of foot-and-mouth disease virus by using next-generation genome sequencing mutational robustness can facilitate adaptation the fittest versus the flattest: experimental confirmation of the quasispecies effect with subviral pathogens evolution of mutational robustness in an rna virus selection for robustness in mutagenized rna viruses the fitness effects of synonymous mutations in dna and rna viruses codon usage determines the mutational robustness, evolutionary capacity, and virulence of an rna virus endosymbiotic bacteria: groel buffers against deleterious mutations costs and benefits of mutational robustness in rna viruses the cost of replication fidelity in an rna virus the cost of replication fidelity in human immunodeficiency virus type viral mutation rates: modelling the roles of within-host viral dynamics and the trade-off between replication fidelity and speed rna virus population diversity, an optimum for maximal fitness and virulence increased fidelity reduces poliovirus fitness and virulence under selective pressure in mice quasispecies diversity determines pathogenesis through cooperative interactions in a viral population multiple host barriers restrict poliovirus trafficking in mice rna virus population diversity: implications for inter-species transmission viral population dynamics and virulence thresholds coxsackievirus b mutator strains are attenuated in vivo back to the future: revisiting hiv- lethal mutagenesis structure-activity relationships and design of viral mutagens and application to lethal mutagenesis hiv accessory proteins versus host restriction factors rna virus error catastrophe: direct molecular test by using ribavirin lethal mutagenesis of hiv with mutagenic nucleoside analogs viral error catastrophe by mutagenic nucleosides therapeutically targeting rna viruses via lethal mutagenesis engineering attenuated virus vaccines by controlling replication fidelity biochemical characterization of enzyme fidelity of influenza a virus rna polymerase complex base-seq: a method for obtaining long viral haplotypes from short sequence reads key: cord- - qj ri authors: roux, simon; matthijnssens, jelle; dutilh, bas e. title: metagenomics in virology date: - - journal: reference module in life sciences doi: . /b - - - - . - sha: doc_id: cord_uid: qj ri metagenomics, i.e., the sequencing and analysis of genomic information extracted directly from clinical or environmental samples, has become a fundamental tool to explore the viral world. against the background of an extensive viral diversity revealed by metagenomics across many environments, new sequence assembly approaches that reconstruct complete genome sequences from metagenomes have recently revealed surprisingly cosmopolitan viruses in specific ecological niches. metagenomics is also applied to clinical samples as a non-targeted diagnostic and surveillance tool. by enabling the study of these uncultivated viruses, metagenomics provides invaluable insights into the virus-host interactions, epidemiology, ecology, and evolution of viruses across all ecosystems. historically, viruses have been primarily explored using laboratory cultivation: new viruses were obtained from clinical or environmental samples through propagation and isolation on cell cultures. this process is, however, biased and challenging to apply at large scales because (i) many viruses depend on host cells that are difficult to maintain as clonal culture in the laboratory, and (ii) even if the cells are available, propagating viruses may require specific conditions distinct from those used to cultivate the cells. these considerations are especially meaningful for viruses with microbial hosts, the vast majority of which remain uncultivated to date. metagenomics bypasses this requirement for cultivation and instead relies on the sequencing of viral genomic material extracted directly from a sample (see box and fig. ). thus far, the history of viral metagenomics has seen two major phases. initially, entire communities of viruses were assayed by analyzing and comparing short sequencing reads obtained from diverse environments. because of the fragmented nature of these data, most of these studies had to be conducted at the community scale, and identifying and distinguishing individual viruses in these datasets remained challenging. more recently, bioinformatic advances have enabled the reconstruction of individual viral genome sequences from metagenomes, allowing naturally occurring viruses to be identified and studied at high, genomic resolution. using a metagenomics approach, entirely new types of viruses can now be discovered, surveyed, and characterized even without cultivation. the unique ability offered by metagenomics to study uncultivated viruses led to the emergence of two parallel and interconnected fields: a clinical one, where metagenomics promises to be a catch-all method for the unbiased surveillance and diagnosis of viral pathogens, and one focused on natural biomes. that aims to describe the diversity of the viral world and understand its ecological and evolutionary drivers and impacts. when the field of shotgun environmental metagenomics was pioneered in by the laboratory of forest rohwer at san diego state university, the first datasets consisted of three viral metagenomes (viromes) that, together comprised just under , short genomic fragments derived from two natural marine viral communities and one human feces sample. while limited in scope and resolution, these and other early viromes provided an unprecedented view of complex viral communities in nature. both oceanic and human fecal viromes pointed toward the existence of an extensive virus diversity. this diversity of the virosphere was estimated by comparing the sequencing reads within each metagenome, and observing that almost every fragment was unique. moreover, comparing the short sequencing reads to a reference database of known viral genomes sequences revealed that up to % was not similar to any known virus, suggesting that most of the virosphere was yet to be discovered. this uncharted genomic biodiversity became popularly known as "viral dark matter". in the years that followed, a broader range of environments was progressively surveyed using viromics including freshwater lakes, hot springs, agricultural soils, or human skin, saliva, and gut samples. improvements in dna sequencing technologies, especially the advent of the popular pyrosequencing platform, that has since been surpassed and discontinued, increased the scale of these datasets by providing hundreds of thousands of short genomic fragments for each sample. by directly comparing the sequences across these datasets, several studies indicated that virus genes tend to structure by environment rather than by sample location, implying that some of these genes may be globally distributed. in addition, when sampled from the same freshwater and hypersaline ponds across several days, weeks, and months, viral metagenomes revealed that the genetic composition of viral communities was coherent at a broad level, but some individual viral genes experienced rapid changes in relative abundance. while the analyses outlined above were foundational for our current understanding of virus diversity, they were limited by the short length of next-generation sequencing reads which fragmented the view of viral genomes. these limitations were progressively overcome through an increase in sequencing throughput associated with improvements in sequence assembly and analysis tools. the first large-scale assemblies of viral genomes from short metagenomic fragments were published around , and quickly became a standard analysis in the viral metagenomic field so that by the year b , complete or near-complete virus genomes were routinely reported and analyzed in viromics studies. in only a couple of years, metagenomics has thus transformed the way scientists can identify and study viruses in the environment, as illustrated by the quick rise of virus genomes and genome fragments assembled from metagenomes available in public databases (fig. ) . in , only viral genomes (fragments) assembled from metagenomes were publicly available, while this number reached , in , and , in . genome sequences of uncultivated viruses are frequently obtained not only from viral metagenomes, i.e., metagenomes from virus-targeted samples, but also from "bulk" metagenomes in which virus particle were not enriched and viral and microbial sequences are mixed. combined with genome sequences obtained from isolates, these "uncultivated virus genomes" represent the foundation of an extensive mapping of the viral sequence space. over the past few decades, a number of molecular techniques, such as (q)pcr or elisa, have been developed and used to detect pathogenic viruses in clinical samples. however, these techniques can only detect previously known viruses, and often require box use of complementary methods to target different types of viruses a number of approaches have been developed to specifically select and survey the genetic material contained by virus particles in a given sample. alternatively, viral genomes can also be analyzed from "bulk" metagenomes which include both virus particles and microbial cells. virus sequences obtained from "bulk" metagenomes will typically reflect viruses infecting their host cell at the time of sampling, either actively replicating or not, while viromes enables a deeper and more focused exploration of the virus diversity in a specific site or sample. regardless of the type of sample, viromes are most often generated through a combination of centrifugation, filtration and dnase/rnase treatment, aiming at removing as much of the cellular genomes as possible (fig. ) . a typical protocol will notably include a filtration through . , . , or . µm membrane filters to remove bacteria and larger cells. depending on the initial concentration of virus particles, a concentration step using e.g., iron chloride (fecl), polyethyleneglycol (peg), or tangential flow filtration step(s), may be necessary to obtain enough material for sequencing library preparation. cesium chloride density gradients ultracentrifugation can also be used to further separate viruses from extracellular dna and large particles in complex samples, although this step can also lead to a substantial loss of viral material. finally, the virus particles obtained are typically treated with dnase or rnase to remove free dna and/or rna. depending on the type of virus studied, the corresponding protocols for rna or dna extraction and sequencing library preparation are then applied, after releasing the genetic material from the virus particle through e.g., a heat shock if necessary. a critical step in this process is to recover enough material for sequencing. while micrograms of dna were initially needed, several protocols are now available which only require b ng of dna. in addition, a dna/rna random amplification step, called "whole genome amplification", can also be conducted in order to gather enough input material. this type of approach was initially used in almost every virome study, and revealed important information for example on the unsuspected diversity of ssdna genome viruses in the environment (see below). however, the whole genome amplification process is inherently biased, and these datasets are not quantitative, i.e., one cannot draw any conclusion about the relative abundance of the viruses identified in these amplified metagenomes. thus, whole genome amplification methods have now been often replaced by advanced library preparation protocols which require nanogram-scale input but enable quantitative datasets well suited for ecological studies. alternatively, for cases in which target viruses represent a minor part of the templates, targeted sequence capture approaches have been used, mainly in a clinical framework as they can only be applied to viruses with known genomes but can detect these viruses with a very high sensitivity. the recovery of virus genomes from bulk metagenomes and from viromes each have their own limitations. for bulk metagenomes, viruses typically represent only a minor fraction of all sequences compared to cellular genomes. this means that the virus genomes obtained this way will tend to be restricted to abundant viruses found in their host cells, while viruses that are not infecting at the time of sampling, viruses with a low frequency of infected hosts, or viruses infecting rare hosts will likely be missed. viromes provide a deeper description of the virus community, since most of the sequencing data will be obtained from virus genomes. in addition, virus particles will not represent only current infections but a more integrated sampling of all recent successful infections, the timing of which depending on the type of sample and the individual virus decay rate. yet viromes still suffer from several biases. notably, the size-based selection of virus particles excludes most of the larger viruses such as the "giant viruses", and viromes also tend to be dominated by viruses with high burst size while under-sampling viruses with low burst size and long infection time. all metagenomes (bulk and viromes) will struggle with very rare viruses, as well as hypervariable viruses which genome will not assemble well. hence complementary approaches such as targeted capture approach for the former, and long read sequencing for the latter, are being developed (fig. ) . overall, the different methods developed over the last decade to sequence genomes from uncultivated viruses are mostly complementary and can be individually tailored for specific applications. virus discovery can be achieved through bulk metagenomes or viromes, while viral ecology studies will tend to rely more on viromes as a reflection of virus activity and transport, and metagenomics used as a diagnostic tool in the clinic would be the most likely to use sequence capture. nevertheless, all these complementary approaches will be needed for achieving a comprehensive picture of viral diversity. specific assays for each pathogen. metagenomics instead offers the possibility to detect known and novel viruses without prior knowledge from a single analysis, and is thus well suited and already applied to study emerging and/or rare viruses, as well as cases which remain negative using the available diagnostic tests (see below). a number of challenges remain however for viral metagenomics to become a standard clinical procedure. first, given the current cost associated with sample processing and sequencing, metagenomes are still more expensive and slower than elisa's or qpcr assays. second, there is no generally validated bioinformatics pipelines than can perform a rapid, sensitive, and specific analysis of the obtained data on a bench top computer. finally, physicians will have to be trained and guided to deal with the obtained breadth of data. specifically, it is becoming clear that each individual is chronically infected with a dozen or more eukaryotic viruses (many of which have not been associated with any disease, e.g., anelloviruses), and that many known viral pathogens can also cause asymptomatic infections. therefore, a physician might get a list of viruses (and other potentially pathogenic or unknown organisms), and it will be a challenge to identify the actual cause of a particular disease. nevertheless, the price of sample preparation and high throughput sequencing has declined enormously in the last decade including with the development of smaller and faster machines, while automatic virome fig. overview of the viral metagenomics workflow. the overall process used to generate and analyze viral metagenomes can be divided into four major steps: (i) collection of environmental and/or clinical sample, (ii) sample preparation, (iii) library preparation, and (iv) sequence analysis. the sample preparation step can target either the cellular fraction (left) or the viral fraction (right) in which case viral particles are often further concentrated and purified to remove free nucleic acids. *targeted sequence capture can be applied to the extracted dna/rna to enrich for a specific virus. **while whole genome amplification was initially used routinely for viral metagenomes, it has now been supplanted by methods enabling the preparation of more quantitative libraries from low input (b ng), hence whole genome amplification is now primarily used in single-cell or single-virus-particle experiments. ***the genome assembly can be bypassed if using long-read sequencing technologies, although these long-read datasets require a more careful error correction. ****genome binning, i.e., the identification of multiple contigs assembled from a metagenome and corresponding to the same genome, is typically only used for large genomes (e.g., kb), and individual contigs are directly analyzed instead for most viruses. analysis pipelines are being actively developed, so that metagenomics will likely be available in the near future as a routine test allowing physicians to get a viral diagnosis from a biological sample in a matter of minutes to hours in their home office or on the clinic bedside. currently, metagenomics is most often used in a diagnostic context when both conventional and enhanced molecular testing fail to identify a causative agent in a sample. these cases can represent a significant fraction of patients for diseases such as acute diarrhea, for which an etiological agent is identified in only b % of cases. in this framework, metagenomic analysis can lead to the discovery of unexpected or novel viruses that are associated with a specific set of symptoms. first, metagenomics can successfully identify known viruses in unexpected sample types. these studies include the detection of enterovirus d in clinical samples (rectal, throat, and oral swabs as well as blood samples) in cases of acute flaccid paralysis, the detection of herpes simplex virus (hsv- ) in cerebrospinal fluid samples of a patient with encephalitis, and the detection of mumps vaccine virus from the brain biopsy of a patient with chronic encephalitis. in addition, new human pathogens only distantly related to known viruses have also been discovered with metagenomics. these include the bas-congo virus, a rhabdovirus that was associated with a hemorrhagic fever outbreak in the african congo, as well as novel rhinovirus, bocavirus, arenavirus, and parechoviruses. finally, entirely novel types of potentially pathogenic viruses have been described through metagenomics, including previously unknown cycloviruses, cosaviruses, and klasseviruses. diagnostics through viral metagenomics has also been applied to non-human animals as well as plants, and similarly revealed new potential viral pathogens in organisms showing unexplained symptoms. multiple new virus types including novel parvoviruses, polyomaviruses, sapoviruses, and picornaviruses were for example identified in livestock samples (porcine and bovine), while a large diversity of persistent rna viruses were newly identified across several groups of plants. however, it is important to note that the detection of a (novel) virus in a sample from a patient with an illness of unknown etiology does not prove causation, even in cases of a demonstrated significant association between the presence of the virus sequence and the observed symptoms. hence, metagenomics will often be the first step of a longer process involving attempts to propagate the virus in culture, or monitoring healthy individuals exposed to the suspected pathogen (see below "future of viral metagenomics: major challenges and upcoming innovations"). in parallel to the diagnosis application, metagenomics is also very well suited for environmental surveillance. species representing important reservoirs of viruses with high zoonotic pandemic potential such as mosquitoes, rodents, and bats have been specifically targeted in this context. a recent study investigating the virome of more than invertebrate species (a fraction of known invertebrate species), identified more than , novel rna viruses, exemplifying that the diversity of unknown eukaryotic viruses a. the total number of genomes from isolates was based on queries to the ncbi nucleotide database portal, while the number of uncultivated virus genomes was estimated by compiling data from the literature and from the img/vr database. the number of sequences is displayed on a log scale. b. comparison of complete viral genomes assembled from viral metagenomes sampled from the indian, pacific, and atlantic oceans, through the tara oceans expedition. these sequences were identified and analyzed as part of the "global ocean virome" dataset (gov). predicted genes are colored by functional annotation. c. overview of the host predictions available for uncultivated virus genomes in the img/vr database. host prediction was based on signals including sequence similarity with isolate viruses, prophages, and crispr spacers derived from known bacterial and archaeal genomes. in the environment is enormous and only poorly characterized. since the majority of human pandemics have a zoonotic origin, one hope is that such metagenomic surveillance will allow a faster identification of novel pandemic viruses during outbreaks, as well as identify their natural reservoirs. this knowledge is crucial for an appropriate and fast response from a medical and global health perspective. as an example, in the last two decades zoonotic coronaviruses were able to jump from bats to humans and pigs. both the sars (severe acute respiratory syndrome virus) and mers (middle east respiratory syndrome) viruses caused large-scale disease outbreaks in humans, whereas sads (swine acute diarrhea syndrome) caused an epidemic in the swine industry. ongoing efforts to characterize the virome of such reservoir animals will facilitate the implementation of control measure to prevent epidemics or enforce appreciate actions to stop ongoing epidemics. in an ideal situation, obtained environmental virome data in combination with biochemical experiments could help with the early identification of candidate viruses with the potential to transfer to a human host. for instance, a combination of metagenomics and dna synthesis-based experiments revealed that a novel coronavirus (wiv -cov) initially detected in bat samples could be prime for transfer and emergence into human hosts. metagenomic analysis can also be leveraged in response to viral outbreaks, for example to rapidly determine viral subtypes in a novel infection source. this has been applied to cases of influenza infections as well as for a novel wild type ebola virus outbreak, for which metagenomic approaches could correctly identify the causative agent, even in cases where traditional methods were unsuccessful because the wild type virus was too distantly related to known ebola viruses. a correct and rapid identification of these viruses could enable the application of the correct therapeutics and guide preventive efforts against potential epidemics. while viruses of humans, animals, and plants may have direct clinical or economic relevance, the vast majority of the (estimated) virus particles on earth infect micro-organisms, including bacteria, archaea, protists, fungi, and other environmental microbes. initial studies of environmental viral diversity focused on human feces, coastal and open ocean, freshwater lakes, as well as hypersaline and hot geothermal ponds, because protocols for efficient separation of virus particles from microbial cells were first developed for aquatic samples. importantly though, recent innovations and technology improvements now enable application of viromics to more complex samples such as soil, groundwater, or ice cores, helping to expand our view of global viral diversity both in the human microbiome and in the environment. a striking example of a viromics discovery is that of a highly abundant bacteriophage, named "crassphage", that was assembled from a set of human fecal viromes in . the crassphage genome was identified by combining information from individual viromes, which yielded a high-confidence kb sequence with matching ′ and ′ ends, suggesting that it represented a complete circular genome. this crassphage genome was mostly unrelated to any isolated phage genome known at the time: from the predicted proteins, less than half ( ) were even remotely similar to known proteins or domains, and only had a predicted function, such as "phage structural protein" or "dna helicase". while clearly novel, crassphage was also found to be uniquely abundant and ubiquitous: its genome was detected across metagenomes, primarily from human feces, at average levels that were six times higher than all other known phages combined. by applying several independent computational host-prediction approaches, a bacterial host (bacteroides) was predicted. thus, in this instance, metagenomics revealed what remains to date the most abundant and widespread phage associated with the human gut microbiome, which had until then evaded detection through classical approaches like laboratory cultivation and pcr. assembling genomes of uncultivated viruses can not only identify some of the most abundant and widespread viruses in an ecosystem, but these sequences also represent foundational data for targeted follow-up experiments aimed at further characterizing these novel viruses. in the case of crassphage, two major studies leveraged this initial genome sequence to better understand the diversity and host of these phages. first, predicted proteins from the original crassphage genome were used as "bait" to identify related phages in a broad range of metagenomes. this revealed an extensive and diverse group of "crass-like" phages predicted to represent a new family within the caudovirales order, that may be related to podoviridae. genome comparisons within this new family also enabled the identification of conserved structural and replication gene modules. meanwhile, another study was able to isolate a member of the crassphage-like family through broth enrichment on bacteroides intestinalis strains isolated from human gut samples, confirming the computational predictions from bioinformatic analyses that these phages were likely infecting bacteroidetes hosts and had a podoviridae-like morphology. in , a comprehensive effort to chart viral diversity across the global oceans yielded a similar observation. this study detected more than , viral genome fragments, and grouped them into clusters of closely related viruses, approximately consistent with genera in the viral taxonomy. two out of the four most highly abundant and ubiquitous clusters were entirely novel and had not been described before, while the other two were similar to known bacteriophages. with viral metagenomics being applied to a larger set of samples and environments, and with bioinformatic analyses including genome assembly and interpretation constantly improving, novel groups of dominant and widespread viruses may thus be progressively revealed across many environments. another group of viruses whose known diversity has been vastly expanded through metagenomics are the so-called "giant viruses", dsdna viruses with a uniquely large virion (b . - µm) and genome (often mb), blurring the boundaries between "simple" viruses and "complex" cellular life. following the isolation and characterization of the first giant virus in ("acanthamoeba polyphaga mimivirus"), around other members of this group have been isolated, the vast majority by using an acanthamoeba host. however, metagenome analyses suggest that the true diversity of giant viruses vastly exceeds the number of isolates. as early as , an analysis of metagenomes revealed that giant viruses could be found in the ocean at concentrations of b genomes/ml. these initial studies were based on the detection of marker genes, since the technologies available at the time did not enable the assembly of complex and large genomes like those of giant viruses. more recently, four complete or nearcomplete giant virus genomes could be assembled from metagenomes of a wastewater treatment plant. this revealed a new subgroup of giant viruses named klosneuviruses which comprised some genomes with the largest set of translation system components found at the time in any virus, including aminoacyl transfer rna synthetases with specificity for all amino acids. undoubtedly, as our collective ability to assemble large genomes from metagenomes increases, the giant virus diversity will keep expanding. while most sequencing technologies are designed for dsdna templates (see box ), our knowledge of single-stranded dna (ssdna) and rna viruses has also been transformed by metagenomics. in both cases, specific sample processing steps are required to access these genomes, however their relatively short length (usually o kb) means that complete genomes are routinely assembled from total community shotgun metagenomes that target all the nucleic acids in an environment. as for dsdna viruses, metagenomics revealed that these ssdna and rna viruses were much more diverse and broadly distributed than previously inferred from isolation and cultivation approaches. enrichment for circular ssdna viruses can be achieved through phi -based whole genome amplification, which is known to over-amplify small circular ssdna templates. pragmatically, this translates into viral metagenomes that are dominated by ssdna viruses with circular genomes, which helped shed a new light on the diversity of two major groups: bacteriophages from the microviridae family, and eukaryotic viruses from the cress dna (circular rep-encoding ssdna) supergroup. the latter saw the more striking expansion: until , these viruses were known exclusively in plants and vertebrates, specifically pigs and birds, yet in less than a decade, cress dna viruses were detected in metagenomes sampled from primates, arthropods, and unicellular protists, as well as diverse aquatic, terrestrial, and man-made ecosystems. hence, while the exact host range and impact of these viruses remain to be fully characterized, metagenomics already revealed that ssdna viruses are ubiquitous and can be found associated with all types of cellular hosts. for rna viruses, several additional sample processing steps have to be performed to preferentially sequence viral rna, typically including reverse transcription and random amplification. the most comprehensive study of rna virus diversity to date included samples from invertebrate species across phyla, and led to the discovery of nearly , novel viruses across the major clades of rna viruses. in addition, the assembly of complete genomes provided new insights on the recombination patterns of these viruses, highlighting a remarkable propensity of rna viruses to exchange or acquire genes horizontally, both with other viruses and with their host. rna viruses were also detected in a much broader host range than currently known from isolates, although these host associations now have to be confirmed through laboratory experiments since virus detection in metagenomes does not equate active infection. improvements in metagenomics protocols post b enabled the analysis of dozens of samples in parallel. in the field of viral metagenomics, this increased capacity has been leveraged specifically to analyze viral signal along time series and thus investigate virus-host dynamics in nature. such datasets have notably been obtained from freshwater lakes, for which recurrent sampling across months or years can be done, and which usually harbor a high concentration of viruses. these first explorations of environmental viral diversity across months and seasons indicated that viruses display a large range of relative abundance patterns, from "ephemeral" ones with a single peak in abundance to "constitutive" ones detected in virtually all samples. some of these patterns were seasonal and possibly linked to similar patterns of abundance for their microbial hosts, while other viruses displayed drastic changes from one year to the next. for instance, although longitudinal virome studies of the human gut are scarce, available data suggests a rather stable population of gut viruses (almost exclusively phages) in adults over time, whereas the infant gut virome is much more variable and may be dominated by eukaryotic viruses at particular time point coinciding with an acute enteric infection. time series metagenomes are especially interesting to discover and predict virus-host associations, and to analyze dynamics of known virus-host pairs. the former approach already provided host prediction for several giant viruses that are so far known exclusively from metagenome assemblies, and suggested that these may be linked to uncultivated protist hosts. the latter raised the intriguing possibility of complex and diverse virus-host relationships occurring in nature: while the expected patterns would be a strong correlation between virus and host abundances with possibly a short lag in the virus signal in a typical predator-prey fashion, these large-scale metagenome time series instead suggested that some of the viruses could peak prior to a peak in abundance of their host, while other virus-host pairs showed no similarity in relative abundance at all. these conflicting results likely reflect the complex interactions at play between viruses and microbes in nature, including variable host ranges, from viruses infecting a unique host strain to others infecting multiple host species sometimes across different genera, as well as the spectrum of infection dynamics from fast-acting lytic viruses to slower, temperate, and even chronic ones, and the development of resistance to the virus among the host population. despite these numerous challenges in their analysis, time series metagenomes are poised to become a key approach to complement laboratory experiments and untangle the intricate relationships between viruses and their hosts. metagenomics has quickly become a major tool for exploring viral diversity, yet several challenges need to be addressed in order to fully leverage the potential of these methods. first, metagenomes built from limited input material are still difficult to reliably obtain and interpret, and do not yet provide a comprehensive and quantitative view of the viral community present in the sample. this includes environments with low biomass such as some human tissues, hydrothermal vents, ice cores, or ancient samples, but also samples with a thick substrate or matrix to which cells and virus particles tend to adhere such as human lung mucus or coral samples. improvement in the recovery of cells and virions from this type of substrates and in the generation of quantitative libraries from sub-nanogram input will help better survey these viral communities. the second major challenge lies in the absence of direct host information for genomes assembled from metagenomes. in a clinical context, this means that one of koch's postulates, which requires that the candidate etiological agent be isolated from a diseased organism and grown in pure culture, cannot be fulfilled. already, several smacoviruses which had been detected in human samples metagenomes and suspected to represent new human viral pathogens have been found to likely infect prokaryotic cells from the human microbiome instead. in a similar way, evidence is emerging that picobirnaviruses, which are believed to be eukaryotic viruses, might actually infect bacterial cells. these examples should thus serve as a cautionary tale when trying to detect entirely new viral pathogens from mixed samples containing both human and microbial cells. a modified koch's postulate for the metagenomic era has been proposed in which potential new pathogens must first be present and more abundant in the diseased subject compared to matched control. then, experiments using either a sample from a disease subject or an artificial virus obtained through dna synthesis and expression in cell cultures must be performed to demonstrate that this agent induces disease in another healthy subject. while not trivial, these additional experiments based on metagenomic results could still lead to the identification of viral pathogens much more quickly than classic culture techniques. in an ecological context, associating uncultivated viruses to their host is also critical to understand their impact on microbial communities and to meaningfully integrate viruses into ecosystem models. because viral ecology studies typically include hundreds to thousands of viruses of interest, these host associations are typically derived from in silico approaches based on various types of genome sequence comparison. while methods for in vitro confirmation of these metagenome-derived virus-host pairs are currently being developed, they will need to improve both in terms of scale and resolution to provide meaningful host association for the vast diversity of uncultivated viruses. among the expected technological improvements, two stand out as likely to benefit the field of viral metagenomics in the near future. first, long-read sequencing technologies are progressively amenable to the sequencing of environmental viral communities. pragmatically, this means that instead of having to assemble virus genomes from short reads, a process which can yield potentially erroneous and/or incomplete genome sequences, a complete viral genome could be sequenced as a single read. once broadly available, these long-reads metagenomes will not only bypass assembly issues but also provide valuable information about virus genome evolution by enabling whole-genome phasing of polymorphisms. meanwhile, in an epidemiological context, long-read sequencing technologies associated with miniaturized devices, streamlined sample preparation, and live scanning of the sequencing results offers unique possibility for real-time surveillance or diagnostics. this is especially the case for the minion sequencer based on nanopore sequencing technology, allowing the identification of viral pathogens from a patient sample in less than h, compared to more than h for other sequencing technologies. the computational framework to analyze and share these types of data in a timely, safe, and meaningful way remains to be built, however it is likely that metagenomics through portable genome sequencers will become a major component of the epidemiological toolkit in the near future. complementarily, the throughput of sample preparation protocols and short-read sequencing approaches is likely to keep increasing at a fast pace. concretely, these technological improvements will translate into a lower cost per sample, and an increased ability to process hundreds of samples in parallel in a timely fashion, in particular through laboratory robotics automation. for the detection of viral pathogens as well as the exploration of viral diversity and virus-host interactions in nature, this increased throughput will provide the opportunity to generate e.g., high-resolution time-series, possibly including paired cellular and viral size fractions with multiple replicates per sample, enabling more robust and sensitive data analyses. eventually, a fully developed virus metagenomics toolkit will enable the accurate identification of viruses in natural, clinical, and biotechnological samples for monitoring and diagnostics purposes. moreover, as bioinformatics analysis tools advance, the reconstruction of full viral genome sequences will allow predictions to be made for the most important viruses in different environments, leading to the reconstruction of environmental virus-host networks and, when combined with other 'omics' approaches, the comprehensive evaluation of viral activity across an entire ecosystem. collectively, these studies should lead to a life sciences deeper understanding of viral impacts on ecological, evolutionary, and metabolic processes as well as information on potentially new viral pathogens and putative molecular virus-host interactions which could then be further characterized through targeted metagenomic identification of viral pathogens phage puppet masters of the marine microbial realm modular approach to customise sample preparation procedures for viral metagenomics: a reproducible protocol for virome analysis virus discovery by metagenomics: the (im)possibilities real-time digital pathogen surveillance -the time is now a decade of rna virus metagenomics is (not) enough beyond research: a primer for considerations on using viral metagenomics in the field and clinic metagenomics and future perspectives in virus discovery moving beyond metagenomics to find the next pandemic virus a field guide to eukaryotic circular single-stranded dna viruses: insights gained from metagenomics minimum information about an uncultivated virus genome (miuvig) a viral reckoning: viruses emerge as essential manipulators of global ecosystems bacteriophages of the human gut: the 'known unknown' of the microbiome the phage metagenomic revolution viruses in soil ecosystems: an unknown quantity within an unexplored territory using metagenomics to characterize an expanding virosphere vr -collection of viral genomes assembled from metagenomes key: cord- - pjolkql authors: liu, y.-t. title: infectious disease genomics date: - - journal: genetics and evolution of infectious diseases doi: . /b - - - - . -x sha: doc_id: cord_uid: pjolkql the history and development of infectious disease genomics have been closely associated with the human genome project (hgp) during the past years. it has been emphasized since the beginning of the hgp that such effort must not be restricted to the human genome and should include other organisms including mouse, bacteria, yeast, fruit fly, and worm for comparative sequence analyses. a brief history is reviewed in this chapter. as of , more than completed genome sequencing projects have been reported. one of the important motivations for these efforts is to develop preventative, diagnostic, and therapeutic strategies through the analysis of sequenced microorganisms, parasites, and vectors related to human health. a number of examples are discussed in this chapter. the history and development of infectious disease genomics are closely associated with the human genome project (hgp). a series of important discussions about the hgp were made from to , , which led to the appointment of a special national research council (nrc) committee by the national academy of sciences to address the needs and concerns, such as its impact, leadership, and funding sources. the committee recommended that the united states begin the hgp in . they emphasized the need for technological improvements in the efficiency of gene mapping, sequencing, and data analysis capabilities. in order to understand potential functions of human genes through comparative sequence analyses, they also advised that the hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. in the meantime, the office of technology assessment (ota) of the us congress also issued a similar report to support the hgp. in , the department of energy (doe) and the national institutes of health (nih) jointly presented an initial -year plan for the hgp. in october , the sanger center/institute (hinxton, uk) was officially open to join the hgp. the cost of dna sequencing was about $ to $ per base in and the initial aim was to reduce the costs to less than $ . per base before large-scale sequencing. the sequencing cost gradually declined during the subsequent years. in , the national human genome research institute (nhgri) challenged scientists to achieve a $ , human genome ( gb/haploid genome) by and a $ genome by to meet the need of genomic medicine. in early , illumina announced that the company would begin producing a new system to deliver full coverage human genomes for less than $ , . the first complete genome to be sequenced was the phix bacteriophage ( . kb) by sanger's group in . the complete genome sequence of sv polyomavirus ( . kb) was published in . , the human epsteinebarr virus ( kb) genome was determined in . the first completed free-living organism genome was haemophilus influenza ( . mb), sequenced through a whole-genome shotgun approach in . the second sequenced bacterial genome, mycoplasma genitalium ( kb), was completed in less than month in the same year using the same approach. the doe was the first to start a microbial genome program (mgp) as a companion to its hgp in . the initial focus was on nonpathogenic microbes. along with the development of the hgp, there was exponential growth of the number of completely sequenced free-living organism genomes. the fungal genome initiative (fgi) was established in to accelerate the slow pace of fungal genome sequencing since the report of the genome of saccharomyces cerevisiae in . one of the major interests was to sequence organisms that are important in humanhealth and commercial activities. with the explosion in the number of sequenced genomes, thanks to the development of next generationesequencing methods, many genome-based studies have become popular. compared to years ago when only completed genome projects were documented, the gold (genomes online database) contains information for , genome-sequencing projects, of which were completed, as of august . , the genomes of human malaria parasite plasmodium falciparum and its major mosquito vector anopheles gambiae were published in . , historically, the effort to sequence the malaria genome began in by taking advantage of a clone derived from laboratory-adapted strain. notably, many parasites have complex life cycles that involve both vertebrate and invertebrate hosts and are difficult to maintain in the laboratory. few other important human pathogenic parasites, such as trypanosomes, , leishmania, and schistosomes, , have been either completely or partially sequenced. , in the meantime, the genome sequence of aedes aegypti, the primary vector for yellow fever and dengue fever, was published in . the genome size ( mb) of this mosquito vector is about times larger than the previously sequenced genome of the malaria vector a. gambiae. about % of the genome consists of transposable elements. in , the genome sequence of the body louse (pediculus humanus humanus), an obligatory parasite of humans and the main vector of epidemic typhus (rickettsia prowazekii), relapsing fever (borrelia recurrentis), and trench fever (bartonella quintana), was reported. its mb genome is the smallest among the known insect genomes. subsequently, more vector genomes have been published. e genome-sequencing projects for other important human disease vectors are in progress. , these include culex pipiens (mosquito vector of west nile virus), and ixodes scapularis (tick vector of lyme disease, babesia and anaplasma). the challenge to sequence the genome of an insect vector is much greater than a microbe. for example, the genome of ticks was estimated to be between and gb and may have a significant proportion of repetitive dna sequences, which may be a problem for genome assembly. furthermore, the evolutionary distances among insect species may also affect homologybased gene predictions. it is as important to understand the sequence diversity within a species as to perform a de novo sequencing of a reference genome from the perspective of human health. this is true for both hosts and pathogens. , the goal of the genomes project is to find most genetic variants that have frequencies of at least % in the human populations studied. one of the similar efforts for human pathogens is the nih influenza genome sequencing project. when this project began in november , only seven human influenza h n isolates had been completely sequenced and deposited in the genbank database. , as of may , more than human and avian isolates had been completely sequenced, including the "spanish" influenza virus. databases for human immunodeficiency virus (hiv) and hepatitis c virus have also been established. while most human studies of microbes have focused on the disease-causing organisms, interest in resident microorganisms has also been growing. in fact, it has been estimated that the human body is colonized by at least times more prokaryotic and eukaryotic microorganisms than the number of human cells. it was suggested to have "the nd human genome project" to sequence the human microbiome. highly variable intestinal microbial flora among normal individuals has been well documented. e therefore, the human microbiome project (hmp) was initiated by the nih in late . the analysis and data of healthy adults at (for males) or (for females) body sites over months were published in . the completed or ongoing genome projects (table . ) provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. specific examples are provided to illustrate how the information provided by various genome projects may help achieve the goal of promoting human health. meningococcal isolates produce one of antigenically distinct capsular polysaccharides, but only five (a, b, c, w , and y) are commonly associated with disease. the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. while conventional vaccines consisting of the conjugation of capsular polysaccharides to carrier proteins for meningococcus serogroups a, c, y, and w- have been clinically successful, the same approach failed to produce clinically useful vaccine for serogroup b (menb). the capsule polysaccharide (a e -n-acetylneuraminic acid) of menb is identical to human polysialic acid, therefore is poorly immunogenic. alternatively, vaccines consisting of outer-membrane vesicles (omvs) have been successfully developed to control menb outbreaks in areas where epidemics are dominated by one particular strain. e the most significant limitation of this type of vaccine is that the immune response is strain specific, mostly directed against the porin protein, pora, which varies substantially in both expression level and sequence across strains. , with the completion of the genome sequence of a virulent menb strain, a "reverse vaccinology" approach was applied for the development of a universal menb vaccine by novartis. , , through bioinformatic searching for surface exposed antigens, which may be the most suitable vaccine candidates due to their potential to be readily recognized by the immune system, open reading frames (orfs) were selected from a total of orfs of the mc genome. eventually, five antigens were chosen as the vaccine components based on a series of criteria including the ability of candidates to be expressed in escherichia coli as recombinant proteins ( candidates), the confirmation of surface exposure by immunological analyses, the ability of induced protective antibodies in experimental animals ( candidates), and the conservation of antigens within a panel of diverse meningococcal strains, primarily the diseaseassociated menb strains. , , the vaccine formulation consists of an fhbp-gna fusion protein, a gna -gna fusion protein, nada, and omvs from the new zealand menzb vaccine strain, which contains the immunogenic pora. initial phase ii clinical results in adults and infants showed that this vaccine could induce a protective immune response against three diverse menb strains in e % of subjects following three vaccinations and e % after four vaccinations. this vaccine (bexsero) has been approved in the usa and in more than other countries. natural products, especially microbial secondary metabolites, are important source of bioactive compounds. actinomycetes have been a main source of natural-product discovery in bacteria. consequently, the high rediscovery rate of known compounds and scaffolds were inevitable with activity-based screening. genome mining of gene clusters that produce secondary metabolites have been a new approach to overcome this problem. for example, an antibiotic, clostrubin, was discovered through searching novel compounds from clostridium beijerinckii due to the presence of several cryptic gene clusters for secondary metabolite biosynthesis. genome mining starts with a genome-wide search for highly conserved members of the required biosynthesis gene cluster. computational programs that support the prediction of operons help to assign boundaries of newly identified biosynthesis gene clusters. a large-scale, high-throughput genome mining for the genetic potential for producing phosphonic acids by screening more than , actinomycetes has been achieved in . it was believed that phosphonates would have greater potential to become pharmaceuticals, with a past commercialization rate of % ( / ), such as fosfomycin, compared to the . % average for natural products as a whole. , in addition, bioinformatical discovery of phosphonate biosynthetic loci has been well established, as all but two previously characterized phosphonate biosynthetic pathways start with phosphoenolpyruvate (pep) mutase that is encoded by pepm. among , actinomycetes, only strains were confirmed to have pepm by polymerase chain reaction (pcr) screening and genome sequencing. a diverse collection of phosphonate biosynthetic gene clusters were identified within these strains. remarkably, out of the distinct clusters would direct the synthesis of unknown compounds. characterization of strains within five of these groups resulted in discovery of argolaphos, and other interesting compounds, including valinophos, and phosphonocystoximate. argolaphos showed broad-spectrum antibacterial activity against salmonella typhimurium, e. coli, and staphylococcus aureus. targeting an essential pathway is a necessary but not sufficient requirement for an effective antimicrobial agent. identification of essential genes in a completely sequenced genome has been actively pursued with various approaches. , the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents. the subcellular organization of the fatty acid biosynthesis components is different between mammals (type i fas) and bacteria (dissociated type ii fas), which raises the likelihood of host specificity of the targeting drugs. comparison of the available genome sequences of various species of prokaryotes reveals highly conserved fas ii systems suggesting that the antimicrobial agent can be broad spectrum. in addition, through computational analyses, new members of the fas ii system have been discovered in different bacterial species. , one of the protein components in this system, fabi, is the target of an antituberculosis drug isonizid and a general antibacterial and antifungal agent, triclosan. e through a systematic screening of , natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis and a selective fabf/b inhibitor in fas ii system. treatment with platensimycin eradicated s. aureus infection in mice. platensimycin did not have cross-resistance to other antibiotic-resistant strains in vitro, including methicillin-resistant s. aureus, vancomycin-intermediate s. aureus, and vancomycinresistant enterococci. no toxicity was observed using a cultured human cell line and the activity of platensimycin was not affected by the presence of human serum in this study. however, the fas ii system appears to be dispensable for another gram-positive bacterium, streptococcus agalactiae, when exogenous fatty acids are available, such as in human serum. , the susceptibility to inhibitors targeting the fas ii system indicates heterogeneity in fatty acid synthesis or in acquiring exogenous fatty acids among gram-positive pathogens. comparative genomic approaches may be useful to identify and develop a strategy to target the salvage pathway for s. agalactiae. alternatively, similar approaches as described earlier for menb vaccine may also be applied for s. agalactiae (group b streptococcus). emergence of drug-resistant malaria to chloroquine in s and sulfadoxinee pyrimethamine in s occurred from western cambodia to the greater mekong subregion (gmsr, including cambodia, lao, myanmar, thailand, and vietnam) and to africa. the finding of artemisinin-resistant malaria in cambodia and gmsr raised a concern regarding the global spread of these parasites. while a number of studies, including population genetics and laboratory-based investigations were conducted, no reliable molecular marker was identified until the major breakthrough reported in early . clinical artemisinin resistance has been defined as a reduction of parasite-clearance rate, which is expressed as an increase of parasite-clearance half-life, or a persistence of microscopically detectable parasites days after artemisinin-based combination therapy (act). although artemisinin was thought to have broad-stage specificity against malaria throughout the life cycle, it was showed that artemisinin-resistant parasites only had decrease of artemisinin susceptibility at ring stages, which was demonstrated by the ring-stage survival assay (rsa e h ). an in vitro laboratory-based approach was conducted at a time when populationbased genome-wide association studies (gwas) did not clearly identify the genes responsible for artemisinin resistance. for years, an artemisinin-resistant f -art parasite line was selected by culturing an artemisinin-sensitive f -tanzania clone under a dose-escalating, -cycle regimen of artemisinin. eight mutations in seven genes were eventually selected from the result based on whole-genome sequence analysis f -art and f -tem (its sibling clone cultured without artemisinin) at  and  average nucleotide coverage, respectively. to examine whether these in vitro selected mutations were associated with artemisinin resistance in cambodia, sequence polymorphism in all seven genes were analyzed from culture-adapted clinical isolates related to their rsa e h . only polymorphisms of a gene, k -propeller, showed a significant association with rsa e h survival rates. in total, four mutant alleles, each harboring a single nonsynonymous snp (y h, r t, i t, and c y) within a kelch repeat of the c-terminal k -propeller domain were identified. to confirm that k propeller polymorphism is a molecular marker of clinical artemisinin resistance, parasite-clearance half-lives in patients were correlated with their k alleles. of the patients, carried parasites with a wild-type allele and the others carried parasites with only one of the three single nonsynonymous snps in the k propeller: c y (n ¼ ), r t (n ¼ ), and y h (n ¼ ). the parasiteclearance half-life in patients with wild-type parasites is significantly shorter (median . h) than those with these three mutant alleles (median . e . h). subsequently, clinical studies have validated the association between k propeller mutations and artemisinin resistance. e early mathematical model for malaria control suggested that the most vulnerable element in the malaria cycle was survivorship of adult female mosquitos. , therefore, insect control is an important part of reducing transmission. the use of ddt as an indoor residual spray in the global malaria eradication program from to has reduced the population at risk of malaria to about % by compared with % in . , engineering genetically modified mosquitoes refractory to malaria infection appeared to be an alternative approach, given the environmental impact of ddt and the emergence of insecticide-resistant insects. the vector biology network (vbn) was formed in and had proposed a -year plan with the who in to achieve three major goals: ( ) to develop basic tools for the stable transformation of anopheline mosquitoes by the year , ( ) to engineer a mosquito incapable of carrying the malaria parasite by , and ( ) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by . e while some proof-of-concept experiments have been achieved for the first two aims in when the a. gambiae genome was completely sequenced, , the progress has been relatively slow. genomic loci of the a. gambiae responsible for p. falciparum resistance have been identified through surveying a mosquito population in a west african malaria transmission zone. a candidate gene, anopheles plasmodium-responsive leucine-rich repeat (apl ) was discovered. subsequently, other resistant genes have also been identified. , studying the genetic basis of resistance to malaria parasites and immunity of the mosquito vector will be important to control malaria transmission. perhaps the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the information may be of great importance to the public health when a newly emerged or reemerged pathogen is discovered. a few examples will be described. a novel swine-origin influenza a virus (s-oiv) emerged in the spring of in mexico and subsequently was discovered in specimens from two unrelated children in the san diego area in mid-april . , those samples were positive for influenza a but negative for both human h and h subtypes. the complete genome sequence and a real-time pcrebased diagnostic assay were released to the public in late april. the outbreak evolved rapidly and who declared the highest phase worldwide pandemic alert on june , . s-oiv has three genome segments (ha, np, and ns) from the classic north american swine (h n ) lineage, two segments (pb and pa) from the north american avian lineage, one segment (pb ) from the seasonal h n , and most notably, two segments (na and m) from the eurasian swine (h n ) lineage. with the available influenza genome database, diagnostic assays to distinguish previous seasonal h n , h n , and s-oiv can be easily accomplished. a comprehensive pathogen genome database is not only useful for infectious disease diagnosis but also for novel pathogen discovery. homologous sequences within the same family or among different family members are important for new pathogen identification even with the advent of third generationesequencing technology. de novo pathogen discovery may also be complicated by coexisting microorganisms, such as commensal bacteria in the human body. without prior knowledge of these microorganisms, one may be misled. in , a microarray-based assay, designated virochip, was used to help discover the sars conoronavirus. the virochip contained the most highly conserved mer sequences from every fully sequenced reference viral genome in genbank. the computational search for conservation was performed across all known viral families. a microarray hybridized with a reaction derived from a viral isolate cultivated from a sars (severe acute respiratory syndrome) patient revealed that the strongest hybridizing array elements belong to families astroviridae and coronaviridae. alignment of the oligonucleotide probes having the highest signals showed that all four hybridizing oligonucleotides from the astroviridae and one oligonucleotide from avian infectious bronchitis virus, an avian coronavirus, shared a core consensus motif spanning nucleotides. interestingly, it had been known previously through bioinformatics analyses that this sequence is present in the utr of all astroviruses, avian infectious bronchitis virus, and an equine rhinovirus. therefore, a new member of the coronavirus was identified through the unique hybridizing pattern and subsequent confirmations. the finding of the seventh human oncogenic virus, merkel cell polyomavirus (mcv) in is another example of why conserved sequences are important for novel pathogen discovery. mcv is the etiological agent of merkel cell carcinoma (mcc), which is a rare but aggressive skin cancer of neuroendocrine origin. two cdna libraries derived from mcc tumors were subjected to high-throughput sequencing by a next-generation roche/ sequencer. nearly , sequence reads were generated. the majority ( . %) of the sequences derived from human origin were removed from further analyses. only one of the remaining cdna was homologous to the t antigen of two known polyomaviruses. one additional cdna was subsequently identified to be part of the mcv sequence when the complete viral sequence was known. later analyses showed that % ( / ) of the mcc had integrated mcv in the human genome. monoclonal viral integration was revealed by the patterns of southern blot analysis. only e % of control tissues had low copy number of mcv infection. in , an interesting and unexpected discovery of the malignant transformation of hymenolepis nana, a human tape worm, in a human host has been reported by conventional and next generationesequencing approaches. a initially, examination of a -year-old hiv-infected man revealed extensive lymphadenopathy. h. nana eggs and blastocystis hominis cysts were found in stool. the disease progressed to death despite antiparasitic and antiretroviral treatment. histological examination of biopsied lymph nodes revealed proliferative cells with overt malignant features. they were monomorphic with morphologic features characteristic of stem cells (a high nucleus-to-cytoplasm ratio). however, the small cell size (< ) suggested infection with an unfamiliar, possibly unicellular, eukaryotic organism. infection with a plasmodial slime mold rather than h. nana was considered because of the prominent syncytia formation and the primitive appearance of the atypical cells but lack of architecture identifiable as tapeworm tissue. pcr screening suggested that these cells were h. nana. next generationegenome sequencing and comparative analysis revealed h. nana variants harboring mutations typically found in cancer. as of , next generationesequencing technologies are gradually being applied for diagnosis and monitoring of infectious diseases, including genotypic resistance testing, direct detection of unknown disease-associated pathogens without culture, investigation of microbial population diversity in the host, and strain typing. however, promising, next generationesequencing approaches for clinical diagnosis require further improvements for automation, standardization of technical and bioinformatic procedures, and other practical issues, such as costs and turnaround time. while we can expect that the efforts of a variety of genome projects may improve human health, the socioeconomic issues that are not discussed in this chapter may be substantial. in addition, the tremendous amount of information derived from these projects will also pose a challenge for scientists as well nonscientists to follow and understand. the human genome project: past, present, and future a turning point in cancer research: sequencing the human genome mapping and sequencing the human genome mapping our genesdgenome projects: how big? how fast understanding our genetic inheritance, the u.s. human genome project: the first five years: fiscal years e the $ , genome nucleotide sequence of bacteriophage phi x dna the genome of simian virus complete nucleotide sequence of sv dna dna sequence and expression of the b - epstein-barr virus genome whole-genome random sequencing and assembly of haemophilus influenzae rd hisotry of microbial genomics microbial genome program mutation of the pik ca gene in ovarian and breast cancer life with genes the genomes on line database (gold) in : status of genomic and metagenomic projects and their associated metadata the genomes online database (gold) v. : a metadata management system based on a four level (meta)genome project classification genome sequence of the human malaria parasite plasmodium falciparum the genome sequence of the malaria mosquito anopheles gambiae funding for malaria genome sequencing the genome sequence of trypanosoma cruzi, etiologic agent of chagas disease the genome of the african trypanosome trypanosoma brucei the genome of the kinetoplastid parasite, leishmania major the genome of the blood fluke schistosoma mansoni the schistosoma japonicum genome reveals features of host-parasite interplay helminth genomics: the implications for human health eupathdb: a portal to eukaryotic pathogen databases genome sequence of aedes aegypti, a major arbovirus vector genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle highly evolvable malaria vectors: the genomes of anopheles mosquitoes genome of rhodnius prolixus, an insect vector of chagas disease, reveals unique adaptations to hematophagy and parasite infection genome sequence of the tsetse fly (glossina morsitans): vector of african trypanosomiasis vectorbase: a data resource for invertebrate vector genomics genomic resources for invertebrate vectors of human pathogens, and the role of vectorbase tick genomics: the ixodes genome project and beyond the genome gets personalealmost human genetics of infectious diseases: between proof of principle and paradigm a plan to capture human diversity in genomes race against time large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution characterization of the influenza virus polymerase genes microbial ecology of the gastrointestinal tract the meaning and impact of the human genome sequence for microbiology bacterial community variation in human body habitats across space and time diversity of the human intestinal microbial flora a core gut microbiome in obese and lean twins microbiology learning about who we are human microbiome project c. structure, function and diversity of the healthy human microbiome mechanisms of avoidance of host immunity by neisseria meningitidis and its effect on vaccine development an igg monoclonal antibody to group b meningococci cross-reacts with developmentally regulated polysialic acid units of glycoproteins in neural and extraneural tissues effect of outer membrane vesicle vaccine against group b meningococcal disease in norway vaccine against group b neisseria meningitidis: protection trial and mass vaccination results in cuba phase ii meningococcal b vesicle vaccine trial in new zealand infants efficacy, safety, and immunogenicity of a meningococcal group b ( :p . ) outer membrane protein vaccine in iquique, chile. chilean national committee for meningococcal disease identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing effect of sequence variation in meningococcal pora outer membrane protein on the effectiveness of a hexavalent pora outer membrane vesicle vaccine complete genome sequence of neisseria meningitidis serogroup b strain mc a universal vaccine for serogroup b meningococcus vaccinology in the genome era lessons from reverse vaccinology for viral vaccine design discovery of clostrubin, an exceptional polyphenolic polyketide antibiotic from a strictly anaerobic bacterium discovery of phosphonic acid natural products by mining the genomes of , actinomycetes thoughts and facts about antibiotics: where we are now and where we are heading biosynthesis of phosphonic and phosphinic acid natural products type ii fatty acid synthesis is not a suitable antibiotic target for gram-positive pathogens identification of critical staphylococcal genes using conditional phenotypes generated by antisense rna global transposon mutagenesis and a minimal mycoplasma genome antibacterial targets in fatty acid biosynthesis the application of computational methods to explore the diversity and structure of bacterial fatty acid synthase a triclosan-resistant bacterial enzyme a new mechanism for anaerobic unsaturated fatty acid formation in streptococcus pneumoniae molecular basis of triclosan activity inhibiting bacterial fatty acid synthesis mycobacterium tuberculosis platensimycin is a selective fabf inhibitor with potent antibiotic properties essentiality of fasii pathway for staphylococcus aureus identification of a universal group b streptococcus vaccine by multiple genome screen a molecular marker of artemisinin-resistant plasmodium falciparum malaria reduced artemisinin susceptibility of plasmodium falciparum ring stages in western cambodia spread of artemisinin resistance in plasmodium falciparum malaria genetic architecture of artemisinin-resistant plasmodium falciparum spread of artemisinin-resistant plasmodium falciparum in myanmar: a cross-sectional survey of the k molecular marker malaria management: past, present, and future the epidemiology and control of malaria the global distribution and population at risk of malaria: past, present, and future possible use of translocations to fix desirable genes in insect pest populations from tucson to genomics and transgenics: the vector biology network and the emergence of modern vector biology the mosquito genomeea breakthrough for public health malaria control with genetically manipulated insect vectors stable germline transformation of the malaria mosquito anopheles stephensi transgenic anopheline mosquitoes impaired in transmission of a malaria parasite malaria control with transgenic mosquitoes natural malaria infection in anopheles gambiae is regulated by a single genomic control region leucine-rich repeat protein complex activates mosquito complement in defense against plasmodium parasites dissecting the genetic basis of resistance to malaria parasites in anopheles gambiae mosquito defenses against plasmodium parasites swine influenza a (h n ) infection in two childrenesouthern california, marcheapril emergence of a novel swine-origin influenza a (h n ) virus in humans detection in of the swine origin influenza a (h n ) virus by a subtyping microarray a technological update of molecular diagnostics for infectious diseases third-generation sequencing fireworks at marco island viral discovery and sequence recovery using dna microarrays a common rna motif in the ' end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus clonal integration of a polyomavirus in human merkel cell carcinoma malignant transformation of hymenolepis nana in a human host next-generation sequencing for infectious disease diagnosis and management: a report of the association for molecular pathology database resources of the national center for biotechnology information ensembl genomes: extending ensembl across the taxonomic space the comprehensive microbial resource the microbial rosetta stone database: a compilation of global and emerging infectious microorganisms and bioterrorist threat agents genomic metadata for infectious agents, a geospatial surveillance pathogen database a catalog of reference genomes from the human microbiome the influenza virus resource at the national center for biotechnology information key: cord- - fctxk authors: proudfoot, chris; lillico, simon; tait-burkard, christine title: genome editing for disease resistance in pigs and chickens date: - - journal: anim front doi: . /af/vfz sha: doc_id: cord_uid: fctxk nan for thousands of years, humans have used selective breeding to improve desirable traits in both livestock and companion animals. in livestock, targeted breeding has been common practice since the british agricultural revolution of the th century, with measurable production traits such as feed conversion in cattle or wool production in sheep actively selected for. in the late th century, genomic selection was added to the livestock breeding tool box; by reading specific locations in the genome and assigning them to measurable production traits, faster improvement in livestock production efficiency has been achieved. one of the inherently difficult production traits to measure is resistance to a specific disease, as animals with less severe symptoms or pathology may simply have been exposed to less pathogen. experimental infections guaranteeing equal pathogen exposures are expensive and require large numbers of animals for genetic association studies, making them ethically questionable. genome editing offers new opportunities to livestock breeding for disease resistance, allowing the direct translation of laboratory research into disease-resistant or resilient animals. made? genome editors are custom enzymes that allow scientists to cut the dna strands in the nucleus of a cell at a specific position. the researcher can then influence how the dna is repaired, introducing very precise genetic changes at a target locus in their species of interest. this technology has been revolutionary and provides exciting possibilities for the production of livestock resistant to viral diseases. such opportunities are particularly pertinent given state efforts to improve global food security and reduce food waste throughout the production chain. the most prominent editor technology today, crispr/cas, uses a nucleotide rna guide to target its enzyme component to a designated locus in the genome. the probability of off-target cutting with a high fidelity cas enzyme is very low, because with four potential base combinations at each of the nucleotides there are over one trillion unique guide combinations. once the enzyme has cut the dna strands, the predominant repair pathway in most cells is nonhomologous end joining, an error-prone process which often introduces small insertions or deletions into the genetic code at the break site. if the target is within a gene, such perturbations can result in a disruption to the function of that gene, potentially leading to a loss of protein function. this can be very useful to basic science as it allows researchers to discover functions associated with novel genes. for many applications, a more precise change to the genome is required. to that end, scientists regularly make an alternative dna repair process called homology-directed repair. to do this, researchers provide a novel dna sequence alongside the crispr/cas reagents, whereby the cellular repair machinery uses the new dna as a template when fixing the break. this approach facilitates the introduction of defined implications • genome editing technology enlarges the tool box of trait-selective breeding. • methods for genome editing have developed over the past decades, making the technology more efficient and specific. • technology to generate edited pigs and chickens is developing alongside genome editors to generate animals faster and more affordable. • for two major pig diseases, it has been shown that resistant animals can be generated that are refractory to infection. in chickens there are promising laboratory results but no genome-edited, resistant chickens yet. • genome editing allows us to overcome bottlenecks in trait-selective breeding and allows the incorporation of genetic traits from other breeds, related species, or laboratory results. • two major hurdles still to be faced prior to the implementation of this promising technology are consumer acceptance and the regulatory framework. changes at the genomic target locus and has sufficient refinement to alter a single nucleotide, allowing precise modification of gene function. finally, by introducing a pair of editors, it is possible to generate two concurrent dna breaks on the same chromosome. the cellular repair machinery then joins the ends of the cut sites, promoting the deletion of the intervening sequence. all the editor reagents introduced to the cell are rapidly degraded, with only the alteration to the genomic sequence remaining to be propagated following cell division. genome editing has been applied to a wide variety of agricultural species including salmonids, poultry, and ruminants. however, due to its global economic value, relatively short generation time, and multiparous nature, the most edited livestock species to date is the pig. there are two main methods widely used for the generation of edited pigs: cloning of edited fibroblasts or direct injection of the zygotes with editor reagents. both work well, and each has specific advantages. in cloning, fibroblast cells can be maintained in the lab for prolonged periods. this allows researchers to introduce editor reagents into the cultured cells typically by lipofection, electroporation, or microinjection. editing events in each cell of a population can be characterized and individual cells with the desired alteration to their genome selected for the cloning process, whereby the fibroblast cell is fused with an enucleated oocyte shell in a process called somatic cell nuclear transfer ( figure a ). the reconstituted "zygote" is then transferred to a recipient gilt or sow (carlson et al., ) . despite the benefit of being able to prescreen the donor cells, cloning is generally inefficient with hundreds of reconstituted zygotes being transferred to a single recipient. cloning also yields reduced litter sizes when compared with standard breeding and offspring often demonstrate reduced viability. as an alternative to cloning, newly fertilized zygotes can be directly microinjected with genome-editing reagents and transferred immediately back to the oviduct of a recipient animal ( figure b ). in contrast to cloning, this approach (lillico et al., ) results in the efficient establishment of pregnancies and robust litters. however, without the prescreening of cells that is routine in cloning, offspring from direct zygote manipulation inevitably encompasses a range of editing outcomes, since selection of a specific edit is not possible. porcine zygotes can also be generated by maturation of oocytes extracted from slaughterhouse-derived ovaries and in vitro fertilization. unfortunately, in vitro fertilization in pigs often results in polyspermy, rendering the resulting embryo inviable. however, in this controlled environment, editing rates can be increased and costs and animal use reduced. an emerging alternative to these proven methods could be the use of surrogate sires. as a first step towards this goal, pigs have been edited to remove a gene required for male fertility, generating an empty spermatagonial stem cell niche in the testis . spermatogonial stem cells can be isolated and cultured in vitro, opening the possibility to edit and characterize these cells before transfer to a recipient (park et al., ) ( figure c ). genetic modification of poultry poses unique challenges due to the very different physiology of the avian egg compared with a mammalian oocyte. as a result, isolation and transfer of a chicken yolk is not practical. one approach that has been taken is in ovo electroporation of editing reagents, which allowed the analysis of gene function in the neural crest (gandhi et al., ) . however, others reported that electroporation resulted in mosaicism with editing limited to a subset of cells as the chicken embryo is already much further developed when an egg is laid compared with a zygote (veron et al., ) ( figure d ). as a result, it is unlikely that this approach could be efficiently utilized to generate edited birds. an alternative approach involves sperm transfection-assisted gene editing, whereby sperm are lipofected with editing reagents before use in artificial insemination (cooper et al., ) ( figure e ). however, advances in chicken stem cell technology show the greatest promise for genome editing in chicken. primordial germ cells (stem cells that eventually develop into germ cells) can be isolated from the blood of developing chicks in ovo and cultured in vitro. as with mammalian fibroblasts, these cells can be edited and selected in vitro before transfer into the bloodstream of a stage-matched recipient where they migrate to and populate the developing gonad. the chicken embryo is accessed through an opening in the egg shell, which is sealed again until the chicken hatches. genome editing in primordial germ cells has been successfully demonstrated by a number of groups (park et al., ; taylor et al., ; idoko-akoh et al., ) and one group has generated modified birds (park et al., ) . the founder birds generated from this editing method are chimeric due to the presence of preexisting germ cells. the resulting offspring generated from breeding with the founders will be a mixture of edited or nonedited. recipient chicken embryos devoid of germ cells are currently being developed that will significantly increase the efficiency of this process (m. mcgrew, unpublished results) ( figure f ). genome editors will undoubtedly have a significant role on the generation of disease-resistant animals as exemplified below. it is important to note that currently the technology is limited to modifying a single gene or a snp with large effects; however, disease resistance in many cases is likely to be a polygenic trait. multiplexing technology is under development such that in the future polygenic traits could be altered in a single step. progress so far? porcine reproductive and respiratory syndrome virus. porcine reproductive and respiratory syndrome (prrs) is arguably the most economically important pig disease worldwide. the causative agent of prrs is an arterivirus, named prrs virus (prrsv), that affects pigs of all ages but most importantly causes late-term abortions and stillbirth in sows and severe respiratory disease in piglets with severe morbidity and high mortality. prrsv also incapacitates the pig's immune response, providing an ideal breeding ground for severe secondary infections, mostly by bacteria, which leads to increased use of antibiotics. prrsv exclusively infects cells of the monocyte/macrophage lineage and two macrophagespecific proteins, cd and cd , were identified as receptors for the virus: cd acting on the surface of the cells and cd inside the internalizing transport vesicles (calvert et al., ; van gorp et al., ) . the virus was thought to attach to cd to be taken up into the cells; however, genome-edited pigs lacking cd were not resistant to prrsv infection (prather et al., ) . cd on the other hand is thought to act through a key-lock interaction with the virus to allow it to escape from the internalizing transport vehicles into the cytosol where it replicates. cd consists of nine globular domains, organized like beads on a string, with domain determined to mediate the key-lock interaction allowing viral entry into pig cells (van gorp et al., ) . using genome editing to generate pigs lacking cd whitworth et al. showed for the first time that this approach could be used to produce livestock resistant to important viral diseases, in this case prrs (whitworth et al., ) . cd is known to have a range of important biological functions in homeostasis, inflammation, and immune responses. as a refinement on functional knock out of the entire cd protein, editing reagents were designed to remove only domain leaving the remainder of the protein intact. the resulting animals were completely resistant to prrsv infection and maintained the biological functions associated with the remaining domains of cd (burkard et al., ; burkard et al., ) . porcine epidemic diarrhea virus/transmissible gastroenteritis virus. the two coronaviruses porcine epidemic diarrhea virus (pedv) and transmissible gastroenteritis virus (tgev) both cause severe diarrhea in preweaned piglets and are associated with high morbidity and mortality. in vitro host-pathogen studies identified aminopeptidase n as the receptor for tgev and a potential receptor for pedv (delmas et al., ; li et al., ) . the use of genome editing to generate pigs lacking aminopeptidase n successfully showed that pigs resistant to tgev infection could be generated. however, the edited animals remained susceptible to pedv infection (whitworth et al., ) . aminopeptidase n is important for peptide digestion in the small intestine and knockout mice were shown to have delayed mammary gland development. in humans, aminopeptidase n defects are associated with different types of leukemia and lymphoma. therefore, further investigation into the potential consequences of the absence aminopeptidase n in pigs is warranted as it may affect the overall health and/or productivity of the animals. african swine fever virus. african swine fever virus (asfv) causes a severe hemorrhagic disease in domestic pigs (sus scrofa domesticus) and wild boars (sus scrofa ferus) with high mortality in pigs of all ages. asfv is highly contagious and can be transmitted by soft ticks of the ornithodoros genus. it was identified in and contained to africa with occasional transmission around the strait of gibraltar into portugal and spain. in an introduction of the virus into the caucasus region showed that the virus does not solely rely on ticks for transmission in the wild, as transport of contaminated material and direct contact between animals have been shown to be major routes of disease dissemination. since then, the virus has spread across eastern europe and russia and was recently found in western europe and china. asfv poses a huge risk to the pig industry worldwide and is a limiting factor to a sustainable pig industry in many parts of africa. interestingly, asfv also infects wild suids, such as warthogs (phacocherus africanus) and bushpigs (potamocherus porcus), without causing overt disease. such infected wild suids are thought to act as a reservoir of the virus in africa. in the late stages of asfv infection, a cytokine storm, i.e., an overreaction of the immune system, is observed, which is thought to strongly contribute to the lethal outcome of disease. a comparison of the warthog and domestic pig genomes identified differences in the rel-like domain-containing protein a (rela, also known as p ) protein, which is involved in nf-κb cytokine signaling, was thought to underlie the different responses of the related species to asfv infection (palgrave et al., ) . researchers used genome editing to convert a key region of the encoded domestic pig protein sequence to the warthog equivalent (lillico et al., ) . data on susceptibility of the edited animals to asfv infection have yet to be reported. in this instance, it is important to differentiate between disease resistance, the ability of an animal to suppress the establishment and/or development of an infection, and disease resilience, where an infected host manages to maintain an acceptable level of productivity despite challenge pressure. should these pigs prove to be resilient to asfv infection it is likely that their use may not be permitted in many jurisdictions, since they could act as reservoirs of infection. however, in environments where the disease is endemic use of such animals could be beneficial. avian leucosis virus. avian leukosis virus infection results in inappetence, diarrhea, weight loss, a reduction in eggs laid, and often causes tumor formation in the chicken. the virus is divided into six subgroups, with the avian leucosis virus subgroup j (alv-j) shown to be responsible for major disease outbreaks in china. the cellular receptor of alv-j was identified to be the chicken sodium/hydrogen exchanger protein on the cell surface. chicken somatic cell lines have been edited to introduce changes to this gene-conferring resistance to avian leucosis virus in vitro (lee et al., ) . despite cells showing resistance to alv-j infection, no edited chickens have been produced to date. in both mice and humans, a lack of the sodium/hydrogen exchanger protein is associated with severe neurological disease; however, targeted changes to single amino acids may retain biological functions of the protein in chicken while resulting in disease resistance. avian influenza virus. in chickens, disease resistance to avian influenza is at the top of the wish list due to the serious impact on chicken health but also the risk of transmission to humans. similarly, influenza a is also one of the diseases on the resistance wish list for pigs, as they can act as an intermediate host-aiding virus adaptation to humans. the acidic leucine-rich nuclear phosphoprotein- a (anp a) was found to play a key role in avian influenza virus replication in both chicken and water fowl. although the virus polymerase protein readily interacts with the avian anp a, the human version of the same protein supports only limited replication of the viral genome. it has been demonstrated in vitro that deletion of a small region of chicken anp a can prevent replication of avian influenza virus (long et al., ; long et al., ) . although the functional consequence of edited anp a has yet to be demonstrated in vivo, such approaches offer exciting opportunities that have the potential to benefit both industry and animal welfare. as exemplified above, currently many gene editing approaches focus on targeting host genes involved in mediating entry of the virus, with a special focus on receptors. however, as the example for avian influenza shows, host genes play an important role in other steps of the pathogen replication cycle and also provide editing targets for disease resilience or resistance. more in-depth host-pathogen interaction studies, including genome-wide editing studies in vitro, will no doubt produce a variety of further candidate genes for genetic disease resistance. an alternative antipathogen approach pursued for decades is the generation of transgenic livestock, expressing antiviral or antibacterial agents, such as enzymes or small interfering rnas. genome editing can be used to improve the integration efficiency of these transgenes at specific locations in the genome; however, the discussion of transgenic disease-resistant animals is beyond the scope of this review. how does genome editing fit within existing selective breeding structures and how will it be regulated? selective breeding has generated highly productive, robust animals that are adapted to a modern production environment. livestock production is dynamic, with evolving challenges such as climate change and disease outbreaks coupled with societal pressure to reduce antimicrobial use. selective breeding for disease resistance has proven difficult, as outbreaks are often sporadic and resistant/resilient animals often difficult to identify. in circumstances where a genetic trait for disease resistance can be identified in the breeding population, then selection through the selective breeding can be achieved. a good example of this is pigs with resistance to f type enterotoxigenic e. coli. association studies revealed that a polymorphism in the fucosyl transferase gene conferred resistance to these bacteria. there was initial concern that selection for the locus figure . genetic resistance to disease and how genome editing can help integrate traits into highly productive lines. (a) genetic resistance to disease may be present in a small percentage of production animals and genetic selection for these animals may be associated with the risk of inbreeding, productivity loss, or the risk of losing other desirable traits. genome editing allows integration of the disease-resistance trait into a wider selection of pigs, ensuring genetic variability and maintenance of desirable traits. (b) genetic resistance to disease may be present in an indigenous or less productive breed. crossbreeding would result in productivity loss and the risk of losing other desirable traits, such as fur color. genome editing allows for incorporation of genetic disease resistance into highly bred lines without losing productivity. (c) genetic resistance may be observed in a closely related species, e.g., wild boar or wild suids in the case of the domestic pig. integration into highly bred domestic pig lines would only be possible by genome editing. (d) resistance genes may be identified in laboratory research but not in highly bred lines, making integration into those productive animals only possible using genome editing. harboring this gene may counterselect for another gene associated with stress resistance. however, this proved not to be the case and genetic selection for the favorable fucosyl transferase allele has been integrated into many pig-breeding programs (coddens et al., ) . this was possible, in part, because the favorable allele was present at sufficient prevalence (in most studies between % and %) in the breeding population to allow for selection while avoiding inbreeding. in circumstances where an allelic variant associated with a resistant phenotype is present at a much lower frequency, it may prove difficult to incorporate effective selection into a standard breeding regime without the risk of inbreeding and related longer-term productivity loss (figure a) . genome editing has the potential to contribute in such circumstances, allowing the direct introgression of a beneficial allele into the offspring of diverse, highly productive animals. similarly, disease-resistance traits associated with less productive indigenous breeds are unlikely to be introduced to highly productive populations by standard crossbreeding as this would result in a significant set-back in productivity, abrogating decades or even centuries of advances made through genetic selection ( figure b ). in circumstances where resistance or resilience is observed in a related species, crossbreeding is simply not possible. genome editing could bridge these gaps. one example of this is resilience of wild suids to african swine fever virus while domestic pigs can suffer from severe disease. it is not possible to crossbreed these species, so introduction of the genetics underlying resilience is not possible by this route. genetic comparison can be used to identify the functional differences underlying such traits, and genome editing employed to introduce appropriate variants into domestic pigs ( figure c ). finally, with a good understanding of host-pathogen interactions, novel genetics that has not been observed in live animals can be created and tested for efficacy in a laboratory environment. this was the case for both the cd /prrsv and apn/tgev examples in pigs and would be the case for the anp a/ influenza and the alv-j resistance in chicken, described above. in such circumstances, integration through genome editing presents a practical route to benefit from the findings ( figure d ). it is imperative that in such circumstances thorough phenotypic characterization of the edited animals be carried out as deletion of all or part of a functional protein could result in a loss of (systemic) biological function. a second measure worthy of consideration before embarking on an editing project is whether the gene is located within a locus that has been actively selected in breeding programs. this could indicate whether a potential target is associated with known production traits. this approach has been taken for prrsv-resistant pigs, with evaluation as to whether the cd gene locus has been selected for in pig breeding programs (johnsson et al., ) . overall, genome editing holds vast promise for the future production of animals resistant or resilient to disease, improving productivity and animal welfare while reducing food waste throughout the production chain. through reduction of primary and secondary infections, it should also be possible to reduce antimicrobial use in livestock production. technical improvements in the generation of genome editing animals will undoubtedly reduce the cost implications of this technology. the two major hurdles still to be faced prior to implementation of this promising technology are consumer acceptance and the regulatory framework. approval of edited animals for human consumption relies on national and multinational legislation, which is currently at early stages. encouragingly, some international jurisdictions such as argentina and brazil have about the authors dr. chris proudfoot is research fellow at the roslin institute/university of edinburgh since . his work centers on generation of genome-modified livestock, with particular emphasis on genome editors, to improve disease resistance or to accurately model human disease. dr. proudfoot has worked extensively with zfns, talens, and crispr/cas to produce a variety of edited animals. he was a member of the team that produced the first edited livestock using this method. dr. simon lillico is a research associate at the roslin institute/university of edinburgh. he joined the institute on an industrial collaboration to produce highvalue therapeutic proteins in hens eggs and then applied his expertise in lentiviral transgenesis to generate livestock models of human diseases. the rapid expansion of the field of genome editors over the last yr has made practicable genome modifications which had previously been unattainable. dr. lillico has been at the forefront of application of these editors to livestock, creating either disease-resistant/resilient strains, or accurate models of human disease. dr. christine tait-burkard is an assistant professor at the roslin institute/university of edinburgh since in the departments of genetics and genomics and infection and immunity. her research focuses on understanding host-pathogen interactions on a cellular and genetic level, developing new in vitro tools for virus research, improving and developing easy-to-use diagnostics, and devising strategies to combat viral disease in livestock in general and pigs in particular. she employs genome editing and genetic selection to generate animals genetically resistant to viral disease. corresponding author: christine.burkard@roslin.ed.ac.uk already ruled that modifications, such as the prrsv-resistant pig, that do not have any new genetic information integrated into the animal, will be exempt from regulation. precision engineering for prrsv resistance in pigs: macrophages from genome edited pigs lacking cd srcr domain are fully resistant to both prrsv genotypes while maintaining biological function pigs lacking the scavenger receptor cysteine-rich domain of cd are resistant to porcine reproductive and respiratory syndrome virus infection cd expression confers susceptibility to porcine reproductive and respiratory syndrome viruses efficient talen-mediated gene knockout in livestock the possibility of positive selection for both f (+)escherichia coli and stress resistant pigs opens new perspectives for pig breeding generation of gene edited birds in one generation using sperm transfection assisted gene editing (stage) aminopeptidase n is a major receptor for the entero-pathogenic coronavirus tgev optimization of crispr/cas genome editing for loss-of-function in the early chick embryo high fidelity crispr/cas increases precise monoallelic and biallelic editing events in primordial germ cells precise gene editing of chicken na+/h+ exchange type (chnhe ) confers resistance to avian leukosis virus subgroup j (alv-j) porcine aminopeptidase n is a functional receptor for the pedv coronavirus live pigs produced from genome edited zygotes mammalian interspecies substitution of immune modulatory alleles by genome editing species difference in anp a underlies influenza a virus polymerase host restriction avian anp b does not support influenza a virus polymerase and influenza a virus relies exclusively on anp a in chicken cells species-specific variation in rela underlies differences in nf-κb activity: a potential role in african swine fever pathogenesis successful genetic modification of porcine spermatogonial stem cells via an electrically responsive au nanowire injector generation of germline ablated male pigs by crispr/cas editing of the nanos gene targeted gene knockout in chickens mediated by talens an intact sialoadhesin (sn/siglec /cd ) is not required for attachment/internalization of the porcine reproductive and respiratory syndrome virus efficient talen-mediated gene targeting of chicken primordial germ cells sialoadhesin and cd join forces during entry of the porcine reproductive and respiratory syndrome virus identification of the cd protein domains involved in infection of the porcine reproductive and respiratory syndrome virus crispr mediated somatic cell genome engineering in the chicken gene-edited pigs are protected from porcine reproductive and respiratory syndrome virus resistance to coronavirus infection in amino peptidase n-deficient pigs we acknowledge financial support from the biotechnology and biological science research council (bbsrc) (bb/ r / , bb/n / ) and the bbsrc institute strategic programme grant funding to the roslin institute (bbs/ e/d/ and bbs/e/d/ ). key: cord- -r el nqm authors: domingo, esteban title: molecular basis of genetic variation of viruses: error-prone replication date: - - journal: virus as populations doi: . /b - - - - . - sha: doc_id: cord_uid: r el nqm genetic variation is a necessity of all biological systems. viruses use all known mechanisms of variation; mutation, several forms of recombination, and segment reassortment in the case of viruses with a segmented genome. these processes are intimately connected with the replicative machineries of viruses, as well as with fundamental physical-chemical properties of nucleotides when acting as template or substrate residues. recombination has been viewed as a means to rescue viable genomes from unfit parents or to produce large modifications for the exploration of phenotypic novelty. all types of genetic variation can act conjointly as blind processes to provide the raw materials for adaptation to the changing environments in which viruses must replicate. a distinction is made between mechanistically unavoidable and evolutionarily relevant mutation and recombination. genetic change was a prerequisite for the early life forms to be generated and maintained (chapter ), and it is also a requirement for the evolution of present-day life. we may willingly or inadvertently modify selective pressures, but genetic change is rooted in all replication machineries. the results of genetic modifications, regarding relative dominances of variant forms, are guided by selective pressures and random events. the replicative machinery itself has probably been influenced by natural selection; as an example, polymerases devoid of a capacity to generate variants should have endured a long-term selective disadvantage. however, once the replicative machinery was established, the mechanisms of variation acted independently of the selective pressures applied or to come. viruses use the same molecular mechanisms of genetic variation than other forms of life: mutation (that encompasses point mutations and insertions-deletions of different lengths), hypermutation, several types of recombination, and genome segment reassortment. mutation is observed in all viruses, with no known exceptions. recombination is also widespread, but its role in the generation of diversity appears to vary among viruses. its occurrence was soon accepted for dna viruses, but it was considered uncertain for the rna viruses. pioneering studies of poliovirus (pv) by p. d. cooper, v. i. agol and colleagues, and of foot-and-mouth disease virus (fmdv) by a.m. king and colleagues provided the first evidence of recombination in rna. the present perception is that recombination is more widespread than thought only a few decades ago and that its frequency and the types of genomic forms it generates are varied among viruses. for example, it appears that positivestrand rna viruses recombine more easily than negative-strand rna viruses to give rise to mosaic genomes of standard length. several negative-strand rna viruses, however, can yield defective genomes through recombination, frequently characterized by deletions in their rna. a connection between the structure of replication complexesdas viewed by x-ray diffraction or high-resolution cryo-electron microscopydand the propensity to produce defective genomes has not been established. defective genomes are increasingly perceived not only as unavoidable side-products of blind replicative imperfections but as classes of genome subpopulations that perform relevant biological roles for the standard, infectious viruses. genome segment reassortment, a type of variation close to chromosomal exchanges in sexual reproduction, is an adaptive asset of segmented viral genomes, as continuously evidenced by the ongoing evolution of the influenza viruses. the three modes of virus genome variation are not incompatible, and reassortantrecombinant-mutant genomes are continuously arising in present-day viruses. the potential for genetic variation of rna and dna viral genomes is remarkable, and it is the ultimate molecular mechanism that lies at the origin of the virus diversity delineated in chapter . mutation is a localized alteration of a nucleotide residue in a nucleic acid. it generally refers to an inheritable modification of the genetic material. in the case of viral genomes, mutations can result from different mechanisms: (i) template miscopying (direct incorporation of an incorrect nucleotide); (ii) primer-template misalignments that include miscoding followed by realignment, and misalignment of the template relative to the growing chain (polymerase "slippage" or "stuttering"); (iii) activity of cellular enzymes (i.e., deaminases), or (iv) chemical damage to the viral nucleic acids (deamination, depurination, depyrimidination, reactions with oxygen radicals, direct and indirect effects of ionizing radiation, photochemical reactions, etc.) (naegeli, ; bloomfield et al., ; friedberg et al., ) . the basis of nucleotide misincorporation during template copying (defined as the incorporation of a nucleotide different from that expected from the template residue at that position) lies mainly in the electronic structure of the bases that make up dna (adenine, a; guanine, g; cytosine, c; thymine, t) or rna (with uracil, u instead of t). each base includes potential hydrogen-bonding donor sites (amino or amino protons) and hydrogen-bonding acceptor sites (carbonyl oxygens or aromatic nitrogens) that contribute to standard watson-crick base pairs ( fig. . ) , as well as wobble base pairs (nonstandard watson-crick, but fundamental for rna secondary structure and mrna translation) ( fig. . ) . the conformation of the purine and pyrimidine bases is highly dynamic. amino and methyl groups rotate about the bonds that link them to the ring structure. in dilute solution, hydrogen bonds are established with water, and they can be displaced by nucleotide or amino acid residues to give rise to nucleotide-nucleotide or nucleotide-amino acid interactions. the strength difference between hydrogen bonds established in a polynucleotide chain with water, and their strength between two bases in separate polynucleotide chains determines whether a double-stranded polynucleotide will be formed. purine and pyrimidine bases can acquire different charge distributions and ionization states. as a consequence, in addition to the standard watson-crick and wobble, other base pairs are found in naturally occurring nucleic acids (notably cellular rrna and trna) and in synthetic oligonucleotides (a-u or a-t hoogsteen, and a-g, c-u, g-g, and u-u pairs, as well as interactions involving ionized bases). one of the types of electronic redistribution leads to tautomeric changes, such as the keto-enol and amino-imino transitions, which modify the hydrogen-bonding properties of the base; tautomeric imino and enol forms of the standard bases can produce non-watson-crick pairs. the proportion of the alternative tautomeric forms can be influenced by modifications in the purine and pyrimidine rings, which, in turn, can favor either the syn or anti conformation of a nucleoside, which is defined by the torsion angle of the bond between the carbon of the ribose and either n in pyrimidines or n in purines ( fig. . ) . the anti-conformation is usually the most stable in standard nucleotides and polynucleotides. the transition from the anti to the syn conformation may alter the hydrogen-bonding properties of the base, thereby inducing mutagenesis (bloomfield et al., ; suzuki et al., ) . the understanding of conformational and g-c, with ribose as pentose). phosphodiester bonds of two potential polynucleotide chains of different polarity (outer arrows) are indicated. effects on the base-pairing tendencies of nucleoside analogs in the context of the active site of a polymerase is very relevant to the design of specific mutagenic analogs for viral polymerases in lethal mutagenesis-centered antiviral approaches (chapter ). base-base interactions are not only responsible for part of the mutations that occur during genome replication, but also for the formation of double-stranded nucleic acids, either within the same polynucleotide chain or between two different chains. transitions from a coil-like into an organized double-stranded (or other) structure are functionally relevant for both rna and dna. in the case of rna, doublestranded regions in the adequate alternation with single-stranded regions, determine key catalytic or macromolecule-attracting abilities, as for example, ribozyme activities (chapter ), the internal ribosome entry site (ires) of several viral and cellular mrnas, or multitudes of functional rna-protein interactions (denny and greenleaf, ) . adjacent base stacking due to electronic interactions (rather than hydrophobic bonds as once thought), contribute also to the stability of double-helical regions in nucleic acid molecules. structural transitions due to alternative stacking conformations, particularly within polypurine or polypyrimidine tracts, can affect nucleic acidprotein interactions. in turn, replication machineries (typically including viral and host proteins gathered in membrane structures) may also be affected by nucleic acid conformations; such effects are important in virology regarding consequences for mutant generation in a given template sequence context. these considerations on structural transitions are relevant to the nonneutral character of silent (also termed synonymous) mutations (those in open-reading frames that do not result in an amino acid substitution), a point to be addressed in the next section. transitions from a single-stranded into a doublestranded nucleic acid structure and the relative stability of the two forms depend on multiple factors that include the nucleotide sequence of the nucleic acid, its being a ribo-or a deoxyribosepolynucleotide, temperature, ionic environment, and ionic strength. positively charged counterions neutralize negatively charged phosphates, and favor duplex stability [as an overview of figure . examples of a class of non-watson-crick base pairs termed wobble base pairs. the drawing is similar to that of fig. . , except that the sugar residues and phosphodiester bonds have been omitted. hydrogen bonds (discontinuous lines in red) are shown between i (inosine) and c, u, and a, and between g and u. wobble base pairs are important for codonanticodon interactions, as described in the text. physical and chemical properties of nucleic acids and their nucleotide components, see (bloomfield et al., ) ]. mutations resulting from any of the mechanism just summarized can be divided into transitions, transversions (both referred to as point mutations), and insertions and deletions (referred to as indels) (fig. . ) . the latter occurs preferentially at homopolymeric tracts and also at short, repeated, sequences which are prone to misalignment mutagenesis ( fig. . ). an example is an editing mechanism for some viral mrnas, such as the phosphoprotein mrna of the paramyxovirinae [ (kolakofsky et al., ) other examples in vivo are hot spots for variation in reiterated sequences in complex dna genomes (yamaguchi et al., ; barrett and mcfadden, ; mcgeoch et al., ) , or the insertion of two amino acids (often ser-ser, ser-gly, or ser-ala between residues and of the hiv- reverse transcriptase), in concert with hiv- resistance to nucleoside inhibitors (winters and merigan, ) (chapter ). hairpin structures in rna and dna may also induce deletions as a result of slippage mutagenesis (pathak and temin, ; viguera et al., ) . transition mutations occur more frequently than either transversions or indels during virus replication. nucleotide discrimination at the catalytic site of viral polymerases fits this observation because of the more likely replacement of a purine or pyrimidine nucleotide by its structurally more similar nucleotide. in some cases, however, an abundance of indels and similar numbers of transitions and transversions have been recorded (cheynier et al., ; malpica et al., ) . the molecular bases of such unexpected behavior regarding mutational spectra are not well understood. the generation of point mutations and indels is subject to thermodynamic and quantummechanical uncertainties inherent to atomic fluctuations, rendering mutagenesis a highly unpredictable event, thus introducing stochasticity (randomness) in a key motor of evolution: the generation of diversity at the molecular level (domingo et al., ; eigen, ) . indicate point mutations, insertion or deletions (known as indels). a genome is depicted as an elongated rod. symbols on the rod (cross, circle, and line) represent mutations. hypermutation is generally associated with a high frequency of specific mutation types (crosses and lines). a region inserted or deleted from the genome is depicted as an empty rod. the effect of mutations on the structure and function of proteins is extremely relevant to penetrate into the mechanisms that drive virus evolution since selection acts on phenotypes that are often embodied in protein molecules. silent or synonymous mutations are those that do not give rise to an amino acid substitution despite being located in a protein-coding region of a genome. their occurrence is due to the degeneracy of the genetic code: the same amino acid can be coded for by two or more triplets (codons), with the exception of aug for methionine and ugg for tryptophan. synonymous mutations are not necessarily selectively neutral, neutral meaning that they have no discernible consequence for any viral function. the assumption that synonymous mutations are selectively neutral, and the fact that the early comparison of nucleotide sequences of homologous genes showed a dominance of synonymous over nonsynonymous mutations, contributed to the foundations of the neutral theory of molecular evolution. this theory attributes the evolution of organisms at the molecular level mainly to the random drift of genomes carrying neutral or quasi-(or nearly-) neutral mutations (king and jukes, ; kimura, kimura, , . the terms quasi-neutral or nearly-neutral may seem ambiguous to molecular biologists. however, in the formulation of the neutral theory they had a precise meaning of the selection coefficient (a parameter that measures fitness differences) being lower than the inverse of the effective population size, with minor variations in the equations of some formulations (kimura, ) . despite random drift of genomes playing an important role in molecular evolution, evidence gathered over the last decades renders untenable the assumption that synonymous mutations are neutral. evidence to the contrary has been obtained with viruses and cells, including mutations in the human genome that may affect enhancer functions (hirsch and birnbaum, ) , mrna folding (faure et al., ; mittal et al., ) , and microrna targeting (brest et al., ) , among other processes (novella, ; novella et al., ; parmley et al., ; hamano et al., ; resch et al., ; lafforgue et al., ; nevot et al., nevot et al., , supek, ) . there are several mechanisms by which synonymous mutations can affect virus behavior: alteration of cis-acting regulatory elements in viral genomes, decrease of the stability of duplex structures within the rna genome or between viral sequences and mirnas or sirnas, or changes of viral gene expression [splicing precision or translation fidelity through the modification of rna-rna or rna-protein interactions; reviewed in (martínez et al., ) ]. synonymous codons use different trnas for protein synthesis, and different trnas do not have the same relative abundance in different host cell types. thus, the rate of protein synthesis, an important phenotypic trait for cells and viruses, can be affected by the frequency of alternative synonymous codons present in mrnas (richmond, ; akashi, ) . not only codon bias, but also specific codons or codon combinations may affect ribosome speed to regulate the folding of nascent proteins during translation (makhoul and trifonov, ; rocha, ; aragones et al., ; brule and grayhack, ) . as a consequence, generation of rare codons by mutation of abundant codons (or vice versa) can modify viral fitness (chapter ). rare codons may also limit the fidelity of amino acid incorporation when the frequency of the required aminoacyl-trnas is low (ling et al., ; zaher and green, ; czech et al., ) . the frequency of codon pairs in rna genomes is also a fitness determinant relevant to the preparation of attenuated viral vaccines. to complicate matters further, a synonymous mutation may be neutral or quasi-neutral in one environment, but it may contribute to selection in a different environment, because of the phenotypic effects of rna structure and codon usage. neutrality is relative to the environment. regarding the effects of mutations (box . ), the following general statements are applicable to viruses: • although difficult to prove due to the limited number of environments used for experimentation, truly neutral mutations (i.e., with no influence on the virus in any environment) are probably very rare. this applies to synonymous, as well as to nonsynonymous mutations. • mutations resulting in chemically conservative amino acid substitutions are more likely to be tolerated than those leading to chemically different amino acids. tolerability (quantified by substitution matrices among amino acids in protein evolution) should be distinguished from neutrality. a tolerated mutation may cause a reduction in fitness, which is nevertheless compatible with virus replication. • a conservative amino acid substitution may have important biological consequences. • the effect of any individual mutation is context-dependent in two ways: it may depend on other mutations in the same genome (epistasis, see also section . and chapter ) or on the mutant cloud that surrounds the genome harboring the mutation (effects of complementation, cooperation, or interference, discussed in section . of chapter ). • the previous points do not deny the influence of random drift of genomes on intrahost and interhost evolution. the currently most accepted view is that positive and negative selection and random drift occur continuously during virus evolution (chapter ). the proportion of transition versus transversion mutations may depend initially on the specific replication machinery of a virus that tends to produce some mutation types preferentially over others. for a given virus, short-term evolution is often reflected in the dominance of transitions, a dominance which is less apparent when distantly related sequences of the same virus are compared. the effect of evolutionary distance on the transition to transversion ratio was observed in the fmdv genome sequence comparisons carried out in our laboratory over several decades, that ranged from analyses of mutant spectra relative to their corresponding consensus sequence to independent viral isolates from disease outbreaks separated by several decades [review of the work on fmdv evolution in (domingo et al., (domingo et al., , ]. these two levels of sequence comparisons (within mutant spectra vs. independent isolates) can be highly significant, as discussed in chapters and . the proportion of synonymous and nonsynonymous mutations that have mediated the diversification of viral genomic sequences that belong to the same phylogenetic lineage is often considered informative of the underlying evolutionary forces. probably because of the rooted (albeit uncertain) notion that biological function is more likely to reside in protein than in dna or rna, the ratio of nonsynonymous substitutions (corrected per nonsynonymous site in the sequence under study) (d n ), to the number of synonymous substitutions per synonymous site (d s ), termed u (u ¼ d n /d s ) is calculated to infer the dominant mode of evolution (nei and gojobori, ) . mutations may affect stem-loop or other secondary and higher-order structures involved in regulatory processes through nucleic acidnucleic acid or nucleic acid-protein interactions. the primary sequence in nonstructured, noncoding regions may also be functionally relevant. in coding regions the effect of a mutation may be contextdependent in two manners: it may be affected by other mutations in the same genome (epistasis) or by other genomes of the surrounding mutant spectrum. when u ¼ the evolution is considered neutral, when u < purifying (or negative) selection is dominant, and when u > positive (or directional) selection prevails (yang and bielawski, ) . the types of selection undergone by viruses are discussed in section . of chapter . there are several reasons to be cautious about the significance of u: (i) synonymous mutations need not be neutral, for reasons discussed in section . . (ii) in the course of evolution, important but transient events of positive selection (termed episodic positive selection) due to one or a few amino acid substitutions may be accompanied by a larger number of synonymous, tolerated mutations. in this situation, u computes as u < , thus indicative of purifying selection despite a critical role of positive selection triggered by one or few nonsynonymous mutations in the evolutionary outcome (crandall et al., ) . (iii) in a striking proof of the above arguments, statistically significant mutational biases led to a value of u indicative of positive selection in an in vitro evolution experiment simulating pseudogene evolution in which positive selection was not possible (vartanian et al., ) ; this study represents a warning which is rarely mentioned when discussing the limitations of conclusions based on the value of u. (iv) a synonymous change may permit the mutant codon to acquire a relevant nonsynonymous change through a point mutation. the term quasisynonymous has been used to describe codons that encode the same amino acid, but that has a different evolutionary potential regarding the amino acids that they can access through a point mutation. alternative codons for a given amino acid approximate a replicative system to points of sequence space from which a phenotypically relevant change has a different probability (chapters and ). (v) finally, u was initially proposed to compare distantly related rather than closely related genomes, as is often the case in the short-term evolution of viruses (kryazhimskiy and plotkin, ) . for all these reasons, u values as a diagnostic of forces mediating dna and rna virus evolution must be regarded only as indirect and suggestive, not as a definitive parameter. despite these arguments, use of u to propose a model of virus evolution continues being surprisingly unchallenged in the literature of virus evolution. we use u only in a limited way in subsequent chapters because, in addition to the limitations just listed, it does not help in the interpretation of critical evolutionary events regarding viruses. related shortcomings apply to other tests of neutrality developed to interpret the origin of dna polymorphisms in the years following the summit of the neutralist-selectionist controversy (fu, ; achaz, ). mutation rates quantify the number of misincorporations per nucleotide copied, irrespective of the fate (increase or decrease in frequency) of the mutated genome produced. a mutation rate for a genomic site measures a biochemical event dictated by the replication machinery and environmental parameters that affect the catalytic properties of the polymerase. in contrast, a mutant (or mutation) frequency describes the proportion of a mutant (or a set of mutants) in a genome population. the frequency of a mutant will depend on the rate at which it is generated (given by the mutation rate) and on its replication capacity relative to other genomes in the population (drake and holland, ) (fig. . ). a specific mutation may be produced at a modest rate, but then be found at high frequency because the mutation is advantageous for genome replication in that environment. the converse situation may also occur. some mutational hot spots (in the sense of genomic sites where mutations tend to occur with high probability) may never be reflected among the repertoire of mutations found in a . mutation rates and frequencies for dna and rna genomes genome population because of the selective disadvantage they inflict upon the genome harboring them. a very significant example is the elongation of an internal oligoadenylate tract located between the two functional aug initiation codons in the fmdv genome. the homopolymeric tract constitutes a hot spot for variation due to polymerase slippage ( fig. . b). the elongation of the internal oligoadenylate was dramatic because it sextuplicated the number of adenylate residues present at that site; it was only observed when fmdv was subjected to repeated plaque-to-plaque (bottleneck) transfers, not large population passages. in fact, this drastic genetic modification has not been recorded among natural isolates of the virus. the molecular instruction to elongate the oligoadenylate was very strong because it was observed in many independent biological clones subjected to bottleneck transfers (escarmís et al., ) . despite qualifying as a hot spot for variation, the first event in fitness recovery when the clones were subjected to large population passages was the reversion of the elongated tract to its original size (escarmís et al., ) . the interpretation of these findings, to be further analyzed in chapter , is that during plaque-to-plaque transfers the negative selection to eliminate unfit genomes is less intense than during large, highly competitive population passages. again, a clear molecular instruction to elongate a homopolymeric track may not be reflected in a high frequency of the affected genomes. therefore, although mutation rates and frequencies for viruses bear some relationship, rates cannot be inferred from frequencies and vice versa ( fig. . ). the first calculations of mutation rates for cellular organisms and for some dna bacteriophages were carried out by j.w. drake, who pursued comparative measurements that have generally supported a difference between mutation rates for dna and rna viruses. the rates estimated for bacteriophages l and t were about times higher than those of their host e. coli. an approximately constant rate of . mutations per genome per replication round was calculated for a number of dna-based microbes (drake, ) , an observation sometimes referred to as "drake's rule." this rather surprising constancy suggests that different dna organisms have accommodated the template-copying fidelity of their replication machineries to achieve a narrow window in the mutational load measured as mutations fixed per genome, a remarkable fitting of biochemistry with evolutionary needs. the basal mutation rate in mammalian cells has been estimated at about À substitutions per nucleotide and cell generation [reviewed in (naegeli, ; domingo et al., ; friedberg et al., ) ] (table . ). the synonymous mutation rate measured with experimental populations of bacteria has been assumed to reflect the neutral mutation rate (despite limitations explained in section . ). values for e. coli have ranged from  À up figure . scheme that illustrates the difference between mutation rate and mutant frequency. residue a in a template residue (top) can be misread to incorporate a c, a, or g into the complementary strand (discontinuous lines), at a rate of À , À , and À substitutions per nucleotide, respectively. the replicative capacity of the newly generated templates (with g, u, and c, continuous lines) will determine widely different mutant frequencies with g > c > u. to  À substitutions per synonymous site per generation (ochman et al., ) with  À as the most likely estimate (lenski et al., ) . the latter value is in agreement with a rate of  À to  À substitutions per base pair and generation based on wholegenome deep sequencing of an experimentally evolved lineage of myxococcus xanthus (velicer et al., ) . there are biological phyla for which no mutation rates have been calculated. from current knowledge, we can assume that mutation rates in cells and viruses depend on the replicative machinery (generally a multiprotein complex that includes the relevant viral polymerase with additional viral and host proteins and membrane structures) and on multiple environmental parameters (template nucleotide sequence context, ionic environment, temperature, metabolites in interaction with components of the replication apparatus, etc.). whether bacteria are in the exponential or stationary phase of growth can affect intracellular metabolites and proton exchange rates which, in turn, may alter the proportion of tautomeric forms in nucleotides and misincorporation tendencies (friedberg et al., ) . the sequence context of the template nucleic acids (presence of repeated sequences that can induce misalignment mutagenesis or g-c vs. a-t rich regions in relation to relative nucleotide substrate abundances, etc.) may impel or attenuate mutability. insertion elements may enhance mutation rates at neighboring sites in a bacterial genome (miller and day, ) . despite these influences, vesicular stomatitis virus (vsv) displayed comparable mutation rates in several host cells (combe and sanjuan, ) suggesting that there is a limited range of average error rates needed for a virus to maintain fitness (chapters and ). in addition to the general environmental and sequence context consequences for templatecopying fidelity that may affect any genome type, mutation rates for dna viruses will also be influenced by: (i) whether the dna polymerase that catalyzes viral dna synthesis includes or lacks a functional proofreading-repair activity. high copying fidelity is typical of dna polymerases involved in cellular dna replication (bebenek and ziuzia-graczyk, ) , and low copying fidelity is generally a feature of dna polymerases involved in dna repair (friedberg et al., ; ganai and johansson, ) . thus, repair of lesions that by themselves might not be mutagenic may lead to the introduction of mutations during the error-prone repair process. (ii) expression of proteins active in repair encoded in the viral genome, such as uracil-dna glycosylase, dna repair endonucleases, etc. (iii) the mechanism of viral dna replication, particularly the occurrence of double-stranded versus single-stranded dna in replicative intermediates. (iv) intracellular site of replication and the availability of postreplicative dna repair proteins (regarding both intracellular location and concentration) to the viral replication factories. little is known of the spatial relationships and relative affinities of cellular and viral proteins and structures that may critically affect polymerase fidelity. comparative measurements of mutation rates at specific genome sites of dna viruses are needed, as a first step to define the cellular and biochemical influences on the fidelity of dna virus genome replication. general genetic variability affecting the entire virus genome should be distinguished from rna viruses À to À retroviruses À to À dna viruses À to À cellular dna À to À values are expressed as substitutions per nucleotide. the range of values is the most likely according to several independent studies. no distinction is made between mutation rates and frequencies. see text for comments and references. . mutation rates and frequencies for dna and rna genomes localized variability at hot spots in a genome. even the extremely complex human genome shows genetic instability at specific loci, some associated with genetic disease (domingo et al., ; alberts et al., ; bushman, ) . genome size is a parameter pertinent to biological behavior, not only because it imposes a commensurate copying fidelity, but also because it affects the impact of genetic heterogeneity within infected organisms and upon the invasion of new hosts (chapter ). mutation frequencies measured by subjecting virus to a specific selective agent (e.g., mutants that escape the neutralizing activity of a monoclonal antibody or mutants that escape inhibition by a drug) span a broad range of values ( À to À ) for dna and rna viruses (smith and inglis, ; sarisky et al., ; domingo et al., ) (table . ). the technical details of any procedure used to calculate a mutation frequency should be carefully evaluated to translate its meaning to the genome level. important variables are the efficacy of the antibody or drug (which will be concentration-dependent) or the possibility of phenotypic hiding-mixing in the escape mutants to be quantified (holland et al., ; valcarcel and ortin, ) . unexpected low levels of escape mutants (that would imply < À substitutions per nucleotide) for an rna virus can mean either a general or sitespecific high polymerase fidelity, a selective disadvantage of the genome that harbors the mutation or, when a phenotypic alteration is measured, the requirement of two or more mutations to produce the alteration. conversely, a high mutation frequency for a dna virus whose replication is catalyzed by a high-fidelity dna polymerase may mean that either repair activities were not functional or that the mutant displayed a selective advantage and overgrew the wild type prior to the measurement of its frequency. mathematical treatments that take into account reversion of a low fitness mutant and its competition with wild-type virus have been used to calculate mutation rates (batschelet et al., ; coffin, ) . despite difficulties and limitations in the calculations, independent genetic and biochemical methods with different viruses support mutation rates for rna viruses in the range of À to À substitutions per nucleotide copied [as representative articles and reviews see (batschelet et al., ; domingo et al., domingo et al., , steinhauer and holland, ; eigen and biebricher, ; varela-echavarria et al., ; ward and flanegan, ; mansky and temin, ; preston and dougherty, ; drake and holland, ; sanjuan et al., ; bradwell et al., ) ] (table . ). a few early studies indicated unusual low mutation rates or frequencies for some rna viruses. as discussed in some of the reviews listed above, there are technical reasons to suggest that such values were probably underestimates of the true average mutations rates or frequencies. obviously, it cannot be excluded that some genomic sites or viruses under a given environment might be unusually refractory to introduce mutations, but most evidence supports the range of values listed in table . . the near million-fold higher mutation rates for rna viruses than cellular dna, whose biological implications were presciently anticipated by j. holland and colleagues (holland et al., ) , have been confirmed. that is, for rna viruses of genome length between kb and kb, an average of . - mutation is introduced per template molecule copied in the replicating population. unless most mutations impeded viral replication, a continuous input of mutant genomes is expected, as indeed found experimentally (chapter ). high mutation rates for rna genomes are also supported by measurements of templatecopying fidelity by rna polymerases, reverse transcriptases, and dna polymerases devoid of e proofreading exonuclease (or under conditions in which, such exonuclease is not functional) [ (steinhauer et al., ; varela-echavarria et al., ; mansky and temin, ; domingo et al., ; friedberg et al., friedberg et al., , men endez-arias, ) , and references therein]. in vitro fidelity tests may be based on genetic or biochemical assays using homopolymeric or heteropolymeric templateprimers. measurements include the kinetics of incorporation of an incorrect versus the correct nucleotide directed by a specific position of a template or the capacity of a polymerase to elongate a mismatched template-primer end, [these and other assays have been reviewed (men endez- arias, ) ] (see also section . ). differences between related enzymes (i.e., amv rt is more accurate than hiv- rt), and the fact that amino acid substitutions in the polymerases affect nucleotide discrimination, demonstrate that proofreading-repair activities together with the structure of the polymerase and replication complexes are determinants of template-copying fidelity. the term mutation rate if often used in a light manner in the literature of virus evolution, probably driven by nomenclature from classical population genetics. it is used to mean mutation frequency, rate of evolution, and sometimes to mean mutation rate in its real sense (as explained in section . ). a particularly risky habit is to use mutation rate when what is measured is a mutation frequency. some studies have claimed that they have a replication system devoid of selection, and therefore, the number of mutations counted corresponds to the true mutation rate of the system. this is incorrect. there is no replicative system devoid of selection because at least selection to maintain replication is in continuous operation. furthermore, as taught by quasispecies dynamics (stadler, ) , supported by mutational waves in hepatitis c virus upon prolonged replication in a constant cellular environment (moreno et al., ) , the mutant spectrum per se is part of the environment. since the mutant spectrum is constantly changing, so is the environment in which replication takes place. unfortunately, some studies have proposed the existence of mutational cold spots (sites at which mutations as a biochemical event occur at a very low frequency) ignoring that negative selection might have eliminated newly arising mutations. these false conclusions imply the existence of genomic regions particularly suitable as drug or antibody targets because they cannot mutate. these are the types of studies and incorrect conclusions that keep perpetuating the problem of control of viral diseases by encouraging antiviral and vaccine designs doomed to failure (chapters and ). . evolutionary origins, evolvability, and consequences of high mutation rates: fidelity mutants the amino acid substitutions in the core polymerase that affect fidelity can be located either close to or away from the active site of the enzyme. the change in fidelity can reach almost one order of magnitude, but virus viability is not compromised. thus, error rates themselves can be subjected to selection, as supported by theoretical studies on evolvability (earl and deem, ) . early studies documented heterogeneity in the mutation rates among individual plaque isolates of influenza a virus (suarez et al., ) . it is not clear whether mutation rates of viruses have evolved to procure a balance between adaptability and genetic stability, or whether other selective constraints have imposed the observed values. it has been suggested that because of the generally deleterious nature of most mutations, the adaptive value of the high mutation rates for rna viruses is debatable and that there might have been a trade-off between replication rate and copying fidelity (elena and sanjuan, ) . mutation rates would be a consequence of rapid rna replication, and an increase in copying fidelity would come at a . evolutionary origins, evolvability, and consequences of high mutation rates: fidelity mutants cost, resulting in a lower replication rate. a connection between elongation and error rate has been suggested by results with some viral and cellular polymerases [review (kunkel and erie, ) ]. in an early study with the poliovirus rdrp in vitro, an increase in the error frequency was observed when the ph and mg þ ion conditions were modified, and the decreased fidelity correlated with increased rna elongation rate (ward et al., ) . a possible connection between elongation rate and copying fidelity cannot be ruled out, but current evidence points to template-copying fidelity as being the result of multiple factors, not necessarily linked to the rate of genome replication (vignuzzi and andino, ; campagnola et al., ; domingo and perales, ) . there is ample support for an adaptive value of high mutation rates for rna viruses, independently of their biochemical origins. a poliovirus mutant, whose rdrp displayed a three-to fivefold higher fidelity than the wild-type enzyme, replicated at a slightly lower rate than wild-type virus in cell culture but displayed a strong selective disadvantage regarding the invasion of the brain of susceptible mice (pfeiffer and kirkegaard, ; vignuzzi et al., ) . the impediment to cause neuropathology was due at least partly to the limited complexity of the mutant spectrum since its broadening through mutagenesis restored the capacity to produce neuropathology. these and other studies have provided evidence that mutant spectrum complexity, by virtue of its impact on fitness, can be a virulence determinant. the work by m. vignuzzi, j. pfeiffer, r. andino, c. cameron, k. kirkegaard, and their colleagues on poliovirus fidelity mutants opened a much-needed branch of research in virus evolution and quasispecies implications. as proof of this statement, the field of fidelity mutants is rapidly expanding (bordería et al., ) , and references to the information they provide will be made in several chapters. theoretical models and experimental observations suggest that mechanisms for error correction had to evolve to maintain functionality of increasingly complex genomes (swetina and schuster, ; eigen and biebricher, ; domingo et al., ; eigen, eigen, , ) (here complexity means genome size, provided no redundant information is encoded). the coronaviruses have the largest genomes among the known rna viruses, with e kb. this is about -fold more genetic information than encoded in the simple rna bacteriophages, such as ms or qb. coronaviruses are replicated by complex rna-dependent rna polymerases, which include a domain that corresponds to a e exonuclease, proofreading-repair activity. the protein displays exonuclease activity in vitro, and its inactivation affects viral rna synthesis (minskaia et al., ) , and results in increases of about -fold in the average mutation frequency (eckerle et al., (eckerle et al., , . a coronavirus mutant devoid of this repair function is more susceptible to lethal mutagenesis than the corresponding, nonmutated virus , as expected from a connection between replication accuracy and proximity to an error threshold for the maintenance of genetic information (chapter ). thus, it is likely that a proofreading activity evolved (or was captured from a cellular counterpart) in rna genomes, whose genomic complexity was in the limit compatible with the fidelity achievable by standard rna replicases. it would be interesting to discover new rna viruses with a single rna molecule longer than kb as a genome to analyze whether they have evolved more accurate core polymerases or exhibit a proofreading-repair function during replication. toward the other end of the rna size scale, viroid rnas display a mutation rate higher than (or close to the highest) recorded for rna viruses, consistent with the correlation between genome size and template-copying accuracy (gago et al., ) . studies with bacteria have identified some of the factors that successively increase copying fidelity. it has been estimated that during e. coli dna replication the error rate would be À to À mutations per nucleotide copied if accuracy relied only upon the strength of interactions provided by base pairing (section . ). the error rate would decrease to À to À with base selection and proofreading-repair, to about À with the contribution of additional proteins present in the replication complex, and to about À misincorporations per nucleotide with the participation of postreplicative mismatch correction mechanisms (naegeli, ; kunkel and erie, ; friedberg et al., ) . reduction of bacterial genome size results in the increase of mutation frequency (nishimura et al., ) . the error rate of the bacteriophage f dna polymerase is about À without the proofreading exonuclease activity, and it decreases to À with the correcting activity [(de vega et al., ) and references therein]. postreplicative repair pathways act on double-stranded dna, but not (or very inefficiently) on rna or dna-rna hybrids. therefore, the known postreplicative repair systems that operate in cellular dna do not make a significant contribution to error correction in rna viruses (steinhauer et al., ) . the importance of copying fidelity for complex genomes is reflected in the fact that more than proteins are directly or indirectly involved in the repair of the human genome . elevated mutation rates in the range of those operating for rna viruses would be lethal for large dna genomes. localized genetic modification occurs physiologically in processes, such as somatic hypermutation and class-switch recombination in b cells of the germinal centers, as mechanisms of diversification of immunoglobulin genes (upton et al., ; methot et al., ) . chromosomal instability has long been associated with cancer (gatenby and frieden, ; stratton et al., ) . surveys have been (and are currently being) used to identify genes associated with chromosomal instability and their role in aging and disease (aguilera and garcia-muse, ; vijg et al., ) (see also chapter ). while uncontrolled high mutability is deleterious for differentiated cellular organisms, it constitutes a modus vivendi for a great majority of viruses. despite its attractiveness, definitive proof of the hypothesis of a direct relationship between error rate and limited genome complexity will require additional functional and biochemical studies. exceptions to the absence of repair activities in simple genetic elements have been described. a satellite rna of the plant virus turnip crinkle carmovirus evolved a -end rna repair mechanism. it implicates the synthesis of short oligoribonucleotides by the viral replicase using the -end of the viral genome as a template. the mechanism consists probably of template-independent priming at the -end of the damaged rna to generate wild type, negative strand, and satellite rna (nagy and simon, ) . a reversible, ntp-dependent excision of the residue of the nascent nucleic acid product has been described in some retroviruses and hepatitis c virus (meyer et al., ; jin et al., ) . this activity is important for drug resistance, and it may also modulate the overall fidelity of some polymerases. it cannot be excluded that some type of point mutation correction may operate in rna genetic elements of less than kb. such putative mechanisms may even diminish mutation rates that would otherwise be prohibitively deleterious, and they do not overshadow high mutation rates as a feature of rna and some dna genomes (table . ). limited copying fidelity in the absence of proofreading-correction mechanisms can be regarded as an unavoidable consequence of the molecular mechanisms involved in template copying by viral polymerases. most nucleic acid polymerases share a structure that resembles a right hand, with fingers, palm, and thumb domains (fig. . a ). three-dimensional structures of viral rdrps and rts indicate that interactions between the incoming nucleotide or residues of the template-primer with amino acids of the polymerase must permit displacement of the growing polymerase chain along . evolutionary origins, evolvability, and consequences of high mutation rates: fidelity mutants on the left, the klenow fragment is represented with colored fingers, palm and thumb domains, next to an open right hand. the structure on the right is that of pv rdrp, next to a closed right hand. courtesy of n. verdaguer and l. vives-adri an (the hand is that of l. vives-adri an). (b) the structure of the ternary complex between fmdv d, an rna molecule, and utp as the substrate (pdb id. e z). the left panel is a front view of the complex, depicting the polymerase chain as a yellow ribbon, the rna in dark blue (template) and cyan (primer). the incoming utp and the pyrophosphate product are shown in atom type, and mg þ ions as magenta balls. the right panel is the same complex in a top-down orientation. (figure courtesy of c. ferrer-orta and n. verdaguer). (c) scheme of the minimum number of steps involved in nucleotide incorporation. the first step consists of the binding of polymerase e to the template-primer r n (elongated up to nucleotide n) to form a complex er n . formation of the activated complex er n is governed by the rate of constant k assembly (k a ). the activated er n complex binds a nucleotide ntp with an apparent binding affinity given by k d,app to form the er n ntp complex. catalysis to covalently incorporate the ntp to the growing primer chain to yield er nþ and pyrophosphate (pp i ) is governed by the rate constant k pol . other constants depicted in the scheme are the inactivation rate constant (k inact ) of e, and dissociation of e from rna (k off, er n and k off , er nþ ). based on arias, a., arnold, j.j., sierra, m., smidansky, e.d., domingo, e., et al., . determinants of rna-dependent rna polymerase (in)fidelity revealed by kinetic analysis of the polymerase encoded by a foot-and-mouth disease virus mutant with reduced sensitivity to ribavirin. j. virol , e , and previous studies with pv polymerase d by c. e. cameron and his colleagues. the channel located at the palm domain of the polymerase (steitz, ; ferrer-orta et al., ; wu and gong, ) (fig. . b ). the polymerase of classical swine fever virus and other pestiviruses, such as bovine viral diarrhea virus includes an n-terminal extra-domain of about amino acids; its interaction with the palm domain is important for template copying fidelity . if interactions around the catalytic site to ensure the correct nucleotide incorporation were so strong as to preclude misincorporations, the movement of the growing polynucleotide chain would be hampered. again, this compromise suggests a match between biochemical and evolutionary needs. the orientation of the triphosphate moiety of the incoming nucleotide substrate is important for nucleotide incorporation (men endez- arias, ; graci and cameron, ; ferrer-orta et al., ). one of the several steps involved in nucleotide incorporation is the formation of a ternary complex (polymerase with templateprimer and the incoming nucleotide) that undergoes a conformational change (reorientation of the divalent ion-complexed triphosphate moiety of the incoming nucleotide). this conformational change activates the complex for phosphoryl transfer, to link the nucleosidemonophosphate to the -terminus of the primer (or growing chain). steps involved in the nucleotide incorporation are represented in fig. . c. both the conformational change and the relative rate of phosphoryl transfer for an incorrect nucleotide versus the correct nucleotide influence the error rate at each site of the growing chain. critical kinetic constants in fig. . c that are determined experimentally to quantify relative nucleotide incorporations and misincorporations are k d,app (expressed as mm), k pol (expressed as s À ), and the ratio k pol /k d,app (mm À s À ) termed the catalytic efficiency. the ratio of k pol /k d,app for the incorporation of an incorrect nucleotide to k pol /k d,app for a correct nucleotide gives the frequency of that particular misincorporation, and an assessment of polymerase fidelity [ (castro et al., ) and references therein]. modifications of polymerase residues by site-directed mutagenesis, combined with comparisons of the relevant structures, have identified critical amino acid residues involved in template-copying fidelity. high-fidelity mutants are frequently obtained by selecting viruses resistant to mutagenic nucleotide analogs such as the antiviral agent ribavirin (beaucourt and vignuzzi, ) . limited incorporation of a deleterious nucleotide can be attained either through specific discrimination against the analog (ribavirin or other) or through a general decrease of all types of misincorporations, that is, a high-fidelity phenotype. structural modifications of viral polymerases that lead to high fidelity have inspired the design of mutant viral polymerases displaying either an increase or decrease of copying fidelity achieved through a single amino acid substitution (wainberg et al., ; men endez-arias, ; mansky et al., ; pfeiffer and kirkegaard, ; arnold et al., ; domingo, ; vignuzzi et al., ; coffey et al., ; gnadig et al., ; meng and kwang, ; rozen-gagnon et al., ; borderia et al., ) . the capacity of the virus to evolve at higher or lower rates than their ancestors is achievable through modest numbers of mutations (limited movements in sequence space, chapter ), again emphasizing the evolvability of mutation rates. the rates of mutation and recombination need not be independent. m. vignuzzi and colleagues have shown that a mutator sindbis virus displays a higher recombination rate and enhanced production of di particles than the wild type virus (poirier et al., ) . a connection between mutation and recombination rates strengthens the evolutionary consequences of the modifications of template copying fidelity that can be achieved through a single amino acid substitution. . evolutionary origins, evolvability, and consequences of high mutation rates: fidelity mutants . hypermutagenesis and its application to generating a variation: apobec and adar activities some viral genomes either isolated from biological samples or evolved in cell culture show biased mutation types (e.g., monotonous g / a or c / u substitutions in the same genome), generally at frequencies of around À substitutions per nucleotide ( -to fold higher than standard mutation rates and frequencies) ( table . ). biased hypermutation was first observed in some defective interfering (di) rnas of vesicular stomatitis virus (vsv) (holland et al., ) , and in variant forms of measles virus, associated with postmeasles neurological disease, such as subacute sclerosing panencephalitis (cattaneo and billeter, ) . hypermutation is mainly due to the activity of cellular deaminases, such as the apolipoprotein b mrna and the editing complex (apobec), or the adenosine deaminase acting on doublestranded rna (adar) families, that are involved in cellular editing and regulatory functions (sheehy et al., ; santiago and greene, ; nishikura, ; stavrou et al., ; pfaller et al., ; venkatesan et al., ) . in the event of a viral infection, such cellular functions can become part of an innate defense mechanism against the invading virus. viral proteins (i.e., vif in hiv- ) bind some apobec proteins, thus inhibiting mutagenesis and permitting virus survival (sheehy et al., ) . in oncoretroviruses, retroviruses, and hepatitis b virus (hbv), the apobec- cytidine deaminase acts on singlestranded dna and results mainly in g / a and c / u hypermutation, that may affect % e % of the g residues. the preferred sequence context for g hypermutation in hiv- observed in vivo is gpa > gpg > gpt z gpc. the specific dinucleotide context of the hypermutated sites provides a means to distinguish genomes that have undergone hypermutation by cellular activities from those that are heavily mutated by other mechanisms, such as the action of mutagenic agents (chapter ). apobec proteins play a role in cancer through cytidine deaminase mutagenesis and generation of double-strand breaks in chromosomal dna (wang et al., ) . apobec levels in the cell may be regulated by cellular and viral proteins, for example, human papillomavirus (hpv) oncoprotein e that stabilizes apobec a in human keratinocytes that may promote cervical cancer (westrich et al., ) . the adar-associated hypermutation was identified in negative-strand rna riboviruses and results mainly in a / g and u / c hypermutation. it is originated by a / i (inosine) modifications in double-stranded viral rna catalyzed by adar- l, one of more than proteins inducible by type i ifn (maas et al., ) . inosine can be recognized as g by the replication machinery (valente and nishikura, ) , although it can form wobble base pairs also with a and u (fig. . ) . hypermutation can contribute to genetic variation of viruses (hirose et al., ) . there are additional mechanisms of hypermutagenesis. higher than average mutation frequencies can occur as a result of replication in the presence of biased concentrations of the standard nucleotide substrates; this has been applied to the in vitro generation of genes mutated at frequencies of À to À (mutagenic pcr), as a powerful tool to study sequence-function relationships and functional robustness of nucleic acids and proteins (meyerhans and vartanian, ) . error-prone pcr has been used in experiments of in vitro evolution of nucleic acid enzymes to generate heterogeneous collections of nucleic acid sequences to select for molecules capable of catalyzing specific reactions (joyce, ) (chapter ). high mutation rates have practical implications in laboratory studies on the behavior of . molecular basis of genetic variation of viruses: error-prone replication virus mutants obtained by molecular cloning of a biological sample, or constructed by site-directed mutagenesis. a transition mutation that causes a strong fitness decrease but that still allows residual rna genome replication will most likely revert following infection or transfection of cells with the mutant construct and subsequent viral replication. double or triple mutants (preferentially including transversions) should be engineered (when possible according to the genetic code) to study the behavior of a viral mutant with an amino acid replacement of interest that may produce a fitness decrease. as an example, a c / u transition found in an open reading frame of an rna virus may convert a pro into a ser (ccg / ucg). since ser will revert to pro through a u / c transition in the triplet (a common type of misincorporation by most polymerases), ser should be engineered to be encoded by agu; in the course of replication, reversion to pro would require at least two transversions since the codons for pro are ccu, ccc, cca, or ccg. thus, if effects derived from the difference in the primary sequence of the rna or codon bias do not intervene in the behavior of the viral genome, codons with a high genetic barrier to reversion should be engineered for studies involving viral replication. in general, deletions revert at a much lower frequency than point mutations, and when appropriate for the question under study, a deletion should be introduced within the gene of interest to probe gene function in reversegenetics studies. high mutation rates also imply that infection or transfection with debilitated mutant viruses may result in progeny with sequences that differ from the input. v. i. agol and colleagues have coined the term quasi-infectious to refer to mutant viruses that are capable of yielding progeny, but the progeny differs from the initial genome (pseudorevertants) (gmyl et al., ; agol and gmyl, ) . the difference between the input mutant and the rescued progeny virus will depend on the type of genetic lesion in the input virus and its consequences for virus multiplication. a single point mutation that decreases replication is likely to evolve to yield a true revertant (return to the original sequence) upon replication. if the same reversion depends on two or more mutations, a true revertant will require extended replication for exploration of sequence space (chapter ), and selection of compensatory mutations elsewhere in the genome (sometimes referred to as second site revertants) becomes an alternative for fitness gain. the term compensatory applies to mutations that compensate for the deleteriousness of other mutations. a typical example is a mutation that decreases the stability of a stem in an rna stem-loop that functions as a cis-acting element. a compensatory mutation restores a stable stem needed for the activity. transfection of cells by an engineered virus with some preselected genetic modification (produced either from cdna copies of a viral genome or by chemical synthesis) may yield progeny genomes, which differ from the parent. if a substantial loss of replicative capacity is produced by a drastic genetic change (an indel, loss of a stem-loop structure, etc.) selection of a true revertant becomes extremely unlikely. the compensatory generation of alternative structures (or constellations of point mutations) that restores replication (partially or completely) becomes an interesting and informative possibility. procedures to copy an entire viral rna genome into a cdna for reverse genetics studies are now available (fan and di bisceglie, ) . if for technical reasons an infectious cdna clone is constructed from several molecules, which were copied from different genomes present in the mutant spectrum, the ligation product may be transcribed into an rna, which is not infectious. this is because, while some constellations of mutations may be compatible with infectivity, others may not, or may allow limited, suboptimal replication, thus favoring the selection of additional mutations or reversions. the same applies to a synthetic genome based on one of the multiple genomic sequences from a viral . error-prone replication and maintenance of genetic information: instability of laboratory viral constructs isolate. individual mutations may be detrimental either per se, or by the combined presence of other mutations. the joint effect of different mutations in the same genome is often referred to as epistasis. mutations that reinforce each other with regard to a viral function are said to produce positive epistasis, and those that interfere with each other produce negative epistasis (also mentioned in section . ). epistasis in rna viruses may be blurred by the weight of mutant spectra in determining viral behavior through intergenomic interactions (chapter ). an interesting contrast that recapitulates concepts given in sections . and . is the effect of an active proofreading-repair activity in maintaining the infectivity of a viral genome upon its extended replication in vitro (in a test tube, in the absence of cellular extracts). the , bp bacteriophage f dna can be amplified at least -fold without detectable loss of infectivity due to the fidelity of f dna polymerase conferred by a e proofreading-repair exonuclease activity (bernad et al., ) . engineered f dna polymerases provide a powerful amplification tool in genomics (de vega et al., ) . in contrast, the nucleotides long qb rna rapidly loses its infectivity when replicated by qb replicase in vitro due to the accumulation of mutations and deletions in the viral rna (mills et al., ; sabo et al., ) . the error-prone qb replicase is not adequate to amplify infectious viral rna, but it was at the origin of the quasispecies concept to be discussed in chapter . mutagenic dna polymerases (generally those involved in dna repair) are an alternative to mutagenic pcr (section . ) to produce randomly mutated collections of nucleic acid molecules (forloni et al., ) . recombination is the formation of a new genome by covalent linkage of genetic material from two or more different parental genomes (fig. . ) . recombination can also involve different sites of the same genome to yield insertions or deletions, such as in the formation of defective interfering (di) genomes. it is a widespread mechanism of genetic variation in all biological systems, and in cells, it underlies critical physiological and developmental processes (splicing, generation of diversity in figure . rna recombination and segment reassortment. (a) scheme of replicative and nonreplicative rna recombination. rna polarity is indicated by þ, À symbols. replicative recombination is displayed as the result of template switching during minus-strand rna synthesis. nonreplicative recombination is depicted as the outcome of breakage and ligation (joining) of fragments of plus-strand rna. (b) an example of genome segment reassortment with the formation of a new segment constellation in which six genomic segments originate from one parent (blue) and two from the other (gold). influenza virus is the best-known example (see text). immunoglobulin genes and t cell receptors, transposition events, phase variation in bacteria, repair pathways that promote postreplicative error correction, etc.). cellular dna recombination relates to replication, repair, and completion of dna replication, operations that involve multiple proteins displaying a variety of activities (smith and jones, ; alberts et al., ; nimonkar and boehmer, ; friedberg et al., ) . recombination occurs both with dna and rna viruses, often with the participation of the virus replication machinery. several types of recombination have been distinguished in viruses: homologous versus nonhomologous recombination, according to the extent of nucleotide sequence identity around the recombination (crossover) site, and replicative versus nonreplicative recombination, according to the requirement of viral genome replication for recombination to occur (kirkegaard and baltimore, ; king, ; lai, ; nagy and simon, ; plyusnin et al., ; boehmer and nimonkar, ; gmyl et al., ; chetverin et al., ; agol, ; simmonds, ; bujarski, ; perez-losada et al., ; agol and gmyl, ; bentley and evans, ) . as in the case of cells, homologous recombination in double-stranded dna viruses is intimately connected with dna replication and repair. it implicates multiple viral gene products (dna polymerase, single-stranded dna-binding proteins, processivity factors, helicase-primase, eukaryotic topoisomerase i, etc.), and a succession of protein-catalyzed steps (czarnecki and traktman, ) . in the copy choice (or template switching) mechanism, the nascent dna switches from one template molecule to another, resulting in the synthesis of recombinant, daughter dnas. in its basic form, recombination by breakage and rejoining starts with the introduction of a nick at one of the strands of each parental dna, strand invasion of one parental dna by the other, branch migration, ligation at the nicks (linking dna strands from the two parents), and further isomerization and cleavage reactions. dna recombination is responsible for the endonuclease-mediated isomerization of herpesvirus genomes [four isomers defined by the orientation of the long (l) and short (s) regions of the viral genome]. during the late phase of herpes simplex virus- replication, the frequency of recombination has been estimated at . % per kb of the genome . integration or excision of proviral dna or temperate bacteriophage dna, are examples of site-specific recombination that involves specific enzyme activities (i.e., retroviral integrases), and requires a short stretch of nucleotide sequence identity. the copy choice mechanism of homologous rna recombination is also associated with genome replication. an rna polymerase molecule with its nascent rna product jumps into the corresponding position of another template molecule, to complete synthesis of the rna product ( fig. . ) . given the large numbers of viral genomes often present in replication complexes (also termed replication factories) in each infected cell, it is not surprising that this mechanism may give rise to frequent recombinant progeny genomes. formation of mosaic genomes has long been recognized as an essential feature of the genetics of some retroviruses and plant rna viruses. for hiv- and some plant rna viruses recombination frequencies have been estimated at %e % of progeny per nucleotides; for picornaviruses and coronaviruses the number of recombinants amounts to %e % of the progeny (king, ; lai, ; nagy and simon, ; levy et al., ; urbanowicz et al., ; sztuba-solinska et al., ) . using a phylogenetic approach the average recombination rate of hiv- in vivo was estimated in .  À recombination events/site/generation, which is about fivefold greater than the average point mutation rate (shriner et al., ) . a ten-fold lower value of .  À recombination events/site/ . recombination in dna and rna viruses generation was estimated from the changes in the genetic composition of hiv- within single patients (neher and leitner, ) . recombination is required for hiv- replication and genome integrity (rawson et al., ) . in negative-strand rna viruses recombination may be inefficient or absent, but some of them can display homologous recombination (plyusnin et al., ) , and a high rate of generation of di rnas and other types of defective genomes (roux et al., ; rezelj et al., ) . recombination frequency may be altered by environmental factors that affect viral replication. a decrease of intracellular nucleotide levels as a result of treatment of cells with hydroxyurea may favor template switching reflected in an increase of intra and intermolecular recombination (pfeiffer et al., ; svarovskaia et al., ) . homologous rna recombination can also be influenced by amino acid substitutions in the polymerase, the primary sequence in the rna (i.e., high frequency of template switching in au-rich regions), the sequence identity between the nascent strand and acceptor template, and secondary structures at or around the crossover sites, among other influences (nagy and simon, ; agol, ; agol and gmyl, ) . since recombination necessitates coinfections of the same cell by at least two parental genomes, the persistence of a viral genome in a cell increases the likelihood of sequential coinfections, unless some reinfection or superinfection exclusion mechanism operates [(webster et al., ) and references therein]. without such restrictions, persistently infected cells may be an environment with a higher probability of recombination than transiently infected cells, assuming comparable genome loads at the sites of replication. nonreplicative recombination does not require replication of the viral genome, and has been described upon cotransfection of cells with viral rna fragments that could not replicate by themselves (gmyl et al., ; gallei et al., ; agol, ; agol and gmyl, ; bentley and evans, ). it appears to be a promiscuous event with a required -phosphate in the partner rna and a -hydroxyl residue in the partner rna mediated by cellular components whose mechanisms of activity are not understood. the emerging picture is that the frequency of recombination varies among viruses, and that as new tools for genome analyses have become available, recombination has been detected in an increasing number of viruses. recognition of recombination in a viral system is facilitated when a cell culture system is available. controlled infection of cells with genetically marked parental viruses has been essential to estimate recombination frequencies, and to distinguish true recombination from mutationreversion events that may mimic the formation of recombinants. as with the concept of high genetic variation in rna viruses, recombination has often gone from being considered marginal to prominent and relevant; hcv is a typical example (galli and bukh, ) , with presently at least one chimera established as a circulating recombinant form in the field. viral replicative machineries may be endowed with features that influence the occurrence of recombination. one such feature is processivity of the viral polymerase (capacity of continued copying of the same template molecule). genome detachment of the polymerase complex from one genome to bind either to a different genome or to a distant site of the same genome is part of the standard replicative cycle of viruses, such as retroviruses and coronaviruses. reverse transcriptase participates in strand transfer during dna synthesis, and coronavirus polymerase switches from one template site to another during discontinuous rna synthesis. it may be significant that they belong to viral families displaying high recombination frequencies (makino et al., ) . thus, here again, we encounter a "molecular instruction" that evolved as an essential feature of viral genome replication, and that can be exploited to generate variation, and permit new genomic forms to undergo the scrutiny of selection (neher and leitner, ) recombination is diagnosed by discordant positions of different genes or genomic regions in phylogenetic trees, as a result of the transfer of part of a viral genome from representatives of one lineage to representatives of another lineage (chapter ). a commonly used procedure measures similarity values between sequences using a sliding-window scanning method. the recombination crossover point (where the two parental sequences meet) is identified by the point (or region) where the similarity plot crosses from one sequence into another (salemi and vandamme, ; martin et al., ; kosakovsky pond et al., , perez-losada et al., . crossover points along a viral genome are not distributed at random, either because polymerase detachment from the template is sequence-dependent or because many of the recombination events do not lead to viable progeny. absence of recombinant viability introduces a parallel with the distinction between mutation rate and mutation frequency (section . ); that is, a difference between what does occur at the biochemical level during replication and what is subsequently observed upon analysis of the replication products. newly arising recombinants may be subjected to negative selection, and only a viable subset might be detectable in the progeny virus (king, ; lai, ). an elegant study by d. j. evans and colleagues documented "imprecise" enterovirus recombinant intermediates that were lost upon serial virus passage (lowry et al., ) . recombination is viewed as a biphasic process consisting of initial imprecise events followed by a stage of resolution in favor of fit recombinants. the distinction between generation and resolution events that applies both to mutants and recombinants has yet another implication for rna virus genetics. some mutants or recombinants that in isolation do not exhibit sufficient replicative fitness to acquire dominance in a population may nevertheless persist as minority genomes. they may display low-level replication or be maintained by complementation by partner genomes (as in the case of two fmdv genome segments that are described in section . ). as minority genomes, they may engage in modulatory activities (chapter ). in viruses whose genomes are composed of two or more rna or dna segments, genome segment reassortment consists in the formation of new constellations of viral genomic segments from two or more parental genomes (mcdonald et al., ) (fig. . ) . reassortment can produce new phenotypic traits. it is the main mechanism of antigenic shift of influenza a virusesdoften associated with new influenza pandemics (webster et al., ; morse, ; gibbs et al., ; domingo et al., ) das opposed to antigenic drift, which is mediated by amino acid substitutions in the surface proteins hemagglutinin and neuraminidase (barbezange et al., ) . reassortments occur among the e doublestranded rna segments of the widespread reoviridae family (tanaka et al., ) . fitness differences among all possible segment combinations ( n , for two types of coinfecting particles with n genome segments) determine the types of genome segment groupings that dominate subsequent rounds of infection. in the laboratory, analysis of reassortant viruses has been applied to map a viral function into one segment or a combination of segments. genomic segments can be encapsidated either into a single virus particle (as in orthomyxoviruses or arenaviruses) or into separate particles . genome segment reassortment (as in multipartite plant viruses). a multipartite virus can have either rna or dna as genetic material. the plant nanoviruses have - molecules of single-stranded circular dna of about e nucleotides, and each segment encodes a single protein. in the case of the nanovirus faba bean necrotic stunt virus, its eight segments vary in frequency in a hostdependent manner (sicard et al., ) . this observation led the authors to propose a "setpoint genome formula," which may reflect the control of segment (gene) copy number that may provide some still unrecognized benefit to the multipartite phenotype (see next section . ). in principle, replication of multipartite viruses requires that each cell be coinfected by at least one of each type of particle harboring a different genome type, which in fact represents a remarkable cost for replicative efficiency. the fact that unsegmented and segmented rna viruses are well represented in our biosphere suggests that neither of the two organizations confers a definitive and general advantage for long-term survival. the origin of viral genome segmentation is a debated issue, although there is general agreement that it may confer adaptive flexibility to viruses. most proposals have been based on theoretical studies. segmentation has been viewed as a form of sex that facilitates genomic exchanges to counteract the effect of deleterious mutations (chao, ; szathmary, ). an alternative, not mutually exclusive model is that segmentation confers an advantage because replication of shorter rna molecules is completed earlier than the unsegmented counterparts (nee, ) . yet another possibility is that the lifestyle of a virus (in particular, the particle yield in connection with the number of surrounding susceptible cells), shaped over long evolutionary periods, may favor segmentation over intactness of a genetic message or vice versa. an experimental system of genome segmentation is available with the picornavirus foot-and-mouth disease virus (fmdv). its single-stranded rna genome underwent a modification akin to genome segmentation when the standard virus was subjected to passages in bhk- cells at high multiplicity of infection. the experiment was originally intended to investigate the limits of fitness gain following prolonged multiplication in a defined environment, in this case, bhk- cells in culture. the starting fmdv had not been well adapted to the bhk- cell culture environment since it derived from a diseased swine during a disease outbreak, and it was minimally propagated in bhk- cells to obtain a biological clone by plaque isolation. upon extensive replication of the clone in bhk- cells, the virus evolved toward a bipartite genome (garcía-arriaza et al., ) . each of the two pieces of rna that composed the bipartite (or segmented) genome version contained in-frame deletions affecting trans-acting proteins ( fig. . ) . each segment in isolation could not infect cells productively, but, when present together, they were infectious by complementation, and killed cells in the absence of standard fmdv. a low multiplicity of infection rapidly selected the full-length genome as a result of recombination of the two parental, defective segments (garcía-arriaza et al., ) . the particles containing the shortened rna were thermally more stable than the standard particles (ojosnegros et al., ) , but this difference did not explain the initial trigger of the segmentation event. the solution to this question came with the demonstration that the transition toward genome segmentation was possible because of an extensive exploration of the mutational sequence space by the standard virus. indeed, the mutations that accumulated during serial passages enhanced the fitness of the segmented genome version to a much higher extent than the fitness of the standard genome (moreno et al., ) (fig. . ) . thus, gradual evolution (drift in sequence space) was a requirement for the major transition toward segmentation, thus adding reassortment to mutation and recombination as potential mechanisms of genetic variation of this laboratory-adapted picornavirus. cooperation and complementation are discussed in section . of chapter . it should be noted that segmented forms of rna viruses have been engineered, but little is known of the incurred fitness cost; it is relevant to have shown that segmentation was possible upon unperturbed replication of a virus with an unsegmented rna genome. the experimental result suggests that in evolution there is no unsurmountable barrier that allows the conversion between intact and split forms of the same genome, reflecting remarkable genome flexibility that will be emphasized in chapter in the context of the relevance of virus variation in the emergence of viral pathogens. mutation, recombination, and segment reassortment contribute to the evolution of most dna and rna viruses. sometimes one form of genetic change appears to be more prominent than another, and sometimes the concerted action of recombination or reassortment with the mutation is apparent [i.e., antigenic drift in influenza virus, following the origin of a new antigenic type through reassortment (ghedin et al., ) ]. a mutation is a universal form of genetic change. it underlies numerous adaptive responses and critical biological transitions in viruses, and it is a prerequisite for recombination and reassortment to have a biological impact. if mutations were not present in different template molecules during replication, recombinants with the crossover point at equivalent positions of the parental genomes would be "silent," and display the same behavior as the parental genomes. apparently, "silent" recombination events may take place within replicative units; even if some mutations distinguished individual genomes of the same quasispecies swarm, a recombinant would not be distinguished from a mutant genome. the frequency of recombination in hiv- was noticed only when the acquired figure . evolution toward rna genome segmentation in the laboratory. the monopartite, standard fmdv genome (clone c-s c or pmt , top) was subjected to passages in bhk- cells. the resulting population p lacked detectable standard genome that could be rescued by low moi passages. the evolved c-s p accumulated point mutations (depicted as vertical lines on the genome at the bottom) and consisted in two segments that were infections by complementation: d , that lacked most of the l protease-coding region, and d that lacked most of the capsid proteins vp , vp -coding region. see text for further details and references. immune deficiency syndrome (aids) pandemic had advanced, and the virus had diversified through the accumulation of mutations. similar arguments apply to segment reassortment. genomes necessitate mutation-driven diversification for reassortment to provide a biological difference; detection of a reassortant will be easier the larger the replicative advantage it confers to the virus (see also chapter ). the evolutionary significance of recombination has been viewed in two opposite ways: as a means to rescue fit genomes from less fit parents (a conservative force that eliminates deleterious mutations), or as a means to explore new genomic forms for adaptive potential (a vast substrate for the exploration of sequence space; chapter ) [reviewed in (zimmern, ; lai, ; worobey and holmes, ; simmonds, ; perez-losada et al., ) ]. recombination has been probably at the origin of new viruses that presently occupy a well-established niche, and it is also at play today to expand diversity during the spread of viruses. as a historical event, the coronavirus mouse hepatitis virus appears to have acquired its hemagglutininesterase gene by recombination with an influenza c virus. the alphavirus western equine encephalitis virus originated probably by recombination between sindbis-like and eastern equine encephalitis-like viruses [reviewed in different chapters of ]. several recent poliomyelitis outbreaks have been associated with recombinants between oral poliovirus vaccine (opv) viruses and other circulating enteroviruses (gavrilin et al., ; kew et al., ; oberste et al., ; muslin et al., ) . intersubtype hiv- recombinants play a key role in current hiv- diversification, with around circulating recombinant forms (and the number is growing) displaying complex mosaic structures (multiple crossover sites) (thomson et al., ; gerhardt et al., ) . in addition, other hiv- recombinants have been characterized that are not established epidemiologically. fewer hcv recombinants have been identified, but the number is likely to increase as the virus diversifies in nature. positive selection of hiv- recombinants that unite different drug-resistant mutations in the same genome offers an example of the conservative force of recombination to rescue fit viruses in the face of a strong selective constraint (men endez- arias, ) . recombination is expected to play an increasing role in the spread of drug resistance among viruses for which new antiviral agents are in use, such as hbv and hcv. some defective dna and rna genomes that include indels, notably di rnas, which originate from recombination events can play an important role in the establishment and maintenance of persistent infections in cell culture, and can modulate viral infections in vivo (holland and villarreal, ; roux et al., ; rezelj et al., ) . detailed genetic and biochemical analyses by a. huang, j.j. holland and their colleagues on the generation of vsv di's and their interplay with the standard, infectious vsv contributed to unveil a continuous dynamics of genetic variation, competition, and selection, observable within short time intervals, a hallmark of rna genetics (palma and huang, ; holland et al., ) , fully confirmed by application of new sequencing techniques. di particles and defective genomes are present in populations of positive-and negative-strand rna viruses as they multiply in their natural hosts (n€ uesch et al., ; drolet et al., ; li et al., ; saira et al., ; ke et al., ; rezelj et al., ) . their widespread presence in vivo may mean that they are an unavoidable side-product of the replication machineries (i.e., instruction to recombine) or that selection might have favored their generation. both possibilities are compatible. an instruction whose result is a means to modulate replication of the corresponding standard viruses or the antiviral immune response will be selected as a consequence of its biological effects. di rnas can be regarded as the tip of the iceberg of many classes of defective genomes with a range of interfering or potentiating capacities that may coexist with standard animal, plant, insect, and bacterial viruses, and that may facilitate persistence and modulate disease symptoms (holland et al., ; vogt and jackson, ; l opez-ferber et al., ; rosario et al., ; sachs and bull, ; villarreal, ; aaskov et al., ; rezelj et al., ) . noncytopathic coxsackievirus b (cvb ) variants with deletions at the untranslated genomic region were isolated from hearts of mice inoculated with cvb . the variants replicated in vivo and were associated with longterm viral persistence (kim et al., ) . despite the continuous dynamics of the escape of infectious virus to the interfering activities of dis, some authors consider dis as potential antiviral agents (dimmock and easton, ) . if defective genomes are competent in rna (or dna) synthesis or are complemented to replicate, they can act as dominant-negative swarms, provided they reach a sufficient load. in this manner, they may underlie the suppressive effects of mutant spectra of viral quasispecies. intramutant spectrum decrease of replicative capacity due to the presence of defective genomes is one of the mechanisms of virus extinction evoked by enhanced mutagenesis (chapters and ). recombination events must have been the last step in ancestral processes of horizontal gene transfer that mediated the incorporation of host genes (or gene segments) into viral genomes, and vice versa. host genes related to immune responses were probably captured by complex dna viruses at early stages of their evolution (alcami, ; mcfadden, ) . mosaicism associated with nonhomologous recombination events is the norm among tailed bacteriophages (canchaya et al., ) . nonhomologous recombination can give rise to genomic sequences with a viral and a nonviral moiety. they include di rnas of sindbis virus-containing cellular rna sequences at their ends, some cytopathic forms of bovine viral diarrhea virus, rna of potato leafroll virus-containing tobacco chloroplast rna, or an influenza virus with an insertion of ribosomal rna into the hemagglutinin gene, mentioned in chapter regarding transient, high fitness levels [reviewed in (domingo et al., ) ]. phylogenetic analyses have suggested that recombination between rna and dna viruses might have occurred to give rise to some present-day single-stranded dna viruses (stedman, ) . however, the evidence for this attractive possibility is indirect and, to my knowledge, no experimental evidence of viral rna-viral dna recombination in cell culture or in vivo has been reported. viability of mutant and recombinant viral genomes is severely constrained by the evolutionary history of the virus that has shaped viral genomes as coordinated sets of modules (botstein, (botstein, , zimmern, ; koonin and dolja, ) . experimental studies with engineered recombinant viruses have shown that modularity can restrict recombination (martin et al., ) . the three molecular mechanisms of viral genome variation (mutation, recombination, and reassortment) are not incompatible, although it may sometimes be difficult to discern their occurrence (varsani et al., ) . it would be truly remarkable if a viral system could be proven to be totally devoid of one of the mechanisms of genetic variation. it would imply that there are powerful molecular reasons to dispense with an effective adaptive mechanism. absence of a mechanism is extremely difficult to demonstrate but, if we could, its basis would open a new chapter of molecular virology. there is an ongoing controversy regarding clonality versus nonclonality in biological evolution, particularly concerning the evolution of cellular parasites (heitman, ; tibayrenc . mutation, recombination, and reassortment as individual and combined evolutionary forces and ayala, , ; ramirez and llewellyn, ; hauser and cushion, ) . clonality means asexual progeny from a single ancestor. in the case of viruses, clonal evolution emphasizes reproduction without the exchange of genetic material among two or more parental genomes. sexual reproduction necessarily involves the exchange of genetic material. the question for viruses is interesting because we would be inclined to propose clonal evolution despite considerable promiscuity of recombination and reassortment. a tentative solution was offered based on one assumption and some experimental observations. the assumption is that recombination is not a requirement for viruses to complete their infection cycles. despite the possibility that recombination might be imprinted in the replication apparatuses (rendering it inherent to replication), its occurrence is not a necessity. the experimental observations are of two sorts. one is that historically recombination and reassortment have been at the origin of the emergence and reemergence of viral pathogens [western equine encephalitis virus, pandemic influenza viruses, emergent circulating poliovirus, hiv- and hcv recombinants, etc. (section . )]. the second observation is that recombination-based genome segmentation can occur given the adequate population dynamics and competitive environment, as documented with fmdv (section . ). in consequence, the distinction was made between mechanistically unavoidable but biologically irrelevant, and meaningful recombination (perales et al., ) . the latter form of recombination requires prior diversification of parental genomes by mutation and a number of cellular and epidemiological conditions. despite its relevance to evolutionary transitions and viral emergence, it is not a requirement for virus survival, propagation, and evolution. this marks a contrast with the genomic exchanges associated with sexual reproduction. the proposal is that viruses evolve clonally at widely different time scales (intrahost or within-host evolution vs. long-term evolution at the epidemiological level). similar arguments apply to mutation. this point will be addressed again in the closing chapter , concerning implications of clonality. all forms of genetic variation of viruses must be viewed essentially as blind processes despite preferences of nucleotide sequences or structures for mutation and recombination events: hot spots with higher than average rate values, and cold spots with lower than average rate values. mutation originates largely in fluctuations of electronic structure that modify base-pairing properties, and from features of polymerase-template interactions, not subject to regulation, in the sense that we understand the regulation of gene expression or enzymatic activity. absence of regulation is not incompatible with long-term evolution having shaped the molecular interactions that yield a level of mutagenesis compatible with survival and adaptability. given the biological consequences of mutation rates, many additional studies are needed for the biological phyla, to quantity not only basal mutation rates but also the possible presence of mutator strains. similar arguments can be used for recombination and reassortment. the number of segments that enter a new genomic constellation may be regulated but not which variant forms of the individual segments will make up the new viral particles. it is short-term selection acting at the very center of replication and recombinant complexes that preserves some mutant and recombinant forms in detriment of others. subsequent levels of selection occur when variant forms expand in multiple rounds of infection first within cells, then within an organism and then at the epidemiological level. the very nature of life in our planet has been built upon an inherent tendency to instruct variation in an incessant fashion, as necessary and unavoidable as the physical principles that dictate the behavior of our universe. the net result of all mechanisms of genetic variation available to a virus is the generation of repertoires of variant genomes for random drift and selective forces to act upon. in other terms, genetic variation sets the scene for the actors of evolution to play their roles, and secure a continuous input of new forms despite subtle or catastrophic environmental perturbations. the same forces that drive general evolution have produced the dominant virus forms we see in nature, with all their nuances in the interaction with cell components. the adaptation of viruses to participate in intracellular processes with cells dictates that genetic variation of viruses has its limits to prevent deleteriousness. this is currently exemplified by the effects of amino acid substitutions in viral polymerases that either increase or decrease templatecopying fidelity. viruses have reached a compromise between the stability of core information and flexibility for adaptability. although not yet treated in this chapter, viral population numbers are a key parameter in the evolutionary events. next chapters address some of these questions, not only in general conceptual terms but also in the way evolution affects our daily confrontation with viral disease (see summary box). • mutation, recombination, and genome segment reassortment are the mechanisms of genetic variation used by dna and rna viruses. mutations are due mainly to changes in the electronic distributions of the standard nucleotides, to damage of nucleotides by external influences, and by alignment alterations of the template relative to product polynucleotide chains. the effect of a mutation can range from being well-tolerated to highly detrimental or lethal. • mutation frequencies are only an indirect consequence of mutation rates. their values for viruses whose replication is catalyzed by polymerases devoid of proofreading-repair activity are -to -fold higher than those displayed by replicative cellular dna polymerases. error-prone replication is a hallmark of rna viruses and some dna viruses. the larger the amount of genetic information encoded in a viral genome, the lower the mutation rate must be to maintain the genetic message. • several mechanisms of genetic recombination have been described for dna and rna viruses. the best characterized is homologous recombination whose frequency of occurrence is dependent on the replicative machinery, in particular, polymerase processivity. genome segment reassortment is operative in segmented genomes, and it gives rise to biologically relevant changes, such as an antigenic shift in the influenza type a viruses. • studies with foot-and-mouth disease virus have shown that extensive evolution of an unsegmented rna genome has the potential to undergo a recombination-mediated transition akin to genome segmentation. therefore, segmented and unsegmented forms of rna viruses need not be considered as completely unrelated classes of genome organization. • recombination and genome segment reassortment have been viewed as conservative forces to rescue viable genomes from a damaged pool, and also as a means to explore new genomic compositions that deviate from their parents. all forms of genetic variation give rise to repertoires of variant genomes on which selection and random drift act to produce the viral forms that we isolate and study. long-term transmission of defective rna viruses in humans and aedes mosquitoes frequency spectrum neutrality tests: one for all and all for one picornaviruses as a model for studying the nature of rna recombination emergency services of viral rnas: repair and remodeling. microbiol causes of genome instability gene expression and molecular evolution a universal bmv-based rna recombination systemdhow to search for general rules in rna recombination fine-tuning translation kinetics selection as the driving force of codon usage bias in the hepatitis a virus capsid determinants of rnadependent rna polymerase (in)fidelity revealed by kinetic analysis of the polymerase encoded by a footand-mouth disease virus mutant with reduced sensitivity to ribavirin remote site control of an active site fidelity checkpoint in a viral rna-dependent rna polymerase seasonal genetic drift of human influenza a virus quasispecies revealed by deep sequencing origin and evolution of poxviruses the proportion of revertant and mutant phage in a growing population, as a function of mutation and growth rate ribavirin: a drug active against many viruses with multiple effects on virus replication and propagation. molecular basis of ribavirin resistance fidelity of dna replication ─ a matter of proofreading mechanisms and consequences of positive-strand rna virus recombination a conserved '- ' exonuclease active site in prokaryotic and eukaryotic dna polymerases nucleic acids. structures, properties, and functions. university science books herpes virus replication fidelity variants and rna quasispecies a theory of modular evolution for bacteriophages a modular theory of virus evolution correlation between mutation rate and genome size in riboviruses: mutation rate of bacteriophage qbeta a synonymous variant in irgm alters a binding site for mir- and causes deregulation of irgm-dependent xenophagy in crohn's disease synonymous codons: choose wisely for expression genetic recombination in plant-infecting messenger-sense rna viruses: overview and research perspectives lateral dna transfer. mechanisms and consequences structure-function relationships underlying the replication fidelity of viral rnadependent rna polymerases phage as agents of lateral gene transfer incorporation fidelity of the viral rna-dependent rna polymerase: a kinetic, thermodynamic and structural perspective mutations and a/i hypermutations in measles virus persistent infections evolution of sex in rna viruses viral rnadirected rna polymerases use diverse mechanisms to promote recombination between rna molecules insertion/deletion frequencies match those of point mutations in the hypervariable regions of the simian immunodeficiency virus surface envelope gene arbovirus high fidelity variant loses fitness in mosquitoes and mice genetic variation in retroviruses variation in rna virus mutation rates across host cells parallel evolution of drug resistance in hiv: failure of nonsynonymous/synonymous substitution rate ratio to detect selection the vaccinia virus dna polymerase and its processivity factor silent mutations in sight: co-variations in trna abundance as a key to unravel consequences of silent mutations improvement of phi dna polymerase amplification performance by fusion of dna binding motifs linking rna sequence, structure, and function on massively parallel highthroughput sequences defective interfering influenza virus rnas: time to reevaluate their clinical potential as broad-spectrum antivirals? virus entry into error catastrophe as a new antiviral strategy viral quasispecies nucleotide sequence heterogeneity of an rna phage population genetic variability and antigenic diversity of foot-and-mouth disease virus quasispecies: the concept and the word quasispecies and rna virus evolution: principles and consequences evolution of footand-mouth disease virus viral quasispecies: dynamics, interactions and pathogenesis a constant rate of spontaneous mutation in dna-based microbes mutation rates among rna viruses detenction of truncated virus particles in a persistent rna virus infection in vivo evolvability is a selectable trait high fidelity of murine hepatitis virus replication is decreased in nsp exoribonuclease mutants infidelity of sars-cov nsp -exonuclease mutant virus replication is revealed by complete genome sequencing error catastrophe and antiviral strategy sequence space and quasispecies distribution adaptive value of high mutation rates of rna viruses: separating causes from consequences multiple molecular pathways for fitness recovery of an rna virus debilitated by operation of muller's ratchet rt-pcr amplification and cloning of large viral sequences adaptation of mrna structure to control protein folding a comparison of viral rna-dependent rna polymerases structural insights into replication initiation and elongation processes by the fmdv rna-dependent rna polymerase random mutagenesis using error-prone dna polymerases specialized dna polymerases, cellular survival, and the genesis of mutations dna repair and mutagenesis statistical tests of neutrality of mutations against population growth, hitchhiking and background selection extremely high mutation rate of a hammerhead viroid rna recombination in vivo in the absence of viral replication comparative analysis of the molecular mechanisms of recombination in hepatitis c virus dna replication-a matter of fidelity evolutionary transition toward defective rnas that are infectious by complementation information dynamics in carcinogenesis and tumor growth evolution of circulating wild poliovirus and of vaccine-derived poliovirus in an immunodeficient patient: a unifying model in-depth, longitudinal analysis of viral quasispecies from an individual triply infected with late-stage human immunodeficiency virus type , using a multiple pcr primer approach large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution molecular basis of virus evolution functional and genetic plasticities of the poliovirus genome: quasi-infectious rnas modified in the '-untranslated region yield a variety of pseudorevertants nonreplicative homologous rna recombination: promiscuous joining of rna pieces? rna coxsackievirus b mutator strains are attenuated in vivo a single-nucleotide synonymous mutation in the gag gene controlling human immunodeficiency virus type virion production is sex necessary for the proliferation and transmission of pneumocystis? evolution of eukaryotic microbial pathogens via covert sexual reproduction within-host variations of human papillomaviruses reveal apobec signature mutagenesis in the viral genome persistent noncytocidal vesicular stomatitis virus infections mediated by defective t particles that suppress virion transcriptase rapid evolution of rna genomes virus mutation frequencies can be greatly underestimated by monoclonal antibody neutralization of virions ntp-mediated nucleotide excision activity of hepatitis c virus rna-dependent rna polymerase directed evolution of nucleic acid enzymes phylodynamic analysis of the emergence and epidemiological impact of transmissible defective dengue viruses outbreak of poliomyelitis in hispaniola associated with circulating type vaccine-derived poliovirus '-terminal deletions occur in coxsackievirus b during replication in murine hearts and cardiac myocyte cultures and correlate with encapsidation of negativestrand viral rna the neutral theory of molecular evolution the neutral theory of molecular evolution and the world view of the neutralists non-darwinian evolution the mechanism of rna recombination in poliovirus paramyxovirus mrna editing, the "rule of six" and error catastrophe: a hypothesis virus world as an evolutionary network of viruses and capsidless selfish elements. microbiol the population genetics of dn/ds dna mismatch repair tempo and mode of plant rna virus escape from rna interference-mediated resistance genetic recombination in rna viruses rates of dna sequence evolution in experimental populations of escherichia coli during , generations dynamics of hiv- recombination in its natural target cells defective interfering viral particles in acute dengue infections aminoacyl-trna synthesis and translational quality control dna mismatch repair and its many roles in eukaryotic cells a unique intra-molecular fidelity-modulating mechanism identified in a viral rna-dependent rna polymerase defective or effective? mutualistic interactions between virus genotypes recombination in enteroviruses is a biphasic replicative process involving the generation of greater-than genome length 'imprecise' intermediates a-to-i rna editing: recent news and residual mysteries distribution of rare triplets along mrna and their relation to protein folding highfrequency rna recombination of murine coronaviruses the rate and character of spontaneous mutation in an rna virus lower in vivo mutation rate of human immunodeficiency virus type than that predicted from the fidelity of purified reverse transcriptase influence of reverse transcriptase variants, drugs, and vpr on human immunodeficiency virus type mutant frequencies rdp : recombination detection and analysis from sequence alignments synonymous viral genome recoding as a tool to impact viral fitness reassorment in segmented rna viruses: mechanisms and outcomes poxvirus tropism molecular evolution of the herpesvirales molecular basis of fidelity of dna synthesis and nucleotide specificity of retroviral reverse transcriptases attenuation of human enterovirus high-replication-fidelity variants in ag mice molecular mechanisms of somatic hypermutation and class switch recombination unblocking of chain-terminated primer by hiv- reverse transcriptase through a nucleotide-dependent mechanism microbial evolution. gene establishment, survival and exchange an extracellular darwinian experiment with a self-duplicating nucleic acid molecule discovery of an rna virus '/ ' exoribonuclease that is critically involved in coronavirus rna synthesis codon usage influences fitness through rna toxicity exploration of sequence space as the basis of viral rna genome segmentation internal disequilibria and phenotypic diversification during replication of hepatitis c virus in a noncoevolving celular environment the evolutionary biology of viruses evolution and emergence of enteroviruses through intra-and inter-species recombination: plasticity and phenotypic impact of modular genetic exchanges in the 'untranslated region. plos pathog. , e . naegeli, h., . mechanisms of dna damage recognition in mammalian cells new insights into the mechanisms of rna recombination the evolution of multicompartmental genomes in viruses recombination rate and selection strength in hiv intra-patient evolution simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions rna interference as a tool for exploring hiv- robustness hiv- protease evolvability is affected by synonymous nucleotide recoding reconstitution of recombination-dependent dna synthesis in herpes simplex virus functions and regulation of rna editing by adar deaminases coordinated changes in mutation and growth rates induced by genome reduction negative effect of genetic bottlenecks on the adaptability of vesicular stomatitis virus detection of defective genomes in hepatitis a virus particles present in clinical specimens evidence for frequent recombination within species human enterovirus b based on complete genomic sequences of all thirty-seven serotypes viral genome segmentation can result from a trade-off between genetic content and particle stability cyclic production of vesicular stomatitis virus caused by defective interfering particles evidence for purifying selection against synonymous mutations in mammalian exonic splicing enhancers -azacytidine and rna secondary structure increase the retrovirus mutation rate gonz alez-candelas, f., . recombination in viruses: mechanisms, methods of study, and evolutionary consequences extensive editing of cellular and viral double-stranded rna structures accounts for innate immunity suppression and the proviral activity of adar p a single mutation in poliovirus rna-dependent rna polymerase confers resistance to mutagenic nucleotide analogs via increased fidelity ribavirin resistance in hepatitis c virus replicon-containing cell lines conferred by changes in the cell line or mutations in the replicon rna altering the intracellular environment increases the frequency of tandem repeat deletion during moloney murine leukemia virus reverse transcription transfection-mediated generation of functionally competent tula hantavirus with recombinant s rna segment low-fidelity polymerases of alphaviruses recombine at higher rates to overproduce defective interfering particles mechanisms of retroviral mutation reproductive clonality in protozoan pathogens ─ truth or artefact? recombination is required for efficient hiv- replication and the maintenance of the viral genome integrity the defective component of viral populations non-darwinian evolution: a critique codon usage bias from trna's point of view: redundancy, specialization, and efficient decoding for translation optimization functional characterization of the genomic promoter of borna disease virus (bdv): implications of '-terminal sequence heterogeneity for bdv persistence effects of defective interfering viruses on virus replication and pathogenesis in vitro and in vivo alphavirus mutator variants present host-specific defects and attenuation in mammalian and insect models a guanosine to adenosine transition in the ' terminal extracistronic region of bacteriophage qb rna leading to loss of infectivity experimental evolution of conflict mediation between genomes sequence analysis of in vivo defective interferinglike rna of influenza a h n pandemic virus the phylogenetic handbook. a practical approach to dna and protein phylogeny the role of the apobec family of cytidine deaminase in innate immunity, g-to-a hypermutation, and evolution of retroviruses difference in incidence of spontaneous mutations between herpes simplex virus types and isolation of a human gene that inhibits hiv- infection and is suppressed by the viral vif protein pervasive genomic recombination of hiv- in vivo gene copy number is differentially regulated in a multipartite virus recombination in the evolution of picornaviruses coronaviruses as dna wannabes: a new model for the regulation of rna virus replication fidelity the mutation rate and variability of eukaryotic viruses: an analytical review dna recombination and repair coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics evolution of rna-based networks different modes of retrovirus restriction by human apobec a and apobec g in vivo mechanisms for rna capture by ssdna viruses: grand theft rna direct method for quantitation of extreme polymerase error frequencies at selected single base sites in viral rna lack of evidence for proofreading mechanisms associated with an rna virus polymerase dna polymerases: structural diversity and common mechanisms the cancer genome heterogeneity of the mutation rates of influenza a viruses: isolation of mutator mutants the code of silence: widespread associations between synonymous codon biases and gene function template properties of mutagenic cytosine analogues in reverse transcription structural determinants of murine leukemia virus reverse transcriptase that affect the frequency of template switching self-replication with errors. a model for polynucleotide replication rna-rna recombination in plant virus replication and evolution mycoreovirus genome alterations: similarities to and differences from rearrangements reported for other reoviruses molecular epidemiology of hiv- genetic forms and its significance for vaccine development and therapy reproductive clonality of pathogens: a perspective on pathogenic viruses, bacteria, fungi, and parasitic protozoa cryptosporidium, giardia, cryptococcus, pnemocystis genetic variability: cryptic biological species or clonal near-clades? aid: a riddle wrapped in a mystery inside an enigma homologous crossovers among molecules of brome mosaic bromovirus rna or rna segments in vivo phenotypic hiding: the carryover of mutations in rna viruses as shown by detection of mar mutants in influenza virus comparison of moloney murine leukemia virus mutation rate with the fidelity of its reverse transcriptase in vitro notes on recombination and reassortment in multipartite/segmented viruses simulating pseudogene evolution in vitro: determining the true number of mutations in a lineage comprehensive mutation identification in an evolved bacterial cooperator and its cheating ancestor perspective: apobec mutagenesis in drug resistance and immune escape in hiv and cancer evolution biological implications of picornavirus fidelity mutants quasispecies diversity determines pathogenesis through cooperative interactions in a viral population replication slippage involves dna polymerase pausing and dissociation genome instability: a conserved mechanism of ageing? essays biochem satellites and defective viral rnas enhanced fidelity of tc-selected mutant hiv- reverse transcriptase role of the single deaminase domain apobec a in virus restriction, retrotranscription, dna damage and cancer determination of the poliovirus rna polymerase error frequency at eight sites in the viral genome direct measurement of the poliovirus rna polymerase error frequency in vitro evolution and ecology of influenza a viruses evasion of superinfection exclusion and elimination of primary viral rna by an adapted strain of hepatitis c virus insertions in the human immunodeficiency virus type protease and reverse transcriptase genes: clinical impact and molecular mechanisms evolutionary aspects of recombination in rna viruses visualizing the nucleotide addition cycle of viral rna-dependent rna polymerase genomic heterogeneity maps to tandem repeat sequences in the herpes simplex virus type ul region statistical methods for detecting molecular adaptation fidelity at the molecular level: lessons from protein synthesis key: cord- -ty xob authors: denison, mark r; graham, rachel l; donaldson, eric f; eckerle, lance d; baric, ralph s title: coronaviruses: an rna proofreading machine regulates replication fidelity and diversity date: - - journal: rna biology doi: . /rna. . . sha: doc_id: cord_uid: ty xob in order to survive and propagate, rna viruses must achieve a balance between the capacity for adaptation to new environmental conditions or host cells with the need to maintain an intact and replication competent genome. several virus families in the order nidovirales, such as the coronaviruses (covs) must achieve these objectives with the largest and most complex replicating rna genomes known, up to kb of positive-sense rna. the covs encode sixteen nonstructural proteins (nsp – ) with known or predicted rna synthesis and modification activities, and it has been proposed that they are also responsible for the evolution of large genomes. the covs, including murine hepatitis virus (mhv) and sars-cov, encode a ′-to- ′ exoribonuclease activity (exon) in nsp . genetic inactivation of exon activity in engineered sars-cov and mhv genomes by alanine substitution at conserved de-d-d active site residues results in viable mutants that demonstrate - to -fold increases in mutation rates, up to times greater than those tolerated for fidelity mutants of other rna viruses. thus nsp -exon is essential for replication fidelity, and likely serves either as a direct mediator or regulator of a more complex rna proofreading machine, a process previously unprecedented in rna virus biology. elucidation of the mechanisms of nsp -mediated proofreading will have major implications for our understanding of the evolution of rna viruses, and also will provide a robust model to investigate the balance between fidelity, diversity and pathogenesis. the discovery of a protein distinct from a viral rdrp that regulates replication fidelity also raises the possibility that rna genome replication fidelity may be adaptable to differing replication environments and selective pressures, rather than being a fixed determinant. in order to survive and propagate, rna viruses must achieve a balance between the capacity for adaptation to new environmental conditions or host cells with the need to maintain an intact and replication competent genome. several virus families in the order nidovirales, such as the coronaviruses (covs) must achieve these objectives with the largest and most complex replicating rna genomes known, up to kb of positive-sense rna. the covs encode sixteen nonstructural proteins (nsp - ) with known or predicted rna synthesis and modification activities, and it has been proposed that they are also responsible for the evolution of large genomes. the covs, including murine hepatitis virus (mhv) and sars-cov, encode a '-to- ' exoribonuclease activity (exon) in nsp . genetic inactivation of exon activity in engineered sars-cov and mhv genomes by alanine substitution at conserved de-d-d active site residues results in viable mutants that demonstrate -to -fold increases in mutation rates, up to times greater than those tolerated for fidelity mutants of other rna viruses. thus nsp -exon is essential for replication fidelity, and likely serves either as a direct mediator or regulator of a more complex rna proofreading machine, a process previously unprecedented in rna virus biology. elucidation of the mechanisms of nsp mediated proofreading will have major implications for our understanding of the evolution of rna viruses, and also will provide a robust model to investigate the balance between fidelity, diversity and pathogenesis. the discovery of a protein distinct from a viral rdrp that regulates replication fidelity also raises the possibility that rna genome replication fidelity may be adaptable to differing replication environments and selective pressures, rather than being a fixed determinant. coronaviruses (covs) are a family of rna viruses that cause significant diseases in humans such as severe acute respiratory syndrome (sars) and other respiratory infections, as well as a variety of respiratory, gastrointestinal and other infections in an increasingly large variety of mammals and birds. the capacity covs possess for trans-species movement and adaptation, long recognized in the laboratory, was confirmed in nature by the recent emergence of several animal coronavirus pathogens of domesticated animals and sars-cov. data on the latter emergence event indicated that sars likely resulted from human infection and adaptation by a bat sars-like cov. , finally, the "post-sars" identification and analysis of a vast diversity of newly identified coronaviruses across bat species, along with the successful synthetic resurrection of a bat sars-like cov, suggests that many, if not all, mammalian covs may originate from bats. however, the mechanisms of cov host species movement and adaptation for replication and pathogenesis are poorly understood. this review will discuss the role of rna replication fidelity in rna virus replication and pathogenesis, and will focus on a novel exoribonuclease universally encoded within cov genomes that likely mediates rna-dependent rna proofreading during virus replication. , due to high error rates of rna replication, rna viruses exist as quasispecies, defined as "ensembles of related genotypes". , there is evidence that evolutionary selection targets rna virus quasispecies populations rather than individual variants, , and that cooperative interactions between variants influences rna virus pathogenesis. cell culture passage of a mumps vaccine strain associated with meningitis resulted in reduced neurovirulence that correlated with heterogeneity at specific positions in multiple viral genes. for west nile review review covs are enveloped viruses that have positive-sense, non-segmented rna genomes - kb in length. the basic gene organization and replication is similar for all covs and is illustrated for sars-cov (fig. ) . gene encodes all predicted replicase/transcriptase proteins, which are translated from input genome rna (rna ). genes - encode structural and accessory proteins, which are translated from separate subgenomic (sg) mrnas. [ ] [ ] [ ] [ ] covs, as members of the nidovirales order, generate not only new genome rna, but also a '-nested set (nido = nest) of subgenomic mrnas (sgrnas). along with a portion of the ' genome sequence, each cov sgrna also contains the first approximately nt of the ' leader sequence (fig. ) . coronavirus rna synthesis can be conceptualized as involving two stages: genome replication and subgenomic rna transcription. in genome replication, the plus-strand genome rna is transcribed into a full-length minus-strand template rna, and then significantly more plus-strand genome rnas are synthesized from that minus-strand template. in subgenomic rna transcription, '-nested subgenomic rnas are transcribed to serve as templates for translation of the viral structural and accessory proteins. this stage of viral rna synthesis involves a discontinuous rna transcription model termed transcription attenuation during negative strand rna synthesis. [ ] [ ] [ ] [ ] [ ] during negative strand synthesis, the viral rdrp recognizes virus-specific conserved sequences termed transcriptional regulatory sequences (trss), located just upstream of each subgenomic orf. at these points, the polymerase either reads through to the next trs or dissociates from the template strand, then re-associates with the leader trs, located in the ' utr, and completes synthesis of a set of subgenomic length negative strand rna containing an antileader rna sequence and equivalent in size to each viral mrna. these subgenomic negative strand rnas then function as the principal templates for the production of subgenomic mrnas virus, increased genetic heterogeneity after mosquito cell passage correlated with growth fitness in those cells. another study provided evidence suggesting that innate immunity can limit poliovirus pathogenesis by restricting viral diversity during transit to vulnerable tissues such as brain. despite the clear linkage between replication fidelity and pathogenesis, and although numerous studies support high mutation rates of rna viruses, the range of genetic variability tolerated by specific viruses is not well understood. a four-fold increase in mutation frequency of poliovirus through chemical mutagenesis reduced infectivity by %. in contrast, a -to- -fold decrease in mutation frequency ( dpol-g s) reduced the capacity of the mutant poliovirus to compete with wt virus in direct competition assays in culture and in mice and resulted in highly attenuated viruses with restricted tissue tropism in mice. , , vesicular stomatitis virus (vsv) likely has a narrow tolerance for alteration in replication fidelity, based on the finding that mutations from chemical mutagenesis at two defined single-nucleotide positions in vsv genomes could be only moderately increased ( - -fold) without abolishing infectivity. underscoring the contribution of diversity to pathogenesis, modulation of replication fidelity is a new and promising approach for engineering live-attenuated rna virus vaccines. poliovirus rdrp mutants with restricted genetic diversity elicit a protective immune response in mice comparable to the sabin type vaccine strain but have superior genetic stability since increased replication fidelity minimizes reversion to virulence. although constitutively low replication fidelity (m = - to - substitutions per nt per round of replication) is thought to be a key determinant of rna virus quasispecies diversity, adaptation and virulence, the tolerated range of increased or decreased fidelity has appeared to be constrained to less than -fold. this is in sharp contrast to dna genomes of bacteria such as e. coli that may have profoundly greater fidelity ( - to - substitutions per gene per round of replication) but may tolerate up to , fold differences in replication fidelity in viable mutator phenotype bacteria. is supported by studies showing targeted recombination between specially designed mutant subgenomic mrnas and genome length templates. , viable mutant covs can be recovered with artificial trs sequences, as long as the leader and intergenic trs sequences match. further, these artificial trs sequence viruses prevent recombination with virus containing the native trs sequences. these factors combine to allow for the predicted rapid evolution of structural genes, especially within the spike gene, which undergoes high positive selective pressure during emergence and host-shift events. [ ] [ ] [ ] [ ] the viral proteins responsible for cov replication, transcription and recombination are encoded as protein domains of the largest known rna virus polyproteins (fig. ) . the cov orf a and fusion orf ab replicase polyproteins are expressed from the ~ kb gene on the input genome rna and subsequently are processed by viral proteinases to yield , in addition, multiple nsps may be required for certain functions in rna synthesis, as it has recently been shown that interactions of nsps , and are required for methyl transferase activity. remarkably, viruses with mutations that ablate the '-o-methyl transferase activity encoded in nsp are highly sensitive to the activity of the interferon stimulated gene ifit- , which likely inhibits the translation of rna molecules lacking '-o-methyl modifications. that are '-coterminal, and that all possess an ~ nt ' leader sequence. because of this transcription mechanism, alterations in trs sequences can influence viral replication efficiency. , importantly, the primary trs sequence seems less important, as recombinant viruses engineered to encode completely new trs networks are viable, suggesting that regulatory sequences flanking the trs elements are critical regulators of subgenomic transcription. recently, wu and brian have shown that artificial, marked positive-sense subgenomic mrnas can function as templates for minus strand synthesis and likely contribute to amplifying the amounts of plus-strand sg mrnas synthesized during infection. moreover, mrnas also can serve as templates for the synthesis of smaller sg mrnas by recognition of internal trs elements as well. the specific mechanism that confers the capacity of the polymerase to dissociate and reassociate is not well understood, but is thought to be mediated by complementary binding of the anti-trs on the nascent minus strand with the leader template trs on the plus strand. [ ] [ ] [ ] as rare misaligned leader-body junctions are occasionally seen during transcription, , it is possible that one or more unique rna modifying activities, like nsp exon which encodes a '- ' exonuclease activity (see below), may process the ends of incomplete negative strand rnas to promote base-pairing and the priming of antileader rna synthesis. recombination has shaped the population genetic structure of coronaviruses, promoting cross-species transmission and pathogenesis while complicating vaccine design. , , coronaviruses are quite capable of mediating homologous rna recombination, with rates approaching % during mixed infection of closely related strains in the same group. [ ] [ ] [ ] this high recombination frequency is likely due to the large size of the genome, paired with replication machinery that is already equipped to dissociate and reassociate from the template rna (site-assisted copy choice recombination), as well as the availability of full-length and subgenomic-length strands for template switching. this view hinting at less robust interactions with nsp and orf b. clearly, additional biochemical and genetic studies are needed to identify the nsp viral and cellular protein interactome. , substitution of alanine at de motif i of either mhv (m-exon) or sars-cov (s-exon) allows recovery of viable mutants with modest replication defects of less than log in peak titer compared to wild type, indicating that functional exon activity is not required for cov replication in culture. sequencing revealed -to -fold increases in mutation frequency and up to -fold increase in mutation rate than comparably isolated and sequenced wildtype mhv or sars-cov (fig. ) . ) are the largest known rna genomes , at up to twice the length in nucleotides as those of the next-largest non-segmented rna virus. the bioinformatics prediction of a putative exoribonuclease (exon) encoded in replicase nonstructural protein (nsp ) of all covs led to the speculation that exon functions in proofreading during replication and that acquisition of exon by a precursor virus was critical for expansion and maintenance of the large rna genomes of covs. the amino-terminal half of the -kda nsp includes '-to- ' exon motifs i (de), ii (d) and iii (d), which were originally identified in cellular enzymes of the dedd superfamily, including those that catalyze dna proofreading. , , bacterially expressed sars-cov nsp has been shown to have '-to- ' exon activity in vitro, and alanine substitution of the de-d-d residues profoundly impairs or abolishes this activity. the intracellular rna targets for exon activity likely include viral rna intermediates; importantly, however, they may also include cellular mrnas, noncoding rnas and/or micrornas, which encode or regulate critical antiviral activities during infection. studies from our laboratories also indicated that the carboxy-terminal half of nsp has independent functions in rna synthesis and virulence in animals, , a conclusion which was confirmed by the demonstration that the carboxy-terminal half of nsp encodes a novel cap n -methyltransferase function. proteomic studies, while controversial, indicate robust two-way interactions between nsp , nsp and orf b or nsp and nsp ; the latter study also it is reasonable to propose that subgenomic transcriptional and recombination repair may have evolved in covs and that nsp -exon is required for this process. a most likely model is that nsp cooperates with other cov enzymes to form a complex that is involved in error recognition and repair. although nsp alone has been shown so far only to hydrolyze nucleotides at the ' terminus of rna, sequential action of nsp -endou and nsp -exon would theoretically allow removal of internal, mismatched nucleotides. thus, whereas nsp can facilitate correction of residues at only the growing end of a nascent rna chain, cooperative interaction of nsp and nsp could facilitate correction of residues at other sites, and conceivably in full-length molecules that are no longer nascent. finally, covs encode methyltransferase (nsp ) that interacts specifically with nsp for cap methylation and which could also be participating in fidelity or other rna modifications that allow expansion of the cov genome. the potential for genetic variability of rna viruses has long been considered to be fundamental to their evolution, adaptation and escape from host responses. however, the effects of changes in replication fidelity, susceptibility to accumulation of deleterious mutations and lethal mutagenesis are not well studied for many viruses. genetic determinants including size of genome and presence of repair mechanisms such as proofreading, replicase fidelity and recombination, as well as other as yet undetermined factors may have evolved quite differently in distinct virus families. the high mutation rates of rna viruses also render them particularly susceptible to repeated genetic bottleneck events during replication, transmission between hosts or spread within a host, resulting in progressive deviation from the consensus sequence associated with decreased viral fitness and sometimes extinction. , , the process by which populations of asexual organisms tend to accumulate deleterious mutations in the absence of recombination is referred to as muller's ratchet. muller's ratchet has been shown to be applicable to multiple rna viruses during plaque-to-plaque passage [ ] [ ] [ ] [ ] [ ] and to result in accumulation of mutations and lethal mutagenesis and extinction of plaquepassaged viruses. for example, some fmdv clones are susceptible to genetic bottleneck-mediated extinction, while others are resistant. mutagenesis has also been proposed to work as an antiviral strategy. , a major mechanism of action of ribavirin and other rna mutagens is lethal mutagenesis, as demonstrated with poliovirus, , , and other rna viruses, including hiv, sequencing (sanger) of wt and s-exon plaque isolates from a single round of infection. the results showed a -fold increase in mutational frequency and -fold increase in mutation rate across the s-exon genomes compared with wt sars-cov. for both s-exon and m-exon, mutations were distributed across the genomes with no statistical bias for regions of the genome, for type of mutation (codon position, transversion, transition) or for synonymous, non-synonymous or non-coding mutations. the analysis was only performed on viable replication-competent populations or plaques, and therefore excluded lethal or profoundly deleterious individual or combination mutations. thus, the results likely highly under-represent the numbers, types and locations of mutations that would be detected in analysis of total viral genomes from a round of replication that would include both non-viable or minor mutations and defective interfering genomes. when the mutation patterns in the genomes from each of the plaque isolates of s-exon were compared, mutations were identified, and both the individual mutations and the "mutation sets" for each genome were unique. finally, the growth of multiple plaque isolates showed that all plaque isolates had growth patterns with defects compared to wt sars-cov but indistinguishable from the initial recovered s-exon population or from each other. the results from both m-exon and s-exon mutants, as well as the in vitro ssrna exonuclease activity of sars-cov nsp , all argue strongly that nsp -exon directly mediates or participates in the prevention or repair of mutations, which would constitute rna proofreading, an enzymatic activity not previously reported during the replication of rna viruses. a nuclease activity of influenza virions in removing non-cognate residues from the ' termini of capped primers was reported as evidence of proofreading, but this was not tested during replication and has not been further investigated or confirmed by other labs. rather, others have tested for and failed to find evidence for ' to ' exonuclease activity or proofreading in rna viruses, and have generally concluded that the energy and fitness cost exceeded the need for error recognition or repair mechanisms. thus it may be that the incorporation of nsp exon in the coronaviruses both allowed expansion of the genome and then was required for maintenance of the large and complex genome integrity. in this case the central unanswered research question remains: by what mechanism does nsp increase fidelity? there are several possible models, each of which would be unprecedented among rna viruses, and which could occur alone or in combination as functions of nsp or with other replicase proteins. ( ) nsp exon could directly mediate rna proofreading. this would be analogous to dna proofreading where the exonuclease activity is oftentimes provided by a subunit distinct from the polymerase activity. in this regard it is notable that dna proofreading exonucleases also belong to the de-d-d superfamily. ( ) nsp could stimulate an intrinsic putative '-to- ' exon activity of the rdrp. this would be similar to rna proofreading during cellular transcription by demonstrated that in population passage with selection limited only to growth at h, massive diversity in the population was tolerated and still allowed adaptation for increased growth. the effect of more stringent genetic bottleneck (muller's ratchet) was tested using mhv-exon, in which plaque-to-plaque passage of plaques each of wt mhv and m-exon was performed in parallel (fig. ) . ten clones each of wt-mhv-a and mhv-exon were subjected to ten serial plaque-to-plaque transfers in murine dbt cells. two m-exon clones became non-recoverable during this passage series (one at p and one at p ), whereas all mhv clones were recoverable throughout. moreover, titers from m-exon plaque passage showed a trend of decreasing average titer over passage, whereas the titer from wt-mhv remained constant. the results suggest that exon mutator viruses may be more susceptible to accumulation of deleterious mutations driven by repeated population bottlenecks. conversely, it was surprising that the plaque passage revealed no rapid extinction of the mutant, suggesting that other mechanisms have evolved to stabilize populations and prevent lethal mutagenesis, and/or that covs tolerate the accumulation of massive mutational loads across the expanded genomes. comparison of complete genome sequences of extended plaque and population passages will be required to test these possibilities. the genomes of positive-strand rna viruses have considerable capacity to evolve quickly in response to changing ecologic conditions and/or host environments. mutation rate is a critical parameter for understanding virus evolution, and restriction in genetic diversity within a population of viruses leads to lower adaptability and pathogenicity. moreover, a general trend toward an inverse correlation between genome size and replication fidelity has been demonstrated by high variations in rdrp error rates that range from about - to - . based primarily on studies with enteroviruses, rna viruses with smaller genome sizes seem to regulate replication fidelity by a long distance network of dynamic interactions throughout the dpol rdrp that function to regulate rna binding and recognition, ligand recognition and binding, protein conformation and rna synthesis. fidelity can be further modified by virus-host interactions that regulate rna replicase fidelity or rna recombination and repair. in contrast to other positive strand rna viruses, covs appear to have expanded replicase fidelity by acquiring and evolving a unique enzymatic activity, encoded with the exon nsp replicase protein. , clearly, the existing paradigm of a lumbering error-prone rdrp will be replaced with one that recognizes a more complex, highly tuned and regulated protein machine that was likely essential for the expansion of the cov genome. there is a critical need to elucidate the molecular mechanisms governing exon fidelity regulation, and we are in the process of designing in vitro exon mutant assays. nonetheless, the existence of a non-essential exogenous activity which appears to modify cov rdrp fidelity provides novel opportunities for experimental testing of the fundamental relationships between fidelity and replication, recombination, adaptation, cross species fmdv, lcmv and hantaan virus. [ ] [ ] [ ] [ ] in contrast, sars-cov is highly resistant to ribavirin in vitro and in vivo; , in fact, in some instances, the drug exacerbates in vivo disease. the susceptibility to varying degrees of genetic bottleneck has been addressed for the m-exon and s-exon mutator viruses in comparison with wt parental cloned strains. sars-cov and s-exon were passaged as populations (three independent passages each) at fixed intervals ( h) and with titer determination and equivalent moi for passage times (fig. ) . interestingly, both sars-cov and s-exon had similar responses, with selection for increased titer ( - log) by passage , after which adaptation reached a plateau. analysis of sequence showed retention of the exon mutations at passage , indicating that a primary revertant was not the cause of increased growth. further, up to passage the total mutational diversity of the s-exon populations dramatically exceeded that of wt sars-cov. these results of the new emerging infectious diseases that were identified between - , . percent were zoonoses, . percent of which originated in wildlife. as pathogen emergence has also been increasing overtime, coupled with greater rates of global dissemination, the threat to global health and economies is profound. strategies that can identify viral threats that emerge as a consequence of advantageous mutations in response to select evolutionary pressure(s) would provide profound advances in the ability of the global health response network to control emerging diseases. the existence of a defined exon-mediated mutator phenotype may allow for mechanistic insights and modeling of the mutation repertoires that govern: (a) the rapid selection of host range expanded mutant viruses which represent precursors to future epidemic emergences; (b) pathways of escape from therapeutic human monoclonal antibodies and drugs; (c) limits of genome variation and stability; and (d) replicase mutations and interacting networks which restore fidelity in passaged exon mutant viruses. for example, cov phylogeny is punctuated by numerous shifts between host species and cross-species transmission is readily achieved in co-cultures and during persistent infections in vitro. , , in nature, human coronavirus (hcov) oc emerged around , from closely related bovine covs, whereas hcov e likely emerged from closely related group bat coronaviruses around , in africa, , leading some to propose that nearly all human and animal covs originated from a vast reservoir of strains circulating in bat species. , sars-cov is also a zoonotic virus that crossed the species barrier, most likely originating from bats, following amplification in other species (civet cats, raccoon dogs), prior to transmission to humans. , , our group has used synthetic biology to reconstruct full-length genomes of sars-like bat cov precursors to the - epidemic strains. , , although these strains replicate but do not spread because they lack the appropriate receptor-binding domain, recombinant bat coronaviruses harboring the sars-cov receptor binding domain (rbd) replicate efficiently and use angiotensin converting enzyme (ace ) as a receptor for docking and entry. these data suggest that the trimeric spike glycoprotein of covs may be plastic and modular in design, readily allowing for protein domains to be exchanged between divergent s glycoproteins from different strains. we propose that blending the exon mutator phenotype into synthetically reconstructed zoonotic viruses provides a strategy to rapidly identify pathway components and mutation sets that govern trans-species movement in cell or organ cultures and in vivo. we hypothesize that the cov exon mutator phenotype constitutes a robust investigative platform to predict mutations and possibly recombinants in advance of their occurrence by identifying advantageous mutations governing host range and virus cross-species transmission. genetic diversity within a quasispecies has been proposed to contribute to pathogenesis by cooperative interactions among engineered variant viruses within a population. , however, the transmission, genome evolution and pathogenesis while simultaneously providing new avenues for therapeutic enhancement of lethal mutagenesis and the rational design of live attenuated vaccines. , historically, the extent of genetic diversity in rna virus populations has most often been analyzed by sequencing a small number of genomes at low coverage or a small region at high coverage. the former approach lacks resolution while the latter is narrowly focused, so extrapolation to the entire genome can be misleading. deep sequencing approaches like mrnaseq provide new opportunities for high resolution mapping of mutation distributions across genomes. the availability of mutator phenotype mutants of both mhv and sars that can tolerate up to -fold increase in mutation rate, accumulate massive mutational diversity at the population level and survive extended population and plaque passage, represents a powerful tool to directly test long-standing questions concerning the role of diversity and fidelity in virus replication, pathogenesis and evolution. how do covs maintain a large and complex genome over time while allowing sufficient mutation rates for adaptation and trans-species movement? is the fidelity of wt cov replication fixed by highly selected interactions between the rdrp, exon and possibly other viral and cellular proteins or is it flexible in response to altered environmental conditions as has been shown for bacteria such as e. coli? what are the limitations to cov genome diversity in vitro and in vivo? does the exon mutator phenotype result in more rapid adaptation or attenuation associated with lethal mutational load and rapid extinction during infection in vivo? does the mutator phenotype increase susceptibility to mutagens for lethal mutagenesis? clearly, covs provide a rich empirical platform to address these interesting and unique questions in virus evolution and adaptation. covs use a unique discontinuous mechanism to transcribe a series of progressively larger subgenomic mrnas, and each contains a leader rna sequence that is derived from the ' end of the genome. rna recombination and coronavirus subgenomic transcription occur by template switching mechanisms, which can occur either by sequence-assisted base pairing and hydrogen bonding networks or sequence independent processes in mismatched regions. given the potential need to appropriately process and match the ' ends of nascent rnas and their templates, the exon mutator phenotype provides a novel approach to investigate the role of fidelity in regulating both full length and subgenomic length rna synthesis, transcription attenuation during negative strand rna synthesis and as potential regulators of rna recombination distributions and frequencies across the genome. the possibility that exon interacts with other rna modifying enzymes such as nsp endou and nsp '-o-mtase or interact with putative rna processivity components encoded in nsp and to modify rna repair rates or recombination frequencies may serve as a rich environment for future research. whereas a two-to six-fold increase in mutation frequency was sufficient to cause lethal mutagenesis of poliovirus in cell culture. , , moreover, ribavirin is clearly ineffective against mouse-adapted sars-cov and appears to exacerbate disease, suggesting that the exon activity in wildtype viruses may reduce the efficacy of this important antiviral. the high conservation of nsp exon sequences among covs and lack of close orthologs in cells suggests that nsp exon might represent a promising target for design and development of antiviral drugs and raises the possibility that a single drug targeting exon might be effective against multiple coronaviruses, including potential zoonotic sars-cov-like viruses from bats that emerge in the future. however, given numerous examples of viruses evolving drug resistance, an exon-targeted companion drug in a combination therapy would not only attenuate pathogenesis by altering error rates, but also prevent reversion from other compounds in the cocktail. investigating the potential of exon targeted mutations as a universal strategy to construct live attenuated, reversion proof cov vaccines and antivirals seems broadly relevant. studies investigating the pathogenesis of exon mutants in animal models, along with their tenability as vaccine candidates, are currently in progress. the identification of a stable mutator phenotype and possible dedicated rna-dependent rna-proofreading complex is significant for several reasons. the availability of the m-exon and s-exon mutant viruses constitutes a unique system to directly study the impact of profoundly decreased replication fidelity and increased diversity on replication and pathogenesis of rna viruses. the use of both traditional di-deoxy (sanger) and ultradeep (solexa) sequencing in conjunction with virus passage and persistent infection will allow establishment of new models for understanding the range and limitations on diversity and mutational load and will aid in developing tool sets for evaluating, comparing and annotating sequence diversity across rna virus genomes. massive sequence annotation and analysis during different stages of acute infection (intracellular, virion), as well as over time (passage, persistent infection) will allow mapping of genetic regions highly tolerant or resistant to change and to define potential epistatic interaction networks not predictable by other approaches as well as conserved across the coronaviridae. passage and sequencing of exon mutator viruses also will provide a system to rapidly generate extensive libraries of individual mutations and complete mutation sets across the genome that can be tested in individual coding or regulatory domains for effects on immune response, host range and virulence. such studies of exon-generated decreased fidelity and increased diversity also represent two potentially broadly applicable strategies for live vaccine design that simultaneously attenuate and prevent reversion to virulence. finally, the mutator viruses and studies of exon mutant revertants will result in important insights into the evolutionary mechanisms by which the nidovirales acquired (or lost) the capacity for rna proofreading, and will allow testing of the limitations of size and complexity of rna as a replicating molecule. as such, exon should be considered as a high impact impact of reduced replication fidelity on pathogenesis remains largely untested for rna viruses. the cov exon mutator phenotype represents a unique property with unclear consequences for adaptation and pathogenesis in animals. although increased fidelity attenuates virulence of poliovirus and restricts tissue tropism, the exon mutator activity might increase virus virulence and tissue tropism because of the increased population diversity and spread into novel tissues. alternatively, exon decreased fidelity might attenuate virulence because the mutation frequency may approach error catastrophe and drive self-annihilation of s-exon in vivo. the growth defects observed in culture seem to support the prediction that s-exon will be attenuated in animals, but these impairments could be trumped by potential enhanced adaptability of s-exon. to assess these possibilities, pathogenesis studies in animal models are currently underway. of note, a low-fidelity exonuclease mutant of cytomegalovirus showed accelerated evolution of drug resistance in cell culture. increased mutation rates in the gii. noroviruses have been proposed to account for their epochal evolution, increased diversity and the striking increase in the frequency of human epidemics in winter. , thus, fidelity regulation is a broadly relevant topic with far-ranging appeal in rna virus evolution and pathogenesis. importantly, exon represents an important and unique high impact target for understanding cov replication fidelity, quasispecies diversity and pathogenesis and is strongly coupled with the potential of developing universal strategies to build safe live attenuated cov vaccines. live attenuated vaccines elicit strong protective immune responses with low risk of disease, leading to robust tools that protect the public health against pathogens like measles, poliovirus, mumps, smallpox, herpesviruses and rubella. safety concerns clearly exist as evidenced by vaccine reversion to virulence and the development of serious and lethal disease in a low percentage of vaccinees. the present regulatory environment in the us is now limiting live attenuated virus vaccine use because of safety concerns, attesting to the need for rational approaches that prevent reversion to virulence. the high conservation of nsp exon sequences among covs and lack of close orthologs in cells suggests that nsp exon may be a promising target for live attenuated virus design or antiviral therapeutics. clearly, studying the exon mutator phenotype in pathogenesis and as a rational approach to develop reversion-resistant live attenuated vaccines provides a potential rapid response strategy to control future emerging cov diseases in human and domesticated animals. current treatment regimens for sars-cov include ribavirin, a nucleoside analog that induces lethal mutagenesis of other rna viruses such as poliovirus, foot and mouth disease virus, hepatitis c virus among others. , , , however, its precise mechanism of action against covs has not been determined and the high replication fidelity of wt mhv and sars-cov in cell culture suggests that drug-induced viral extinction therapies employed against other rna viruses might not be as effective against covs. , a possible recalcitrance of covs to rna mutagens is also suggested by our demonstration that at least in cell culture sars-cov tolerates a . -fold increase in substitution frequency, coronavirus reverse genetics by targeted rna recombination mosaic evolution of the severe acute respiratory syndrome coronavirus evolution and variation of the sars-cov genome complete genome sequence of bat coronavirus hku from chinese horseshoe bats revealed a much smaller spike gene with a different evolutionary lineage from the rest of the genome characterization of severe acute respiratory syndrome coronavirus genomes in taiwan: molecular epidemiology and genome evolution coronaviruses post-sars: update on replication and pathogenesis a second, non-canonical rna-dependent rna polymerase in sars coronavirus the rna polymerase activity of sarscoronavirus nsp is primer dependent the human coronavirus e superfamily helicase has rna and dna duplex-unwinding activities with '-to- ' polarity human coronavirus e nonstructural protein : characterization of duplexunwinding, nucleoside triphosphatase and rna '-triphosphatase activities discovery of an rna virus '→ ' exoribonuclease that is critically involved in coronavirus rna synthesis functional screen reveals sars coronavirus nonstructural protein nsp as a novel cap n methyltransferase rna recognition and cleavage by the sars coronavirus endoribonuclease mutational analysis of the sars virus nsp endoribonuclease: identification of residues affecting hexamer formation biochemical and genetic analyses of murine hepatitis virus nsp endoribonuclease major genetic marker of nidoviruses encodes a replicative endoribonuclease the severe acute respiratory syndrome coronavirus nsp protein is an endoribonuclease that prefers manganese as a cofactor coronaviruses use discontinuous extension for synthesis of subgenome-length negative strands coronavirus transcription: a perspective studies into the mechanism for mhv transcription subgenomic negative-strand rna function during mouse hepatitis virus infection a new model for coronavirus transcription coronavirus transcription: subgenomic mouse hepatitis virus replicative intermediates function in rna synthesis nidovirus transcription: how to make sense…? arterivirus discontinuous mrna transcription is guided by base pairing between sense and antisense transcription-regulating sequences rewiring the severe acute respiratory syndrome coronavirus (sars-cov) transcription circuit: engineering a recombination-resistant genome subgenomic messenger rna amplification in coronaviruses role of nucleotides immediately flanking the transcription-regulating sequence core in coronavirus subgenomic mrna synthesis reverse genetic analysis of the transcription regulatory sequence of the coronavirus transmissible gastroenteritis virus sequence motifs involved in the regulation of discontinuous coronavirus subgenomic rna synthesis recombination, reservoirs and the modular spike: mechanisms of coronavirus crossspecies transmission high recombination and mutation rates in mouse hepatitis virus suggest that coronaviruses may be potentially important emerging viruses establishing a genetic recombination map for murine coronavirus strain a complementation groups highfrequency rna recombination of murine coronaviruses evidence for variable rates of recombination in the mhv genome repair and mutagenesis of the genome of a deletion mutant of the coronavirus mouse hepatitis virus by targeted rna recombination episodic evolution mediates interspecies transfer of a murine coronavirus severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats bats are natural reservoirs of sars-like coronaviruses metagenomic analysis of the virome of three north american bat species: viral diversity between different bat species that share a common habitat synthetic recombinant bat sars-like coronavirus is infectious in cultured cells and in mice evolutionary insights into the ecology of coronaviruses infidelity of sars-cov nsp -exonuclease mutant virus replication is revealed by complete genome sequencing high fidelity of murine hepatitis virus replication is decreased in nsp exoribonuclease mutants quasispecies made simple the origin of genetic information: viruses as models quasispecies theory and the behavior of rna viruses quasispecies diversity determines pathogenesis through cooperative interactions in a viral population changes in mumps virus neurovirulence phenotype associated with quasispecies heterogeneity role of the mutant spectrum in adaptation and replication of west nile virus increased fidelity reduces poliovirus fitness and virulence under selective pressure in mice rna virus error catastrophe: direct molecular test by using ribavirin engineering attenuated virus vaccines by controlling replication fidelity mutation frequencies at defined single codon sites in vesicular stomatitis virus and poliovirus can be increased only slightly by chemical mutagenesis long range interaction networks in the function and fidelity of poliovirus rna-dependent rna polymerase studied by nuclear magnetic resonance mechanisms of action of ribavirin against distinct viruses rationalizing the development of live attenuated virus vaccines global trends in emerging infectious diseases molecular anatomy of mouse hepatitis virus persistence: coevolution of increased host cell resistance and virus virulence amino acid substitutions in the s subunit of mouse hepatitis virus variant v encode determinants of host range expansion distant relatives of severe acute respiratory syndrome coronavirus and close relatives of human coronavirus e in bats complete genomic sequence of human coronavirus oc : molecular clock analysis suggests a relatively recent zoonotic coronavirus transmission event a review of studies on animal reservoirs of the sars coronavirus characterization and complete genome sequence of a novel coronavirus, coronavirus hku , from patients with pneumonia pathways of cross-species transmission of synthetically reconstructed zoonotic severe acute respiratory syndrome coronavirus synthetic reconstruction of zoonotic and early human severe acute respiratory syndrome coronavirus isolates that produce fatal disease in aged mice accelerated evolution of maribavir resistance in a cytomegalovirus exonuclease domain ii mutant rapid evolution of pandemic noroviruses of the gii. lineage norovirus gii. strain antigenic variation a single mutation in poliovirus rna-dependent rna polymerase confers resistance to mutagenic nucleotide analogs via increased fidelity ribavirin resistance in hepatitis c virus replicon-containing cell lines conferred by changes in the cell line or mutations in the replicon rna the relation of recombination to mutational advance fitness declines in tobacco etch virus upon serial bottleneck transfers drastic fitness loss in human immunodeficiency virus type upon serial bottleneck events evolution of fitness in experimental populations of vesicular stomatitis virus genetic lesions associated with muller's ratchet in an rna virus rapid fitness losses in mammalian rna virus clones due to muller's ratchet viral error catastrophe by mutagenic nucleosides lethal mutagenesis of poliovirus mediated by a mutagenic pyrimidine analogue the broad-spectrum antiviral ribonucleoside ribavirin is an rna virus mutagen response of foot-and-mouth disease virus to increased mutagenesis: influence of viral load and fitness in loss of infectivity ribavirin reveals a lethal threshold of allowable mutation frequency for hantaan virus lethal mutagenesis of hiv with mutagenic nucleoside analogs no evidence of selection for mutational robustness during lethal mutagenesis of lymphocytic choriomeningitis virus adverse effects of ribavirin and outcome in severe acute respiratory syndrome: experience in two medical centers inhibitory effect of mizoribine and ribavirin on the replication of severe acute respiratory syndrome (sars)-associated coronavirus a new mouse-adapted strain of sars-cov as a lethal model for evaluating antiviral agents in vitro and in vivo viral mutation rates unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group lineage coronavirus nonstructural protein is a cap- binding enzyme possessing (nucleoside- 'o)-methyltransferase activity in vitro reconstitution of sarscoronavirus mrna cap methylation '-o methylation of the viral mrna cap evades host restriction by ifit family members nidovirales: evolving the largest rna virus genome international committee on taxonomy of viruses and the , unassigned species exoribonuclease superfamilies: structural analysis and phylogenetic distribution the proofreading domain of escherichia coli dna polymerase i and other dna and/or rna exonuclease domains unique signatures of long noncoding rna expression in response to virus infection and altered innate immune signaling single-amino-acid substitutions in open reading frame (orf) b-nsp and orf a proteins of the coronavirus mouse hepatitis virus are attenuating in mice effects of mutagenesis of murine hepatitis virus nsp and nsp on replication in culture analysis of intraviral protein-protein interactions of the sars coronavirus orfeome genome-wide analysis of protein-protein interactions and involvement of viral proteins in sars-cov replication proofreading function associated with the rna-dependent rna polymerase from influenza virus lack of evidence for proofreading mechanisms associated with an rna virus polymerase modern mrna proofreading and repair: clues that the last universal common ancestor possessed an rna genome? quasispecies dynamics and rna virus extinction genetic and phenotypic variation of footand-mouth disease virus during serial passages in a natural host this work was supported by grants from the national institutes of health: serceb-u -ai (m.r.d., r.s.b., e.f.d.), r -ai (m.r.d., l.d.e.) and contract hhsn c. additionally, r.l.g. received funding from f -ai and l.d.e. received funding from t -ai . the funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. key: cord- -xvc hvpq authors: singh, roshan kumar; prasad, ashish; muthamilarasan, mehanathan; parida, swarup k.; prasad, manoj title: breeding and biotechnological interventions for trait improvement: status and prospects date: - - journal: planta doi: . /s - - - sha: doc_id: cord_uid: xvc hvpq main conclusion: present review describes the molecular tools and strategies deployed in the trait discovery and improvement of major crops. the prospects and challenges associated with these approaches are discussed. abstract: crop improvement relies on modulating the genes and genomic regions underlying key traits, either directly or indirectly. direct approaches include overexpression, rna interference, genome editing, etc., while breeding majorly constitutes the indirect approach. with the advent of latest tools and technologies, these strategies could hasten the improvement of crop species. next-generation sequencing, high-throughput genotyping, precision editing, use of space technology for accelerated growth, etc. had provided a new dimension to crop improvement programmes that work towards delivering better varieties to cope up with the challenges. also, studies have widened from understanding the response of plants to single stress to combined stress, which provides insights into the molecular mechanisms regulating tolerance to more than one stress at a given point of time. altogether, next-generation genetics and genomics had made tremendous progress in delivering improved varieties; however, the scope still exists to expand its horizon to other species that remain underutilized. in this context, the present review systematically analyses the different genomics approaches that are deployed for trait discovery and improvement in major species that could serve as a roadmap for executing similar strategies in other crop species. the application, pros, and cons, and scope for improvement of each approach have been discussed with examples, and altogether, the review provides comprehensive coverage on the advances in genomics to meet the ever-growing demands for agricultural produce. electronic supplementary material: the online version of this article ( . /s - - - ) contains supplementary material, which is available to authorized users. the human population is growing at a tremendous rate, and it is projected to reach billion by . this necessitates the need to produce enough grains to meet the food and nutritional security. thus, agriculture holds the key to meet the needs of the ever-growing population. however, various factors, including challenges associated with the availability of water and irrigation systems, deteriorating soil fertility, erroneous rainfall, rise in atmospheric temperature and heatwaves, insect pests and pathogens, and their evolution to form highly virulent strains, leaves a heavy toll on agricultural productivity. for example, extreme heat and frequent droughts have resulted in about a % reduction in yield of cereal crops throughout the world from to . the damage had been more in developed countries with an - % higher loss than developing countries (lesk et al. page of ). on the other hand, intensification of crop improvement programmes using biotechnological and breeding interventions had resulted in the trait discovery and release of improved varieties that could mitigate the adverse effects ( fig. ) . conventional breeding transformed into molecular breeding, which then took the shape of genomics-assisted breeding to meet the challenges in agriculture. similarly, gene cloning and overexpression or knockout/down diversified to different approaches, including rnai, vigs, gene/ genome editing, to develop lines with enhanced agronomic as well as climate-resilient traits. advances in high-throughput genomics strategies at a whole-genome level, including genetic association mapping, map-based cloning, genomic selection, and speed breeding, are also proven useful in improvising genetic gains for expediting the crop improvement processes. in one or the other way, these approaches contributed to the increase in the yield of staple crops like maize, rice, and wheat (bailey-serres et al. ) . however, there are some drawbacks, mainly due to the asymmetric support in sub-saharan and other impoverished areas. also, the lack of investment in underutilized crops and the replacement of fruits and vegetables with grain crops led to a dietary shift from foods rich in macro-and micronutrients to a calorierich diet (pingali ) . achieving nutritional and food security by increasing the production of nutrient-rich fruits, vegetables, cereals, etc. is the call of the hour and a significant challenge for agricultural scientists (muthamilarasan et al. ; bailey-serres et al. ) . while the technologies and approaches for crop improvement get advanced on the one hand, the problems in agriculture and productivity increase on the other, which is majorly due to the changing climate. fao predicts that crop yields will decline by % by if we do not address climate change. also, studies on the effect of combined environmental stresses are limited as compared to studies on individual stresses, which is again a limiting factor. though it has been repeatedly underlined that the response of plants towards combined stress is unique as compared to individual biotic or abiotic stress, studies in this direction to delineate the molecular machinery underlying such tolerance and extrapolating the information for crop improvement remain largely elusive. while studies in this direction are need of the hour, synergistic application of two or more integrated modern genomic approaches for developing better varieties also gains momentum. given this, the review enumerates the different genomics approaches being deployed for trait improvement with examples and provides the roadmap for studying the genomic regions regulating any trait and harnessing the information for an expedited improvement and release of elite varieties. genome sequencing has provided direct access to the structural and functional aspects of protein-coding genes organized within the chromosomes of any species. also, genome sequencing provides information about the noncoding elements, including transposons and promoters, that are crucial for understanding evolution and diversification. identification of upstream regulatory elements of each gene through genome sequencing has also enabled fine-tuning the expression of target genes, and all these are possible due to the advent of ngs approaches. started with arabidopsis thaliana (arabidopsis genome initiative ), genome sequencing expanded its horizon by including several crops, plants, and tree species. rice was the first crop to be sequenced (yu et al. ; goff et al. ; international rice genome sequencing project ) , followed by maize (schnable et al. ), sorghum (paterson et al. ), and soybean (schmutz et al. ) . recently, an annotation-grade whole-genome sequence data of wheat has been released (international wheat genome sequencing consortium ). in case of foxtail millet, zhang et al. ( ) had sequenced the genome and reported that around genes are unique to this crop, where ~ are annotated as 'response to water'. reannotation of these genes provided insights into their classification, wherein a few stress-responsive genes were characterized as genome-wide level prasad , a, b) . singh et al. ( ) characterized aquaporin-encoding genes of foxtail millet, and showed that overexpressing sipip ; and sisip ; in heterologous yeast system provides tolerance to dehydration and salt stress. recently, sood et al. ( ) had optimized the transformation procedure of foxtail millet using agrobacterium tumefaciens as a medium, and this would enable the overexpression of candidate genes in the crop per se, which would contribute to trait improvement. the decoding of genome sequences of diverse major food crops provides useful genomic information related to structural, functional and comparative genomics for novel trait discovery and genetic enhancement of these crops. resequencing of germplasms enables the integrated genomics approach of sequence information to identify novel quantitative trait loci (qtls), genes as well as snps that regulate any specific agronomic trait. significant efforts in this regard have been made in case of rice by sequencing of its diverse accessions belonging to different populations (wang et al. ) . resequencing of accessions of chickpea is one of the large-scale exercises conducted to determine the candidate genes underlying thirteen traits (varshney et al. ) . in this context, the concept of pangenome has gained significant attention, which includes the complete genetic information available within the accessions of the species (hurgobin and edwards ) . reference genome, together with the resequencing data of all available accessions of a given species, have the considerable potential to expedite molecular breeding as it deploys the entire genetic diversity existing in the species. development of pangenomes for crop species has been carried out in several plants, such as rice (schatz et al. ; zhao et al. ; zhou et al. ) , maize (hirsch et al. ; xu et al. ) , soybean (lam et al. ; li et al. ; liu et al. a, b) and brassica (golicz et al. a; bayer et al. ). the quality of the reference assembly determines the appliance of plant pangenomics in terms of size, completeness and annotation, selection, and dense phenotyping of appropriate genotypes (golicz et al. b) . weckwerth et al. ( ) had underlined the future of pangenomics in enhancing the accuracy for marker-dependent trait performance, increasing the resolution of markers, and combining this approach with other next-generation strategies for accelerating the crop improvement programmes. the application of pangenome instead of single genome sequence as a reference would allow us to determine the retention or loss of valuable genes due to breeding and artificial selection in cultivars. it will further expedite the exploitation of crop wild relatives as well as cultivated accessions/species to improve abiotic and biotic stress tolerance, nutritional properties, architecture, and other economically significant traits in popularly grown cultivars. genomics includes the analysis of gene function, their regulation, and inter-networking for association with biological traits. in the field of crop science, comparative functional genomics and transcriptomics are primarily focused on the identification of allelic variations responsible for the improved phenotype. condition-specific transcriptional activation of a large number of functional and regulatory genes is measured through micro-and macroarrays, quantitative pcr (qpcr), massively parallel signature sequencing (mpss), serial analysis of gene expression (sage) and ngs-based rna sequencing (rna-seq) techniques. global transcriptome profiling in diverse tissues and developmental stages of different crop accessions has led to the generation of atlas rich in differentially regulated transcripts/ genes indicating complex developmental intricacies in crops (ref) . global transcriptomic resources may be utilized for an integrative genomics evaluation to delineate the function of genes (muthamilarasan et al. ; azodi et al. ) . for example, genome-wide rna-seq analysis of two contrasting sorghum genotypes: is (drought-sensitive) and is (drought tolerant) explored the correlation between physiological response to drought stress and differential gene expression (fracasso et al. ) . the abundance of drought-related transcripts was more in the drought-sensitive genotype. a total of up-and down-regulated genes were identified under drought conditions, among which, and were exclusively up-and down-regulated in the tolerant genotype, respectively. the study revealed the different strategies adopted by both the genotypes in coping with drought conditions. the tolerant is genotype initiated the synthesis of secondary metabolites, including glycinbetaine and glutathione, whereas sensitive is genotype hydrolyzed carbohydrates and sugars. therefore, the extent of drought imposition and perception was more in susceptible genotype than the tolerant. the tolerant genotype could be used as a genetic donor in sorghum germplasm improvement related to drought tolerance traits. similarly, global transcriptome analysis in response to high temperature and drought has identified , differentially expressed genes (degs) in wheat , and degs under low-and high-nitrogen conditions in rice (xin et al. ) , and degs under short-and long-term hypoxia in tomato roots (safavi-rizi et al. ) and various other crops under different conditions. the degs are being functionally characterized and utilized in crop improvement through breeding, genetic engineering, or genome editing. page of molecular markers form the backbone of classical genetics and plant breeding as it enables selection and breeding for any given trait. development of molecular markers, construction of genetic linkage maps, mapping of qtls, saturation of maps, and fine mapping of the precise gene were once time-consuming and labour-intensive processes. however, the advent of next-generation sequencing (ngs) has enabled the development of large-scale molecular markers, including microsatellite or simple sequence repeat (ssr), insertion-deletions (indels), and single nucleotide polymorphisms (snp). these markers enabled the development of high-density genetic maps useful for mapping of target genes and utilize them in crop breeding. molecular markers are also employed for the detection of genetic variation associated with valuable agronomic traits among cultivars in a species and facilitate the identification of appropriate parents for molecular breeding. they further ease the selection of desirable offspring resulted from the parental cross at the early stage of their development. genome-wide marker analysis in chinese spring wheat has led to the identification of , ssr markers from , , sequences of the genome with . ssr markers/mb density (han et al. ) . a total of forms of ssr motifs were detected with a maximum proportion of dinucleotide repeats ( . %), followed by a trinucleotide ( . %), hexanucleotide ( . %), tetranucleotide ( . %) and pentanucleotide ( . %) in the genome. ag/ct, aag/ ctt, agat/atct, aaaag/ctttt, and aaa att / aat ttt remained the most abundant repeats of di-to hexanucleotide ssr motifs. similarly, in foxtail millet, genome-wide , microsatellite-repeat motifs have been identified covering . mb of the whole-genome sequence (pandey et al. ) . the abundance of trinucleotide repeats ( %) was more in this case than the dinucleotide repeats ( %). in barley, zhou et al. ( ) have identified , indels throughout the genome after aligning the dna sequences from two different accessions-morex and barke. among these indel markers were integrated with ssrs, dart (diversity arrays technology), and gene-based snp markers into a single barley genetic map. in addition to these, the development of other classes of markers such as transposable element-based and mirna-based markers was demonstrated in different crops, including foxtail millet (muthamilarasan and prasad ) . these markers were proven useful in large-scale genotyping applications in millets, cereals, and bioenergy grass species. their application has also been extended to evolutionary studies and phylogenetic relationships, genetic diversity analysis, map-based cloning, and dna-fingerprinting. garrido-cardenas et al. ( ) has exquisitely described the applications and trends of molecular markers research in the field of plant science. single nucleotide polymorphism is another critical class of molecular marker abundantly distributed in the genome and detected through a comparative study of whole-genome sequence or transcriptome data of different accessions or genotypes (habash et al. ). advancements in ngs technologies with simultaneous reduction of their cost have prompt the detection and utilization of large-scale snp marker for crop improvement. approximately million snps from rice were identified after aligning the reads from genome sequences with the nipponbare genome as reference (alexandrov et al. ) . in wheat, , geneassociated snps from , high-density snp array were genetically mapped using the combination of eight mapping population ). in total, , , snp with high density were identified from different droughtresponsive inbred maize lines and b reference genome (xu et al. ). the abundance of snps was more in the intergenic region (approximately %) and intronic ( . %) region followed by upstream, exon, utr, and splice sites. non-synonymous snp (nssnps)-associated candidate genes responsible for drought tolerance were also revealed (xu et al. ). resequencing of brassica napus accessions from countries has generated about , , snps and , , indels. through genome-wide association study (gwas), loci significantly associated with agronomic traits such as oil content, seed quality, stress tolerance were identified, which may be proven as a valuable resource for genetic improvement (lu et al. ). the study also revealed the origin of b. napus from the hybridization between domesticated brassica rapa and brassica oleracea approximately - years back. these genetic resources have enormous significance in diversity studies and understanding the genetic basis of trait variation throughout the population. the genome and transcript sequences available for diverse crops have led to the generation of numerous genomic and genetic sequence-based markers like ssrs, snps, and indels for their further use in genomics-assisted crop improvement. these markers are found efficient in rapid large-scale genotyping among natural germplasm accessions and bi-parental mapping populations through association and qtl mapping for trait discovery in crops. the analysis of quantitative trait locus (qtl) is a statistical approach that correlates the phenotypic measurements with the genotypic data to evaluate the genetic basis of variations among complex traits. the pipeline for qtl mapping requires a mapping population segregating for understudied agricultural traits, precise phenotyping, development of large scale high throughput genomic markers, construction of genetic map through genotyping of mapping population with polymorphic genomic markers and finally mapping of qtl utilizing both phenotypic and genotypic data (mir et al. ) . the biparental mapping population used for qtl mining is either of f , double haploids, backcrosses, nearisogenic lines (nils), or recombinant inbred lines (rils). this approach of linkage analysis-based qtl mapping was used thoroughly during the last decade for various crops. however, this process encompasses several limitations, as illustrated by myles et al. ( ) ; therefore, to overcome the constraints, linkage disequilibrium (ld)-based association mapping was introduced to map qtls for dissecting complex agronomically significant traits ). the advantages of association mapping over bi-parental linkage mapping are ( ) superior mapping resolution through the exploration and utilization of each recombination event that happened in the evolutionary history of the species, ( ) use of natural germplasm collection rather than the development of specialized mapping population, ( ) less time consuming and cost-effective, ( ) use of same association mapping panel and genotyping data for mapping other traits and ( ) larger number of alleles can be mined compared to the linkage analysis-based qtl mapping where only two alleles are usually sampled (mir et al. ) . with the availability of crop genetic and genomic resources, genome-wide association study (gwas), candidate gene-based association mapping, qtl mapping, fine mapping, and map-based cloning are becoming popular to discover novel qtls, genes, and alleles associated with traits of agronomic importance in major food crops. more recently, the department of biotechnology (dbt), government of india, has initiated the genotypic and phenotypic characterization of more than , germplasm accessions of rice, wheat, minor pulses, and minor oilseeds conserved at national genebank through gwas for trait discovery and genetic improvement of these crops. multi-parent advanced generation inter-crosses (magic) and nested association mapping (nam) are specially designed population structures for multi-parent association studies (ladejobi et al. ) . for genotyping, restrictionsite associated sequencing (rad-seq), genotyping-bysequencing (gbs), skim-sequencing, and whole-genome resequencing approach are being exploited for mid to highdensity trait mapping through qtl-based analysis (roorkiwal et al. ) . molecular markers linked with various agronomic traits derived from association mapping are reported in crops including soybean (hu et al. ), brassica (qu et al. ; zhu et al. ) , rice (feng et al. ; rao et al. ) , chickpea (bajaj et al. ; , foxtail millet (jaiswal et al. a, b) and various others plants (reviewed by muthamilarasan et al. ) . in plants, qtls were mostly identified for a variety of agronomic traits, including abiotic and biotic stress tolerance, yield and yield contributing factors, flowering time, root architecture and nutrient uptake, and nitrogen fixation (in case of soybean). a few landmark qtls associated with nutritional traits in major cereal crops (rice, wheat, and maize) are listed in table . in addition to these genomic qtls, several other types of qtls, namely expression qtl (eqtl), proteomic qtl (pqtl), metabolic (mqtl), and phenomic qtl (phqtl) are seeing their dawn in breeding for crop improvement. an eqtl illustrates the genetic variance of a gene expression phenotype (nica and dermitzakis ) , while in pqtl, protein abundance is correlated with genetic polymorphism (rodziewicz et al. ). on the other hand, chromosomal regions that encompass loci that contribute to genetic variation in phenotypic traits are called phqtl. the targeted metabolome profiling of wheat kernel through lc-ms/ms followed by linkage analysis has resulted in the identification of mqtls distributed unevenly in the genome. twenty-two candidate genes underlying these mqtls regulating the level of different metabolites were functionally annotated (shi et al. ) . comprehensive information about these qtls is essential for facilitating the effective use of genes and genomic regions that regulate key traits. various modern ngs-driven qtl mapping strategies like bulk population resequencing: qtl-seq, individual population resequencing, and mutmap utilizing bi-parental mapping and mutant populations are found expedient for identification of major qtls modulating agronomic traits in crop plants. all these advanced genomics, including novel qtl strategies, enabled to detect both major as well as minor qtls governing gene regulatory networks underlying vital agronomic traits for quantitative dissection of complex traits and further genetic improvement of crops. genomics-assisted breeding (gab) is initiated with the identification of genomic markers associated with qtl or gene(s) related to the agronomic trait of interest and then their application in the breeding platform (fig. ) . molecular markers assist in an assortment of desired offspring in the breeding cycle at the early growth stage utilizing ngsbased high throughput genotyping platforms crossa et al. ) . numerous gab strategies have been deployed for crop improvement, including markerassisted backcrossing, marker-assisted recurrent selection, and genomic selection. recently, speed breeding is added to the list to expedite breeding processes. countries like india predominantly rely on breeding for crop improvement and cooking and eating quality of grain , , , , , grain protein content b, a, a, b, a and b . - . blanco et al. ( ) variety release. in such cases, strengthening the breeding strategies and modernizing the approaches are required to meet the challenges faced by agriculture, on time. marker-assisted backcrossing (mabc) is the introgression of a genomic region (qtl or locus or gene) contributing the desired trait from a donor genotype into a breeding line or elite cultivar without linkage drag through backcrossing after multiple generations. the resultant product of mabc contains the whole genome of an elite parent with the genetic loci or qtl or gene(s) contributing to the desired phenotype from the donor parent (gupta et al. the marker-assisted recurrent selection (mars) was introduced to counter the inefficiency of mabc in transferring multiple qtls regulating complex traits like yield or broad-spectrum disease resistance. mars involves the detection and selection of large qtls or multiple genomic regions controlling complex agronomic traits within a single or across the populations and their pyramiding in a single genotype (ribaut et al. ; kulwal et al. ). this approach makes use of the f population and is most effective for cross-pollinating species. in disparity with mabc, favourable alleles may be contributed by both the parents, and the selected improved genotype becomes the chimera of their parents. the superior allele enrichment involves the phenotypic and marker effect for desired traits in the f population, followed by two or multiple cycles of markerassisted selection (eathington et al. ). in the past few years, the hyderabad situated international maize and wheat improvement center (cimmyt) has made significant headway in the development of drought-tolerant maize inbred lines through mars approach in their asia maize drought tolerance (amdrout) project. other applications of this method have also been reported from rice, wheat, barley, soybean, cotton, pea, and sunflower improvement, particularly for evolving durable resistance. genomic selection (gs) or genome-wide selection (gws) utilizes the large-scale dna markers dispersed throughout the genome to develop superior germplasm lines. thus, the genomic selection approach has the potential to capture multiple qtls/genes widely distributed with minor additive effects. vigorous phenotyping is not mandatory for a breeding population, and subsequent offspring selection primarily focused on genotypic predictions, which combines the genomic and pedigree data for several generations of the breeding cycle (nakaya and isobe ). genomic estimated breeding value (gebv), the sum of the information index with a combined effect of genome-wide molecular markers, is the basis of recurrent selection ). highdensity molecular markers where each qtls are in linkage disequilibrium with a minimum of single genomic markers are prerequisites for precise gebv, and thus, for gws (habier et al. ). the success of gs also depends on the quantity and diversity of the training population (breeding lines selected for the gws programme). the reduced number of selection events has decreased the time and cost of breeding. this approach can be equally applicable for both cross-and self-pollinated species with slight alterations (bernardo ) . few examples of crop improvement through this approach are the development of wheat lines resistant to stem rust caused by puccinia graminis f. sp. tritici (rutkoski et al. ) , drought-tolerant high-yielding lines in maize (ziyomo and bernardo ) , improved yield and related traits under drought in chickpea , and improved productivity in superior hybrids of rice (cui et al. ) . gs has also expanded its horizon towards underutilized or less studied crops for their improvement. de c. lara et al. ( ) have demonstrated the use of gs in an autotetraploid forage grass, panicum maximum. similarly, in miscanthus, slavov et al. ( ) have combined index selection and genomic prediction to achieve multiple breeding targets. these include increased biomass, delayed flowering, reduced lignin, and increased cellulose contents. in cassava, torres et al. ( ) have deployed gs for early selection and breeding for agronomic traits such as fresh root yield, dry matter content, dry yield, fresh shoot yield, and harvest index. these reports suggest the popularity and applicability of gs in enhancing the traits at a quicker phase that might lead to the early release of improved genotypes for agricultural production. in case of breeding, time is an important factor that decides the release of genotypes to the farmers. conventional breeding takes - years for crossing experiments, followed by - years for testing the yield, diseases and quality, and another - years for the release of varieties. altogether, considerable time is invested in improving a single genotype or variety. given this, the approach of modulating day-light and duration for accelerating the life cycle, termed 'speed breeding', has been introduced (fig. ) . this recently developed speed breeding technology shortens the breeding cycle by accelerating crop generation by providing controlled rapid growth-promoting conditions in glasshouses and growth chambers ghosh et al. ) . by modulating lighting, photoperiod, humidity, temperature, and other factors, the approach can achieve six generations per year for crops like wheat, barley, chickpea, and canola (hickey et al. ) . in contrast, in the glasshouse, these crops can undergo only three generations a year (hickey et al. ) . early anthesis was reported from plants grown under speed breeding setup with fully viable mature seeds. seed produced (g per plant) was unaffected between speed breeding and normal photoperiod conditions in almost all crops . the adaptation of technique in the breeding programme will accelerate the generation of mapping populations, reduce the duration of mabc/ mars/gws, and expedite the progression towards homozygosity. apart from major crops that are mostly annual or fig. mapping of quantitative trait loci associated with complex agronomic traits and their application in genomics-assisted breeding. linkage analysis in mapping population segregating for desired phenotype conquer qtl identification which generally employs in mas biannual, the method also has immense potential to hasten the improvement of woody shrub or perennial plants. the optimization of methods which headed towards the reduction of juvenile phase from years to months in apple and to years in chestnut are some example of the application of accelerated breeding cycle in perennial crops (baier et al. ; van nocker and gardiner ) . rana et al. ( ) had coupled marker-assisted selection with speed breeding for developing salt-tolerant rice lines. similarly, bauerle ( ) has shown that the generation per year of hops could be enhanced from one (under field conditions) to four (through speed breeding) that could accelerate selection for flower yield and quality in this crop. jighly et al. ( ) have combined gs with speed breeding to enhance genetic gains in allogamous plants like tall fescue. the approach named speedgs is gaining popularity among the breeders for achieving higher genetic gain per cycle, especially for traits with low heritability. transgenic or genetically modified (gm) crops have modified genomes at gene level achieved through several genetic engineering techniques. while breeding is time-consuming and allows only the transfer of genetic information from closely related species, genetic engineering or transgenebased research facilitates the transfer of genes from any source into the plants. however, an established protocol for introducing the gene into host species and rigorous selection is required to achieve greater success. agrobacterium tumefaciens-mediated genetic transformation is one of the reliable approaches being used to achieve stable transgenic lines. in contrast, other techniques, including particle bombardment (biolistics), sonication, and electroporation, are used for transient expression of the foreign dna. singh and prasad ( ) had comprehensively discussed the merits and demerits of a. tumefaciens-mediated genetic transformation in cereals. the prime bottleneck in this approach is the lack of optimized protocol for several important species, and the optimization is a time consuming and labour-intensive approximately takes - years to release an improved cereal variety while speed-breeding-assisted mab would be completed within - years process. achieving higher transformation efficiency is another issue that requires the fine-tuning of the experimental parameters. once established, the protocol will serve as a key to introduce several genes into the target genome to attain better performance and phenotype. so far, transgenics in crops have been commercialized, of which zea mays accounts for the highest number. cultivation of transgenic crops has boosted agricultural productivity to about % leading to a % increase in profits (kumar et al. ) . several genetic engineering technologies have been utilized for crop improvement, which is briefly discussed in the following sections (fig. ) . bt cotton has contributed to the indian economy for a while, and in low-income countries like bangladesh, bt brinjal has secured its economy and livelihood of farmers. despite these advantages, the public acceptance of transgenic crops has been quite low, and concerns have been raised regarding ecological hazards and safety-related issues in the context of human consumption. however, there is no scientific evidence that shows transgenic crops to cause health hazards (tsatsakis et al. ; de vos and swanenburg ) . irrespective of these, biotechnology holds the key to the future of agriculture as the challenges faced by the farming sector are on a steady rise. at present, intensive agriculture is securing the life and livelihood of farmers and contributes to the global stock of grains and vegetables; however, issues like the spread of new diseases, insect/pest attack, erroneous rainfall, lack of soil fertility due to overuse of synthetic fertilizers, monotonous cropping, etc. could soon pose a serious threat to the ongoing agriculture. plants need to withstand multiple stresses rather than single stress in their environment. modulation of genes that regulate multiple stress responses through biotechnological interventions will help in the development of plants with enhanced efficiency under such conditions (pandey et al. ) . several crops have been modified using biotechnology and have either been released or have the potential to be released. examples of such crop species have been provided in table . while addressing climate change and introducing good farming practices receives importance on one hand, it is also imperative to release varieties that could be climate-resilient or sustain in abnormal conditions, thus securing the food security of global population. gene cloning and isolation have facilitated the pulling out of a target gene from any genome that can then be transformed into any other genome for its expression. expression of 'cry' gene of bacillus thuringiensis in plants is a typical example of this approach, and it is still popular since there is a rise in pests and insects that attacks plants. expression of genes from other plant species for enhancing the agronomic or stress tolerance traits of target crops is also being practiced for a while. dreb (dehydration responsive element binding) protein-encoding genes are one such class of genes that were frequently isolated from one species and expressed in another for enhancing the tolerance to different abiotic and biotic stresses. further, these expression and overexpression strategies also assist in elucidating the function of genes, which is an important task considering a large number of genes are at our disposal due to the advancement of ngs technologies. their functional characterization becomes essential not only for basic research but also for application purposes. the t-dna insertion lines of arabidopsis thaliana have served as a vital resource for elucidating gene function. the next step after functional characterization is to utilize the gene for crop improvement programmes, and one of the most widely used methods is the overexpression of candidate genes. several success stories depict the immense potential of gene overexpression in crops. overexpression of argos genes in zea mays leads to a reduction in sensitivity to ethylene, and transgenic plants show enhanced drought resistance as well as higher grain yield in well-watered as well as drought conditions (shi et al. ) . similarly, transgenic glycine max plants overexpressing gmwri b show higher oil content and improved plant architecture under field conditions (guo et al. ). the list for crop improvement for biotic and abiotic stresses, nutritional enhancement, increase in yield, biofuel production, herbicide resistance, etc., through overexpression approach is quite long with new additions at an ever-increasing rate. some recent examples have been presented in supplementary table . the functional validation of new candidates is also increasing at a rapid pace generating novel resources for crop improvement programmes (lata and prasad ; puranik et al. ; singh et al. ) . the availability of genome sequence information in public domains had facilitated the large-scale analysis of genes and gene families, and characterizing those genes for their physiochemical properties, genomic composition, promoter elements, and expression profiling in response to stress, hormonal treatments and developmental stages had pinpointed several candidate genes that could be subjected to overexpression in target organisms for enhancing the trait-of-interests. the discovery of rnai was a breakthrough in the history of biology, and since its finding, it has been widely utilized in functional genomics, reverse genetics and crop improvement (rosa et al. ) . rnai pathway involves the generation of small rnas (srna), which include short interfering rna (sirna), microrna (mirna), transacting sirna (ta-sirna) and natural-antisense sirna (nat-sirna) which mediate silencing or epigenetic regulation of their target genes . rnai can be utilized by both transformative and non-transformative strategies. transformative rnai has been used in several modified forms like artificial mirna (amirna), artificial ta-sirna (ata-sirna), hairpin rna (hprna), intrinsic direct repeat, ′-untranslated region (utr) direct repeat, terminator-less, single-stranded promoter antisense and intron delivered promoter hprna (guo et al. ) . numerous examples can be cited where rnai has been waltz ( ) successfully utilized for improving important traits like modification of plant architecture, improvement in fruit quality in terms of high β-carotene and lycopene content, enhanced shelf life, nutritional enhancement like low gluten content, reduction in toxic terpenoids, biotic stress resistance against viruses, fungi, bacteria and nematodes; and abiotic stress resistance to heat, drought, salinity, and cold (kamthan et al. ) . the non-transformative rnai technique, spray induced gene silencing (sigs), has gained widespread attention due to its low cost of application and feasibility of use. it involves spraying plants with double-stranded (ds) rna/sirna and has been successfully utilized for controlling insect pests, which are the carriers of several viral pathogens (cagliari et al. ; worrall et al. ) . plants sprayed with dsrna/ srna targeting dcl and dcl of botrytis cinerea showed a significant reduction in grey mold disease symptoms highlighting the potential of this technology for the generation of next-generation eco-friendly biofungicides . transgenic plants are met with criticism in several countries, and widespread acceptance is still lacking. it is estimated that about million dollars are spent to bring a transgenic crop into commercialization (rosa et al. ) . however, despite all the efforts and promising features that gmos have to offer, anti-gmo responses follow. considering this, sigs being a non-gmo approach has enormous potential for crop improvement, and it is also crucial that we devise new dsrna/srna delivery strategies for silencing host as well as pathogen genes. recent crop improvements utilizing rnai as a tool have been summarized in supplementary table . in the functional genomics perspective, rnai was useful in gene characterization studies; however, the recent advent of virus-induced gene silencing has now established its prominence over rnai. precise genome editing has revolutionized genetic engineering, and this started in when for the first time, it was shown that dna binding zinc finger domains along with fok endonuclease domains could cleave dna at defined regions and act as site-specific nucleases (ssns) (kim et al. ) . further research led to the development of transcription activator-like effector nucleases (talens) and clustered regularly interspaced short palindrome repeats (crispr)/crispr-associated protein (cas ). meganucleases (megan) recognize long dna sequences that are greater than nucleotides (nt) up to nt. since they have endonuclease activity, they produce double-stranded (ds) breaks at the recognition sites. however, their use in genome editing has been minimal because the variety of megans available is very less and cannot be used for every locus (silva et al. ). however, crispr/cas has been more popular because of its ease of use compared to other genome editing technologies (das et al. ) . as the name suggests, crispr/cas consists of two components: a single-guide application of functional and comparative genomics in marker-assisted breeding and biotechnological approaches for crop improvement. the candidate gene(s) identified from functional genomic studies can be introduced through genetic engineering or tar-geted modify through genome editing technology in crop species for improved agronomic traits. the other approach is through molecular breeding which employ molecular markers to identify genomic region associated with desired traits during breeding programme rna that is customizable and cas endonuclease. another prime requirement of the system is a protospacer adjacent motif (pam) ( ′ngg ′), which is required for inducing ds breaks at the targeted sites in the genome. the breaks are repaired through either homology directed repair (hodr) or non-homologous end joining (nhej). since nhej is errorprone, repair leads to insertions or deletions at the target site (khatodia et al. ). crispr/cas has shown immense potential for crop improvement, and several traits ranging from nutritional, biotic, and abiotic stress resistance have been enhanced (jaganathan et al. ; das et al. ) (supplementary table ). however, there are certain limitations like the restrictions due to the requirement of pam, off-target effects, and low efficacy of hodr. another problem is plant viruses that are known to mutate at tremendous rates . reports suggest that the evolution of editingresistant viruses may lead to viral escapes within a short span of time (mehta et al. ). however, the technology is continually being supplemented with novel innovations overcoming some of the drawbacks. single nucleotide polymorphisms (snps) are responsible for certain elite traits in crops through genome-wide association studies (gwas) . base editing is a new approach that can be utilized for editing these snps. cas is fused with cytidine deaminase enzyme that has base conversion activity (c → t or g → a), and this modified method does not require dsdna as a repair template (komor et al. ). the problem of low efficiency of hodr is also overcome as it is not involved in the base conversion process. conventional genome editing is associated with the insertion of dna cassettes at random regions within the genome, and this may lead to other undesirable effects. the regulatory concerns for transgenics are another hindrance. to overcome these problems, dna free genome editing was developed, which involves the delivery of preassembled crispr/cas ribonucleoproteins (rnps) to protoplasts by particle bombardment (woo et al. ) . the only problem with this method is that protoplast regeneration systems are not yet available for a majority of crops, but this can be overcome with a research focus in this direction. crispr/cas is also limited due to the requirement of 'ngg ' pam, and cas variants are essential to overcome this problem. crispr from prevotella and francisella (cpf ) endonuclease recognize t-rich pams, thus broadening the target range of the system (zetsche et al. ) . several variants exploiting the crispr-based approach to achieve other functions had also been reported. for example, deactivating the nuclease domain and engineering a transcription enhancer to cas could promote the expression of the target gene. similar engineering of deaminase to the cas protein facilitates the conversion of cytosine residues to thymine. thus, crispr/ cas -based approach has multiple applications in editing the genes and genomes at base-pair level. of note, the method is not considered as gm in several countries, which would encourage the use of crispr/cas in developing lines that could meet the challenges currently faced by agriculture. the foremost challenge for crop scientists is to increase agricultural productivity to pursue the demand for food supply for a rapidly expanding global population, which is expected to reach approximately billion by the mid of the twentyfirst century (united nations, world population prospects ). on the one hand, global warming, constrained environmental conditions, and biotic factors are limiting crop yield. on the other hand, fertile farmland is also shrinking due to rapid urbanization and soil erosion. besides, pandemic situations like the recent covid- prevalence have introduced significant gaps in food and nutritional securities to the global population. therefore, the rapid release of environmentally sustainable high yielding varieties is required. molecular breeding and genetic manipulation have emerged as the two most potent technologies which have the potential to attain food and energy security for the coming years (fig. ) . advances in ngs technology have enabled the incorporation of genomics with various disciplines of crop breeding. large-scale genomic markers and high-throughput genotyping are being applied in breeding have accelerated the quantity and cost of cultivar development. similarly, functional and comparative genomics have provided the platform for gene discovery and their functional characterization. the key gene or genes regulating a molecular pathway are being genetically engineered or edited to develop phenotypically improved crop lines. whatever be the approach, either molecular breeding or biotechnological tools, the ultimate intention is to ensure food for all through increased productivity per plant and minimize yield loss caused due to external factors. the collaborative research investments of various branches of science are paramount to sustainable crop improvement. author contribution statement mp conceived and outlined the review. rks and ap prepared the first draft, tables, and figures; mm and skp improved the manuscript and provided revisions to it. all authors have read and approved the final version of this manuscript. snp-seek database of snps derived from rice genomes qtl mapping of grain quality traits from the interspecific cross oryza sativa × o. glaberrima analysis of the genome sequence of the flowering plant arabidopsis thaliana transcriptome-based prediction of complex traits in maize early flowering in chestnut species induced under high dose light in growth chambers genetic strategies for improving crop yields ecotilling-based association mapping efficiently delineates functionally relevant natural allelic variants of candidate genes governing agronomic traits in chickpea disentangling photoperiod from hop vernalization and dormancy for global production and speed breeding variation in abundance of predicted resistance genes in the brassica oleracea pangenome steady expression of high oleic acid in peanut bred by marker-assisted backcrossing for fatty acid desaturase mutant alleles and its effect on seed germination along with other seedling traits genome wide selection with minimal crossing in self-pollinated crops detection of grain protein content qtls across environments in tetraploid wheats management of pest insects and plant diseases by non-transformative rnai genetic analysis of agronomic traits and grain iron and zinc concentrations in a doubled haploid population of rice next-generation protein-rich potato expressing the seed protein gene am a is a result of proteome rebalancing in transgenic tuber redirection of tryptophan leads to production of low indole glucosinolate canola genomic selection in plant breeding: methods, models, and perspectives hybrid breeding of rice via genomic selection crispr/cas : a novel weapon in the arsenal to combat plant diseases genomic selection with allele dosage in panicum maximum health effects of feeding genetically modified (gm) crops to livestock animals: a review molecular markers in a commercial breeding program grain protein content and thousand kernel weight qtls identified in a durum × wild emmer wheat mapping population tested in five environments qtl mapping of starch granule size in common wheat using recombinant inbred lines derived from a ph - /neixiang cross genome wide association mapping for grain shape traits in indica rice drought stress tolerance strategies revealed by rna-seq in two sorghum genotypes with contrasting wue qtl detection for water-soluble oligosaccharide content of grain in common wheat genetic identification of quantitative trait loci for contents of mineral nutrients in rice grain trends in plant research using molecular markers speed breeding in growth chambers and glasshouses for crop breeding and model plant research genetic variation for protein content and yield-related traits in a durum population derived from an inter-specific cross between hexaploid and tetraploid wheat cultivars a draft sequence of the rice genome (oryza sativa l. ssp. japonica) towards plant pangenomics the pangenome of an agronomically important crop plant brassica oleracea comprehensive phenotypic analysis and quantitative trait locus identification for grain mineral concentration, content, and yield in maize identification of unconditional and conditional qtl for oil, protein and starch content in maize rna silencing in plants: mechanisms, technologies and applications in horticultural crops a na + /h + antiporter, k -nhad, improves salt and drought tolerance in cotton (gossypium hirsutum l.) marker assisted selection as a component of conventional plant breeding genomic approaches for designing durum wheat ready for climate change with a focus on drought the impact of genetic relationship information on genome-assisted breeding values genome-wide analysis of microsatellite markers based on sequenced database in chinese spring wheat (triticum aestivum l.) functional analysis of starch-synthesis genes in determining rice eating and cooking qualities breeding crops to feed billion insights into the maize pan-genome and pan-transcriptome mapping of quantitative trait loci (qtls) for rice protein and fat content using doubled haploid lines association mapping of yield-related traits and ssr markers in wild soybean snp discovery using a pangenome: has the single reference approach become obsolete? shifting the limits in wheat research and breeding using a fully annotated reference genome. science :eaar international rice genome sequencing project ( ) the map-based sequence of the rice genome crispr for crop improvement: an update review genome-wide association study (gwas) delineates genomic loci for ten nutritional elements in foxtail millet (setaria italica l.) genome-wide association study of major agronomic traits in foxtail millet (setaria italica l.) using ddrad sequencing boosting genetic gain in allogamous crops via speed breeding and genomic selection small rnas in plants: recent development and application for crop improvement accelerated development of rice stripe virus-resistant, near-isogenic rice lines through markerassisted backcrossing the crispr/cas genome-editing tool: application in improvement of crops hybrid restriction enzymes: zinc finger fusions to fok i cleavage domain physicochemical characteristics and qtl mapping associated with the lipid content of high-lipid rice identification of quantitative trait loci for rice grain quality and yield-related traits in two closely related oryza sativa l. subsp. japonica cultivars grown near the northernmost limit for rice paddy cultivation programmable editing of a target base in genomic dna without double-stranded dna cleavage molecular mapping of the grain iron and zinc concentration, protein content and thousand kernel weight in wheat genomics interventions in crop breeding for sustainable agriculture genetically modified crops: current status and future prospects maximizing the potential of multi-parental crop populations resequencing of wild and cultivated soybean genomes identifies patterns of genetic diversity and selection role of drebs in regulation of abiotic stress responses in plants analysis of qtls associated with the rice quality related gene by double haploid populations influence of extreme weather disasters on global crop production qtl identification of grain protein concentration and its genetic correlation with starch concentration and grain weight using two populations in maize (zea mays l.) de novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits investigating drought tolerance in chickpea using genome-wide association mapping and genomic selection based on whole-genome resequencing data qtl mapping for maize starch content and candidate gene prediction combined with co-expression network analysis dissecting the genetic basis for the effect of rice chalkiness, amylose content, protein content, and rapid viscosity analyzer profile characteristics on the eating quality of cooked rice using the chromosome segment substitution line population across eight environments qtl analysis of percentage of grains with chalkiness in japonica rice (oryza sativa) temporal transcriptome profiling reveals expression partitioning of homeologous genes contributing to heat and drought acclimation in wheat (triticum aestivum l.) identification of quantitative trait loci and candidate genes for maize starch granule size through association mapping development of nutritious rice with high zinc/selenium and low cadmium in grains through qtl pyramiding pan-genome of wild and cultivated soybeans whole-genome resequencing reveals brassica napus origin and genetic loci involved in its improvement super annigeri and improved jg : two fusarium wilt-resistant introgression lines developed using marker-assisted backcrossing approach in chickpea (cicer arietinum l.) development of a high-density snp-based linkage map and detection of qtl for β-glucans, protein content, grain yield per spike and heading time in durum wheat linking crispr-cas interference in cassava to the evolution of editing-resistant geminiviruses integrated genomics, physiology and breeding approaches for improving drought tolerance in crops advances in setaria genomics for genetic improvement of cereals and bioenergy grasses exploiting genome sequence information to develop genomic resources for foxtail millet improvement genetic determinants of drought stress tolerance in setaria development of intron-length polymorphic markers for large-scale genotyping applications in foxtail millet exploration of millet models for developing nutrient rich graminaceous crops multi-omics approaches for strategic improvement of stress tolerance in underutilized crop species: a climate change perspective association mapping: critical considerations shift from genotyping to experimental design will genomic selection be a practical method for plant breeding expression quantitative trait loci: present and future candidate genes and genome-wide association study of grain protein content and protein deviation in durum wheat genome-wide development and use of microsatellite markers for large-scale genotyping applications in foxtail millet linking the plant stress responses with rna helicases improving the glossiness of cooked rice, an important component of visual rice grain quality the sorghum bicolor genome and the diversification of grasses quantitative trait loci conferring grain mineral nutrient concentrations in durum wheat × wild emmer wheat ril population green revolution: impacts, limits, and the path ahead qtl analysis for grain protein content using ssr markers and validation studies using nils in bread wheat tomato yellow leaf curl virus: impact, challenges, and management recent advances in small rna mediated plant-virus interactions nac proteins: regulation and role in stress tolerance genome-wide association mapping and identification of candidate genes for fatty acid composition in brassica napus l. using snp markers salt tolerance improvement in rice through efficient snp marker-assisted selection coupled with speed-breeding identification of rice landraces with promising yield and the associated genomic regions under low nitrogen molecular breeding in developing countries: challenges and perspectives identification of drought responsive proteins and related proteomic qtls in barley integrating genomics for chickpea improvement: achievements and opportunities rna interference mechanisms and applications in plant pathology mapping qtls related to zn and fe concentrations in bread wheat (triticum aestivum) grain using microsatellite markers high density mapping of quantitative trait loci conferring gluten strength in canadian durum wheat genomic selection for durable stem rust resistance in wheat rna-seq reveals novel genes and pathways associated with hypoxia duration and tolerance in tomato root whole genome de novo assemblies of three divergent strains of rice, oryza sativa, document novel gene space of aus and indica genome sequence of the palaeopolyploid soybean the b maize genome: complexity, diversity, and dynamics reduction of polygalacturonase activity in tomato fruit by antisense rna identification of two stably expressed qtls for fat content in rice overexpression of argos genes modifies plant sensitivity to ethylene, leading to improved drought tolerance in both arabidopsis and maize metabolomics analysis and metaboliteagronomic trait associations using kernels of wheat (triticum aestivum) recombinant inbred lines meganucleases and other tools for targeted genome engineering: perspectives and challenges for gene therapy advances in agrobacterium tumefaciensmediated genetic transformation of graminaceous crops genome wide association studies for improving agronomic traits in foxtail millet genomics-assisted breeding for improving stress tolerance of graminaceous crops to biotic and abiotic stresses: progress and prospects study on aquaporins of setaria italica suggests the involvement of sipip ; and sisip ; in abiotic stress response versatile roles of aquaporin in physiological processes and stress tolerance in plants genomic index selection provides a pragmatic framework for setting and refining multi-objective breeding targets in miscanthus an efficient agrobacterium-mediated genetic transformation method for foxtail millet (setaria italica l.) quantitative trait loci for phytate in rice grain and their relationship with grain micronutrient content evaluation of rice (oryza sativa l.) near isogenic lines with root qtls for plant production and root traits in rainfed target populations of environment quantitative trait loci (qtls) for quality traits related to protein and starch in wheat identification and validation of quantitative trait loci for grain protein concentration in adapted canadian durum wheat populations mapping quantitative trait loci underlying the cooking and eating quality of rice using a dh population genomic selection for productive traits in biparental cassava breeding populations environmental impacts of genetically modified plants: a review genetic engineering of cotton plants and lines breeding better cultivars, faster: applications of new technologies for the rapid deployment of superior horticultural tree crops resequencing of chickpea accessions from countries provides insights into genome diversity, domestication and agronomic traits crispr-edited crops free to enter market, skip regulation genetic basis of traits and viscosity parameters characterizing the eating and cooking quality of rice grain conditional qtl mapping of protein content in wheat with respect to grain yield and its components genomic variation in , diverse accessions of asian cultivated rice characterization of polyploid wheat genomic diversity using a high-density , single nucleotide polymorphism array bidirectional cross-kingdom rnai and fungal uptake of external rnas confer plant protection qtl for fatty acid composition of maize kernel oil in illinois high oil × b backcross-derived lines speed breeding is a powerful tool to accelerate crop research and breeding panomics meets germplasm dna-free genome editing in plants with preassembled crispr-cas ribonucleoproteins exogenous application of rnai-inducing double-stranded rna inhibits aphidmediated transmission of a plant virus an integrated analysis of the rice transcriptome and metabolome reveals root growth regulation mechanisms in response to nitrogen availability identification of candidate genes for drought tolerance by wholegenome resequencing in maize verification of qtl for grain starch content and its genetic correlation with oil content using two connected ril populations in high-oil maize qtl verification of grain protein content and its correlation with oil content by using connected ril populations of high-oil maize genetic analysis of sugar-related traits in rice grain detection of quantitative trait loci for kernel oil and protein concentration in a b and zheng maize cross improving rice blast resistance of feng s through molecular marker-assisted backcrossing engineering the provitamin a (beta-carotene) biosynthetic pathway into (carotenoid-free) rice endosperm identification of quantitative trait loci for lipid metabolism in rice seeds a draft sequence of the rice genome analysis of rice grain quality-associated quantitative trait loci by using genetic mapping cpf is a single rna-guided endonuclease of a class crispr-cas system genome sequence of foxtail millet (setaria italica) provides insights into grass evolution and biofuel potential qtl mapping for quantities of protein fractions in bread wheat mapping and validation of quantitative trait loci associated with concentrations of elements in unmilled rice grain pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice meta-analysis of genome-wide association studies provides insights into genetic control of tomato flavor development of genome-wide indel markers and their integration with ssr, dart and snp markers in single barley map a platinum standard pan-genome resource that represents the population structure of asian rice from golden rice to astarice: bioengineering astaxanthin biosynthesis in rice endosperm identification of snp loci and candidate genes related to four important fatty acid composition in brassica napus using genome wide association study drought tolerance in maize: indirect selection through secondary traits versus genome-wide selection de novo domestication of wild tomato using genome editing key: cord- -el v a authors: tan, h.s. title: fourier spectral density of the coronavirus genome date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: el v a we present an analysis of the coronavirus rna genome via a study of its fourier spectral density based on a binary representation of the nucleotide sequence. we find that at low frequencies, the power spectrum presents a small and distinct departure from the behavior expected from an uncorrelated sequence. we provide a couple of simple models to characterize such deviations. away from a small low-frequency domain, the spectrum presents largely stochastic fluctuations about fixed values which vary inversely with the genome size generally. it exhibits no other peaks apart from those associated with triplet codon usage. we uncover an interesting, new scaling law for the coronavirus genome: the complexity of the genome scales linearly with the power-law exponent that characterizes the enveloping curve of the low-frequency domain of the spectral density. motivated by our search for deeper organizational principles governing genetic information , the study of a dna/rna genome via its fourier spectral density has given us several interesting insights into the code of life. an example of a seminal paper in this subject is that of voss in [ ] where the author found that the spectral density of the genome of many different species follows a power law of the form /k β in the low-frequency domain, with the exponent β potentially related to the organism's evolutionary category. in [ ] , β was found to be close to , a phenomenon shared by a wide variety of physical systems especially those that carry long-range correlations or characterized by a myriad of length scales. it was also found that the power spectra may contain defining peaks or resonances, for example at period for primates, vertebrates and invertebrates, or period - for yeast, bacteria and archaea as shown in [ ] where the peaks were remarkably related to aspects of protein structuring and folding. over the years, these methods and results have been extended in various ways [ ] , such as wavelet-type analysis [ , , ] of the sequences, using features of the spectra to classify and cluster genomes with the aid of neural networks [ ] , prediction of coding regions [ ] and periodic structures [ ] , etc. in this paper, we study the fourier spectral density of the genome of coronaviruses -a positivesense single-stranded rna genome with size ranging from roughly to kilobases, based on the dataset of [ ] which covers all four genera of coronaviruses. in addition, motivated by the recent covid- pandemic, we include the genomes of sars-cov- , a bat coronavirus bat-ratg of close genome identity, and the mers coronavirus. across the different genome sequences, we find that their fourier spectra take on the same form. there is a low frequency domain (k in units of inverse genome length) where a sinc-squared-like oscillatory form is enveloped by a roughly /k decay curve. this is followed by stochastic white-noise type fluctuations about fixed mean values which tend to vary inversely with the genome size. we find that a random, uncorrelated sequence -with the probability of occurrence for each nucleotide being its frequency ratio in the sequence -yields similar behavior in the low-frequency domain. we develop a few models to characterize the typical spectrum, and in the process stumble upon a linear scaling law between a measure of the complexity of each genome and the power-law exponent that describes the enveloping curve of the low-frequency domain. the complexity measure that we use here is intimately related to the shannon entropy of the sequence, and thus this relation concretely realizes a way by which information-theoretic content is carried within the genome's spectral density. now, power-law decay of the form /k α have previously been discussed in literature for other types of genomes (see for example [ , ] ). we would like to emphasize that here, we do not employ either the fast fourier transform or non-overlapping averaging procedures to smoothen the data in the low-frequency domain. these are common techniques used for easing computations in past related works, but may compromise the sensitivity by which we characterize the spectral curves. we also perform the spectral density analysis at the level of the coding region (a few thousand nucleotides) for the spike protein, an essential protein that binds to the host cell's receptor. we find that interestingly, the general features of the spectrum persist at the protein level, but not the scaling law mentioned above. our paper is organized as follows. in section , we present some background theory for our work, followed by section where we present the results and a few graphical plots for visualization, before concluding in section . the appendix a collects a table listing all the genbank accession numbers [ ] of the genomes, and another gathers several graphs useful for interpreting our various results. in this section, we present some essential mathematical concepts that form the basis for our study. our analysis of the rna genomes can only begin after transformation of the genome sequence consisting of the four nucleotides (adenine, cytosine, guanine and uracil) into a numerical string. the spectral density of interest here is the absolute square of the discrete fourier transform of a nucleotide indicator function φ(i) defined as follows where m denotes the length of the genome, and α, β denote particular choices of nucleotides. in the continuum limit and after averaging over some distribution of genomes, this approaches the fourier transform of the correlation function now, a basic premise lies in the choice of the indicator function φ(i). while various propositions have been explored in the literature, in this paper, following [ ] , we use a simple binary-valued model where for each nucleotide, φ(i) is equal to if the nucleotide is found at position 'i' and otherwise. for all our genome data, we find that ( ) exhibits a clear specific oscillatory form that resembles a sinc(-squared) function in the low-frequency domain (up to k ∼ ). in the following, we furnish a potential simple explanation of such low-frequency behavior. for simplicity and definiteness, we will mainly focus on the spectral density sum for the rest of the paper, but we have checked that the general features described above pertain to the cross-spectra s αβ (with β = α) as well as the individual autocorrelations s αα for all the four nucleotides. apart from computing various quantities at the level of the entire rna genome, we also examine the spectral density associated with the coding region for the spike protein. for the coronaviruses, apart from the spike protein, the genome encodes several proteins each carrying unique functions, such as the envelope, membrane, nucleocapsid, etc. in particular, the spike protein plays an essential role in host cell receptor binding during the process of viral infection, and is thus a common target for developments of antibodies and vaccines (see for example, [ ] ). now the coding region associated with this protein is only of the order of nucleotides, so a priori it is not clear if the spectral density can be meaningfully analyzed. we find however that the general features of the spectral density persist for the spike protein's coding region too. consider the case of an uncorrelated numerical sequence, where the probablity of a nucleotide of type α occurring at some position is a constant, independent of others and the position itself. given n α such nucleotides in the sequence, we can estimate this constant to be nα m , with the expectation value of the spectral density being we would find that up to k ∼ , ( ) models the spectral density rather well. for the local maxima of ( ), they approximately occur at half-integer values of k and thus the upper envelope of the oscillation is manifest as a /k decay function in this domain which follows from expanding ( ) about k = . the decaying behavior of the envelope curve typically stops at about k ∼ , and thereafter the spectral density appears to be characterized by stochastic fluctuations about some fixed mean. although ( ) appears to model observed datasets well, the goodness of fit doesn't extend beyond the k ∼ range, nor is it clear from the data whether deviations from ( ) are unimportant random fluctuations or otherwise within the low-frequency domain. to gain further insights, we present a few simple models which characterize the observed deviations from ( ). the models' parameters can potentially be used for clustering coronavirus genomes if future studies prove that these values persist for a larger sets of data, or more interestingly, they could potentially demonstrate correlation with other features of the genome that would help us recognize the presence of long-range correlations. from now on, we refer to ( ) as the 'uncorrelated background'. in the following, we present three models for the observed spectral density that characterize deviations from the uncorrelated background. the first two concerns the description of the low-frequency domain (k ) whereas the third involves a more global description. (a) power-law decay of the enveloping curve motivated by previous works on this subject, we consider fitting a power-law decay via leastsquare regression to the enveloping curve ( for k ∈ [ , ] ) of the form for some power exponent δ. the power-law description is convenient and has proven to be a popularly studied model for spectral density of genomes in general (see for example [ ] ). it is crucial to bear in mind that it is a coarse-grained description which doesn't extend to the origin, and valid only for the low-frequency domain. we would later find that this is the parameter that remarkably scales linearly with a measure of the genome complexity. for all our datasets, = δ − ∼ − . it is not a priori clear how large has to be in order for the deviation to be significant, and more sequences corresponding to each type of coronavirus should be studied in order to determine the range of and its statistical distribution. although we leave this for future work, we found evidence that the variation in the correlates with a measure of the complexity of the genome (which at the limit of infinite genome size approaches the shannon entropy) in a way that is distinctly different from a completely random sequence. it is useful to compute the expected δ for the hypothetical uncorrelated background ( ) which is parametrized by the genome size m and the sum of squares of nucleotide number α n α . for the general spectral density s(k), from least-square regression of the log-log relation, we obtain ≡ − δ to be where . . . = k (. . .) denotes averaging over the nine local maxima points in the domain k ∈ [ , ] . for the uncorrelated case of ( ), we find that the factor α n α cancels away in ( ) and numerically, δ ≈ . for all the datasets at the level of the genome and that of the protein coding region. this defines a background value for the detection of a deviation away from the completely random sequence. in contrast to an empirical power-law fitting of only the enveloping curve, one could adopt a bottom-up approach by postulating certain forms of the correlation function, and then performing the discrete fourier transform. consider the case where the correlation function is a linear function of the nucleotide separation, we can write for some constant κ, and r = α this function is invariant under the reflection k ↔ m − k, which is an exact discrete symmetry for the spectral density s(k) (or the individual s αα (k)) more generally. the parameter κ admits the physical interpretation of the presence of long-range correlation/anti-correlation depending on whether it's positive/negative, and we would find that apart from one exception, all our datasets can be matched to a positive κ of the order − . we find that if the curvefitting is performed taking into account only the first ten local maxima as in the case for δ-parameter, the local minima points at integral k-values are not well captured by the fitted curve, so we also include them in the curve-fitting. beyond the specific linear form of the correlation function postulated in ( ), it is also representative of a large class of correlation functions of the form where τ ≡ l − j,κ is a small constant andτ =κ |τ | m . this first-order truncation is identical to ( ) with f ( )κ = κ. thus, ( ) could approximate correlation functions of the general form f κ |τ | m whereκ is a small dimensionless parameter, and for example, if the correlation function turns out to be an exponentially decaying function of the form e − b|τ | m with b , then to a good approximation we can identify κ ∼ b. the power-law decay in (a) parametrizes the decay of the envelope whereas the model in ( ) could account for non-vanishing local minima in the low-frequency domain. beyond this region, we seek an interpolating curve that extends throughout the spectrum including the origin. for this purpose, we consider fitting a lorentzian function of the following form to the spectrum where n = α n α m − m and m is the mean value near the spectrum's midpoint, about which stochastic fluctuations are observed. this is a simple coarse-grained model which averages over the oscillations in the low-frequency domain and describes the overall decay of the spectrum via a smooth curve. like the κ parameter in the model ( ), the curve-fitting is performed with the set of extremal points in the low-frequency domain, with the initial and final conditions taken into account by first fixing n, m with their observed values for each genome sequence. as a useful reference, we also fit the lorentzian function to the uncorrelated background ( ) and finding b ≈ . with m ∼ − at the genome level, and m ∼ − at the protein coding region level. scaling laws manifest in the fourier spectral density have often motivated the study of features of the genome that reflect various properties of it being a complex system, such as the fractal dimension (of a suitably defined matrix representation of the correlation function), etc. a measure of the complexity of the genome considered in the past literature (see for example [ , , ] ) is defined as follows . where n α is the number of the α-nucleotide. the logarithmic argument counts the number of distinguishable permutations given a fixed number of each nucleotide. at large m , this admits a natural interpretation of the shannon entropy of the genome sequence. to see this, we can invoke stirling's formula to express the large-m limit of Ω as which is a function of only the fractional distribution of nucleotides. in this form ( ), the measure of complexity Ω is clearly the shannon entropy which measures the information entropy associated with a genome sequence where the probability of nucleotide-α occurring in any position is nα m . we would find later that interestingly, the model parameter δ (but not κ) scales linearly with Ω across the dataset of types of coronavirus genomes. also, when restricted to the spike protein's level, the measure of complexity appears to scale linearly with the overall measure at the genome level. but the model parameter δ that is computed at the level of the spike protein does not correlate with Ω at either the genome/protein level, and neither does κ. our genome dataset consisting of types of coronaviruses spread across four genera mainly follows from reference [ ] plus a few other additions : sars-cov- , mers-cov and bat-ratg . bat-ratg is a bat coronavirus that was most recently found to have % genome identity with sars-cov- and featured in papers discussing a possible bat origin of the latter [ ] . we included it here to see how the model parameters for this genome compare to that of sars-cov- relative to the other coronaviruses. in the following, we outline the essential results, using the example of the sars-cov- reference genome for various graphical illustrations. we find that the fourier spectral density is characterized by the following features: (a) in a small low-frequency regime (k ), the uncorrelated background ( ) is a good approximation (see fig. from visual inspection of the relevant graphical plots, we find no obvious correlation among these model parameters, nor between them and the genome/spike protein sizes. but we find that δ w and Ω w appear to be related. linear regression yields the following best-fit line (see fig. ) with the line parameters being (with the % confidence intervals in brackets) α ≈ . ( . , . ), β ≈ − . (− . , − . ), since we checked that for all the coronaviruses, the assumption of a completely uncorrelated background yields δ ≈ . , this leads to a convenient definition of a reference complexity value Ω u ≈ . , which lies at the intersection between the uncorrelated vertical line and the observed one with finite slope. the difference between the observed complexity measure and Ω u in turn enacts a measure of the deviation from complete randomness of the sequence. there is also a similar relation between δ w and Ω s , consistent with the following linear relation that we found: Ω w = cΩ s , c ≈ . ( . , . ). it would be interesting to study this for other coding and non-coding regions as it is suggestive of some level of self-similarity for this complexity measure. ( ) obtained by fitting to all maxima and minima, with the best-fit value κ = . . (b) after k ∼ , the genome displays much more scatter about the uncorrelated background, and the models of deviation are no longer effective descriptions (see fig. ). stochastic fluctuations about a fixed mean appear to set in and there are no isolated peaks apart from two prominent ones at k ∼ m , m which have been seen and interpreted in past literature [ , ] to correspond to the universal triplet codon usage. we applied an (overlapping) moving average (of window size ∼ nucleotides) to smooth out the data, and checked that there is no apparent regime where some non-trivial scaling law holds (see fig. at the level of both the genome and protein coding region, the fixed mean parameter m appears to correlate with the genome size. it appears to generally decrease with the size of the sequence,at both levels of the genome and the spike protein (see figures a and b ) in appendix b). at the genome level, it is of the order ∼ − which is about larger than the value expected for the uncorrelated background, whereas at the spike protein level, m ∼ − which is times larger than the uncorrelated background. the lorentzian function that is fitted to the data with initial and final conditions fixed by r and m is parametrized by the showing how after about k = , the data points appear to be noisy and such stochastic fluctuations appear to persist throughout apart from a couple of isolated peaks. neither the envelope curve of /k δ nor equation ( ) continue to be effective descriptions. half-width parameter b. we find that this parameter generally increases with κ at both genome and spike protein levels (see figures a and b in appendix b). (c) finally, although for simplicity, we have kept to analyzing the spectral density corresponding to the sum of all the nucleotides, the general qualitative features described in (a) and (b) above apply to the spectral density for each individual nucleotide as well as the cross-spectra. we have presented a study of the fourier spectral density of the coronavirus genome at the level of the entire genome as well as the coding region for the spike protein. the power spectrum profile can be well-described by considering aspects of deviation from the hypothetical case of a random, uncorrelated sequence (eqn. ( ) ). we summarize the essential general features below: (i) there is a low-frequency domain (k ) which exhibits a clear oscillatory form that is close to ( ) . in this domain, we find that the enveloping curve connecting the local maxima is well-described by a power decay law of the form /k δ . we noted that the power exponent δ shows a correlation with a measure of complexity of the sequence (eqn. ( ) ) which in the limit of large genome size is the sequence's shannon entropy. the deviation from the uncorrelated background can be described by a linear relation between δ and Ω. this behavior does not however persist at the level of the spike protein's coding region. (ii) beyond the low-frequency domain, the spectrum displays stochastic fluctuations about certain fixed values m, and we find no other resonances apart from the peaks at m , m which are associated with the universal triplet codon usage. relative to the uncorrelated case, m is about higher at the genome level and about higher at the spike protein level. it also generally decreases with the size of the genome or the protein coding region. (iii) upon fitting the lorentzian function to the spectrum with initial and final conditions determined by r and m respectively, we find that its half-width parameter is correlated with κ -the dimensionless constant that defines the linearized correlation function in the lowfrequency domain, and generally increases with it. this is observed at both the genome and spike protein's levels. let us conclude by briefly pointing out several future directions and applications. now, it has been noted in literature for some time that dna viruses and unicellular organisms tend to have mutation rates which vary inversely with the genome size ('drake's rule' [ , , ] ). this correlation has been studied for rna viruses recently (see for example [ ] ) although we are unaware of any evidence for the case of coronaviruses which is the only rna virus family which has a 'exonuclease proofreading mechanism that enhances replication fidelity. the parameter m that we have introduced here appears to vary inversely with genome size, and thus it may be worthwhile to explore its role in models that attempts to explain viral mutation rates. in [ ] , a negative association between molecular evolution rate and genome size was established for rna viruses. it would be interesting to compute the parameter m for the viral sequences studied in [ , ] . another potential application of our work which has immediate relevance is to study the distribution of m for sars-cov- genomes specifically to explore if they could describe current evolution of the virus (see for example [ ] ). the lorentzian function that we fit broadly to the spectrum as a whole is a coarse-grained description that does not model the transition from the low-frequency spectrum to the other part of the spectrum that appears to be dominated by stochastic fluctuations. it would be interesting to develop theoretical models that could possibly account for such a transition and in the process, construct a clearer understanding for the parameter m or why the information-theoretic measure ( ) is relevant for the low-frequency domain. a complementary approach towards understanding correlation effects is to study directly the correlation function itself (see for example [ ] ), although this is more computationally intensive. it would be interesting to study what forms of correlation functions could lead to the enveloping curve being of the form /k δ . a few related models were proposed in [ , ] , and it may be worthwhile to revisit them in light of the newfound relation with the measure of complexity. finally, it would be interesting to perform a more extensive study of the models here with a larger set of viral genomes so that we have a fuller understanding of their statistical distribution and whether they can be useful in clustering and classifying purposes. motivated by the covid- pandemic, notwithstanding our limited dataset, in table below, we show the viral genome that is the closest neighbor to sars-cov- for each of the four model parameters at both levels of the genome and spike protein coding region. from table , we see that bat-rtg features most frequently and that apart from tgev and hku which infect pigs and humans respectively, the others are bat coronaviruses. collectively, they appear to be broadly compatible with the plausibility of the bat origin of sars-cov- , while to our knowledge, the association of sars-cov- with tgev and hku has never been made in literature. spike protein δ tgev bat-rtg κ bat-rtg hku m bat-rtg bat-cov- , hku b hku hku in this section, we collect several graphs useful for visualizing two particular trends observed: (i) the parameter m tends to vary inversely with size of genome/spike protein coding region, (ii) the linearized correlation function parameter κ and the half-width parameter b appears to be correlated. based on lectures delivered under the auspices of the dublin institute for advanced studies at trinity college evolution of long-range fractal correlations and /f noise in dna base sequences - bp periodicities in complete genomes reflect protein structure and dna folding universal /f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in dna sequences of the human genome multi-scale coding of genomic information: from dna sequence to genome structure and function wavelet analysis of dna sequences characterizing long-range correlations in dna sequences from wavelet analysis bacteria classification on power spectrums of complete dna sequences by self-organizing map long-range correlation properties of coding and noncoding dna sequences: genbank analysis periodicity in prokaryotic and eukaryotic genomes identified by power spectrum analysis coronavirus genomics and bioinformatics analysis statistics of dna sequences: a low-frequency analysis understanding long-range correlations in dna sequences national center for biotechnology information a maximum entropy principle for the distribution of local complexity in naturally occurring nucleotide sequences fractals and hidden symmetries in dna visualization and analysis of dna sequences using dna walks a pneumonia outbreak associated with a new coronavirus of probable bat origin periodicity of base correlation in nucleotide sequence a constant rate of spontaneous mutation in dna-based microbes rates of spontaneous mutation evolution of the mutation rate correlation between mutation rate and genome size in riboviruses: mutation rate of bacteriophage q moderate mutation rate in the sars coronavirus genome and its implications from molecular genetics to phylodynamics: evolutionary relevance of mutation rates across viruses complexities of viral mutation rates on the origin and continuing evolution of sars-cov- study of statistical correlations in dna sequences spatial /f spectra in open dynamical systems expansion-modification systems: a model for spatial /f spectra a quantitative genomic view of the coronaviruses: sars-cov acknowledgments i thank neal snyderman and rajesh parwani for stimulating discussions. this appendix collects the genbank accession id and names of the coronaviruses used in this work, which largely follows [ ] , our additions being sars-cov- , mers-cov and bat-ratg . these genomes can be freely downloaded from https://www.ncbi.nlm.nih.gov. for each genome, we exclude the poly(a) tail for our analysis. key: cord- -r bbomvk authors: woo, patrick cy; lau, susanna kp; tsang, chi-ching; lau, candy cy; wong, po-chun; chow, franklin wn; fong, jordan yh; yuen, kwok-yung title: coronavirus hku in respiratory tract of pigs and first discovery of coronavirus quasispecies in ′-untranslated region date: - - journal: emerg microbes infect doi: . /emi. . sha: doc_id: cord_uid: r bbomvk coronavirus hku is a deltacoronavirus that was discovered in fecal samples of pigs in hong kong in . over the past three years, coronavirus hku has been widely detected in pigs in east/southeast asia and north america and has been associated with fatal outbreaks. in all such epidemiological studies, the virus was generally only detected in fecal/intestinal samples. in this molecular epidemiology study, we detected coronavirus hku in . % of the nasopharyngeal samples obtained from pigs in hong kong. samples that tested positive were mostly collected during winter. complete genome sequencing of the coronavirus hku in two nasopharyngeal samples revealed quasispecies in one of the samples. two of the polymorphic sites involved indels, but the other two involved transition substitutions. phylogenetic analysis showed that the two nasopharyngeal strains in the present study were most closely related to the strains pdcov/chjxni / from jiangxi, china, and ch/sichuan/s / from sichuan, china. the outbreak strains in the united states possessed highly similar genome sequences and were clustered monophyletically, whereas the asian strains were more diverse and paraphyletic. the detection of coronavirus hku in respiratory tracts of pigs implies that in addition to enteric infections, coronavirus hku may be able to cause respiratory infections in pigs and that in addition to fecal-oral transmission, the virus could possibly spread through the respiratory route. the presence of the virus in respiratory samples provides an alternative clinical sample to confirm the diagnosis of coronavirus hku infection. quasispecies were unprecedentedly observed in the ′-untranslated region of coronavirus genomes. coronaviruses (covs) are found in a wide variety of animals, in which they can lead to enteric, hepatic, neurological and respiratory illnesses of differing severity. on the basis of genotypic and serological characterization, covs were traditionally divided into three distinct groups. in , the coronavirus study group of the international committee for taxonomy of viruses replaced the traditional cov groups , and with three genera, alphacoronavirus, betacoronavirus and gammacoronavirus, respectively. in the same year, we discovered three novel covs in avian cloacal swabs. these covs formed a distinct novel cov genus, named deltacoronavirus. subsequently, in a large epidemiological study, we discovered seven additional deltacoronaviruses. interestingly, one of these deltacoronaviruses, which was originally named porcine cov hku , was found in fecal samples of pigs in hong kong, and it is the only mammalian deltacoronavirus. in , the coronavirus study group of the international committee for taxonomy of viruses rectified the species name for this virus to coronavirus hku . over the past three years, coronavirus hku has been widely detected in pigs in east/southeast asia and north america and was found to be associated with fatal outbreaks. [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] in these epidemiological studies, coronavirus hku was generally only detected in fecal/intestinal samples. however, some covs, such as transmissible gastroenteritis cov (tgev)/porcine respiratory cov (prcv) of alphacoronavirus and bovine cov of betacoronavirus , could be detected consistently in both fecal and respiratory samples. , therefore, we hypothesized that coronavirus hku may also be present in respiratory samples of pigs, which has implications for its transmission and potential role in respiratory diseases and for the use of nasopharyngeal sampling as an alternative method of identifying infected pigs. to test this hypothesis, we performed a molecular epidemiology study on nasopharyngeal samples collected from pigs in hong kong. two complete genomes of the 'respiratory' coronavirus hku were sequenced, and comparative genomic and phylogenetic studies were performed. the implications of the presence of coronavirus hku in respiratory samples are also discussed. nasopharyngeal samples from pigs were collected in hong kong over a -month period (january -february ). these samples were obtained from slaughterhouses and pig farms in hong kong with the assistance of the veterinary public health section, food and environmental hygiene department; and the agriculture, fisheries and conservation department of the government of hong kong. immediately after sample collection, each nasopharyngeal swab was submerged in viral transport medium for viral transport and maintenance. rna extraction and reverse transcription viral rnas were extracted from the nasopharyngeal samples of pigs by using μl of inoculated viral transport medium for each sample and by utilizing the ez advanced xl system (qiagen, hilden, germany) and ez virus mini kit v . (qiagen) according to the manufacturer's protocol with rnase-free water as the eluent. reverse transcription (rt) was performed using the superscript iii reverse transcriptase (invitrogen, carlsbad, ca, usa) according to the manufacturer's protocol by random priming. coronavirus hku screening detection of coronavirus hku was performed by polymerase chain reaction (pcr) amplifying a -bp fragment of the rnadependent rna polymerase (rdrp) gene, using the specific primer pair lpw ( ′-aca cac ttg ctg taa cca aa- ′) and lpw ( ′-atc att aga gtc acc acg at- ′). pcr and dna sequencing were carried out following our previous publications with slight modification. , briefly, each pcr mixture contained pcr buffer ( mm of kcl, mm of tris-hcl at ph . and mm of mgcl ; applied biosystems, foster city, ca, usa), μm of each deoxynucleoside triphosphate (roche diagnostics, basel, switzerland), μm of each primer (invitrogen), . u of amplitag gold dna polymerase (applied biosystems) and cdna. the mixtures were subjected to thermocycles of °c for min, °c for min and °c for min, with an initial denaturation at °c for min and a final extension at °c for min for dna amplification using the geneamp pcr system automated thermal cycler (applied biosystems). standard precautions were taken to avoid contamination, and no falsepositive result was observed for the negative controls. pcr products were agarose gel-purified using the qiaquick gel extraction kit (qiagen). both strands of the pcr products were sequenced twice by the abi prism xl genetic analyzer (applied biosystems) using the two pcr primers. the genomes of two coronavirus hku strains detected in the nasopharyngeal samples of two different pigs were sequenced following our previous publications , with modifications. briefly, viral rnas were converted to cdnas using superscript iii reverse transcriptase with a combined random priming and oligo(dt) priming strategy. the cdnas were pcr-amplified by primers (supplementary tables s and s ) that were designed by multiple alignment of the genome sequences of other coronavirus hku strains with complete genomes available or from the results of the first and subsequent rounds of pcr-sequencing. pcrs were performed using the iproof high-fidelity pcr kit (bio-rad laboratories, hercules, ca, usa) according to the manufacturer's protocol, and dna sequencing was performed as mentioned above. when ambiguous peaks were observed consistently in the electropherograms after several attempts of pcr-sequencing for certain genomic regions, cloning followed by plasmid sequencing was performed according to our previous publication, except the zero blunt topo pcr cloning kit (invitrogen) was used to resolve the sequence ambiguities. pcrs using recombinant plasmids as templates were also performed to confirm that indels at mononucleotide polymeric regions were not the result of polymerase slippage. the ′ ends of the viral genomes were amplified and sequenced by the rapid amplification of cdna ends (race) using the smarter race ′/ ′ kit (clontech laboratories, mountain view, ca, usa) according to the manufacturer's protocol with the reverse primers lpw ( ′-tgg gta atg tgt ccg ctg acg ggc ggt g- ′), lpw ( ′-aga agt ggt gga tgg tca gag gaa cgg t- ′) and lpw ( ′-gtg gct ggt ttc cag gta atc ta- ′). the sequences of the pcr products were then assembled manually to obtain the genomes of the two nasopharyngeal strains. pairwise alignment was performed using bioedit . . (optimal global alignment) or emboss stretcher (nucleotide alignment); whereas multiple sequence alignment was performed using muscle . , where the aligned sequences were further manually inspected and edited. tests for substitution models and phylogenetic analysis by the maximum likelihood method were performed using mega . . . divergence times for the coronavirus hku strains were calculated based on the complete genome sequence data, utilizing the bayesian markov chain monte carlo method using beast . . with the substitution model gtr (general time-reversible model)+g (gammadistributed rate variation)+i (estimated proportion of invariable sites), a strict molecular clock, and a constant coalescent. fifty million generations were run with trees sampled every th generation to yield trees. convergence was assessed based on the effective sampling size after a % burn-in using tracer . . . the mean time to the most recent common ancestor (tmrca) and the highest posterior density (hpd) regions at % were calculated. the trees, after a % burn-in, were summarized as a single tree using treeannotator . . by choosing the tree with the maximum sum of posterior probabilities (maximum clade credibility) and viewed using figtree . . . the complete genome sequences of the two nasopharyngeal coronavirus hku strains were deposited into the international nucleotide sequence databases with accession numbers lc and lc . a total of nasopharyngeal samples from pigs were tested. rt-pcr for a -bp fragment of the rdrp gene of coronavirus hku was positive in ( . %) of the nasopharyngeal samples. the samples that tested positive were mostly collected during winter (december-march) ( figure ). dna sequencing showed that seven sequence variants were detected among the positive samples, and pairwise alignment showed that these seven sequence variants possessed . %- % sequence identity to the corresponding region in the rdrp gene of coronavirus hku strain hku - that we previously found in fecal samples of pigs in hong kong (supplementary figure s ). complete genome sequencing and genome analysis complete genome sequencing was performed for the coronavirus hku found in two of the positive nasopharyngeal samples (s n and s n) . excluding the ′ poly(a) tail, the genomes of s n and s n were - and nucleotides long, respectively. the genome organization of the two strains was the same as that of other coronavirus hku strains. the lengths of the seven open reading frames (orfs) of the two strains s n/ s n were / , , , , , and bp, respectively. the genomes of the two strains possessed . %- . % sequence identity to that of the representative isolate hku - . quasispecies were detected in one of the samples (s n) at the ′ genomic region via two independent nested pcrs targeting the nd- th bases of the genome using two different primer pairs for the first round and the same primer pair for the second round of reaction. for s n, direct sequencing of the pcr products yielded ambiguous peaks in the sequencing electropherograms, which could only be resolved after cloning (figure ). post-cloning dna sequencing revealed that there were six sequence variants, with four polymorphic sites, for this genomic region ( figure ) . two of the polymorphic sites, located at the th and th nucleotide positions, involved indels (Δt and Δa/c, respectively), whereas the other two polymorphic sites, located at the nd and th nucleotide positions, involved transition substitutions (t → c and g → a, respectively). additionally, pcr-dna sequencing using the recombinant plasmids as amplification templates did not generate the same sequence ambiguities observed in the pre-cloning experiment. phylogenetic analysis of the complete genomes of the two nasopharyngeal strains and other coronavirus hku strains showed that the outbreak strains in the united states possessed highly similar genome sequences and that they were all clustered together monophyletically, whereas the asian strains were more diverse and paraphyletic, with the lao and thai strains occupying the basal lineage; however, the south korean strain knu - was more similar to the us strains than to the other asian strains (figure and supplementary figure s the estimated mean evolutionary rate of the complete genome sequence data set was . × − ( % hpd: . - . × − ) substitutions per site per year, which is approximately . -fold higher than that estimated in a previous study. the root of the tree was september ( % hpd: june -march ). the tmrca of the diversity of coronavirus hku was dated to june ( % hpd: november -june ); and the tmrca of the thai/laos strains was traced back to september ( % hpd: may -january ). the tmrcas for the clade containing us/korean strains was estimated to be october ( % hpd: june -january ), which is slightly delayed compared with that estimated in a previous study. for the two nasopharyngeal strains characterized in this study (s n and s n) , they were estimated to have diverged from their respective mrcas in december ( % hpd: . six intra-strain quasispecies were found. post-cloning plasmid-dependent pcr-sequencing confirmed that the presence of indels at positions and was not due to polymerase slippage. quasispecies and were detected in both nested pcr using first round primers lpw /lpw and second round primers lpw /lpw as well as nested pcr using first round primers lpw /lpw and second round primers lpw /lpw . however, quasispecies and were only detected in nested pcr using first round primers lpw / lpw and second round primers lpw /lpw , whereas quasispecies and were only detected in nested pcr using first round primers lpw /lpw and second round primers lpw /lpw . coronavirus hku was detected in nasopharyngeal samples of pigs. although coronavirus hku has been widely detected in various locations around the pacific ocean, including canada, china, , , , hong kong, laos, , mexico, south korea, , thailand, , vietnam and the united states, [ ] [ ] [ ] [ ] [ ] [ ] , , , the virus has principally been found in fecal or intestinal specimens. there have been a few exceptional circumstances; in one study, the presence of coronavirus hku was reported in the blood, liver, lung and kidney of one pig, and in a few other studies, coronavirus hku was found to exist in the blood (n = ), mesenteric lymph node (n = ) and saliva (n = )/oral fluid (n = ) of pigs, , , implying that coronavirus hku can cause systemic infections in occasional cases. in this study, coronavirus hku was found in . % of the nasopharyngeal samples of pigs, which is similar to the . % positive rate of coronavirus hku in fecal samples of pigs that we reported previously. seasonal variation in the detection rate of coronavirus hku from pigs was noted, where most of the positive samples were collected in winter. this is similar to the pattern of seasonal variation in a surveillance study carried out in the united states, where the detection rate for coronavirus hku was much lower during summer. it has recently been confirmed that coronavirus hku is able to cause swine enteric infections by infecting gnotobiotic and conventional pigs with coronavirus hku . the detection of coronavirus hku in respiratory tracts of pigs has the following implications. first, in addition to enteric infections, coronavirus hku may be able to cause respiratory infections in pigs. second, in addition to fecal-oral transmission, the virus may be able to spread through the respiratory route. third, the presence of the virus in respiratory samples provides an alternative clinical sample to confirm the diagnosis of coronavirus hku infection. further studies will determine the full spectrum of clinical diseases and pathologies associated with coronavirus hku . from the data of the present study, both the 'enteric' and 'respiratory' coronavirus hku may possess similar properties. a number of animal covs possess dual or multiple tissue tropisms. for example, tgev, which is another enteropathogenic cov that infects pigs, could also be found in the nasopharynx of pigs as prcv, which is a deletion mutant of tgev. moreover, bovine cov is both an enteric and a respiratory pathogen in cattle. similar to tgev/ prcv, coronavirus hku is recovered from both respiratory and gastrointestinal samples. however, unlike tgev/prcv, in which there is a - nucleotide deletion at the ′ end of the spike (s) gene leading to a loss of - antigenic sites in prcv, comparative genome analysis of coronavirus hku from respiratory and fecal samples did not show any obvious difference in their s proteins or other parts of their genomes. phylogenetic analysis also did not reveal a separate clustering of fecal/intestinal and nasopharyngeal isolates ( figure ). further cell culture experiments are required to confirm whether all strains of this species possess intrinsic tropism to both enteric and respiratory tissues. this is also the first report of cov quasispecies in the ′untranslated region (utr). in one (s n) of the two coronavirus hku genomes that we sequenced in this study, variant sites were observed at four positions; two of them were due to nucleotide substitutions, and the other two were results of indels at mononucleotide polymeric regions ( th and th bases). these two indels were genuine variant sites instead of being due to polymerase slippage during the amplification process because recombinant plasmiddependent pcr-sequencing no longer resulted in sequence ambiguities in the electropherograms. although the existence of quasispecies has been reported in covs, the variant sites were found in coding regions or ′-utr. [ ] [ ] [ ] [ ] in the case of severe acute respiratory syndromerelated coronavirus, all of the variant sites observed in the quasispecies were located at the s gene. for bovine cov, one of the two strains with naturally occurring intra-isolate quasispecies had all seven variant sites located at orf a, whereas for the other strain with naturally occurring intra-isolate quasispecies, there were polymorphic sites scattered across orf a (n = ), orf b (n = ), kda-non-structural protein (nsp) gene (n = ), hemagglutinin esterase (he) gene (n = ), s gene (n = ), . kda-nsp gene (n = ), . kda-nsp gene (n = ), membrane (m) gene (n = ), nucleocapsid (n) gene (n = ) and ′-utr (n = ). similar to bovine cov, middle east respiratory syndrome-related coronavirus also possessed all the intra-host single nucleotide variations throughout its genome except the ′-utr. , in this study, all four variant sites ( Δt, t → c, g → a and Δa/c) were present in the ′-utr and were not located in the leader sequence or the transcription regulatory sequence. we speculate that the existence of quasispecies in covs may play a role in cov evolution, in addition to the more well-known high-recombination and mutation rates in cov genomes. virus taxonomy: ninth report of the international committee on taxonomy of viruses, international union of microbiological societies, virology division comparative analysis of complete genome sequences of three avian coronaviruses reveals a novel group c coronavirus discovery of seven novel mammalian and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as the gene source of alphacoronavirus and betacoronavirus and avian coronaviruses as the gene source of gammacoronavirus and deltacoronavirus create new species in the family coronaviridae complete genome characterization of korean porcine deltacoronavirus strain kor/knu - / full-length genome sequence of porcine deltacoronavirus strain usa/ia/ / rapid detection, complete genome sequencing, and phylogenetic analysis of porcine deltacoronavirus complete genome sequence of strain sdcv/ usa/illinois / , a porcine deltacoronavirus from the united states detection and genetic characterization of deltacoronavirus in pigs porcine coronavirus hku detected in us states complete genome sequence of porcine coronavirus hku strain in from the united states full-length genome characterization of chinese porcine deltacoronavirus strain ch/sxd / porcine deltacoronavirus in mainland china isolation and characterization of porcine deltacoronavirus from pigs with diarrhea in the united states origin, evolution, and virulence of porcine deltacoronaviruses in the united states newly emerged porcine deltacoronavirus associated with diarrhoea in swine in china: identification, prevalence and full-length genome sequence analysis complete genome sequence of porcine deltacoronavirus strain ch/sichuan/s / from mainland china porcine deltacoronavirus: histological lesions and genetic characterization characterization and evolution of porcine deltacoronavirus in the united states detection and phylogenetic analysis of porcine deltacoronavirus in korean swine farms the first detection and full-length genome sequence of porcine deltacoronavirus isolated in lao pdr different lineage of porcine deltacoronavirus in thailand, vietnam and lao pdr in studies on the relationship between coronaviruses from the intestinal and respiratory tracts of calves porcine respiratory coronavirus: molecular features and virus-host interactions discovery of a novel coronavirus, china rattus coronavirus hku , from norway rats supports the murine origin of betacoronavirus and has implications for the ancestor of betacoronavirus lineage a discovery of a novel bottlenose dolphin coronavirus reveals a distinct species of marine mammal coronavirus in gammacoronavirus intra-genomic internal transcribed spacer region sequence heterogeneity and molecular diagnosis in clinical microbiology bioedit: a user-friendly biological sequence alignment editor and analysis program for windows / /nt the embl-ebi bioinformatics web and programmatic tools framework muscle: multiple sequence alignment with high accuracy and high throughput mega : molecular evolutionary genetics analysis version . bayesian phylogenetics with beauti and the beast . respiratory and fecal shedding of porcine respiratory coronavirus (prcv) in sentinel weaned pigs and sequence of the partial s-gene of the prcv isolates sars-associated coronavirus quasispecies in individual patients quasispecies of bovine enteric and respiratory coronaviruses based on complete genome sequences and genetic changes after tissue culture adaptation middle east respiratory syndrome coronavirus quasispecies that include homologues of human isolates revealed through whole-genome analysis and virus cultured from dromedary camels in saudi arabia middle east respiratory syndrome coronavirus intrahost populations are characterized by numerous high frequency variants comparative analysis of coronavirus hku genomes reveals a novel genotype and evidence of natural recombination in coronavirus hku kong. the funding sources had no role in study design, data collection, analysis, interpretation, or writing of the report. the authors alone are responsible for the content and the writing of the manuscript. the authors thank the staff from the veterinary public health section, food and environmental hygiene department as well as the agriculture, fisheries and conservation department of the hong kong government for their help in collecting the porcine nasopharyngeal samples. key: cord- -qgyzk th authors: edgar, robert c.; taylor, jeff; altman, tomer; barbera, pierre; meleshko, dmitry; lin, victor; lohr, dan; novakovsky, gherman; al-shayeb, basem; banfield, jillian f.; korobeynikov, anton; chikhi, rayan; babaian, artem title: petabase-scale sequence alignment catalyses viral discovery date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: qgyzk th public sequence data represents a major opportunity for viral discovery, but its exploration has been inhibited by a lack of efficient methods for searching this corpus, which is currently at the petabase scale and growing exponentially. to address the ongoing pandemic caused by severe acute respiratory syndrome coronavirus and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (cov) and other viral families to . petabases of public sequencing data from . million biologically diverse samples. to implement this strategy, we developed a cloud computing architecture, serratus, tailored for ultra-high throughput sequence alignment at the petabase scale. from this search, we identified and assembled thousands of cov and cov-like genomes and genome fragments ranging from known strains to putatively novel genera. we generalise this strategy to other viral families, identifying several novel deltaviruses and huge bacteriophages. to catalyse a new era of viral discovery we made millions of viral alignments and family identifications freely available to the research community. expanding the known diversity and zoonotic reservoirs of cov and other emerging pathogens can accelerate vaccine and therapeutic developments for the current pandemic, and help us anticipate and mitigate future ones. viral zoonotic disease has had a major impact on human health over the past century despite dramatic advances in medical science, notably by the spanish flu, aids, sars, ebola and covid- pandemics. there are an estimated , mammalian viruses [ ] from which emerging infectious diseases in humans may arise [ ] . uncovering this viral biodiversity is a prerequisite for predicting and preventing future epidemics and is therefore the focus of consortia such as usaid predict [ ] and the global virome project [ ] as well as hundreds of government and academic research projects worldwide. these efforts can be aided through re-analysis of petabases of high-throughput sequencing data available in public databases such as the sequence read archive (sra) [ ] . this data spans millions of ecologically diverse biological samples, many of which capture viral transcripts that may be incidental to the goals of the original studies [ ] . to expand the known repertoire of viruses and catalyse global virus discovery, in particular for coronaviridae (cov) family, we developed the serratus cloud computing architecture for ultra-high throughput sequence alignment. from a screen of . million libraries comprising . petabases of sequencing reads, we report , assemblies, including sequences from previously uncharacterised or unavailable cov or cov-like operational taxonomic units (otus), defined by clustering amino sequences of the rna dependent rna polymerase (rdrp) gene at % identity. to demonstrate the broader utility of our approach, we also report six novel deltaviruses related to the human pathogen hepatitis δ virus (hdv), and expand the described members of the recently characterised family of huge bacteriophages (phages). viral discovery is a first step in preparing for the next pandemic. sequencing reads for thousands of uncharacterised viruses already exist and require careful curation. to accelerate this process, we established a freely available and explorable resource of all vertebrate viral alignment data generated by serratus at https://serratus.io. this work lays the foundation for years of future research by enabling the exploration of viruses which have been captured by more than a decade of high-throughput sequencing studies. serratus is a freely available, open-source cloud-computing platform designed to enable petabase-scale sequence alignment against a set of references. using serratus, we aligned in excess of one million short-read sequencing datasets per day for under us cent per dataset (extended figure ). this was achieved by leveraging commercially available computing infrastructure to employ up to , virtual cpus simultaneously (see methods). we aligned , , public rna-seq, meta-genome, meta-virome and meta-transcriptome datasets (termed a sequencing run [ ] ) against a collection of viral family pangenomes comprising all genbank cov records clustered at % identity plus all non-retroviral refseq records for vertebrate viruses (see methods and extended table ). to uncover more divergent viruses, we re-analysed , runs in a translated nucleotide search against a query comprising panproteomes for cov and other families. we performed de novo assembly on , runs potentially containing cov sequencing reads by combining , sra accessions identified by the serratus search with , identified by an ongoing cataloguing initiative of the sra called stat [ ] . , of the resulting assemblies contained putative cov contigs, of which , aligned to cov rdrp (extended table ). of these, we identified otus from a total of , i.e. not represented by coronaviridae in genbank (figure a and extended figure ). the protein domains of these otu are consistent with a cov or cov-like genome organisation (extended figure ) . three of the novel cov otus fell within the alphacoronavirus (αcov) genus. the first (exemplar run: err ) was from two desmodus rotundus bat metagenomes yielding . and . kb cov contigs respectively in the nyctacovirus subgenus. these cov were noted by the data-collectors, [ ] , but the sequences were not public and thus novel to our analysis. the second otu (srr ) was from a pipistrellus pipistrellus bat metagenome collected in in china. finally, from five libraries (err ) generated for a study on the metagenomic effects of the burying beetle nicrophorus vespilloides on a mouse carcass, we assembled a luchacovirus related to the rodent lucheng rn rat coronavirus ( % genome nucleotide identity to nc . ). from a rodent virome study which identified several novel cov [ ] , a sample from an unknown species contained a βcov embecovirus (srr ), with the closest matching genome matching an unclassified βcov from vietnam ( . % to mh ). finally, the δcov otu (srr ) appears to be from a currently unpublished avian virome study in china. we designated the eight remaining otus as group e, noting that all were found in samples from non-mammal aquatic vertebrates falling outside of δcov in the tree (extended figure ) . a sister taxon to coronaviridae figure : expanded characterisation of cov and related otus a radial cladogram derived from maximum likelihood tree of cov and related otus. inset is a phylogram of the same tree annotated with cov genera (greek letters) and group e cov-like nidoviruses. otus were generated by clustering the rdrp gene at % identity. diversity within each such % otu was characterised by counting the number of % identity otus it contained. an otu ( % or %) was considered to be known if it contained a genbank sequence, otherwise to be a novel otu discovered by serratus. hosts were considered novel if the source organism annotated by the sra belonged to a species not annotated as a host in any genbank record, noting that the annotated source may differ from the viral host (e.g., faecal contamination in a plant sample). hosts are classified as primates, fowl (galliformes), bats (chiroptera, aquatic (amphibia and osteichthyes), or other. b length distribution for assemblies of sra datasets classified as likely cov-positive, showing a peak around the typical cov genome length knt. c triangular matrix showing median rdrp sequence identities between selected nidovirales and group e viruses. d phylogram of group e cov-like nidoviruses. was recently proposed [ ] following the characterisation of a corona-like virus, microhyla alphaletovirus (mlev), in the frog microhyla fissipes, and soon after a related pacific salmon nidovirus (psnv) was described in the endangered oncorhynchus tshawytscha [ ] . two of our otus were in these host species and the described viruses proved to be near-perfect matches. we expand this recently characterised group with six additional members, five similar to psnv in; takifugu pardalis (fugu fish; tparnv), syngnathus typhle (broad-nosed pipefish; stypnv), hippocampus kuda (seahorse; hkudnv) [ ] , puntigrus tetrazona (tiger barb; ptetnv), ambystoma mexicanum (axolotl; amexnv), and a more distant member in caretta caretta (loggerhead sea turtle; ccarnv). notably, the ambystoma mexicanum (axolotl) nidovirus (amexnv) was assembled in runs, of which yielded kb contigs. easing the criteria of requiring an rdrp match, / ( . %) of the runs from the associated studies were amexnv positive. gene structure of the amexnv and related contigs suggests that there is genomic segmentation within this clade (extended figure ) , with a homologous assembly gap is present in the published psnv genome [ ] . these contigs were obtained from experimental animals from two different research groups [ ] [ ] [ ] , the common factor is the animal stock centre used by these studies which is therefore likely to be the source of the virus. axolotl are critically endangered in the wild; determining the distribution and pathophysiology of amexnv in these animals can assist with conservation efforts. infectious agents are the leading cause of pyrexia of unknown origin (puo) in children and immunocompromised adults [ ] . in addition to identifying genetic diversity within cov, we cross-referenced cov+ library meta-data to identify possible zoonoses and infer vectors of transmission. discordant libraries, one in which a cov is identified and the viral expected host does not match the sequencing library source taxa, were rare, accounting for only . % of cases (extended table e ). in a virome sequencing study [ ] of children with febrile illness, we identified sequencing runs from two children, one febrile (id: ) and one afebrile (id: ) with reads mapping to the (βcov), murine hepatitis virus (mhv). we assembled a complete . kb mhv genome from each replicate taken from the febrile child and a partial genome from the afebrile child. mhv can infect human cells in vitro [ ] , but may be rare in humans, highlighting how rapid and unbiased meta-genomic sequence analysis can not only resolve the etiology of a sub-set of puo, centralisation of these data (stripped of human-identifying reads) also serves as a public-health surveillance system for zoonosis. an important consideration for these analyses is that the nucleic acid reads do not prove viral infection has occurred in the nominal host species. for example, we identified four libraries in which a porcine or avian coronavirus was found in plant samples. a more likely explanation than cross-kingdom cov transmission is that cov was present in faeces/fertiliser originating from a mammalian or avian host. coronaviridae is a well-characterised family ( figure and extended figure ), yet our re-analysis of the sra yielded eleven novel or under-reported otus. there are at least , more high-confidence (score ≥ ) and diverged (≤ % identity) virus-containing datasets. in particular picornaviridae and reoviridae are enriched and numerous within this category ( figure ). serratus exploration of under-characterised viruses can potentially fill these gaps in our knowledge. the global mortality from viral hepatitis exceeds that of hiv/aids, tuberculosis or malaria, due to acute and chronic liver cirrhosis and subsequent hepatocellular carcinoma [ ] . hepatitis delta virus (hdv) is a small ( . knt) rna satellite virus infecting hepatocytes. alone, hdv is unable to produce infectious viral particles, as it requires the envelope protein from its helper virus, hepatitis b (hbv) [ ] . hdv infection aggravates liver cirrhosis caused by hbv and worsens clinical outcomes [ ] . prior to , hdv was the sole known member of its genus; ten members have since been characterised [ - ]. we identified an additional six deltaviruses ( figure a ) and assembled complete circular genomes for five (extended figure ). the evolutionary histories of these deltaviruses are explored further in a companion manuscript [ ] . one of these novel deltaviruses, mmondv, was identified in marmota monax (eastern woodchuck), a model organism used over the last three decades for the study of viral-induced hepatitis and hepatocellular carcinoma following woodchuck hepatitis virus (whv) infection, a hepadnavirus similar to hbv [ ]. from a study of woodchucks born in captivity and experimentally infected with whv [ ], liver biopsy rna-seq from four ( . %) animals contained > mmondv-mapping reads in at least one time-point of the week study (figure c ). woodchuck hepatitis virus can support replication of human hdv, it is in fact a model for hdv pathogenesis [ , ], so it is probable that whv is also the helper virus for mmondv. inter-animal variation of whv-induced liver cirrhosis can be substantial [ ] and cryptic mmondv infection may have be underlying some of this variability from the past three decades of research using this model system, which warrants further investigation. to explore the utility of broad-scale read archive searches for microbiome research, we sought to locate phages whose genomes encode proteins related to the terminases and major capsid proteins from recently reported huge phages [ ] . to focus on phages whose genomes are substantially larger than normal (the average size is kbp [ ]), we prioritised assembled sequences of ≥ kbp (figure a ). assembly of high-scoring runs returned terminase-containing long contigs, primarily from cats, dogs, cattle and whales. the phylogenetic analysis of these sequences resolves new groups of phages with large genomes, some of which are comprised only of sequences only from one animal genus. however, in a few cases we identified closely related phages in different animal orders, including one case where related phages were found in a human from bangladesh (err ) and groups of cats (prjeb ) and dogs (prjeb ) from england, sampled years apart. this result parallels the finding of kbp lak phage genomes in pigs, baboons and humans [ ] . these newly recovered sequences substantially expand the previously defined clades and reveal members of these clades in new habitats ( figure b ). overall, these findings amplify that phages with large genomes are prevalent in human and animal microbiomes. since the completion of the initial draft of the human genome, the cost of dna sequencing has outpaced moore's law with a corresponding increase in the sizes of sequence databases [ ] . serratus offers researchers access to over a decade of data collected by the global research community in a rapid and a cost-effective manner. while our first priority was viral discovery in the context of an ongoing global health crisis, we believe that serratus and further extensions of petabase scale metagenomics will shape a new era in computational biology, and enable radically new approaches to gene discovery, pathogen surveillance, pangenomic evolutionary analysis amongst other applications. rapid translation of large datasets, such as those generated by serratus, into meaningful biomedical advances requires concerted collaboration between specialists [ ] and underscores a greater need for prompt, free and unrestricted data sharing in the community, not only of raw data (reads) but also of analyses such as assemblies and annotations. to facilitate such progress, we established a data warehouse of the . terabytes of viral alignments containing known, and yet to be characterized, viral species, each requiring domain expertise for curation. these data can be explored via a graphical web interface at https://serratus.io or programatically through the r package tantalus (https://github.com/serratus-bio/tantalus) which interfaces to a postgresql-server hosting high-level data summaries. computational biology is outpacing the rate at which classical isolation-or culture-based validation can be performed. reverse genetics and synthetic nucleic acids offer a path to biological validation when virions are unavailable, such as those predicted from sequence alone [ , ] . innovative fields such as high-throughput functional viromics [ ] leverage these broad and rapidly growing collections of viral sequences, and can inform evidence-based policies responding to emerging pandemics [ , ] . human population growth and encroachment on animal habitats is bringing more species into proximity, leading to increased zoonosis [ ] and accelerating the anthropocene mass extinction [ , ] . while the availability of computation and data analysis is increasing, the opportunity to capture the rich genetic diversity of endangered species and their associated microorganism biodiversity is not. the need to invest in field studies for the collection and curation of rare and biologically diverse samples has never been as pressing as it is today. if not for the conservation of endangered species, then to conserve our own. figure ). the processing of each sequencing library is split into three modules dl (download), align, and merge. the dl module acquires compressed data (.sra format) via prefetch, from the aws s mirror of the sra, decompresses to fastq, and splits the data into fq-blocks of million reads or read-pairs into a temporary s cache bucket. to mitigate excessive disk usage caused by a few large datasets, a limit of million reads per dataset was imposed. the align module reads individual fq-blocks and aligns to an indexed database of user-provided query sequences using either bowtie each component is launched from a separate aws autoscaling group with its own launch template, allowing the user to tailor instance requirements per task. this enabled us to minimise the use of costly block storage during compute-bound tasks such as alignment. we used the following spot instance types; dl: gb ssd block storage, vcpus, gb ram (r .xlarge) instances; align: gb ssd block storage, vcpus, gb ram (c .xlarge) , instances; merge: gb ssd block storage, vcpus, gb ram (c .large) instances. users should note that it may be necessary to submit a service ticket to access more than the default ec instance limit. ec instances have higher network bandwidth (up to . gb/s) than block storage bandwidth ( mb/s). to exploit this, we used s buckets as a data buffering and streaming system and to transfer data between instances following methods developed in a previous cloud architecture (https://github.com/fredhutch/sra-pipeline). this, combined with splitting of fastq files into individual blocks, effectively eliminated file input/output (i/o) as a bottleneck, since the available i/o is multiplied per running instance (conceptually analogous to a raid configuration). using s as a buffer also allowed us to decouple the input and output of each module s storage is cheap enough that in the event of unexpected issues (e.g., exceeding ec quotas) we could resolve problems and resume processing. for example, shutting done the align modules to hotfix a genome indexing problem without having to re-run the dl modules. the serratus scheduler node controls the number of desired instances to be created for each component of the workflow, based on the available work queue. we implemented a pull-based work queue. upon boot-up each instance launches a number of worker threads equal to the number of cpu available. each worker independently manages itself via a boot script, and query the scheduler for available tasks. upon completion of the task, the worker updates the scheduler of the result: success, or fail, and queries for a new task. under ideal conditions, this allows for a response time in the hundreds of milliseconds, worst case, keeping cluster throughput high. each task typically lasts several minutes. the scheduler itself was implemented using postgres (for persistence and concurrency) and flask (to pool connections and translate rest queries into sql). the flask layer allowed us to scale the cluster past the number of simultaneous sessions manageable by a single postgres instance. the work queue can also be managed manually by the user, to perform operations such as re-attempt downloading of an sra accession upon a failure or to pause an operation while debugging. the system is designed to be fully self-scaling. an "autoscaling controller" was implemented which scales-in or scales-out the desired number of instances per task every five minutes based on the work queue. as a backstop, when all workers on an instance fail to receive work instructions from the scheduler, the instance is shut-down. finally a "job cleaner" component checks the active jobs against currently running instances. if an instance has disappear due to spot termination or manual shutdown, it resets the job allowing it to be processed up by the next available instance. to monitor cluster performance in real-time, we used prometheus and node exporter to retrieve cpu, disk, memory, and networking statistics from each instance, postgres exporter to expose performance information about the work queue, and python exporter to export information from the flask server. this allowed us to identify and diagnose performance problems within minutes to avoid costly overruns. we define a viral pangenome as the entire collection of reference sequences belonging to a taxonomic viral family, which may contain both full-length genomes and sequence fragments such as those based on rdrp amplicon sequencing. we developed a summarizer module written in python to provide a compact, human-and machine-readable synopsis of the alignments generated for each sra dataset. the method was implemented in serratus summarizer.py for nucleotide alignment and serratus psummarizer.py for amino acid alignments. reports generated by the summarizer are text files with three sections described in detail online (https://github.com/ababaian/serratus/ wiki/.summary-reports). in brief, each contains a header section with alignment meta-data and one-line summaries for each virus family pangenome, reference sequence and gene respectively, with gene summaries provided for protein alignments only. for each summary line we include descriptive statistics gathered from the alignment data such as the number of aligned reads, estimated read depth, mean alignment identity, and coverage, i.e. the distribution of reads across each reference sequence or pangenome. coverage is measured by dividing a reference sequence into equal bins and depicted as an ascii text string of symbols, one per bin; for example oaooomouu:owwuuwowamwaauw. each symbol represents log (n + ) where n is the number of reads aligned to a bin in this order: .:uwaomuwaom^. thus, ' ' indicates no reads, '.' exactly one read, ':' two reads, 'u' - reads, 'w' - reads and so on; '^' represents > = , reads in the bin. for a pangenome, alignments to its reference sequences are projected onto a corresponding set of bins. for a complete genome, the projected pangenome bin number , , . . . , is the same as the reference sequence bin number. for a fragment, a bin is projected onto the pangenome bin implied by the alignment of the fragment to a complete genome. for example, if the start of a fragment aligns half way into a complete genome, bin of the fragment is projected to bin = of the pangenome. the introduction of pangenome bins was motivated by the observation that bowtie selects an alignment at random when there are two or more top-scoring alignments, which tends to distribute coverage over several reference sequences when a single viral genome is present in the reads. coverage of a single reference genome may therefore be fragmented, and binning to a pangenome better assesses coverage over a putative viral genome in the reads while retaining pangenome sequence diversity for detection. the summarizer implements a binary classifier predicting the presence or absence of each virus family in the query. for a given family f , the classifier reports a score in the range [ , ] with the goal of assigning a high score to a dataset if it contains f and a low score if it does not. setting a threshold on the score divides datasets into disjoint subsets representing predicted positive and negative detections of family f . the choice of threshold implies a trade-off between false positives and false negatives. sorting by decreasing score ranks datasets in decreasing order of confidence that f is present in the reads. naively, a natural measure of the presence of a virus family is the number of alignments to its reference sequences. however, alignments may be induced by non-homologous sequence similarity, for example low-complexity sequence. the score for a family was therefore designed to reflect the overall coverage of a pangenome because coverage across all or most of a pangenome is more likely to reflect true homology, i.e. the presence of a related virus. ideally, coverage would be measured individually for each base in the reference sequence, but this could add undesirable overhead in compute time and memory for a process which is executed in the linux alignment pipe (fastq decompression → aligner → summarizer → alignment file compression). coverage was therefore measured by binning as described above, which can be implemented with minimal overhead. a virus that is present in the reads with coverage too low to enable an assembly may have less practical value than an assembled genome. also, genomes with lower identity to previously known sequences will tend to contain more novel biological information than genomes with high identity and will tend to have fewer alignments highly diverged segments. with these considerations in mind, the classifier was designed to give higher scores when coverage is high, read depth is high, and/or identity is low. this was accomplished as follows. let h be the number of bins with at least alignments to f , and l be the number of bins with from to alignments. let s be the mean alignment percentage identity, and define the identity weight w = ( s ) − , which is designed to give higher weight to lower identities, noting that w is close to one when identity is close to % and increases rapidly at lower identities. the classification score for family f is calculated as z f = max(w( h + l)), ). by construction, z f has a maximum of when coverage is consistently high across a pangenome, and is also high when identity is low and coverage is moderate, which may reflect high read depth but many false negative alignments due to low identity. thus, z f is greater than zero when there is at least one alignment to f and assigns higher scores to sra datasets which are more likely to support successful assembly of a virus belonging to f . )" (date accessed: may th ). retroviruses (n = ) were excluded as preliminary testing yielded excessive numbers of alignments to transcribed endogenous retroviruses. each sequence was annotated with its taxonomic family according to its refseq record; those for which no family was assigned by refseq (n = ) were designated as "unknown". the collection of these pangenomes was termed cov m, and was the sequence reference used for this study. the protein search query was composed of the following sequences: (i) cov proteins (method described under to run serratus, a target list of sra run accessions is required. for this work, we designed target lists broadly classified as human, mouse, mammal, vertebrate, invertebrate, bat (including genome sequencing libraries), virome and metagenome (extended table c ). each list contained accessions of rna-seq, meta-genomic, and metatranscriptome runs for these organisms; some run accessions appeared in more than one list. prior to each serratus run, the lists were depleted for accessions already analyzed. re-processing of a failed dataset was attempted at least twice. in total we were able to generate alignments to the query pangenomes for , , / , , ( . %) of the targeted sra accessions. we implemented an on-going, multi-tiered release policy for code and data generated by this study, as follows. all code, electronic notebooks and raw data is immediately available at https://github.com/ababaian/serratus and on the s ://serratus-public/ bucket, respectively. upon completion of a project milestone, a structured data-release is issued containing raw data into our viral data warehouse s ://lovelywater/. for example, at the time of writing the .bam alignment files from . million sra runs are stored in s ://lovelywater/bam/x.bam; .summary files are s ://lovelywater/summary/x.summary, where x is a sra run accession. these structured releases enable downstream and third-party programmatic access to the data. summary files for every searched sra dataset are parsed into a postgresql relational database which can be queried remotely via an aws relational database (rds) server. this enables users and programs to perform complex operations such as retrieving summaries and meta-data for all sra runs matching a given reference sequence with above a given classifier score threshold. for example, all records containing at least aligned reads to hepatitis delta virus (nc . ) and the associated host taxonomy for the corresponding sra datasets. for users unfamiliar with sql queries we developed tantalus (https://github.com/serratus-bio/tantalus, an r programming-language package which directly interfaces the serratus rds server to retrieve summary information as data-frames. tantalus also offers functions to explore and visualize the data. finally, the serratus data can be explored via a graphical web interface by accession, virus, or viral family at https:/serratus.io. the website uses javascript to access the rds server and create a graphical report with an overview of viral families found in each sra accession matching a user query. all four data access interfaces are under ongoing development, receiving community feedback via their respective github issue trackers to facilitate the translation of this data collection into an effective viral discovery resource. documentation for data access methods is available at https://serratus.io/access . viral assembly and annotation . . coronaspades rna viral genome assembly faces several distinct challenges stemming from technical and biological bias in sequencing data. during library preparation, reverse transcription introduces end coverage bias, and gc-content skew and secondary structures lead to unequal pcr amplification [ ] . technical bias is confounded by biological complexity such as intra-sample sequence variation due to transcript isoforms, as found in cov [ ] and/or to presence of multiple strains. to address the assembly challenges specific to rna viruses, we developed coronaspades, described in detail in a companion manuscript [ ] . in brief, rnaviralspades and the more specialized variant, coronaspades, combines algorithms and methods from several previous approaches based on metaspades [ ], rnaspades [ ] and metaviralspades [ ] with a hmmpathextension step. coronaspades constructs an assembly graph from a rna-sequencing dataset (transcriptome, meta-transcriptome, and meta-virome are supported), removing expected sequencing artifacts such as low-complexity (poly-a / poly-t) tips, edges, single-strand chimeric loops or doublestrand hairpins [ ] and subspecies-bases variation [ ] . to deal with possible misassemblies and high-covered sequencing artifacts, a secondary hmmpathextension step is performed to leverage orthogonal information about the expected viral genome. protein domains are identified on all assembly graphs using a set of viral hidden markov models (hmms), and similar to biosyntheticspades [ ], hmmpathextension attempts to find paths on the assembly graph which pass through significant hmm matches in order. coronaspades is bundled with the pfam sars-cov- set of hmms [ ], although these may be substituted by the user. this latter feature of coronaspades was utilized for hdv assembly, where the hmm model of hdag, the hepatitis delta antigen, was used instead of pfam sars-cov- set. note that despite the name, these hmms are quite general, modeling domains found in all coronavirus genera in addition to rdrp, which is found in many rna virus families. hits from these hmms cover most bases in most known coronaviruse genomes, enabling the recovery of strain mixtures and splice variants. accurate annotation of cov genomes is challenging due to ribosomal frameshifts and polyproteins which are cleaved into maturation proteins [ ] , and thus previously-annotated viral genomes offer a guide to accurate gene-calls and protein functional predictions. however, while many of the viral genomes we were likely to recover would be similar to previously-annotated genomes in refseq or genbank, we anticipated that many of the genomes would be taxonomically distant from any available reference. to address these constraints, we developed an annotation pipeline called darth [ ] which leverages both reference-based and ab initio annotation approaches. in brief, darth consists of the following phases: canonicalize the ordering and orientation of assembly contigs using conserved domain alignments, perform reference-based annotation of the contigs, annotate rna secondary structure, ab intio gene-calling, generate files for aiding assembly and annotation diagnostics, and generate a master annotation file. it is important to put the contigs in the "expected" orientation and ordering to facilitate comparative analysis of synteny and as a requirement for genome deposition. to perform this canonicalization, darth generates the six-frame translation of the contigs using the transeq [ ] and uses hmmer [ ] to search the translations for pfam domain models specific to cov [ ] . darth compares the pfam accessions from the hmmer alignment to the ncbi sars-cov- reference genome (ncbi nucleotide accession nc . ) to determine the correct ordering and orientation, and produces an updated assembly fasta file. darth performs reference-based annotation using vadr [ ] , which provides a set of genome models for all cov refseq genomes [ ] . vadr provides annotations of gene coordinates, polyprotein cleavage sites, and functional annotation of all proteins. darth supplements the vadr annotation by using infernal [ ] to scan the contigs against the sars-cov- rfam release [ ] which provides updated models of cov and untranslated regions (utrs) along with stem-loop structures associated with programmed ribosomal frame-shifts. while vadr provides reference-based gene-calling, darth also provides ab initio gene-calling by using fraggenescan [ ] , a frameshift-aware gene caller. darth also generates auxiliary files which are useful for assembly quality and annotation diagnostics, such as indexed bam files created with samtools [ ] representing self-alignment of the trimmed reads to the canonicalized assembly using bowtie [ ], and variant-calls using bcftools from samtools. darth generates these files so that the can be easily loaded into a genome browser such as jbrowse [ ] or igv [ ] . as the final step darth generates a single generic feature format (gff) . file [ ] containing combined set of annotation information described above, ready for use in a genome browser, or for submitting the annotation and sequence to a genome repository. the serratus searches described above identified , libraries ( , by nucleotide and , by amino acid) as potentially positive for cov (score ≥ and ≥ reads). to supplement this search we also employed a recently developed index of the sra called stat [ ] with which identified an additional , sra datasets not in the defined sra search space. the stat bigquery was where tax id= and total count > " accessed on june th . we used aws batch to launch thousands of assemblies of ncbi accessions simultaneously. the workflow consists of four standard parts: a job queue, a job definition, a compute environment, and finally, the jobs themselves. a cloudformation template was created for building all parts of the cloud infrastructure from the command line. the job definition specifies a docker image, and asks for virtual cpus (vcpus, corresponding to threads) and gb of memory per job, corresponding to a reasonable allocation for coronaspades. the compute environment is the most involved component. we set it to run jobs on cost-effective spot instances (optimal setting) with an additional cost-optimization strategy (spot capacity optimized setting), and allowing up to , vcpus total. in addition, the compute environment specifies a launch template which, on each instance, i) automatically mounts an exclusive tb ebs volume, allowing sufficient disk space for several concurrent assemblies, and ii) downloads the . gb checkv database, to avoid bloating the docker image. the peak aws usage of our batch infrastructure was , vcpus, performing , assemblies simultaneously. a total of , accessions out of , were assembled in a single day. they were then analysed by two methods to detect putative cov contigs. the first method is checkv, followed selecting contigs associated to known cov genomes. the second method is a custom script that parses coronaspades bgc candidates and keeps contigs containing cov domain(s). for each accession, we kept the set of contigs obtained by the first method (checkv) if it is non-empty, and otherwise we kept the set of contigs from the second method (bgc). a majority ( %) of the assemblies were discarded for one of the following reasons: i) no cov contigs were found by either filtering method, ii) reads were too short to be assembled, iii) batch job or sra download failed, or iv) coronaspades ran out of memory. a total of , assemblies were considered for further analysis. with rna-seq metagenomic reads, the number of reads per base may be highly variable at different locations in a viral genome. regions of high coverage may be adjacent to regions with low coverage or no reads, causing breaks between contigs. thus, a given base in a contig may have only one or very few reads as evidence, and as a consequence the reliability of base calls may be low in some regions of the assembly which could degrade inference of biological variations between genomes. the assemblers used in this work do not provide a per-base quality score, and to address this issue we used two complementary approaches: ( ) reporting contig average coverage as a proxy for quality, and ( ) self-aligning reads to the assembly sequence and calling variants to enable facile visual inspection of per-base coverage levels and significant variants in genome browsers (see section . . ). we developed a module, serratax, to predict taxonomy for cov genomes and assemblies (https://github. com/ababaian/serratus/tree/master/containers/serratax). serratax was designed with the following requirements in mind: provide taxonomy predictions for fragmented and partial assemblies in addition to complete genomes; report best-estimate predictions balancing over-classification and under-classification (too many and too few ranks, respectively); and assign an ncbi taxonomy database [ ] identifier (taxid). assigning a best-fit taxid was not supported by any previously published taxonomy prediction software to the best of our knowledge; this requires assignment to intermediate ranks such as sub-genus and ranks below species (commonly called strains, but these ranks are not named in the taxonomy database), and to unclassified taxa, e.g. taxid , unclassified buldecovirus, in cases where the genome is predicted to fall inside a named clade but outside all named taxa within that clade. serratax uses a reference database containing domain sequences with taxids. this database was constructed as follows. records annoated as cov were downloaded from uniprot [ ] , and chain sequences were extracted. each chain name, e.g. helicase, was considered to be a separate domain. to generate an alternate taxonomic annotation of an assembled genome, we created a pipeline based on phylogenetic placement, serraplace. to perform phylogenetic placement, a reference phylogenetic tree is required. to this end, we collected reference amino acid rdrp sequences, spanning all coronaviridae. to this set we added an outgroup rdrp sequence from the torovirus family (nc ). we clustered the sequences to % identity using usearch ([ ] , uclust algorithm, v . . ), resulting in centroid sequences. subsequently we performed multiple sequence alignment on the clustered sequences using muscle ( [ ] , v . . ). we then performed maximum likelihood tree inference using raxml-ng ( [ ] , protgtr+fo+g , v . . ), resulting in our reference tree. to apply serraplace to a given genome, we first use hmmer ([ ], v . ) to generate a reference hmm, based on the reference alignment. we then split each contig into orfs using esl-translate, and use hmmsearch (p-value cutoff . ) to identify those query orfs that align with sufficient quality to the previously generated reference hmm. all orfs that pass this test are considered valid input sequences for phylogenetic placement. subsequently, we use epa-ng ( [ ] , v . . ) to place each sequence on the rdrp reference tree. this produces a set of likely placement locations on the tree, with an associated likelihood weight. we then use gappa ( [ ] , v . . ) to assign taxonomic information to each query, using the taxonomic information for the reference sequences. gappa assigns taxonomy by first labelling the interior nodes of the reference tree by a consensus of the taxonomic labels of all descendant leaves of that node. if % of leaves share the same taxonomic label up to some level, then the internal node is assigned that label. then, the likelihood weight associated with each sequence is assigned to the labels of internal nodes of the reference tree, according to where the query was placed. from this result, we select that taxonomic label that accumulated the highest total likelihood weight as the taxonomic label of a sequence. note that multiple orfs of the same genome may result in a taxonomic label, in which case, we select the longest sequence as the source of the taxonomic assignment of the genome. we performed phylogenetic inferences using a custom snakemake pipeline (available at https://github.com/ lczech/nidhoggr), using pargenes ( [ ] , v . . ). pargenes is a treesearch orchestrator, build on top of modeltest-ng [ ] and raxmlng, enabling higher levels of parallelisation for a given tree search. to infer the maximum likelihood phylogenetic tree displayed in extended figure , we performed a tree search comprising distinct starting trees ( random, parsimony), as well as bootstrap searches. we used modeltestng to automatically select the best evolutionary model, which in this case was lg+iu+g m. the pipeline also automatically produces versions of the best maximum likelihood tree annotated with felsenstein's bootstrap ( [ ] ) support values, and transfer bootstrap expectation ([ ]) values, the latter of which was used in extended figure . archival copies of all code generated for this study is available at https://github.com/serratus-bio. electronic notebooks for experiments are available at https://github.com/ababaian/serratus. access to all data generated in this study can be accessed at https://serratus.io/access. assembled genomes contigs for this study are available at https://serratus.io/access pending deposition into public repositories. extended table : sra run queries and search nucleotide accessions. queries and accessions from this study. a sra queries to retrieve collections of datasets. b nucleotide accessions compiled into the cov ma reference query and c the sequence masked applied to those sequences. extended table : assembled coronaviridae in the sra. a run accessions, assembly statistics and select meta-data for the , runs for which coronaviridae, or coronaviridae-like sequences were assembled. b assignment of assembled runs to operational taxonomic units (otus) based on % identity of the rna dependent rna polymerase (rdrp) domain. c assignment of genbank records to rdrp otus. d assignment of expected viral host for genbank records. e taxonomic source for rdrp containing assemblies. f supporting data for figure . extended figure : overview of the serratus architecture. a schematic and data workflow (b) as described in the methods for aligning to the viral pangenome (c). d a nucleotide alignment completion rate for serratus shows stable and linear performance to complete . million sra accessions in a -hour period. e cost breakdown for this run. compute costs between modules are an approximate comparison of cpu requirements of each step. the total average cost per completed sra accession was $ . us dollars or $ . us dollars per terabase processed. extended figure : distribution of dna and other viral families in the sra the total number of datasets matching each dna or other viral pangenome, binned by the average nucleotide identity and colored by score (see methods). an interactive and queryable version of this plot is available at https://serratus.io/family. figure : deltavirus ribozymes evolutionary history a multiple sequence alignment of the genomic and anti-genomic deltavirus ribozymes based on muscle [ ] and refined manually based on secondary structure. the shortening of the j / loop and presence of the lg loop is specific to and conserved within the genomic ribozyme. consensus secondary structure of the b genomic and c anti-genomic ribozymes. d maximum-likelihood tree based on concatenated ribozyme sequences supports the topology of the δag amino-acid tree (figure ) a strategy to estimate unknown viral diversity in mammals global trends in emerging infectious diseases. eng global shifts in mammalian population trends reveal key predictors of virus spillover risk the global virome project. en the sequence read archive the sensitivity of massively parallel sequencing for detecting candidate infectious agents associated with human tissue. eng demographic and environmental drivers of metagenomic viral diversity in vampire bats. en comparative analysis of rodent and small mammal viromes to better understand the wildlife origin of emerging infectious diseases description and initial characterization of metatranscriptomic nidovirus-like genomes from the proposed new family abyssoviridae, and from a sister group to the coronavirinae, the proposed genus alphaletovirus endangered wild salmon infected by newly discovered viruses comparative population genomics in animals uncovers the determinants of genetic diversity. en blastemal progenitors modulate immune signaling during early limb regeneration midkine is a dual regulator of wound epidermis development and inflammation during the initiation of limb regeneration ap- cfos/junb /mir- a regulate the pro-regenerative glial cell response during axolotl spinal cord regeneration. en pyrexia of unknown origin sequence analysis of the human virome in febrile and afebrile children mouse hepatitis virus strain jhm infects a human hepatocellular carcinoma cell line. eng the global burden of viral hepatitis from to : findings from the global burden of disease study infection by hepatitis delta virus. en pfam sars-cov- special update (part ) en. library catalog: xfam.wordpress.com vadr: validation and annotation of virus sequence submissions to genbank coronavirus annotation using vadr en. library catalog: github infernal . : -fold faster rna homology searches rfam coronavirus special release en. library catalog: xfam.wordpress.com fraggenescan: predicting genes in short and error-prone reads. en the sequence alignment/map format and samtools. eng jbrowse: a dynamic web platform for genome visualization and analysis. eng publisher: american association for cancer research section: focus on computer resources the sequence ontology: a tool for the unification of genome annotations the ncbi taxonomy database uniprot: a worldwide hub of protein knowledge muscle: multiple sequence alignment with high accuracy and high throughput raxml-ng: a fast, scalable and userfriendly tool for maximum likelihood phylogenetic inference. en epa-ng: massively parallel evolutionary placement of genetic sequences. en genesis and gappa: processing, analyzing and visualizing phylogenetic (placement) data pargenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes. en modeltest-ng: a new and scalable tool for the selection of dna and protein evolutionary models. en tobaniviridae (t), and roniviridae (r). b distribution of pair-wise sequence identities for rdrp sequences within and between distinct taxa at species, subgenus and genus rank, respectively. c distribution of pair-wise rdrp identities for coronaviridae genera hidden markov model (hmm) protein domain matches from the rdrp containing contigs or reference sequences for exemplar operational taxonomic units (otus) grouped by genus extended figure : newly characterised deltavirus genomes genome structure and organisation of the five deltaviruses (pmacdv srr ; mmondv srr ; ovirdv srr ; tgutdv srr ; and ichidv srr ) and one deltavirus-like (bgladvl srr ; for which we could not identify a ribozyme sequence) sequence identified in our study. each circular rna virus shows characteristic rod-like genome folding and low free-energy (δg), similar to a hepatitis delta virus positive control, and two ribozymes and the serratus project is an initiative of the hackseqrna genomics hackathon (https://www.hackseq.com). we would like to thank the many contributors for code snippets and bioinformatic discussion; e. erhan, j. chu, i. birol, k. wellman, c. xu, m. huss, k. ha, e. nawrocki, r. mclaughlin, c. morgan-lang, c. blumberg, and the j. brister lab. a. rodrigues, s. mcmillan, v. wu, c. kennet, k. chao, and n. pereyaslavsky for aws support. we would also like to thank the j. joy lab, g. mordecai, j. taylor, s. roux, l. bergner, r. orton, and d. streicker for virology discussions. we are grateful to the entire team managing the ncbi sra. ta is grateful for the advanced research computing resource at the university of british columbia. pb was financially supported by the klaus tschira foundation, rc by anr transipedia and inception grants (pia/anr- -conv- , anr- -ce - ), ak and dm were supported by the russian science foundation (grant - - ) and computation was carried out in part by resource centre "computer centre of spbu". ak and dm are grateful to saint petersburg state university for the overall support of this work (project id: ). project support and computing resources were kindly provided by the university of british columbia community health and wellbeing cloud innovation centre, powered by aws. and special thanks to our patient and understanding partners. ab conceived and led the study. ab and jt designed and implemented the serratus architecture. ab and rce constructed the viral pangenomes and panproteomes. rce developed the serratax and summarizer modules. pb developed the serraplace tree placement and taxonomy prediction code and calculated maximum likelihood trees. ta developed the darth annotation pipeline and submitted the annotated genomes to ena. dm and ak developed the coronaspades assembler. rc implemented the assembly pipeline, and deployed the assembly and annotation pipeline. ab, vl, and dl designed and developed https://serratus.io and the sql server. ab and gn developed the tantalus r package. ab, rce, ta, pb, dm, ak, and rc analysed the coronavirus and deltavirus data. bas and jb designed the phage panproteome, assembled phage genomes, and conducted phylogenetic analyses. all authors contributed to data interpretation and writing the manuscript. correspondence should be addressed to ab. does not apply. key: cord- -xg eb nm authors: easton, alice; gao, shenghan; lawton, scott p; bennuru, sasisekhar; khan, asis; dahlstrom, eric; oliveira, rita g; kepha, stella; porcella, stephen f; webster, joanne; anderson, roy; grigg, michael e; davis, richard e; wang, jianbin; nutman, thomas b title: molecular evidence of hybridization between pig and human ascaris indicates an interbred species complex infecting humans date: - - journal: elife doi: . /elife. sha: doc_id: cord_uid: xg eb nm human ascariasis is a major neglected tropical disease caused by the nematode ascaris lumbricoides. we report a megabase (mb) reference-quality genome comprised of , protein-coding genes derived from a single, representative ascaris worm. an additional worms were collected from human hosts in kenyan villages where pig husbandry is rare. notably, the majority of these worms ( / ) possessed mitochondrial genomes that clustered closer to the pig parasite ascaris suum than to a. lumbricoides. comparative phylogenomic analyses identified over million nuclear-encoded snps but just two distinct genetic types that had recombined across the genomes analyzed. the nuclear genomes had extensive heterozygosity, and all samples existed as genetic mosaics with either a. suum-like or a. lumbricoides-like inheritance patterns supporting a highly interbred ascaris species genetic complex. as no barriers appear to exist for anthroponotic transmission of these ‘hybrid’ worms, a one-health approach to control the spread of human ascariasis will be necessary. approximately million people were estimated to be infected with the intestinal nematode ascaris lumbricoides in , resulting in an estimated deaths and a loss of over , disability-adjusted life years (dalys, global burden of disease study, ; http://ghdx.healthdata.org/ gbd- ). many infections go undiagnosed, but like other soil-transmitted helminths (sth), ascaris spp. infections contribute significantly to global dalys, perpetuating the cycle of poverty in areas of endemic infection (brooker, ; hotez et al., ; montresor et al., ; pullan et al., ) . despite the large global burden of sth, little is known about a. lumbricoides transmission patterns or the true prevalence of infection with the pig parasite a. suum infection in humans in endemic regions. deworming has become more widespread in areas of endemic sth infection (bundy et al., ) . regional health authorities and global health organizations are now looking for strategies to build on these programs by achieving local elimination of sth as a public health problem (becker et al., ) . a greater understanding of transmission dynamics (including the frequency of zoonotic transmission) using molecular epidemiological methods in settings where a. lumbricoides prevalence is low but persistent could help move current efforts toward successfully eliminating transmission through more targeted treatment. population genetic studies of a. lumbricoides have drawn varying conclusions about whether zoonotic transmission is frequent (anderson and jaenike, ; dutto and petrosillo, ; nejsum et al., ; nejsum et al., a) . some studies have shown that cross-species transmission occurs between pigs and humans living in close proximity (anderson, ; betson et al., ; miller et al., ; monteiro et al., ; nejsum et al., b; peng and criscione, ; sadaow et al., ; takata, ; zhu et al., ) . this is especially common in non-endemic regions, probably because zoonotic transmission is less likely to be identified in areas where humanto-human transmission is common. the human parasite a. lumbricoides and the pig parasite a. suum have been found to be capable of interbreeding, and - % of worms in guatemala and china were hybrids (criscione et al., ; peng and criscione, ) . furthermore, it is unclear whether pigs are an important reservoir of infection in humans worldwide or if a. suum is readily transmitted anthroponotically (betson et al., ; betson and stothard, ; da silva alves et al., ; leles et al., ; nejsum et al., ) . studies have generally concluded that the genetic differences between ascaris worms collected from human populations in different parts of the world (betson et al., ; peng et al., ) are the result of geographic reproductive isolation. previous studies using ascaris mitochondrial genomes or genes suggest there are a. lumbricoides-type (human-associated) and a. suum-type (pig-associated) clades (anderson and jaenike, ; cavallero et al., ; zhou et al., ) . other work suggests multiple clades of worms, only one of which is unique to pigs (nejsum et al., ) . ascaris spp. infections also occur naturally in monkeys and apes, and ascaris spp. eggs are sometimes found in the feces of dogs but this is likely a result of coprophagy by the dogs, rather than due to infection (https://www.cdc.gov/parasites/ascariasis/biology.html). in the current study, we constructed a reference-quality ascaris genome (alv ) based on sequences from a single female worm collected from a person in kenya. this person was presumed to be infected with a. lumbricoides as there is a lack of local pig husbandry. draft a. suum genomes have previously been constructed from worms obtained from pigs in australia (jex et al., ) and in the united states wang et al., ) . the ascaris genome alv was found to be highly similar ( % identity) to the a. suum genome from worms collected from pigs in the united states . our mitochondrial and whole-genome analyses from an additional individual worms indicate that a. suum and a. lumbricoides form a genetic complex that is capable of interbreeding. our data support a model for a recent worldwide, multi-species ascaris population expansion caused by the movement of humans and/or livestock globally. ascaris from both pigs and humans may be important in human disease, necessitating a one-health approach to control the spread of human ascariasis. human ascaris reference genome to promote comparative genomic analyses to generate a human ascaris spp. germline genome assembly (prior to programmed dna elimination wang et al., ) , ovarian dna was sequenced from a single female worm collected from a kenyan study participant who was presumed to be infected with a. lumbricoides using illumina paired-end and mate-pair libraries of various insert sizes with a total sequence coverage of~ fold (supplementary file ). using these data, three different assembly strategies were used. the de novo assembly and semi-de novo strategies produced poor a. lumbricoides germline draft genomes ( table ). in the semi-de novo assembly, the majority of the > short contigs (making up . mb of sequence) that could not be incorporated into the semi-de novo assembly are sequences that aligned to the genome at multiple positions. comparison of the a. suum gene annotations to this assembly revealed a low a. lumbricoides gene number and high numbers of partial and split genes ( table , see footnote ). these characteristics are typical of highly fragmented genomes or genomes with high levels of mis-assemblies . mapping of the human ascaris reads to the a. suum reference genome revealed an exceptionally high-sequence similarity (> % identity) between the two species with few human ascaris reads that could not be mapped to a. suum. based on this high-sequence similarity, a third reference-based-only assembly strategy was used to generate the human ascaris germline genome assembly using the a. suum germline genome as a reference (see materials and methods). this approach led to a reference-quality human ascaris genome assembly with many fewer gaps (only . mb of sequence) and no unplaced contigs. the ascaris genome assembled into scaffolds with a combined size of mb. an additional . mb of sequence was present in unscaffolded short contigs. the assembly n value was . mb, with the largest scaffold measuring . mb. the largest scaffolds combined to represent % of the genome. the assembly was further polished using additional illumina reads from the same worm to more accurately reflect single base differences, indels, and any potential local mis-assembled regions. to evaluate the quality of the assembled genome, we mapped the ascaris illumina reads back to the reference-based ascaris genome assembly and found that > % of the illumina reads could be mapped, indicating that the reference-based assembly excluded very few ascaris reads. we then mapped and transferred the extensive set of a. suum transcripts (jex et al., ; wang et al., ) to the human ascaris germline assembly to annotate the genome, identifying and classifying , protein-coding genes ( table , supplementary file ). as this reference-based assembly exhibits the best assembly attributes, including high continuity with a large n , low gaps and unplaced sequences, and high-quality protein-coding genes (see table ), we suggest that this version should be used as a reference germline genome for a human ascaris spp. specimen (available in ncbi genbank with accession number prjna ). the other two assemblies are available online. like a. suum embryos, a. lumbricoides embryos undergo programmed dna elimination during the differentiation of the somatic cells from the germline in early development (streit et al., ; wang and davis, ) . in a. suum,~ mb of bp tandem repeats and~ germlineexpressed genes are lost from the germline to form the somatic genome (wang et al., ; wang et al., ) . we also sequenced the somatic genome from the intestine of the same female a. lumbricoides worm. comparison of the germline and somatic genomes revealed that dna elimination in the human ascaris sample (including the breaks, sequences, and genes eliminated) was identical to that described for the pig a. suum sample . earlier annotations of protein coding genes for a. suum draft genomes were produced by jex et al., and wang et al., and improved with a recent updated genome although the focus of the recent study was not on protein annotations. here, we updated, identified, and fully annotated the , protein-coding genes in the reference-based genome assembly (supplementary file and figure -figure supplement ). our aims were to highlight the phylogenetic relationship with other helminths and between ascaris spp., to provide potential targets for future diagnostics to differentiate between nematodes and even between pig and human ascaris, and to detail the potential functions of hypothetical or unknown proteins in the ascaris genome. using a custom pipeline (see materials and methods and cotton et al., ) , we classified % of the predicted proteome into functional groups ( figure a) . although the remaining % ( ) of the genes were classified as unknown/uncharacterized, ( %) of these appear to encode proteins that have signatures indicative of either being secreted or being membrane-bound (some with gpi anchors). to provide a more comprehensive annotation of the transcriptomes of a. suum and a. lumbricoides, we re-mapped the rna-seq data from a. suum to the current gene models of a. lumbricoides (alv ) (supplementary file ). we performed multivariate analyses of this revised rnaseq data compilation to generate a comprehensive rna-seq data set for differential gene expression in diverse stages/tissues (supplementary file ). phylogenetic trees derived from orthologue analyses of the predicted proteomes of alv with the predicted proteomes of other nematodes across all clades indicated the similarity among the published genomes of a. suum prjna and prjna in jex et al., ; wang et al., ; wang et al., and a. lumbricoides (international helminth genomes consortium, ) with alv within the ascaris branch ( figure c ). the variation observed within the ascaris spp. (with relatively weak bootstrap values of . - . ) is likely due to the differences in protein coding gene annotations and split genes seen in previous assemblies. we next took advantage of the abundant reads from the mitochondrial genome in our sequencing data (on average x coverage, see supplementary file ) to perform de novo assembly of complete human ascaris spp. mitochondrial genomes from individual worms (supplementary file ). these mitochondrial genomes were then annotated using sequence similarity to well-characterized and annotated mitochondrial genes. the mitochondrial cox- gene has been frequently used to infer evolutionary distances between species as well as between populations (cavallero et al., ; amor et al., ; springer et al., ; wiens et al., ; zardoya and meyer, ; zou et al., ) due to its rapid mutation rate, lack of recombination and relatively constant rate of change over time (brown et al., ; giles et al., ; harrison, ) . existing data suggest that mitochondria are inherited maternally in c. elegans (lim et al., ; zhou et al., ; sato and sato, ; wang et al., ) and ascaris . previous cox- phylogeny studies resolve ascaris spp. worms into three distinct clades: clade a is predominantly comprised of worms isolated from pigs, clade b is predominantly comprised of worms isolated from humans, and clade c is from worms only isolated from pigs in europe and asia (cavallero et al., ) . interestingly, haplotype network analyses revealed that the majority of worms isolated from humans in the kenyan villages possessed cox- haplotypes that were consistent with infection of parasites from clade a ( / ), whereas only six specimens had cox- haplotypes consistent with infection by worms from clade b (figure -figure supplement and figure a ). when cox- sequences from the present study were compared against those within the ascaris species complex deposited at ncbi (see supplementary file and figure b ; cotton et al., ; criscione et al., ; godel et al., ; goldberg et al., ) within clade a (which appeared to contain the majority of sequences not only from kenya but also from other localities), seven unique haplotypes of cox- from kenya were identified. these appeared to be shared not only with other haplotypes from africa, but also with those from brazil. in contrast, clade b haplotypes appeared to be even more cosmopolitan, with the three haplotypes from kenya not only being shared with zanzibar, but also with haplotypes from brazil, denmark, china and japan. despite the distinct clustering of haplotypes into the three typical ascaris clades, there was very little genetic diversity among haplotypes within each of the clades, with the majority of haplotypes being separated by - nucleotide differences. there were greater levels of genetic divergence between clades; a and b were closer to each other while c was more distinct. similar findings were seen with nad- , the most variable gene in the mitochondrial genome ( forty-seven snps were identified in the human ascaris mitochondrial genomes. approximately a quarter of these variants were in non-coding portions of the mitochondrial genome and half were synonymous (supplementary file ). as with the cox- haplotype analyses, whole mitochondrial genome analysis distinguished two clades (clade a and clade b), but there were no distinct geographically specific sub-clades seen within either clade a or clade b ( figure b , table ). clade c was also produced by a single published sequence which was used for comparison. in order to assess the validity of the clades a and b representing two distinct molecular taxonomic units, and thus potentially different species, birky, x ratio was applied to provide a lineage-specific perspective of potential species delimitation. the ratio failed to differentiate clades a and b as distinct species with k/q < at . indicating ascaris is one large population-further supporting the lack of differentiation into separate species (supplementary file ). furthermore, there were no significant associations between mitochondrial sequence variations and other factors (e.g. village, household, time of worm collection, host) based on permanova (see methods and table ) after translating the phylogenetic tree into a distance matrix, suggesting not only a lack of differentiation into distinct species but also a potentially large interbreeding population of worms being transmitted between individuals and across villages. to account for a potentially large population of interbreeding worms, analyses to detect signatures of population expansion were performed. when the global mitochondrial genome data were compared, the tajima's d was negative and significant (tajima's d À . ; p-value . ), indicating an excess of low frequency polymorphisms within the global data set suggesting population size expansion. despite the fu's f not being significant it was positive (fu's fs . ; p- . ) potentially indicating a deficiency in diversity as would be expected in populations that have recently undergone a bottleneck event. the same pattern was also seen in the kenyan sequences but neither the tajima's d nor the fu's were significant (figure -figure supplement and supplementary file ). although there does appear to be a signature of a recent population expansion event in both the global and kenyan data, the lack of information on the mutation rates of ascaris and other nematodes prevents the accurate estimate of such an event. to quantify genetic variation in the ascaris worms isolated from infected kenyans, the nuclear genomes of the individual worms were analyzed to assess intraspecific population genetic diversity, heterozygosity, and ploidy. single-nucleotide polymorphisms (snps) and insertion/deletions (indels) across the nuclear genomes were assessed for the first largest scaffolds, which comprised % of the genome (see methods). each ascaris worm was sequenced to a mean coverage depth of~ fold. a total of . million snp positions were identified in the first scaffolds among the ascaris nuclear genomes. approximately % of these variants were intergenic (supplementary file ). as an example, snps and indels in a single ascaris chromosome were plotted for two worms collected from humans in kenya and one worm from a pig in the united states ( figure -figure supplement ) . the profiles and the frequency between snps and indels are highly consistent within individual worms, with the ratio of indel:snps frequency at~ : . a comparison of the variations identified between individuals infected with worms that had either a. lumbricoides-like or a. suum-like mitochondrial genomes illustrates that most of the differences appear to be random variations, and there do not appear to be major differences between a. lumbricoideslike and a. suum-like worms. a total of . million snps were unique to individual specimens, presumably representing genetic drift. of the remaining . million snps,~ % of these variant positions were present in less than five specimens indicating that the ascaris genomes sequenced are~ % polymorphic among the major alleles circulating within the species complex. to investigate the evolutionary pressures that account for the high snp diversity found among the sympatric worms, the ploidy, degree of heterozygosity (he) and allelic diversity were determined. worms were disomic, with little to no evidence of aneuploidy (figure -figure supplement ). the vast majority (> %) of snp positions were biallelic, and each worm had, on average, . million variant positions, of which approximately % were heterozygous snps (supplementary file ). snp density was determined in kb windows for each worm against the reference alv and a patchy, mosaic pattern was resolved. snp density was structured within the genome, with scaffolds being either snp poor or snp dense. for example, algv r was snp dense whereas algv r x was snp poor. in other scaffolds, alternating snp poor and snp dense regions were defined within the contig, with distinct transition points, see for example the first half of algv b , the last quarter of figure continued mitochondrial genomes assembled from the kenyan worm specimens and all other published reference ascaris mitochondrial genomes and baylisascaris procyonis was used as the outgroup. the three major clades a, b, and c were identified by color hue, and the majority of the kenyan worms clustered in clade a. each village was represented by a distinct shape and unfilled shapes represented worms sequenced from specific villages post-anthelminthic treatment. the online version of this article includes the following figure supplement(s) for figure : algv b , or the middle of algv r x ( figure a ). in those regions where snp density was low, the tajima d statistic was net negative, indicating that allele frequencies within these regions were structured and more limited. genome-wide, homozygous snp regions were found to be unevenly distributed, with some scaffolds possessing long runs of homozygosity, see for example algv b , algv r x, algv r x, algv r x, algv r x, algv r x, algv r x (depicted by solid blue in figure b ), and these regions were net negative by the tajima d test. conversely, heterozygous snps were less structured and appeared randomly distributed throughout the genome ( figure b ). overall, three genetic types were resolved by this analysis: in each genome, there existed snp-poor homozygous regions (colored blue) or snp dense regions, which either possessed homozygous alternate snps (also colored blue) or heterozygous snps (colored in 'red' or 'yellow' blocks depending on the density of heterozygous snps resolved in each kb block: one haplotype was similar to alv and the other was different). only one worm specimen ( _ ) was heterozygous genome-wide, and this track is depicted as 'red' across all scaffolds in the circos plot ( figure b ). a phylogenetic tree constructed using genome wide snps with at least x coverage ( . million phased snps total) from ascaris worm specimens, including the a. suum reference genome, established that the kenyan specimens were more similar to each other than they were to the a. outside track (red histograms) shows the total snp diversity across the genome (first largest scaffolds) in kb sliding windows. blue bar plot indicates the measured degree of polymorphism (p) (nei and li, ) within the ascaris population in kb sliding windows. the innermost track with black-green histogram plots the tajima, values which reflect the difference between the mean number of pairwise differences (p) and the number of segregating sites using a sliding window of kb. (b) the circos-plot of the genome-wide distribution of heterozygous and homozygous snps in kb blocks identified long stretches of homozygosity among the different ascaris specimens, except _ , which is predominantly heterozygous throughout and was isolated from village . red color = > % of heterozygous snps, blue = > % of homozygous snps, yellow = % heterozygous, % homozygous snps. each track represents a single specimen. the online version of this article includes the following figure supplement(s) for figure : suum reference genome, which had many more unique snps ( figure a) . notably, the nuclear genomes from the worms that possessed a. lumbricoides-like mitochondrial genomes did not clade separately, indicating that the nuclear genomes were incongruent with the mitochondrial genomes, and likely recombinant. a co-ancestry heatmap was generated among the sympatric ascaris, and this analysis divided the genome into discrete segments and clustered samples along the diagonal based on the greatest number of shared ancestral blocks using the nearest neighbor algorithm from finestructure. the ascaris genomes resolved as clusters that possessed high frequency nearest-neighbor, or shared ancestry, relationships. in contrast, the a. suum reference genome and specimen _ were anomalous, likely the result of their excess heterozygosity due in part to elevated (dunn, ) . (e) population genetic structure and admixture clustering analysis of the ascaris genomes obtained by popsicle using k = different color hues in the innermost concentric circle of the circos plot. the middle concentric circle shows the relative percentage of each genetic ancestry within each genome (represented by the color hues for k = ). the outermost concentric circle shows the genome wide local admixture profile of each worm in kb sliding windows. the following geometric shapes represent villages, and the color for each shape identifies the mitochondrion genome each sample possesses: black = a. suum; red = a. lumbricoides; circle = village ; square = village ; upside triangle = village ; downside triangle = village ; diamond = village . the online version of this article includes the following figure supplement(s) for figure : numbers of unique snps. notably, nine worm specimens did not coalesce into a cluster with shared ancestry. closer examination of these specimens indicated that their phased genomes possessed limited allelic diversity and were highly recombinant ( figure b ). this genetic mosaicism was readily resolved by fluctuating intra-scaffold genealogies established using a sliding-window neighbor-joining topology that identified regions with incongruent tree topologies. see for example the trees generated at the scaffolds algv b , algv b , and algv r . indeed, the pairwise snp and f st estimates for these specimens identified segments where snp density was low, but f st was elevated with respect to neighboring segments (see block in algv b ) and the most parsimonious explanation for these results is that recombination of a limited number of distinct alleles had occurred in the regions of increased f st (figure b and c) . to estimate the number of supported ancestries (k) that could be resolved in the ascaris genomes sequenced, we calculated the dunn index, which supported - ancestral populations ( figure d) . a gradual increase in the dunn index after k = was observed for an ancestral population size between and ( figure d and figure -figure supplement ) . we next used popsi-cle to calculate the number of clades present within each kb sliding window. local clades were represented with a different color and painted across the genome to resolve ancestry. the snp diversity plots across the specimens identified three major 'parentage blocks' that were resolved as belonging to alv or were genetically distinct with either both haplotypes sharing the alternate parent (homozygous alternate), or were heterozygous between the two parental haplotypes for the majority of the specimens ( figure e , middle circos plot. color hues cyan, orange, aqua). to visualize such shared ancestry across the different ascaris specimens at chromosome resolution, a color hue representing a local genetic 'type' present was assigned and integrated to construct haplotype blocks across each chromosome for the ancestries present. chromosome painting based on shared ancestry revealed a striking mosaic of large haplotype blocks of different admixed color hues, consistent with limited genetic recombination between a low number of parentage haplotypes. these admixture patterns were readily visualized by shared color blocks between different specimens across entire scaffolds including algv r x ( figure a ) and algv r x ( figure b) . in low complexity regions such as the left portion of contig algv r x, only three major haplotypes were resolved ( figure a) . strikingly, within each of the six clades resolved, all worm specimens showed a limited, mosaic fingerprint of introgressed sequence blocks indicating that recombination has shaped the population genetic structure among the ascaris specimens sequenced. examples of both chromosomal segregation and recombination were seen. for example, specimens e_ and f_ shared the same chromosome at algv r x, but entirely different chromosomes at algv r x, whereas specimens _ , _ and f_ were identical except at the subtelomeric end of algv r x. in this region two admixture blocks were resolved; _ and f_ remained similar to each other but _ now possessed a sequence block that was shared with specimen _ . this extensive chimeric pattern in chromosome painting also closely resembled the genome-wide hierarchy tree ( figure a) . the data support a model in which the specimens are genetic recombinants between a. suum and a. lumbricoides that are predominantly inbreeding. to examine genetic clustering of worms in individual human hosts, host households and villages, and study time-points, we statistically compared genetic variation within groups (such as within a village) versus between groups (such as between villages). we found significant genetic separation between worms in different villages (table , figure ), although worms from kenya clustered with worms from around the world based on cox- , rather than predominantly with each other (figure a ). this suggests genetic diversity is present in the population of ascaris in these kenyan villages, which is similar to the diversity of populations of ascaris around the world. it also suggests that a high proportion of ascaris transmission may occur within villages in this kenyan setting. there was no evidence from this analysis that the worms collected three months after albendazole treatment were any different than the worms collected prior to albendazole treatment ( table ) . to expand on our observations that genetically similar worms are found around the world, but that similar worms cluster within a village, based on our nuclear snps data, we plotted genetic distances against geographic distances. surprisingly, we found no significant correlations between genetic and geographic distance, neither across all five studied villages nor within the two most heavily parasitized villages ( figure -figure supplement ). in this study, we generated a high-quality reference genome from a single worm presumed to be human a. lumbricoides. our comparative phylogenomic analyses of this new ascaris spp. genome against existing draft genomes of a. lumbricoides and a. suum suggest that a. suum and a. lumbricoides form a genetic complex that is capable of interbreeding, which has apparently undergone a recent worldwide, multi-species ascaris population expansion. our phylogenetic analysis on the complete mitochondrial genomes (from worms collected from human hosts in kenya and other available sequences) suggests that the worms collected in kenya mirror the separation into clade a (worms from pigs in non-endemic regions and humans in endemic regions) and clade b (worms from humans and pigs from endemic and non-endemic regions) described elsewhere (cavallero et al., ) . it is likely that worms in both these clades are being transmitted from human to human, as pig husbandry is rare in this area of kenya. patterns may differ by locality, and it is possible that some of the pig-associated (a. suum-like) worms circulating in this human population in kenya were acquired, perhaps generations ago, by humans who lived in closer proximity to pigs. it is also possible that these worms were acquired from non-human primates (nejsum et al., ) , or some other ascaris host, rather than from pigs. however, the snps across the whole nuclear ascaris genome provide significantly greater power in understanding ascaris speciation. importantly, our nuclear genome snp analysis suggests that the kenyan ascaris are distributed across multiple clades in a phylogeny based on the nuclear genomes. overall, data from our study and other studies are consistent with a pattern where hybrid our study represents one of the most detailed accounts of mito-nuclear discordance in nematodes echoing patterns seen in another human nematode: onchocerca volvulus (choi et al., ) . the data in our current study show the occurrence of distinct mitochondrial lineages that could be evidence of early stages of species differentiation. the admixture seen within the nuclear genome, however, appears to disrupt the establishment of defined molecular speciation barriers between the different ascaris lineages. such patterns have been recorded in other parasites, including o. volvulus (choi et al., ) , the blood fluke schistosoma (lawton et al., ) and the protist leishmania (kato et al., ) . each of these studies has implicated definitive hosts in the movement of parasites between otherwise isolated populations, allowing interbreeding to take place. it is most likely the historical movement of humans and their domesticated livestock that has mediated the transport of ascaris between localities, allowing for extensive interbreeding as shown by the nuclear genomes and resulting in the discordance observed between the mitochondrial and nuclear genomes in our study. at a more local scale, the insights into the human transmission dynamics of ascaris showing clustering both within an individual and in villages suggest that villages are appropriate units for interventions and that people are infected with multiple eggs from a single source. these findings are in line with clustering at the village level found in guatemala and at the sub-village level in nepal (criscione et al., ) , but not in line with the lack of small-scale geographical structuring found in denmark, zanzibar and uganda (betson et al., ; betson et al., ; nejsum et al., a) . differences could be a result of different patterns in human and livestock movement (betson et al., ) . although the current genome is, by far, the most continuous assembly for ascaris, it is not a full chromosome assembly due largely to repetitive sequences, in particular bp tandem repeat clusters and long stretches of subtelomeric repeats. thus, it is possible that mis-assembly in some scaffolds has increased the frequency of mosaicism detected. it is for this reason that the comparative analyses on the nuclear genome was restricted to the largest scaffolds, most of which are at chromosomal resolution, with only minor localized variation due to the repeat clusters. in these high confidence scaffolds, large haplotype blocks possessing either a. suum, a. lumbricoides or both parental haplotypes (heterozygous) were readily resolved indicating that the genetic mosaicism observed could not be solely attributed to genome mis-assembly. ultimately, future studies using ultralong pacbio (rhoads and au, ) or nanopore (branton et al., ) sequencing combined with chromosome conformation capture (hi-c) techniques (belaghzal et al., ) will improve the genome to full chromosome assembly to more accurately resolve the true extent to which recombination has impacted the population genetic structure of the ascaris species genetic complex. the finding that a. suum and a. lumbricoides form a genetic complex has important public health implications. reduced treatment efficacy is not currently a common issue in ascaris infections among humans or pigs vercruysse et al., ; zuccherato et al., ) , although low efficacy of benzimidazoles is an issue for trichuris trichiura in humans (diawara et al., ; furtado et al., ; olsen et al., ) and various intestinal nematodes of veterinary importance (jaeger and carvalho-costa, ; kaplan and vidyashankar, ; wolstenholme et al., ) . extensive albendazole use in either human or pig populations could lead to resistance in both populations, if cross-species infections are common and produce fertile offspring. this study suggests that research and public health interventions targeting a. lumbricoides and a. suum should be more closely integrated, and that extensive work done by the veterinary research community may be highly relevant to mass deworming campaigns that seek to improve human health. the similarity between ascaris from different countries and from different vertebrate hosts suggests that ascaris infection has spread rapidly around the world, leaving little time for it to differentiate. taken together, these finding have very important implications for parasite control and elimination efforts that only focus on mass deworming of humans for ascaris. the ability of pig-associated worms to become endemic in human populations indicates that a one-health approach may be necessary for the control of ascaris. the covid- pandemic has highlighted the importance of one health approaches to zoonotic diseases (global burden of disease, ); we must use a one health approach to ensure that pigs do not serve as a reservoir and potential breeding ground for drug resistance in a parasite that can sustain community transmission in humans . worms were expelled as part of a larger study in rural western kenya described previously (easton et al., . worms collected from study participants in five villages ( figure -figure supplement ) following treatment with mg albendazole were isolated, washed, labeled and stored frozen (À ºc). the villages were near the town of bungoma, located at n . , e . . temperatures ranged from ˚c to ˚c and rainfall is mm on average. chicken, sheep and cattle farming are common, as is subsistence agriculture and growth of sugar cane as a cash crop. the primary spoken language is bukusu, a dialect of luhya. all samples were stored in kisumu, from which they were subsequently transported to the kemri-cdc offices until they were shipped to the nih (bethesda, md, usa) on dry ice. a modified dna extraction method was developed based on phenol/chloroform and qiagen methods (available on request) and used on samples (supplementary file ) . for the five germline samples, dna was extracted from the uterus, oviduct or ovary of the worms. for the remaining samples, dna was extracted from somatic tissue: the body wall or the intestine. our previous work did not reveal any differences between a variety of somatic samples including the intestine and muscle (wang et al., ) , thus we do not expect any significant variations in the muscle and intestine genomic dna used in this study. paired-end genome libraries -sixty-eight a. lumbricoides dna samples were sequenced using illumina hiseq (www.illumina.com) short-read paired-end sequencing. dna was quantified by uv spec and picogreen. a ng of dna based on picogreen quantification was used as template for ngs library preparation using the truseq nano dna sample library prep kit without modification. primer-dimers in the libraries were removed by additional ampure beads purification. sequencing was performed to obtain a minimum genomic depth of x coverage for each sample. mate-pair genome libraries -two samples were selected for mate-pair sequencing, based on the quality of the dna preparation. three independent dna isolations (corresponding to what region of the worm or what is the sample for dna isolation) from specimen ' _ . ' were combined to obtain one mg dna input. the mate-pair libraries were generated using the nextera mate pair library prep kit, following the gel-free method with the only modification that m- streptavidin binding beads were used instead of m- beads. the libraries were amplified for cycles given the low dna input going into the circularization phase. the mate-pair fragment size averaged kb with a range of - kb fragments. the a. lumbricoides germline genome assembly was constructed using the a. suum genome as a reference. briefly, sequencing reads from a single a. lumbricoides worm (libraries # , # , and # ) were mapped to the a. suum germline genome assembly using bwa (li and durbin, ) to generate bam and mpileup alignment files. the mpileup files were processed with a perl script that replaced all variation sites in the reference genome with the highest allele frequencies in the a. lumbricoides sample. a. suum genomic regions that represent < x of a. lumbricoides reads coverage were excluded from the assembly. we further polished the genome with additional illumina sequencing reads using pilon and its default parameters (walker et al., ) . the a. lumbricoides genome was annotated using the gene models built for a. suum, using the annotation transfer tool ratt (otto et al., ) . the protein coding regions were defined using transdecoder (https://github.com/transdecoder/transdecoder/wiki; haas and papanicolaou, ) . to evaluate the gene expression across all stages, we utilized previous rnaseq data from the developmental stages (wang et al., ; wang et al., ) , re-mapped the sra from adult males, females, l and l stages (jex et al., ) to the current gene models, and quantified the expression using tophat and cufflinks. the re-mapped reads, analyzed by jmp genomics (sas) across all the stages and based on the principal component analyses ( figure b) , were grouped as adult male, adult female, l , l , l (egg l , liver l and lung l ), l , carcass, muscle, intestine, embryonic (zygote , zygote , zygote , zygote , hr, hr, hr, hr, d, d), ovaries (female mitotic region, female early pachytene, female late pachytene, female diplotene and oocyte) and testis (male mitotic region, spermatogenesis, post meiotic region, seminal vesicles and spermatids). proteome and comparative genomics analyses were done using an in-house pipeline (karim et al., ) . automated annotation of proteins was done as described earlier (cotton et al., ) and based on a vocabulary of nearly words found in matches to various databases, including swissprot, gene ontology, kog, pfam, and smart, refseq-invertebrates and a subset of the genbank sequences containing nematode protein sequences, as well as the presence or absence of signal peptides and transmembrane domains. signal peptide, secretomep, transmembrane domains, furin cleavage sites, and mucin-type glycosylation were determined with software from the center for biological sequence analysis (technical university of denmark, lyngby, denmark) (duckert et al., ; julenius et al., ; sonnhammer et al., ) . classification of kinases was done by kinannote . interproscan (jones et al., ) analyses were done using the standalone version . . allergenicity of proteins were predicted by allerdictor (dang and lawrence, ) , fuzzyapp (saravanan and lakshmi, ) and allertop (dimitrov et al., ) . genes that had blast scores < % of max possible score (self-blast) in other non-ascaris nematodes with an e-value greater than e- were considered as 'unique'. [jex et al., ; wang et al., ; wang et al., ] , brugia malayi [ghedin et al., ] , caenorhabditis elegans c. elegans sequencing [c. elegans sequencing consortium, ], dirofilaria immitis [godel et al., ] , loa loa [desjardins et al., ; tallon et al., ] , necator americanus [tang et al., ] , onchocerca volvulus [cotton et al., ] , strongyloides ratti [nemetschke et al., ] , strongyloides stercoralis [hunt et al., ] , toxocara canis [international helminth genomes consortium, ; x.-q. zhu et al., ] , trichinella spiralis [korhonen et al., ; mitreva et al., ] , trichuris trichiura [foth et al., ] , wuchereria bancrofti international helminth genomes consortium, ; small et al., ) were analyzed using orthofinder (emms and kelly, ) . the estimated phylogenetic tree generated was graphed using figtree v . . further manual annotation was done as required. the data were mapped into a hyperlinked excel spreadsheet as previously described (bennuru et al., ) , available in supplementary file . the illumina paired-end sequence reads of the ascaris whole genomes were trimmed by removing any adapter sequences with cutadapt v . (martin, ) , then low-quality sequences were filtered and trimmed using the fastx toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). remaining reads were then ref-mapped to the a. lumbricoides genome alv reference genome (described in this paper) using either bowtie v . . (langmead and salzberg, ) , with very sensitive, no-discordant, and no-mixed settings or using the burrows-wheeler aligner (bwa, v . . ) (li and durbin, ) mem in default parameters and then converted into a bam file for sorted with samtools (li, ) . sorted reads were soft-clipped and marked-duplicated using picard- . . (http://broadinstitute.github.io/picard; broad institute, ) . single-nucleotide polymorphisms (snps) were obtained using samtools (li, ) and bcftools (narasimhan et al., ) using the mpileup function and -ploidyfile features and taking chromosomal ploidies into account. snps were also determined using genome analysis toolkit (gatk) (mckenna et al., ) . snps were called by gatk haplotype caller with a read coverage ! x, a phredscaled snp quality of ! . mapping statistics were generated in perl and awk. the ploidy of each specimen was calculated using ageless software (http://ageless.sourceforge. net/) by dividing the chromosomes into kb sliding windows and averaging the coverage within each window. the windows with zero coverage were not included in any further analyses due to sequencing noise or repeat regions (inbar et al., ) . snps, pi (nei and li, ) , (tajima, ) , and f st (dunn, ) values were calculated using vcftools (danecek et al., ) in kb sliding windows and plotted using either circos (krzywinski et al., ) or ggbio (http://bioconductor.org/packages/release/bioc/html/ggbio.html) and variantannotation (http://bioconductor.org/packages/release/bioc/html/variantannotation. html) r packages (v. . . , url http://www.r-project.org). the proportions of heterozygous and homozygous snps were estimated in kb sliding windows using custom java scripts to generate histogram plots in circos (krzywinski et al., ) . red and blue colors indicate the presence of % or more heterozygous and homozygous snps respectively whereas yellow color was assigned otherwise. the snp data (vcf file) was first phased accurately to estimate the haplotypes using shapeit (delaneau et al., ) after keeping only biallelic snps and loci with less than % missing data. co-ancestry heatmaps were generated using the linkage model of chromopainter (lawson et al., ) and finestructure (http://www.paintmychromosomes.com) based on the genome-wide phased haplotype data. for finestructure (version . ) (lawson et al., ) , both the burn-in and markov chain monte carlo (mcmc) after the burn-in were run for iterations with default settings. inference was performed twice at the same parameter values. population genetic structure was constructed using popsicle (shaik et al., ) by comparing specimens against the reference sequence alv in kb sliding windows with the number of cluster k = to and then use the dunn index (dunn, ) to calculate the optimal number of clusters. after calculating the optimal number of clusters, popsicle assigned each block to the existing or new clades depending on population structure of specimens and the ancestral state of each block followed by painting in circos plot (krzywinski et al., ) with color assignment based on number of clusters. in order to determine the phylogenetic relationship between samples, we selected base positions where variants were detected in a representative sample vs the reference (alv ), and where each sample had at least x coverage for each locus. using this list, the base calls for each sample were pooled together to generate a single multi-sequence fasta file. next, both maximum likelihood (ml) trees and bootstrap (bs) trees were generated with a final 'best' tree generated from the best scoring ml and bs trees using raxml v . . (stamatakis, ). the tree was visualized in figtree v . . (http://tree.bio.ed.ac.uk/software/figtree/). similarity within and between worms from different villages, households, people and time-points was analyzed based on the distance matrix of the patristic distances from the phylogenetic tree described above, using permutational multivariate analysis of variance (adonis vegan in r). the distance matrix underlying the phylogenetic tree was analyzed in order to measure the significance and contribution of different factors to variance between samples. each factor (village, household, host and time-point) was analyzed both separately and sequentially. the sequence chosen was ordered based on significance of each factor when tested individually. since multiple groupings were considered using the same dataset, multiple comparison corrections were applied. sample sizes and descriptions of each group are shown in table . similar methods were used to analyze the mitochondrial phylogeny along the same groupings. we assembled mitochondrial genomes using a de novo approach from individual ascaris genomes. for each individual, the ascaris mitochondrial reads in the total dna sequencing were identified by mapping the ascaris reads to the a. suum reference mitochondrial genome (genbank accession: nc_ ). adaptor sequences were trimmed prior to de novo assembly. to reduce the complexity of the de novo assembly, we randomly sampled x reads from each individual (the use of higher read coverage often resulted in fragmented scaffolds) and assembled these reads using the spades assembler (bankevich et al., ) with continuous k-mer extension from k = to the maximum k-mer allowed (average extended k-mer size = ). the assembled scaffolds were corrected with the built-in tool in spades to reduce potential assembly artifacts. next, the assembled scaffolds were aligned to the a. suum mitochondrial reference genome using blast, the order of the scaffolds was adjusted, and they were joined into a single scaffold. finally, the gaps in the scaffold were filled using gapfiller (boetzer and pirovano, ) using mitochondrial reads from the same individual to generate a complete mitochondrial genome. using the same method, we also de novo assembled another five a. suum or a. lumbricoides mitochondrion genomes from previous studies (see supplementary file ). in order to assess overall evolutionary relationships across the complete mitochondrial genomes, we aligned the genomes using clustal w and phylogenetic trees constructed using raxml under the conditions of the general time reversible model (gtr) as described above for the whole genome snp alignment. subsequent tree files were formatted in figtree and mega v . the variation in nucleotide diversity across the mitochondrial genome was measured using sliding window analyses, with a window of bp and a step of bp, using dnasp v (rozas et al., ) . in order to assess the validity of potential species groupings in the ml phylogenetic tree the birky, x ratio was applied to the alignment of the complete mitochondrial genomes including both samples from kenya and published mitochondrial reference genomes from tanzania, uganda, china, usa, denmark, and the uk. the x ratio method of species delimitation compares the ratio of mean pairwise differences between two distinct clades (k) and the mean pairwise differences within each of the clades being compared (q). it is considered that if k/q > this is indicative of the two clades representing two distinct species. owing to the fact that two clades are being compared there will be two separate values of q, as per recommendations of birky, , the larger q value is used to perform the final ratio calculation as this will provide a more conservative result which ultimately will be less likely to provide a false positive result. due to the extensive use of mitochondrial genome data in population genetic analyses of ascaris, several analyses were performed to identify the effect of any population level processes that may be affecting the diversity of the parasites within kenya. initially, diversity indices were calculated for each of the genes within the mitochondrial genome across the entire kenyan data set as well as considering the mitochondrial genome as a whole. in order to account for the diversity within the genic regions, we removed non-coding and trna sequences for these analyses. to provide a genealogical perspective of population structure of the kenya ascaris worm specimens, we constructed the most parsimonious haplotype network based on the protein coding sequences using the tcs algorithm as implemented in popart (leigh and bryant, ) . further population genetic analyses were also performed to detect the occurrence of selection on the protein coding genes of the mitochondrial genome and if there were any major departures from neutrality. standard dn/ds ratios were performed to identify the presence of positive selection where both measures equate to = neutral, > = positive selection, < = purifying selection. both tajima's d and fu's fs were calculated to identify any substantial departure from neutrality which could be indicative of population expansion events (supplementary file ). all described analyses were performed using dnasp (rozas et al., ) . as both cox- and nad- have been used in the past for epidemiological studies, single gene phylogenies were also constructed as described previously for comparison against the whole mitochondrial genome phylogeny (figure -figure supplement ) . owing to the extensive use of the cox- gene for epidemiological studies the gene was extracted from the complete mitochondrial genomes of kenya and compared to all other available ascaris lumbricoides and ascaris suum cox- sequences housed by ncbi representing populations from across the globe. haplotype network analyses was performed to produce the parsimonious network using tcs as implemented through popart (leigh and bryant, ) . this provided a genealogical perspective of population structure and allowed genetic connectivity between the kenyan samples and samples from other locations to be assessed. the funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. ethics human subjects: this study was approved by the ethics review committee of the kenya medical research institute (scientific steering committee protocol number ) and the imperial college research ethics committee (icrec_ _ _ ). informed written consent was obtained from all adults and parents or guardians of each child. minor assent was obtained from all children aged - . anyone found to be infected with any sth was treated with mg alb during each phase of the study, and all previously-untreated village residents were offered alb at the end of each study phase. decision letter and author response decision letter https://doi.org/ . /elife. .sa author response https://doi.org/ . /elife. .sa supplementary files . supplementary file . characteristics of genome assemblies. reference a. lumbricoides genomes generated as part of this study ( and ) are compared with reference genomes for a. suum generated previously ( and ). . supplementary file . proteome annotation. while~ . % of the genes can be transferred to both genomes, over % of the transferred genes are only partial matches and are fragmented supporting the view that the de novo and semi de novo a. lumbricoides assemblies are highly fragmented. . supplementary file . description of worm from which each sample was sequenced. the sex of the worm (based on morphological identification) and the part of the worm (germline vs somatic) is listed. some hosts donated multiple worms. . supplementary file . cox- haplotype list. . supplementary file . x ratio analyses of clades a and b using complete mitochondrial genomes used to construct the phylogeny in figure b . . supplementary file . demographic analyses using tajima's d and fu's f statistic across complete mitochondrial genomes as a detection for the signature of population expansion events. whether all sequences collected globally, or just sequences collected in kenya as part of this study were examine, the tajima's d value was negative and significant (indicating an excess of low frequency polymorphisms) and the fu's fs was positive but not significant (potentially indicating a deficiency in diversity as would be expected in populations that has recently undergone a bottle neck event). . supplementary file . number of heterozygous and homozygous snps in each of the worms from kenya sequenced. . supplementary file . reference mitochondrion genomes. . supplementary file . supplement to table using alternative measures of phylogenetic distance. . transparent reporting form data are available under the national center for biological information (ncbi) bioproject numbers; prjna for raw sequencing data, and prjna for the genomic assembly. links to all genome assemblies are available at: all (https://s .amazonaws.com/proj-bip-prod-publicread/hisomics/alv /genome_assembly/genome_assemblies.tar.gz), de novo (https://s .amazonaws.com/ proj-bip-prod-publicread/his-omics/alv /genome_assembly/al-version -genome-assembly.fasta. gz), semi-de novo (v ) (https://s .amazonaws.com/proj-bip-prod-publicread/his-omics/alv / genome_assembly/al-version -genome-assembly.fasta.gz), v -(https://s .amazonaws.com/projbip-prod-publicread/his-omics/alv /genome_assembly/al-version -genome-assembly.fasta.gz), v -(https://s .amazonaws.com/proj-bip-prod-publicread/his-omics/alv /genome_assembly/al-ver-sion -genome-assembly.fasta.gz), v -(https://s .amazonaws.com/proj-bip-prod-publicread/hisomics/alv /genome_assembly/al-version -genome-assembly.fasta.gz), v -(https://s .amazonaws.com/proj-bip-prod-publicread/his-omics/alv /genome_assembly/al-version -genomeassembly.fasta.gz), mitochondrial -(https://s .amazonaws.com/proj-bip-prod-publicread/his-omics/ alv /genome_assembly/mitochondrial_genomes.tar.gz). the following datasets were generated: high prevalence of strongyloides stercoralis in school-aged children in a rural highland of north-western ethiopia: the role of intensive diagnostic work-up ascaris infections in humans from north america: molecular evidence for cross-infection mitochondrial dna and ascaris microepidemiology: the composition of parasite populations from individual hosts, families and villages host specificity, evolutionary relationships and macrogeographic differentiation among ascaris populations from humans and pigs spades: a new genome assembly algorithm and its applications to single-cell sequencing toward the goal of soil-transmitted helminthiasis control and elimination hi-c . : an optimized hi-c procedure for high-resolution genome-wide mapping of chromosome conformation stage-specific proteomic expression patterns of the human filarial parasite brugia malayi and its endosymbiont wolbachia a molecular epidemiological investigation of ascaris on unguja, zanzibar using isoenyzme analysis, dna barcoding and microsatellite dna profiling genetic diversity of ascaris in southwestern uganda from the twig tips to the deeper branches molecular epidemiology of ascariasis: a global perspective on the transmission dynamics of ascaris in people and pigs ascaris lumbricoides or ascaris suum : what s in a name species detection and identification in sexual organisms using population genetic theory and dna sequences toward almost closed genomes with gapfiller nanoscience and technology: a collection of reviews from nature journals broad institute, github repository estimating the global distribution and disease burden of intestinal nematode infections: adding up the numbers-a review rapid evolution of animal mitochondrial dna the international bank for reconstruction and development / the world bank genome sequence of the nematode c. elegans: a platform for investigating biology phylogeographical studies of ascaris spp. based on ribosomal and mitochondrial dna sequences genomic diversity in onchocerca volvulus and its wolbachia endosymbiont the genome of onchocerca volvulus, agent of river blindness disentangling hybridization and host colonization in parasitic roundworms of humans and pigs landscape genetics reveals focal transmission of a human macroparasite ascaris lumbricoides, ascaris suum, or "ascaris lumbrisuum"? multiple exposures to ascaris suum induce tissue injury and mixed th /th immune response in mice genomes project analysis group. . the variant call format and vcftools allerdictor: fast allergen prediction using text classification techniques improved whole-chromosome phasing for disease and population genetic studies genomics of loa loa, a wolbachia-free filarial parasite of humans assays to detect beta-tubulin codon polymorphism in trichuris trichiura and ascaris lumbricoides allertop v. -a server for in silico prediction of allergens prediction of proprotein convertase cleavage sites a fuzzy relative of the isodata process and its use in detecting compact well-separated clusters hybrid ascaris suum/lumbricoides (ascarididae) infestation in a pig farmer: a rare case of zoonotic ascariasis multi-parallel qpcr provides increased sensitivity and diagnostic breadth for gastrointestinal parasites of humans: field-based inferences on the impact of mass deworming sources of variability in the measurement of ascaris lumbricoides infection intensity by kato-katz and qpcr orthofinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy whipworm genome and dual-species transcriptome analyses provide molecular insights into an intimate host-parasite interaction benzimidazole resistance in helminths: from problem to diagnosis draft genome of the filarial nematode parasite brugia malayi maternal inheritance of human mitochondrial dna emerging zoonoses: a one health challenge global burden of disease study the genome of the heartworm, dirofilaria immitis, reveals drug and vaccine targets kinannote, a computer program to identify and classify members of the eukaryotic protein kinase superfamily transdecoder (find coding regions within transcripts) animal mitochondrial dna as a genetic marker in population and evolutionary biology helminth infections: the great neglected tropical diseases the genomic basis of parasitism in the strongyloides clade of nematodes whole genome sequencing of experimental hybrids supports meiosis-like sexual recombination in leishmania comparative genomics of the major parasitic worms status of benzimidazole resistance in intestinal nematode populations of livestock in brazil: a systematic review molecular epidemiology of ascaris infection among pigs in iowa ascaris suum draft genome interproscan : genomescale protein function classification prediction, conservation analysis, and structural characterization of mammalian mucin-type o-glycosylation sites an inconvenient truth: global worming and anthelmintic resistance a deep insight into the sialotranscriptome of the gulf coast tick, amblyomma maculatum pcr-rflp analyses of leishmania species causing cutaneous and mucocutaneous leishmaniasis revealed distribution of genetically complex strains with hybrid and mito-nuclear discordance in ecuador phylogenomic and biogeographic reconstruction of the trichinella complex circos: an information aesthetic for comparative genomics fast gapped-read alignment with bowtie inference of population structure using dense haplotype data signatures of mito-nuclear discordance in schistosoma turkestanicum indicate a complex evolutionary history of emergence in europe popart : full-feature software for haplotype network construction are ascaris lumbricoides and ascaris suum a single species? the optimal timing of post-treatment sampling for the assessment of anthelminthic drug efficacy against ascaris infections in humans a statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data fast and accurate short read alignment with burrows-wheeler transform fndc- contributes to paternal mitochondria elimination in c. elegans cutadapt removes adapter sequences from high-throughput sequencing reads the genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data ascariasis in humans and pigs on small-scale farms the draft genome of the parasitic nematode trichinella spiralis genetic diversity of ascaris spp. infecting humans and pigs in distinct brazilian regions, as revealed by mitochondrial dna preventive chemotherapy and the fight against neglected tropical diseases bcftools/roh: a hidden markov model approach for detecting autozygosity from next-generation sequencing data mathematical model for studying genetic variation in terms of restriction endonucleases population structure in ascaris suum (nematoda) among domestic swine in denmark as measured by whole genome dna fingerprinting ascariasis is a zoonosis in denmark molecular evidence for sustained transmission of zoonotic ascaris suum among zoo chimpanzees (pan troglodytes) assessing the zoonotic potential of ascaris suum and trichuris suis: looking to the future from an analysis of the past ascaris phylogeny based on multiple whole mtdna genomes a genetic map of the animal-parasitic nematode strongyloides ratti albendazole and mebendazole have low efficacy against trichuristrichiura in school-age children in kabale district ratt: rapid annotation transfer tool genetic variation in sympatric ascaris populations from humans and pigs in china ascariasis in people and pigs: new inferences from dna analysis of worm populations global numbers of infection and disease burden of soil transmitted helminth infections in pacbio sequencing and its applications dnasp : dna sequence polymorphism analysis of large data sets molecular identification of ascaris lumbricoides and ascaris suum recovered from humans and pigs in thailand, lao pdr, and myanmar fuzzy logic for personalized healthcare and diagnostics: fuzzyapp-a fuzzy logic based allergen-protein predictor degradation of paternal mitochondria by fertilization-triggered autophagy in c. elegans embryos the genome and transcriptome of the zoonotic hookworm ancylostoma ceylanicum identify infection-specific gene families popsicle: a software suite to study population structure and ancestral determinants of phenotypes using whole genome sequencing data population genomics of the filarial nematode parasite wuchereria bancrofti from mosquitoes a hidden markov model for predicting transmembrane helices in protein sequences mitochondrial versus nuclear gene sequences in deep-level mammalian phylogeny reconstruction raxml version : a tool for phylogenetic analysis and post-analysis of large phylogenies gene silencing and sex determination by programmed dna elimination in parasitic nematodes statistical method for testing the neutral mutation hypothesis by dna polymorphism experimental infection of man with ascaris of man and the pig single molecule sequencing and genome assembly of a clinical specimen of loa loa, the causative agent of loiasis genome of the human hookworm necator americanus assessment of the anthelmintic efficacy of albendazole in school children in seven countries where soil-transmitted helminths are endemic pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement deep small rna sequencing from the nematode ascaris reveals conservation, functional diversification silencing of germline-expressed genes by dna elimination in somatic cells comparative genome analysis of programmed dna elimination in nematodes programmed dna elimination in multicellular organisms one health -an ecological and evolutionary framework for tackling neglected zoonotic diseases discordant mitochondrial and nuclear gene phylogenies in emydid turtles: implications for speciation and conservation drug resistance in veterinary helminths phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates phylogeography of ascaris lumbricoides and a. suum from china characterisation of ascaris from human and pig hosts by nuclear ribosomal dna sequences genetic blueprint of the zoonotic pathogen toxocara canis determining geographical variations in ascaris suum isolated from different regions in northwest china through sequences of three mitochondrial genes pcr-rflp screening of polymorphisms associated with benzimidazole resistance in necator americanus and ascaris lumbricoides from different geographical regions in brazil we thank the school children, schoolteachers, and bungoma administrators for their support. we extend special thanks to all the members of the study team: bungoma county hospital, siangwe, siaka, sang'alo, nasimbo and ranje village administrators and community health workers. particular thanks to dr. charles s mwandawiro, prof. sammy njenga, and dr. jimmy h kihara (kemri), and dr simon j brooker (bmgf) for making the fieldwork possible in kenya, and for their invaluable scientific and logistical advice. competing interests roy anderson: rma was a non-executive director of glaxosmithkline (gsk) during the period of worm collection in kenya. gsk played no role in the funding of this research or this publication. the other authors declare that no competing interests exist. key: cord- -s x grh authors: payne, natalie; kraberger, simona; fontenele, rafaela s; schmidlin, kara; bergeman, melissa h; cassaigne, ivonne; culver, melanie; varsani, arvind; van doorslaer, koenraad title: novel circoviruses detected in feces of sonoran felids date: - - journal: viruses doi: . /v sha: doc_id: cord_uid: s x grh sonoran felids are threatened by drought and habitat fragmentation. vector range expansion and anthropogenic factors such as habitat encroachment and climate change are altering viral evolutionary dynamics and exposure. however, little is known about the diversity of viruses present in these populations. small felid populations with lower genetic diversity are likely to be most threatened with extinction by emerging diseases, as with other selective pressures, due to having less adaptive potential. we used a metagenomic approach to identify novel circoviruses, which may have a negative impact on the population viability, from confirmed bobcat (lynx rufus) and puma (puma concolor) scats collected in sonora, mexico. given some circoviruses are known to cause disease in their hosts, such as porcine and avian circoviruses, we took a non-invasive approach using scat to identify circoviruses in free-roaming bobcats and puma. three circovirus genomes were determined, and, based on the current species demarcation, they represent two novel species. phylogenetic analyses reveal that one circovirus species is more closely related to rodent associated circoviruses and the other to bat associated circoviruses, sharing highest genome-wide pairwise identity of approximately % and %, respectively. at this time, it is unknown whether these scat-derived circoviruses infect felids, their prey, or another organism that might have had contact with the scat in the environment. further studies should be conducted to elucidate the host of these viruses and assess health impacts in felids. the sonoran desert is a unique ecosystem in which four species of felids are known to coexist: pumas (puma concolor), bobcats (lynx rufus), ocelots (leopardus pardalis), and jaguars (panthera onca) [ ] . these felids play a crucial role in maintaining a functional ecosystem. pumas mainly regulate populations of ungulates, including deer, bighorn sheep, and javelina [ ] [ ] [ ] , while bobcats and ocelots tend to prey upon small mammals, such as lagomorphs and rodents, and reptiles [ , [ ] [ ] [ ] . ocelots and jaguars are recognized as endangered in the region [ ] [ ] [ ] ; however, the status of all four felid species are likely threatened by shared environmental pressures, including drought [ ] , habitat fragmentation and encroachment (which can lead to human-wildlife conflict), and emerging diseases. while antibodies to canine distemper virus (cdv) have been detected in sonoran jaguars [ ] and antibodies to cdv, feline panleukopenia virus, feline calicivirus, and feline enteric coronavirus have been detected in pumas from southern arizona [ ] , other viruses circulating in populations of sonoran felids are largely unknown. cataloging the diversity of viruses present in these felids could reveal an abundance of both known and novel viruses; although most viruses are not pathogenic, some may cause disease and be relevant to conservation. high throughput sequencing technologies have allowed for unprecedented advances in identifying known and novel viruses and characterizing viral communities through viral metagenomics. taking advantage of metagenomic approaches to monitor viral communities associated with wildlife could be instrumental for conservation; however, this is not routinely performed. altered viral evolutionary dynamics (largely due to anthropogenic factors such as facilitating viral movement around the world, spillover from domestic animals, increasingly dense populations of wildlife due to habitat encroachment, and climate change) and altered exposure of wildlife to viruses through vector range expansion create conditions for accelerated emergence of viruses, some of which may cause new disease outbreaks in wildlife populations [ , ] . notable examples include the spillover of feline leukemia virus (felv) from domestic cats into the endangered florida panther [ ] and spillover of cdv from domestic dogs into wildlife populations within serengeti national park, tanzania, affecting spotted hyenas, african lions, and other species [ , ] . this may be especially problematic for already threatened populations, as small populations typically have lower genetic diversity (and possibly stress-induced immunosuppression) and, therefore, decreased adaptive potential to assist survival of a proportion of the population experiencing the effects of a novel viral disease [ , [ ] [ ] [ ] . genomes from several families of circular rep-encoding single-stranded dna viruses (cress-dna viruses) are part of the phylum cressdnaviricota [ ] and have been identified in fecal samples of other mammals, including domestic cats [ , ] , bobcats, african lions [ ] , capybaras [ ] , and tasmanian devils [ ] . circoviridae is one of the families in the cressdnaviricota phylum and is composed of the genera circovirus and cyclovirus. circoviruses have ambisense genomes of approximately . - . kb in length and encode two proteins, rep and the capsid protein (cp) [ ] . circoviruses have implications for wildlife management because they are associated with disease in some vertebrates, including life-threatening hemorrhagic gastroenteritis in dogs [ ] [ ] [ ] , psittacine beak and feather disease in parrots [ ] , and postweaning multisystemic wasting syndrome in pigs [ , ] . importantly, several studies suggest that these life-threatening diseases may be largely due to coinfection with porcine parvovirus or porcine reproductive and respiratory syndrome virus [ , ] , or canine coronavirus, canine parvovirus, or cdv [ ] [ ] [ ] , in pigs and dogs respectively. no circoviruses are known to infect felids, although a cyclovirus (feline associated cyclovirus ) has been identified in the feces of domestic cats [ ] . additionally, a feline stool-associated circular dna cress-dna virus has recently been identified from cats with diarrhea [ ] . endogenous fragments of circoviruses have also been detected in feline genomes, indicating the susceptibility of the ancestors of modern felids to circovirus infection [ , ] . here we used a metagenomic approach to identify novel circoviruses in the feces of two species of sonoran felids, the puma and bobcat; although not endangered, knowledge of viral threats facing these species could help prevent future population decline, as well as indicate potential threats to the endangered ocelot and jaguar. for the two novel circoviruses identified, we sought to determine relationships with known circoviruses and characterize their genomes. these novel feline feces associated circoviruses may represent the first known feline circoviruses. scat samples from bobcats and pumas were collected from sonora, mexico, between and . the samples were desiccated at room temperature prior to shipping and long-term storage at − • c. to determine the species, dna was extracted by swabbing the scat surface. the swab was deposited into lysis buffer and dna extracted using qiagen's dneasy blood and tissue kit as previously described by cassaigne et al. [ ] . this dna was used as template for pcr of the mitochondrial cytochrome b gene [ ] with confirmation by sanger sequencing of the amplicon (~ bp region) as previously described [ ] . we randomly selected fecal samples (bobcats (n = ) and pumas (n = )) for this study. of each of the fecal samples, g was homogenized in sm buffer and the homogenate was centrifuged at × g for min. the supernatant was sequentially filtered through . µm and . µm syringe filters and viral particles in the filtrate were precipitated with % (w/v) peg- with overnight incubation at • c followed by centrifugation at , ×g as described in fontenele et al. [ ] . the pellet was resuspended in µl of sm buffer and µl of this was used for viral dna extraction using the high pure viral nucleic acid kit (roche diagnostics, indianapolis, in, usa). circular viral dna was amplified by rolling circle amplification (rca) using the illustra templiphi amplification kit (ge healthcare, chicago, il, usa). sequencing libraries were prepared from the rca products using the nextera dna flex library prep kit (illumina, san diego, ca, usa) and sequenced on an illumina hiseq ( × bp). the paired-end raw reads were trimmed using default settings within trimmomatic v . [ ] and the trimmed reads were de novo assembled using k-mer values of , , and within metaspades v . . [ ] . contigs greater than nucleotides were analyzed by blastx [ ] against a local viral protein database constructed from available ncbi refseq viral protein sequences (https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/). based on the de novo assembled contigs (> nts) that had blastx hits to circovirus sequences, two pairs of abutting primers were designed manually to recover and verify the full genomes of circoviruses: uoa _ f -ctatagaacagatatgcaaattatggccgg- and uoa _ r -atatctcaaaaagaggaaccgaaaccttgg- (complementarity to cp gene/ stem loop region) and uoa f -gaccgatacccattgaaagtggagactaag- and uoa r -catcactcgaagcaggtcatcatag- (complementary to the rep gene region). as a template, . µl rca product was used with kapa hifi hotstart dna polymerase (kapa biosystems, wilmington, ma, usa) and the specific abutting primers described above were used for each of the fecal samples to screen and recover the full genomes of the circoviruses using the manufacturer's recommended thermal cycling conditions. the pcr amplicons were resolved on a . % agarose gel, recovered with gel purification, cloned into the plasmid pjet . (thermofisher, waltham, ma, usa), and sanger-sequenced at macrogen inc. (seoul, south korea) by primer walking. the sanger sequence contigs were assembled using the "assembly module" in geneious prime v [ ] . open reading frames in the genomes were identified using orffinder (https://www.ncbi.nlm.nih. gov/orffinder/). the genomes and amino acid sequences of rep and cp of representative circoviruses and those identified in this study were aligned using muscle [ ] , and pairwise percent identities were obtained using sdt v . [ ] (file s ). the optimal substitution model based on akaike information criterion with correction for small sample size (aicc) for the genome alignment was identified as gtr+i+g using jmodeltest [ , ] , and prottest [ ] identified lg+i+g as the optimal model for the rep alignment and vt+i+g+f as the optimal model for the cp alignment. phylogenetic analyses for each alignment were performed with phyml . [ ] . for visualization purposes, all trees were rooted with sequences from the duck associated cyclovirus (genbank: ky ) and horse associated cyclovirus (genbank: kr ) (not shown in the tree). branches with sh-like alrt support less than . [ , ] were collapsed using ips [ ] and ape [ ] packages in r [ ] . the viral genomes described in this manuscript were submitted to genbank (accession numbers: mt -mt ). based on the metagenomic analysis, we assembled a partial viral genome in two of the samples. based on this partial sequence data, we designed abutting primers to screen all the available scat samples. of the samples screened with the two primer pairs, three circovirus genomes were identified and recovered ( figure a ) from three fecal samples of bobcats. two of the genomes (genbank: mt and mt ) share greater than % pairwise identity to each other (file s ) and are nucleotides in length, having a rep coding sequence (cds) of nucleotides ( amino acids) on the virion-sense strand and cp cds of nucleotides ( amino acids) on the complementary strand. based on the species-demarcation threshold for circoviruses which is % genome-wide identity [ ] , both of these belong to a new species which we refer to as sonfela (derived from sonoran felid associated) circovirus . the third genome (genbank: mt ) of nucleotides, referred to as sonfela circovirus , is more distantly related, sharing approximately % identity with the two sonfela circovirus genomes (file s ), and contains a rep cds of nucleotides ( amino acids) on the virion-sense strand and cp cds of nucleotides ( amino acids) on the complementary strand. the stem loop and nonanucleotide motif "tagtattac" were identified in the genomes and correspond to the origin of replication. conserved motifs within rep (rc endonuclease motifs i, ii, and iii and sf helicase domains walker a, walker b, motif c, and arg finger) [ ] were all detected. the genome ( figure a ) and protein ml phylogenetic trees ( figure b ,c) demonstrate that canine circovirus (genbank: kc ), rodent associated circoviruses (roacv , , , , and ) (genbank: ky , ky , ky , ky , and mf ), bat associated circovirus (genbank: kx ), and the sonfela circoviruses cluster in a separate clade with sh-like alrt support between . - . . sonfela circovirus is most closely related to a group of three rodent-derived viruses (roacv - ; genbank: ky , ky , and ky ), sharing a maximum of approximately % genome-wide identity, % rep identity, and % cp identity with roacv (genbank: ky ) (file s ). the phylogenetic trees reveal sonfela circovirus and bat associated circovirus (genbank: kx ) to be sister taxa, sharing approximately % genome-wide identity, % rep identity, and % cp identity according to sdt; however, pairwise percent identity calculations reveal maximum genome-wide identity with batacv (genbank: kj ) ( . %) and cp identity with roacv (genbank: ky ) ( %) (file s ). sharing less than % genome-wide identity with known circoviruses, both sonfela circoviruses and represent novel species (file s ). based on the circovirus species demarcation threshold of % identity [ ] , the circovirus genomes identified and recovered in this study represent two new species. these feline associated viral genomes have a typical circovirus length, contain both circovirus rep and cp cds (in appropriate orientation), and have a well-defined nonanucleotide sequence. the health implications of these circoviruses for these populations are currently unclear given the viruses' true hosts and pathogenicity are unknown. as the viral genomes were derived from scat samples, the circoviruses could have infected the bobcat prey species or the felids themselves or be environmentally derived. the phylogenetic clustering of sonfela circovirus and several rodent circoviruses suggests the virus may be rodent-derived; similarly, sonfela circovirus may be bat-derived. as with these novel feline associated viruses, many of the recently described viruses have not been associated with their mammalian hosts. the lack of formal host association limits our ability to directly interpret the biological relevance of these viruses. however, in the meantime, it is critical to continue to describe the viral diversity associated with unconventional hosts. to our knowledge, the circoviruses described here may represent the first known feline associated circoviruses. detection, or lack thereof, of the circoviruses in other tissues within felids could help discern the viruses' true hosts. screening for the viruses in sympatric populations of rodents, bats, and other prey species could also be utilized to rule out or confirm the sources of these viruses. if felids are the host for these viruses, affected individuals should be monitored for possible symptoms of disease; however, further investigations linking these viruses to their natural host are needed as well as investigations into the prevalence of the viruses within felid populations in the sonoran desert and across the americas. funding: np was supported by funds from the genetics graduate interdisciplinary program and the technology and research initiative fund at university of arizona. the high throughput sequencing work and viral molecular work was supported by a startup grant awarded to av by arizona state university. sample collection of bobcat and puma scat was supported by primero conservation, a nonprofit wildlife conservation organization. wildlife survey and monitoring in the sky island region with an emphasis on neotropical felids food habits of pumas in northwestern sonora abundance and food habits of cougars and bobcats in the sierra leopardus pardalis) food habits in a tropical deciduous forest of jalisco diets of sympatric bobcats and coyotes during years of varying rainfall in central arizona food habits of ocelots and potential for competition with bobcats in southern texas interior. endangered and threatened wildlife and plants; final rule to extend endangered status for the jaguar in the united states protección ambiental-especies nativas de méxico de flora y fauna silvestres-categorías de riesgo y especificaciones para su inclusión, exclusión o cambio-lista de especies en riesgo recovery plan for the ocelot (leopardus pardalis) first revision annual and warm season drought intensity-duration-frequency analysis for sonora, mexico serosurvey of mountain lions in soutern arizona human drivers of ecological and evolutionary dynamics in emerging and disappearing infectious disease systems global factors driving emerging infectious diseases multiple introductions of domestic cat feline leukemia virus in endangered florida panthers a canine distemper virus epidemic in serengeti lions (panthera leo) cross-species transmission and evolutionary dynamics of canine distemper virus during a spillover in african lions of serengeti national park interactive influence of infectious disease and genetic diversity in natural populations importance of genetic variation to the viability of mammalian populations the threat of disease increases as species move toward extinction a virus phylum unifying families of rep-encoding viruses with single-stranded, circular dna genomes faecal virome of cats in an animal shelter novel single-stranded, circular dna virus identified in cats in japan novel smacoviruses identified in the faeces of two wild felids: north american bobcat and african lion single stranded dna viruses associated with capybara faeces sampled in brazil fecal viral diversity of captive and wild tasmanian devils characterized using virion-enriched metagenomics and metatranscriptomics revisiting the taxonomy of the family circoviridae: establishment of the genus cyclovirus and removal of the genus gyrovirus circovirus in tissues of dogs with vasculitis and hemorrhage genomic characterization of a circovirus associated with fatal hemorrhagic enteritis in dog genomic characterization of canine circovirus associated with fatal disease in dogs in south america characterization of a new virus from cockatoos with psittacine beak and feather disease a review of porcine circovirus -associated syndromes and diseases porcine circovirus diseases experimental reproduction of severe wasting disease by co-infection of pigs with porcine circovirus and porcine parvovirus concurrent infections are important for expression of porcine circovirus associated disease circovirus in domestic and wild carnivores: an important opportunistic agent? virology role of canine circovirus in dogs with acute haemorrhagic diarrhoea a molecular survey for selected viral enteropathogens revealed a limited role of canine circovirus in the development of canine acute gastroenteritis endogenous viral elements in animal genomes the evolution, distribution and diversity of endogenous circoviral elements in vertebrate genomes novel universal primers establish identity of an enormous number of animal species for forensic application genetic analysis of scats reveals minimum number and sex of recently documented mountain lions trimmomatic: a flexible trimmer for illumina sequence data spades: a new genome assembly algorithm and its applications to single-cell sequencing basic local alignment search tool geneious basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data multiple sequence alignment with high accuracy and high throughput a virus classification tool based on pairwise sequence alignment and identity calculation fast, and accurate algorithm to estimate large phylogenies by maximum likelihood jmodeltest , more models, new heuristics and parallel computing prottest , fast selection of best-fit models of protein evolution new algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of phyml . approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative phyloch: r language tree plotting tools and interfaces to diverse phylogenetic software packages ape . , an environment for modern phylogenetics and evolutionary analyses in r r: a language and environment for statistical computing a field guide to eukaryotic circular single-stranded dna viruses: insights gained from metagenomics we would like to thank interns meagan bethel and anna kegan scott and work study student emma froelich for help with dna extraction of bobcat and puma scat samples. we acknowledge jana jandova for her assistance with extracting viral dna. we also thank alex erwin, eldridge wisely, karla vargas, conor handley, and hans-werner herrmann for feedback during the early stages of manuscript preparation and robert jackson for critical review of this manuscript. any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the u.s. government. the authors declare that there are no conflict of interest. key: cord- -ck lhojz authors: gromeier, matthias; wimmer, eckard; gorbalenya, alexander e. title: genetics, pathogenesis and evolution of picornaviruses date: - - journal: origin and evolution of viruses doi: . /b - - / - sha: doc_id: cord_uid: ck lhojz the discovery of viruses heralded an exciting new era for research in the medical and biological sciences. it has been realized that the cellular receptor guiding a virus to a target cell cannot be the sole determinant of a virus's pathogenic potential. comparative analyses of the structures of genomes and their products have placed the picornaviruses into a large “picorna-like” virus family, in which they occupy a prominent place. most human picornavirus infections are self-limiting, yet the enormously high rate of picornavirus infections in the human population can lead to a significant incidence of disease complications that may be permanently debilitating or even fatal. picornaviruses employ one of the simplest imaginable genetic systems: they consist of single-stranded rna that encodes only a single multidomain polypeptide, the polyprotein. the rna is packaged into a small, rigid, naked, and icosahedral virion whose proteins are unmodified except for a myristate at the n-termini of vp . the rna itself does not contain modified bases. the key to ultimately understanding picornaviruses may be to rationalize the huge amount of information about these viruses from the perspective of evolution. it is possible that the replicative apparatus of picornaviruses originated in the precellular world and was subsequently refined in the course of thousands of generations in a slowly evolving environment. picornaviruses cultivated the art of adaptation, which has allowed them to “jump” into new niches offered in the biological world. the discovery of viruses heralded an exciting new era for research in the medical and biological sciences. many contemporary virologists do not know, however, that the first animal virus described was a picornavirus, the etiological agent of the dreaded foot-and-mouth disease in cloven-footed animals. the discovery of footand-mouth disease virus (fmdv) by e loeffier and p. frosch in (loeffier and frosch, ) occurred at the same time as m.w. beijerinck described the amazing "contagium vivum fluidum" in . this liquid was a filtered leaf extract derived from tobacco plants suffering from tobacco mosaic disease. free of bacteria, it was yet able to transmit the disease to uninfected plants. already in , i. ivanovski had made a similar observation with tobacco mosaic virus but apparently he was unable to fully convince his peers of the significance of his discovery (waterson and wilkinsen, ) . research on viruses, now formally in its hundred-and-first year, has yielded an immense harvest of biochemical and biological information. the studies were driven not only by an urgent need to understand, and possibly prevent, viral disease; they were also fueled by a strong curiosity about the minute biologicals called viruses, which we can view as chemicals, on the one hand and as "living" entities on the other. poliovirus is an exquisite example of a chemical with a known empirical formula (molla et al., ) that can be crystallized (schaffer and schwerdt, ) yet causes a devastating disease in humans. poliovirus was discovered years ago by landsteiner and popper ( ) to be the causative agent of poliomyelitis. the current knowledge of its chemical and three-dimensional structure and of its life cycle and pathogenesis is second to none. indeed, the intense research efforts on poliovirus over a period of nine decades will lead to its demise in the near future: global eradication of poliovirus is considered possible by the year . following the identification of fmdv and poliovirus, a deluge of other viruses with similar properties were uncovered. these viruses have now been classified, as picornaviridae, a large family of small (lat. pico) rna (rna) viruses. currently, picornaviridae consists of six genera: enterovirus, rhinovirus, hepatovirus, parechovirus, cardiovirus and aphthovirus (table . ). the first four genera include predominantly human pathogens, which cause a bewildering array of disease syndromes. although a disease syndrome may be considered characteristic for a specific picornavirus group, the same syndrome can possibly be also produced by b coxsackieviruses a (see cav above), poliomyelitis, c~v~ ( ) coxsackieviruses b - myocarditis, pleurodynia, meningitis, hcar*, daf ( ) hand-fix)t-and-mouth" disease, respiratory disease, neonatal, infections meningitis, encephalitis, pleurodynia, exanthema echoviruses - , - , - , - vla- (= c ~i) daf (=cd ) enterovirus nd c poliovirus types - poliomyelitis, meningitis cd ( ) coxsackieviruses , , , , , [ ] [ ] [ ] [ ] [ ] common cold, infantile diarrhea icam-lt the following viruses have been recognized as picornaviruses on the basis of their genome sequences and physico-chemical properties as well as the result of comparative sequence analyses (see the section on evolution): equine rhinovirus types i and , aichi virus, porcine enterovirus, avian encephalomyelitis virus, infectious flacherie virus of silkworm clusters of enteroviruses refer to groups of enteroviruses arranged predominantly according to genotypic kinship (hyypia et al., ) . more clusters, including mainly animal enteroviruses, have been proposed. list of human syndromes adapted from melnick, . common syndromes in humans caused predominantly by one and/or other member(s) of the cluster but member viruses of other clusters may cause the same syndrome. receptors may be specific for specific serotypes. for details, see text. references describing the identification of receptors: ( ) roivainen et al., , ( ) tomko and philipson, ; shafren et al., , ( ) bergelson et al., ; ( ) bergelson et al., ; ward et al., ; ( ) mendelsohn et al., ; ( ) shafren et al., ; ( ) staunton et al., ; greve et al., ; ( ) hofer et al., ; ( ) feigelstock et al., ; ( ) neff et al., ; berinstein et al., ; ( ) jackson et ai., ; ( ) huber, . * shared with adenovirus type . t daf (decay accelerating factor) may function as non-essential (infection-augmenting) coreceptor. coxsackie virus a v is a genetic variant of coxsackie virus a . ** pringle, . other picornaviruses. it has been realized that the cellular receptor guiding a virus to a target cell cannot be the sole determinant of a virus's pathogenic potential. indeed, it is a major challenge of the day to decipher the molecular mechanism(s) that determine viral tissue tropism and disease. what is the identity of picornaviruses? it relates to ancestral viruses whose identity we will never know. comparative analyses of the structures of genomes and their products, however, have placed the picornaviruses into a large "picorna-like" virus family, in which they occupy a prominent place (discussed in the section on evolution). these same analyses have led to an evolutionary tree of picornaviruses that reveals the extent of kinship (figure . a) . one result of these phylogenetic investigations was a radical reorganization of the taxonomy of enterovirus, a genus of picornaviridae comprising numerous members infecting the gastrointestinal tract. the enteroviruses have now been divided into clusters (table . ; figure . b) grouping the viruses mainly corresponding to genotypes (hyypia et al., ) . earlier classifications were based ( ) on specific properties of the virions, ( ) on disease patterns, ( ) on the apparent absence of pathogenesis (echo is an acronym for "enteric cytopathic human orphan" because no disease was originally correlated with these viruses), or ( ) in reference to the site of discovery (e.g. the town of coxsackie in new york state) and pathogenesis in suckling mice. as the number of known enteroviruses increased and the properties of these new isolates were elucidated, the need for a modified classification became apparent. however, even the latest dendrograms are likely to be modified again. principally, viruses that have been classified as belonging to a specific genus may be further divided into serotypes. a serotype is defined by the virus's ability to elicit a set of neutralizing antibodies ("antiserum") in a host animal; this set of neutralizing antibodies will generally not neutralize any other virus, regardless of the origin of the antiserum. neutralizing antibodies, in turn, are elicited by structures specific for a virus's capsid, and they have been referred to as neutralization antigenic determinants (or sites). the poliovirion carries at its surface four distinct neutralization antigenic determinants minor, ) . however, poliovirus expresses only three unique sets of these four determinants; hence poliovirus occurs in three serotypes. hepatitis a virus, on the other hand, expresses only one set of neutralization antigenic determinants; hence, it occurs in only one serotype. in contrast, human rhinoviruses (hrv) can express more than unique sets of four antigenic determinants. rhinoviruses, therefore, occur in more than serotypes. it should be noted that a poliovirus has been constructed that expresses neutralization antigenic determinants of all three serotypes. this virus, which is severely handicapped in proliferation, is trivalent as it can be neutralized by all three serotype-specific antibodies (murdin et al., ) . a genus consisting of viruses that cause the same disease syndrome can be subdivided further on the basis of receptor use. for example, all member viruses of the genus rhinovirus cause the common cold, yet they use two different receptors (icam- and ldl receptor; table . ). on the basis of genotypes, however, this division no longer holds up (figure . ~) . as mentioned, the enteroviruses have now been subdivided into clusters based on genotypes (table . , figure . b). for example, the ccluster consists of the three serotypes of poliovirus and of serotypes , , , , , [ ] [ ] [ ] [ ] [ ] . originally the c-cavs were not considered related to polioviruses because of the profound difference in pathogenesis (common cold and poliomyelitis, respectively) and the different use of receptor (icam- and cd , respectively). however, their very close kinship was revealed by genome sequence. this proximity has led to the interesting question of whether the c-cavs are genetic variants of poliovirus (harber et al., ) or vice versa (discussed in detail in evolution an interesting recent variant of cav is cav v, an agent that emerged in the early s and that causes acute hemorrhagic conjunctivitis. this syndrome is also associated with a new variant of enterovirus , a d-cluster enterovirus (table . ; yin-murphy, ) . the phenomenon of the sudden appearance of enterovirus strains causing human diseases not previously associated with picornaviruses is of greatest interest with respect to the dynamics of picornavirus diversification, particularly in view of the eradication of poliovirus. what are the mechanisms by which the picornaviruses and other rna virus families have diversified? clearly, the genetic program inscribed into the viral genome is being changed as the viruses acquire new genetic traits. the predominant driving force of the changes in the genotype is largely an adaptation to new opportunities to proliferate. in the following, we will discuss some mechanisms and rules of genetic diversification and evolution of picornaviruses. sequences involved in rna replication, and the internal ribosomal entry site (ires), controlling translation. the virus-encoded '-terminal protein, vpg (viral protein genome-linked) is covalently linked to the '-terminal uridylic acid via a -( '-uridylyl)tyrosine bond (lee et al., , nomoto et al., b; rothberg et al., ) . picornavirus vpgs are - amino acids long; their third amino acid (from the n-terminus) is always a tyrosine, the residue linking vpg to the genome. genomelinked proteins are quite common amongst viruses belonging to the picorna-like super family (see figure . ) picornavirus vpgs are attached to '-terminal nucleotide sequences that form complex structures typical for entero-and rhinoviruses on the one hand, or cardio-, aphtho-and hepatoviruses on the other. these sequences are important signals in genome replication. entero-and rhinoviruses share a cloverleaf structure (rivera et al., ; andino et al., ) that has been subject to intense studies (see below). relatively little is known about the role of corresponding structural elements (which do not form cloverleaves) of cardio, aphtho-and hepatovirus genomes. the cloverleaf is followed by the internal ribosomal entry site (ires), arguably the most complex cis-acting element in any rna virus genome known (figures . , . ; wimmer et al., ) . picornavirus ires elements, which are approximately nt long, regulate the initia-tion of polyprotein synthesis. in deviation to c a p -d e p e n d e n t "scanning", ireses promote internal ribosomal entry, i.e. they allow initiation of translation independently of a capping group and even a free ' end (jang et al., (jang et al., , pelletier and sonenberg, ; molla et al., ; chen and sarnow, ) . remarkably, ires elements are defined by their function, not by their sequences or apparent higher-order structure(s). this is illustrated in figure . , which depicts the sequence and folding pattern of the ires elements of poliovirus and encephalomyocarditis virus (emcv; pilipenko et al., a,b) . in spite of these differences, the poliovirus ires has been exchanged with that of emcv, leading to a novel chimeric virus with excellent growth properties . similarly, the ires of hepatitis c virus (hcv), a flavivirus, was found to functionally substitute for the poliovirus ires, yielding a p o l i o / h c v chimeric virus (lu and wimmer, ; zhao et al., ) . finally, a construct in which the ires of h u m a n rhinovirus type (hrv ) replaced that of poliovirus yielded a p v / h r v chimeric virus (pvi(ripo)) that is figure . sequences and secondary structures of ires elements of poliovirus and encephalomyocarditis virus. a. poliovirus ires; individual domains have been labeled with roman numerals. b. encephalomyocarditis virus (emcv) ires; domains have been labeled with capital letters. both ireses contain a conserved ynxmaug motif, of which the oligopyrimidine stretch (yn) and the aug triplet are indicated by solid bars. note that in the emcv ires, the aug triplet of the ynxmaug motif is the initiating codon of the polyprotein. in the poliovirus ires, this aug triplet is silent and is separated from an aug codon initiating the synthesis of the polyprotein by a "spacer sequence" of nt (jang et al., ) . single attenuating mutations in the poliovirus vaccine strains map to domain v (wimmer et al., ) . indistinguishable from wt poliovirus with respect to replication in hela cells yet is highly attenuated in poliovirus-receptor-transgenic mice and in monkeys (gromeier et al., (gromeier et al., , a ; discussed in the section on pathogenesis). the properties of this interesting novel virus will be discussed in a later section. the mechanism by which ires elements function is still obscure. translation of picornavirus mrna is initiated downstream of the ires to yield an unstable "polyprotein" that is rapidly cleaved by virusencoded proteinases to proteins involved in viral proliferation (figure . ; see also evoiution and figure . ). it is important to note that the mrna found in viral polyribosomes that encodes the polyprotein differs from virion rna in one important aspect: it is terminated with pup ... (hewlett et al., ; nomoto et al., ) . apparently, the terminal protein vpg has been cleaved from incoming or from newly synthesized rna. it has been suggested that the enzyme cleaving the vpg-pup phosphodiester bond is of cellular origin but the reason for the removal of the protein and the nature of the enzyme catalyzing it remain unknown. moreover, it is not clear whether the incoming vpg-linked virion rna will be processed immediately after entry or whether the removal of vpg will occur only after the first round(s) of viral protein synthesis. entero-and rhinoviruses encode the two proteinases a pr~ and c/ cd p~~ aphthoviruses the two proteinases l pr~ and c pr~ and cardioviruses only the proteinase c p~~ interestingly, both cardioviruses and aphthoviruses have evolved a peculiar cleavage mechanism between a and b that occurs only in cis and is an enzyme-independent reaction (reviewed by ryan and flint, ) . a similar as yet unknown mechanism of proteolytic cleavage is that between vp and vp ( figure . d), which occurs only during maturation of the virion (maturation cleavage) and appears also to be proteinase-independent (harber et al., ; see below) . the origin of these fascinating enzymes and of specific cleavage events are discussed in the section on evolution. since most details of proteolytic processing have been accumulated for poliovirus, much of the following discussion will center on this viral system. the two poliovirus proteinases a pr~ and c/ cd pr~ cleave at different sites, as determined by the sequences of the scissile bond ( figure . b, c). theoretically, the poliovirus polyprotein could give rise to different cleavage products if proteolytic processing by these enzymes and the maturation cleavage were entirely random (wimmer et al., ) . in fact, only roughly - cleavage products have been identified in poliovirus-infected cells (nicklin et al., ) . it has thus been concluded that processing of the picornavirus polyproteins is not random but follows a pathway that is determined by protein folding (masking of cleavage sites) and by the amino-acid sequences surrounding the scissile bond ( figure . b, c; harris et al., ) . for example, the precursor cd pr~ can be cleaved into c pr~ and cd p~ by a (cis?) cleavage in which the c/ cd pr~ proteinase is involved. both c pr~ and d p~ are quite stable end-products of processing. however, in the case of poliovirus type (mahoney) (pv (m)), cd pr~ can also be efficiently processed in trans by a pr~ to c' and d' (figure . c), two polypeptides with no apparent function in viral proliferation (lee and wimmer, ) . just like c pr~ and d p~ c' and d' are quite stable endproducts of processing, even though d p~ harbors a perfect cleavage site for a pr~ and c' harbors a cleavage site for c/ cd pr~ (figure . ) . indeed, in pvl(m)-infected cells, nearly equal amounts of the four cleavage products of precursor cd pr~ are observed. it is assumed that structural constraints mask one or the other cleavage site from recognition and processing once the cleavage product has been formed (lee and wimmer, ) . the preferred cleavage sequence for c/ cd pr~ in the poliovirus polyprotein is axxq*g; hence, cleavage sites with this sequence are usually rapidly processed. numerous mutational studies have supported the identity of this c/ cd pr~ cleavage motif (reviewed in dougherty and semler, , and wimmer et al., ) . an intriguing genetic analysis has made use of a viral construct that . processing scheme and cleavage sites of the poliovirus polyprotein. a. proteolytic cleavages of the polyprotein. triangles indicate cleavage by c pr~ and/or cd pr~ note that both enzymatic entities can efficiently cleave the non-structural proteins. in contrast, the p capsid precursor can be processed by cd pr~ only. solid triangles represent efficient cleavage sites, whereas open triangles represent slowly cleaved sites resulting in stable precursor proteins. the apr~ cleavages are depicted with circles. only the cleavage between p and p -p (solid circle) is essential, whereas the cleavage of cd p'~ to c' and d' is dispensable (open circle). the maturation cleavage is indicated by the open diamond. the mechanism by which this cleavage occurs is unknown. numbers in brackets indicate the molecular weight in kda. b-d. amino-acid residues at sites cleaved by (b) c p~~ and/or cd pr~ (c) by a pr~ and (d) during the maturation cleavage are shown in a single-letter code. the positions of the amino-acid residues are designated p , p , p .... at the newly generated c-termini, or pi', p ', p ', ... at the newly generated n-termini. the fastest cleavages catalyzed by cp~~ pr~ occur at sites in which the p position is a small aliphatic amino acid (e.g. axxq*g). cleavage at tqsq*g between c and d is slow, giving rise to the cd p~~ cleavage intermediate with a long half-life (cao and wimmer, ) . mutated this axxq*g cleavage motif at a specific site in order to avoid proteolytic processing. the amino acids placed by the mutants into the motif confirmed the proposed interaction between substrate and enzyme during cleavage (cao and wimmer, , and references therein). as will be discussed later, poliovirus is a purist with respect to cleavage signals, since the scissile bond in all cleavages, catalysed by c/ cd pr~ is q*g (kitamura et al., ; semler et al., a semler et al., , b . in other picornaviruses, or viruses of the large picoma-like superfamily, the cleavage site may differ from the canonical q*g signal. a most important observation in studies of picomavirus proliferation is that cleavage intermediates may have important functions that in some cases may even be distinct from that of their end-products (e.g. cw pr~ yielding c pr~ and dp~ the structure of picornavirus c p~~ enzymes has been accurately predicted by gorbalenya et al. ( ) , leading to the genetic analyses alluded to above. the structures were proved to be correct by x-ray crystallographic studies of c p~~ of human hepatitis a virus (allaire et al., ) and human rhinovirus (matthews et al., ) . following the orf, there is a heteropolymeric region that may be different with respect to length ( - nt) and structure in different picornavirus genomes (xiang et al., ) . however, all picornavirus genomes terminate with poly(a), as was shown first for poliovirus (yogo and wimmer, ) . the role these sequences play in replication will be discussed below. the genomic rna of picornaviruses can serve as mrna and, consequently, it is of the same polarity as cellular mrna. by convention, this polarity has been designated plus-strand polarity (baltimore, ) . fortunately, the genomic rna of picornaviruses is infectious; that is, upon transfection into suitable host cells, virion rna will initiate a complete infectious cycle (wimmer et al., ) . interestingly, poliovirus and its purified genome will replicate even in enucleated cells (morgan-detjen et al., ) , an observation suggesting that the nucleus does not contribute factors essential for viral proliferation. using reverse transcriptase, racaniello and baltimore ( ) generated full-length "complementary" dna (cdna) that contained the entire genetic information of the viral genome (currently, cdna refers to double-stranded dna generated from the original complementary dna strands). transfections into hela cells of the cdna that contained heterologous dna sequences at either end of the virus-specific sequences generated, surprisingly, poliovirus. with this experiment, "reverse genetics" of rna viruses was born as the rna genome was now amenable to manipulations developed for dna. the efficiency with which the original cdna clones induced an infectious cycle in hela cells was very low (about pfu/~tg dna; racaniello and baltimore, ) . construction of plasmids that could replicate in transfected cells dramatically increased the specific infectivity to pfu/~tg dna; semler et al., ) . however, reverse genetics was made more practical when the cdna was cloned downstream of the phage t rna transcriptase promoter and, using purified t transcriptase, virtually unlimited amounts of highly infectious transcript rna could be produced in a simple test-tube experiment (> s pfu/~tg of transcript rna; van der werf et al., ) . this was important because mutant genomes with highly debilitating replication phenotypes could not be recovered by the inefficient cdna transfection method. it was known before that vpg is not required to be at the ' end for poliovirus rna to be infectious (nomoto et al., a) . the ' end of the t transcripts is pppgguuaaaa.., whereas that of virion rna is vpg-puuaaaa... the extra g residues do not prevent transfection but they reduce the specific infectivity of the transcript. in any event, picornavirus rna is quite tolerant of modifications of the ' end of its genome and, in all cases, the virion rnas isolated after transfections have the authentic terminus restored (wimmer et al., ) . infectious cdnas have now been generated from members of all picornaviruses. the method of choice to generate virus remains transfection of t transcripts. recently developed methods of rt/pcr allow researchers to generate infectious cdna clones in less than month (tellier et al., ) . in general terms, genome replication proceeds in two steps: synthesis of a complementary rna strand (-strand) that then serves as template for plus rna strands (+strands; figure t q a b c d e f g h figure . steps in the replication of the poliovirus genome. parental, positive-stranded virion rna (solid line) is transcribed, yielding -rna (broken line) after protein (vpg)-priming by the viral rna-dependent rna polymerase d pr~ (enzyme or any other proteins involved are not shown). a replicative intermediate (ri) form consisting of a single +strand template and multiple nascent -rna strands (a) has not been detected, so that, more probably, intermediates in -rna synthesis are either mainly single-stranded (b) or double-stranded (c). elongation of the nascent-rna (c) yields the replicative form (rf) double-stranded rna (d). available evidence suggests that the rf is an intermediate in genome replication (discussed in xiang et al., ) . accordingly, a cloverleaf/rnp is formed at the end of the rf that promotes vpg-primed synthesis of +rna (e). the structures formed after multiple initiation could either be "closed" (entirely base-paired; f) or "open" (g). available evidence suggests that structure f is the correct intermediate (note that d p~ is an unwindase). for details, see wimmer et al., , and xiang et al., . modified from wimmer et al., . . ). the validity of this scheme has been known for almost three decades yet only very few details of the individual steps have been elucidated . because the vast majority of studies have been carried out with poliovirus, this review will concentrate predominantly on this viral system. with the exception of the capsid proteins, all viral non-structural proteins and even processing intermediates have been implicated in genome replication (xiang et al., ) . the evidence for the involvement of these proteins ( a pr~ b, bc, c, a, ab, vpg, c/ cd pr~ d p~ is based largely on genetic data or on biochemical experiments assumed to be indicative of genome replication (wimmer et al., ; xiang et al., ) . for example, genetic and bio-chemical analyses of ab strongly suggest that this protein, a non-specific rna binding protein, and the proteinase cd pr~ participate in the formation of an initiation complex for +strands (xiang et al., ) . another example is the involvement of c in rna replication. briefly, poliovirus rna synthesis is highly sensitive to the presence of mm guanidine hydrochloride (gua hc ); poliovirus mutants resistant to mm gua hc harbor a single amino-acid exchange (n a/g) in polypeptide c. it has recently been established that c is an atpase (and not a gtpase) and we now refer to it as c awpase (pfister and wimmer, ) . the atpase activity of purified c awpase is inhibited by mm gua hc , whereas that of purified c atpase with a n /g mutation is resistant to this concentration of the drug (pfister and wimmer, ) . on the basis of these considerations, it can be assumed that the atpase activity of c awpase is essential for genome replication. just as with ab or cd pr~ however, the step(s) by which c awpase is exerting its essential function are still unknown. the only proteins whose role in genome replication has been firmly established are vpg and d p~ the crystal structure of d p~ has recently been solved (hansen and schultz, ) , a result that will greatly advance our (limited) understanding of this important enzyme. importantly, d p~ was established already in as being a primer-dependent and rnadependent rna polymerase . although a deluge of circumstantial evidence suggested that a uridylylated form of vpg might serve as primer for d p~ (nomoto et al., b; wimmer, ; takeda et al., ; toyoda et al., ) , direct evidence for this mechanism has been obtained only very recently (paul et al., b) . briefly, vpg is being uridylylated to vpg-pu(pu) by the viral rna polymerase d p~ in the presence of template (poly(a)). vpg-pu(pu) then primes the transcription of poly(a), leading to the synthesis of poly(u), which is the ' terminus of-strands (paul et al., b) . in spite of these seemingly simple experiments (paul et al., b) , the mechanism of initiation of rna synthesis was a matter of controversy for almost two decades. baltimore's and flanegan's groups presented evidence favoring "hairpin priming", whereas wimmer's group accumulated data suggesting "protein priming" (reviewed by richards and ehrenfeld, ) . the controversy has finally been settled in favor of protein priming. at low concentration of enzyme, poliovirus polypeptide ab stimulates the transcriptional activity of d p~ up to -fold plotch et al., ; paul et al., ) . indeed, biochemical and genetic evidence suggests that d p~ and ab form a complex in solution (molla et al., ) . the significance of these observations is not yet known. an important additional property of d p~ is its ability to unwind double-stranded rna. that is, the enzyme, while transcribing a template, can replace a dormant rna strand that is hybridized to the template with the new strand that is just being synthesized (cho et al., ) . it should be noted, however, that d p~ is not a helicase as it will not separate two strands without transcribing one of them (cho et al., ) . the participation in picornavirus replication of cellular proteins, referred to by investigators as "host factors", has also had a history of controversies. several polypeptides were proposed to be involved in replication (e.g. a kinase or a uridylic acid transferase) but these proteins have disappeared after further analysis (richards and ehrenfeld, ). ehrenfeld's and semler's groups have recently identified a cellular kda rna binding protein, poly(rc) binding protein (pcbp ), that is not only required for the function of the poliovirus ires but it has also the propensity to bind, together with cd pr~ to the poliovirus '-terminal cloverleaf (blyn et al., ) . pcbp (or pcbp , a protein related to pcbp ; gamarnik and andino, ) is undoubtedly the "host factor p " that was originally proposed by baltimore's group to effect the binding of cd pr~ to the poliovirus cloverleaf (andino et al., (andino et al., , . andino et al. ( ) provided first evidence suggesting that the formation of a specific protein/cloverleaf rnp complex consisting of viral protein cd, a cellular protein ("p ") and the viral rna is required for the initiation of +strand synthesis (andino et al., ) . this hypothesis has been further supported by the discovery of pcbp (gamarnik and andino, ; parsley et al., ) . pcbp is therefore a sensible candidate for a "host factor" involved in poliovirus rna replication. however, poliovirus protein ab can replace pcbp in all biochemical reactions characteristic of the formation of a ' terminal rnp. moreover, ab and cd pr~ both cleavage products of the p precursor ( figure . a), are associated in solution (molla et al., ) . finally, the phenotypes of mutants of ab in vivo and in vitro support the conjecture that ab is involved in the formation of a cloverleaf/ cdpr~ complex xiang et al., a,b) . currently, there is no compelling evidence in favor of the cloverleaf/ cdpr~ complex over that of cloverleaf/ cdpf~ with respect to poliovirus genome replication (see a discussion in xiang et al., ) . recognition of rna signals located somewhere in the rna genome is a prerequisite for specificity in genome replication. this review will concentrate only on cis-acting elements of entero-and rhinoviruses because, as mentioned earlier, the overwhelming number of experiments deal with these viruses. currently, only the '-terminal cloverleaf has been firmly established as a cis-acting signal in enterovirus genome replication (see previous section), although the mechanism by which it functions is still obscure. clearly, the formation of a specific rnp plays a role the significance of which will be discussed below. more complicated is the recognition of the +strand template for the initiation of-strands. since replication of picornavirus rnas commences at the '-terminal poly(a), a homopolymeric sequence found also in most cellular mrnas, poly(a) alone cannot be a determinant for virus-specific -strand synthesis. vpg-pu(pu) can be synthesized in the presence of poly(a), and vpg-poly(u), the ' end of -strands, will follow the synthesis of the primer (paul et al., ) . this reaction, however, does not reveal the mechanism of specificity. mutational analysis of the heteropolymeric sequence of the 'ntr of enteroviruses indicated that this region was critically important for replication (pilipenko et al., ; melchers et al., ) . however, the poliovirus '-terminal heteropolymeric sequence can be replaced with that of hrv , a hairpin with no apparent homology with the poliovirus structure, and the resultant poliovirus/hrv hybrid genome replicated with wt kinetics (rohll et al., ) . even more startling was a report from semler's group that presented evidence that the heteropolymeric region could be deleted altogether without loss of viability (todd et al., ) . currently, the paradox intrinsic to these findings remains unsolved . it is possible that the ' heteropolymeric region plays an important role in the efficient formation of an initiation complex for replication but to a much lesser extent in +strand template recognition. the authentic recognition signal may reside in rna-internal sequences, as proposed by mcknight and lemon ( ) . these authors reported that, surprisingly, a stem loop structure mapping to the coding region of the hrv capsid proteins was absolutely necessary for genome replication. fittingly, a stemloop rna structure that has been uncovered in poliovirus rna also appears to play a role in genome replication; it maps to the coding region of c awpase (goodfellow et al., ) . the mechanism by which these new elements influence replication has yet to be resolved. finally, evidence has been presented suggesting that sequences within the ires play a role in genome replication (borman et al., ; shiroki et al., ) . this is difficult to comprehend if one considers chimeric ires viruses. as mentioned above, the cognate ires of poliovirus can be replaced with ires elements from different viruses whose ires are merely related (hrv , hrv , cbv , cav , cav , ev ; gromeier et al., , a and unpublished results) or entirely different (emcv; alexander et al., ; hcv, lu and wimmer, ; zhao et al., ) without loss of genome replication. defective interfering particles (di particles; see below) of poliovirus are naturally occurring variants with deletions (of varying sizes) in the p region, encoding the capsid proteins. di particles can replicate their rna without helper function but they need wt virus for encapsidation. sequence analyses of genomic rnas of di particles led nomoto and his colleagues to the surprising observation that in all cases the deletions were in-frame of the polyprotein coding sequence. on the other hand, artificial genomes engineered with out-of-frame deletions were unable to replicate their rna, even in the presence of wt helper virus (kuge et al., ; hagino-yamagushi and nomoto, ) . it was concluded that translation was necessary for the cognate genome to replicate. that is, translation had a cis effect on replication that could not be complemented in trans by a helper genome. these observations were later confirmed and extended (wimmer et al., ; novak and kirkegaard, ; agol et al., ) . there are several hypotheses that are used to explain the phenomenon. the least likely is that certain replication proteins can only function in cis. if so, only viral mrna could serve as template in rna synthesis. since viral mrnas lack vpg (hewlett et al., ; nomoto et al., ; see above), every +strand rna that functions as template in rna synthesis should also lack vpg. available evidence suggests that all rna templates involved in replication are terminated with vpg (nomoto et al., b; petterson et al., ; wu et al., ; larsen et al., ) . furthermore, rna replication occurs in a tight membranous environment (bienz et al., ) . thus, it is unlikely that these genome replicating membranous complexes also harbor viral polysomes (wimmer et al., ) . indeed, crude, membranous replication complexes can be isolated from infected cells that can replicate poliovirus rna yet they are free of ribosomes (takegami et al., ; takeda et al., ; toyoda et al., ) . an alternative explanation is that the observed cis effect is operating only during the very first round of translation at the onset of infection. clearly, translation of an infecting genome will have to be somehow arrested to allow the template to switch from translation to transcription. it is possible that, once the switch has been made, replication can proceed independently of translation. this does not exclude the possibility that viral proteins, perhaps intermediates with a short half-life or short-lived protein complexes, must be continuously supplied to the rna synthesizing machinery. the question of the switch from translation to rna synthesis of infecting + stranded genomic rna has been subject of much speculation. the classical study of kolakofsky and weissmann ( ) on phage q~ replication solved the dilemma by showing that the phage replicase (a complex of four proteins) can repress translation of viral mrna. a similar model has been proposed for poliovirus by gamarnik and andino ( ) : the formation of an rnp consisting of cloverleaf/ cdpr~ at the ' end of the viral mrna inhibits further translation, thereby switching the template to replication. one problem with this model is that at the peak of poliovirus replication, translation and rna syn-thesis occur concomitantly in the presence of an excess of cd pr~ molecules (note that for each virus particle, molecules of cd pr~ are synthesized; the ratio of viral +strand rna to unprocessed cd pr~ may be : through most of the replicative cycle). moreover, if the genome has to be translated for replication to occur, how can inhibition of translation promote rna synthesis? a very schematic representation of steps in genome replication is shown in figure . (wimmer et al., ) . the possible rna structures involved in replication have been divided into three categories: ( ) we have argued before that the cumulative evidence favors the "closed forms" for rf and ri but this view may not be shared by others (wimmer et al., ; xiang et al., ) . since d p~ is an "unwindase" (cho et al., ; see above), the scheme does not necessarily require a helicase. indeed, so far no picornaviral helicase has been identified, and purified c atpase has stubbornly refused to exhibit such activity (pfister and wimmer, ) . briefly, vpg will be uridylylated at the '-terminal poly(a). vpg-pu(pu), in turn, will then prime synthesis of-strands ( figure . c). it is unlikely that multiple initiation of-strands on the same template (prior to completion of the first-strand) occurs, since an ri with multiple -strands (such as in figure . a) has not been found in infected cells (bishop and koch, ) . it is even possible that initiation at the poly(a) tail of poliovirus rna occurs only once. completion of the-strand will thus yield rf ( figure . d), which we consider an intermediate in replication and not a byproduct (wimmer et al., ) . one compelling argument in favor of this assumption is that in the rf the ' end of +strands is in the close vicinity of the ' end of -strands, a prerequisite first proposed by baltimore and his colleagues (andino et al., ; harris et al., ) . destabilization of this end of the rna will lead to the formation of an rnp consisting either of cloverleaf/ cdpr~ or cloverleaf/ cdpr~ which, in turn, will free the ' end of the-strand for vpgprimed +strand synthesis to occur ( figure . e). multiple initiation at this end will lead to the multistranded ri ( figure . f), the nascent or full-length +strands being replaced during transcription by the d p~ unwindase. initiation of +strands may be more efficient than initiation of-strands; hence the large excess of +strands over-strands in infected cells. note that a reconstituted replication system of purified viral and cellular components capable of synthesizing +strands from input +strands has not been achieved; thus many of the hypotheses put forward in this scheme have not yet been tested. gamamik and andino ( ) have described a novel system to study poliovirus replication in xenopus oocytes by injecting poliovirus rna into these cells. however, virus will replicate only if a hela cell $ extract was co-injected with the rna. interestingly, the authors have been able to separate the hela supporting activities ($ ) into two factors, one necessary for poliovirus ires-driven translation, the other for poliovirus rna synthesis. this system offers an excellent opportunity to separate and characterize viral and cellular factors involved in virus replication. viruses, lacking the genetic information as well as the tools to provide most of the essential components to replicate, are obligatory intracellular parasites. the complexity of viral proliferationmacromolecular synthesis of polypeptides and genomic nucleic acid, and encapsidation-has led to the text book wisdom that viruses are obligatory intracellular parasites unable to proliferate outside living cells. however, poliovirus rna (obtained either from virions or by transcription with phage t rna polymerase from plasmid dna), when incubated in an extract of uninfected hela cells void of nuclei, mitochondria and cellular mrna, will direct translation, genome replication and genome encapsidation such that infectious particles are formed. these newly synthesized virions are indistinguishable from poliovirus isolated from tissue cultures. thus, a picornavirus (poliovirus) is the first virus that has been synthesized de novo in a cell-free extract of mammalian cells (molla et al., ) . this experiment has nullified the notion that viruses can proliferate exclusively in living cells. moreover, the novel approach can be used to study individual steps of viral replication in the absence of cell-membrane barriers. several interesting observations regarding protein-protein interactions, the role of membranes, of cellular membranous components or soluble cellular factors, or of inhibitors of viral rna synthesis, have been published (barton and flanegan, ; molla et al., c molla et al., , barton et al., ; parsley et al., ; cuconati et al., ; towner et al., ) . the use of the cell-free cellular extract for studies of poliovirus rna replication, however, is still in the early stages of exploitation. nevertheless, it has been possible to even achieve genetic recombination of poliovirus in cell-free hela extracts (duggal et al., ; tang et al., ; duggal and wimmer, ; see below). in the course of transcription, all templatedependent nucleic acid polymerases make errors in incorporating nucleotides with roughly the same frequency ( - - - ). as is discussed in chapter , this phenomenon has profound biological consequences for rna viruses. because rna viruses have chosen not to develop mechanisms by which misincorporations of nucleotides can be recognized and corrected, the average number of "spontaneous" mutations per replication of the genome, referred to as error rate, is around - . the high error rate in the absence of mechanisms of proofreading and editing has several consequences. first, the average genome length of animal rna viruses is small ( nt). notwithstanding the genome of the exceptional coronavirus ( nt), rna viruses with genomes exceeding kb (e.g. the dna viruses, herpes viruses, poxviruses, iridoviruses) are inconceivable because of the high probability that each genome would carry multiple mutations after each round of synthesis. it should be noted that these considerations by no means imply that dna viruses with very small genomes do not exist. in fact, the animal virus with the smallest known genome is hepatitis b virus ( . kb). as to picornaviruses, their average genome length is nt (see also wimmer et al., ) . second, rna viruses replicate near the threshold of error catastrophe (holland et al., ) . that is, the artificial increase of misincorporation of nucleotides (e.g. by chemical mutagens) may lead to a rapid decline of the viability of the entire virus population. third, plaque-purified clones of rna viruses are not homogeneous but populations of many different, albeit very closely related genotypes; hence the term "quasispecies" (eigen, ) . fourth, the genetic heterogeneity allows an rna to rapidly adapt to a changing environment. a simple example should demonstrate the ease with which a drug-resistant mutant of poliovirus can be isolated. as mentioned, poliovirus rna replication is highly sensitive to the presence of mm guanidine hydrochloride (gua hc ). after plating a stock of plaque-purified poliovirus on a monolayer of hela cells in the presence of mm gua hc , a few plaques will arise corresponding to resistant variants (gr) with mutations mapping to c awpase (pincus et al., ; tolskaya et al., ) . in the case of the selection of gr poliovirus mutants, it should be noted that the resistant variants already existed in the population of the inoculating virus. if the virus inoculum had been entirely free of gr variants, no selection of gr mutants could have occurred since the drug inhibits rna synthesis; hence, there would have been no misincorporation of nucleotides to generate the gr mutations in c atpase. although it may be a trivial thing to repeat, it is important to remember that genetic variation by misincorporation of nucleotides (just as recombination) requires replication. no replication, no mutants. the genetic plasticity of genotypes and the dynamics of genetic variation can be studied conveniently when transcript rna, produced by transcription of cdna with t rna polymerase, is transfected on to hela cell monolayers and the corresponding plaque phenotype of progeny virus is analysed. in the case of wt virus, the plaques are, by convention, "large". if mutant rnas are analysed in plaque assays, one may observe only "small plaques" with a rare "large" plaque emerging on the plate. this rare large plaque may signal a reversion (either directly or through suppresser mutations) to a fast-replicating genotype. passage of the population of small and large plaque phenotype viruses (at multiplicities of infection of more than ) will rapidly yield populations of only the faster-growing virus because the impaired genomes are eliminated by competition. an example of this phenomenon has been described by lu et al., ( b) , who analysed a hybrid poliovirus in which the cognate a pr~ coding sequence was exchanged to that of coxsackie b virus. a special case of a genetic phenomenon is that of a "quasi-infectious" genome. this term was originally introduced by agol and his colleagues (gmyl et al., ) to describe the following phenomenon. genetically engineered poliovirus variant rna was transfected into hela cells. progeny virus was harvested, sometimes only after prolonged incubations of the transfected tissue cell cultures. analysis of the genotypes of progeny virus genomes (by rt/pcr) revealed only revertant or pseudorevertant rnas. none of the original mutant genotypes were detectable. this phenomenon can be explained if the original mutant genotype was able to replicate its rna, albeit only at levels too low for virus production or even for the development of cpe. nevertheless, the slowly replicating mutant genome allowed for mutation (either misincorporation or deletions, insertions), eventually leading to fast-growing genotypes. by definition, the progeny of quasiinfectious genomes will not yield virus with the parental genotype. if a mutation (point mutation, linker insertion, etc.) engineered into the genome rna is lethal, the lesion may effect complete abrogation of genome replication. hence, reversion to viability cannot be expected. an interesting example of quasi-infectious versus lethal mutations in the poliovirus genome was described when mutations in vpg were studied (kuhn et al., ; reuer et al., ; cao and wimmer, ) . as mentioned, vpg is linked to the genome via a o -( '-uridylyl) tyrosine (the tyrosine in position three of vpg). a mutation of tyrosine to phenylalanine (y f) was originally described as being lethal (reuer et al., ) . this conclusion made sense, since phenylalanine lacks a hydroxyl group for phosphodiester formation. however, cao and wimmer ( ) later observed that cells transfected with vpg(y f) variant rna produced viable virus, albeit only at very low frequency and only after prolonged incubation of the cultures. all of the progeny genomes carried a f y reversion. the possibility of contamination of the cultures with wt virus was excluded. the only explanation for this surprising result was that the vpg(y f) variant was quasi-infectious, presumably in that the threonine residue in position four of vpg may have served as a (poor) surrogate acceptor for uridylylation and protein priming of rna synthesis (it should be noted that genome-linked terminal proteins are often attached to serine residues; salas, ) . further analyses supported this hypothesis. a vpg(t a) variant was found to be viable, expressing good growth kinetics. in contrast, vpg(f y, t a) variant rna never yielded progeny virus and was, therefore, considered unable to replicate its rna. this mutation then can be considered to be lethal. genetic analysis of mutant genomes and their revertants has been an invaluable tool to study the structure and function of picornavirus genetic elements and picornavirus proteins (wimmer et al., ) . genetic recombination of picomaviruses is the exchange of genetic elements between two viruses that may occur during replication in the same cell. discovered by hirst ( ) , and used first by cooper and his colleagues (cooper, ) to map poliovirus genetic units, lingering skepticism about the phenomenon was dispelled through biochemical analyses of poliovirus recombinant proteins (romanova et al., ; tolskaya et al., ) or fmdv recombinant genomes (king et al., ) . (for a detailed review of recombination, the reader is referred to wimmer et al., .) picornavirus recombinants have been detected because ( ) they acquired genetic traits from the parental strains allowing them to proliferate under conditions restricting the growth of either parent and ( ) they arose in excess over (replication competent) revertants of the parental strains (cooper, ) . restricting conditions for the selection of recombinants can include specific drugs, such as mm gua hc , monoclonal neutralizing antibodies (emini et al., ) or host-cell specificity (duggal et al., ). an elegant method to study poliovirus recombination under normal growth conditions (without selection) has been developed by jarvis and kirkegaard ( ) . a wealth of experimental data has shed light on the most important steps in recombination. details of individual steps in recombination, however, remain to be elucidated. the current knowledge can be summarized as follows. . recombination is homologous and it occurs by copy choice; i.e. an incomplete (nascent) rna strand may switch template strands during genome replication. the probability of crossover depends strongly upon the degree of homology between the two recombining viral genomes. . genetic analyses have indicated that template switching occurs (predominantly?) during-strand synthesis (kirkegaard and baltimore, ) . . recombination is precise: no deletions or insertions at sites of homologous recombination have been observed. this is true even if crossover occurred in the nt long noncoding region (spacer region) between ires and initiating aug of poliovirus (jarvis and kirkegaard, ) even though the sequence of this "spacer" is not conserved amongst the three poliovirus serotypes (toyoda et al., ) . indeed, deletions in the "spacer" could conceivably be tolerated in view of the observations by kuge and nomoto ( ) and others that the spacer can be partially or completely deleted without loss of viability. . template switching requires that the replication complex pauses, allowing a heterologous (invading) +strand to offer its service as template. an unsolved question is whether pausing and crossover is random (jarvis and kirkegaard, ) or non-random (romanova et al., ) . the latter is in all probability true. agol and his colleagues have proposed that higher order structures formed on template rnas may favor pausing of rna synthesis and crossover (romanova et al., ) in addition, king ( ) suggested that there are preferred sites of recombination in poliovirus rna, and that crossover may be favored immediately after synthesis of two uridylate residues (uu) in the nascent strand. duggal and wimmer ( ) observed that crossover patterns changed significantly when recombination occurred at different temperatures. specifically, crossover between two genetically marked rna strands at ~ occurred over a wide range of the genome with preference for sequences coding for structural proteins in the '-terminal half of the genome. in contrast, recombination in vivo at ~ and ~ yielded crossover patterns that had shifted dramatically to a region encoding nonstructural proteins (duggal and wimmer, ) . preferential selection of recombinants at ~ and ~ was ruled out by analyses of the growth kinetics of the recombinants. the reason for the temperature effect is unknown. temperature-dependent stability of higher order rna structures seems possible. recombination frequencies are calculated by dividing the yield of the recombinant virus by the sum of the yield of the parental virus. for picornaviruses with linear genomes, the distance between genetic markers used to determine recombination is proportional to the recombination frequency. as mentioned above, the degree of homology between the parental genomes strongly influences the probability for crossover. the frequency of recombination between homologous genomes is remarkably high ( x - between markers only nt apart; jarvis and kirkegaard, ) . it has been estimated that - % of the homologous viral genomes may undergo genetic recombination within a single growth cycle (king, ) . this would mean an unprecedented genetic shuffling between genotypes of which the fittest retain the "wt" phenotype. experimental results that support a high frequency of recombination between sibling strands were obtained with engineered, quasiinfectious poliovirus genomes carrying two adjacent vpg sequences (cao and wimmer, ) . after transfection, all progeny viruses had lost the downstream vpg, most probably by homologous recombination during-strand synthesis. on studying recombination using genetically marked genomes, % of the recombination events occurred between sibling strands. finally, jarvis and kirkegaard ( ) have demonstrated that the frequency of recombination increases with the progression of the infectious cycle; i.e. the larger the concentration of intracellular viral rna the higher the probability of recombination. recombination in a cell-free extract of uninfected hela cells (molla et al., ) has recently been reported by two groups. recombination of parental viruses in the cell-free medium was detected either by rt/pcr in the absence of selection (tang et al., ) , or by plating the progeny virus under conditions that were restricted for either parent (duggal et al., ) . the recombination frequencies were found to be roughly the same as that in vivo (in tissue culture cells). the crossover pattern of recombination in vitro and in vivo at ~ was the same, lending credibility to the cell-free system as reflecting an in-vivo environment (duggal and wimmer, ) . the in-vitro approach has the potential to decipher some basic steps in recombination, as, for example, the invasion of heterologous template strands into the replication complex. as mentioned before, the pattern of recombination changed significantly when recombination was carried out in vivo at ~ or ~ unfortunately, this effect at higher temperature cannot be analysed in vitro because cell-free synthesis of poliovirus is highly inefficient or completely absent at temperatures above ~ (molla et al., c) . picornavirus genomes are extremely "plastic" in that any change in their genotype can lead to unexpected nucleotide rearrangements. this has been observed, for example, in analyses of ires elements, where deletions or insertions lead to unexpected new genotypes with excellent growth properties (see, for example, dildine and semler, ; gmyl et al., ; pilipenko et al., ; alexander et al., ; charini et al., ; cao and wimmer, ) . genetic plasticity is particularly apparent when poliovirus genomes are constructed harboring foreign sequences. the analysis of genetic rearrangements and deletions is especially important when picornavirus genomes are to be used as vectors for the delivery of foreign genes. generally, polioviruses respond to the insertion of foreign sequences by rapidly deleting these sequences either partially or completely. the driving force behind the selection of deletion variants may be: ( ) excessive length of the genome, restricting encapsidation; ( ) interference with efficient processing of the polyprorein; ( ) interference with initiation of translation; ( ) alteration of rna structures necessary for replication; or others. there are examples, however, where poliovirus, or deletion mutants thereof, appear to tolerate a foreign sequence inserted into the genome. an interesting approach to studying ires elements was the insertion of the emcv ires into the orf of poliovirus, thereby making unnecessary the primary cleavage between pi*p catalyzed normally by a pr~ (figure . b; molla et al., ) . the resulting rna transcripts proved highly infectious (molla et al., ) . although this dicistronic virus expressed a small plaque phenotype, neither the plaque size nor the genotype surrounding the insertion changed over six passages. these observations suggested that the insertion was stable, at least under the conditions studied. the emcv ires has no apparent sequence homology with any sequence of the poliovirus genome, the poliovirus ires included. this eliminated the possibility that the emcv ires was rapidly removed by homologous recombination. perhaps, illegitimate recombination (see below) to delete the ires without debilitating the virus was a very rare event and not apparent in progeny virus. it should be noted in parenthesis that the viability of the dicistronic virus shown in figure . b proved for the first time the function of the emcv ires as a true internal ribosomal entry site (molla et al., ) . on the basis of these experiments, a specific version of novel expression vectors was constructed ( figure . c,d) that included, in addition to the foreign ires, a foreign orf lu et al., a) . none of these vectors was genetically stable over extended numbers of passages, and in some cases the deletion occurred during first passage (lu et al., a) . nevertheless, the insertion of a foreign orf between two ireses in the 'ntr, the foreign gene (dark stippled box) was inserted upstream of a pr~ which now delivers the foreign protein by a cis cleavage. d. dicistronic poliovirus generated by inserting a foreign gene and the emcv ires into the 'ntr. in this case the foreign gene is synthesized independently from the polyprotein. e. generation of an expression vector by fusing the coding sequence of a foreign gene to the n-terminus of the poliovirus polyprotein. the foreign gene product is liberated through trans cleavage, by either dpr~ pr~ or a pr~ f. expression vector based on mengovirus, a cardiovirus that carries a small leader sequence preceding the p region of the polyprotein (see figure . ). in this case, the foreign gene is inserted into the l coding sequence. note that the organization of the genomes in e and f is identical. g. encapsidation-incompetent poliovirus expression vector in which a portion of the p coding sequence has been replaced by a foreign gene. this genome can be encapsidated in trans but, by itself, it can only go through one cellular cycle of replication. as shown in figure . d, yielded a replicating poliovirus vector that efficiently expressed the cat gene over several passages . no deletion was apparent after the first passage. remarkably, the genome of this construct is % larger than that of the wt genome, an observation indicating that the capsid of naturally occurring polioviruses is not "full". however, attempts failed to encapsidate and express the larger luciferase gene instead of the cat gene in the context of the dicistronic virus. the luciferase activity was clearly detectable in cells transfected with the appropriate dicistronic transcript, but the genome harboring luciferase was not encapsidated . apparently, an increase of genome length to % (luciferase gene plus emcv ires) was not tolerable for encapsidation . a different strategy to convert poliovirus to an expression vector was the fusion of a foreign orf directly to the poliovirus polyprotein (figure . e; andino et al., ) . in these experiments, the strategy of altmeyer et al. ( ) was mirrored, which made use of the genetic make-up of cardioviruses (mengovirus or emcv). specifically, the cardiovirus polyprotein is preceded by a small leader protein ( aa) that is cleaved from the capsid region p by the viral c pr~ proteinase, thereby allowing maturation and encapsidation of the virion. altmeyer et al. ( ) inserted into the leader sequence of the mengovirus genome a foreign gene ( figure . f) and expressed the product of this fusion protein over several passages in tissue culture (altmeyer et al., (altmeyer et al., , . andino et al., ( ) generated a similar "leader" protein in front of the poliovirus polyprotein. in this case, however, it was necessary to engineer a novel c/ cd pr~ cleavage site between the foreign orf (the new "leader") and the viral polyprotein such that the foreign polypeptide can be cleaved from the poliovirus capsid precursor (figure . e). although these poliovirus constructs were originally claimed to express excellent growth properties and, more importantly, were reported to be genetically highly stable (andino et al., ) , the poliovirus-based vectors proved, in fact, impaired in replication and prone to rapid deletions, at least if the insert was more than nt in size (mueller and wimmer, , and references therein; see below). it should be noted that the cardiovirus-based vectors also suffered from loss of the inserts of a foreign gene upon repeated passage, an observation suggesting that even cardioviruses do not tolerate an extended leader protein for the purpose of gene therapy (altmeyer et al., (altmeyer et al., , . in a third strategy of the construction of picornavirus vectors, the p capsid region of the picornavirus genome is partially replaced with a foreign orf (figure . g), yielding proliferation-incompetent replicons that appear to be genetically quite stable (see, for example , porter et al., ) . for a possible application as vectors in gene therapy, the replicons are trans-encapsidated via a vaccinia virus-based p expression vector, with relatively low yields of proliferation-incompetent virions. the apparent genetic stability of these replicons may be due to the fact that the rnas are similar in size when compared to the wt genome, and that the naturally occurring cis cleavage (between pi*p ) catalyzed by a pr~ is highly efficient, placing no restriction on this step of polyprotein processing. however, the rapid selection of faster growing variants that lost the foreign gene is unlikely, since the trans-encapsidated replicons can only proceed to a one-step infectious cycle. this is very different from the selection pressure in proliferation-competent vectors, which engage in second-round infections. what is the mechanism by which poliovirus may eliminate foreign sequences? homologous recombination cannot function because there is not enough sequence homology to engage in crossover. pilipenko et al. ( ) have nevertheless proposed that short sequences may serve as parting and anchoring sites for template switching in illegitimate (non-homologous) recombination. an alternative mechanism is "loop-out" deletion, in which the nascent strand skips endogenous sequences, jumping to an upstream sequence that serves as anchoring sequence. given the high frequency by which recombination occurs among sibling strands, a crossover mechanism may be favored, but decisive experiments to decide between these two mechanisms are lacking. in any case, a detailed study of genetic variations of polyprotein fusion vectors (figure . d) strongly supports the model of parting and anchoring sites for template switching or loop-out deletion (mueller and wimmer, ; see below) . briefly, when expression vectors ( figure . e) consisting of a gag gene (encoding p -p ; nt) of human immunodeficiency virus that was fused to the n-terminus of the poliovirus polyprotein (andino et al., ; mueller and wimmer, ) were analysed after transfection into hela cells, the genomes were not only found to be severely impaired in viral replication but they were also genetically unstable (mueller and wimmer, ) . upon replication, the inserted sequences were rapidly deleted as early as the first growth cycle in hela cells. interestingly, the vector viruses did not readily revert to wt sequences but rather retained some of the insert plus the artificial c/ cd pr~ cleavage site (to allow processing at the n-terminus of the polyprotein). thus, variants of different genotypes that replicated nearly as well as wt poliovirus had followed an evolutionary pathway towards the genetic organization of cardioviruses (mueller and wimmer, ) . that is, the poliovirus polyprotein of these variants was preceded by gag-derived "leader" proteins of different but distinct sizes (predominantly between and aa long), the most prominent leader size reflecting the length of that in cardioviruses ( aa). in the immediate vicinity of the deletion borders of several isolates, short direct sequence repeats were observed that are likely to allow alignment of rna strands for non-homologous (illegitimate) recombination during-strand synthesis (figure . ; mueller and wimmer, ) . interestingly, the selection of the leader size occurred during the very first rounds of replication of the transfected rna; in most cases, as sequential shortening of the leader sequence was not observed. defective interfering particles an interesting phenomenon of naturally occurring deletion mutants of picornaviruses are defective interfering particles (di particles) that can be (rarely) discovered in laboratory stocks of virus or generated (with difficulty) by passage of virus at high multiplicities (reviewed in wimmer et al., ) . all naturally occurring di particles carry deletions in the p capsid precursor region (cole et al., ; cole and baltimore, ; lundquist et al., ; nomoto et al., ; kajigaya et al., ; kuge et al., ) . as mentioned before, nomoto and his colleagues have found that the deletions in all di particles are in-frame (kuge et al., ) . since genetically engineered di genomes (replicons) with an out-of-frame deletion in the p region are unable to replicate, nomoto and his colleagues (kuge et al., ; hagino-yamagushi and nomoto, ) correctly concluded that poliovirus rna replication requires, at some stage of the replicative cycle, translation of the replicating rna (cis requirement of translation), and that the di particles with out-of-frame deletion cannot be complemented in trans. novak and kirkegaard ( ) confirmed this hypothesis in that replicons with translation termination codons downstream ofthe p coding region could not be rescued in trans. for hypotheses to explain this phenomenon, see above. chetverin et al. ( ) have recently made the startling observation that certain rna fragments can join to one another via a molecular pathway determined by the intrinsic chemical properties of the rna molecules (chetverin et al., ) . the fragments that formed chemically stable duplexes were selected by the replicase of phage q[ : only those dimers that had acquired signals from two different rna molecules were able to replicate. gmyl et al. ( ) have now reported that viable recombinants could also be generated from non-replicating and non-translatable segments of the poliovirus genome. these fragments by themselves were unable to generate the viral rna-dependent rna polymerase necessary for a replication-dependent recombination event. the "crossovers" were targeted to the highly variable segment of nucleotides located within the 'ntr upstream of the initiating aug codon for the polyprotein. a great number of recombinants have been obtained by transfection of mixtures of rna fragments. analyses of viruses that evolved after the mix- (+) figure . two models of illegitimate recombination during -rna synthesis as was observed with an expression vector shown in figure . e (mueller and wimmer, ) . both models require a partial dissociation of the nascent-rna from the template +rna, caused presumably by pausing of the rna polymerase. the free ' end of the nascent-rna can re-anneal to a short complementary sequence further upstream on the same template strand, thereby looping out the intervening sequence (a), or it can re-anneal to the same complementary sequence but on a sibling +strand, and complete synthesis on this second template (b; strand switching). in both cases, the resulting strands would have excised the same sequence and could now, in turn, give rise to truncated +rna genomes. note that this deletion event can occur even during the first round of replication of the expression vector leading to partial or complete deletion of the foreign coding sequences. reproduced, with permission, from mueller and wimmer, . ture of the rna species strongly suggested that the connection between two fragments was the result of chemical reactions between the fragments rather than of template-switching (gmyl et al., ) . the mechanism by which the chemical linkage between two fragments is formed is obscure but it could involve structures reminiscent of ribozyme-like activities in the viral rnas. nevertheless, this observation could have pro-found implications for the generation of novel genotypes in nature (gmyl et al., ) . genetic complementation is the compensatory action of gene products of two homologous genetic systems to alleviate defects of mutant genes. genetic complementation has been firmly established in picornavirus replication. however, because of the complexity of diverse function(s) of precursor proteins and their cleavage products, it has not been possible to define complementation groups (wimmer et al., ) . complementation groups are indicative of genetic elements that can function independently, and they have been the basis of the definition of cistrons (benzer, ) . a cistron, therefore, may be equated with a gene, i.e. a functional unit of genetic material specifying a single protein. on the basis of these definitions, the picornavirus genome, encoding only the polyprotein whose products function in many cases in overlapping or even opposing fashions, cannot be called multicistronic. in general genetics, mutations affecting the same polypeptide can occasionally complement each other, a phenomenon referred to as intracistronic complementation (schlesinger and levinthal, ) . based on these considerations, we have suggested that the picornavirus genome be considered "monocistronic" (wimmer et al., ) . it follows that the genome encodes only one gene product, the polyprotein. the polyprotein, in turn, contains multiple genetic units whose products may or may not be capable of intracistronic complementation. if this definition is accepted, one should avoid referring to individual coding regions of the picornavirus genome as "genes". thus, there would be no " d p~ gene". this convention makes good sense if one considers that a "gene for d p~ is for the most part also the gene for cd pr~ a proteinase with properties unrelated to the polymerase d p~ it should be noted that theiler's virus is an exception to the monocistronic nature of picornaviruses in that it encodes a small protein in a separate reading frame, mapping towards the n-terminus of the polyprotein . it is interesting to consider that there is no absolute requirement that picornaviruses must exist as monocistronic (single-polyprotein-producing) entities. for example, the insertion of a second ires into the genome would represent a viable dicistronic virus (figure . b) , an entity artificially generated by molla et al. ( , b) . apparently, during the evolution of picornaviruses, the elimination of genetic elements regulating the expression of different picornavirus proteins was favored over retaining them. similarly, there was no pressure to generate such regulatory sequences and insert them into the genome. in other words, proteolytic processing of a single polyprotein evolved not only to be highly efficient but also as a means to regulate the temporal appearance of viral proteins (e.g. precursor proteins versus end-products of proteolytic cleavage). in contrast, in prokaryotic rna phages or in-strand rna viruses, the expression of proteins is regulated by sequence elements located between different cistrons. interestingly, the dicistronic poliovirus depicted in figure . b resembles the genetic composition of cow pea mosaic virus (cpmv), a plant virus (hellen et al., ; molla et al., ) . indeed, cpmv and the dicistronic poliovirus shown in figure . b have similar gene order and amino acid sequences, the main difference being that the genome of cpmv is bipartite. that is, rather than inserting a sequence such as an ires into the genome, cpmv preferred to divide the genome into two portions, one coding for the capsid proteins the other for the replication proteins. such a genetic arrangement, which requires two particles to initiate a complete infectious cycle, may be suitable for a plant virus (where the yield of virus per host can be extremely high and, equally importantly, the local concentration of host organisms can be high) but it would be highly disadvantageous for an animal virus. the fascinating topic of the structure and evolution of polyproteins within the rna-like virus superfamily is discussed later. evidence for genetic complementation in vivo has existed for decades, the best-known involving guanidine-generated mutants. complementation of guanidine mutants seemed unidirecional (wimmer et al., ) . bernstein et al. ( ) provided the first conclusive evidence for symmetric complementation, using mutants that were generated in a pr~ and a. these authors, however, also made the unexpected observation that mutants mapping to b or d p~ (which they had also generated by genetic engineering) could not be complemented (bernstein et al., ) . on the other hand, charini et al. ( ) clearly showed that mutations in d p~ mapping to a different region of the coding sequence could be rescued in trans. this example and many others (wimmer et al., ) support the notion that the polyprotein is a single genetic unit that does not consist of non-overlapping genes whose functions can be separated by complementation grouping. a special case of complementation was tested using dicistronic polioviruses. briefly, cao and wimmer ( ) constructed a virus with a genotype shown in figure . b, the extra cistron being the coding region for poliovirus ab. the dicistronic construct yielded a virus expressing a small plaque phenotype but it was genetically unstable, losing its inserts after several passages. nevertheless, if the lethal mutation vpg(y f, t a) was engineered into the second cistron of the dicistronic virus (see above), the first cistron ( ab) could rescue the genome, albeit inefficiently. a much more efficient rescue of lesions in the ab coding sequence was reported by towner et al. ( ) , who used the cell-free system of poliovirus replication developed by molla et al. ( ) . apparently, the supply in trans of p polypeptides in vitro is more efficient that in vivo, a phenomenon that remains as yet unexplained. interestingly, it appeared as if a mutation in ab could be rescued only if the complementing polypeptide was a precursor of ab, preferably p (see figure . ). perhaps, efficient complex formation between ab/ cd pr~ (see above) in this system depends on the cleavage of the p precursor in situ. picornaviridae combines species that infect animals with exceedingly varied pathogenic features affecting almost every organ system (table . ). in the following, we will concentrate only on human picornavirus infections, which range in severity from protean symptoms associated with the common cold (e.g. rhinoviruses), mild gastroenteritis (e.g. echoviruses), hepatitis (e.g. hepatitis a virus), to fatal cns manifestations (e.g. pv, enterovirus ) or lethal myocarditis (coxsackievirus b group). despite the enormous variety in organ tropis m observed with different species of the picornavirus group, pathogenic features of every single species are recognized in the form of highly distinct disease syndromes (exceptionally, coxsackieviruses can cause fatal disseminated infections in neonates with widespread viral propagation in multiple organs). .we will define pathogenic properties of picornaviruses as a combination of different viral traits: ( ) those that affect tropism (determining the target cell type of, and influencing spread in, the host); ( ) those that affect virulence (determining kinetics of particle propagation); and ( ) those determining the progression of a disease syndrome ("pathogenicity proper", the propensity to cause clinical symptoms). there is a fourth parameter, which is strictly related to a condition of the host. an example is injury-provoked ("provocation") poliomyelitis, which will also be discussed below. surprisingly, the disparity in pathogenic properties may be contrasted with a high degree of sequence conservation among certain groups of picornaviruses. this is most evident with the cluster c enteroviruses (figure . b). for example, on the basis of sequences of the d p~ poliovirus serotype (lansing) (pv (l)) shares more than % sequence homology with its close relative coxsackievirus a (cav ). indeed, their sequence similarity exceeds that between pv (l) and pv (mahoney) or pv (leon) (figure . b) . yet, whereas all pv serotypes are associated with poliomyelitis, a severe and frequently fatal infection of the cns, cav causes mild upper respiratory tract infections only. since minor sequence variations of picornaviruses can account for drastically different disease syndromes it may be assumed that the pathogenic phenotype of picornaviruses is encrypted within a few crucial genetic determinants. these basic determinants of pathogenic features appear to be dynamic, leading occasionally to the emergence of novel virus variants causing clinical syndromes not previously observed with their ancestors. this was the case when widespread epidemics of acute hemorrhagic conjunctivitis ravaged africa and the pacific rim (yin-murphy, ) and quickly expanded worldwide. two picornaviruses were associated with this previously unknown clinical syndrome, coxsackievirus a variant (cav v) and enterovirus (ev ; table . ). the former evolved from its ancestral cav , causing mild upper respiratory tract infections, whereas ev was primarily recognized for its association with a poliomyelitis-like neurological disorder (melnick et al., ) . the deviation of tropism toward ocular tissues resulting in acute hemorrhagic conjunctivitis suggests a switch or an expansion in receptor specificity. this hypothesis, however, awaits confirmation, since the cellular receptor(s) of ev is unknown ( the circumstances and conditions that may favor a switch in host cell tropism with resulting changes in the pathogenic phenotype are unknown. a detailed discussion of our current view of the evolution of enteroviruses, however, is presented in the following section on evolution. this is particularly relevant in the context of the imminent global eradication of poliovirus. it is known that coxsackieviruses a and a (cav , cav ) as well as enteroviruses and (ev ) occasionally cause a clinical syndrome with striking resemblance to poliomyelitis (melnick et a!., ) . occurrence of poliomyelitis caused by these virus species has only rarely been reported in epidemic proportions (voroshilova and chumakov, ; melnick et al., ) ; generally they occur as isolated incidents. fortunately, preliminary evidence (da silva et al., ) suggests that to date no surge in non-pv-caused poliomyelitis has occurred in response to the eradication of poliovirus in latin america. however, the time elapsed since the eradication of wt polioviruses in the western hemisphere is too short in terms of evolution to conclude that the incidence of non-pv-caused poliomyelitis and poliovirus eradication are unrelated (see section on evolution). the observation of diverse specific clinical syndromes caused by closely related picornaviruses (particularly enteroviruses) has sparked interest amongst virologists in identifying those factors that may determine the clinical outcome of picornaviral infections. generally, signals for pathogenic phenotypes can be found in all parts of the viral genome. however, factors that determine cell and tissue tropism are not necessarily the same as factors determining virulence or attenuation. for example, the capsid mutations of the live attenuated strains of pv (the sabin strains) have an attenuating effect without altering the tropism of the sabin strains for the prime target of pv: spinal anterior horn motor neurons. a different effect of capsid proteins on an extended host tropism has been reported for several poliovirus type strains, e.g. pv (l). a small segment in capsid protein vp of pv (l) (the b-c loop) has been identified as carrying determinants of host range extension from primates to rodents (murray et al., ) . however, whereas pv (l) infection causes poliomyelitis in primates and in cd tg mice, normal mice developed histopathology indicative of panencephalomyelitis, which was radically distinct from poliomyelitis observed in primates and cd tg mice (gromeier et al., ) . this observation suggests pv (l) tropism toward a cell type in mice that is not targeted in primates. it is likely that the mouse-adapted pv (l) acquired additional receptor specificity but the nature of the receptor for pv (l) in normal mice is unknown. it should be noted that although pv (l) causes disease in mice after intracerebral injection, cultivated mouse l cells cannot be infected with this strain (gromeier et al., ) . the fact that single determinants of pathogenicity (e.g. the pv capsid) can carry signals that influence either tropism (e.g. pv (l)) and/or virulence (sabin strains of poliovirus) indicates different dimensions of picornaviral pathogenesis. taking multiple determinants of tissue tropism and virulence (shared between the capsid, non-structural viral proteins and non-coding sequences) into account, the enormous complexity of the molecular basis of picornaviral pathogenesis comes into perspective. capsid protein structure determines the interaction of a virus with its cellular receptor. as pointed out for the poliovirus sabin strains and pv (l), small differences in the structure of viral capsids are critical for cell and organ tropism as they ultimately determine the pathognomic features of the resulting infection. similarly, the capsid was determined to harbor sequences critical for cardiotropism (tracy et al., ; cameron-wilson et al., ) as well as diabetogenicity (kang et al., ) of coxsackie b viruses (cbv). moreover, diabetogenicity of emcv in mice mapped to the capsid (jun et al., ) . as pointed out, the mechanism by which the changes in the capsid may affect pathogenesis may be related to differences in the interaction between virus and receptor. apart from direct receptor switching (or extending receptor specificity to more than one cell surface molecule), virus capsid alterations may affect the kinetics of virus/receptor binding or particle stability. this would influence the virulence of that virus without a concurrent change in host cell tropism. reduced particle integrity has been proposed to participate in the attenuation phenotype of the sabin strains of poliovirus (filman et al., ) . however, theories linking capsid mutations within the sabin strains with structural elements important for protomer cohesion and capsid integrity remain inconclusive. in contrast to a likely relationship between capsid structure and pathogenic phenotype, the role of non-structural viral gene products in the determination of disease has been less obvious. non-structural viral proteins are cell-internal and, hence, do not influence tropism in a strict sense. cell-type-specific restrictions in viral replication regulated by non-coding regions of the viral proteins or by non-structural proteins will not change the spectrum of target cells infected but may critically influence virulence. viruses normally evolve to adapt to host cells offering adequate portal of entries (receptors), thereby exposing the viral particles to an intracellular milieu supportive of particle propagation. thus, favorable cell-internal conditions for viral replication would ideally be matched by a suitable viral receptor to avoid virus entry into cells that do not permit replication. for most viruses invading a host organism, the match is not perfect and, hence, the viruses are restricted to replication in fewer cells or organs than the distribution of receptor molecules would suggest. cell-internal determinants mapping to the viral genome of various picornaviruses have been suggested to influence virulence. these can be divided into loci mapping to viral proteins or to non-coding regions (e.g. the 'ntr). for example, mutations within the coding region for the rna-dependent rna polymerase d p~ of the pv sabin vaccine have been implicated in the attenuation phenotype (toyoda et al., ; tardy-panit et al., ) . it was found that these mutations contributed to the ts phenotype of this sabin vaccine strain (toyoda et al., ) . mutations in the p region of hepatitis a virus have been correlated with an attenuated phenotype of hav (raychaudhuri et al., ) . in many instances the genetic loci of pathogenesis mapping to viral non-structural proteins have been identified through sequence comparison. this approach, of course, did not reveal mechanisms to account for reduced virulence. the non-coding genetic elements of picornaviruses have also been shown to carry signals determining virulence (evans et al., ; duke et al., ; gromeier et al., ; tracy et al., ) . as has been discussed in the section on genetics, the 'ntr harbors the internal ribosomal entry site (ires) that, on the basis of the early observation by evans et al. ( ) with the sabin type vaccine strain, has been identified as a major determinant of virulence for a number of picornaviruses. sequence comparison of the sabin strains of poliovirus with their wt progenitors revealed point mutations within a confined region of the 'ntr in all three serotypes, known as domain v (figure . ; reviewed in wimmer et al., ) . in analyses of viral strains recovered from patients who acquired paralytic polio after vaccination, a point mutation at position (direct reversion in domain v) of the ires of pv (sabin) was proposed to contribute to the neurovirulence of the isolate (evans et al., ) . recent analyses led to a different hypothesis stressing a co-operative attenuating effect of capsid mutations with mutations in the ires and other locations in the sabin vaccine genomes (mcgoldrick et al., ) . the mechanism responsible for ires-mediated attenuation of neurovirulence remains obscure. analyses of cell-type-specific growth restrictions in cell lines of neuronal origin and biochemical studies of cell-type-specific ires function suggested impairment of initiation of translation in a cell-type-specific manner (haller et al., ; agol et al., ; la monica and racaniello, ) . following the example of poliovirus, ires elements or surrounding sequences of a large number of picornaviruses were found to contain genetic markers with a role in the determination of a pathogenic phenotype. this was most impressively demonstrated by the drastic attenuating effect of a deletion within the poly(c) tract of the 'ntr of emcv (duke et al., ) . how do the multiple mechanisms alluded to interlace to produce a specific picornavirus disease syndrome? the intricacies of dual cell external and internal determinants of viral pathogenic features are best illustrated using the example of poliovirus. this most thoroughly studied prototype picornavirus is characterized by pathogenic properties of specificity untypical of viral pathogens of the cns. poliovirus host range is limited to primates only. within the primate organism poliovirus replicates within an unknown site in the gastrointestinal tract and within associated lymphatic structures, leading to viremia (bodian, ) . at this stage, the virus causes hardly any disease symptoms. however, viremia may, in only about % of infections, lead to cns invasion, presumably through passive passage of the blood-brain barrier (yang et al., ) . on muscle injury, the virus may also reach the cns via retrograde axonal transport wimmer, , ) . within the cns, poliovirus uniquely targets spinal cord anterior horn motor neurons. lytic destruction of anterior horn motor neurons results in flaccid paralysis, the hallmark clinical sign of paralytic poliomyelitis (bodian, ) . whereas efficient poliovirus proliferation occurring in the human gastrointestinal tract produces few or no symptoms, this virus's pathological potential is expressed in a small and relatively inaccessible subpopulation of neurons in the cns. this peculiar restriction has been the subject of research interest for many years. available evidence clearly suggests that the restriction of the host range of poliovirus to primates is determined predominantly by the receptor. in humans, this receptor is the immunoglobulin superfamily molecule cd (mendelsohn et al., ) and two proteins closely related to cd function as poliovirus receptor in simians (koike et al., ) . clearly, the observed organ and cell tropism are co-determined by the virus's dependence on the cellular receptor. we believe that, at least in part, the expression of cd must also determine the highly restrictive cell tropism of poliovirus within the cns, because mice transgenic for the cd gene develop a neurological condition with pathologic and clinical features identical to those observed in primates (ren et al., ; koike et al., ; gromeier et al., ) . furthermore, support for this hypothesis comes from studies of transcriptional control (solecki et al., ) and developmental expression of the cd gene (gromeier et al., b) . it has been reported that cd expression is restricted to structures in close anatomical and functional relationship with spinal cord anterior horn neurons during embryonic development of the cns (gromeier et al., b) . it is thus likely that the restrictive expression pattern of cd may indeed direct poliovirus tropism toward a specific cellular compartment of the cns. analyses of pathogenesis related to genetic determinants mapping to the capsid, proteins, ires, or d p~ have recently been extended using poliovirus hybrid viruses. specifically, polioviruses have been constructed in which the cognate ires was replaced by that of other picornaviruses (gromeier et al., ) . by exchanging the cognate ires element of pv by that of other picornavirus species, it could be shown that neuropathogenicity of pv can be eliminated without affecting growth properties in non-neuronal cell types normally susceptible to poliovirus (gromeier et al., ) . thus, it was determined that the ires of rhinovirus type , a virus species never associated with neurological disease, confers the attenuation phenotype to poliovirus. significantly, this chimeric virus, called pvi(ripo), did not carry any attenuating mutations in the poliovirus-specific sequence of its genome. the neuropathogenic potential of a picornavirus ires cannot be predicted but it is innate, of course, in its sequence. certainly, ires elements of enteroviruses known to cause poliomyelitis (cav , cav , ev , and ev ) are candidates for "neurovirulent ireses" and, indeed, a corresponding pv/cav chimera has proved this hypothesis (gromeier et al., unpublished results) . c-cluster enteroviruses, on the other hand, never cause poliomyelitis. however, because of their close genetic kinship, the ireses of c-cluster coxsackieviruses confer a highly neurovirulent phenotype to the pv/cav chimeras (gromeier et al., unpublished) .. finally, ires elements of the genus rhinovirus ablate neuropathogenesis in poliovirus chimeric viruses (gromeier et al., ) . these observations indicate that, indeed, cellexternal restriction in cell tropism as well as cellinternal factors exert powerful limitations toward enterovirus pathogenesis. many relatives of poliovirus of the enterovirus genus (particularly the c cluster; table . ) presumably would equal the neuropathogenic properties of pv if their capsid structure allowed interaction with neuronal cells. thus, non-neurovirulent ccluster enteroviruses with high sequence homology to pv and "neurovirulent" ires elements may gain tropism for neurons in the future (see under evolution). in addition to virus-encoded factors of pv neuropathogenicity, the circumstances within the host organism at the time of infection or shortly thereafter may influence the outcome of poliovirus infection. trivial muscle injury has been shown to increase the probability of neurological complications of concurrent poliovirus infection (mccloskey, ) . a proposed pathogenic mechanism for provocation polio identified a deviation of the route of cns invasion toward retrograde axonal transport to account for the increased risk of polio among individuals who received intramuscular injections wimmer, , ) . picornaviruses have adapted to a wide variety of cellular components of their hosts. the diverse spectrum of disease syndromes associated with human picornaviruses provides an excellent field of study to examine the factors that determine the clinical outcome of a viral infection. the enormous amount of sequence information, combined with a broad knowledge of the molecular biology of many of these agents, have sparked hopes of a rapid elucidation of the molecular basis for their pathogenic properties. initial optimism that sequence comparison of virulent strains with their attenuated variants alone would rapidly identify those elements responsible for a pathogenic phenotype and unravel mechanisms of pathogenesis is, however, unjustified. this is particularly true for poliovirus, the most thoroughly studied picornavirus. progress in the analysis of poliovirus neuropathogenicity has revealed that the interactions of poliovirus with the host are characterized by a degree of complexity not previously appreciated. mechanistic concepts of viral pathogenesis, combined with one-dimensional views of virus replication and its relation to the host organism, have helped little in increasing our understanding of the selective susceptibility to poliovirus of motor neurons. viral infections result in complex clinical syndromes that are difficult to explain in terms of single viral genetic elements. using poliovirus as an example, picornavirus-induced disease is the result of an intricate interplay of numerous factors, of both viral and host origin, that coordinately affect the ability of the virus to propagate in any particular cell type or organ. numerous investigations, particularly those of j.j. holland, e. domingo and their colleagues (see chapter ), have led to the realization that the rapid evolution of rna viruses results from high mutation rates combined with exceedingly large heterogenic populations. this is true for picornaviridae that encode variants of conserved protein folds as well as catalytic systems not found in the cellular world and, hence, have explored an enormous evolutionary space. it is less appreciated, although equally true, that it is the host environment that has provided new opportunities for viruses to proliferate and select new variants out of a huge number of mutants. without this "co-operation" between host and parasites, picornavirus evolution would not have resulted in the tremendous diversity of genotypes whose number, now counting in the hundreds, is currently biased towards those infecting mammals. the key role of virus-host interaction is evident in the evolution of rna viruses. indeed, the relatively slow pace by which the cellular environment is changing imposes severe restrictions on the mode of rna virus evolution. a model has been proposed suggesting that, in each moment of protein evolution, mutations could only be accepted in a very limited number of positions of a polypeptide chain; otherwise, protein structure and function would have been severely compromised. these limited places of acceptable variation have been called covarions (fitch, ) . it has been argued that the fast evolution of rna viruses in a constrained environment has to proceed through the exploitation of constantly emerging but vastly overlapping covarions . this model of virus evolution provides a sensible hypothesis as to how picornaviral proteins have managed to accept a heavy load of mutations that are quite frequently unique and rarely seen at sites in cellular proteins, while still not entirely losing some discernible similarity with other viral and cellular homologs. this has important practical consequences, since structural similarities can be used to reconstruct the evolution of rna viruses, an undertaking impossible just years ago. apart from the biological properties generally associated with all rna viruses, the current end-product of evolution makes each virus species unique. this is engraved in the virus' genetic plan: the organization of the genome and the mode of gene expression. during the past years we have learned that +strand rna viruses employ variations of surprisingly few basic genetic plans. the genetic organization and genetic expression of picornaviruses has been outlined in detail above. in the following, the genetic organization of different picornaviruses will be put into an evolutionary perspective. in addition, we will discuss in a rather speculative manner some hypothetical implications of evolutionary consequences of the eradication of poliovirus. numerous rna viruses have been combined into a picorna-like supergroup of which the picornaviridae comprise a rather compact domain (figure . ). the viruses of the picorna-like supergroup, a taxon not yet recognized as higher-than-family rank, share a conserved array of replicative proteins (see below). they infect plants, insects, birds and mammals. most of the established members of picornaviridae are mammalian viruses, and they fall principally into six genera (table . ). the newly characterized avian encephalomyelitis virus (aev), which is a member of picornaviridae, is most closely related to hepatitis a virus (marvil et al., ) . the latest revision of the classification of picornaviruses, although closely related to the original version, has a clear evolutionary flavor since it tends to combine viruses in accord with phylogenetic kinship rather than relying on phenotypic properties (see the introduction). the largest number of picornaviruses fall into the most closely related genera, enterovirus and rhinovirus (table . ). cardiovirus and aphthovirus comprise two other genera that have most probably emerged from a common ancestor. it appears that the two pairs of picornavirus genera diverged after hepato-and parechoviruses split from the main trunk of the picornavirus tree (figures . , . ) . the exact phylogenetic interrelationship between hepatoand parechoviruses remains somewhat uncertain and may be different from that shown in figure . (see also below). in addition, three other picornaviruses, equine rhinovirus types and (erv and respectively) and aichi virus (aiv), remain to be classified. erv was shown to be a distant relative of fmdv (wutz et al., ) , while erv (li et al., ; wutz et al., ) and aiv (yamashita et al., ) appear to be related to the cardio-and cardio/aphthovirus branches respectively. they are not recognized, however, as cardio-or aphthoviruses. in addition to mammalian viruses, a number of insect viruses have been previously included into the picornaviridae on the basis of phenotypic criteria. during the last years, however, the complete genome structure of six picornalike insect viruses has been reported (van der wilk et al., ; isawa et al., ; johnson and christian, ; moon et al., ; sasaki et al., ) . some of these viruses feature different genome organizations, and only infectious flacherie virus of silkworm (infv; isawa et al., ) was shown to possess a genotype and gene organization that may justify placing it in the picornaviridae (a.e. gorbalenya, unpublished results) . it is likely that an ancestor of infv separated from the main branch of picornaviridae before the radiation of mammalian viruses (figure . ) . the rapid accumulation of new virus genotypes has not been matched by an understanding of its evolutionary meaning. therefore, the basis of picornavirus classification may need to be revisited. moreover, the relationship between picornaviridae and other genetic systems may have to be defined within a new classification. although confusing for virologists on first sight, any reclassification into hierarchically organized taxa will ultimately aid our understanding of evolution, host range of viruses and pathogenesis. it should be kept in mind, however, that any classification is, at best, an approximation of true phylogenetic relationships, and the current classification of picornaviridae should be treated as such. picornaviridae have evolved by speciation from a common ancestor. this plausible statement has been supported by computer analyses of nucleotide and protein sequences as well as by studies of the tertiary structure of capsid proteins and c pr~ proteinases. there is every reason to believe that the putative ancestral viral entity had a genetic organization that has been conserved largely in contemporary picornaviruses. its signature is a long 'ntr-long open reading frame- 'ntr-poly(a) (figure . ). with the sole exception of a strain of theiler's virus (tmev; see below), all picornavirus proteins are generated by autocatalytic processing of the gigantic polyprotein (figure . ). the backbone of the polyprotein is formed by a set of polypeptides conserved in all known picornaviruses. in addition, the backbone may be decorated with a few optional proteins unique to a particular virus or virus group. among the proteins, the conservation increases in the order (figures . , . ): this order can be deduced from analyses of virus groups belonging to different phylogenetic ranks-clusters of closely related viruses, of distinct genera, of an entire family, or of the entire picorna-like supergroup (figure . ; a.e. gorbalenya, unpublished results) . slight deviations can only be seen upon analysis of some small taxonomic groups. as already noted, tmev encodes a small, unique l* polypeptide outside the main reading frame at a locus overlapping the l reading frame. additionally, the insect picornavirus infv was predicted to encode unique domains as part of the polyprotein (isawa et al., ; a.e. gorbalenya, unpublished results) . a closer look at these proteins reveals the following. the l polypeptide preceding the capsid coding region (figure . ) is encoded by all picornaviruses except the entero-and rhinoviruses. the l protein is the most variable of all picornavirus proteins, and it exists in five different versions. l proteins with proteinase activity are encoded only by fmdv, erv and erv (skern, ) . cardioviruses and aiv encode three different versions of l polypeptides containing a putative zn finger a.e. f i g u r e . primary cleavage in picornavirus polyproteins. open boxes at the left end depict l proteins, of which only that of aphthoviruses is a proteinase. of the a coding sequences, only a pr~ of entero-and rhinoviruses is a proteinase. in cardio-and aphthoviruses, processing at the c-terminus of a is strictly a cis cleavage event. in hepatoviruses, even this cleavage is catalyzed by c pr~ modified from ryan and flint, . gorbalenya, unpublished observations), while hepato-, parechoviruses and infv appear to encode unique l proteins (najarian et al., ; hyypia et al., ; isawa et al., ) . some evolutionary characteristics of the a polypeptides parallel those of the l proteins. entero-and rhinoviruses encode a of the same protein family, known as a cysteine chymotrypsin-like proteinases (bazan and fletterick, ) , whereas each of the three groups of hepato-and parechoviruses and infv encodes a unique a with unknown function(s). the other picornaviruses, cardio-and aphthoviruses and aiv, encode a a protein having a characteristic c-terminal motif (or a derivative thereof) that has been implicated in the spontaneous separation of a and b proteins during polyprotein synthesis (reviewed in ryan and flint, ) . besides l and a, the only other protein of the family not conserved at the primary structure is vp (palmenberg, ; a.e. gorbalenya, unpublished observation) . it may therefore not come as a surprise that vp has been poorly resolved in x-ray analyses of picornavirions whose structure has been solved (lentz et al., ) . it is interesting to note that this small pro-tein, which occupies a position upstream of vp , has moved its position to between vp and vp in insect infv . other picornavirus proteins are conserved, albeit to varying degrees. these polypeptides therefore may play similar role(s) in the life cycle of different picornaviruses. for example, b and a are of variable sizes but they contain hydrophobic regions thought to be involved in the anchoring of these proteins to membranes in rna replication complexes. only the hydrophobic patches of b and a polypeptides, however, have been conserved (a.e. gorbalenya, unpublished results) . in capsid proteins, the most pronounced conservation is evident in residues critically important for fold maintenance. finally, in the key replicative enzymes c awpase, c pr~ and d p~ as well as in b vpg, the active site residues are amongst the most highly conserved (gorbalenya and koonin, a) . the majority of proteins of picornaviruses, regardless of how well they have been conserved within their own family, have homologs among cellular and other viral proteins. first, the three capsid proteins, vp , vp and vp , have adopted different versions of an eight-f i g u r e . comparison of the genome organizations of the main groups of the picorna-like supergroup. for each picorna-like family group, excluding apv, the conserved organization of an "averaged" genome typical for this group is shown and compared with that of picornaviruses. the genomes are aligned with respect to the position of the d (-like) locus (the rna polymerase). "averaging" was carried out with respect to genome size so that most conserved genome features could be shown. note that bymoviruses, comprising a genus of potyviridae, have a bipartite genome. it is believed that all picoma-like viruses contain vpg at the '-end, although a vpg has not yet been demonstrated for every group of viruses. in picoma-like viruses, proteins were designated so as to reflect their similarity to the prototype picornavirus enzymes, although other nomenclatures may be in use by other investigators. apart from studies with picomaviruses, enzymatic activities have been ascribed to some proteins of como-, poty-, calici-and sequiviruses, but the complete processing map of polyproteins has been established only for como-and potyviruses. the conserved ~/~ rossmann-fold and palm-like fold comprise only one of the domains of c and d, respectively, or their homologs. for further details, see legends to figures . and . , and the text. stranded antiparallel beta-barrel fold, dubbed "jelly-roll" (rossmann and johnson, ) . amongst rna and dna viruses of different families, this fold is the most common to build icosahedral capsids (rux and burnett, ) . it is also conserved in a number of cellular proteins (rossmann, ; orengo et al., ) . second, the core domain of d p~ containing several highly conserved sequence motifs, is related to a number of polynucleotide polymerases includ-ing rna-dependent rna polymerases of rna viruses, reverse transcriptases of viral and cellular origins, and dna-dependent dna polymerases (hansen et al., ). an analysis of the crystal structure of pv d p~ has also identified a (palm) subdomain adopting a rrm-like fold conserved among a number of functionally different proteins, including ribosomal proteins l /l and $ as well as the uia splicing factor (hansen et al., ) . third, two picomavirus proteinases, the ubiquitous c pr~ and entero/rhinovirus-specific a pr~ have adopted -stranded antiparallel two beta-stranded barrel folds, conserved in cellular serine proteases with chymotrypsin as the prototype (reviewed in skern, ) . these picornavirus proteinases have also relatives that are encoded by (+)rna viruses belonging to dozens of different species (gorbalenya and snijder, ; ryan and flint, ) . unlike cellular proteases, the picornaviruses c pr~ and a pr~ employ cysteine as the principal catalytic nucleophile and, in some lineages, have another unique replacementinstead of the catalytic asp they use a glu . the other small family of picornavirus proteinases, l pr~ of aphthoviruses, erv and erv , is related to cellularpapain-like proteases (gorbalenya et al., ; skern, ) whose homologs have been identified in many animal and plant rna viruses as well (gorbalenya and snijder, ) . finally, c atpase, whose structure is yet to be solved, belongs to the so-called helicase superfamily iii. this protein group includes polynucleotidestimulated atpases, some with helicase activity, which are encoded by (+)rna and small dna viruses as well as proteins of cellular origin (gorbalenya and koonin, , b) . the c awpase has been predicted to be a three-domain protein. two (x/(x domains flank an atp-binding domain adopting a variation of the (x/j "rossmann" fold, which is widespread in the protein world (teterina et al., ) . with respect to details, our current understanding of the function of picornavirus proteins is rather fragmentary. nevertheless, a preliminary functional profile of picornavirus proteins fits patterns of conservation evident at the structural level. the most conserved non-structural proteins provide the basic enzymatic activities needed for the synthesis and expression of viral rnas inside the cell. the three conserved capsid proteins form the scaffold of virions shielding virus rna from the detrimental environment outside the cell. all these activities appear to be virus-specific, although they may be modulated by cell-encoded components. in contrast, non-conserved viral proteins seem to sense and modify the host environment in addition to serving basic biosynthetic processes pro-grammed by the viral genomes (for instance, see piccone et al., ; zoll et al., ; svitkin et al., ; ventoso et al., ) . virions also have host-dependent functions, such as the recognition of the cellular receptor, entry and, possibly, virion maturation. different lines of evidence have shown that the least conserved regions of three capsid proteins, as well as vp , may mediate these early activities of host cell entry (for recent work, see hadfield et al., , and lentz et al., ) . much of what has been said about proteins applies also to the terminal ntrs of the picornavirus genomes. these regions are conserved within related genera but they may diverge when groups are compared (e.g. enterovirus/ rhinovirus versus cardiovirus /aphthovirus) even although they play identical roles in viral proliferation (discussed in the section on genetics). variants of two very different conserved secondary structure organizations of ires elements are shown in figure . , prototyped by those of pv and emcv. it is unclear what type of 'ntr was encoded by an ancestor of picornaviruses -that resembling one of the contemporary prototypes or rather a "consensus" one (le and maizel, ) . in contrast, the 'ntr region has diverged profoundly amongst picornaviruses and it is not conserved even within the otherwise closely related entero-and rhinoviruses (poyry et al., ) . the polyprotein of numerous +strand rna viruses has evolved such that its organization reveals an additional level of conservation-the order of mature proteins in this large precursor (figure . ; see also the order of protein domains in the prototype pv in figure . ). this order is inflexible and none of the picornaviruses violates it, although entero-and rhinoviruses do not encode l proteins (see above). despite a near absolute conservation of the order of protein domains, there is some plasticity (figure . ). upon computer sequence analyses of the picorna-like viruses, it has become evident that the polyprotein can be divided into two parts, one comprising capsid proteins and the other the non-structural proteins. these parts are expressed rather independently. in a group of picorna-like insect viruses, rhopalosiphum padi virus (rhpv), drosophila c virus (dcv), plautia stali intestine virus (psiv) and cricket paralysis virus (crpv), non-structural and capsid proteins are encoded by two orfs, separated by a ntr (koonin and gorbalenya, ; johnson and christian, ; moon et al., ; nakashima et al., ; sasaki et al., ) . in comoviridae, a family of plant viruses, the capsid and non-structural proteins are encoded by two distinct rnas, rna and rna (goldbach, ) . remarkably, in the dendrogram shown in figure . , comoviridae, sequiviridae, a plant virus family having the same polyprotein organization as picornaviridae (turnbull-ross et al., ) and picorna-like insect viruses form a division immediately adjacent to the picornaviridae. it is important to stress that comparison of the sequences of many viruses of the picorna-like supergroup has revealed a profile of sequence conservation that parallels that observed for picornaviridae. thus, two groups of highly conserved clusters can be distinguished in polyproteins. the first group comprises the capsid proteins vp -vp -vp , the second the non-structural proteins c atp .... (vpg)- cpr~ p~ (or equivalents). the functions assigned to the individual members of the group of non-structural proteins remain provisional for the majority of known viruses. they have been inferred largely on the basis of sequence similarities with proteins of well-characterized viruses like poliovirus. among the positionally highly conserved non-structural proteins, the genomelinked protein vpg has a special standing (highlighted by bracketing) since it is conserved functionally rather than structurally in the picornalike viruses (figure . ; gorbalenya and koonin, a) . the combination of conserved non-structural proteins of picorna-like viruses has been termed "replicative module" (goldbach, ) . such module of related proteins has been recognized also as "capsid modules" built of three "jelly-roll" proteins. animal caliciviridae, plant potyviridae and insect acyrthosiphon pisum virus, all of which are distantly related to picornaviridae, encode a distinct variety of the replicative module that is associated with one of three unique sets of capsid protein(s) encoded in the '-region of viral genomes (domier et al., ; meyers et al., ; van der wilk et al., ; figure . ) . the conservation of protein order in the picornavirus polyprotein and the patterns of expression (proteolytic processing) have been conserved. pairs of neighboring proteins are separated at scissile bonds cleaved by a virus proteinase or, in case of the vp /vp junction, by an unknown mechanism. it could be hypothesized that the position of protein domains could be changed as long as the corresponding proteins were released more or less independently from the precursor. this, however, is not the case, as the pathway of proteolytic processing in picornavirus polyproteins is not random. furthermore, at least some intermediate precursors, e.g. bc, ab and cd pr~ have essential functions that differ from those of the end-product of processing (see the section on genetics). these considerations provide a biological reasoning for the observed conservation of the protein order in polyproteins. we have already pointed out that the order of two conserved units within the polyprotein, the capsid precursor and replicative modules, is flexible. the two least conserved proteins, l and a, flank the capsid precursor at the n-and ctermini, respectively, and bring additional plasticity to the organization of the polyprotein. this is reflected also in terms of expression, i.e. the mechanism of proteolytic processing. processing of the capsid precursors as well as of the replicative module at junctions separating conserved proteins involves exclusively the conserved c/ cd pr~ proteinases, a mechanism functioning not only in picornaviruses but also in other picorna-like viruses (figure . ; ryan and flint, ) . in contrast, the three cleavages separating the poorly conserved l and a pro-teins from the neighboring polypeptide chains (l/vp , vp / a and a/ b) are processed by a range of mechanisms in a genus-specific manner. furthermore, whereas picornaviruses use two general pathways of cleavages - c pr~ pr~ versus distinct mechanisms involving l and a-this genetic repertoire may be further diversified in some picorna-like viruses. for instance, in comoviridae and insect viruses, capsid precursor and non-structural proteins are encoded by distinct orfs (figure . ), which eliminates the need for cleavages separating these polypeptide chains. cpr~ pr~ have emerged as the major enzymatic factors in the regulation of protein expression in all picorna-and related viruses. interestingly, the primary structure of sites recognized by these proteases is virus-specific rather than position-specific. among picornaviruses, entero-and rhinoviruses employ sets of structurally uniform sites while viruses of the other genera use more diversified sets. poliovirus and hav exemplify the most extreme diversity. in poliovirus, all eight cleavage sites have the same ("canonical") q/g structure ( figure . ) , whereas in hav, six variations of this structure were described in different sites (palmenberg, ) . poliovirus proteins produced from its replicative module seem to have been exceptionally strongly constrained not only with respect to the type of the terminal amino acids but also with respect to size. mature poliovirus proteins (except cpr~ as well as processing intermediates, have sizes that can be divided by without remainder or with only a small remainder (gorbalenya et al., ) . this feature separates poliovirus proteins from the overwhelming majority of cellular and viral proteins. the latter are heterogeneous both in size and sequence, particularly at their termini, because of a relative abundance of mutations, including insertions and deletions. structural regularities documented for poliovirus can be visualized in a form of weak primary structure periodicities with the common denominator of comprising the major portion of the replicative module. on the basis of these observations, it has been proposed that the replicative module of picornaviruses has originated from a primitive self-replicating rna molecule through consecutive multistep duplications (gorbalenya, ; gorbalenya et al., ) . how it is likely to evolve in the future? we have briefly described different levels of evolutionary conservation in picornaviruses by using results of comparative sequence analyses. the conservation of different properties is the result of a long evolutionary process, accompanied by numerous radiations. does the history of the polyprotein determine how picornaviruses may evolve in the future? we are unaware that this question has ever been directly addressed in experimental studies, although many results obtained by using genetic engineering seem to be quite relevant. these data can technically be separated into two setsthose obtained in studies using site-directed mutagenesis and those aimed at constructing chimeras. in numerous studies of the first type, it has been observed that different regions of the picornavirus genome express a differential tolerance to replacements (wimmer et al., ) . it can be predicted that a profile of the "accepted" mutability, drawn over the entire genome, would fit the conservation profiles described above. such a result would support the hypothesis that the past of picornaviruses influences their future in terms of evolution. however, mutagenesis saturating the genome has never been systematically carried out. therefore, the available "mutagenesis profile" can only be used as a rough approximation of the yet-to-bedefined "accepted" mutability profile in relation to the conservation of modules. the "resolution" of the mutagenesis studies that remained unresolved is potentially relevant to an understanding of the evolution of contemporary picornaviruses, regardless of whether this relates to recent evolutionary events or to the complete historical past. the second group of data involving genome engineering complements the mutagenesis studies and helps to address the question posed above. in wt genomes of entero-and rhinoviruses, the orf of the capsid precursor is preceded by the cognate 'ntr ( figure . ). as was observed in studies of poliovirus expression vectors ( figure . ), genetically stable variants of poliovirus have been selected (mueller and wimmer, ) in which an additional leader peptide is encoded that is fused to the n-terminus of the polyprotein, just downstream from the 'ntr. this organization may look unique on first sight, but in fact it resembles that of all other picornaviruses distantly related to enteroand rhinoviruses. these terminal appendices in the poliovirus variants resemble the l proteins and, hence, these poliovirus chimeras have a "cardio-like" organization ( figure . e). we can speculate that pv has "accepted" an artificial l peptide because a similar event has already happened in the past history of its ancestors. in a different set of experiments, several poliovirus chimeras have been generated in which the heterologous emcv ires was placed into the sequences specifying scissile bonds of the polyprotein, thereby dividing the polyprotein into two parts (figure . b). this insertion radically modified the conserved protein expression mechanism of picornaviruses, since it functionally replaced a proteolytic cleavage event by an event of internal initiation of translation directed by the alien ires. in all, poliovirus genomes were constructed in which the emcv ires was placed between the y*g cleavage site of a pr~ (figure . b) or all possible q*g cleavage sites involving the cpr~ pr~ proteinase (molla et al., (molla et al., , b paul et al., a) . only two poliovirus-emcv dicistronic chimeras, specifically those carrying emcv ires between vp and a and between a and b, have given rise to viable and stable virus progeny (molla et al., (molla et al., , b paul et al., a) . although the genome organizations of these chimeras do not match anything found in nature, immediate parallels come to mind with genomes of picorna-like viruses in which capsid and replicative modules are encoded by different orfs (for example comoviridae, see above and the section on genetics). these considerations imply that the conserved and non-conserved features in organization and structure of genomes of picornaviruses and even picorna-like viruses are indicative of an evolutionary plasticity and of possible future changes of a picornavirus. perhaps an "evolutionary space" of a picornavirus can be approximated from the past. mechanistically, this can be seen as if the past evolution of the entire family has been "imprinted" in the organization of the genome of each of the contemporary picornaviruses. phylogenetic trees that have been built for different picornaviral proteins (most often vp , c awpase and d p~ by employing parsimonious and maximum-likelihood methods proved roughly topologically equivalent even though different regions of the polyproteins have definitely evolved at different rates (stanway, ; rodrigo and dopazo, ; hyypia et al., ) . these observations strongly favor a concerted evolution of (the majority of) the picornavirus proteins. this conclusion is not compromised by some incongruity in the tree topology of closely related viruses, e.g. the c cluster of the enteroviruses (poyry et al., ) , or very distantly related groups, e.g. hepato-and parechoviruses. it is likely that some trees generated for different regions look different, as a result of technical limitations related to phylogenetic and biopolymer sequence analyses as well as a biased representation of some groups. also, possible recombination events between closely related viruses may have complicated phylogenetic analyses. we shall analyse sequence alignments of picornavirus proteins and polynucleotides aimed at deducing the mechanisms functioning in picornavirus evolution. uniform sizes of each of the vp , c awpase, c pr~ or d p~ polypeptides have been maintained in all picornaviruses. the diversity of the proteins is therefore most probably the result of numerous in-frame mutations. for the other proteins, some additional mechanism of diversification may have been functioning in the course of evolution. among the viruses encoding a proteins sharing the npgp motif, the two viruses fmdv and erv encode a a consisting of only amino acids, whereas the cardioviruses erv and aiv encode a a ranging between and residues. it can be speculated that deletion events in the a coding region of fmdv and erv are the result of "jumping" of d p~ perhaps by loop-out deletion or by illegitimate recombination (figure . ) . on the other hand, the three adjacent coding regions for vpg uniquely found in all strains of fmdv suggest duplication events. in other viruses, e.g. erv or tmev, genetic events such as local duplication and deletions may have occurred, leading to considerable size heterogeneity of the corresponding vpgs and adjacent sequences (wutz et al., ; a.e. gorbalenya, unpublished results) . duplications have also been discovered in the 'ntr of enteroviruses (pilipenko et al., a) . picornavirus genomic redundancy, known as duplications, may have been generated by intragenomic recombination. after duplications, however, the sequences must have undergone some variation so as to avoid elimination by homologous recombination. indeed, the nucleotide sequences (and to a small extent also the amino acid sequences) of the three vpgs of fmdv differ such that homologous recombination at this locus is unlikely (cao and wimmer, ) . in spite of lack of evidence, duplications by intragenomic recombination might have been involved in the production of large differences in size found in capsid proteins vp and vp , or in non-structural a and b proteins of some picornaviruses. the capsid proteins contain long extra loops while the b protein of erv has an enormous size relative to the b proteins of all other picornaviruses ( versus - amino acids; wutz et al., ) . on the other hand, charini et al. ( ) have reported that, surprisingly, a viable poliovirus isolate they selected from a swarm of revertants had captured a short segment of cellular ribosomal rna. thus, capture of entirely foreign rna sequences, although very rare, cannot be excluded from the mechanisms of diversification. at least two different mechanisms could have given rise to the contemporary diversity of a and l protein families. the diversity includes, amongst others, chymotrypsin-like proteinase and npgp motif-containing polypeptides for a and papain-like proteinase and zn-finger proteins for l. phylogenetic analyses suggest that "new" unrelated a and l proteins have emerged in the course of evolution of picornaviruses on several occasions, following the split of the major groups of the picornavirus tree. it is logical to assume that, following each split, one of the two descendants has arisen from an ancestral viral source, the other from an "independent" source. as to the latter, the coding sequence of either a or l could have recombined with a gene of either another virus or of the cell, leading to the replacement of the ancestral coding sequence. for example, this replacement mechanism could have resulted in the capture of cellular chymotrypsin-like ( a pr~ or papain-like (l pr~ activities. this hypothesis is, of course, purely speculative since no potential partners in recombination have been identified as yet. alternatively, the diversity of the a and l families may be the result of frame-shifting events. for example, enteroviruses have a "spacer sequence" between the ires and the orf of the polyprotein ~. this spacer commences with an unused ("silent") aug at the ' border of the ires. in poliovirus, it is nt long and represents a small out-of-frame orf terminating inside the polyprotein orf. if the silent aug at the ' end of the spacer were to trigger initiation of translation and, in addition, a frame-shift mutation connected the small orf with the main orf, a small "leader" peptide would be created fused to the polyprotein. all that is then necessary is a c pr~ cleavage site to sever the "leader" from vp -and a genetic arrangement would have been created resembling that of cardio-and aphthoviruses (jang et al., ) . indeed, the silent aug of poliovirus can be turned on by changing its kozak context (pestova et al., ) , and stable poliovirus variants can be isolated that carry short foreign leaders (see above; mueller and wimmer, ) . thus, the conversion of an enterovirus to a cardiovirus genotype with respect to an l protein can be envisioned by relatively simple genetic changes. similarly, it should be possible to convert a cardiovirus genotype in this region into an enterovirus genotype by silencing its l orf. it is relevant that, as already mentioned, a strain of theiler's virus has been identified that, just like the normal cardioviruses, synthesizes a polyprotein-fused l protein and, in addition, a polypeptide l* in a separate orf. l* synthesis is initiated at its own aug initiation codon (takata et al., ) . apparently, the synthesis of l* may present the virus with an advantage in the natural host, a fact that may have contributed to its selection. by comparison with the l protein region in tmev, two a proteins may have existed in the ancestral picornavirus genome, one active in the polyprotein, the other "silent". in the course of subsequent speciation, each of these a variants may have been used in separate picornavirus lineages. the activation of the "silent" a may have led to a concomitant inactivation of the other a. it should be mentioned that the presence of multiple alternative orfs in ancestral picornavirus genomes may have been the rule rather than the exception, particularly if the polyprotein evolved by amplification of -mers (gorbalenya, ; see also above). ohno ( ) has demonstrated that periodicity-organized polynucleotides with a period that cannot be divided by ( -long periodicity included) have an identical coding capacity in each frame. in other words, if one orf is open the two other frames are open also. in the course of evolution, two out of three reading frames may have deteriorated or may have given rise to genetic variation as speculated for the generation of the diversity in a and l proteins. numerous studies attest to a remarkable stability of the picornavirus genotype if grown under identical conditions (wimmer et al., ) . on the other hand, if exposed to altered conditions in the environment, a shift to new variants can be readily observed. just like other biological systems, it can be assumed that picornavirus speciation has been driven by a changing environment. circumstances upon which a picornavirus may encounter a "new" environment include: ( ) horizontal or vertical transfer to a new (different) host; ( ) entering a natural host through a non-natural gate; ( ) infecting immunized (natural) hosts previously exposed to the same virus. although there is no proof, it is intuitively highly likely that all three scenarios have played a role in picornavirus speciation. in the following, a speculative reconstruction of forces will be presented that may have contributed to the evolution of picornaviruses. picornaviruses belonging to a genus or a cluster may have almost identical phenotypes with respect to growth properties and even in regard to pathogenic potential. a most important characteristic, however, does further divide a group of very closely related picornaviruses (e.g. polioviruses): the susceptibility to activation by different neutralizing antibodies and, hence, the separation into serotypes (see the introduction). it is logical to assume that the (negative) pressure of the immune system may be largely accountable for serotype diversification of picornaviruses. that is, the immune response can lead to the selection of viral variants resistant to the neutralizing immune response produced by the surviving host. such variants would form a pool from which a new serotype could be further selected. in fact, such mechanism of virus evolution seems to dominate in the case of influenza a virus or immunodeficiency virus (hiv). however, the sheer unlimited degree of serotype diversification observed in influenza viruses or hiv is an exception rather than the rule amongst viruses. indeed, not all picornaviruses seem to be able to easily produce new serotypes. for example, the genus hepatovirus encompasses only one serotype while others are restricted to a few serotypes (e.g. poliovirus). new viral variants that have escaped the immune surveillance must, of course, interact with multiple host components at virtually every stage of their reproduction in order to survive. this includes virus entry into the host cell, translation and processing, genome replication, encapsidation and maturation, spread in the host. each of these steps are checkpoints and every new viral variant must be fit to pass these barriers. the earliest events in the infectious cycle-receptor interaction, uptake, uncoating-and the mechanisms of neutralization are amongst the least understood in the molecular biology of picornaviruses. the crystal structures of some member viruses of four picornavirus genera have been solved; examples are: enterovirus, poliovirus and (hogle et al., ) ; rhinovirus, human rhinoviruses , , , (rossman et al., ) ; cardiovirus, mengovirus (luo et al., ) , theiler's virus (luo et al., ) ; aphthovirus, fmdv (acharya et al., ) . (for a complete list, see lentz et al., ) . however, the precise localization and structures of different neutralization antigenic sites (the structures interacting with neutralizing antibodies) is known only for polioviruses, rhinoviruses and aphthoviruses. for aphthoviruses and for polioviruses, the available evidence suggests that the same structures that determine in part the serotype identity are also involved in receptor recognition (domingo et al., ; mason et al., ; harber et al., ) . thus, immuneescaping viral mutants are likely to be enriched in those variants that have maintained the ability to efficiently interact with the cognate receptor and follow the pathway of uptake and uncoating. this is, of course, only speculative but, if correct, it would explain in part serotype restriction (harber et al., ) . in this respect it may be informative to compare receptor specificities with serotype diversities of human enteroviruses, on the one hand and rhinoviruses on the other. these two genera encompass viruses that have diverged from an immediate common ancestor and radiated during the same time period (figure . ) . in the course of evolution, different serotypes in roughly the same numbers have been generated in these two picornavirus branches: there are about enterovirus and over rhinovirus serotypes. this implies that viruses of the two genera are similarly prone to accumulation of changes in those capsid structures giving rise to new serotypes. but what about receptor specificity of these viruses? at the time of writing, two receptors have been assigned for human rhinoviruses (which is probably all that will be found) and six receptors for human enteroviruses (at least four more are awaiting identification; table . ). thus, in contrast to the quite similar extent of serotype diversification in both genera, adaptation to new receptors is significantly more restricted in rhinoviruses than in the closely related enteroviruses. importantly, there is an overlap between the two receptor patterns and, taken together, the icam- receptor specificity appears to be dominant among entero-and rhinoviruses. this can be interpreted to mean that the immediate common ancestor of both enteroand rhinoviruses may have used a receptor related to icam- . regardless of whether this is true or not, the subsequent evolution of icam- -recognizing picornaviruses has proceeded differently, as seen in the disparity of the current use of this cellular receptor (> for rhinoviruses versus for c-cluster coxsackieviruses). given that the serotype diversification has proceeded at a similar pace in entero-and rhinoviruses, enteroviruses may have had greater opportunities -or a greater need -to adapt to new receptors in order to initiate an infection. this may be related to the function(s) of receptors in viral docking and uncoating: whereas rhinoviruses may need the receptor only for docking and uptake (because of their inherent sensitivity to the acidic ph inside late endosomes), the exceedingly stable enteroviruses do need the receptor (and possibly a co-receptor) for docking, uptake and uncoating. with poliovirus, a particle stable to detergents, proteases and low ph (ph ), this is exemplified in the formation of a-particles, a labile product of receptor/virion interaction and an intermediate in uncoating . a-particle formation appears to involve also sequences of neutralization antigenic sites (harber et al., ) . thus, the intercourse between receptor and enterovirion may be much more complex than that between receptor and rhinovirion. consequently, a change in the serotype may have forced enteroviruses to search for new receptors to retain the uncoating capacity of the cellular receptor. the unusually large serotype diversity of the major receptor group human rhinoviruses may then be explained as follows. it seems possible that the initiation of an infectious cycle of hrv does not critically require an interaction between structures of the neutralization antigenic sites of the virion and icam- . that is, the n-terminal domain of icam- , by inserting itself into the virion's canyon, can effect docking, uptake and uncoating of the particle. progression through any of these events is not critically dependent on sequences of the neutralization antigenic sites. if correct, it follows that variation of the antigenic sites does not restrict viral proliferation and serotype evolution. consistently, in other picornaviruses the neutralization antigenic sites and the determinants recognizing the receptor would be much more overlapping and mutually dependent. it is likely that an initial immune-driven selection might also finally result in a virus variant with changed or extended tissue tropism. this might have happened with cav v, a c-cluster human enterovirus. immune pressure might have initiated the selection of the cav v mutant derived from a cav swarm. as mentioned before, cav v is a very recent variant of cav and, unlike its parent and the other members of the c-cluster, it can cause acute hemorrhagic conjunctivitis. apart from the possibility that cav v emerged through immune selection, it could also have been selected from a swarm when the parental cav was accidentally inoculated into the eye. another type of selection might have been responsible for the emergence of swine vesicular disease virus (svdv). phylogenetic analysis of genomes of human enteroviruses identified svdv as being interleaved with human viruses comprising the cbv-like cluster (poyry et al., ) . this observation is strongly indicative of selection of svdv from a mutant of a human coxsackie-b virus entering the new host through frequent contacts of these domestic animals with (infected) humans. we have discussed different aspects relevant to picornavirus evolution, but we did not address one crucial question: are picornaviruses a successful family? we believe that the answer is: yes. in discussing this issue, we will also formu-late considerations regarding the worldwide eradication of poliovirus. one of the strongest criteria of biological prosperity is the diversity of a taxonomic group. despite some bias inherent in current analyses, phylogenetic studies of picornaviral genomes suggest that picornaviridae have radiated densely over the course of evolution, at both early and late stages (figure . ) . furthermore, picornaviruses are members of a superfamily with numerous distant relatives (figure . ) that infect a wide range of organisms, including both plants and animals. some of these viruses, like sequiviridae, employ a genetic plan that is basically a variation of the genetic plan used by picornaviruses (figure . ). prosperity of the host is another prerequisite for a virus to be successful. by this criterion also, picornaviruses are successful, since the majority of them, representing different branches of the picornavirus tree, infect humans. humans are arguably one of the most successful species in the biological world. in truth, picornaviruses are relatively harmless even though few humans, if any, can escape picornavirus infections. this too can be viewed as evidence that these viruses have adapted well to their host, as they have not significantly undermined human affairs. this is true even for poliovirus, an agent that is commonly regarded as a deadly virus following epidemics of poliomyelitis. however, prior to this century, poliovirus did not cause epidemics, even though it infected humans at rates approaching %. epidemics emerged because human behavior changed through the invention of modern hygiene. hygiene broke the chain of natural immunization through infant infection combined with infant protection by maternal antibodies. even in this century's devastating epidemics, however only - % of infected individuals developed poliomyelitis. the poliovirus-human relationship alluded to above deserves to be discussed in more detail. humans, who occupy a unique niche in the bio-logical world (because they care about each human life), did not accept their potential defeat as poliomyelitis became an epidemic. unprecedented efforts combining medical research with modern technologies led to the development of two highly effective poliovirus vaccines, the inactivated poliovirus vaccine by jonas salk and the live attenuated vaccine by albert sabin (wimmer et al., ) . through education of the populace and advanced healthcare measures, mass vaccinations have gradually eliminated wild-type poliovirus, first in the developed countries and later in most of the world. incredibly, the few cases of poliomyelitis in the western hemisphere now result from vaccination with the live sabin strains. overall, polio vaccination is a success story of greatest consequence. indeed, through worldwide efforts led by the world health organization, it is likely that wild-type polioviruses will be eradicated globally by the turn of the century (who, ) . do these considerations allow us to safely conclude that, after its global eradication, poliovirus will have no chance to re-emerge through enterovirus evolution? for discussion of this issue, we will first summarize hypotheses about the possible origin of polioviruses and their closest relatives, the c-cluster coxsackieviruses. the three serotypes of poliovirus belong to the c-cluster of enteroviruses (table . ; figure . b). the most comprehensive analysis of the c-cluster has been performed with sequences of the vp -vp capsids and with sequences of the d p~ rna polymerase (pulli et al., ) . results of these analyses are consistent with data obtained in a study of the other regions of the viral genome using a less representative set of sequences (poyry et al., ) . therefore, these relationships shown in figure . b can be assumed to be quite reliable. a phylogenetic analysis of the capsid vp -vp region of c-cluster viruses indicated that the tree has split at least twice, perhaps before the emergence of an immediate ancestor of polioviruses. the first split led to the separation of a branch encompassing cav , cav and cav from the main c-cluster trunk, and the second, more recent one resulted in the separation of the ancestor for pv and the ancestor for cav , cav , cav , cav , cav and cav b. the results obtained with sequences of d p~ favor an even more complex evolutionary history of poliovirus, including more than five intermediate steps (pulli et al., ) . consistent with the results of the analysis of the capsid region, cav and cav were among those viruses that diverged from the main trunk relatively early in evolution while the three poliovirus serotypes clustered together with cav , cav , cav and cav . remarkably, in the tree based on d p~ sequences the latter four coxsackieviruses (as well as several other coxsackieviruses) are interleaved with, rather than separated from, the three poliovirus serotypes (figure . ). this stands in contrast to the tree of the capsid region. assuming the most parsimonious scenario of evolution, the combination of these results strongly implies that coxsackieviruses that recognize the icam- receptor formed a pool from which polioviruses, interacting with the cd receptor, have evolved. this conclusion is compatible with a hypothesis of the immune-driven evolution of entero-and rhinoviruses presented above. furthermore, the analyses do not indicate that three polioviruses comprise a monophyletic subgroup within the c-cluster enteroviruses and, hence, have emerged from an ancestral virus by speciation, as one could expect from a distinct phenotypic profile of these viruses. we have previously hypothesized that the coxsackiviruses may have derived from polioviruses by switching receptors from cd to icam- (harber et al., ) . this possibility may be supported from the fact that the ireses of c-cluster coxsackieviruses are highly 'neuropathogenic". on the other hand, the assessment presented above favors an evolutionary relationship in the opposite direction. regardless of the direction in which these viruses emerged, the receptor switch has profound consequences for their pathogenic properties: whereas the c-cluster coxsackieviruses cause respiratory disease, poliovirus can cause deadly neurological disease. these considerations may also have important practical implications. for the sake of the argument, we will assume that the poliovirus eradication campaign has been successfully completed and no more poliovirus particles, including those of the vaccine strains, are circulating worldwide. furthermore, we will assume that all vaccination against poliovirus (including vaccination by inactivated vaccines) has been terminated, a scenario that has been envisioned to be a reality by the end of the next decade. these measures would mark the beginning of a new era in the history of mankind: there will be no human exposure to polioviruses and their antigens. generations of humans will be born that have not been infected with wild-type or vaccine polioviruses and, gradually, they will replace the older generations who carry anti-poliovirus antibodies. at that point, the world will not only be free of poliovirus, but its human population will also no longer carry anti-poliovirus antibodies. thus, a new environment will emerge for human viruses, in particular for c-cluster coxsackieviruses, which are the closest genetic relatives of poliovirus. these c-cavs are expected to circulate widely in the human population, exploring a new evolutionary space. within the human space populated by the c-cavs, there will then exist also a free space that was previously occupied by the three (extinct) poliovirus serotypes. it is possible that mutations in antigenic sites of the c-cavs may (re)generate affinity to cd . prior to eradication, c-cavs carrying such mutations could conceivably be eliminated by anti-poliovirus antibodies (harber et al., ) but in the poliovirus-free world they may remain unchecked. this means that, once emerged, these new viruses carrying poliovirus-like neutralization antigenic sites with cd receptor affinity are less likely to be eliminated from the human population after eradication than before. since all enteroviruses, the variants included, lead to enteric infections, these variants may find a passage to the cns and, mediated by their affinity to cd , may cause neurological disease. it is relevant to point out (gromeier and wimmer, unpublished results ) that poliovirus chimeric viruses in which the poliovirus ires has been replaced with that of c-cluster ires elements have been found to be highly neurovirulent in cd tg mice (see the section on pathogenesis). thus, there is reason to fear that in a poliovirus-free world new coxsackievirusrelated, poliovirus-like pathogens that can cause poliomyelitis may emerge in the course of natural viral evolution. the time frame, however, cannot be predicted. it could be one generation or years. the human condition favors an increasing rate of diversity of human viruses simply because of the increasing size of the human population (estimated to stabilize at - billion during the next century). this population explosion will lead to a dramatic increase of human contacts, either in cities, particularly megacities (harboring more than % of the world's population), through travel or otherwise. clearly, this presents a fertile ground for proliferation and diversification of the highly infectious human picornaviruses. thus, the possibilities of genetic variation of picornaviruses leading to new or renewed human pathogens, such as cav v, must always be kept in mind. at this point, however, our considerations of the possible re-emergence of poliovirus-like pathogens in the post-eradication era pale in the face of mankind's heroic attempt to eradicate an rna virus for the first time. after all, poliovirus has caused, and is still causing, terrible human suffering. picornaviruses have been discovered because they cause diseases in animals and humans. fortunately, most human picornavirus infections are self-limiting. yet the enormously high rate of picornavirus infections in the human population can lead to a significant incidence of disease complications that may be permanently debilitating or even fatal. the case of poliovirus has taught us that a change of human behavior, which, paradoxically, was the invention of modern hygiene, has greatly aggravated the impact of infection by this specific agent. clearly, this scenario could repeat itself with other human picornavirus species. the terror of this century's poliomyelitis epidemics has driven picornavirus research forward more than any other factor. this work has led to a wealth of discov-eries in biology in general, and to an abundance of data describing the unique biology of picornaviruses and their evolution in particular. picornaviruses employ one of the simplest imaginable genetic systems: they consist of single-stranded rna that encodes only a single multidomain polypeptide, the polyprotein. the rna is packaged into a small, rigid, naked, icosahedral virion whose proteins are unmodified except for a myristate at the n-termini of vp . the rna itself does not contain modified bases. thus, picornaviruses travel with light baggage. on first sight, the replication of picornaviruses is exceedingly simple. after having chosen a receptor from a large menu of cell-surface proteins, the virion enters the cytoplasm and immediately translates its genome, controlled by its ires element. thereafter, the polyprotein is processed by its own proteinases. rna replication occurs by a unique, proteinprimed mechanism catalyzed by the rnadependent rna polymerase. assembly appears to be linked to rna synthesis, and release of the progeny virions follows a passive mechanism. there is no need for a cellular nucleus. indeed, the entire replication cycle can occur in a cellfree system free of nuclei, mitochondria and perhaps of all other cellular organelles. yet as of now we understand only a small fraction of these viruses' life cycle, and we are awed by the sophistication with which the viruses express their genetic information. the ires, arguably one of the most complex cis-acting signals known in rna systems, has freed picornaviruses from the cellular constraint of cap-dependent translation. this, in turn, allows the primer-dependent rna polymerase, an enzyme with properties generally ascribed only to dna polymerases or reverse transcriptase, to prime with vpg and leave the rna uncapped. polyprotein processing proceeds in a controlled manner yielding cleavage intermediates and end-products that can be used for different functions. thus, the menu of gene products is expanded through the temporal regulation of proteolytic processing. details of all of these steps in replication are still obscure . the key to ultimately understanding picornaviruses may be to rationalize the huge amount of information about these viruses from the perspective of evolution. it is possible that the replicative apparatus of picornaviruses originated in the precellular world and was subsequently refined in the course of thousands of generations in a slowly evolving environment. picornaviruses cultivated the art of adaptation, which has allowed them to "jump" into new niches offered in the biological world. also, by having chosen humans as an additional host, they were offered an abundance of opportunities to proliferate in different tissues, which has contributed to their diversification. these opportunities have further increased through the human population explosion and through changes in human behavior. we suggest that, in addition to drastic and expansive measures such as global eradication, strategies should be developed that aim at predicting the possible evolution of new picornavirus pathogens and preparing for their control. the results reviewed in this article may contribute to achieving this tantalizing and desirable goal. the threedimensional structure of foot-and-mouth disease virus at . a resolution restricted growth of attenuated poliovirus strains in cultured cells of a human neuroblastoma paradoxes of the replication of picornaviral genomes polioviruses containing picornavirus type and/or type internal ribosomal entry site elements: genetic hybrids and the expression of a foreign gene picornaviral c cysteine proteinases have a fold similar to chymotrypsin-like serine proteinases attenuated mengo virus as a vector for immunogenic human immunodeficiency virus type i glycoprotein attenuated mengo virus: a new vector for live recombinant vaccines a functional ribonucleoprotein complex forms around the ' end of poliovirus rna poliovirus rna synthesis utilizes an rnp complex formed around the '-end of viral rna engineering poliovirus as a vaccine vector for the expression of diverse antigens expression of animal viral genomes coupled translation and replication of poliovirus rna in vitro: synthesis of functional d polymerase and infectious virus complete replication of poliovirus in vitro: preinitiation rna replication complexes require soluble cellular factors for the synthesis of vpg-linked rna viral cysteine proteases are homologous to the trypsin-like family of serine proteases: structural and functional implications genetic fine structure decay-accelerating factor (cd ), a glycosylphosphatidyylinsitol-anchored complement regulatory protein, is a receptor for several echoviruses identification of the integrin vla- as a receptor for echovirus antibodies to the vitronectin receptor (integrin alpha v beta ) inhibit binding and infection of foot-andmouth disease virus to cultured cells genetic complementation among poliovirus mutants derived from an infectious cdna clone structural and functional characterization of the poliovirus replication complex infectious replicative intermediate of poliovirus: purification and characterization poly(rc) binding protein binds to stem-loop iv of the poliovirus rna ' noncoding region: identification by automated liquid chromatography-tandem mass spectrometry poliomyelitis sequences within the poliovirus internal ribosome entry segment control viral rna synthesis nucleotide sequence of an attenuated mutant of coxsackievirus b compared with the cardiovirulent wildtype: assessment of candidate mutations by analysis of a revertant to cardiovirulence intragenomic complementation of a ab mutant in dicistronic polioviruses genetic variation of the poliovirus genome with two vpg coding units trans rescue of a mutant poliovirus rna polymerase function transduction of a human rna sequence by poliovirus initiation of protein synthesis by the eukaryotic translational apparatus on circular rnas a picornaviral protein synthesized out of frame with the polyprotein plays a key role in a virus-induced immune-mediated demyelinating disease nonhomologous rna recombination in a cell-free system: evidence for a transesterification mechanism guided by secondary structure rna duplex unwinding activity of poliovirus rna-dependent rna polymerase d p~ defective interfering particles of poliovirus defective interfering particles of poliovirus. . isolation and physical properties genetics of picomaviruses brefeldin a inhibits cell-free, de novo synthesis of poliovirus role of enterovirus in acute flaccid paralysis after the eradication of poliovirus in brazil the deletion of proximal nucleotides reverts a poliovirus mutant containing a temperaturesensitive lesion in the ' noncoding region of genomic rna the nucleotide sequence of tobacco vein mottling virus rna new observations on antigenic diversification of rna viruses: antigenic variation is not dependent on immune selection expression of virus-encoded proteinases: functional and structural similarities with cellular enzymes temperaturedependent alteration of cross-over sites in poliovirus recombination. virology, submitted attenuation of mengo virus through genetic engineering of the ' noncoding poly(c) tract the origin of genetic information: viruses as models recombinants of mahoney and sabin strain poliovirus type : analysis of in vitro phenotypic markers and evidence that resistance to guanidine maps in the nonstructural proteins increased neurovirulence associated with a single nucleotide change in a noncoding region of the sabin type poliovaccine genome the human homolog of ha vcr- codes for a hepatitis a virus cellular receptor structural factors that control conformational transitions and serotype specificity in type poliovirus rate of change of concomitantly variable codons poliovirus-specific primer-dependent rna polymerase able to copy poly(a) covalent linkage of a protein to a defined nucleotide sequence at the '-terminus of virion and replicative intermediate rnas of poliovirus replication of poliovirus in xenopus oocytes requires two human factors two functional complexes formed by kh domain containing proteins with the ' noncoding region of poliovirus rna switch from translation to replication in a positivestranded rna virus functional and genetic plasticities of the poliovirus genome: quasi-infectious rnas modified in the '-untranslated region yield a variety of pseudorevertants molecular evolution of plant rna viruses abstract. europic' origin of rna viral genomes: approaching the problem by comparative sequence analysis superfamily of uvra-related ntp-binding proteins. implications for rational classification of recombination/repair systems comparative analysis of the amino acid sequences of the key enzymes of the replication and expression of positive-strand rna viruses. validity of the approach and functional and evolutionary implications helicases: amino acid sequence comparisons and structure-function relationships viral cysteine proteinases poliovirus induced proteinase c: a possible evolutionary link between cellular serine and cysteine proteinase families cysteine proteases of positive strand rna viruses and chymotrypsin-like serine proteases: a distinct protein super-family with a common structural fold putative papain-related thiol proteases of positive-strand rna viruses. identification of rubi-and aphthovirus proteases and delineation of a novel conserved domain associated with proteases of rubi-, alpha-and coronaviruses interaction of rhinovirus with its receptor, icam- mechanism of injury-provoked poliomyelitis prophylactic injections and the onset of paralytic poliomyelitis mouse neuropathogenic poliovirus strains cause damage in the central nervous system different from poliomyelitis internal ribosomal entry site substitution eliminates neurovirulence in intergeneric poliovirus recombinants dual stem loops within the poliovirus internal ribosomal entry site control neurovirulence the human poliovirus receptor/cd promoter directs reportergene expression in floor plate and optic nerve of transgenic mice in vitro construction of poliovirus defective interfering particles attenuation stem-loop lesions in the ' noncoding region of poliovirus rna: neuronal cell-specific translation defects structure of the rna-dependent rna polymerase of poliovirus the catalysis of the poliovirus vpo maturation cleavage is not mediated by serine of vp serotype polymorphism of poliovirus-cellular receptor interaction: separation of events of viral attachment and uptake proteolytic processing in the replication of picornaviruses interaction of the polioviral polypeptide cd pr~ with the ' and ' termini of the poliovirus genome: identification of viral and cellular cofactors necessary for efficient binding proteolytic processing of viral polyproteins in the replication of rna viruses ) '-terminal structure of poliovirus polyribosomal rna is pup genetic recombination with newcastle disease virus, polioviruses and influenza. cold spring harbor symp members of the low density lipoprotein receptor family mediate cell entry of a minor-group common cold virus the three dimensional structure of poliovirus at . a resolution mutation frequencies at defined single codon sites in vesicular stomatitis virus and poliovirus can be increased only slightly by chemical mutagenesis vcam- is a receptor for encephalomyocarditis vitrus on murine vascular endothelial cells a distinct picornavirus group identified by sequence analysis classification of enteroviruses based on molecular and biological properties analysis of genetic information of an insect picorna-like virus, infectious flacherie virus of silkworm: evidence for evolutionary relationships among insect, mammalian and plant picorna(-like) viruses efficient infection of cells in culture by type o foot-and-mouth disease virus requires binding to cell surface heparan sulfate a segment of the ' nontranslated region of encephalomyocarditis virus rna directs internal entry of ribosomes during in vitro translation initiation of protein synthesis by internal entry of ribosomes into the ' nontranslated region of encephalomyocarditis virus rna in vitro cap-independent translation of picornavirus rnas: structure and function of the internal ribosomal entry site the polymerase in its labyrinth: mechanisms and implications of rna recombination poliovirus rna recombination: mechanistic studies in the absence of selection the novel genome organization of the insect picoma-like virus drosophila c virus suggests this virus belongs to a previously undescribed virus family determination of encephalomyocarditis viral diabetogenicity by a putative binding site of the viral capsid protein isolation and characterization of defective-interfering particles of poliovirus sabin i strain complete nucleotide sequence of a strain of coxsackie b virus of human origin that induces diabetes in mice and its comparison with nondiabetogenic coxsackie b jbv strain preferred sites of recombination in poliovirus rna: an analysis of intertypic cross-over sequences recombination in rna the mechanism of rna recombination in poliovirus primary structure, gene organization and polypeptide expression of poliovirus rna the poliovirus receptor protein is produced both as membrane-bound and secreted forms transgenic mice susceptible to poliovirus q[ replicase as repressor of q[ rna-directed protein synthesis evolution of rna genomes: does the high mutation rate necessitate high rate of evolution of viral proteins? an insect picornavirus may have genome organization similar to that of caliciviruses construction of viable deletion and insertion mutants of the sabin strain type i poliovirus: function of the ' noncoding sequence in viral replication primary structure of poliovirus defective interfering particle genomes and possible generation mechanism of the particles mutational analysis of the genome-linked protein vpg of poliovirus properties of purified recombinant poliovirus protein ab as substrate for viral proteinases and as co-factor for viral polymerase d p~ differences in replication of attenuated and neurovirulent poliovirus in human neuroblastoma cell line sh-sy y ubertragung der poliomyelitis acuta auf affen the structure of poliovirus replicative form evolution of a common structural core in the internal ribosome entry sites of picornavirus proteolytic processing of poliovirus polyproteins: elimination of a pro-mediated, alternative cleavage of polypeptide cd by in vitro mutagenesis the genome of poliovirus is an exceptional eukaryotic mrna the genome-linked protein of picornaviruses. i. a protein covalently linked to poliovirus genome rna structure of poliovirus type lansing complexed with antiviral agent sch : comparison of the structural and biological properties of three poliovirus serotypes equine rhinovirus i is more closely related to foot-and-mouth disease virus than to other picornaviruses berichte der kommission zur erforschung der maul-und klauenseuche bei dem institut fuer infektionskrankheiten in berlin poliovirus chimeras replicating under the translational control of genetic elements of hepatitis c virus reveal unusual properties of the internal ribosomal entry site of hepatitis c virus construction and genetic analysis of dicistronic polioviruses containing open reading frames for epitopes of human immunodeficiency virus type gp analysis of picornavirus a(pro) proteins: separation of proteinase from translation and replication functions characterization of a new isolate of poliovirus defective interfering particles the atomic structure of mengo virus at . a resolution the structure of a highly virulent theiler's murine encephalomyelitis virus (gdvii) and implications for determinants of viral persistence the relation of prophylactic inoculations to the onset of poliomyelitis role of mutations g- and c- in the attenuation phenotype of sabin type poliovirus capsid coding sequence is required for efficient replication of human rhinovirus rna avian encephalomyelitis virus is a picornavirus and is most closely related to hepatitis. a virus rgd sequence of foot-and-mouth disease virus is essential for infecting cells via the natural receptor but can be bypassed by an antibody dependent enhancement pathway structure of human rhinovirus c protease reveals a trypsin-like polypeptide fold, rna-binding site, and means for cleaving precursor polyprotein kissing of the two predominant hairpin loops in the coxsackie b virus ' untranslated region is the essential structural feature of the origin of replication required for negative-strand rna synthesis enteroviruses: polioviruses, coxsackieviruses, echoviruses, and newer enteroviruses identification of bulgarian strain of enterovirus enteroviruses , , and cellular receptor for poliovirus: molecular cloning, nucleotide sequence, and expression of a new member of the immunoglobulin superfamily rabbit hemorrhagic disease virus -molecular cloning and nucleotide sequencing of a calicivirus genome antigenic structure of picornaviruses cell-free, de novo synthesis of poliovirus cardioviral internal ribosomal entry site is functional in a genetically engineered dicistronic poliovirus inhibition of proteolytic activity of poliovirus and rhinovirus a proteinases by elastase specific inhibitors studies on dicistronic polioviruses implicate viral proteinase apro in rna replication effects of temperature and lipophilic agents on poliovirus formation and rna synthesis in a cell free system stimulation of poliovirus proteinase cpro-related proteolysis by the genome-linked protein vpg and its precursor ab nucleotide sequence analysis shows that rhopalosiphum padi virus is .a~member of a novel group of insect-infecti~ig rna viruses poliovirus single-stranded and dou pathogenesis and evolution of picornaviruses ble-stranded rna: differential infectivity in enucleated cells expression of foreign proteins by poliovirus polyprotein fusion: analysis of genetic stability reveals rapid deletions and formation of cardioviruslike open reading frames expression of foreign proteins by poliovirus polyprotein fusion: analysis of genetic stability reveals rapid deletions and formation, of cardiovirus-like open reading frames poliovirus antigenic hybrids simultaneously expressing antigenic determinants from all three serotypes poliovirus host range is determined by a short amino acid sequence in neutralization antigenic site primary structure and gene organization of human hepatitis a virus properties of a new picorna-like virus of the brown-winged green bug, plautia stali foot-and-mouth disease virus virulent for cattle utilizes the integrin alpha(v)beta as its receptor proteolytic processing in the replication of polio and related viruses genetic studies of the antigenicity and the attenuation phenotype of poliovirus the ' end of poliovirus mrna is not capped with m g( ')pppnp the ' terminal structures of poliovirion rna and poliovirus mrna differ only in the genome-linked protein vpg the location of the polio genome protein in viral rnas and its implication for rna synthesis defective interfering particles of poliovirus: mapping of the deletion and evidence that the deletions in the genome of di coupling between genome translation and replication in an rna virus repeats of base oligomers as the primordial coding sequences of the primeval earth and their vestiges in modern genes cath -a hierarchic classification of protein domain structures sequence alignments of picornaviral capsid proteins proteolytic processing of picornaviral polyprotein poly (rc) binding protein forms a ternary complex with the ' terminal sequences of poliovirus rna and the viral cd proteinase studies with poliovirus polymerase dp~ stimulation of poly (u) synthesis in vitro by purified poliovirus c protein ab internal ribosomal entry site scanning of the poliovirus polyprotein: implications for proteolytic processing protein-primed rna synthesis by purified poliovirus rna polymerase internal initiation of translation of eukaryotic mrna directed by a sequence derived from poliovirus rna a conserved aug triplet in the ' nontranslated region of poliovirus can function as an initiation codon in vitro and in vivo ) '-terminal nucleotide sequences of polio-virus polyribosomal rna and virion rna are identical characterization of the nucleotide triphosphatase activity of poliovirus protein c reveals a mechanism by which guanidine inhibits replication of poliovirus the foot-and-mouth disease virus leader proteinase gene is not required for viral replication conserved structural domains in the '-untranslated region of picornaviral genomes: an analysis of the segment controlling translation and neurovirulence conservation of the secondary structure elements of the '-untranslated region of cardioand aphothovirus rnas prokaryotic-like cis elements in the cap-independent internal initiation of translation on picornavirus rna a model for rearrangements in rna genomes cis-element, orir, involved in the initiation of (-) strand poliovirus rna: a quasiglobular multi-domain rna structure maintained by tertiary guanidine-selected mutants of poliovirus: mapping of point mutations to polypeptide c purification and properties of poliovirus rna polymerase expressed in escherichia coli encapsidation of genetically engineered poliovirus minireplicons which express human immunodeficiency virus type i gag and pol proteins upon infection genetics and phylogenetic clustering of enteroviruses virus taxonomy molecular comparison of coxsackie a virus serotypes cloned poliovirus complementary dna is infectious in mammalian cells utilization of chimeras between human (hm- ) and simian (agm- ) strains of hepatitis a virus to study the molecular basis of virulence transgenic mice expressing a human poliovirus receptor: a new model for poliomyelitis characterization of poliovirus clones containing lethal and nonlethal mutations in the genome-linked protein vpg poliovirus rna replication comparative sequence analysis of the ' noncoding region of the enteroviruses and rhinoviruses evolutionary analysis of the picornavirus family the "untranslated region of picornavirus rna: features required for efficient genome replication intestinal trypsin can significantly modify antigenic properties of polioviruses: implications for the use of inactivated poliovirus vaccine biochemical evidence for intertypic genetic recombination of polioviruses the primary structure of intertypic poliovirus recombinants: a model of recombination between rna genomes the evolution of rna viruses icosahedral rna virus structure structure of a human common cold virus and functional relationship to other picornaviruses the genome-linked protein of picornaviruses v. -( ' uridylyl)-tyrosine is the bond between the genome-linked protein and the rna of poliovirus spherical viruses virus-encoded proteinases of the picornavirus super-group protein-priming of dna replication an insect picorna-like virus, plautia stali intestine virus, has genes of capsid proteins in the ' part of the genome crystallization of purified mef-i poliomyelitis virus particles hybrid protein formation of e. coli alkaline phosphatase leading to in vitro complementation cleavage sites in the polypeptide precursors of poliovirus protein p -x poliovirus replication proteins: rna sequence encoding p - b and the site of pioteolytic processing production of infectious poliovirus from cloned cdna is dramatically increased by sv transcription and replication signals a decay-accelerating factor-binding strain of coxsackievirus b requires the coxsackievirus-adenovirus receptor protein to mediate lytic infection of rhabdomyosarcoma cells a new cis-acting element for rna replication within the ' noncoding region of poliovirus type rna picornain c identification and characterization of the cis-acting elements of the human cd gene core promoter structure, function and evolution of picornaviruses a cell adhesion molecule, icam- , is the major surface receptor for rhinoviruses rapamycin and wortmannin enhance replication of a defective encephalomyocarditis virus l* protein of the da strain of theiler's murine encephalomyelitis virus is important for virus growth in a murine macrophagelike cell line initiation of poliovirus plus-strand rna synthesis in a membrane complex of infected hela cells membrane fractions active in poliovirus rna replication contain vpg precursor polypeptides poliovirus rna recombination in cell-free extracts a mutation in the rna polymerase of poliovirus type contributes to attenuation in mice amplification of the full-length hepatitis a virus genome by long reverse transcription-pcr and transcription of infectious rna directly from the amplicon poliovirus c protein determinants of membrane binding and rearrangements in mammalian cells translation and replication properties of the human rhinovirus genome in vivo and in vitro intertypic recombination in poliovirus: genetic and biochemical studies genetic studies on the poliovirus c protein, an ntpase. a plausible mechanism of guanidine effect on the c function and evidence for the importance of c oligomerization hcar and mcar: the human and mouse cellular receptors for subgroup c adenoviruses and group b coxsackieviruses rescue of defective poliovirus rna replication by ab-containing precursor polyproteins complete nucleotide sequences of all three poliovirus serotype genomes: implication for genetic relationship, gene function and antigenic determinants analysis of rna synthesis of type poliovirus by using an in vitro molecular genetic approach genetics of coxsackievirus b cardiovirulence genetics of coxsackievirus b cardiovirulence and inflammatory heart muscle disease sequence analysis of the parsnip yellow fleck virus polyprotein: evidence of affinities with picornaviruses synthesis of infectious poliovirus rna by purified t rna polymerase nucleotide sequence and genomic organization of acyrthosiphon pisum virus mutational analysis of poliovirus apro. distinct inhibitory functions of apro on translation and transcription poliomyelitis-like properties of ab-iv coxsackie a group of viruses role for beta -microglobulin in echovirus infection of rhabdomyosarcoma cells who ( ) expanded programme on immunization, global poliomyelitis eradication by the year : manual for managers of immunization programmes on activities related to polio eradication genome-linked proteins of viruses genetics of poliovirus poliovirus receptors an electron microscope study of proteins attached to poliovirus rna and its replicative form (rf) equine rhinovirus serotypes and : relationship to each other and to aphthoviruses and cardioviruses molecular dissection of the multifunctional poliovirus rna-binding protein ab interaction between the '-terminal cloverleaf and ab/ cdpro of poliovirus is essential for rna replication rna signals in entero-and rhinovirus genome replication complete nucleotide sequence and genetic organization of aichi virus, a distinct member of the picornaviridae associated with acute gastroenteritis in humans efficient delivery of circulating poliovirus to the central nervous system independently of poliovirus receptor viruses of acute haemorrhagic conjunctivitis polyadenylic acid at the -terminus of poliovirus rna polivirus/hepatitis c virus (internal ribosomal entry site-core) chimeric viruses: improved growth properties through modification of a proteolytic cleavage site and requirement for core rna sequences but not core-related polypeptides mengovirus leader is involved in the inhibition of host cell protein synthesis we are indebted to leena kinnunen for providing figure . , and to steffen mueller for figure . . we thank astrid wimmer for editing parts of the manuscript. work by m.g. and e.w. described here has been supported in part by grants from the national institutes of health, the national cancer institute, and the centers for disease control.