key: cord- -ampip od authors: bagowski, christoph p; bruins, wouter; te velthuis, aartjan j.w title: the nature of protein domain evolution: shaping the interaction network date: - - journal: curr genomics doi: . / sha: doc_id: cord_uid: ampip od the proteomes that make up the collection of proteins in contemporary organisms evolved through recombination and duplication of a limited set of domains. these protein domains are essentially the main components of globular proteins and are the most principal level at which protein function and protein interactions can be understood. an important aspect of domain evolution is their atomic structure and biochemical function, which are both specified by the information in the amino acid sequence. changes in this information may bring about new folds, functions and protein architectures. with the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. such investigations not only help predict the function of newly discovered proteins, but also assist in mapping unforeseen pathways of evolution and reveal crucial, co-evolving inter- and intra-molecular interactions. in turn this will help us describe how protein domains shaped cellular interaction networks and the dynamics with which they are regulated in the cell. additionally, these studies can be used for the design of new and optimized protein domains for therapy. in this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks. the protein universe is the collection of proteins of all biological species that exist or have once existed on earth [ ] . our sampling and understanding of it began over half a century ago, when the first peptide and protein sequences were determined by sanger [ , ] and, subsequently, the sequencing of rna and dna [ ] [ ] [ ] . in the meantime, the genome projects of the last decade have uncovered an overwhelming amount of sequence data and researchers are now starting to address a series of fundamental questions that should shed light onto protein evolution processes [ ] [ ] [ ] [ ] . for instance, how many gene encoding sequences are present in one genome? how many sequences are repetitive and are these sequences similar in the various organisms on earth? which genes were involved in the large scale genome duplications that we see in animals? a comparison of sequences for evolutionary insight is best achieved by looking at the structural and functional (sub)units of proteins, the protein domains. by convention, domains are defined as conserved, functionally independent protein sequences, which bind or process ligands using a core structural motif [ ] [ ] [ ] . examples of domain modes of actions in signaling cascades for instance, are to connect different components into a larger complex or to bind signaling-molecules [ , ] . protein domains can usually fold independently, likely due to their relatively limited size, and are well known to behave as independent genetic elements within genomes [ , ] . the sum of these features makes protein domains readily identifiable from raw nucleotide and amino acid sequences and many protein family resources (e.g., superfamily and smart [see table ]) indeed fully rely on such sequence similarity and motif identifications [ , ] . the algorithms that are used for domain identification are built around a set of simple assumptions that describe the process of evolution. in general, evolution is believed to form and mold genomes largely via three mechanisms, namely i) chemical changes through the incorporation of base analogs, the effects of radiation or random enzymatic errors by polymerases, ii) cellular repair processes that counter mutations, and iii) selection pressures that manifest themselves as the positive or negative influence that determines whether the mutation will be present in subsequent generations [ , ] . by definition, each of these phenomena styles, reproductive strategies, or the lack of apparent polymerase-dependent proofreading such as in positivestranded rna viruses [ ] [ ] [ ] [ ] . consequently, substitution rates need therefore be calculated to correctly compare two or more sequences and hunt uncharted genomes for comparable domains. particularly this last strategy, using general rate matrices like blosom and pam, is an elegant example of how new protein functions can be discovered [ ] [ ] [ ] [ ] [ ] . fast algorithms for pair-wise alignments can be found in the basic local alignment search tool (blast), whereas multiple sequence alignments (msas, fig. a) in which multiple sequences are compared simultaneously are commonly created with for example clustalx and muscle (see table ) [ ] [ ] [ ] [ ] . close relatives, sharing an overall sequence identity above for example % and a set of functional properties, can also be grouped into families and subfamilies. in turn, these families share also evolutionary relationships with other domains and form together so-called domain superfamilies [ , ] . evolutionary distances between related domain sequences can easily be estimated from sequence alignments, provided that the correct rate assumptions are made. subsequently, these can be used to compute the phylogenies of the domain that share an evolutionary history. these, often tree-like graphs (fig. b) , depend heavily on rate variation models, such as molecular clocks or relaxed molecular clocks (e.g., maximum likelyhood and bayesian estimation), which are calibrated with additional evidence fig. ( a) . it was computed using bayesian estimation and presents the best-supported topology for the alignment. numbers indicate % support by the two methods used, while # indicates gene duplication events in the common ancestor and * marks a species-specific duplication event. for computational details, please see [ ] . such as fossils and may therefore also provide valuable information on aspects like divergence times and ancestral sequences [ ] [ ] [ ] . commonly used phylogenetic analysis strategies are listed in table . a limitation of all inferred phylogenetic data is that it is directly dependent on the alignment and less so on the programs used to build the phylogenetic tree [ ] . one of the shortcomings of automated alignments may thus derive from the fact that they commonly employ a scoring and penalty procedure to find the best possible alignment, since these parameters vary from species to species [ , ] , as mentioned above. careful inspection of alignments is therefore advisable, even though software has been developed that combines the alignment procedure and phylogenetic analysis iteratively in one single program [ ] . although sequence and phylogenetic analysis provide a relatively straightforward way for looking at domain divergence, comparison of solved protein structures has shown that protein tertiary organizations are much more conserved (> %) than their primary sequence (> %) [ ] . for this reason, protein structures and their models provide significantly more insight into the relations of protein domains and how domain families diverged [ ] . for example, the inactive guanylate kinase (gk) domain present in the maguk family was shown to originate from an active form of the gk domain residing in ca + channel beta-subunits (cacnbs) through both sequence and structural comparison [ ] . furthermore, identification of functionally or structurally related amino acid sites in a fold sheds light on the complex, co-evolutionary dynamics that took place during selection [ ] . as described above, the evolution of a protein domain is generally the result of a combination of a series of random mutations and a selection constraint imposed on function, i.e., the interaction with a ligand. the interaction between protein and ligand can be imagined as disturbances of the protein's energy landscape, which in turn bring about specific, three-dimensional changes in the protein structure [ , ] . binding energies however, need not be smoothly distributed over the protein's binding pocket as a limited number of amino acids may account for most of the free-energy change that occurs upon binding [ ] [ ] [ ] . in these cases, new binding specificities (including loss of binding) may therefore arise through mutations at these hot spots. an example is a recent study of the pdz domain in which it was shown that only a selected set of residues, and in particular the first residue of -helix ( b ), directly confers binding to a set of c-terminal peptides [ ] . the folding of a domain is essentially based on a complex network of sequential inter-molecular interactions in time [ ] . this has of course significant implications for domain integrity, particularly if one assumes that the core of a protein domain is and has to be largely structurally conserved. indeed, even single mutations that arise in this area may easily derail the folding process, either because their free energy contribution influences residues in the direct vicinity or disturbs connections higher up in the intermolecular network [ ] . it is therefore hypothesized that protein evolution took place at the periphery of the protein domain core, and that gradual changes via point mutations, insertions and deletions in surface loops brought about the evolutionary distance we see among proteins to date [ , [ ] [ ] [ ] . however, distant sites also contribute to the thermodynamics of catalytic residues. this is achieved through a mechanism called energetic coupling, which is shaped by a continuous pathway of van der waals interactions that ultimately influences residues at the binding site with similar efficiency as the thermodynamic hotspots [ , ] . indeed in such cases, evolutionary constraints are not placed on merely one amino acid in the binding pocket, but on two or more residues that can be shown to be statistically coupled in msas [ , ] . in addition to contributions to binding, these principles also explain why the core of a domain structure will remain largely conserved, while at functionally related places residues can (rapidly) co-evolve with an overall neutral effect [ ] . of course, these aspects of co-evolution are also of practical consequence for structure prediction and rational drug design [ ] . through selective mutation, protein domains have been the tools of evolution to create an enormous and diverse assembly of proteins from likely an initially relatively limited set of domains. the combined data in genbank and other databases now covers over . species with at least complete genomes and this greatly facilitates genome comparisons [ ] [ ] [ ] . following such extensive comparisons, currently > domain superfamilies are recognized in the recent release of the structural classification of proteins (scop) [ ] and it has become clear that many proteins consist of more than one domain [ , , ] . indeed, it has been estimated that at least % of the domains is duplicated in prokaryotes, whereas this number may even be higher in eukaryotes, likely reaching up to % [ ] . there are various mechanisms through which protein domain or whole proteins may have been duplicated. on the largest scale, whole genome duplication such as those seen in the vertebrate genomes duplicated whole gene families, including postsynaptic proteins, hormone receptors and muscle proteins, and thereby dramatically increased the domain content and expanded networks [ , , ] . on the other end of the scale, domains and proteins have been duplicated through genetic mechanisms like exon-shuffling, retrotranspositions, recombination and horizontal gene transfer [ ] [ ] [ ] . since the genetic forces, like exon-shuffling and genome duplication vary among species, the total number of domains and the types of domains present fluctuate per genome. interestingly, comparative analyses of genomes have shown that the number of unique domains encoded in organisms is generally proportional to its genome size [ , ] . within genomes, the number of domains per gene, the socalled modularity, is related to genome size via a power-law, which is essentially the relation between the frequency f and an occurrence x raised by a scaling constant k (i.e., f (x) x k ) [ , ] . a similar correlation is found when the multi-domain architecture is compared to the number of cell types that is present in an organism, i.e., the organism complexity or when the number of domains in a abundant superfamily is plotted against genome size (fig. ) [ , ] . given the amount of domain duplication and apparent selection for specific multi-domain encoding genes in, for example, vertebrates, it may come as little surprise that not all domains have had the same tendency to recombine and distribute themselves over the genomes [ , ] . in fact, some are highly abundant and can be found in many different multi-domain architectures, whereas others are abundant yet confined to a small sample of architectures or not abundant at all [ , ] . is there any significant correlation between the propensity to distribute and the functional roles domains have in cellular pathways? some of the most abundant domains can be found in association with cellular signaling cascades and have been shown to accumulate non-linearly in relation to the overall number of domains encoded or the genome size [ ] . additionally, the on-set of the exponential expansion of the number of abundant and highly recombining domains has been linked to the appearance of multicellularity [ ] . a reoccurring theme among these abundant domains is the function of protein-protein interaction and it appears that particularly these, usually globular domains, have been particularly selected for in more complex organisms [ ] . this positive relation is underlined by the association of these abundant domains with disease such as cancer and gene essentiality as the highly interacting proteins that they are part of have central places in cascades and need to orchestrate a high number of molecular connections [ , ] . their shape and coding regions, which usually lie within the boundaries of one or two exons, make them ideally suited for such a selection, since domains are most frequently gained through insertions at the n-or c-terminus and through exon shuffling [ ] [ ] [ ] . from a mutational point of view, protein-protein interaction domains are different from other domains as well and this appears to be particularly true for the group of small, relatively promiscuous domains like sh and pdz. these domains are promiscuous in the sense that they both tend to physically interact with a large number of ligands [ , ] and are prone to move through the genome to recombine with many other domains. it has been found that particularly these domains evolve more slowly than non-promiscuous domains [ ] . this likely stems from the fact that they are required to participate in many different interactions, which makes selection pressures more stringent and the appearance of the branches on phylogenetic trees relatively short and more difficult to assess when co-evolutionary data in terms of other domains in the same gene family or expression patterns is limited [ , ] . non-promiscuous domains on the other hand can quite easily evade the selection pressure by obtaining compensatory mutations either within themselves or their specific binding partner [ ] . the overall phenomenon that the number of protein domains and their modularity increases as the genome expands has not been linked to a conclusive biological explanation yet. a rationale for the increase in interactions and functional subunits, however, may derive from the paradoxical absence of correlation between the number of genes encoded and organism complexity, the so-called g-value paradox [ ] . there is indeed evidence that domains involved in the same functional pathway tend to converge in a single protein sequence, which would make pathways more controllable and reliable without the need for supplementary genes [ ] . additionally, the number of different arrangements found in higher eukaryotes is, given the vast scale of unique domains present, relatively limited. this in turn implies that evolutionary constraints have played an important role in selecting the right domain combinations and the right order from n-to c-terminus in multi-domain proteins [ , ] . in fact, the ordering and co-occurrence of domains was demonstrated to hold enough evolutionary information to construct a tree of life similar to those based on canonical sequence data [ ] . furthermore, the increased use of alternative splicing and exon skipping in higher eukaryotes likely supplied a novel way of proteome diversification by restricting gene duplication and stimulating the formation of multi-domain proteins [ , ] . in plants, however, the latter notion is not supported since both mono-and dicots show limited alternative splicing and a more extensive polyploidy [ ] [ ] [ ] . it is clear that some of the above characteristics are underappreciated in the phylogenetic analysis of linear amino acid sequences. moreover, the effects of evolution extend even further than these aspects and entail transcriptional and translational regulation, intramolecular domain-domain interactions, gene modifications and post-translational protein modifications [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . new methods are thus being developed to take into account that when sequences evolve, their close and distant functional relationships evolve in parallel. correlations of mutations have already been found between residues of different proteins [ , ] and compensating mutational changes at an interaction interface were shown to recover the instability of a complex [ ] . these observations are evidence for the current evolutionary models for the protein-protein interaction (ppis) networks that are being constructed through large-scale screens [ ] [ ] [ ] . in these, a gene duplication or domain duplication (depending on the resolution of the network) implies the addition of a node, while the deletion of a gene or domain reduces the amount of links in the network (fig. ) . in the next step, extensive network rewiring may take place, driven by the effect of node addition or node loss in the network (i.e., the duplicability or essentiality of a domain/protein) and mutations in the domain-interaction interface [ , , [ ] [ ] [ ] . beyond mutations at the domain and protein level, regulation of protein expression provides another vital mechanism through which protein networks can evolve. microarray studies are now well under way to map genome-wide ex-pression levels of related and non-related genes under a variety of conditions [ , [ ] [ ] [ ] . for example, transcriptional comparisons have investigated aging [ ] and pathogenicity [ ] . unfortunately, given the highly variable nature of gene expression and the fact that different species may respond different to external stimuli, such comparisons can only be performed under strictly controlled research conditions. to date most studies have therefore focused on the embryogenesis, metamorphosis, sex-dependency and mutation rates of subspecies [ , [ ] [ ] [ ] [ ] . other studies have revealed valuable information on promoter types and duplication events [ ] [ ] [ ] [ ] . to overcome the limitations mentioned in the previous paragraph, the analysis of co-expression data has been developed to supplement the direct comparison of individual gene expression changes [ ] . in this procedure, a coexpression analysis of gene pairs within each species precedes the cross-comparison of the different organisms in the study. this approach thus primarily focuses on the similarity and differences of the orthologous genes within network, and is therefore ideally suited for the study of protein domain evolution and has already revealed that species-specific parts fig. ( ) . evolutionary models for protein-protein interactions. the evolution of protein networks is tightly coupled to the addition or deletion of nodes. additionally, events that introduce mutations in binding interfaces of proteins may result in the addition or loss of links in the network. node addition may take place through e.g., domain duplication or horizontal gene transfer, while rewiring of the network is mediated by point mutations, alternative splice variants and changes in gene expression patterns. of an expression network resulted via a merge of conserved and newly evolved modules [ , , ] . finding evolutionary relationships protein domains is mostly based on orthology and thus commonly performed on best sequence matches. identifying these and categorizing them depends largely on multiple sequence alignments and this will in most cases give good indications for function, fold and ultimately evolution. however, this approach usually discards apparent ambiguities that arise from speciesspecific variations (e.g., due to population size, metabolism or species-specific domain duplications or losses) and may therefore introduce significant biases [ ] . biases may also derive from the method of alignment, the rate variation model used to infer the phylogeny, and the sample size used to build the alignment [ , , ] . care should therefore be taken to not regard orthology as a one-to-one relationship, but as a family of homologous relations [ ] , to select for appropriate analysis methods [ , ] and extend comparative data to protein interactions and expression profiles [ ] . indeed, as our wealth of biological information expands, our systems perspective will improve and provide us with an opportunity to reveal protein domain evolution at the level network organization and dynamics. large-scale expression studies are beginning to show us evolutionary correlations between gene expression levels and timings [ , , , , ] , while others demonstrate spatial differences between paralogs or (partial) overlap between interaction partners [ ] [ ] [ ] [ ] . indeed, when we are able to map the spatiotemporal aspects of inter-and intra-molecular interactions we will begin to fully understand the versatile power of evolution that shaped the protein universe and life on earth [ ] . phylogenetic continuum indicates "galaxies" in the protein universe: perliminary results on the natural group structures of proteins the chemistry of amino acids and proteins some peptides from insulin nucleotide sequence from the coat protein cistron of r bacteriophage rna use of dna polymerase i primed by a synthetic oligonucleotide to determine a nucleotide sequenc of phage fl dna dna sequencing with chain-terminating inhibitors the genome sequence of drosophila melanogaster flybase: genomes by the dozen initial sequencing and comparative analysis of the mouse genome insights into social insects from the genome of the honeybee apis mellifera selectivity and promiscuity in the interaction network mediated by protein recognition modules modular peptide recognition domains in eukaryotic signaling the multiplicity of domains in proteins the modular nature of apoptotic signaling proteins regulatory potential, phyletic distribution and evolution of ancient, intracellular smallmolecule-binding domains protein families and their evolution: a structural perspective the folding and evolution of multidomain proteins the superfamily database in : families and functions smart: identification and annotation of domains from signalling and extracellular protein sequences comparative genomics: genome-wide analysis in metazoan eukaryotes distribution of indel lengths heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference review of concepts, case studies and implications who do species vary in their rate of molecular evolution infidelity of sars-cov nsp -exonuclease mutant virus replication is revealed by complete genome sequencing who's your neighbor? new computational approaches for functional genomics protein function in the post-genomic era the role of pattern databases in sequence analysis gene ontology: tool for the unification of biology unique and conserved features of genome and proteome of sarscoronavirus, an early split-off from the coronavirus group lineage phylip version . . deptartment of genetics gapped blast and psi-blast: a new generation of protein database search programs the clustal_x windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools comparison of methods for searching protein sequence databases an insight into domain combinations evolutionary trees from dna sequences: a maximum likelihood approach mrbayes: bayesian inference of phylogenetic trees mammalian evolution and biomedicine: new views from phylogeny multiple sequence alignment: in pursuit of homologous dna positions bayesian coestimation of phylogeny and sequence alignment the relation between the divergence of sequence and structure in proteins molecular evolution of the maguk family in metazoan genomes why should we care about molecular coevolution the propagation of binding interactions to remote sites in proteins: analysis of the binding of the monoclonal antibody d . to lysozyme structural stability of binding sites: consequences for binding affinity and allosteric effects revealing the architecture of a k+ channel pore through mutant cycles with a peptide inhibitor structural plasticity in a remodeled protein-protein interface a specificity map for the pdz domain family the linkage between protein folding and functional cooperativity: two sides of the same coin? empirical and structural models for insertions and deletions in the divergent evolution of proteins analysis of insertions/deletions in protein structures structural similarity of loops in protein families: toward the understanding of protein evolution the effect of inhibitor binding on the structural stability and cooperativity of the hiv- protease evolutionary conserved pathways of energetic connectivity in protein families how frequent are correlated changes in families of protein sequences? an improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution evolution of vertebrate genes related to prion and shadoo proteins--clues from comparative genomic analysis evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes data growth and its impact on the scop database: new developments estimating the number of protein folds and families from complete genome data insights into the molecular evolution of the pdz-lim family and indentification of a novel conserved protein motif independent elaboration of steroid hormone signaling pathways in metazoans integration of horizontally transferred genes into regulatory interaction networks takes many million years prokaryotic evolution in light of gene transfer how the global structure of protein interaction networks evolves the impact of comparative genomics on our understanding of evolution modular genes with metazoan-specific domains have increased tissue specificity evolution of protein domain promiscuity in eukaryotes the structure of the protein universe and genome evolution modules, multidomain proteins and organismic complexity detecting protein function and protein-protein interaction from genome sequences lethality and centrality in protein networks comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks domain deletions and substitutions in the modular protein evolution genome evolution and the evolution of exon-shuffling-a review significant expansion of exon-bordering protein domains during animal proteome evolution thermodynamic basis for promiscuity and selectivity in protein-protein interactions: pdz domains, a case study promiscuous binding nature of sh domains to their target proteins expansion of genome coding regions by acquisition of new genes the geometry of domain combination in proteins different levels of alternative splicing among eukaryotes how did alternative splicing evolve? alternative splicing and gene duplication are inversely correlated evolutionary mechanisms polyploidy and genome evolution in plants comparative analysis indicates that alternative splicing in plants has a limited role in functional expansion of the proteome structural characterization of the intramolecular interaction between the sh and guanylate kinase domains of psd- identification of an intramolecular interaction between the sh and guanylate kinase domains of psd- interplay of pdz and protease domain of degp ensures efficient elimination of misfolded proteins comparative biology: beyond sequence analysis a genetic signature of interspecies variations in gene expression genome-wide scan reveals that genetic variation for transcriptional plasticity in yeast is biased towards multi-copy and dispensable genes identification of tightly regulated groups of genes during drosophila melanogaster embryogenesis a gene-coexpression network for global discovery of conserved genetic modules similarities and differences in genome-wide expression data of six organisms accurate prediction of proteinprotein interactions from sequence alignments using a bayesian method correlated mutations contain information about protein-protein interaction mutually compensatory mutations during evolution of the tetramerization domain of tumor suppressor p lead to impaired hetero-oligomerization functional organization of the yeast proteome by systematic analysis of protein complexes a human protein-protein interaction network: a resource for annotating the proteome protein function, connectivity, and duplicability in yeast evolution and topology in the yeast protein interaction network modularity and evolutionary constraint on proteins comparing genomic expression patterns across species identifies shared transcriptional profile in aging genome-wide functional analysis of pathogenicity genes in the rice blast fungus a mutation accumulation assay reveals a broad capacity for rapid evolution of gene expression evolution of gene expression in the drosophila melanogaster subgroup sexdependent gene expression and evolution of the drosophila transcriptome microarray analysis of drosophila development during metamorphosis conservation and coevolution in the scale-free human gene coexpression network conservation and evolution of gene coexpression networks in human and chimpanzee brains cross-species sequence comparisons: a review of methods and available resources impact of taxon sampling on the estimation of rates of evolution at sites comparative genomics beyond sequence-based alignments: rna structures in the encode regions comparative analysis of splice form-specific expression of lim kinases during zebrafish development towards cellular systems in d gene expression map of the arabidopsis shoot apical meristem stem cell niche a gene expression map of arabidopsis thaliana development key: cord- -pyb pt authors: newell-mcgloughlin, martina; re, edward title: the flowering of the age of biotechnology – date: journal: the evolution of biotechnology doi: . / - - - _ sha: doc_id: cord_uid: pyb pt nan the significance of developing genetic and physical maps of the genome, and the importance of comparing the human genome with those of other species. it also suggested a preliminary focus on improving current technology. at the request of the u.s. congress, the office of technology assessment (ota) also studied the issue, and issued a document in -within days of the nrc report -that was similarly supportive. the ota report discussed, in addition to scientific issues, social and ethical implications of a genome program together with problems of managing funding, negotiating policy and coordinating research efforts. prompted by advisers at a meeting in reston, virginia, james wyngaarden, then director of the national institutes of health (nih) , decided that the agency should be a major player in the hgp, effectively seizing the lead from doe. the start of the joint effort was in may (with an "official" start in october) when a -year plan detailing the goals of the u.s. human genome project was presented to members of congressional appropriations committees in mid-february. this document co-authored by doe and nih and titled "understanding our genetic inheritance, the u.s. human genome project: the first five years" examined the then current state of genome science. the plan also set forth complementary approaches of the two agencies for attaining scientific goals and presented plans for administering research agenda; it described collaboration between u.s. and international agencies and presented budget projections for the project. according to the document, "a centrally coordinated project, focused on specific objectives, is believed to be the most efficient and least expensive way" to obtain the -billion base pair map of the human genome. in the course of the project, especially in the early years, the plan stated that "much new technology will be developed that will facilitate biomedical and a broad range of biological research, bring down the cost of many experiments (mapping and sequencing), and finding applications in numerous other fields." the plan built upon the reports of the office of technology assessment and the national research council on mapping and sequencing the human genome. "in the intervening two years," the document said, "improvements in technology for almost every aspect of genomics research have taken place. as a result, more specific goals can now be set for the project." the document describes objectives in the following areas mapping and sequencing the human genome and the genomes of model organisms; data collection and distribution; ethical, legal, and social considerations; research training; technology development; and technology transfer. these goals were to be reviewed each year and updated as further advances occured in the underlying technologies. they identified the overall budget needs to be the same as those identified by ota and nrc, namely about $ million per year for approximately years. this came to $ billion over the entire period of the project. considering that in july , the dna databases contained only seven sequences greater than . mb this was a major leap of faith. this approach was a major departure from the single-investigator-based gene of interest focus that research took hitherto. this sparked much controversy both before and after its inception. critics questioned the usefulness of genomic sequencing, they objected to the high cost and suggested it might divert funds from other, more focused, basic research. the prime argument to support the latter position is that there appeared to be are far less genes than accounted for by the mass of dna which would suggest that the major part of the sequencing effort would be of long stretches of base pairs with no known function, the so-called "junk dna." and that was in the days when the number of genes was presumed to be - , . if, at that stage, the estimated number was guessed to be closer to the actual estimate of - , (later reduced to - , ) this would have made the task seem even more foolhardy and less worthwhile to some. however, the ever-powerful incentive of new diagnostics and treatments for human disease beyond what could be gleaned from the gene-by-gene approach and the rapidly evolving technologies, especially that of automated sequencing, made it both an attractive and plausible aim. charles cantor ( ) , a principal scientist for the department of energy's genome project contended that doe and nih were cooperating effectively to develop organizational structures and scientific priorities that would keep the project on schedule and within its budget. he noted that there would be small short-term costs to traditional biology, but that the long-term benefits would be immeasurable. genome projects were also discussed and developed in other countries and sequencing efforts began in japan, france, italy, the united kingdom, and canada. even as the soviet union collapsed, a genome project survived as part of the russian science program. the scale of the venture and the manageable prospect for pooling data via computer made sequencing the human genome a truly international initiative. in an effort to include developing countries in the project unesco assembled an advisory committee in to examine unesco's role in facilitating international dialogue and cooperation. a privately-funded human genome organization (hugo) had been founded in to coordinate international efforts and serve as a clearinghouse for data. in that same year the european commission (ec) introduced a proposal entitled the "predictive medicine programme." a few ec countries, notably germany and denmark, claimed the proposal lacked ethical sensitivity; objections to the possible eugenic implications of the program were especially strong in germany (dickson ) . the initial proposal was dropped but later modified and adopted in as the "human genome analysis programme" (dickman and aldhous ) . this program committed substantial resources to the study of ethical issues. the need for an organization to coordinate these multiple international efforts quickly became apparent. thus the human genome organization (hugo), which has been called the "u.n. for the human genome," was born in the spring of . composed of a founding council of scientists from seventeen countries, hugo's goal was to encourage international collaboration through coordination of research, exchange of data and research techniques, training, and debates on the implications of the projects (bodmer ) . in august nih began large-scale sequencing trials on four model organisms: the parasitic, cell-wall lacking pathogenic microbe mycoplasma capricolum, the prokaryotic microbial lab rat escherichia coli, the most simple animal caenorhabditis elegans, and the eukaryotic microbial lab rat saccharomyces cerevisiae. each research group agreed to sequence megabases (mb) at cents a base within years. a sub living organism was actually fully sequenced and the complete sequence of that genome, the human cytomegalovirus (hcmv) genome was . mb. that year also saw the casting of the first salvo in the protracted debate on "ownership" of genetic information beginning with the more tangible question of ownership of cells. and, as with the debates of the early eighties, which were to be revisited later in the nineties, the respondent was the university of california. moore v. regents of the university of california was the first case in the united states to address the issue of who owns the rights to an individual's cells. diagnosed with leukemia, john moore had blood and bone marrow withdrawn for medical tests. suspicious of repeated requests to give samples because he had already been cured, moore discovered that his doctors had patented a cell line derived from his cells and so he sued. the california supreme court found that moore's doctor did not obtain proper informed consent, but, however, they also found that moore cannot claim property rights over his body. the quest for the holy grail of the human genome was both inspired by the rapidly evolving technologies for mapping and sequencing and subsequently spurred on the development of ever more efficient tools and techniques. advances in analytical tools, automation, and chemistries as well as computational power and algorithms revolutionized the ability to generate and analyze immense amounts of dna sequence and genotype information. in addition to leading to the determination of the complete sequences of a variety of microorganisms and a rapidly increasing number of model organisms, these technologies have provided insights into the repertoire of genes that are required for life, and their allelic diversity as well as their organization in the genome. but back in many of these were still nascent technologies. the technologies required to achieve this end could be broadly divided into three categories: equipment, techniques, and computational analysis. these are not truly discrete divisions and there was much overlap in their influence on each other. as noted, lloyd smith, michael and tim hunkapiller, and leroy hood conceived the automated sequencer and applied biosystems inc. brought it to market in june . there is no much doubt that when applied biosystems inc. put it on the market that which had been a dream became decidedly closer to an achievable reality. in automating sangers chain termination sequencing system, hood modified both the chemistry and the data-gathering processes. in the sequencing reaction itself, he replaced radioactive labels, which were unstable, posed a health hazard, and required separate gels for each of the four bases. hood developed chemistry that used fluorescent dyes of different colors for each of the four dna bases. this system of "color-coding" eliminated the need to run several reactions in overlapping gels. the fluorescent labels addressed another issue which contributed to one of the major concerns of sequencing -data gathering. hood integrated laser and computer technology, eliminating the tedious process of information-gathering by hand. as the fragments of dna passed a laser beam on their way through the gel the fluorescent labels were stimulated to emit light. the emitted light was transmitted by a lens and the intensity and spectral characteristics of the fluorescence are measured by a photomultiplier tube and converted to a digital format that could be read directly into a computer. during the next thirteen years, the machine was constantly improved, and by a fully automated instrument could sequence up to , , base pairs per year. in three groups came up with a variation on this approach. they developed what is termed capillary electrophoresis, one team was led by lloyd smith (luckey, ) , the second by barry karger , and the third by norman dovichi. in molecular dynamics introduced the megabace, a capillary sequencing machine. and not to be outdone the following year in , the original of the species came up with the abi prism sequencing machine. the is also a capillary-based machine designed to run about eight sets of sequence reactions per day. on the biology side, one of the biggest challenges was the construction of a physical map to be compiled from many diverse sources and approaches in such a way as to insure continuity of physical mapping data over long stretches of dna. the development of dna sequence tagged sites (stss) to correlate diverse types of dna clones aided this standardization of the mapping component by providing mappers with a common language and a system of landmarks for all the libraries from such varied sources as cosmids, yeast artificial chromosomes (yacs) and other rdnas clones. this way each mapped element (individual clone, contig, or sequenced region) would be defined by a unique sts. a crude map of the entire genome, showing the order and spacing of stss, could then be constructed. the order and spacing of these unique identifier sequences composing an sts map was made possible by development of mullis' polymerase chain reaction (pcr), which allows rapid production of multiple copies of a specific dna fragment, for example, an sts fragment. sequence information generated in this way could be recalled easily and, once reported to a database, would be available to other investigators. with the sts sequence stored electronically, there would be no need to obtain a probe or any other reagents from the original investigator. no longer would it be necessary to exchange and store hundreds of thousands of clones for full-scale sequencing of the human genome-a significant saving of money, effort, and time. by providing a common language and landmarks for mapping, sts's allowed genetic and physical maps to be cross-referenced. with a refinement on this technique to go after actual genes, sydney brenner proposed sequencing human cdnas to provide rapid access to the genes stating that 'one obvious way of finding at least a large part of the important [fraction] of the human genome is to look at the sequences of the messenger rna's of expressed genes' (brenner, ) . the following year the man who was to play a pivotal role on the world stage that became the human genome project suggested a way to implement sydney's approach. that player, nih biologist j. craig venter announced a strategy to find expressed genes, using ests (expressed sequence tag) (adams, ) . these so called ests represent a unique stretch of dna within a coding region of a gene, which as sydney suggested would be useful for identifying full-length genes and as a landmark for mapping. so using this approach projects were begun to mark gene sites on chromosome maps as sites of mrna expression. to help with this a more efficient method of handling large chunks of sequences was needed and two approaches were developed. yeast artificial chromosomes, which were developed by david burke, maynard olson, and george carle, increased insert size -fold (david t. burke et al., ) . caltech's second major contribution to the genome project was developed by melvin simon, and hiroaki shizuya. their approach to handling large dna segments was to develop "bacterial artificial chromosomes" (bacs), which basically allow bacteria to replicate chunks greater than , base pairs in length. this efficient production of more stable, large-insert bacs made the latter an even more attractive option, as they had greater flexibility than yacs. in in a collaboration that presages the snp consortium, washington university, st louis mo, was funded by the pharmaceutical company merck and the national cancer institute to provide sequence from those ests. more than half a million ests were submitted during the project (murr l et al., ) . on the analysis side was the major challenge to manage and mine the vast amount of dna sequence data being generated. a rate-limiting step was the need to develop semi-intelligent algorithms to achieve this herculean task. this is where the discipline of bioinformatics came into play. it had been evolving as a discipline since margaret oakley dayhoff used her knowledge of chemistry, mathematics, biology and computer science to develop this entirely new field in the early sixties. she is in fact credited today as a founder of the field of bioinformatics in which biology, computer science, and information technology merge into a single discipline. the ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. there are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information. paralleling the rapid and very public ascent of recombinant dna technology during the previous two decades, the analytic and management tools of the discipline that was to become bioinformatics evolved at a more subdued but equally impressive pace. some of the key developments included tools such as the needleman-wunsch algorithm for sequence comparison which appeared even before recombinant dna technology had been demonstrated as early as ; the smith-waterman algorithm for sequence alignment ( ); the fastp algorithm ( ) and the fasta algorithm for sequence comparison by pearson and lupman in and perl (practical extraction report language) released by larry wall in . on the data management side several databases with ever more effective storage and mining capabilities were developed over the same period. the first bioinformatic/biological databases were constructed a few years after the first protein sequences began to become available. the first protein sequence reported was that of bovine insulin in , consisting of residues. nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine trna with bases. just one year later, dayhoff gathered all the available sequence data to create the first bioinformatic database. one of the first dedicated databases was the brookhaven protein databank whose collection consisted of ten x-ray crystallographic protein structures (acta. cryst. b, ) . the year saw the creation of the genetics computer group (gcg) as a part of the university of wisconsin biotechnology center. the group's primary and much used product was the wisconsin suite of molecular biology tools. it was spun off as a private company in . the swiss-prot database made its debut in in europe at the department of medical biochemistry of the university of geneva and the european molecular biology laboratory (embl). the first dedicated "bioinformatics" company intelligenetics, inc. was founded in california in . their primary product was the intelligenetics suite of programs for dna and protein sequence analysis. the first unified federal effort, the national center for biotechnology information (ncbi) was created at nih/nlm in and it was to play a crucial part in coordinating public databases, developing software tools for analyzing genome data, and disseminating information. and on the other side of the atlantic, oxford molecular group, ltd. (omg) was founded in oxford, uk by anthony marchington, david ricketts, james hiddleston, anthony rees, and w. graham richards. their primary focus was on rational drug design and their products such as anaconda, asp, and chameleon obviously reflected this as they were applied in molecular modeling, and protein design engineering. within two years ncbi were making their mark when david lipman, eugene myers, and colleagues at the ncbi published the basic local alignment search tool blast algorithm for aligning sequences (altschul et al., ) . it is used to compare a novel sequence with those contained in nucleotide and protein databases by aligning the novel sequence with previously characterized genes. the emphasis of this tool is to find regions of sequence similarity, which will yield functional and evolutionary clues about the structure and function of this novel sequence. regions of similarity detected via this type of alignment tool can be either local, where the region of similarity is based in one location, or global, where regions of similarity can be detected across otherwise unrelated genetic code. the fundamental unit of blast algorithm output is the high-scoring segment pair (hsp). an hsp consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cutoff score. this system has been refined and modified over the years the two principal variants presently in use being the ncbi blast and wu-blast (wu signifying washington university). the same year that blast was launched two other bioinformatics companies were launched. one was informax in bethesda, md whose products addressed sequence analysis, database and data management, searching, publication graphics, clone construction, mapping and primer design. the second, molecular applications group in california, was to play a bigger part on the proteomics end (michael levitt and chris lee). their primary products were look and segmod which are used for molecular modeling and protein design. the following year in the human chromosome mapping data repository, genome data base (gdb) was established. on a more global level, the development of computational capabilities in general and the internet in specific was also to play a considerable part in the sharing of data and access to databases that rendered the rapidity of the forward momentum of the hgp possible. also in edward uberbacher of oak ridge national laboratory in tennessee developed grail, the first of many gene-finding programs. in the first two "genomics" companies made their appearance. incyte pharmaceuticals, a genomics company headquartered in palo alto, california, was formed and myriad genetics, inc. was founded in utah. incyte's stated goal was to lead in the discovery of major common human disease genes and their related pathways. the company discovered and sequenced, with its academic collaborators (originally synteni from pat brown's lab at stanford), a number of important genes including brca and brca , with mary claire king, epidemiologist at uc-berkeley, the genes linked to breast cancer in families with a high degree of incidence before age . by a low-resolution genetic linkage map of the entire human genome was published and u.s. and french teams completed genetic maps of both mouse and man. the mouse with an average marker spacing of . cm as determined by eric lander and colleagues at whitehead and the human, with an average marker spacing of cm by jean weissenbach and colleagues at ceph (centre d'etude du polymorphisme humaine). the latter institute was the subject of a rather scathing book by paul rabinow ( ) based on what they did with this genome map. in , an american biotechnology company, millennium pharmaceuticals, and the ceph, developed plans for a collaborative effort to discover diabetes genes. the results of this collaboration could have been medically significant and financially lucrative. the two parties had agreed that ceph would supply millennium with germplasm collected from a large coterie of french families, and millennium would supply funding and expertise in new technologies to accelerate the identification of the genes, terms to which the french government had agreed. but in early , just as the collaboration was to begin, the french government cried halt! the government explained that the ceph could not be permitted to give the americans that most precious of substances for which there was no precedent in law -french dna. rabinow's book discusses the tangled relations and conceptions such as, can a country be said to have its own genetic material, the first but hardly the last franco-american disavowal of détente (paul rabinow, ) . the latest facilities such as the joint genome institute (jgi), walnut creek, ca are now able to sequence up to mb per day which makes it possible to sequence whole microbial genomes within a day. technologies currently under development will probably increase this capacity yet further through massively parallel sequencing and/or microfluidic processing making it possible to sequence multiple genotypes from several species. nineteen ninety-two saw one of the first shakeups in the progress of the hgp. that was the year that the first major outsider entered the race when britain's wellcome trust plunked down $ million to join the hgp. this caused a mere ripple while the principal shake-ups occurred stateside. much of the debate and subsequently the direction all the way through the hgp process was shaped by the personalities involved. as noted the application of one of the innovative techniques, namely ests, to do an end run on patenting introduced one of those major players to the fray, craig venter. venter, the high school drop out who reached the age of majority in the killing fields of vietnam was to play a pivotal role in a more "civilized" but no less combative field of human endeavor. he came onto the world stage through his initial work on ests while at the national institute of neurological disorders and stroke (ninds) from to . he noted in an interview with the scientist magazine in , that there was a degree of ambiguity at ninds about his venturing into the field of genomics, while they liked the prestige of hosting one of the leaders and innovators in his newly emerging field, they were concerned about him moving outside the nind purview of the human brain and nervous system. ultimately, while he proclaimed to like the security and service infrastructure this institute afforded him, that same system became too restrictive for his interests and talent. he wanted the whole canvas of human-gene expression to be his universe, not just what was confined to the central nervous system. he was becoming more interested in taking a whole genome approach to understanding the overall structure of genomes and genome evolution, which was much broader than the mission of ninds. he noted, with some irony, in later years that the then current nih director harold varmus had wished in hindsight that nih had pushed to do a similar database in the public domain, clearly in venter's opinion varmus was in need of a refresher course in history! bernadine healy, nih director in , was one of the few in a leadership role who saw the technical and fiscal promise of venter's work and, like all good administrators, it also presented an opportunity to resolve a thorny "personnel" issue. she appointed him head of the ad hoc committee to have an intramural genome program at nih to give the head of the hgp (that other larger than life personality jim watson) notice that he was not the sole arbitrator of the direction for the human genome project. however venter very soon established himself as an equally non-conformist character and with the tacit consent of his erstwhile benefactor. he initially assumed the mantle of a non-conformist through guilt by association rather than direct actions when it was revealed that nih was filing patent applications on thousands of these partial genes based on his ests catalyzing the first hgp fight at a congressional hearing. nih's move was widely criticized by the scientific community because, at the time, the function of genes associated with the partial sequences was unknown. critics charged that patent protection for the gene segments would forestall future research on them. the patent office eventually rejected the patents, but the applications sparked an international controversy over patenting genes whose functions were still unknown. interestingly enough despite nih's reliance on the est/cdna technique, venter, who was now clearly venturing outside the ninds mandated rubric, could not obtain government funding to expand his research, prompting him to leave nih in . he moved on to become president and director of the institute for genomic research (tigr), a nonprofit research center based in gaithersburg, md. at the same time william haseltine formed a sister company, human genome sciences (hgs), to commercialize tigr products. venter continued est work at tigr, but also began thinking about sequencing entire genomes. again, he came up with a quicker and faster method: whole genome shotgun sequencing. he applied for an nih grant to use the method on hemophilus influenzae, but started the project before the funding decision was returned. when the genome was nearly complete, nih rejected his proposal saying the method would not work. in a triumphal flurry in late may and with a metaphorical nose-thumbing at his recently rejected "unworkable" grant venter announced that tigr and collaborators had fully sequenced the first free-living organism -haemophilus influenzae. in november , controversy surrounding venter's research escalated. access restrictions associated with a cdna database developed by tigr and its rockville, md.-based biotech associate, human genome sciences (hgs) inc. -including hgs's right to preview papers on resulting discoveries and for first options to license products -prompted merck and co. inc. to fund a rival database project. in that year also britain "officially" entered the hgp race when the wellcome trust trumped down $ million (as mentioned earlier). the following year hgs was involved in yet another patenting debacle forced by the rapid march of technology into uncharted patent law territory. on june , hgs applied for a patent on a gene that produces a "receptor" protein that is later called ccr . at that time hgs has no idea that ccr is an hiv receptor. in december , u.s. researcher robert gallo, the co-discoverer of hiv, and colleagues found three chemicals that inhibit the aids virus but they did not know how the chemicals work. in february , edward berger at the nih discovered that gallo's inhibitors work in late-stage aids by blocking a receptor on the surface of t-cells. in june of that year in a period of just days, five groups of scientists published papers saying ccr is the receptor for virtually all strains of hiv. in january , schering-plough researchers told a san francisco aids conference that they have discovered new inhibitors. they knew that merck researchers had made similar discoveries. as a significant valentine in the u.s. patent and trademark office (uspto) grants hgs a patent on the gene that makes ccr and on techniques for producing ccr artificially. the decision sent hgs stock flying and dismayed researchers. it also caused the uspto to revise its definition of a "patentable" drug target. in the meantime haseltine's partner in rewriting patenting history, venter turned his focus to the human genome. he left tigr and started the for-profit company celera, a division of pe biosystems, the company that at times, thanks to hood and hunkapillar, led the world in the production of sequencing machines. using these machines, and the world's largest civilian supercomputer, venter finished assembling the human genome in just three years. following the debacle with the then nih director bernine healy over patenting the partial genes that resulted from est analysis, another major personality-driven event in that same year occurred. watson strongly opposed the idea of patenting gene fragments fearing that it would discourage research, and commented that "the automated sequencing machines 'could be run by monkeys.' " (nature june , ) with this dismissal watson resigned his nih nchgr post in to devote his full-time effort to directing cold spring harbor laboratory. his replacement was of a rather more pragmatic, less flamboyant nature. while venter maybe was described as an idiosyncratic shogun of the shotgun, francis collins was once described as the king arthur of the holy grail that is the human genome project. collins became the director of the national human genome research institute in . he was considered the right man for the job following his success (along with lap-chee tsui) in identifying the gene for the cystic fibrosis transmembrane (cftr) chloride channel receptor that, when mutated, can lead to the onset of cystic fibrosis. although now indelibly connected with the topic non-plus tout in biology, like many great innovators in this field before him, francis collins had little interest in biology as he grew up on a farm in the shenandoah valley of virginia. from his childhood he seemed destined to be at the center of drama, his father was professor of dramatic arts at mary baldwin college and the early stage management of career was performed on a stage he built on the farm. while the physical and mathematical sciences held appeal for him, being possessed of a highly logical mind, collins found the format in which biology was taught in the high school of his day mind-numbingly boring, filled with dissections and rote memorization. he found the contemplation of the infinite outcomes of dividing by zero (done deliberately rather than by accident as in einstein's case) far more appealing than contemplating the innards of a frog. that biology could be gloriously logical only became clear to collins when, in , he entered yale with a degree in chemistry from the university of virginia and was first exposed to the nascent field of molecular biology. anecdotally it was the tome, the book of life, penned by the theoretical physicist father of molecular biology, edwin schrodinger, while exiled in trinity college dublin in that was the catalyst for his conversion. like schrodinger he wanted to do something more obviously meaningful (for less than hardcore physicists at least!) than theoretical physics, so he went to medical school at unc-chapel hill after completing his chemistry doctorate in yale, and returned to the site of his road to damascus for post-doctoral study in the application of his newfound interest in human genetics. during this sojourn at yale, collins began working on developing novel tools to search the genome for genes that cause human disease. he continued this work, which he dubbed "positional cloning," after moving to the university of michigan as a professor in . he placed himself on the genetic map when he succeeded in using this method to put the gene that causes cystic fibrosis on the physical map. while a less colorful-in-your-face character than venter he has his own personality quirks, for example, he pastes a new sticker onto the back of his motorcycle helmet every time he finds a new disease gene. one imagines that particular piece of really estate is getting rather crowded. interestingly it was not these four hundred pound us gorillas who proposed the eventually prescient timeline for a working draft but two from the old power base. in meetings in the us in , john sulston and bob waterston proposed to produce a 'draft' sequence of the human genome by , a full five years ahead of schedule. while agreed by most to be feasible it meant a rethinking of strategy and involved focusing resources on larger centers and emphasizing sequence acquisition. just as important, it asserts the value of draft quality sequence to biomedical research. discussion started with the british based wellcome trust as possible sponsors (marshall e. ) . by a rough draft of the human genome map was produced showing the locations of more than , genes. the map was produced using yeast artificial chromosomes and some chromosomes -notably the littlest -were mapped in finer detail. these maps marked an important step toward clone-based sequencing. the importance was illustrated in the devotion of an entire edition of the journal nature to the subject. (nature : - ) the duel between the public and private face of the hgp progressed at a pace over the next five years. following release of the mapping data some level of international agreement was decided on sequence data release and databases. they agreed on the release of sequence data, specifically, that primary genomic sequence should be in the public domain to encourage research and development to maximize its benefit to society. also that it be rapidly released on a daily basis with assemblies of greater than kb and that the finished annotated sequence should be submitted immediately to the public databases. in an international consortium completed the sequence of the genome of the workhorse yeast saccharomyces cerevisiae. data had been released as the individual chromosomes were completed. the saccharomyces genome database (sgd) was created to curate this information. the project collects information and maintains a database of the molecular biology of s. cerevisiae. this database includes a variety of genomic and biological information and is maintained and updated by sgd curators. the sgd also maintains the s. cerevisiae gene name registry, a complete list of all gene names used in s. cerevisiae. in a new more powerful diagnostic tool termed snps (single nucleotide polymorphisms) was developed. snps are changes in single letters in our dna code that can act as markers in the dna landscape. some snps are associated closely with susceptibility to genetic disease, our response to drugs or our ability to remove toxins. the snp consortium although designated a limited company is a nonprofit foundation organized for the purpose of providing public genomic data. it is a collaborative effort between pharmaceutical companies and the wellcome trust with the idea of making available widely accepted, high-quality, extensive, and publicly accessible snp map. its mission was to develop up to , snps distributed evenly throughout the human genome and to make the information related to these snps available to the public without intellectual property restrictions. the project started in april and was anticipated to continue until the end of . in the end, many more snps, about . million total, were discovered than was originally planned. by the complete genome sequence of mycobacterium tuberculosis was published by teams from the uk, france, us and denmark in june . the abi prism sequencing machine, a capillary-based machine designed to run about eight sets of sequence reactions per day also reached the market that year. that same year the genome sequence of the first multicellular organism, c. elegans was completed. c. elegans has a genome of about mb and, as noted, is a primitive animal model organism used in a range of biological disciplines. by november the human genome draft sequence reached mb and the first complete human chromosome was sequenced -this first was reached on the east side of the atlantic by the hgp team led by the sanger centre, producing a finished sequence for chromosome , which is about million base-pairs and includes at least genes. according to anecdotal evidence when visiting his namesake centre, sanger asked: "what does this machine do then?" "dideoxy sequencing" came the reply, to which fred retorted: "haven't they come up with anything better yet?" as will be elaborated in the final chapter the real highlight of was production of a 'working draft' sequence of the human genome, which was announced simultaneously in the us and the uk. in a joint event, celera genomics announced completion of their 'first assembly' of the genome. in a remarkable special issue, nature included a -page article by the human genome project partners, studies of mapping and variation, as well as analysis of the sequence by experts in different areas of biology. science published the article by celera on their assembly of hgp and celera data as well as analyses of the use of the sequence. however to demonstrate the sensitivity of the market place to presidential utterances the joint appearances by bill clinton and tony blair touting this major milestone turned into a major cold shower when clinton's reassurance of access of the people to their genetic information caused a precipitous drop in celera's share value overnight. clinton's assurance that, "the effort to decipher the human genome will be the scientific breakthrough of the century -perhaps of all time. we have a profound responsibility to ensure that the life-saving benefits of any cutting-edge research are available to all human beings." (president bill clinton, wednesday, march , ) stands in sharp contrast to the statement from venter's colleague that " any company that wants to be in the business of using genes, proteins, or antibodies as drugs has a very high probability of running afoul of our patents. from a commercial point of view, they are severely constrained -and far more than they realize." (william a. haseltine, chairman and ceo, human genome sciences). the huge sell-off in stocks ended weeks of biotech buying in which those same stocks soared to unprecedented highs. by the next day, however, the genomic company spin doctors began to recover ground in a brilliant move which turned the clinton announcement into a public relations coup. all major genomics companies issued press releases applauding president clinton's announcement. the real news they argued, was that "for the first time a president strongly affirmed the importance of gene based patents." and the same bill haseltine of human genome sciences positively gushed as he happily pointed out that he "could begin his next annual report with the [president's] monumental statement, and quote today as a monumental day." as distinguished harvard biologist richard lewontin notes: "no prominent molecular biologist of my acquaintance is without a financial stake in the biotechnology business. as a result, serious conflicts of interest have emerged in universities and in government service (lewontin, ) . away from the spin doctors perhaps eric lander may have best summed up the herculean effort when he opined that for him "the human genome project has been the ultimate fulfilment: the chance to share common purpose with hundreds of wonderful colleagues towards a goal larger than ourselves. in the long run, the human genome project's greatest impact might not be the three billion nucleotides of the human chromosomes, but its model of scientific community." (ridley, ) . gene therapy the year also marked the passing of another milestone that was intimately connected to one of the fundamental drivers of the hgp. the california hereditary disorders act came into force and with it one of the potential solutions for human hereditary disorders. w. french anderson in the usa reported the first successful application of gene therapy in humans. the first successful gene therapy for a human disease was successfully achieved for severe combined immune deficiency (scid) by introducing the missing gene, adenosine deaminase deficiency (ada) into the peripheral lymphocytes of a -year-old girl and returning modified lymphocytes to her. although the results are difficult to interpret because of the concurrent use of polyethylene glycol-conjugated ada commonly referred to as pegylated ada (pgla) in all patients, strong evidence for in vivo efficacy was demonstrated. ada-modified t cells persisted in vivo for up to three years and were associated with increases in t-cell number and ada enzyme levels, t cells derived from transduced pgla were progressively replaced by marrow-derived t cells, confirming successful gene transfer into long-lived progenitor cells. ashanthi desilva, the girl who received the first credible gene therapy, continues to do well more than a decade later. cynthia cutshall, the second child to receive gene therapy for the same disorder as desilva, also continues to do well. within years (by january ), more than gene therapy protocols had been approved in the us and worldwide, researchers launched more than clinical trials to test gene therapy against a wide array of illnesses. surprisingly, a disease not typically heading the charts of heritable disorders, cancer has dominated the research. in cancer patients were treated with the tumor necrosis factor gene, a natural tumor fighting protein which worked to a limited extent. even more surprisingly, after the initial flurry of success little has worked. gene therapy, the promising miracle of failed to deliver on its early promise over the decade. apart from those examples, there are many diseases whose molecular pathology is, or soon will be, well understood, but for which no satisfactory treatments have yet been developed. at the beginning of the nineties it appeared that gene therapy did offer new opportunities to treat these disorders both by restoring gene functions that have been lost through mutation and by introducing genes that can inhibit the replication of infectious agents, render cells resistant to cytotoxic drugs, or cause the elimination of aberrant cells. from this "genomic" viewpoint genes could be said to be viewed as medicines, and their development as therapeutics should embrace the issues facing the development of small-molecule and protein therapeutics such as bioavailability, specificity, toxicity, potency, and the ability to be manufactured at large scale in a cost-effective manner. of course for such a radical approach certain basal level criteria needed to be established for selecting disease candidates for human gene therapy. these include, such factors as the disease is an incurable, life-threatening disease; organ, tissue, and cell types affected by the disease have been identified; the normal counterpart of the defective gene has been isolated and cloned; either the normal gene can be introduced into a substantial subfraction of the cells from the affected tissue, or the introduction of the gene into the available target tissue, such as bone marrow, will somehow alter the disease process in the tissue affected by the disease; the gene can be expressed adequately (it will direct the production of enough normal protein to make a difference); and techniques are available to verify the safety of the procedure. an ideal gene therapeutic should, therefore, be stably formulated at room temperature and amenable to administration either as an injectable or aerosol or by oral delivery in liquid or capsule form. the therapeutic should also be suitable for repeat therapy, and when delivered, it should neither generate an immune response nor be destroyed by tissue-scavenging mechanisms. when delivered to the target cell, the therapeutic gene should then be transported to the nucleus, where it should be maintained as a stable plasmid or chromosomal integrant, and be expressed in a predictable, controlled fashion at the desired potency in a cell-specific or tissue-specific manner. in addition to the ada gene transfer in children with severe combined immunodeficiency syndrome, a gene-marking study of epstein-barr virus-specific cytotoxic t cells, and trials of gene-modified t cells expressing suicide or viral resistance genes in patients infected with hiv were studied in the early nineties. additional strategies for t-cell gene therapy which were pursued later in the decade involve the engineering of novel t-cell receptors that impart antigen specificity for virally infected or malignant cells. issues which still are not resolved include nuclear transport, integration, regulated gene expression and immune surveillance. this knowledge, when finally understood and applied to the design of delivery vehicles of either viral or non-viral origin, will assist in the realization of gene therapeutics as safe and beneficial medicines that are suited to the routine management of human health. scientists are also working on using gene therapy to generate antibodies directly inside cells to block the production of harmful viruses such as hiv or even cancer-inducing proteins. there is a specific connection with francis collins, as his motivation for pursuing the hgp was his pursuit of defective genes beginning with the cystic fibrosis gene. this gene, called the cf transmembrane conductance regulator, codes for an ion channel protein that regulates salts in the lung tissue. the faulty gene prevents cells from excreting salt properly causing a thick sticky mucus to build up and destroy lung tissue. scientists have spliced copies of the normal genes into disabled adeno viruses that target lung tissues and have used bronchioscopes to deliver them to the lungs. the procedure worked well in animal studies however clinical trials in humans were not an unmitigated success. because the cells lining the lungs are continuously being replaced the effect is not permanent and must be repeated. studies are underway to develop gene therapy techniques to replace other faulty genes. for example, to replace the genes responsible for factor viii and factor ix production whose malfunctioning causes hemophilia a and b respectively; and to alleviate the effects of the faulty gene in dopamine production that results in parkinson's disease. apart from technical challenges such a radical therapy also engenders ethical debate. many persons who voice concerns about somatic-cell gene therapy use a "slippery slope" argument. it sounds good in theory but where does one draw the line. there are many issues yet to be resolved in this field of thorny ethics "good" and "bad" uses of the gene modification, difficulty of following patients in long-term clinical research and such. many gene therapy candidates are children who are too young to understand the ramifications of this treatment: conflict of interest -pits individuals' reproductive liberties and privacy interests against the interests of insurance companies or society. one issue that is unlikely to ever gain acceptance is germline therapy, the removal of deleterious genes from the population. issues of justice and resource allocation also have been raised: in a time of strain on our health care system, can we afford such expensive therapy? who should receive gene therapy? if it is made available only to those who can afford it, then a number of civil rights groups claim that the distribution of desirable biological traits among different socioeconomic and ethnic groups would become badly skewed adding a new and disturbing layer of discriminatory behavior. indeed a major setback occurred before the end of the decade in . jesse gelsinger was the first person to die from gene therapy, on september , , and his death created another unprecedented situation when his family sued not only the research team involved in the experiment (u penn), the company genovo inc., but also the ethicist who offered moral advice on the controversial project. this inclusion of the ethicist as a defendant alongside the scientists and school was a surprising legal move that puts this specialty on notice, as will no doubt be the case with other evolving technologies such as stem cells and therapeutic cloning, that its members could be vulnerable to litigation over the philosophical guidance they provide to researchers. the penn group principal investigator james wilson approached ethicist arthur caplan about their plans to test the safety of a genetically engineered virus on babies with a deadly form of the liver disorder, ornithine transcarbamylase deficiency. the disorder allows poisonous levels of ammonia to build up in the blood system. caplan steered the researchers away from sick infants, arguing that desperate parents could not provide true informed consent. he said it would be better to experiment on adults with a less lethal form of the disease who were relatively healthy. gelsinger fell into that category. although he had suffered serious bouts of ammonia buildup, he was doing well on a special drug and diet regimen. the decision to use relatively healthy adults was controversial because risky, unproven experimental protocols generally use very ill people who have exhausted more traditional treatments, so have little to lose. in this case, the virus used to deliver the genes was known to cause liver damage, so some scientists were concerned it might trigger an ammonia crisis in the adults. wilson underestimated the risk of the experiment, omitted the disclosure about possible liver damage in earlier volunteers in the experiment and failed to mention the deaths of monkeys given a similar treatment during pre-clinical studies. a food and drug administration investigation after gelsinger's death found numerous regulatory violations by wilson's team, including the failure to stop the experiment and inform the fda after four successive volunteers suffered serious liver damage prior to the teen's treatment. in addition, the fda said gelsinger did not qualify for the experiment, because his blood ammonia levels were too high just before he underwent the infusion of genetic material. the fda suspended all human gene experiments by wilson and the university of penn subsequently restricting him solely to animal studies. a follow-up fda investigation subsequently alleged he improperly tested the experimental treatment on animals. financial conflicts of interest also surrounded james wilson, who stood to personally profit from the experiment through genovo his biotechnology company. the lawsuit was settled out of court for undisclosed terms in november . the fda also suspended gene therapy trials at st. elizabeth's medical center in boston, a major teaching affiliate of tufts university school of medicine, which sought to use gene therapy to reverse heart disease, because scientists there failed to follow protocols and may have contributed to at least one patient death. in addition, the fda temporarily suspended two liver cancer studies sponsored by the schering-plough corporation because of technical similarities to the university of pennsylvania study. some research groups voluntarily suspended gene therapy studies, including two experiments sponsored by the cystic fibrosis foundation and studies at beth israel deaconess medical center in boston aimed at hemophilia. the scientists paused to make sure they learned from the mistakes. the nineties also saw the development of another "high-thoughput" breakthrough, a derivative of the other high tech revolution namely dna chips. in biochips were developed for commercial use under the guidance of affymetrix. dna chips or microarrays represent a "massively parallel" genomic technology. they facilitate high throughput analysis of thousands of genes simultaneously, and are thus potentially very powerful tools for gaining insight into the complexities of higher organisms including analysis of gene expression, detecting genetic variation, making new gene discoveries, fingerprinting strains and developing new diagnostic tools. these technologies permit scientists to conduct large scale surveys of gene expression in organisms, thus adding to our knowledge of how they develop over time or respond to various environmental stimuli. these techniques are especially useful in gaining an integrated view of how multiple genes are expressed in a coordinated manner. these dna chips have broad commercial applications and are now used in many areas of basic and clinical research including the detection of drug resistance mutations in infectious organisms, direct dna sequence comparison of large segments of the human genome, the monitoring of multiple human genes for disease associated mutations, the quantitative and parallel measurement of mrna expression for thousands of human genes, and the physical and genetic mapping of genomes. however the initial technologies, or more accurately the algorithms used to extract information, were far from robust and reproducible. the erstwhile serial entrepreneur, al zaffaroni (the rebel who in founded alza when syntex ignored his interest in developing new ways to deliver drugs) founded yet another company, affymetrix, under the stewardship of stephen fodor, which was subject to much abuse for providing final extracted data and not allowing access to raw data. as with other personalities of this high through put era, seattle-bred steve fodor was also somewhat of a polymath having contributed to two major technologies, microarrays and combinatorial chemistry, the former has delivered on it's, promise while the latter, like gene therapy, is still in a somewhat extended gestation. and despite the limitations of being an industrial scientist he has had a rather prolific portfolio of publications. his seminal manuscripts describing this work have been published in all the journals of note, science, nature and pnas and was recognized in by the aaas by receiving the newcomb-cleveland award for an outstanding paper published in science. fodor began his industrial career in yet another zaffaroni firm. in he was recruited to the affymax research institute in palo alto where he spearheaded the effort to develop high-density arrays of biological compounds. his initial interest was in the broad area of what came to be called combinatorial chemistry. of the techniques developed, one approach permitted high resolution chemical synthesis in a light-directed, spatially-defined format. in the days before positive selection vectors, a researcher might have screened thousands of clones by hand with an oligonucleotide probe just to find one elusive insert. fodor's (and his successors) dna array technology reverses that approach. instead of screening an array of unknowns with a defined probe -a cloned gene, pcr product, or synthetic oligonucleotide -each position or "probe cell" in the array is occupied by a defined dna fragment, and the array is probed with the unknown sample. fodor used his chemistry and biophysics background to develop very dense arrays of these biomolecules by combining photolithographic methods with traditional chemical techniques. the typical array may contain all possible combinations of all possible oligonucleotides ( -mers, for example) that occur as a "window" which is tracked along a dna sequence. it might contain longer oligonucleotides designed from all the open reading frames identified from a complete genome sequence. or it might contain cdnas -of known or unknown sequence -or pcr products. of course it is one thing to produce data it is quite another to extract it in a meaningful manner. fodor's group also developed techniques to read these arrays, employing fluorescent labeling methods and confocal laser scanning to measure each individual binding event on the surface of the chip with extraordinary sensitivity and precision. this general platform of microarray based analysis coupled to confocal laser scanning has become the standard in industry and academia for large-scale genomics studies. in , fodor co-founded affymetrix where the chip technology has been used to synthesize many varieties of high density oligonucleotide arrays containing hundreds of thousands of dna probes. in , steve fodor founded perlegen, inc., a new venture that applied the chip technology towards uncovering the basic patterns of human diversity. his company's stated goals are to analyze more than one million genetic variations in clinical trial participants to explain and predict the efficacy and adverse effect profiles of prescription drugs. in addition, perlegen also applies this expertise to discovering genetic variants associated with disease in order to pave the way for new therapeutics and diagnostics. fodor's former company diversified into plant applications by developing a chip of the archetypal model of plant systems arabidopsis and supplied pioneer hi bred with custom dna chips for monitoring maize gene expression. they (affymetrix) have established programs where academic scientists can use company facilities at a reduced price and set up 'user centers' at selected universities. a related but less complex technology called 'spotted' dna chips involves precisely spotting very small droplets of genomic or cdna clones or pcr samples on a microscope slide. the process uses a robotic device with a print head bearing fine "repeatograph" tips that work like fountain pens to draw up dna samples from a -well plate and spot tiny amounts on a slide. up to , individual clones can be spotted in a dense array within one square centimeter on a glass slide. after hybridization with a fluorescent target mrna, signals are detected by a custom scanner. this is the basis of the systems used by molecular dynamics and incyte (who acquired this technology when it took over synteni). in , incyte was looking to gather more data for its library and perform experiments for corporate subscribers. the company considered buying affymetrix genechips but opted instead to purchase the smaller synteni, which had sprung out of pat brown's stanford array effort. synteni's contact printing technology resulted in dense -and cheaper -arrays. though incyte used the chips only internally, affymetrix sued, claiming synteni/incyte was infringing on its chip density patents. the suit argued that dense biochips -regardless of whether they use photolithography -cannot be made without a license from affymetrix! and in a litigious congo line endemic of this hi-tech era incyte countersued and for good measure also filed against genetic database competitor gene logic for infringing incyte's patents on database building. meanwhile, hyseq sued affymetrix, claiming infringement of nucleotide hybridization patents obtained by its cso. affymetrix, in turn, filed a countersuit, claiming hyseq infringed the spotted array patents. hyseq then reached back and found an additional hybridization patent it claimed that affymetrix had infringed. and so on into the next millennium! in part to avoid all of this another california company nanogen, inc. took a different approach to single nucleotide polymorphism discrimination technology. in an article in the april edition of nature biotechnology, entitled "single nucleotide polymorphic discrimination by an electronic dot blot assay on semiconductor microchips," nanogen describes the use of microchips to identify variants of the mannose binding protein gene that differ from one another by only a single dna base. the mannose binding protein (mbp) is a key component of the innate immune system in children who have not yet developed immunity to a variety of pathogens. to date, four distinct variants (alleles) of this gene have been identified, all differing by only a single nucleotide of dna. mbp was selected for this study because of its potential clinical relevance and its genetic complexity. the samples were assembled at the nci laboratory in conjunction with the national institutes of health and transferred to nanogen for analysis. however, from a high throughput perspective there is a question mark over microarrays. mark benjamin, senior director of business development at rosetta inpharmatics (kirkland, wa), is skeptical about the long-term prospects for standard dna arrays in high-throughput screening as the first steps require exposing cells and then isolating rna, which is something that is very hard to do in a high-throughput format. another drawback is that most of the useful targets are likely to be unknown (particularly in the agricultural sciences where genome sequencing is still in its infancy), and dna arrays that are currently available test only for previously sequenced genes. indeed, some argue that current dna arrays may not be sufficiently sensitive to detect the low expression levels of genes encoding targets of particular interest. and the added complication of the companies' reluctance to provide "raw data" means that derived data sets may be created with less than optimum algorithims thereby irretrievably losing potentially valuable information from the starting material. reverse engineering is a possible approach but this is laborious and time consuming and being prohibited by many contracts may arouse the interest of the ever-vigilant corporate lawyers. over the course of the nineties, outgrowths of functional genomics have been termed proteomics and metabolomics, which are the global studies of gene expression at the protein and metabolite levels respectively. the study of the integration of information flow within an organism is emerging as the field of systems biology. in the area of proteomics, the methods for global analysis of protein profiles and cataloging protein-protein interactions on a genome-wide scale are technically more difficult but improving rapidly, especially for microbes. these approaches generate vast amounts of quantitative data. the amount of expression data becoming available in the public and private sectors is already increasing exponentially. gene and protein expression data rapidly dwarfed the dna sequence data and is considerably more difficult to manage and exploit. in microbes, the small sizes of the genomes and the ease of handling microbial cultures, will enable high throughput, targeted deletion of every gene in a genome, individually and in combinations. this is already available on a moderate throughput scale in model microbes such as e. coli and yeast. combining targeted gene deletions and modifications with genome-wide assay of mrna and protein levels will enable intricate inter-dependencies among genes to be unraveled. simultaneous measurement of many metabolites, particularly in microbes, is beginning to allow the comprehensive modeling and regulation of fluxes through interdependent pathways. metabolomics can be defined as the quantitative measurement of all low molecular weight metabolites in an organism's cells at a specified time under specific environmental conditions. combining information from metabolomics, proteomics and genomics will help us to obtain an integrated understanding of cell biology. the next hierarchical level of phenotype considers how the proteome within and among cells cooperates to produce the biochemistry and physiology of individual cells and organisms. several authors have tentatively offered "physiomics" as a descriptor for this approach. the final hierarchical levels of phenotype include anatomy and function for cells and whole organisms. the term "phenomics" has been applied to this level of study and unquestionably the more well known omics namely economics, has application across all those fields. and, coming slightly out of left field this time, the spectre of eugenics needless to say was raised in the omics era. in the year american and british scientists unveiled a technique which has come to be known as pre-implantation genetic diagnosis (pid) for testing embryos in vitro for genetic abnormalities such as cystic fibrosis, hemophilia, and down's syndrome (wald, ) . this might be seen by most as a step forward, but it led ethicist david s. king ( ) to decry pid as a technology that could exacerbate the eugenic features of prenatal testing and make possible an expanded form of free-market eugenics. he further argues that due to social pressures and eugenic attitudes held by clinical geneticists in most countries, it results in eugenic outcomes even though no state coercion is involved and that, as abortion is not involved, and multiple embryos are available, pid is radically more effective as a tool of genetic selection. the first regulatory approval of a recombinant dna technology in the u.s. food supply was not a plant but an industrial enzyme that has become the hallmark of food biotechnology success. enzymes were important agents in food production long before modern biotechnology was developed. they were used, for instance, in the clotting of milk to prepare cheese, the production of bread and the production of alcoholic beverages. nowadays, enzymes are indispensable to modern food processing technology and have a great variety of functions. they are used in almost all areas of food production including grain processing, milk products, beer, juices, wine, sugar and meat. chymosin, known also as rennin, is a proteolytic enzyme whose role in digestion is to curdle or coagulate milk in the stomach, efficiently converting liquid milk to a semisolid like cottage cheese, allowing it to be retained for longer periods in a neonate's stomach. the dairy industry takes advantage of this property to conduct the first step in cheese production. chy-max™, an artificially produced form of the chymosin enzyme for cheese-making, was approved in . in some instances they replace less acceptable "older" technology, for example the enzyme chymosin. unlike crops industrial enzymes have had relatively easy passage to acceptance for a number of reasons. as noted they are part of the processing system and theoretically do not appear in the final product. today about % of the hard cheese in the us and uk is made using chymosin from geneticallymodified microbes. it is easier to purify, more active ( % as compared to %) and less expensive to produce (microbes are more prolific, more productive and cheaper to keep than calves). like all enzymes it is required only in very small quantities and because it is a relatively unstable protein it breaks down as the cheese matures. indeed, if the enzyme remained active for too long it would adversely affect the development of the cheese, as it would degrade the milk proteins to too great a degree. such enzymes have gained the support of vegetarian organizations and of some religious authorities. for plants the nineties was the era of the first widespread commercialization of what came to be known in often deprecating and literally inaccurate terms as gmos (genetically modified organisms). when the nineties dawned dicotyledonous plants were relatively easily transformed with agrobacterium tumefaciens but many economically important plants, including the cereals, remained inaccessible for genetic manipulation because of lack of effective transformation techniques. in this changed with the technology that overcame this limitation. michael fromm, a molecular biologist at the plant gene expression center, reported the stable transformation of corn using a high-speed gene gun. the method known as biolistics uses a "particle gun" to shoot metal particles coated with dna into cells. initially a gunpowder charge subsequently replaced by helium gas was used to accelerate the particles in the gun. there is a minimal disruption of tissue and the success rate has been extremely high for applications in several plant species. the technology rights are now owned by dupont. in some of the first of the field trials of the crops that would dominate the second half of the nineties began, including bt corn (with the bacillus thuriengenesis cry protein discussed in chapter three). in the fda declared that genetically engineered foods are "not inherently dangerous" and do not require special regulation. since , researchers have pinpointed and cloned several of the genes that make selected plants resistant to certain bacterial and fungal infections; some of these genes have been successfully inserted into crop plants that lack them. many more infection-resistant crops are expected in the near future, as scientists find more plant genes in nature that make plants resistant to pests. plant genes, however, are just a portion of the arsenal; microorganisms other than bt also are being mined for genes that could help plants fend off invaders that cause crop damage. the major milestone of the decade in crop biotechnology was approval of the first bioengineered crop plant in . it represented a double first not just of the first approved food crop but also of the first commercial validation of a technology which was to be surpassed later in the decade. that technology, antisense technology works because nucleic acids have a natural affinity for each other. when a gene coding for the target in the genome is introduced in the opposite orientation, the reverse rna strand anneals and effectively blocks expression of the enzyme. this technology was patented by calgene for plant applications and was the technology behind the famous flavr savr tomatoes. the first success for antisense in medicine was in when the u.s. food and drug administration gave the go-ahead to the cytomegalovirus (cmv) inhibitor fomivirsen, a phosphorothionate antiviral for the aids-related condition cmv retinitis making it the first drug belonging to isis, and the first antisense drug ever, to be approved. another technology, although not apparent at the time was behind the second approval and also the first and only successful to date in a commercial tree fruit biotech application. the former was a virus resistant squash the second the papaya ringspot resistant papaya. both owed their existence as much to historic experience as modern technology. genetically engineered virus-resistant strains of squash and cantaloupe, for example, would never have made it to farmers' fields if plant breeders in the 's had not noticed that plants infected with a mild strain of a virus do not succumb to more destructive strains of the same virus. that finding led plant pathologist roger beachy, then at washington university in saint louis, to wonder exactly how such "cross-protection" worked -did part of the virus prompt it? in collaboration with researchers at monsanto, beachy used an a. tumefaciens vector to insert into tomato plants a gene that produces one of the proteins that makes up the protein coat of the tobacco mosaic virus. he then inoculated these plants with the virus and was pleased to discover, as reported in , that the vast majority of plants did not succumb to the virus. eight years later, in , virus-resistant squash seeds created with beachy's method reached the market, to be followed soon by bioengineered virus-resistant seeds for cantaloupes, potatoes, and papayas. (breeders had already created virusresistant tomato seeds by using traditional techniques.) and the method of protection still remained a mystery when the first approvals were given in and . gene silencing was perceived initially as an unpredictable and inconvenient side effect of introducing transgenes into plants. it now seems that it is the consequence of accidentally triggering the plant's adaptive defense mechanism against viruses and transposable elements. this recently discovered mechanism, although mechanistically different, has a number of parallels with the immune system of mammals. how this system worked was not elucidated until later in the decade by a researcher who was seeking a very different holy grail -the black rose! rick jorgensen, at that time at dna plant technologies in oakland, ca and subsequently of, of the university of california davis attempted to overexpress the chalcone synthase gene by introducing a modified copy under a strong promoter.surprisingly he obtained white flowers, and many strange variegated purple and white variations in between. this was the first demonstration of what has come to be known as post-transcriptional gene silencing (ptgs). while initially it was considered a strange phenomenon limited to petunias and a few other plant species, it is now one of the hottest topics in molecular biology. rna interference (rnai) in animals and basal eukaryotes, quelling in fungi, and ptgs in plants are examples of a broad family of phenomena collectively called rna silencing (hannon ; plasterk ) . in addition to its occurrence in these species it has roles in viral defense (as demonstrated by beachy) and transposon silencing mechanisms among other things. perhaps most exciting, however, is the emerging use of ptgs and, in particular, rnai -ptgs initiated by the introduction of double-stranded rna (dsrna) -as a tool to knock out expression of specific genes in a variety of organisms. nineteen ninety one also heralded yet another first. the february , issue of science reported the patenting of "molecular scissors": the nobel-prize winning discovery of enzymatic rna, or "ribozymes," by thomas czech of the university of colorado. it was noted that the u.s. patent and trademark office had awarded an "unusually broad" patent for ribozymes. the patent is u.s. patent no. , , , claim of which reads as follows: "an enzymatic rna molecule not naturally occurring in nature having an endonuclease activity independent of any protein, said endonuclease activity being specific for a nucleotide sequence defining a cleavage site comprising single-stranded rna in a separate rna molecule, and causing cleavage at said cleavage site by a transesterification reaction." although enzymes made of protein are the dominant form of biocatalyst in modern cells, there are at least eight natural rna enzymes, or ribozymes, that catalyze fundamental biological processes. one of which was yet another discovery by plant virologists, in this instance the hairpin ribozyme was discovered by george bruening at uc davis. the self-cleavage structure was originally called a paperclip, by the bruening laboratory which discovered the reactions. as mentioned in chapter , it is believed that these ribozymes might be the remnants of an ancient form of life that was guided entirely by rna. since a ribozyme is a catalytic rna molecule capable of cleaving itself and other target rnas it therefore can be useful as a control system for turning off genes or targeting viruses. the possibility of designing ribozymes to cleave any specific target rna has rendered them valuable tools in both basic research and therapeutic applications. in the therapeutics area, they have been exploited to target viral rnas in infectious diseases, dominant oncogenes in cancers and specific somatic mutations in genetic disorders. most notably, several ribozyme gene therapy protocols for hiv patients are already in phase trials. more recently, ribozymes have been used for transgenic animal research, gene target validation and pathway elucidation. however, targeting ribozymes to the cellular compartment containing their target rnas has proved a challenge. at the other bookend of the decade in , samarsky et al. reported that a family of small rnas in the nucleolus (snornas) can readily transport ribozymes into this subcellular organelle. in addition to the already extensive panoply of rna entities yet another has potential for mischief. viroids are small, single-stranded, circular rnas containing - nucleotides arranged in a rod-like secondary structure and are the smallest pathogenic agents yet described. the smallest viroid characterized to date is rice yellow mottle sobemovirus (rymv), at nucleotides. in comparison, the genome of the smallest known viruses capable of causing an infection by themselves, the single-stranded circular dna of circoviruses, is around kilobases in size. the first viroid to be identified was the potato spindle tuber viroid (pstvd). some species have been identified to date. unlike the many satellite or defective interfering rnas associated with plant viruses, viroids replicate autonomously on inoculation of a susceptible host. the absence of a protein capsid and of detectable messenger rna activity implies that the information necessary for replication and pathogenesis resides within the unusual structure of the viroid genome. the replication mechanism actually involves interaction with rna polymerase ii, an enzyme normally associated with synthesis of messenger rna, and "rolling circle" synthesis of new rna. some viroids have ribozyme activity which allow self-cleavage and ligation of unit-size genomes from larger replication intermediates. it has been proposed that viroids are "escaped introns". viroids are usually transmitted by seed or pollen. infected plants can show distorted growth. from its earliest years, biotechnology attracted interest outside scientific circles. initially the main focus of public interest was on the safety of recombinant dna technology, and of the possible risks of creating uncontrollable and harmful novel organisms (berg , ) . the debate on the deliberate release of genetically modified organisms, and on consumer products containing or comprising them, followed some years later (nas, ) . it is interesting to note that within the broad ethical tableau of potential issues within the science and products of biotechnology, the seemingly innocuous field of plant modification has been one of the major players of the 's. the success of agricultural biotechnology is heavily dependent on its acceptance by the public, and the regulatory framework in which the industry operates is also influenced by public opinion. as the focus for molecular biology research shifted from the basic pursuit of knowledge to the pursuit of lucrative applications, once again as in the previous two decades the specter of risk arose as the potential of new products and applications had to be evaluated outside the confines of a laboratory. however, the specter now became far more global as the implications of commercial applications brought not just worker safety into the loop but also, the environment, agricultural and industrial products and the safety and well being of all living things. beyond "deliberate" release, the rac guidelines were not designed to address these issues, so the matter moved into the realm of the federal agencies who had regulatory authority which could be interpreted to oversee biotechnology issues. this adaptation of oversight is very much a dynamic process as the various agencies wrestle with the task of applying existing regulations and developing new ones for oversight of this technology in transition. as the decade progressed focus shifted from basic biotic stress resistance to more complex modifications the next generation of plants will focus on value added traits in which valuable genes and metabolites will be identified and isolated, with some of the later compounds being produced in mass quantities for niche markets. two of the more promising markets are nutraceuticals or so-called "functional foods" and plants developed as bioreactors for the production of valuable proteins and compounds, a field known as plant molecular farming. developing plants with improved quality traits involves overcoming a variety of technical challenges inherent to metabolic engineering programs. both traditional plant breeding and biotechnology techniques are needed to produce plants carrying the desired quality traits. continuing improvements in molecular and genomic technologies are contributing to the acceleration of product development in this space. by the end of the decade in , applying nutritional genomics, della penna ( ) isolated a gene, which converts the lower activity precursors to the highest activity vitamin e compound, alpha-tocopherol. with this technology, the vitamin e content of arabidopsis seed oil has been increased nearly -fold and progress has been made to move the technology to crops such as soybean, maize, and canola. this has also been done for folates in rice. omega three fatty acids play a significant role in human health, eicosapentaenoic acid (epa) and docosahexaenoic acid (dha), which are present in the retina of the eye and cerebral cortex of the brain, respectively, are some of the most well documented from a clinical perspective. it is believed that epa and dha play an important role in the regulation of inflammatory immune reactions and blood pressure, treatment of conditions such as cardiovascular disease and cystic fibrosis, brain development in utero, and, in early postnatal life, the development of cognitive function. they are mainly found in fish oil and the supply is limited. by the end of the decade ursin ( ) had succeeded in engineering canola to produce these fatty acids. from a global perspective another value-added development had far greater impact both technologically and socio-economically. a team led by ingo potrykus ( ) engineered rice to produce pro-vitamin a, which is an essential micronutrient. widespread dietary deficiency of this vitamin in rice-eating asian countries, which predisposes children to diseases such as blindness and measles, has tragic consequences. improved vitamin a nutrition would alleviate serious health problems and, according to unicef, could also prevent up to two million infant deaths due to vitamin a deficiency. adoption of the next stage of gm crops may proceed more slowly, as the market confronts issues of how to determine price, share value, and adjust marketing and handling to accommodate specialized end-use characteristics. furthermore, competition from existing products will not evaporate. challenges that have accompanied gm crops with improved agronomic traits, such as the stalled regulatory processes in europe, will also affect adoption of nutritionally improved gm products. beyond all of this, credible scientific research is still needed to confirm the benefits of any particular food or component. for functional foods to deliver their potential public health benefits, consumers must have a clear understanding of, and a strong confidence level in, the scientific criteria that are used to document health effects and claims. because these decisions will require an understanding of plant biochemistry, mammalian physiology, and food chemistry, strong interdisciplinary collaborations will be needed among plant scientists, nutritionists, and food scientists to ensure a safe and healthful food supply. in addition to being a source of nutrition, plants have been a valuable wellspring of therapeutics for centuries. during the nineties, however, intensive research has focused on expanding this source through rdna biotechnology and essentially using plants and animals as living factories for the commercial production of vaccines, therapeutics and other valuable products such as industrial enzymes and biosynthetic feedstocks. possibilities in the medical field include a wide variety of compounds, ranging from edible vaccine antigens against hepatitis b and norwalk viruses (arntzen, ) and pseudomonas aeruginosa and staphylococcus aureus to vaccines against cancer and diabetes, enzymes, hormones, cytokines, interleukins, plasma proteins, and human alpha- -antitrypsin. thus, plant cells are capable of expressing a large variety of recombinant proteins and protein complexes. therapeutics produced in this way are termed plant made pharmaceuticals (pmps). and non-therapeutics are termed plant made industrial products (pmips) (newell-mcgloughlin, ) . the first reported results of successful human clinical trials with their transgenic plant-derived pharmaceuticals were published in . they were an edible vaccine against e. coli-induced diarrhea and a secretory monoclonal antibody directed against streptococcus mutans, for preventative immunotherapy to reduce incidence of dental caries. haq et al. ( ) reported the expression in potato plants of a vaccine against e. coli enterotoxin (etec) that provided an immune response against the toxin in mice. human clinical trials suggest that oral vaccination against either of the closely related enterotoxins of vibrio cholerae and e. coli induces production of antibodies that can neutralize the respective toxins by preventing them from binding to gut cells. similar results were found for norwalk virus oral vaccines in potatoes. for developing countries, the intention is to deliver them in bananas or tomatoes (newell-mcgloughlin, ) . plants are also faster, cheaper, more convenient and more efficient than the principal eukaryotic production system, namely chinese hamster ovary (cho) cells for the production of pharmaceuticals. hundreds of acres of protein-containing seeds could inexpensively double the production of a cho bioreactor factory. in addition, proteins can be expressed at the highest levels in the harvestable seed and plant-made proteins and enzymes formulated in seeds have been found to be extremely stable, reducing storage and shipping costs. pharming may also enable research on drugs that cannot currently be produced. for example, croptech in blacksburg, va., is investigating a protein that seems to be a very effective anticancer agent. the problem is that this protein is difficult to produce in mammalian cell culture systems as it inhibits cell growth. this should not be a problem in plants. furthermore, production size is flexible and easily adjustable to the needs of changing markets. making pharmaceuticals from plants is also a sustainable process, because the plants and crops used as raw materials are renewable. the system also has the potential to address problems associated with provision of vaccines to people in developing countries. products from these alternative sources do not require a so-called "cold chain" for refrigerated transport and storage. those being developed for oral delivery obviates the need for needles and aspectic conditions which often are a problem in those areas. apart from those specific applications where the plant system is optimum there are many other advantages to using plant production. many new pharmaceuticals based on recombinant proteins will receive regulatory approval from the united states food and drug administration (fda) in the next few years. as these therapeutics make their way through clinical trials and evaluation, the pharmaceutical industry faces a production capacity challenge. pharmaceutical discovery companies are exploring plant-based production to overcome capacity limitations, enable production of complex therapeutic proteins, and fully realize the commercial potential of their biopharmaceuticals (newell-mcgloughlin, ) . nineteen ninety also marked a major milestone in the animal biotech world when herman made his appearance on the world's stage. since the palmiter's mouse, transgenic technology has been applied to several species including agricultural species such as sheep, cattle, goats, pigs, rabbits, poultry, and fish. herman was the first transgenic bovine created by genpharm international, inc., in a laboratory in the netherlands at the early embryo stage. scientist's microinjected recently fertilized eggs with the gene coding for human lactoferrin. the scientists then cultured the cells in vitro to the embryo stage and transferred them to recipient cattle. lactoferrin, an iron-containing anti-bacterial protein is essential for infant growth. since cow's milk doesn't contain lactoferrin, infants must be fed from other sources that are rich in iron -formula or mother's milk (newell-mcgloughlin, ) . as herman was a boy he would be unable to provide the source, that would require the production of daughters which was not necessarily a straightforward process. the dutch parliments permission was required. in they finally approved a measure that permitted the world's first genetically engineered bull to reproduce. the leiden-based gene pharming proceeded to artificially inseminate cows with herman's sperm. with a promise that the protein, lactoferrin, would be the first in a new generation of inexpensive, high-tech drugs derived from cows' milk to treat complex diseases like aids and cancer. herman, became the father of at least eight female calves in , and each one inherited the gene for lactoferrin production. while their birth was initially greeted as a scientific advancement that could have far-reaching effects for children in developing nations, the levels of expression were too low to be commercially viable. by , herman, who likes to listen to rap music to relax, had sired calves and outlived them all. his offspring were all killed and destroyed after the end of the experiment, in line with dutch health legislation. herman was also slated for the abattoir, but the dutch public -proud of making history with herman -rose up in protest, especially after a television program screened footage showing the amiable bull licking a kitten. herman won a bill of clemency from parliament. however, instead of retirement on a comfortable bed of straw, listening to rap music, herman was pressed into service again. he now stars at a permanent biotech exhibit in naturalis, a natural history museum in the dutch city of leiden. after his death, he will be stuffed and remain in the museum in perpetuity (a fate similar to what awaited an even more famous mammalian first born later in the decade). the applications for transgenic animal research fall broadly into two distinct areas, namely medical and agricultural applications. the recent focus on developing animals as bioreactors to produce valuable proteins in their milk can be catalogued under both areas. underlying each of these, of course, is a more fundamental application, that is the use of those techniques as tools to ascertain the molecular and physiological bases of gene expression and animal development. this understanding can then lead to the creation of techniques to modify development pathways. in a european decision with rather more far-reaching implications than hermans sex life was made. the first european patent on a transgenic animal was issued for a transgenic mouse sensitive to carcinogens -harvard's "oncomouse". the oncomouse patent application was refused in europe in due primarily to an established ban on animal patenting. the application was revised to make narrower claims, and the patent was granted in . this has since been repeatedly challenged, primarily by groups objecting to the judgement that benefits to humans outweigh the suffering of the animal. currently, the patent applicant is awaiting protestors' responses to a series of possible modifications to the application. predictions are that agreement will not likely be forthcoming and that the legal wrangling will continue into the future. bringing animals into the field of controversy starting to swirl around gmos and preceding the latter's commercialization, was the approval by the fda of bovine somatotropin (bst) for increased milk production in dairy cows. the fda's center for veterinary medicine (cvm) regulates the manufacture and distribution of food additives and drugs that will be given to animals. biotechnology products are a growing proportion of the animal health products and feed components regulated by the cvm. the center requires that food products from treated animals must be shown to be safe for human consumption. applicants must show that the drug is effective and safe for the animal and that its manufacture will not affect the environment. they must also conduct geographically dispersed clinical trials under an investigational new animal drug application with the fda through which the agency controls the use of the unapproved compound in food animals. unlike within the eu, possible economic and social issues cannot be taken into consideration by the fda in the premarket drug approval process. under these considerations the safety and efficacy of rbst was determined. it was also determined that special labeling for milk derived from cows that had been treated with rbst is not required under fda food labeling laws because the use of rbst does not effect the quality or the composition of the milk. work with fish proceeded a pace throughout the decade. gene transfer techniques have been applied to a large number of aquatic organisms, both vertebrates and invertebrates. gene transfer experiments have targeted a wide variety of applications, including the study of gene structure and function, aquaculture production, and use in fisheries management programs. because fish have high fecundity, large eggs, and do not require reimplantation of embryos, transgenic fish prove attractive model systems in which to study gene expression. transgenic zebrafish have found utility in studies of embryogenesis, with expression of transgenes marking cell lineages or providing the basis for study of promoter or structural gene function. although not as widely used as zebrafish, transgenic medaka and goldfish have been used for studies of promoter function. this body of research indicates that transgenic fish provide useful models of gene expression, reliably modeling that in "higher" vertebrates. perhaps the largest number of gene transfer experiments address the goal of genetic improvement for aquaculture production purposes. the principal area of research has focused on growth performance, and initial transgenic growth hormone (gh) fish models have demonstrated accelerated and beneficial phenotypes. dna microinjection methods have propelled the many studies reported and have been most effective due to the relative ease of working with fish embryos. bob devlins' group in vancouver has demonstrated extraordinary growth rate in coho salmon which were transformed with a growth hormone from sockeye salmon. the transgenics achieve up to eleven times the size of their littermates within six months, reaching maturity in about half the time. interestingly this dramatic effect is only observed in feeding pins where the transgenics' ferocious appetites demands constant feeding. if the fish are left to their own devices and must forage for themselves, they appear to be out-competed by their smarter siblings. however most studies, such as those involving transgenic atlantic salmon and channel catfish, report growth rate enhancement on the order of - %. in addition to the species mentioned, gh genes also have been transferred into striped bass, tilapia, rainbow trout, gilthead sea bream, common carp, bluntnose bream, loach, and other fishes. shellfish also are subject to gene transfer toward the goal of intensifying aquaculture production. growth of abalone expressing an introduced gh gene is being evaluated; accelerated growth would prove a boon for culture of the slowgrowing mollusk. a marker gene was introduced successfully into giant prawn, demonstrating feasibility of gene transfer in crustaceans, and opening the possibility of work involving genes affecting economically important traits. in the ornamental fish sector of aquaculture, ongoing work addresses the development of fish with unique coloring or patterning. a number of companies have been founded to pursue commercialization of transgenics for aquaculture. as most aquaculture species mature at - years of age, most transgenic lines are still in development and have yet to be tested for performance under culture conditions. extending earlier research that identified methylfarnesoate (mf) as a juvenile hormone in crustaceans and determined its role in reproduction, researchers at the university of connecticut have developed technology to synchronize shrimp egg production and to increase the number and quality of eggs produced. females injected with mf are stimulated to produce eggs ready for fertilization. the procedure produces percent more eggs than the traditional crude method of removing the eyestalk gland. this will increase aquaculture efficiency. a number of experiments utilize gene transfer to develop genetic lines of potential utility in fisheries management. transfer of gh genes into northern pike, walleye, and largemouth bass are aimed at improving the growth rate of sport fishes. gene transfer has been posed as an option for reducing losses of rainbow trout to whirling disease, although suitable candidate genes have yet to be identified. richard winn of the university of georgia is developing transgenic killifish and medaka as biomonitors for environmental mutagens, which carry the bacteriophage phi x as a target for mutation detection. development of transgenic lines for fisheries management applications generally is at an early stage, often at the founder or f generation. broad application of transgenic aquatic organisms in aquaculture and fisheries management will depend on showing that particular gmos can be used in the environment both effectively and safely. although our base of knowledge for assessing ecological and genetic safety of aquatic gmos currently is limited, some early studies supported by the usda biotechnology risk assessment program have yielded results. data from outdoor pond-based studies on transgenic catfish reported by rex dunham of auburn university show that transgenic and non-transgenic individuals interbreed freely, that survival and growth of transgenics in unfed ponds was equal to or less than that of non-transgenics, and that predator avoidance is not affected by expression of the transgene. however, unquestionably the seminal event for animal biotech in the nineties was ian wilmut's landmark work using nuclear transfer technology to generate the lambs morag and megan reported in (from an embryonic cell nuclei) and the truly ground-breaking work of creating dolly from an adult somatic cell nucleus, reported in february, (wilmut, ) . wilmut and his colleagues at the roslin institute demonstrated for the first time with the birth of dolly the sheep that the nucleus of an adult somatic cell can be transferred to an enucleated egg to create cloned offspring. it had been assumed for some time that only embryonic cells could be used as the cellular source for nuclear transfer. this assumption was shattered with the birth of dolly. this example of cloning an animal using the nucleus of an adult cell was significant because it demonstrated the ability of egg cell cytoplasm to "reprogram" an adult nucleus. when cells differentiate, that is, evolve from primitive embryonic cells to functionally defined adult cells, they lose the ability to express most genes and can only express those genes necessary for the cell's differentiated function. for example, skin cells only express genes necessary for skin function, and brain cells only express genes necessary for brain function. the procedure that produced dolly demonstrated that egg cytoplasm is capable of reprogramming an adult differentiated cell (which is only expressing genes related to the function of that cell type). this reprogramming enables the differentiated cell nucleus to once again express all the genes required for the full embryonic development of the adult animal. since dolly was cloned, similar techniques have been used to clone a veritable zoo of vertebrates including mice, cattle, rabbitts, mules, horses, fish, cats and dogs from donor cells obtained from adult animals. these spectacular examples of cloning normal animals from fully differentiated adult cells demonstrate the universality of nuclear reprogramming although the next decade called some of these assumptions into question. this technology supports the production of genetically identical and genetically modified animals. thus, the successful "cloning" of dolly has captured the imagination of researchers around the world. this technological breakthrough should play a significant role in the development of new procedures for genetic engineering in a number of mammalian species. it should be noted that nuclear cloning, with nuclei obtained from either mammalian stem cells or differentiated "adult" cells, is an especially important development for transgenic animal research. as the decade reached its end the clones began arriving rapidly with specific advances made by a japanese group who used cumulus cells rather than fibroblasts to clone calves. they found that the percentage of cultured, reconstructed eggs that developed into blastocysts was % for cumulus cells and % for oviductal cells. these rates are higher than the % previously reported for transfer of nuclei from bovine fetal fibroblasts. following on the heels of dolly, polly and molly became the first genetically engineered transgenic sheep produced through nuclear transfer technology. polly and molly were engineered to produce human factor ix (for hemophiliacs) by transfer of nuclei from transfected fetal fibroblasts. until then germline competent transgenics had only been produced in mammalian species, other than mice, using dna microinjection. researchers at the university of massachusetts and advanced cell technology (worcester, ma) teamed up to produce genetically identical calves utilizing a strategy similar to that used to produce transgenic sheep. in contrast to the sheep cloning experiment, the bovine experiment involved the transfer of nuclei from an actively dividing population of cells. previous results from the sheep experiments suggested that induction of quiescence by serum starvation was required to reprogram the donor nuclei for successful nuclear transfer. the current bovine experiments indicate that this step may not be necessary. typically about embryos needed to be microinjected to obtain one transgenic cow, whereas nuclear transfer produced three transgenic calves from reconstructed embryos. this efficiency is comparable to the previous sheep research where six transgenic lambs were produced from reconstructed embryos. the ability to select for genetically modified cells in culture prior to nuclear transfer opens up the possibility of applying the powerful gene targeting techniques that have been developed for mice. one of the limitations of using primary cells, however, is their limited lifespan in culture. primary cell cultures such as the fetal fibroblasts can only undergo about population doublings before they senesce. this limited lifespan would preclude the ability to perform multiple rounds of selection. to overcome this problem of cell senescence, these researchers showed that fibroblast lifespan could be prolonged by nuclear transfer. a fetus, which was developed by nuclear transfer from genetically modified cells, could in turn be used to establish a second generation of fetal fibroblasts. these fetal cells would then be capable of undergoing another population doublings, which would provide sufficient time for selection of a second genetic modification. as noted, there is still some uncertainty over whether quiescent cells are required for successful nuclear transfer. induction into quiescence was originally thought to be necessary for successful nuclear reprogramming of the donor nucleus. however, cloned calves have been previously produced using non-quiescent fetal cells. furthermore, transfer of nuclei from sertoli and neuronal cells, which do not normally divide in adults, did not produce a liveborn mouse; whereas nuclei transferred from actively dividing cumulus cells did produce cloned mice. the fetuses used for establishing fetal cell lines in a tufts goat study were generated by mating nontransgenic females to a transgenic male containing a human antithrombin (at) iii transgene. this at transgene directs high level expression of human at into milk of lactating transgenic females. as expected, all three offspring derived from female fetal cells were females. one of these cloned goats was hormonally induced to lactate. this goat secreted . - . grams per liter of at in her milk. this level of at expression was comparable to that detected in the milk of transgenic goats from the same line obtained by natural breeding. the successful secretion of at in milk was a key result because it showed that a cloned animal could still synthesize and secrete a foreign protein at the expected level. it will be interesting to see if all three cloned goats secrete human at at the identical level. if so, then the goal of creating a herd identical transgenic animals, which secrete identical levels of an important pharmaceutical, would become a reality. no longer would variable production levels exist in subsequent generations due to genetically similar but not identical animals. this homogeneity would greatly aid in the production and processing of a uniform product. as nuclear transfer technology continues to be refined and applied to other species, it may eventually replace microinjection as the method of choice for generating transgenic livestock. nuclear transfer has a number of advantages: ) nuclear transfer is more efficient than microinjection at producing a transgenic animal, ) the fate of the integrated foreign dna can be examined prior to production of the transgenic animal, ) the sex of the transgenic animal can be predetermined, and ) the problem of mosaicism in first generation transgenic animals can be eliminated. dna microinjection has not been a very efficient mechanism to produce transgenic mammals. however, in november, , a team of wisconsin researchers reported a nearly % efficient method for generating transgenic cattle. the established method of cattle transgenes involves injecting dna into the pronuclei of a fertilized egg or zygote. in contrast, the wisconsin team injected a replication-defective retroviral vector into the perivitelline space of an unfertilized oocyte. the perivitelline space is the region between the oocyte membrane and the protective coating surrounding the oocyte known as the zona pellucida. in addition to es (embryonic stem) cells other sources of donor nuclei for nuclear transfer might be used such as embryonic cell lines, primordial germ cells, or spermatogonia to produce offspring. the utility of es cells or related methodologies to provide efficient and targeted in vivo genetic manipulations offer the prospects of profoundly useful animal models for biomedical, biological and agricultural applications. the road to such success has been most challenging, but recent developments in this field are extremely encouraging. with the may announcement of geron buying out ian wilmuts company roslin biomed, they declared it the dawn of an new era in biomedical research. geron's technologies for deriving transplantable cells from human pluripotent stem cells (hpscs) and extending their replicative capacity with telomerase was combined with the roslin institute nuclear transfer technology, the technology that produced dolly the cloned sheep. the goal was to produce transplantable, tissue-matched cells that provide extended therapeutic benefits without triggering immune rejection. such cells could be used to treat numerous major chronic degenerative diseases and conditions such as heart disease, stroke, parkinson's disease, alzheimer's disease, spinal cord injury, diabetes, osteoarthritis, bone marrow failure and burns. the stem cell is a unique and essential cell type found in animals. many kinds of stem cells are found in the body, with some more differentiated, or committed, to a particular function than others. in other words, when stem cells divide, some of the progeny mature into cells of a specific type (heart, muscle, blood, or brain cells), while others remain stem cells, ready to repair some of the everyday wear and tear undergone by our bodies. these stem cells are capable of continually reproducing themselves and serve to renew tissue throughout an individual's life. for example, they continually regenerate the lining of the gut, revitalize skin, and produce a whole range of blood cells. although the term "stem cell" commonly is used to refer to the cells within the adult organism that renew tissue (e.g., hematopoietic stem cells, a type of cell found in the blood), the most fundamental and extraordinary of the stem cells are found in the early-stage embryo. these embryonic stem (es) cells, unlike the more differentiated adult stem cells or other cell types, retain the special ability to develop into nearly any cell type. embryonic germ (eg) cells, which originate from the primordial reproductive cells of the developing fetus, have properties similar to es cells. it is the potentially unique versatility of the es and eg cells derived, respectively, from the early-stage embryo and cadaveric fetal tissue that presents such unusual scientific and therapeutic promise. indeed, scientists have long recognized the possibility of using such cells to generate more specialized cells or tissue, which could allow the generation of new cells to be used to treat injuries or diseases, such as alzheimer's disease, parkinson's disease, heart disease, and kidney failure. likewise, scientists regard these cells as an important -perhaps essential -means for understanding the earliest stages of human development and as an important tool in the development of life-saving drugs and cell-replacement therapies to treat disorders caused by early cell death or impairment. geron corporation and its collaborators at the university of wisconsin -madison (dr. james a. thomson) and johns hopkins university (dr. john d. gearhart) announced in november the first successful derivation of hpscs from two sources: (i) human embryonic stem (hes) cells derived from in vitro fertilized blastocysts (thomson ) and (ii) human embryonic germ (heg) cells derived from fetal material obtained from medically terminated pregnancies (shamblott et al. ) . although derived from different sources by different laboratory processes, these two cell types share certain characteristics but are referred to collectively as human pluripotent stem cells (hpscs). because hes cells have been more thoroughly studied, the characteristics of hpscs most closely describe the known properties of hes cells. stem cells represent a tremendous scientific advancement in two ways: first, as a tool to study developmental and cell biology; and second, as the starting point for therapies to develop medications to treat some of the most deadly diseases. the derivation of stem cells is fundamental to scientific research in understanding basic cellular and embryonic development. observing the development of stem cells as they differentiate into a number of cell types will enable scientists to better understand cellular processes and ways to repair cells when they malfunction. it also holds great potential to yield revolutionary treatments by transplanting new tissue to treat heart disease, atherosclerosis, blood disorders, diabetes, parkinson's, alzheimer's, stroke, spinal cord injuries, rheumatoid arthritis, and many other diseases. by using stem cells, scientists may be able to grow human skin cells to treat wounds and burns. and, it will aid the understanding of fertility disorders. many patient and scientific organizations recognize the vast potential of stem cell research. another possible therapeutic technique is the generation of "customized" stem cells. a researcher or doctor might need to develop a special cell line that contains the dna of a person living with a disease. by using a technique called "somatic cell nuclear transfer" the researcher can transfer a nucleus from the patient into an enucleated human egg cell. this reformed cell can then be activated to form a blastocyst from which customized stem cell lines can be derived to treat the individual from whom the nucleus was extracted. by using the individual's own dna, the stem cell line would be fully compatible and not be rejected by the person when the stem cells are transferred back to that person for the treatment. preliminary research is occurring on other approaches to produce pluripotent human es cells without the need to use human oocytes. human oocytes may not be available in quantities that would meet the needs of millions of potential patients. however, no peer-reviewed papers have yet appeared from which to judge whether animal oocytes could be used to manufacture "customized" human es cells and whether they can be developed on a realistic timescale. additional approaches under consideration include early experimental studies on the use of cytoplasmic-like media that might allow a viable approach in laboratory cultures. on a much longer timeline, it may be possible to use sophisticated genetic modification techniques to eliminate the major histocompatibility complexes and other cell-surface antigens from foreign cells to prepare master stem cell lines with less likelihood of rejection. this could lead to the development of a bank of universal donor cells or multiple types of compatible donor cells of invaluable benefit to treat all patients. however, the human immune system is sensitive to many minor histocompatibility complexes and immunosuppressive therapy carries life-threatening complications. stem cells also show great potential to aid research and development of new drugs and biologics. now, stem cells can serve as a source for normal human differentiated cells to be used for drug screening and testing, drug toxicology studies and to identify new drug targets. the ability to evaluate drug toxicity in human cell lines grown from stem cells could significantly reduce the need to test a drug's safety in animal models. there are other sources of stem cells, including stem cells that are found in blood. recent reports note the possible isolation of stem cells for the brain from the lining of the spinal cord. other reports indicate that some stem cells that were thought to have differentiated into one type of cell can also become other types of cells, in particular brain stem cells with the potential to become blood cells. however, since these reports reflect very early cellular research about which little is known, we should continue to pursue basic research on all types of stem cells. some religious leaders will advocate that researchers should only use certain types of stem cells. however, because human embryonic stem cells hold the potential to differentiate into any type of cell in the human body, no avenue of research should be foreclosed. rather, we must find ways to facilitate the pursuit of all research using stem cells while addressing the ethical concerns that may be raised. another seminal and intimately related event at the end of the nineties occurred in madison wisconsin. up until november of , isolating es cells in mammals other than mice proved elusive, but in a milestone paper in the november , issue of science, james a. thomson, ( ) a developmental biologist at uw-madison reported the first successful isolation, derivation and maintenance of a culture of human embryonic stem cells (hes cells). it is interesting to note that this leap was made from mouse to man. as thomson himself put it, these cells are different from all other human stem cells isolated to date and as the source of all cell types, they hold great promise for use in transplantation medicine, drug discovery and development, and the study of human developmental biology. the new century is rapidly exploiting this vision. when steve fodor was asked in "how do you really take the human genome sequence and transform it into knowledge?" he answered from affymetrix's perspective, it is a technology development task. he sees the colloquially named affychips being the equivalent of a cd-rom of the genome. they take information from the genome and write it down. the company has come a long way from the early days of venter's ests and less than robust algorithms as described earlier. one surprising fact unearthed by the newer more sophisticated generation of chips is that to percent of the non-repetitive dna is being expressed as accepted knowledge was that only . to percent of the genome would be expressed. since much of that sequence has no protein-coding capacity it is most likely coding for regulatory functions. in a parallel to astrophysics this is often referred to in common parlance as the "dark matter of the genome" and like dark matter for many it is the most exciting and challenging aspect of uncovering the occult genome. it could be, and most probably is, involved in regulatory functions, networks, or development. and like physical dark matter it may change our whole concept of what exactly a gene is or is not! since beadle and tatum's circumspect view of the protein world no longer holds true it adds a layer of complexity to organizing chip design. depending on which sequences are present in a particular transcript, you can, theoretically, design a set of probes to uniquely distinguish that variant. at the dna level itself there is much potential for looking at variants either expressed or not at a very basic level as a diagnostic system, but ultimately the real paydirt is the information that can be gained from looking at the consequence of non-coding sequence variation on the transcriptome itself. and fine tuning when this matters and when it is irrelevant as a predicative model is the auspices of the affymetrix spin-off perlegen. perlegen came into being in late to accelerate the development of high-resolution, whole genome scanning. and they have stuck to that purity of purpose. to paraphrase dragnet's sergeant joe friday, they focus on the facts of dna just the dna. perlegen owes its true genesis to the desire of one of its cofounders to use dna chips to help understand the dynamics underlying genetic diseases. brad margus' two sons have the rare disease "ataxia telangiectasia" (a-t). a-t is a progressive, neurodegenerative childhood disease that affects the brain and other body systems. the first signs of the disease, which include delayed development of motor skills, poor balance, and slurred speech, usually occur during the first decade of life. telangiectasias (tiny, red "spider" veins), which appear in the corners of the eyes or on the surface of the ears and cheeks, are characteristic of the disease, but are not always present. many individuals with a-t have a weakened immune system, making them susceptible to recurrent respiratory infections. about % of those with a-t develop cancer, most frequently acute lymphocytic leukemia or lymphoma suggesting that the sentinel competence of the immune system is compromised. having a focus so close to home is a powerful driver for any scientist. his co-founder david cox is a polymath pediatrician whose training in the latter informs his application of the former in the development of patient-centered tools. from that perspective, perlegen's stated mission is to collaborate with partners to rescue or improve drugs and to uncover the genetic bases of diseases. they have created a whole genome association approach that enables them to genotype millions of unique snps in thousands of cases and controls in a timeframe of months rather than years. as mentioned previously, snp (single nucleotide polymorphism) markers are preferred over microsatellite markers for association studies because of their abundance along the human genome, the low mutation rate, and accessibilities to high-throughput genotyping. since most diseases, and indeed responses to drug interventions, are the products of multiple genetic and environmental factors it is a challenge to develop discriminating diagnostics and, even more so, targetedtherapeutics. because mutations involved in complex diseases act probabilisticallythat is, the clinical outcome depends on many factors in addition to variation in the sequence of a single gene -the effect of any specific mutation is smaller. thus, such effects can only be revealed by searching for variants that differ in frequency among large numbers of patients and controls drawn from the general population. analysis of these snp patterns provides a powerful tool to help achieve this goal. although most bi-alleic snps are rare, it has been estimated that just over million common snps, each with a frequency of between and %, account for the bulk of the dna sequence difference between humans. such snps are present in the human genome once every base pairs or so. as is to be expected from linkage disequilibrium studies, alleles making up blocks of such snps in close physical proximity are often correlated, resulting in reduced genetic variability and defining a limited number of "snp haplotypes," each of which reflects descent from a single, ancient ancestral chromosome. in cox's group, using high level scanning with some old-fashioned somatic cell genetics, constructed the snp map of chromosome .the surprising findings were blocks of limited haplotype diversity in which more than % of a global human sample can typically be characterized by only three common haplotypes (interestingly enough the prevalence of each hapolytype in the examined population was in the ratio : : . ).from this the conclusion could be drawn that by comparing the frequency of genetic variants in unrelated cases and controls, genetic association studies could potentially identify specific haplotypes in the human genome that play important roles in disease, without need of knowledge of the history or source of the underlying sequence, which hypothesis they subsequently went on to prove. following cox et al. pioneering work on "blocking" chromosome into characteristic haplotypes, tien chen came to visit him from university of southern california and following the visit his group developed discriminating algorithms which took advantage of the fact that the haplotype block structure can be decomposed into large blocks with high linkage disequilibrium and relatively limited haplotype diversity, separated by short regions of low disequilibrium. one of the practical implications of this observation is as suggested by cox that only a small fraction of all the snps they refer to as "tag" snps can be chosen for mapping genes responsible for complex human diseases, which can significantly reduce genotyping effort, without much loss of power. they developed algorithms to partition haplotypes into blocks with the minimum number of tag snps for an entire chromosome. in they reported that they had developed an optimized suite of programs to analyze these block linkage disequilibrium patterns and to select the corresponding tag snps that will pick the minimum number of tags for the given criteria. in addition the updated suite allows haplotype data and genotype data from unrelated individuals and from general pedigrees to be analyzed. using an approach similar to richard michelmore's bulk segregant analysis in plants of more than a decade previously, perlegen subsequently made use of these snp haplotype and statistical probability tools to estimate total genetic variability of a particular complex trait coded for by many genes, with any single gene accounting for no more than a few percent of the overall variability of the trait. cox's group have determined that fewer than total individuals provide adequate power to identify genes accounting for only a few percent of the overall genetic variability of a complex trait, even using the very stringent significance levels required when testing large numbers of dna variants. from this it is possible to identify the set of major genetic risk factors contributing to the variability of a complex disease and/or treatment response. so, while a single genetic risk factor is not a good predictor of treatment outcome, the sum of a large fraction of risk factors contributing to a treatment response or common disease can be used to optimize personalized treatments without requiring knowledge of the underlying mechanisms of the disease.they feel that a saturating level of coverage is required to produce repeatable prediction of response to medication or predisposition to disease and that taking shortcuts will for the most part lead to incomplete, clinically-irrelevant results. in hinds et al. in science describe even more dramatic progresss. they describe a publicly available, genome-wide data set of . million common singlenucleotide polymorphisms (snps) that have been accurately genotyped in each of people from three population samples. a second public data set of more than million snps typed in each of people has been generated by the international haplotype map (hapmap) project. these two public data sets, combined with multiple new technologies for rapid and inexpensive snp genotyping, are paving the way for comprehensive association studies involving common human genetic variations. perlegen basically is taking to the next level fodor's stated reason for the creation of affymetrix, the belief that understanding the correlation between genetic variability and its role in health and disease would be the next step in the genomics revolution. the other interesting aspect of this level of coverage is, of course, the notion of discrete identifiable groups based on ethnicity, centers of origin and such breaks down and a spectrum of variation arises across all populations which makes the perlegen chip, at one level, a true unifier of humanity but at another adds a whole layer of complexity for hmos! at the turn of the century, this personalized chip approach to medicine received some validation at a simpler level in a closely related disease area to the one to which one fifth of a-t patients ultimately succumb when researchers at the whitehead institute used dna chips to distinguish different forms of leukemia based on patterns of gene expression in different populations of cells. moving cancer diagnosis away from visually based systems to such molecular based systems is a major goal of the national cancer institute. in the study, scientists used a dna chip to examine gene activity in bone marrow samples from patients with two different types of acute leukemia -acute myeloid leukemia (aml) and acute lymphoblastic leukemia (all). then, using an algorithm, developed at the whitehead, they identified signature patterns that could distinguish the two types. when they cross-checked the diagnoses made by the chip against known differences in the two types of leukemia, they found that the chip method could automatically make the distinction between aml and all without previous knowledge of these classes. taking it to a level beyond where perlegen are initially aiming, eric lander, leader of the study said, mapping not only what is in the genome, but also what the things in the genome do, is the real secret to comprehending and ultimately curing cancer and other diseases. chips gained recognition on the world stage in when they played a key role in the search for the cause of severe acute respiratory syndrome (sars) and probably won a mcarthur genius award for their creator. ucsf assistant professor joseph derisi, already famous in the scientific community as the wunderkind originator of the online diy chip maker in pat brown's lab at stanford, built a gene microarray containing all known completely sequenced viruses ( , of them) and, using a robot arm that he also customized, in a three day period used it to classify a pathogen isolated from sars patients as a novel coronavirus. when a whole galaxy of dots lit up across the spectrum of known vertebrate cornoviruses derisis knew this was a new variant. interestingly the sequence had the hottest signal with avian infectious bronchitis virus. his work subsequently led epidemiologists to target the masked palm civet, a tree-dwelling animal with a weasel-like face and a catlike body as the probable primary host. the role that derisi's team at ucsf played in identifying a coronavirus as a suspected cause of sars came to the attention of the national media when cdc director dr. julie gerberding recognized joe in march , press conference and in when joe was honored with one of the coveted mcarthur genius awards. this and other tools arising from information gathered from the human genome sequence and complementary discoveries in cell and molecular biology, new tools such as gene-expression profiling, and proteomics analysis are converging to finally show that rapid robust diagnostics and "rational" drug design has a future in disease research. another virus that puts sars deaths in perspective benefitted from rational drug design at the turn of the century. influenza, or flu, is an acute respiratory infection caused by a variety of influenza viruses. each year, up to million americans develop the flu, with an average of about , being hospitalized and , to , people dying from influenza and its complications. the use of current influenza treatments has been limited due to a lack of activity against all influenza strains, adverse side effects, and rapid development of viral resistance. influenza costs the united states an annual $ . billion in physician visits, lost productivity and lost wages. and least we still dismiss it as a nuisance we are well to remember that the "spanish" influenza pandemic killed over million people in and , making it the worst infectious pandemic in history beating out even the more notorious black death of the middle ages. this fear has been rekindled as the dreaded h n (h for haemaglutenin and n for neuraminidase as described below) strain of bird flu has the potential to mutate and recognise homo sapiens as a desirable host. since rna viruses are notoriously faulty in their replication this accelerated evolutionary process gives then a distinct advantage when adapting to new environments and therefore finding more amenable hosts. although inactivated influenza vaccines are available, their efficacy is suboptimal partly because of their limited ability to elicit local iga and cytotoxic t cell responses. the choices of treatments and preventions for influenza hold much more promise in this millennium. clinical trials of cold-adapted live influenza vaccines now under way suggest that such vaccines are optimally attenuated, so that they will not cause influenza symptoms but will still induce protective immunity. aviron (mountain view, ca), biochem pharma (laval, quebec, canada), merck (whitehouse station, nj), chiron (emeryville, ca), and cortecs (london), all had influenza vaccines in the clinic at the turn of the century, with some of them given intra-nasally or orally. meanwhile, the team of gilead sciences (foster city, ca) and hoffmann-la roche (basel, switzerland) and also glaxowellcome (london) in put on the market neuraminidase inhibitors that block the replication of the influenza virus. gilead was one of the first biotechnology companies to come out with an anti-flu therapeutic. tamiflu™ (oseltamivir phosphate) was the first flu pill from this new class of drugs called neuraminidase inhibitors (ni) that are designed to be active against all common strains of the influenza virus. neuraminidase inhibitors block viral replication by targeting a site on one of the two main surface structures of the influenza virus, preventing the virus from infecting new cells. neuraminidase is found protruding from the surface of the two main types of influenza virus, type a and type b. it enables newly formed viral particles to travel from one cell to another in the body. tamiflu is designed to prevent all common strains of the influenza virus from replicating. the replication process is what contributes to the worsening of symptoms in a person infected with the influenza virus. by inactivating neuraminidase, viral replication is stopped, halting the influenza virus in its tracks. in marked contrast to the usual protracted process of clinical trials for new therapeutics, the road from conception to application for tamiflu was remarkably expeditious. in , gilead and hoffmann-la roche entered into a collaborative agreement to develop and market therapies that treat and prevent viral influenza. in , as gilead's worldwide development and marketing partner, roche led the final development of tamiflu, months after the first patient was dosed in clinical trials in april , roche and gilead announced the submission of a new drug application to the u.s. food and drug administration (fda) for the treatment of influenza. additionally, roche filed a marketing authorisation application (maa) in the european union under the centralized procedure in early may . six months later in october , gilead and roche announced that the fda approved tamiflu for the treatment of influenza a and b in adults. these accelerated efforts allowed tamiflu to reach the u.s. market in time for the - flu season. one of gilead's studies showed an increase in efficacy from % when the vaccine was used alone to % when the vaccine was used in conjunction with a neuraminidase inhibitor. outside of the u.s., tamiflu also has been approved for the treatment of influenza a and b in argentina, brazil, canada, mexico, peru and switzerland. regulatory review of the tamiflu maa by european authorities is ongoing. with the h n birdflu strain's relentless march (or rather flight) across asia, in through eastern europe to a french farmyard, an unwelcome stowaway on a winged migration, and no vaccine in sight, tamiflu, although untested for this species, seen as the last line of defense is now being horded and its patented production right's fought over like an alchemist's formula. tamiflu's main competitor, zanamivir marketed as relenza™ was one of a group of molecules developed by glaxowellcome and academic collaborators using structure-based drug design methods targeted, like tamiflu, at a region of the neuraminidase surface glycoprotein of influenza viruses that is highly conserved from strain to strain. glaxo filed for marketing approval for relenza in europe and canada. the food and drug administration's accelerated drug-approval timetable began to show results by , its evaluation of novartis's gleevec took just three months compared with the standard - months. another factor in improving biotherapeutic fortunes in the new century was the staggering profits of early successes. in , $ . billion of the $ . billion in revenue collected by genentech in south san francisco came from oncology products, mostly the monoclonal antibody-based drugs rituxan, used to treat non-hodgkin's lymphoma, and herceptin for breast cancer. in fact two of the first cancer drugs to use the new tools for 'rational' design herceptin and gleevec, a small-molecule chemotherapeutic for some forms of leukemia are proving successful, and others such as avastin (an anti-vascular endothelial growth factor) for colon cancer and erbitux are already following in their footsteps. gleevec led the way in exploiting signal-transduction pathways to treat cancer as it blocks a mutant form of tyrosine kinase (termed the philadelphia translocation recognized in 's) that can help to trigger out-of-control cell division. about % of biotech companies raising venture capital during the third quarter of listed cancer as their primary focus, according to online newsletter venturereporter. by according to the pharmaceutical research and manufacturers of america, medicines were in development for cancer up from in . another new avenue in cancer research is to combine drugs. wyeth's mylotarg, for instance, links an antibody to a chemotherapeutic, and homes in on cd receptors on acute myeloid leukemia cells. expertise in biochemistry, cell biology and immunology is required to develop such a drug. this trend has created some bright spots in cancer research and development, even though drug discovery in general has been adversely affected by mergers, a few high-profile failures and a shaky us economy in the early 's. as the millennium approached observers as diverse as microsoft's bill gates and president bill clinton predicted the st century wiould be the "biology century". by the many programs and initiatives underway at major research institutions and leading companies were already giving shape to this assertion. these initiatives have ushered in a new era of biological research anticipated to generate technological changes of the magnitude associated with the industrial revolution and the computerbased information revolution. complementary dna sequencing: expressed sequence tags and human genome project basic local alignment search tool high-tech herbal medicine: plant-based vaccines asilomar conference on recombinant dna molecules potential biohazards of recombinant dna molecules hugo: the human genome organization chimeric plant virus particles administered nasally or orally induce systemic and mucosal immune responses in mice the human genome: the nature of the enterprise orchestrating the human genome project separation and analysis of dna sequence reaction products by capillary gel electrophoresis nutritional genomics: manipulating plant micronutrients to improve human health helping europe compete in human genome research genome project gets rough ride in europe construction of a linkage map of the human genome, and its application to mapping genetic diseases separation of dna restriction fragments by high performance capillary electrophoresis with low and zero crosslinked polyacrylamide using continuous and pulsed electric fields preimplantation and the 'new' genetics a history human genome project it aint necessarily so: the dream of the human genome and other illusions high speed dna sequencing by capillary electrophoresis a strategy for sequencing the genome years early expression of norwalk virus capsid protein in transgenic tobacco and potato and its oral immunogenicity in mice rapid production of specific vaccines for lymphoma by expression of the tumor-derived single-chain fv epitopes in tobacco plants generation and analysis of , human expressed sequence tags national academy of sciences. introduction of recombinant dna-engineered organisms into the environment: key issues functional foods and biopharmaceuticals: the next generation of the gm revolution in let them eat precaution biotechnology: a review of technological developments, publishers forfas vitamin-a and iron-enriched rices may hold key to combating blindness and malnutrition: a biotechnology advance french dna: trouble in purgatory genome: the autobiography of a species in chapters harper collins derivation of pluripotent stem cells from cultured human primordial germ cells production of correctly processed human serum albumin in transgenic plants high-yield production of a human therapeutic protein in tobacco chloroplasts the common thread: a story of science, politics, ethics and the human genome capillary gel electrophoresis for dna sequencing. laser-induced fluorescence detection with the sheath flow cuvette production of functional human alpha -antitrypsin by plant cell culture genetic modification of oils for improved health benefits, presentation at conference, dietary fatty acids and cardiovascular health: dietary recommendations for fatty acids: is there ample evidence? stable accumulation of aspergillus niger phytase in transgenic tobacco leaves antenatal maternal serum screening for down's syndrome: results of a demonstration project viable offspring derived from fetal and adult mammalian cells key: cord- -tv ntug authors: gautam, ablesh; tiwari, ashish; malik, yashpal singh title: bioinformatics applications in advancing animal virus research date: - - journal: recent advances in animal virology doi: . / - - - - _ sha: doc_id: cord_uid: tv ntug viruses serve as infectious agents for all living entities. there have been various research groups that focus on understanding the viruses in terms of their host-viral relationships, pathogenesis and immune evasion. however, with the current advances in the field of science, now the research field has widened up at the ‘omics’ level. apparently, generation of viral sequence data has been increasing. there are numerous bioinformatics tools available that not only aid in analysing such sequence data but also aid in deducing useful information that can be exploited in developing preventive and therapeutic measures. this chapter elaborates on bioinformatics tools that are specifically designed for animal viruses as well as other generic tools that can be exploited to study animal viruses. the chapter further provides information on the tools that can be used to study viral epidemiology, phylogenetic analysis, structural modelling of proteins, epitope recognition and open reading frame (orf) recognition and tools that enable to analyse host-viral interactions, gene prediction in the viral genome, etc. various databases that organize information on animal and human viruses have also been described. the chapter will converse on overview of the current advances, online and downloadable tools and databases in the field of bioinformatics that will enable the researchers to study animal viruses at gene level. viruses are notorious to infect all forms of life ranging from bacteria to chordates. in humans, viruses are known to cause infectious diseases such as influenza, hepatitis, aids, diarrhoea, encephalitis, dengue fever and, more recently, severe acute respiratory syndrome (sars), ebola (singh et al. a) , zika (singh et al. b) , etc. despite the vaccines and treatments for such diseases, morbidity and mortality both occur as a result of the viral infections. viral disease of animals not only affects the production but also is a threat to humans (saminathan et al. ) . a rapid growth in the availability of sequencing methods and a vast amount of viral sequence data have been generated during recent times. thus, it is imperative to decipher this data using more advanced tools such as bioinformatics resources. a large number of bioinformatics tools that can aid in the analysis of viral genomes and develop preventive and therapeutic strategies have been developed for human as well as animal viruses. this chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics. analysis of viral sequence involves use of certain tools that are employable on any novel sequence, for example, gene identification, orf identification, functional annotation and phylogeny. however, due to small genome size, viruses have complex methods to maximize the coding potential of genomes and evolution. many viruses utilize overlapping reading frames or translational frameshifts to code for multiple proteins from limited genome sequences. also, higher rates of mutations and recombination between related viruses pose a challenge in accurate phylogenetic and evolutionary analysis of viruses using general-purpose softwares. lately, enormous growth in the volume and diversity of viral sequences in the databases has been seen. now, it has become imperative to organize data of these viral sequences in virus family-specific resources tailored for accurate analysis of a specific virus. one of the most common applications of bioinformatics in virology was to use phylogenetic analysis of the viral isolates to aid in the epidemiological analysis of viral outbreaks. general-purpose phylogeny programs such as phylip (felsenstein ) have been used extensively for the phylogeny and molecular epidemiology of viruses. a comprehensive list of these packages and web servers is maintained by joe felenstein at http://evolution.genetics.washington.edu/phylip/software.html. an open reading frame (orf) is the part of genome that translates into a protein. finding orf is one of the key steps in viral genome analysis. it forms the basis for further analysis such as homologous search, predicting proteins, functional analysis and viral vaccine and antiviral target discovery. if an orf translates a surface protein that is unique to that virus, it may elicit immune responses and could potentially be a vaccine candidate. orf finder by ncbi is a orf prediction program (rombel et al. ) . the program outputs a range of each orfs along with its protein translation in six possible reading frames from the input dna sequence. it can be used to search newly sequenced dna for potential protein encoding sequences and to verify predicted proteins using smart blast or blastp (altschul et al. ). however, the web version of the program is limited to a query sequence length of kb only. a standalone system has no limitation on length but is available only for the linux operating system. neg , a -codon novel orf in segment of influenza virus, was visualized using orf finder (clifford et al. ). using the orf finder in association with the basic local alignment search tool blast, orfs were found in the hz- virus genome (cheng et al. ) . due to small genome size, viruses employ multiple strategies to maximize the coding potential including frameshifts and alternative codon usage. thus, virus-specific programs have been developed to overcome these challenges. genemark (http://opal.biology. gatech.edu/genemark/genemarks.cgi) provides gene prediction tools for viruses (besemer and borodovsky ) . viral genome organizer (vgo) -a java-based web tool -offers identification of gene and orf identification in viral sequences (upton et al. ) . identification of immune epitopes is important in designing new vaccine candidates and in diagnostics. an epitope is the part of an antigen that is recognized by the receptors of immune system components such as antibodies, b cells or t cells. epitopes have been generally classified as either linear or conformational epitopes. t cells recognize linear epitopes, short continuous strings of amino acids derived from protein antigen, presented with mhc class i molecules. b cells and antibodies, on the other hand, recognize conformational epitopes which are formed by interactions of amino acids with multiple discontinuous segments forming a threedimensional antigen (barlow et al. ). owing to the simple linear structure of t cell epitopes, their interaction with receptors can be modelled with high accuracy (delisi and berzofsky ) . a large number of prediction databases and servers thus are available for linear epitope prediction. mhcpep (brusic et al. ) , syfpeithi (rammensee et al. ) , fimm (schonbach et al. ) , mhcbn (bhasin et al. ) and epimhc (reche et al. ) are some of the commonly used t cell epitope prediction programs. immune epitope database and analysis resource (https://www.iedb.org) (vita et al. ) offers the most comprehensive set of tools for epitope analysis for epitope prediction covering hla-a and hla-b for humans as well as chimpanzee, macaque, gorilla, cow, pig and mouse and is one of the few databases that cover such a variety of organisms. since , iedb uses netmhcpan as prediction method. netmhc server uses the artificial neural network method to predict binding of peptides to different alleles from human as well as animals including cattle and pig ( from core). the database also contains curated data for many viruses including influenza and herpesviruses. b cell receptors and epitope interactions are more complex in nature than the linear epitopes for t cells; thus, accuracy of b cell epitopes is relatively low. furthermore, most of the current databases are centred on linear rather than conformational epitopes. bcipep is a tool developed for predicting the linear epitope of b cells (saha et al. ) . epitome is a database of structure-inferred antigenic residues in proteins (schlessinger et al. ) . epitome is especially useful in the prediction of antibodyantigen complex interaction. the database is available at http://www.rostlab.org/ services/epitome/. antijen is an intricate database with entries on both t cell and b cell epitopes. it emphasizes on integration of kinetic, thermodynamic, functional and cellular data within the context of immunology and vaccinology (toseland et al. ) (fig. . a ). three-dimensional prediction of viral proteins can be used to predict the correlation between actual protein structure and antigenic sites, folding surfaces and functional motifs. such structural modelling tools may be implicated to identify and design novel candidates for antiviral inhibitors and vaccine targets. secondary structures may be predicted using the tool predictprotein (http://www.predictprotein.org/) (rost et al. ) . using this online tool, along with secondary structures, solvent accessibility and possible transmembrane helices can be predicted. further, it also provides expected accuracy of prediction methods. swiss-model (http:// swissmodel.expasy.org/) is a popular tool for the prediction of a -d structure of a protein. -d structure prediction programs usually employ homology searching using similar and known protein structures as templates. one of the most commonly used database for such templates is protein data bank (pdb) (reddy et al. ) . output from the swiss-model program includes the template selected, alignment between the query sequence and the template, and the predicted -d model. results of swiss-model are, however, only sent by email (figs. . b, . c, . d and . e). for long, bioinformatic analysis of viruses utilized common bioinformatics tools developed for other organisms. however, analysing viral genomes using general bioinformatics tools could compromise the accuracy and sensitivity of analysis. virus genomes are too small (e.g. < kb) to compute statistics with their codon usage. to maximize the coding potential, viruses work with unusual codon usage patterns comprising of overlapping coding and non-coding functional elements. additionally, viruses also rely on other translational mechanisms such as stop codon read-through, frameshifting, leaky scanning and internal ribosome entry sites. comparative genomic analysis of viruses is complicated by the fact that highly conservative sequences may not be coding for anything. presence of overlapping pairs may be indicated by conservation for the sequences where there is overlapping of cdss and/or non-coding functional elements. novel virus types comprise of new cdss that are different than previously known cdss. there are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. in this section, some of the databases and resources useful for the analysis of veterinary viruses are discussed (table . ). viruses are one of the most diversified and dynamic microorganisms. with increasing viral genome sequencing, there was a need to develop bioinformatics tools to compare and analyse the voluminous data. to meet this requirement, one such downloadable software package is base-by-base, which aids in analysis of whole viral genome alignments at single nucleotide level (brodie et al. ). moreover, with the online resource genome information broker for viruses (gib-v), comparative studies can be made using the generic tools such as clustalw, blast and keyword search algorithms (hirahata et al. ). another downloadable web server tool, viroblast, is an exclusive blast tool that can be used for queries against multiple databases (deng et al. ). sequences from a variety of viral strains can be analysed simultaneously using the alvira software, which is a multiple sequence alignment tool that provides graphical representation as well (enault et al. ). furthermore, comparative analysis of genes and genomes of coronavirus can be carried out by using the covdb (coronavirus database) (huang et al. ). the digital resource viralzone is designed specifically to comprehend viral diversity and acquire information on viral molecular biology, hosts, taxonomy, epidemiology and structures (hulo et al. ). the simmonics program was upgraded to the simple sequence editor (sse) software package, wherein the user-given sequences can be aligned and annotated and further can be analysed for diversity and phylogeny (simmonds ) . evolutionary changes in viral genome lead to polymorphisms in their proteins, which in turn result into changes in viral phenotype such as viral virulence, viral-host interactions, etc. the digital database, viralorfeome, not only stores all variants and mutants of viral orfs, but also provides tools to design orf-specific cloning primers (pellet et al. ). further, degenerate primer pairs can be selected and matched to amplify user-defined viral genomes using the online tool prism (yu et al. ). the recent advances in nextgeneration sequencing and technologies have facilitated to study viral population at an advanced level. the viral population biodiversity and dynamics can be studied using the first such tool developed, phaccs (phage communities from contig spectrum), that can analyse the shotgun sequence data to estimate the structure and diversity of phages (angly et al. ) . later on, more tools/resources were developed to analyse viral metagenomics sequences, such as viral informatics resource for metagenomic exploration (virome), viral metagenome annotation pipeline (vmgap) and metavir (lorenzi et al. , roux et al. , wommack et al. . novel viruses can be identified from a pool of specimen types using a specific computational pipeline, virushunter ). the phenomenon of genetic recombination in viruses is responsible for the emergence of new viruses, increased virulence and host range, immune evasion and development of antiviral resistance. this distinct process of viral recombination can be detected by two bioinformatics tools, viz. jphmm (jumping profile hidden (schultz et al. ; routh and johnson ) . the jphmm, a web server, can be used for predicting recombination in hiv- and hbv, whereas virema, a downloadable software, can be used to analyse next-generation sequencing data. additionally, another software called vipr hmm (viral identification with a probabilistic algorithm incorporating hidden markov model) can detect recombinant and nonrecombinant viruses using microbial detection microarrays (allred et al. ). further, viral genome sequences can be searched for degenerate locus of recombination (lox)-like sites by a web server called selox (surendranath et al. ) . a downloadable software, virapops, is a forward simulator that allows simulation of rna virus population (petitjean and vanet ) . with this software, the drastic changes in rapidly evolving rna viruses such as mutability, recombination, variation, covariation, etc. can be simulated to predict their effects on viral populations. seqmap is a tool capable of identifying viral integration sites (vis) from ligationmediated pcr (lm-pcr), linear amplification-mediated pcr (lam-pcr) and nonrestrictive lam-pcr (nrlam-pcr) reactions and mapping short sequences to the genome (hawkins et al. ) . further, vis can also be detected by three more distinct tools, virusseq, viralfusionseq, and virusfinder , li et al. . for more precise vis prediction, all four tools can be employed by virologists. mirnas: a microrna (mirna) is a small, regulatory, non-coding rna molecule that regulates the translation or stability of viral and host target mrnas, thereby affecting viral pathogenesis. this host-viral regulatory relationship can be investigated by a database called vita, capable of curating known viral mirna genes and known/putative target sites of host mirna (hsu et al. ). vita exploits miranda and targetscan to scan viral genomes and determine mirna targets. vita is also capable of annotating the viruses, virus-infected tissues and tissue specificity of host mirnas. subtypes of viruses, for example, influenza viruses, and the conserved regions in various viruses can also be compared using the vita database. viral mirna candidate hairpins can be predicted using the database vir-mir. it serves as a platform to query the predicted viral mirna hairpins (based on taxonomic classification) and host target genes (based on the use of the rnahybrid program) in human, mouse, rat, zebrafish, rice and arabidopsis (li et al. ) . sirna: a sirna is similar to mirna that operates within the rna interference (rnai) pathway. it interferes in expression of specific genes and, therefore, is used in post-transcriptional gene silencing. virsirnadb is an online curated repository that stores experimentally validated research data of sirna and short hairpin rna (shrna) targeting diverse genes of important human viruses, including influenza virus (tyagi et al. , thakur et al. . the current database includes experimental information on sirna sequence, virus subtype, target gene, genbank accession, design algorithm, cell type, test object, method, efficacy, etc. a web-based software, sivirus, is an antiviral srna design software that allows analysis of influenza virus, hiv- , hcv and sars coronavirus (naito et al. ). further, viral sirna sequence data sets can be analysed using the softwares visitor and virome (antoniewski ; watson et al. ) . a perl script, called paparazzi, enables reconstitution of viral genome using a viral sirna in a given sample (vodovar et al. ). host-pathogenic interactions play an important role in determining the pathogenicity of a pathogen or immune evasion mechanism of a host. to comprehend such interactions between viral and host cellular proteins, various databases and softwares are available. one such database is phever that enables to explore virusvirus and virus-host lateral gene transfers by providing evolutionary and phylogenetic information (palmeira et al. ). this distinct database catalogues homologous families between different viral sequences and between viral and host sequences. it compiles the extensive data from completely sequenced genomes ( nonredundant viral genomes, non-redundant prokaryotic genomes, eukaryotic genomes ranging from plants to vertebrates). thus, it enables compiling of various proteins into homologous families by selecting at least one viral sequence, related alignments and phylogenies for each of these families. with increasing availability of viral genome sequences, data mining, curation and genome annotation have become essential components to better comprehend the structure and function of genome components. this information can further be exploited to develop diagnostics, vaccines and therapeutics. there are a number of tools available capable of annotation and classification of viral sequences, such as ncbi genotyping tool (rozanov et al. ) , vigor (viral genome orf reader) (wang et al. ), viral genome organizer (vgo) (upton et al. ) , genome annotation transfer utility (gatu) (tcherepanov et al. ) , virus genotyping tools (alcantara et al. ), zcurve_v (guo and zhang ) and star (subtype analyser) (myers et al. ) . vgo is a web-based genome browser that allows viewing and predicting genes and orfs in one or more viral genomes. it also allows performing searches within viral genomes and acquiring information about a genome such as locating genes, orfs, start/stop codons, etc. within genome, the sequences can be searched for regular expression, fuzzy motif pattern, genes with highest at composition, etc. using vgo, comparative analyses can be made between different viral genomes. vgo uses the graphical user interface (gui) for constructing alignments and display orthologues in a set of genomes. it also allows searching the translated genome for matches to mass spec peptides. vigor is a gene prediction online tool that was developed by j. craig venter institute in . it started with gene prediction in small viral genomes such as coronavirus, influenza, rhinovirus and rotavirus. with the updated version in (https://www.ncbi.nlm.nih.gov/pmc/articles/pmc /), vigor is now capable of gene prediction in more viruses: measles virus, mumps virus, rubella virus, respiratory syncytial virus, alphavirus and venezuelan equine encephalitis virus, norovirus, metapneumovirus, yellow fever virus, japanese encephalitis virus, parainfluenza virus and sendai virus. with vigor, based on sequence similarity searches, users are able to predict protein coding regions, start and stop codons and other complex gene features such as rna editing, stop codon leakage and ribosomal shunting. further, various features such as frameshifts, overlapping genes, embedded genes, etc. can be predicted in the virus genome. additionally, a mature peptide can be predicted in a given polypeptide open reading frame. vigor is also capable of genotyping influenza virus and rotavirus. four output files -a gene prediction file, a complementary dna file, an alignment file, and a gene feature table file -are produced by vigor. genbank submission can be directly done using the gene feature table. genome annotation transfer utility (gatu) facilitates quick and efficient annotation of similar target genome using the reference genomes that have already been annotated. later, the users can manually curate the annotated genome. the newly annotated genomes can be saved as genbank, embl or xml file format. although it doesn't provide a complete annotation system, gatu serves as a very useful tool for the preliminary work in genome annotation. gatu utilizes tblastn and blastn algorithms to map genes onto the new target genome by using an annotated reference genome. as a result, majority of the new genome's genes are annotated in a single step. with gatu, users can also identify open reading frames present in the target genome and absent from the reference genome. these orfs can further be scrutinized by using other bioinformatics tools such as blast and vgo, which can determine if the orfs should be included in the annotation. multiple-exon genes and mature peptides can also be analysed using gatu. a primer design tool, primerhunter, allows to design highly sensitive and specific primers for virus subtyping by pcr (duitama et al. ). primerhunter allows predicting specific forward and reverse primers with respect to a given set of dna sequences. phylotype is a web-based as well as downloadable software that uses parsimony to reconstruct ancestral traits and to select phylotypes (chevenet et al. ) . rotac is an automated genotyping tool for group a rotaviruses (maes et al. ). it works by comparing a complete orf of interest to other complete orfs of cognate genes available in the genbank database by performing blast searches. viroligo is a database of virus-specific oligonucleotides. the viroligo database acts as a repository for virus-specific oligonucleotides for virus detection (onodera and melcher ) . the database comprises of oligo data and common data tables. the oligo data table enlists pcr primers and hybridization probes that are used for viral nucleic acid detection, while common data table contains pcr and hybridization experimental conditions used in their detection. each oligo data entry provides information on the name of the oligonucleotide, oligonucleotide sequence, target region, type of usage (pcr primer, pcr probe, hybridization or other), note and direction of the pcr oligonucleotide (forward or reverse). each oligonucleotide entry also contains direct links to pubmed, genbank, ncbi taxonomy databases and blast. on the updated version of viroligo as of september , the database contains complete listing of oligonucleotides specific to various animal viruses. the viruses are vaccinia virus; canine parvovirus; porcine parvovirus; rodent parvovirus; tobamovirus; potyvirus; borna virus; bovine herpesvirus types , , and ; bovine viral diarrhoea virus; bovine parainfluenza virus; bovine respiratory syncytial virus; bovine adenovirus; bovine rhinovirus; bovine coronavirus; bovine reovirus; bovine enterovirus; foot-and-mouth disease (fmd) virus; and alcelaphine herpesvirus. virus-ploc is a web server for prediction of subcellular localization of viral proteins within host and virus-infected cells (shen and chou ) . another web server developed a little later, iloc-virus, is a multi-label learning classifier that predicts the subcellular locations of viral proteins with single and multiple sites (xiao et al. ) . similarly, a most recent web server, ploc-mvirus (cheng et al. ) , is a new predictor that identifies subcellular localization of viral proteins with both single and multiple location sites. it works by extracting information from the gene ontology (go) database and is claimed to be more successful than the state-of-the-art method, iloc-virus, in predicting subcellular localization of viral proteins. avppred is an antiviral peptide prediction algorithm that contains the peptides with experimentally proven antiviral activity (thakur et al. ) . the prediction is based on peptide sequence features, peptide motifs, sequence alignment, amino acid composition and physicochemical properties. vips is a viral internal ribosomal entry site (ires) prediction system that can predict ires secondary structures (hong et al. ) . vips uses the rna fold program that predicts local rna secondary structures, rna align program that compares predicted structures and pknotsrg program (reeder et al. ) that calculates the pseudoknot structures. vazymolo, a database that deals with viral sequences at protein level, defines and classifies viral protein modularity (ferron et al. ) . it extracts information of complete genome sequences of various viruses from genbank and refseq and organizes the acquired information about modularity on viral orfs (fig. . f) . there are web-based tools available to predict and analyse structural aspects of viruses. the learncoil-vmf is a computational tool that allows to predict coiledcoil-like regions in viral membrane fusion proteins (singh et al. ) . the membrane fusion proteins are known to be diverse and share no sequence similarity between most pairs of viruses in the same or different families. the learncoil-vmf is also capable of characterizing the core structure of these membrane fusion proteins. viperdb (virus particle explorer database) is a web-based database that enables manual curation of icosahedral virus capsid structures (carrillo-tripp et al. ). this database serves as a comprehensive resource for specific needs of structural virology and comparatives of data derived from structural and computational analyses of capsids. with the updated version, viperdb ( ), capsid protein residues in the icosahedral asymmetric unit (iau) can be deduced using phi-psi (phi-psi) diagrams (azimuthal polar orthographic projections) (ref: https://www.ncbi.nlm.nih. gov/pubmed/ ). these diagrams can be depicted as dynamic interface and surface residues and interface and core residues and can be mapped to the database using a new application programming interface (api). this aids in identifying family-wide conserved residues at the interfaces. additionally, jmol and strap are built in the system to visualize an interactive model of viral molecular structures. vida is a database that organizes animal virus genome open reading frames from partial and complete genomic sequences (alba et al. ) . presently, vida includes a complete collection of homologous protein families from genbank for herpesviridae, papillomaviridae, poxviridae, coronaviridae and arteriviridae. the homologous proteins in vida include both orthologous and paralogous sequences. vida retrieves virus sequences from genbank and the files are parsed into subfields. the parsed fields contain all the information such as genbank accession number, genbank identifier (gi numbers), protein sequence source, sequence length, gene name and gene product. in order to eliminate % redundancy, the virus protein sequences thus retrieved are filtered and a list of synonymous gis is created for reference. the orfs from complete and partial virus genomes are further organized into homologous protein families, on the basis of sequence similarity. furthermore, the structure of known viral proteins or homologous to viral proteins is also mapped onto homologous protein families. vida also provides functional classification of virus proteins into broad functional classes based on typical virus processes such as dna and rna replication, virus structural proteins, nucleotide and nucleic acid metabolism, transcription, glycoproteins and others. this database also provides alignment of the conserved regions based on potential functional importance. apart from functional classification, vida also provides a taxonomical classification of the proteins and protein families. the protein families serve as a tool for functional and evolutionary studies, whereas alignments of conserved sequences provide crucial information on conserved amino acids or construction of sequence profiles. the viral bioinformatics resource center (vbrc) is one of eight nih-sponsored bioinformatics resource centers (http://www.oxfordjournals.org/nar/database/ summary/ ). it is an online platform that provides informational and analytical tools and resources to scientific community. the vbrc is oriented to conduct basic and applied research to better comprehend the viruses included on the nih/niaid list of priority pathogens. these viruses are selected based on their possibility of bioterrorism threats or as emerging or re-emerging infectious diseases. the vbrc focuses specifically on large dna viruses. it includes the viruses that belong to the arenaviridae, bunyaviridae, filoviridae, flaviviridae, paramyxoviridae, poxviridae and togaviridae families. it serves as a relational database and web application tool that allows data storage, annotation, analysis and information exchange of the data. the current version (v . ) consists of complete genomic sequences. using the vbrc, each of the viral gene and genome can be curated. as a result, a comprehensive and searchable summary is acquired that details about the genotype and phenotype of the genes. the role of the genes in host-pathogen relationships is also being emphasized in these curations. additionally, the vbrc also houses multiple analytical tools such as tools for genome annotation, comparative analysis, whole genome alignments and phylogenetic analysis. further, this database also looks forward to include high-throughput data derived from other studies such as microarray gene expression data, proteomic analyses and population genetics data. the poxvirus bioinformatics resource center (pbrc, now merged into vbrc) is an online platform that serves as an informational and analytical resource to better comprehend the poxviridae family of viruses. it allows data storage, annotation, analysis and information exchange of the data. influenza virus is one the major global concern. it gained attention after the emergence of pandemic influenza a virus (h n , swine flu) in . there are a total of web portals and tools that focus only on influenza virus. this includes the influenza virus database (ivdb), influenza research database (ird) and ncbi influenza virus resource (ncbi-ivr) (chang et al. ; bao et al. ; squires et al. ) . researchers can exploit all the three websites mentioned for sequence databases as well as various basic tools such as blast, multiple-sequence alignment, phylogenetic tree construction, etc. ivdb provides access to additional tools such as (i) the sequence distribution tool, which provides global geographical distribution of a given viral genotype as well as correlates its genomic data with epidemiological data, and (ii) the quality filter system, which according to their sequence content (coding sequence [cds], 'untranslated region [ 'utr] , and 'utr) and integrity (complete [c] or partial [p]) categorizes a given viral nucleotide sequence into either of the seven categories of c to c and p to p , respectively. ncbi-ivr is the most widely used and cited online resource. with ncbi-ivr, the given viral genomic sequences can be annotated using a genome annotation tool and flu annotation (flan) tool. additionally, large phylogenetic trees may be constructed and can be visualized in aggregated form with sub-scale details (bao et al. ; bao et al. ; zaslavsky et al. ) . ird provides tools for genomic and proteomic intervention, immune epitope prediction and surveillance data for viral nucleotide sequences (squires et al. ) . furthermore, this resource is also equipped with tools that provide insight into hostpathogen interactions, type of virulence, host range and a correlation of sequence variation and these processes. there are other repositories available: global initiative on sharing avian influenza data (gisaid) consortium that mediated the epiflu database and flugenome database that exclusively provides genotyping of influenza a virus and aids in detecting reassortments taking place in divergent lines (lu et al. ) . furthermore, reassortment events in influenza viruses exclusively can be identified by a program giraf (graph-incompatibility-based reassortment finder) that can be downloaded (nagarajan and kingsford ) . another distinct repository, influenza sequence and epitope database (ised), provides viral sequences and epitopes from asian countries; the information could be exploited to understand and study evolutionary divergence and migration of strains (yang et al. ). the web server ativs (analytical tool for influenza virus surveillance) provides an antigenic map for conducting surveillance and selection of vaccine strains by scrutinizing the serological data of haemagglutinin sequence data of influenza a/h n viruses and influenza subtypes (liao et al. ). there is another online repository openfludb (an isolate-centred inventory), where information of an isolate such as virus type, host, date of isolation, geographical distribution, predicted antiviral resistance, enhanced pathogenicity or human adaptation propensity may be obtained (liechti et al. ) . for influenza viruses, primers and probes can be designed using the influenza primer design resource (ipdr) (bose et al. ). further, prospective influenza seasonal epidemics or pandemics can be predicted using a stochastic model, flute (chao et al. ) (table . ). the ncbi virus variation resource (ncbi-vvr) is a web-based database of a set of viruses, viz. influenza virus, dengue virus, rotavirus, west nile virus, ebola virus, zika virus and mers coronavirus (resch et al. ). it enables the user to submit their viral sequences along with relevant metadata such as sample collection time, isolation source, geographic location, host, disease severity, etc. it further allows integrating and analysing the viral sequences using the generic tools such as multiple sequence alignment and phylogenetic tree construction. rotavirus a (rva) is the most frequent cause of severe diarrhoea in human and animal infants worldwide and remains as a major global threat for childhood morbidity and mortality (minakshi et al. ; basera et al. ) . in recent years, extensive research efforts have been done for the development of live, orally administered vaccines. in india, an orally administered vaccine rotavac was also introduced after successful clinical trials in which became available to clinicians in , although these vaccines will have to be scrutinized and have to be updated regularly to accommodate the emerging rotavirus genotype variations, following which molecular and genetic characterization of new circulating and emerging genotypes of rotavirus strains in humans and animals becomes necessary. recently, a classification system for rvas has been described by the rotavirus classification working group (rcwg) in which all the genomic rna segments are assigned a particular alphabet followed by the particular genotype number. the classification system will be helpful in explaining the importance of genetic reassortments among rvas, host range, transfer of gene segments among two different genotypes and adaptation to different hosts. to differentiate between different gene segments of rvas, an online web-based tool rotac was developed by the leading researchers from rega institute, ku leuven, belgium, in (table . ). it's an easy-to-use and reliable classification tool for rvas and works on the agreement with rcwg. it's a platform-independent tool which works on any web browser by simply going to its url (http://rotac.regatools.be/) and has been released without any restriction of use by academicians or anyone else. as claimed, the rotac web-based tool will be updated regularly to reflect the established as well as newly emerging genotypes announced by the rcwg from time to time. various researches in animal viral diseases are being conducted at the genomic level. often, handling an enormous data obtained from sequencing is daunting to researchers. the chapter categorically provides a list of bioinformatics approaches that are useful in data mining. there are tables that list all such bioinformatics programs as per the applications. the tables also list databases that organize information on human and animal viruses such as genomic data, orfs, oligonucleotides, etc. an illustration has also been provided in the chapter showing the application of the tool predictprotein, which is used for prediction of three-dimensional structures of viral proteins. the major goal of the chapter has been to provide a roadmap to bioinformatics approaches in the field of animal viral diseases. although the chapter elaborates on viruses-specific bioinformatics programs, most of these programs are designed for human viruses. nevertheless, there are bioinformatics tools that are animal-virus specific, but these are limited in number. henceforth, in many cases, researchers have to switch to either human virus-specific tools or other generic tools. application of such tools for studying animal viruses or animal diseases, in many situations, may not be as accurate as with specialized tools. the users should take precautions while using the settings of such tools. furthermore, the results, thus obtained, also need to be scrutinized. therefore, development of new bioinformatics programs/tools that are specifically designed for animal viruses/diseases should be taken up robustly. specialized tools will provide much accurate results and predictions, thereby accelerating the bioinformatics researches in the field of animal viral diseases. vida: a virus database system for the organization of animal virus genome open reading frames a standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences hmm: a hidden markov model for detecting recombination with microbial detection microarrays basic local alignment search tool phaccs, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information visitor, an informatic pipeline for analysis of viral sirna sequencing datasets flan: a web server for influenza virus genome annotation the influenza virus resource at the national center for biotechnology information continuous and discontinuous protein antigenic determinants detection of rotavirus infection in bovine calves by rna-page and rt-pcr genemark: web software for gene finding in prokaryotes, eukaryotes and viruses mhcbn: a comprehensive database of mhc binding and non-binding peptides the influenza primer design resource: a new tool for translating influenza sequence data into effective diagnostics base-by-base: single nucleotidelevel analysis of whole viral genome alignments mhcpep, a database of mhc-binding peptides: update viperdb : an enhanced and web api enabled relational database for structural virology influenza virus database (ivdb): an integrated information resource and analysis platform for influenza virus research flute, a publicly available stochastic influenza epidemic simulation model virusseq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue analysis of the complete genome sequence of the hz- virus suggests that it is related to members of the baculoviridae ploc-mvirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal go information into general pseaac searching for virus phylotypes evidence for a novel gene associated with human influenza a viruses t-cell antigenic sites tend to be amphipathic structures viroblast: a stand-alone blast web server for flexible queries of multiple databases and user's datasets primerhunter: a primer design tool for pcr-based virus subtype identification alvira: comparative genomics of viral strains mathematics vs. evolution: mathematical evolutionary theory vazymolo: a tool to define and classify modularity in viral proteins zcurve_v: a new self-training system for recognizing protein-coding genes in viral and phage genomes identifying viral integration sites using seqmap . genome information broker for viruses (gib-v): database for comparative analysis of virus genomes viral ires prediction system -a web server for prediction of the ires secondary structure in silico vita: prediction of host micrornas targets on viruses covdb: a comprehensive database for comparative analysis of coronavirus genes and genomes viralzone: a knowledge resource to understand virus diversity vir-mir db: prediction of viral microrna candidate hairpins viralfusionseq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution ativs: analytical tool for influenza virus surveillance openfludb, a database for human and animal influenza virus the viral meta genome annotation pipeline(vmgap):an automated tool for the functional annotation of viral metagenomic shotgun sequencing data flugenome: a web tool for genotyping influenza a virus rota c: a web-based tool for the complete genome classification of group a rotaviruses g and p genotyping of bovine group a rotaviruses in faecal samples of diarrheic calves by dig-labeled probes a statistical model for hiv- sequence classification using the subtype analyser (star) giraf: robust, computational identification of influenza reassortments via graph mining sivirus: web-based antiviral sirna design software for highly divergent viral sequences viroligo: a database of virus-specific oligonucleotides phever: a database for the global exploration of virus-host evolutionary relationships viralorfeome: an integrated database to generate a versatile collection of viral orfs virapops: a forward simulator dedicated to rapidly evolved viral populations syfpeithi: database for mhc ligands and peptide motifs epimhc: a curated database of mhcbinding peptides for customized computational vaccinology virus particle explorer (viper), a website for virus capsid structures and their computational analyses pknotsrg: rna pseudoknot folding including nearoptimal structures and sliding windows virus variation resources at the national center for biotechnology information: dengue virus orf-finder: a vector for high-throughput gene identification the predictprotein server discovery of functional genomic motifs in viruses with virema-a virus recombination mapper-for analysis of next-generation sequencing data metavir: a web server dedicated to virome analysis a web-based genotyping resource for viral sequences bcipep: a database of b-cell epitopes prevalence, diagnosis, management and control of important diseases of ruminants with special reference to indian scenario epitome: database of structure-inferred antigenic epitopes an update on the functional molecular immunology (fimm) database jphmm: improving the reliability of recombination prediction in hiv- virus-ploc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells sse: a nucleotide and amino acid sequence analysis platform learncoil-vmf: computational evidence for coiled-coil-like motifs in many viral membrane-fusion proteins advances in diagnosis, surveillance, and monitoring of zika virus: an update ebola virus -epidemiology, diagnosis and control: threat to humans, lessons learnt and preparedness plans-an update on its year's journey biohealthbase: informatics support in the elucidation of influenza virus host pathogen interactions and virulence influenza research database: an integrated bioinformatics resource for influenza research and surveillance selox--a locus of recombination site search tool for the detection and directed evolution of site-specific recombination systems genome annotation transfer utility (gatu): rapid annotation of viral genomes using a closely related reference genome virsirnadb: a curated database of experimentally validated viral sirna/shrna antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data hivsirdb: a database of hiv inhibiting sirnas viral genome organizer: a system for analyzing complete viral genomes the immune epitope database (iedb) . in silico reconstruction of viral genomes from small rnas improves virus-derived small interfering rna profiling vigor, an annotation program for small viral genomes virusfinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data virome: an r package for the visualization and analysis of viral small rna sequence datasets virome: a standard operating procedure for analysis of viral metagenome sequences iloc-virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites influenza sequence and epitope database prism: a primer selection and matching tool for amplification and sequencing of viral genomes visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation identification of novel viruses using virushunter--an automated data analysis pipeline acknowledgements all the authors of the manuscript thank and acknowledge their respective universities and institutes. there is no conflict of interest. key: cord- -jpow iw authors: astrovskaya, irina; tork, bassam; mangul, serghei; westbrooks, kelly; măndoiu, ion; balfe, peter; zelikovsky, alex title: inferring viral quasispecies spectra from pyrosequencing reads date: - - journal: bmc bioinformatics doi: . / - - -s -s sha: doc_id: cord_uid: jpow iw background: rna viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. the genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. high-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. results: in this paper, we introduce a new viral spectrum assembler (vispa) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art shorah tool on both simulated and real pyrosequencing shotgun reads from hcv and hiv quasispecies. experimental results show that vispa outperforms shorah on simulated error-free reads, correctly assembling out of quasispecies and sequences out of quasispecies. while shorah has a significant advantage over vispa on reads simulated with sequencing errors due to its advanced error correction algorithm, vispa is better at assembling the simulated reads after they have been corrected by shorah. vispa also outperforms shorah on real reads. indeed, most frequent sequences reconstructed by vispa from a real hcv dataset are viable (do not contain internal stop codons), and the most frequent sequence was within % of the actual open reading frame obtained by cloning and sanger sequencing. in contrast, only one of the sequences reconstructed by shorah is viable. on a real hiv dataset, shorah correctly inferred only quasispecies sequences with at most mismatches whereas vispa correctly reconstructed quasispecies with at most mismatches, and out of sequences were inferred without any mismatches. vispa source code is available at http://alla.cs.gsu.edu/~software/vispa/vispa.html. conclusions: vispa enables accurate viral quasispecies spectrum reconstruction from pyrosequencing reads. we are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. results: in this paper, we introduce a new viral spectrum assembler (vispa) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art shorah tool on both simulated and real pyrosequencing shotgun reads from hcv and hiv quasispecies. experimental results show that vispa outperforms shorah on simulated error-free reads, correctly assembling out of quasispecies and sequences out of quasispecies. while shorah has a significant advantage over vispa on reads simulated with sequencing errors due to its advanced error correction algorithm, vispa is better at assembling the simulated reads after they have been corrected by shorah. vispa also outperforms shorah on real reads. indeed, most frequent sequences reconstructed by vispa from a real hcv dataset are viable (do not contain internal stop codons), and the most frequent sequence was within % of the actual open reading frame obtained by cloning and sanger sequencing. in contrast, only one of the sequences reconstructed by shorah is viable. on a real hiv dataset, shorah correctly inferred only quasispecies sequences with at most mismatches whereas vispa correctly reconstructed quasispecies with at most mismatches, and out of sequences were inferred without any mismatches. vispa source code is available at http://alla.cs.gsu.edu/~software/vispa/vispa.html. conclusions: vispa enables accurate viral quasispecies spectrum reconstruction from pyrosequencing reads. we are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. many viruses (including sars, influenza, hbv, hcv, and hiv) encode their genome in rna rather than dna. unlike dna viruses, rna viruses lack the ability to detect and repair mistakes during replication [ ] and, as a result, their mutation rate can be as high as mutation per each , - , bases copied per replication cycle [ ] . many of the mutations are well tolerated and passed down to descendants, producing a family of co-existing related variants of the original viral genome referred to as quasispecies, a concept that originally described a mutation-selection balance [ ] [ ] [ ] [ ] [ ] . the diversity of viral sequences in an infected individual can cause the failure of vaccines and virus resistance to existing drug therapies [ ] . therefore, there is a great interest in reconstructing genomic diversity of viral quasispecies. knowing sequences of the most virulent variants can help to design effective drugs [ , ] and vaccines [ , ] targeting particular viral variants in vivo. briefly, the pyrosequencing system shears the source genetic material into fragments of approximately - bases. millions of single-stranded fragments are sequenced by synthesizing their complementary strands. repeatedly, nucleotide reagents are flown over the fragments, one nucleotide (a, c, t, or g) at a time. light is emitted at a fragment location when the flown nucleotide base complements the first unpaired base of the fragment [ , ] . multiple identical nucleotides may be incorporated in a single cycle, in which case the light intensity corresponds to the number of incorporated bases. however, since the number of incorporated bases (referred to as a homopolymer length) cannot be estimated accurately for long homopolymers, it results in a relatively high percentage of insertion and deletion sequencing errors (which respectively represent %- % and %- % of all sequencing errors [ , ] ). the software provided by instrument manufacturers were originally designed to assemble all reads into a single genome sequence, and cannot be used for reconstructing quasispecies sequences. thus, in this paper we address the following problem: given a collection of pyrosequencing reads generated from a viral sample, reconstruct the quasispecies spectrum, i.e., the set of sequences and the relative frequency of each sequence in the sample population. a major challenge in solving the qsr problem is that the quasispecies sequences are only slightly different from each other. the amount and distribution along the genome of differences between quasispecies varies significantly between virus species, as different species have different mutation rates and genomic architectures. in particular, due to the lower mutation rate and longer conserved regions, hcv quasispecies are harder to reconstruct than quasispecies of hbv and hiv. additionally, the qsr problem is made difficult by the limited read length and relatively high error rate of high throughput sequencing data generated by current technologies. the qsr problem is related to several well-studied problems: de novo genome assembly [ ] [ ] [ ] , haplotype assembly [ , ] , population phasing [ ] and metagenomics [ ] . as noted above, de novo assembly methods are designed to reconstruct a single genome sequence, and are not well-suited for reconstructing a large number of closely related quasispecies sequences. haplotype assembly does seek to reconstruct two closely related haplotype sequences, but existing methods do not easily extend to the reconstruction of a large (and a priori unknown) number of sequences. computational methods developed for population phasing deal with large numbers of haplotypes, but rely on the availability of genotype data that conflates information about pairs of haplotypes. metagenomic samples do consist of sequencing reads generated from the genomes of a large number of species. however, differences between the genomes of these species are considerably larger than those between viral quasispecies. furthermore, existing tools for metagenomic data analysis focus on species identification, as reconstruction of complete genomic sequences would require much higher sequencing depth than that typically provided by current metagenomic datasets. in contrast, achieving high sequencing depth for viral samples is very inexpensive, owing to the short length of viral genomes. mapping based approaches to qsr are naturally preferred to de novo assembly since reference genomes are available (or easy to obtain) for viruses of interest, and viral genomes do not contain repeats. thus, it is not surprising that such approaches were adopted in the two pioneering works on the qsr problem [ , ] . eriksson et al. [ ] proposed a multi-step approach consisting of sequencing error correction via clustering, haplotype reconstruction via chain decomposition, and haplotype frequency estimation via expectation-maximization, with validation on hiv data. in westbrooks et al. [ ] , the focus is on haplotype reconstruction via transitive reduction, overlap probability estimation and network flows, with application to simulated error-free hcv data. recently, the qsr software tool shorah was developed [ ] and applied to hiv data [ ] . another combinatorial method for qsr was also developed and applied to hiv and hbv data in [ ] , with results similar to those of shorah. our contributions in this paper are as follows: • a novel qsr tool called viral spectrum assembler (vispa) taking into account sequencing errors at multiple steps, • comparison of vispa with shorah on hcv synthetic data both with and without sequencing errors, and • statistical and experimental validation of the two methods on real pyrosequencing reads from hcv and hiv samples. our method for inferring the quasispecies spectrum of a virus sample from pyrosequencing reads consists of the following steps (see fig. ): • constructing the consensus virus genome sequence for the given sample and aligning the reads onto this consensus, • preprocessing aligned reads to correct sequencing errors, • constructing a transitively reduced read graph with vertices representing reads and edges representing overlaps between them, • selecting paths in the read graph that correspond to the most probable quasispecies sequences, and assembling candidate sequences for selected paths by weighted consensus of reads, and • estimating candidate sequence frequencies by em below we describe each step separately. we assume that a reference genome sequence of the particular virus strain is available (e.g., from ncbi [ ] ). since viral genomes do not have sizable repeats and the quasispecies sequences are usually close enough to the reference sequence, the majority of reads can typically be uniquely aligned onto the reference genome. however, a significant number of reads may remain unaligned due to differences between the reference genome and sequences in the viral sample. in order to recover as many of these reads as possible, we iteratively construct a consensus genome sequence from aligned reads. in particular, we first align pyrosequencing reads to the reference sequence using the segemehl software [ ] . then we extend the reference sequence with a placeholder i for each nucleotide inserted by at least one uniquely aligned read. similarly, we add a placeholder d to the read sequence for each reference nucleotide missing from the aligned read. then we perform sequential multiple alignment of the previously aligned reads against this extended reference sequence. finally, the consensus genome sequence is obtained by ( ) replacing each nucleotide in the extended reference with the nucleotide or placeholder in the majority of the aligned reads and ( ) removing all i and d placeholders, respectively corresponding to rare insertions and to deletions found in a majority of reads. reads may contain a small portion of unidentified nucleotides denoted by n'swe treat n as a special allele value matching any of nucleotides a, c, t, g, as well as placeholders i, and d. iteratively, we replace the reference with the consensus and try to align the reads, for which we could not find any acceptable alignment previously. our experiments on a dataset consisting of approximately , pyrosequencing reads generated from a . kb-long hcv fragment (see data description in results and discussions) show that % of reads are uniquely aligned onto the reference sequence and an additional % of the reads are aligned onto the final consensus sequence. reads that cannot be aligned onto the final consensus are removed from the further consideration. since aligned reads contain insertions and deletions, we use placeholders i and d to simplify position referencing among the reads. all placeholders are treated as additional allele values but they are removed from the final assembled sequences. first, we substitute each deletion in the aligned reads with placeholder d. deletion supported by a single read is replaced either with the allele value, which is present in all other reads overlapping this position, or with n, signifying an unknown value, otherwise. next, we fill with placeholder i each gap in a read corresponding to the insertions in the other reads. all insertions supported by a single read are removed from consideration. we begin with the definition of the read graph, introduced in [ ] and independently in [ ] , and then describe the adjustments that need to be made to read graph construction and edge weights to account for sequencing errors as well as the high mutation rate between quasispecies. the read graph g = (v, e) is a directed graph with vertices corresponding to reads aligned with the consensus sequence. for a read u, we denote by b(u), respectively e(u), the genomic coordinate at which the first, respectively the last, base of u gets aligned. a directed edge (u, v) connects read u to read v if a suffix of u overlaps with a prefix of v and they coincide across the overlap. two auxiliary vertices -a source s and a sink t are added such that s has edges into all reads with zero indegree and t has edges from all reads with zero outdegree. then each st-path corresponds to a possible candidate quasispecies sequence. the read graph is transitively reduced, i.e., each edge e = (u, v) is removed if there is a uv-path not including edge e. note that certain reads can be completely contained inside other reads. let a superread refer to a read that is not contained in any other read and let the rest of the reads be called subreads. subreads are not used in the construction of the read graph, but are taken into account in the final assembly of candidate sequences and frequency estimation. since the number of different st-paths is exponential, we wish to generate a set of paths that have high probability to correspond to real quasispecies sequences. in order to estimate path probability, we independently estimate for each edge e the probability p(e) that it connects two reads from the same quasispecies, and then multiply estimated probabilities for all edges on the path. under the assumption of independence between edges, if we assign to each edge e a cost equal tolog (p(e)) = log( /p(e)), then the minimum-cost st-path will have the maximum probability to represent a quasispecies sequence. for reads without errors, [ ] estimated the probability that two reads u and v connected by edge (u, v) belong to the same quasispecies as is the overhang between reads u and v [ ] , n = #reads, q = #quasispecies, and l = #starting positions. thus, in this case the cost of an edge with overhang Δ can be approximated by Δ ∝ log( /p Δ ). to account for sequencing errors, we adjust the construction of the read graph to allow for mismatches. we use three parameters: ( ) n = #mismatches allowed between a read and a superread, ( ) m = #mismatches allowed in the overlap between two adjacent reads, and ( ) t = #mismatches expected between a read and a random quasispecies. the probability that two reads u and v with j mismatches within an overlap of length o = e(u) b(v) belong to the same quasispecies can be estimated as: where ε is the estimated sequencing error rate. as in the case of error-free reads, defining the edge costs as ensures that stpaths with low cost correspond to most likely quasispecies sequences. to generate a set of high-probability (low-cost) paths that are rich enough to explain observed reads, we compute for each vertex in the read graph the minimum cost st-path passing through it. finding these paths is computationally fast. indeed, we only need to compute two shortest-paths trees in g, one outgoing from s and one incoming into t; the shortest st-path passing through a vertex v is the concatenation of the shortest s v-and vt-paths. preliminary simulation experiments (see additional file ) show that better candidate sets are generated when edge costs c defined by ( ) and ( ) are replaced by e c . in fact, if we use even faster dependency on c then we obtain better candidate sets. the fastest growing cost effectively changes the shortest path into so called maxbandwidth path, i.e., paths that minimizes maximum edge cost for the entire path and for each subpath. so, vispa generates candidate paths using this strategy. when no mismatches are allowed in the construction of the read graph, finding the candidate sequence corresponding to a st-path is trivial, since by definition adjacent superreads coincide across their overlap. when mismatches are allowed, we first assemble a consensus sequence from superreads used by the st-path. it may be not the best choice, especially when the coverage with superreads is low. hence, we replace each initial candidate sequence with a weighted consensus sequence obtained using both superreads and subreads of the path, as described below. for each read r, we compute the probability that it belongs to a particular initial candidate sequence s as: where l and l denote the lengths of the read and initial candidate sequence, respectively, k is the number of mismatches between the read and the initial candidate sequence s, and t/l is the estimated mutation rate. then final candidate sequence is computed as the weighted consensus over all reads, where the weight of a read is the probability that it belongs to the sequence. note that, unlike the case without mismatches, the same candidate sequence can be obtained from different candidate st-paths, so we remove duplicates at the end of this step. we assume that reads r with observed frequencies where generated from a quasispecies population q as follows. first, a quasispecies sequence q q is randomly chosen accordingly to its unknown frequency f q . a read starting position is generated from the uniform distribution and then a read r is produced from quasispecies q with j sequencing errors. the probability of this event is calculated as h q r j l lj j , ( ) where l is the read length and ε is the sequencing error rate. in our simulation studies we use the following read data sets. in order to perform cross-validation on the assembly method, we simulate reads data from -bp long fragment from the e e region of hcv sequences [ ] when sequence frequencies are generated according to some specific distribution. in our simulation experiments, we use geometric distribution (i-th sequence is constant factor more frequent than the (i + )-th sequence) to create sample quasispecies populations with different number of randomly selected above-mentioned quasispecies sequences. we first simulate reads without sequencing errors: the length of a read follows normal distribution with a particular mean value and variance , and a starting position follows the uniform distribution. this simplified model of reads generation has two parameters: number of the reads that varies from k up to k and the average read length that varies from bp up to bp. additionally, we simulate pyrosequencing reads from quasispecies sequences (following geometric distribution of frequencies) out of hcv sequences [ ] using flowsim [ ] . we generated k reads with average length bp. the data set data has been received from hcv research group in institute of biomedical research, at university of birmingham. data contains , reads obtained from the . kb-long fragment of hcv- a genome (which is more than a half of the entire hcv genome). the average (aligned) read length average is bp but it significantly varies as well as the depth of position coverage (see additional file for details). the depth of reads coverage variability is due to a strong bias in the sequence start points, reflecting the secondary structure of the template dna or rna used to generate the initial pcr products. as a result, shorter reads are produced by gc-rich sequences. data is available upon request from the authors. the hiv dataset [ ] contains , reads from mixture of different . kb-long region of hiv- quasispecies, including pol protease and part of the pol reverse transcriptase. the aligned reads length varies from bp to bp with average about bp (see additional file for details). in contrast to [ ] , we do not filter out reads with low-quality scores. in all our experimental validations, we compare the proposed algorithm vispa with the state-of-the-art tool shorah as well as with vispa on shorah-corrected reads (shorahreads + vispa). we say the quasispecies sequence is captured if one of the candidate sequences exactly matches it. we measure the quality of assembling by portion of the real quasispecies sequences being captured by candidate sequences (sensitivity = + ) and its portion among candi- here, we see advantage of vispa over shorah. following [ ] , we measure the prediction quality of frequency distribution with kullback-leibler divergence, or relative entropy. given two probability distributions, relative entropy measures the "distance" between them, or, in the other words, the quality of approximation of one probability distribution by the other distribution. formally, the relative entropy between true distribution p and approximation distribution q is given by the formula: where summation is over all reconstructed original sequences i = {i | p(i) > , q(i) > } , i. e., over all original sequences that have a match (exact or with at most k mismatches) among assembled sequences. the relative entropy is decreasing with increasing of the average read length. it is expected since sensitivity is increasing with increasing of the average read length and em predicts underlying distribution more accurately. vispa algorithm considerably outperforms shorah (see fig. (right)). however, shorah has a significant advantage over vispa on a read data simulated by flowsim both in prediction power and in robustness of results (see table ). indeed, shorah correctly infers out of real quasispecies sequences whereas vispa reconstructs only sequence. additionally, most frequent assemblies inferred by shorah are more robust with repeating up to % of times on %-reduced data versus % of times for vispa's assemblies. this advantage can be explained by superior read correction in shorah. if vispa is used on shorah-corrected reads, the results drastically improves: quasispecies sequences are inferred and exactly % of times are repeated on reduced data, confirming that vispa is better in assembling sequences (see table ). experimental validation on pyrosequencing reads from hcv samples we first discuss the choice of parameters of the read graph and candidate sequence assembly from stpaths. then we give statistical validation for obtained most frequent quasispecies sequences. we infer quasispecies spectrum based on the read graphs constructed with various numbers n and m (numbers of mismatches allowed for superreads and overlaps corresponding to edges). we sort the estimated frequencies in descending order and count the number of sequences which cumulative frequency is %, %, and %. fig. reports these numbers as a percent of the total number of candidate sequences. there is an obvious drop in percentage for all three categories if we allow up to n = mismatches to cluster reads and up to m = mismatches to create edges. in this case, the constructed read graph has no isolated vertices. to refine assembled candidate sequences, we use all reads and parameter t varying from bp till bp, or, in the other words, mutation rate varying from . % up to % per sequence (which is in the range observed in [ ] ). out of max-bandwidth paths, we obtain as much as distinct sequences (t = ) and as low as sequences (t = ) for different values of t [ ; ]. the neighbor-joining tree for the most frequent candidate sequences obtained by vispa and shorah (see fig. ) reminds a neighbor-joining tree for hcv quasispecies evolution. additionally, the most frequent candidate sequence found by vispa is % identical to one of the actual orfs obtained by cloning the quasispecies. the quasispecies sequence is considered found if one of candidate sequences matches it exactly (k = ) or with at most k ( or ) mismatches. all methods are run times on % -reduced data. for the i-th (i = , .., ) most frequent sequence assembled on the whole data, we record its reproducibility, i.e., percentage of runs when there is a match (exact or with at most k mismatches) among most frequent sequences found on reduced data. "reproducibility: max" and "reproducibility: average" report respectively maximum and average of those percentages." figure percentage of candidate sequences which cumulative frequency is %, %, and %. the values on x-axis corresponds to the number of allowed mismatches during read graph construction. n_m means that up to n mismatches are allowed in superreads and up to m mismatches are allowed in edges. viral sequences containing internal stop codons are not viable since the entire hcv genome consists of a single coding region for a large polyprotein. so the number of reconstructed viable sequences can serve as an accuracy measure for quasispecies assembly. out of most frequent sequences reconstructed by vispa, only are not viable while shorah is able to reconstruct only one viable sequence. this sequence has . % similarity with the vispa's fourth most frequent assemblies. both methods returned similar frequency estimations for this sequence: . % (shorah) and . % (vispa). both shorah and vispa (n = , m = ) are run on eight . ghz-cpus with m cache. they take around minutes to assemble sequences and estimate their frequencies. smaller value of n increases vispa's runtime since its bottleneck (candidate sequences assembling) is proportional to the number of reads times number of paths. indeed, smaller value of n results in larger number of superreads in built read graph, thus, in larger set of candidate paths. for example, vispa runs minutes for n = , m = . the plot on fig. shows validation results for most frequent quasispecies sequences with respect to em estimations assembled on data by shorah and vispa (n = , m = , and t = ). repeatedly, times we have deleted randomly chosen % of reads and run both methods on each reduced read instance to reconstruct quasispecies spectrum. the plot reports the percentage of runs when each of most frequent sequences assembled on data are reproduced among the most frequent quasispecies figure the neighbor-joining phylogenetic tree for most frequent hcv quasispecies variants on a , bp-long fragment obtained by vispa and shorah. sequences are labeled with software name and its rank among most frequent assembled sequences. percentage of runs when the i-th most frequent sequence is reproduced among most frequent quasispecies assembled on the %-reduced set of reads. the i-th point at x-axis corresponds to the i-th most frequent sequence assembled on the % of reads. no data are shown for the sequences that are reproduced less than % of runs. inferred on the reduced instances with no mismatches (k = ), or with k = , , mismatches. for example, for k = shorah repeatedly ( % of times) reconstructs only the third most frequent sequence while vispa reconstructs sequences in at least % times, and the most frequent sequence is reconstructed % times. this plot shows that the found sequences are pretty much reproducible for vispa. in order to compare vispa and shorah, we run both of the methods on hiv dataset, used in the first experiment in [ ] . as said above, we do not preprocess reads with respect to its quality score, and it can explain poorer performance of shorah. indeed, shorah correctly infers only quasispecies sequences with at most mismatches: one assembly has mismatches with real quasispecies sequence, and the other has mismatches. vispa correctly reconstructs quasispecies with at most mismatches ( of them among most frequent assemblies): two sequences are inferred without any mismatches (one is among most frequent assemblies), one assembly has mismatch with real quasispecies sequence (and it is among most frequent assemblies), and the rest sequences have mismatches (one is among most frequent assemblies). the assemblies correspond to a viable protein sequences. if vispa is applied to shorah-corrected reads, it can successfully infer three real quasispecies without any mismatches. in this paper, we have proposed and implemented vispa, a novel software tool for quasispecies spectrum reconstruction from high-throughput sequencing reads. the vispa assembler takes into account sequencing errors at multiple steps, including mapping-based read preprocessing, path selection based on maximum bandwidth, and candidate sequence assembly using probability-weighted consensus techniques. sequencing errors are also taken into account in vispa's em-based estimation of quasispecies sequence frequencies. we have validated our method on simulated error-free reads, flowsim-simulated reads with sequencing errors, and real pyrosequencing reads from hcv and hiv samples. we are currently exploring extensions of vispa to paired-end reads; the main difficulty is selection of pair-aware candidate paths. we also foresee application of vispa's techniques to the analysis of high-throughput sequencing data from microbial communities [ ] and ecological samples of eukaryote populations [ ] . the vispa source code is available at http://alla.cs.gsu. edu/~software/vispa/vispa.html. additional file : supplementary materials. the file contains derivation of edge cost formula ( ) and em algorithm, example of read graph construction and analysis of pyrosequencing data. rna virus quasispecies: significance for viral disease and epidemiology mutation rates among rna viruses rna virus mutations and fitness for survival the quasispecies (extremely heterogeneous) nature of viral rna genome populations: biological relevance -a review the molecular quasi-species hepatitis c virus (hcv) circulates as a population of different but closely related genomes: quasispecies nature of hcv genome distribution rapid evolution of rna viruses rna virus populations as quasispecies. current topics in microbiology and immunology computational methods for the design of effective therapies against drug resistant hiv strains hiv- subtype b protease and reverse transcriptase amino acid covariation the rational design of an aids vaccine diversity considerations in hiv- vaccine selection pyrosequencing: an accurate detection platform for single nucleotide polymorphisms genome sequencing in microfabricated high-density picolitre reactors pyrobayes: an improved base caller for snp discovery in pyrosequences quality scores and snp detection in sequencing-by-synthesis systems short read fragment assembly of bacterial genomes building fragment assembly string graphs whole-genome sequencing and assembly with high-throughput, short-read technologies hapcut: an efficient and accurate algorithm for the haplotype assembly problem algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem snp: scalable phasing based on -snp haplotypes environmental genome shotgun sequencing of the sargasso sea beerenwinkel n: viral population estimation using pyrosequencing hcv quasispecies assembly using network flows deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction error correction of nextgeneration sequencing data and reliable estimation of hiv quasispecies combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing fast mapping of short sequences with mismatches, insertions and deletions using index structures maximum likelihood from incomplete data via the em algorithm (with discussions) hepatitis c virus continuously escapes from neutralizing antibody and t-cell responses during chronic infection in vivo characteristics of pyrosequencing data-enabling realistic simulation with flowsim the quasispecies nature and biological implications of the hepatitis c virus. infection robust haplotype reconstruction of eukaryotic read data with hapler inferring viral quasispecies spectra from pyrosequencing reads authors contributions ia designed algorithms, developed software, performed analysis and experiments, wrote the paper. bt performed analysis and experiments. sm contributed to developing software. kw designed algorithms and developed software. im contributed to designing the algorithms and writing the paper. pb supplied the hcv data and contributed to performing the analysis. az designed the algorithms, wrote the paper and supervised the project. all authors have read and approved the final manuscript. the authors declare that they have no competing interests. key: cord- - vouj pp authors: latif, seemab; bashir, sarmad; agha, mir muntasar ali; latif, rabia title: backward-forward sequence generative network for multiple lexical constraints date: - - journal: artificial intelligence applications and innovations doi: . / - - - - _ sha: doc_id: cord_uid: vouj pp advancements in long short term memory (lstm) networks have shown remarkable success in various natural language generation (nlg) tasks. however, generating sequence from pre-specified lexical constraints is a new, challenging and less researched area in nlg. lexical constraints take the form of words in the language model’s output to create fluent and meaningful sequences. furthermore, most of the previous approaches cater this problem by allowing the inclusion of pre-specified lexical constraints during the decoding process, which increases the decoding complexity exponentially or linearly with the number of constraints. moreover, some of the previous approaches can only deal with single constraint. additionally, most of the previous approaches only deal with single constraints. in this paper, we propose a novel neural probabilistic architecture based on backward-forward language model and word embedding substitution method that can cater multiple lexical constraints for generating quality sequences. experiments shows that our proposed architecture outperforms previous methods in terms of intrinsic evaluation. recently, recurrent neural networks (rnns) and their variants such as long short term memory networks (lstms) and gated recurrent units (grus) based language models have shown promising results in generating high quality text sequences, especially when the input and output are of variable length. rnn based language models (lm) have the ability to capture the sequential nature of language, be it for words, characters or whole sentences. this allows them to outperform other language models in sequence prediction and classification tasks. to learn the distributed representation of data efficiently by rnns, multiple methods have been proposed such as word embeddings. it mainly include continuous bag-of-words (cbow) and skip-gram (sg) model [ , ] . cbow model predicts the word as vector at a current time step, given preceding and proceeding context word vectors. the sg model is opposite in approach to predict the representation of target word vector, but same in the architecture. existing methods to incorporate constraints in the output sentences or generating lexical constrained sentences have multiple limitations. [ ] proposed variants of backward-forward generation approach which can not handle out-of-vocabulary (oov) words and only generate sentences with single lexical constraint. similarly, [ ] proposed a synchronous training approach to generate lexical constrained sequences with generative adversarial networks (gans). moreover, various lexical constrained decoding methods have been proposed for constrained sequence generation through the extension of beam search to allow the inclusion of constraints [ , ] . such lexical constrained decoding methods do not examine what specific words need to be included at the start of generation, but try to force specific words at each time step during the generation process at a cost of high computational complexity [ ] . the remainder of this paper is organized as follows. we review the related work in sect. . section describes our proposed architecture and sect. explains the dataset, experimental setup, comparison models and evaluation criteria. section gives in detail result analysis, finding and discussions about future directions. finally, sect. concludes the paper. in general, the purpose of lm is to capture the regularities of a language as well as its morphological and distributional properties. lm aims to compute the probability of a word sequence in order to estimate the maximum likelihood of an upcoming word to be predicted in the sequence. lm learns the distributed representation of words to interpret semantic and syntactic relations between the sequence of words. in past, rnn has shown progressive success in language modeling over traditional methods based on statistical counts. the ability of rnn language model (rnnlm) to learn long term contextual dependency and capturing inherited sequential nature of language makes it better than other traditional methods [ ] . particularly in sentence generation task, rnnlm performed well because of its capability of learning highly complicated structures of language. rnnlm makes maximum a posteriori (map) estimation for predicting words in a sentence [ ] . mou et al. first proposed multiple variants of backward and forward (b/f) language models based on grus for constrained sentence generation [ ] . for training the b/f language models, sentences were split by choosing a word randomly. this resulted in the positional information of words getting smoothed out while generating sentences, and thus they lose the positional information of the word. this method of choosing a split word badly influences the joint probability estimation of a sentence. liu et al. proposed an algorithmic framework dubbed as backward and forward generative adversarial networks (bfgan) for constrained sentence generation [ ] . bfgan constitutes three modules; a discriminator, lstm based backward and a forward generator with attention mechanism. the purpose of discriminator is to distinguish the real sentences from constrained sentences generated by machine and to guide the joint training of both backward and forward generators by assigning them reward signals. the backward generator takes lexical constraint as an input, which can be a word, phrase or fragment and generate the first half of the sentence backwards. the forward generator takes the input half sentence generated by backward generator to complete the sentence with the aim of fooling the discriminator. the sentences prepared for training of backward generator relies on random splitting of sentences and the proposed framework can tackle single lexical constrained sentence generation. another line of work tackles the problem of constrained sentence generation by sampling the sentences from search space. su et al. proposed a gibbs sampling method based on markov chain monte carlo (mcmc) method for decoding constrained sentences [ ] . the proposed approach consists of a discriminator and a pure language model conditioned on a bi-directional rnn. introducing discriminator in the proposed method caters the job for calculating probability of a sentence satisfying the constraints. gibbs method samples the set of random variables x ...n from a joint distribution, which takes the form of words to make a sentence. the shortcoming of gibbs sampling is that it cannot change the length of sentences and hence not able to solve complicated tasks like directly generating sentences from constraints established in advance. miao et al. extends gibbs sampling by introducing metropolis-hastings for constrained sentence generation (cgmh) [ ] . the proposed method directly samples from the sentence space by defining local operations in the sentence space such as word replacement, insertion and deletion. hokamp et al. proposed grid beam search (gbs) algorithm, an extension of beam search, for incorporating specified lexical constraints in the output sequences [ ] . in neural machine translation (nmt) task, the proposed algorithm ensures that all specified constraints must meet the hypothesis before they can be considered to be completed. to generalize image caption generative models for out-of-domain images constituting novel scenes or objects, anderson et al. proposed a constrained beam search (cbs) decoding method, which utilizes finite-state machine (fsm) [ ] . the proposed search algorithm is capable of forcing certain image tags over resulting output sequences by recognizing valid sequences with fsm. table summarizes techniques for generating constrained sequences. it is evident that many of the architectures are designed for specific scenarios and have high computational complexity. due to performance gaps and inability to handle multiple constraints efficiently, a method need to be addressed. therefore, we have proposed a neural probabilistic backward-forward architecture that can generate high quality sequences, with word embedding substitution method to satisfy multiple constraints. to begin with, we state the problem of constrained sequence generation as follows: given the constraint(s) c as input, the proposed b/f lm needs to generate a fluent sequence s = w , · · ·, w v , · · ·, w m maximizing the conditional probability p(s|c). for this purpose, we need to select a split word in a sequence s to train the proposed b/f lm. as a sequence provides us an expression, the parts-of-speech (pos) verb plays a vital role in placing the subject of a sequence into motion and offers more clarification about sequence. in this section, we first discuss the general seq seq model for generation of sequences. after that, we discuss our proposed architecture to deal with constrained sequence generation. conventionally, rnnlms for text generation are trained to maximize the likelihood of a word w t or character c t at time step t while given the context of previous observations in the sequence. this type of learning technique for generating sequences is known as teacher forcing [ ] . in such learning technique, input to the recurrent neural probabilistic language model is of fixed size. the training purpose is to predict only next token until a special stop sign is generated or specific constraint is satisfied in a sequence given the context of previous observations. in traditional seq seq models we cannot satisfy lexical constraints, where disintegrating joint probability of a sentence y = y , y · · ·y m for given input sentence x = x , x · · ·x n is given by thus, the output sentence y is predicted from y to y m in sequence either by a greedy or beam decoder. such decomposition is because of natural language's sequential nature. our proposed approach consists of a neural probabilistic architecture that is an ensemble of two lstm based b/f lm for generating lexical constrained sequences, which captures the statistical properties of text sequences effectively. in order to generate the coherent sequences from given multiple constraints as input, we first generate the sequence from verb constraint w v through b/f lm, and then we satisfy the other given constraints by word embedding substitution method during the inference process. the predicted verb v splits the sequence into two sub-sequences as: if m denotes the length of words in a sequence s i.e. s = w , · · ·, w v , · · ·, w m , then the joint conditional probability of remaining m words, given lexical constraint w v and training parameters θ can be calculated as: where p bw θ and p fw θ depict the probabilities of generated sub-sequences by backward and forward language models. the sub-sequences are generated asynchronously i.e. we first generate the half sequence s v conditioned on backward sequence s : w v . therefore, following the spirit of ensemble models that are trained separately, joint probability factors in eq. becomes where ≤ j ≤ v − . backward lm decodes the output in reverse order from w v− , w v− to w , which is reversed again to input forward language model for decoding the complete sequence. consequently, as the output order of sub-sequence generated by backward lm is reversed again to decode the entire sequence from forward language model, therefore s :v is equal to w , · · ·, w v . for learning the sequences, we used lstm networks in proposed architecture. the lstm networks have the capability of capturing sequential data effectively where the network transforms a sequence of given input word vectors x = x , · · ·, x n into the sequence of hidden states h = h , · · ·, h t by maintaining a history of inputs at each hidden state. the lstm cell depends on gating mechanism for information processing. lstm network's hidden state h at time step t is dependent on the previous state h t− and current input x t word vectors. particularly, in our scenario for generating variable length text sequences, the probability of an output word w out from both language models calculated as: where w bw out and w fw out are shared across all time steps in their respective lstm models, which projects the hidden state vector h t into a fixed same size vector as target vocabulary in order to generate a sequence of outputs y t = w v−t , · · ·, w for backward language model and y t = w v+t , · · ·, w m for forward language model. the softmax function is in the final layer of lstm network, applied to each word vector for calculating the probability distribution over vocabulary of distinct word vectors. in order to satisfy the given lexical constraints c other than verb constraint w v , we have used a lexical substitution method based on word embedding substitution. sg model embeds both target words and their context in the same dimensional space. in this space, the vector representations of words are drawn closer together when they co-occur more frequently in a learning corpus. thus, cosine distance between them can be viewed as target-to-target distributional similarity measure. our method relies on a natural assumption that a good lexical constraint substitution for a target word w instance in a generated sequence s = w , · · ·, w v , · · ·, w m needs to be consistent with the given sequence and lexically similar to the target word w instance. during inference, we find the cosine similarity [ ] of given input constraint c with every word w in a sequence s generated by the proposed b/f lm. after that, we replace the constraint c with the closest matching (least cosine distance) word w in a sequence s. step of fig. illustrates the concept. for this purpose, we have created word embedding vectorization from fasttext. in this section, we introduced our experimental designs, containing the preparation of dataset for training and testing, experimental configuration, comparison architectures and evaluation criteria. there are many benchmark datasets for evaluating pure lm consisting of seq seq networks for text classification and generative models, but specifically there is no such benchmark corpus for evaluation of constrained sequence generation based on statistical language models. as far, we have used stanford natural language inference (snli) [ ] dataset for evaluation and training of proposed architecture. as we target the domain of generating sequences from lexical constraints, we extracted unlabeled sequences within range of minimum and maximum tokens, resulting in k sequences for training of proposed architecture. the proposed architecture ensemble backward-forward lm, therefore, to prepare training sequences for backward lm, following steps have been carried out: -annotate the tokens with their lexical categories using pos tagging. -split the sentences on verb category instead of random splitting. -sentences with more than one verb are broken up into multiple sequences. -after splitting the sequence on verb category, invert the half sequences. for the forward language model, the dataset contains complete sequences for training the network. here, it should be noted that backward language model requires only half sequences till verb token for training the network. we follow the work of bojanowski et al. [ ] to create dense representations of words in dataset. a word vector is represented by augmenting the character n-grams appearing in the word, where the scoring function s takes into consideration the internal structure information of words, which is ignored by conventional skip-gram models [ ] . the proposed model represents each word w as a bag of character n-gram, where adding special boundary symbols at the beginning and end of words for distinguishing prefixes and suffixes from other character sequences. in addition to character n-grams of word w, the word w is also included in its set of n-grams for learning representation of each word. for example, taking the word 'apple' and let n = , it will be represented by the character n-grams as and the special sequence . let a dictionary of n-grams with size g. given a word w where l w ⊂ , ...g is the set of n-grams appearing in word w. vector z g represents the each n-gram g, therefore a word w is represented by the sum of vectors of its n-gram g. in this regard, scoring function of word w with surrounding set of word indices c is calculated by: this extension of skip-gram model for creating word embedding allow the sharing of word vector representations across all words, thus enabling the reliable representational learning of rare or out-of-vocabulary (oov) words. we have used extension of fasttext's sg model to learn such data representations for both backward and forward language model given their respective data sets. in order to train the fasttext model, the word embedding dimension set to . min count value set to , which represents that all the word frequencies lower than were ignored while learning the word representations. window size set to , defining the maximum distance between a current and predicted word within a sequence. workers parameter set to , explaining the worker threads for faster training of fasttext sg model. epochs value set to iteration, over the whole data set. we performed different experiments on test set to get the most optimal hyperparameters and evaluate change in performance of the model. table shows the different experimental configurations and change in performance w.r.t perplexity metric. in the proposed architecture, we get the best results by employing -layers of lstm in both backward and forward language model. both the lstm networks were trained with adam algorithm [ ] for stochastic optimization of networks. during training, the parameters were adjusted using adam optimizer for minimizing the training loss function, also known as misclassification rate. for calculating optimization score, we used categorical cross entropy loss function between the actual y and predictedŷ word probability distribution [ ] . in target of accurately capturing the regularities by the neural networks and preventing overfitting, we appended drop-out layer after every lstm layer in both the networks. the idea of drop-out layer is to randomly drop units with their connections while training, thus preventing units from co-adapting too much. dropping units significantly leads to major improvements than other regularization methods [ ] . the epochs value was set to and mini batch size was set to in both the networks. both the backward and forward models are trained on nvidia gtx ti gpu. the lstm based networks are developed in keras. training took about h approx. per model with this implementation and optimal hyper-parameter configuration. we compared our proposed methodology with state-of-the art sampling method cgmh [ ] for satisfying multiple constraints in a sequence. we also evaluated our methodology of verb based split generation with different variants [ ] , which can only handle single lexical constraint. we have used intrinsic evaluation metric that allows to determine the quality of a lm without being associated or embedded to a particular application. the most conventional intrinsic evaluation metric is perplexity (ppl). ppl of a language model given a test set w = w , w , ...w m is the inverse probability of w where the probability is normalized by the number of words for intrinsic evaluation of our proposed methodology, we first make comparisons with variants such as separate b/f and asynchronous b/f language models proposed by [ ] . as mentioned earlier, in our proposed methodology the given word is verb constraint w v through which we decode complete sequence whereas in variants of b/f, the complete sequence is decoded by random split word. we calculated ppl with both verb and random constraint as input to decode the complete sequences. table represents the comparison in terms of ppl, where the higher probability of a sequence results in the lower of perplexity, which is better. separate b/f variant yields worse sequences with huge perplexity score because both the b/f lm were enforced to output separately with the input constraint and concatenated after decoding of sequences. this is due to the fact that forward lm does not have the context of half sequence decoded by backward lm. our proposed approach is more similar to asynchronous b/f lm, but technically very different as we are satisfying multiple constraints while asynchronous approach can deal with only single constraint. the results clearly shows that decoding a sequence on specific verb constraint can make use of the positional information of words in a sequence, that is smoothed out when we generate a sequence with random constraint. table shows the comparison of our proposed approach for catering multiple constraints with cgmh [ ] . our proposed approach shows lower perplexity than cgmh sampling method for sentence generation through keywords/constraints to , while with constraints as input cgmh shows slightly better result than our approach of generating sequence with verb constraint and during inference replacing the words in sequence with closest embedding similarity. the decoding complexity of cgmh increases linearly with the number of constraints, while there is no such factor in our approach for catering multiple constraints. there is always a trade-off between fluency of sequence and decoding complexity. in practice, the downside of cgmh sampling methods is that we are not sure of which sampling step size is best for proposal distribution. to validate our proposed architecture of generating sequence, we performed a series of experiments. results of intrinsic evaluation confirms that our proposed approach for sequence generation given constraint(s) outperforms previous methods. splitting and generating a sequence on verb constraint makes use of positional information, which is smoothed out in breaking down a sequence with random word. we observe that decoding a sequence given random word as input in proposed b/f lm even performs better when the backward lm is trained over half sequences till verb. moreover, in future we would like to explore about the constraint-to-target context similarity, indicating their syntagmatic compatibility for improving the word embedding substitution method. introducing attention mechanism as context vectors for constraints would be an interesting side in the proposed architecture. in this paper, we have proposed a novel method, dubbed neural probabilistic backward-forward language model and word embedding substitution method to address the issue of lexical constrained sequence generation. our proposed system can generate constrained sequences given multiple lexical constraints as input. to the best of our knowledge, this is the first time that multiple constraints have been handled through lstm based backward-forward lm and word embedding substitution of the sequences. the proposed method contains a backward language model based on lstm network, which learns the half representation of a sentence until the verb splitting word and forward language model constitute lstm network, learning the complete representation of a sequence. moreover, word embedding substitution method satisfy other constraints by substituting the target word in the sequence with given constraints based on similar context in an embedding space. guided open vocabulary image captioning with constrained beam search enriching word vectors with subword information a large annotated corpus for learning natural language inference a comparison of mlp, rnn and esn in determining harmonic contributions from nonlinear loads a tutorial on the crossentropy method lexically constrained decoding for sequence generation using grid beam search adam: a method for stochastic optimization bfgan: backward and forward generative adversarial networks for lexically constrained sentence generation cgmh: constrained sentence generation by metropolis-hastings sampling efficient estimation of word representations in vector space recurrent neural network based language model distributed representations of words and phrases and their compositionality backward and forward language modeling for constrained sentence generation fast lexically constrained decoding with dynamic beam allocation for neural machine translation dropout: a simple way to prevent neural networks from overfitting incorporating discriminator in sentence generation: a gibbs sampling method sequence to sequence learning with neural networks key: cord- -yv yvy authors: demers, g. william; matunis, michael j.; hardison, ross c. title: the l family of long interspersed repetitive dna in rabbits: sequence, copy number, conserved open reading frames, and similarity to keratin date: journal: j mol evol doi: . /bf sha: doc_id: cord_uid: yv yvy the l family of long interspersed repetitive dna in the rabbit genome (l oc) has been studied by determining the sequence of the five l repeats in the rabbit β-like globin gene cluster and by hybridization analysis of other l repeats in the genome. l oc repeats have a common ′ end that terminates in a poly a addition signal and an a-rich tract, but individual repeats have different ′ ends, indicating a polar truncation from the ′ end during their synthesis or propagation. as a result of the polar truncations, the ′ end of l oc is present in about , copies per haploid genome, whereas the ′ end is present in at least , copies per haploid genome. one type of l oc repeat has internal direct repeats of bp in the ′ untranslated region, whereas other l oc repeats have only one copy of this sequence. the longest repeat sequenced, l oc , is . kb long, and genomic blot-hybridization data using probes from the ′ end of l oc indicate that a full length l oc repeat is about . kb long, extending about kb ′ to the sequenced region. the l oc sequence has long open reading frames (orfs) that correspond to orf- and orf- described in the mouse l sequence. in contrast to the overlapping reading frames seen for mouse l , orf- and orf- are in the same reading frame in rabbit and human l s, resulting in a discistronic structure. the region between the likely stop codon for orf- and the proposed start codon for orf- is not conserved in interspecies comparisons, which is further evidence that this short region does not encode part of a protein. orf- appears to be a hybrid of sequences, of which the ′ half is unique to and conserved in mammalian l repeats. the ′ half of orf- is not conserved between mammalian l repeats, but this segment of l oc is related significantly to type ii cytoskeletal keratin. the repeated dna sequences that are dispersed throughout eukaryotic genomes have been divided into two classes (reviewed by weiner et al. ). both classes appear to transpose by an rna intermediate, and the insertion of either class of repeated dna generates short flanking direct repeats at the target site--hallmarks of transposition first recognized in prokaryotes. one class of repeated dna resembles retroviruses in that members of this class are flanked by long terminal repeats (baltimore ) . this class includes the yeast ty- repeat, the drosophila copia repeat, and the human the repeat (paulson et al. ) . another class of repeated sequences resembles processed pseudogenes and lacks long terminal repeats (ltrs). this second class of repeats has been termed retroposons (rogers ) , nonviral retroposons (weiner et al. ) , and non-ltr retrotransposons (xiong and eickbush ) . in this paper, this second class of rnatransposed repeats will be called retroposons. two groups of retroposons have been identified based on their length: the short interspersed repeats, or sines, that are tess than bp long, and the long inter- repetitive dna in the rabbit/ -like globin gene cluster. the -like globin genes ~, , , and/ are shown as boxes along the -kb segment of cloned dna (lacy et al. ). transcription of the active genes is from left to right. the location and orientation of l repeats are shown by the filled arrows. the l repeats are named liocl-l oc (demers et al. ). the location and orientation of c repeats, a rabbit sine, are shown by the open arrows. spersed repeats, or lines, that are greater than bp long (singer ) . although no precise sequence specificity has been observed at the insertion sites, sines and lines do have a regional preference for integration in the human genome, as shown by the enrichment of different chromosome bands for either lines or sines (korenberg and rykowski ) . although several different sequences have been dispersed as sines in mammals (reviewed in weiner et al. ), only one sequence element, called l , has been found to be dispersed as a line in mammals (reviewed in singer and skowronski ) . the l sequence has been identified in a wide variety of species including primates (lerman et al. ) , mice (brown and dover ; fanning ) , rats (econonmou-pachnis et al. ; soares et al. ; d'ambrosio et al. ), dogs (katzir et al. ) , eats (fanning and singer ) , and rabbits (demers et al. ). genomic blot-hybridization analysis indicates that the l sequence is present in all mammalian species at a frequency of about - copies per haploid genome (burton et al. ) . although the parent genes of sines are transcribed by rna polymerase iii, the l repeats appear to be derived from an rna polymerase ii transcript. the parent gene of l is proposed to be a protein-coding gene (reviewed in singer and skowronski ) . long open reading frames (orfs) are found in the l sequences (manuelidis ; martin et al. ; potter ) , and sequenced members from the mouse genome have two overlapping orfs of bp (orf- ) and bp (orf- ) shehee et al. ) . the orf- regions of primate and rabbit li are % similar, but the similarity ends abruptly at a conserved stop codon (demers et al. ). in previous studies on the l repeats from rabbits (l c, for line from oryctolagus cuniculus), the b, e, and d repeats identified by shen and maniatis ( ) were shown to be parts of the l oc repeat. the sequence of one truncated l repeat and part of another repeat were presented as a composite sequence, and the orf (corresponding to orf- ) and ' untranslated region were identified (demers et al. ). in this paper, the rabbit l repeats are characterized more thoroughly, and the similarities and differences of l sequences between species are explored further. interspecies comparisons reinforce the conclusion that the l repeat has two orfs that are conserved for their protein-coding capacity. however, the region between the two orfs is not conserved among species, and this observation is used to indicate possible start and stop codons for the orfs. orf- encodes a composite protein, and the ' half of orf- from l oc is related to type ii cytoskeletal keratin. subcloning and sequencing of lloc repeats. the sequenced members of the l oc family were from the rabbit -like globin gene cluster isolated by lacy et al. ( ) . interspersed repetitive dna was identified by shen and maniatis ( ) by hybridization and heteroduplex mapping. the five l members (demers et al. ) were sequenced by dideoxynucleotide chain termination reactions (sanger et al. ) using subclones in m phages as templates (messing ). analysis of dna sequences. sequence matches were first identified by dot plots generated by the computer program matrix (zweig ) . this provides a graphical display of sequence similarity that plots matches (forward similarity) of out of bases. similar sequences were then aligned by the computer program nucaln (wilbur and lipman ) using the parameters k-tuple = , window size = , gap penalty = . the protein sequence databases at the protein identification resource (national biomedical research foundation) were searched using the fastp program (lipman and pearson ) . the statistical significance of the similarities found by fastp were tested using the program rdf (national biomedical research foundation); this program scrambles the target sequence (revealed by fastp) into shuffled sequences and computes the mean similarity score for the shuffled sequence with the test sequence (in this case, orf- of l c). the similarity score for the match between the true sequences is compared with the mean score for the shuffled sequences in terms of the number of standard deviations that separate them. conditions as in the southern blot analysis. the ratio of percentage of plaques that hybridized to the percentage of the rabbit genome in one h clone gives the approximate copy number of the region. the average size of an insert in this ~, library is kb (maniatis et al. ) . thus, the fraction of the rabbit genome per phage is • / x or . • - %. the fact that % of the phage in the library have rabbit dna (maniatis et al. ) was also taken into account. rodent and human l sequences. the mouse li sequence, limda , and the rat li sequence, l rn or line (d'ambrosio et al. ) are randomly isolated l members from their respective genomes. the human l sequence, l hs-tbg , is located . kb ' to the human/]-globin gene (hanori et al. ) . a consensus l hs sequence (scott et al. ) was used in the analysis of orf-i in fig. . the interspersion of repetitive sequences among the rabbit b-like globin genes is shown in fig. . the genes ~ and (formerly/ and/ ) are expressed in embryonic development (rohrbaugh and hardison ) , ~ (~b/ ) is an inactive pseudogene (lacy and maniatis ) , and/ (/ ) is expressed in fetal and adult life (hardison et al. ; rohrbaugh et al. ) . the ' to ' orientations of the proposed rna intermediates of the repetitive elements are indicated by the arrows in fig. ; the a-rich tracts are at the ' ends. the sequences of the five l oc repeats are presented in fig. . l oc is adjacent to l oc (fig. ) , so the last nucleotide in the l oc sequence is followed by the first nucleotide in the l oc sequence (fig. ) in the sequence of the gene cluster (margot et al. ) . the longest member of the rabbit l family in the / -like globin gene cluster is l oc . the next longest member is li oc ; it has an internal deletion of bp (fig. , . this is clearly a deletion from l i oc and not an insertion in l oc because a similar sequence is present in both mouse and human lls (demers et al. ). l oc will be the prototypical rabbit l for further analysis because it is the longest and has no extensive internal deletions. the ' end of l oc is also the end of the cloned region of the rabbit / -like globin gene cluster (see fig. ). only two of the shen and maniatis ( ) are shown at the bottom of the diagram. individual repeats, l oc and l oc , contain sequences for the orf region (demers et al. ). the other three repeats contain part or all of the ' untranslated region. l oc and l ocl have internal direct repeats of bp in the ' untranslated region. one copy of the repeat is at positions - and the other is at positions - (lower case letters in fig. ). l oc and l oc have only one copy of this -bp sequence, and they do not contain the sequence between the -bp direct repeat (present in l oc and l ocl). thus, the class of l oc repeats containing one copy of the -bp sequence could be derived from the class containing two copies by a deletion between the two -bp sequences. another example of a sequence rearrangement is the apparent insertion of bp into l e between positions - of l oc . most members of the l oc family are flanked by short direct repeats. l c and l oc are flanked by direct repeats of bp and bp, respectively (fig. ) . the flanking direct repeats differ for the two individual l repeats, showing that they are not part of the l sequence. such flanking direct repeats are often generated by insertion of transposable elements presumably by repair of a staggered break at the target site. the flanking direct repeats for l oc and l oc cannot be identified with the available data. the ' end of l oe has not been cloned. because l oc is juxtaposed to li oc , it is possible that l oc may have inserted into l oc , in which case the ' end of l oc is also not available. the only other l member, l oc , does not have obvious flanking direct repeats generated by a duplication of the target site. the sequence gttaaaaaaa found just ' to the polyadenylation site (positions - ) is also found upstream from l oc (margot et al. ). however, because the sequence gtt(a) (or a slight variation ofi is also found in all of the other l sequences just ' to the polyadenylation signal, it is likely not to have been generated by a target site duplication around l oc . this terminal repetition could be generated by insertion of a circular form of l by homologous recombination into a gtt(a) sequence at the target site. the structural features revealed by the alignment and comparison of the l members from the rabbit ~ - ike globin gene cluster are summarized in fig. . the b, e, and d repeats identified by shen and maniatis ( ) are also aligned with their position in the l c sequence. the d repeat is confined to the ' untranslated region, whereas the b repeat and most of the e repeat are from the orf region. l ocl begins immediately after the conserved translation stop codon. figure also illustrates the internal sequence rearrangements described above. the diagram of l oc repeats in fig. shows that they are truncated at a variable distance from the ' end of the longest elements. this truncation from the ' ends is common in the whole population of li repeats, as demonstrated by using four regions of l oc as probes against the rabbit genomic dna library in a plaque hybridization assay. by counting the number of plaques that hybridized to a given probe, the approximate copy number of each region of the l oc repeat was determined (see materials and methods). as shown in fig. , the '-most region of lioc is represented about , times in the haploid genome of the rabbit, and regions of l located more ' are found more frequently. the largest increase in copy number is seen in the region from positions to that includes the ' untranslated region; this region is represented at least , times. however, the relationship between the length of the repeat and the copy number is not linear; only a gradual decrease in copy number is observed as probes going from position to position are used (fig. ) . therefore, many of the l repeats detected with the probe from the ' end may be full length, indicating that up to % of the population of lioc repeats could be full length. this difference in copy number at the ' and ' ends of lioc repeats is also observed when uncloned genomic dna is hybridized with the different l oc probes (data not shown). thus, the lower copy number at the ' end is not a result of underrepresenration in the cloned genomic library. because the ' end of l oc is at the end of the cloned portion of the rabbit/ - ike globin gene cluster, it is likely that the nucleotide sequence obtained from l oc is not that of a full-length l repeat. therefore, cloned subfragments of l oc were used as probes against southern ( ) blots of rabbit genomic dna to determine the average structure of full-length rabbit l repeats. discrete genomic restriction fragments detected with l oc probes were mapped by two strategies. the portion of l oc contained within the genomic restriction fragment was determined by which probes from l oc hybridized to the fragment, and then the genomic restriction fragment was aligned with conserved restriction sites found in the cloned li oc dna. this analysis is presented in detail in demers ( ) , and the portion relevant to the ' end of l oc is summarized in fig seal . sphl . xmnl . the longest restriction fragment extending ' to the cloned end of l oc is the psti . -kb fragment that ends kb ' to the cloned region of l oc (fig. ). the scai . -kb, sphi . -kb, and xmni . -kb genomic fragments all have ' ends between the conserved psti site located outside l oc and the ' end of l oc (fig. ) . these data indicate that fulllength l oc repeats wii extend at least kb further ' than the sequenced portion of l oc . several clones from the rabbit genomic dna library are currently being studied in order to determine the ' end of l oc repeats. the sequence of the rabbit l repeat was compared with the sequences of the mouse and human li repeats by dot-plots and by sequence alignments. the dot-plot analyses in fig. show that the internal sequence of l oc is very similar to both l md (mouse) and l hs (human) over very long segments, whereas the ' and ' ends are not conserved between species. the internal region of sequence similarity of about . kb is divided into two pans, a short region of similarity of about bp followed by a very long segment of similarity. the long segments of internal similarity are in the portion of l that encodes open reading frames (orfs). the orfs found in the l oc sequence are shown in fig. , along with a comparison of the orfs from l md. the mouse limda sequence contains two orfs, one of nucleotides (top strand, n frame in fig. , bottom panel) and one of nucleotides (top strand, n + frame in fig. ), that overlap by nucleotides ). seven open reading blocks are in the rabbit l oc sequence in frames n, n + , and n + ( fig. , top panel). the bar between the stop codon maps of each species shows the regions of similarity ( fig. ) as filled boxes. it is apparent that the regions of l that are similar between species contain extensive orfs, although the orfs at the ' end are not similar between species. rabbit l repeats have only two major orfs. although the data in fig. show that l oc has several orfs, they are probably derived from longer reading frames in the ancestral l sequence. the fig. . sequence similarities in the orf-! region. the l]oc orf-i region is shown as a black box, numbered according to the codon positions in fig. . the orf- regions from l md and l hs are displayed as composite boxes. the darkness of the fill in each box is proportional to the extent of similarity of the l c sequence. the percent identity ofthe encoded amino acids, compared to the l oc sequence, are given in the boxes. a box representing a portion of the type ii cytoskeletal keratin sequence is aligned with the segment of the lioc sequence that matches it. the percent of amino acids identical to the l oc orf- translated sequence is given in the boxes, and the amino acid positions in the keratin sequence are listed below the boxes. a gap penalty of -! was assessed in calculating the percent identities. larity corresponds to orf- and the short region of similarity corresponds to the ' portion of orf- . the two orfs are overlapping in l md, and it is of interest to determine whether this feature is conserved in li repeats from other species. also, orf- appears to be a hybrid sequence because it is well conserved between species in the ' half but it is not well conserved in the ' half. therefore, the sequence of orf- and the region between the orfs were aligned for the l repeats from rabbit, mouse, rat, and humans. figure shows both the aligned nucleotide sequences and the predicted amino acid sequences. sequences that match well between species are in reverse text, whereas sequences that do not match well are in plain text. inspection of the aligned l sequences allows a tentative identification of the start and stop sites of the orfs. this analysis reveals that no overlap between reading frames is seen in rabbit and human l repeats. the end of orf- in l md is the taa at positions - (boldface in fig. ) . the same sequence is found in the rat l sequence (l rn), and in-phase terminators are found nearby in l c and l hs (boldface taas in fig. ). orf- in l md begins in a different reading frame at position , and thus it overlaps with orf- for nucleotides. by aligning the sequences of the different l s in the well-conserved orf- region, it is apparent that an atg is conserved in the rabbit and human sequences at positions - . an in-frame atg two codons upstream was previously identified as the start of orfb in the l rn sequence (d'ambrosio et al. ) and an atg is also in frame in the l md sequence seven codons upstream. one can propose that the taa close to position is the end of orf- and the atg at positions - is the start of orf- in rabbit and human l repeats. in an independent analysis of several individual l hs repeats, these same codons were assigned as the end of orf- and the start of orf- in the consensus l hs sequence (scott et al. ) . as shown in fig. , orf- is in the same reading frame as orf- in the l c and l hs sequences. thus, the overlap in reading frames seen for l md is not observed in l oc and l hs. orf- in l rn is in a different reading frame than orf- , but the l rn sequence does have an atg proposed as the start of orf- . thus, lirn has overlapping reading frames, but the sequence in the overlap may not be used to encode a protein. the region between orf- and orf- is not conserved between mammalian species. the sequence between the taa that ends orf- and the atg proposed to be the start of orf- is in a region that is quite dissimilar between rabbit and mouse and between rabbit and human (plain text region between positions and in fig. ). this is the region of no similarity previously seen in dotplots (fig. ) . the sequence between the l orfs is also not conserved in comparisons between the human and rodent sequences (scott et al. ). because this region is not conserved, whereas the sequences before and after it are conserved, probably for their capacity to encode a protein, it is unlikely that the inter-orf region encodes a protein. this lack of conservation supports the proposed assignments for the start of orf- in l oc and l hs. the mouse l sequence is ata at positions - ; this same sequence is found in three sequenced members of the l md family (shehee et al. ) . therefore, the overlap between reading frames and are conserved in mouse lls, but the overlaps are not seen in the rabbit and human l sequences. the orf- sequence is a composite of conserved and nonconserved regions. as shown diagrammatically in fig. , codons - are highly related between species in different mammalian orders, and a long segment from codons through shows a - % amino acid identity in these comparisons. a short region from codons to is not conserved, nor are the last codons in the sequence, but in general the c-terminal two-thirds of orf- is conserved between orders. a search through the databanks at the protein identification resource (national biomedical research foundation) did not identify any known proteins (besides the l proteins) that are related to the c-terminal half of the orf- sequence. (lipman and pearson ) is shown starting at amino acid position of orf- from l oc (fig. ) and position of the sequence of type ii cytoskeletal keratin of humans (johnson el al. ) . the orf- sequence of rabbit l is labeled li, and the type ii keratin sequence is labeled kii. identical amino acids are indicated by colons, and similar amino acids are indicated by periods. the following groups of amino acids are considered similar: p, a, g, s, and t (neutral or weakly hydrophobic); q, n, e, and d (acids and amides); h, k, and r (basic); l, i, v, and m (hydrophobic); f, y, and w (aromatic); and c. in contrast, the n-terminal portion of orf- is not highly conserved between mammalian orders. this region shows almost no similarity between rabbit and human (sequence between nucleotide positions and in fig. ; fig. ), and the comparison between rabbit and mouse shows only a short segment of matching sequence at the ' end (figs. and ) . the dissimilarity of the sequences makes it difficult to assign a start point to orf- . however, an atg is found in the rabbit, mouse, and rat sequences at positions - of fig. (shown in boldface). an atg is found three codons downstream in the human l sequence. other atg codons are either immediately adjacent (mouse and rat) or are codons upstream (rabbit, underlined in fig. ) . the atg at positions - has been tentatively assigned as the start of orf- , and the codons in fig. are numbered starting here. this is codons into orf- as defined by loeb et al. ( ) . although the n-terminal half of orf- differs among rabbits, mouse, and humans, it is similar between the two rodents, mouse and rat. this region surrounds a -bp tandemly repeated sequence in l rn (soares et al. ; d'ambrosio et al. ) and contains several in-frame stop codons in l rn (fig. ) . it is possible that the coding function of this region has been lost in l rn. the n-terminal half of orf- from the rabbit l sequence is related to type ii cytoskeletal keratin. protein sequence databanks were searched using the fastp program (lipman and pearson ) , and a significant match was found with type ii cytoskeletal keratin. the region of l oc orf- that matches with keratin, along with the percent amino acid identity, is shown in fig. , and the alignment with the human kda type ii keratin ) is shown in fig. . the sequences align over a -amino acid region, with an average of . % identity. the segment between amino acid positions and oflioc orf- is most similar to type ii keratin; this segment contains identical amino acids at % of the positions. the similarity between the n-terminal half of orf- from l c and type ii cytoskeletal keratin is statistically significant. the sequence of the type ii keratin was scrambled into different sequences and aligned with the orf- sequence to generate an average match score. the match score with the true keratin sequence is standard deviations above the average match score with the scrambled sequences; a difference of standard deviations in this test is an indicator of a significant evolutionary relationship (lipman and pearson ) . although statistical significance does not establish biological significance, it is helpful to compare this match with that of a part of orf- with reverse transcriptases whose similarity has been cited as significant in the past (hattori et al. ; loeb et al. ) the alignment between the l md orf- sequence and the sequence of reverse transcriptase from moloney murine leukemia virus shows . % amino acid identity, whereas the alignment between l oc orf- and type ii keratin shows . % identity. it is apparent that orf- of the rabbit li contains a region related in sequence to type ii cytoskeletal keratin. the propagation of l repeats probably has occurred independently in different mammalian gehomes. although the l repeats from lagomorphs, rodents, and primates are similar in size and sequence organization, the ' and ' ends are distinctive (summarized in fig. ) . also, the li repeats i i i i i i i i are located in different positions in orthologous regions of chromosomes, specifically the b-like globin gene cluster of rabbits and humans (margot.el al. ) and mice (shehee et al. ) . because the contemporary / -like globin gene clusters are descended from a preexisting gene cluster in the last common ancestor, the presence of l repeats at different positions in different species indicates that the l repeats have integrated independently into these gene clusters (and probably the whole genome) is each species. it is noteworthy, therefore, that the structure of the population of l repeats is quite similar in several mammals. most members of the l repeat family in rabbits (this paper), mouse (voliva et al. ) , and monkeys (grimaldi et al. ) are truncated from the ' end, resulting in a higher frequency in the genome of the ' end of l (about , copies) than the ' end (about , copies). this similarity in copy number suggests that the time of onset and the rate of propagation of l repeats is similar in the different species. the rabbit, mouse, and monkey l repeats also show a similar pattern for the increase in copy number in which the ' regions increase gradually in copy number before a large increase in copy number at the very ' end. this very large increase in copy number in the y region could indicate a strong stop for reverse transcriptase during the conversion of the l transcript to a dna copy. given this frequency of polar truncations of l in rabbits, humans, and mice, it is striking that most of the l repeats in rats are full length (d' am-brosio et al. ). some aspect of the mechanism for synthesis and propagation of the lls is apparently different in rats, e.g., to allow more full length reverse transcripts or to select for these in the integration process. full length l transcripts have been observed in teratocarcinoma cells (skowronski and singer ) . given the assignments of start and stop codons proposed in this paper, then transcripts of the l repeat of rabbits and humans have the characteristics of a dicistronic rna. polycistronic mrnas are common in bacteria, and a polycistronic arrangement of genes is found in the genomes of some rna viruses that infect animals and plants, e.g., togaviruses, coronaviruses, and tobacco mosaic virus. in contrast, most mrnas from eukaryotic cellular genes are monocistronic. regardless of whether the orfs are overlapping, as in l md, or are part of a dicistronic rna, as in l oc and l hs, the structure of the l repeats resembles dna copies of viral genomes more than conventional cellular transcription units. this suggests that the ancestor to l repeats in fact may be some type of animal virus rather than a normal cellular gene, as is often proposed (reviewed in weiner et al. ) . a viral ancestor with a wide host range would provide an explanation for the independent, and perhaps simultaneous, entry of the l element into different mammalian genomes. the orfs in the l repeal appear to encode hybrids of different types of proteins (fig. ) . or.f- can be divided into two parts, the n-terminal por-tion that is not well conserved between species and the c-terminal portion that is well conserved. in the rabbit l repeat, a sequence similar to keratin has been fused to the conserved c-terminal portion of orf- . although orf- is conserved in l s from different orders of mammals it also seems to be a hybrid of sequences related to several proteins (fig. i ) . the middle portion of orf- is related to reverse transcriptase (hattori et al. ; loeb et al. ). different parts of the c-terminal region are related to transferrin (hattori et al. ) and to nucleic acid binding proteins with the cysteine structural motif, such as the binding proteins derived from retroviral gaggenes (fanning and singer i ) . the cysteine structural motif is related to the zinc fingers characterized in tfiiia and other nucleic acid binding proteins (fanning and singer ) . this pastiche of similarities suggests that the l element is a fusion of several different sequences, some of which are derived from cellular genes, possibly by a viral vector. another fusion event may account for the variation in sizes and sequences of the ' untranslated regions of l repeats in different mammals. the ' untranslated regions of orthologous globin genes in mammals have retained obvious sequence similarities over the course of eutherian evolution (e.g., hardies et al. ; hardison ) , so it is puzzling that no sequence similarity is seen in the ' untranslated region of l repeats in comparisons between mammals (fig. ) . perhaps the conserved coding region was fused to a different ' untranslated sequence in each species. it is noteworthy that the ' end of l ocl begins immediately after the conserved termination codon that ends orf- , suggesting that the sequence corresponding to the ' untranslated region of l oc may exist as a distinct repetitive element in the rabbit genome in addition to its presence in the l sequence. if so, this would be an additional factor in explaining the large increase in copy number of ll repeats in this region. a similar situation has been observed in drosophila melanogaster, in which suffix, an element repeated about times in the genome, is almost identical to the sequence of the ' untranslated region (but not the coding region) of the f element that is present about times in the genome (dinocera and casari ). the mammalian l repeats show a clear similarity to the ingi repeat in the protozoan trypanosoma brucei (kimmel et al. ) , the i factor of the i-r system of hybrid dysgenesis in d. melanogaster (fawcett et al. ), f elements in d. melanogaster (dinocera and casari ) , and the r bm (xiong and eickbush ) and r bm (burke et al. ) insertion sequences in some rrna genes ofbombyx mori (fig. ) . the similarity has been recognized only in the region proposed to encode reverse transcriptase, and these sequences are more similar among themselves than to retroviral reverse transcriptases (dinocera and casari ; xiong and eickbush ) . the mammalian l s and these protozoan and insect repeats share other structural features, such as the absence of long terminal repeats, the presence of at least two orfs (orf- containing sequences similar to reverse transcriptase and either orf- or orf- encoding a cysteine motif), a length from to . kb, and a ' untranslated region with a sequence similar to aataaa close to the ' end. the dicistronic structure proposed for l oc and lihs may also be present in the i factor, the f element, and the r ibm repeat (fawcett et al. ; dinocera and casari ; xiong and eickbush ) . each type of repeated element also has some distinctive features, e.g., the specific insertion sites for r bm and r bm in the rrna genes and the absence of a-rich tracts at the ' ends of some of the insect repeats. however, at least parts of these repeats in mammals, insects, and a parasitic protozoan appear to be evolutionarily related. if this type of repeat is restricted to these groups of organisms, it may indicate that the genetic information was transferred among parasites, their mammalian hosts, and insect vectors (k.immel et al. ) . a viral progenitor, suggested by the dicistronic arrangement shown in this paper, would provide a means for the horizontal transmission of the l sequences. retroviruses and retrotransposons: the role of reverse transcription in shaping the eukaryotic genome screening xgt recombinant clones by hybridization to single plaques in situ organization and evolutionary progress of a dispersed repetitive family of sequences in widely separated rodent genomes the site-specific ribosomal insertion element type ii ofbombyx mori (r bm) contains the coding sequence for a reverse transcriptase-like enzyme conservationthroughoutmammaliaand extensive protein encoding capacity of the highly repeated dna li genomic sequencing structure of the highly repeated, long interspersed dna family (line or l rn) of the rat long interspersed l repeats in rabbit dna are homologous to l repeats of rodents and primates in an open-reading-frame region related polypeptides are encoded by drosophila f elements, i factors, and mammalian l repeats insertion of long interspersed repeated elements at the lgh (immunoglobulin heavy chain) and mlvi- (moloney leukemia virus integration ) loci of rats characterization ofa highly repetitive family of dna sequences in the mouse the line- dna sequences in four mammalian orders predict proteins thin conserve homologies to retrovirus proteins transposable elements controlling i-r hybrid dysgenesis in d. metanogaster are similar to mammalian lines defining the beginning and end of kpni family segments evolution of the mammalian fl-globin gene cluster comparison of the ~ - ike globin gene families of rabbits and humans indicates that the gene cluster '-~- '-fi-fl- ' predates the mammalian radiation efstratiadis a ( ) the structure and transcription of four linked rabbit/~-like globin genes sequence analysis of a kpni family member near the ' end of human fl-globin gene li family of repetitive dna sequences in primates may be derived from a sequence encoding a reverse transcriptase-related protein structure ofa gene for the human epidermal -kda keratin retroposon" insertion into the cellular oncogene c-myc in canine transmissible venereal tumor ingi, a . -kb dispersed sequence element from trypanosoma brucei that carries half of a smaller mobile element at either end and has homology with mammalian lines human genome organization: alu, lines, and the molecular structure of metaphase chromosome bands the nucleotide sequence of a rabbit fl-giobin pseudogene linkage arrangement of four rabbit fl-like giobin genes kpni family of long interspersed repeated dna sequences in primates: polymorphism of family members and evidence for transcription rapid and sensitive protein similarity searches the sequence of a large l md element reveals a tandemly repeated ' end and several features found in retrotransposons the isolation of structural genes from libraries of eucaryotic dna nucleotide sequence definition ofa major human repeated dna, the hindlii . kb family complete nucleotide sequence of the rabbit fl-like globin gene cluster: analysis of intergenic sequences and comparison with human fl-like globin gene cluster a large interspersed repeal found in mouse dna contains a long open reading frame that evolves as if it encodes a protein new m vectors for cloning a transposon-like element in human dna rearranged sequence of a human kpni element labeling deoxyribonucleic acid to high specific activity in vitro by nick translation with dna polymerase i retroposons defined analysis of rabbit fl-like globin gene transcripts during development transcription unit of rabbit fl i-globin gene dna sequencing with chain-terminating inhibitors origin of the human li elements: proposed progenitor genes deduced from a consensus dna sequence determination of a functional ancestral sequence and definition of the ' end of a-type mouse li elements the nucleotide sequence of the balb/c mouse fl-globin complex the organization of repetitive sequences in a cluster of rabbit fl-like globin genes sines and lines: highly repeated short and long interspersed sequences in mammalian genomes making sense out of lines: long interspersed repeat sequences in mammalian genomes expression of a cytoplasmic line- transcript is regulated in a human teratocarcinoma cell line rat linei: the origin and evolution of a family of long interspersed middle repetitive dna elements detection of specific sequences among dna fragments separated by gel electrophoresis the l md long interspersed repeat family in the mouse: almost all examples are truncated at one end roposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information rapid similarity searches of nucleic acid and protein data banks the site-specific ribosomal dnainsertion element ribm belongs to a class of non-long-terminal-repeat retrotransposons analysis of large nucleic acid dot matrices on small computers acknowledgments. we key: cord- - dsx pey authors: maitra, arindam; sarkar, mamta chawla; raheja, harsha; biswas, nidhan k; chakraborti, sohini; singh, animesh kumar; ghosh, shekhar; sarkar, sumanta; patra, subrata; mondal, rajiv kumar; ghosh, trinath; chatterjee, ananya; banu, hasina; majumdar, agniva; chinnaswamy, sreedhar; srinivasan, narayanaswamy; dutta, shanta; das, saumitra title: mutations in sars-cov- viral rna identified in eastern india: possible implications for the ongoing outbreak in india and impact on viral structure and host susceptibility date: - - journal: j biosci doi: . /s - - - sha: doc_id: cord_uid: dsx pey direct massively parallel sequencing of sars-cov- genome was undertaken from nasopharyngeal and oropharyngeal swab samples of infected individuals in eastern india. seven of the isolates belonged to the a a clade, while one belonged to the b clade. specific mutations, characteristic of the a a clade, were also detected, which included the p l in rna-dependent rna polymerase and d g in the spike glycoprotein. further, our data revealed emergence of novel subclones harbouring nonsynonymous mutations, viz. g v in spike (s) protein, r k, and g r in the nucleocapsid (n) protein. the n protein mutations reside in the sr-rich region involved in viral capsid formation and the s protein mutation is in the s( ) domain, which is involved in triggering viral fusion with the host cell membrane. interesting correlation was observed between these mutations and travel or contact history of covid- positive cases. consequent alterations of mirna binding and structure were also predicted for these mutations. more importantly, the possible implications of mutation d g (in s(d) domain) and g v (in s( ) subunit) on the structural stability of s protein have also been discussed. results report for the first time a bird’s eye view on the accumulation of mutations in sars-cov- genome in eastern india. electronic supplementary material: the online version of this article ( . /s - - - ) contains supplementary material, which is available to authorized users. sars-cov- is the causative agent of current pandemic of novel coronavirus disease which has infected millions of people and is responsible for more than , deaths worldwide in a span of just months. the virus has a positive sense, singlestranded rna genome, which is around kb in length. the genome codes for four structural and multiple non-structural proteins (astuti and ysrafil ) . while the structural proteins form capsid and envelope of the virus, non-structural proteins are involved in various steps of viral life cycle such as replication, translation, packaging and release (lai and cavanagh ) . although at a slower rate, mutations are emerging in the sars-cov- genome which might modulate viral transmission, replication efficiency and virulence in different regions of the world (jia et al. ; pachetti et al. ) . the genome sequence data has revealed that sars-cov- is a member of the genus betacoronavirus and belongs to the subgenus sarbecovirus that includes sars-cov while mers-cov belongs to a separate subgenus, merbecovirus (lu et al. ; wu et al. ; zhu et al. ) . sars-cov- is approximately % similar to sars cov at the nucleotide sequence level. epidemiological data suggests that sars-cov- had spread widely from the city of wuhan in china (chinazzi et al. ) after its zoonotic transmission originating from bats via the malayan pangolins . global sequence and epidemiological data reveals that since its emergence, sars-cov- has spread rapidly to all parts of the globe, facilitated by its ability to use the human ace receptor for cellular entry (hoffmann et al. ) . the accumulating mutations in the sars-cov- genome have resulted in the evolution of clades out of which the ancestral clade o originated in wuhan. since the first report of sequence of sars-cov- from india, there have been multiple sequence submissions in global initiative on sharing all influenza data (gisaid, https://www.gisaid.org/). extensive sequencing of the viral genome from different regions in india is required urgently. this will provide information on the prevalence of various viral clades and any regional differences therein, which might lead to improved understanding of the transmission patterns, tracking of the outbreak and formulation of effective containment measures. the mutation data might provide important clues for development of efficient vaccines, antiviral drugs and diagnostic assays. we have initiated a study on sequencing of sars-cov- genome from swab samples obtained from infected individuals from different regions of west bengal in eastern india and report here the first nine sequences and the results of analysis of the sequence data with respect to other sequences reported from the country until date. we have detected unique mutations in the rna-dependent rna polymerase (rdrp), spike (s) and nucleocapsid (n) coding viral genes. it appears that the mutation in nucleocapsid gene might lead to alterations in local structure of the n protein. also the putative sites of mirna binding could be affected, which might have major consequences. the possible implications of the mutations have been discussed, which will provide important insights for functional validation to understand the molecular basis of differential disease severity. the regional virus research & diagnostic laboratory (vrdl) in indian council of medical research-national institute of cholera and enteric diseases (icmr-niced) is a government-designated laboratory for providing laboratory diagnosis for sars-cov- (covid ) in eastern india. nasopharyngeal and oropharyngeal swabs in viral transport media (vtm) (himedia labs, india) collected from suspect cases with acute respiratory symptoms/travel history to affected countries or contacts of the covid- confirmed cases were referred to the laboratory for diagnosis. the test reports were provided to the health authorities for initiating treatment and quarantine measures. residual deidentified positive samples for sars-cov- were used for rna isolation and sequencing in accordance with ethics guidelines of govt. of india. extraction of viral rna from the clinical sample ( ll) was performed using the qiaamp viral rna mini kit as per manufacturer's protocol (qiagen, germany). the extracted rna was tested for sars-cov- (covid- ) by real time reverse transcription pcr (qrt-pcr) (abi , applied biosystems, usa) using the protocol provided by niv-pune, india (https://www.icmr.gov.in/pdf/covid/labs/ _sop_for_first_ line_screening_assay_for_ _ncov.pdf; https://www. icmr.gov.in/pdf/covid/labs/ _sop_for_confirmatory_ assay_for_ _ncov.pdf). briefly, first line screening was done for envelope e gene and rnase p (internal control). clinical samples positive for e gene (ct b . ) were subjected to confirmatory test with primers specific for rdrp and hku orf (hku-orf -nsp ). positive control and no template control were run for all genes. a specimen was considered confirmed positive for sars-cov- if reaction growth curves crossed the threshold line within cycles (ct cut off b . ) for e gene, and both rdrp, orf or either rdrp or orf. rna isolated from nasopharyngeal and oropharyngeal swabs were depleted of ribosomal rna using ribo-zero rrna removal kit (illumina, usa). the residual rna was then converted to double stranded cdna and sequencing libraries prepared using truseq stranded total rna library preparation kit (illumina inc, usa) according to the manufacturer's instructions. the sequencing libraries were checked using high sensitivity d screentape in tapestation system (agilent technologies, usa) and quantified by real time pcr using library quantitation kit (kapa biosystems, usa). the libraries were sequenced using miseq reagent kit v in miseq system (illumina inc, usa) to generate x bp paired end sequencing reads. for viral genome amplification in samples which did not generate sufficient viral reads, the rna samples were converted to double stranded cdna and amplified using qiaseq sars-cov- primer panel (qiagen gmbh, germany) according to the manufacturer's instructions. the multiplexed amplicon pools were then converted to sequencing libraries by enzymatic fragmentation, end repair and ligation to adapters. the sequencing libraries were checked and quantified as above and sequenced using miseq reagent kit v nano in miseq system (illumina inc, usa) to generate x bp paired end sequencing reads. the sequencing reads obtained in shotgun rna-seq experiment were mapped to reference viral sequence, variants detected and consensus sequence for each sample built using dragen rna pathogen detection software (version ) in basespace (illumina inc, usa). for amplified whole genome sequencing, the viral sequences were assembled using clc genomics workbench v . . (qiagen gmbh, germany). in both cases, the severe acute respiratory syndrome coronavirus isolate wuhan-hu- as reference genome (accession nc_ . ) was used as the reference sequence. each variant call generated in either pipeline was manually verified in integrated genome viewer igv v . . (jt robinson et al. ) . clustal omega was used to display the mutations in the context of the sequence alignments. bioedit software (v . ) was used to extract the cds from consensus sequence and to check codon usage. nucleotide to amino acid conversion was done in emboss transeq online tool (f madeira et al. ). to generate the clustering patterns of the viral sequences from west bengal, a subset of representative virus sequence data (n = ) were downloaded from gisaid global database (supplementary table ). only high coverage data (where the entries have less than % n's and less than . % amino acid mutations), complete genome (entries with base pair greater than , ) and excluding low coverage entries (entries having more than % n's) were used in the analysis. all of the sequences were aligned using mafft (multiple alignment using fast fourier transform). we used the nextstrain pipeline to process the sequence data. nextstrain with the augur pipeline was used to build phylogenetic tree based on the iqtree method, which is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. the tree building process involves the use of these subtypes 'wuhan-hu- / ', 'wuhan/wh / ' to generate the root of the tree. the tree is refined using raxml (randomized axelerated maximum likelihood). a web-based visualization program, auspice was then used to present and interact with phylogenetic and phylogeographic data. we investigated the potential mirna binding site in the region coding for n protein, found to be mutated in our samples. starmir (http://sfold.wadsworth.org/cgibin/starmirtest .pl) software was used for this purpose. the whole human mature mirna library was obtained from mirbase database. the sequence in query was taken nt upstream and nt downstream from the site of mutation. the mirnas which bind to the mutation site through seed sequence were shortlisted. the change in bases can prevent certain mirna binding and support the binding of others. therefore, mirna binding was checked for both, original and mutated site. we checked the levels of mirnas in the cancer conditions around the upper respiratory tract in the dbdemc database (https://www.picb.ac.cn/ dbdemc/). the tissueatlas database (https://ccbweb.cs.uni-saarland.de/tissueatlas/) was used to analyse the presence and correlation of mirnas in body fluids. all patients were diagnosed positive for sars-cov- rna by real time pcr as described above. five of the patients suffered from fever, while seven patients exhibited some symptoms of infection like sore throat, cough with sputum, running nose or breathlessness. one patient suffered from acute respiratory distress syndrome (ards). two patients did not exhibit any symptom (table ) . five individuals had contact with covid- patients in particular; both s and s had contact with the same patient (table ). one individual had history of international travel while another had history of domestic travel. the shotgun rna-seq data resulted in high coverage (greater than x median depth of coverage) of complete genome sequences of the sars-cov- in five samples (s , s , s , s and s ) in which greater than % of the viral genome was covered at greater than x and greater than % of the viral genome was covered at greater than x. a negative correlation was found between viral load (represented by the threshold cycle or ct value of the rna samples in the real time pcr based diagnostic assay) and the number of reads mapped to the viral genome in the rna-seq library. even with samples, the pearson correlation coefficient was found to be - . (p value = . ) (table ). in particular, it was observed that samples with ct values greater than mostly resulted in generation of low counts of viral sequence reads leading to less than x median depth of coverage of the viral genome. in the remaining four samples (s , s , s and s ), the median depth of coverage was less than x and hence the viral genome sequencing was achieved after amplification of the viral genome by a multiplex pcr approach. all the nine sequences have been submitted in the global initiative on sharing all influenza data (gisaid) database. phylogenetic tree analysis of the sequences, along with other complete viral genome sequences submitted from india in gisaid, revealed that seven of these sequences belonged to the a a clade while only one sequence belonged to clade b (figure and table ). we were unable to classify one of the nine sequences, s , into any clade due to low sequence coverage. to understand transmission histories of these nine sars-cov- isolates from west bengal, we aligned these sequences with more than global sequences, including thirty sequences submitted in gisaid from india (at the time of our analysis) to identify specific mutations that occur at the highest level of the tip in a branch leading to the specific subtype. the predicted origin of the transmitted subtype in each case was identified with - % confidence from the branch in which our samples were located in the phylogenetic tree (table ) . the list of mutations detected in the sequences from nine samples are provided (table ) . seven sequences harboured the important signature mutations of a a clade. these consisted of the c/t mutation resulting in a change of p l in the rdrp and the a/g mutation resulting in a change of d g in the spike glycoprotein of the virus. in addition to these, g/t mutation in the gene coding for spike glycoprotein (g v) and triple base mutations of - ggg/aac in the gene coding for nucleocapsid resulting in two consecutive amino acid changes r k and g r were detected in s , s and s , s , s respectively. while the g/t s gene mutation was unique to these samples and could not be found in any other sequence from india or the rest of the world, the nucleocapsid mutations could be detected in only three other sequences from india (figure ). out of these, two sequences were obtained from individuals with contact history of a covid- patient who had travelled from italy. interestingly, two out of three sequences harbouring these mutations obtained by us belonged to kolkata and with contact history with one covid- patient who had travelled from london (uk). the third sequence was obtained from a covid- patient from darjeeling, india who had history of travel from chennai, india. these mutations have been found in % of sars-cov- sequences reported world-wide from countries like uk, netherlands, iceland, belgium, portugal, usa, australia, brazil, etc. rdrp (nsp ) gene of the sars-cov- codes for the rna-dependent rna polymerase and is vital for the replication machinery of the virus. we detected a total of six mutations in this gene in the nine samples, out of which four were nonsynonymous, including the a a clade specific c/t (rdrp: p l) mutation. two individuals, s and s , harboured viral genome sequences that shared a unique c/t (a v) mutation which was not found in any other sequence reported from india or rest of the world. one individual s , whose viral sequence belonged to b clade, harboured mutations in rdrp, which appear to be clade specific, out of which were nonsynonymous. to study the functional relevance of the mutations, we investigated the alteration in mirna binding in the nucleocapsid coding region, predicted to be caused by the - ggg/aac mutations. we found seven mirnas which bind to the original sequence and three which bind the mutated sequence exclusively (table and figure ). the number of bases in the sequence (ggg/aac) which bind the seed sequence of mirna were also identified. the strength of mirna prediction is reflected by the dg value mentioned in the figure . mutant base s s s s s s s s s lesser the value, stronger is the binding. the values are comparable to some of the experimentally validated mirna bindings like mir binding to hcv rna has dg value of - . kcal/mol for s binding site and - . kcal/mol for s binding site (data not shown). the values of dg obtained for the mirnas binding to n protein coding region are comparable to these values, suggesting their relevance in the in vivo conditions. we checked the levels of these mirnas in cancer conditions around the upper respiratory tract in the only two samples from west bengal (s and s ) harbour this mutation. (d) c/t, c/a and t/c mutations in the rdrp gene in clustal omega. only one sample from west bengal (s ) harbour these mutations. dbdemc database. we found that mir- - - p and mir- - p were downregulated in most of the cancers. mir- - - p was found to be upregulated in esophageal cancer (esca), head and neck cancer (hnsc), lung cancer (luca) and downregulated in nasopharyngeal cancer (nsca) (supplementary figure ). assuming that the binding of mirnas would inhibit the viral replication/stability, higher abundance of that mirna would be protective against infection and lower abundance would increase the susceptibility towards infection. to comprehend the results, we have found that if a patient suffering from esca, hnsc, luca is infected with the original virus containing ggg sequence, the upregulated mir- - - p would be protective against the infection. but, if the same patient is infected with the mutated virus containing aac sequence, mir- - - p will not be functional anymore and mir- - p which targets the mutated site is also downregulated. this could make the patients suffering from described cancers, highly susceptible to infection with the mutant virus. we also checked if these mirnas are associated with other disease conditions and found that mir- - p is down regulated in type diabetes mellitus (t dm) and hence could serve as one of factors for increased susceptibility of t dm patients for the mutated viral subtype and increase the risk of comorbidity (huang et al. ) . another mir- - p, targeting original subtype, is reported to be higher in asthma patients (fang et al. ) . this could be one of the factors limiting the original viral propagation, but the loss of its targeting site in mutated viral subtype could increase the host susceptibility towards viral infection. we further checked if there are some other conditions that could alter the availability of these mirnas at the site of infection. therefore, we used the tis-sueatlas database to analyse the presence and correlation of these mirnas in body fluids. we found that there is differential expression of certain mirnas in the saliva of patients suffering from pancreatic cancer. mir- b- p, mir- - p and mir- - p were found to be upregulated in the saliva of pancreatic cancer patients which could provide similar protective/susceptible effect as mentioned of mirnas before (supplementary figure ) . mirnas have been known to affect viral replication and stability by binding to protein coding regions of the genome of h n , ev , cvb and many more viruses (bruscella et al. ; trobaugh and klimstra ) . in most of the cases, binding of mirnas leads to translational repression of the targeted protein and hence directly affects viral rna replication. targeting by mirnas could decrease the levels of n protein, which is involved in various steps of viral life cycle including replication, translation and coating of viral rna to form the nucleocapsid. hence, altered levels of the shortlisted mirna could regulate various viral processes and severity of sars-cov- infection. the effect of mirnas would be opposite if they assist in viral replication/stability, but that needs to be experimentally confirmed and still holds the importance of mirnas targeting the original and mutated sites. we analysed the - ggg/aac mutations in the nucleocapsid gene which results in contiguous amino acid changes of r k and g r for their potential role in alteration of structure of the encoded protein. the sites of these mutations at position are located in the sr-rich region which is known to be intrinsically disordered (chang et al. ). in addition, this region is known to encompass a few phosphorylation hsa-mir- - p hsa-mir- - p hsa-mir- - p hsa-mir- - p hsa-mir- - p hsa-mir- hsa-mir- - - p hsa-mir- - p hsa-mir- b- p hsa-mir- - - p sites (surjit et al. ) , notably the gsk phosphorylation site at ser and a cdk phosphorylation site at ser which are in close proximity to these mutations. the sequence motifs and are entirely consistent with gsk and cdk phosphorylation motifs, respectively. when ser is phosphorylated which incorporates a large negative group tethered to the sidechain of ser, as seen in many other substrates of kinases, it is likely that charge neutralization takes place involving positively charged sidechains in the sequential and spatial vicinity. arg is a part of gsk phosphorylation motif and its sidechain could potentially contribute to charge neutralization at p-ser . given the sequential, and therefore spatial proximity of arg to p-ser the sidechain of arg could potentially be involved in interaction also with phosphate group at position . this interaction would contribute to reduction of conformational entropy. similarly, arg , a part of cdk phosphorylation motif, would contribute to charge neutralization at p-ser . arg and gly are mutated to lys and arg respectively (figure ). spike protein (s) of coronaviruses is a class i viral fusion protein which is synthesized as a single chain precursor that trimerizes upon folding. it is composed of two subunits: s (in the amino terminal) containing the receptor binding domain (rbd) and s (in the carboxy terminal) that drives membrane fusion ( in all the three structure, d lies in a loop at the interface between any two out of the three protomers. the co-ordinates for the d side-chain in chain a and c of vyb are available only up to c b -atom and the orientation of these atoms are similar to that observed in the respective atoms of d in vxx. the co-ordinates of all the side-chain atoms of d in chain b of vyb are available and they are similar to that observed in chain b of vxx. the side-chain of d in all the protomers of vxx and chain b of vyb point outward from the core of the protein toward the solvent. the side-chain orientation of d in all the three chains of vsb is different from the former two structures. this differential orientation of d side-chain in vsb facilitates formation of hydrogen bond between d (present in s subunit) and t (present in s subunit) from the neighbouring chain in two out of the three interfaces found in vsb ( figure ). taken together, these facts suggest that d is highly flexible and support the wobbly nature of the inter-protomeric hydrogen bond observed between d and t . contribution of this transient hydrogen bond toward stability of the pre-fusion state cannot be negated. interestingly, s protein of mouse coronavirus (mhv-a ) which has a similar structural topology as that of the sars-cov- s protein but shares a low overall sequence identity (* %), has a conservative substitution at the position equivalent to d of the latter. the asn (n ) of mouse coronavirus (mhv-a ) is replaced with asp (d ) in sars-cov- ( figure and figure ) . in earlier literature, n has been suggested to offer inter-protomeric interactions that contribute toward maintenance of the s fusion machinery in its metastable state (ac walls et al. ) . given the conservation of asp at this position in closely related coronaviruses (bat coronaviruses: btcov-ratg and btcov-hku ; sars-cov) and its conservative substitution in mouse coronavirus (mhv-a ), it is likely that d is important for structural stability of s protein. as gly lacks a side-chain, the transient hydrogen bond as observed in the wild-type s protein would be lost in the variant with d g mutation. this can potentially compromise on the structural stability of pre-fusion state of s protein possibly interfering with conformational transitions. moreover, replacement of asp with gly at this position would come with higher conformational freedom at the backbone (c ramakrishnan and gn ramachandran ) of the polypeptide resulting in enhancement of local conformational entropy. the gly at this position is solvent exposed and is present at the tip of the c-terminal end of a b-strand. this position is proximal to the region where the s protein attaches itself to the viral membrane (figure ). it is to be noted that the gly at this position is conserved among the closely related coronaviruses (bat coronavirus ratg and hku , sars-cov) hinting toward its possible role in maintenance of structure and function of the s protein (figure ). in general, as explained above, gly backbone has higher conformational freedom than any other amino acid residues (ramakrishnan and ramachandran ) . therefore, figure . conformation of d in three structures ( vxx, vyb, vsb). (a), (b), (c) overlay of d ( vxx: yellow carbon; vyb: white carbon; vsb: dark pink carbon) from chain a, b and c of the three structures, respectively. to maintain visual clarity, only the backbone of respective chain of vxx is shown in cartoon representation. (d), (e), (f) orientation of d (green carbon) from chain a (purple cartoon) and t (dark blue carbon) from chain b (teal cartoon) in vxx, vyb and vsb, respectively. hydrogen bond is depicted as black dashed line. (g), (h), (i) orientation of d (green carbon) from chain c (orange cartoon) and t (dark blue carbon) from chain a (purple cartoon) in vxx, vyb and vsb, respectively. hydrogen bond is depicted as black dashed line. the side-chain co-ordinates for d in chain a and c of vyb are unavailable. protein rendering has been done using pymol (schrödinger, llc). substitution of gly with val would impart rigidity to the local region. the possible implication of such rigidity on the association of s protein with viral membrane could be understood from a structure of s protein in association with the viral membrane. however, such a structure is currently unavailable. substantial uncertainties surround the trajectory of the recent epidemic of covid- in india. it is extremely important to track the outbreak by analysing the phylogenetic relationships between different sars-cov- genomes prevalent in india and compare them with genomes reported from rest of the world. the errorprone replication process of all rna viruses in general, results in introduction of mutations in their genomes which behave as a molecular clock that can provide insights into the emergence and evolution of the virus. the data till date suggests that sars-cov- emerged not long before the first cases of pneumonia in wuhan occurred . in this study, direct massively parallel sequencing of the viral genome was undertaken on nasopharyngeal and oropharyngeal swab samples collected from infected individuals from different districts of west bengal. we have analysed the first nine sequences in this report. recent analysis of sars-cov- sequences from all over the globe has revealed that the outbreaks have been initially triggered in most countries by the original strain from wuhan, clade o, which thereafter have diversified into multiple clades (yadav et al. ; biswas and majumder ) . temporal sweeps leading to replacement of the ancestral o and other clades by a a, have been detected. until our report, initial sequences from samples obtained from individuals with travel history to china, reported genetic similarity to the clade o, which was obtained at the beginning of the outbreak in wuhan, china. rest of the sequences reported from india mostly belonged to either clade a ( %) or a a ( %) (supplementary table ), with evidence of the temporal sweep where the a a is emerging as the predominant clade (biswas and majumder ) . the a a clade is characterized by the signature nonsynonymous mutations leading to amino acid changes of p l in the rdrp which is involved in replication of the viral genome and the change of d g in the spike glycoprotein which is essential for the entry of the virus in the host cell by binding to the ace receptor. notably, the d g mutation is close to the furin recognition site for cleavage of the spike protein, which plays an important role in virus entry. whether both these mutations have resulted in the evolution of a more transmissible viral subtype i.e. the a a clade, is yet to be verified by in vitro and in silico analyses. interestingly, we also found that one of viral sequences in our study belonged to the b clade, which originated in china (gonzalez-reiche et al. ). b clade sequences have not been reported from india earlier and are only less than % of sequences reported worldwide. probably, the individual s was transmitted this subtype by contact with others who had travel history to china although this information was not available in the patient clinical history. emergence of viral subclones in an outbreak can affect the transmission patterns and disease severity, which are immensely important for public health (harvala et al. ; jones et al. ) . given the large size of the infected population in india, with the possibility of regional differences in the population and host-related factors, this can have the potential to affect the course of the outbreak. population surveillance is essential for early detection of emergence of such subclones. we analysed the mutations detected in each sequence that we generated and found preliminary evidence of this. we found that three individuals of this study, viz. s , s and s , shared rare set of three contiguous mutations in their genome which resulted in the consecutive alterations of r k and g r. these mutations were also found to be shared with other sequences reported from western india. interestingly, while two out the three sequences harbouring these mutations were from individuals who shared contact history with a covid- patient with history of travel from italy, two out of the three samples from west bengal shared contact history with the same covid- patient with history of travel from uk. the third individual whose sample harboured these mutations, viz. s , was found to have history of travel from chennai, india, but the possibility of the patient having contact in transit with an individual with international travel history cannot be excluded. additionally, origin of the viral subtypes infecting s and s has also been predicted by phylogenetic analysis to be europe (uk). s had been infected in delhi, india where he had contact with an infected individual who travelled from europe. one of the individuals s , harboured a viral subtype which is predicted to have been transmitted in china. s and s , who shared an identical sequence of the virus, also harboured one unique mutation resulting in the amino acid alteration of g v in the spike protein. this correlates with the fact that these two individuals had also been known to have contact with the same covid- patient. viral rna sequences obtained from two samples s and s shared all mutations except a v l mutation at orf harboured by s and not by s . interestingly, both these individuals belonged to the same district of east medinipur, had history of contact with covid- patients and did not exhibit any clinical symptom. thus our findings indicate that the viral subtypes transmitted in the eastern region of india, in particular west bengal, have mostly originated from europe and also china. sequencing of large number of samples are being presently undertaken to confirm and elaborate these initial findings. rdrp is essential for replication of viral rna genome and hence this gene is expected to be conserved. interestingly, we detected multiple mutations in this gene, the majority of which were non synonymous and hence result in alteration of protein sequence. in particular, the p l was present in all a a sequences in our samples. this mutation is located adjacent to a hydrophobic cleft in rdrp which is a promising target for potential drugs (pachetti et al. ) . sequences from two samples, s and s , shared a unique rdrp mutation at a v which has not been detected until date in rest of the sequences submitted from india or worldwide. as observed earlier, these two samples harbour viral subtypes whose genomes are strikingly similar. sequence obtained from one of the samples s , which belonged to the clade b , did not possess the p l mutation. instead, it harboured three different mutations resulting in two non-synonymous changes of h y, p t and a synonymous mutation which were not found in any other sequences reported from india until date and are specific for the b clade. it remains to be seen whether these amino acid alterations result in substantial changes in structure or function of rdrp, resulting in emergence of drug resistant subtypes or enhancement in mutation rate in the viral genome. we investigated the potential of the mutations detected in the nucleocapsid region to effect alterations in the viral and host processes. we found that this mutation results in considerable alterations in the predicted binding of mirnas, which might play a role in the establishment and progress of infection in the patient. we also found that some of the mirnas which are predicted to bind to the mutated subtype might be downregulated in multiple cancer types. this raises the possibility that cancer patients might have higher susceptibility to the mutated sub-clone due to the reduced ability to contain the virus in vivo, compared to infection by the original virus of the same clade. the leads obtained from this study need to be pursued to develop mirna based novel therapeutic approaches. we also analysed the predicted structural alterations in the viral nucleocapsid protein, which might be caused by consecutive alterations of r k and g r. as a result of these mutations, we have two strong positively charged residues in close sequential positions as opposed to only one positively charged residue in the other genotype. given the structural vicinity of p-ser and p-ser and the long sidechains of lys and arg with high positive charge and significant side-chain conformational freedom in this genotype, both these residues potentially could contribute to charge neutralization of the phosphorylated serine residues. this contributes to further reduction of conformational entropy compared to the other genotype. while lys is likely to offer electrostatic interactions to p-ser , arg (with a greater number of positively charged centres as compared to lys) could potentially simultaneously interact with the phosphate groups at both p-ser and p-ser . together, these two positively charged residues (lys and arg ) have the potential to offer additional interactions to the phosphorylated serine residues at and positions as opposed to only one of them (arg ) in the other genotype. consequently, one can expect a significant difference in conformational entropy as well as in the inter-residue interaction structural network between the two genotypes especially when ser and ser are phosphorylated. further, gly at position in one of the genotypes would confer significantly higher conformational freedom at the backbone (ramakrishnan and ramachandran ) of the polypeptide chain compared to arg in the equivalent position in the other genotype. this mutation adds another dimension to the likely structural differences in this local region of the two genotypes. subsequently, phosphorylation-mediated functional events might be different in the two genotypes (surjit and lal ; surjit et al. ). these proposed differences in the inter-residue structural network between the two genotypes are depicted schematically in figure . admittedly, the proposed network of interactions is fraught with uncertainty. however, given two positively charged residues in one genotype compared to only one in the other genotype, the charge neutralization structural interaction networks involving p-ser and p-ser has to be certainly different going by the highly established literature on kinase substrates (kitchen et al. ; krupa et al. ) . interestingly, the mutations d g (in s d domain) is supposed to confer flexibility in the s d domain and the mutation g v might impart partial rigidity in the conformation of s domain. obvious question is whether such structural alterations in local region would have any consequence in receptor binding affinity of spike protein. since the mutation resides in rbd domain-s subunit of spike protein, residue is not directly involved in the interaction with ace . but the mutation might have some effect on the positioning of the residues involved in interaction. now to address the concerns whether these mutations are expected to affect the sensitivity of the existing diagnostic kit, we have again explored the implications of the structural changes. most likely, the presence of mutation should not affect the rapid detection kits because these kits detect the presence of specific igg/igm antibody against viral n protein or viral s protein. the whole protein is coated for the test and therefore polyclonal antibodies would provide the result here. change in just one epitope might not affect the overall result. we have further checked if the mutation sites fall in immunodominant epitopes. this data is available for sars proteins and the sites where we have found mutation have been shown to be conserved in sars and sars-cov- . while the mutation site of n protein does not elicit much antibody response, region - of the s protein of sars has been shown to be a major immunodominant epitope in s protein (he et al. ) . change in this epitope by mutation could alter the sensitivity of the igg/igm tests conducted. also, there are certain diagnostic kits being designed to check the presence of viral antigen in the clinical sample. the abundance of antibodies targeting the mutation sites needs to be checked in those kits, to be more effective across the viral strains harbouring different mutations. we also detected interesting relationships between ct value of diagnostic assay as a surrogate of viral copy number and viral sequence reads obtained. we recommend that for future sequencing studies, the shotgun rna-seq approach should be used for high viral copy number represented by low ct values while for rest, a viral genome amplification method should be used. although the sample size of our preliminary report is small, follow up studies are underway to confirm these observations for understanding the impact of the same in the ongoing outbreak of covid- in india. we have not commented on the relationship of the viral sequence alterations with disease severity due to the limited sample size of this analysis. we hope to provide valuable information on this aspect based on the expanded number of samples being sequenced at present. our findings provide leads which might benefit outbreak tracking and development of therapeutic and prophylactic strategies to contain the infection. finally, we conclude that the initial sequences generated by us from first nine samples in west bengal in eastern india indicate a selective sweep of the a a clade of sars-cov- . however, the viral population is not homogenous and other clades like b also exist. we have also detected emergence of mutations in the important regions of the viral genome including spike, rdrp and nucleocapsid coding genes. some of these mutations are predicted to have impact on viral and host factors, which might affect transmission and disease severity. this preliminary evidence of emergence of multiple subclones of sars-cov- , which might have altered phenotypes, can have important consequences on the ongoing outbreak in india. during the ongoing covid- pandemic. we also acknowledge the assistance provided by dr. sillarine kurkalang (nibmg), mr. sumitava roy (nibmg) in reviewing the sequence data, ms. soumi sarkar (nibmg) for assistance in statistical analysis, and mr. anand bhushan and ms. meghna chowdhury for providing assistance in laboratory support and logistics. sd and ns would like to acknowledge support from j c bose fellowship. we also thank dbt-iisc partnership programme at iisc, bengaluru, and the national genomics core at nibmg. hr and sc would like to acknowledge support from csir-spm fellowship and dst-inspire fellowship, respectively. tg would like to acknowledge dbt-ra fellowship. severe acute respiratory syndrome coronavirus (sars-cov- ): an overview of viral structure and host response analysis of rna sequences of sars-cov- collected from countries reveals selective sweep of one virus type viruses and mirnas: more friends than foes modular organization of sars coronavirus nucleocapsid protein the sars coronavirus nucleocapsid protein -forms and functions the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak mir- - p is a novel microrna that exacerbates asthma by regulating b-catenin introductions and early spread of sars-cov- in the new york city area emergence of a novel subclade of influenza a(h n ) virus in london identification of immunodominant sites on the spike protein of severe acute respiratory syndrome (sars) coronavirus: implication for developing sars diagnostics and vaccines sars-cov- cell entry depends on ace and tmprss and is blocked by a clinically proven protease inhibitor glucolipotoxicity-inhibited mir- - p regulates pancreatic b-cell function and survival analysis of the mutation dynamics of sars-cov- reveals the spread history and emergence of rbd mutant with lower ace binding affinity evolutionary, genetic, structural characterization and its functional implications for the influenza a (h n ) infection outbreak in india from charge environments around phosphorylation sites in proteins structural modes of stabilization of permissive phosphorylation sites in protein kinases: distinct strategies in ser/thr and tyr kinases the molecular biology of coronaviruses structure, function, and evolution of coronavirus spike proteins genomic characterisation and epidemiology of novel coronavirus: implications for virus origins and receptor binding the embl-ebi search and sequence analysis tools apis in emerging sars-cov- mutation hot spots include a novel rna-dependent-rna polymerase variant stereochemical criteria for polypeptide and protein chain conformations: ii. allowed conformations for a pair of peptide units variant review with the integrative genomics viewer the severe acute respiratory syndrome coronavirus nucleocapsid protein is phosphorylated and localizes in the cytoplasm by - - -mediated translocation the sars-cov nucleocapsid protein: a protein with multifarious activities the nucleocapsid protein of severe acute respiratory syndrome-coronavirus inhibits the activity of cyclin-cyclindependent kinase complex and blocks s phase progression in mammalian cells microrna regulation of rna virus replication and pathogenesis structure, function, and antigenicity of the sars-cov- spike glycoprotein cryo-electron microscopy structure of a coronavirus spike glycoprotein trimer cryo-em structure of the -ncov spike in the prefusion conformation a new coronavirus associated with human respiratory disease in china full-genome sequences of the first two sars-cov- viruses from india probable pangolin origin of sars-cov- associated with the covid- outbreak a novel coronavirus from patients with pneumonia in china we acknowledge the financial and overall support provided by the department of biotechnology, ministry of science and technology, india, and indian council of medical research and all laboratory staff of the niced-vrdl network for laboratory support key: cord- -ey v authors: nan title: dreizehnter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom . . bis . . : die arbeit der zentralen kommission für die biologische sicherheit (zkbs) im jahr date: journal: bundesgesundheitsblatt gesundheitsforschung gesundheitsschutz doi: . /s - - - sha: doc_id: cord_uid: ey v nan die zentrale kommission für die biologische sicherheit (zkbs) prüft und bewertet sicherheitsrelevante fragen nach den vorschriften des gentechnikgesetzes (gentg), gibt hierzu empfehlungen und berät die bundesregierung und die länder in sicherheitsrelevanten fragen der gentechnik. da das gentg hauptsächlich aus der nationalen umsetzung der eu-gentechnikrichtlinien hervorgegangen ist, sind die entwicklungen im bereich der internationalen und der nationalen gentechnik-regelungen für die zkbs von besonderem interesse. aus dem bereich der internationalen regelungen zur gentechnik ist für das berichtsjahr hervorzuheben, dass das "intergovernmental committee for the cartagena protocol" (iccp) eingerichtet wurde,das die vorbereitungen zur ratifizierung und sachgerechten umsetzung des "biosafety protocols" begleitet. mit in vorbereitung auf das "dritte gesetz zur Änderung des gentechnikgesetzes"das vorrangig die richtlinie / / eu auf nationaler ebene umsetzt, befassten sich arbeitskreise aus zkbs-mitgliedern mit darüber hinausgehenden, fachlich schwierigen aspekten (u. a. "sicherheitseinstufung", § gentsv).bei diesem "dritten gesetz zur Änderung des gentechnikgesetzes"wirdvoraussichtlichauch die implementierung des cartagena-protokolls berücksichtigt werden (s. oben). vom dezember bis zum september fand auf initiative des bundesministeriums für verbraucherschutz, ernährung und landwirtschaft (bmvel) ein "diskurs zur grünen gentechnik" statt,an dem repräsentanten gesellschaftlicher gruppen und betroffener verbände beteiligt waren. während der monate wurden verschiedene aspekte der nutzung der gentechnik in landwirtschaft und ernährung erörtert . am ende jedes teildiskurses wurde ein resü-mee gezogen. ein gemeinsam getragener ergebnisbericht, in dem auch minderheitspositionen dargestellt wurden, wurde auf einer schlussveranstaltung im september vorgestellt. die zkbs nahm zur kenntnis, dass dieser "diskurs zur grünen gentechnik" keine fortsetzung der sog."kanzlerinitiative" war,da die für die bewertung der biologischen sicherheit wesentlichen aspekte, die in der "kanzlerinitiative" eine vorrangige rolle spielen sollten (gewinnung praktischer erfahrungen aus bereits abgeschlossenen oder laufenden anbauprojekten mit gvos), schon in der konzeption des diskurses ausgeschlossen wurden.die zkbs konnte keine sachargumente für die durchführung dieses "diskurs zur grünen gentechnik" erkennen. insgesamt kann die zkbs die veränderungen im bereich der internationalen und der nationalen regelungen für die "grüne gentechnik" als auch ihre entwicklung und anwendung nicht unbedingt als positiv einstufen.aus diesem grund spricht die zkbs für das jahr , in dem die politische verantwortung für den bereich gentechnik vom bundesministerium für gesundheit und soziale sicherung (bmgs) zum bundesministerium für verbraucherschutz, ernährung und landwirtschaft (bmvel) überwechseln wird, erneut ihre erwartung auf eine wende aus, die -unter wahrung sachgerechter, wissenschaftlich begründbarer vorsorgemaßnahmen -auch "den rechtlichen rahmen für die erforschung, entwicklung, nutzung und förderung der wissenschaftlichen, technischen und wirtschaftlichen möglichkeiten der gentechnik" (gentg § , abs. ) schafft. die situation innerhalb der europäischen union für die genehmigungsverfahren zum inverkehrbringen von produkten, die gentechnisch veränderte organismen enthalten, stagniert nun schon im vierten jahr unverändert seit . weder die z. t. seit einigen jahren anhängigen genehmigungsverfahren gemäß der richtlinie / /ewg noch solche nach der novel-foods-verordnung wurden abgeschlossen (tabelle in [ ] ). zur erfüllung der aufgaben der zkbs bei der prüfung sicherheitsrelevanter fragen der gentechnik werden die mitglieder der kommission aus unterschiedlichen disziplinen berufen. maßgeblich für die zusammensetzung der zkbs ist § des gentechnikgesetzes. darin ist geregelt, dass sich die kommission zusammensetzt aus ◗ sachverständigen, die über besondere und möglichst auch internationale erfahrung in den bereichen der mikrobiologie, zellbiologie, virologie, genetik, hygiene, Ökologie und sicherheitstechnik verfügen; von diesen müssen mindestens auf dem gebiet der neukombination von nukleinsäuren arbeiten; jeder der genannten bereiche muss durch mindestens einen sachverständigen, der bereich der Ökologie muss durch mindestens sachverständige vertreten sein, in ihrer publikation "transgenic dna introgressed into traditional maize landraces in oaxaca, mexico" [ ] anhang "letter to the editor" der zeitschrift 'nature' ladies and gentlemen, in their article quist and chapela ( ) report on the detection of transgenic dna constructs in native maize landraces grown in remote mountains in oaxaca, mexico. they raise therewith concerns about the unintended introgression of transgenic maize traits into landraces ('criollo') in the centre of their origin resulting in a danger for the natural diversity of this crop plant. by use of molecular methods including pcr,inverted pcr (ipcr) and sequencing of the amplified dna,they obtained data from which they conclude (i) that the nucleotide sequence of the cauliflower mosaic virus (cmv) s promoter [p- s; contained in various lines of genetically modified maize; ] is present in the maize genomes of several 'criollo' samples, (ii) that in two instances these promoter sequences were flanked by adh -sequences which are also neighbouring the p- s in the transgenic construct of novartis bt maize, and (iii) that the transgenic p- s sequences were "embedded within various genomic contexts" of the 'criollo' samples. we have closely examined the experimental data and the analyses of the nucleotide sequences presented in the report.we find that aside from problematic details of the experimental design and some erratic presentations of the data the results of the study do not provide evidence for the introgression of recombinant dna from transgenic crop plants into the genomes of 'criollo' maize. our detailed analyses of the data including the nucleotide sequences ( ) which the authors have deposited in the nucleotide sequence data base of genbank clearly show that none of the authors conclusions are justified and therefore the far reaching interpretations on the endangered diversity of landraces is lacking any basis. our position with respect to the presented data are detailed in the following. . in order to prove that the p- s sequences that were amplified by pcr from the dna samples prepared from the corn cobs were not derived from contaminating cmv it is necessary to show that the observed p- s sequences are linked on one side or on both to maize dna. to identify the sequences flanking the p- s the authors applied ipcr. the template for ipcr were ecorv restriction fragments of the maize dna circularized by ligation. ecorv cuts at a site in the middle of the p- s sequence and therefore dna amplified by ipcr from ligation products with primer pairs matching the right and left parts of the p- s sequence should contain one restored ecorv cleavage site.eight sequenced ipcr products were presented in fig. of ref. (sequences af to af ). none of them contains the ecorv site (see box in our fig.) .this casts doubts on the authors assumption that restriction by ecorv and ligation had created the circular dna products necessary for ipcr. . next we examined whether the nucleotides directly ahead of the four applied primers and expected to be identical to the p- s sequence were present in the eight sequenced ipcr amplification products. as shown in our fig. the primers icmv and icmv were used for ipcr on the left side of the s promoter sequence,the primers icmv and icmv on the right side.[at this point two details of the authors experimental setup must be critizised. first, the binding sites of icmv and icmv are located outside of the p- s region initially amplified by primers cm and cm from the dna samples and therefore the presence of these binding sites in the sample dna was not certain. second, icmv has nucleotides at the ' end which do not match the s promoter region (waved line in icmv in our fig.) and therefore is not expected to allow specific amplification of p- s sequences.] in the sequences of the amplification products af , - , - , - in which the ipcr primer sequences can be identified the nucleotide sequences ahead of the primers are not from p- s.the expected p- s sequence is only partially present ahead of icvm in af . in five cases the primer sequences were not discernible (af , - , - ). in one case (af ) the expected p- s nucleotides were present. these data indicate that perhaps with the exception of af the template of the pcr amplifications was not the s promoter region. three other inconsistencies between the sequences af to af and the fig. are apparent ( ) . first, the sequences described as "downstream" relative to the p- s by quist and chapela are in fact "upstream" sequences and correspondingly the "upstream" sequences of quist and chapela are in fact "downstream" (compare our fig. with fig. of ) . second, the vertical lines in fig. which according to quist and chapela indicate the ends of cmv sequences are misleading since as outlined above they only mark the ' ends of the primers employed (except for the sequence af ; the source of this sequence is termed b in the fig. ,b in the sequence deposited in genbank,and a in the supplement to ref. ). thirdly, the thin lines in fig. of ref. supposedly indicating the parts of cmv dna in the amplified sequences are unduely overstretched (they essentially always represent only the pcr primers) and several of them should not be there at all because primer sequences can not be identified. this is the case at one side of a (af ) and both sides of b (af ). . we characterized with the help of blast searches those parts of the sequences of the ipcr amplification products that were denoted by quist and chapela in their fig. as regions flanking the cmv p- s sequence.we find that the sequence of af denoted adh in the k source of fig. does not match with the maize adh gene. rather, it matches with a sequence located about . nucleotides away from the adh gene (but still within the database entry of about . bp termed adh ).a corresponding blast search result was also obtained with af from the a quence. therefore, the conclusion that "sequences adjacent to the p- s dna were diverse" in the maize genome cannot be drawn. . we examined whether the identified regions in the maize genomic dna from which pcr amplification products were obtained by the authors would perhaps be flanked by primer binding sites. for this we performed pairwise blast alignments of the ipcr sequences with the five matching maize genomic sequences. by adjusting the alignment parameters we were able to identify putative primer binding sites at the expected distances in the maize genome target sequences corresponding to af to af and also one binding site for the single primer sequence identified in af . this indicates that these five assumed ipcr amplification products were most likely obtained by normal pcr amplification directly from continuous sequences of the maize genome having accidentally flanking regions with similarity to the primers. no primer binding sites were found in sequences af and af . these findings support our conclusion from section that the template for at least seven of the eight ipcr products were not p- s sequences. rather, the templates were sequences of the maize genome related to retroviral sequences which frequently had reasonably matching primer binding sites. evidence for the integration of p- s sequences into the 'criollo' genome was not obtained because in none of the eight cases studied a linkage of p- s sequences (aside from the primers used for pcr) to maize dna could be demonstrated. in one case p- s sequences were linked to a non-identified sequence. this can be a consequence of the use of ipcr in which the essential ligation step always bears the risk of fusing (restriction) fragments that were not naturally contiguous in the sample dna. in this case the authors did not perform the necessary pcr control experiment using primers from p- s and the unknown sequence to show that these sequences are in fact contiguously present in the 'criollo' genome. the fact that the p- s specific primers used by the authors had considerable similarity to retroviral sequences explains the formation of pcr products from 'criollo' dna under conditions when the the hybridization stringency is not sufficiently controlled. five of the eight amplified sequences gave in fact matches with retroviral or retrotransposon elements.the claim of the authors that two of their sequences were related to sequences present in the transgenic construct of novartis event bt corn was disproven by careful analysis of the sequence and its target.the low amounts of p- s sequences detected in the 'criollo' dna preparation can easily be explained by contamination of the samples with cmv. if the samples had been tested for other cmv sequences they would probably be there in the equivalent amounts as the p- s sequence. achter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom . . bis . gentechnisch veränderte pflanzen der elfter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom . . bis yellow head complex viruses: transmission cycles and topographical distribution in the asia-pacific region observation of measles virus cell-to-cell spread in astrocytoma cells by using a green fluorescent protein-expressing recombinant virus neunter bericht nach inkrafttreten des gentechnikgesetzes (gentg) für den zeitraum vom . . bis transgenic dna introgressed into traditional maize landraces in oaxaca, mexiko further tests at cimmyt find no presence of promoter associated with transgenes in mexican landraces in gene bank or from recent field collections no credible scientific evidence is presented to support claims that transgenic dna was introgressed into traditional maize landraces in oaxaca doubts linger over mexican corn analysis transgenic dna introgressed into traditional maize landraces in oaxaca, mexico a method of detecting recombinant dnas from four lines of genetically modified maize fig. .these two sequences thus incorrectly associated with maize adh gene were a strong argument for the authors that a transgenic construct was identified in the 'criollo' because adh sequences are in fact present in the transgenic construct of the bt event of novartis. however, in the bt construct the adh -related sequences are the introns ivs and ivs of the adh gene and are located downstream of p- s ( ). different from the bt construct the so called adh sequences in af and af are located upstream of p- s (see previous section).thus,the adh hits of two ipcr products presented by the authors as evidence for "the integrity as an unaltered construct" retained in the 'criollo' genomes are wrong in two ways: (i) the sequences are not from the adh gene and (ii) they are located on the wrong side of p- s. the sequence af (a sample) was denoted as zea mays alpha zein gene, although the matching region in genbank sequence af is not an alpha zein gene. instead, the target sequence is part of a region denoted as "similar to retrovirus-related pol polyprotein sequence".similarly, our blast search also identified the previously discussed "adh " sequences of af and af as being highly similar to a putative gag-pol precurser, e.g. in the genbank sequence af . in case of the sequence af (a sample) the similarity (bit score ) with the dull gene could not be reproduced. a match of only identical nucleotides (bit score ) was obtained when using decreased stringency parameters in a pairwise alignment with "blast sequences". in summary, five of the eight ipcr sequences are retro element sequences, the other three are not (af ,- ) or not closely (af ) related to any known maize dna se- key: cord- -isbc r o authors: munjal, geetika; hanmandlu, madasu; srivastava, sangeet title: phylogenetics algorithms and applications date: - - journal: ambient communications and computer systems doi: . / - - - - _ sha: doc_id: cord_uid: isbc r o phylogenetics is a powerful approach in finding evolution of current day species. by studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. the phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. this paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. the paper has also discussed the application of phylogenetic study in disease diagnosis and evolution. phylogenetics can be considered as one of the best tools for understanding the spread of contagious disease, for example, transmission of the human immunodeficiency virus (hiv) and the origin and subsequent evolution of the severe acute respiratory syndrome (sars) associated coronavirus (scov) [ ] . earlier, morphological traits were used for assessing similarities between species and building phylogenetic trees. presently, phylogenetics relies on information extracted from genetic material such as deoxyribonucleic acid (dna), ribonucleic acid (rna) or protein sequences [ ] . methods used for phylogenetic inference have changed drastically during the past two decades: from alignment-based to alignment-free methods [ ] . this paper has reviewed various methods under phylogenetic tree construction from character to distance methods and alignment-based to alignment-free methods. a brief review of phylogenetic tree applications is also given in cancer studies. a phylogenetic tree can be unrooted or rooted, implying directions corresponding to evolutionary time, i.e. the species at the leaves of a tree relate to the current day species. the species can be expressed as dna strings which are formed by combining four nucleotides a, t, c and g (a-adenine, t-thymine, c-cytosine and g-guanine). in literature, various string processing algorithms are reported which can quickly analyse these dna and rna sequences and build a phylogeny of sequences or species based on their similarity and dissimilarity. a high similarity among two sequences usually implies significant functional or structural likeliness, and these sequences are closely related in the phylogenetic tree. to get more precise information about the extent of similarity to some other sequence stored in a database, we must be able to compare sequences quickly with a set of sequences. for this, we need to perform the multiple sequence comparison. dynamic programming concepts facilitate this comparison using alignment methods, but it involves more computation. moreover, the iterative computational steps limit its utility for long length sequences [ ] . alignment-free methods overcome this limitation as they follow alternative metrics like word frequency or sequence entropy for finding similarity between sequences. phylogenetic tree generation consists of sequence alignment where the resulting tree reveals how alignment can influence the tree formation. alignment-based methodologies are probably the most widely used tools in sequence analysis problems [ ] . they consist of arranging two sequences: one on the top of another to highlight their common symbols and substrings. an alignment method is based on alignment parameters including insertion, deletions and gaps which play a pivotal role in the construction of the phylogenetic tree. a phylogenetic tree is formed as an outcome of sequence analysis performed on the dna or rna strings [ ] . sequence comparison reveals the patterns of shared history between species, helping in the prediction of ancestral states. the comparison of sequences also helps in understanding the biology of living organisms which is required to find similarity and relationship among species. for sequence comparison, we can follow alignment-based or alignment-free methods [ , , ] . sequence alignment is a method to identify homologous sequences. it is categorized as pairwise alignment in which only two sequences are compared at a time whereas in multiple sequence alignment more than two sequences are compared. alignmentbased can be global or local [ , ] . these alignment-based algorithms can also be used with distance methods to express the similarity between two sequences, reflecting the number of changes in each sequence. figure gives a hierarchical view of various methods for phylogenetic tree building. the character-based methods compare all sequences simultaneously considering one character/site at a time. these are maximum parsimony and maximum likelihood. these methods use probability and consider variation in a set of sequences [ ] . both approaches consider the tree with the best score tree, which requires the smallest number of changes to perform alignment. maximum parsimony method suffers badly from the long-branch attraction and gives the least information about the branch lengths [ ] . in such cases, if two external branches are separated by short internal branches, it leads to the incorrect tree. some of the salient features of character-based methods are mentioned in table . distance-based methods use the dissimilarity (the distance) between the two sequences to construct trees. they are much less computationally intensive than the character based methods are mostly accurate as they take mutations into count. for tree generation, generally, hierarchical clustering is used in which dendrograms (clusters) are created. table briefly compares various phylogenetic tree construction methods. multiple alignments of related sequences may often yield the most helpful information on its phylogeny. however, it can produce incorrect results when applied to more divergent sequence rearrangements [ ] . some computationally intensive multiple alignment methods align sequences strictly based on the order in which they receive them. multiple sequence alignment methods emphasize that more closely related sequences should be aligned first. in cases of sequences being less related to one another, however, sharing a common ancestor may be clustered separately [ ] . this implies that they can be more accurately aligned, but may result in incorrect phylogeny. alignment can provide an optimized tree if a recursive approach is followed; however, this will increase the complexity of the problem. if the differences among the lengths of sequences are very high, the alignment performance significantly impacts tree generation. the use of dynamic programming in alignment makes computation more complicated, and iterative steps limit their utility for large datasets. therefore, consistent efforts have been made in developing and improving multiple sequence alignment methods for supporting variable length sequences with high accuracy and also for aligning a larger number of sequences simultaneously. because of the problems associated with alignment-based phylogeny the importance of alignment-free methods is apparent [ ] . hence, the alignment quality affects the relationship created in a phylogenetic tree based on the consideration discussed above. alignment-free methods proposed in recent years can be classified into various categories as shown in fig. . these include k-tuple based on the word frequencies, methods that represent the sequence without using the word frequencies, i.e. compression algorithms probabilistic methods and information theory-based method. in the k-tuple method, a genetic sequence is represented by a frequency vector of fixed length subsequence and the similarity or dissimilarity measures are found based on the frequency vector of subsequence. the probabilistic methods represent the sequences using the transition matrix of a markov chain [ ] of a pre-specified order, and comparison of two sequences is done by finding the distance between two transition matrices. graphical representation comprising d or d or even d methods provides an easy way to view, sort and compare various sequences. graphical representation further helps in recognizing major characteristics among similar biological sequences. as discussed k-tuple method uses k-words to characterize the compositional features of a sequence numerically. a biological sequence is numerically converted into a vector or a matrix composed of the word frequency. the k-word frequency pro-vides a fast arithmetic speed and can be applied to full sequences. the problem with k-tuple is a big value of k that poses a challenge in the computing time and space, and k-word methods underestimate or even ignore the importance of its location. the string-based distance measure uses substring matches with k mismatches. cancer research is considered one of the most significant areas in the medical community. mutations in genomic sequences are responsible for cancer development and increased aggressiveness in patients [ , ] . the combination of all such genes mutations, or progression pathways, across a population can be summarized in a phylogeny describing the different evolutionary pathways [ ] . application of the phylogenetic tree can be explored for finding similarities among breast cancer subtypes based on gene data [ , ] . discovery of genes associated in cancer subtype help researchers to map different pathways to classify cancer subtypes according to their mutations. methods of phylogenetic tree inference have proliferated in cancer genome studies such as breast cancer [ ] . phylogenetic can capture important mutational events among different cancer types; a network approach can also capture tumour similarities. it has been observed from the literature that in cancer disease, the driver genes change the cancer progression, and it even affects the participation of other genes thus generating gene interaction network. phylogenetic methods can solve the problem of class prediction by using a classification tree. phylogenetic methods give us a deeper understanding of biological heterogeneity among cancer subtype. the research focuses on the various methods of sequence analysis to generate phylogenetic trees. the limitations associated with sequence alignment methods lead to the development of alignment-free sequence analysis. however, most of the existing alignment-free methods are unable to build an accurate tree so more refinement is required in alignment-free methods. the phylogenetic study is not limited to species evolution, but disease evolution as well. extending phylogenetic to disease diagnosis can give birth to new treatment options and understanding its progression. use of phylogenetics in the molecular epidemiology and evolutionary studies of viral infections reconstructing optimal phylogenetic trees : a challenge in experimental algorithmics editorial: alignment-free methods in computational biology analyzing dna strings using information theory concepts modified k-tuple method for the construction of phylogenetic trees the evolution of tumour phylogenetics: principles and practice on maximum entropy principle for sequence comparison a general method applicable to the search for similarity in the amino acid sequence of two proteins comparison of biosequences approximate maximum parsimony and ancestral maximum likelihood phylogenetic trees in bioinformatics weighted relative entropy for phylogenetic tree based on -step markov model phylooncology: understanding cancer through phylogenetic analysis tumor classification using phylogenetic methods on expression data novel gene selection method for breast cancer classification sequence analysis integrating alignment-based and alignmentfree sequence similarity measures for biological sequence classification hidden markov models with applications to dna sequence analysis is multiple-sequence alignment required for accurate inference of phylogeny? constructing phylogenetic trees using multiple sequence alignment phylogenetic trees in bioinformatics constructing phylogenetic trees using maximum likelihood applications and algorithms for inference of huge phylogenetic trees: a review upgma clustering revisited: a weight-driven approach to transitive approximation an improved phylogenetic tree comparison method neighbor-net: an agglomerative method for the construction of phylogenetic networks bioinformatics: a practical guide to the analysis of genes and proteins sequence similarity using composition method sequence analysis kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison key: cord- -ef svn f authors: saitou, naruya title: eukaryote genomes date: - - journal: introduction to evolutionary genomics doi: . / - - - - _ sha: doc_id: cord_uid: ef svn f general overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk dnas. we then discuss the evolutionary features of eukaryote genomes, such as genome duplication, c-value paradox, and the relationship between genome size and mutation rates. genomes of multicellular organisms, plants, fungi, and animals are then briefly discussed. duplications sometimes occur in eukaryotes, especially in plants and in vertebrates, but genome duplication is so far not known for prokaryotic genomes. because the gene number of typical eukaryotic genomes is much larger than that of prokaryotes, there are many genes shared among most of eukaryote genomes but nonexisting in prokaryote genomes. some examples are listed in table . . for example, myosin is located in animal muscle tissues, and its homologous protein exists in cytoskeleton of all eukaryotes, but not found in prokaryotes. recently, kryukov et al. ( ; [ ] ) constructed a new database on oligonucleotide sequence frequencies and conducted a series of statistical analyses. frequencies of all possible - oligonucleotides were counted for each genome, and these observed values were compared with expected values computed under observed oligonucleotide frequencies of length - . deviations from expected values were much larger for eukaryotes than prokaryotes, except for fungal genomes. figure . shows the distribution of the deviation for various organismal groups. the biological reason for this difference is not known. there are two major types of organella in eukaryotes: mitochondria and plastids. figure . shows schematic views of mitochondria and chloroplasts. these two organella has their independent genomes. this suggests that they were initially independent organisms which started intracellular symbiosis with primordial eukaryotic cells. because most eukaryotes have mitochondria, the ancestral eukaryotes, a lineage that emerged from archaea, most probably started intracellular symbiosis with mitochondrial ancestor. a parasitic rickettsia prowazekii is so far phylogenetically closest to mitochondria [ ] , and a rickettsia-like bacterium is the best candidate as the mitochondrial ancestor. however, there is an alternative "hydrogen hypothesis" [ ] . plastids include chloroplasts, leucoplasts, and chromoplasts and exist in land plants, green algae, red algae, glaucophyte algae, and some protists like euglenoids. mitochondrial genome sizes of some representative eukaryotes are listed in table . . most of animal mitochondrial genomes are less than kb, and sizes of protist and fungi mitochondrial genomes are somewhat larger. mitochondrial genome size of plants is much larger than those of other eukaryotic lineages, yet the size is mostly less than kb. an ancestral eukaryotic cell, probably an archaean lineage, hosted a bacterial cell, and intracellular symbiosis started. initially, archaea and bacteria shared genes responsible for basic metabolism, and the situation is a sort of gene duplication for many genes, though homologous genes are not identical but already diverged long time ago. in any case, division of labor followed, and only limited metabolic pathways were left in the bacterial system, which eventually became mitochondria. animal mitochondrial genomes contain very small number of genes; for peptide subunits, for trna, and for rrna [ ] . genome size (kb) animals homo sapiens (human) . takifugu rubripes (torafugu fi sh) representative animal species mitochondrial dna genomes. although most of vertebrate mitochondrial dna genomes have the same gene order as in human ( fig. . a ), gene order may vary from phylum to phylum. yet the gene content and the genome size are more or less constant among animals. it is not clear why animal mitochondrial genomes are so small. one possibility is that animal individuals are highly integrated compared to fungi and plants, and this might have infl uenced a drastic reduction of the mitochondrial genome size. another interesting feature of animal mitochondrial dna genomes is the heterogeneous rates of gene order change. for example, platyhelminthes exhibit great variability in mitochondrial gene order (sakai and sakaizumi, ; [ ] ). in contrast, plant mitochondrial genomes are much larger (see table . ). figure . shows the genome structure of tobacco mitochondrial genome (from sugiyama et al. ; [ ] ). horizontal gene transfers are also known to occur in plant mitochondrial dnas even between remotely related species [ ] . the melon ( cucumis melo ) mitochondrial genome size, ca. . mb, is exceptionally large, and recently its draft genome was determined [ ] . interestingly, melon mitochondrial genome looks like the vertebrate nuclear genome in its contents, in spite of its genome size being similar to that of bacteria. the protein coding gene region accounted for only . % of the genome, and about half of the genome is composed of repeats. the remaining part is mostly homologous to melon nuclear dna, and . % is homologous to melon chloroplast dna. most of the protein coding genes of melon mitochondrial dnas are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are kb and kb, respectively. this indicates that the huge expansion of its genome size occurred only recently. interestingly, cucumber ( cucumis sativus ), another congeneric species, also has ~ . -mb mitochondrial genome with many repeat sequences [ ] . it will be interesting to study whether the increase of mitochondrial genomes of melon and cucumber is independent or not. chloroplasts exist only in plants, algae, and some protists. it may change to leucoplasts and chromoplasts. because of this, a generic name "plastids" may also be used. the origin of chloroplast seems to be a cyanobacterium that started intracellular symbiosis as in the case of mitochondria. a unique but common feature of chloroplast genome is the existence of inverted repeats [ ] , and they mainly contain rrna genes. chloroplast dna contents may [ ] . chloroplast genomes were determined for more than species as of december [ ] . their genome sizes range from kb ( rhizanthella gardneri ) to kb ( floydiella terrestris ). although the largest chloroplast genome is still much smaller than atypical bacterial genome, its average intergenic length is kb, much longer than that for bacterial genomes. fractions of mitochondrial dna may sometimes be inserted to nuclear genomes, and they are called "numts." an extensive analysis of the human genome found over numts [ ] . their sequence patterns are random in terms of mitochondrial genome locations. this suggests that mitochondrial dnas themselves were inserted, not via cdna reverse-transcribed from mitochondrial mrna. a possible source is sperm mitochondrial dna that were fragmented after fertilization [ ] . the reverse direction, from nucleus to mitochondria, was observed in melon, as discussed in subsection . . . intron is a dna region of a gene that is eliminated during splicing after transcription of a long premature mrna molecule. intron was discovered by phillip a. sharp and richard j. roberts in as "intervening sequence" [ ] , but the name "intron" coined by walter gilbert in [ ] is now widely used. it should be noted that some description on intron by kenmochi [ ] was used for writing this section. there are various types of introns, but they can be classifi ed into two: those requiring spliceosomes (spliceosome type) and self-splicing type. figure . shows the splicing mechanisms of these two major types. most of introns in nuclear genomes of eukaryotes are spliceosome type, and there are common gu-ag type and rare au-ac type, depending on the nucleotide sequences of the intron-exon boundaries [ ] . spliceosomes involving these two types differ [ ] . self-splicing introns are divided into three groups: groups i, ii, and iii. group i introns exist in organellar and nuclear rrna genes of eukaryotes and prokaryotic trna genes. group ii are found in organellar and some eubacterial genomes. cavalier-smith [ ] suggested that spliceosome-type introns originated from group ii introns because of their similarity in splicing mechanism and structural similarity between group ii introns and spliceosomal rna. group iii introns exist in organellar genomes, and its splicing system is similar with that of group ii intron, though they are smaller and have unique secondary structure. there is yet another type of introns which exist only in trnas of single-cell eukaryotes and archaea [ ] . these introns do not have self-splicing functions, but endonuclease and rna ligase are involved in splicing. the location of this type of introns is often at a certain position of the trna anticodon loop. after the discovery of introns, their probable functions and evolutionary origin have long been argued (e.g., [ , ] ). because self-splicing introns can occur at any time, even in the very early stage of origin of life, we consider only spliceosometype introns. for brevity, we hereafter call this type of introns as simply "intron." there are mainly two major hypotheses: introns early and introns late. the former claims that exon existed as a functional unit from the common ancestor of prokaryotes and eukaryotes, and "exon shuffl ing" was proposed for creating new protein functions [ ] . introns which separate exons should also be quite an ancient origin [ , ] . in contrast, introns are considered to emerge only in the eukaryotic lineage according to the introns-late hypothesis [ , ] . the protein "module" hypothesis proposed by go [ ] is related to be intronsearly hypothesis. pattern of intron appearance and loss has been estimated by various methods (e.g., [ , ] ). kenmochi and his colleagues analyzed introns of ribosomal proteins of mitochondrial genomes and eukaryotic nuclear genomes in details [ - ] . these studies supported the introns-late hypothesis, because introns in mitochondrial and cytosolic ribosomal proteins seem to be independent origins and introns seem to emerge in many ribosomal protein genes after eukaryotes appeared. introns do not code for amino acid sequences by defi nition. in this sense, most of introns may be classifi ed as junk dnas (see the next section). there are, however, evolutionarily conserved regions in introns, suggesting the existence of some functional roles in introns. ohno ( ; [ ] ) proclaimed that the most part of mammalian genomes are nonfunctional and coined the term "junk dna." with the advent of eukaryotic genome sequence data, it is now clear that he was right. there are in fact so much junk dnas in eukaryotic genomes. junk dnas or nonfunctional dnas can be divided into repeat sequences and unique sequences. repeat sequences are either dispersed type or tandem type. unique sequences include pseudogenes that keep homology with functional genes. prokaryote genomes sometimes contain insertion sequences; however, this kind of dispersed repeats constitutes the major portion of many eukaryotic genomes. these interspersed elements are divided into two major categories according to their lengths: short ones (sines) and long ones (lines). one well-known example of sine is alu elements in primate genomes. it is about -bp length, and originated from sl ribosomal rna gene. let us see the real alu element sequence from the human genome sequence. if we retrieve the ddbj/embl/genbank international sequence database accession number ap (a part of chromosome ), there are alu elements among the kb sequence. the density is . alu elements per kb. if we consider the whole human genome of ~ billion bp, alu repeats are expected to exist in ~ . million copies. one example of alu sequence is shown below from this entry coordinates from to : ggcgggagcg atggctcacg cctgtaatgc cagcactttg ggaggccgag gtgggtggat cacaaggtca ggagatagag accatcctgg ctaacacggt gaaacactgt ctctactaaa aacacaaaaa actagccagg cgtggtggcg ggtgcctgta atcccagcta ctcgggaggc tgaggcagga gaatggtgtg aacccaggaa gtggagcttg cagtgagctc agattgcgcc actgcactcc agcctgggtg acagagtgag actccatctc aaaaaaaata aaataaataa aaaaaa if we do blast homology search (see chap. ) using ddbj system ( http:// blast.ddbj.nig.ac.jp/blast/blastn ) targeted to nonhuman primate sequences (pri division of ddbj database), the best hit was obtained from chimpanzee chromosome , which is orthologous to human chromosome . i suggest interested readers to do this homology search practice. alu elements were fi rst classifi ed into j and s subfamilies [ ] . it is not clear about the reason of selection of two characters (j and s), but probably two authors (jurka and smith) used initials of their surnames. in any case, this division was based on the distance from alu consensus sequence; alu elements which are more close to the consensus were classifi ed as s and those not as j. later, a subset of the s subfamily were found to be highly similar with each other, and they were named as y after 'young," for they appeared relatively in young or recent age. rough estimates of the divergence time of alu elements are as follows: j subfamily appeared about million years ago, and s subfamily separated from j at million years ago, followed by further separation of y at million years ago [ ] . figure . shows the overall pattern of alu element evolution (based on [ ] ). tandemly repeated sequences are also abundant in eukaryotic genomes, and the representative ones are heterochromatin regions. heterochromatins are highly condensed nonfunctional regions in nuclear dna, in contrast to euchromatins, in which many genes are actively transcribed. heterochromatins usually reside at teromeres, terminal parts of chromosomes, and at centromeres, internal parts of chromosomes, that connect spindle fi bers during cell division. a more than -mb teromeric regions of arabidopsis thaliana were found to be tandem repeats of ca. -bp repeat unit [ , ] . the nucleotide sequence below is arabidopsis thaliana tandemly repeated sequence ar (international sequence database accession number x ): aagcttcttc ttgcttctca atgctttgtt ggtttagccg aagtccatat gagtctttgt ctttgtatct tctaacaagg aaacactact taggctttta ggataagatt gcggtttaag ttcttatact taatcataca catgccatca agtcatattc gtactccaaa acaataacc the human genome also has a similar but nonhomologous sequence in centromeres, called "alphoid dna" with the -bp repeat unit [ ] . the following is the sequence (international sequence database accession number m ): catcctcaga aacttctttg tgatgtgtgc attcaagtca cagagttgaa cattcccttt cgtacagcag tttttaaaca ctctttctgt agtatctgga agtgaacatt aggacagctt tcaggtctat ggtgagaaag gaaatatctt caaataaaaa ctagacagaa g if we do blast homology search (see chap. ) targeted to the human genome sequences of the ncbi database, there was no hit with this alphoid sequence. this clearly shows that the human genome sequences currently available are far from complete, for they do not include most of these tandem repeat sequences. telomores of the human genome are composed of hundreds of -bp repeats, ttaggg. if we search the human genome as -bp long tandem repeats of this -repeat units as query using the ncbi blast, many hits are obtained. as we already discussed in chap. , authentic pseudogenes have no function, and they are genuine members of junk dnas. when a gene duplication occurs, one of two copies often become a pseudogene. because gene duplication is prevalent in eukaryote genomes, pseudogenes are also abundant. pseudogenes are, by defi nition, homologous to functional genes. however, after a long evolutionary time, many selectively neutral mutations accumulate on pseudogenes, and eventually they will lose sequence homology with their functional counterpart. there are many unique sequences in eukaryote genomes, and majority of them may be this kind of homology-lost pseudogenes. a long rna is initially transcribed from a genomic region having an exon-intron structure, and then rnas corresponding to introns are spliced out. these leftover rnas may be called "junk" rnas, for they will soon be degraded by rnase. only a limited set of genes are transcribed in each tissue of multicellular organisms, but leaky expression of some genes may happen in tissues in which these genes should not be expressed. again these are "junk" rnas, and they are swiftly decomposed. a series of studies (e.g., [ , ] ) claimed that many noncoding dna regions are transcribed. however, van bakel et al. [ ] showed that most of them were found to be artifact of chip-chip technologies used in these studies. if nonsense or frameshift mutations occur in a protein coding sequences, that gene cannot make proteins. yet its mrna may be produced continuously until the promoter or its enhancer will become nonfunctional. in this case, this sort of mutated genes produces junk rnas. if only a small quantity of rnas are found from cells and when they are not evolutionarily conserved, they are probably some kind of junk rnas. as junk dnas and junk rnas exist, cells may also have "junk" proteins. if mature mrnas are not produced in the expected way, various aberrant mrna molecules will be produced, and ribosomes try to translate them to peptides based on these wrong mrna information. proteins produced in this way may be called "junk" proteins, for they often have no or little functions. even if one protein is correctly translated and is moved to its expected cellular location, it can still be considered as "junk" protein. one good example is the abcc transporter protein of dry-type cerumen (earwax), for one nonsynonymous substitution at this gene caused that protein to be essentially nonfunctional [ ] . there are various genomic features that are specifi c to eukaryotes other than existence of introns and junk dnas, such as genome duplication, rna editing, c-value paradox, and the relationship between genome size and mutation rates. we will briefl y discuss them in this section. the most dramatic and infl uential change of the genome structure is genome duplications. genome duplications are also called polyploidization, but this term is tightly linked to karyotypes or chromosome constellation. prokaryotes are so far not known to experience genome duplications, which are restricted to eukaryotes. interestingly, genome duplications are quite frequent in plants, while it is relatively rare in the other two multicellular eukaryotic lineages. an ancient genome duplication was found from the genome analysis of baker's yeast [ ] , and rhizopus oryzae , a basal lineage fungus, was also found to experience a genome duplication [ ] . among protists, paramecium tetraurelia is known to have experienced at least three genome duplications [ ] . because we human belongs to vertebrates and the two-round genome duplications occurred at the common ancestor of vertebrates (see chap. ), we may incline to think that genome duplications often happen in many animal species. it is not the case. so far, only vertebrates and some insects are known to experience genome duplications. the reason for this scattered distribution of genome duplication occurrences is not known. if we plot the number of synonymous substitutions between duplogs in one genome, it is possible to detect a relatively recent genome duplication. this is because all genes duplicate when a genome duplication occurs, while only a small number of genes duplicate in other modes of gene duplications (see chap. ). figure . shows the schematic view of two cases: with and without genome duplication. lynch and conery ( ; [ ] ) used this method to various genome sequences and found that the arabidopsis thaliana genome showed a clear peak indicative of relatively recent genome duplication, while the genome sequences of nematode caenorhabditis elegans and yeast saccharomyces cerevisiae showed the curves of exponential reduction. it is interesting to note that before the genome sequence was determined, the genome duplication was not known for arabidopsis thaliana, while the genome of saccharomyces cerevisiae was later shown to be duplicated long time ago [ ] . when genome duplications occurred in some ancient time, the number of synonymous substitutions may become saturated and cannot give appropriate result. in this case, the number of amino acid substitutions may be used, even if each protein may have varied rates of amino acid substitutions. in any case, accumulation of mutations will eventually cause two homologous genes to become not similar with each other. therefore, although the possibility of genome duplications in prokaryotes are so far rejected [ ] , it is not possible to infer the remote past events simply by searching sequence similarity. we should be careful to reach the fi nal conclusion. modifi cation of particular rna molecules after they are produced via transcription is called rna editing. all three major rna molecules (mrna, trna, and rrna) may experience editing [ ] . there are various patterns of rna editing; substitutions, in particular between c and u, and insertions and deletions, particularly u, are mainly found in eukaryote genomes. guide rna molecules exist in one of the main rna editing mechanisms, and they specify the location of editing, but there are some other mechanisms [ ] . it is not clear how the rna editing mechanism evolved. tillich et al. [ ] studied chloroplast rna editing and concluded that suddenly many nucleotide sites of chloroplast dna genome started to have rna editing, but later the sites experiencing rna editing constantly decreased via mutational changes. they claimed that there was no involvement of rna editing on gene expression. this result does not give rna editing a positive signifi cance. because there are many types of rna molecules inside a cell, there also exist many sorts of enzymes that modify rnas. it may be possible that some of them suddenly started to edit rnas via a particular mutation. rna editing which did not cause deleterious effects to the genome may have survived by chance at the initial phase. this view suggests the involvement of neutral evolutionary process in the evolution of rna editing. organisms with complex metabolic pathways have many genes. multicellular organisms are such examples. generally speaking, their genome sizes are expected to be large. in contrast, viruses whose genomes contain only a handful of genes have small genome sizes. therefore, their possibility of genome evolution is rather limited. even if amino acid sequences are rapidly changing because of high mutation rates, the protein function may not change. unless the gene number and genome size increase, viruses cannot evolve their genome structures. it is thus clear that the increase of the genome size is crucial to produce the diversity of organisms. however, genomes often contain dna regions which are not indispensable. organisms with large genome sizes have many such junk dna regions. because of their existence, the genome size and the gene number are not necessarily highly correlated. this phenomenon was historically called c-value paradox (e.g., [ ] ), after the constancy of the haploid dna amount for one species was found, yet their values were found to vary considerably among species at around (e.g., [ - ] ). "c-value" is the amount of haploid dna, and c probably stands as acronym of "constant" or "chromosomes." we now know that the majority of eukaryote genome dna is junk, and there is no longer a paradox in c-values among species. ]) found conserved noncoding dna sequences from insects, nematodes, and yeasts by comparing closely related species. we will discuss more on conserved noncoding sequences of vertebrates in chap. . as for plants, kaplinsky [ ] ) compared genome sequences of arabidopsis, grape rice, and brachypodium and found > times more abundant cnss from monocots than dicots. hettiarachchi and saitou; [ ] compared genome sequences of plant species and searched lineage-specifi c cnss. they found and cnss shared by all vascular plants and angiosperms, respectively, and also confi rmed that monocot cnss are much more abundant than those of dicots. what kind of the relationship exists between the genome size and mutation rates? if all the genetic information contained in the genome of one organism are necessary for survival of that organism, the individual will die even if only one gene of its genome lost its function by a mutation. an organism with a small genome size and hence with a small number of genes, such as viruses, can survive even if the mutation rate is high. in contrast, organisms with many genes may not be able to survive if highly deleterious mutations often happen. therefore, such organisms must reduce the mutation rate. however, when the nucleotide substitution type mutation rate per generation was compared with the whole-genome size, lynch ( ; [ ] ) found a positive correlation. more recently, lynch ( ; [ ] ) admitted that for organisms with small-sized genomes, these two values were in fact negatively correlated. however, when large-genome-sized eukaryotes are compared, now a positive correlation was observed. we have to be careful when we discuss these two contradictory reports. one considered the rate using unit as physical year, while the other used one generation as the unit. another difference is to use either only protein coding gene region dna sizes or the whole-genome sizes. the relationship between the mutation rate and genome size is not simple. drake et al. ( ; [ ] ) examined this problem and found that the mutation rate per genome per replication was approximately / for bacteria, while mutation rates of multicellular eukaryotes vary between . and per genome per sexual or individual generation. table . shows the list of the mutation rate and the genome size for various organisms. apparently there is no clear tendency. we will discuss genomes of three multicellular lineages of eukaryotes: plants, fungi, and animals in this section. unfortunately, there seems to be no common feature of genomes of multicellular organisms, so each lineage is discussed independently. arabidopsis thaliana was the fi rst plant species whose -mb genome was determined in [ ] . a. thaliana is a model organism for fl owering plants (angiosperms), with only -month generation time. in spite of its small genome size, only % of the human genome, it has , protein coding genes. the genome sequence of its closely related species, a. lyrata , was also recently determined [ ] . angiosperms are divided into monocots and dicots. a. thaliana is a dicot, and genome sequences of six more species were determined as of december (see table . ). rice, oryza sativa , is a monocot, and its genome size, ~ mb, is much smaller than that of the wheat genome. its japonica and indica subspecies genomes were determined [ ] and [ ] , and the origin of rice domestication is currently in great controversy, particularly in single or multiple domestication events (e.g., [ , ] ). the number of protein coding genes in the rice genome is , ~ , [ ] . wheat corresponds to genus triticum , and there are many species in this genus. the typical bread wheat is triticum aestivum , and it is a hexaploid with ( × ) chromosomes. its genome arrangement is conventionally written as aabbdd [ ] . because it is now behaving as diploid, genomic sequencing of chromosomes (a -a , b -b , and d -d ) is under way (see http://www.wheatgenome.org/ for the current status). the hexaploid genome structure emerged by hybridization of diploid (dd) cultivated species t. durum and tetraploid (aabb) wild species aegilops tauschii [ ] . a genome duplication followed hybridization. non-seedling land plants are ferns, lycophytes, and bryophytes, in the order of closeness to seed plants (e.g., [ ] ). a draft genome sequence of a moss, physcomitrella patens was reported in [ ] , followed by genome sequencing of a lycophyte, selaginella moellendorffi i, in [ ] . these genome sequences of different lineages of plants are deciphering stepwise evolution of land plants. the genome sequence of baker's yeast ( saccharomyces cerevisiae ) was determined in , as the fi rst eukaryotic organism [ ] . there are chromosomes in s . cerevisiae, and its genome size is about mb. there are a total of , genes in its genome: , orfs and , other genes. the genome-wide gc content is %, slightly lower than that of the human genome. the proportion of introns is very small compared to that of the human genome, and the average length of one intron is only bp, in contrast to the , -bp average length of exons [ ] . as we already discussed, the ancestral genome of baker's yeast experienced a genome-wide duplication [ ] . pseudogenes, which are common in vertebrate genomes, are rather rare in the genome of baker's yeast; they constitute only % of the protein coding genes [ ] . the baker's yeast is often considered as the model organisms for all eukaryotes; however, their genome may not be a typical eukaryote genome. as of december , genome sequences of more than fungi species are available (see ncbi genome list at http://www.ncbi.nlm.nih.gov/genome/browse/ for the present situation). figure . shows the relationship between the genome size and gene numbers for genomes. there is a clear positive correlation between them. however, there are some outliers. the perigord black truffl e ( tuber melanosporum ), shown as a i n fig. . , has the largest genome size (~ mb) among the fungi species whose genome sequences were so far determined, yet the number of genes is only ~ , [ ] . three other outlier species are postia placenta , ajellomyces dermatitidis , and melampsora laricipopulina , shown as b, c, and d in fig. . , respectively. interestingly, these four outlier species are phylogenetically not clustered well; two are belonging to pezizomycotina of ascomycota and the other two are agaricomycotina and pucciniomycotina of basidiomycota. if we exclude these four outlier species, a good linear regression is obtained, as shown in fig. . . this straight line indicates that in average, one gene size corresponds to . kb in a typical fungi genome. if we apply this average gene size to the truffl e genome, its genome size should be ~ mb, but the real size is mb larger. this suggests that there is unusually large number of junk dna in this genome. in fact, % of its genome consists of transposable elements [ ] . the truffl e genome must still have % more junk dna region. gain and loss of genes in each branch of the phylogenetic tree for fungi species are shown in fig. . (based on [ ] ). it will be interesting to examine genome sizes of species related to the perigord black truffl e, so as to infer the evolutionary period when the genome size expansion occurred. the relationship between the genome size and gene numbers among fungi genomes system that is responsible for this is hox genes. we thus fi rst discuss this gene system in this subsection. the genome of c. elegans , fi rst determined genome among animals, will be discussed next, followed by genomes of insects and those of deuterostomes. because genomes of many vertebrate species were determined, we discuss them in chap. , and in particular, on the human genome in chap. . hox genes were initially found through studies of homeotic mutations that dramatically change segmental structure of drosophila by edward b. lewis [ ] . they code for transcription factors, and a dna-binding peptide, now called homeobox domain, was later found in almost all animal phyla [ ] . figure . shows the hox gene clusters found in animal groups. there are four hox clusters in mammalian and avian genomes, and they are most probably generated by the two-round genome duplication in the common ancestor of vertebrates (see chap. ). interestingly, the physical order of hox genes in chromosomes and the order of gene expression during the development are corresponding, called "collinearity" [ ] . this suggests that some sort of cis-regulation is operating in hox gene clusters, and in fact, many long transcripts are found, and some of their transcription start sites are highly conserved among vertebrates [ ] . figure . shows highly conserved the hox genes control expression of different groups of downstream genes, such as transcription factors, elements in signaling pathways, or genes with basic cellular functions. hox gene products interact with other proteins, in particular, on signaling pathways, and contribute to the modifi cation of homologous structures and creation of new morphological structures [ ] . there are other gene families that are thought to be involved in diverse animal body plan. one of them is the zic gene family [ ] . the zic gene family exists in many animal phyla with high amino acid sequence homology in a zinc-fi nger domain called zf, and members of this gene family are involved in neural and neural crest development, skeletal patterning, and left-right axis establishment. this gene family has two additional domains, zoc and zf-bc. interestingly, cnidaria, platyhelminthes, and urochordata lack the zoc domain, and their zf-bc domain sequences are quite diverged compared to arthropoda, mollusca, annelida, echinodermata, and chordata. this distribution suggests that the zic family genes with the entire set of the three conserved domains already existed in the common ancestor of bilateralian animals, and some of them may be lost in parallel in the platyhelminthes, nematodes, and urochordates [ ] . interestingly, phyla that lost zoc domains have quite distinct body plan although they are bilateralian. caenorhabditis elegans was the fi rst animal species whose -mb draft genome sequence was determined in [ ] . this organism belongs to the nematoda phylum which includes a vast number of species [ ] . brenner ( ; [ ] ) chose this species as model organism to study neuronal system, for its short generation time (~ days) and its size (~ mm). the following description of this section is based on the information given in online "wormbook" [ ] . there are , protein coding genes in c. elegans including , alternatively spliced forms, with % confi rmed to be transcribed at least partially. the number of trna genes is , and are located in x chromosome. the three kinds of rrna genes ( s, . s, and s) are located in chromosome i in - tandem repeats, while ~ s rrna genes are also in tandem form but located in chromosome v. the average protein coding gene length is kb, with the average of . coding exons per gene. in total, protein coding exons constitute . % of the whole genome. figure . shows the distribution of the protein coding genes, and fig. . the distribution of exon numbers per gene. both distributions have long tails. the median sizes of exons and introns are bp and bp, respectively. intron lengths of c. elegans are quite short compared to these of vertebrate genes (see chap. ). the distribution of protein coding genes varies depending on chromosomes, slightly more dense for fi ve autosomes than x chromosome and more dense in the central region than the edge of one chromosome. processed, i.e., intronless, pseudogenes are rare, and a total of pseudogenes were reported at the wormbase version ws . about half of them are homologous to functional chemoreceptor genes. genome sequences of four congeneric species of c. elegans ( c. brenneri , c. briggsae , c. japonica , and c. remanei ) were determined ( http://www.ncbi.nlm.nih. gov/genome/browse/ ). a fruit fl y drosophila melanogaster was used by thomas hunt morgan's group in the early twentieth century and has been used for many genetic studies. because of this importance, its genome sequence was determined at fi rst among arthropods in [ ] . heterochromatin regions of ~ mb were excluded from sequencing, [ ] . their genome sizes vary from to mb, and the number of genes is , - , . interestingly, d . melanogaster has the largest genome size and the smallest number of genes. a total of insect species other than drosophila species were sequenced by end of [ ] . as of december , their genome sizes are in the range of mb and mb, more than fi ve times difference, and the gene numbers are from , to , . deuterostomes contain fi ve phyla: echinodermata, hemichordata, chaetognatha, xenoturbellida, and chordata. the genome of sea urchin strongylocentrotus purpuratus [ ] was determined in . its genome size is mb with , genes. genomes of another sea urchins, lytechinus variegatus and patiria miniata , are also under sequencing, as well as hemicordate saccoglossus kowalevskii . chordata is classifi ed into urochordata (ascidians), cephalochordata (lancelets or amphioxus), and vertebrata (vertebrates). because we will discuss genomes of vertebrates in chap. , let us discuss genomes of ascidians and lancelets only. the genome of ascidian ciona intestinalis was determined in [ ] , and the genome sequence of its congeneric species, c. savignyi , was also determined three years later [ ] . the genome size of c. intestinalis is ~ mb with ~ , genes. interestingly it contains a group of cellulose synthesizing enzyme genes, which were probably introduced from some bacterial genomes via horizontal gene transfer [ , ] . the c. intestinalis genome also contains several genes that are considered to be important for heart development ( [ ] ), and this suggests that heart of ascidians and vertebrates may be homologous. through the superimposition of phylogenetic trees (see chapter a ) for fi ve genes coding muscle proteins, oota and saitou ([ ] ) estimated that vertebrate heart muscle was phylogenetically closer to vertebrate skeletal muscles. if both results are true, muscles used in heart might have been substituted in the vertebrate lineage. the genome sequences of an amphioxus (cephalochordate branchiostoma fl oridae ) was determined in by holland et al. ( ; [ ] ), and they provide good outgroup sequence data for vertebrates. eukaryotic viruses are relying most of metabolic pathways to their eukaryote host species. therefore, the number of genes in virus genomes is usually very small. for example, infl uenza a virus has rna fragments coding for protein genes, and the total genome size is ~ . kb. as in bacteriophages, there are both dna type and rna type genomes in eukaryotic viruses. table . shows one example of classifi cation of eukaryotic viruses based on their genome structure [ ] . genomes of double-strand dna genome viruses have four types: circular, simple linear, linear with proteins covalently attached to both ends, and linear but both ends were closed. genomes of single-strand dna genome viruses are either circular or linear. genomes of rna genomes are all linear in both single-and double-strand type. those of single-strand rna genomes are classifi ed into two types: plus strand and minus strand. a subset of single-plus strand rna genome type is experiencing [ ] . megavirus is phylogenetically close to mimivirus [ ] , a member of nucleoplasmic large dna viruses, including pox virus. recently, a larger genome size virus, pandoravirus, with more than . -mb genome, was discovered [ ] . the phylogenetic status of these large genome size dna viruses is unknown at this moment. analysis of the genome sequence of the fl owering plant arabidopsis thaliana the genome of the cucumber, cucumis sativu s l draft genome sequence of the oilseed species ricinus communis the genome of black cottonwood, populus trichocarpa the grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla genome sequence of foxtail millet ( setaria italica ) provides insights into grass evolution and biofuel potential a new database (gcd) on genome composition for eukaryote and prokaryote genome sequences and their initial analyses the genome sequence of rickettsia prowazekii and the origin of mitochondria the hydrogen hypothesis for the fi rst eukaryote mitochondrial genome the complete mitochondrial genome of dugesia japonica (platyhelminthes; order tricladida) the complete nucleotide sequence of the tobacco mitochondrial genome: comparative analysis of mitochondrial genomes in higher plants and multipartite organization widespread horizontal transfer of mitochondrial genes in fl owering plants determination of the melon chloroplast and mitochondrial genome sequences reveals that the largest reported mitochondrial genome in plants contains a significant amount of dna having a nuclear origin small, repetitive dnas contribute signifi cantly to the expanded mitochondrial genome of cucumber the complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression changes in the structure of dna molecules and the amount of dna per plastid during chloroplast development in maize pattern of organization of human mitochondrial pseudogenes in the nuclear genome why genes in pieces? introns. in encyclopedia of evolution . tokyo: kyoritsu shuppan comprehensive splice-site analysis using comparative genomics the ever-growing world of small nuclear ribonucleoproteins intron phylogeny: a new hypothesis trnomics: analysis of trna genes from genomes of eukarya, archaea, and bacteria reveals anticodon-sparing strategies and domain-specifi c features the origin of introns and their role in eukaryogenesis: a compromise solution to the introns-early versus introns-late debate? the evolution of spliceosomal introns: patterns, puzzles and progress genes in pieces: were they ever together? nuclear volume control by nucleoskeletal dna, selection for cell volume and cell growth rate, and the solution of the dna c-value paradox the recent origins of spliceosomal introns revisited correlation of dna exonic regions with protein structural units in haemoglobin remarkable interkingdom conservation of intron positions and massive, lineage-specifi c intron loss and gain in eukaryotic evolution new maximum likelihood estimators for eukaryotic intron evolution analysis of ribosomal protein gene structures: implications for intron evolution intron dynamics in ribosomal protein genes so much "junk" dna in our genome a fundamental division in the alu family of repeated sequences whole-genome analysis of alu repeat elements reveals complex evolutionary history characterization of highly repetitive sequences of arabidopsis thaliana centromeric repetitive sequences in arabidopsis thaliana sequence defi nition and organization of a human repeated dna empirical analysis of transcriptional activity in the arabidopsis genome identifi cation and analysis of functional elements in % of the human genome by the encode pilot project most "dark matter" transcripts are associated with known genes a snp in the abcc gene is the determinant of human earwax type molecular evidence for an ancient duplication of the entire yeast genome genomic analysis of the basal lineage fungus rhizopus oryzae reveals a whole-genome duplication global trends of whole-genome duplications revealed by the ciliate paramecium tetraurelia size of the protein-coding genome and rate of molecular evolution the evolutionary fate and consequences of duplicated genes comparative genomics in prokaryotes functions and mechanisms of rna editing the evolution of chloroplast rna editing chromosome structure and the c-value paradox la teneur du noyau cellulaire en acide désoxyribonucléique à travers les organes, les individus et les espèces animales (in french) nucleoprotein determination in cytological preparations the constancy of deoxyribose nucleic acid in plant nuclei conserved linkage between the puffer fi sh (fugu rubripes) and human genes for platelet-derived growth factor receptor and macrophage colony-stimulating factor receptor conserved noncoding sequences are reliable guides to regulatory elements enrichment of regulatory signals in conserved non-coding genomic sequence evolution at two level: on genes and form evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes utility and distribution of conserved noncoding sequences in the grasses conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution conserved noncoding sequences in the grasses arabidopsis intragenomic conserved noncoding sequence the banana ( musa acuminata ) genome and the evolution of monocotyledonous plants computational analysis and characterization of uce-like elements (ules) in plant genomes identifi cation and analysis of conserved noncoding sequences in plants viral mutation rates the origins of eukaryotic gene structure evolution of the mutation rate rates of spontaneous mutation analysis of the genome sequence of the fl owering plant arabidopsis thaliana the arabidopsis lyrata genome sequence and the basis of rapid genome size change a draft sequence of the rice genome ( oryza sativa l. ssp. japonica) a draft sequence of the rice genome phylogeography of asian wild rice, oryza rufi pogon , reveals multiple independent domestications of cultivated rice, oryza sativa independent domestication of asian rice followed by gene fl ow from japonica to indica curated genome annotation of oryza sativa ssp. japonica and comparative genome analysis with arabidopsis thaliana multigene phylogeny of land plants with special reference to bryophytes and the earliest land plants the physcomitrella genome reveals evolutionary insights into the conquest of land by plants the selaginella genome identifi es genetic changes associated with the evolution of vascular plants overview of the yeast genome origin of genome architecture perigord black truffl e genome uncovers evolutionary origins and mechanisms of symbiosis master control genes in development and evolution: the homeobox story from dna to diversity evolution of conserved non-coding sequences within the vertebrate hox clusters through the two-round whole genome duplications revealed by phylogenetic footprinting analysis wormbook -the online review of c. elegans biology function and specifi city of hox genes a wide-range phylogenetic analysis of zic proteins: implications for correlations between protein structure conservation and body plan complexity genome sequence of the nematode c. elegans : a platform for investigating biology an improved molecular phylogeny of the nematoda with special emphasis on marine taxa the genetics of caenorhabditis elegans the genome sequence of drosophila melanogaster evolution of genes and genomes on the drosophila phylogeny the genome of the sea urchin strongylocentrotus purpuratus the draft genome of ciona intestinalis : insights into chordate and vertebrate origins assembly of polymorphic genomes: algorithms and application to ciona savignyi a functional cellulose synthase from ascidian epidermis phylogenetic relationship of muscle tissues deduced from superimposition of gene trees genome science and microorganismal molecular genetics distant mimivirus relative with a larger genome highlights the fundamental features of megaviridae the . -megabase sequence of mimivirus ultraconserved elements in the human genome genomu shinkagaku nyumon (written in japanese, meaning 'introduction to evolutionary genomics') the amphioxus genome illuminates vertebrate origins and cephalochordate biology pandoraviruses: amoeba viruses with genomes up to . mb reaching that of parasitic eukaryotes key: cord- -u ol fs authors: ogiwara, atsushi; uchiyama, ikuo; seto, yasuhiko; kanehisa, minoru title: construction of a dictionary of sequence motifs that characterize groups of related proteins date: - - journal: protein eng doi: . /protein/ . . sha: doc_id: cord_uid: u ol fs an automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. this procedure is applied to the pir database and a dictionary of sequence motifs that relate to specific superfamilies constructed. the motifs have a practical relevance in identifying the membership of specific superfamilies without the need to perform sequence database searches in % of newly determined sequences. the sequence motifs identified represent functionally important sites on protein molecules. when multiple blocks exist in a single motif they are often close together in the -d structure. furthermore, occasionally these motif blocks were found to be split by introns when the correlation with exon structures was examined. when the amino acid sequences of two proteins are similar, they probably belong to the same group of functionally related proteins. thus, when a new protein sequence is determined, it is customary to perform a database search for similar sequences in the hope of obtaining a clue to its biological function. the search involves pairwise comparisons against individual sequences in the database. this is becoming more time-consuming with the rapid growth in database size. an alternative approach is to search a library of signature patterns, each of which uniquely identifies a group of related proteins. whether all protein groups can be represented by such diagnostic patterns is arguable, but this approach is certainly more effective because the comparison is made against individual groups rather than individual sequences in the database. it is common knowledge that functionally important sites are well conserved in the amino acid sequences of related proteins. conserved regions are not necessarily contiguous in the primary structure, because a functional site in the -d structure can be composed of separate pieces of conserved segments. the conserved amino acid patterns, often called consensus patterns or sequence motifs (taylor, ; hodgman, ) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. these published motifs are then manually collected, verified and organized in a motif library (bairoch, ; seto et al., ). an additional constraint to the conserved regions is introduced in this study: the uniqueness of amino acid patterns when compared with all other sequences outside the group. this has enabled the design of an automatic procedure to define from the protein sequence database a collection of signature patterns that uniquely identify specific protein groups. this procedure is applied to the superfamily grouping of the pir database and a library of sequence motifs is constructed that identifies specific superfamilies. the amino acid sequences were obtained from the pir database release . (september ) . the pir database is divided into three sections (two sections before release . ): pir , annotated and classified entries; pir , preliminary entries; and pir , unverified entries. only the pir section is used when constructing a motif library. the releases of . (december ) to . (june ) were also used for comparison purposes. the -d coordinates of the protein structures were acquired from the brookhaven protein data bank (april ) . functional groups of proteins suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. the pir superfamily classification is used to define a protein group, but there may also be other definitions. a superfamily is a group of proteins bearing significant sequence similarity and represents the probable evolutionary relationships of the proteins (dayhoff, ; dayhoff et al., ) . it is not always the case that a protein is uniquely assigned to one superfamily, because it can contain multiple domains with different functions. for simplicity, however, the pir superfamily numbering scheme is used, which assumes that each protein in the database belongs to one, and only one, superfamily. dictionary of unique peptide words a three-step procedure is employed to identify the sequence motifs. the first step involves an exhaustive search for unique peptide words (upws) which, in our definition, are short oligopeptide patterns that are well conserved and found exclusively in one protein group. a group is usually a single superfamily, but it can be extended to comprise a few superfamilies. in practice, as illustrated in figure , we make a tally of all possible tetra-, penta-and hexapeptide patterns in the superfamilies of the pir database. let m s and n t be the numbers of sequences containing a given pattern in a given superfamily and in the entire database respectively. the pattern is unique to this superfamily when n % = nj. the pattern is conserved when n s = n t > f-m, where m is the number of members belonging to the superfamily and/is the parameter defining the majority. we consider different cases ranging from/ = ( % conservation) to/= . ( % conservation). although the distinction between and % conservation is highly dependent on the superfamily size and variability of its members, the uniqueness is mostly determined by the size and variability of the entire database. screening of unique peptide words. this figure shows the numbers of sequences containing given tetrapeptide patterns. the superfamily has member sequences and all contain the pattern qwyw, while all other sequences outside this superfamily do not possess this pattern. thus, this pattern is unique to, and conserved in, the superfamily . the unique pattern whfv is not % conserved in superfamily , but this pattern can be detected by setting a lower threshold value for the conservation in the second step the order of unique peptide words in each sequence of a given group is examined and a consensus pattern constructed. as illustrated in figure , each amino acid sequence is converted to an abstract structure, which may be called unique peptide sentences, consisting of the upw pattern number and the number of residues separating the first residues of two successive upws. one amino acid mutation is allowed when searching for the occurrence of each upw pattern. when the separation is smaller than the length of the preceding upw there is actually an overlap between the two upws, as in patterns and in figure . from a set of these sentences, some of which may lack specific upws and some of which may contain duplicates of the same upw, a consensus sentence is constructed. this is a multiple alignment problem and an approximate procedure was devised by combining pairwise alignments. the optimal pairwise alignment can be obtained by the following dynamic programming algorithm which is similar to the rna secondary structure prediction algorithm (waterman and smith, ; kanehisa and goad, ) : (s, v -g.p.(i,j,k,l) where s,y is the score up to the ith pattern p t andy'th pattern pj, g.p. is the gap penalty and w is the weight for a match of two patterns. the resulting consensus pattern is represented by the order of upws with the upper-and lower-bound numbers of residues separating two successive upws ( figure ). the consensus pattern obtained in the previous step is represented by the blocks of amino acid patterns, which we call motif blocks, separated by the upper-and lower-bound numbers of residues in the space region as follows: < motif blockl > [min_spacer, max_spacer] < motif block > as shown in figure , this consensus is used again in the last step to compare each sequence in the group, to identify substitution patterns and to determine whether each block is conserved in all sequences. in practice, it is first decided whether a particular block exists or not, given the minimum fraction of matched residues, r, that constitute a block. then, all substitution patterns are recorded. in the representation of our motif library, the plus sign designates that the block is conserved in all members of the group, while the minus sign indicates that some members lack the block. substitution patterns are enclosed in braces. . an illustration of how the sequence motif is constructed from unique peptide words. first, the locations of unique peptide words in a given superfamily are examined for all member sequences. then the consensus ordering of unique peptide words is obtained by a dynamic programming algorithm. the pirl database release . contains sequences, totalling residues, classified into superfamilies. the relatively large superfamilies that contained a set number of member sequences were considered. when the minimum value for the size of a superfamily in release . was defined as three or five members, there were or superfamilies respectively. as summarized in table i , our procedure identified sequence motifs that characterized > % of these superfamilies when the degree of conservation, /, was set at or %. the motif library constructed with the minimum superfamily size of five members and/= % contained sequence motifs ( table i) . out of the motifs, were characterized by single blocks while the rest contained multiple blocks, as shown in table ii . a complete listing of the motifs containing < blocks is shown in appendix. substitution patterns are obtained when r = %. as each new release of the pirl database is produced, the motif library can be reconstructed by this automatic procedure. however, a long computation time is required because of the calculation of the many hexapeptide patterns in the initial screening of the upws. when the libraries shown in table i were constructed without hexapeptide patterns, - % of the superfamilies could not be identified. this was a relatively small loss compared with the gain in computation time. superfamily assignment by sequence motifs a procedure for superfamily assignment was established utilizing our motif library, as follows: (i) begin the search using the first motif block. the criterion for the existence of a motif block is given by the parameter r, which specifies the minimum fraction of matched residues; (ii) if a motif block is found, check if the next motif block exists after the specified spacer length; and (iii) if a motif block is not found, skip this and continue searching for the next block. the search fails if no motif block is found. in the above procedure a sequence is considered assigned to a superfamily if any of the motif blocks match. no distinction is made between the conserved (+) and nonconserved (-) blocks. table iii(a) shows the results of this procedure when applied to the pir database release . , which is the training data set used for constructing the motif library. when the block detection parameter r = %, no entries were falsely assigned (false positives), but entries could not be detected (false negatives) as belonging to one of the superfamilies. at the level of r = % there were false positives and false negatives. when false positives were examined in more detail, all resulted from single motif blocks containing substitution patterns. sequence motifs with multiple blocks or sequence motifs with single blocks without substitution patterns could be used safely for superfamily assignment. next, a test data set was prepared from release . of the pir database by identifying new entries added after release . . there were cases where several entries in multiple superfamilies were combined into a single superfamily or entries in a single superfamily were split into different superfamilies. in such cases, the multiple superfamilies are considered to be related and assignment to a related superfamily is the correct answer. the results using this new data set are summarized in table iii (b) . although the prediction ability (~ %) was not as great as had been expected, the search itself could be performed within a fraction of a second on a small workstation, which is two to three orders of magnitude faster than the fasta homology search (pearson and lipman, ) . we modified the above procedure and stopped the search if any of the conserved (+) blocks were not found. the number of false positives could be decreased without affecting the number of false negatives in table ih (a) because this is how the conserved block was defined in the training set. however, this additional constraint has more effect on increasing the number of false negatives than decreasing the number of false positives in the if the motif library is to be used as an initial step in superfamily assignment, it is desirable to decrease the number of false negatives because false positives can easily be distinguished by sequence similarity in the subsequent step. there are still - % of false negatives in table iii (b), even with low values of r. it is possible to halve this by incorporating amino acid similarity scores, such as the pam matrix (dayhoff, ) , when comparing motif blocks (data not shown). because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites. table iv summarizes the percentages of biological sites, annotated in the pir database, which correspond to motif blocks identified by our procedure. table v is a listing of the single block sequence motifs that characterize superfamilies, together with any known functional significance. our procedure identified known consensus patterns, or closely related derivatives, such as the active site sequence gdsgg which is known to be exclusive to the serine protease superfamily (dayhoff et al., ) . the sequence motifs were obtained strictly from -d sequence information, the superfamily classification based on sequence similarity and the amino acid pattern searches. among the superfamilies with identified motifs, superfamilies contained one or more member sequences with known -d structures; seven were characterized by single block motifs and by multiple block motifs. using the coordinate data from the brookhaven protein data bank (bernstein et al, ) , it has been determined that multiple motif blocks come closer together in the -d structure. typical examples are: l-lactate dehydrogenase (sf ; see appendix for actual motifs), phosphoglycerate kinase (sf ), phospholipase a (sf ), neutral proteinase (sf ), carbonate dehydratase (sf ) and triose-phosphate isomerase (sf ). figure shows a stereo drawing of phospholipase a with two motif blocks at the active site. the correlation between conserved sequence patterns and exon structures has also been examined. a popular view suggests that introns existed in ancestral genes and have been removed under the exon shuffling mechanism (holland and blake, ) where an exon forms a structural or functional unit of a protein. therefore, it was expected that the identified motif blocks may correspond to exon units. as shown in table iv , however, quite a few introns were found to split functionally important motif blocks. figure shows typical examples where exon boundaries appear within the motif blocks. it is also noted that the intron positions around the motif block cgscw of the papain (cysteine protease) superfamily (ishidoh et al., ) and around the motif block gdsggp of the trypsin (serine protease) superfamily (rogers, ) are not fixed within the respective member sequences. these observations appear to support the concept of intron insertions (rogers, ) , although all introns examined here may not fall into this category. information about the functional properties of expressed protein products is often the main concern when dna sequences are determined. the method presented in this paper is an attempt towards fully computerized interpretations of the sequence data. a collection of sequence motifs with associated biological meanings in evolutionary, functional and structural aspects may be considered a dictionary for such purposes. at the same time, the motif search approach is expected to solve the speed and sensitivity problems in the current homology search approaches. because motifs represent more organized information, concentrated and extracted from primary databases, the search against a motif library is much faster than the search against a sequence database. it is also possible to incorporate various types of motif in the library, not only those to identify membership of a superfamily, but also other sequence patterns which are too weak to be detected by standard database search methods. until now, sequence motifs have been found by manually examining a set of related sequences, although there have been a few attempts to automate the procedure (staden, ; smith and smith, ; smith et ai, ) . the essence of our automatic method is the concept of uniqueness. for a protein with residues there are possible amino acid sequences. in nature, however, the repertoire of real amino acid sequences appears to be quite limited in comparison to this theoretical number. the protein sequences sequenced to date amount to million residues, three times larger than or . million pentapeptide patterns. in reality, - % of the possible pentapeptide patterns are not used in the known sequences. thus, actual proteins seem to have evolved from a limited set of amino acid sequences, conserving functionally important residues. this has been the working hypothesis in this study. as expected, motif blocks, constructed from unique peptide words, were found to be well correlated with functionally important sites of protein molecules. in addition, separate blocks tend to be close together in space to form an active site. for the motif library to be more useful, it is necessary to increase the number of identified superfamilies, i.e. to reduce the number of no opinions ( - %) in table iii . one approach is to use lower levels of conservation,/, as shown in table i . another is to relax the condition of uniqueness which was strictly required in this analysis. a few exceptions can be allowed in other superfamilies and/or patterns could be identified that are unique to multiple superfamilies. in our preliminary analyses of the latter case, the pattern ygdtds was found in two superfamilies (dna-directed dna polymerases of adenovirus and herpes virus) which share very little sequence homology. the possibility of combining multiple superfamilies based on short sequence motifs is thus inferred. the pattern hpdkgg was found exclusively in the three superfamilies: large t antigen, middle t antigen and small t antigen of polyoma and related viruses. however, this pattern was actually located in the exon shared by the three antigens. dictionary of sequence motifs characterizing superfamilies prosite: a dictionary of protein sites and patterns. embl, release proc. natl acad. sci. usa received on february this work was supported by a grant-in-aid for scientific research on the priority area 'genome informatics' from the ministry of education, science and culture, japan key: cord- -o kiadfm authors: durojaye, olanrewaju ayodeji; mushiana, talifhani; uzoeto, henrietta onyinye; cosmas, samuel; udowo, victor malachy; osotuyi, abayomi gaius; ibiang, glory omini; gonlepa, miapeh kous title: potential therapeutic target identification in the novel coronavirus: insight from homology modeling and blind docking study date: - - journal: egypt j med hum genet doi: . /s - - - sha: doc_id: cord_uid: o kiadfm background: the -ncov which is regarded as a novel coronavirus is a positive-sense single-stranded rna virus. it is infectious to humans and is the cause of the ongoing coronavirus outbreak which has elicited an emergency in public health and a call for immediate international concern has been linked to it. the coronavirus main proteinase which is also known as the c-like protease ( clpro) is a very important protein in all coronaviruses for the role it plays in the replication of the virus and the proteolytic processing of the viral polyproteins. the resultant cytotoxic effect which is a product of consistent viral replication and proteolytic processing of polyproteins can be greatly reduced through the inhibition of the viral main proteinase activities. this makes the c-like protease of the coronavirus a potential and promising target for therapeutic agents against the viral infection. results: this study describes the detailed computational process by which the -ncov main proteinase coding sequence was mapped out from the viral full genome, translated and the resultant amino acid sequence used in modeling the protein d structure. comparative physiochemical studies were carried out on the resultant target protein and its template while selected hiv protease inhibitors were docked against the protein binding sites which contained no co-crystallized ligand. conclusion: in line with results from this study which has shown great consistency with other scientific findings on coronaviruses, we recommend the administration of the selected hiv protease inhibitors as first-line therapeutic agents for the treatment of the current coronavirus epidemic. the first outburst of pneumonia cases with unknown origin was identified in the early days of december , in the city of wuhan, hubei province, china [ ] . revelation about a novel beta coronavirus currently regarded as the novel coronavirus [ ] came up after a high-throughput sequencing of the viral genome which exhibits a close resemblance with the severe acute respiratory syndrome (sars-cov) [ ] . the -ncov is the seventh member of enveloped rna coronavirus family (subgenus sarbecovirus, orthocoronavirinae) [ ] , and there are accumulating facts from family settings and hospitals confirming that the virus is most likely transmitted from person-to-person [ ] . the -ncov has also recently been declared by the world health organization as a public health emergency of international concern [ ] and as of the th of february , over , cases has been confirmed and documented from laboratories around the world [ ] while more than , of such cases were documented in china through laboratory confirmation as of the th of february [ ] . despite the fast rate of global spread of the virus, the characteristics clinically peculiar to the -ncov acute respiratory disease (ard) remain unclear to a very large extent [ ] . over infections and deaths were recorded worldwide in the summer of before a successful containment of the severe acute respiratory syndrome wave was achieved as the disease itself was also a major public health concern worldwide [ , ] . the infection that led to a huge number of death cases was linked to a new coronavirus also known as the sars coronavirus (sars-cov). coronaviruses are positive-stranded rna viruses and they possess the largest known viral rna genomes. the first major step in containing the sars-cov-lined infection was to successfully sequence the viral genome, the organization of which was found to exhibit similarity with the genome of other coronaviruses [ ] . the main proteinase crystal structure from both the transmissible gastroenteritis virus and the human coronavirus (hcov e) has been determined with the discovery that the enzyme crystal structure exists as a dimer and the orientation of individual protomers making up the dimer has been observed to be perpendicular to each other. each of the protomers is made up of three catalytic domains [ ] . the first and second domains of the protomers have a two-βbarrel fold that can be likened to one of the folds in the chymotrypsin-like serine proteinases. domain iii have five α-helices which are linked to the second domain by a long loop. individual protomers have their own specific region for the binding of substrates, and this region is positioned in the left cleft between the first and second domain. dimerization of the protein is thought to be a function of the third domain [ ] . the main proteinase of the sars cov is known to be a cysteine proteinase which has in its active site, a cysteine-histidine catalytic dyad. conservation of the sars cov main proteinase across the genome sequence of all sars coronaviruses is very high, likewise the homology of the protein to the main proteinase of other coronaviruses. on the basis that high similarity exists between the different coronavirus main proteinase crystal structures and the conservation of almost all the amino acid residue side chains involved in the dimeric state formation, it was proposed that the only biologically functional form coronavirus main proteinase might be is its existence as a dimer [ ] . more recently, chen et al. in his study which involved the application of molecular dynamic simulations and enzyme activity measurements from a hybrid enzyme showed that the only active form of the proteinase is in its dimeric state [ ] . recent studies based on the sequence homology of the coronavirus main proteinase structural model with tgev as well as the solved crystal structure has involved the docking of substrate analogs for the virtual screening of natural products and a collection of synthetic compounds, alongside approved antiviral therapeutic agents in the evaluation of the coronavirus main proteinase inhibition [ ] . some compounds from this study were identified for the inhibitory role played against the viral proteinase. these compounds include the l- , , which is an hiv- protease inhibitor, calanolide a, and nevirapine, both of which are reverse transcriptase inhibitors, an inhibitor of the α-glucosidase named glycovir, sabadinine, which is a natural product and ribavirin, a general antiviral agent [ ] . ribavirin was shown to exhibit an antiviral activity in vitro, at cytotoxic concentrations against the sars coronavirus. at the start of the first outbreak of the sars epidemic, ribavirin was administered as a first-line of defense. the administration was as a monotherapy and in combination with corticosteroids or the hiv protease inhibitor, kaletra [ ] . according to reports from a very recent research conducted by cao et al., where a total of a laboratoryconfirmed sars-cov-infected patients were made to undergo a controlled, randomized, open-labeled trial in which patients were assigned to the standard care group and patients assigned to the lopinavirritonavir group. . % of the patients in the lopinavir-ritonavir group ( patients) and . % of the patients in the standard care group ( patients) exhibited serious adverse events between randomization and the th day. the exhibited adverse events include acute respiratory distress syndrome (ards), acute kidney injury, severe anemia, acute gastritis, hemorrhage of lower digestive tract, pneumothorax, unconsciousness, sepsis, acute heart failure etc. patients in the lopinavir-ritonavir group in addition, specifically exhibited gastrointestinal adverse events which include diarrhea, vomiting, and nausea [ ] . our current study took advantage of the availability of the sars cov main proteinase amino acid sequence to map out the nucleotide coding region for the same protein in the -ncov. two selected hiv protease inhibitors (lopinavir and ritonavir) were then targeted at the catalytic site of the protein d structure which was modeled using already available templates. the predicted activity of the drug candidates was validated by targeting them against a recently crystalized d structure of the enzyme, which has been made available for download in the protein data bank. lopinavir is an antiretroviral protease inhibitor used in combination with ritonavir in the therapy and prevention of human immunodeficiency virus (hiv) infection and the acquired immunodeficiency syndrome (aids). it plays a role as an antiviral drug and a hiv protease inhibitor. it is a member of amphetamines and a dicarboxylic acid diamide (fig. ). the complete genome of the isolated wuhan seafood market pneumonia virus ( -ncov) was downloaded from the genbank database with an assigned accession number of mn . . the nucleotide sequence of the full genome was copied out in fasta format. the gen-bank sequence database is an annotated collection of all nucleotide sequences which are publicly available with their translated protein segments and also open access. this database is designed and managed by the national center for biotechnology information (ncbi) in accordance with the international nucleotide sequence database collaboration (insdc) [ ] . nucleotides between the and sequence of the -ncov genome was selected as the sequence of interest. translation of the nucleotide sequence of interest in the -ncov and the back-translation of the sars cov main proteinase amino acid sequence was achieved with the use of emboss transeq and backtranseq tools, respectively [ ] . transeq reads one or more nucleotide sequences and writes the resulting translated sequence of protein to file while backtranseq makes use of a codon usage table which gives the usage frequency of individual codon for every amino acid [ ] . for every amino acid sequence input, the corresponding most frequently occurring codon is used in the nucleotide sequence that forms the output. the corresponding amino acid sequence generated as a product of the transeq translation of the nucleotide sequence of interest had no stop codons and as such was used directly for protein homology modeling without the need for any deletion. two sets of sequence alignments were carried out in this study. the first was the alignment between the translated nucleotide sequence copy of the -ncov genome which was used for the reference protein homology modeling and the amino acid sequence of the sars cov main proteinase while the second alignment was between the back-translated sars cov main proteinase nucleotide sequence and the -ncov full genome. the latter was used in mapping out the protein coding sequence in the -ncov full genome. these alignments were carried out using the clustal omega software package. clustal omega can read inputs of nucleotide and amino acid sequences in formats such as a m/fasta, clustal, msf, phylip, selex, stockholm, and vienna [ ] . template search with blast and hhblits was performed against the swiss-model template library. the target sequence was searched with blast against the primary amino acid sequence contained in the smtl. a total of templates were found. an initial hhblits profile was built using the procedure outlined in remmert et al. [ ] followed by iteration of hhblits against nr . the obtained profile was then searched against all profiles of the smtl. a total of templates were found. models were built based on the targettemplate alignment using promod . coordinates which are conserved between the target and the template are copied from the template to the model. insertions and deletions are remodeled using a fragment library. side chains were then rebuilt and finally, the geometry of the resulting model, regularized by using a force field [ ] . for the estimation of the protein structure model quality, we used the qmean (qualitative model energy analysis), a composite scoring function that describes the main aspects of protein structural geometrics, which can also derive on the basis of a single model, both global (i.e., for the entire structure) and local (i.e., per residue) absolute quality estimates [ ] . an appreciable number of alternative models have been produced which formed the basis on which scores produced by the final model was selected. the qmean score was thus used in the selection of the most reliable model against which the consensus structural scores were calculated. molprobity (version . ) was used as the structurevalidation tool that produced the broad-spectrum evaluation of the quality of the target protein at both the global and local levels. it greatly relies on the provided sensitivity and power by optimizing the placement of hydrogen and all-atom contact analysis with complementary versions of updated covalent-geometry and torsion-angle criteria [ ] . the torsion angles between individual residues of the target protein were calculated using the ramachandran plot. this is a plot of the torsional angles [phi (φ) and psi (ψ)] of the amino acid residues making up a peptide. in the order of sequence, the torsion angle of n(i- ), c(i), ca(i), n(i) is φ while the torsion angle of c(i), ca(i), n(i), c(i+ ) is ψ. the values of φ were plotted on the x-axis while the values of ψ were plotted on the y-axis [ ] . plotting the torsional angles in this way graphically shows the possible combination of angles that are allowed. the quaternary structure annotation of the template is employed to model the target sequence in its oligomeric state. the methodology as proposed by bertoni et al. [ ] was supported on a supervised machine learning algorithm rule, support vector machines (svm), which mixes conservation of interface, clustering of structures with other features of the template to produce a quaternary structure quality estimate (qsqe). the qsqe score is a number that ranges between and , and it is a reflection of the accuracy expected of the inter-chain contacts for a model engineered based on a given template and its alignment. the higher score is an indication of a more reliable result. this enhances the gmqe score that calculates the accuracy of the d structure of the resulting model. the d structural homology modeling of the -ncov genome translated segment was followed by a structural comparison with the sars cov main proteinase d structure (pdb: uj ). this was achieved using the ucsf chimera which is a highly extensible tool for interactive analysis and visualization of molecular structures and other like data, including docking results, supramolecular assemblies, density maps, sequence alignments, trajectories, and conformational ensembles [ ] . high-quality animation videos were also generated. the amino acid constituents of the target protein secondary structures were colored and visualized in d using the pymol molecular visualzer which uses opengl extension wrangler library (glew) and freeglut. the py aspect of the pymol is a reference to the programming language that backs up the software algorithm which was written in python [ ] . the percentage composition of each component making up the secondary structure was calculated using the chou and fasman secondary structure prediction (cfssp) server. this is a secondary structure predictor that predicts regions of secondary structure from an amino acid input sequence such as the regions making up the alpha helix, beta sheet, and turns. the secondary structure prediction output is displayed in a linear sequential graphical view according to the occurrence probability of the secondary structure component. the cfssp implemented methodology is the chou-fasman algorithm, which is based on the relative frequency analyses of alpha helices, beta sheets, and loops of each amino acid residue on the basis of known structures of proteins solved with x-ray crystallography [ ]. the expasy server calculates protein physiochemical parameters as a part of its sub-function, basically for the identification of proteins [ ] . we engaged the function of the protparam tool in calculating various physiochemical parameters in the model and template protein for comparison purposes. the calculated parameters include the molecular weight, theoretical isoelectric point, amino acid composition, extinction coefficient, instability index, etc. the inference on evolutionary relationship was made utilizing the maximum likelihood methodology which is the basis of the jtt matrix-based model [ ] . the corresponding consensus tree on bootstrap was inferred from a thousand replicates, and this was used to represent the historical evolution of the analyzed taxa. the tree branches forming partitions that were reproduced in bootstrap replicates of less than % were automatically collapsed. next to every branch in the tree is the displayed percentage of tree replicates of clustered associated taxa in the bootstrap test of a thousand replicates. initial trees were derived automatically for the search through the application of the neighbor-join and bionj algorithms to a matrix of pairwise distances calculated using a jtt model and followed by the selection of the most superior log likelihood value topology. the phylogenetic analysis was carried out on amino acid sequences with close identity. the complete dataset contained a total of positions. the whole analysis was conducted using the molecular evolutionary and genetics analysis (mega) software (version ) [ ] . ligand preparation and molecular docking protocol d structures of the experimental ligands were viewed from the pubchem repository and sketched using the chemaxon software [ ] . the sketched structures were downloaded and saved as mrv files which were converted into smiles strings with the openbabel. the compounds prepared as ligands were docked against each of the prepared protein receptors using autodock vina [ ] . blind docking analysis was performed at extra precision mode with minimized ligand structures. after a successful docking, a file consisting of all the poses generated by the autodock vina along with their binding affinities and rmsd scores was generated. in the vina output log file, the first pose was considered as the best because it has stronger binding affinity than the other poses and without any rmsd value. the polar interactions and binding orientation at the active site of the proteins were viewed on pymol and the docking scores for each ligand screened against each receptor protein were recorded. the same docking protocol was performed against the sars-cov main proteinase d structure that was downloaded from the protein data bank with a pdb identity of m n. obtained outputs were visualized, compared, and documented for validation purpose. the full genome of the -ncov (https://www.ncbi. nlm.nih.gov/nuccore/mn . ?report=fasta) consists of nucleotides, but for the purpose of this study, nucleotides between and were considered to locate the protein of interest. the direct translation of this segment of nucleotides produced a sequence of amino acids (fig. ). this amino acid count was reached after the direct translation of the nucleotide sequence of interest as there were no single existing stop codons hence, deletion of any form was needless. as depicted in fig. , few structural differences were noticed. the amino acid sequences making up these nonconserved regions were clearly revealed in fig. . notwithstanding, a % identity was observed between both sequences showing the conserved domains were predominant. figure represents the percentage amino acid sequence identity between the target and the template protein, where the positions with a single asterisk (*) depicts regions of full residue conservation while the segments with the colon (:) indicates regions of conservation between amino acid residues with similar properties. positions with the period (.) show regions of conservation between amino acids with less similar properties. the amino acid sequence of the sars coronavirus main proteinase was back-translated to generate the corresponding nucleotide sequence which was then aligned with the -ncov full genome. this was carried out for the purpose of mapping out the region of the -ncov full genome where the proteinase coding sequence is located. as depicted in fig. , the target protein coding sequence is located between and nucleotides of the viral genome the outcome of a qmean analysis is anchored on the composite scoring function which calculates several features regarding the structure of the target protein. the expression of the estimated absolute quality of the model is in terms of how well the score of the model is in agreement with the values expected of a set of resultant structures from high-resolution experiments. the global score values can either be from qmean or qmean . qmean is a combination of four statistical potential terms represented in a linear form while qmean in addition to the functionality of qmean uses two agreement terms in the consistency evaluation of structural features with sequence-based predictions. both qmean and qmean originally are in the range of to with being the good score and are by default transformed into z-scores (table ) for them to be related to what would have been expected from x-ray structures of high resolution. the local scores are also a combination of four linear potential statistical terms with the agreement terms of evaluation being on a per residue basis. the local scores are also estimated in the range of to , where one is the good score (fig. ) . when compared to the set of non-redundant protein structures, the qmean z-score of the target protein as shown in fig. was . the models located in the dark zone are shown in the graph to have scores less than while the scores of other regions outside the dark zone can either be < the z-score < or z-score > . good models are often located in the dark zone. whenever such values are found, they result in some strains in the polypeptide chain and in cases of such, the stability of the structure will depend greatly on additional interactions but this conformation may be conserved in a protein family for its structural significance. another αand β-regions clustering principle exemption can be viewed on the right side plot of fig. where the distribution of torsion angles for glycine are the only displayed angles on the ramachandran plot. glycine has no side chain, and this gives room for flexibility in the polypeptide chain hence making accessible the forbidden rotation angles. glycine for this reason is more concentrated in the regions making up the loop where sharp bends can occur in the polypeptide. for this reason, glycine is highly conserved in protein families as the presence of turns at specific positions is a characteristic feature of particular structural folds. the comparative physiochemical parameter computation of the template and target proteins by protparam were deduced from the amino acid sequences of the individual proteins. no additional information was required about the proteins under consideration and each of the complete sequences was analyzed. the amino acid sequence of the target protein has not been deposited in the swiss-prot database. for this reason, inputting the standard single-letter amino acid sequence for both proteins into the text field was employed in computing the physiochemical properties as shown in tables , the two hiv protease inhibitors (lopinavir and ritonavir) when targeted at the modeled -ncov catalytic site gave significant inhibition attributes; hence, the in silico study was planned through molecular docking analysis with autodock vina. the binding orientation of the drugs to the protein active site as viewed in the pymol molecular visualizer (fig. ) showed an induced fit model binding conformation. the same compounds were targeted against the active site of the downloaded pdb d structure of the sars-cov main proteinase (pdb m n) for comparison purposes fig. . the active site residues as visualized in pymol are shown in fig. . the binding of lopinavir to the target protein which produced the best binding score was used as the predictive model. residues at the distance of < angstroms to the bound ligand were assumed to form fig. the combined view of the d structural comparison between the modeled target protein and the downloaded pdb structure of the viral protein (left column) and their primary sequence alignment (right column). the target protein is colored in grey while its protein data bank equivalence is colored in red. the high structural similarity between the two proteins was validated through their sequence alignment which produced . % sequence identity score. homology modeling which is a computational method for modeling the d structure of proteins and also fig. depicted here are two ramachandran plots. the plot on the left hand side shows the general torsion angles for all the residues in the target protein while the plot on the right hand side is specific for the glycine residues of the protein fig. the target protein secondary structures with bound lopinavir. at the top is the secondary structure visualization on pymol with regions making up the alpha helix, beta sheets, and loops shown in light blue, purple, and brown, respectively. below is the prediction by cfssp where the red, green, yellow, and blue lines depict regions of the helices, sheets, turns, and coils (loops), respectively. the predicted secondary structure composition shows a high degree of alpha helix and beta sheets, respectively, occupying and % of the total residues with the percentage loop occupancy at % regarded as comparative modeling, constructs atomic models based on known structures or structures that have been determined experimentally and likewise share more than % sequence homology. the backing principle for this is that there is likely to be an existing three-dimensional structure similarity between two proteins with high similarity in their amino acid sequence. with one of the proteins having an already determined d structure, the structure of the unknown one can be copied with a high degree of confidence. there is a higher degree of accuracy for alpha carbon positions in homology modeling than the side chains and regions containing the loop which is mostly inaccurate. as regards the template selection, homologous proteins with determined structures are searched through the protein data bank (pdb) and templates must have alongside a minimum of % identity with the target sequence, the highest possible resolution with appropriate cofactors for selection consideration [ ] . in this study, the target protein was modeled using the sars coronavirus main proteinase as template. this selection was based on the high resolution and its identity with the target protein which is as high as %. qualitative model energy analysis (qmean) is a composite scoring function that describes protein structures on the basis of major geometrical aspects. the scoring function of the qmean calculates the global quality of models on six structural descriptors linear combination basis, where four of the six descriptors are statistical potentials of mean force. the analysis of local geometry is carried out by potential of torsion angle over three consecutive amino acids. in predicting the structure of a protein structure, final models are often selected after the production of a considerable number of alternative models; hence, the prediction of the protein structure is anchored on a scoring function which identifies within a collection of alternative models the best structural model. two distance-dependent interaction potentials are used to assess long-range interactions based on c_β atoms and all atoms, respectively. the burial status of amino acid residues describes the solvation potential of the model while the two terms that reflect the correlation between solvent accessibility, calculated and predicted secondary structure are not excluded [ ] . the resultant target protein can be considered a good model as the z-scores of interaction energy of c_β, pairwise energy of all atoms, solvation energy, and torsion angle energy are − . , − . , − . , and . , respectively, as shown in table . the quality of a good model protein can be compared to high-resolution reference structures that are products of x-ray crystallography analysis through z-score where is the average z-score value for a good model [ ] . according to benkert et al., qmean z-score provides an estimate value of the degree of nativeness of the structural features that can be observed in a model, and this is an indication that the model is of a good quality in comparison to other experimental structures [ ] . our study shows the z-score of the target is " " as indicated in fig. and such a score is an indication of a relatively good model as it possesses the average z-score for a perfect model. properties of the model that is predicted determines the molprobity scores. work initially done on all-atom contact analysis has shown that proteins possess exquisitely well-packed structures with favorable van der waals interactions which overlap between atoms that do not form hydrogen bonds [ ] . unfavorable steric clashes are correlated strongly with the quality of data that are often poor where a near zero occurrence of such steric clashes occurs in the ordered regions of crystal structures with high resolution. therefore, low values of clash scores are indications of a very good model which likewise has been proven by the clash score value exhibited by the target protein that was modeled for the purpose of this study (table ). in addition to the clash score, the protein conformation details are remarkably relaxed, such as staggered χ angles and even staggered methyl groups [ ] . applied forces to a given local motif in environments predominantly made up of folded protein interior can produce a locally strained conformation but significant strain are kept near functionally needed minimum by evolution and this is on the presumption that the stability of proteins is too marginal for high tolerance. in traditional validation measures updates, there has been a compilation of rigorously quality-filtered crystal structures through homology, resolution, and the overall score validation at file level, by b-factor and sometimes at residue level, by all-atom steric clashes. the resulting multi-dimensional distributions generated after an adequate smoothing are used in scoring the "protein-like" nature of each local conformation in relation to known structures either for backbone ramachandran values or the side chain rotamers [ ] . rotamer outliers are equivalent to < % at high resolution while general-case ramachandran outliers to a high-resolution equivalence of < . %, and ramachandran favored to %. in this regard, the definition of the molprobity score (mpscore) was given as mpscore = . *ln( +clashscore) + . *ln( +max( , rota_out|- )) + . *ln( + max( , rama_iffy|- )) + . where the number of unfavorable all-atom steric overlaps ≥ . Å per atoms defines the clashscore [ ] . the rota_out is the side chain conformation percentage termed as the rotamer outliers, from side chains that can undergo possible evaluation while rama_iffy is the backbone ramachandran percentage conformations that allows beyond the favored region, from residues that can undergo possible evaluation. the derivatives of the coefficients are from a log-linear fit to crystallographic resolution on a set of pdb structures that has undergone filtration, so that the mpscore of a model is the resolution at which each of its scores will be the values expected thus, the lower mpscores are the indications of a better model. with a clash score of . and a . % value for the ramachandran favored region as compared to the ramachandran outliers and rotamer outliers with individual values of . % and . % respectively, we arrived at a molprobity score of . which is low enough to indicate the quality of a good model in our experimental protein. the characteristic repetitive conformation attribute of amino acid residues is the basis for the repetitive nature of the secondary structures hence the repetitive scores of φ and ψ. the range of φ and ψ scores can be used in distinguishing the different secondary structural elements as the different φ and ψ values of each secondary structure elements map their respective regions on the ramachandran plot. peptides of the ramachandran plot have the average values of their α-helices clustered about the range of φ = − °and ψ = − °while the average values of °and ψ = + °describes the ramachandran plot clustering for twisted beta sheets [ ] . the core region (green in fig. ) on the plot has the most favorable combinations for the φ and ψ values and has the highest number of dots. the figure also shows in the upper right quadrant, a small third core region. this is known as the allowed region and can be found in the areas of the core regions or might not be associated with the core region. it has lesser data points compared to the core regions. the other areas on the plot are regarded as disallowed. since glycine has only one hydrogen atom as side chain, steric hindrance is not as likely to occur as φ and ψ are rotated through a series of values. the glycine residues having φ and ψ values of + °and − °, respectively [ ] do not exhibit steric hindrance and for that reason positioned in the disallowed region of the ramachandran plot as shown in the right hand side plot in fig. . the extinction coefficient is an indication of the intensity of absorbed light by a protein at specific wavelength. the importance of estimating this coefficient is to monitor a protein undergoing purification in a spectrophotometer. woody in his experiment [ ] has shown the possibility of estimating a protein's molar extension coefficient from knowledge of its amino acid composition which has been presented in table . the extinction coefficient of the proteins (both the template and the target proteins) was calculated using the equation: e(prot) = numb(tyr) × ext(tyr) + numb(trp) × ext(trp) + numb(cystine) × ext(cystine)the absorbance (optical density) was calculated using the following formula: for this equation to be valid, the following conditions must be met: ph . , . m guanidium hydrochloride, . m phosphate buffer. the n-terminal residue identity of a protein is an important factor in the determination of its stability in vivo and also plays a major role in the proteolytic degradation process mediated by ubiquitin [ ] . βgalactosidase proteins with different n-terminal amino acids were designed through site-directed mutagenesis, and the designed β-galactosidase proteins have different half-lives in vivo which is striking, ranging from over a hundred hours to less than min, but this is dependent on the nature of the amino terminus residue on the experimental model (yeast in vivo; mammalian reticulocytes in vitro, e. coli in vivo). the order of individual amino acid residues is thus in respect to the conferred half-lives when located at the protein's amino terminus [ ] . this is referred to as the "n-end rule" which was what the estimated half-life of both the template and target proteins were based on. the instability index provides an estimate of the protein's stability in a test tube. statistical analysis of stable and unstable proteins has shown [ ] that there are specific dipeptides with significantly varied occurrence in the unstable proteins as compared with those in the stable proteins. the authors of this method have assigned a weight value of instability to each of the different dipeptides (diwv). the computation of a protein's instability index is thus possible using these weight values, which is defined as: table amino acid composition table for both template and target protein amino acid residues in one letter codes template target durojaye et al. where l is the sequence length and diwv(x[i]x[i + ]) is the instability weight value for the dipeptide starting from position i. a protein that exhibits an instability index value less than can be predicted as a stable protein while an instability index value that exceeds the threshold is an indication that the protein may be unstable. the comparative instability index values for the template and target proteins were . and . (table ) , respectively, showing both are stable proteins. the relative volume occupied by aliphatic side chains (valine, alanine, leucine, and isoleucine) of a protein is known as its aliphatic index. it may be an indicator of a positive factor for an increment in globular proteins thermostability. the aliphatic index of the experimental proteins was calculated according to the following formula [ ] : where x(ala), x(val), x(ile), and x(leu) are the mole percent ( × mole fraction) of alanine, valine, isoleucine, and leucine. the coefficients "a" and "b" are the relative volume of the valine side chain (a = . ) and of leu/ile side chains (b = . ) to the alanine side chain. the calculated aliphatic index for the experimental protein shows that the thermostability of the target protein is slightly higher than the template. the most common secondary structures are the alpha helices and beta sheets although the beta turns and omega loops also occur. elements of the secondary structures spontaneously form as an intermediate before their folding into the corresponding three-dimensional tertiary structures [ ] . the stability and how robust the α-helices are to mutations in natural proteins have been shown in previous studies. they have also been shown to be more designable than the beta sheets; thus, designing a functional all-α helix protein is likely to be easier than designing proteins with both α helix and strands, and this has recently been confirmed experimentally [ ] . the template and target proteins both have a total of amino acid residues (table ) with the composition of individual residues shown in table . as shown in fig. , the target protein which shares a structural homology with the template (fig. and the animation video) is predominantly occupied by residues forming alpha helix and beta sheets, with very low percentage of the residues forming loops. the stability of these two proteins is revealed in their physiochemical characteristics which can therefore be linked to the high percentage of residues forming alpha helix. the ultimate goal of genome analysis is to understand the biology of organisms in both evolutionary and functional terms, and this involves the combination of different data from various sources [ ] . for the purpose of this study, we compared our protein of interest to similar proteins in the ncbi database to predict the evolutionary relationships between homologous proteins represented in the genomes of each divergent species. this makes the amino acid sequence alignment the most suitable form of alignment for the phylogenetic tree construction. organisms with common ancestors were positioned in the same monophyletic group in the tree, and the same node where the protein of interest (the -ncov main proteinase) is positioned also houses the non-structural polyprotein of the ab bat sars-like coronavirus. this shows that the two viral proteins share a common source with shorter divergence period. bootstrapping allows evolutionary predictions on the level of confidence. one hundred is a very high level of confidence in the positioning of the node in the topology. the lower scores are more likely to happen by chance than it is in the real tree topology [ ] . the bootstrap value of the above-mentioned viral proteins which is exactly is a very high level of statistical support for their positioning at the nodes in the branched part of the tree. the length of the branches is a representation of genetic distance. it is also the measure of the time since the viral proteins diverged, which means, the greater the branch length, the likelihood that it took a longer period of time since divergence from the most closely related protein [ ] . the tw and tjf strains of the sars coronavirus orf a polyprotein and replicase, respectively, are the most distantly related, based on their branch length and as such can be regarded as the out-group in the tree. structure-based drug discovery is the easiest molecular docking methodology as it screens variety of listed ligands (compounds) in chemical library by "docking" them against proteins of known structures which in this study is the modeled d structure of the -ncov main proteinase and showing the binding affinity details alongside the binding conformation of ligands in the enzyme active site [ ] . ligand docking can be specific, that is, focusing only on the predicted binding sites of the protein of interest or can be blind docking where the entire area of the protein is covered. most docking tool applications focus on the predicted primary binding site of ligands; however, in some cases, the information about the target protein binding site is missing. blind docking is known to be an unbiased molecular docking approach as it scans the protein structure in order to locate the ideal binding site of ligands [ ] . the autodock-based blind docking approach was introduced in this study to search the entire surface of the target and template protein for binding sites while optimizing the conformation of peptides simultaneously. for this reason, it was necessary to set up our docking parameters to search the entire surface of the modeled main proteinase of the -ncov. this was achieved using the autogrid to create a very large grid map (center Å × − Å × Å and size Å × Å × Å) with the maximum number of points in each dimension in order to cover the whole protein. we observed a partial overlap in the docking pose of lopinavir to the active site of both template and target protein as compared to the conspicuous difference observed in the binding orientation of ritonavir to the protein active sites. these differential poses can be viewed distinctively in the attached animation video. a keen view of the binding orientation of the two drug candidates to the -ncov virus main proteinase active site (fig. ) is also consistent with the proposed induced fit binding model. in a comparative docking study, the same drug candidates (lopinavir and ritonavir) were docked against the active site of the pdb downloaded version of the viral main proteinase. the docking grid for this purpose was set with precision as the solved pdb structure of the virus included a cocrystalized ligand at the enzyme active site (center - Å × − Å × Å and size Å × Å × Å) and experimental ligands bind to this site with precision and variation in poses (fig. ) . the binding energy results table here, the docking results of lopinavir and ritonavir against the template and target protein are shown. the binding of ritonavir to the template protein produced the highest number of inter model hydrogen bonds while the binding of lopinavir to the target protein formed polar interaction with three residues at the active site as compared to the two formed by the other interactions table the amino acid residues involved in polar interaction, the number of inter-model hydrogen bonds and the docking score of lopinavir and ritonavir upon binding to the d pdb download of the sars-cov main proteinase (pdb m n) showed a difference of − . kcal/mol upon the binding of lopinavir to the template and the pdb d structure of the enzyme (pdb m n), and a difference of − . kcal/ mol between the pdb d structure of the enzyme and the target protein (table and ). the same comparative study was repeated for the binding of ritonavir and a difference of − . and − . kcal/mol was observed upon the binding of drug to the template and target proteins, respectively, in comparison with the binding to the downloaded d structure of the enzyme from the pdb. the observed consistency in the binding energy of the drug candidates can also serve as a reference to the validity and quality of the modeled protein, which has exhibited a high sequence and structural similarity with the downloaded d structure from the protein data bank (fig. ). in an effort to make available potent therapeutic agents against the fast rising novel coronavirus epidemic, we identified from the viral genome the coding region and modeled the main proteinase of the virus coupled with the evaluation of the efficacy of existing hiv protease inhibitors by targeting the protein active site using a blind docking approach. our study has shown that lopinavir displays a broader spectrum inhibition against both the sars coronavirus and -ncov main proteinase as compared to the inhibition profile of ritonavir. the modeled d structure of the enzyme has also provided interesting insights regarding the binding orientation of the experimental drugs and possible interactions at the protein active site. the conclusion from the study of cao et al. as previously discussed however has shown that the administration of the lopinavir-ritonavir therapy might elicit additional health concerns as a result of the extreme adverse events exhibited by the experimental subjects for the purpose of their study. it was also observed that the drugs showed no increased benefit when compared with the standard supportive care. in view of this findings, we therefore suggest a drug modification approach aimed at avoiding the health concerns posed by the lopinavir-ritonavir combined therapy while retaining their proteinase inhibitory activity. supplementary information accompanies this paper at https://doi.org/ . /s - - - . additional file . supplementary information to this article can be found online at https://www.rcsb.org/structure/ m n clinical features of patients with novel coronavirus in wuhan genomic characterization and epidemiology of novel coronavirus: implications of virus origins and receptor binding a novel coronavirus from patients with pneumonia in china a familial cluster of pneumonia associated with the novel coronavirus indicating person-to-person transmission: a study of a family cluster importation and human-tohuman transmission of a novel coronavirus in vietnam national health commission of the people's republic of china transmission of -ncov infection from an asymptomatic contact in germany alert, verification and public health management of sars in the post-outbreak period coronavirus in severe acute respiratory syndrome (sars) a novel coronavirus and sars crystal structures of the main peptidase from the sars coronavirus inhibited by a substrate-like aza-peptide epoxide dissection study on the sars c-like protease reveals the critical role of the extra domain in dimerization of the enzyme: defining the extra domain as a new target for design of highly-specific protease inhibitors c-like proteinase from sars coronavirus catalyzes substrate hydrolysis by a general base mechanism only one protomer is active in the dimer of sars c-like proteinase biosynthesis, purification, and substrate specificity of severe acute respiratory syndrome coronavirus c-like proteinase a trial of lopinavir-ritonavir in adults hospitalized with severe covid- emboss: the european molecular biology open software suite srs, an indexing and retrieval tool for flat file data libraries issues in bioinformatics benchmarking: the case study of multiple sequence alignment hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment the swiss-prot protein knowledgebase and its supplement trembl in toward the estimation of the absolute quality of individual protein structure models molprobity: more and better reference data for improved all-atom structure validation chapter : protein composition and structure modeling protein quaternary structure of homo-and hetero-oligomers beyond binary interactions by homology ucsf chimera-a visualization system for exploratory research and analysis fasman gd ( ) prediction of protein conformation protein identification and analysis tools on the expasy server the rapid generation of mutation data matrices from protein sequences mega : molecular evolutionary genetics analysis version . for bigger datasets chemoinformatics: theory, practice, & products critical assessment of the automated autodock as a new docking tool for virtual screening critical assessment of methods of protein structure prediction (casp) round visualizing and quantifying molecular goodnessof-fit: small-probe contact dots with explicit hydrogen atoms a test of enhancing model accuracy in high-throughput crystallography the penultimate rotamer library protein geometry database: a flexible engine to explore backbone conformations and their relationships to covalent geometry circular dichroism spectrum of peptides in the poly(pro)ii conformation calculation of protein extinction coefficients from amino acid sequence data universality and structure of the n-end rule the n-end rule in bacteria correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence thermostability and aliphatic index of globular proteins alpha helices are more robust to mutations than beta strands global analysis of protein folding using massively parallel design, synthesis, and testing time of the deepest root for polymorphism in human mitochondrial dna intraspecific nucleotide sequence differences in the major noncoding region of human mitochondrial dna limitation of the evolutionary parsimony method of phylogenetic analysis efficient docking of peptides to proteins without prior knowledge of the binding site molecular recognition and docking algorithms we appreciate the leadership of the laboratory of cellular dynamics (lcd), university of science and technology of china, for the all-around support and academic advisory role. we also acknowledge the strong support from the ustc office of international cooperation all through the challenging period of the coronavirus epidemic. the authors received no funding for this project from any organization. ethics approval and consent to participate not applicable the authors declare that they have no competing interests. key: cord- -ajzk rq authors: van weezep, erik; kooi, engbert a.; van rijn, piet a. title: pcr diagnostics: in silico validation by an automated tool using freely available software programs date: - - journal: j virol methods doi: . /j.jviromet. . . sha: doc_id: cord_uid: ajzk rq pcr diagnostics are often the first line of laboratory diagnostics and are regularly designed to either differentiate between or detect all pathogen variants of a family, genus or species. the ideal pcr test detects all variants of the target pathogen, including newly discovered and emerging variants, while closely related pathogens and their variants should not be detected. this is challenging as pathogens show a high degree of genetic variation due to genetic drift, adaptation and evolution. therefore, frequent re-evaluation of pcr diagnostics is needed to monitor its usefulness. validation of pcr diagnostics recognizes three stages, in silico, in vitro and in vivo validation. in vitro and in vivo testing are usually costly, labour intensive and imply a risk of handling dangerous pathogens. in silico validation reduces this burden. in silico validation checks primers and probes by comparing their sequences with available nucleotide sequences. in recent years the amount of available sequences has dramatically increased by high throughput and deep sequencing projects. this makes in silico validation more informative, but also more computing intensive. to facilitate validation of pcr tests, a software tool named pcrv was developed. pcrv consists of a user friendly graphical user interface and coordinates the use of the software programs clustalw and ssearch in order to perform in silico validation of pcr tests of different formats. use of internal control sequences makes the analysis compliant to laboratory quality control systems. finally, pcrv generates a validation report that includes an overview as well as a list of detailed results. in-house developed, published and oie-recommended pcr tests were easily (re-) evaluated by use of pcrv. to demonstrate the power of pcrv, in silico validation of several pcr tests are shown and discussed. pathogens exhibit genetic variation as a result of genetic drift, adaptation and evolution, but also by random variation. since the late nineties of the th century, due to the improved sequencing techniques and high throughput sequencing machines, the number of sequences submitted to databases like genbank ® has increased exponentially. this results in an enormous increase of identified variants and quasi-species as well as sequences of newly discovered pathogens from all over the world. a few examples are the discovery of coronaviruses causing severe acute respiratory syndrome (sars) and middle east respiratory syndrome (mers), nipah and hendra viruses, atypical pestiviruses, atypical and new serotypes of bluetongue virus, schmallenberg virus and new variants of avian influenza viruses (chua et al., ; demmler and ligon, ; drosten et al., ; hoffmann et al., ; hofmann et al., ; maan et al., ; marcacci et al., ; schirrmeier et al., ; van boheemen et al., ; wang, ; zientara et al., ) . currently, in many countries, the first line of pathogen detection is real-time pcr diagnostics. favourably, pcr tests can be highly sensitive and specific, and are often designed to detect all variants of a defined family, genus or species, while not detecting closely related pathogens. in addition, pcrv can also be used to validate in silico pcr assays that differentiate between lineages, serotypes or variants. therefore, pcr targets must be unique, and highly conserved. nonetheless, false negative results can arise by genetic drifting or by emergence of new variants, while false positive results can be caused by new variants of closely related pathogens. it is therefore important to frequently reevaluate and, if necessary, redesign pcr tests taking sequences of newly discovered pathogen variants into account. validation of pcr diagnostics should be organized in three stages, in silico, in vitro and in vivo validation. in silico validation covers the study on inventory of matching and non-matching sequences of the pcr target sequence in a nucleotide database. matching sequences enable in silico sensitivity (detection of all variants), while non-matching sequences support in silico specificity (selective detection of variants of the respective group of pathogens). in vitro and in vivo validation include testing of cultured pathogens, and field samples of defined positive and negative status. in vitro and in vivo validation for all virus variants is practically impossible and extremely costly. even more, not every pathogen variant has been cultured or isolated, and transport and handling of pathogens could imply safety issues. in contrast, sequences are rapidly becoming available by high throughput and deep sequencing, even without culturing of pathogens. therefore, in silico re-evaluation of validated pcr diagnostics is and will be an attractive alternative to obtain detailed insight in detection of circulating and (re-) emerging virus variants, and should be frequently executed. it will however become an increasing task due to the rapid increase of available sequences and full genome sequences of numerous species. we developed a software tool named pcrv to facilitate in silico validation of pcr tests entirely based on freely available software programs. pcrv links freely available software programs to automate the whole process, reduces labour, and generates a validation report that includes a brief summary as well as a list of detailed results. the software tool pcrv is written in the python programming language. pcrv consists of a user friendly graphical user interface and coordinates the use of software programs clustalw . (larkin et al., ; thompson et al., ) and ssearch (brenner et al., ; pearson, ; pearson et al., ; to perform in silico validation. pcrv is suitable to determine the in silico sensitivity (conservation of sequences) and in silico specificity (selectivity) of different pcr formats. to monitor the performance of pcrv, a set of flagged internal control sequences (fics) are randomly added to the sequence database. pcrv processes data and analyses results, and generates a validation report that includes a summarizing table as well as a list of detailed results for an easy check of potential false positives and false negatives. an overview of all actions executed by pcrv is shown in fig. . the sequences of a target organism are downloaded from the national center for biotechnology information (ncbi) database (https://www.ncbi.nlm.nih.gov/nuccore/) by using the respective taxonomy id number as search query. this guarantees that all available sequences of the defined taxon in the database are downloaded. to generate a multiple sequence alignment (msa) of these sequences, a full genome sequence was selected as a reference sequence. genome segments of pathogens with a segmented genome were concatenated to serve as an artificial full length genome. if a full genome sequence was not available, a representative large sequence of the taxon was selected as a reference sequence. a prerequisite is that this partial sequence contained the full target of the pcr test being validated. in order to drastically reduce computing time, pairwise alignments were calculated for each downloaded sequence to the reference sequence by using software program clustalw . (larkin et al., ) . to correct for orientation errors in the database sequences, alignment in the reverse complement orientation was also attempted. a score was calculated using a scoring scheme as follows: match (+ ), mismatch (- ), point deletion or gap (- ), every next adjacent point deletion (- ). the aligned orientation with the highest score was selected. to enable efficient alignment of large sequences, these large sequences were segmented in fragments of , nucleotides in length and individually aligned to the reference sequence and subsequently combined into one pairwise alignment. pcrv combined all individual pairwise alignments into one multiple sequence alignment (msa), including the pairwise alignments of primers and probes. the calculation of the msa was performed by a computer with an intel ® xeon(r) cpu e - v @ . ghz processor and gb of internal computer memory. the regions corresponding to primers and probes were selected from the msa to construct a conservation plot sorted in decreasing total number of mismatches. the in silico sensitivity was expressed as the percentage of hits with a cut-off value of a maximum of one mismatch per primer or probe. the entire nucleotide sequence database (compressed gzip file: nt.gz) was downloaded from the ncbi ftp-website (ftp:// ftp.ncbi.nlm.nih.gov/blast/db/) using pcrv. the integrity of the download was confirmed by calculation of the md checksum and subsequent comparison with the checksum published on the ftp-website (file nt.gz.md ). pcrv processed the data stream during download by several optimizations to improve the analysis. nucleotide code 'n' was replaced by the meaningless code 'z', which prevents infinite number of hits by the alignment search. the data stream was unpacked and subdivided into multiple fasta formatted text files. fasta files with a maximum size of mb were sequentially numbered and stored because the ncbi nucleotide database is too large to be analysed all at once. to increase the accuracy of the alignment search (see discussion), large sequences were fragmented in sequences of maximal nucleotides with an overlap of nucleotides to prevent the loss of hits of primer or probe sequences spanning the split site. fragmented sequences were tagged with a unique code allowing reconstruction of the original sequence. any nucleotide database in fasta format is compatible and could be added. flagged internal control sequences (fics) were added to enable validation of the alignment search. fics consisted of randomly generated sequences of nucleotides in length containing primer and probe sequences of the pcr test being validated. primer and probe sequences were inserted in all possible combinations and orientations potentially initiating amplification ( fig. ). multiple copies of each combination were inserted with an increasing number of randomly introduced mismatches from - in each primer and probe sequence ( fig. ). in total, ten copies of each control sequence per number of mismatches were linearly spread in each mb fasta file. an alignment search was performed with the default expectancy threshold value on all fasta files using primers and probes of the pcr test as search queries and the program ssearch available in the fasta sequence analysis package (brenner et al., ; pearson, ; pearson et al., ; . pcrv produced a list of hits of the alignment search of all possible primer/probe combinations potentially leading to detectable amplicons. hits of fics were stored separately. the percentage of returned hits of control sequences with an increasing number of mismatches was indicative for the sensitivity and accuracy of the alignment search per mb fasta file. the maximum number of returned mismatches in the control sequences was determined by use of the spearman-kärber method and demonstrated the validity of the computing process (wulff et al., ). an aborted search caused by an unknown error was visible by the incompleteness of returned fics. if the accuracy of the alignment search was not acceptable, the alignment search was repeated with a higher expectancy threshold value, which usually resulted in a longer analysis time. the specificity check was limited to a maximum of nucleotides in amplicon length and up to four mismatches per primer or probe. this limitation was however not applied to the fics in order to fics consist of randomly generated sequences of nucleotides in length containing the primer and probe sequence of the pcr test being validated. multiple copies were inserted with an increasing number of randomly introduced mismatches from - in each primer and probe sequence. ten copies of each fics per number of mismatches were linearly spread in each mb fasta file. b) overview of all eight possible combinations of positional orientations of forward primer (fwd), reverse (rev) primer and probe used as fics which are all capable of initiating an (nonspecific) amplification reaction in combination with a detectable probe signal. combinations of primers and probes according to other pcr formats (e.g. nested pcr, pcr using hybridisation probes or hydrolysis probe) are also supported by pcrv but are not shown. fully ascertain the validity of the executed alignment search. hits were interpreted as specific or nonspecific according to the taxonomy classified sequences as used to generate the msa. the in silico specificity is expressed as the percentage of specific hits of taxonomy classified sequences with a maximum of one mismatch per primer or probe as these are considered to be detected with the respective pcr test. to demonstrate the suitability of our in-house developed software tool pcrv, we determined the in silico sensitivity and specificity of three pcr tests for west nile virus (wnv) recommended by the world organisation for animal health (oie) (eiden et al., ; johnson et al., ) . these wnv pcr tests represented three different formats; a real-time pcr test, a conventional pcr test and a nested pcr test (table ) . available west nile virus nucleotide sequences were downloaded from the ncbi website using taxonomy id , (search query ncbi:txid on january th , ). in total, the download contained , wnv sequences. a msa was calculated using the full genome sequence with accession number nc_ as a reference sequence (borisevich et al., ) . primer and probe sequences were included in the alignment. the calculation of the msa with pcrv was completed in about . h. a limited number of - % of the aligned sequences encompassed the locations of primers or probes of the selected oie-recommended wnv pcr tests. the regions corresponding to primers and probes were taken from the alignment in order to construct a conservation plot. detailed results were sorted according to the number of mismatches to easily select individual sequences with > mismatch in order to check their origin (supplemented data a). note, sequences incorrectly classified as wnv as well as synthetically derived sequences should be discarded as these are irrelevant. results of the conservation plot were summarized according to the number of mismatches to a maximum of four mismatches per primer or probe ( table ).the overall in silico sensitivity of each pcr test was calculated and expressed as the percentage of sequences with a maximum of one mismatch per primer or probe. the real time pcr test for wnv showed the highest in silico sensitivity of . % ( . %+ . %). the conventional and nested pcr tests showed an in silico sensitivity of . % and . %, respectively. the entire nucleotide sequence database from the ncbi ftp-website was downloaded as a compressed gzip file (nt.gz) of gb on january th , . the download was valid according to the calculated md (johnson et al., ) , and the real time pcr test have been described (eiden et al., ) . sequences. an alignment search with primer and probe sequences was performed with a cut-off expectation value e of . the search per pcr test was completed in less than two hours. about . - . million individual primer and probe alignment hits were found and processed by pcrv as described (fig. ) . fics were found homogeneously in all database files indicating that the alignment search was completed properly. fics for each pcr test were returned with a mean of . - . mismatches per primer or probe demonstrating completeness and acceptable accuracy of the alignment search (table ) . potential amplicons were interpreted as specific or non-specific according to the presence of its ncbi accession number in the list of sequences as used for the in silico sensitivity check (table ) . we noticed that the number of specific hits differed from the numbers as scored by the in silico sensitivity check (table ) . however, several reasons for this apparent inconsistency can be considered, see discussion. in summary, using wnv pcr tests as an example, pcrv easily determined the in silico sensitivity and specificity of these pcr tests of different formats in a highly automated manner. all results are included in the validation report generated by pcrv, such as a summarizing table of results, conservation plot and a list of nonspecific hits. the summarizing table clearly demonstrates the differences of the in silico sensitivity and specificity between these pcr tests (table ). in addition, the detailed conservation plot (supplemented data a) and detailed list of nonspecific hits up to mismatches per primer or probe (supplemented data b) support manual check of individual sequences on correctness, background, submission details, and other information. validation of diagnostics by testing all variants of a target pathogen in cultured or field samples, named in vitro and in vivo validation, respectively, is hardly feasible. because of the availability of sequences of pathogens in databases, checking conservation and uniqueness of primer and probe sequences, so-called in silico validation, has become an attractive and reliable alternative to (re-) evaluate specificity and sensitivity of molecular diagnostics. exponential expansion of available sequences, genetic drift of pathogens, and discovery of new pathogens drive the need to frequently validate established pcr tests. this, however, will also become an increasing significant effort. we automated the in silico validation process by integrating freely available software programs into a single tool named pcrv. public databases, such as ncbi as well as other available databases and sequences formatted in single sequence fasta files are compatible with pcrv. pcrv generates a multiple sequence alignment (msa) using a selected reference sequence, which is preferably a full length genome but at least a partially large sequence encompassing the pcr target. software program clustalw . (larkin et al., ) is used to calculate pair-wise alignments of each sequence to the reference sequence, and subsequently a msa is generated using these pair-wise alignments. this strategy exponentially reduces calculation time, in particular for large numbers of sequences. additionally, more than one reference sequence could be used to improve the generation of a msa in case of extreme variability among a group of pathogens. the msa is used to determine the in silico sensitivity, since this is less prone to mismatches in primers or probes (not shown). for example, sequences with numerous mismatches in one of the primers or probes will not be found by an alignment search using these primer or probe sequences as search queries. however, such sequences will be present in the msa, see conservation plots of wnv pcr tests. supplemented data a shows the summarised -without accession numbers -conservation plots of the three wnv pcr tests. pcrv generates a conservation plot listing all hits according to decreasing number of mismatches. hits with the most mismatches needs attention as these could lead to false negative pcr results. we calculated and defined the in silico sensitivity as the percentage of hits with a maximum of one mismatch per primer or probe as these are assumed to be detected with the respective pcr test. the software program ssearch that is available in the fasta sequence analysis package from the university of virginia (pearson, ) uses a calculated expectation value e in combination with a supplied threshold value to determine whether a hit is returned. the expectation value e depends on the number and length of sequences in the database. consequently, the e value of a search hit depends on the location of the found sequence in the database. large sequences are therefore segmented into fragments of maximal nucleotides in length. this reduces the variability in sequence length leading to a more homogenous sensitivity of ssearch across the database and improves the overall sensitivity of ssearch. the sensitivity of the well-known and commonly used blastn alignment search program was compared to that of ssearch (fig. ) . clearly, ssearch returns % of the primers up to six mismatches. in contrast, the percentage of returns with blastn is slightly less than % for three mismatches and rapidly declines by an increasing number of mismatches. we conclude that ssearch is much more accurate, and thus more suitable than blastn to determine the in silico fig. . comparison of the accuracy of an alignment search performed by the blastn and the ssearch software programs. a test database of randomly generated nucleotide sequences was generated containing , sequences of nucleotides in length. sequences contained a primer sequence of nucleotides in length. each primer contained randomly - mismatches. the cut-off expectation value e used in both programs was . the inserted primer with up to mismatches completely returned with blastn, whereas ssearch completely returned the primer with up to mismatches. specificity. we also noticed that blastn tends to find partial/fractional nucleotide alignment hits which is not desirable for primers and probes. in addition, pcrv using ssearch is suitable for use in a laboratory quality control system, since the search process is monitored per mb fasta file for completeness and accuracy/sensitivity by returned hits of flagged internal control sequences (fics). an overview of this monitoring is added to the validation report. examples of incomplete, inaccurate or alignment searches with a low sensitivity are presented (supplemented data c). in case alignment search results are not sufficient, the threshold value can be changed to increase the sensitivity but the calculation time will also increase. here, we showed in silico validation results of wnv pcr tests of different formats as an example. pcrv was also used to validate real time pcr tests at wbvr (fig. ) . ssearch quantifies hits for any combination of primers and probes potentially leading to detectable amplicons, see fig. . this can result in more hits for the in silico specificity check by ssearch than for the in silico sensitivity check by clustalw . . for example, sequences partially overlapping with the pcr target sequence will not be found by the in silico specificity check, since this check only finds complete amplicons. further, ncbi only stores unique nucleotide sequences in its downloadable database export file "nt.gz". identical sequences are combined as one sequence with the sequence name as a concatenation of all individual sequence names separated by the ascii code . pcrv does not recognize merged names as multiple sequences, resulting in less hits by ssearch. detailed analysis of in silico validation results enables a focus on specific test problems, as shown for the pcr test for peste-des-petits ruminants virus (pprv) of wbvr that presumably does not detect pprv strain ghana because of three mismatches in the probe sequence. indeed, the pcr target of this pprv strain was amplified but was not detected by the taqman probe (van rijn et al., a) . we used pcrv to analyse oie-recommended and published pcr tests for other pathogens in order to select the best option for implementation in laboratory diagnostics. upon preparedness on incursions, frequent in silico (re-)validation could also show the need for adaptation of operational pcr tests to emerging epidemics caused by new variants in other parts of the world. pcrv depends on compatible and reliable nucleotide databases. for example, in silico validation by pcrv depends on submission of accurately determined sequences which are coded with the correct taxonomy id number. for example, classical swine fever virus (csfv) sequences that are taxonomy classified as bovine viral diarrhoea virus type (bvdv ii) were consequently interpreted as false positives in the csfv pcr test and as false negatives in the bvdv pcr test. further, in our example of wnv pcr tests, five nonspecific hits appeared to be sequences without taxonomy id. still, these sequences are definitely wnv sequences, although out of nonspecific hits have been synthetically derived (supplemented data b). on the other hand, a more specific taxonomy classification or labelling of sequences in databases could be used for the development of pcr tests specific for subspecies, serotypes or lineages. considering the expected rapid expansion of available sequences, pcrv will be further improved by allowing incremental analyses in which only newly submitted sequences with respect to the previously analysed sequences are processed. this will keep the required analysis time manageable for in silico re-validation of pcr tests. the number of hits for the in silico sensitivity and specificity are not representative for the field situation but represents that of the sequences in the database. in other words, the percentages could be skewed by a small number of sequences in the database, or by a large number of very closely related sequences caused by a huge effort during one epidemic. submitted sequences are sometimes not trimmed for synthetic adaptors like pcr primers causing misleading positive analysis results. synthetic or optimized genes of pathogens can lead to misleading negative pcrv results. synthetic and genetically modified sequences should be labelled as 'nonnatural' in databases to prevent misleading results of in silico validation efforts. finally, negative pcrv results can be created on purpose by development of diva (differentiating infected from vaccinated) vaccine viruses with a deleted or mutated diva target, like ge deletion mutants of bovine herpes virus type and pseudorabies virus (kaashoek et al., ; van oirschot et al., ) , ns deletion mutants of bluetongue virus and african horse sickness virus (feenstra et al., ; van rijn et al., b van rijn et al., , , and liveattenuated lumpy skin disease (lsd) vaccine (agianniotaki et al., ) . viral pathogens belonging to the same taxon showing an extreme variation in their sequence cannot be aggregated in one msa using one reference sequence. further, large scale genomic rearrangements, such as duplication, deletion, insertion, inversion, and translocation, are very common in genomes of bacterial pathogens, and will undoubtedly challenge the calculation of a msa, if this is even possible. currently, we are investigating alignment-free analysis methods to address these challenges. even more, we foresee the development of a next generation in silico tool, partially based on pcrv, to find highly conserved targets for new or confirmatory pcr tests. fig. . overview of the in silico sensitivity and specificity of several real time pcr tests at wbvr as determined by pcrv. the in silico sensitivity of pcr tests is expressed as the percentage of hits with a maximum of one mismatch per primer or probe (squares, line). the in silico specificity is expressed as the percentage of specific hits with mismatches (black) and mismatch per primer of probe (grey). real time pcr tests are indicated: wnv; west-nile virus (eiden et al., ; johnson et al., ) , btv; bluetongue virus (van rijn et al., ; ) , pprv; peste des petits ruminants virus (van rijn et al., a) , ashv_s ; african horse sickness virus segment (van rijn et al., b) , ashv_s ; african horse sickness virus segment (van rijn et al., b) ; in-house developed assays: rvfv; rift valley fever virus, sgpv; sheepand-goat pox virus, ehdv-a; epizootic haemorrhagic disease virus test a, ehdv-b; epizootic haemorrhagic disease virus test b, eav; equine arteritis virus, eblv- ; european bat lyssa virus type , csfv; classical swine fever virus, asfv; african swine fever virus, prv-gb; pseudorabies virus glycoprotein gene gb, prv-ge; pseudorabies virus glycoprotein gene ge. results of pcrv could demonstrate the need to optimize or redesign a pcr test, like for ehdv-a and ahsv_s . note: hits of non-natural sequences were not discarded. (for interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article). development and validation of a taqman probe-based real-time pcr method for the differentiation of wild type lumpy skin disease virus from vaccine virus strains biological properties of chimeric west nile viruses assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships nipah virus: a recently emergent deadly paramyxovirus severe acute respiratory syndrome (sars): a review of the history, epidemiology, prevention, and concerns for the future identification of a novel coronavirus in patients with severe acute respiratory syndrome two new real-time quantitative reverse transcription polymerase chain reaction assays with unique target sites for the specific and sensitive detection of lineages and west nile virus strains vp -serotyped live-attenuated bluetongue virus without ns /ns a expression provides serotype-specific protection and enables diva novel orthobunyavirus in cattle genetic characterization of toggenburg orbivirus, a new bluetongue virus, from goats detection of north american west nile virus in animal tissue by a reverse transcription-nested polymerase chain reaction assay a conventionally attenuated glycoprotein e-negative strain of bovine herpesvirus type is an efficacious and safe vaccine clustal w and clustal x version . novel bluetongue virus serotype from kuwait one after the other: a novel bluetongue virus strain related to toggenburg virus detected in the piedmont region searching protein sequence libraries: comparison of the sensitivity and selectivity of the smith-waterman and fasta algorithms query-seeded iterative sequence similarity searching improves selectivity - -fold genetic and antigenic characterization of an atypical pestivirus isolate, a putative member of a novel pestivirus species identification of common molecular subsequences comparative biosequence metrics multiple sequence alignment using clustalw and clustalx genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans marker vaccines, virus protein-specific antibody assays and the control of aujeszky's disease sustained high-throughput polymerase chain reaction diagnostics during the european epidemic of bluetongue virus serotype bluetongue virus with mutated genome segment to differentiate infected from vaccinated animals: a genetic diva approach recombinant newcastle disease viruses with targets for pcr diagnostics for rinderpest and peste des petits ruminants diagnostic diva tests accompanying the disabled infectious single animal (disa) vaccine platform for african horse sickness discovering novel zoonotic viruses monte carlo simulation of the spearman-kaerber tcid novel bluetongue virus in goats the authors are grateful to colleagues of wbvr, in particular to jan boonstra and rené van gennip, for fruitful discussions and suggestions. this research was financially supported by project wot- - - of the dutch ministry of agriculture, nature and food quality (lnv) (wbvr-project number - ). all authors declare no conflict of interest. supplementary material related to this article can be found, in the online version, at doi:https://doi.org/ . /j.jviromet. . . . key: cord- -nc yf x authors: wichmann, stefan; scherer, siegfried; ardern, zachary title: computational design of genes encoding completely overlapping protein domains: influence of genetic code and taxonomic rank date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: nc yf x overlapping genes (olgs) with long protein-coding overlapping sequences are often excluded by genome annotation programs, with the exception of virus genomes. a recent study used a novel algorithm to construct olgs from arbitrary protein domain pairs and concluded that virus genes are best suited for creating olgs, a result which fitted with common assumptions. however, improving sequence evaluation using hidden markov models shows that the previous result is an artifact originating from dataset-database biases. when parameters for olg design and evaluation are optimized we find that . % of the constructed olg pairs score at least as highly as naturally occurring sequences, while . % of the artificial olgs cannot be distinguished from typical sequences in their protein family. constructed olg sequences are also indistinguishable from natural sequences in terms of amino acid identity and secondary structure, while the minimum nucleotide change required for overprinting an overlapping sequence can be as low as . % of the sequence. separate analysis of datasets containing only sequences from either archaea, bacteria, eukaryotes or viruses showed that, surprisingly, virus genes are much less suitable for designing olgs than bacterial or eukaryotic genes. an important factor influencing olg design is the structure of the standard genetic code. success rates in different reading frames strongly correlate with their code-determined respective amino acid constraints. there is a tendency indicating that the structure of the standard genetic code could be optimized in its ability to create olgs while conserving mutational robustness. the findings reported here add to the growing evidence that olgs should no longer be excluded in prokaryotic genome annotations. determining the factors facilitating the computational design of artificial overlapping genes may improve our understanding of the origin of these remarkable genetic constructs and may also open up exciting possibilities for synthetic biology. the triplet nature of the standard genetic code and double-stranded configuration of dna together enable more than one protein to be encoded within the same nucleotide sequence in different reading frames. this property of the code has long been known to be utilised in viruses [ , ] and there is increasing evidence for overlapping encoding in other organisms [ , , ] , including many genes fully embedded within other coding sequences in alternate reading frames [ ] . while a mutation in a stop codon can easily create a short, trivial overlap in neighbouring genes as a chance event, longer, non-trivial overlaps should only be maintained in a genome if the overlapping region encodes a part of the protein essential for its function for both genes. there are a few hypothetical reasons why genes might overlap, and the evidence for functional antisense overlaps in prokaryotes has been discussed in a recent review [ ] . while the reduction of genome size is particularly relevant only for some viruses [ , ] , it has also been studied in bacteria [ ] . effects on gene regulation [ ] conceivably could affect all organisms, for instance there is the possibility of co-expression of same-strand overlapping genes (olgs) with the mother gene, given that they are potentially expressed from the same mrna. genes within an antisense overlapping pair could also influence each other, for instance in a way similar to what has recently been termed a "noncontiguous operon", where genes in antisense to each other are nonetheless co-expressed as an operon [ ] . other proposed benefits of overlapping genes relate to templating structure based on the existing 'mother gene', namely, for genes directly in antisense ("- frame"), the creation of proteins with a complementary polarity structure to the gene on the antisense strand [ , , ] or, in the case of sense overlaps, a similar hydrophobicity profile [ ] . overlapping open reading frames may play an important role in the origin of de novo genes, exploring new territory in the total space of sequences and functions [ , , , ] . while most currently extant olgs are not taxonomically conserved and therefore appear to be evolutionarily young [ ] , one claimed example of an ancient olg pair is comprised of the two classes of aminoacyl-trna synthetases which can be encoded in an overlapping manner [ , , ] . despite the many possible effects of overlapping genes (olgs), they are generally not considered a significant phenomenon outside of viruses, due perhaps to perceived difficulties in their evolution for some or all reading frames [ , ] . the idea that they have been more widespread has long been theorized [ , , ] . as a consequence, most gene prediction algorithms still exclude non-trivially overlapping genes [ ] , especially outside of bacteriophages and other viruses. the ncbi rules for annotation of prokaryotic genes do not allow genes completely embedded in another gene in a different frame without individual justification [ ] . even in viruses, relatively few overlapping genes have been annotated, particularly antisense gene pairs, although more are regularly being discovered including, for instance, in the pandemic viruses hiv and sars-cov- [ , , ] . a recent study [ ] quantified the difficulty of constructing olgs by picking random pairs of protein domains and rewriting them so as to overlap, with an algorithm minimizing the amino acid changes in each domain. this is a new approach, as previous studies tried to create overlaps without changing the amino acid sequence of the two genes, which resulted in either a very limited overlap length [ ] or could only be done for very specific genes [ ] . they found that, remarkably, % of arbitrary protein domain pairs were able to successfully overlap in at least one of the reading frames they investigated, and one of two positions tested. virus domains were much more likely to create putatively functional overlaps than domains from prokaryotes or eukaryotes, as determined by blast searches of the swiss-prot database. this result suggests that creating overlaps is not as difficult as might be expected, implying that an abnormally high threshold of evidence as compared to other gene types should not be required for verifying their existence. this high success rate also opens up many possibilities for synthetic biology. for instance, mutations in overlapping regions are expected to be more deleterious on average, so an artificial genome with many olgs is not only smaller but also expected to be more stable over time on a population level, as mutations are more likely to be strongly selected against. a recent method for stabilizing synthetic genes [ ] , where an arbitrary orf was constructed to overlap a gene of interest and was concatenated with an essential gene downstream, could be taken a large step forward by overlapping whole genes thereby creating a system where not only 'polar' mutations are selected against but also more minor mutations, if they also affect the mother gene. genome size has become a significant limiting factor for biomolecular computing, in which genetic programs are inserted into cells [ ] . existing compression methods [ ] could be greatly improved by using olgs, making more complex systems possible. in this context a well designed stable synthetic genome could include fail-safe measures, such that faulty genetic programs would shut down. here the algorithm provided in [ ] is used but improved in the evaluation of the constructed sequences as the analysis in the previous study has some weaknesses resulting in incorrect claims. determining whether an artificial sequence has a specific function from its amino acid sequence only is a very hard problem and not possible today. progress is being made in predicting the protein structure from amino acid sequence [ ] , but protein structure does not determine function as essential binding sites can be rendered useless if the amino acid is changed while not changing the overall protein structure. ultimately only experiment can definitively determine the function of a given amino acid sequence. in order to aid the design of expensive experimental setups however, it can at least be determined bioinformatically how similar an artificial sequence is to sequences with known functions. in this study the artificially designed sequences are compared to their original sequences in terms of amino acid identity, amino acid similarity, hidden markov model profile and secondary structure in order to determine the impact of olg construction and which sequences are potentially functional. firstly, the details on how some technical artifacts arose are explained and how to avoid them. in order to further improve the analysis hidden markov models rather than blast is used in this study. while the previous study [ ] tried to estimate an upper limit of how many domains can be successfully overlapped in at least one reading frame and position, here the average success rate for olg construction is determined instead, which is more relevant in relation to both understanding constraints on the formation rate of naturally occuring olgs and in assessing the likelihood of successful synthetic creation of olgs. these results in one sense give an upper estimate of the ease of creating overlaps as the difficulty of obtaining an overlapping gene pair naturally is not directly addressed here. on the other hand, overlapping functional domains directly is a "worst case scenario" as there is some evidence that the critical functional domains of one protein in an olg pair tend to overlap less constrained regions of the other protein [ ] , and this segregation is also intuitively plausible. in order to estimate the difficulty of achieving overprinting naturally, the minimal number of nucleotide changes needed to create the olg sequence is determined. whether functional domains do in fact overlap in nature, however, deserves further attention. by expanding the analysis of the previous study [ ] from the reading frames '+ ','- ' and '- ' to all reading frames (see fig. for reading frame definitions), the observed differences between reading frames can be related to the structure of the standard genetic code. through constructing olgs using randomly generated genetic codes it can be studied whether the standard code shows evidence of optimisation regarding olgs. using the improved evaluation of the designed olgs it can be shown that virus genes, surprisingly, are less suited than bacterial and eukaryotic genes to design olgs. figure : illustration of the alternative reading frames. the '+ ' frame is the standard or reference reading frame and '+ '/'+ ' the sense overlaps, while frames '- ' to '- ' are on the antisense strand. in [ ] constructed sequences were evaluated with a blast search against the swiss-prot database. if both overlapping sequences had a match to the best hit with at most an e-value of ^(- ) and a match length of %, the overlap was considered successful. however, the initial sequences were picked from the pfam seed database and it can be shown that most of the chosen sequences are not well represented in the swiss-prot database (see left panel in fig. ) , with the exception of virus genes. in a search against the swiss-prot database, identities of over % were only found for % of the non virus genes, while % of the virus genes could be found in this category. a curated set in which all sequences have a % match in the swiss-prot database but otherwise the same properties has a remarkable % success rate for overlaps and the virus vs non-virus difference vanishes (see right panel in fig. ). the advantage reported for virus genes is thus fully explained by dataset-database biases. in any case, the extremely high overall success rate obtained should be investigated. either creating overlaps is indeed unexpectedly easy or the evaluation of functionality used in [ ] is not conservative enough. it can be shown that both factors appear to contribute to the surprising result. [ ] with different match identities in swiss-prot -virus genes from this dataset have a higher average identity to a swiss-prot entry than non-virus genes. right: percentage of functional olgs for the original dataset used in [ ] and the average of curated datasets grouped into virus and non virus genes. in curated datasets all original sequences have an exact match in swiss-prot. each curated dataset has sequences with - amino acids. the virus versus non-virus difference observed in the dataset of [ ] vanishes for the curated datasets. when introducing the minimal number of changes required for two random sequences to fully overlap each other, a similar percentage of each sequence is expected to change. in such a case the e-values of the constructed sequences would be strongly length dependent, as a longer sequence with the same similarity has a lower probability of being found by chance in a database of a given size. when picking datasets with different sequence lengths such a lengthdependence can be found in the blast evaluation (supplementary fig. s ). a fixed e-value cutoff cannot adequately evaluate sequences in such a situation as the cutoff value fully determines the result and is chosen arbitrarily. the sequences used in [ ] have a length of - amino acids, and the high success rate for the curated set can be explained by a combination of the sequence length and the choice of the cutoff value. in order to find a reasonable alternative to the fixed e-value cutoff, hidden markov models (hmms) can be used to score the constructed sequences. here hmmer (v . . ) [ ] is used to create profiles for each protein domain family in the pfam database [ ] in order to score the constructed sequences. the pfam database consists of a 'seed' database, containing trusted sequences for each family which are used to create hmm profiles, and a 'full' database, containing all the sequences of the uni-prot database sorted into the different families according to the previously constructed profiles. here the hmm profiles are also constructed from the 'seed' sequences and in order to find the sequence most closely representing the profile all full sequences are tested against the profiles. the highest scoring sequences are used to construct olgs. the rest of the 'full' sequences are used as a comparison for the overlapping region of the constructed sequences. a constructed sequence is judged successful if it has a higher score than a sequence at a defined threshold percentile of the 'full' sequences, thereby creating a threshold value which is individual for each protein family. here results for different threshold percentiles are discussed, while highlighting two particular percentile values. firstly, the th percentile (median), which marks the score of a typical sequence in the protein family. in this analysis, sequences meeting this threshold can not be distinguished from the naturally occurring protein domains and they will be categorised as typical proteins. since all sequences in the 'full' group are naturally occurring sequences, scoring at least as highly as any of these sequences renders a sequence biologically relevant. in order to avoid extreme outliers which may be misclassified, the th percentile is used as the biologically relevant threshold. a relative threshold could alternatively be established with e.g. blast by first picking a single sequence as a starting point for construction and also for comparing the rest of the protein family to in order to find the threshold score as described above. in this case however, it is not clear which sequence to choose as a starting point. a randomly picked sequence could be an outlier of the protein family, resulting in unreliable comparison scores and a higher chance of losing function after constructing olgs. hmms on the other hand provide a profile reflecting the 'average' sequence, which is a better representative for the whole protein family. choosing a family-specific threshold value takes care of most of the length dependencies, but in order to be sure and to be able to compare sequences of different lengths, each score resulting from a comparison between a sequence and a hmm profile is divided by the sequence length. here scores are used instead of e-values, as the latter also depend on the database size, an arbitrary factor in this analysis. aligning the best sequence with the 'seed' sequences using mafft (v . ) [ ] , weights used for sequence construction can be determined just as in [ ] . a more detailed description of the calculation of the weights and their influence can be found below. when studying the influence of a protein family's taxonomic classification on the construction of olgs, the 'seed' and the 'full' database are first filtered by the four major taxonomic groups -archaea, bacteria, eukaryotes and viruses -before creating the profiles and the thresholds. muscle (v . . ) [ ] was used for realigning the 'seed' sequences after taxonomic filtering. for subsequent analyses, random sets from the ~ pfam families were chosen, with the condition that each family must have at least 'seed' sequences and 'full' sequences in order for the weights and the thresholds to be reasonably defined. each dataset consists of families since the variance of the resulting olg success rate barely declines for larger sets (see supplementary fig. s ). fig. summarizes the workflow. hmm profiles are constructed from the seed sequences. the sequence with the highest score from the full group is used for olg design. the remaining sequences in the full group are used to construct threshold scores used to evaluate the designed olgs. in order to estimate the expected success rate of an individual overlap attempt, the domains are overlapped at random positions such that one domain is fully embedded into the other. just as in [ ] the sequence with the lower quality of the two constructed olgs is used as a conservative representative of the pair. after determining the success for each position, the percentage of successful positions for each olg pair, the average success rate in each reading frame, and the overall success rate averaged across reading frames are calculated. the number of possible positions for each olg pair is equal to their difference in length plus one, so using more than one overlap position in each pair is only possible for genes with different lengths. increasing the number of positions for each gene does not change the expected success rate but reduces its variation between different sets (see supplementary fig. s ). comparing the variation caused by choosing random positions and the variation caused by choosing random pfam families, the former turns out to be negligible and consequently only a single randomly chosen position for each olg pair is used for subsequent analyses. the distribution of the percentage of successful positions in each olg pair is calculated from up to different positions (see fig. ). . % of all olg pairs form biologically relevant sequences at all positions in every reading frame while only . % cannot form a biological relevant sequence at any position (see fig. ). . % of the pairs even form typical proteins, as determined by the th percentile threshold, at every position in any reading frame (see right panel in fig. ). this result is strongly dependent on the threshold percentile chosen, but due to the wide range of possible results it can still be concluded that the chance of success of a constructed olg pair depends strongly on the particular genes used, as might be expected. in each olg pair sets of up to random positions were tested against the pfam group hmms using the 'biologically relevant' threshold ( th percentile) and the 'typical sequence' threshold ( th percentile) for a successful overlap. while . % of the pairs can be overlapped at any position and . % in no position using the biological threshold only . % can be overlapped at any position and . % in no position using the threshold of typical sequences. the sequence threshold strongly influences the result. in order to determine whether the relative evaluation of olgs really removed the length dependency, the average quality 'q' of an olg pair is determined and compared for olg pairs with different lengths. q is defined as the ratio of the scores of the constructed sequence (s) over the original sequence (s_max) times . the quality is therefore the percentage score loss due to the overlap. supplementary fig. s shows the mean quality for datasets with different sequence lengths. starting from around amino acids, q is indeed mostly independent of sequence length. the low q values of smaller sequences are because these sequences are less frequently matched to their respective hmm-profile, which results in a score of zero. the reason is probably that the shorter sequences fall below internal detection thresholds of hmmer more easily. changing a single amino acid in a short gene changes its quality to a greater extent than in a long gene, resulting in larger fluctuations, which can lower the sequence below detection thresholds. lowering internal thresholds of hmmer did not lead to more sequences being recognized by their respective profile. in further analysis the minimum sequence length of amino acids is used so that the percentage of olg pairs in which at least one sequence is not recognised is below % (see supplementary fig. s ). when taking both sequences of each pair and not only the one with the lower quality, the quality distribution converges to a broad peak at around % with increasing sequence length (see supplementary fig. s ) . since the quality also depends on the flexibility of the hmm profiles used to score the sequences the peak is not expected to get any narrower with increasing sequence length and thus to reduce variations in sequence similarities between the constructed and the original sequences. the algorithm to construct olg sequences from [ ] uses an exchange matrix (blosum [ ] ) to find the closest overlapping sequences to the original ones. it determines the codon with the highest sum of the scores for the exchanges in both sequences at each position. sequence weights can prioritise the score of either one or the other sequence at different positions in order to increase the chance of obtaining functional sequences. in [ ] , the weight w_i at position i of the sequence is w_i=e^(-s_i), where s_i is the entropy calculated at position i in the alignment. the weights could be defined differently such that their influence on olg construction is stronger or weaker. in order to optimize the weight strength a factor k is added to the entropy in their calculation such that w_i=e^(-ks_i). varying k> , the optimal weight strength for constructing olgs can be determined, while k= means no weights are being used. in the hmm evaluation the influence of k is very weak. a value of k= . is used in order to maximise the quality, q (see supplementary fig. s ). picking very high k values q goes to zero. in this case at each position the sequence with the higher conservation maintains its amino acid. this indicates that it is crucial that at each position both sequences are changed in order to create functional olgs. in the blast evaluation k= is optimal (see supplementary fig. s ), such that no better value can be found for k> . blast does not take special account of conserved regions of a sequence, so weights can improve one sequence but at the same time will reduce the score of the other. since the lowest scoring of the two sequences is taken to represent the olg pair, introducing weights has a high chance of reducing the success rate in an evaluation using blast, despite increasing biological relevance. this makes an evaluation using hmm or any other method that takes into account sequence conservation significantly preferable for judging constructed olg pairs. the five alternative reading frames differ strongly in the combinatorial constraints imposed by the reference gene (mother gene) via the standard genetic code [ ] , e.g. the sequence n|gcn|, with n being any nucleotide, always translates to alanine in the + and the - frame. it is interesting whether this difference in constraint transfers to the success rate for designing olgs. for olgs resembling typical proteins of their respective families, the success rates for olg construction varies from . % in the '- ' frame to . % in the '- ' frame with an average value of . % across all reading frames (see fig. ). calculating the e-value just as in [ ] as a reference, the constructed olgs have a median e-value of ^- ( ) to ^(- ), decreasing with increasing threshold percentile. the result is strongly threshold dependent as . % of the constructed sequences score at least as highly as the worst sequence in the full group, while only . % score better than % of the full group. considering combinatorial restrictions of different reading frames [ ] the ranking of frames by success rate are exactly as expected, insofar as the success rate of each reading frame is inversely proportional to the extent of combinatorial restrictions found in [ ] (see fig. ): the '- ' frame is the least successful reading frame and has the highest restrictions, followed by the '- ' frame, which is the second most restricted frame. next are reading frames '+ ' and '+ ', which have exactly the same restrictions and surprisingly almost the same success rates, not only in their average value but also in every single dataset (data not shown), despite expected stochastic fluctuations due to some genes simply fitting better to each other. last is the '- ' frame, which has no combinatorial restrictions and the highest success rate. plotting the different success rates in the different reading frames as a function of the number of combinatorial constraints found in [ ] , results in a linear relation for the lowest possible threshold, namely that all sequences which are at least as good as the worst in the comparison group are judged successful. as the threshold is increased the linear relation is gradually lost (see supplementary fig. s ). for higher thresholds most of the sequences are below the threshold and very little data is left, which might lead to the observed behaviour. in summary, the structure of the standard genetic code appears to strongly influence the construction of olgs. whether the observed relationship between predicted constraints in different frames and the difficulty of constructing olgs is borne out by the proportion of natural olgs found across frames deserves attention across diverse taxa. the threshold chosen within the pfam group has a very strong influence on success rates. the ordering of the reading frames by success rates, namely '- ', '+ '/'+ ', '- ' and '- ', matches the ordering by combinatorial restrictions in the standard genetic code, beginning with the least restricted frame [ ] . determining the impact of olg construction on an amino acid sequence identity is another indicator of its functionality. it has been argued that a % amino acid identity between naturally occurring sequences ensures that both sequences have the same structure [ ] . comparing the altered part due to olg construction with the original sequence, in . % of cases both olg sequences share at least % of amino acids with their original sequence. in some olg pairs both sequences have an amino acid identity of up to % compared to their original sequence. in the biologically more relevant property of amino acid 'similarity', the worst-scoring of the two olgs can be even up to % similar to its respective original sequence (c.f. left panel of fig. ). determining the average amino acid identity and similarity between the two olg sequences, the average olg design impact can be determined. the average amino acid identity is % in most cases (right panel of fig. ) showing that in almost all olg pairs one sequence is above and one is below % amino acid identity. the average amino acid similarity is % in most cases (right panel of fig. ) which again shows that in almost all cases one of the two olg sequences is above and one below % identity. the double peak structure of both panels in fig. can be explained by differences for olg pairs in different relative reading frames, which are pooled here (c.f. supplementary fig. s ). it follows that in an average olg design, in % of all overlap positions the amino acids of both sequences can be maintained, in % one sequence maintains its amino acid while the amino acid in the other sequence is changed to a similar one and in % one sequence maintains its amino acid and the other sequence cannot maintain a similar amino acid. how well the two sequences can be maintained after the overlap is determined by the standard genetic code and the two specific sequences, the overlap position, their amino acid composition and the amino acid order. while the standard genetic code is a constant factor across all overlaps, all other factors are specific in each case and create the variability in the results. figure : probability density for different amino acid identities and similarities in constructed olg pairs. the data is calculated from , olg pairs. left: the sequence with the lower identity is representative of the pair. the black line indicates the % amino acid identity threshold. right: the mean similarity of both olg sequences represents the pair. the impact of olg design on secondary structure is the last factor studied here. comparing the secondary structure of the olg sequence with its original sequence, a secondary structure similarity is determined. secondary structure is predicted using porter [ ] with the "--fast" flag. it can distinguish between the eight different secondary structure motifs of the dictionary of protein secondary structure (dssp) [ , , ] , which are _ -, alpha-, and phi-helices, hydrogen bonded turns, beta sheets, beta bridges, bends and coils. determining the same secondary structure similarity for all sequences in the seed group of the pfam database yields a control group. this way the typical deviations between domains with the same function can be determined. comparing probability densities for different secondary structure identities in both groups it can be seen that the constructed olg sequences barely deviate from the seed sequences (c.f. fig. ). in conclusion, in regards to secondary structure the change inflicted on a sequence to create olgs is no more than the differences within naturally occurring protein domain families. it is noteworthy that only amino acid identity and similarity have a strong correlation (r= . ) so combined with the other parameters, namely the relative hmm score and the secondary structure identity, there is a set of three more or less independent properties for evaluating constructed olgs, and probably for protein homologs in general. the relative hmm score is the hmm score of the olg sequence divided by the hmm score of a sequence at any threshold percentile as discussed above. between each pair of parameters the pearson's correlation is below . , with the exception of the correlation between secondary structure identity and hmm score being r= . or r= . for thresholds of % or % respectively. olgs are as similar to their original sequences in secondary structure as observed for comparisons of seed sequences of naturally occurring protein domains to the sequence best representing the respective domain family. by comparing olg sequences constructed with the standard genetic code (sgc) to sequences constructed with artificial codes the level of optimality of the sgc can be inferred. since such an approach depends strongly on the codeset used [ ] , four different versions with increasing restrictions will be tested. there are two factors defining a genetic code, namely its amino acid composition and the arrangement of amino acids on the codons. the first code set is the random code set and does not constrict any of the two factors. each code can have any of the amino acids used in the sgc at any codon. the second set only restricts the composition of its codes and is called the degeneracy code set. all codes in this set contain the same amount of codons for each amino acid as in the sgc and thus conserving its amino acid composition. the third set is the blocks code set whose codes have a very similar structure to the sgc and while it also restricts the composition of the codes to some degree it mostly determines their arrangement. this code set is created by assigning all codons of the sgc that code for the same amino acid into blocks and shuffling the amino acids assigned to each block and thus conserves the degeneracy structure of the sgc on the third nucleotide. lastly a code set that maintains the mutational robustness of the sgc as calculated in [ ] is tested. in short, the mutational robustness is the average change of amino acids due to point mutations and has been shown to be extremely optimal in the sgc relative to similar codes [ ] . this set contains block codes like in the second set but only the codes whose mutational robustness is at least as high as the sgc are kept. since these codes are fundamentally block codes they are partly restricted in their amino acid composition, but the arrangement of amino acids in these codes is even more restricted as point mutations from any codons should result in similar amino acids. this code set reflects the fact that different properties of the sgc have a different impact on the fitness or biological optimality of the sgc with the mutational robustness most likely being one of the most important features. here this code set is called the mutational robustness blocks set (mr-blocks set) and it tests the optimality of constructing olgs as an additional property of the sgc after taking into account the mutational robustness. comparing the degeneracy, the block and the mr-blocks code set to the random set, the influence of code composition and arrangement can be determined (see left panel of fig. ). the degeneracy code set reflecting the composition of the sgc has the codes with the highest average success rates indicating that the composition of the sgc is a major factor for this property, but the sgc itself has a very low success rate in comparison, indicating that the amino acid arrangement is an even stronger -in this case negative -factor as the sgc is worse than both the random codes and the degeneracy codes. the block structure of the sgc has a strong negative impact on successful olg design and the sgc is a typical member of this set. enforcing even more structure on the artificial codes in order to maintain the mutation robustness of the sgc further reduces the ability of the sgc to create successful olgs. studying the optimalities of each of the four code sets for flexibility in olg design, it is apparent that the more restricted the code set is, the more optimal the sgc is relative to the set (see right panel of fig. ) . especially in the mr-blocks code set only a few codes are better than the sgc, however no codeset or reading frame has fewer than % of codes doing better (see fig. s -s ), which has been a recommended threshold for inferring optimality [ ] . this is an expected result even if the code has been optimised for olgs as the success rate for constructing olgs reflects merely the 'flexibility' of a code system, but olg sequences also need to be conserved, which is an almost directly opposing property which also has not been found to be strongly optimal by itself [ ] ; it might indeed be expected that overall optimality involves a trade-off between the two. if the sgc has been optimized in this way this could indicate a turning point at which a further increase in mutational robustness results in a smaller fitness increase compared to an increase in the flexibility to create olgs -how to measure fitness for a genetic code is however not clear. while the code composition of the sgc is beneficial for both the ability to create successful olgs and the mutational robustness, the code arrangement of the sgc is only beneficial for mutational error robustness and the sgc (see fig. of [ ] ), indicating that the mutational robustness is the more important property. only in the set of codes with the same mutational robustness the optimality for olg design becomes stronger, supporting the turning point hypothesis. figure : olg design success rates for different alternative code sets. the average is calculated from sets of alternative codes, except for the mr-blocks set with sets of codes. the error bars indicate the standard deviation. left: the average success rates compared to the sgc. while the composition of the sgc is a positive factor, the arrangement of the sgc is a negative factor. right: the optimality of different code sets. the black line indicates the % threshold. the more restricted the code set the more optimal the sgc appears indicating that the ability to successfully create olgs has only been optimized while maintaining other properties. besides the four basic taxonomic groups (three domains of cellular life: archaea, bacteria, eukaryotes, plus viruses) also old genes can be studied by picking only families which have at least one sequence in all four taxonomic groups since it is expect that these families have already been present in luca or another ancient ancestor (although this high level categorisation is not perfect due to widespread horizontal gene transfer). surprisingly, bacterial and eukaryotic genes are generally significantly better suited to olg construction than virus and archaeal genes with only minimal dependence on the threshold percentile, c.f. fig. s . the largest dependence on the threshold percentile is found for the "found in all" genes, which only a total of sequences can be found in the pfam database, so higher stochastic fluctuations are to be expected. using the 'biologically relevant' threshold, the biggest difference is between eukaryotic and archaea genes which have a % difference in their success rate (see left panel of fig. ). for olgs which are typical proteins of their respective family, eukaryotic genes are almost twice as likely to be successful as virus genes (see right panel of fig. ). eukaryotes and "found in all" genes are typically the easiest to overlap, which is somewhat unexpected as eukaryotic genes would perhaps be expected to have the youngest protein families, and so to appear less 'flexible' due to having sampled less of the functional space through mutations. more understandable however is that due to being closer to mutational saturation (if more ancient on average) and therefore having explored a larger proportion of functional sequence space, "found in all" genes might appear more 'flexible', resulting in lower weights and thresholds. in order to estimate the difficulty of naturally forming olg sequences, the minimum number of nucleotide changes needed in order to reach the olg sequence from any of the two original sequences is determined (see fig. ). by only taking olgs in which both sequences are above a certain hmm threshold, extreme outliers are gradually removed with increasing threshold but the rest of the distribution stays the same. this indicates that this property is independent of the threshold value, just as for the amino acid identity and similarity, as fewer and fewer designed olgs pass a higher threshold which makes extreme outliers less likely to occur. on average a designed olg sequence has a % difference in nucleotides to its original, with half of constructed sequences in the range of - % change. most interesting are outliers on the lower end of the distribution as they indicate whether olgs exist that are potentially reachable by naturally occurring mutations. the lowest nucleotide difference observed is . %, which was for an olg pair that scores better than % of the domains in the comparison group. . % of olgs required less than % nucleotide change, i.e. sequences of the sequences created in this dataset that scored at least as highly as the worst sequence in the comparison group. this suggests intuitively that creating overlaps of the sort constructed here could be possible naturally through accumulation of random mutations. the population genetics of such a hypothetical process is a potential topic for further study, as is an experimental evaluation of functionality. figure : percentage nucleotide change of olgs as a function of hmm threshold %. the minimal nucleotide distance of each of olg sequences (two per pair), with a minimal length of nucleotides, to their respective original sequence is determined. there are many aspects of the synthetic construction of olg pairs which can be studied. here factors such as sequence length and the influence of sequence conservation are taken into account. the analysis shows that an evaluation with blast and a fixed e-value cutoff cannot accurately assess the potential functionality of the designed olgs. while the combination of sequence length and an e-value cutoff completely determines the success rate of the constructed olgs, adding in positional weights can only negatively influence the sequences constructed with this method. both problems can be solved however by instead using hmm profiles to determine sequence similarity and then using these to define a threshold for successful olgs derived from sequences in the same protein family. the hmm profiles and the thresholds are though both derived from the pfam database [ ] , which makes these results strongly dependent on the database quality. for example, if in one taxonomic group sequences are very similar due to being mostly from the same species or genus, thresholds would appear to be higher and it would be harder for designed olgs to pass these thresholds. further optimization of the construction algorithm can be achieved by determining the optimal weight strength (influence of sequence conservation), which is k= . . . % of the constructed olg sequences score at least as highly as the worst-scoring biological sequences in pfam groups, while . % of the sequences cannot be distinguished from naturally occurring domains in their respective protein family. this indicates that the typical variation inside protein families is of the same order of magnitude as the change needed in order to construct artificial olgs by arbitrary pairing of protein domains. this result also holds true for other bioinformatic factors like amino acid identity and secondary structure, since the constructed olgs are typically very similar to naturally occurring domains in these properties. studying artificial olg design success from the perspective of an even more constricting biological parameter like tertiary structure would be an important next step; but besides the amino acid sequence, also codon usage can impact protein structure [ ] , along with environmental factors such as the presence of chaperone proteins, which together make it a much harder problem. ultimately, proof of the functionality of artificial sequences cannot yet be realised bioinformatically, and experimental verification is essential. to this end all known independent protein properties available from the sequence should be tested in order to create a gold standard for possibly functional sequences. from this study it is clear that sequence similarity (or identity), hmm-scores and secondary structure should be part of the judged properties. determining relative hmm scores for high thresholds could be used to prefilter sequences for secondary structure prediction as it is the computationally most intensive part of this analysis. considering that domain-domain overlaps are expected to be much harder than overlapping a domain with a less conserved region in another gene, it appears that de novo origin of genes from overlapping orfs may be much less difficult than widely assumed. some constructed olg sequences varied only by . % from their original sequence, and there might be other natural sequences from the same domain that are even closer to the olg sequence. this result could be a starting point for estimating the difficulty of evolving olgs from different starting sequences in natural systems, which is still relatively unexplored despite some early work [ ] . the structure of the standard genetic code explains differences between reading frames and is a strong factor in the overall success rate of olg construction. olgs can maintain an average % amino acid identity and an average % amino acid similarity, which is mostly due to the genetic code. the structure of the standard genetic code is defined by its composition, namely how many codons code for each amino acid, and its arrangement, namely which codons code for each amino acid. it is known that the composition alone can not explain the strong optimality of the standard genetic code for mutational robustness as it stands out from between codes with the same composition as the standard genetic code [ , ] . considering that the arrangement of the standard genetic code creates such high mutational robustness values [ ] it is remarkable that designing olgs also works so well. another factor which deserves further exploration is the age of a protein family, i.e. the time since gene birth. this may correlate with apparent 'sequence flexibility', which is the strongest influence on the result via the threshold values, due to increasing mutational saturation in older protein families. being able to distinguish genuine sequence flexibility from mutational saturation, even in broad terms, could be very useful here. the analysis presented here depends primarily on the reliability of hmm profiles of pfam groups as a guide to biological functionality in constructed sequences. reliability for classifying biological protein sequences into ortholog families, the main use of these hmms, may not correlate well with reliability in scoring artificially constructed sequences for functionality. in other words it may well be that these profiles fail to capture important requirements for protein tertiary structure and/or functionality. future research ought test the best candidates experimentally, and if the best candidates from the methods developed here are not successful, additional factors could be considered in comparing constructed sequences and their natural precursors. for instance, many protein characteristics can be assessed using servers or packages incorporating multiple bioinformatic tools such as predictprotein, for various secondary structural elements [ ] , and many sequence properties, such as hydrophobicity profiles, can be computed using the volpes server [ ] , which has been applied to the related case of frame-shifted sequences compared to their mother genes [ ] . other properties required for functional protein sequences can be inferred from the evolutionary information contained in sequence alignments of protein families. for instance, it has been calculated based on a study of residue-residue co-evolution in ten well-characterized protein families that the proportion of all sequences which fold to the family's structure ranges from approx - to - [ ] . these principles have recently been successfully used in the design of functional proteins [ ] , and could conceivably also be applied to olg construction. factors facilitating the existence of olgs may possibly help in predicting olgs in sequenced genomes and should be explored further. for instance, a careful study of relatively 'flexible' sequence regions in taxonomically widespread genes may help find more overlapping genes. most interestingly, bacterial and eukaryotic genes can be overlapped more easily than virus genes, contrary to the findings in [ ] . these earlier results can be explained entirely with dataset-database biases, so this algorithm gives no support for the common assumption of a higher intrinsic olg formation capacity of viruses compared with bacteria or eukaryotes. two of the main differences between the taxonomic groups are the expected mutation rates and the average length of a protein. while genomes with higher mutation rates explore sequence space faster and therefore their proteins should appear to be more flexible, virus domains do not appear to be very flexible, despite having the highest mutation rate. the length of the sequences on the other hand has been removed as a factor in this analysis. an artificial factor not considered could be database biases or an exchange matrix (blosum ) biased towards certain kinds of proteins. the latter could be tested by using different matrices created from sequences from different taxonomic groups. it would be important to use the new matrix not only in the construction of the olgs but also in the evaluation by the hmms. so far it is not clear why protein families from different taxonomic groups are so different in their calculated ability to create olgs. a better theoretical understanding of overlapping genes will be extremely useful in microbial genome annotation methods, the study of evolution, and in synthetic biology, and therefore deserves renewed attention. overlapping genes in bacteriophage φx concomitant emergence of the antisense protein gene of hiv- and of the pandemic the novel ehec gene asa overlaps the tegt transporter gene in antisense and is regulated by nacl and growth phase overlapping genes in parasitic protist giardia lamblia overlapping genes in vertebrate genomes are antisense proteins in prokaryotes functional? the evolution of genome compression and genomic novelty in rna viruses gene overlapping and size constraints in the viral world comparative study of overlapping genes in bacteria, with special reference to rickettsia prowazekii and rickettsia conorii overlapping genes in bacterial and phage genomes noncontiguous operon is a genetic organization for coordinating bacterial gene expression is genetic code redundancy related to retention of structural information in both dna strands? complementarity of peptides specified by 'sense' and 'antisense' strands of dna genetic coding algorithm for sense and antisense peptide interactions frameshifting preserves key physicochemical properties of proteins viral proteins originated de novo by overprinting can be identified by codon usage: application to the "gene nursery" of deltaretroviruses origins of genes:" big bang" or continuous creation gene birth contributes to structural disorder encoded by overlapping genes evolution of viral proteins originated de novo by overprinting a minimal trprs catalytic domain supports sense/antisense ancestry of class i and ii aminoacyl-trna synthetases two types of aminoacyl-trna synthetases could be originally encoded by complementary strands of the same nucleic acid functional class i and ii amino acid-activating enzymes can be coded by opposite strands of the same gene the combinatorics of overlapping genes overlapping genes: more than anomalies? do overlapping genes violate molecular biology and the theory of evolution? missing genes in the annotation of prokaryotic genomes a case for a negative-strand coding sequence in a group of positive-sense rna viruses computational design of fully overlapping coding schemes for protein pairs and triplets two proteins for the price of one: the design of maximally compressed coding sequences designing of a single gene encoding four functional proteins engineering gene overlaps to sustain genetic constructs in vivo biomolecular computing systems: principles, progress and potential genetic programs can be compressed and autonomously decompressed in live cells functional segregation of overlapping genes in hiv profile hidden markov models pfam: a comprehensive database of protein domain families based on seed alignments mafft multiple sequence alignment software version : improvements in performance and usability muscle: multiple sequence alignment with high accuracy and high throughput amino acid substitution matrices from protein blocks deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction optimality in the standard genetic code is robust with respect to comparison code sets the genetic code is one in a million a neutral origin for error minimization in the genetic code evolution of overlapping genes dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features a series of pdb related databases for everyday needs the structure of proteins: two hydrogenbonded helical configurations of the polypeptide chain a previously uncharacterized gene in sars-cov- illuminates the functional dynamics and evolutionary origins of the covid- pandemic properties and abundance of overlapping genes in viruses codon usage regulates protein structure and function by affecting translation elongation speed in drosophila cells critical assessment of methods of protein structure prediction (casp)-round xiii twilight zone of protein sequence alignments extreme genetic code optimality from a molecular dynamics calculation of amino acid polar requirement evolution by gene duplication predictprotein-an open resource for online prediction of protein structural and functional features volpes: an interactive web-based tool for visualizing and comparing physicochemical properties of biological sequences how many protein sequences fold to a given structure? a coevolutionary analysis co-evolutionary fitness landscapes for sequence design key: cord- -kqcx lrq authors: ladner, jason t.; beitzel, brett; chain, patrick s. g.; davenport, matthew g.; donaldson, eric; frieman, matthew; kugelman, jeffrey; kuhn, jens h.; o’rear, jules; sabeti, pardis c.; wentworth, david e.; wiley, michael r.; yu, guo-yun; sozhamannan, shanmuga; bradburne, christopher; palacios, gustavo title: standards for sequencing viral genomes in the era of high-throughput sequencing date: - - journal: mbio doi: . /mbio. - sha: doc_id: cord_uid: kqcx lrq thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. however, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. we also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. v iruses represent the greatest source of biological diversity on earth, and with the help of high-throughput (ht) sequencing technologies, great strides are being made toward the genomic characterization of this diversity ( ) ( ) ( ) . genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. here, we outline a set of viral genome quality standards, similar in concept to those proposed for large dna genomes ( ) but focused on the particular challenges of and needs for research on small rna/ dna viruses, including characterization of the genomic diversity inherent in all viral samples/populations. our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. despite the small sizes of viral genomes, complications related to limited rna quantities, host "contamination," and secondary structure mean that it is often not time-or cost-effective to finish every genome, and given the intended use, finishing may be unnecessary ( ) . therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. each viral family/species comes with its own challenges (e.g., secondary structure and gc content); therefore, we provide only loose guidance on the depth of sequence coverage likely required to obtain different levels of finishing. in reality, a similar amount of data will generate genomes with different levels of finishing for different viruses. to alleviate any reliance on particular aspects of the different sequencing technologies, we have made two assumptions that should be valid in most viral sequencing projects. the first assumption is a basic understanding of the genomic structure of the virus being sequenced, including the expected size of the genome, the number of segments, and the number and distribution of major open reading frames (orfs). fortunately, genome structure is highly conserved within viral groups ( ) , and although new viruses are constantly being uncovered, the discovery of a novel family or even genus remains relatively uncommon ( ) . in the absence of such information, the defined standards can still be applied following further analysis to determine genome structure. the second assumption is that the genetic material of the virus being described can be accurately separated from the genomes of the host and/or other microbes, either physically or bioinformatically. depending on the technology used, it is critical that the potential for crosscontamination of samples during the sample indexing/bar coding process and sequencing procedure be addressed with appropriate internal controls and procedural methods ( ) . for a summary of the proposed categories for whole-genome sequencing of viruses, see fig. and table . the "standard draft" category is for whole shotgun genome assemblies with coverage that is low and/or uneven enough to prevent the assembly of a single contig for Ն genome segments. genomes in this category are likely to result from samples with low viral titers, such as clinical and environmental samples, or to be those containing regions that are difficult to sequence across (e.g., intergenic hairpin regions) ( ) . to distinguish standard drafts from targeted amplification of partial viral sequences, standard drafts should contain at least contig for each genomic segment and should be prepared in a manner that allows the possibility of sequencing the vast majority of a virus's genome. to avoid the inclusion of small pieces of genomes as "drafts," there needs to be some type of minimum cutoff for breadth of coverage. therefore, we suggest that at least a majority (Ն %) of the genome be present for a set of sequences to be considered a draft genome. high quality (hq). genomes should be considered high quality if no gaps remain (i.e., a single contig per genome/segment), even if one or more orfs remain incomplete due to missing sequence at the ends of segments. an hq genome can often be achieved with modest levels of ht sequencing coverage (~ to ϫ) or through sanger-mediated gap resolution of an sd. coding complete (cc). the "coding complete" category indicates that in addition to the lack of gaps, all orfs are complete. this level of completion is typically possible with high levels of ht sequencing coverage (Ͼ ϫ) or may require the use of conserved pcr primers targeting the ends of the segments. complete. a genome is complete when the genome sequence has been fully resolved, including all non-protein-coding sequences at the ends of the segment(s). this is typically achieved through rapid amplification of cdna ends (race) or similar procedures. finished. this final category represents a special instance in which, in addition to having a completed consensus genome sequence, there has been a population-level characterization of genomic diversity. typically this requires~ to , ϫ coverage (see below). this provides the most complete picture of a viral population; however, this designation will apply only for a single stock. additional characterizations will be necessary for future passages. population-level characterization. ht sequencing technologies provide powerful platforms for investigating the genetic diversity within viral populations, which is integral to our understanding of viral evolution and pathogenesis ( , ) . population-level characterization requires very high levels of ht sequencing coverage ( , ); however, the exact level will depend on the background error profiles of the sequencing technology and the desired level of sensitivity. as an example, wang et al. ( ) determined that for pyrosequencing data,~ ϫ coverage is necessary to identify minor variants present at % frequency with . % confidence, and~ , ϫ coverage is needed for variants with a frequency of . %. targeted amplification of the viral genome is often necessary to achieve these coverage requirements. due to the modest sequence lengths of most ht technologies, the state of the art for population-level analysis has been the characterization of unphased polymorphisms. however, single-molecule technologies, with maximum read lengths of Ͼ kb, are opening the door for complete genome haplotype phasing ( ) . identification of contaminants or adventitious agents. after isolation, viruses are often maintained as stocks, which are propagated within host cells in tissue culture and thus amplified and preserved for future use. despite careful laboratory practices, it is possible for these stocks to become contaminated with additional microbes. contaminating microbes are often detrimental to subsequent applications such as vaccine development or the testing of therapeutics, making it imperative to monitor the purity of viral stocks. ht sequencing provides a powerful method for not only detecting the presence of contaminants within a sample but also for identification and characterization of any contaminants. the level of sequencing required for contamination analysis is dependent on the desired sensitivity, with more sequencing required to ensure detection of contaminants present at very low levels. for most approaches, hq-level sequencing should be sufficient. depending on the intended applications, analysis may need to be repeated after further passaging to ensure that no additional contaminants have been introduced. description of novel viruses. despite the rapidly growing collection of viral sequences, the description of novel viruses is likely to remain an important aspect of viral genome sequencing ( , , ) . this is true in part because viruses evolve rapidly and are capable of recombining to form novel genotypes ( , ) . it is also true that most of the viruses that are currently circulating remain uncharacterized ( ) . particularly lacking are representatives from groups that are not currently known to infect humans or organisms of economic importance. it would be imprudent, however, to continue to ignore these uncharacterized reservoirs of diversity, because it is difficult to predict the source of future emerging diseases ( ) ( ) ( ) . additionally, with the current suite of primarily sequence similarity-based pathogen identification tools, the ability to detect novel pathogens is wholly dependent on highquality reference databases ( ) . there is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last to % of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. therefore, for the majority of viral characterization projects, we recommend, at a minimum, a cc genome. this will ensure a complete description of the viral proteome and will allow accurate phylogenetic placement. molecular epidemiology. one of the most common and important applications for viral genomes is in the study of viral epidemiology, which encompasses our understanding of the patterns, causes, and effects of disease. early studies of molecular epidemiology targeted small pieces of viral genomes; however, this type of analysis is likely to miss important changes elsewhere in the genome. therefore, there has been a strong focus in recent years toward the sequencing of "full" viral genomes. institutes such as the broad institute and the j. craig venter institute (jcvi) have been instrumental in breaking ground in the collection of large numbers of good-quality viral sequences. their newly identified genomes typically fall within our cc category. this is likely to remain the gold standard for studies involving a large number of genome sequences, especially when some samples come from lowtiter clinical samples, often necessitating amplicon-based sequencing methods. cc genomes allow for interrogation of changes throughout the coding portion of the viral genome and often include partial noncoding regions. in the absence of highthroughput race alternatives, the time and resources required to complete hundreds or thousands of genomes are likely to continue to outweigh the potential information gained from completing the terminal sequences. countermeasure development. advancements in our capabilities to sequence viral genomes are changing the way we counteract global pandemics and acts of bioterrorism. there are two important aspects of countermeasure development that can benefit strongly from the availability of genome sequences and ht sequencing data: the detection of the infectious agent and the treatment of the disease caused by the agent. taxonomic classification and detection through dna/rna-based inclusivity assays (i.e., using techniques such as pcr to detect the presence of a pathogen) can be designed using fragmented and incomplete genomes (e.g., sd and hq sequences). fully resolved orfs (cc) further enable the development of immunological assays, such as enzyme-linked immunosorbent assays (elisa) and immunofluorescence assays (ifa), for protein-based detection, and obtaining a complete genome opens the door to a plethora of additional downstream applications, including the design of exclusivity tests, the establishment of reverse genetics systems, and the design of robust forensics protocols. however, for effective development and testing of animal models, therapeutics, vaccines, and prophylactics, it is necessary to obtain a complete picture of the variability present within both the challenge stock and postinfection populations, thereby necessitating finished genomes. in these medical applications, it is also important to demonstrate the absence of adventitious agents. in addition to standardizing the vocabulary of viral genome assemblies, it is also critical for researchers to routinely provide raw sequencing reads. without these, it is impossible for others to independently verify the quality of an assembly. data repositories such as genbank already provide a platform for depositing ht sequencing reads, but this is not a requirement for the submission of a genome, nor is this option typically utilized. wider analysis of data will ultimately result in higher-quality assemblies. it is worth considering broader implementation of a wiki-like, crowdsourcing strategy to genome assembly, similar to the annotation strategies that have been adopted for specific genomes of high interest ( , ) . this approach would allow multiple parties to work on genome assembly and annotation at the same time and would provide instant updates for the entire community to evaluate and utilize in their own research. our primary goal here is to initiate a conversation. the rate at which viral genomes are being sequenced is only going to increase in the coming years, and without some standardization, it will be impossible for these valuable resources to be utilized to their full potential. we present these categories as a starting point, with the goal of adjusting and refining them over time as our capabilities and needs continue to change. crystal ball. the viriosphere: the greatest biological diversity on earth and driver of global processes metagenomic analysis of coastal rna virus communities the search for meaning in virus discovery genome project standards in a new era of sequencing next generation sequencing of viral rna genomes . virus taxonomy. ninth report of the international committee on taxonomy of viruses human viruses: discovery and emergence double indexing overcomes inaccuracies in multiplex sequencing on the illumina platform rescue of the prototypic arenavirus lcmv entirely from plasmid viruses as quasispecies: biological implications quasispecies diversity determines pathogenesis through cooperative interactions in a viral population characterization of mutation spectra with ultra-deep pyrosequencing: application to hiv- drug resistance highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data the advantages of smrt sequencing a strategy to estimate unknown viral diversity in mammals the changing face of pathogen discovery and surveillance the evolution of epidemic influenza characterization of the candiru antigenic complex (bunyaviridae: phlebovirus), a highly diverse and reassorting group of viruses affecting humans in tropical america isolation and characterization of viruses related to the sars coronavirus from animals in southern china the emerging novel middle east respiratory syndrome coronavirus: the "knowns" and "unknowns relationship between domestic and wild birds in live poultry market and a novel human h n virus in china computational tools for viral metagenomics and their application in clinical research web apollo: a web-based genomic annotation editing platform pseudomonas genome database: improved comparative analysis and population genomics capability for pseudomonas genomes key: cord- - lnpujip authors: anthonsen, henrik w.; baptista, antónio; drabløs, finn; martel, paulo; petersen, steffen b. title: the blind watchmaker and rational protein engineering date: - - journal: j biotechnol doi: . / - ( ) -x sha: doc_id: cord_uid: lnpujip in the present review some scientific areas of key importance for protein engineering are discussed, such as problems involved in deducting protein sequence from dna sequence (due to posttranscriptional editing, splicing and posttranslational modifications), modelling of protein structures by homology, nmr of large proteins (including probing the molecular surface with relaxation agents), simulation of protein structures by molecular dynamics and simulation of electrostatic effects in proteins (including ph-dependent effects). it is argued that all of these areas could be of key importance in most protein engineering projects, because they give access to increased and often unique information. in the last part of the review some potential areas for future applications of protein engineering approaches are discussed, such as non-conventional media, de novo design and nanotechnology. nature has evolved using several types of random mutations in the genetic material as a fundamental mechanism, thereby creating new versions of existing proteins. by natural selection nature has given a preference to organisms with proteins which directly or indirectly made them better adapted to their environment. thus nature works like a blind watchmaker, trying out an endless number of combinations. this may seem to be an inefficient approach by industrial standards, but nevertheless nature has been able to develop some highly complex and sophisticated designs, simply by the power of natural selection over millions of years, occurring in a large number of parallel processes. by virtue of reproduction several copies of each organism have been able to test the effect of different mutations in parallel. it is quite probable that the mutation frequency was higher in ancient species (doolittle, ) , although it is still possible to find highly mutable loci in genes involved in adaptation to the environment (moxon et al., ) . enzymes have been used by man for thousands of years for modification of biological molecules. the use of rennin (chymosin) in rennet for cheese production is a relevant example. and with increased knowledge about proteins, genes and other biological macromolecules scientists started starting with a protein with known sequence and properties, we make a -d model of the protein from experimental structure data or by homology. by modelling and simulation we identify mutations that will modify selected properties of the protein (the design part of the process), these mutations are implemented at the dna level and expressed in a suitable organism (the production part of the process), and the success of the design is verified by experimental methods. to look at methods for making modified proteins with new or improved properties. at first this was done by speeding up nature's own approach, by increasing the number of mutations (e.g., by using chemicals or radiation) and by using a very strong selection based on tests for specific properties. with the introduction of new and powerful techniques for structure determination and site directed mutagenesis, it is now possible to do rational protein modification. rather than testing out a large number of random mutations, it has become feasible to identify key residues within the protein structure, to predict the effect of changing these residues, to implement these changes in the genetic material, and finally to produce large amounts of modified proteins. this is protein engineering. there are several reviews describing the fundamental ideas in protein engineering, see fersht and winter ( ) for a recent one. the basic protein engineering process is shown in fig. (see also petersen and martel ( ) ). in most cases it starts out with an unmodified protein with well-characterised properties. for some reason we want to modify this protein. in the case of an enzyme we may want to make it more stable, alter the specificity or increase the catalytic activity. first we enter the design part of the protein engineering process. based on structural data we create a computer model of the protein. by a combination of molecular modelling and experimental methods the correlation between relevant properties and structural features is established, and changes affecting these properties can be identified and evaluated for implementation. in more and more cases the effect of these changes can be simulated, and the modifications can be optimised with respect to these simulations. as soon as a new design has been es~tablished we may enter the production part of the process. the necessary mutations must be implemented in the genetic material, this genetic material is introduced into a production organism, and the resulting modified protein can (in most cases) be extracted from a bioprocess. this protein can be tested with respect to relevant properties, and if necessary it may be used as a basis for re-entering the design part of the protein engineering process. after a few iterations we may reach an optimal design. there are several examples of successful protein engineering projects. protein engineering may be used to improve protein stability (kaarsholm et al., ) , enhance or modify specificity (getzoff et al., ; witkowski et al., ) , adapt proteins to new environments (arnold, ; gupta, ) , or to engineer novel regulation into enzymes (higaki et al., ) . in some cases even de novo design of new proteins may be relevant, using knowledge gained from existing structures (kamtekar et al., ; shakhnovich and gutin, ; ghadiri et al., ; ball, ) . in a truly multidisciplinary project chymosin mutants with optimal activity at increased ph values compared to wild-type chymosin was designed and produced (pitts et al., ) . point mutations changing the charge distribution of superoxide dismutase have been used to increase reaction rate by improved electrostatic guidance (getzoff et al., ) . a project on converting trypsin into chymotrypsin has been important for understanding the role of chymotrypsin surface loops (hedstrom et al., ) , a serine active site hydrolase has been converted into a transferase by point mutations (witkowski et al., ) , and mutations in insulin aiming at increased folding stability have given an insulin with enhanced biological activity (kaarsholm et al., ) . an example of a rational de novo project (as opposed to the random approach used, e.g., in generation of catalytic antibodies) is the design of an enzymatic peptide catalysing the decarboxylation of oxaloacetate via an imine intermediate, in which a very simple design gave a three to four orders of magnitude faster formation of imine compared to simple amine catalysts . in some cases it may also be an interesting approach to incorporate nonpeptidic residues into otherwise normal proteins (baca et al., ) , or to build de novo proteins by assembling peptidic building blocks on to a nonpeptidic template (tuchscherer et al., ) . it has been shown that incorporation of nonpeptidic residues into e-turns of hiv- protease gives a more stable enzyme (baca et al., ) . the main problem with this approach is how to incorporate the non-standard residues. in the hiv- protease case solid-phase peptide synthesis combined with traditional organic synthesis was used, others have suggested that the degeneracy of the genetic code may be used to incorporate novel residues via the standard protein synthesis machinery of the cell (fahy, ) . in the present review we will look at the design part of the protein engineering process, with emphasis on some of the more difficult steps, especially homology based modelling in cases with very low sequence similarity, nuclear magnetic resonance (nmr) of very large proteins and modelling of electrostatic interactions. in the last part of the review we will discuss some possible future directions for protein engineering and protein design. any protein engineering project is based on information about the protein sequence. this information may stem from either direct protein sequencing or a deducted translation of the dna/rna sequence. the amount of information on protein and nucleic acid sequences, as well as on relevant data like -d structures and disease-related mutations, is growing at a very rapid pace, and novel databases and computer tools give increased access to these data (coulson, ) . it is very reasonable to expect that projects like the human genome project will succeed in providing us with sequence information about every single gene in our chromosomes within the next decade. this information will be after transcription, the mrna may be edited, a process that now has been reported in man, plants and primitive organisms (trypanosoma brucei). the mrna is then translated into a protein sequence. this protein sequence can subsequently be modified, leading to n-or o-glycosylation, phosphorylation, sulfation or the covalent attachment of fatty acid moieties to the protein. at this stage the protein is ready for transport to its final destination -which may be right where it is at the time of synthesis, but the destination may also be extracellular or in secluded compartments such as the mitochondria or lysosomes. in this case the protein is equipped with a signal sequence. after arrival to its destination the protein is processed, often involving proteolytic cleavage of the signal sequence. shorter routes to the functional protein with fewer steps undoubtedly exist as well as routes with interchanged steps of processing. finally, the catabolism of the protein is also part of the process, but has been left out in the figure. of key importance for our understanding of the biology, development and evolution of man. it should, however, be kept in mind that the sequence itself may give us little information about regulation of gene expression, i.e., under what conditions genes are expressed, if they are expressed at all. most protein sequences have been deducted from gene sequences. it is in most cases a priori assumed that a trivial mapping exists between the two sets of information. however, this may not necessarily be the case. in fig. , the various steps currently recognised as being of importance for the production of the mature enzyme are shown, and several of these steps may affect the mapping from gene to protein. posttranscriptional editing is modifications at the mrna level affecting the mapping of information from gene to protein, often involving modification, insertion or deletion of individual nucleotides at specific positions (cattaneo, ) . currently only speculative models exist for the underlying molecular mechanism(s) for posttranscriptional editing. in the case of mammalian apolipoprotein b two forms exist, both originating from a single gene. the shorter form, apo b , arises by a posttranscriptional mrna editing whereby cytidine deamination produces an uaa termination codon (teng et al., ) . in the ampa receptor subunit glur-b mrna editing is responsible for changing a glutamine codon (cag) into a arginine codon (cgg) (higuchi et al., ) . this editing has a pronounced effect on the ca + permeability of the ampa receptor channel, and it seems to be controlled by the intron-exon structure of the rna. similar mrna editing has been reported in the related kainate receptor subunits giur- and giur- , where two additional codons in the first trans-membrane region are altered (sommer et al., ; k hler et al., ) . it is also interesting in this context that certain human genetic diseases have been related to reiteration of the codon cag (green, ) . mrna editing in plant mitochondria and chloroplasts has also been reported (gray and covello, ) . here the posttranscriptional mrna editing consists almost exclusively of c to u substitutions. editing occurs predominantly inside coding regions, mostly at isolated c residues, and usually at the first or second position of the codons, thus almost always changing the amino acid compared to that specified by the unedited codon. in trypanosoma brucei some extensive and well-documented posttranscriptional cases of editing have been reported (read et al., ; harris et al., ; adler et al., ) . the editing takes place at the mitochondrial transcript level where a large number of uridine nucleic acid bases are added or deleted from the mrna, which then subsequently is translated. several non-editing processes affecting the transcription/translation steps are also known. although the ribosomes in an almost perfect manner translate the message provided by the mrna (with error rates less than x - per amino acid incorporated), it appears as if the mrna in certain cases contain information, that forces the ribosome to read the nucleic acid information in a non-canonical fashion (farabaugh, ) . a special case, that may deserve some attention as well, is the seleno proteins, were seleno cystein is introduced into the protein by an alternative interpretation of selected codons (b ck et al., ; yoshimura et al., ; farabaugh, ) . translational frameshifting has been found in retroviruses, coronaviruses, transposons and a prokaryotic gene, leading to different translations of the same gene. two cases of translational 'hops' have been reported, where a segment of the mrna is being skipped by all ribosomes, in the two cases and nucleotides were skipped, respectively (farabaugh, ) . to our knowledge posttranscriptional editing and related processes are uncommon but definitely present in humans. it is, therefore, important to understand precisely how these mechanisms work, in order to correctly deduct the protein sequence from the gene sequence. the most common posttranslational modifications are side chain modifications like phosphorylations, glycosylations and farnesylations, as well as others. however, some modifications may also affect the (apparent) gene to protein mapping. posttranslational processing may involve removal of both terminal and internal protein sequence fragments. in the latter case an internal protein region is removed from a protein precursor, and the external domains are joined to form a mature protein (hodges et al., ; xu et al., ) . interestingly, all intervening protein sequences reported so far have sequence similarity to homing endonucleases (doolittle, ) , which also can be found in coding regions of group i introns (grivell, ) . posttranslational modifications like phosphorylation, glycosylation, sulfation, methylation, farnesylation, prenylation, myristylation and hydroxylation should also be considered in this context. they modify properties of individual residues and of the protein, and may thus make surface prediction, dynamics simulations and structural modelling in general more complex. the residues that are specifically prone to such modifications are tyrosines (phosphorylation and sulfation), serine and threonine (o-glycosylation), asparagine (nglycosylation), proline and lysine (hydroxylation) and lysine (methylation). in addition glutamic acid residues can become y-carboxylated leading to high affinity towards calcium ions (alberts, ) . specific transferases are involved in the modification, e.g., tyrosylprotein sulfotransferases (suiko et al., ) and farnesyl-protein transferases (omer et al., ) . phosphorylation of amino acid residues is an important way of controlling the enzymatic function of key enzymes in the metabolic and signalling pathways. tyrosine kinases phosphorylate tyrosine residues -thus introducing an electrostatic charge at a residue, which under normal physiological ph is uncharged. phosphorylation is central to the function of many receptors, such as the insulin and insulin-like growth factor i receptors. given the possibility that several modifications may be introduced in the sequence when we move from gene to mature protein, the task of deducting a protein sequence from the gene sequence may be more complex than we normally assume. although the protein sequence itself is a valuable starting point, the optimal basis for a rational protein engineering project will be a full structure determination of the protein. in many cases this turns out to be an expensive and timeconsuming part of the project. most structure determinations are based on x-ray crystallography. this approach may give structures of atomic resolution, but is limited by the fact that stable high quality crystals are needed. many proteins are very difficult to crystallise, in particular many structural and membrane-associated proteins. a large number of important x-ray structures have been published over the last few years, and the structures of the hhai methylase (klimasauskas et al., ) , the tbp/tata-box complex (kim et al., a; kim et al., b) and the porcine ribonuclease inhibitor (kobe and deisenhofer, ) are mentioned as examples only. nmr may be an alternative in many cases, as the proteins can be studied in solution, and for some experiments they can even be membrane associated. however, nmr is limited to relatively small molecules, and even with incorporation of labelling in the protein the upper limit for a full structure determination using current state of the art methods seems to be close to kda. some novel techniques for studying structural aspects of larger proteins will be discussed (vide infra). representative examples of important nmr structures may be interleukin / (clore et al., a) , the glucose permease iia domain (fairbrother et al., ) and the human retionic acid receptor-/ dna-binding domain (knegtel et al., ) . cryo electron microscopy (cem) is a relatively new approach to protein structure determination. the resolution of the structures are still lower than the corresponding x-ray structures, and a -dimensional crystal is a prerequisite. however, despite this cem appears to be a very promising approach to structure determination of membrane associated proteins that can form -dimensional crystals. cem has been used to study the nicotinic acetylcholine receptor at .~ resolution (unwin, ) and the atp-driven calcium pump at a resolution (toyoshima et al., ) , and in a combined approach using high resolution x-ray data superimposed on cem data the structure of the actin-myosin complex (rayment et al., ) and of the adenovirus capsid (stewart et al., ) has been studied. the recent structure by kiihlbrandt et al. ( ) of the chlorophyll a/b-protein complex at . a resolution shows that the resolution of cem rapidly is approaching the resolution of most x-ray protein data. scanning tunnelling microscopy (stm) is another new approach for studying protein structures (amrein and gross, ; lewerenz et al., ; haggerty and lenhoff, ) . the method is interesting because of a very high sensitivity, as individual molecules may be examined. the method will give a representation of the surface of the molecule, rather than a full structure determination. however, it is possible that both cem and stm can be used for identification of protein similarity. if data from these methods show that the overall shape of a protein is similar to some other known high resolution protein structure, then the known structure may be evaluated as a potential template for homology based modelling. we believe that such a model can either be used as an improved starting point for a full structure determination (i.e., for doing molecular replacement on x-ray data), or as a low resolution structure determination by itself. in homology based modelling a known structure is used as a template for modelling the structure of an homologous sequence, based on the assumption that the structures are similar. this is a very simple and rapid process, compared to a full structure determination. the sequences may be homologous in the strict sense, meaning that there is an evolutionary relationship between protein data banks the sequences. the same approach may obviously also be used for sequences that are similar, but not necessarily evolutionary related, and in that case we probably should talk about similarity based modelling. however, in this paper we will use homology based modelling as a general term, especially since the distinction between homology and similarity may be difficult in many cases. homology based modelling may turn out to be essential for the future of protein engineering. in fig. , the number of entries in the swissprot protein sequence database (bairoch and boeckmann, ) and the brookhaven protein structure database (bernstein et al., ; abola et al., ) are shown as a function of time. as we can see, there is a very significant gap between the number of sequences and the number of structures. this gap is in fact even larger than shown in fig. , as not all entries in the brookhaven database are unique structures. a large number of entries are mutants of other structures or identical proteins with different substrates or inhibitors. there has been an exceptional growth in the number of protein structures over the last - years. however, it is unrealistic to assume that we will be able to get high resolution experimental structures of all known proteins. the structure determination process is too time consuming, and the sequence databases are growing at a far faster pace, as shown in fig. , especially as a consequence of several large-scale genome sequencing projects. on the other hand, it may not really be necessary to do experimental structure determination of all proteins (ring and cohen, ) . the assumption that similar sequences have similar structures (see fig. ) has been proved valid several times and it seems to be true even for short peptide sequences as long as they come from proteins within the same general folding class ). an interesting case which is to some degree an exception to this rule is the structure of hiv- reverse transcriptase (kohlstaedt et al., ) . two units with identical sequence have similar secondary structure, but very different tertiary structure. however, this seems to be a rather exceptional case. new approaches to general structure alignment (orengo structure distance sequence distance fig. . sequence and structure similarity. in most cases similar sequences have similar structures (region ), and dissimilar sequences (i.e., measured by a standard mutation matrix) have dissimilar structures (region ). in several cases quite dissimilar sequences have been shown to have very similar structures, at least with respect to individual domains (region ). in very special cases we may have similar sequences with different structures (region ), at least with respect to tertiary structure, showing that environment and binding to other proteins may be essential for the final conformation in some cases. however, in most cases it seems to be safe to assume that structures can be found in the lower grey triangle of this graph, indicating that structure is better conserved than sequence. holm et al., ; alexandrov and go, ; lessel and schomburg, ) have made it possible to search for structurally conserved domains in proteins with very low sequence similarity (swindells, ) . this is an important approach, as structure normally is better conserved than sequence (doolittle, ) . several cases have been identified where the sequences are very different (at least by traditional similarity measures), whereas the three-dimensional structures are surprisingly similar. the identification of a globin fold in a bacterial toxin (holm and sander, ) , and the similarity between the dsba protein and thioredoxin (martin et al., ) are relevant examples. recently the structure of the human serum amyloid p component was shown to be similar to concanavalin a and pea lectin, despite only % sequence identity (emsley et al., ) , and the similarity between hen egg-white lysozyme and a lysozyme-like domain in bacterial muramidase "is remarkable in view of the absence of any significant sequence homology", as noted by thunnissen et al. ( ) . this shows that there probably is a limited number of protein folds, and this number must be lower than the number of sequence classes, defined as groups of similar protein sequences. recent estimates show that this number probably is close to different protein folds (chothia, ) , and approx. of these folds are known so far (burley, ; orengo et al., ) . this means that rather than full structure determination of a very large number of proteins, it may be sufficient to do structure determination of only a few selected examples of each protein fold, and use this as a basis for homology based modelling of other proteins shown to have the same fold. homology based modelling of the -d structure of a novel sequence can be divided into several steps. first, one or more templates must be identified, defined as known protein structures assumed to have the same fold as the trial sequence. then a sequence alignment between trial sequence and template is defined, and based upon this alignment an initial trial model can be built. this initial model must be refined in several steps, taking care of gap splicing, loops, side chain packing etc. the final model can be evaluated by several quality criteria for protein structures. an example of homology based modelling is the modelling of cinnamyl alcohol dehydrogenase based on the structure of alcohol dehydrogenase (mckie et al., ) . the protein folding problem is a fundamental problem in structural biology. this problem can be defined as the ab initio computation of a protein's tertiary structure starting from the protein sequence. this problem has not been solved and appears to be extremely difficult. if we want to solve the problem by computing an energy term for all conformations of a protein, defined by rotation around the ~b and ~o backbone angles of n residues in degree steps, we have to evaluate (n-d alternatives, even without considering the side chains. for a peptide with residues this corresponds to conformations. a hypothetical computer with processors, each processor running at hz (the frequency of uv light) and completing the energy evaluation of one conformation per cycle would need x years in order to test all conformations. the estimated age of the universe is x ~ years. a more realistic approach is the use of molecular dynamics or monte carlo methods for simulation of protein folding. however, it is still very difficult to use this as an ab initio approach, both because folding is a very slow process compared to a realistic simulation time scale, and also because it is very difficult to distinguish between correctly and incorrectly folded structures using standard molecular mechanics force fields (novotny et al., ) . a possible alternative approach may be to generate potential folds on a simplified lattice representation of possible residue positions (covell and jernigan, ; crippen, ) . however, this approach is still very experimental. some progress has been achieved in the area of secondary (rather than tertiary) structure prediction (benner and gerloff, ) . studies of local information content indicate that % match may be an upper limit for single-sequence prediction methods (rao et al., ) , whereas methods taking homology data into account may probably raise this limit to approx. %. methods based on neural networks and combinations of several prediction schemes seem to give good predictions, and especially methods using homology data from multiple alignments may give predictions at % match or better in many cases (salzberg and cost, ; boscott et al., ; rost and sander, a; rost et al., ; levin et al., ) . also methods taking potential residue-residue interactions into account, like the hydrophobic cluster analysis (hca), may be used for identification of potential secondary structure elements (woodcock et al., ) . it has been shown that by restricting the prediction to a consensus region with stable conformation it is possible to make very reliable predictions (rooman and wodak, ) . in one case, neural networks were shown to be capable of returning a limited amount of information on the tertiary structure (bohr et al., ) . . structure retrieval by secondary structure. a flow chart for structure retrieval by secondary structure (right side) compared to retrieval by sequence (left side). please see the text for details. in this example the secondary structure library was generated using dssp (kabsch and sander, ) , the secondary structure was predicted with the phd program (rest et al., ), and fasta (pearson, ; pearson and lipman, ) was used to search the secondary structure library and the nrl- -d databases (namboodiri et al., ; george et al., ) . only the secondary structure based method was able to identify the hla class i structure as similar to the class ii structure. the ribbon representation of the hla class i antigen binding region used in this figure was generated with molscript (kraulis, ) . be inconsistent, compared to the more sophisticated classification which can be achieved by a trained expert. recent studies show that the average agreement between alternative assignment methods used on identical structures is close to % for three methods (colloc'h et al., ) , or % if only two methods are compared (woodcock et al., ) . vadar is a new classification method which is aiming at a better agreement between manual and automatic assignment (wishart et al., ) , to what degree this may have influence on prediction systems remains to be seen. over the last few years it has been realised that the inverse folding problem is much easier to solve (bowie et al., ; blundell and johnson, ; bowie and eisenberg, ) . the inverse folding problem can be defined as follows: given a known protein structure, identify all protein sequences which can be assumed to fold in the same way. a large number of protein structures must be available in order to use this as a general approach, as the relevant protein fold has to be represented in the database in order to be identified. however, with a limited number of possible folds actually used by nature, a complete database of all folds appears to be possible. some information about possible folding classes can be derived from experimental data. circular dichroism can be used as a crude way of measuring the relative amounts of secondary structure in a protein. classification methods based on amino acid composition can be used for classification of proteins into broad structural classes zhou et al., ; chou and zhang, ; dubchak et ai., ) . this information may limit the number of different folds which have to be evaluated. it is also possible that such information may be used to improve the performance of other methods, although data on secondary structure prediction of all-helical proteins seems to indicate that the gain may be small (rost and sander, b) . however, for a unique identification of folding class more sensitive methods are needed, and the most useful one is probably some kind of protein sequence library search. in order to identify the folding class we have to search a database of known protein structures with our trial sequence. the problem is that standard methods for sequence retrieval may not be sensitive enough in all cases. if the sequences are similar, then retrieval is trivial. however, we know that there are cases where structures are known to be similar despite very different sequences. how can these cases be identified in a reliable way? the most promising approaches are based on methods for describing the environment of each residue (bowie et al., ; eisenberg et al., ; overington et al., ; ouzounis et al., ; wilmanns and eisenberg, ; liithy et al., ) . this description can be used for generating a profile, showing to what degree each residue is found in a similar environment in other structures, and this profile can be used as a basis for sequence alignment and library searches. similar property profiles can also be used for searching database systems of protein structures (vriend et al., ) . a very simple approach can be used if we accept the hypothesis that protein sequences representing structures with a similar linear distribution of secondary structure elements may fold in a similar way. we can then create a sequence type library of known structures where the residues are coded by secondary structure codes rather than residue codes (see fig. ). given the sequence of a protein with unknown -d structure, we can use a secondary structure prediction method and translate the sequence into a secondary structure description. if we define a suitable 'mutation' matrix describing the probability of inter conversion between different secondary structure elements, then a standard library search program like fasta (pearson and lipman, ; pearson, ) can be used in order to identify potential template structures. the example shown in fig. is the identification of hla class i as a suitable candidate for homology modelling of hla class ii. the sequence similarity is very low, % sequence identity in the antigen binding region (based on alignment of the structures), and especially for this region most sequence based methods will retrieve a large number of alternative sequences before any of the class i molecules. bly improve the performance to use a positiondependent gap penalty, where most gaps are placed in loop regions rather than in helices or strands. however, the method is very simple to implement and test, as necessary tools and data already are available in most labs. however, for the secondary structure based approach the hla class i sequences are retrieved as top candidates. the structure prediction did not include any information about the hla class ii structure, which recently has been published (brown et al., ) . it should be mentioned that the % sequence identity score is not significantly higher than the score from a random alignment of sequences. if we, for each sequence in the swissprot protein sequence library, align it against a sequence selected at random from the same library (alignment without gaps, using the full length of the shortest sequence, and start the alignment at a random position within the longest sequence), then the average percentage of identical residues is ( _+ )% at standard deviations. the identification method using secondary structure is based on an assumption which has to be examined more closely, and the implementation of it is very crude. much work can be done on the secondary structure prediction, the 'mutation' matrix and the search method. it will proba- . . sequence alignment pip _rat as described in the introduction a crucial feature in molecular evolution has been the parallel exploration of several different mutations. and although mechanisms like horizontal gene transfer and intragenic recombination may have been important as key steps in the evolution of new proteins, the most common mechanism seems to have been gene duplication followed by mutational modification (doolittle, ) . this means that especially multiple sequence alignment can give essential information about the mutation studies already performed by nature. conserved residues are normally conserved because they . multim alignment. alignment of inositol triphosphate specific phospholipase c / from rat (pip rat) against three other pip sequences from rat. each horizontal bar represents a sequence, marked in residue intervals. black lines connecting the bars represent well conserved motifs found in all sequences, in this case subsequences of residues where at least residues are completely conserved in all sequences. it is very easy to identify two well conserved regions, annotated as region x and y in the swissprot entries, despite a residue insertion in two of the sequences. this insertion contains sh and sh domains (pawson, ) . it is an interesting observation that the extra c-terminal domain of the pipi_rat sequence shows a weak similarity to myosin and tropomyosin sequences. have an important structural or functional role in the protein, and identification of such residues will thus give vital information about structure and activity of a protein. several tools have been developed for multiple alignment. a very attractive one is macaw (schuler et al., ) , which will generate several alternative alignments of a given set of regions, and in a very visual way help the user to identify a reasonable combination of (sub)alignments. an even more general tool is muitim (drablcs and . here all possible alignments, based on short motifs, are shown simultaneously, and the user is free to identify potential similarities even in cases with low sequence identity and very disperse motifs. this is possible because of the superior classification potential of the human brain compared to most automatic approaches. the method includes an option for probability based filtering of motifs, and an example of a multim alignment is shown in fig. . however, it is important to realise that in standard sequence alignment we are trying to solve a three-dimensional problem (residue interactions) by using an essentially one-dimensional method (alignment of linear protein sequences). as a consequence important conserved throughspace interactions may not be evident from a standard sequence alignment. a good example can be found in the alignment of lipases (schrag et al., ) . in fig. , the sequence alignment of residues in a structurally conserved core of three lipases (rhizomucor miehei lipase (derewenda et al., ) , candida antarctica b lipase (a. jones, personal communication) and human pancreatic lipase (winkler et al., ) is shown. the active site residues, ser (s), asp (d) and his (h), are shown as black boxes. the ser and his residues are at identical positions. however, the asp residue of the pancreas lipase is at a very different sequence position compared to the other two lipases. it would be very difficult to identify this as the active site asp from a sequence alignment. if we look at the structural alignment in fig. , we see that the positions are structurally equivalent, it is possible for all three lipases to have highly similar relative orientation of the active site atoms, despite the fact that the alternative asp positions are located at the end of two different / -strands. an improved alignment may be generated if we can incorporate -d data for at least one of the sequences in the linear alignment (gracy et al., ) . however, in order to get a reliable alignment of sequences with low sequence similarity, we have to take true three-dimensional effects into account. this means that if we are able to identify a known -d structure as a potential basis for modelling, then the sequence alignment should be done in -d using this structure as a template. this can be done by threading the sequence through the structure and calculating pairwise interactions (jones et al., ; bryant and lawrence, ) . as soon as a template has been identified, and an alignment between this template and a sequence has been defined, a -d model of the protein can be generated. we can either use the template coordinates directly, combined with dif- fig. . sequence alignment of lipases. alignment of structurally conserved regions of three lipases. for each lipase the solvent accessible surface in % compared to the gxg standard state (grey scale, white is buried and black is exposed), the secondary structure as defined by the dssp program, and the sequence is shown. the position of each subsequence in the full sequence is also shown. the active site residues are shown in white on black. please observe the shift of the active site asp (d) between two very different positions. the alignment was generated using alscript (barton, ) . ferent modelling approaches for the ill-defined regions, or the template can be used as a more general basis for folding the protein by distance geometry (srinivasan et al., ) or general molecular dynamics methods. loop regions are often highly variable, and must be treated with special approaches (topham et al., ) . it is also necessary to consider the orientation of side chains. although the backbone may be well conserved, many residues especially at the protein surface will be mutated, as shown in fig. . the stability of a protein depends upon an optimal packing of residues, and it is important to optimise side chain conformation if we want to study protein stability and complex formation. a very common approach is the use of rotamer libraries combined with molecular dynamics refinement. recent studies show that this step of the modelling in fact may be less difficult than has been assumed . and important interactions, and exposed regions may to some degree be identified by using antibodies. however, in many cases the rational for modelling by homology is the very lack of experimental data related to structure, and we have to use other more general methods for evaluation of models. some of the approaches we already have described for sequence alignment can obviously also be used for evaluation of models. in general, model evaluation can be based on -d profiles (lfithy et al., ) , contact profiles (ouzounis et al., ) or more general energy potentials (hendlich et al., ; jones et al., ; nishikawa and matsuo, ) . some of these approaches have been implemented as programs for evaluation of structures or models, like procheck (laskowski et al., ) and prosa (sippl, a, b) . however, in general no model (or even experimental structure) should be trusted beyond what can be verified by experimental methods. a protein model based on homology (or similarity) has to be verified in as many ways as possible, and experimental methods should always be preferred. mutation studies may give valuable information about active site residues a prerequisite for rational protein engineering is -d structure information about the protein. in fig. , including parts of the sequences connecting the core regions. the active site asp is able to maintain a similar relative orientation, despite very different sequence positions. the alignment was generated using insight (biosym technologies). adddition to x-ray crystallography, nmr is the most important method for protein structure determination. x-ray crystallography has several advantages when compared to nmr. solving the crystal structure by x-ray crystallography is usually fast as soon as good crystals of the protein are obtained (even if it may not be so easy to obtain these crystals). it is also possible to determine the structure of very big proteins. the major disadvantage of x-ray crystallography is that it is the crystal structure that is determined. this implies that crystal contacts may distort the structure (chazin et al., ; wagner et al., ) . since active sites and other binding sites usually are located on the surface of the proteins, very important regions of the protein may be distorted. some structures even show large differences between nmr and x-ray structure (frey et al., ; klevit and waygood, ) the advantage of nmr is that it is dealing with protein molecules in solution, usually in an environment not too different from its natural one. it is possible to study the protein and the dynamical aspects of its interaction with other molecules like substrates, inhibitors, etc. it is also possible to obtain information about apparent pk a values, hydrogen exchange rates, hydrogen binding and conformational changes. all nuclei contain protons, and therefore they carry charge. some nuclei also possess a nuclear spin. this creates a magnetic dipole, and the nuclei will be oriented with respect to an external magnetic field. the most commonly studied nuclei in protein nmr ( h, c and n) have two possible orientations, representing high and low energy states. the frequency of the transition between the two orientations is proportional to the magnetic field. at a magnetic field of . tesla the energy difference corresponds to about mhz for protons. in an undisturbed system there will be an equilibrium population of the possible orientations, with a small difference in spin population between the high and low energy orientation. the equilibrium population can be perturbed by a radio frequency pulse of a frequency at or close to the transition frequency. in addition, the spins will be brought into phase coherence (concerted motion) and a detectable magnetisation will be created. the intensity of the nmr signal is proportional to the population difference between the levels the nuclei can possess. nuclei of the same type in different chemical and structural environments will experience different magnetic fields due to shielding from electrons. the shielding effect leads to different resonance frequencies for nuclei of the same type. the effect is measured as a difference in resonance frequency (in parts per million, ppm) between the nuclei of interest and a reference substance, and this is called the chemical shift. in molecules with low internal symmetry most atoms will experience different amounts of shielding, the resonance signals will be distributed over a well-defined range, and we get a typical nmr spectrum. the process that brings the magnetisation back to equilibrium may be divided into two parts, longitudinal and transverse relaxation. the longitudinal or t relaxation describes the time it takes to reach the equilibrium population. the transverse or t relaxation describes the time it takes before the induced phase coherence is lost. for macromolecules the t relaxation is always shorter than the t a relaxation. short t relaxation leads to broad signals because of poor definition of the chemical shift. most molecules have dipoles with magnetic moment, and the most important cause of relaxation is fluctuation of the magnetic field caused by the brownian motion of molecular dipoles in the solution. how effective a dipole may relax the signal depends upon the size of the magnetic moment, the distance to the dipole, and the frequency distribution of the fluctuating dipoles. a nucleus may also detect the presence of nearby nuclei (less than three bonds apart), and this will split the nmr signal from the nucleus into more components. several nuclei in a coupling network is called a spin system. by applying radio frequency pulses it is possible to create and transfer magnetisation to different nuclei. it is, as an example, possible to create magnetisation at one nucleus, and transfer the magnetisation through bonds to other nuclei where it may be detected. the pulses are applied in a so-called pulse sequence (ernst, ; kessler et al., ) . the methodology for determination of protein structure by two-dimensional nmr is described in several textbooks and review papers (wagner, ; wiithrich, ; wider et al., ) . the standard method is based on two steps, sequential assignment: assignment of resonances from individual amino acids, and distance information: assignment of distance correlated peaks between different amino acids. the first step involves acquiring coupling correlated spectra (cosy, tocsy) in deuterium oxide to determine the spin system of correlated resonances. some amino acids have spin systems that in most cases make them easy to identify (gly, ala, thr, ile, val, leu). the other amino acids have to be grouped into several classes, due to identical spin systems, even though they are chemically different. the spin systems can be correlated to the nh proton by acquiring cosy and tocsy spectra in water. the assigned nh resonance is then used in distance correlated spectra (noesy) to assign correlations to protons (nh, h,, h~) at the previous amino acid residue (fig. ) . by combining the knowledge of the primary sequence (which gives the spin system order) with the nmr data collected it is possible to complete the sequential assignment. when the sequential assignment is done the assignment of short range noe (up to four residues) will give information about secondary structure (a-helix, fl-strand). long-range correlations will serve as constraints (together with scalar couplings) to determine the tertiary structure of the protein. excellent procedures describing these steps are available (roberts, ; wiithrich, ) . with large proteins there will be spectral overlap of resonance lines. the problem is partially solved by labelling the protein with c and n isotopes. triple resonance multidimensional nmr methods (griesinger et al., ; kay et al., ) may then be applied. the resonances will then be spread out in two more dimensions ( c and ~sn) and the problem with overlap is reduced. these methods depend upon the use of scalar couplings to perform the sequential assignment, the sequential assignment procedure will then be less prone to error. the noesy spectra of such large proteins are often very crowded, but four-dimensional experiments like the c-~ c edited noesy spectrum (clore et al., b) have been designed. such experiments will spread the proton-proton distance correlated peaks by the chemical shift of its corresponding c neighbour and reduce the spectral overlap. secondary structure elements may also be predicted from the chemical shift of h and c (spera and bax, ; williamson and asakura, ; wishart et al., ) . obtaining nmr-spectra of proteins has some aspects that should be considered. spectral overlap. as we move to larger proteins the probability of overlap of resonance lines increases. at some point it will become impossible to do sequential assignment due to this overlap. application of -d and -d multiresonance nmr has made it possible to assign proteins in the kda range (foght et al., ; stockman et al., ) . fast relaxation. as the size of the protein is increased the rate of tumbling in solution is re-duced. this leads to a reduced transverse relaxation time (t ), and broadening of the resonance lines in the nmr spectra. the intensities of the peaks are reduced and they may be difficult to detect. the short transverse relaxation time will also limit the length of the pulse sequences it is possible to apply (because there will be no phase coherence left), and multidimensional methods become difficult. it is possible to determine a -d structure by nmr or x-ray crystallography are probably a subset of all proteins (wagner, ) . proteins may have regions with mobility and few cross peaks. the effective size of a protein is often increased by aggregation. the amount of aggregation can often be reduced by reducing the protein concentration. thus, very often the degree of aggregation will determine whether it is possible to assign and solve a protein structure by nmr, by limiting the maximum concentration that may be used. the stability of the proteins is also a major issue. a sample may be left in solution for days, often at elevated temperatures, so denaturation may become a problem. photo-cidnp (chemically induced nuclear polarisation) is an interesting technique for the study of surface positioned aromatic residues in proteins (broadhurst et al., ; cassels et al., ; hore and kaptain, ; scheffier et al., ) . by introducing a dye and exciting it with a laser, it is possible to transfer magnetisation to aromatic residues, where it can be observed. in addition to high-resolution nmr, solid state nmr has also been applied to studies of proteins. studies of active sites and conformation of bound inhibitors yields interesting information. the stability of proteins may be monitored under different conditions by detecting signals from transition intermediates bound to the active site (burke et al., ; gregory et al., ) . structural constraints on transition state conformation of bound inhibitors can be obtained (auger et al., ; christensen and schaefer, ) . structural constraints of the fold and conformation of the amino sequence may be gathered by setting upper and lower distances for lengths between specific amino acids (mcdowell et al., ) . using solid-state nmr it is also possible to study membrane proteins and their orientation with respect to their membrane (killian et al., ; ulrich et al., ) . we expect such studies to give insight into ion channels in membranes (woolley and wallace, ). an important mechanism for relaxation in high-resolution nmr is dipolar relaxation. usually this is induced by the spin of nuclei in the immediate vicinity, and it is a function of the size of the dipole. the electron is also a magnetic dipole, and the magnitude of this dipole is about -times that of a proton. paramagnetic compounds have an electron that will interact with . the paramagnetic relaxation method. outline of the paramagnetic relaxation method. the protons located at the protein surface will be closer to the dissolved paramagnetic relaxation agent than the protons located inside the protein core, hence the resonance lines from protons at the surface will be broadened more than resonance lines stemming from protons located inside the protein. nearby protons and increase the relaxation rate of these protons. the widest use of paramagnetic compounds has been of gd ÷ bound to specific sites in a protein , but also other compounds have been used (chang et al., ; hernandez et al., a, b ). this will make it possible to identify resonance lines from residues in the vicinity of the binding site. it is also possible to calculate distances from the paramagnetic atom as the relaxation effect is distance dependent. the paramagnetic broadening effect can also be used with a compound moving freely in solution (drayney and kingsbury, ; esposito et al., ; petros et al., ) . in this way residues located on or close to the protein surface will give broadened resonance lines compared to residues in the interior of the protein. this method can be used to measure important noe and chemical shifts inside the protein directly, or it can be used as a difference method to identify resonances at the surface by comparing spectra acquired with and without the paramagnetic relaxation agent (fig. ) . we have used the paramagnetic compound gadolinium diethylenetriamine pentaacetic acid (gd-dtpa) as a relaxation agent. gd-dtpa will increase both the longitudinal and the transverse relaxation rates of protons within the influence sphere. suitable nmr experiments to highlight the relaxation effect may be noesy, roesy and tocsy (bax and davis, ; braunschweiler and ernst, ) gd-dtpa is widely used in magnetic resonance imaging (mri) to enhance tissue contrast. it is assumed to be non-toxic and we do not expect it to bind to proteins. we used the wellstudied protein hen egg-white lysozyme as a test protein. both the structure and the nmr spectra of this protein are known (diamond, ; redfield and dobson, ) , and the protein is extremely well suited for nmr experiments. in fig. , the -d ih-nmr spectrum recorded in the presence and absence of gd-dtpa is shown. although it is evident that there is a selective broadening in the -d spectrum, it is also clear that there are problems with overlapping spectral lines. we therefore applied two-dimensional nmr methods, and shown in fig. is the low field region of a noesy spectrum of lysozyme. the region corresponds to the same region as shown in fig. . from fig. we see that the signals from w , w and w disappear with addition of gd-dtpa, while the signals from w , w and wlll still are observable. by examination of the solvent accessible surface of lysozyme it is evident that the indole nh of w , w and w is exposed to solvent, while the indole nh of w , w and wlll is not exposed. this shows that the changes in the spectrum are as expected from the structure data. the appearance of the nh-nh region of the spectrum (fig. ) also shows the reduction in the number of signals in the gd-dtpa exposed spectrum. this shows that the paramagnetic broadening effect can be used for selective identification of signals from solvent exposed residues in a protein. one of the fundamental steps in the protein engineering process shown in fig. is the design step, where a correlation between structure and properties is established in order to select potential structural candidates that match new functional profiles. the understanding of this correlation implies a realistic modelling of the physical chemical properties involved in the functional features to be engineered. these features are basically of two types: diffusional and catalytic. any ligand binding to a protein, whether ligandreceptor or substrate-enzyme, is essentially a diffusional encounter of two molecules. electrostatic interactions are the strongest long-range forces at the molecular scale and, thus, it is not surprising that they are one of the determinant effects in the final part of the encounter (berg and von hippel, ) . in the case of substrateenzyme interactions the catalytic step that follows the binding of the substrate seems to be possible mainly by the presence of electrostatic forces that stabilise the reaction intermediates in the binding site (warshel et al., ) , from which the product formation may proceed. another and much more basic necessary condition for a successfully engineered protein is that a functional folded conformation is maintained. solvation of charged groups is one of the determinants in protein folding (dill, ) , so that even the conformation of the protein is electrostatically driven. given the ubiquitous role of electrostatic interactions, it is then obvious that their accurate modelling is an essential prerequisite in the design of engineered proteins. several good reviews exist on protein electrostatics (warshel and russel, ; matthew, ; rogers, ; harvey, ; davies and mccammon, ; sharp and honig, ) . this section intends to give a brief overview of the subject. we start by presenting the methods one can use to model electrostatic interactions. the most familiar methodology in biomolecular modelling is certainly molecular mechanics (mm) (either through energy minimisations or molecular dynamics (md)). we point out some of the limitations of mm in the treatment of electrostatic interactions, and the need to use alternative ways of describing the system, such as continuum methods. the computation of ph-dependent properties and some potential extensions of mm are also discussed. finally, we refer some applications of electrostatic methods relevant to protein engineering. in mm simulations, electrostatic interactions are usually described with a pairwise coulombic term of the form qlq /dr ,were ql and q are the charges of the pair of atoms, r their distance, and d the dielectric constant. d is usually set equal to when the solvent is included. a complete simulation in a sufficiently big box with water molecules should, in principle, give a realistic description of the protein molecule (harvey, ) . this would be specially true if a force field including electronic polarizability effects (see . .) was available for use with biomolecular systems, which unfortunately is not the case (harvey, ; davis and mccammon, ) . we use the term force field in this context as including both the functional form and parameters describing the energetics of the system, from which the forces are derived. simulations where solvent molecules are not treated explicitly are naturally appealing, since the computation time increases with the square of the number of atoms. several methods have been proposed that attempt to account for solvent effects. the more popular approach is an ad hoc dielectric 'constant' proportional to the distance (e.g., mccammon and harvey, ) but different distance dependencies can be used (e.g., solmajer and mehler, ) . a variety of more elaborated methods were also suggested (northrup et al., ; still et al., ; gilson and honig, ) . all these methods should be viewed as attempts of including solvent screening effects in a simplified way. they can be useful when inclusion of water is computationally prohibitive, but they cannot substitute for an explicit inclusion of solvent since, e.g., the existence of hydrogen bonding with the solvent is not properly described by these approaches. mm of biomolecules has, in general, heavy computation needs. the number of water molecules that should be included in order to simulate a typical protein in a realistic way is quite large, especially if one wants to perform md. also, each pair of atoms has its own electrostatic interaction and the number of pairs cannot be lowered by a short cut-off distance (e.g., a) as in van der waals interactions, since electrostatic interactions are very long range, typically up to ~,. mm simulations have also some limitations on the description of the system, since ph and ionic strength effects usually are difficult or impossible to include. the only way to include ph effects is through the protonation state of the residues. each titrable group (in asp, glu, his, tyr, lys, arg, c-and n-terminal) in the protein have two states, protonated or unprotonated. thus, a protein with n titrable groups will have n possible protonation charge sets. the best we can do is to choose the set corresponding to the protonation states of model compounds at the desired ph. free ions can be included in md simulations of proteins (levitt, ; mark et al., ) , but it is not clear if the simulated time intervals are long enough to realistically reflect ionic strength effects. another problem with mm is that the understanding it provides of the system (through energy minimisation or md) does not include entropic aspects explicitly, i.e., it does not give free energies directly. there are methods to calculate free energies based on mm potentials (beveridge and dicapua, ) , but even though several applications have been made on biomolecular systems (for a review see beveridge and dicapua, ) , they are still too demanding for routine use in systems of this size. then, when the properties under study are related to free energies rather than energies (which is often the case), mm by itself can only be seen as a first approach. in summary, although mm simulations can provide some unique information on the structural and dynamical behaviour of biomolecular systems, some limitations exist due to both conceptual and practical reasons, in particular regarding the treatment of electrostatic interactions. fortunately, other methods exist that can provide insight on aspects whose modelling is poor or absent in mm simulations, although at the cost of the atomic detail in the description. there is no 'best' modelling method and we should resort to the several methods available in order to gain an understanding of the system that is as complete as possible. the so-called continuum or macroscopic models assume that electrostatic laws are valid at the protein molecular level and that macroscopic concepts such as dielectric properties are applicable. protein and solvent are treated as dielectric materials where charges are located. these charges may be titrable groups (whose protonation state may vary), permanent ions (structural and bound ions, etc.) or, more recently, permanent partial charges of polar groups. given the dielectric description of the system and the placement of the charges, the problem can be reduced to the solution of the poisson equation (or any equivalent formulation), as can any problem of electrostatics (e.g., jackson, ) . the electrostatic potential thus obtained can be used to study diffusional processes or visually compare different molecules (see . .). the simplest continuum model assumes the same dielectric constant inside and outside the protein. typically, a value somewhere between the protein and solvent dielectric constants has been used (sheridan and allen, ; koppenol and margoliash, ; hol, ) . this approach completely ignores the effects of having two very different dielectric regions, but can be used for a first qualitative computation. the more common continuum models treat the protein as a low dielectric cavity immersed in a high dielectric medium, the solvent. the way the charges are placed in this cavity and the way the electrostatic problem is solved vary with the particular method. analytical solutions can be obtained for the simplest shapes, such as spheres, but in general the more complex shapes require numerical techniques. in the first cavity model the protein was assumed to be a sphere with the charge uniformly distributed over its surface (linderstr~m-lang, ) . tanford and kirkwood ( ) proposed a more detailed model in which each charge has a fixed position below the surface. assuming a spherical geometry allows for a simple solution to the electrostatic problem. it is even possible to include an ionic atmosphere that accounts for ionic strength effects (leading to the poisson-boltzman equation). the effect of ph occurs naturally in the formalism. the energy cost of burying a charge inside the low-dielectric protein (self-energy) is taken to be the same as in small model compounds, since at the time when this method was developed (before protein crystallography) charges were believed to be restricted to the protein surface. this limits the method to proteins without buried charges, unless we have some estimate on the self-energy. there are, obviously, some problems in fitting real, irregularshaped proteins to a spherical model. some solutions to this problem were proposed, including an ad hoc scaling of interactions based on solvent accessibility (shire et al., ) , and the placing of more exposed charges in the solvent region (states and karplus, ) . the inclusion of non-spherical geometries im-plies the use of numerical techniques, as referred above. warwicker and watson ( ) and used the finite differences technique to solve, respectively, the poisson and poisson-boltzman equations. self-energies can be included (gilson and honig, ) , such that the method is fully applicable when buried charges exist. the intrinsic discretization of the system in the finite differences technique, makes these methods readily applicable to any kind of spatial dependency on any of the properties involved. the inclusion of a spatially-dependent dielectric constant, for instance, will be relatively simple. other extensions such as additional dielectric regions (ligands, membranes, etc.), eventually with charges, should also be possible. alternative numerical techniques for solving the poisson or poisson-boltzman equations have also been used, including finite elements (orttung, ) and boundary elements (zauhar and morgan, ) . the dielectric constant in a region comes from the existence of dipoles in that region, permanent or induced. permanent dipoles are due to atomic partial charges (e.g., water dipole, peptide bond dipole). induced dipoles are due to the polarizability of electron clouds. warshel and levitt ( ) represented this electronic polarizability by using point dipoles in the atoms. as pointed out by davies and mccammon ( ) this representation is roughly equivalent to a spatially-dependent dielectric constant. this approach is usually combined with a simplified representation ot water by a grid of dipoles (warshel and russel, ) . ionic strength and ph effects are not considered. all the above methods deal with a particular charge set (see . .), even when ph effects are considered. however, a protein in solution does not exist in a single charge set. we are usually interested in the properties of a protein at a given ph and ionic strength, not at a particular charge set. moreover, if we want to test the available methods, we have to test them against experimental results which usually do not correspond to a specific charge set. a common test on the accuracy of electrostatic models is their ability in predicting pk a values of titrable groups in a protein (see . .), obtained via titrations, nmr, etc. these values can be quite different from the ones of model compounds, due to environment of the groups in the protein. this difference (pk a shift) can be of several pk units. the experimentally determined apparent pka (pkap p) is determined as the ph value at which half of the groups of that residue are protonated in the protein solution, i.e., when its mean charge is / (thus, the equivalent notation pk / ). then, if we can devise a method to compute the mean charge of the titrable groups at several ph values, we can predict their pkap p values. as mentioned above (see . .), we have n possible charge sets. any structural property can, in principle, be computed through a boltzman sum over all those sets, with each one contributing according to its free energy (taken as the electrostatic energy) (tanford and kirkwood, ; bashford and karplus, ) . the property thus computed is characteristic of the chosen ph value (and ionic strength, if considered) instead of a specific charge set. we are particularly interested in computing the mean charges at a given ph (see last paragraph). a sum with n terms is not, however, a trivial calculation in terms of computer time. tanford and roxby ( ) avoided the boltzman sum by placing the mean charges directly on the titrable groups, instead of using one of the integer sets. this corresponds to considering the titration of the different groups as independent (a mean field approximation; bashford and karplus, ) . other alternatives to the boltzman sum are the monte carlo method (beroza et al., ) , less drastic mean field approximations (yang et al., ; gilson, ) , the 'reduced site' approximation (bashford and karplus, ) , or even assume that the predominant charge set is enough to describe the system (gilson, ) . since electrostatic interactions in proteins are typically dominated by titrable groups whose charge is affected by ph, no electrostatic treat-ment can be complete without taking this effect into account. a simple, although effective, way of doing this is to: (i) compute the electrostatic free energies (e.g., by a continuum method); (ii) compute the mean charge of each titrable group at a given ph (e.g., by a mean field approximation); (iii) use those charges to compute the electrostatic potential (e.g., by a continuum method), which can be displayed together with the protein structure (see the human pancreatic lipase example in section . .). in this way a ph-dependent electrostatic model of the protein can be obtained, which is not possible with usual mm-based modelling techniques. as stated above (see . .), electronic polarizability is not explicitly considered in common force fields. van belle et al. ( ) included the induced dipole formalism (warshel and levitt, ) in mm calculations. the electrostatic interactions in the applied force field were simply 'corrected' with additional terms due to inducible dipoles. however, it should be noted that a force field fitted to experimental data without polarizability terms, should be fitted again if those terms are included. the protein conformation used in molecular modelling is usually an experimentally based (xray, nmr) mean conformation, characteristic of those particular experimental conditions. that conformation may, however, be inadequate for modelling the protein properties at different conditions. in particular, proteins are known to denaturate at extreme ph conditions. thus, ph-dependent methods such as the continuum methods may give incorrect results when using one single conformation over the whole ph range. actually, md simulations have shown that the results can be highly dependent on side chain conformation (wendoloski and matthew, ) . although overall properties like titration curves did not seem to be very sensitive, individual pka's showed variations up to . pk units. as mentioned in section . , mm has the problem of what charge set to use in simulations. instead of using a charge set corresponding to model compounds at the intended ph, one may use the predominant charge set of the protein, determined, e.g., by a continuum method, as suggested by gilson ( ) . a different approach to this problem would be to devise a way of including the averaged effect of all charge sets in the mm simulation. we have recently developed a method where a force field is derived which includes the proper averaged effect of all charge sets (a potential of mean force) (to be published). the method depends on the calculation of electrostatic free energies obtained from, e.g., a continuum method. the electrostatic potential, computed in some of the referred methods, can help to understand the contribution of electrostatic interactions in the diffusional encounters of proteins with ligands (substrates or not). the diffusional process driven by the electrostatic field can be simulated through brownian dynamics (bd) and diffusion rates may be computed (for references see, e.g., davies and mccammon, ) . the effect of mutations on the diffusion of superoxide ion into the active site of superoxide dismutase has been studied by this technique (sines et al., ) and faster mutants showing - -fold increase in reaction rate could be designed (getzoff et al., ) , although this enzyme usually is considered to be 'perfect'. electrostatically driven bd simulations can help to reveal steric 'bottlenecks' (reynolds et al., ) and orientational effects (luty et al., ) . this method can also be applied to study the encounter of two proteins (northrup et al., ) . visual comparison of electrostatic fields can also provide useful information. soman et al. ( ) showed that rat and cow trypsins have similar electrostatic potentials near the active site, despite a total charge difference of . units. as an illustration of such type of comparisons, using ph-dependent electrostatics, we have applied the solvent accessibility-modified tanford-kirkwood method (see . .) to the human pancreatic lipase structures with both closed (van tilbeurgh et al., ) and open lid (van tilbeurgh et al., ) , as shown in fig. a and b. fig. c-f shows surfaces corresponding to an electrostatic potential equal to + . kt/e (where k is the boltzman constant, t the absolute temperature and e the proton charge). these surfaces correspond to regions were the electrostatic interactions on a charge are roughly of the same magnitude as the thermal effects due to the surrounding solvent, i.e., where charged molecules in solution start to feel electrostatic steering or repulsion. at ph clear differences exist between the closed and open forms, the latter showing a dipolar groove in the presumed binding site region. at pi-i the molecule is strongly positively charged and most electrostatically differentiated regions have disappeared. given the role of electrostatic interactions on molecular orientation and association (see the beginning of this section ( )), this is expected to markedly affect the interaction with the lipid-water interface. for enzymes the catalytic activity involving a charged residue can be modulated by shifting the pk a of that residue. the pk a shifts of the active site histidine has been successfully predicted for a number of mutants of subtilisin loewenthal et al., ) . one of the main reasons why enzymes are good catalysts is because they stabilise the transition state intermediate (fersht, ) . for enzymatic reactions that are not diffusion limited, engineering leading to an enhanced stabilisation of the intermediate will result in an increased activity. the induced dipole method was used to compute the activation free energy for different mutants of trypsin and subtilisin (warshel et al., ) , with some qualitative agreement with the experimental results. the prediction of changes introduced by mutations on redox potentials could also be of interest to protein engineering. prediction of redox potentials has been made with some success (rogers et al., ; durell et al., ) . in plastocyanin the effect of chemically modifying charged groups was also considered (durell et al., ) . the effect of mutations could also be analysed, as has been done for pk a shift calculations (see above). the above examples clearly show that, whatever the particular method used, the modelling of a c e d t electrostatic interactions in proteins has an important role to play in protein engineering. a highly relevant example is the design of a faster 'perfect' enzyme (getzoff et al., ) , which also illustrates the combination of different methods (bd and electrostatic continuum methods) that can sometimes be determinant in a modelling study. the science of protein engineering is advancing rapidly, and is emerging in many new contexts, such as metabolic engineering. rational protein engineering is a complex undertakingand only the groups with sufficient understanding of sequences and -d structures can handle the complex underlying problems. predicting protein structure may be difficult -but predicting future developments in a very active branch of science can be hazardous at the best. however, we will review a few of the more recent research aspects that we are convinced will be of key importance in the future development of protein engineering. often the substrates or products in an enzymatic process are poorly soluble in an aqueous medium. this may lead to poor yields and difficult or expensive purification steps. the potential of using other solvents, either pure or in mixture, where substrates and/or products may be soluble has attracted a great deal of attention (tramper et al., ; arnold, ) . dissolving the protein in organic solvents will alter the macroscopic dielectric constant and lead to a much less pronounced difference between the interior and exterior static dielectric behaviour. protein function in such media may be altered and is poorly understood; we can expect a significant development in the future. despite the often dramatic change in dielectric constant when changing the solvent from, e.g., water to an organic substance, the protein -d structure can remain virtually intact, as has been documented in the case of subtilisin carlsberg dissolved in anhydrous acetonitrile (fitzpatrick et al., ) . the hydrogen bonding pattern of the active site environment is unchanged, and of the enzyme-bound structural water molecules are still in place. one-third of the enzymebound acetonitrile molecules reside in the active site. many enzymes remain active in organic solvents and in the case of enzyme reactions where the substrate has very poor water solubility, a change to organic solvent can be of major importance (gupta, ). an extreme case of a non-conventional medium for enzymatic action is the gas phase. certain enzymes, immobilised on a solid bed, have been shown to be active at elevated temperatures towards selected substrates in the gas phase (lamare and legoy, ) . obviously the range of substrates that potentially can be used is limited to those that actually can be brought into the gas phase under conditions where the enzyme is still active. enzymes for which such reactions have been studied include hydrogenase, alcohol oxidase and lipases. the fact that even interfacially activated lipases (such as the porcine pancreatic and the candida rugosa lipases) function with gas phase carried substrate molecules opens up the interesting possibility of studying the role of water in this reaction. protein engineering may be used to enhance enzyme activity in organic solvents (arnold, ; fig. . electrostatic maps of hpl with closed and open lid. ribbon models of human pancreatic lipase with colipase are shown with closed (left: a,c,e) and open (right: b,d,f) lid. the colipase is shown in blue and the mainly a-helical 'lid' region is highlighted in cyan. the residues of the active site are shown in green. access to the active site pocket seems to be controlled by the conformational st'ate of the lid. electrostatic isopotential contours of + . kt/e are shown at ph (c,d) and ph (e,f). the negative surfaces are represented in red and the positive surfaces in blue. the models and isopotentiai contours were produced with insight h and delphi (biosym technologies, san diego). the ph-dependent charge sets were computed with titra (to be published). chen and arnold, ) . when dissolving subtilisin e in % dimethylformamide (dmf) the kcat/k m for the model substrate suc-ala-ala-pro-met-p-nitroanilide drops -fold. after ten mutations were introduced, the activity in dmf was restored almost to the level of the native enzyme in water. all metabolic conversions in micro-organisms are carried out directly or indirectly by proteins. our ability to manipulate single genes has opened up for the actual control of such processes. we may alter the efficacy of a certain pathway or we may introduce totally new pathways. thus, escherichia coli can be modified in such a way that one can use i>glucose in the e. coli based manufacture of hydroquinone, benzoquinone, catechol and adipic acid (dell and frost, ; draths and frost, ; frost, ) . presently such compounds are produced through organic chemical synthesis using aromatics as one of the reactants. the prospect of producing the same compounds using only microbes and glucose thus has some obvious environmental benefits. we expect to see a virtual surge in the engineering of microorganisms towards the production of rare chemical or biochemical compounds or compounds for which the current synthetic route is costly either economically or from an environmental perspective. the perspective of designing and producing functional protein molecules from scratch is extremely attractive to many visionary scientists. some central questions arise: do we know enough to undertake such tasks, and what goals can we define? screening mutation studies of protein interfaces show that the majority of mutations reduce activity or binding affinity (cunningham and wells, ) , indicating that most proteins already represent highly optimised designs. the groups active in this area have aimed at constructing certain -dimensional folds such as the four helix bundle (felix) (hecht et al., ) and histidine-based metal binding sites (arnold, ) and even the observation of limited enzymatic activity is regarded as a successful result . protein de novo design of helix bundles may even follow a very simple binary pattern of polar and nonpolar amino acids as was concluded in a study of four-helix bundle proteins (kamtekar et al., ) . the helix-helix contact surfaces are mainly hydrophobic, whereas the solvent exposed regions are hydrophilic. many variants conforming to this hydrophobic pattern were generated and two of these proteins were stabilised with . and . kcal tool -~ relatively to the unfolded form, thus approaching what is found for many natural proteins. the authors suggest that such a binary pattern may have been important in the early stages of evolution. in our laboratory we have results supporting this conclusion for the trypsin family of proteins, which is predominantly in a / -strand based fold . fusion and hybrid proteins may be produced by fusing the genes or gene fragments including a proper linking region between the two genes (argos, ) . this in principle may allow for combining properties from two different proteins. thus artificial bifunctional enzymes have been produced by fusing the genes for the proteins, e.g., / -galactosidase and galactokinase (bulow, ). in a recent paper an elegant hybrid protein concept is described. a hybrid antibody fragment was designed to consist of a heavy-chain variable domain from one antibody connected through a linker region of - residues to a short lightchain variable domain from another antibody (holliger et al., ) . the antibody fragments displayed similar binding characteristics as the parent antibodies. the prospect of engineering multifunctional antibodies for medical applications is imminent. a hybrid protein between the glucose transporter and the n-acetylglucosamine transporter of e. coli have been produced. the two proteins displayed % residue identity. the hybrid protein consisted of the putative transmembrane do-main from the glucose transporter and the two hydrophilic domains from the n-acetylglucosamine transporter. the hybrid protein was, somewhat surprisingly, still specific for glucose (hummel et al., ) . interestingly, several naturally occurring proteins themselves seem to have originated through gene fusion. in the case of hexokinase it is proposed that it originated from a duplication of the glucokinase gene maintaining even the gene organisation (kogure et al., ) . several other proteins such as receptor proteins of the insulin family can best be understood as gene fusion products of a kinase domain onto the rest of the receptor (which in itself may consist of several fragments). with potential medical applications, proteinnucleic acid hybrids have been constructed, where the nucleic acid fragment complemented the sequence of a fragment of mrna that the rnase should be targeted towards. the results obtained confirmed that this approach indeed worked (kanaya et al., ) . the potentials for generating anti-viral agents against, e.g., hiv are obvious. as a consequence of the enormous growth in our understanding of molecular biology and material technology, a new technological sector is emerging which takes aim at exploring the possible advantages in creating micro-machines and switchable molecular entities. this concept is currently known as nano technology (birge, ) . two concepts that we find particularly interesting are described briefly below. rhodopsin is a very ancient molecular construct -we find rhodopsin like molecules in a range of roles, all of them associated with its membrane location. proton transport and receptor functions are particularly interesting. bacteriorhodopsin from halobacterium halobium maintains a large ph gradient across the bacterial membrane. this protein complex is coloured, and its colour can be changed by exposing the protein to light of an appropriate frequency. the lifetime of the excited state can be adjusted by adjusting the physical chemical parameters of the medium the rhodopsin is embedded in (birge, ) . this protein can be used as a molecular switch in a very broad sense, e.g., as part of a high density memory device. however, changing the colour of a protein molecule is just one example that could be considered. another molecular based switch concept involves the transfer of a molecular ring (paraquat-derived rotaxane ring) between two binding sites (bradley, ) . currently the transfer is induced by a solvent change, but it is believed that an electrochemical transfer mechanism can be developed as well. similar concepts can probably also be developed for proteins. the present paper reviews some of many new developments in protein engineering. the review is not exhaustive -it is simply not possible to do this properly within the limits of this paper. we have tried to review some selected scientific areas of key importance for protein engineering, such as the validity of protein sequence information as well as structural information. sometimes the translation of a gene sequence to amino acid sequence is not trivial -a range of posttranscriptional editing and splicing events may occur, leading to a functional protein, where the amino acid sequence cannot be directly deducted from the gene sequence. in addition, posttranslational modification may provide triggers for other parts of the cells molecular machinery. we are thus in a situation where the full benefits and profits from projects such as the human genome project may escape us for a while. we have covered some of the recent developments in the modelling of protein structure by homology, which we regard as one of the most strategic areas of development. we will be flooded with sequence information deducted from gene sequences, and in the cases where the deducted amino acid sequences are assumed valid, we have to use homology based structure prediction in most cases. given that the number of protein structure families is expected to be limited the task is durable. here we should again caution the reader. we have no a priori reason to assume that non-soluble proteins, such as structural proteins, have structures that can be predicted from our limited library of mostly globular, soluble proteins. some structural proteins are gigantic, the cuticle collagen in the riftia worms from deep sea hydrothermal vents have a molecular mass of . kda (gaill et al., ) . it is extremely unlikely that a -d structure at atomic resolution of such a protein will ever be determined using methods we have available today. nmr has emerged with surprising speed as a structure determination tool. many excellent reviews have been written on this topic. we have decided to direct the readers attention to some recent developments that we believe will be of significant importance to the usage of nmr in protein engineering projects. the potential of using nmr to study the solvent exposed outer shell of larger proteins, that by far exceed the kda limit mentioned earlier is intriguing. this is particularly so, since most functionality of a protein is a feature of exactly the residues in the outer shell. thus, we can 'peel' the protein, and thereby isolate the spectral information that pertains to the surface only. this simplifies the spectra, and in some cases even allows for a partial assignment of specific residues. recent developments in ph-dependent protein electrostatics have been given special attention here. the similarities and differences within a family of structurally related proteins can only be understood if we are capable of interpreting the consequences of the substitutions, insertions and deletions that mostly occur at the surface of the proteins. when such changes are found and they involve charged residues, this will effect the extent or polarity of the electrostatic fields that the protein molecule is embedded in. we believe that the consequences of charge mutations to a large extent can be predicted through the use of ph-dependent electrostatics although practical examples are still lacking. to our knowledge the results on the electrostatic consequences of the lid motion in the human pancreatic lipase (vide supra) are among the first such reported. the story of molecular biology is continuously unfolding -and our understanding of our own biology, development and evolution is becoming ever deeper and more detailed. but we are also, once again, discovering that one of the many qualities of nature is endless complexity. protein data bank. crystallographic databases -information content, software systems, scientific applications. bonn/cambridge/chester, data commission of the international union of crystallography modification of trypanosoma brucei mitochondrial rrna by posttranscriptional ' polyuridine tail formation significance of similarities in protein structures (in abstracts of the th annual meeting of the protein engineering society of japan) scanning tunneling microscopy of biological macromolecular structures coated with a conducting film an investigation of oligopeptides linking domains in protein tertiary structures and possible candidates for general gene fusion engineering proteins for nonnatural environments solid-state c nmr study of a transglutaminaseinhibitor adduct structural engineering of the hiv- protease molecule with a/ -turn mimic of fixed geometry the swlss-prot protein sequence data bank polymers made to measure alscript: a tool to format multiple sequence alignments pka's of ionizable groups in proteins: atomic detail from a continuum electrostatic model multiple-site titration curves of proteins: an analysis of exact and approximate methods for their calculation mlev- -based two-dimensional homonuclear magnetization transfer spectroscopy predicting the conformation of proteins. man versus machine diffusion-controlled macromolecular interactions the protein data bank: a computer-based archival file for macromolecular structures protonation of interacting residues in a protein by a monte carlo method: application to lysozyme and the photosynthetic reaction center of rhodobacter sphaeroides free energy via molecular simulation: application to chemical and biomolecular systems research and perspectives catching a common fold seleno protein synthesis: an expansion of the genetic code protein structures from distance inequalities secondary structure prediction for modelling by homology inverted protein structure prediction a method to identify protein sequences that fold into a known three-dimensional structure will future computers be all wet? coherence transfer by isotropic mixing: application to proton correlation spectroscopy a photochemically induced dynamic nuclear polarization study of denatured states of lysozyme three-dimensional structure of the human class ii histocompatibility antigen hla-dr an empirical energy function for threading protein sequence through the folding motif preparation of artificial bifunctional enzymes by gene fusion solidstate nmr assessment of enzyme active center structure under nonaqueous conditions forward to the fundamentals study of the tryptophan residues of lysozyme using h nuclear magnetic resonance rna duplexes guide base conversions ph dependence of relaxivities and hydration numbers of gadolinium(ill) complexes of linear amino carboxylates ih nmr studies of human c a anaphylatoxin in solution: sequential resonance assignments, secondary structure, and global fold tuning the activity of an enzyme for unusual environments: sequential random mutagenesis of subtilisin e for catalysis in dimethylformamide proteins. one thousand families for the molecular biologist a correlation-coefficient method to predicting protein-structural classes from amino acid compositions solid-state nmr determination of intra-and intermolecular p- c distances for shikimate -phosphate and [ -i c]glyphosate bound to enolpyruvylshikimate- -phosphate synthase four-dimensional c/ c-edited nuclear overhauser enhancement spectroscopy of a protein in solution: application to interleukin / high-resolution three-dimensional structure of interleukin / in solution by three-and four-dimensional nuclear magnetic resonance spectroscopy origins of structural diversity within sequentially identical hexapeptides comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment extracting the information -sequence analysis software design evolves conformations of folded proteins in restricted spaces prediction of protein folding from amino acid sequence over discrete conformation spaces comparison of a structural and a functional epitope electrostatics in biomolecular structure and dynamics identification and removal of impediments to biocatalytic synthesis of aromatics from d-glucose: rate-limiting enzymes in the common pathway of aromatic amino acid biosynthesis the crystal and molecular structure of the rhizomucor miehei triacylglyceride lipase at . a resolution real-space refinement of the structure of hen egg white lysozyme dominant forces in protein folding complete assignment of aromatic h nuclear magnetic resonances of the tyrosine residues of hen lysozyme stein and moore award address. reconstructing history with amino acid sequences the comings and goings of homing endonucleases and mobile introns multim -tools for multiple sequence analysis genomic direction of synthesis during plasmid-based biocatalysis free radical induced nuclear magnetic resonance shifts: comments on contact shift mechanism prediction of protein folding class from amino acid composition modeling of the electrostatic potential field of plastocyanin three-dimensional profiles for analysing protein sequence -structure relationships a method to configure protein side-chains from the main-chain trace in homology modelling structure of pentameric human serum amyloid p component nuclear magnetic resonance fourier transform spectroscopy (nobel lecture) probing protein structure by solvent pertubation of nuclear magnetic resonance spectra molecular nanotechnology low resolution solution structure of the bacillus subtilis glucose permease iia domain derived from heteronuclear three-dimensional nmr spectroscopy alternative readings of the genetic code enzyme structure and mechanism. freeman protein engineering enzyme crystal structure in a neat organic solvent ih, c and lsn nmr backbone assignments of the -residue serine protease pb from bacillus alcalophilus polypeptide -metal cluster connectivities in metallothionein by novel i h- cd heteronuclear two-dimensional nmr experiments design and use of heterologous microbes for conversion of d-glucose into aromatic chemicals. enzyme engineering xii molecular characterization of the cuticle and interstitial collagens from worms collected at deep sea hydrothermal vents the protein identification resource (pir) faster superoxide dismutase mutants designed by enhancing electrostatic guidance self-assembling organic nanotubes based on a cyclic peptide architecture multiple-site titration and molecular modeling: two rapid methods for computing energies and forces for ionizable groups in proteins calculation of the total electrostatic energy of a macromolecular system: solvation energies, binding energies, and conformational analysis the inclusion of electrostatic hydration energies in molecular mechanics calculations calculations of electrostatic potentials in an enzyme active site calculating the electrostatic potential of molecules in solution: method and error assessment improved alignment of weakly homologous protein sequences using structural information rna editing in plant mitochondria and chloroplasts human genetic diseases due to codon reiteration: relationship to an evolutionary mechanism the influence of hydration on the conformation of lysozyme studied by solid-state c-nmr spectroscopy three-dimensional fourier spectroscopy. application to high-resolution nmr invasive introns enzyme function in organic solvents analysis of ordered arrays of adsorbed lysozyme by scanning tunneling microscopy specific cleavage of pre-edited mrnas in trypanosome mitochondrial extracts treatment of electrostatic effects in macromolecular modeling de novo design, expression and characterization of felix: a four-helix bundle protein of native like sequence converting trypsin to chymotrypsin: the role of surface loops identification of native protein folds amongst a large number of incorrect models. the calculation of low energy conformations from potentials of mean force nuclear magnetic relaxation in aqueous solutions of the gd(hedta) complex proton magnetic relaxation dispersion in aqueous glycerol solutions of gd(dtpa) -and gd(dota) engineered metalloregulation in enzymes rna editing of ampa receptor subunit giur-b: a base-paired intron-exon structure determines position and efficiency protein splicing removes intervening sequences in an archaea dna polymerase the role of the a-helix dipole in protein function and structure diabodies': small bivalent and bispecific antibody fragments globin fold in a bacterial toxin a database of protein structure families with common folding motifs proton nuclear magnetic resonance assignment and surface accessibility of tryptophan residues in lysozyme using photochemically induced dynamic nuclear polarization spectroscopy a functional protein hybrid between the glucose transporter and the n-acetylglucosamine transporter of escherichia coli classical electrodynamics synthesis, structure and activity of artificial, rationally designed catalytic polypeptides a new approach to protein fold recognition engineering stability of the insulin monomer fold with application to structure-activity relationships dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features protein design by binary patterning of polar and nonpolar amino acids a hybrid ribonuclease h. a novel rna cleaving enzyme with sequence-specific recognition four-dimensional heteronuclear triple-resonance nmr spectroscopy of interleukin- / in solution two-dimensional spectroscopy: background and overview of the experiments orientation of the valine- side chain of the gramicidin transmembrane channel and implications for channel functioning. a h nmr study co-crystal structure of tbp recognizing the minor groove of a tata element crystal structure of a yeast tbp/tata-box complex two-dimensional h nmr studies of histidine-containing protein from escherichia coli. secondary and tertiary structure as determined by nmr hhaimethyltransferase flips its target base out of the dna helix the solution structure of the human retinoic acid receptor-/ dna-binding domain crystal structure of porcine ribonuclease inhibitor, a protein with leucine-rich repeats evolution of the type ii hexokinase gene by duplication and fusion of the glucokinase gene with conservation of its organization determinants of ca + permeability in both tm and tm of high affinity kainate receptor channels: diversity by rna editing crystal structure at . ,~ resolution of hiv- reverse transcriptase complexed with an inhibitor the asymmetric distribution of charges on the surface of horse cytochrome c molscript: a program to produce both detailed and schematic plots of protein structures atomic model of plant light-harvesting complex by electron crystallography biocatalysis in the gas phase procheck: a program to check the stereochemical quality of protein structures a new procedure for the detection and evaluation of similar substructures in proteins quantification of secondary structure prediction improvement using multiple alignments molecular dynamics of macromolecules in water direct observation of reverse transcriptases by scanning tunneling microscopy on the ionization of proteins long-range surface charge-charge interactions in proteins assessment of protein models with three-dimensional profiles improving the sensitivity of the sequence profile method brownian dynamics simulations of diffusional encounters between triosephosphate isomerase and glyceraldehyde phosphate: electrostatic steering of glyceraldehyde phosphate conformational flexibility of aqueous monomeric and dimeric insulin: a molecular dynamics study crystal structure of the dsba protein required for disulphide bond formation in vivo electrostatic effects in proteins dynamics of proteins and nucleic acids inter-tryptophan distances in rat cellular retinol binding protein ii by solid-state nmr a molecular model for cinnamyl alcohol dehydrogenase, a plant aromatic alcohol dehydrogenase involved in lignification adaptive evolution of highly mutable loci in pathogenic bacteria automated protein structure data bank similarity searches and their use in molecular modeling with development of pseudoenergy potentials for assessing protein -d- -d compatability and detecting weak homologies brownian dynamics of cytochrome c and cytochrome c peroxidase electron transfer proteins molecular dynamics of ferrocytochrome c. magnitude and anisotropy of atomic displacements an analysis of incorrectly folded protein models. implications for structure predictions characterization of recombinant human farnesyl-protein transferase: cloning, expression, farnesyl diphosphate binding, and functional homology with yeast prenyl-protein transferases fast structure alignment for protein databank searching identification and classification of protein fold families direct solution of the poisson equation for biomolecules of arbitrary shape, polarizability density, and charge distribution prediction of protein structure by evaluation of sequencestructure fitness. aligning sequences to contact profiles derived from three-dimensional structures environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds sh and sh domains rapid and sensitive sequence comparison with fastp and fasta improved tools for biological sequence comparison gene duplication and the origin of trypsin protein engineering -new or improved proteins for mankind nmr identification of protein surfaces using paramagnetic probes multidisciplinary cycles for protein engineering: site-directed mutagenesis and x-ray structural studies of aspartic proteinases. scand the local information content of the protein structural database structure of the actin -myosin complex and its implications for muscle contraction extensive editin~ of both processed and preprocessed maxicircle cr transcripts in trypanosoma brucei sequential h-nmr assignments and secondary structure of hen egg white lysozyme in solution electrostatics and diffusional dynamics in the carbonic anhydrase active site channel identification of structural motifs from protein coordinate data: secondary structure and first-level supersecondary structure modeling protein structures: construction and their applications nmr of macromolecules. a practical approach the modelling of electrostatic interactions in the function of globular proteins electrostatic interactions in globular proteins: calculation of the ph dependence of the redox potential of cytochrome c i extracting information on folding from the amino acid sequence: consensus regions with preferred conformation in homologous proteins prediction of protein secondary structure at better than % accuracy secondary structure prediction of all-helical proteins in two states phd -an automatic mail server for protein secondary structure prediction progress in protein structure prediction? predicting protein secondary structure with a nearest-neighbor algorithm database of homologyderived protein structures and the structural meaning of sequence alignment an winexpensive, versatile sample illuminator for photo-cidnp on any nmr spectrometer pancreatic lipases: evolutionary intermediates in a positional change of catalytic carboxylates? a workbench for multiple alignment construction and analysis a new approach to the design of stable proteins electrostatic interactions in macromolecules: theory and applications the electrostatic potential of the alpha helix electrostatic effects in myoglobin. hydrogen ion equilibria in sperm whale ferrimyoglobin point charge distributions and electrostatic steering in enzyme/substrate encounter: brownian dynamics of modified copper/zinc superoxide dismutases boltzmann's principle, knowledge based mean fields and protein folding recognition of errors in three-dimensional structures of proteins describing protein structure: a general algorithm yielding complete helicoidal parameters and a unique overall axis electrostatic screening in molecular dynamics simulations electrical potentials in trypsin isozymes rna editing in brain controls a determinant of ion flow in glutamate-gated channels empirical correlation between protein backbone conformation and c a and ct c nuclear magnetic resonance chemical shifts an automated method for modeling proteins on known templates using distance geometry a model for electrostatic effects in proteins difference imaging of adenovirus: bridging the resolution gap between x-ray crystallography and electron microscopy semianalytical treatment of solvation for molecular mechanics and dynamics sequencespecific h and n resonance assignment for human dihydrofolate reductase in solution posttranslational modification of protein by tyrosine sulfation: active sulfate paps is the essential substrate for this modification finding your fold (commentary) theory of protein titration curves. i. general equations for impenetrable spheres interpretation of protein titration curves. application to lysozyme molecular cloning of an apolipoprotein b messenger rna editing protein fragment ranking in modelling of protein structure. conformationally constrained environmental amino acid substitution tables three-dimensional cryo-electron microscopy of the calcium ion pump in the sarcoplasmic reticulum membrane biocatalysis in non-conventional media total chemical synthesis, characterization, and immunological properties of an mhc class i model using the tasp concept for protein de novo design doughnut-shaped structure of a bacterial muramidase revealed by x-ray crystallography structure determination of the cyclohexene ring of retinal in bacteriorhodopsin by solid-state deuterium nmr nicotinic acetylcholine receptor at ~, resolution calculations of electrostatic properties in proteins interfacial activation of the lipase -procolipase complex by mixed micelles revealed by x-ray crystallography structure of the pancreatic lipase -colipase complex a novel search method for protein sequence -structure relations using property profiles nmr investigations of protein structure prospects for nmr of large proteins protein structures in solution by nuclear magnetic resonance and distance geometry theoretical studies of enzymic reactions calculation of electrostatic interactions in biological systems and in solution how do serine proteases really work? calculation of the electric potential in the active site cleft due to a-helix dipoles molecular dynamics effects on protein electrostatics homonuclear two-dimensional h nmr of proteins. experimental procedures calculation of chemical shifts of protons on alpha carbons in proteins three-dimensional profiles from residue-pair preferences: identification of sequences with beta/alpha-barrel fold structure of human pancreatic lipase the chemical shift index: a fast and simple method for the assignment of protein secondary structure through nmr spectroscopy reengineering the specificity of a serine active-site enzyme. two active-site mutations convert a hydrolase to a transferase detection of secondary structure elements in proteins by hydrophobic cluster analysis model ion channels: gramicidin and alamethicin nmr of proteins and nucleic acids in vitro protein splicing of purified precursor and the identification of a branched intermediate on the calculation of pka's in proteins molecular cloning of cdna coding for rat plasma glutathione peroxidase a new method for computing the macromolecular electric potential an optimization approach to predicting protein structural class from amino acid composition a weighting method for predicting protein structural class from amino acid composition we want to thank christian cambillau, cnrs, marseille, for kindly providing us with pre-release -d data of human pancreatic lipase, jerry h. brown, harvard university, for sending us a prerelease dataset for the hla ii structure, alwyn jones, uppsala university, for pre-release -d data of candida antarctica b lipase, and johnmccarthy, brookhaven national laboratory, for helping us with data on previous pdb releases. the french norwegian foundation (fns ) and the norwegian research council (bp ) have contributed with financial support to some of the research activities described in this paper. a.b. and p.m. thank junta nacional de investi-ga~o cientlfica, portugal, for their grants. key: cord- - kpxhzbe authors: das, jayanta kumar; pal choudhury, pabitra title: chemical property based sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: kpxhzbe periplasmic c type cytochrome a (ppca) protein is determined in geobacter sulfurreducens along with its other four homologs (ppcb-e). from the crystal structure viewpoint the observation emerges that ppca protein can bind with deoxycholate (dxca), while its other homologs do not. but it is yet to be established with certainty the reason behind this from primary protein sequence information. this study is primarily based on primary protein sequence analysis through the chemical basis of embedded amino acids. firstly, we look for the chemical group specific score of amino acids. along with this, we have developed a new methodology for the phylogenetic analysis based on chemical group dissimilarities of amino acids. this new methodology is applied to the cytochrome c family members and pinpoint how a particular sequence is differing with others. secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c family members and some unique characteristics and their domains are highlighted. thirdly, we search for unique patterns as subsequences which are common among the group or specific individual member. in all the cases, we are able to show some distinct features of ppca that emerges ppca as an outstanding protein compared to its other homologs, resulting towards its binding with deoxycholate. similarly, some notable features for the structurally dissimilar protein ppcd compared to the other homologs are also brought out. further, the five members of cytochrome family being homolog proteins, they must have some common significant features which are also enumerated in this study. amino acids play the vital role for determining the protein structure and functions. but it is informative to know how the functionality of the group of proteins is changed while amino acid patterns are getting changed from one protein to another. it becomes quite harder and mostly time consuming to identify the uniqueness of proteins and their functionality from the wet lab experiments while working with complete sequence. in this regard, several techniques have been developed for the analysis of primary protein sequence that is helping the plos biochemist to work with only specific domain instead of the whole sequence which reduces the experiment time. geobacter sulfurreducens is one of the predominant metal and sulphur reducing bacteria [ ] . the organism geobacter sulfurreducens is known to act as an electron donar and participate in redox reaction [ ] . periplasmic c type cytochrome a (ppca) protein along with its four additional homologs (ppcb-e: ppcb, ppcc, ppcd, ppce) are identified in geobacter sulfurreducens genome [ ] [ ] [ ] [ ] . altogether, five proteins are highly conserved around "heme iv" but are not identical, and mostely differ in two hemes, "heme i" and "heme iii" [ ] . these two regions are known to interact with its own redox partner. deoxycholic acid (conjugate base deoxycholate), also known as cholanoic acid, is one of the secondary bile acids, which are metabolic byproducts of intestinal bacteria used in medicinal field and for the isolation of membrane associated proteins [ , ] . among the five members of cytochromes c family, only ppca can interact with deoxycholate (dxca) while its other homologs cannot. while interacting with dxca, it is observed that few residues are utilized [ , , ] . it would be worthy if the reason of such an amazing difference towards recognizing a single compound can be found through the amino acids sequence viewpoints. further, one can also see the reason of the structural dissimilarity of ppcd compared to the other homologs [ ] . in literature, in-silico techniques have been used to tackle the various problems through the analysis of dna, rna and protein sequences in bioinformatics field. specially, the authors are searching the protein blocks which are highly similar and conserved among the sub-group or entire family members [ ] [ ] [ ] [ ] . there are twenty standard naturally occurring amino acids which are diverse, arises complexity in the sequences, and have some group specific susceptibility. various reduced alphabet methods are established which can perform much better in certain conditions [ ] [ ] [ ] [ ] . sequence similarity is the most widely reliable strategy that has been used for characterizing the newly determined sequences [ ] [ ] [ ] [ ] . finding the functional/structural similarity from homolog sequences with low sequence similarity is a big challenging task in bioinformatics. to tackle this problem, several methods have been introduced that can identify homolog proteins which are distantly distributed in their evolutionary relationships [ ] [ ] [ ] [ ] . again, in microrna field the authors have developed a new identification technique of microrna precursors emphasizing on different data distributions of negative samples [ ] . further, phylogenetic analysis are also studied from different viewpoints to find the evolutionary relationship among various species [ ] [ ] [ ] . some authors have used the statistical tools for sequence alignment, alignment-free sequence comparison and phylogenetic tree [ ] [ ] [ ] [ ] . although every amino acid has individual activity, group specific function of amino acid is also obvious. methods have been introduced for the d graphical representation of dna/rna or protein sequences [ ] [ ] [ ] [ ] [ ] [ ] [ ] where methods are based on individual score and position wise graphical representation. so, in this field establishment of a new methodology is always welcome with distinct findings. combining with various features for dna, rna and protein sequence a web server called pse-in-one (http:// bioinformatics.hitsz.edu.cn/pse-in-one/home/) is developed [ ] which is user friendly and can be modified by users themselves. recently, the authors have classified the twenty standard amino acids into the eight chemical groups and have found some group and/or family specific conserved patterns which are involved in some functional role specially in motor protein family members [ ] . in this study, the previously defined method [ ] of reduced alphabets are used as an application into the cytochrome c family protein members. we introduced a new method of phylogenetic analysis based on chemical group dissimilarity of amino acids. in addition, we build the graph from primary protein sequence. in the designing of graph, we have designated the various chemical groups of amino acids as thevertices in the graph. the primary protein sequence is read as consecutive order pairs serially from first amino acid to the end of sequence, and each order pair is nothing but a connected edge between the two nodes where nodes in the graph are involved with different chemical groups of amino acids. the graph is drawn for every individual protein sequence and we look for various unique edges/ cycles among the entire family members. so any unique findings from the graph may be hypothesized as having a significant functional role in the primary protein sequence. because the variation in the graph is directly affected by the amino acid residues in some specific domain where a change of chemical group has taken place. we highlight all the significant points which are differing from one sequence to other. further, working with reduced alphabets and designing the graph require less complexity and easy visualization even if working with the larger sequences. order pair directed graph a directed graph g = (v, e) is a graph which consists of a set of vertices denoted by v = {v , v , . . ., v i }, and a set of connected edges denoted by e = {e , , e , , . . ., e i, j } where an edge e i, j exists if the corresponding two vertices v i and v j are connected and the direction of edge is from the vertex v i to the vertex v j . from the graph, various graph theoretic properties like edge connectivity, cycles, graph isomorphism etc. can be investigated to differentiate the graphs. given an arbitrary amino acids sequence, it is first transformed into the numerical sequence as described previously where amino acids are categorized into eight chemical groups according to the side chain/chemical nature of the amino acids [ ] . the transformation is done using the following rules (eq ) as per the classification. if a particular amino acid is read as a i , then the corresponding transformed group is g k and the numerical value k is defined by the following eq ( ). : ifa i fd; eg here, g , g , . . ., g are the acidic, basic, aliphatic, aromatic, cyclic, sulfur containing, hydroxyl containing and acidic amide groups respectively [ ] . the eight numerical values are considered as the vertices of the graph g i.e. v i { , , . . . }. algorithm is used to generate the directed graph from the primary protein sequence using matlab b software. here, we obtain the graph which is the order pair digraph because an edge is constructed through the pair (source node, target node) which is obtained from the consecutive order pair list of amino acids in the primary protein sequence. so given an arbitrary amino acid sequence, we can find an order pair directed graph having at most eight vertices/nodes. output: an adjacency matrix and the corresponding order pair directed graph. define a null matrix (m) of size by ; define a -d array (t) of size l,; find x as the chamical group number of a i uisng eq ( ); the phylogenetic tree is an acyclic graph showing the evolutionary relationship among the various biological species based on their genetic closeness. although various phylogenetic tree methods have already been studied, based on chemical nature of amino acids are not yet explored in the literature as per our knowledge. our method of phylogenetic tree formation used the dissimilarity matrix which is obtained for every pair of sequence on the basis of chemical group specific score of amino acids. so this method is completely alignment free and requires less computational complexity. firstly, we calculate the percentage of occurrence of amino acids from each chemical group using the following equation eq ( ) . if there are n number of sequences which are denoted as s , s , . . .s n , then the corresponding length of the sequences are denoted as l , l , . . .l n . and a particular sequence s i is read as for the sequence s , the first amino acid is read as s , the second amino acid is read as s and so on. for each g k group and a particular sequence s i , we count the total number of amino acids s i (t k ) and score per hundred s i (g k ) on using the following eqs ( ) and ( ) respectively. for example, if the primary protein sequence length is aa, out of which aa are from acidic group i.e. g , then the score per hundred of the acidic group is  À Á ¼ %. secondly, we measure the dissimilarity measure for every possible pair of sequence. the dissimilarity of two sequences s i and s j is denoted as d s i ; s j . for each group g k , we count the percentage of amino acid differences of the two sequences taking the mod value of the score obtained on using eq ( ). this is done for all the respective eight chemical groups and all the values are added. finally, we get the dissimilarity matrix d of size n by n as shown below. dðn; nÞ ¼ to draw the phylogenetic tree, we use the nearest distance (single linkage) method. the pair wise distances are the entities of the obtained dissimilarity matrix and the whole procedure is written in matlab b software. five homologous triheme cytochromes (ppca-e) are identified in g. sulfurreducens periplasm and gene knockout studies revealed their involvement in fe(iii) and u(vi) extracellular reduction [ , ] . cytochromes have been thoroughly studied for laboratory experiments because of their small size (about amino acids). table shows the gene name, accession number, protein name, length (#amino acids). the primary protein sequences are collected from http://www.uniprot.org/. sequence identity and the phylogenetic tree firstly, our analysis is directed to measure the primary protein sequence for every member. we obtain the percentage identity matrix of every pair of sequences ( sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach exported from clustalw. it is observed that sequences are at least % similar. the maximum similarity is % which is found between ppca and ppcb. if we consider the ppca sequence which shows the minimum of % similarity with ppce and the maximum of % similarity with ppcb, we are not able to differentiate the ppca from other homologs on using the similarity percentage. secondly, we count rate of occurrence (frequancy of amino acids) of every individual amino acid of the respective five sequences which are shown in table . then, we look for chemical group specific frequency for every sequence shown in table using eq ( ). now, we obtain the dissimilarity score of all possible two sequences (using eq ( )). say for an example, we compare the seq. no. and seq. no. , we get the difference for acidic group is . ( . - . ), basic group is . ( . - . ) and so on (from table ). total score after summing the eight groups is . which measures the dissimilarity percentage of the said two sequences. similar results we get for all other pairs which are shown in table . this table shows the biological distances between each pair of sequences. from this pair wise distance matrix, the phylogenetic tree is constructed as shown in fig , also discussed in method section. based on the phylogenetic tree of five members, we find that the ppca and ppcd, ppcb and ppce are mostly closed with regards to the frequency of amino acids of respective eight chemical groups. from fig it is not obvious that ppca differs from other homologs, but if we go through the dissimilarity matrix (table ) , we find some variations. here, it is observed that ppca differs by minimum of . % with ppcd, whereas for other homologs minimum dissimilarity sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach is found for ppcd with ppcc which is . %. therefore among all the pairs, the high dissimilarity of ppca shows its uniqueness compared to its homologs. if we have a closer look into the list of amino acids, it is observed that the amino acids d, e, h, k, f, i, l, v, a, g, p, m, c, t are present among all the sequences. other amino acids are not common to all the member sequences. therefore, on the basis of chemical groups, all the amino acids from acidic, aliphatic, cyclic and hydroxyl containing groups are present. it is observed that the acidic, basic and hydroxyl containing groups percentage distinctly differ while compared ppca with other homologs. further, it is observed that only one proline(p) from cyclic group is present in ppcd while in other homologs, proline (p) is present at least times. and another important observation is that the amino acid tryptophan (w) from aromatic group is present only in ppcd sequence. for every member of cytochrome c family, we draw a order pair directed graph using algorithm which are shown in fig . there are maximum of eight possible nodes and the various directed edges among the nodes. we try to highlight the connected edges that show the uniqueness, specially in between the ppca and its homolog members and ppcd with other members separately as well as commonality to all members. details of the edge connectivity information for ppca and its homologs are shown in table . we say two nodes (direction is from row to column) are connected or present if the cell symbol is , not present if the cell symbol is , and common to all the members if the cell symbol is à . an edge between two nodes (in order) is basically a pattern https://doi.org/ . /journal.pone. .g table . existance of unique edges comparison between ppca and ppcb-e groups obtained from directed graph (fig ) . ppcb-e node vs. node * * * * * * (two distinct nodes or two distinct amino acids from two different chemical groups) of length . we find two particular edges, one edge ( ) is present only in ppca sequence (approx. residues - , s table) that is not found in other member sequences, and one edge ( ) which is present in ppcb-e sequences (approx. residues - , s table) , but this edge is not present in ppca sequence. while considering all the members, we find many edges which are common to all. further, ppcd is structurally dissimilar among the homologs [ ] . while looking into the order pair directed graph, we find only one variation i.e. there is an edge ( ) node to node among the ppca-c and ppce sequences which is not observed in ppcd (table ) . this node transition where amino acid changes proline(p) to glycine (g) for ppca-c and ppce and for ppcd this transition is from glycine (g) to glycine (g), located in approximately residues - (s table) . again existence of edges between any two nodes either common to all or individual member specific have some significant role in the primary protein sequences. because node to node connectivity is the point of changes from one chemical group to the other in the primary protein sequence positions and this could be the effective characteristic for the structural or functional variation of proteins. although few residues are being responsible while interacting with dxca, the neighbouring residues of amino acids must be having a role for their unique characteristics. so the subdomain identification involving with different unique cycles would be worth mentioning in this regard. here, we have calculated the various cycles of length c l ( c l ) for group specific and individual member specific which are shown in in s table. say for an example, the cycle of length i.e. the directed edges are ! ! ! ! ! ! . for completing this cycle a particular subdomain is responsible. interestingly, we find various unique cycles for ppca, ppcd and ppcb-e. so there are some unique cycles which are distinctly present for ppca and its homolog proteins and vice versa. there are some unique cycles which are present in ppcd, but no unique cycle is present for ppca-c and ppce. highlighiting the sub-domain for some of the unique cycles of length , and are shown in fig (a) for ppca and fig (b) for ppcd. from fig , the cycle ( ) of length whose sub-domain residues are within to , that is the numerical sequence is . . . from fig (a) . one can see the corresponding amino acids residues from s table. for some cycles, there is a possibility of different sub domains because some edges are repeating more than once in the different positions of the sequence that can be counted for the same cycle. similarly, on varying the cycle length, we get different sub-domains or amino acid residues. these sub-domain findings might be of immense help to the bio-chemists for the understanding of physicochemical nature and the unique activity of various proteins. table . existance of unique edges comparison between ppcd and ppca-c, ppce groups obtained from directed graph (fig ) . ppca-c, ppce node vs. node * * * * * * we take all the five sequences of ppca-e members, obtain the alignment sequence from clus-talw . the alignment figure is shown fig . we mark the various blocks as r , r . . .r which are conserved. rectangular with highlighted regions are chemically conserved, and only highlighted regions are conserved based on individual amino acid. we find two highly conserved regions r and r which are having some variations. the first region (r ) is with residues block (hkk/rh or ) among the members ppcb-e where all the amino acids from basic group, but in ppca this block is hkah or i.e. the rd position k/r is replaced by aliphatic amino acid alanine (a). the second region (r ) is gche/k or / where th position amino acid is either from acidic or basic group i.e. both fall under charge group. if we look into the ppca sequence some dissimilarities are found in "heme i" region [ ] [ ] [ ] . the two consecutive amino acids between regions r and r in ppca is kk (from basic group), but for ppcb-e only one amino acid is from basic group. previously it is observed that ppcd is structurally dissimilar [ ] and the authors have shown that there is an addition of amino acid threonine (t) for ppcd sequence after the r region in fig . but, from figure we can see that another one amino acid valnine (v) insertion is viewed in region of r and r . besides, various patterns which are common to ppca, but not in ppcb-e and vice versa shown in table with bold color. for the pattern " " which is located with the combined regions of r and r ("heme iii" region), there is a change of amino acid threonine (t) for ppcd and lysine (k) for others. apart from these, we find an amino acid deletion both for the ppcd and ppce before the "heme iii" region. further, on combining the regions r , r and sequence characterization of ppca and its homolog proteins ppcb-e: a mathematical approach r (pattern " "), the change for ppca sequence is phenylalanine (f) which is from arometic group whereas other sequences are from aliphatic group, and the change for ppcd sequence is histidine (h) which is from basic group whereas other sequences are also from aliphatic group. again the region between r and r ppcd contains the amino acid methionine (m) from the sulfur containing group while the other homologs contain phenylalanine (f) from aromatic group. altogether, group specific changes have significant role towards the binding with the dxca for ppca and the structural dissimilarity of ppcd. in this work, we have presented the sequence based characterization of cytochromes c family members. we specifically emphasize the distinguished features of ppca and ppcd compared to the other homologs. although the study suggests that percent identity among the five members varies between % and %, on the basis of chemical groups these are shown between % and %. we highlight some of the chemical groups and their percentage that can distinguish ppca and ppcd. the dissimilarity features of ppca may play significant role towards its binding with dxca. similar is the case that may happen for ppcd for its structural dissimilarity. our proposed graph theoretic model can easily show the instant change of amino acids from one group to the other in the sequences. further, the unique cycles for ppca and ppcd may expose their outstanding nature. and finally from the alignment graph, chemically conserved regions are highlighted. we observe some special patterns where amino acid(s) from some of the sequences are abruptly changed. all the cases will provide the features for ppca and ppcd that would explain their unique functionality and/or structural dissimilarity. it may be noted that there are some existing methodologies [ , , , , , , ] which would reflect the sequence pattern information or key features of the observed sequence. many characteristics of the dna, rna and protein sequences can be found out from the web servers and standalone existing tools, one of the important web servers in this regard is defined in [ ] . we look at the problem in a different manner, one dealing with embedded chemical properties of amino acids and various mathematical structures. in general, methodology defined in this article is very easy to implement to get the unique features of observed sequences. so, collectively our methodology will add to be combined for the machine learning algorithms to develop refined computational predictors. hence, the use of reduced alphabets (amino acids) technique involving mathematical basis with the embedded chemical properties of amino acids will be very much useful for the protein homology detection. supporting information s table. amino acids and transformed numerical sequence based on eight chemical groups for c five members. (pdf) s table. unique cycles for ppca-e, ppca, ppcb-e, ppcd. these cycles are involved in various sub-domains, some of which are shown in fig . (pdf) electricity production by geobacter sulfurreducens attached to electrodes geobacter sulfurreducens sp. nov., a hydrogen-and acetate-oxidizing dissimilatory metal-reducing microorganism. applied and environmental microbiology thermodynamic characterization of a triheme cytochrome family from geobacter sulfurreducens reveals mechanistic and functional diversity family of cytochrome c -type proteins from geobacter sulfurreducens: structure of one cytochrome c at . Å resolution † structural characterization of a family of cytochromes c involved in fe (iii) respiration by geobacter sulfurreducens structure of a novel dodecaheme cytochrome c from geobacter sulfurreducens reveals an extended nm protein with interacting hemes lipomas treated with subcutaneous deoxycholate injections guide to protein purification dissecting the functional role of key residues in triheme cytochrome ppca: a path to rational design of g. sulfurreducens strains with enhanced electron transfer capabilities conservation within the myosin motor domain: implications for structure and function identification of common molecular subsequences selection of conserved blocks from multiple alignments for their use in phylogenetic analysis amino acid substitution matrices from protein blocks reduction of protein sequence complexity by residue grouping reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment protein sequence analysis based on hydropathy profile of amino acids mathematical characterization of protein sequences using patterns as chemical group combinations of amino acids an introduction to sequence similarity ("homology") searching. current protocols in bioinformatics similarity/dissimilarity studies of protein sequences based on a new d graphical representation improved tools for biological sequence comparison analysis of similarity/dissimilarity of protein sequences combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection application of learning to rank to protein remote homology detection. bioinformatics protein remote homology detection by combining chou's pseudo amino acid composition and profile-based protein representation a comprehensive review and comparison of different computational methods for protein remote homology detection imirna-ssf: improving the identification of microrna precursors by combining negative sets with different distributions phylogenetic analysis of protein sequence data using the randomized axelerated maximum likelihood (raxml) program. current protocols in molecular biology phylogenetic analysis of protein sequences based on conditional lz complexity analyzing and synthesizing phylogenies using tree alignment graphs a probabilistic measure for alignment-free sequence comparison simplification of protein sequence and alignment-free sequence analysis phylogenies and the comparative method progressive sequence alignment as a prerequisitetto correct phylogenetic trees graph theory with applications to engineering and computer science protein flexibility predictions using graph theory dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features use of information discrepancy measure to compare protein secondary structures -d graphical representation of protein sequences and its application to coronavirus phylogeny a d graphical representation of protein sequence and its numerical characterization similarity/dissimilarity analysis of protein sequences based on a new spectrum-like graphical representation pse-in-one: a web server for generating various modes of pseudo components of dna, rna, and protein sequences we thank dr. pokkuluri, phani raj (argonne lab, usa) for the initial discussions of the problem. key: cord- - rx jlw authors: kim, kwangsoo; ryoo, hong seo title: selecting genotyping oligo probes via logical analysis of data date: journal: advances in artificial intelligence doi: . / - - - - _ sha: doc_id: cord_uid: rx jlw based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. when extensively tested on genomic sequences downloaded from the lost alamos national laboratory and the national center of biotechnology information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length or nucleotides that perfectly classified all unseen testing sequences. these results well illustrate the utility of the proposed method in genotyping applications. a microarray or a dna chip is a small glass or silica surface bearing dna probes. probes are single stranded reverse transcribed mrnas, each located at a specific spot of the chip for hybridization with its watson-crick complementary sequence in a target to form the double helix [ ] . microarrays currently use two forms of probes, namely, oligonucleotide (shortly, oligo) and cdna, and have prevalently been used in the analysis of gene expression levels, which measures the amount of gene expression in a cell by observing the hybridization of mrna to different probes, each targeting a specific gene. with the ability to identify a specific target in a biological sample, microarrays are also well-suited for detecting biological agents for genetic and chronic disease [ , , , ] . furthermore, as viral pathogens can be detected at the molecular and genomic level much before the onset of physical symptoms in a patient, the microarray technology can be used for an early detection of patients infected with viral pathogens [ , , ] . the success of microarrays depends on the quality of probes that are tethered on the chip. having an optimized set of probes is beneficial for two reasons. one, the background hybridization is minimized, hence true gene expression levels can be more accurately determined [ ] . the other, as the number of oligos needed per gene is minimized, the cost of each microarray is minimized or the number of genes on each chip is increased, yielding oligo fingerprinting a much faster and more cost-efficient technique [ , ] . short probes consisting of - nucleotides (nt) are used in genotyping applications [ ] . having short optimal probes means a high genotyping accuracy in terms of both sensitivity and specificity [ , ] and can play a key role in genotyping applications. from the perspective of numerical optimization, genomic data present an unprecedented challenge for supervised learning approaches for a number of reasons. first, genomic data are long sequences over the nucleic acid alphabet Σ = {a,c,g,t}. second, for example, the complexity of viral flora, owing to constantly evolving viral serotypes, requires a supervised learning theory to be trained on a large collection of target and non-target samples. that is, a typical training set contains a large number of large-scale samples. third, a supervised learning framework usually requires a systematic pairing or differencing between each target and non-target samples during the course of training a decision rule [ , , , ] . adding to these, the nature of data classification is difficult [ ] . based on the general framework of logical analysis of data (lad) from [ ] , we develop in this paper a probe design method for selecting short oligo probes of length l nt, where l ∈ [ , ] . to list some advantages of selecting oligo probes by the proposed method, first, the method selects probes via sequential solution of a small number of compact set covering (sc) instances, which offers a great advantage from computational point of view. to be more specific, consider classification of two types of data and suppose that a training set is comprised of m + target and m − non-target sequences. the size of the sc training instances solved by the proposed method is minimum of m + and m − orders of magnitude smaller than optimization learning models used in [ , , ] . second, the method uses the sequence information only and selects probes via optimization based on principles of probability and statistics. that is, the probability of an l−mer (oligo of length l) appearing in a single sequence by chance is ( . ) l , hence the probability of an l−mer appearing in multiple samples of one type but in none or only a few of the sequences of the other type by chance alone is extremely small. third, the proposed method does not rely on any extra tool, such as blastn [ ] , a local sequence alignment search tool that is commonly used for probe selection [ , , ] , or the existence of pre-selected representative probes [ ] . this makes the method truly stand-alone and free of problems that may possibly be caused by limitations associated with external factors. last, with an array of efficient (meta-)heuristic solution procedures for sc, the proposed method is readily implementable for an efficient selection of oligo probes. as for the organization of this paper, we develop an effective method for selecting short oligo probes in section (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in section with using viral genomic sequences from the los alamos national laboratory and the national center of biotechnology information websites. the task of classifying more than two types of data can be accomplished by sequential classifications of two types of + and − data (see [ , , ] and section below). without loss of generality, therefore, we present the material below in the context of binary classification. the backbone of the proposed procedure is lad. a typical implementation of lad analyzes data on hand via four sequential stages of data binarization, support feature selection, pattern generation and classification rule formation. as a boolean logic-based, lad first converts all non-binary data into equivalent binary observations. a + (−) 'pattern' in lad is defined as a conjunction of one or more binary attributes or their negations that distinguishes one or more + (−) type observations from all − (+) observations. the number of attributes used in a pattern is called the 'degree' of the pattern. as seen from the definition, patterns hold the structural information hidden in data. after patterns are generated, they are aggregated into a partially-defined boolean discriminant function/rule to generalize the discovered knowledge to classify new observations. referring readers to [ , , ] for more background in lad, we design a lad-based method below for efficiently analyzing large-scale genomic data. let there be m + and m − sample observations of type + (target) and − (nontarget), respectively. for a dna sequence is a sequence of nucleic acids a, c, g and t, and the training sequences need to be converted into boolean sequences of and before lad can be applied. toward this end, we first choose an integer value for l, usually l ∈ [ , ] (see section ), generate all l possible l−mers over the four nucleic acid letters and then number them consecutively from to l by a mapping scheme. next, each l−mer is selected in turn and every training sample is fingerprinted with the oligo for its presence or absence. that is, with oligo j, we scan each sequence p i , i ∈ s + ∪ s − , from the beginning of the sequence and shifting to the right by a base and stamp p ij = , if oligo j is present in sequence i; and , otherwise. after this, the oligos that appear in all or none of the training sequences can be deleted from further consideration. we re-number the surviving l−mers consecutively from to n and replace the original training sequences described in the nucleic acid alphabets by their boolean representations. let n = { , . . . , n}. the data are now described by n attributes a j ∈ { , }, j ∈ n . for observation p i , i ∈ s • , • ∈ {+, −}, let p ij denote the binary value the j−th attribute takes in this observation. denote by l j the literal of binary attribute a j . then, l j = a j (l j = a j ) instructs to take (negate) the value of a j in all sequences. a term t is a conjunction of literals. given a term t, let n t ⊆ n denote the index of literals included in the term. then, note here that n t of a • pattern identifies probes that collectively distinguish one or more • sequences from the sequences of the other type. let us introduce n additional features a n+j , j ∈ n , and use a n+j to negate a j . let n = { , . . . , n} and let us introduce a binary decision variable x j for a j , j ∈ n , to determine whether to include l j in a pattern. [ ] formulated a compact mixed integer and linear programming (milp) model below with respect to a reference sample consider the following. we note here that genomic data are large-scale in nature. furthermore, owing to constantly evolving viral serotypes, the complexity of viral flora is high, and this requires large numbers of target and non-target viral samples to be used for selecting optimal genotyping probes. adding to these the difficulties associated with numerical solution of milp, we see that (milp- .i • ) above presents no practical way of selecting genotyping probes. with the need to develop a more efficient pattern generation scheme, we select a reference sequence for k ∈ s• and j ∈ n. next, we set for l ∈ s • and j ∈ n. now, consider the set covering model where c j (j ∈ n ) are positive real numbers. let (x, y) denote a feasible solution of (sc • i ). then, forms a • lad pattern. although smaller than the milp counterpart by only one constraint and one integer variable, (sc • i ) has a much simpler structure and is defined only in terms of - variables. in addition, it can exploit any of sc heuristic procedures developed so far (see, for example, [ ] and references therein) for its efficient solution, hence is much preferred. note that (sc • i ) is defined by m + + m − − cover inequalities and n + m • − binary variables. also, recall that n is large for genomic sequences and the analysis of viral sequences requires large numbers of target and non-target sequences, that is, m + and m − are also large numbers. to develop a more compact sc-based probe selection model, we select a reference sequence p i , i ∈ s • , • ∈ {+, −}, and set the values of a (i,k) j for k ∈ s• and j ∈ n via ( ). consider the following sc model: where c j 's are positive reals. theorem . let x denote a feasible solution of (sc-pg • i ). then, p generated on x via ( ) forms a • lad pattern. below, we use (sc-pg • i ) to design one simple oligo probe selection procedure. let p • denote the set of • patterns generated so far. in this section, we extensively test the proposed probe design for the classification of viral disease-agents in in silico setting with using genomic sequences obtained from the los alamos national laboratory (lanl) and the national center for biotechnology information (ncbi). table summarizes the number and the length (the minimum, average± standard deviation and maximum lengths) of each type of the genomic data that were used in our experiments. in analyzing data in an experiment, we first decided on a length of oligos to use by calculating the smallest integer value l such that l became larger than or equal to the average of the lengths of target and non-target sequences of the experiment. then, l candidate oligos were generated to fingerprint and binarize the data. note here that if a constraint in (sc-pg • i ) has all zero coefficients, then the sc instance has no feasible solution, and this case arises when the reference sequence p i , i ∈ s • , and the sequence p j , j ∈ s• have identical - fingerprints, which is a contradiction. supervised learning methodologies, including lad, presume for the existence of a classification function that each unique sequence in the training set belongs to exactly one of the two classes. when data under analysis are indeed contradiction-free, then contradiction-free - clones of the data can always be obtained by using oligos of longer length for data fingerprinting and binarization. therefore, when we generated the identical fingerprint for data of different types, we incremented the value of l by and repeated the data binarization stage until the binary representations of the data became contradiction free. next, procedure sc-pg was applied to generate patterns, hence probes. in applying procedure sc-pg in these in silico experiments, we selected a minimal set of oligo probes by setting c j = for all j ∈ n . for solving the unicost (sc-pg • i )'s generated, we used the textbook greedy heuristic [ ] for ease of implementation. denote by p + , . . . , p + n+ and p − , . . . , p − n− the positive and negative patterns, respectively, generated via procedure sc-pg. in classifying unseen + (target) and − (non-target) sequences, we use three decision rules. specifically, for the polyspecific genotyping experiments (in section . and experiments and in section . ), we form the standard lad classification rule [ ] Δ := where ω • i denotes the number of • training sequences covered by p • i . we assign class + (−) to new sequence p if Δ(p) > (Δ(p) < ). we fail to classify sequence p if Δ(p) = . for monospecific genotyping in experiment in section . , we form a decision rule by where p k , . . . , p k n k are the probe(s) selected to for virus (sub-)type k, and assign p to class k if Δ k (p) > while Δ i (p) = for all i = , . . . , m, i = k. when Δ(p) > for more than two virus types or Δ k (p) = for all k, then we fail to assign a class to sequence p. in each of the experiments in this section, we tested the proposed oligo probe selection method in independent hold-out experiments, each with randomly selected % of the target and of the non-target data forming a training set of sequences and the remaining % of the target and of the non-target sequences forming the testing data. more specifically, after a training set of data was formed, we binarized the training data and selected optimal oligo probes on them via procedure sc-pg. next, a classification rule was formed by one of ( ), ( ) and ( ) above and then used for classifying the corresponding testing sequences. these steps were repeated times to obtain the average testing performance and other relevant information of the experiment. the computational platform used for these experiments was an intel . ghz pentium linux pc with mb of memory. ( , , , , , ) ± the infection with hpv is the main cause of cervical cancer, the second most common cancer in women worldwide [ , ] . there are more than identified types of hpv and the genital hpv types are subdivided into high and low risk types: low risk hpv types are responsible for most common sexually transmitted viral infections while high risk hpv types are a crucial etiological factor for the development of cervical cancer [ ] . we applied the proposed probed design method on the hpv sequences downloaded from lanl with their classification found in table of [ ] . the selected probes were used to form a decision rule by ( ) and tested for their classification capability. results from this polyspecific probe selection experiment are provided in table . in this table and also in the table found in the following subsection, the target (+) and the non-target (−) virus types of the experiments are first specified. then, the tables provide two bits of information on the candidate oligos, namely, the length l and the average and the standard deviation of the number of features generated and used in the runs of each experiment for data binarization and for pattern generation. provided next in the tables is the information on the number of probes selected in the format 'the average ± standard deviation' and information on the lad patterns generated. finally, the testing performance of the probes selected is provided in the last column of the tables, summarized in format 'the average ± standard deviation' of the percentage of the correct classifications of the unseen sequences. briefly summarizing, the proposed probe design method selected probes on the hpv data in a few cpu seconds that tested . % accurate in classifying the unseen hpv samples. for comparison, the same hpv dataset was used in [ ] and [ ] for the classification of hpv by high and low risk types. in brief, the probe design methods of [ ] and [ ] required several cpu hours of computation and selected probes that obtained . % and . % correct classification rates, respectively. before moving on, we note that the sequences belonging to the target and the non-target groups in this experiment all have different hpv subtypes (see table in [ ] ). the combination of all target and non-target sequences being different from one another and the presence of noise in the data (the classification errors) gave rise to selecting a relatively large number of polyspecific probes in this experiment. the proposed probe design method was tested on genomic viral sequences from ncbi for selecting monospecific and polyspecific probes for screening for sars and ai in a number of different binary and multicategory experimental setting and performed superbly on all counts. we describe individual experiments below and summarize results from these experiments in table . ± d e g r e e * : in format average ± standard deviation † : percentage of correct classifications of testing/unseen data experiment . sars virus vs. coronavirus sars virus is phylogenetically most closely related to group coronavirus [ ] . sars sequences and coronavirus samples were used to select monospecific probe for screening for sars. used in a classification rule ( ), the sars probe and one probe selected for coronavirus together perfectly classified all testing sequences. this experiment simulates a sars pandemic where suspected patients with sars-like symptoms are screened for the disease. we used the sars virus sequences and samples of other influenza virus types (the 'other virus' in table ) in this experiment and selected polyspecific probes. used in a classification rule ( ), these probes collectively gave the perfect classification of all testing sequences. experiment . classification of lethal ai virus h & h and other influenza virus h subtypes ai virus h and h subtypes cause a most fatal form of the disease [ ] , and they were separated from the other h subtypes of influenza virus in this experiment. h and h target sequences and other h subtype sequences were used to select polyspecific probes for detecting ai virus h and h subtypes from the rest. in a classification rule ( ), the selected probes collectively classified all testing sequences correctly. the statement "monospecific neuraminidase (na) subtype probes were insufficiently divers to allow confident na subtype assignment" from [ ] motivated us to design this experiment on multicategory and monospecific classification of influenza virus by n subtypes. we used the three influenza virus n subtypes with or more samples in table and selected monospecific probes for their classification. tested in a classification rule ( ), the selected probes performed perfectly in classifying all testing sequences. note that only a small number of monospecific probes were selected and proved 'needed' in this experiment. trends in microarray analysis genetic mining of dna sequence structures for effective classification of the risk types of human papillomavirus (hpv) discovery and analysis of inflammatory disease-related genes using cdna microarrays possibility of using dna chip technology for diagnosis of human papillomavirus classification of multiple cancer types by multicategory support vector machines using gene expression data molecular detection and identification of influenza viruses by oligonucleotide microarray hybridization dna-chip technology and infectious diseases microarray-based detection and genotyping of viral pathogens selection of optimal dna oligos for gene expression arrays probe selection algorithms with applications in the analysis of microbial communities fast large scale oligonucleotide selection using the longest common factor approach optimal robust non-unique probe selection using integer linear programming an implementation of logical analysis of data on the complexity of polyhedral separability milp approach to pattern generation in logical analysis of data basic local alignment search tool selection of oligonucleotide probes for protein coding sequences support vector networks pattern recognition techniques. crane statistical learning theory partially defined boolean functions and cause-effect relationships a heuristic method for the set covering problem integer and combinatorial optimization for the international agency for research on cancer multicenter cervical cancer study group: epidemiologic classification of human papillomavirus types associated with cervical cancer the causal relation between human papillomavirus and cervical cancer the role of human papillomavirus in screening for cervical cancer classification of the risk types of human papillomavirus by decision trees unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group lineage transmission of h n avian influenza a virus to human beings during a large outbreak in commercial poultry farms in the netherlands key: cord- -bcuec fz authors: matson, david o. title: iv, . calicivirus rna recombination date: - - journal: perspect med virol doi: . /s - ( ) - sha: doc_id: cord_uid: bcuec fz rna recombination apparently contributed to the evolution of cvs. nucleic acid sequence homology or identity and similar rna secondary structure of cvs and non-cvs may provide a locus for recombination within cvs or with non-cvs should co-infections of the same cell occur. natural recombinants have been demonstrated among other enteric viruses, including picornaviridae (kirkegaard and baltimore, ; furione et al., ), astroviridae (walter et al., ), and possibly rotaviruses (e.g., desselberger, ; suzuki et al., ), augmenting the natural diversity of these pathogens and complicating viral gastroenteritis prevention strategies based upon traditional vaccines. such is the case for cvs and astroviridae, whose recombinant strains may be a common portion of naturally circulating strains. the taxonomic — and perhaps biologic — limits of recombination are defined by the suggested recombination of nanovirus and cv, viruses from hosts of different biologic orders; the relationship of picornaviruses and cvs, viruses in different families, as recombination partners; and the intra-generic recombination between different clades of nlvs. in this review, i will discuss evidence for the occurrence of rna recombination in caliciviridae, both within and outside the family. constraints on recombination provided by the genomic diversity of caliciviruses (cvs), as well as implications of recombination on the natural diversity of cv strains and the clinical and biologic significance of rna recombination, also will be considered. first, i will review some features of cvs that affect understanding of recombination. the cv genome is a positive-sense, single-stranded, polyadenylated rna molecule of about nucleotides in length. cvs fall into four genera that differ in their genomic organization (green et al., a) (fig. ) . norwalk-like viruses (nlvs) have three open reading frames (orfs). orf encodes a polyprotein cleaved during replication into a set of nonstructural proteins, orf encodes the capsid protein, and orf encodes a protein that appears to be a minor structural protein . where studied, cvs have been shown to synthesize a positive-sense subgenomic rna that begins at the ' of the capsid gene and that is co-terminus with the genome (meyers et al., ; neill and mengeling, ; sosnovtser and green, section iv, chapter of this book). vesiviruses differ from nlvs in having a longer genome that in some vesiviruses (e.g., pan-i ; rinehart-kim et al., ) , but not others (e.g, feline cv; carter et al., ) , includes a longer orf with an additional predicted protein at its n-terminus. orf of vesiviruses is longer than that of nlvs, with the extra nucleotides at the ' end of orf . this extra sequence encodes a protein fragment that must be post-translationally cleaved to agree with experimental data of vesivirus virion structure (prasad et al., ) . orf of vesiviruses is about one-half the size of that of nlvs (~ amino acids vs. - amino acids, respectively). in lagoviruses and sapporo-like viruses (slvs), the genes that are in orf and orf of nlvs and vesiviruses are fused into one longer orfi. a gene comparable to that of orf of nlvs also is present. an orf in another frame at the ' end of the capsid gene occurs among slvs, but not in all slv strains (liu et al., ; jiang et al., the antigenic determinants (neutralization epitopes) that induce immunity against cvs presumably are located on the surface of the virion capsid. this capsid is composed of copies of the capsid gene product, paired into dimers (prasad et al., ; prasad et al., ) . despite the existence of just one capsid protein, cvs exhibit extensive antigenic diversity. in the best-characterized genus, vesivirus, at least distinct serotypes (neutralization types) exist, not including feline cvs and closely related strains, which among themselves are so diverse antigenically that definition of serotypes has been problematic (lauritzen et al., ; hohdatsu et al., ; smith, ) . the distinct vesivirus serotypes are certainly determined by differences in nucleotide sequence of the capsid gene, resulting in differences in surface epitopes. the nucleotide differences sufficient to change the serotype are unknown, but likely to occur in a few distinct regions of the capsid gene (neill, ; rinehart-kim et al., ; neill et al., ) . when many capsid nucleotide sequences from different cv strains are simultaneously compared in phylogenetic analyses, the sequences within a genus fall into statistically significant clades green et al., b) . the biologic significance of these distinct clades is unknown. it is clear that such clades are related to differences in capsid gene sequences; sequence differences are less marked in the rna polymerase gene: when rna polymerase region sequences are analyzed in phylogenetic analyses, statistically significant differences similar to those observed among capsid gene sequences do not occur . it is possible that separate capsid sequence clades within a genus indicate separate serotypes, but, even for vesivirus capsid sequences, an insufficient number of strains have been analyzed to associate specific sequence differences with differences in serotype. with the description of statistically significant phylogenetic clades within cv genera, data were available to recognize strains that might be natural recombinants within cvs. two examples are the well-characterized argentine strain (arg ) and snow mountain virus (smv), one of the prototype cvs, recognized to be recombinants when the rna polymerase and capsid regions of these strains were characterized (hardy et al., ; jiang et al., ) (fig. ) . at the time of publication, recombination was more certain for arg , because the sequence was derived from a single cdna insert spanning the ~ . kb at the ' end of the genome, including the end of orf and all of orf , orf , and the ' non-coding region. in arg , the change of relative sequence identity occurred at the orf /capsid gene junction, indicating that the recombination occurred there. this site also was suggested (see below) to be the break-and-rejoin site for recombination between cvs and picornaviruses. for arg , the orf sequence was closest to that of lordsdale virus, among sequenced nlvs, and the capsid and orf sequences were closest to those of mexico virus. a similar change of relative sequence identity also occurred in smv, when partial polymerase and capsid sequences were compared to reference mexico and melksham viruses. while smv was likely also to be a recombinant virus, the capsid and rna polymerase region amplicons of smv were generated separately and that fact did not exclude the possibility of different sources of strains. xi jiang has confirmed the recombinant status of smv by sequencing a single cdna derived from a single rt-pcr amplicon (x jiang, personal communication). generation of recombinants within cvs requires biologic and molecular attributes of cvs. outbreaks caused by multiple cv strains and co-infection by different hucv strains occur (matson et al., ; gray et al., ; reuter et al., ) . infection of single cells simultaneously by two cvs implies absence of immune or molecular and of nt near the ' end of that strain's capsid gene (id="b" sequence for this fig.) . each a or b sequence is from a cv with a known complete genome sequence, is on a single line, and is repeated in two columns. in the left-hand column, each a or b sequence is compared with the first nt of the norwalk virus genome, i.e., the norwalk virus a sequence (jiang et al., ) . in the second column, the a or b sequence is compared with the first nt (a sequence) of a prototype sequence for that genus. within a genus, the a sequences are listed first and the b sequences given next. "-" indicates the nt at that site is identical to that in the comparison sequence. for ebhsv, the "*" indicates a residue i inserted into the ' noncoding interference. cv rna also must have an attribute that permits/favors recombination, such as a site where errors in procession of rna polymerase can occur. the subgenomic rna is the most likely molecule to participate in recombination as noted above. the highly conserved ' end sequence of the genome and at the ' end of the capsid gene in nlvs is an obvious common target for cv rna polymerase, for genomic and subgenomic rna synthesis. the sequence data indicated that recombination in strain arg occurred at the orf /capsid gene junction where high sequence identity exists between the putative parent clades. the genomic sequence of nlv strains has been determined. a comparison of sequence at the ' genomic sequence with sequence near the ' end of orf shows a high degree of sequence identity (fig. ) . no other region of an nlv strain genome shares this degree of identity, even closely, with the ' end of the genome. in addition, sequence identity comparable to that shown between the nlv ' end and near the ' end of the nlv capsid may occur among cvs of a single genus from a single host, once enough strains are sequenced. the "copy choice" model has been preferred for recombination of single-stranded rna viruses, including picornaviruses and coronaviruses (kirkegaard and baltimore, ; makino et al., ; lai and cavanagh, ; nagy and simon, ) . in the copy-choice model, recombination occurs during rna replication when the viral rna polymerase switches templates from the rna derived from one strain (donor template) to the rna derived from a second strain (acceptor template), at a highly conserved genome region, without releasing the nascent strand (lai and cavanagh, ) . models for rna virus recombination have utilized two terminologies to describe the degree that features of the donor and acceptor templates are shared: homologous, aberrant homologous, and non-homologous types (lai and cavanagh, ) or sequence similarity-essential, similarity-assisted, and similarity-nonessential (nagy and simon, ) . the putative parent clades of intrageneric cv recombinants have a long region of identical sequence and predicted stable hairpin structures at the proposed recombination site, which supports the classification of these recombinants as homologous or (at least) similarity-assisted. interaction like that between genomic and subgenomic cv rnas could occur by the same mechanisms for two genomic 'ends, but the outcome of such recombination events would be hard to predict. furthermore, if a virion contains genomic and subgenomic rnas, recombination could occur in a generation after initial co-infection. evidence for recombination of cvs depends upon sequence comparisons. upon sequencing a portion of a feline cv strain f , neill ( ) observed that (what later was designated) orf contained significant sequence identity with picornaviruses. this significant identity was concentrated around certain amino acid motifs within orf that are homologous to those within the non-structural region of picornaviruses, encoding, in order, c, c, and d genes. the order of these motifs and the approxi- mate number of nucleotides between them were the same in both virus families (fig. ) . the capsid gene of cvs also is homologous to the vp to vp capsid proteins of picornaviruses to the extent of a shared ppg amino acid motif in a relatively conserved ' portion of the capsid gene(s), formation of capsomeres having polypeptide [ -pleated sheets as a core structural element, and formation of a spherical virion capsid by the protein(s) (prasad et al., ) . these findings led to the hypothesis that at some point in time cvs and picornaviruses were/are "recombination partners" (dinulos and matson, ) . . ), which lies ' to the nonstructural genes, and is marked by a "ppg" motif that signals a relatively conserved region between the families. in picornaviruses, this order of nonstructural-structural "gene cassettes" is reversed, with the order of motifs within the nonstructural peptide the same, and about the same number of nucleotides from each other in the genome. (box shadings as in fig. .) in a recent report, gibbs and weiller ( ) suggested from sequence analyses that cvs (rna genome, mostly in vertebrates) may have recombined with nanovirus (ss dna genome, plant virus) to generate (a) circovirus(es) (fig. ) gibbs and weiller, .) nanovirus dna. that this set of steps occurred is suggested by significant sequence identities of two regions of circoviruses, one including the rep (ligation initiation) gene of nanoviruses and the other c-like sequences closest to those of cvs. however, a reverse-transcriptase initiation site is not known in the cv genome. the possibility that recombination occurred in invertebrates is not excluded, given the existence of viruses in insects with close sequence identity to cvs (govan et al., ) . the differences in genome organization among cv genera imply different constraints on how rna recombination might have occurred. for example, if the different genera are derived from a single "parental" genomic structure, different events must have occurred to generate the diversity of genome structures exhibited by the different cv genera--even within genera--for some genes are absent and others present. alternatively, if, as discussed above, cvs are "recombination partners" with (an)other virus family(ies), then "convergent evolution" might explain the shared genomic features of cvs, despite multiple "parental" genomic structures. recombinants extend our knowledge of the genetic diversity within cvs. they also place constraints on methods that "genotype" cvs. if the rna recombination of arg and smv is a common phenomenon, genotyping would be more difficult. for example, many reports of cv genotyping have been based upon sequence of the rna polymerase region, due to its relativity high sequence conservation and relative ease of designing rt-pcr primers. in contrast, fewer capsid genes have been characterized (see also jiang, section iv, chapter of this book). the viral capsid protein is responsible for virion antigenicity and probably for inducing immunity. genotyping of cvs based upon the rna polymerase sequences clearly is not the best choice if recombination at the orf /capsid gene is common. in addition, it remains unclear whether additional recombination sites exist. recombination in nlvs at the orf -orf junction has been described upon the characterization of this genomic region for relatively few strains. thus, one might discover other types of recombinants as more strains are characterized. one recombinant nlv, arg , was first recovered from ill children and adults in argentina, the united states and the netherlands jiang et al., ; m koopmans, personal communication) . smv (morens et al., ) and many very similar strains have been recovered from outbreaks of gastroenteritis worldwide. many smv-like nlvs have been characterized at the genomic level only in the rna polymerase genome region. each of these strains is a potential smv-like recombinant, like the prototype, awaiting sufficient characterization of the capsid sequence to draw this conclusion. the possible widespread occurrence of recombinants in symptomatic persons suggests their ready infectivity in the host(s), their easy transmissibility, furthermore that recombination does not necessarily ablate virulence, and that recombinants are genetically and ecologically stable. perhaps the most striking feature of arg and smv is that they and their associated illness were otherwise unremarkable. their recombinant status was recognized only because their genomes were initially characterized in both the rna polymerase region and capsid regions. also, the two potential parental strains for each of arg and smv are within the range of genetic diversity of strains currently co-circulating. therefore, the recombination event could have occurred recently, but not necessarily during the infection of the child from whom arg was recovered. on the other hand, it would not be difficult to imagine that many cv strains currently co-circulating could have derived from remote recombination events in the past. recombination may permit cvs to escape host immunity quickly, analogous to antigenic shifts in influenza viruses, but by a different molecular mechanism. recombination during calicivirus replication may be common or rare; it is possible to envision the generation of many non-viable or attenuated recombinants. the orf polyprotein genes could persist in virus with a new capsid selected for by the host's immunity. viable recombinants could be a model for laboratory manipulation of cap-sids (e.g., neill et al., ) , including study of packaging constraints and antigenicity. in the two natural recombinants described above, orf , encoding a minor structural protein, segregated with the capsid gene in the recombinants. whether recombination can occur with an orf derived from another strain is unknown. rna recombination apparently contributed to the evolution of cvs. nucleic acid sequence homology or identity and similar rna secondary structure of cvs and non-cvs may provide a locus for recombination within cvs or with non-cvs should co-infections of the same cell occur. natural recombinants have been demonstrated among other enteric viruses, including picornaviridae (kirkegaard and baltimore, ; furione et al., ) , astroviridae (walter et al., ) , and possibly rotaviruses (e.g., desselberger, ; suzuki et al., ) , augmenting the natural diversity of these pathogens and complicating viral gastroenteritis prevention strategies based upon traditional vaccines. such is the case for cvs and astroviridae, whose recombinant strains may be a common portion of naturally circulating strains. the taxonomic --and perhaps biologic --limits of recombination are defined by the suggested recombination of nanovirus and cv, viruses from hosts of different biologic orders; the relationship of picornaviruses and cvs, viruses in different families, as recombination partners; and the intra-generic recombination between different clades of nlvs. a phylogenetic analysis of the caliciviruses the complete nucleotide sequence of a feline calicivirus genome rearrangements of rotaviruses recent developments with human caliciviruses polioviruses with natural recombinant genomes isolated from vaccine-associated paralytic poliomyelitis evidence that a plant virus switched hosts to infect a vertebrate and then recombined with a vertebrate-invecting virus analysis of the complete genome sequence of acute bee paralysis virus shows that it belongs to the novel group of insect-infecting rna viruses mixed genogroup srsv infection among a party of canoeists exposed to contaminated recreational water capsid protein diversity among norwalk-like viruses caliciviridae human calicivirus genogroup ii capsid sequence diversity revealed by analyses of the prototype snow mountain agent neutralizing features of commercially available feline calicivirus (fcv) vaccine immune sera against fcv field isolates sapporo-like human caliciviruses are genetically and antigenically diverse sequence and genomic organization of norwalk virus characterization of a novel human calicivirus that may be a naturally occurring recombinant diagnosis of human caliciviruses by use of enzyme immunoassays the mechanism of rna recombination in poliovirus the molecular biology of coronaviruses serological analysis of feline calicivirus isolates from the united states and united kingdom molecular characterization of a bovine enteric calicivirus: relationship to the norwalk-like viruses high-frequency rna recombination of murine coronaviruses enteric viral pathogens as causes of outbreaks of diarrhea among children attending day care centers during one year of observation a waterborne outbreak of gastroenteritis with secondary personto-person spread. association with a viral agent genomic and subgenomic rnas of rabbit hemorrhagic disease virus are both protein-linked and packaged into particles new insights into the mechanisms of rna recombination nucleotide sequence of a region of the feline calicivirus genome which encodes picornavirus-like rna-dependent rna polymerase, cysteine protease and c polypeptides nucleotide sequence of the capsid protein gene of two serotypes of san miguel sea lion virus: identification of conserved and non-conserved amino acid sequences among calicivirus capsid proteins further characterization of the virus-specific rnas in feline calicivirus infected cells recovery and altered neutralization specificities of chimeric viruses containing capsid protein domain exchanges from antigenically distinct strains of feline calicivirus x-ray crystallographic structure of the norwalk virus capsid three-dimensional structure of the primate calicivirus molecular epidemiology of human calicivirus gastroenteritis outbreaks in hungary complete nucleotide sequence and genomic organization of a primate calicivirus virus cycles in aquatic mammals, poikilotherms, and invertebrates identification and genomic mapping of the orf and vpg proteins in feline calicivirus virions intragenic recombinations in rotaviruses molecular characterization of a novel recombinant strain of human astrovirus associated with gastroenteritis in children molecular mechanisms of variation in influenza viruses thank you to my friend xi jiang for a special o-year collaboration. i thank tamas berke for continued insights. key: cord- -n y d authors: zhang, feiyun; toriyama, shigemitsu; takahashi, mami title: complete nucleotide sequence of ryegrass mottle virus : a new species of the genus sobemovirus date: journal: j doi: . /pl sha: doc_id: cord_uid: n y d the genome of ryegrass mottle virus (rgmov) comprises nucleotides. the genomic rna contains four open reading frames (orfs). the largest orf encodes a polyprotein of amino acids ( . kda), which codes for a serine protease and an rna-dependent rna polymerase. the viral coat protein is encoded on orf present at the ′-proximal region. other orfs and encode the predicted . kda and . kda proteins of unknown function. the consensus signal for frameshifting, heptanucleotide uuuaaac and a stem-loop structure just downstream is in front of the aug codon of orf . analysis of the in vitro translation products of rgmov rna suggests that the kda protein may represent a fusion protein of orf -orf produced by frameshifting. the protease region of the polyprotein and coat protein have a low similarity with that of the sobemoviruses (approximately % amino acid identity), while the rna-dependent rna polymerase region has particularly strong similarity ( to % of more than amino acid residues). the sequence similarities of rgmov to the sobemoviruses, together with the characteristic genome organization indicate that rgmov is a new species of the genus sobemovirus. ryegrass mottle virus (rgmov) was first isolated from stunted italian ryegrass (lolium multiflorum) and cocksfoot (dactylis glomerata) having mottling and necrotic symptoms on leaves"). the isometric particle, nm in diameter contains a species of single-stranded rna with a molecular weight of . x lo . the physical properties and some biological ones of the virus are similar to cocksfoot mottle virus (cfmv), which is prevalent in cocksfoot pastures in japanlg). however, rgmov is serologically distinct from cfmv, cocksfoot mild mosaic virus, cynosurus mottle virus and phleum mottle virus, which occur in european countries '). last year, an isometric virus, isolated from italian ryegrass in germany was found to be serologically related to rgmov ; in agar gel double diffusion tests, a spur formed between rgmov and the germany isolate (frank rabenstein, germany ; personal communication). in spite of serological differences between rgmov and sobemoviruses ), general properties of rgmov are similar to those of grass viruses that belong to sobemo-viruses ). the genome sequence of sobernoviruses has been determined in southern bean mosaic virus (sbmv)' , ), cfmv ), rice yellow mottle virus (rymv)") and lucerne transient streak virus (ltsv, accession number u ). the genomic rna of sobemoviruses is a single-stranded molecule, approximately to nucleotides (nt) in size. the ' terminus has a genome-linked viral protein (vpg) and the end does not have a poly (a) tail. the genome encodes four orfs : the largest orf encodes the polyprotein of approximately kda, which contains protease and rna polymerase motifs. only the polyprotein of cfmv is encoded by two smaller overlapping orfs, by - frame shifting ). recently, we determined the complete nucleotide sequence of the japanese isolate of cfmv (cfmv/jp) (zhang and toriyama, unpublished data ; accession number ab ). the nucleotide sequence is . % identical to the norwegian isolate of cfmv(cfmv/ no)') and . % identical to the russian i~olate'~). its genome organization is identical to that of cfmv. so far, the genome sequence of rgmov and the germany isolate has not been determined, so the genus is still unknown. in this paper, we report the complete nucleotide sequence of rgmov and compare it to that of the sobemoviruses. ryegrass mottle virus (rgmov) was propagated in barley plants (cv. shunsei) and purified as described previously o). a purified preparation of cfmvl jp, was stored at - "c' ) and used for the in uitro translation experiment. in a preliminary experiment, we found that rgmov rna does not have a poly (a) tail at the '-terminus. thus, we determined the '-terminal sequence by two-dimensional mobility shift analysis as described previ~usly'~). in this experiment, the homomix (alkaline digested yeast rna mixture) was prepared by using the rna from torula utilis, a product of fluka (riedel-de haen; seelze, germany). the 'terminal sequence of rgmov rna was identified by sequencing the pcr clones amplified by using the ' race abridged anchor primer system (gibco brl, gaithersburg, usa). cdna synthesis was done as described previously ') using m-mlv reverse transcriptase (gibco brl), random hexanucleotide primer and synthetic oligonucleotide primer (pl), - '-actagtcgacacgaaaacccc- ' : the sequence at the ' end underlined was analyzed by twa-dimensional sequence analysis. the synthesized second strand cdna was blunt-ended with t dna polymerase and ligated into smai-digested puc . recombinant plasmids were transformed into competent escherichia coli dh a (toyobo, osaka, japan). the cdna clones shown in terminal sequence of viral rna cloning and dna sequence fig. were made by primer extension and pcr amplification and used for sequencing of rgmov rna. the ambiguous nucleotide sequence was confirmed by using pcr clones prepared independently (not shown in fig. ). nucleotide sequences were determined using the pharmacia dna sequencing kit and an alfred dna sequencer (pharmacia, uppsala, sweden). the sequence data were assembled and analyzed using the dnasis (macintosh) program (hitachi software engineering co., yokohama, japan). genebankiembl, nbrf and pir databases were searched for nucleic acid and amino acid sequence identity. cell-free translation in uitro translation using wheat germ extract (promega, madison, usa) was performed as described by the manufacturer's manual in a final volume of ,ul in the presence of redivue l-[ sl methionine (amersham pharmacia biotech, buckinghamshire, uk) for hr at °c. translation products were separated by sds-page ( % polyacrylamide) and detected using a molecular imager system (biorad, richmond, usa). a set of prestained sds-page standards (biorad) was used as protein size markers. purified rgmov was electrophoresed on % polyacrylamide-sds gels and electro-blotted onto the pvdf membrane (immobilon-psq ; millipore, middlesex, uk). the portion corresponding to the coat protein on the pvdf membrane was excised, and the n-terminal sequence of the coat protein was analyzed using a gas-phase protein sequencer (model a/ a, applied biosystems, foster city, usa). nucleotide sequence and genome organization the complete nucleotide sequence of rgmov com- a) references of sequence data : sbmv (m ), ltsv (u ), rymv(l ) and cfmv ( ). b) the percentage values indicate the identity over the stretch of amino acid residues indicated in parentheses. c) this similarity was found between the n-terminal region of the . k orf of cfmv (refer to fig. ) . prises nt with a base composition of . % a, . % u, . % c and . % g. the g s c content is . %. the sequence contains four major orfs flanked by '-and -untranslated sequences of and nt, respectively. database searches indicated that the genome sequence of rgmov is significantly similar to that of sobemoviruses, for which the genome organization is summarized in fig. . as shown in fig. the largest orf extends from nucleotides to . the predicted . kda protein consists of amino acids. database searches revealed a significant similarity to the polyproteins of sobemoviruses ; sbmv (accession number, m ) ), rymv (l )"), cfmv ( )*) and ltsv (u ). the polyprotein of rgmov contains serine and p c ~r o t e a s e s~,~~) and an rna-dependent rna p ymerase'~) (fig. ) . a conserved sequence, gxpxfdpxyg*), is found in the n-terminal region (amino acids to residues) of the . kda polyprotein. the protease motif appears immediately downstream of the conserved sequence : serine protease, in amino acids to from n-terminus and p c protease, in amino acids to (fig. ) . the serine protease motif is well conserved between rgmov, sobernoviruses and polioviru~~,~~). in addition, the p c protease motif ... xgxs . /c . gxxxxxxxxgxxxxgxh* ... (the catalytic amino acid residue is marked with asterisks), is present just downstream. however, instead of serine (s*) or cysteine (c*), alanine is found in rgmov. thus, it is uncertain whether the p c protease domain is catalytic in rgmov or not. the rna-dependent rna polymerase is encoded near the c-terminal region of the polyprotein. this region showed very strong similarity, to % identity over a amino acid stretch ( table ) . the rna polymerase motifs ) are distributed between amino acids to . the sequence of this domain is conserved in particular, with approximately % identity between rgmov and sobemoviruses. database searches also showed that the sequence of rgmov polymerase is highly conserved between the rna polymerases of beet mild yellowing virus (s ), cucurbit aphid-borne yellowing virus (x )) potato leaf roll virus (x ) and barley yellow dwarf virus (l ) of the family luteoviridae. the similarity is approximately % identity over a amino acid stretch, suggesting an evolutionary close relationships between rgmov, sobemovirus and luteovirus (subgroup )'o). van der wilk et a .") found that the vpg of sbmv is encoded by orf , downstream of the protease domain and in front of the rna polymerase. we compared the amino acid sequence similarity between the vpg region of sbmv orf and the corresponding region of rgmov orf . the search revealed no significant similarity. sequence diversities in the vpg regionzz) are also shown between sbmv, cfmv and rymv. however, the con-served sequence, wag + e/d rich sequence is detected in the region, and putative e/s cleavage sites are present on both sides of the region : proteolytic cleavage would result in a protein of kda. possibly, the vpg of rgmov is located between the protease and the rna-dependent rna polymerase domains in the same order as in the sbmv orf ) (fig. ) . rgmov orf is completely within the orf . the predicted . kda protein has distinct similarity, % identity to the corresponding orfs of sbmv and ltsv. however, it is unknown whether the . kda protein is independently translated in vivo, because orf may be expressed as a fusion protein as will be discussed. orf comprises amino acids encoding a . kda coat protein. the amino acid sequence of the n-terminus of the viral coat protein was identical to that deduced from the orf nucleotide sequence (data not shown). sequence similarity searches indicated that the rgmov coat protein revealed a weak but significant similarity, to % identity with that of sbmv, ltsv and rymv, but only % identity with cfmv (table ) . in the wheat germ extract system, rgmov rna directs the synthesis of two products of kda and kda, but no other distinct product was detected. in contrast, the translational products synthesized in vitro with cfmv/jp rna are four major proteins with sizes almost identical to those previously reported for cfmv/ no'*) (fig. ) . the translational activity of cfmv/jp rna was low in our present system, as reported for other sobemoviruses' ). rgmov rna is a poorer message in our wheat germ extract system. the largest product of rgmov rna was kda and seems to be derived from the largest orf for the polyprotein. in the rgmov rna sequence, no orf corresponds to the second largest product of kda. the putative replicase of cfmv is translated as part of a single polyprotein by - ribosomal frameshifting between two overlapping orfs having a coding capacity for . kda and . kda proteins j ). translational frameshifts are known in coronavirus ibv ), polymerase genes of r e t r o~i r u s e s~'~) and plant viruses +'). as consensus signals for frameshifting, the heptanucleotide sequence (e.g., uuuaaac sequence) and the stem-loop structure immediately downstream have been proposed by jacks et al ). as found in cfmv, sbmv and rymv ), identical signals are found in rgmov rna just preceding the initiation codon of the orf (fig. ). tamm et al *) proposed a possible mechanism that the kda in uitro translation product of sbmv and rymv rnas may represent the ow -orf transframe fusion protein. thus, the kda translational product of rgmov rna is probably derived from - ribosomal frameshifting (fig. ) , not from proteolytic cleavage of the p~lyprotein~~). in this experiment, we tried to detect the rgmov coat protein in the in uitro translation products by immunoprecipitation. however, we could not detect any signal for the coat protein. the coat protein of sbmv is translated only from a smaller, subgenomic rna, which is detected in virus-infected tissues as well as virus particle^'^). as smaller rnas were not detectable in our rgmov rna preparation, the amount of subgenomic rna, if any, may have been insufficient for the detection of the in vitro translated coat protein. we conclude that rgmov is a member of the genus sobernovirus based on sequence similarities. the similarity level of nucleic acid (approximately % identity) and protein (table ) is low enough for virus species demarcation between any species of sobemoviruses, whereas the genome organization of rgmov is closely related among sobemoviruses. biological and serological properties of rgmov are distinct from those of other characterized grass viruseszo). thus, rgmov is a unique species of the genus s o b e r n o v i r~~~,~~) . the polyprotein gene organization of rgmov is the same as that of sbmv, rymv and ltsv, but different from that of cfmv, for which a polyprotein is produced as a single fusion-protein by the frameshifting of two orfs ). expression of rice yellow mottle virus p protein in vitro and in vivo and its involvement in virus spread an efficient ribosomal frame-shifting signal in the polymerase-encoding region of the coronavirus ibv sobemovirus genome appears to encode a serine protease related to cysteine proteases of picornaviruses genus sobemovirus signals for ribosomal frameshifting in the rous sarcoma virus gag-pol region characterization of ribosomal frameshift in hiv- gag-pol expression the putative replicase of the cocksfoot mottle sobemovirus is translated as a part of the polyprotein by - ribosomal frameshift sequence and organization of barley yellow dwarf virus genomic rna luteovirus gene expression genome characterization of rice yellow mottle virus rna nucleotide sequence of the bean strain of southern bean mosaic virus identification of four conserved motifs among the rna-dependent polymerases encoding elements messenger rna for the coat protein of southern bean mosaic virus nucleotide sequence of rna from the sobemovirus found in infected cocksfoot shows a luteovirus-like arrangement of the putative replicase and protease genes translation of southern bean mosaic virus rna in wheat embryo and rabbit reticulocyte extracts complementarity between the '-and '-terminal sequences of rice stripe virus rnas identification of genes encoding for the cocksfoot mottle virus proteins cocksfoot mottle virus in japan ryegrass mottle virus, a new virus from lolium multiflorum in japan nucleotide sequence of rna , the largest genomic segment of rice stripe virus, the prototype of the tenuivirus the genome-linked protein (vpg) of southern bean mosaic virus is encoded by the orf guidelines to the demarcation of virus species sequence and organization of southern bean mosaic virus genomic rna evolution of rna viruses the nucleotide sequence data reported in this paper have been submitted to ddbj, embl and genbank under the accession number ab . national institute of agro-environmental sciences, tsukuba - , japan present address : tokyo university of agriculture and technology, united graduate school of agriculture, fuchu - , japan present address : tokyo university of agriculture, sakuragaoka , setagaya-ku, tokyo - , japan we wish to thank the late professor dr. d. hosokawa, tokyo university of agriculture and technology, for his encouragement and dr. t. teraoka for his help with the amino acid sequence analysis. key: cord- - ncgldaq authors: elworth, r a leo; wang, qi; kota, pavan k; barberan, c j; coleman, benjamin; balaji, advait; gupta, gaurav; baraniuk, richard g; shrivastava, anshumali; treangen, todd j title: to petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date: - - journal: nucleic acids res doi: . /nar/gkaa sha: doc_id: cord_uid: ncgldaq as computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. in recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. for instance, sketching algorithms such as minhash have seen a rapid and widespread adoption. these techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. we also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. we then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions. thanks to advances in sequencing technology, the amount of next-generation sequencing data for genomics has increased at an exponential pace over the last decade. while this explosion of data has yielded unprecedented oppor-tunities to answer previously unanswered questions in biology, it also creates new challenges. for instance, a key challenge is in designing new algorithms and data structures that are capable of handling analyses on such large and numerous datasets (table ) . one approach for solving this big data problem is the development and adoption of probabilistic algorithms and data structures. when applying probabilistic methods to genomic analyses, input sequences are frequently decomposed into sets of overlapping subsequences with length k, referred to as k-mers. this large set of k-mers is then compressed into matrices using techniques from compressed sensing and sketching. genomic analyses such as clustering and taxonomic classification can be performed directly on the compact matrices ( figure ). in this paper, we review the great strides that have already been made in these areas and look forward to future possibilities. many novel probabilistic and signal processing approaches for handling these massive amounts of genetic data have been previously reviewed ( ) ( ) ( ) ( ) ( ) . for instance, in ( ) a comprehensive review was performed covering probabilistic algorithms and data structures such as minhash ( ) and locality sensitive hashing (lsh) ( ) , count-min sketch (cms) ( ) , hyperloglog ( ) and bloom filters ( ) . this review includes extensive details of how these data structures work, supporting theory behind each of them, as well as a brief discussion of their applications. however, the genomics applications for each approach were not thoroughly covered. other more biologically motivated reviews include a review of compressive algorithms in ( ) and ( ) < . metahit ( ) . tara oceans ( ) . terragenome ( ) . jgi img ( ) . human microbiome project ( ) . the european nucleotide archive (ena) ( ) . ncbi sequence read archive ( ) . sketching approaches in ( ) . in ( ) , techniques are covered such as the burrows-wheeler transform (bwt) ( ) , the fm-index ( ) , and other techniques based around exploiting redundancy in large datasets. a more in depth discussion of many of these topics can also be found in ( , ) includes a thorough review of compressed string indexes, lsh via sketches, cms, bloom filters, and minimizers ( ) , with accompanying applications in genomics for each. while many techniques focus on efficient ways to represent a dataset, the compressed sensing (cs) technique from signal processing exploits the sparsity of signals for their efficient acquisition and interpretation. cs's measurement efficiency often translates to significant reductions in cost and time. cs has previously found biomedical applications in microscopy ( ) and rapid mri acquisition ( ) . in this review, we summarize the essentials of cs, relate the technique to the other probabilistic data structures and algorithms, discuss relevant recent advances, and highlight corresponding applications in metagenomics. we direct interested readers to ( ) for further discussion of the core concepts of cs and to the seminal works of ( ) and ( ) for more thorough analyses. most recently, a comprehensive review of sketching algorithms in genomics was performed in ( ) . this review covers approaches like minhash, bloom filters, cms, hy-perloglog, the biological applications and implementations of each, and even includes a set of live, interactive notebooks with code examples of each approach. given the wealth of previously performed reviews on these topics, we refer readers to the works above for more in depth explanations of these approaches along with their applications, implementations, and theory. instead, we include only a brief review of these fundamental methodologies, followed by more recent advances in these areas, and finally their applications to metagenomics. previous studies have often neglected more novel applications in metagenomic data given the new challenges it poses. metagenome sequencing and analysis not only complicates established fundamental problems in comparative genomics but also adds entirely new problems. therefore, we focus on how the aforementioned techniques can overcome unique hurdles in metagenomics. recently, more attention has been given to the study of probabilistic algorithms ( ) as a means to circumvent the widening gap between the explosion of data and our computing capabilities. algorithms based on hashing and sketching ( ) ( ) ( ) ( ) ( ) ( ) have been extensively used in the theoretical computer science and database literature for reducing the computations associated with processing massive webscale datasets ( ) ( ) ( ) ( ) ( ) . hashing algorithms are typically associated with a random hash function that takes the input (usually the data vector) and outputs a discrete value. usually, this output serves as a (small memory) fingerprint which, being discrete, can be used for 'smart' indexing. these indices are most notably used for sub-linear time near-neighbor searches ( , ) . sketching algorithms work by creating a dynamic probabilistic data structure popularly known as a sketch ( ) . the sketch is a small memory summary of a given set of items, which typically requires logarithmic memory for summarizing them ( ) . these sketches can support dynamic updates ( ) and the dynamic query operation which returns an approximate estimate for a quantity of interest. to begin, we perform a concise overview of core probabilistic data structures and algorithms ( figure ). we then include a review of a wide array of more recent variations, extensions, and recent advancements of these fundamental methodologies. finally, we include a more in depth discussion on promising applications to genomic and metagenomic data. ( ) locality sensitive hashing (lsh) was first introduced to solve the nearest neighbor search (nns) problem in high dimensions ( ) . lsh functions are a subset of hash functions that seek to hash similar input values to the same hash values. essentially, for an lsh function f, if two input items x and x are very similar to each other, then applying the lsh function to both should cause them to collide (f(x ) = f(x )) with high probability. the main idea behind efficient retrieval is to use f to structure the data as an efficient dictionary or hash table by indexing data point x i with key f(x i ). given any query q, f(q) naturally becomes a favorable key for lookup. this is because any x j with the same key will have f(q) = f(x j ), and hence, is likely to have high similarity with query q. ( ) minhash is arguably one of the most popular lsh functions for genomic and metagenomic data. min-hash takes a set as input and outputs a set of integer hash values. specifically, minhash applies p different hash functions to each element in a set and returns the minimal hash values from each of the p hash functions as the sketch of the set. the probability that two sets have the same minimal hash values is equal to the percentage of common elements in the union of both sets. as a consequence, we can quickly approximate the similarity between two sets by simply computing the ratio of the number of minhash collisions between the sets and the total number of minhashes. with minhash we can compute a small approximate summary of each set, referred to as a sketch, and then calculate the similarity of any two sets as the distance between their sketches. sequencing data are often conveniently represented as sets of tokens (or k-mers). as a result, minhash is fre- figure . overview of applying probabilistic data structures and compressed sensing in metagenomic sequence analysis. given a set of sequences, each sequence is usually first decomposed into a series of consecutive k-mers. then the probabilistic algorithm compresses the k-mers into sketches. the sketches can be analyzed to evaluate characteristics of the input sequences, such as sequence similarity. in compressed sensing (cs), the aggregate k-mer frequencies for the whole sample are treated as measurements. elements of a database (e.g. microbial genomes) have individual k-mer frequency distributions that are stored in columns of a matrix. cs finds the elements of the database that comprise the sample measurements. quently used to quickly compare the similarity between two large sequencing datasets by applying the p hash functions to their k-mers. ( ) minimizers are another widely used technique within the family of lsh-algorithms to reduce the total number of k-mers for sequence comparison applications. a minimizer is a representative sequence of a group of adjacent k-mers in a string and can help memory efficiency by storing a single minimizer in lieu of a large number of highly similar k-mers. minimizers will sample the sequence by choosing the smallest (lexicographically, for instance) k-mer within a sliding window. in figure , the minimizer portion demonstrates the sliding window that moves across the sequence, creating the set of minimizer k-mers for the sequence by taking the smallest k-mers within the window as it slides. the choice of the window length w and k-mer size k of the minimizers are parameters that can be adjusted for the application. several techniques employ hashing to compress the representation of a dataset. from these new representations, information can be rapidly queried. ( ) bloom filter (bf) is a data structure that compresses a set while still being able to query if an element exists in the set. the sketch for a bf is a bit array of w bits. the bits are given an initial value of . to record an element into the sketch, p different hash functions are used to map the input element to p different positions in the array. after evaluating the hash functions, the bf sets the bits to at all mapped positions. to search for an element, the query element is hashed by the same p hash functions. then, every bit that the hash values map to in the bf are checked. if any bit value of the mapped locations are not equal to , the input element is definitely not in the set. if all the mapped bits are , the element is likely in the set. this result can also be caused by random hash collisions while inserting other elements. thus, the bf can have false positives. ultimately, bfs can quickly evaluate the presence of a given element using very little memory. ( ) hyperloglog is designed to estimate the number of distinct elements in a set using minimal memory. the essence of hyperloglog is to keep track of the count of the maximum number of leading zeros in the binary representation of each element in the set. if the maximum number of leading zeros observed is n, a crude estimate for the number of distinct elements in the set is n . this style of cardinality estimation only works for data distributed uniformly at random, so each element passes through a hash function before being evaluated and incorporated into an extremely compact sketch for the set. the process of cardinality estimation based on leading zeroes can have a high variance, so the hy-perloglog sketch distributes the hashed elements into multiple counters, whose harmonic mean yields a final cardinality estimation (after correcting for using multiple counters and hash collisions). but this memory is still logarithmic in the total number of distinct elements. on the other hand, calculating the exact cardinality requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. alternatively, condensed representations may summarize the structure of the dataset by analyzing the frequency of components of the set. new datapoints that are assumed to exhibit the same structure can be efficiently acquired. . count-min sketch: three pairwise independent hash functions are applied to each k-mer. each hash function is responsible for a row in the sketch and maps the hash values to the bins in its row. to encode an element into the sketch, the count-min sketch increases the numeric value in the mapped bins. to return the number of occurrences of a given k-mer, it hashes the k-mer using the same hash functions and returns the smallest value. bloom filter: it initiates all the values in the array as . to record the presence of a k-mer in the dataset, it maps k-mer to the bits in the bloom filter using three pairwise independent hash functions, and then it changes the mapped bits from to . minimizer: given a sequence, it can be compressed into a list of minimizers. to do that, a window slides across the sequence. in each window, the sequence inside the window is decomposed into k-mers. a minimizer is selected among the list of k-mers for the window at each position. hyperloglog: each k-mer is represented by a hash value with length . the first three bits of a hash value is used to locate a register and the last bits are saved in the corresponding register. the maximum number of leading zeros among all the values, that are stored in the register, is used to estimate the cardinality of each register. ( ) compressed sensing is a signal processing technique that enables the acquisition of high-dimensional signals from low-dimensional measurements by leveraging the sparsity of many natural signals ( ) ( ) ( ) . sparse signals have only a few nonzero elements. in metagenomics, a signal of interest may be the relative abundance of microbes in a sample. these signals are sparse because only a small fraction of all known species are present (i.e. have nonzero abundance) in any given sample. figure illustrates the process of cs in this context. the cs problem can be represented concisely with linear algebra: y = x where an m × n sensing matrix captures an n-dimensional signal x with m linear measurements that are stored in y. sparse recovery algorithms find the sparsest x that obeys y = x either through a convex relaxation (e.g. a lasso regression ( )) or a greedy algorithm (e.g., matching pursuit ( ) ( ) ( ) ( ) ). theory shows that cs can make very efficient use of linear measurements; m scales logarithmically with n ( , ). ( ) count-min sketch (cms) is a specialized cs algorithm where the projection matrix is a structured ( - ) random matrix derived from cheap universal hash functions. due to this carefully designed matrix, it is possible to compute the projection y = x as well as perform recovery of x from y without materializing the matrix in memory and instead only use a few universal hash func-tions, each of which needs only two integers. as a result, we get a provably logarithmic memory algorithm for compressing x and recovering its heavy elements. the cms is popular for estimating the frequencies of different elements in a data set or stream. the cms algorithm is remarkably simple and has a striking similarity with the bloom filter. the cms is a matrix with w columns and d rows. it can be thought of as a collection of d bloom filters, one for each row, each using a single hash function. the only difference is that we use counters in cms instead of bits in bloom filters. given an input data element x to the cms, it is hashed by d independent hash functions. each of the d hash functions generates a hash value hash d (x) within range w and increments the numeric value stored at column hash d (x) row d. querying the count of an element consists of simply taking the minimum of the counters that the element hashes to in the cms. a tremendous amount of study and followup work has been performed by the scientific community to improve the fundamental probabilistic data structures and algorithms. here, we give a brief overview of relevant variations, extensions, and recent advancements to the methodologies described above. there has been a significant advancement in improving the computing cost of minhash, which became a central tool in bioinformatics after the introduction of mash ( ) and other toolkits that then followed ( , ) . minhash requires p hash functions, and p passes over the data to compute p signatures. recently, using a novel idea of densification ( ) ( ) ( ) , densified-minhash was developed. densified-minhash only requires one hash function and one pass over the set to generate all the p signatures of the data with identical statistical properties as p independent minhash, for any given p. several improvements have been made for efficiently computing weighted minhash as well ( ) , where the elements of sets are allowed to have importance weight. these recent advances have made it possible to convert data into minhashes in the same cost as data reading, which, otherwise, was the main bottleneck step. genomic applications also use many lsh functions beyond minhash. simhash ( ) was invented by google to find near-duplicates over large string inputs using cosine similarity. it was shown in ( ) that for sequence and string datasets minhash is provably and empirically superior to simhash, even for cosine similarity. b-bit minwise hashing is a variation of minhash that saves only the lowest b bits of each hashed value ( ) . it requires less memory to store each hash code and can be used to accurately estimate the similarities among high-dimensional binary data. sectional minhash (s-minhash) ( ) includes information about the location of k-mers or tokens in a string to improve duplicate detection performance. universal (or random) hash functions seek to quickly and uniformly map inputs to hash codes. universal hash functions are important building blocks for the cms, bloom filter, hash table, and other fundamental data structures. murmurhash (https://sites.google.com/site/murmurhash, accessed march ) is a very well-known universal hash that has been widely used in many bioinformatic software packages, including mash ( ) . although previous murmurhash versions were vulnerable to hash collision, murmurhash (https: //github.com/aappleby/smhasher/wiki/murmurhash , accessed march ) is a good general-purpose function that is particularly well-suited to large binary inputs. however, there are other options such as xxhash (https://github.com/cyan /xxhash, accessed march ), which can be faster than murmurhash, and cityhash (https://opensource.googleblog.com/ / / introducing-cityhash.html, accessed march ). city-hash is relevant to genomics because it is optimized for strings. it outperforms murmurhash for short string inputs but is appropriate for any length input. farmhash is the successor to cityhash and also focuses on improved string hashing performance (https://opensource.googleblog.com/ / /introducing-farmhash.html, accessed march ). nthash ( ) is a specialized dna hashing function. it recursively calculates the hash values for the consecutive k-mers in a given sequence. while nthash can be faster than xxhash, cityhash and murmurhash, it is only appropriate for sequence data. minimal perfect hash functions (mphf) and perfect hash functions (phf) map inputs to a set of hash codes without any collisions. a phf maps n inputs, or keys, to a set of >n hash codes, some of which are unused. an mphf maps n inputs to n codes. although mphfs have been used to improve many bioinformatics applications, such as the quasi-dictionary ( ), the mphf construction process is often resource-intensive. critically, all of the inputs must be known in advance to construct an mphf, and many construction methods based on hypergraph peeling fail to scale. bbhash is an mphf construction method that was introduced to scale to massive key sets ( ) . bbhash is constructed by a simple procedure that maps each key to a fixed-size bit array using a universal hash. if two keys collide in the bit array, the corresponding location is set to . otherwise, the bit remains . this recursive process is repeated with all of the colliding keys until there are no more collisions. due to the simplicity of the algorithm, bbhash construction is much faster at the scale typically encountered in genomics. mphfs are usually used to implement fast, read-only hash tables with constant-time lookups. however, clever open addressing schemes can also be used to achieve similar query performance without knowing the key set in advance. rather than avoid hash collisions, open addressing attempts to rearrange elements in the hash table for optimal performance. for instance, hopscotch hashing ( ) ensures that a key pair is always found within a small neighborhood of its hash code. since only a small collection of consecutive buckets need to be searched when a query is issued, hopscotch hashing has very strong query-time performance. robin hood hashing ( ) is another open addressing method. the key feature of this algorithm is that it minimizes the distance between the hash code location and the actual key-value pair, reducing worst-case query time. cuckoo hashing ( ) uses two hash functions and guarantees that the element will always be found at one of the two hash indices. some fundamental advances in lsh have also been seen with minimizers. traditionally, minimizer selection is executed according to lexicographic order. however, this procedure may cause 'over-selection' where more k-mers than necessary become minimizers. instead, researchers recently proposed to select minimizers from a set of k-mers based on a universal hitting set or a randomized ordering ( ) . if minimizers are picked from the universal hitting sets, which are the minimum sets of k-mers that cover every possible llong sequence ( ) , the expected number of minimizers in a given sequence would decrease. there is also recent progress in techniques to rapidly characterize datasets. hyperloglog has risen to prominence recently thanks to its ability to efficiently count distinct elements in large data sets and databases. many new algorithms have since been developed based on hyper-loglog to adapt to different scenarios. for instance, hyper-loglog++ ( ) was introduced to reduce the memory usage and increase the estimation accuracy for an important cardinality range. sliding hyperloglog ( ) adds a sliding window to the original algorithm for more flexible queries, but it requires more memory storage. bloom filters are attractive because they can substantially compress a dataset, but this approach can return false positive answers. cascading bloom filters ( , ) improve the accuracy of the standard bloom filter. a cascading bloom filter recursively creates child bloom filters to store the false positives from a parent bloom filter. this reduces the false positive rate (fpr) of the overall system at a small memory cost. an alternative fpr reduction strategy is the kmer bloom filter (kbf) ( ) . each k-mer in a sequence overlaps with its adjacent k-mers by k − base pairs. therefore, the existence of two k-mers in a sequence is not independent, and the presence of a particular k-mer in the bloom filter can be verified by the co-occurrences of its neighbors. based on this information, kbf lowers the fpr by checking, for instance, the query's eight possible neighboring kmers (four to the left and four to the right). if none of the query's neighbors exist in the bloom filter, kbf rejects the query as a false positive. there are also many algorithms built around the generalized bloom filter data structure. these methods give the bloom filter different functions, but maintain its simplicity and memory-efficiency. the counting bloom filter (cbf), for instance, was developed to detect whether the count of an element is below a certain threshold ( ) . the only difference between the bf and cbf is that when adding an element, all the counters for that element increase by . the spectral bloom filter (sbf) ( ) functions similarly to a cbf, but the sbf only increases the minimum value in the table when inserting an element. this modification causes sbf to have a lower error rate when compared to the cbf. nucleic acids research, , vol. , no. in addition to extensions and variations of fundamental methods, recent advances have developed by combining several core data structures and techniques. for instance, race ( ) is an algorithm to downsample sets of genetic sequences while preserving metagenomic diversity. race replaces the universal hash function in the cms with an lsh function. using minhash, race can identify frequent clusters of sequences rather than frequent elements. since race is robust to sequence perturbations, it can be used to implement diversity sampling. by adjusting the lsh collision properties, race can create a sampled set of sequences that retains metagenomic diversity while substantially downsampling a data stream. the race diversity sampling algorithm is attractive because it can downsample accurately with high throughput, low memory overhead, and only one online pass through the dataset. for each sequence in an input stream, race checks to see whether the sequence belongs to a frequent cluster. this is done by replacing the minimum operation in the cms with an average over the count values. due to a deep connection between race and kernel density estimation, the average is a measure of the number of nearby sequences in the dataset, otherwise known as a density estimate. if the density is low, then race has not seen many similar sequences and the sequence is kept. otherwise, the sequence is discarded. in theory and practice, race attempts to select a constant number of sequences from each cluster. when minhash is properly tuned to differentiate between species, the clusters in the race algorithm correspond to different species in the dataset. as a result, race provides a fast, online and robust way to downsample sequence datasets while retaining important metagenomic properties. another important development comes from the cms and bloom filters. rambo (repeated and merged bloom filter) ( ) is a recent development in multiple set compression for fast k-mer and genetic sequence search. the rambo data structure is inspired by the cms, but the goal is to report the sequence containment status rather than sequence frequency. rambo consists of a set of b × r bloom filters. rather than maintain one bloom filter for each set of k-mers, rambo uses a -universal hash function to randomly merge k datasets into b groups ( ≤ b k ) so that each group has approximately k/b datasets. each partition is compressed using a bloom filter. this process is independently repeated r times with different partitions. to determine which sets contain a query sequence, rambo queries each bloom filter. because the groupings are random, each repetition reduces the number of candidates by the factor /b until only the correct datasets are reported at the end of the algorithm. the key insight is that with this approach, rambo can determine which datasets contain a given k-mer or sequence using far fewer bloom filter queries, yielding a very fast sublinear-time sequence search algorithm ( ) . rambo also inherits many desirable features from the cms and the bloom filter. this includes a low false positive rate, zero false negative rate, cheap update process for streaming inputs, fast query time, and a simple systems-friendly data structure that is straightforward to parallelize. in addition to methods that enable the scalable processing of high dimensional data, there are fundamental extensions of and considerations for cs that enable its efficient acquisition. while applications of cs are constrained to those where the sparsity assumption is appropriate, seemingly irrelevant signals may have a hidden sparse representation in some basis. for example, jpeg image compression exploits the fact that natural images can be sparsely represented (or at least approximated) in a discrete cosine basis (a cousin of the fourier transform). when the sparsity basis is known in advance, the canonical cs problem can be reformulated from y = x to y = s where s is the sparse representation of x in the basis defined by the columns of . this transformation was recently demonstrated in transcriptomics ( ) and may soon find an analogous application in metagenomics. aside from signal sparsity, cs also imposes constraints on the sensing matrix. specifically, must adequately preserve signals' separation distances; highly distinct ndimensional signals should not be forced into close proximity in m-dimensional space once projected by ( , ) . while gaussian and other classes of random matrices have been shown to work well in the general case, recent techniques indicate that can be iteratively optimized for a given task by simulating measurements and sparse recovery of signals ( ) . however, as we discuss below, practitioners generally do not have full control of in most applications. in metagenomics, the values in are constrained by the nucleic acid content of natural organisms. because each chosen sensor makes up a row of , a new algorithm can select m sensors (e.g. k-mers or probes) from a set of options to optimize the properties of for cs ( ) . very recent techniques in cs are also exploring how to merge machine learning with cs. given a dataset, recent work indicates that both the sensing matrix and the procedure that recovers x from y = x can be learned from specially designed deep neural networks ( ) ( ) ( ) ( ) , even in cases where the signal's sparsity structure is nonlinear. datasets in metagenomics are known to be highly structured and could thus be positively impacted by these recent advances in cs in the near future. most, if not all, of the approaches described above have found their way into previously published bioinformatics methods. however, method development to date has been primarily focused on genome sequencing for a single individual or isolate genome. findings suggesting links between microbiomes, such as the human gut microbiome, and human disease ( , ) has led to increased metagenomic sequencing. the rapid growth of this type of sequencing, where the set of reads is from a complex community of organisms, adds additional complexity and new challenges to fundamental comparative genomics problems. here we list a core set of these fundamental problems faced when performing metagenomic sequence analysis: (i) sequence resemblance, (ii) sequence containment, (iii) sequence classification, (iv) sequence downsampling, (v) sequence profiling, (vi) sequence probe design. for each problem, we discuss the role of the previously described approaches and newer tools incorporating recent advances (table ) . one of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically minhash and minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses. mash. in response to the high computational expense of large-scale sequence similarity calculations, researchers have begun to apply probabilistic approaches such as using minhash to approximate the similarity between sequences ( ). in the seminal work of mash ( ) , it was shown that minhash could be used as an extremely efficient estimator for genome similarities in both speed and resource use. it was also shown how mash could be applied to similarity estimates between entire metagenomes. in addition, mashtree has experimented with building phylogenetic trees based on the genomic similarity estimated using mash ( ) . these and other applications led to a quick and widespread adoption of mash throughout the research community for rapid sequence similarity calculations. despite representing a paradigm shift, one of the shortcomings of minhash is that its similarity estimation is most accurate when the two sets have similar sizes and their intersection region is large ( ) . in the paper ( ), the authors also point out that the genomic similarity estimated via jaccard distance is sensitive to the data set size. another limitation of minhash applied to metagenomics is that large amounts of rare k-mers can dominate the sample sketches. these k-mers which only occur a few times could be the result of sequencing errors as well as being actual rare species present in a metagenome. we will now review several other recent bioinformatic tools that have accelerated sequence similarity in the era of terabyte-scale datasets. bindash ( ) , like mash, takes in sequences, compresses them into sketches and then compares sketches to estimate the genome similarities. specifically, bindash focuses on accelerating the sketch construction and sketch comparison time. to do this, bindash uses the b-bit onepermutation minhash algorithm to compress sequences. given a sequence, bindash first decomposes the sequence into k-mers. each k-mer of the sequence is hashed by one predefined hash function. the hash values of k-mers are then pooled together into b buckets. after all the k-mers are hashed and then grouped into b buckets, bindash selects the smallest hash value from each bucket and stores the b lowest bits of each selected hash value as the sketch of a sequence. to account for potentially empty buckets, the sketch process is optimized by the densification operation as mentioned in the previous section. the sketch similarities are then estimated using jaccard indices based on the b · b bit sketch. the experiments show that, comparing to mash, bindash can characterize the same data set with less error, less memory used and faster speed. dashing. the recently introduced work of dashing uses hyperloglog (hll) sketching to approximate genomic distances ( ) . one main motivation behind dashing is to improve the similarity estimation accuracy across input sequence datasets with different sizes. dashing represents the first time that hll has been applied to estimate the overall similarity between sequence samples. given that hll is used to estimate set cardinality, to use hll to estimate genomic sequence similarities you must estimate the intersection of the two sequence data sets' k-mers, then estimate the cardinality of this intersection set. dashing first sketches the k-mers of each given sequence data set using hll. it then creates a union sketch using basic register maximum operations between the two hll sketches. now, having access to the set cardinality of both independent sets, as well as the union set size, the inclusion-exclusion principle yields the set cardinality of the intersection between the two sequence datasets. the hll set cardinality calculations of dashing are estimated using a maximum-likelihood-based approach, which has higher accuracy than the traditional corrected harmonic mean estimation approach. dashing is able to sketch metagenomes faster than previous approaches, but it requires more cpu time to calculate the genomic distances. in the end, comparing to mash, dashing has faster speed, higher accuracy and a lower memory footprint. finch rare k-mers can distort the estimation of sequence comparisons and inter-metagenomic distances. to solve this problem, finch ( ) uses minhash with a larger sketch size in order to evaluate the abundance of each k-mer. it then decides thresholds based on estimated abundances to filter out low abundance k-mers. it also removes k-mers with unequal frequencies of forward and reverse sequences. by deleting erroneous or rare k-mers, finch can estimate the distances between metagenomic samples robustly. finch also reports including correction for sequencing depth biases. hulk estimates the similarities among metagenomic samples while taking k-mer frequencies into account ( ) . in hulk, a metagenomic sample is sketched via histogram sketching ( ) into a final histosketch, which preserves k-mer frequency information. to build a histoskech for a given metagenome, reads are first decomposed into k-mers and then streamed in a distributed fashion into independent count-min sketch counters. once a large number of reads have been counted, hulk sends the cms data to be histosketched and resets the cms counts to initial values. in order to create the final histosketch, hulk first summarizes the count-min sketch counters into a k-mer spectrum and then applies consistent weighted sampling (https://www.microsoft.com/en-us/research/publication/ consistent-weighted-sampling/, accessed march ) methods. hulk can successfully cluster metagenome samples based on similarity between histosketches as well as being a faster approach than that of naive k-mer counting. kwip is yet another recent approach that tries to improve the accuracy of estimating sequence dataset similarity via k-mer weighted inner product (kwip) ( ) . kwip first uses khmer ( ) , which is a k-mer counting software relying on count-min sketch, to compress each metagenomic read sample into a sketch. each sketch is an array consisting of m bins. each bin is responsible for counting the number of occurrences of some of the k-mers (with collisions) in the sample. to calculate the distance between two sam-nucleic acids research, , vol. , no. table . metagenomics software based on probabilistic and signal processing algorithms. six main application areas are highlighted: containment, downsampling, probe design, profiling, resemblance and taxonomic classification. speed indicates the relative computational speed of cpu operations, memory the relative maximum ram used during index construction/query steps and year the publication year. more ' 's means better time and memory efficiency. less ' 's indicate more resource intensive tools. performance estimates using only literature based comparison are marked in gray (' '). the stars ( - ) correspond roughly to time (days, hours, minutes, seconds and milliseconds) and memory (> gb (server), > gb (workstation), > gb, > mb and < mb). datasets used were shakya et al. ( ) biobloom tools and opal were indexed using the training data provided by opal which is much smaller than the dbs other tools use. metamaps is a classifier specifically for long read sequences as compared to the other tools in the category. the datasets and results for each tool can be found at https://gitlab.com/treangenlab/hashreview ples, each of the m bins is assigned a weight to be used in a weighted inner product. in order to assign weights to individual bins, kwip first counts the number of non zero bins across all of the n samples. an m length vector containing these frequencies is then used by kwip to create another m length vector converting the frequency values to a new value based on shannon entropy. this entropy conversion causes bins that have k-mers present in roughly half of the samples to be heavily weighted versus bins that have k-mers present in all or none of the samples (which get a weight of zero). genetic similarity is then approximated by the kwip distance. the kwip distance is calculated using the inner product between two sample sketches, with each bin weighted by the shannon entropy for that bin. the authors show that kwip can produce more accurate results than mash, especially for metagenomic samples with low divergence. of note, kwip is specifically designed to create a distance matrix from multiple samples, using all samples in the sketching process, as opposed to comparing individual sketches for individual samples like most other methods discussed here. order min hash (omh) introduces a new way of sketching a sequence that estimates the edit distance of the sequences. ( ) unlike most other hashing based techniques for similarity calculations, which treat all the k-mers without respect to the order in which they occur, omh preserves the k-mer ordering in its sketching process. the sketch for a given sequence consists of n vectors of length l. each of the n vectors contains l representative k-mers, which are selected according to a pre-defined permutation function, and whose relative ordering is maintained from the original sequence. the distance calculation uses the weighted jaccard distance, where the number of appearances of a k-mer are taken into account. sourmash ( ) is closely related to mash and based on minhash. it modifies the sketching procedure such that the sketch size can be of variable length for different sequences. in their approach, the size of the sketch is based around the number of unique k-mers unlike the fixed size min-hash sketch. additionally, sourmash includes functionalities such as k-mer frequency calculations as well as a sequence containment method that combines the sequence bloom tree and minhash methodologies. searching for the containment of a read, gene fragment, gene, operon, or genome within a metagenomic sample or sequence database is a frequent computational task in bioinformatics. this is an open challenge for two key reasons: first, the size of metagenomic and sequence repositories are on the scale of terabytes to petabyes. thus, methods able to quickly eliminate all the non-matching sequences in the database are crucial. second, sequences evolve over time and rarely, if ever, will be an exact match especially as metagenomes and sequence databases contain a huge amount of sequence diversity. methods that tolerate mismatches and indels have much improved sensitivity compared to methods that require more strict exactly matching sequences to satisfy containment. despite the breakthroughs made via bloom tree inspired structures in sequence search, these approaches are not without drawbacks. first, they have to make a trade-off between falsepositives and the filter size due to the inherent limitations of the bloom filter. second, they commonly lack flexibility; once the filter size is determined, they cannot be changed based on the size of the input sequences. no matter how many k-mers a sequence has, they all have to be sketched into a fixed size array. finally, as the size of the input data increases, the precision of the bloom filter-based sequence search typically declines. we will now review a few recent approaches that have tackled this important task in computational biology. sequence bloom tree (sbt) ( ) is a binary tree where each node in the tree is a bloom filter. an sbt is used to index large sequence databases for efficient containment check of a query sequence within the database sequences or datasets. to construct an sbt, each sequence or dataset is added one by one, beginning with adding the first dataset as the root of the sbt. for each additional sequence or dataset, you first compute the bloom filter for the contained k-mers, and then scan from the root of the sbt to the leaves, inserting the dataset's representative bloom filter at the bottom of the tree. at each bifurcation, the insertion traversal follows the path of the child with the closest hamming distance similarity to the bloom filter for the current dataset. after insertion is finished, the new dataset's bloom filter is added as a leaf node, and each node in the sbt contains the union of the bloom filters of its children. to be specific, if a k-mer is present in node u, it should also exist in all the direct ascending nodes' bloom filters from u to the root. therefore, as a bloom filter gets closer to the root, it becomes more populated and the false-positive rate of the bloom filter is higher (a process known as saturation). querying for sequence containment proceeds by querying each node's bloom filter, starting with the root, and determining if enough k-mers are contained from the query's k-mers. if the bloom filter contains enough of the query's k-mers, then each child node's bloom filter is queried for containment. the process proceeds until each sequence or dataset containing the query at the leaves of the sbt is determined. split sequence bloom tree (ssbt) ( ) were implemented to quickly search short transcripts within a large database. although the ssbt was originally designed for rna-seq data, it can be adapted to other sequence containment problems just like sbts. the ssbt is an improvement over the sequence bloom tree (sbt) data structure ( ) . similar to sbts, each sequence or dataset in the database is inserted into the ssbt by traversing from the root of the tree to the bottom. the ssbt is also a binary tree, but each node has two bloom filters instead of one. the first filter, called the similarity filter, saves k-mers shared by all the datasets in the subtree under a particular node. the second filter, named the remainder filter, stores the k-mers that are not universally shared among all the datasets but are specific to at least one dataset in the subtree for a node. the union of the similarity filter and the remainder filter is a single bloom filter for the node similar to the nodes of an sbt. ssbt is a clever re-organization of sbt resulting in accuracy similar to an sbt but with reduced space occupancy and search time. bigsi represents a significant advance in sequence containment search; bigsi was introduced to allow efficient search for a query sequence among a large bacterial and viral genome database ( ) . it also relies on bloom filters to solve this problem. but, instead of using a tree-like structure (e.g. sbt), bigsi employs a flat bloom filter-based data structure. bigsi first indexes the reference datasets, where these datasets are raw fastq read datasets or assemblies from which to search for the presence of a query sequence. to index the reference datasets, bigsi first extracts a set of non-redundant k-mers from each dataset, and then builds a corresponding bloom filter. after this initial step, bigsi then concatenates all the bloom filters together. bigsi compresses the whole database into a matrix, in which each column is a bloom filter for a given dataset. to conduct an exact search of a sequence, bigsi is expected to find the index of all the k-mers of the query sequence inside the matrix. for inexact search, as referenced above, bigsi just needs to find the index for a subset of the k-mers present in a sequence of interest. bigsi can also dynamically update the size of the sketch based on the amount of input datasets. when new datasets arrive, bigsi can add a new column to the matrix for each new dataset. rambo ( ) is a very recent method which also allows indexing new sequences and new datasets in a streaming fashion. contrary to bigsi, which has o(k) (k is the number of datasets) query time, rambo is sublinear in query time with a slight increase in memory. mash screen ( ) was developed to determine which reference sequences are contained within a metagenomic sample using minhash, though the methodology is also presented as a method for sequence similarity. similar to meta-pallette (described below), it uses references found to be contained in a metagenome to describe the metagenome's taxonomic composition, but does not classify individual reads. mash screen first converts a reference sequence and a given metagenomic sample into two sets of k-mers a and b. following that, mash screen compresses the set of ref- represents the fraction of k-mers in the sketch of a contained in b, and is referred to as the containment index. finally, the containment index is converted to a score that approximates sequence similarity. this final score is referred to as the mash containment score. the presence or absence of one or more reference sequences in a metagenomic sample is then determined by this mash containment score. an example is given, for instance, of searching for a set of reference viral sequences in hundreds of metagenomes by calculating the mash containment score between each reference and each metagenome. metagenomic sequence classification software typically uses reads to search against known genomes and perform lowest common ancestor based taxonomic classification. as the size of the reference databases (terabytes to petabytes) and the number of reads ( s of millions to billions) in metagenomic samples increase, it becomes computationally intractable to perform exhaustive comparison of all kmers in the reads against all k-mers within the reference databases, opening the door for efficient new tools. tools like kraken ( ) and diamond ( ) were two of the first ultra efficient tools for fast metagenomic classifications. we now review a few recently developed approaches for metagenomic sequence classification. krakenuniq is built based on kraken and its main goal is to decrease the false-positive read classification rate ( ) . compared to kraken, one of the additional features of krakenuniq is that the number of unique k-mers of each taxon is recorded while processing all reads of a metagenomic data set. krakenuniq uses hyperloglog to efficiently estimate these unique k-mer counts. by tracking the number of unique k-mers for a taxa alongside the coverage for that taxa across all the reads in a metagenome, krak-enuniq can identify likely false-positive read classifications caused by events such as sample contamination, lowcomplexity regions, and contaminated database sequences. kraken substantially reduces memory usage, while simultaneously gaining a significant boost in classification speed, when compared with kraken ( ) . this advancement in memory use and speed comes from using a compacted hash table that stores lca assignments for hashed minimizers of k-mers instead of a table storing lca assignments for all k-mers as in kraken . while this hash table saves significant memory, it comes at a small specificity and accuracy cost given that it only stores pairs of minimizers and lcas which are further subsampled through hashing. this hashing process includes adding spaced seed masking to the minimizer before hashing. the size of this new compact hash table can be specified by the user, with smaller sizes reducing the memory footprint and increasing speed but lowering classification accuracy. when compared with other state of the art tools, kraken ultimately provides similar or better classification accuracy alongside its memory and speed improvements. biobloom tools (bbt) ( ) is novel in that it applies a multi-index bloom filter (mibf) to the sequence classification problem. the mibf is a bloom filter-like data structure that consists of three arrays. the first array serves as a traditional bloom filter, recording the existence of hashed items in a set. the second array, named the rank array, tracks the number of non zero bits stored in the first bloom filter array at certain intervals (by default, the number of non zeros every bits in the bloom filter is stored). to reduce memory usage, the rank array is ultimately interleaved with the first bloom filter. the third array, also referred as the id array, saves the integer identifiers (ids) for reference sequences inserted into the mibf. these ids allow the mibf to additionally store associated taxonomic classification information for entries so as to be used as a classifier. for each reference sequence, bbt hashes spaced seeds into the mibf rather than contiguous k-mers. spaced seeds, unlike k-mers, allow mismatches between the references and the queries which can increase the sensitivity of approximate sequence search ( ) . to classify a given read, spaced seeds from the read are looked up in the bloom filter. the rank array is then used to help retrieve ids from the id array. ultimately, the retrieved ids lead to a final taxonomic classification. to reduce the false positive rate, bbt makes use of nearby spaced seeds within adjacent sliding windows, referred to as frames, when performing its classifications. bbt also intelligently populates the id array in multiple passes such that the effects of data loss from hash collisions is minimized. ganon ( ) focuses on quick database indexing in order to ensure usage of the most up to date sequence database data to accurately classify reads. many existing tools apply static, out-of-date versions of databases to assign reads. this approach can miss, for instance, classifications for species that have been newly sequenced and very recently added to existing databases. to overcome this problem, ganon employs interleaved bloom filters (ibf) ( ) to index up-to-date reference genomes efficiently. an ibf is an array of length b · n. it encompasses b bloom filters of length n. to index the references, ganon first groups the sequences into clusters. these clusters should roughly mirror different groups for a given taxonomic rank such as different species or strains. it then sketches each cluster into a single bloom filter. lastly, all the bloom filters are interleaved into one ibf. reads are classified that pass a minimum threshold for the number of matches found within the read and the references. if a given read can map to multiple references, an optional lowest common ancestor (lca) approach can be applied. metamaps was designed to perform classification on noisy long read data including making both classifications and abundance estimates down to the strain level ( ) . metamaps classifies long reads by mapping them to reference genomes. given that reads could map to many closely related references, metamaps simultaneously performs mapping as well as estimating the community composition of a metagenome sample. thus, when determin-ing the probability of mapping a read to a reference, the probability is a combination of both a probabilistic mapping quality to the reference as well as the estimated abundance of the reference's taxonomic unit in the sample. to quickly find mapping locations for reads across all reference genomes, an efficient probabilistic approach is used that generates initial candidate mappings using minimizers followed by a winnowed-minhash statistical modelling approach for further ani estimation ( ) . the read mappings and metagenome abundance estimates are then iteratively updated through an expectation-maximization (em) algorithm. metaothello ( ) is one of the latest efforts in improving the classification speed of metagenomic classification. similar to kraken , metaothello reports significant improvements in both memory use and speed when compared to, for instance, kraken . metaothello applies the recently developed l-othello data structure to speed up the process, which is a hashing based classifier. metaothello uses k-mers that act as signatures for taxa to make its classifications. a kmer is a signature for a taxon if it is only present in that taxon or that taxon's subtree, and nowhere else in the tree of life (it is taxon specific). metaothello indexes all reference sequences, finds all taxon signature k-mers and their taxonomic mappings, and populates an l-othello data structure that efficiently maps from signature k-mers to taxa. the l-othello, once built, maintains two arrays a and b populated with binary values. when looking up a k-mer's taxa mapping in the l-othello, the k-mer is hashed by two hash functions h a and h b that map to the matching positions in a and b. the final corresponding taxa value t for the k-mer is calculated through a bit-wise xor operation of the two values found in a and b. thus the classification step of metaothello operates similarly to other approaches. a query sequence is decomposed into its constituent k-mers and the corresponding taxa for each k-mer is looked up using the l-othello data structure. then, differing from other approaches, metaothello uses a windowed approach to make the final classification. for a given taxonomic rank, the classification takes into account the maximum number of contiguous taxa assignments that all occur consecutively within the query sequence. opal ( ) is an lsh-based metagenomic classifier that uses low density parity check (ldpc) codes. the rationale for using an ldpc lsh approach is to ensure even coverage for all of the positions in the k-mer while using as few hash functions as possible. the authors highlight that this is the first application of low-density lsh in bioinformatics. the rationale for using low-density lsh is that it will avoid coverage bias issues and offer increased accuracy when using long k-mers. in addition to newer more efficient methods for analyzing large metagenomic data sets, a parallel effort has been emerging that instead reduces the data set size first before running further downstream analyses. intelligently down sampling, for instance, a read data set can dramatically speed up any further computations performed, while ideally preserving the important characteristics of the metagenome. another alternative approach to analyze less data than a full metagenome would be to restrict sequencing to a small subset of regions in the metagenome such as the s rrna. this sequencing approach, referred to as metabarcoding ( ) or amplicon sequencing, can help to simplify other downstream tasks such as community profiling and taxonomic assignments of reads. here, however, we consider only the recent computational approaches that shrink large metagenomic datasets previously generated or in an online streaming fashion. diginorm ( ) is a cms-based method for downsampling shotgun sequencing data. diginorm is a streaming algorithm that can select a small set of reads from a large dataset using relatively few computational resources without substantial information loss. this improves the speed of downstream tasks. diginorm begins by finding the frequencies of all k-mers in a sequence using a cms. if the median frequency value is larger than a threshold, usually , the sequence is discarded. this process discards reads with k-mers that have already been observed in other reads. since rare reads have many rare k-mers, they will have a lower median count than common reads and will be kept. an easy-to-use python implementation is provided in the khmer package. bignorm ( ) is an extension of the ideas behind diginorm. bignorm obtains better downsampling performance by including additional information, such as quality scores and common error modalities, when determining whether to accept a read. while bignorm is still based on k-mer abundance counts and the cms, the decision threshold is based on a weighted summary of k-mer counts rather than simply the median. the decision process attempts to remove bias in diginorm that may incorrectly accept a read. for instance, bignorm attempts to differentiate between rare k-mers caused by single substitution errors and authentic uncommon reads. while diginorm and bignorm are both efficient streaming algorithms, bignorm is implemented in c++ and uses parallelism to achieve faster processing times. race ( ) is a recent downsampling method based on lsh and the cms. rather than consider explicit k-mer abundance statistics, race is based on jaccard similarity. diginorm and bignorm both discard reads which contain many k-mers that have already been observed. race discards reads that have a high jaccard similarity with many observed reads. while these decision criteria are similar, density estimation with jaccard similarity is incredibly efficient using the race algorithm. quikr/wgsquikr ( , ) are cs-based approaches that leverage differences in bacterial k-mer frequencies to recover the relative abundances of bacteria in complex samples. the setup of the cs problem is similar to our depiction in figure . in quikr, each column of the sensing matrix is populated with the -mer frequency profile of a bacterial species' s gene. sequence measurements across a whole sample are converted to raw -mer frequencies (y) from which the sparse combination of species can be recovered using cs with sparsity-based optimization. quikr was soon followed up with wgsquikr ( ) that leveraged the same core method except with -mer analysis of whole-genome shotgun sequencing data. at the time of publication, these techniques achieved competitive accuracy with orders of magnitude improvement in speed over state-ofthe-art read-by-read classifiers. however, they were limited to genus-level taxonomic depth and exhibited difficulty in recovering rare organisms. metapallette ( ) takes a cs-inspired approach similar to wgsquikr for metagenomic community reconstruction with a few subtle but significant differences. the authors define a matrix a created from k-mers of database reference genomes, known as the common k-mer training matrix. this matrix is analogous to the sensing matrix in cs, but a stores pairwise similarities of reference genomes based on shared k-mers. a is able to be efficiently constructed for long k-mers by using bloom count filters. ultimately, the relative taxa abundances x is recovered from the aggregate sample k-mer counts y by solving ax = y for a sparse x. while we only discuss a single a, x and y here, metapallete in fact creates multiple a and x for different values of k for k-mers ( and ) . the authors also augment a with artificial 'hypothetical organisms' of similar k-mer profiles. the use of long k-mers and the mathematical representation of unknown organisms enables metapallette to classify even novel organisms at the strain level. mission ( ) is a hybrid compressed sensing and hashing-based approach. specifically, mission uses a count-sketch data structure and will acquire the heavy hitters from the data and apply stochastic gradient descent to update the data structure. the sparsity of the features keeps the top heavy hitters while setting the rest to zero. this algorithm was used for metagenomic classification on the dataset from ( ) and showed how many features of the data would be adequate relative to performance. metagenomic sequencing has opened the gate for biologists to detect novel or rare organisms in different environments. however, detection with high sensitivity can demand extensive sequencing runtimes to capture novel fragments among the innumerable metagenomic background data ( ) . to circumvent these challenges, single stranded nucleic acid probes can enrich or sense dna fragments by binding to intended target strands. many software packages have been developed for designing probes for a specific target genome, but generating probes for metagenomic analysis is difficult because of the uneven and diverse sequences in metagenomic samples. capturing rare sequences while excluding highly similar sequences is challenging. therefore, metagenomics requires probe design techniques that scale well with the number of organisms found in samples. catch is a newly developed method to design optimal probes for targeted microbial enrichment to facilitate downstream detection in sequencing ( ) . this approach is particularly important for viral detection in samples with low titers; without probe-based enrichment, low abundance viruses may evade detection. moreover, catch pursues a set of probes that can scalably capture the full diversity of viruses. catch first yields a set of candidate probes from the input sequences and then collapses the probes with high similarity using lsh. specifically, it detects nearly-identical probes through either hamming distance or minhash, and then removes the similar candidate probes. to make sure that the final set of probes encapsulates the diversity of the input sequences, catch computes the smallest set of probes needed to cover the whole set of target sequences. catch treats this as a set cover problem and solves it using the canonical greedy solution ( ) . ultimately, thousands of probes are chosen to cover the targets based upon the optimization criteria. insense while catch focuses on probe design for enrichment of target sequences in a complex sample before metagenomic sequencing, applying cs permits another workflow with orders of magnitude fewer probes at the cost of some taxonomic depth. if a sample is known to be vsparse, i.e. contain a subset of v or fewer of the n possible microbes, cs can be applied with m = o(vlog(n/v)) mismatch-tolerant dna probes. the sensing matrix is populated by the expected number of binding events between each probe (in rows) and each target organism (in columns). these nonspecific probes can be thought of as directly measuring the abundance of soft-matching k-mers. proof-of-concept work was first explored in a cs microarray (csm) format ( ) . the same principle has also been demonstrated for sensing bacterial pathogen genomes at species resolution in bulk solution with less than a dozen fluorescent, random dna probes ( ) . although fewer probes can be resolved in bulk solution compared to a microarray (m is limited), such an approach may find applications in rapid infection diagnostics where the species library is constrained to pathogens (n is much smaller) and patient samples are very sparse with at most a few unique species ( ) . given a set of possible microbes (library), a set of probes, and the simulated hybridization behavior between them, a subset of probes can be selected with the insense algorithm ( ) . insense optimizes for the incoherence of , a common quality metric for cs sensing matrices, with a convex relaxation. this cs approach bypasses sequencing by capturing information directly from probe-target hybridization events, and it will be exciting to see how it performs in real patient and environmental samples. if can be accurately predicted from probe and target sequences, it is plausible that future applications can synergize with sequencing databases by automatically updating based on known trends in microbial evolution. however, nonspecific hybridization mandates a thorough understanding of the library of possible species and perhaps careful sample processing; out-oflibrary, unexpected nucleic acids that interact with nonspecific probes would corrupt the measurements and downstream sparse recovery. despite the nascent state of metagenomic sequencing and analysis, its accelerated adoption has led to both an explosion in available data as well as an ever increasing demand for new data analysis methodologies. in this survey, we have covered what we believe to be a core set of fundamental probabilistic data structures and algorithms that are uniquely positioned to tackle the burgeoning growth of metagenomic data, as well as the added nuances of anal-yses involving a diverse community contained inside of a metagenome. despite the relative youth of the field of metagenomics, many fast methods have already emerged that can be used or were designed for this area. for instance, as seen in table , methods like bindash and dashing are being developed in an effort to further accelerate sequence similarity estimations beyond the speed of the seminal mash tool. similarly, recent advances like bigsi, rambo, and ssbt are opening the door to petabyte-scale sequence searches among vast sequencing datasets. however, continued breakthroughs are still needed to better handle metagenomic-specific intricacies such as sequencing error, low abundance community members, and uneven coverage. in addition, probabilistic approaches as discussed in this paper generally come with an accompanying set of pros and cons. for instance, most bloom filter algorithms involve a fundamental trade-off between memory, query cost, and quality. standard bloom filters balance the size of the bit array with the possibility of false positives. the tradeoff is implicit for any algorithm using this data structure. the fpr can be reduced by choosing the right number of hash functions, which may increase query time, or by making assumptions about the input data, as with kmer bloom filters. cascading bloom filters provide an alternative way to trade query time and memory for fpr at the expense of a more complex hierarchical structure. additionally, cs approaches come with their own set of tradeoffs. while cs confers measurement efficiency for cost and time savings, it is inherently database-dependent. for instance, in some of the applications we discussed, the sensing matrix was precomputed by leveraging a sequence database (sequences at a specified position, k-mer frequencies, response to a set of probes etc.). similarly, the discovery of sparse representations requires a training set of signals. this requirement for a dataset becomes limiting in chaotic applications such as the identification of rapidly evolving organisms either through vertical or horizontal gene transfer. such novel differences that real-world samples may exhibit would likely be treated as noise in sparse recovery and ignored until the database is updated. cs is therefore likely limited to applications that exhibit an acceptable level of stability in the dataset. more generally, while the cs technique is provably robust to errors (noise) in the lowdimensional measurements y, any errors in the signal x are amplified by the factor n/m ( ) . in metagenomics, measurement noise may be attributed to whether an expected nucleic acid fragment in the sample generates a read during sequencing, and signal noise could be the result of unforeseen mutations or contamination. in applications featuring significant signal noise, the ratio n/m controls the tradeoff between the efficiency of the measurement process and signal-to-noise ratio degradation. in addition to all of the considerations directly involved in the inner workings of the discussed methods, there are many considerations surrounding these methods that can also greatly affect both their accuracy and scale. while we have discussed various tradeoffs involved in probabilistic approaches, many of these tradeoffs involve carefully selected hyper parameters. to a non expert user of the methods, it may not be obvious how to set the various parameters for each method, and even advanced users may struggle to find the truly optimal parameter settings derived from underlying theory. another consideration is in the modeling of processes such as natural genome evolution. many k-mer based approaches and hashing techniques are initially developed in a way that is blind to underlying biological processes such as evolutionary drift which gradually introduces point mutations, insertions, and deletions into closely related genomes that otherwise might be identical. conversely, phylogenetic methods which explicitly model events like drift and recombination have been slow to incorporate recent advances discussed in this survey. considerations can also be given to the actual data collection procedures, such as how the dna sequencing is performed. one new advance on the sequencing side of metagenomics is the concept of genome skimming ( ) , which is a technique to lightly sequence metagenomic samples. similarly, metabarcoding ( ) or amplicon sequencing can reduce metagenomic data by only sequencing a small set of amplified regions, potentially speeding up and simplifying downstream analyses. a final consideration surrounding newer methodologies is that of the sequence databases that nearly all metagenomics tools rely on for sequence classification. while recent advances in probabilistic data structures and algorithms may drastically shrink computational requirements, these speedups can be easily offset and even outpaced by exponential growth in sequence databases that these algorithms must interact with. new methods should also seek to overcome challenges such as database quality issues such as misassembled or mislabelled genomes or sets of reads. following methodologies such as simple uniform random downsampling and more intelligent downsampling like diginorm ( ) , recent advances like the race method ( ) attempt to address the need to shrink databases and remove contaminants and error, while preserving biologically important characteristics like diversity. probabilistic data structures for big data analytics: a comprehensive review computational biology in the st century: scaling with compressive algorithms sketching and sublinear data structures in genomics computational solutions for omics data when the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data on the resemblance and containment of documents approximate nearest neighbors: towards removing the curse of dimensionality an improved data stream summary: the count-min sketch and its applications hyperloglog: the analysis of a near-optimal cardinality estimation algorithm space/time trade-offs in hash coding with allowable errors fast and accurate short read alignment with burrows-wheeler transform opportunistic data structures with applications reducing storage requirements for biological sequence comparison compressive fluorescence microscopy for biological and hyperspectral imaging sparse mri: the application of compressed sensing for rapid mr imaging compressive sensing decoding by linear programming compressed sensing randomized algorithms the random projection method sampling techniques for kernel methods a random sampling based algorithm for learning the intersection of half-spaces adaptive sampling methods for scaling up knowledge discovery algorithms randnla: randomized numerical linear algebra finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions an algorithmic theory of learning: robust concepts and random projection dimensionality reduction by random projection and latent semantic indexing random projection trees and low dimensional manifolds experiments with random projection linear regression with random projections on the resemblance and containment of documents approximate nearest neighbors: towards removing the curse of dimensionality the space complexity of approximating the frequency moments data streams: models and algorithms mining data streams: a review signal recovery from random measurements via orthogonal matching pursuit iterative thresholding for sparse approximations cosamp: iterative signal recovery from incomplete and inaccurate samples from denoising to compressed sensing mash: fast genome and metagenome distance estimation using minhash viral coinfection analysis using a minhash toolkit large-scale sequence comparisons with sourmash optimal densification for fast and accurate minwise hashing densifying one permutation hashing via rotation for fast near neighbor search improved asymmetric locality sensitive hashing (alsh) for maximum inner product search (mips) simple and efficient weighted minwise hashing similarity estimation techniques from rounding algorithms in defense of minhash over simhash hashing algorithms for large-scale learning sectional minhash for near-duplicate detection nthash: recursive nucleotide hashing a resource-frugal probabilistic dictionary and applications in bioinformatics fast and scalable minimal perfect hashing for massive key sets hopscotch hashing robin hood hashing improving the performance of minimizers and winnowing schemes designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm sliding hyperloglog: estimating cardinality in a data stream over a sliding window using cascading bloom filters to improve the memory usage for de brujin graphs fast lossless compression via cascading bloom filters improving bloom filter performance on sequence data using k-mer bloom filters an improved construction for counting bloom filters spectral bloom filters diversified race sampling on data streams applied to metagenomic sequence analysis repeated and merged bloom filter for multiple set membership testing (msmt) in sub-linear time sub-linear sequence search via a repeated and merged bloom filter (rambo): indexing tb data in hours efficient generation of transcriptomic profiles by random composite measurements the restricted isometry property and its implications for compressed sensing a simple proof of the restricted isometry property for random matrices adaptive compressed sensing mri with unsupervised learning insense: incoherent sensor selection for sparse signals a data-driven and distributed approach to sparse signal representation and recovery the sparse recovery autoencoder learned d-amp: principled neural network based compressive image recovery deepcodec: adaptive sensing and recovery via deep convolutional neural networks nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection clinical metagenomics generating wgs trees with mashtree variant tolerant read mapping using min-hashing beware the jaccard: the choice of similarity measure is important and non-trivial in genomic colocalisation analysis bindash, software for fast genome distance estimation on a typical personal laptop dashing: fast and accurate genomic distances with hyperloglog finch: a tool adding dynamic abundance filtering to genomic minhashing streaming histogram sketching for rapid microbiome analytics histosketch: fast similarity-preserving sketching of streaming histograms with concept drift kwip: the k-mer weighted inner product, a de novo estimator of genetic similarity the khmer software package: enabling efficient nucleotide sequence analysis locality-sensitive hashing for the edit distance fast search of thousands of short-read sequencing experiments improved search of large transcriptomic sequencing databases using split sequence bloom trees ultrafast search of all deposited bacterial and viral genomic data mash screen: high-throughput sequence containment estimation for genome discovery kraken: ultrafast metagenomic sequence classification using exact alignments fast and sensitive protein alignment using diamond krakenuniq: confident and fast metagenomics classification using unique k-mer counts improved metagenomic analysis with kraken improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index bloom filters efficient computation of spaced seeds ganon: precise metagenomics classification against large and up-to-date sets of reference sequences dream-yara: an exact read mapper for very large databases with short update time strain-level metagenomic assignment and compositional estimation for long reads with metamaps a fast approximate algorithm for mapping long reads to large reference databases a novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures metagenomic binning through low-density hashing the ecologist's field guide to sequence-based identification of biodiversity a reference-free algorithm for computational normalization of shotgun sequencing data an improved filtering algorithm for big read datasets and its application to single-cell assembly wgsquikr: fast whole-genome shotgun metagenomic classification quikr: a method for rapid reconstruction of bacterial communities via compressive sensing metapalette: a k-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation mission: ultra large-scale feature selection using count-sketches large-scale machine learning for metagenomics sequence classification how much metagenomic sequencing is enough to achieve a given goal? capturing sequence diversity in metagenomes with comprehensive and scalable probe design a greedy heuristic for the set-covering problem compressive sensing dna microarrays universal microbial diagnostics using random dna probes polymicrobial interactions: impact on pathogenesis and human disease the pros and cons of compressive sensing for wideband signal acquisition: noise folding versus dynamic range genome skimming: a rapid approach to gaining diverse biological insights into multicellular pathogens tackling soil diversity with the assembly of large, complex metagenomes oceanic metagenomics: the sorcerer ii global ocean sampling expedition: northwest atlantic through eastern tropical pacific the ocean sampling day consortium. gigascience, a human gut microbial gene catalogue established by metagenomic sequencing ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses terragenome: a consortium for the sequencing of a soil metagenome img/m v. . : an integrated data management and comparative analysis system for microbial genomes and microbiomes the human microbiome project the european nucleotide archive in the sequence read archive comparative metagenomic and rrna microbial diversity characterization using archaeal and bacterial synthetic communities ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins the views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the odni, iarpa, aro or the us government. key: cord- -i vmfpp authors: lima, francisco esmaile de sales; cibulski, samuel paulo; dos santos, helton fernandes; teixeira, thais fumaco; varela, ana paula muterle; roehe, paulo michel; delwart, eric; franco, ana cláudia title: genomic characterization of novel circular ssdna viruses from insectivorous bats in southern brazil date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: i vmfpp circoviruses are highly prevalent porcine and avian pathogens. in recent years, novel circular ssdna genomes have recently been detected in a variety of fecal and environmental samples using deep sequencing approaches. in this study the identification of genomes of novel circoviruses and cycloviruses in feces of insectivorous bats is reported. pan-reactive primers were used targeting the conserved rep region of circoviruses and cycloviruses to screen dna bat fecal samples. using this approach, partial rep sequences were detected which formed five phylogenetic groups distributed among the circovirus and the recently proposed cyclovirus genera of the circoviridae. further analysis using inverse pcr and sanger sequencing led to the characterization of four new putative members of the family circoviridae with genome size ranging from , to , nt, two inversely arranged orfs, and canonical nonamer sequences atop a stem loop. viruses of the circoviridae family are known to infect a wide range of vertebrates. the virions consist of naked nucleocapsids of about nm in diameter, with a circular single stranded dna (ssdna) genome of approximately . kb [ ] . they have an ambisense genome organization containing two major open reading frames (orfs) inversely arranged, responsible for encoding the replicase (rep) and capsid (cap) proteins, and are separated by a ' intergenic region (igr) between the stop codons and a ' igr between the start codons [ ] . some circoviruses are major pathogens of pigs [ ] [ ] [ ] , e.g. porcine circovirus (pcv ) which causes either asymptomatic infections or clearly apparent disease which may be responsible for significant economic losses [ ] [ ] [ ] [ ] [ ] . in birds, avian circoviruses, within the genus gyrovirus, have been identified in a broad range of avian species; one of them, chicken anemia virus (cav), is a major cause of disease, associated to lymphoid depletion, immunosuppression and developmental abnormalities [ ] [ ] [ ] [ ] [ ] . according to the document . a-gv from ictv, there is a proposal of the gyrovirus genus removal from circoviridae to anelloviridae family due to recent metagenomic studies on gyroviruses showing a very high sequence divergence when compared to other circoviruses members. recent metagenomic approaches, high-throughput sequencing techniques and degenerate pcrs have led to the identification of small circular dna genomes in fecal samples of wild mammals, in insects as well as from environmental samples [ , [ ] [ ] [ ] . some of the newly described circular genomes are similar to those of circoviruses, but phylogenetically different from the previously known avian and porcine circoviruses [ ] . their distinct nucleotide/ amino acid composition and genome organization allowed authors to propose the creation of a new genus within the circoviridae, which was named cyclovirus. in comparison to members of the genus circovirus, both rep and cap cycloviruses genes are smaller, with shorter or no ' igr between the stop codons of the two major orfs and a longer ' igr between the start codons of the two major orfs [ ] . sequences related to circoviruses have been identified based on the detection of the conserved rep region involved in rolling circle replication (rcr) [ ] . cyclovirus genomes were detected in wild animal's samples, human feces and cerebrospinal fluids; muscular tissues of farm animals such as chickens, cows, sheep, goats, and camels [ , ] . currently, eight different species of cycloviruses have been detected in winged-insect populations highlighting they circulate in a wide host range possessing a high genetic diversity, as well [ ] [ ] [ ] [ ] [ ] [ ] . so far, classification for the genus circovirus considers circoviruses sharing > % genomewide nucleotide identity and > % amino acid identity in the capsid (cap) protein to the same species. although, there are no species demarcation criteria for the genus cyclovirus, the taxonomic classification for the family circoviridae considers viruses sharing > % in their cap amino acid identity level as belonging to distinct genera [ ] . in the present article, the detection of ssdna genomes from bat fecal samples is reported. genome segments were amplified by consensus/degenerate pcr. whole genome sequencing and phylogenetic analyses of the sequences obtained revealed that four of the sequences represent viral genomes of new members of the family circoviridae. permission for this work on protected bats was granted by health monitoring (cevs-centro estadual de vigilância em saúde) of the brazilian federal state of rio grande do sul. the study did not involve any direct manipulations of bats and relied entirely on collection of fecal samples from the attic floor. all experiments were performed in compliance with the european convention for the protection of vertebrate animals used for experimental and other scientific purposes (european treaty series-no. revised ) and the procedures of the brazilian college of animal experimentation (cobea). it must be highlighted that we had the owner's permission to access the attic for the purposes of this study. in case of future surveys in porto alegre, the health monitoring (cevs) will be contacted to obtain the permissions. a maternity roost of bats was identified in the summer of in the attic of a private residence in the central area of the municipality of porto alegre, rio grande do sul, southern brazil. the colony was estimated to harbor about bat specimens of insectivorous bats of two species, velvety free tailed bats (molossus molossus) and brazilian free tailed bats (tadarida brasiliensis) [ ] . speciation was confirmed by dna extraction from fecal pellets, amplification and sequencing of the mitochondrial cytochrome b (cytb) gene as described [ ] . one hundred fecal samples were collected from the attic floor as follows: a plastic film was spread on the ground of the attic compartment and fresh droppings were collected with clean disposable forks in the following night. each sample consisted of pool of fecal droppings, which were immediately sent to the laboratory and stored at - °c. the samples were then thawed, resuspended and in ml of hank's balanced salt solution (hbss), vortexed and centrifuged at . x g for min. the supernatants were then transferred to fresh tubes, filtered through . μm pore-size syringe filters (fisher scientific, pittsburgh, pa) and submitted to dna extraction. total fecal dna was extracted from μl of the supernatants (above) with phenol-chloroform (invitrogen) [ ] . the extracted dna was eluted in μl of te (tris-hydrochloride buffer, ph . , containing . mm edta), treated with μg/ml of rnase a (invitrogen) and stored at - °c. subsequently, samples were submitted to amplification in a nested-pcr targeting the rep gene of circoviruses/cycloviruses with the following degenerate primers: cv-f ( ´-ggiayiccicayyticargg- ´), cv-r ( `-awccaiccrtaraartcrtc- `), cv-f ( ´-ggiayiccicayyticarggitt- ´), and cv-r ( ´-tgytgytcrtaiccrtccc acca- ´) [ ] . briefly, the nested pcr was performed as follows: the first reaction was performed in a μl volume containing to ng of sample dna mm mgcl , . μm of each primer (cv-f and cv-r ), . u taq dna polymerase (invitrogen), % pcr buffer and . mm dntps. the cycling conditions were: min at °c; cycles of min at °c, min at °c, min at °c and a final incubation at °c for min. for the second (nested) reaction, the μl mix components were: μl of the st reaction product, mm mgcl , . μm of each primer (cv-f and cv-r ), . u taq dna polymerase (invitrogen), % pcr buffer and . mm dntps. the cycling conditions were: min at °c; cycles of min at °c, min at °c, min at °c, and a final incubation at °c for min. products with a size of approximately bp were purified and directly sequenced using primer cv-r . to confirm the sequences, each product was sequenced three times. samples were sequenced with the big dye terminator cycle sequencing ready reaction (applied biosystems, uk) in an abi-prism genetic analyzer (abi, foster city, ca), according to the protocol of the manufacturer. sequences similar to the rep gene sequences of circovirus-like-genomes were aligned for designing of new sets of primers to perform the inverse pcr (ipcr). the ipcr were carried out in a μl reaction mixture optimized with platinum taq hi-fi (invitrogen™) (cycling conditions can be informed upon request) and the primer sequences as follows: . standard precautions were taken to avoid contamination and negative controls were added to each batch of reactions. five microliters of the pcr products were electrophoresed in . % agarose gels and the products visualized on uv light after staining with ethidium bromide. the amplicons corresponding to the sizes ranging from - kb were purified and cloned into pcr . -topo cloning kit (invitrogen™). three insert-containing plasmids of each clone were sequenced with m forward and reverse primers as described above. the full-length sequence of genomes was constructed by "genome walking" using the geneious software (version . . ). identification of putative orfs was made with aid of orf finder (ncbi; http://www.ncbi. nlm.nih.gov/gorf/gorf.html). sequence analyses were performed with the blastx software (http://www.ncbi.nlm.nih.gov/blast/). nucleotide sequences were aligned and compared to sequences of human, animal and sewage-associated members of the circoviridae available at genbank database using clustalw [ ] . the alignments were optimized with the bioedit sequence alignment editor program version . . [ ] . the hairpin and stem-loop structures were identified in mfold [ ] . phylogenetic analysis was carried out in mega [ ] . the confidence of each branch in the phylogeny was estimated with bootstrap values calculated from replicates. for the purpose of this work, the samples were named bat circovirus porto alegre (batcv poa), followed by the cluster number to which each one was assigned. amplicons with the expected size (about bp) were obtained from out of the ( %) fecal samples screened. the amplified dna was direct sequenced. the nucleotide sequences corresponding to part of the rep gene were determined and submitted to genbank (km -km ). blastx analysis showed that these partial rep sequences have an amino acid identity of - % with those of known circoviruses and - % among themselves. a phylogenetic tree was constructed based on the alignment of the deduced amino acid sequences herein detected with those of the representative circovirus and cyclovirus sequences (fig. ) . as shown in the tree, it was observed the arrangement of five main groups with clusters ii ( sequences), vi ( sequences) and vii ( sequences) falling into the clade of cycloviruses, in contrast to clusters i ( sequences) and v ( sequences) that formed distinct and distant groups from those formed by circoviruses and cycloviruses. the arbitrary division of these sequences in clusters was carried out to analyze their genomic features, assuming that according to the criteria used for circovirus diversity analysis, distinct species comprising more than > % sequence divergence are considered to be classified as an individual viral [ ] . according to this, we could infer the detection of five potential new species from bat samples ( cycloviruses and circoviruses). the impossibility to achieve the complete sequencing of virus dna from clusters i and v was probably due to the high gc-rich content present in the ´igr gc region, even though attempts on pcr amplification before sequencing were made without much success by varying the concentrations of dmso and/or in the presence of % -deaza-gtp and % dgtp (new england biolabs), as performed by rijsewijk et al. [ ] . the predicted two orfs, rep and cap, are present and inversely arranged in all sequences as shown in fig. . the predicted cap protein sequences consist of - amino acids and share an amino acid identity of - % with the known cycloviruses/circoviruses and . - . % among themselves (tables and ) . the predicted rep protein sequences ranged from to amino acid and have an amino acid identity ranging from . - . % among themselves (tables and ) . stem-loop structures were found in all bat circular genomes. they have a conserved nonanucleotide motif located at the ' igr (nantattac) and are considered to be responsible for initiating the rolling-cycle replication of circoviruses [ , ] . as shown in table , all four batcv poa also contain a conserved nonamer sequence in the loop region of the ' igr, different from the conserved cyclovirus and circovirus nonanucleotide motif sequence, but similar to the loop motif of cycloviruses found on bat, human and chimpanzee feces (batcv poa ii, v, vi) and slightly modified from those of cyclovirus and circovirus (batcv poa i) [ , , , ] . the predicted protein sequences encoded by orf (cap) and orf (rep) of batcv i-vi genomes were used for phylogenetic analysis with representative and recently discovered circoviruses/cycloviruses; pepper golden mosaic virus was used as outgroup, as they are somewhat related to other members in the circoviridae family (fig. a, b and c ). as shown in the trees, batcv poa/ /ii and vi fell into the cyclovirus clade already identified in chickens, chimps, bats, goats, humans and dragonflies [ , , , , ] . when analyzing the cap-encoding region (fig. a) , batcv poa/ /ii was related to a cyclovirus detected in muscle tissues of a goat from pakistan through degenerate/consensus pcr [ ] , and batcv poa/ /vi was more related to dragonfly cyclovirus detected through viral metagenomics [ ] . however, when analyzing both genomes according to the conserved rep-encoding region, it was observed that they formed a monophyletic clade (fig. b) . on the other hand, batcv poa/ /i and v fell outside the circovirus and cyclovirus clades, not yet related to any genus of circoviridae family along with bat circovirus-like virus tm and batcv-sc [ , ] . this situation was confirmed based on the alignments of the whole genomes, producing a similar tree topology (see fig. c ). these sequences are closer to sequences detected in guano and fecal samples collected from bats in the united states and china through metagenomic approaches, suggesting that these viruses have the same host origin, likely from bats [ , ] . however, currently, no classification has been fully considered to these sequences. in this work we report the discovery of novel circular ssdna genomic sequences from insectivorous bats feces from brazil. in the recent years, many genomes of circoviruses, cycloviruses and rep-containing circular dna viruses have been characterized in mammals, birds, insects and environmental samples [ ] bringing to light a high level of genetic diversity among these viruses [ , ] . according to our results, two genomes belong to genus cyclovirus (batcv poa ii and vi). these genomes are organized and contain two major orfs in opposite directions, presenting in their ' igr of the rep orf the cyclovirus-conserved nonanucleotide motif ( '-taatactat- ') in their loop region (table ) . batcv poa i and v present their cap located in the positive strand and the larger rep located on the minus strand, as expected for circoviruses, but this pattern was not present in batcv poa ii and vi, as shown in table . the phylogenetic analysis constructed based on the alignments of the complete rep and cap protein confirms that batcv poa/ii and vi cluster into the genus cyclovirus along with the chinese cycloviruses sequences clade detected in bat feces [ ] and sharing less than % of identity at the cap/rep amino acid level. batcv poa i and v had a low amino acid identity with cap (< %) and rep (< %) sequences of two other sequences detected in bat feces in this study with known circoviruses/cycloviruses (table ) . consequently, they formed a distinct clade along with other bat-sourced sequences, expanding the view of diversity in these new ssdna viruses that are divergent enough at the sequence level that they could very likely be part of a different genus. in our study, we detected cyclovirus and circovirus related sequences at a frequency of % in the examined samples. however, due to methodological limitations, restriction in location and variety of bat species, we were not able to extrapolate our results to epidemiological data (such as incidence and prevalence) or to which bat species the ssdna positive samples belonged. as performed by ge et al. in china [ ] , further investigation is needed to determine the prevalence of circoviruses in other brazilian bat species. nevertheless, it becomes clear that such study is worthy to understand the great diversity of circoviruses found worldwide. our study was based on the phylogenetic analysis and comparison to the sequences recovered. the finding of known insect viruses in bat feces simply reflects the diet of these insectivorous bats, which play an important role on predating insects. viral dna detection in bat feces does not allow one to differentiate between viral replication in bats or simple passage through the digestive track from ingested food [ , ] . to date, few members of the circovirus genera can be related to severe clinical conditions in animals, with the exception of pcv and some of the avian circoviruses [ ] . even with the recent discovery of many cycloviruses, circoviruses-like or rep-like sequences in a variety of mammals tissues and feces, including humans fecal samples [ , , ] , there is no syndrome yet associated with these viruses. nevertheless, a recent identification of a new cyclovirus from vietnamese and malawi patients with acute central nervous system infection of unknown etiology raises the possibility of disease association, yet to be proven [ , ] , although possibly with limited geographic distribution [ ] . in this work, two more circular dna genomes were characterized which did not fall within the circo/cycloviruses clade grouping instead distantly with tm and batcv-sc [ , ] both also from bat feces. these new genomes have in common the presence in the rep n-terminus of the same motifs associated with rolling circle replication (ftlnn, tphlqgy) and dntp-binding (gxgks), as well as the conserved identified in the carboxy half of rep amino acid motifs associated with c helicase function (wwdgy and dryp) [ ] . the n-terminal regions related to cap proteins of batcv poa i and v are highly basic and arginine-rich, as is typical for circoviruses capsid proteins with arginine residues ranging from %- % (genome i and v, respectively) along the first aa, in contrast to tm ( %) and sc ( %). they are also distinguishable based on their cap and rep sizes (data not shown), as well as on the low amino acid level for both proteins, as the percentage of amino acid identity of batcv poa i and v shows a rep identity < % and < % for cap identity in relation to tm and sc . based on these genomes characteristics, even though they are clustered in a separate clade, not yet characterized, they are new viral species. upon the discovery of other sequences grouping along with these genomes, it will be of interest to propose the creation of a new genus within circoviridae by the international committee on taxonomy of viruses (ictv). here we report the detection of four novel circular ssdnas from bat feces after whole-genome characterization within the family circoviridae. so far, it is not clear if these new ssdna detected have some important role on pathogenesis. in addition to bioinformatics analysis, future investigations must include attempts in virus isolation to confirm host origin, which will give some light to better understand the relationships between these circular dna viruses and bats. conceived and designed the experiments: fesl spc pmr. performed the experiments: fesl spc hfs tft apmv. analyzed the data: spc ed. contributed reagents/materials/analysis tools: pmr acf. wrote the paper: fesl spc pmr acf ed. virus taxonomy: classification and nomenclature of viruses multiple diverse circoviruses infect farm animals and are commonly found in human and chimpanzee feces pathogenesis of postweaning multisystemic wasting syndrome caused by porcine circovirus : an immune riddle isolation of circovirus from lesions of pigs with postweaning multisystemic wasting syndrome insights into the evolutionary history of an emerging livestock pathogen: porcine circovirus porcine circoviruses: a review quantification of porcine circovirus type (pcv ) dna in serum and tonsillar, nasal, tracheo-bronchial, urinary and faecal swabs of pigs with and without postweaning multisystemic wasting syndrome (pmws) a review of porcine circovirus -associated syndromes and diseases porcine circovirus type associated disease: update on current terminology, clinical manifestations, pathogenesis, diagnosis, and intervention strategies recent advances in the epidemiology, diagnosis and control of diseases caused by porcine circovirus type psittacine beak and feather disease virus nucleotide sequence analysis and its relationship to porcine circovirus, plant circoviruses, and chicken anaemia virus cloning and sequencing of duck circovirus (ducv) genome sequence determinations and analyses of novel circoviruses from goose and pigeon circoviruses: immunosuppressive threats to avian species: a review identification of a novel circovirus in australian ravens (corvus coronoides) with feather disease frequent detection of highly diverse variants of cardiovirus, cosavirus, bocavirus, and circovirus in sewage samples collected in the united states bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses genetic diversity of novel circular ssdna viruses in bats in china a field guide to eukaryotic circular single-stranded dna viruses: insights gained from metagenomics possible cross-species transmission of circoviruses and cycloviruses among farm animals identification of a new cyclovirus in cerebrospinal fluid of patients with acute central nervous system infections dragonfly cyclovirus, a novel single-stranded dna virus discovered in dragonflies (odonata: anisoptera) novel cyclovirus discovered in the florida woods cockroach eurycotis floridana (walker) high global diversity of cycloviruses amongst dragonflies detection of alphacoronavirus in velvety free-tailed bats (molossus molossus) and brazilian free-tailed bats (tadarida brasiliensis) from urban area of southern brazil genomic characterization of severe acute respiratory syndrome-related coronavirus in european bats and classification of coronaviruses based on partial rna-dependent rna polymerase gene sequences molecular cloning: a laboratory manual: cold spring harbor laboratory press clustal w and clustal x version . bioedit: a user-friendly biological sequence alignment editor and analysis program for windows / /nt; well-determined" regions in rna secondary structure prediction: analysis of small subunit ribosomal rna mega : molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods virus taxonomy, ixth report of the international committee for the taxonomy of viruses discovery of a genome of a distant relative of chicken anemia virus reveals a new member of the genus gyrovirus rolling-circle replication of an animal circovirus genome in a theta-replicating bacterial plasmid in escherichia coli rapidly expanding genetic diversity and host range of the circoviridae viral family and other rep encoding small circular ssdna genomes evaluation of the in vivo radiosensitizing activity of etanidazole using tumor-bearing chick embryo host effect on the genetic diversification of beet necrotic yellow vein virus single-plant populations limited geographic distribution of the novel cyclovirus novel cyclovirus in human cerebrospinal fluid key: cord- - stnx dw authors: widrich, michael; schäfl, bernhard; pavlović, milena; ramsauer, hubert; gruber, lukas; holzleitner, markus; brandstetter, johannes; sandve, geir kjetil; greiff, victor; hochreiter, sepp; klambauer, günter title: modern hopfield networks and attention for immune repertoire classification date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: stnx dw a central mechanism in machine learning is to identify, store, and recognize patterns. how to learn, access, and retrieve such patterns is crucial in hopfield networks and the more recent transformer architectures. we show that the attention mechanism of transformer architectures is actually the update rule of modern hop-field networks that can store exponentially many patterns. we exploit this high storage capacity of modern hopfield networks to solve a challenging multiple instance learning (mil) problem in computational biology: immune repertoire classification. accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the covid- crisis. immune repertoire classification based on the vast number of immunosequences of an individual is a mil problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. in this work, we present our novel method deeprc that integrates transformer-like attention, or equivalently modern hopfield networks, into deep learning architectures for massive mil such as immune repertoire classification. we demonstrate that deeprc outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class. source code and datasets: https://github.com/ml-jku/deeprc transformer architectures (vaswani et al., ) and their attention mechanisms are currently used in many applications, such as natural language processing (nlp), imaging, and also in multiple instance learning (mil) problems . in mil, a set or bag of objects is labelled rather than objects themselves as in standard supervised learning tasks (dietterich et al., ) . examples for mil problems are medical images, in which each sub-region of the image represents an instance, video a pooling function f is used to obtain a repertoire-representation z for the input object. finally, an output network o predicts the class labelŷ. b) deeprc uses stacked d convolutions for a parameterized function h due to their computational efficiency. potentially, millions of sequences have to be processed for each input object. in principle, also recurrent neural networks (rnns), such as lstms (hochreiter et al., ) , or transformer networks (vaswani et al., ) may be used but are currently computationally too costly. c) attention-pooling is used to obtain a repertoire-representation z for each input object, where deeprc uses weighted averages of sequence-representations. the weights are determined by an update rule of modern hopfield networks that allows to retrieve exponentially many patterns. classification, in which each frame is an instance, text classification, where words or sentences are instances of a text, point sets, where each point is an instance of a d object, and remote sensing data, where each sensor is an instance (carbonneau et al., ; uriot, ) . attention-based mil has been successfully used for image data, for example to identify tiny objects in large images (ilse et al., ; pawlowski et al., ; tomita et al., ; kimeswenger et al., ) and transformer-like attention mechanisms for sets of points and images . however, in mil problems considered by machine learning methods up to now, the number of instances per bag is in the range of hundreds or few thousands (carbonneau et al., ; lee et al., ) (see also tab. a ). at the same time the witness rate (wr), the rate of discriminating instances per bag, is already considered low at % − %. we will tackle the problem of immune repertoire classification with hundreds of thousands of instances per bag without instance-level labels and with extremely low witness rates down to . % using an attention mechanism. we show that the attention mechanism of transformers is the update rule of modern hopfield networks (krotov & hopfield, demircigil et al., ) that are generalized to continuous states in contrast to classical hopfield networks (hopfield, ) . a detailed derivation and analysis of modern hopfield networks is given in our companion paper (ramsauer et al., ) . these novel continuous state hopfield networks allow to store and retrieve exponentially (in the dimension of the space) many patterns (see next section). thus, modern hopfield networks with their update rule, which are used as an attention mechanism in the transformer, enable immune repertoire classification in computational biology. immune repertoire classification, i.e. classifying the immune status based on the immune repertoire sequences, is essentially a text-book example for a multiple instance learning problem (dietterich et al., ; maron & lozano-pérez, ; wang et al., ) . briefly, the immune repertoire of an individual consists of an immensely large bag of immune receptors, represented as amino acid sequences. usually, the presence of only a small fraction of particular receptors determines the immune status with respect to a particular disease (christophersen et al., ; emerson et al., ) . this is because the immune system has already acquired a resistance if one or few particular immune receptors that can bind to the disease agent are present. therefore, classification of immune repertoires bears a high difficulty since each immune repertoire can contain millions of sequences as instances with only a few indicating the class. further properties of the data that complicate the problem are: (a) the overlap of immune repertoires of different individuals is low (in most cases, maximally low single-digit percentage values) (greiff et al., ; elhanati et al., ) , (b) multiple different sequences can bind to the same pathogen (wucherpfennig et al., ) , and (c) only subsequences within the sequences determine whether binding to a pathogen is possible (dash et al., ; glanville et al., ; akbar et al., ; springer et al., ; fischer et al., ) . in summary, immune repertoire classification can be formulated as multiple instance learning with an extremely low witness rate and large numbers of instances, which represents a challenge for currently available machine learning methods. furthermore, the methods should ideally be interpretable, since the extraction of class-associated sequence motifs is desired to gain crucial biological insights. the acquisition of human immune repertoires has been enabled by immunosequencing technology (georgiou et al., ; brown et al., ) which allows to obtain the immune receptor sequences and immune repertoires of individuals. each individual is uniquely characterized by their immune repertoire, which is acquired and changed during life. this repertoire may be influenced by all diseases that an individual is exposed to during their lives and hence contains highly valuable information about those diseases and the individual's immune status. immune receptors enable the immune system to specifically recognize disease agents or pathogens. each immune encounter is recorded as an immune event into immune memory by preserving and amplifying immune receptors in the repertoire used to fight a given disease. this is, for example, the working principle of vaccination. each human has about - unique immune receptors with low overlap across individuals and sampled from a potential diversity of > receptors (mora & walczak, ) . the ability to sequence and analyze human immune receptors at large scale has led to fundamental and mechanistic insights into the adaptive immune system and has also opened the opportunity for the development of novel diagnostics and therapy approaches (georgiou et al., ; brown et al., ) . immunosequencing data have been analyzed with computational methods for a variety of different tasks (greiff et al., ; shugay et al., ; miho et al., ; yaari & kleinstein, ; wardemann & busse, ) . a large part of the available machine learning methods for immune receptor data has been focusing on the individual immune receptors in a repertoire, with the aim to, for example, predict the antigen or antigen portion (epitope) to which these sequences bind or to predict sharing of receptors across individuals (gielis et al., ; springer et al., ; jurtz et al., ; moris et al., ; fischer et al., ; greiff et al., ; sidhom et al., ; elhanati et al., ) . recently, jurtz et al. ( ) used d convolutional neural networks (cnns) to predict antigen binding of t-cell receptor (tcr) sequences (specifically, binding of tcr sequences to peptide-mhc complexes) and demonstrated that motifs can be extracted from these models. similarly, konishi et al. ( ) use cnns, gradient boosting, and other machine learning techniques on b-cell receptor (bcr) sequences to distinguish tumor tissue from normal tissue. however, the methods presented so far predict a particular class, the epitope, based on a single input sequence. immune repertoire classification has been considered as a mil problem in the following publications. a deep learning framework called deeptcr (sidhom et al., ) implements several deep learning approaches for immunosequencing data. the computational framework, inter alia, allows for attention-based mil repertoire classifiers and implements a basic form of attention-based averaging. ostmeyer et al. ( ) already suggested a mil method for immune repertoire classification. this method considers -mers, fixed sub-sequences of length , as instances of an input object and trained a logistic regression model with these -mers as input. the predictions of the logistic regression model for each -mer were max-pooled to obtain one prediction per input object. this approach is characterized by (a) the rigidity of the k-mer features as compared to convolutional kernels (alipanahi et al., ; zhou & troyanskaya, ; zeng et al., ) , (b) the max-pooling operation, which constrains the network to learn from a single, top-ranked k-mer for each iteration over the input object, and (c) the pooling of prediction scores rather than representations (wang et al., ) . our experiments also support that these choices in the design of the method can lead to constraints on the predictive performance (see table ). our proposed method, deeprc, also uses a mil approach but considers sequences rather than k-mers as instances within an input object and a transformer-like attention mechanism. deeprc sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as d convolutions or lstms. in this work, we contribute the following: we demonstrate that continuous generalizations of binary modern hopfield-networks (krotov & hopfield, demircigil et al., ) have an update rule that is known as the attention mechanisms in the transformer. we show that these modern hopfield networks have exponential storage capacity, which allows them to extract patterns among a large set of instances (next section). based on this result, we propose deeprc, a novel deep mil method based on modern hopfield networks for large bags of complex sequences, as they occur in immune repertoire classification (section "deep repertoire classification). we evaluate the predictive performance of deeprc and other machine learning approaches for the classification of immune repertoires in a large comparative study (section "experimental results") exponential storage capacity of continuous state modern hopfield networks with transformer attention as update rule in this section, we show that modern hopfield networks have exponential storage capacity, which will later allow us to approach massive multiple-instance learning problems, such as immune repertoire classification. see our companion paper (ramsauer et al., ) for a detailed derivation and analysis of modern hopfield networks. we assume patterns x , . . . , x n ∈ r d that are stacked as columns to the matrix x = (x , . . . , x n ) and a query pattern ξ that also represents the current state. the largest norm of a pattern is m = max i x i . the separation ∆ i of a pattern x i is defined as its minimal dot product difference to any of the other patterns: we consider a modern hopfield network with current state ξ and the energy function for energy e and state ξ, the update rule is proven to converge globally to stationary points of the energy e, which are local minima or saddle points (see (ramsauer et al., ) , appendix, theorem a ). surprisingly, the update rule eq. ( ) is also the formula of the well-known transformer attention mechanism. to see this more clearly, we simultaneously update several queries ξ i . furthermore the queries ξ i and the patterns x i are linear mappings of vectors y i into the space r d . for matrix notation, we set x i = w t k y i , ξ i = w t q y i and multiply the result of our update rule with w v . using y = (y , . . . , y n ) t , we define the matrices and the patterns are now mapped to the hopfield space with dimension d = d k . we set β = / √ d k and change softmax to a row vector. the update rule eq. ( ) multiplied by w v performed for all queries simultaneously becomes in row vector notation: this formula is the transformer attention. if the patterns x i are well separated, the iterate eq. ( ) converges to a fixed point close to a pattern to which the initial ξ is similar. if the patterns are not well separated the iterate eq.( ) converges to a fixed point close to the arithmetic mean of the patterns. if some patterns are similar to each other but well separated from all other vectors, then a metastable state between the similar patterns exists. iterates that start near a metastable state converge to this metastable state. for details see ramsauer et al. ( ) , appendix, sect. a . typically, the update converges after one update step (see ramsauer et al. ( ) , appendix, theorem a ) and has an exponentially small retrieval error (see ramsauer et al. ( ) , appendix, theorem a ). our main concern for application to immune repertoire classification is the number of patterns that can be stored and retrieved by the modern hopfield network, equivalently to the transformer attention head. the storage capacity of an attention mechanism is critical for massive mil problems. we first define what we mean by storing and retrieving patterns from the modern hopfield network. definition (pattern stored and retrieved). we assume that around every pattern x i a sphere s i is given. we say x i is stored if there is a single fixed point x * i ∈ s i to which all points ξ ∈ s i converge, for randomly chosen patterns, the number of patterns that can be stored is exponential in the dimension d of the space of the patterns (x i ∈ r d ). theorem . we assume a failure probability < p and randomly chosen patterns on the sphere with radius m = k √ d − . we define a := d− ( + ln( β k p (d − ))), b := k β , and c = b w (exp(a + ln(b)) , where w is the upper branch of the lambert w function and ensure then with probability − p, the number of random patterns that can be stored is examples are c ≥ . for β = , k = , d = and p = . (a + ln(b) > . ) and c ≥ . for β = k = , d = , and p = . (a + ln(b) < − . ). see ramsauer et al. ( ) , appendix, theorem a for a proof. we have established that a modern hopfield network or a transformer attention mechanism can store and retrieve exponentially many patterns. this allows us to approach mil with massive numbers of instances from which we have to retrieve a few with an attention mechanism. deep repertoire classification problem setting and notation. we consider a mil problem, in which an input object x is a bag of n instances x = {s , . . . , s n }. the instances do not have dependencies nor orderings between them and n can be different for every object. we assume that each instance s i is associated with a label y i ∈ { , }, assuming a binary classification task, to which we do not have access. we only have access to a label y = max i y i for an input object or bag. note that this poses a credit assignment problem, since the sequences that are responsible for the label y have to be identified and that the relation between instance-label and bag-label can be more complex (foulds & frank, ) . a modelŷ = g(x) should be (a) invariant to permutations of the instances and (b) able to cope with the fact that n varies across input objects (ilse et al., ) , which is a problem also posed by point sets (qi et al., ) . two principled approaches exist. the first approach is to learn an instance-level scoring function h : s → [ , ], which is then pooled across instances with a pooling function f , for example by average-pooling or max-pooling (see below). the second approach is to construct an instance representation z i of each instance by h : s → r dv and then encode the bag, or the input object, by pooling these instance representations (wang et al., ) via a function f . an output function o : r dv → [ , ] subsequently classifies the bag. the second approach, the pooling of representations rather than scoring functions, is currently best performing (wang et al., ) . in the problem at hand, the input object x is the immune repertoire of an individual that consists of a large set of immune receptor sequences (t-cell receptors or antibodies). immune receptors are primarily represented as sequences s i from a space s i ∈ s. these sequences act as the instances in the mil problem. although immune repertoire classification can readily be formulated as a mil problem, it is yet unclear how well machine learning methods solve the above-described problem with a large number of instances n , and with instances s i being complex sequences. next we describe currently used pooling functions for mil problems. pooling functions for mil problems. different pooling functions equip a model g with the property to be invariant to permutations of instances and with the ability to process different numbers of instances. typically, a neural network h θ with parameters θ is trained to obtain a function that maps each instance onto a representation: z i = h θ (s i ) and then a pooling function z = f ({z , . . . , z n }) supplies a representation z of the input object x = {s , . . . , s n }. the following pooling functions are typically used: average-pooling: where e m is the standard basis vector for dimension m and attention-pooling: z = n i= a i z i , where a i are non-negative (a i ≥ ), sum to one ( n i= a i = ), and are determined by an attention mechanism. these pooling functions are invariant to permutations of { , . . . , n } and are differentiable. therefore, they are suited as building blocks for deep learning architectures. we employ attention-pooling in our deeprc model as detailed in the following. modern hopfield networks viewed as transformer-like attention mechanisms. the modern hopfield networks, as introduced above,have a storage capacity that is exponential in the dimension of the vector space and converge after just one update (see (ramsauer et al., ) , appendix).additionally, the update rule of modern hopfield networks is known as key-value attention mechanism, which has been highly successful through the transformer (vaswani et al., ) and bert (devlin et al., ) models in natural language processing. therefore using modern hopfield networks with the key-value-attention mechanism as update rule is the natural choice for our task. in particular, modern hopfield networks are theoretically justified for storing and retrieving the large number of vectors (sequence patterns) that appear in the immune repertoire classification task. instead of using the terminology of modern hopfield networks, we explain our deeprc architecture in terms of key-value-attention (the update rule of the modern hopfield network), since it is well known in the deep learning community. the attention mechanism assumes a space of dimension d k in which keys and queries are compared. a set of n key vectors are combined to the matrix k. a set of d q query vectors are combined to the matrix q. similarities between queries and keys are computed by inner products, therefore queries can search for similar keys that are stored. another set of n value vectors are combined to the matrix v . the output of the attention mechanism is a weighted average of the value vectors for each query q. the i-th vector v i is weighted by the similarity between the i-th key k i and the query q. the similarity is given by the softmax of the inner products of the query q with the keys k i . all queries are calculated in parallel via matrix operations. consequently, the attention function att(q, k, v ; β) maps queries q, keys k, and values v to d v -dimensional outputs: att(q, k, v ; β) = softmax(βqk t )v (see also eq. ( )). while this attention mechanism has originally been developed for sequence tasks (vaswani et al., ) , it can be readily transferred to sets ye et al., ) . this type of attention mechanism will be employed in deeprc. the deeprc method. we propose a novel method deep repertoire classification (deeprc) for immune repertoire classification with attention-based deep massive multiple instance learning and compare it against other machine learning approaches. for deeprc, we consider immune repertoires as input objects, which are represented as bags of instances. in a bag, each instance is an immune receptor sequence and each bag can contain a large number of sequences. note that we will use z i to denote the sequence-representation of the i-th sequence and z to denote the repertoire-representation. at the core, deeprc consists of a transformer-like attention mechanism that extracts the most important information from each repertoire. we first give an overview of the attention mechanism and then provide details on each of the sub-networks h , h , and o of deeprc. attention mechanism in deeprc. this mechanism is based on the three matrices k (the keys), q (the queries), and v (the values) together with a parameter β. values. deeprc uses a d convolutional network h (lecun et al., ; hu et al., ; kelley et al., ) that supplies a sequence-representation z i = h (s i ), which acts as the values v = z = (z , . . . , z n ) in the attention mechanism (see figure ). keys. a second neural network h , which shares its first layers with h , is used to obtain keys k ∈ r n ×d k for each sequence in the repertoire. this network uses self-normalizing layers (klambauer et al., ) with units per layer (see figure ). query. we use a fixed d k -dimensional query vector ξ which is learned via backpropagation. for more attention heads, each head has a fixed query vector. with the quantities introduced above, the transformer attention mechanism (eq. ( )) of deeprc is implemented as follows: where z ∈ r n ×dv are the sequence-representations stacked row-wise, k are the keys, and z is the repertoire-representation and at the same time a weighted mean of sequence-representations z i . the attention mechanism can readily be extended to multiple queries, however, computational demand could constrain this depending on the application and dataset. theorem demonstrates that this mechanism is able to retrieve a single pattern out of several hundreds of thousands. attention-pooling and interpretability. each input object, i.e. repertoire, consists of a large number n of sequences, which are reduced to a single fixed-size feature vector of length d v representing the whole input object by an attention-pooling function. to this end, a transformer-like attention mechanism adapted to sets is realized in deeprc which supplies a i -the importance of the sequence s i . this importance value is an interpretable quantity, which is highly desired for the immunological problem at hand. thus, deeprc allows for two forms of interpretability methods. (a) a trained deeprc model can compute attention weights a i , which directly indicate the importance of a sequence. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., ) or layer-wise relevance propagation (montavon et al., ; arras et al., ) . see sect. a for details. classification layer and network parameters. the repertoire-representation z is then used as input for a fully-connected output networkŷ = o(z) that predicts the immune status, where we found it sufficient to train single-layer networks. in the simplest case, deeprc predicts a single target, the class label y, e.g. the immune status of an immune repertoire, using one output value. however, since deeprc is an end-to-end deep learning model, multiple targets may be predicted simultaneously in classification or regression settings or a mix of both. this allows for the introduction of additional information into the system via auxiliary targets such as age, sex, or other metadata. table with sub-networks h , h , and o. d l indicates the sequence length. network parameters, training, and inference. deeprc is trained using standard gradient descent methods to minimize a cross-entropy loss. the network parameters are θ , θ , θ o for the sub-networks h , h , and o, respectively, and additionally ξ. in more detail, we train deeprc using adam (kingma & ba, ) with a batch size of and dropout of input sequences. implementation. to reduce computational time, the attention network first computes the attention weights a i for each sequence s i in a repertoire. subsequently, the top % of sequences with the highest a i per repertoire are used to compute the weight updates and prediction. furthermore, computation of z i is performed in -bit, others in -bit precision to ensure numerical stability in the softmax. see sect. a for details. in this section, we report and analyze the predictive power of deeprc and the compared methods on several immunosequencing datasets. the roc-auc is used as the main metric for the predictive power. methods compared. we compared previous methods for immune repertoire classification, (ostmeyer et al., ) ("log. mil (kmer)", "log. mil (tcrb)") and a burden test (emerson et al., ) , as well as the baseline methods logistic regression ("log. regr."), k-nearest neighbour ("knn"), and support vector machines ("svm") with kernels designed for sets, such as the jaccard kernel ("j") and the minmax ("mm") kernel (ralaivola et al., ) . for the simulated data, we also added baseline methods that search for the implanted motif either in binary or continuous fashion ("known motif b.", "known motif c.") assuming that this motif was known (for details, see sect. a ). datasets. we aimed at constructing immune repertoire classification scenarios with varying degree of difficulties and realism in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, specifically, sequence motifs or sets thereof weber et al., ) , at different frequencies into sequences of repertoires of the positive class. these frequencies represent the witness rates and range from . % to %. overall, we compiled four categories of datasets: (a) simulated immunosequencing data with implanted signals, (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data with known immune status, the cmv dataset (emerson et al., ) . the average number of instances per bag, which is the number of sequences per immune repertoire, is ≈ , except for category (c), in which we consider the scenario of low-coverage data with only , sequences per repertoire. the number of repertoires per dataset ranges from to , . in total, all datasets comprise ≈ billion sequences or instances. this represents the largest comparative study on immune repertoire classification (see sect. a ). hyperparameter selection. we used a nested -fold cross validation (cv) procedure to estimate the performance of each of the methods. all methods could adjust their most important hyperparameters on a validation set in the inner loop of the procedure. see sect. a for details. table : results in terms of auc of the competing methods on all datasets. the reported errors are standard deviations across cross-validation (cv) folds (except for the column "simulated"). real-world cmv: average performance over cv folds on the cmv dataset (emerson et al., ) . real-world data with implanted signals: average performance over cv folds for each of the four datasets. a signal was implanted with a frequency (=witness rate) of % or . %. either a single motif ("om") or multiple motifs ("mm") were implanted. lstm-generated data: average performance over cv folds for each of the datasets. in each dataset, a signal was implanted with a frequency of %, %, . %, . %, or . %, respectively. simulated: here we report the mean over simulated datasets with implanted signals and varying difficulties (see tab. a for details). the error reported is the standard deviation of the auc values across the datasets. results. in each of the four categories, "real-world data", "real-world data with implanted signals", "lstm-generated data", and "simulated immunosequencing data", deeprc outperforms all competing methods with respect to average auc. across categories, the runner-up methods are either the svm for mil problems with minmax kernel or the burden test (see table and sect. a ). results on simulated immunosequencing data. in this setting the complexity of the implanted signal is in focus and varies throughout simulated datasets (see sect. a ). some datasets are challenging for the methods because the implanted motif is hidden by noise and others because only a small fraction of sequences carries the motif, and hence have a low witness rate. these difficulties become evident by the method called "known motif binary", which assumes the implanted motif is known. the performance of this method ranges from a perfect auc of . in several datasets to an auc of . in dataset ' ' (see sect. a ). deeprc outperforms all other methods with an average auc of . ± . , followed by the svm with minmax kernel with an average auc of . ± . (see sect. a ). the predictive performance of all methods suffers if the signal occurs only in an extremely small fraction of sequences. in datasets, in which only . % of the sequences carry the motif, all auc values are below . . results on lstm-generated data. on the lstm-generated data, in which we implanted noisy motifs with frequencies of %, %, . %, . %, and . %, deeprc yields almost perfect predictive performance with an average auc of . ± . (see sect. a and a ). the second best method, svm with minmax kernel, has a similar predictive performance to deeprc on all datasets but the other competing methods have a lower predictive performance on datasets with low frequency of the signal ( . %). results on real-world data with implanted motifs. in this dataset category, we used real immunosequences and implanted single or multiple noisy motifs. again, deeprc outperforms all other methods with an average auc of . ± . , with the second best method being the burden test with an average auc of . ± . . notably, all methods except for deeprc have difficulties with noisy motifs at a frequency of . % (see tab. a ) . results on real-world data. on the real-world dataset, in which the immune status of persons affected by the cytomegalovirus has to be predicted, the competing methods yield predictive aucs between . and . (see table ). we note that this dataset is not the exact dataset that was used in emerson et al. ( ) . it differs in pre-processing and also comprises a different set of samples and a smaller training set due to the nested -fold cross-validation procedure, which leads to a more challenging dataset. the best performing method is deeprc with an auc of . ± . , followed by the svm with minmax kernel (auc . ± . ) and the burden test with an auc of . ± . . the top-ranked sequences by deeprc significantly correspond to those detected by emerson et al. ( ) , which we tested by a mann-whitney u-test with the null hypothesis that the attention values of the sequences detected by emerson et al. ( ) would be equal to the attention values of the remaining sequences (p-value of . · − ). the sequence attention values are displayed in tab. a . we have demonstrated how modern hopfield networks and attention mechanisms enable successful classification of the immune status of immune repertoires. for this task, methods have to identify the discriminating sequences amongst a large set of sequences in an immune repertoire. specifically, even motifs within those sequences have to be identified. we have shown that deeprc, a modern hopfield network and an attention mechanism with a fixed query, can solve this difficult task despite the massive number of instances. deeprc furthermore outperforms the compared methods across a range of different experimental conditions. impact on machine learning and related scientific fields. we envision that with (a) the increasing availability of large immunosequencing datasets (kovaltsuk et al., ; corrie et al., ; christley et al., ; zhang et al., ; rosenfeld et al., ; shugay et al., ) , (b) further fine-tuning of ground-truth benchmarking immune receptor datasets (weber et al., ; olson et al., ; marcou et al., ) , (c) accounting for repertoire-impacting factors such as age, sex, ethnicity, and environment (potential confounding factors), and (d) increased gpu memory and increased computing power, it will be possible to identify discriminating immune receptor motifs for many diseases, potentially even for the current sars-cov- (covid- ) pandemic minervina et al., ; galson et al., ) . such results would greatly benefit ongoing research on antibody and tcr-driven immunotherapies and immunodiagnostics as well as rational vaccine design (brown et al., ) . in the course of this development, the experimental verification and interpretation of machine-learningidentified motifs could receive additional focus, as for most of the sequences within a repertoire the corresponding antigen is unknown. nevertheless, recent technological breakthroughs in highthroughput antigen-labeled immunosequencing are beginning to generate large-scale antigen-labeled single-immune-receptor-sequence data thus resolving this longstanding problem (setliff et al., ) . from a machine learning perspective, the successful application of deeprc on immune repertoires with their large number of instances per bag might encourage the application of modern hopfield networks and attention mechanisms on new, previously unsolved or unconsidered, datasets and problems. impact on society. if the approach proves itself successful, it could lead to faster testing of individuals for their immune status w.r.t. a range of diseases based on blood samples. this might motivate changes in the pipeline of diagnostics and tracking of diseases, e.g. automated testing of the immune status in regular intervals. it would furthermore make the collection and screening of blood samples for larger databases more attractive. in consequence, the improved testing of immune statuses might identify individuals that do not have a working immune response towards certain diseases to government or insurance companies, which could then push for targeted immunisation of the individual. similarly to compulsory vaccination, such testing for the immune status could be made compulsory by governments, possibly violating privacy or personal self-determination in exchange for increased over-all health of a population. ultimately, if the approach proves itself successful, the insights gained from the screening of individuals that have successfully developed resistances against specific diseases could lead to faster targeted immunisation, once a certain number of individuals with resistances can be found. this might strongly decrease the harm done by e.g. pandemics and lead to a change in the societal perception of such diseases. consequences of failures of the method. as common with methods in machine learning, potential danger lies in the possibility that users rely too much on our new approach and use it without reflecting on the outcomes. however, the full pipeline in which our method would be used includes wet lab tests after its application, to verify and investigate the results, gain insights, and possibly derive treatments. failures of the proposed method would lead to unsuccessful wet lab validation and negative wet lab tests. since the proposed algorithm does not directly suggest treatment or therapy, human beings are not directly at risk of being treated with a harmful therapy. substantial wet lab and in-vitro testing and would indicate wrong decisions by the system. leveraging of biases in the data and potential discrimination. as for almost all machine learning methods, confounding factors, such as age or sex, could be used for classification. this, might lead to biases in predictions or uneven predictive performance across subgroups. as a result, failures in the wet lab would occur (see paragraph above). moreover, insights into the relevance of the confounding factors could be gained, leading to possible therapies or counter-measures concerning said factors. furthermore, the amount of data available with respec to relevant confounding factors could lead to better or worse performance of our method. e.g. a dataset consisting mostly of data from individuals within a specific age group might yield better performance for that age group, possibly resulting in better or exclusive treatment methods for that specific group. here again, the application of deeprc would be followed by in-vitro testing and development of a treatment, where all target groups for the treatment have to be considered accordingly. all datasets and code is available at https://github.com/ml-jku/deeprc. the cmv dataset is publicly available at https://clients.adaptivebiotech.com/pub/emerson- -natgen. in section a we provide details on the architecture of deeprc, in section a we present details on the datasets, in section a we explain the methods that we compared, in section a we elaborate on the hyperparameters and their selection process. then, in section a we present detailed results for each dataset category in tabular form, in section a we provide information on the lstm model that was used to generate antibody sequences, in section a we show how deeprc can be interpreted, in section a we show the correspondence of previously identified tcr sequences for cmv immune status with attention values by deeprc, and finally we present variations and an ablation study of deeprc in section a . input layer. for the input layer of the cnn, the characters in the input sequence, i.e. the amino acids (aas), are encoded in a one-hot vector of length . to also provide information about the position of an aa in the sequence, we add additional input features with values in range [ , ] to encode the position of an aa relative to the sequence. these positional features encode whether the aa is located at the beginning, the center, or the end of the sequence, respectively, as shown in figure a . we concatenate these positional features with the one-hot vector of aas, which results in a feature vector of size per sequence position. each repertoire, now represented as a bag of feature vectors, is then normalized to unit variance. since the cytomegalovirus dataset (cmv dataset) provides sequences with an associated abundance value per sequence, which is the number of occurrences of a sequence in a repertoire, we incorporate this information into the input of deeprc. to this end, the one-hot aa features of a sequence are multiplied by a scaling factor of log(c a ) before normalization, where c a is the abundance of a sequence. we feed the sequences with features per position into the cnn. sequences of different lengths were zero-padded to the maximum sequence length per batch at the sequence ends. d cnn for motif recognition. in the following, we describe how deeprc identifies patterns in the individual sequences and reduces each sequence in the input object to a fixed-size feature vector. deeprc employs d convolution layers to extract patterns, where trainable weight kernels are convolved over the sequence positions. in principle, also recurrent neural networks (rnns) or transformer networks could be used instead of d cnns, however, (a) the computational complexity of the network must be low to be able to process millions of sequences for a single update. additionally, (b) the learned network should be able to provide insights in the recognized patterns in form of motifs. both properties (a) and (b) are fulfilled by d convolution operations that are used by deeprc. we use one d cnn layer (hu et al., ) with selu activation functions (klambauer et al., ) to identify the relevant patterns in the input sequences with a computationally light-weight operation. the larger the kernel size, the more surrounding sequence positions are taken into account, which influences the length of the motifs that can be extracted. we therefore adjust the kernel size during hyperparameter search. in prior works (ostmeyer et al., ) , a k-mer size of yielded good predictive performance, which could indicate that a kernel size in the range of may be a proficient choice. for d v trainable kernels, this produces a feature vector of length d v at each sequence position. subsequently, global max-pooling over all sequence positions of a sequence reduces the sequence-representations z i to vectors of the fixed length d v . given the challenging size of the input data per repertoire, the computation of the cnn activations and weight updates is performed using -bit floating point values. a list of hyperparameters evaluated for deeprc is given in table a . a comparison of rnn-based and cnn-based sequence embedding for motif recognition in a smaller experimental setting is given in sec. a . regularization. we apply random and attention-based subsampling of repertoire sequences to reduce over-fitting and decrease computational effort. during training, each repertoire is subsampled to , input sequences, which are randomly drawn from the respective repertoire. this can also be interpreted as random drop-out (hinton et al., ) on the input sequences or attention weights. during training and evaluation, the attention weights computed by the attention network are furthermore used to rank the input sequences. based on this ranking, the repertoire is reduced to the % of sequences with the highest attention weights. these top % of sequences are then used to compute the weight updates and the prediction for the repertoire. additionally, one might employ further regularization techniques, which we only partly investigated further in a smaller experimental setting in sec. a due to high computational demands. such regularization techniques include l and l weight decay, noise in the form of random aa permutations in the input sequences, noise on the attention weights, or random shuffling of sequences between repertoires that belong to the negative class. the last regularization technique assumes that the sequences in positive-class repertoires carry a signal, such as an aa motif corresponding to an immune response, whereas the sequences in negative-class repertoires do not. hence, the sequences can be shuffled randomly between negative class repertoires without obscuring the signal in the positive class repertoires. hyperparameters. for the hyperparameter search of deeprc for the category "simulated immunosequencing data", we only conducted a full hyperparameter search on the more difficult datasets with motif implantation probabilities below %, as described in table a . this process was repeated for all folds of the -fold cross-validation (cv) and the average score on the test sets constitutes the final score of a method. table a provides an overview of the hyperparameter search, which was conducted as a grid search for each of the datasets in a nested -fold cv procedure, as described in section a . computation time and optimization. we took measures on the implementation level to address the high computational demands, especially gpu memory consumption, in order to make the large number of experiments feasible. we train the deeprc model with a small batch size of samples and perform computation of inference and updates of the d cnn using -bit floating point values. the rest of the network is trained using -bit floating point values. the adam parameter for numerical stability was therefore increased from the default value of = − to = − . training was performed on various gpu types, mainly nvidia rtx ti. computation times were highly dependent on the number of sequences in the repertoires and the number and sizes of cnn kernels. a single update on an nvidia rtx ti gpu took approximately . to . seconds, while requiring approximately to gb gpu memory. taking these optimizations and gpus with larger memory (≥ gb) into account, it is already possible to train deeprc, possibly with multi-head attention and a larger network architecture, on larger datasets (see sec. a ). our network implementation is based on pytorch . . (paszke et al., ) . incorporation of additional inputs and metadata. additional metadata in the form of sequencelevel or repertoire-level features could be incorporated into the input via concatenation with the feature vectors that result from taking the maximum of the d cnn outputs w.r.t. the sequence positions. this has the benefit that the attention mechanism and output network can utilize the sequence-level or repertoire-level features for their predictions. sparse metadata or metadata that is only available during training could be used as auxiliary targets to incorporate the information via gradients into the deeprc model. limitations. the current methods are mostly limited by computational complexity, since both hyperparameter and model selection is computationally demanding. for hyperparameter selection, a large number of hyperparameter settings have to be evaluated. for model selection, a single repertoire requires the propagation of many thousands of sequences through a neural network and keeping those quantities in gpu memory in order to perform the attention mechanism and weight update. thus, increased gpu memory would significantly boost our approach. increased computational power would also allow for more advanced architectures and attention mechanisms, which may further improve predictive performance. another limiting factor is over-fitting of the model due to the currently relatively small number of samples (bags) in real-world immunosequencing datasets in comparison to the large number of instances per bag and features per instance. we aimed at constructing immune repertoire classification scenarios with varying degree of realism and difficulties in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, which are sequence motifs weber et al., ) , into sequences of repertoires of the positive class. it has been shown previously that interaction of immune receptors with antigens occur via short sequence stretches . thus, implantation of short motif sequences simulating an immune signal is biologically meaningful. our benchmarking study comprises four different categories of datasets: (a) simulated immunosequencing data with implanted signals (where the signal is defined as sets of motifs), (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data. each of the first three categories consists of multiple datasets with varying difficulty depending on the type of the implanted signal and the ratio of sequences with the implanted signal. the ratio of sequences with the implanted signal, where each sequence carries at most implanted signal, corresponds to the witness rate (wr). we consider binary classification tasks to simulate the immune status of healthy and diseased individuals. we randomly generate immune repertoires with varying numbers of sequences, where we implant sequence motifs in the repertoires of the diseased individuals, i.e. the positive class. the sequences of a repertoire are also randomly generated by different procedures (detailed below). each sequence is composed of different characters, corresponding to amino acids, and has an average length of . aas. in the first category, we aim at investigating the impact of the signal frequency, i.e. the wr, and the signal complexity on the performance of the different methods. to this end, we created datasets, whereas each dataset contains a large number of repertoires with a large number of random aa sequences per repertoire. we then implanted signals in the aa sequences of the positive class repertoires, where the datasets differ in frequency and complexity of the implanted signals. in detail, the aas were sampled randomly independent of their respective position in the sequence, while the frequencies of aas, distribution of sequence lengths, and distribution of the number of sequences per repertoire, i.e. the number of instances per bag, are following the respective distributions observed in the real-world cmv dataset (emerson et al., ) . for this, we first sampled the number of sequences for a repertoire from a gaussian n (µ = k, σ = k) distribution and rounded to the nearest positive integer. we re-sampled if the size was below k. we then generated random sequences of aas with a length of n (µ = . , σ = . ), again rounded to the nearest positive integers. each simulated repertoire was then randomly assigned to either the positive or negative class, with , repertoires per class. in the repertoires assigned to the positive class, we implanted motifs with an average length of aas, following the results of the experimental analysis of antigenbinding motifs in antibodies and t-cell receptor sequences by . we varied the characteristics of the implanted motifs for each of the datasets with respect to the following parameters: (a) ρ, the probability of a motif being implanted in a sequence of a positive repertoire, i.e. the average ratio of sequences containing the motif, which is the witness rate. in this way, we generated different datasets of variable difficulty containing in total roughly . billion sequences. see table a for an overview of the properties of the implanted motifs in the datasets. in the second dataset category, we investigate the impact of the signal frequency and complexity in combination with more plausible immune receptor sequences by taking into account the positional aa distributions and other sequence properties. to this end, we trained an lstm (hochreiter & schmidhuber, ) in a standard next character prediction (graves, ) setting to create aa sequences with properties similar to experimentally observed immune receptor sequences. in the first step, the lstm model was trained on all immuno-sequences in the cmv dataset (emerson et al., ) that contain valid information about sequence abundance and have a known cmv label. such an lstm model is able to capture various properties of the sequences, including positiondependent probability distributions and combinations, relationships, and order of aas. we then used the trained lstm model to generate , repertoires in an autoregressive fashion, starting with a start sequence that was randomly sampled from the trained-on dataset. based on a visual inspection of the frequencies of -mers (see section a ), the similarity of lstm generated sequences and real sequences was deemed sufficient for the purpose of generating the aa sequences for the datasets in this category. further details on lstm training and repertoire generation are given in section a . after generation, each repertoire was assigned to either the positive or negative class, with repertoires per class. we implanted motifs of length with varying properties in the center of the sequences of the positive class to obtain different datasets. each sequence in the positive repertoires has a probability ρ to carry the motif, which was varied throughout datasets and corresponds to the wr (see table a ). each position in the motif has a probability of . to be implanted and consequently a probability of . that the original aa in the sequence remains, which can be seen as noise on the motif. in the third category, we implanted signals into experimentally obtained immuno-sequences, where we considered dataset variations. each dataset consists of repertoires for each of the two classes, where each repertoire consists of k sequences. in this way, we aim to simulate datasets with a low sequencing coverage, which means that only relatively few sequences per repertoire are available. the sequences were randomly sampled from healthy (cmv negative) individuals from the cmv dataset (see below paragraph for explanation). two signal types were considered: (a) one signal with one motif. the aa motif ldr was implanted in a certain fraction of sequences. the pattern is randomly altered at one of the three positions with probabilities . , . , and . , respectively. (b) one signal with multiple motifs. one of the three possible motifs ldr, cas, and gl-n was table a : properties of simulated repertoires, variations of motifs, and motif frequencies, i.e. the witness rate, for the datasets in categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". noise types for * are explained in paragraph "real-world data with implanted signals". implanted with equal probability. again, the motifs were randomly altered before implantation. the aa motif ldr changed as described above. the aa motif cas was altered at the second position with probability . and with probability . at the first position. the pattern gl-n, wheredenotes a gap location, is randomly altered at the first position with probability . and the gap has a length of , , or aas with equal probability. additionally, the datasets differ in the values for ρ, the average ratio of sequences carrying a signal, which were chosen as % or . %. the motifs were implanted at positions , , and according to the imgt numbering scheme for immune receptor sequences (lefranc et al., ) with probabilities . , . and . , respectively. with the remaining . chance, the motif is implanted at any other sequence position. this means that the motif occurrence in the simulated sequences is biased towards the middle of the sequence. we used a real-world dataset of repertoires, each of which containing between , to , (avg. , ) tcr sequences with a length of to (avg. . ) aas, originally collected and provided by emerson et al. ( ) . out of repertoires were labelled as positive for cytomegalovirus (cmv) serostatus, which we consider as the positive class, repertoires with negative cmv serostatus, considered as negative class, and repertoires with unknown status. we changed the number of sequence counts per repertoire from − to for sequences. furthermore, we exclude a total of repertoires with unknown cmv status or unknown information about the sequence abundance within a repertoire, reducing the dataset for our analysis to repertoires, of which with positive and with negative cmv status. we give a non-exhaustive overview of previously considered mil datasets and problems in table a . to our knowledge the datasets considered in this work pose the most challenging mil problems with respect to the number of instances per bag (column ). table a : mil datasets with their numbers of bags and numbers of instances. "total number of instances" refers to the total number of instances in the dataset. the simulated and real-world immunosequencing datasets considered in this work contain a by orders of magnitudes larger number of instances per bag than mil datasets that were considered by machine learning methods up to now. we evaluate and compare the performance of deeprc against a set of machine learning methods that serve as baseline, were suggested, or can readily be adapted to immune repertoire classification. in this section, we describe these compared methods. this method serves as an estimate for the achievable classification performance using prior knowledge about which motif was implanted. note that this does not necessarily lead to perfect predictive performance since motifs are implanted with a certain amount of noise and could also be present in the negative class by chance. the known motif method counts how often the known implanted motif occurs per sequence for each repertoire and uses this count to rank the repertoires. from this ranking, the area under the receiver operator curve (auc) is computed as performance measure. probabilistic aa changes in the known motif are not considered for this count, with the exception of gap positions. we consider two versions of this method: (a) known motif binary: counts the occurrence of the known motif in a sequence and (b) known motif continuous: counts the maximum number of overlapping aas between the known motif and all sequence positions, which corresponds to a convolution operation with a binary kernel followed by max-pooling. since the implanted signal is not known in the experimentally obtained cmv dataset, this method cannot be applied to this dataset. the support vector machine (svm) approach uses a fixed mapping from a bag of sequences to the corresponding k-mer counts. the function h kmer maps each sequence s i to a vector representing the occurrence of k-mers in the sequence. to avoid confusion with the sequence-representation obtained from the cnn layers of deeprc, we denote u i = h kmer (s i ), which is analogous to z i . specifically, where #{p m ∈ s i } denotes how often the k-mer pattern p m occurs in sequence s i . afterwards, average-pooling is applied to obtain u = /n n i= u i , the k-mer representation of the input object x. for two input objects x (n) and x (l) with representations u (n) and u (l) , respectively, we implement the minmax kernel (ralaivola et al., ) as follows: where u (n) m is the m-th element of the vector u (n) . the jaccard kernel (levandowsky & winter, ) is identical to the minmax kernel except that it operates on binary u (n) . we used a standard c-svm, as introduced by cortes & vapnik ( ) . the corresponding hyperparameter c is optimized by random search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a a . the same k-mer representation of a repertoire, as introduced above for the svm baseline, is used for the k-nearest neighbor (knn) approach. as this method clusters samples according to distances between them, the previous kernel definitions cannot be applied directly. it is therefore necessary to transform the minmax as well as the jaccard kernel from similarities to distances by constructing the following (levandowsky & winter, ) : d jaccard (u (n) , u (l) ) = − k jaccard (u (n) , u (l) ). (a ) the amount of neighbors is treated as the hyperparameter and optimized by an exhaustive grid search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a . we implemented logistic regression on the k-mer representation u of an immune repertoire. the model is trained by gradient descent using the adam optimizer (kingma & ba, ) . the learning rate is treated as the hyperparameter and optimized by grid search. furthermore, we explored two regularization settings using combinations of l and l weight decay. the settings of the full hyperparameter search as well as the respective value ranges are given in table a . we implemented a burden test (emerson et al., ; li & leal, ; wu et al., ) in a machine learning setting. the burden test first identifies sequences or k-mers that are associated with the individual's class, i.e., immune status, and then calculates a burden score per individual. concretely, for each k-mer or sequence, the phi coefficient of the contingency table for absence or presence and positive or negative immune status is calculated. then, j k-mers or sequences with the highest phi coefficients are selected as the set of associated k-mers or sequences. j is a hyperparameter that is selected on a validation set. additionally, we consider the type of input features, sequences or k-mers, as a hyperparameter. for inference, a burden score per individual is calculated as the sum of associated k-mers or sequences it carries. this score is used as raw prediction and to rank the individuals. hence, we have extended the burden test by emerson et al. ( ) to k-mers and to adaptive thresholds that are adjusted on a validation set. the logistic multiple instance learning (mil) approach for immune repertoire classification (ostmeyer et al., ) applies a logistic regression model to each k-mer representation in a bag. the resulting scores are then summarized by max-pooling to obtain a prediction for the bag. each amino acid of each k-mer is represented by features, the so-called atchley factors (atchley et al., ) . as k-mers of length are used, this gives a total of × = features. one additional feature per -mer is added, which represents the relative frequency of this -mer with respect to its containing bag, resulting in features per -mer. two options for the relative frequency feature exist, which are (a) whether the frequency of the -mer (" mer") or (b) the frequency of the sequence in which the -mer appeared ("tcrβ") is used. we optimized the learning rate, batch size, and early stopping parameter on the validation set. the settings of the full hyperparameter search as well as the respective value ranges are given in table a . for all competing methods a hyperparameter search was performed, for which we split each of the training sets into an inner training set and inner validation set. the models were trained on the inner training set and evaluated on the inner validation set. the model with the highest auc score on the inner validation set is then used to calculate the score on the respective test set. here we report the hyperparameter sets and search strategy that is used for all methods. deeprc. the set of hyperparameters of deeprc is shown in table a . these hyperparameter combinations are adjusted via a grid search procedure. table a : deeprc hyperparameter search space. every · updates, the current model was evaluated against the validation fold. the early stopping hyperparameter was determined by selecting the model with the best loss on the validation fold after updates. * : experiments for { ; ; } kernels were omitted for datasets with motif implantation probabilities ≥ % in the category "simulated immunosequencing data". known motif. this method does not have hyperparameters and has been applied to all datasets except for the cmv dataset. the corresponding hyperparameter c of the svm is optimized by randomly drawing values in the range of [− ; ] according to a uniform distribution. these values act as the exponents of a power of and are applied for each of the two kernel types (see table a a ). knn. the amount of neighbors is treated as the hyperparameter and optimized by grid search operating in the discrete range of [ ; max{n, }] with a step size of . the corresponding tight upper bound is automatically defined by the total amount of samples n ∈ n > in the training set, capped at (see table a ). number of neighbors { ; max{n, }} type of kernel {minmax; jaccard} table a : settings used in the hyperparameter search of the knn baseline approach. the number of trials (per type of kernel) is automatically defined by the total amount of samples n ∈ n > in the training set, capped at . logistic regression. the hyperparameter optimization strategy that was used was grid search across hyperparameters given in table a . learning rate −{ ; ; } batch size max. updates coefficient β (adam) . coefficient β (adam) . weight decay weightings {(l = − , l = − ); (l = − , l = − )} table a : settings used in the hyperparameter search of the logistic regression baseline approach. burden test. the burden test selects two hyperparameters: the number of features in the burden set and the type of features, see table a . number of features in burden set { , , , } type of features { mer; sequence} table a : settings used in the hyperparameter search of the burden test approach. logistic mil. for this method, we adjusted the learning rate as well as the batch size as hyperparameters by randomly drawing different hyperparameter combinations from a uniform distribution. the corresponding range of the learning rate is [− . ; − . ], which acts as the exponent of a power of . the batch size lies within the range of [ ; ]. for each hyperparameter combination, a model is optimized by gradient descent using adam, whereas the early stopping parameter is adjusted according to the corresponding validation set (see table a ). learning rate {− . ;− . } batch size { ; } relative abundance term { mer; tcrβ} number of trials max. epochs coefficient β (adam) . coefficient β (adam) . table a : settings used in the hyperparameter search of the logistic mil baseline approach. the number of trials (per type of relative abundance) defines the quantity of combinations of random values of the learning rate as well as the batch size. in this section, we report the detailed results on all four categories of datasets (a) simulated immunosequencing data (table a ) (b) lstm-generated data (table a ) , (c) real-world data with implanted signals (table a ) , and (d) real-world data on the cmv dataset (table a ) , as discussed in the main paper. ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . svm (minmax) . . . . . . . . . . . . . . . . . . . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . known motif b. . . . . . . . . . . . . . . . . . . . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . table a : auc estimates based on -fold cv for all datasets in category "simulated immunosequencing data". the reported errors are standard deviations across the cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. wildcard characters in motifs are indicated by z, characters with % probability of being removed by d . table a : auc estimates based on -fold cv for all datasets in category "lstm-generated data". the reported errors are standard deviations across the cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. characters affected by noise, as described in a , paragraph "lstm-generated data", are indicated by r . table a : results on the cmv dataset (real-world data) in terms of auc, f score, balanced accuracy, and accuracy. for f score, balanced accuracy, and accuracy, all methods use their default thresholds. each entry shows mean and standard deviation across cross-validation folds. we trained a conventional next-character lstm model (graves, ) based on the implementation in https://github.com/spro/practical-pytorch (access date st of may, ) using pytorch . . (paszke et al., ) . for this, we applied an lstm model with lstm blocks in layers, which was trained for , epochs using the adam optimizer (kingma & ba, ) with learning rate . , an input batch size of character chunks, and a character chunk length of . as input we used the immuno-sequences in the cdr column of the cmv dataset, where we repeated sequences according to their counts in the repertoires, as specified in the templates column of the cmv dataset. we excluded repertoires with unknown cmv status and unknown sequence abundance from training. after training, we generated , repertoires using a temperature value of . . the number of sequences per repertoire was sampled from a gaussian n (µ = k, σ = k) distribution, where the whole repertoire was generated by the lstm at once. that is, the lstm can base the generation of the individual aa sequences in a repertoire, including the aas and the lengths of the sequences, on the generated repertoire. a random immuno-sequence from the trained-on repertoires was used as initialization for the generation process. this immuno-sequence was not included in the generated repertoire. finally, we randomly assigned of the generated repertoires to the positive (diseased) and to the negative (healthy) class. we then implanted motifs in the positive class repertoires as described in section a . . as illustrated in the comparison of histograms given in fig. a , the generated immuno-sequences exhibit a very similar distribution of -mers and aas compared to the original cmv dataset. real-world data deeprc allows for two forms of interpretability methods. (a) due to its attention-based design, a trained model can be used to compute the attention weights of a sequence, which directly indicates its importance. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., ) or layer-wise relevance propagation (montavon et al., ; arras et al., ; montavon et al., ; preuer et al., ) . we apply ig to identify the input patterns that are relevant for the classification. to identify aa patterns with high contributions in the input sequences, we apply ig to the aas in the input sequences. additionally, we apply ig to the kernels of the d cnn, which allows us to identify aa motifs with high contributions. in detail, we compute the ig contributions for the aas and positional features in the kernels for every repertoire in the validation and test set, so as to exclude potential artifacts caused by over-fitting. averaging the ig values over these repertoires then results in concise aa motifs. we include qualitative visual analyses of the ig method on different datasets below. here, we provide examples for the interpretation of trained deeprc models using integrated gradients (ig) (sundararajan et al., ) as contribution analysis method. the following illustrations were created using ig steps, which we found sufficient to achieve stable ig results. a visual analysis of deeprc models on the simulated datasets, as illustrated in tab. a and fig. a , shows that the implanted motifs can be successfully extracted from the trained model and are straightforward to interpret. in the real-world cmv dataset, deeprc finds complex patterns with high variability in the center regions of the immuno-sequences, as illustrated in figure a . real-world data with implanted signals extracted motif implanted motif(s) g r s r a r f r l r d r r r {l r d r r r ; c r a r s; g r l-n} motif freq. ρ . % . % . % table a : visualization of motifs extracted from trained deeprc models for datasets from categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". motif extraction was performed using integrated gradients on the d cnn kernels over the validation set and test set repertoires of one cv fold. wildcard characters are indicated by z, random noise on characters by r , characters with % probability of being removed by d , and gap locations of random lengths of { ; ; } by -. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence). only kernels with relatively high contributions are shown, i.e. with contributions roughly greater than the average contribution of all kernels. b) c) figure a : integrated gradients applied to input sequences of positive class repertoires. three sequences with the highest contributions to the prediction of their respective repertoires are shown. a) input sequence taken from "simulated immunosequencing data" with implanted motif sz d z d n and motif implantation probability . %. the deeprc model reacts to the s and n at the th and th sequence position, thereby identifying the implanted motif in this sequence. b) and c) input sequence taken from "real-world data with implanted signals" with implanted motifs {l r d r r r ; c r a r s; g r l-n} and motif implantation probability . %. the deeprc model reacts to the fully implanted motif cas (b) and to the partly implanted motif aas c and a at the th and th sequence position (c), thereby identifying the implanted motif in the sequences. wildcard characters in implanted motifs are indicated by z, characters with % probability of being removed by d , and gap locations of random lengths of { ; ; } by -. larger characters in the sequences indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. figure a : visualization of the contributions of characters within a sequence via ig. each sequence was selected from a different repertoire and showed the highest contribution in its repertoire. the model was trained on cmv dataset, using a kernel size of , kernels and repertoires for early stopping. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the disease class. table a : tcrβ sequences that had been discovered by emerson et al. ( ) with their associated attention values by deeprc. these sequences have significantly (p-value . e- ) higher attention values than other sequences. the column "quantile" provides the quantile values of the empiricial distribution of attention values across all sequences in the dataset. in this section we investigate the impact of different variations of deeprc on the performance on the cmv dataset. we consider both a cnn-based sequence embedding, as used in the main paper, and an lstm-based sequence embedding. in both cases we vary the number of attention heads and the β parameter for the softmax function the attention mechanism (see eq. in main paper). for the cnn-based sequence embedding we also vary the number of cnn kernels and the kernel sizes used in the d cnn. for the lstm-based sequence embedding we use one one-directional lstm layer, of which the output values at the last sequence position (without padding) are taken as embedding of the sequence. here we vary the number of lstm blocks in the lstm layer. to counter over-fitting due to the increased complexity of these deeprc variations, we added a l weight penalty to the training loss. the factor with which the l weight penalty contributes to the training loss is varied over orders of magnitudes, where suitable value ranges were manually determined on one of the training folds beforehand. to reduce the computational effort, we do not consider all numbers of kernels that were considered in the main paper. furthermore, we only compute the auc scores on of the cross-validation folds. the hyperparameters, which were used in a grid search procedure, are listed in tab. a for the cnn-based sequence embedding and tab. a for the lstm-based sequence embedding. results. we show performance in terms of auc score with single hyperparameters set to fixed values so as to investigate their influence in tab. a for the cnn-based sequence embedding and tab. a for the lstm-based sequence embedding. we note that due to restricted computational resources this study was conducted with fewer different numbers of cnn kernels, with the auc estimated from only of the cross-validation folds, which leads to a slight decrease of performance in comparison to the full hyperparameter search and cross-validation procedure used in the main paper. as can be seen in tab. a and a , the lstm-based sequence embedding generalizes slightly better than the cnn-based sequence embedding. table a : impact of hyperparameters on deeprc with lstm for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first folds of a -fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "lstms=*": grid search over hyperparameters with reduction to specific number * of lstm blocks for sequence embedding. table a : impact of hyperparameters on deeprc with d cnn for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first folds of a -fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "ksize=*": grid search over hyperparameters with reduction to specific kernel size * of d cnn kernels for sequence embedding; "kernels=*": grid search over hyperparameters with reduction to specific number * of d cnn kernels for sequence embedding. a compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding predicting the sequence specificities of dna-and rna-binding proteins by deep learning explaining and interpreting lstms solving the protein sequence metric problem rank-loss support instance machines for miml instance annotation augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires multiple instance learning: a survey of problem characteristics and applications vdjserver: a cloud-based analysis portal and data commons for immune repertoire sequences and rearrangements tetramer-visualized gluten-specific cd + t cells in blood as a potential diagnostic marker for coeliac disease without oral gluten challenge ireceptor: a platform for querying and analyzing antibody/b-cell and t-cell receptor repertoire data across federated repositories support-vector networks quantifiable predictive features define epitope-specific t cell receptor repertoires on a model of associative memory with huge storage capacity bert: pre-training of deep bidirectional transformers for language understanding solving the multiple instance problem with axis-parallel rectangles predicting the spectrum of tcr repertoire sharing with a data-driven model of recombination immunosequencing identifies signatures of cytomegalovirus exposure history and hla-mediated effects on the t cell repertoire predicting antigen-specificity of single t-cells based on tcr cdr regions. biorxiv a review of multi-instance learning assumptions deep sequencing of b cell receptor repertoires from covid- evaluation and benchmark for biological image segmentation the promise and challenge of high-throughput sequencing of the antibody repertoire tcrex: detection of enriched t cell epitope specificity in full t cell receptor sequence repertoires. biorxiv identifying specificity groups in the t cell receptor repertoire generating sequences with recurrent neural networks. arxiv a bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status learning the high-dimensional immunogenomic features that predict public and private antibody repertoires improving neural networks by preventing co-adaptation of feature detectors long short-term memory fast model-based protein homology detection without alignment neural networks and physical systems with emergent collective computational abilities convolutional neural network architectures for matching natural language sentences attention-based deep multiple instance learning nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks basset: learning the regulatory code of the accessible genome with deep convolutional neural networks detecting cutaneous basal cell carcinomas in ultra-high resolution and weakly labelled histopathological images self-normalizing neural networks capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of b-cell receptors using supervised machine learning observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires dense associative memory for pattern recognition dense associative memory is robust to adversarial inputs gradient-based learning applied to document recognition set transformer: a framework for attention-based permutation-invariant neural networks imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-like domains distance between sets methods for detecting associations with rare variants for common diseases: application to analysis of sequence data the extended cohnkanade dataset (ck+): a complete dataset for action unit and emotion-specified expression high-throughput immune repertoire analysis with igor a framework for multiple-instance learning computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires longitudinal high-throughput tcr repertoire profiling reveals the dynamics of t cell memory formation after mild covid- infection. biorxiv methods for interpreting and understanding deep neural networks layer-wise relevance propagation: an overview how many different clonotypes do immune repertoires contain? current opinion in systems biology treating biomolecular interaction as an image classification problem -a case study on t-cell receptorepitope recognition prediction. biorxiv sumrep: a summary statistic framework for immune receptor repertoire comparison and model validation biophysicochemical motifs in t-cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocyte and adjacent healthy tissue pytorch: an imperative style, high-performance deep learning library needles in haystacks: on classifying tiny objects in large images interpretable deep learning in drug discovery pointnet: deep learning on point sets for d classification and segmentation graph kernels for chemical informatics cov-abdab: the coronavirus antibody database. biorxiv immunedb, a novel tool for the analysis, storage, and dissemination of immune repertoire sequencing data a $$k$$-nearest neighbor based algorithm for multi-instance multi-label active learning machine learning in automated text categorization high-throughput mapping of b cell receptor sequences to antigen specificity vdjtools: unifying post-analysis of t cell receptor repertoires vdjdb: a curated database of t-cell receptor sequences with known antigen specificity deeptcr: a deep learning framework for understanding t-cell receptor sequence signatures within complex t-cell repertoires prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs. biorxiv axiomatic attribution for deep networks attention-based deep neural networks for detection of cancerous and precancerous esophagus tissue on histopathological slides learning with sets in multiple instance regression applied to remote sensing attention is all you need revisiting multiple instance neural networks novel approaches to analyze immunoglobulin repertoires immunesim: tunable multi-feature simulation of b-and t-cell receptor repertoires for immunoinformatics benchmarking genome-wide protein function prediction through multiinstance multi-label learning rare-variant association testing for sequencing data with the sequence kernel association test polyspecificity of t cell and b cell receptor recognition practical guidelines for b-cell receptor repertoire sequencing analysis learning embedding adaptation for few-shot learning convolutional neural network architectures for predicting dna-protein binding pird: pan immune repertoire database multi-instance multi-label learning with application to scene classification predicting effects of noncoding variants with deep learning-based sequence model the ellis unit linz, the lit ai lab and the institute for machine learning are supported by the land oberösterreich, lit grants deeptoxgen ( in the following, the appendix to the paper "modern hopfield networks and attention for immune key: cord- -jvfjf aw authors: feng, jie; hu, yong; wan, ping; zhang, aibing; zhao, weizhong title: new method for comparing dna primary sequences based on a discrimination measure date: - - journal: journal of theoretical biology doi: . /j.jtbi. . . sha: doc_id: cord_uid: jvfjf aw abstract we introduce a new approach to compare dna primary sequences. the core of our method is a new measure of pairwise distances among sequences. using the primitive discrimination substrings of sequence s and q, a discrimination measure dm(s, q) is defined for the similarity analysis of them. the proposed method does not require multiple alignments and is fully automatic. to illustrate its utility, we construct phylogenetic trees on two independent data sets. the results indicate that the method is efficient and powerful. with the completion of the sequencing of the genomes of human and other species, the field of analysis of genomic sequences is becoming very important tasks in bioinformatics. comparison of primary sequences of different dna strands remains the upmost important aspect of the sequence analysis. so far, most comparison methods are based on string alignment (pearson and lipman, ; lake, ) : a distance function is used to represent insertion, deletion, and substitution of letters in the compared strings. using the distance function, one can compare dna primary sequences and resolve the questions of the homology of macromolecules. however, it is not easy to use for long sequences since it is realized with the aid of dynamic programming, which will be slow due to the large number of computational steps. in the past two decades, alignment-free sequence comparison (vinga and almeida, ) has been actively pursued. some new methods have been derived with a variety of theoretical foundations. one category out of these methods is based on the statistics of word frequency within a dna sequence (sitnikova and zharkikh, ; karlin and burge, ; wu et al., wu et al., , stuart et al., ; qi et al., ) . the core idea is that the more similar the two sequences are, the greater the number of the factors shared by two sequences is. the earliest publication using frequencies statistics of k-words for sequence comparison dates from (blaisdell, ) . three years after, blaisdell ( ) proved that the dissimilarity values observed by using distance measures based on word frequencies are directly related to the ones requiring sequence alignment. in recent years, many researchers employ the k-words and the markov model to obtain the information about the biological sequences (pham and zuegg, ; pham, ; kantorovitz et al., ; helden, ; dai et al., ) . another category does not require resolving the sequence with fixed word length segments. it can be further divided into three groups. in the first group, researchers represent dna sequence by curves (hamori and ruskin, ; nandy, ; randic et al., a; zhang et al., ; liao, ; li et al., ; qi et al., ; yu et al., ) , numerical sequences (he and wang, ) , or matrices (randic, ; randic et al., ) . according to the representation, some numerical characterizations are selected as invariants of sequence for comparisons of dna primary sequences. the advantage of these methods is that they provide a simple way of viewing, sorting and comparing various gene structures. but how to obtain suitable invariants to characterize dna sequences and compare them is still a question need our attention. the second group corresponds to iterated maps. jeffrey ( ) proposed the chaos game representation (cgr) as a scaleindependent representation for genomic sequences. the algorithm exploited iterative function systems to map nucleotide sequences into a continuous space. since then, alignment-free methods based on cgr have aroused much interest in the field of computational biology. further studies by almeida et al. ( ) showed that cgr is a generalized markov chain probability table which can accommodate non-integer orders, and that cgr is a powerful sequence modelling tool because of its computational efficiency and scale-independence (almeida and vinga, . such alignment-free methods have been successfully applied for sequence comparison, phylogeny, detection of horizontal transfers, detection of oligonucleotides of interest, meta-genomic studies (deschavanne et al., ; pride et al., ; sandberg et al., ; teeling et al., ; chapus et al., ; wang et al., ; dufraigne et al., ; joseph and sasikumar, ) . the third group is based on text compression technique chen et al., ; cilibrasi et al., ) . if one sequence which is given the information contained in the other sequence is significantly compressible, the two sequences are considered to be close. there are also some important methods which are based on compression algorithm but do not actually apply the compression, such as lemple-ziv complexity and burrows-wheeler transform (otu and sayood, ; mantaci et al., mantaci et al., , yang et al., ) . in this paper, we propose a new sequence distance for the similarity analysis of dna sequences. based on the properties of primitive discrimination substrings, we construct a discrimination measure (dm) between every two sequences. furthermore, as application, two data sets (bÀglobin genes and coronavirus genomes) are prepared and tested to identify the validity of the method. the results demonstrate that the new method is powerful and efficient. dna sequences consist of four nucleotides: a (adenine), g (guanine), c (cytosine), and t (thymine). a dna sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet a ¼ fa,c,g,tg. let s and q be sequences defined over a, l(s) be the length of s, s(i) denotes the ith element of s and s(i, j) is the substring of s composed of the elements of s between positions i and j (inclusive). definition . s(i, j) is called a discrimination substring (ds) that distinguishes s from q if sði,jÞ a q , particularly, if s(i, j) does not include any other dss distinguishing s from q, we call s(i, j) a primitive discrimination substring (pds) that distinguishes s from q. the set of pdss that distinguish s from q is denoted by dðs,qÞ. similarly, dðq,sÞ expresses the set of pdss that distinguish q from s. note that every sequence has its own identity, hence dðs,qÞ is usually different from dðq,sÞ. for example for s¼acctac and q¼gtgact, we can obtain that dðs,qÞ ¼ fcc,tag and dðq,sÞ ¼ fgt,tg, ga,actg. suppose u a dðs,qÞ and l(u)¼k, then we can get uð ,kÀ Þ a q (otherwise uð ,kÀ Þ a dðs,qÞ, which conflicts with the minimum of u). therefore the larger the k is, the more the same elements both s and q have and correspondingly the smaller the degree of discrimination that s distinguishes from q is. on the other hand, if the number of appearances of u in sequence s is t, we obviously note that the smaller the t is, the smaller the degree of discrimination that s distinguishes from q is. from the above description, we construct the following discrimination measure that one sequence distinguishes from another sequence. definition . dmðs -s Þ denotes the discrimination measure that s distinguishes from s in which v a dðq,sÞ, lðvÞ ¼ ku and the number of appearances of v in sequence q is tu. definition . the discrimination measure of sequences s and q is for the function dm to be a distance, it must satisfy (a) dmðx,yÞ for x ay; (b) dm(x,x) ¼ ; (c) dm(x,y) ¼dm(y,x) (symmetric); and (d) dmðx,yÞ r dmðx,zÞþdmðz,yÞ (triangle inequality). apparently, dm satisfies distance conditions (a)-(c). it is not obvious that it also satisfies (d). the following proposition answers this. proposition . dm(x,y) satisfies the triangle inequality, that is dmðx,zÞ rdmðx,yÞþdmðy,zÞ. proof. suppose s is an arbitrary element of dðx,zÞ. if s is also contained in dðx,yÞ, clearly we can obtain that dmðx-zÞ r dmðx-yÞþdmðy-zÞ. if there exists an element t a dðx,zÞ, and t is not contained in dðx,yÞ, then we can derive t a dðy,zÞ, therefore the triangle inequality dmðx-zÞ rdmðx-yÞ þ dmðy-zÞ still comes into existence. similarly, we can prove that dmðz-xÞ rdmðy-xÞþdmðz-yÞ. it is sufficient to prove the following inequality: this is equivalent to, by squaring both sides of the above inequality, ce þdf r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðc þd Þðe þf Þ q : to prove this inequality, we just need to prove ðce þ df Þ rðc þ d Þðe þ f Þ, i.e. cedf re d þ c f : obviously, this inequality comes into existence. therefore, ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi hence dm(x,y) satisfies the triangle inequality. & in this section, we apply the discrimination measure to analyze two sets of dna primary sequences. the similarities among these species are computed by calculating the discrimination measure between every two sequences. the smaller the discrimination measure is, the more similar the species are. that is to say, the discrimination measures of evolutionary closely related species are smaller, while those of evolutionary disparate species are larger. fig. illustrates the basic processes of the dm algorithm. the first set we select includes bÀglobin genes, whose similarity has been studied by many researchers using their first exon sequences (randic et al., b; liu and wang, ) . here we will analyze these species using their complete bÀglobin genes. table presents their names, embl accession numbers, locations and lengths. in table , we present the similarity/dissimilarity matrix for the full dna sequences of bÀglobin gene from species listed in table by our new method. observing table , we note that the most similar species pairs are human-gorilla, humanchimpanzee and gorilla-chimpanzee, which is expected as their evolutionary relationship. at the same time, we find that gallus and opossum are the most remote from the other species, which coincides with the fact that gallus is the only nonmammalian species among these species and opossum is the most remote species from the remaining mammals. by further study of the values in the table, we can gain more information about their similarity. another usage of the similarity/dissimilarity matrix is that it can be used to construct phylogenetic tree. the quality of the constructed tree may show whether the matrix is good and therefore whether the method of abstracting information from dna sequences is efficient. once a distance matrix has been calculated, it is straightforward to generate a phylogenetic tree using the nj method or the upgma method in the phylip package (http://evolution.genetics.washington.edu/phylip.html). in fig. , we show the phylogenetic tree of bÀglobin gene sequences based on the distance matrix dm, using nj method. the tree is drawn using the drawgram program in the phylip package. from this figure, we observe that ( ) gallus is clearly separated from the rest, this coincides with real biological phenomenon; ( ) human, gorilla, chimpanzee and lemur are placed closer to bovine and goat than to mouse and rat, this is in complete agreement with cao et al. ( ) confirming the outgroup status of rodents relative to ferungulates and primates. next, we consider inferring the phylogenetic relationships of coronaviruses with the complete coronavirus genomes. the complete coronavirus genomes used in this paper were downloaded from genbank, of which are sars-covs and are from other groups of coronaviruses. the name, accession number, abbreviation, and genome length for the genomes are listed in table . according to the existing taxonomic groups, sequences - form group i, and sequences - belong to group ii, while sequence is the only member of group iii. previous work showed that sars-covs (sequences - ) are not closely related to any of the previously characterized coronaviruses and form a distinct group iv. in fig. , we present the phylogenetic tree belonging to species based on the distance matrix dm, using upgma method. the tree is viewed using the drawgram program. as shown in fig. , four groups of coronaviruses can be seen from it: ( ) the group i coronaviruses, including tgev, pedv and hcov- e tend to cluster together; ( ) bcov, bcovl, bcovm, bcovq, mhv, mhv , mhvm, and mhvp, which belong to group ii, are grouped in a monophyletic clade; ( ) ibv, belonging to group iii, is situated at an independent branch; ( ) the sars-covs from group iv are grouped in a separate branch, which can be distinguished easily from other three groups of coronaviruses. the tree constructed based on dm algorithm is quite consistent with the results obtained by other researchers (zheng et al., ; song et al., ; liu et al., ; li et al., ) . the emphasis of the present work is to provide a new method to analyze dna sequences. from the above applications, we can see that our method is feasible for comparing dna sequences and deducing their similarity relationship. in this paper, we propose a new method for the similarity analysis of dna sequences. it is a simple method that yields results reasonably and rapidly. our algorithm is not necessarily an improvement as compared to some existing methods, but an alternative for the similarity analysis of dna sequences. the new approach does not require sequence alignment and graphical representation, and besides, it is fully automatic. the whole operation process utilizes the entire information contained in the dna sequences and do not require any human intervention. the application of the dm algorithm to the sets of bÀglobin genes and coronavirus genomes demonstrates its utility. this method will also be useful to researchers who are interested in evolutionary analysis. analysis of genomic sequences by chaos game representation universal sequence map (usm) of arbitrary discrete sequences computing distribution of scale independent motifs in biological sequences biological sequences as pictures: a generic two dimensional solution for iterated maps a measure of similarity of sets of sequences not requiring sequence alignment effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarities of natural sequences conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders exploration of phylogenetic data using a global sequence analysis method shared information and program plagiarism detection algorithmic clustering of music based on string compression markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison genomic signature: characterization and classification of species assessed by chaos game representation of sequences detection and characterization of horizontal transfers in prokaryotes using genomic signature h curves, a novel method of representation of nucleotides series especially suited for long dna sequences characteristic sequences for dna primary sequence metrics for comparing regulatory sequences on the basis of pattern counts chaos game representation of gene structure chaos game representation for comparison of whole genomes a statistical method for alignment free comparison of regulatory sequences dinucleotide relative abundance extremes: a genomic signature reconstructing evolutionary trees from dna and protein sequences: paralinear distances directed graphs of dna sequences and their numerical characterization -d graphical representation of protein sequences and its application to coronavirus phylogeny an information based sequence distance and its application to whole mitochondrial genome phylogeny a d graphical representation of dna sequence a relative similarity measure for the similarity analysis of dna sequences characteristic distribution of l-tuple for dna primary sequence an extension of the burrows-wheeler transform distance measures for biological sequences: some recent approaches a new graphical representation and analysis of dna sequence structure a new sequence distance measure for phylogenetic tree construction improved tools for biological sequence comparison spectral distortion measures for biological sequence comparisons and database searching a probabilistic measure for alignment-free sequence comparison evolutionary implications of microbial genome tetranucleotide frequency biases whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach new d graphical representation of dna sequence based on dual nucleotides on the similarty of dna primary sequences on the characterization of dna primary sequences by triplet of nucleic acid bases novel -d graphical representation of dna sequences and their numerical characterization analysis of similarity/ dissimilarity of dna sequences based on novel -d graphical representation quantifying the speciesspecificity in genomic signatures, synonymous codon choice, amino acid usage and g +c content statistical analysis of l-tuple frequencies in eubacteria and organells cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human integrated gene and species phylogenies from unaligned whole genome protein sequences application of tetranucleotide frequencies for the assignment of genomic fragments alignment-free sequence comparison-a review the spectrum of genomic signatures: from dinucleotides to chaos game representation a measure of dna sequence dissimilarity based on mahalanobis distance between frequencies of words statistical measures of dna dissimilarity under markov chain models of base composition the burrows-wheeler similarity distribution between biological sequences based on burrows-wheeler transform tn curve: a novel d graphical representation of dna sequence based on trinucleotides and its applications the z curve database: a graphic representation of genome sequences coronavirus phylogeny based on a geometric approach we thank all the anonymous referees for their valuable suggestions and support. key: cord- -mkwpuav authors: moreira, rebeca; balseiro, pablo; planas, josep v.; fuste, berta; beltran, sergi; novoa, beatriz; figueras, antonio title: transcriptomics of in vitro immune-stimulated hemocytes from the manila clam ruditapes philippinarum using high-throughput sequencing date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: mkwpuav background: the manila clam (ruditapes philippinarum) is a worldwide cultured bivalve species with important commercial value. diseases affecting this species can result in large economic losses. because knowledge of the molecular mechanisms of the immune response in bivalves, especially clams, is scarce and fragmentary, we sequenced rna from immune-stimulated r. philippinarum hemocytes by -pyrosequencing to identify genes involved in their immune defense against infectious diseases. methodology and principal findings: high-throughput deep sequencing of r. philippinarum using pyrosequencing technology yielded , high-quality reads with an average read length of bp. the reads were assembled into , contigs and the . % of the translated nucleotide sequences into protein were annotated successfully. the most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. we have found sequences from molecules never described in bivalves before, especially in the complement pathway where almost all the components are present. conclusions: this study represents the first transcriptome analysis using -pyrosequencing conducted on r. philippinarum focused on its immune system. our results will provide a rich source of data to discover and identify new genes, which will serve as a basis for microarray construction and the study of gene expression as well as for the identification of genetic markers. the discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of ruditapes philippinarum. the manila clam (ruditapes philippinarum) is a cultured bivalve species with important commercial value in europe and asia, and its culture has expanded in recent years. nevertheless, diseases produced by a wide range of microorganisms, from viruses to metazoan parasites, can result in large economical losses. among clam diseases, the majority of pathologies are associated with the vibrio and perkinsus genera [ ] [ ] [ ] . although molluscs lack a specific immune system, the innate response involving circulating hemocytes and a large variety of molecular effectors seems to be an efficient defense method to respond to external aggressions by detecting the molecular signatures of infection [ ] [ ] [ ] [ ] [ ] ; however, not many immune pathways have been identified in these animals. although knowledge of bivalve immune-related genes has increased in the last few years, the available information is still scarce and fragmentary. most of the data concern mussels and eastern and pacific oysters [ ] [ ] [ ] [ ] [ ] [ ] , and very limited information is available on the expressed immune genes of r. philippinarum. recently, the expression of immune-related genes of ruditapes philippinarum and ruditapes decussatus were characterized in response to a vibrio alginolyticus challenge [ ] . also, a recent pyrosequencing study was carried out by milan et al. [ ] , who sequenced two normalized cdna libraries representing a mixture of adult tissues and larvae from r. philippinarum. even more recently ghiselli et al. [ ] , have de novo assembled the r. philippinarum gonad transcriptome with the illumina technology. moreover, a few transcripts encoded by genes putatively involved in the clam immune response against perkinsus olseni have been reported by cdna library sequencing [ ] . currently ( / / ) , there are , ests belonging to r. philippinarum in the genbank database. the european marine genomics network has increased the number of ests for marine mollusc species particularly for ecologically and commercially important groups that are less studied, such as mussels and clams [ ] . unfortunately, most of the available resources are not annotated or well described, limiting the identification of important genes and genetic markers for future aquaculture applications. the use of -pyrosequencing is a fast and efficient approach for gene discovery and enrichment of transcriptomes in non-model organisms [ ] . this relatively low-cost technology facilitates the rapid production of a large volume of data, which is its main advantage over conventional sequencing methods [ ] . in the present work, we undertook an important effort to significantly increase the number of r. philippinarum ests in the public databases. specially, the aim of this work was to discover new immune-related genes using pyrosequencing on the gs flx (roche- life sciences) platform with the titanium reagents. to achieve this goal, we sequenced the transcriptome of r. philippinarum hemocytes previously stimulated with different pathogen-associated molecular patterns (pamps) to obtain the greatest number of immune-related transcripts as possible. the raw data are accessible in the ncbi short read archive (accession number: sra . ). the r. philippinarum normalized cdna library was sequenced with gs flx technology as shown in figure . sequencing and assembly statistics are summarized in table . briefly, a total of , raw nucleotide reads averaging . bp in length were obtained. of these, , exceeded our minimum quality standards and were used in the mira assembly. a total of , quality reads were assembled into , contigs, corresponding to . megabases (mb). the length of the contigs varied from to bp, with an average length of . bp and an average coverage of . reads. singletons were discarded, resulting in , contigs formed by at least ests, and , of these contigs were longer than bp. clustering the contigs resulted in , clusters with more than one contig. the distribution of contig length and the number of ests per contig, as well as the contig distribution by cluster are all shown in figure . even though the knowledge of expressed genes in bivalves has increased in the last few years, it is still limited. indeed, only , nucleotide sequences, , ests, , proteins and genes from the class bivalvia have been deposited in the genbank public database ( / / ) , and the top entries are for the mytilus and crassostrea genera. for ruditapes philippinarum, these numbers are reduced to , ests, proteins and genes. this evidences the lack of information which prompted the recent efforts to increase the number of annotated sequences of bivalves in the databases. for non-model species, functional and comparative genomics is possible after obtaining good est databases. these studies seem to be the best resource for deciphering the putative function of novel genes, which would otherwise remain ''unknown''. ncbi swissprot, ncbi metazoan refseq, the ncbi nonredundant and the uniprotkb/trembl protein databases were chosen to annotate the contigs that were at least bp long ( , ) . the percentage of contigs annotated with a cut off evalue of e- was . %. contig sequences and annotations are included in table s . of these contigs, . % matched sequences from bivalve species and the remaining matched to non-bivalvia mollusc classes ( . %), other animals ( . %), plants ( . %), fungi ( . %), protozoa ( . %), bacteria ( . %), archaea ( . %), viruses ( . %) and undefined sequences ( . %). as shown in figure a , the species with the most sequence matches was homo sapiens with , occurrences. the first mollusc in the top list was lymnaea stagnalis at position . the first bivalve, meretrix lusoria, appeared at position . r. philippinarum was at position with occurrences. notably, a high percentage of the sequences had homology with chordates, arthropods and gastropods ( figure b and c), and only contigs matched with sequences from the veneroida order ( figure d ). these values can be explained by the higher representation of those groups in the databases as compared to bivalves and the quality of the annotation in the databases, which has been reported in another bivalve transcriptomic study [ ] . the data shown highlight, once again, the necessity of enriching the databases with bivalve sequences. a detailed classification of predicted protein function is shown for the top blastx hits ( figure a ). the list is headed by actin with occurrences, followed by ferritin, an angiopoietin-like protein and lysozyme. an abundance of proteins directly involved in the immune response was predicted for this run; ferritin, lysozyme, c q domain containing protein, galectin- and hemagglutinin/amebocyte aggregation factor precursor are immune-related proteins present on the top list. ferritin has an important role in the immune response. it captures circulating iron to overcome an infection and also functions as a proinflammatory cytokine via the iron-independent nuclear factor kappa b (nf-kb) pathway [ ] . lysozyme is a key protein in the innate immune responses of invertebrates against gram-negative bacterial infections and could also have antifungal properties. in addition, it provides nutrition through its digestive properties as it is a hydrolytic protein that can break the glycosidic union of the peptidoglycans of the bacteria cell wall [ ] . the c q domain containing proteins are a family of proteins that form part of the complement system. the c q superfamily members have been found to be involved in pathogen recognition, inflammation, apoptosis, autoimmunity and cell differentiation. in fact, c q can be produced in response to infection and it can promote cell survival through the nf-kb pathway [ ] . galectin- is a central regulator of acute and chronic inflammatory responses through its effects on cell activation, cell migration, and the regulation of apoptosis in immune cells [ ] . the hemagglutinin/amebocyte aggregation factor is a single chain polypeptide involved in blood coagulation and adhesion processes such as self-nonself recognition, agglutination and aggregation processes. the hemagglutinin/ amebocyte aggregation factor and lectins play important roles in defense, specifically in the recognition and destruction of invading microorganisms [ ] . other proteins that are not specifically related to the immune response but could play a role in defense mechanisms include the following: angiopoietin-like proteins, apolipoprotein d and the integral membrane protein b. in other animals, angiopoietin-like proteins (angptl) potently regulate angiogenesis, but a subset also function in energy metabolism. specifically, angptl , the most represented angptl, promotes vascular inflammation rather than angiogenesis in skin and adipose tissues. inflammation occurs via the a b integrin/rac /nf-kb pathway, which is evidenced by an increase in leukocyte infiltration, blood vessel permeability and the expression of inflammatory cytokines (tumor necrosis factor-a, interleukin- and interleukin- b) [ ] . apolipoprotein d (apod) has been associated with inflammation. pathological and stressful situations involving inflammation or growth arrest have the capacity to increase its expression. this effect seems to be triggered by lps, interleukin- , interleukin- and glucocorticoids and is likely mediated by the nf-kb pathway, as there are several conserved nf-kb binding sites in the apod promoter (apre- and ap- binding sites are also present). the highest affinity ligand for apod is arachidonic acid, which apod traps when it is released from the cellular membrane after inflammatory stimuli and, thus, prevents its subsequent conversion in pro-inflammatory eicosanoids. within the cell, apod could modulate signal transduction pathways and nuclear processes such as transcription activation, cell cycling and apoptosis. in summary, apod induction is specific to ongoing cellular stress and could be part of the protective components of mild inflammation [ ] [ ] [ ] . finally, the short form of the integral membrane protein b (itm bs) can induce apoptosis via a caspase-dependent mitochondrial pathway [ ] . to avoid redundancy, the longest contig of each cluster was used for gene ontology terms assignment. a total of . % of the representative clusters matched with at least one go term. concerning cellular components ( figure b ), the highest percentage of go terms were in the groups of cell and cell part with . % in each; organelle and organelle part represented . % and . %, respectively. within the molecular function classification ( figure c ), the most represented group was binding with . % of the terms, which was followed by catalytic activity ( . %) and structural molecular activity ( . %). with regard to biological process ( figure d ), cellular and metabolic processes were the highest represented groups with . % and . % of the terms, respectively, which was followed by biological regulation ( . %). similarities between the r. philippinarum transcriptome and another four bivalve species sequences were analyzed by comparative genomics (crassostrea gigas of the family ostreidae, bathymodiolus azoricus and mytilus galloprovincialis of the family mytilidae and laternula elliptica of the family laternulidae). this analysis could identify specific transcripts that are conserved in these five species. a venn diagram was constructed using unique sequences from these databases according to the gene identifier (gi id number) of each sequence in its respective database: , from c. gigas, , from b. azoricus, , from m. galloprovincialis and , , from l. elliptica. c. gigas was chosen because is the most represented bivalve species in the public databases. the other three species are bivalves that have been studied in transcriptomic assays. figure shows that of the total , clusters, % were found exclusively in the r. philippinarum group, while only . % shared significant similarity with all five species. the number of coincidences among other groups was very low ( . % to . % of sequences), suggesting that , new sequences were discovered within the bivalve group. the percentage of new sequences is very high compared to previous transcriptomic studies [ ] [ ] , in which the fraction of new transcripts was approximately %. one possible explanation for this discrepancy is the low number of nucleotide and est sequences currently available in public databases for r. philippinarum, but these transcripts could also be regions in which homology is not reached, such as and untranslated regions or genes with a high mutation rate. on the other hand, a comparison between our results and the milan et al. [ ] transcriptome using a blastn approach is summarized in table immune-related sequences r. philippinarum hemocytes were subjected to immune stimulation using several different pamps to enrich the est collection with immune-related sequences. the objective was to obtain a more complete view of clam responses to pathogens. a keyword list and go immune-related terms were used to find proteins putatively involved in the immune system. after this selection step, we found that more than % of the proteins predicted from the contig sequences had a possible immune function. some sequences were found to be clustered in common, well-recognized immune pathways, such as the complement, apoptosis and toll-like receptors pathways, indicating conserved ancient mechanisms in bivalves ( figures , , ). the complement system is composed of over plasma proteins that collaborate to distinguish and eliminate pathogens. c is the central component in this system. in vertebrates, it is proteolytically activated by a c convertase through both the classic, lectininduced and alternative routes [ ] . although the complement pathway has not been extensively described in bivalves, there is evidence that supports the presence of this defense mechanism. ests with homology to the c q domain have been detected in the american oyster, c. virginica [ ] , the tropical clam codakia orbicularis [ ] , the zhikong scallop chlamys farreri [ ] and the mussel m. galloprovincialis [ ] [ ] . more recently, a novel c q adiponectin-like, a c and a factor b-like proteins have been identified in the carpet shell clam r. decussatus [ ] [ ] . these data support the putative presence of the complement system in bivalves. our pyrosequencing results, using the blastx similarity approach, showed that the complement pathway in r. philippinarum was almost complete as compared to the kegg reference pathway ( figure ). only the complement components c r, c s, c , c and c were not detected. i. lectins. lectins are a family of carbohydrate-recognition proteins that play crucial self-and non-self-recognition roles in innate immunity and can be found in soluble or membraneassociated forms. they may initiate effector mechanisms against pathogens, such as agglutination, immobilization and complement -mediated opsonization and lysis [ ] . several types of lectins have been cloned or purified from the manila clam, r. philippinarum [ ] [ ] [ ] , and their function and expression were also studied [ , ] . also, a manila clam tandemrepeat galectin, which is induced upon infection with perkinsus olseni, has been characterized [ ] . lectin sequences have been found in the stimulated hemocytes studied in our work: of the contigs are homologous to c-type lectins (calcium-dependent carbohydrate-binding lectins that have characteristic carbohydrate-recognition domains), are homologous to galectins (characterized by a conserved sequence motif in their carbohydrate recognition domain and a specific affinity for bgalactosides), contigs have homology with ficolin a and b (a group of oligomeric lectins with subunits consisting of both collagen-like and fibrinogen-like domains) and contigs have homology with other groups of lectins such as lactose-, mannoseor sialic acid-binding lectins. ii. b-glucan recognition proteins. b-glucan recognition proteins are involved in the recognition of invading fungal organisms. they bind specifically to b- , -glucan stimulating short-term immune responses. although these receptors have been partially sequenced in several bivalves, there is only one complete description of them in the scallop chlamys farreri [ ] . two contigs with homology to the beta- , -glucan-binding protein were found in our study. iii. peptidoglycan recognition proteins. peptidoglycan recognition proteins (pgrps) specifically bind peptidoglycans, which is a major component of the bacterial cell wall. this family of proteins influences host-pathogen interactions through their pro-and anti-inflammatory properties that are independent of their hydrolytic and antibacterial activities. in bivalves, they were first identified in the scallops c. farreri and a. irradians [ , ] and the pacific oyster c. gigas, and from the latter four different types of pgrps were identified [ ] . peptidoglycan-recognition proteins and a peptidoglycan-binding domain containing protein have been found for the first time in r. philippinarum in our results and were present and times, respectively. iv. toll-like receptors. toll-like receptors (tlrs) are an ancient family of pattern recognition receptors that play key roles in detecting non-self substances and activating the immune system. the unique bivalve tlr was identified and characterized in the zhikong scallop, c. farreri [ ] . tlr , and were present among the pyrosequencing results. tlr and tlr form a heterodimer, which senses and recognizes various components from bacteria, mycoplasma, fungi and viruses [ ] . tlr is a novel and poorly characterized member of the toll-like receptor family. although the exact role of tlr is currently unknown, phylogenic analysis indicates that tlr is a member of the tlr subfamily [ ] suggesting that it could recognize urinary pathogenic e. coli [ ] . it has been demonstrated that tlr colocalizes and interacts with unc b , a molecule located in the endoplasmic reticulum, which strongly suggests that tlr might be found inside cells and might play a role in recognizing viral infections [ ] . figure summarizes the tlr signaling pathway with the corresponding molecules found in the r. philippinarum transcriptome. pathogen proteases are important virulence factors that facilitate infection, diminish the activity of lysozymes and quench the agglutination capacity of hemocytes. because protease inhibitors play important roles in invertebrate immunity by protecting hosts through the direct inactivation of pathogen proteases, many bivalves have developed protease inhibitors to regulate the activities of pathogen proteases [ ] . some genes encoding protease inhibitors were identified in c. gigas [ ] , a. irradians [ ] , c. farreri [ ] and c. virginica; in the latter a novel family of serine protease inhibitors was also characterized [ ] [ ] [ ] . a total of contigs with homology to serine, cystein, kunitzand kazal-type protease inhibitors and metalloprotease inhibitors were found among our results. lysozyme was one of the most represented groups of immune genes in this transcriptome study with contigs present. it is an antibacterial molecule present in numerous animals including bivalves. although lysozyme activity was first reported in molluscs over years ago, complete sequences were published only recently including those of r. philippinarum [ ] . antimicrobial peptides (amps) are small, gene-encoded, cationic peptides that constitute important innate immune effectors from organisms spanning most of the phylogenetic spectrum. amps alter the permeability of the pathogen membrane and cause cellular lysis [ ] . in bivalves, they were first purified from mussel hemocyte granules [ , ] . in mussels, the amp myticin c was found to have a high polymorphic variability as well as chemotactic and immunoregulatory roles [ , ] . in clams, two amps with similarity to mussel myticin and mytilin [ ] and a big defensin [ ] are known. we were able to detect contigs with homology to different defensins: defensin- (american oyster defensin), defensin mgd- (mediterranean mussel defensin) and the big defensin previously mentioned. four contigs were similar to an unpublished defensin sequence from venerupis ( = ruditapes) philippinarum. the primary role of heat shock proteins (hsps) is to function as molecular chaperones. their up-regulation also represents an important mechanism in the stress response [ ] , and their activity is closely linked to the innate immune system. hsps mediate the mitochondrial apoptosis pathway and affect the regulation of nf-kb [ ] . hsps are well studied in bivalves. for r. philippinarum, several assays have been developed to better understand the hsps profile in response to heavy metals and pathogen stresses [ ] [ ] [ ] . the most important and well-studied groups of hsps were present in our r. philippinarum transcriptome (hsp , hsp / dnaj, hsp and hsp ), but other, less common hsps were also represented (hsp , hsp , hsp and some members from the hsp family). recently, several genes related to the inflammatory response against lps stimulation have been detected in bivalves. such is the case of the lps-induced tnf-a factor (litaf), which is a novel transcription factor that critically regulates the expression of tnfa and various inflammatory cytokines in response to lps stimulation. it has been described in three bivalve species: pinctada fucata [ ] , c. gigas [ ] and c. farreri [ ] . other tnf-related genes have been identified in the zhikong scallop, such as a tnfr homologue [ ] and a tumor necrosis factor receptor-associated factor (traf ), which is a key signaling adaptor molecule common to the tnfr superfamily and to the il- r/tlr family [ ] . figure shows that several components of the tlr signaling pathway that are present in our transcriptomic sequences (myd , irak , traf- and - , tram, btk, rac- , pi k, akt, btk and tank). a total of , contigs, . % of those annotated, had homology with the main groups of putatively pathogenic organisms such as viruses ( hits), bacteria ( , hits), protozoa ( hits) and fungi ( hits). figure displays the taxonomic classification of these sequences and table summarizes a list of the known bivalve pathogens found in our results. bacteria constitute the main group found among the sequences not belonging to the clam. as filter-feeding animals, bivalves can concentrate a large amount of bacteria and it could be one of their sources of food [ ] . because vibrio spp. are ubiquitous in aquatic ecosystems, it was expected that the vibrionales order, with hits, would be the most predominant. several species of the vibrio genus are among the main causes of disease in bivalves specifically causing bacillary necrosis in larval stages [ ] . is noticeable that sequences belonging to the causative agent of brown ring disease in adults of manila clam, vibrio tapetis, have not been found. perkinsus marinus, with matches, is the only bivalve pathogen found within the protozoa (alveolata) group. perkinsosis is produced by species from the genus perkinsus. both p. marinus and p. olseni have been associated with mortalities in populations of various groups of molluscs around the world and are catalogued as notifiable pathogens by the oie. viruses were the least represented among pathogens. the baculoviridae family was the most predominant with matches, but the corresponding sequences were inhibitors of apoptosis (iaps) [ ] that could also be part of the clam's transcriptome. five viral families were found in our transcriptome study: iridoviridae, herpesviridae, malacoherpesviridae, picornaviridae and retroviridae. a well-known bivalve pathogen was also identified, the ostreid herpesvirus , which has been previously been found to infect clams [ ] . fungi had matches in our results. it is known that bivalves are sensitive to fungal diseases, which can degrade the shell or affect the larval bivalve stages [ , ] . this study represents the first r. philippinarum transcriptome analysis focused on its immune system using a -pyrosequencing approach and complements the recent pyrosequencing assay carried out by milan et al. [ ] . the discovery of new immune sequences was effective, resulting in an enormous variety of contigs corresponding to molecules that could play a role in the defense mechanisms. more than % of our results had relationship with immunity. this new resource is now gathered in the ncbi short read archive with the accession number: sra . . our results will provide a rich source of data to discover and identify new genes, which will serve as a basis for microarray construction and gene expression studies as well as for the identification of genetic markers for various applications including the selection of families in the aquaculture sector. we have found sequences from molecules never described in bivalves before like c , c , c , c , aif, bax, akt, tlr and tlr , among others. as a part of this work, three immune pathways in r. philippinarum have been characterized, the apoptosis, the toll like signaling pathway and the complement cascade, which could help us to better understand the resistance mechanisms of this economically important aquaculture clam species. animal sampling and in vitro stimulation of hemocytes r. philippinarum clams were obtained from a commercial shellfish farm (vigo, galicia, spain). clams were maintained in open circuit filtered sea water tanks at uc with aeration and were fed a total of clams were notched in the shell in the area adjacent to the anterior adductor muscle. a sample of ul of hemolymph was withdrawn from the adductor muscle of each clam with an insulin syringe, pooled and then distributed in -well plates, ml per well, in a total of wells, one for each treatment. hemocytes were allowed to settle to the base of the wells for min at uc in the darkness. then, the hemocytes were stimulated with mg/ml of polyinosinic:polycytidylic acid (poly i:c), peptidoglycans, ß-glucan, vibrio anguillarum dna (cpg), lipopolysaccharide (lps), lipoteichoic acid (lta) or ufc/ml of heat-inactivated vibrio anguillarum (one stimulus per well) for h at uc. all stimuli were purchased from sigma. pyrosequencing. after stimulation, hemolymph was centrifuged at g at uc for minutes, the pellet was resuspended in ml of trizol (invitrogen) and rna was extracted following the manufacturer's protocol. after rna extraction, samples were treated with turbo dnase free (ambion) to eliminate dna. next, the concentration and purity of the rna samples were measured using a nanodrop nd spectrophotometer. the rna quality was assessed in a bioanalyzer (agilent technologies). from each sample, mg of rna was pooled and used for the production of normalized cdna for sequencing in the unitat de genòmica (sct-ub, barcelona, spain). full-length-enriched double stranded cdna was synthesized from , mg of pooled total rna using mint cdna synthesis kit (evrogen, moscow, russia) according to manufacturer's protocol, and was subsequently purified using the qiaquick pcr purification kit (qiagen usa, valencia, ca). the amplified cdna was normalized using trimmer kit (evrogen, moscow, russia) to minimize differences in representation of transcripts. the method involves denaturation-reassociation of cdna, followed by a digestion with a duplex-specific nuclease (dsn) enzyme [ , ] . the enzymatic degradation occurs primarily on the highly abundant cdna fraction. the single-stranded cdna fraction was then amplified twice by sequential pcr reactions according to the manufacturer's protocol. normalized cdna was purified using the qiaquick pcr purification kit (qiagen usa, valencia, ca). to generate the library, ng of normalized cdna were used. cdna was fractionated into small, -to -basepair fragments and the specific a and b adaptors were ligated to both the and ends of the fragments. the a and b adaptors were domain; pkc: protein kinase c; pten: phosphatidylinositol- , , -trisphosphate -phosphatase and dual-specificity protein phosphatase pten; raidd: caspase and rip adapter with death domain; tnf r : tumor necrosis factor receptor ; tnf-a: tumor necrosis factor alpha; tradd: tnf receptor type -associated death domain protein; traf : tnf receptor-associated factor ; trail: tnf-related apoptosis-inducing ligand; trail decoy: decoy trail receptor without death domain; trail-r: trail receptor. doi: . /journal.pone. .g used for purification, amplification, and sequencing steps. one sequencing run was performed on the gs-flx using titanium chemistry. sequencing is based on sequencing-by-synthesis, addition of one nucleotide, or more, complementary to the template strand results in a chemiluminescent signal recorded by the ccd camera within the instrument. the signal strength is proportional to the number of nucleotides incorporated in a single nucleotide flow. all reagents and protocols used were from roche life sciences, usa. pyrosequencing raw data, comprised of , reads, were processed with the roche quality control pipeline using the default settings. seqclean (http://compbio.dfci.harvard.edu/tgi/software/) software was used to screen for and remove normalization adaptor sequences, homopolymers and reads shorter than bp prior to assembly. a total of , quality reads were subjected to mira, version . . [ ] , to assemble the transcriptome. by default, mira takes into account only contigs with at least reads. the other reads go into debris, which might include singletons, repeats, low complexity sequences and sequences shorter than bp. ncbi blastclust was used to group similar contigs into clusters (groups of transcripts from the same gene). two sequences were grouped if at least % of the positions had at least % identity. the , contigs were grouped into a total of , clusters. an iterative blast workflow was used to annotate the r. philippinarum contigs with at least bp ( , contigs out of , ). then, blastx [ ] with a cut off value of e- , was used to compare the r. philippinarum contigs with the ncbi swissprot, the ncbi metazoan refseq, the ncbi nr and the uniprotkb/trembl protein databases. after annotation, blast go software [ ] was used to assign gene ontology terms [ ] to the largest contig of a representative cluster (minimum of bp). this strategy was used to avoid redundant results. default values in blast go were used to perform the analysis and ontology level was selected to construct the level pie charts. to make a comparison between r. philippinarum and other bivalve species, the nucleotide sequences and ests from c. gigas, m. galloprovincialis, l. elliptica and b. azoricus were obtained from genbank and from dedicated databases, when available. [ ] . unique sequences from these databases (based on gi number) were used from each of the databases. these sequences were compared by blastn against the longest contig from each of , r. philippinarum clusters with a cut off e-value of e- . hits to r. philippinarum sequences were represented in a venn diagram. the comparison between our results, the longest contig from each of , clusters, and the milan et al. [ ] transcriptome, contigs downloaded from ruphibase (http:// compgen.bio.unipd.it/ruphibase/query/), was made by blastn with a cut off e-value of e- . another analysis was carried out to compare just the longest contig from each of , clusters identified as immune-related and the milan et al. contigs as well. the results were summarized in a table ( table ). the percentage of coverage is the average % of query coverage by the best blast hit and the percentage of hits is the % of query with at least one hit in database, in parenthesis were added the total number of hits. identification of immune-related genes all the contig annotations were revised based on an immunity and inflammation-related keyword list (i.e. apoptosis, bactericidal, c , lectin, socs…) developed in our laboratory to select the candidate sequences putatively involved in immune response. the presence or absence of these words in the blastx hit descriptions was checked to identify putative immune-related contigs. the remaining non-selected contigs were revised using the go terms at level , and assigned to each sequence after the annotation step that had a direct relationship with immunity. selected contigs were checked again to eliminate non-immune ones and distributed into functional categories. immune-related genes were grouped in three reference immune pathways (complement cascade, tlr signaling pathway and apoptosis) to describe each route indicated by our pyrosequencing results. to identify and classify the groups of organisms that had high similarity with our clam sequences, the uniprot taxonomy [ ] was used except for the protozoa group. because protozoa are a highly complex group, a specific taxonomy [ ] was followed. briefly, after the blastx annotation step all the hit descriptions included the species name (i.e. homo sapiens) or a code (i.e. human) meaning that protein has been previously identified as belonging to that species. with such information sequences were classified in taxonomical groups and represented in pie charts. table s list of contigs (e-value, - ) of ruditapes philippinarum including sequence, length, description (hit description), accession number of description (hit acc), e-value obtained and database used for annotation (blast). study of diseases and the immune system of bivalves using molecular biology and genomics bacterial disease in marine bivalves, review of recent studies. trends and evolution perkinsosis in molluscs: a review bacteria-hemocyte interactions and phagocytosis in bivalves role of lectins (c-reactive protein) in defense of marine bivalves against bacteria modulation of the chemiluminescence response of mediterranean mussel (mytilus galloprovincialis) haemocytes immune parameters in carpet shell clams naturally infected with perkinsus atlanticus nitric oxide production by carpet shell clam (ruditapes decussatus) hemocytes generation and analysis of a , unique expressed sequence tags from the pacific oyster (crassostrea gigas) assembled into a publicly accessible database, the gigasdatabase immune gene discovery by expressed sequence tags generated from hemocytes of the bacteria-challenged oyster, crassostrea gigas sequence variability of myticins identified in haemocytes from mussels suggests ancient host-pathogen interactions mytibase, a knowledgebase of mussel (m. galloprovincialis) transcribed sequences insights into the innate immunity of the mediterranean mussel mytilus galloprovincialis development of expressed sequence tags from the pearl oyster, pinctada martensii dunker gene expression analysis of clams ruditapes philippinarum and ruditapes decussatus following bacterial infection yields molecular insights into pathogen resistance and immunity transcriptome sequencing and microarray development for the manila clam, ruditapes philippinarum: genomic tools for environmental monitoring de novo assembly of the manila clam ruditapes philippinarum transcriptome provides new insights into expression bias, mitochondrial doubly uniparental inheritance and sex determination analysis of est and lectin expression in hemocytes of manila clams (ruditapes phylippinarum) (bivalvia, mollusca) infected with perkinsus olseni increasing genomic information in bivalves through new est collections in four species, development of new genetic markers for environmental studies and genome evolution rapid transcriptome characterization for a nonmodel organism using pyrosequencing sequencing technologies -the next generation transcriptomic analysis of the clam meretrix meretrix on different larval stages ferritin functions as a proinflammatory cytokine via iron-independent protein kinase c zeta/nuclear factor kappab-regulated signaling in rat hepatic stellate cells cloning and characterization of an invertebrate type lysozyme from venerupis philippinarum c q and tumor necrosis factor superfamily: modularity and versatility the regulation of inflammation by galectin- isolation, cdna cloning, and characterization of an -kda hemagglutinin and amebocyte aggregation factor from limulus polyphemus angiopoietin-like proteins: emerging targets for treatment of obesity and related metabolic diseases modulation of apolipoprotein d expression and translocation under specific stress conditions neuroprotective effect of apolipoprotein d against human coronavirus oc -induced encephalitis in mice apolipoprotein d itm bs regulates apoptosis by inducing loss of mitochondrial membrane potential transcriptomic signatures of ash (fraxinus spp.) phloem transcriptomics of the bed bug (cimex lectularius) complement and its role in innate and adaptive immune responses potential indicators of stress response identified by expressed sequence tag analysis of hemocytes and embryos from the american oyster, crassostrea virginica analysis of a cdna-derived sequence of a novel mannose-binding lectin, codakine, from the tropical clam codakia orbicularis a novel c q-domaincontaining protein from zhikong scallop chlamys farreri with lipopolysaccharide binding activity the c q domain containing proteins of the mediterranean mussel mytilus galloprovincialis: a widespread and diverse family of immune-related molecules mgc q, a novel c q-domain-containing protein involved in the immune response of mytilus galloprovincialis differentially expressed genes of the carpet shell clam ruditapes decussatus against perkinsus olseni characterization of a c and a factor b-like in the carpet-shell clam, ruditapes decussatus structural and functional diversity of lectin repertoires in invertebrates, protochordates and ectothermic vertebrates purification and characterisation of a lectin isolated from the manila clam ruditapes philippinarum in korea characterization, tissue expression, and immunohistochemical localization of mcl , a c-type lectin produced by perkinsus olseni-infected manila clams (ruditapes philippinarum) noble tandem-repeat galectin of manila clam ruditapes philippinarum is induced upon infection with the protozoan parasite perkinsus olseni lectin from the manila clam ruditapes philippinarum is induced upon infection with the protozoan parasite perkinsus olseni cdna cloning and mrna expression of the lipopolysaccharide-and beta- , -glucan-binding protein gene from scallop chlamys farreri molecular cloning and characterization of a short type peptidoglycan recognition protein (cfpgrp-s ) cdna from zhikong scallop chlamys farreri molecular cloning and mrna expression of peptidoglycan recognition protein (pgrp) gene in bay scallop (argopecten irradians, lamarck ) distribution of multiple peptidoglycan recognition proteins in the tissues of pacific oyster, crassostrea gigas molecular cloning and expression of a toll receptor gene homologue from zhikong scallop, chlamys farreri pattern recognition receptors and inflammation the evolution of vertebrate toll-like receptors a tolllike receptor that prevents infection by uropathogenic bacteria unc b delivers nucleotide-sensing toll-like receptors to endolysosomes cg-timp, an inducible tissue inhibitor of metalloproteinase from the pacific oyster crassostrea gigas with a potential role in wound healing and defense mechanisms molecular cloning, characterization and expression of a novel serine proteinase inhibitor gene in bay scallops (argopecten irradians, lamarck ) molecular cloning and expression of a novel kazal-type serine proteinase inhibitor gene from zhikong scallop chlamys farreri, and the inhibitory activity of its recombinant domain a novel slowtight binding serine protease inhibitor from eastern oyster (crassostrea virginica) plasma inhibits perkinsin, the major extracellular protease of the oyster protozoan parasite perkinsus marinus evidence indicating the existence of a novel family of serine protease inhibitors that may be involved in marine invertebrate immunity serine protease inhibitor cvsi- potential role in the eastern oyster host defense against the protozoan parasite perkinsus marinus antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? innate immunity. isolation of several cysteine-rich antimicrobial peptides from the blood of a mollusc, mytilus edulis a member of the arthropod defensin family from edible mediterranean mussels (mytilus galloprovincialis) evidence of high individual diversity on myticin c in mussel (mytilus galloprovincialis) mytilus galloprovincialis myticin c: a chemotactic molecule with antiviral activity and immunoregulatory properties analysis of differentially expressed genes in response to bacterial stimulation in hemocytes of the carpetshell clam ruditapes decussatus: identification of new antimicrobial peptides molecular characterization of a novel big defensin from clam venerupis philippinarum heat shock proteins: facts, thoughts, and dreams heat shock proteins, cellular chaperones that modulate mitochondrial cell death pathways djla, a membrane-anchored dnaj-like protein, is required for cytotoxicity of clam pathogen vibrio tapetis to hemocytes alternation of venerupis philippinarum hsp gene expression in response to pathogen challenge and heavy metal exposure identification of two small heat shock proteins with different response profile to cadmium and pathogen stresses in venerupis philippinarum molecular characterization and expression analysis of a putative lps-induced tnf-alpha factor (litaf) from pearl oyster pinctada fucata cloning, characterization and expression analysis of the gene for a putative lipopolysaccharide-induced tnf-alpha factor of the pacific oyster molecular cloning and characterization of a putative lipopolysaccharide-induced tnf-alpha factor (litaf) gene homologue from zhikong scallop chlamys farreri first molluscan tnfr homologue in zhikong scallop: molecular characterization and expression analysis identification and expression of traf (tnf receptor-associated factor ) gene in zhikong scallop chlamys farreri diversity and pathogenecity of vibrio species in cultured bivalve molluscs an apoptosis-inhibiting baculovirus gene with a zinc finger-like motif detection of ostreid herpesvirus dna by pcr in bivalve molluscs: a critical review synopsis of infectious diseases and parasites of commercially exploited shellfish a fungus disease in clam and oyster larvae a novel method for snp detection using a new duplex-specific nuclease from crab hepatopancreas simple cdna normalization using kamchatka crab duplex-specific nuclease using the miraest assembler for reliable and automated mrna transcript assembly and snp detection in sequenced ests basic local alignment search tool blast go, a universal tool for annotation, visualization and analysis in functional genomics research gene ontology, tool for the unification of biology. the gene ontology consortium pyrosequencing of mytilus galloprovincialis cdnas: tissue-specific expression patterns insights into shell deposition in the antarctic bivalve laternula elliptica: gene discovery in the mantle transcriptome using pyrosequencing highthroughput sequencing and analysis of the gill tissue transcriptome from the deep-sea hydrothermal vent mussel bathymodiolus azoricus newt, a new taxonomy portal the new higher level classification of eukaryotes with emphasis on the taxonomy of protists key: cord- - i fc r authors: djikeng, appolinaire; halpin, rebecca; kuzmickas, ryan; depasse, jay; feldblyum, jeremy; sengamalay, naomi; afonso, claudio; zhang, xinsheng; anderson, norman g; ghedin, elodie; spiro, david j title: viral genome sequencing by random priming methods date: - - journal: bmc genomics doi: . / - - - sha: doc_id: cord_uid: i fc r background: most emerging health threats are of zoonotic origin. for the overwhelming majority, their causative agents are rna viruses which include but are not limited to hiv, influenza, sars, ebola, dengue, and hantavirus. of increasing importance therefore is a better understanding of global viral diversity to enable better surveillance and prediction of pandemic threats; this will require rapid and flexible methods for complete viral genome sequencing. results: we have adapted the sispa methodology [ - ] to genome sequencing of rna and dna viruses. we have demonstrated the utility of the method on various types and sources of viruses, obtaining near complete genome sequence of viruses ranging in size from , – , kb with a median depth of coverage of . . we used this technique to generate full viral genome sequence in the presence of host contaminants, using viral preparations from cell culture supernatant, allantoic fluid and fecal matter. conclusion: the method described is of great utility in generating whole genome assemblies for viruses with little or no available sequence information, viruses from greatly divergent families, previously uncharacterized viruses, or to more fully describe mixed viral infections. the emergence of highly pathogenic viral agents from zoonotic reservoirs has energized a wave of research into viral ecology, viral discovery [ ] [ ] [ ] [ ] and a parallel drive to develop large datasets of complete viral genomes for the study of viral evolution and pandemic prediction [ , ] . viral discovery has been aided by the development of sequence independent methodologies for the generation of genomic data [ ] . the most prominent of these methodologies include representational difference analysis (rda) and sequence independent single primer amplification (sispa) with several variations. the sispa method, first developed by reyes and kim [ ] , entails the directional ligation of an asymmetric primer at either end of a blunt-ended dna molecule. following several cycles of denaturation, annealing and amplification, minute amounts of the initial dna are enriched and then cloned, sequenced and analyzed. several modifications of the sispa method have so far been implemented including random-pcr (rpcr) [ ] . the rpcr method combines reverse transcription primed with an oligonucleotide made up of random hexamers tagged with a known sequence which is subsequently used as a primer-binding extension sequence. this initial modification was first used to construct a whole cdna library from low amounts of viral rna. a more recent modification, the dnase-sispa technique [ , , ] , includes steps to detect both rna and dna sequences. combining sample filtration through a . micrometer column and a dnase i digestion step led to the identification of viruses from clinical samples. the dnase-sispa technique has been used for the detection of novel bovine and human viruses from screens of clinical samples [ , , ] . other groups have used the protocol for the characterization of common epitopes in enterovirus [ ] , for the identification of a novel human coronavirus [ ] and for viral discovery in the plasma of hiv infected patients [ ] . in addition to its utility for viral discovery and viral surveillance, the dnase-sispa method has utility in obtaining full genome sequence from uncharacterized viral isolates or viral isolates from highly divergent families. in this study, we demonstrate the utility of the sispa method and its use as a rapid and cost effective method for generating full genome coverage of a wide range of viral types from several sources. optimization of the sispa method for whole genome sequencing given the success of earlier efforts for the identification of novel viral nucleic acids using sispa, we sought to adapt and optimize this method as a general and cost effective technique for large scale de novo viral genome sequencing (figures and ). an rnase treatment step was added to the sispa protocol to reduce contaminating exogenous rnas such as ribosomal rnas. in the case of polya-tailed viruses, we perform reverse transcription using a combination of random (fr rv-n) and poly t tagged (fr rv-t) primers in order to increase the coverage of the ' end ( figure ). additionally, in order to capture ' ends of viral rna, a random hexamer primer tagged with a conserved sequence at the ' end was added to the klenow reaction (figure shows a ' oligo specific for rhinoviruses). we have successfully used the sispa method on viral samples from different viral types. in this paper we discuss seven representative samples (table ) . we have found that the method works consistently on dsdna, ssdna, ssrna positive and ssrna negative viruses. we have also found that the method can result in complete genome sequence of viruses ranging in size from , - , kb in a single experimental procedure. figure shows the sequence coverage obtained for three viruses: positive ssrna phage ms , positive ssrna rhinovirus and negative ssrna newcastle disease virus (ndv). figure a shows an analysis of sequence coverage for the viruses examined in this study. on average, four contigs were generated per experiment, ranging in size from nt to nt with a median contig size of nt. the contigs had high sequence redundancy, with a median depth of coverage of . , varying from . for turkey astrovirus (ta) to a high of . for ms . one parameter that is taken into consideration when designing an efficient protocol for construction of a sequence library is the number of independent colonies needed to obtain sequence coverage of a given reference genome. experiments were conducted using m (a kb genome), ndv (a kb genome), and lambda phage ( . kb) to compare the level of coverage obtained by bidirectional sequencing of , , and clones (figure b ). for m , % genome coverage was achieved from sequencing one well block of clones, and % genome coverage was obtained from two well blocks. for ndv, . %, . % and . % sequence coverage were obtained from one, two or three well blocks respectively. in contrast to m and ndv, the coverage for lambda was . %, . %, and . % after one two and three well blocks were sequenced the efficiency of the sispa method as a tool for obtaining full genome coverage was analyzed using the lander and waterman model [ ] , which estimates the number of gaps present as a function of sequence number and genome size. table compares the expected coverage and redundancy (depth of coverage) as predicted by the lander-waterman model with the observed genome coverage and redundance. with the exception of lambda phage, observed coverage and redundancy approach expected coverage and redundancy. however when taking into account the scaled difference, as described by wendl [ ] , we see a dramatically increased "shortfall" between actual and expected coverage as more clones are sequenced. for example, in the case of ndv which has a genome size of kb, the scaled difference d between the expected coverage and the observed coverage (see equation description in methods section) at the different levels of sequence redundancy is . for the sequencing of a plate of clones, . for two plates and . for three plates. the sispa method works efficiently on viruses purified from a number of sources and by several methods. enterobacteriophages m , ms , and lambda were isolated from bacterial growth media and plasma after concentration by density gradient centrifugation. woodchuck hepatitis virus was purified from plasma by cesium chloride gradient centrifugation. human rhinovirus , purchased as a cell culture supernatant from atcc, was subjected to a low speed spin to remove cellular debris. turkey astrovirus was isolated from fecal material collected from turkey poults showing clinical signs of diarrhea. the intestinal fecal content was diluted in pbs and centrifuged at , k before filtration and nuclease treatment. newcastle dis-ease virus rna was purified from allantoic fluids derived from inoculated eggs. to determine the number of viral particles necessary to generate full genome sequences, we conducted dilution series with viruses whose titer was determined by plaque assays. the results of these experiments demonstrate that the sispa method is very efficient as a genome sequencing method for samples with greater than viral particles per rt-pcr reaction ( figure ). below particles, the specific viral signal is overwhelmed by competition with non-specific or host sequences and is rarely detected from sequencing two blocks ( ) of colonies. our initial results indicated low sequence coverage at the ' and ' ends of most viral genomes. in order to address overview of the strategy figure overview of the strategy. viral particles are separated from host contaminants using centrifugation and filtration. viral particles are treated with dnase i to remove contaminated nucleic acids. random priming is used to generate - bp amplicons which are size-selected and cloned. colonies are picked and sequenced. sequence is trimmed and assembled. contigs are closed using sequence-specific primers. construction of a library of amplicons and sequencing of randomly selected clones sequence analysis and reconstruction of full or partial genome sequences . - this problem in viruses with polya tails the fr rv-t primer ( figure ) is added to the rt reaction. this increases the number of cdnas produced at the ' end of the genome, and results in a much greater depth of coverage at the ' end. the polyt containing primer is added to the rt reaction at a concentration fold lower than the random primer in order to reduce competition with the random primer. we used human rhinoviruses to develop the methodology for improving the coverage of the ' end. we took advantage of a conserved region from nucleotide to nucleotide in the ' untranslated region. the conserved primer was used in the klenow step of the sispa protocol to enrich for the presence of amplicons from the ' end. when used in combination with the ' primer, we have been able to obtain full rhinovirus genome coverage in a clone experiment (data not shown). one inherent difficulty of a method that relies on a random reverse transcription and pcr to generate amplicons for sequencing is the likelihood of detecting contaminant sequences as well as sequences of interest. although filtration and nuclease treatment does reduce the presence of nucleic acids from whole cells and host chromosomes, contaminating rna species will inevitably remain and thus be amplified (table ) . to determine the presence of contaminant sequences in the clone population, all generated sequences were subjected to a blastn search against the ncbi (non-redundant) database. a cutoff e value of - was used to identify viral sequences which matched the reference genome. non-specific sequences (i.e., those that did not match the input viral isolate) were identified as mammalian, avian, bacterial, etc., if their best hit was below a cut off value of - . if no blast results were found below the - cut off value the sequences were not given a specific designation. in experiments resulting in nearly complete genome sequences, contaminant sequences ranged from - %. the nature of the contaminant sequence depended on the initial viral host and included mammalian, avian, bacterial, fungal, viral and unknown sequences. in the case of rhinoviruses, which were purified from hela cell culture, the majority of contaminant was of derived from human or mycobacterial nucleic outline of the sispa method acids. newcastle (ndv) and astrovirus (ta) which were purified from chicken egg allantoic fluid and turkey feces, respectively, were contaminated primarily with nucleic acids of avian origin. table shows the results of blast analyses of two samples, ta and hrv . the work presented here demonstrates the utility of the random genome sequencing method for the generation of viral sequence from positive strand ssrna (human rhinovirus, turkey astrovirus) and negative strand ssrna viruses (newcastle disease virus), ssdna (enterobacteriphage m ) and dsdna viruses (woodchuck hepatitis virus and lambda phage). in addition, using the dnase i-sispa technique we were able to amplify sufficient target material for sequencing from various sources, including cell culture isolates and field isolates which have not been purified by ultracentrifugation. although ultracentrifugation is an efficient procedure to purify viruses, it is not practical for processing samples of relatively low viral titer in a small volume or high throughput processing of viral samples for genomic sequencing. genome coverage and redundancy for viral samples from - kb approach the ideal values as predicted by the lander-waterman model [ ] . however, as the sequence number increases, the efficiency of the method as meas- ured by the scaled difference [ ] decreases dramatically. thus, while the number of gaps declines as more clones are sequenced, the efficiency is reduced (i.e. there is more 'loss'. remaining gaps and areas of × coverage may be due to regions of secondary structure, hydrolysis of the rna template or cloning bias. additionally, at rich regions may inhibit the annealing of random primers during the rt, klenow or pcr step. we routinely pick a total of clones (or two well blocks) per viral sample for bidirectional sequencing as this represents the most affordable sequence coverage to efficiency ratio. while significant coverage is obtained from a single experiment, final genome assembly requires varying levels of targeted rt-pcrs to close the genome (figure ) . the ' end of the virus generally has the lowest coverage in any use of this protocol. in theory, given the directionality of the reverse transcriptase ( ' to ') and assuming an equal distribution of binding sites for the random primer, the ' end of any viral genome will get higher depth of sequence coverage than the ' end. we have found that addition of a tagged oligo dt primer significantly reduces this problem for viruses with polya tails (most positive ssrna viruses), but this remains a limitation for other virus genome types. the ' end of most viruses has also proved difficult to complete and we have found that the addition to the rt reaction of degenerate oligos based on conserved ' sequences can increase coverage. however we have not been able to develop a universally applicable method for obtaining complete ' coverage. we strongly anticipate that specific adaptations of the sispa method to conserved regions of different viruses will demonstrate its versatility in a wide range of viral genome sequencing initiatives. limitations to the method include the need for samples containing a minimum of particles (in the original ml or . ml samples). moreover because the capsid structure renders the viral genomes nuclease-resistant, this protocol requires encapsidated viral genomes to allow the removal of most extra-viral contaminants. the viral nucleic acids in samples whose capsid structures have been disrupted cannot be separated from contaminants, and therefore cannot be efficiently amplified by sispa. in the experiments discussed in this paper dnase i was used to reduce host contaminant. for samples with high levels of host nucleic acid contaminant, we have used µg of rnase a to treat µl of filtered virus for hr. we have found that rnase a treatment eliminates the majority of host rna derived sequence contaminant in these cases. the sispa method is particularly useful for obtaining genome sequence from rna viruses. because most sequencing methods for rna viruses depend on rt-pcr with primers designed from pre-existing sequence data, relationship between initial virus particle number, genome coverage and percent non-specific sequences generated by sispa figure relationship between initial virus particle number, genome coverage and percent non-specific sequences generated by sispa. ms viruses were diluted to , , , and particles per sispa dnase i reaction. the sum of the total lengths of edited contigs for each dilution was calculated as percent of the total reference genome length. non-specific sequences were determined as those sequences which did not match reference genome with a cutoff value less than - . observed coverage and redundancy was compared with the expected coverage and redundancy as predicted by the lander-waterman model for the total number of sequences in each assembly. the utility of this protocol is particularly evident for highly variable or degenerate viral families or for viruses with little available sequence information. in addition, the sispa method will be useful for uncharacterized viruses as no prior sequence information is required. viral rna and dna was prepared following the guidelines provided by [ , ] sequences were analyzed against a non redundant database using a blastn algorithm. viral specific sequences were identified as matching the reference genome with a blastn cut off below - . non-specific (non-viral contaminant) sequences were identified if they had a cut off value below - , while none means that no blastn results were found below the - cut off value. the extracted rna was processed for random reverse transcription as previously described [ , ] using the fr rv-n primer ( ' gcc gga gct ctg cag ata tcn nnn nn ') at a concentration of µm. in addition, fr rv-t ( ' gcc gga gct ctg cag ata tc (t) ') was added at a concentration of nm to specifically amplify the ' end of positive strand viruses. after the first cdna synthesis, the double stranded cdna was synthesized by klenow reaction the presence of random primers. in order to amplify ' ends of rhinoviruses the following primer was added to the klenow reaction at a concentration of - nm ( 'gcc gga gct ctg cag ata tc tta aaa ctg g '). pcr amplification used high fidelity taq gold dna polymerase (abi) with the fr rv primer ( ' gcc gga gct ctg cag ata tc '). pcr amplicons were a-tailed with datp and units of low fidelity dna polymerase (invitrogen) at °c for minutes. a-tailed pcr amplicons were analyzed in a % agarose gel and fragments between and nt were gel purified. amplicons were ligated en masse into the topo ta cloning vector (invitrogen) and transformed into competent one shot topo top bacterial cells (invitrogen). for dna viruses, the purified viral dna was denatured and complementary strands synthesized by klenow reaction as indicated for ds-cdna from first strand cdna. clones were plated on lb/amp/xgal agar, and individual colonies were picked for sequencing. the clones were sequenced bidirectionally using the m primers from the topo ta vector. we routinely sequenced a total of or more per library. sequencing reactions were performed at the joint technology center (an affiliate of the j craig venter institute: jcvi) on an abi xl sequencing system using big dye terminator chemistry (applied biosystems). in the lander and waterman analysis of genome coverage [ ] g = size (bp) of reference genome, l = sequence length (bp) and n = # sequences; redundancy represents the depth of sequence coverage and coverage represents the fraction of genome covered by sequence data. the ideal redundancy (r) = ln/g and the ideal coverage = -e -r [ ] observed coverage = sum of the length of all contigs/g. observed redundancy = the average of total sequence length (length of all sequence reads in a contig including gaps)/contig length. both observed coverage and observed redundancy are experimentally derived values. the average sequence read size for the experiments described was . +/- . bp. the loss of coverage due to various biases is represented as the difference between the ideal coverage and the actual coverage. to allow quantitative comparison, this 'shortfall' difference is scaled by the standard deviation of the coverage probability distribution as given by wendl [ ] . following wendl, we use the moments of the vacancy (which is the complement of the coverage) to calculate the standard deviation. where α is the ratio of the read length and the genome length and n is the number of reads. the second moment is given as: where ρ, the redundancy, is define to be equal to nα the expression for the variance is then: the standard deviation is then: the ideal coverage is given by: using the standard deviation for the vacancy in place of that for the coverage, the correctly-scaled difference d between ideal coverage and the actual coverage a is: note that for large n the mean vacancy converges to exp(ρ) allowing the following simplified approximation of s: sequence reads were trimmed to remove amplicon primer sequence as well as low quality sequence, and assembled. a virus discovery method incorporating dnase treatment and its application to the identification of two bovine parvovirus species cloning of a human parvovirus by molecular screening of respiratory tract samples rna viral community in human feces: prevalence of plant pathogenic viruses f: the marine viromes of four oceanic regions method for discovering novel dna viruses in blood using viral particle selection and shotgun sequencing environmental genome shotgun sequencing of the sargasso sea metagenomic analysis of coastal rna virus communities wholegenome analysis of human influenza a virus reveals multiple persistent lineages and reassortment among recent h n viruses large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution virus discovery by sequence-independent genome amplification. reviews in medical virology sequence-independent, single-primer amplification (sispa) of complex dna populations. molecular and cellular probes a random-pcr method (rpcr) to construct whole cdna library from low amounts of rna identification of a third human polyomavirus identification of enteroviruses by using monoclonal antibodies against a putative common epitope identification of a new human coronavirus new dna viruses identified in patients with acute viral infection syndrome genomic mapping by fingerprinting random clones: a mathematical analysis occupancy modeling of coverage distribution for whole genome shotgun dna sequencing financial support for this project was provided by the institute for genomic research/j. craig venter institute. our thanks to claire fraser, eric eisenstadt and stephen liggett for their advice and support. the following complete genomes were used as reference genomes for the viruses discussed in this study: the author(s) declares that there are no competing interests. ad participated in drafting the article, experimental design, and data analysis and carried out molecular studies. jd participated in data analysis. rh, rk, jf and ns participated in experimental design and carried out molecular studies. ca, xz and ng provided materials used in the study. eg participated in experimental planning and drafting the manuscript. ds conceived of and coordinated the study, and participated in data analysis and drafting the manuscript. key: cord- -lj us dq authors: flower, darren r.; davies, matthew n.; doytchinova, irini a. title: identification of candidate vaccine antigens in silico date: - - journal: immunomic discovery of adjuvants and candidate subunit vaccines doi: . / - - - - _ sha: doc_id: cord_uid: lj us dq the identification of immunogenic whole-protein antigens is fundamental to the successful discovery of candidate subunit vaccines and their rapid, effective, and efficient transformation into clinically useful, commercially successful vaccine formulations. in the wider context of the experimental discovery of vaccine antigens, with particular reference to reverse vaccinology, this chapter adumbrates the principal computational approaches currently deployed in the hunt for novel antigens: genome-level prediction of antigens, antigen identification through the use of protein sequence alignment-based approaches, antigen detection through the use of subcellular location prediction, and the use of alignment-independent approaches to antigen discovery. reference is also made to the recent emergence of various expert systems for protein antigen identification. of smallpox were reported annually from across the globe, leading to about million deaths a year. yet, today, the disease has been completely eradicated. in the last years, there have been no known cases. poliomyelitis or polio is the other largescale disease which has come closest to eradication. its success too has been formidable: in , the pan american health organization effectively eradicated polio from the western hemisphere, since when the global polio eradication programme has significantly decreased the overall incidence of poliomyelitis through the rest of the world. in , there were approximately , cases spread through countries; in the past years, global figures amounted to less than , annually. yet, in spite of such remarkable success, death from vaccine-preventable diseases remains unacceptably high [ ] . there are over common infectious diseases responsible for one in four deaths globally. rotavirus and pneumococcus are pathogens causing diarrhoea and pneumonia, the leading causes of infant deaths in underdeveloped countries. in the next decade, effective, widespread vaccination programs against such pathogenic microbes could save the lives of . million children under years of age. hepatitis b causes , deaths in adults and children aged over . seasonal, non-pandemic influenza kills upwards of half a million globally each year. for those aged under in particular, a series of diseases causes an extraordinary and largely preventable death toll. for example, tetanus accounts every year for , deaths, pertussis is responsible for over , deaths, hib gives rise to in excess of , deaths, diphtheria accounts for , deaths, and yellow fever over , deaths. arguably, the most regrettable, the most lamentable situation is that of measles. measles accounts for the unneeded deaths of , under-fives and over , adults and older children. despite this, the situation is by no means bleak. by the close of , approximately million had been vaccinated against hib and million children against hepatitis b. during its first decade, vaccinations against polio, hep b, hib, measles, pertussis, and yellow fever funded by gavi had prevented the unnecessary loss of over million lives. there are approximately vaccines licensed for use in humans, around half of these are widely prescribed. yet, most of these vaccines target the prevention of common childhood infections, with the remainder addressing tropical diseases encountered by travellers to the tropics; only a relatively minor proportion combat endemic disease in under-developed countries. balancing the persisting need against the proven success and anticipated potential, vaccines remain an area of remarkable opportunity for medical advance, leading directly to unprecedented levels of saved and improved lives. from a commercial perspective, the vaccine arena has long been neglected, in part because of the quite astonishing success limned above; today, and in comparative terms at least, activity within vaccine discovery is feverish [ , ] . during the last years, tens of vaccines and vaccine candidates have moved successfully through clinical trials, and vaccines in late development number in the hundreds. in stark contrast to antibiotics, vaccine resistance is negligible and nugatory. despite the egregious and outrageous success enjoyed by vaccines, many major issues persist. the world health organisation long ago identified tuberculosis (tb), hiv, and malaria as the three most significant life-threatening infectious diseases globally. no vaccine has been licensed for malaria or hiv, and there seems little realistic hope for such vaccines appearing in the immediate future. bacille calmette guérin (bcg), the key anti-tb vaccine, is of limited efficacy [ ] . levels of morbidity and mortality generated by diseases already targeted by vaccines remain high. influenza is the key example, with a global annual estimated death toll in the region of half a million. in the twenty-first century, the world continues to be threatened by infectious and contagious diseases of many kinds: visceral leishmaniasis, marburg's disease, west nile, dengue, as well as sars potentially pandemic h n influenza, and over human and emerging zoonotic infections, as well as the persisting threat from hiv, tb, and malaria mentioned above. all this is further compounded by the additional risk arising from antibiotic-resistant bacteria and bioterrorism, not to mention major quasi-incidental issues, such climate change, an accelerating growth in the world's population, increased travel, and the overcrowding seen within the burgeoning populations concentrated into major cities [ ] . for reasons we shall touch on below, the discovery of vaccines is both more urgent and more difficult than it has ever been. in an era where conventional drug discovery has been seen to fail-or at least as seen by cupiditous investors, for whom the current model of pharmaceutical drug discovery is broken-vaccines are one of a number of biologically derived therapies upon which the future economic health of the pharmaceutical industry is thought to rest. the medical need, as stated above, is clear. set against this is the unfortunate realisation that vaccines exist for most easily targeted diseases, those mediated by neutralising antibodies, and so outstanding vaccine-targets are those of more intractable diseases mediated primarily by cellular immunity. to address those properly requires what all discoveries required: hard work and investment; but they also need new ideas, new thinking, and new vaccine discovery technology. amongst, these are computational techniques, the most promising of which are those targeting the discovery of novel vaccine antigens: the candidate subunit vaccines of tomorrow see fig. . . vaccines are agents-either molecular (epitope-or antigen-based vaccines) or supramolecular (attenuated or inactivated whole pathogen vaccines)-which are able to create protective immunity against specific pathogenic infectious microorganisms and any diseases to which they might give rise. protective immunity can be characterised as an enhanced but highly specific response to consequent re-infection-or infection by an evolutionarily closely related micro-organismsmade by the adaptive immune system. such increased or enhanced immunity is facilitated by the quantitative and qualitative augmentation of immune memory, which is able to militate against the pernicious effects of infectious disease. vaccines synergise with the herd immunity they help engender, leading to reduced transmission rates as well as prophylaxis against infection. the term "vaccine" derives from vacca (latin for cow). the words vaccine and vaccination were coined specifically for anti-smallpox immunization by the discoverer of the technique, edward jenner ( - ). these terms were later extended by louis pasteur ( - ) to include a far more extensive orbit or remit, including the entire notion of immunisation against any disease [ , , ] . several fundamentally distinct varieties of vaccine exist. these include inter alia inactivated or attenuated whole pathogen-based vaccines; subunit vaccines are based on one or more protein antigens, vaccines based upon one or more individual epitopes, carbohydrate-based vaccines, and combinations thereof. hitherto, the best-used and, thus, the most successful types of vaccine were built from attenuated-"weakened" or non-infective or otherwise inactivated-pathogenic whole organisms, be they bacterial or viral in nature. well-known examples include the following: the bcg vaccine which acts prophylactically against tuberculosis and albert sabin's anti-poliomyelitis vaccine based on attenuated poliovirus. the vast majority of subunit vaccines are immunogenic protein molecules, and are typically discovered using a somewhat haphazard search process. concerns over the safety of whole-organism vaccines long ago prompted the development of other kinds of vaccine strategy, including those based upon antigens as the innate or immanent active biological constituent of either single or composite vaccines. the vaccine which targets hepatitis b is a good exemplar of a so-called subunit vaccine as it is based on a protein antigen: the viral envelope hepatitis b surface antigen. other types of as-yet-unproven vaccines include those based on epitopes and others based on antigen-presenting cells; many have entered clinical trials, but none have fulfilled their medical or commercial potential. whole antigen discovery. when looking at a reverse vaccinology process, the discovery of candidate subunit vaccines begins with a microbial genome, perhaps newly sequence, progresses through an extensive computational stage, ultimately to deliver a shortlist of antigens which can be validated through subsequent laboratory examination. the computational stage can be empirical in nature; this is typified by the statistical approach embodied in vaxijen [ ] . or this stage can be bioinformatic; this involves predicting subcellular location and expression levels and the like. or, this stage can take the form of a complex mathematical model which uses immunoinformatic models combined with mathematical methods, such as metabolic control theory [ ] , to predict cell-surface epitope populations it is often difficult to capture the proper scientific meaning and use of recondite terms, often borrowed from common usage or archaic language. so, let us be more specific. an immunogen-a molecular moiety exhibiting the property of immunogenicity-is any material or substance capable of eliciting a specific immune response. an antigen, on the other hand, is a molecular moiety exhibiting the property of antigenicity. it is a substance or material recognised by a primed immune system. such a persisting state of immune readiness may be mediated by humoral immunity (principally via the action of soluble antibodies) or by cellular immunity (as mediated by t-cells, antigen presenting cells (apcs), or other phagocytic cells), or a combination of both, in what is often referred to as a "recall" response. immunogenicity is vital: it is the signature characteristic or property that prompts a certain molecular moiety to evoke a significant immune response. here, we shall strictly limit use of "immunogen" and "antigen" to a sole meaning. here, an "antigen" or an "immunogen" will mean a protein that is capable of educing some kind of discernible response from the host immune system. specifically, and for practical reasons, we will almost exclusively be referring to proteins derived from a pathogenic micro-organism. at present, the prophylaxis engendered by all current effective vaccines-all except bcg-is primarily mediated by the humoral immune system, via soluble antibodies. however, the disease mechanisms of most serious diseases for which vaccines are not available are usually mediated by cellular immunity. thus, for untreated disease, we seek to identify immunogenicity generated principally by cellular responses or by a combination of cellular and humoral responses, rather than by humoral immunity alone. to some extent, subunit vaccines can be thought to represent something of a compromise between vaccines based on attenuated or otherwise inactivated wholeorganisms and the many more recent and more innovative vaccine strategies typified by epitope or poly-epitope vaccines. vaccines based around whole pathogens have long engendered safety concerns [ ] [ ] [ ] . from the lubeck disaster and the cutter incident [ ] [ ] [ ] to the recent mmr debacle, issues over safety, real or imagined, have always dogged the development of vaccines [ , ] . indeed, during the eighteenth century the pre-vaccination practice of variolation against smallpox prefigured much of the current debate over the perceived danger of vaccines [ ] . while the case for vaccines is unanswerable, we should not be complacent. any live vaccine, however extensively attenuated, can revert to a pathogenic, diseaseinducing form. this is currently an on-going issue for polio vaccination [ ] . other issues, particularly the chemical or biological contamination of vaccines during manufacture, remain enduring and persistent problems. undesired immunogenicity, the type leading to severe and pathological immune responses, rather than enduring immune memory, is a concern for both whole-organism and subunitbased vaccines, as well as putative biologics [ ] . immunologists and vaccinologists have thus long sought alternatives to the use of whole organisms as vaccines. subunit vaccines and conjugate vaccines are one such. vaccines based on epitopes, singly or in combination, are another. the diversity of innovations in vaccine design holds much potential for success, but, thus far at least, has proved spectacularly unsuccessful in a clinical context. logically, a vaccine that relies solely on, at most, a few well-chosen epitopes, should be effective, efficacious, and, above-all, safe. epitopes, as peptides, may be cytotoxic and might possibly prompt some kind of inopportune immune response but cannot be infective or revert to infectivity. in many ways, epitopes are closer in size and share many properties with synthetic small molecules; possibly dealing with their pharmacokinetics as such may be better than thinking of them as biologic drugs. in practice, of course, epitope-based vaccines, like subunit vaccines, suffer from poor immunogenicity, necessitating the use of a complex combination of adjuvants and complicated delivery systems. for diverse reasons, including immunogenicity, stimulating protective immune responses against intracellular pathogens remains problematic when using nonreplicating vaccines. why should this be? first, the immune response is very complex, involving both the innate and adaptive immunity, and significant interaction between them. in all probability, and particularly when viewed in the context of the whole population, many epitopes and danger signals are involved; likewise, the many different immune actors, be they acting at the cellular or molecular levels, interact with each other and are subject to complex mechanisms of genetic, epigenetic, and system-level control and regulation. it may be that only the large and complex organism-sized vaccines can induce the range of immune responses necessary across the population to induce protection, since they comprise a potential host of immunogenic molecular moieties, not just a single immunodominant epitope see fig. in that which follows, we shall seek to explore the availability and accessibility of informatic techniques and informatic tools used to identify candidate subunit vaccines of microbial origin. yet, we shall start by adding context with an examination of experimental approaches to antigen discovery: so-called reverse vaccinology. reverse vaccinology already relies on informatics, but, in a sense at least, what we would like to do using informatics is to reproduce as much as is possible the steps inherent in successful reverse vaccinology in silico rather than in vitro. reverse vaccinology, and the necessary computational support, is a much more prevalent means of identifying subunit vaccines [ ] . see fig. . . even today, many experimentalists retain a deep and atavistic distrust of all computation. experimentalists seldom trust the reliability and dependability of computational methodology, choosing to trust instead in what they believe to be infallible, if actually rather elusive, empirical reliability of observations, experiments, and the whole paraphernalia of laboratory experimentation. yet, things are in the process of changing, and this change is likely to accelerate as we move forward into a future that looks more parsimonious and uncertain by the day. vaccines have come a long way from the days when they were prepared directly from the fluids of smallpox pustules or extracts of infected spinal cords. yet vaccine discovery and development remains firmly empirical. many modern vaccines still comprise entire inactivated pathogens. while vaccines targeting papillomavirus, tetanus, hepatitis b, and diphtheria are subunit vaccines, few are recombinant proteins devoid of contaminants. some would argue that the only molecular vaccines are glycoconjugates: oligosaccharides conjugated to immunogenic carrier proteins. conventional empirical, experimental, laboratory-based microbiological ways to identify putative candidate antigens require cultivation of target pathogenic micro-organisms, followed by teasing out their component proteins, analysis in a series of in-vitro and in-vivo assays, animal models and with the ultimate objective of isolating one or two proteins displaying protective immunity. unfortunately, in reality, the process is more complex, and more confusing, and much more confounding as this brief synopsis might suggest. cultivating pathogens outside the environment offered by their host organism can be difficult, even impossible. not every protein is readily expressed in adequate quantities in vitro, and many proteins are only expressed in an intermittent basis during the time course of infection. thus, a considerable number of potential, putative, and possible vaccine candidate antigens could be missed by conventional experimental approaches. reverse vaccinology [ ] [ ] [ ] [ ] has the potential to analyse genomes for potential antigens, initially scanning "open reading frames" (orfs), then selecting proteins because they are open to surveillance by the host immune system. this usually involves some complex combination of informatic-based prediction methodologies. recombinant expression of the resulting set of identified molecules can overcome their reduced natural abundance, which has often prevented us recognising their true potential. by enlarging the repertoire of native antigens, this technology can help to foster the development of a new cohort of vaccines. reverse vaccinology was originally established and has been established by studying neisseria meningitidis, which is responsible for meningococcal meningitis and sepsis. vaccines are currently available for all serotypes, except that serogroup b. n. meningitidis orfs were found initially [ , ] ; proteins were then identified, expressed in vitro and found to be surface exposed. seven proteins elicited immunity over many strains. the culmination of this work was a "universal" vaccine for serogroup b based on five antigens [ ] . this protovaccine, when used with alum as adjuvant, induced murine bactericidal antibodies versus % of meningococcal strains drawn from the world population of n. meningitidis. strain coverage increases to over % when used with cpg or mf as adjuvant. another key illustration is porphyromonas gingivalis, an anaerobic gramnegative bacterium found in the chronic adult inflammatory gum disease periodontitis. initially, orfs were identified [ ] ; of these, protein sequences were open to immune surveillance and were positive for several sera. two antigens were found to be protective in mice. yet another fascinating instance is provided by streptococcus pneumoniae, a prime cause of meningitis, pneumonia, and sepsis [ , ] . in this study, potential orfs were initially identified, with of these proteins being readily expressed. finally, six proteins were seen to induce protection against the pathogen. more recently, other and more advanced experimental techniques, such as microarrays, are beginning to come on-stream, opening up a gallimaufry of possible technologies to the new but maturing field of reverse vaccinology. the following gives but a taste of what is to come. using ribosome display to undertake in-vitro protein selection, weichart et al. [ ] identified within the methicillin-resistant col strain of the virulent human pathogen staphylococcus aureus genes, the majority of which were secreted or surface-localized proteins; of these, % had cell envelope function, % were transporter proteins, and % were virulence factors or toxins. using an ingenious combination of advanced proteomics techniques and in-vitro assays, giefing et al. [ ] identified novel vaccine candidates which prevented infections in children and in the elderly caused by a variety of pneumococcus serotypes; four demonstrating major protection versus sepsis in animals. two leads-stkp (a serine/threonine protein kinase) and pcsb (a structural protein with a role in cell wall separation of group b streptococcus)-showed clear cross-protection as potential candidate vaccines against four separate pneumococcal serotypes. using a whole proteome microarray, and in order to identify protein antigens, eyles et al. [ ] probed serum from balb/c mice previously immunized with a vaccine comprising: killed francisella tularensis and two immunomodulatory adjuvants. eleven out of the top twelve immunogenic antigens were known already as immunoreactive, although further proteins were discovered using this experimental approach. in further work from this consortium, titball and co-workers [ ] constructed a protein microarray of , burkholderia pseudomallei proteins, treated it with patient samples, identifying antigens. this smaller set was treated with a further distinct sera from groups of patients, identifying putative candidate antigens. this survey, brief though it is, helps to highlight the potential power of reverse vaccinology for vaccine discovery. however, since the number of antigens is high, given all the potential difficulties in characterising and expressing them, it is important to note that both computational and experimental techniques and methodologies will doubtlessly omit important and interesting proteins from further analysis, though not necessarily for the same or similar reasons. thus, with the burgeoning discipline of reverse vaccinology, both computational and experimental techniques are in need of constant development and improvement. compared to its role to drug discovery, genomics, and a host of other bioscience sub-disciplines, bioinformatics support for the preclinical discovery and development of vaccine is in its infancy; yet, as interest in vaccine discovery increases, the situation changes. there are two key types of bioinformatics support for vaccine design, discovery, and development. at the technical level, the first of these cannot be properly or meaningfully distinguished from general support for target discovery. it includes the annotation of pathogen genomes, more conventional host genome annotation, and the statistical analysis of immunological microarray experiments. the second form of support concentrates on immunoinformatics, that is, the informatics analysis of immunological problems, principally epitope prediction. b-cell epitope prediction remains defiantly basic or is largely dependent on a sometimes unavailable knowledge of three-dimensional protein structure. both structure- [ ] and data-driven [ ] prediction of antibody-mediated epitopes evince poor results. however, methods developed to predict t-cell epitopes now possess considerable algorithmic sophistication. moreover, they continue to develop and evolve, as well as extend their scope and remit to address new and ever larger and more challenging epitope prediction problems. presently, accurate and reliable t-cell epitope prediction is restricted to predicting the binding of peptides to the major histocompatibility complex (mhc). class i peptide-mhc prediction can be reasonably accurate, or is for properly characterised, wellunderstood alleles [ ] . yet a number of key studies have demonstrated that class ii mhc binding prediction is almost universally inaccurate, and is thus erratic and unreliable [ ] [ ] [ ] . a similar situation persists for structure-driven prediction of mhc epitopes [ , ] . irrespective of poor predictive performance, several other problems exist for epitope prediction. for t cell prediction in particular, a prime concern is with the availability or rather lack of availability of relevant data. it is now known that immunogenic t cell epitopes, thought previously to be peptides no more than amino acids in length, can be or more residues long. longmer epitopes now greatly expand the number of possible peptides open to inspection by t cells [ ] [ ] [ ] [ ] . the inadequate results generated by b cell epitope prediction algorithms may indicate that a fundamental reinterpretation of extant b cell epitope data is necessary before improved methods become feasible. these factors, when taken together, are consistent with the notion that methods relying only on the possession of certain epitopes will not be fully effective when tasked with antigen or immunogen identification. this is supported by information indicating a lack of correspondence between selected antigens and experimentally verified protective proteins. there are many means of identifying antigenic proteins. most focus on the properties of protein sequence and structure, but arguably one of the most insightful is instead to examine properties, both local and global, of the underlying nucleic acid. one notable way is to look for evidence of the horizontal or lateral transfer of so-called pathogenicity islands or pais. horizontal transfer, such as transformation, conjugation, or transduction, is distinct from the vertical transfer of genetic material from an ancestor within its lineage. it typically involves an organism incorporating genetic material from an evolutionarily distant organism without being its offspring. pais are a specific type of genomic island; that is, part of a genome acquired through direct transfer between microbes. a genomic island can occur in distantly related species and may be mono-or multi-functional; there are many sub-classes classified by function. other examples include antibiotic resistance islands, metal resistance, and secretion system islands. the gene products of pais are crucial to the propagation of disease pathogenesis, much as the pais themselves are key to the evolution of pathogenesis. pathogen-associated type iii and type iv secretion systems are, for example, often found together in the same pai. detecting such large (> kb) and discrete clusters of genes clusters, habitually possessing a characteristically atypical g/c content, at least when compared with the remainder of the genome, leads, in turn, to the individual identification within clusters of virulence-associated protein antigens. prokaryotic pais are frequently associated with trna-encoding genes, many are flanked by repeat structures, and many contain fragments of mobile genetic elements such as plasmids and phages. pais can be identified by combining analysis of nucleotide composition and phylogeny, amongst others. composition-based approaches rely on the natural variation between genome sequences from different species. regions of the genome with abnormal composition, as demonstrated by nucleotide or codon bias, may be potentially transferred horizontally. such methods are prone to inaccuracies; these result from inherent genomic sequence variation, such as is seen in highly expressed genes, and the observation that over time the sequences of genomic islands alter to mirror the composition of host genomes. evolution-based approaches seek regions that may have been transferred horizontally by comparing related species. put at its simplest: a putative genomic island present in one species, but absent from several related species, is consistent with horizontal transfer. of course, the island may have been present in the last common ancestor shared by the species compared and subsequently been lost from the other species. a less likely explanation would be that the island arose by mutation and selection in this species and no other. to decide, a body of extra evidence would need to be explored, such as the size of the pai, the mechanistic ease of deletion, the consistent presence of the island in more distantly related species, the relative pathogenicity of island-less species, and the divergence of the genome relative to that of other related species. many methods, which seek to quantify and leverage these somewhat vague notions, are now available [ ] [ ] [ ] . such analysis at the nucleic acid level shares many features in common with approaches used to identify cpg islands in eukaryotic genomes [ ] [ ] [ ] [ ] . recently, langille et al. tested six sequence-composition genomic island prediction methods and found that islandpath-dimob and sigi-hmm had the greatest overall accuracy [ ] . island path was designed to help identify prokaryotic pais, through the visualisation of common pai characteristics such as mobile element-associated genes or atypical sequence composition [ ] . sigi-hmm is a very accurate sequence composition-based genomic island predictor, which combines a hidden markov model (hmm) and codon usage measurement to identify genomic islands [ ] . in another work, yoon et al. coupled heuristic sequence searching methods, which aimed simultaneously to identify pais and individual virulence genes, with composition and codon-usage bias [ ] . exploiting a machine learning approach, vernikos and parkhill sampled the structural features of genomic islands using a hypothesis-free, bottom-up search, with the objective of explicitly quantifying the contribution made by each feature to the overall structure of different genomic islands [ ] . arvey et al. sought to identify large chromosomal regions with atypical features using a general divergence measureable to quantify the compositional difference between genomic segments [ ] . islandpick is a comparative genomic island predictor, rather than a composition-based approach, that can identify very probable genomic islands and very probable non-genomic islands within investigated genomes but does require that several phylogentically related genomes are available [ ] . observing pais as having a g + c composition closer to their host genome, wang et al. used so-called genomic barcodes to identify pais. these barcodes are based on the fact that the frequencies of -mers to -mers, and their reverse complement, are very stable across a whole genome when using a window size of over , bps and that this constituted a characteristic signature for genomes [ ] . the ready detection of pais, as a tool in computational reverse vaccinology, has been greatly aided by the deployment of several web-based resources. a key example of a server that successfully integrates several accurate genomic island predictors is islandviewer [ ] , which combines the methods: islandpick [ ] , islandpath [ ] , and sigi-hmm [ ] and is available at the url: http://www. pathogenomics.sfu.ca/islandviewer/query.php. the gui facilitates the visualisation of genomic islands and downloading of data at the gene and chromosome levels in a variety of formats. another important, web-accessible resource is paidb or the pai database. this is a wide-ranging database of pais, containing distinct pais and genbank accessions present in strains of pathogenic bacteria [ ] . paidb may be accessed via the url: http://www.gem.re.kr/paidb. thus, alternative techniques and methodologies are required in order to select and to rank proteins likely to be protective antigens and thus candidate vaccines. below, we shall explore three key approaches: subcellular location prediction, alignment-dependent sequence similarity searching, and alignment-independent empirical statistical approaches. in this section, we consider, perhaps, the clearest and cleanest way to identify potential new antigens in any microbial genome to alignment-dependent sequence similarity searching. there are two complimentary but distinct ways of identifying the immunogenicity of a protein from its sequence. one is to look for significant similarity to proteins of known immunogenicity. this idea seems so straightforward as to be almost facile. the other approach is somewhat less obvious conceptually but almost as straightforward logistically and involves seeking to identify antigens as proteins without discernible sequence similarity to any host protein. let us turn to the first of these two alternatives. let us begin by stating or rather reiterating the obvious. if we know the sequence of an existing antigen or antigens, we can use sequence searching to find similar sequences in the target genome [ , ] . any candidate antigens selected by this process can then be selected for further verification and validation. the same old, familiar caveats apply here: are chosen thresholds appropriate? are high-scoring matches an artefact or are they real and meaningful? the litany of such conditions is all too familiar to anyone well versed in sequence similarity searching. clearly, when a sequence search is run, using blast or fasta , for example, an enormously long list of nearly identical proteins might ensue, or one that does not get any hits at all, or almost any intervening result might be obtained. as reflective practitioners, we must judge which result can be classified as useful and which cannot, and in so doing, identify sets of suitable thresholds, above which we expect usefulness and below which we might anticipate little or no utility. thresholds are contingent upon the sequence family studied, as well as being dependent solely on the problem investigated. thus heuristically identified cut-offs are desirable, but much thinking and empirical investigation are required to select appropriate values. of course, the process adumbrated above presupposes that sufficient antigenic protein sequences are known. compilation of this data is the role of the database. recently, extensive literature mining, coupled with factory-scale experimentation, has created many functional immunology databases, although databases, such as syfpeithi [ , ] , focussing on cellular immunology-primarily mhc processing, presentation, and t cell recognition-have existed for - years. arguably, the best extant database is the hiv molecular immunology database [ ] , although clearly the depth of the database is at the expense of generality and breadth. other recent databases include mhcbn [ , ] and epimhc [ ] , amongst many others. two databases, warrant particular attention: antijen [ ] , formerly known as jenpep [ , ] ; and iedb [ ] . implemented as a relational postgresql database, antijen integrates a wideranging set of data items, much of which is not stored by other databases. in addition to the kind of cellular immunological information familiar from syfpeithi, such as mhc binding and t cell data, antijen additionally archives b cell epitopes and also includes a significant stockpile of quantitative data: kinetic, thermodynamic, as well as functional, including measurements of immunological peptide-protein and protein-protein interactions. the iedb database is considerably more extensive than other equivalent database systems, benefiting from the input of dedicated epitope sequencing projects. iedb has come to eclipse other work in this area. although both antijen and iedb are full of epitope-focussed information of many flavours, they remain incomplete concerning immunogenic antigens. fortuitously, specific antigen-orientated-rather than epitope-focusseddatabases are starting to be available. arguably, the most obvious and most unambiguous example of an antigen is virulence factor (vf): proteins, such as toxins, able to induce disease directly by attacking a host. analysis of known pathogens has allowed recurring vf systems of + distinct proteins. often, sets of vfs exist as discrete, distinct genome-encoded pais, as well as being more widely spread through the genome. clearly, antigens do not need to be vfs in order to be immunogenic and thus candidates for subunit vaccines. instead, they need only be accessible to the immune system. they do not need to directly or indirectly mediate infection. thus, other databases are needed which capture, collate, and archive the burgeoning plethora of antigen-orientated data. recently, we have helped developed a very different database: antigendb [ ] . it contains over antigens collated from the primary scientific literature, as well as other sources. another related database system has been christened violin (vaccine investigation and online information network) [ ] , which allows straightforward curation and the analysis and comparison of research data across diverse pathogens in the context of human medicine, animal models, laboratory model systems, and natural hosts. as we outline above, in addition to identifying sequence similarity to known antigens, another idea gaining ground is that the immunogenicity of an antigen is solely determined by the absence of similarity to host proteins. some think this is the prime determinant of potential protein immunogenicity [ , ] . such ideas are supported by the belief that immune systems are actively educated to lack reactivity to self-proteins [ ] , a process-often termed "immune tolerance"-which is generated via epitope-specific mechanisms [ , ] . what we really want is a meaningful measure of the "foreignness" of a protein correlating with its immunogenicity. usually, "evolutionary distance" substitutes for "foreignness." clearly, such an evolutionary distance must be specified in terms of biomacromolecular structures or sequences. but, is this practically useful for selecting candidate vaccines? another way to formulate this idea is to say that the probability that a protein is immunogenic is exclusively a product of its dissimilarity, at the whole-sequence or sequence-fragment level, to each and every protein contained within the host proteome. most search software is well matched to this problem. in terms of fragment length, the typical length of an epitope might seem logical, since the epitope is the molecular moiety typically recognised during the initial phase of an immune response. yet, even at the epitope level-say a peptide of - amino acid residues-even a single conservative mutation or mismatch in an otherwise identical match might prove significant. single sequence alterations may totally abrogate or significantly enhance neutralising antibodies binding or recognition by the machinery of cellular immunology. we have attempted to benchmark sequence similarity and correlate it with immunogenicity in order to explore the potential of this idea in a quantitative fashion. to that end, we examined the differences between sets of antigens and non-antigen using sequence similarity scores. we looked specifically at sets of known non-antigenic and antigenic protein sequences from six sources: bacteria, viruses, fungi, and parasites, as well as allergens and tumours [ ] [ ] [ ] , comparing pathogen sequence to those from humans and mice using blast [ ] . most non-antigenic and antigenic sequences were non-redundant; implying a lack of homologues between pathogens and host proteomes, although certain parasite antigens, such as catalases and heat shock proteins, had a much greater level of similarity. we were not able to determine a suitable and appropriate threshold based on the hypothesis of non-redundancy to the host's proteome, suggesting that this is not a viable solution to vaccine antigen identification. however, rather than looking at nucleic acid sequences, or at protein sequences using an alignment-based approach, a new set of techniques, based upon alignmentfree techniques, has been and is being developed; as this approach begins to show significant potential, we shall examine it next. proteins accessible to immune system surveillance are assumed to lie external to the microbial organism or be attached to its surface rather than being sequestered and sequestrated within the cell. for bacteria, this means being located on-or in-the outer membrane surface or being secreted. thus, being able to accurately predict the physical location of a putative antigen can provide considerable insight into the likelihood that a particular protein will prove to be an immunogenic and possibly protective. there are two basic kinds of prediction method for identifying subcellular location: manual rule construction and the application of data-driven machine learning methods. data used to discriminate between compartments include sequence-derived features of the protein, such as hydrophobic regions; the amino acid composition of the whole protein; the presence of certain specific motifs; or a combination thereof. accuracy differs significantly between different methods and different compartments, mostly resulting from the deficiency and inconsistency of data used to derive models. gross overall sequence similarity is unable to predict protein sub-cellular location reliably or accurately. even nearly identical protein sequences may be found in distinct locations, while there are many proteins which exist simultaneously at several distinct locations within the cell, often having equally distinct functions at these different sites [ ] . eukaryotes and prokaryotes have quite distinct subcellular compartments. the number of such compartments used in prediction studies varies. a common schema reduces prokaryotic to three compartments (cytoplasmic, periplasmic, and extracellular) and eukaryotic cells to four compartments (nuclear, cytoplasmic, mitochondrial, and extracellular). other structural classifications evince in excess ten eukaryotic compartments. ten compartments maybe a conservative estimate, such is the complex richness of sub-cellular structure. any prediction method must account for permanent, transient, and multiple locations, and, in addition, multi-protein complexes and membrane-bound organelles as possible sites. numerous signal sequences exist. several methods predict lipoproteins. the prediction of proteins translocated via the tat-dependent pathway is important but has yet to be addressed properly. however, amongst binary, single-outcome approaches, signalp is probably the most accurate and reliable method available. it uses neural networks to predict the presence and probable cleavage sites of type ii or n-terminal spase-i-cleaved secretion signal peptides [ ] [ ] [ ] . this signal is common to both prokaryotic and eukaryotic organisms. signalp has recently been enhanced with a hmm intended to discriminate cleaved from uncleaved signal anchors. a limitation of signalp is its proclivity to over-predict: it cannot properly discriminate reliably between a number of very similar yet functionally different signal sequences, regularly predicting lipoproteins and integral membrane proteins as type ii signals. many methods have been devised capable of dividing a genome or virtualproteome between the various subcellular locations of a eukaryotic or prokaryotic cell. psort is a good example; it is a multicategory prediction procedure, comprising many different programmes [ ] [ ] [ ] [ ] . psort i predicts subcellular compartments, while psort ii predicts ten different locations. ipsort deals with several compartments: chloroplast, mitochondrial, and proteins secreted from the cell, while psort-b focuses solely on predicting bacterial sub-cellular locations. another effective programme is hensbc [ ] . hensbc can assign gene products to one of four different types (nuclear, mitochondrial, cytoplasmic, or extracellular) with an accuracy of about eight out of ten for gram-negative bacteria. another programme, subloc [ ] , predicts prokaryotic subcellular location divided between three compartments. another programme is gpos-ploc [ ] , which integrates several basic classifiers. other methods include phobius [ ] , lipop . [ ] , and tatp . [ ] . a comparison of several such programmes, using mycobacterial proteins as a gold standard [ ] , showed subcellular localisation prediction and possessed high predictive specificity. we have developed a set of methods which predict bacterial subcellular location. using a set of methods for lipoprotein, tat secretion, and membrane protein prediction [ ] [ ] [ ] [ ] [ ] [ ] [ ] , three different bayesian network architectures were implemented as software pipelines able to predict specific subcellular locations, and two serial implementations using a hierarchical decision structure, and a parallel implementation with a confidence-level-based decision engine [ ] . the soluble-rooted serial pipeline performed better than the membrane-rooted predictor. the parallel pipeline outperformed the serial pipeline but was significantly less efficient. genomic test sets proved more ambiguous: the serial implementation identified more of the proteins of known location yet more accurate predictions are made overall by the parallel implementation. the implications of this work are clear. the complexity of subcellular structures must be integrated fully into sub-cellular location prediction. in extant studies, many important cellular organelles are not considered; different routes by which proteins can reach the same compartment are ignored; and proteins existing simultaneously at several locations are likewise discounted. clearly, combining high specificity predictors for each compartment appropriately must be the way forward [ ] . many difficulties, problems, and quandaries persist; the most keenly felt is the lack of high-quality, verified, and validated datasets which unambiguously established the location of well-characterised proteins. this dearth is particularly serious for certain types of secreted protein, such as type iii secretion. in a similar manner, considerably more work is required to accurately predict the locations for proteins of viral origin; while certain studies are encouraging [ , ] , the complexity of viral interaction with host organisms continues to confound attempts at analysis. predicting antigens in silico typically utilise bioinformatics tools. such tools can identify signal peptides or membrane proteins or lipoproteins successfully, yet the majority of algorithms tend to depend on motifs characteristic of antigens or, more generally, sequence alignment as the principal arbiter of definitive and meaningful sequence relationships. this is potentially a problem of some magnitude, particularly given the wide range of evolutionary rates and mechanisms amongst microbial proteins. certain protein families do not, however, show obvious or significant sequence similarity, despite having common biological properties, functions, and three-dimensional structures [ , ] . thus alignment-based approaches may not always produce useful and unequivocal results, since they assume a direct sequence relationship that can be identified by simple sequence search techniques. immunogenicity, as a signature characteristic, may be encrypted within the structure and/or sequence instead. this may be encoded so cryptically or so subtlety as to completely confound or at least mislead conventional sequence alignment protocols. discovery of utterly novel and previously unknown antigens will be totally stymied by the absence of similarity to known antigenic proteins. alignment-dependent methods tend to dominate bioinformatics and, by extension, immunoinformatics. several authors have chosen to look at alternative strategies, implementing so-called alignment-independent or alignment-free techniques. the first authors to do so were mayer et al., who reported that protective antigens had a different amino acid composition compared to control groups of nonantigens [ ] . such a result is unsurprising since it has long been known that the structure and sequence composition of proteins adapted to the different redox environments of different sub-cellular compartments [ ] . mayer's analysis was formulated primarily in terms of univariate comparisons of antigens versus controls for different properties. subsequently, we explored bivariate comparison in terms of easily comprehensible scatter-plots. see fig. . for representative examples. what their results ably demonstrate is the potential for the discrimination of antigens and non-antigens by the appropriate selection of orthogonal descriptors. the challenge, of course, is to identify a robust choice of descriptors which are capable of extrapolating as well interpolating when used predictively. progressing beyond this type of analysis, and synergising with our other work on alignment-independent representation [ ] [ ] [ ] [ ] [ ] , we have initiated the development of new methods to differentiate antigens-and thus potential vaccine candidates-and non-antigens, using more sophisticated alignment-free approach to sequence representation [ , ] . rather than focus on epitope versus nonepitope, our approach utilises data on protective antigens derived from diverse pathogens to create statistical models capable of predicting whole-protein antigenicity. our alignment-independent method for antigen identification uses the auto cross covariance (acc) transformation originally devised by wold et al. [ , ] to transform protein sequences into uniform vectors. the acc transform has found much application in peptide prediction and protein classification [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . in our method, amino acid residues are represented by the well-known and well-used z descriptors [ ] [ ] [ ] , which characterise the hydrophobicity, molecular size, and polarity of residues. our method also accounts for the absence of complete independence between distinct sequence positions. we initially applied our approach to groups of known viral, bacterial, and tumour antigens, developing models capable of identifying antigen. extra models were subsequently added for fungal and parasite antigens. for bacterial, viral, and tumour antigens, models had prediction accuracies in the - % range [ , , ] . for the parasite and fungal antigens, models had good predictive ability with - % accuracy. these models were incorporated into a server for protective antigen prediction called vaxijen [ ] (url: http://www.darrenflower.info/ vaxijen). vaxijen is an imperfect but encouraging start; future research will yield significantly more insight as well-characterised protective antigens increase significantly in number [ ] . as we have said, a number of bioinformatics problems are unique to the discipline of immunology: the greatest of these is the accurate quantitative prediction of immunogenicity. this chapter has in its totality been suffused and pervaded by the idea of immunogenicity and the challenge of predicting this property in silico. such an endeavour is confounding, yet exciting, and, as a key instrument in developing better, safer, more effective vaccines, is also of undisputed practical utility. successful immunogenicity prediction is at its simplest made manifest through the identification of b cell or t cell epitopes. epitope recognition, when seen as a chemical event, may be understood in terms of the relationships between apparent biological function or activity and basic physicochemical properties. delineating structure-activity or property-activity relationships of this kind is a key concern of immunoinformatics. at the other end of the spectrum, immunogenicity can be viewed is a cohesive, integrated, system property: a property of the entire and complete immune system and not a series of individual and isolated molecular recognition events. thus, the task of predicting systems-level immunogenicity is in all likelihood manifold more demanding than predicting peptide-binding say. the clinical manifestation of vaccine immunogenicity arises from the complex amalgam of many contributing extrinsic and intrinsic factors, which includes pathogen-side and host-side properties, as well as those just coming directly from proteins themselves. see fig. . . protein-side properties include the aggregation state of candidate vaccines and the possession of pamps. pathogen-side properties are clearly properties intrinsic to the pathogen, including expression levels of the antigen, the time-course of this expression, as well as its subcellular location. socalled host-side properties are innate recognition properties of host immunity, and most obviously include t cell epitopes or b cell epitopes. a bona fide candidate antigen should be available for immune surveillance and thus highly expressed, constitutively or transiently, as well as having several epitopes. a protein without immunogenicity would logically lack all or some of these characteristics. as a prediction problem, this is, to say the least, not uncomplicated; clearly consisting of a great variety of difficult-to-compute stages. in terms of mechanism, many of these stages are poorly understood. yet, each can be addressed using standard computational and statistical tools. they can all be predicted, however, presupposing, of course, the presence of relevant data in sufficient quantity. one of the strongest messages to emerge from this review is that immunogenicity is a strongly multi-factorial property: some protein antigens are immunogenic for one reason, or set of reasons, and other immunogenic proteins will be so for another possibly tangential reason or set of reasons. each such causal manifold is itself complex and potentially confusing. thus, the prediction of immunogenicity is a problem in multi-factorial prediction, and the search for new antigens is a search through a multi-factorial landscape of contingent causes and discombobulating decoys. some of the evidence will be highly precise and quantitative. the kind provided by predictive immunoinformatics, for example. this typically yields exact values for, say, the binding affinity of a peptide to a protein component of the immune system, or an unequivocal yes or no answer to the question: is this peptide sequence an epitope? however, for each such exact prediction, we have some notional associated probability concerning how reliable we regard this result. different methods evince a range of accuracy, which, in practice, equate to probabilities of reliability: we naturally have more confidence and assume a greater reliability for a highly accurate prediction versus one of average predictability, though it can still give wrong predictions and generally inaccurate predictors may work well for a specific subset of the data. other types of forms of evidence will have a distinctly more anecdotal flavour. take, for example, the case of bacterial exotoxins. together with endotoxins, such as lps, and so-called superantigens, exotoxins form the principal varieties of toxin secreted by pathogenic bacteria. exotoxins have evolved to be the most toxic substances known to science: in terms of the median lethal dose, botulinum toxin-the active ingredient of botox and causative agent of botulism, amongst others-is about ten times as lethal as radioactive isotope polonium- and a million times more deadly than mainline poisons, such as arsenic or potassium cyanide. virtually, all such potent bacterial exotoxins comprise two functionally distinct subunits, either separate proteins or distinct domains, usually denoted a and b. the a subunit is habitually an enzyme, such as a protease, which modifies specific protein targets, thus disrupting key cellular processes with host cells. the b subunit is a protein which binds to host cell surface lipids or proteins, enabling the toxin to be internalised efficiently. the high specificity of this dual action lends exotoxins much of their remarkable lethality. exotoxins are also extremely immunogenic, inducing the immune systems to produce high-affinity neutralising antibodies against them, and thus make excellent targets for vaccinology. a toxoid-a toxin which has been treated or inactivated, often by formaldehyde-is in essence a form of subunit vaccine and, as such, requires adjuvant to induce adequate immune responses. vaccines targeting tetanus and diphtheria, which usually need boosting every decade, are based on toxoids, albeit typically combined with pertussis toxin acting as an adjuvant. poisoning by exotoxins, on the other hand, requires treatment with antitoxin comprising preformed antibodies. however, and say that we were offered a newly sequenced pathogen genome, is such a classification for ab toxins helpful when trying to identify a potential exotoxins? the answer is neither yes nor is it no, but lies somewhere between these extremes. assuming we had extant knowledge or a reliable method predicting the presence of structural and functionally distinct domains, this very simple ruleof-thumb would become a useful tool for eliminating large numbers of possible toxin molecules. it would not directly identify an antigen but would enormously reduce the workload inherent in their discovery. as well as needing more and more reliable predictors, we also need a way of combining the information we gather from any set of reliable predictors to which we have access. thus, when analysing a pathogen genome, what we seem to need, at least in order to identify immunogenic proteins, is both a set of reliable and robust tools and a cohesive expert system within which to embed them. such systems, albeit still at a relatively crude and faltering level, do exist. because there is an implicit hierarchy of one prediction being based on others, there is a need to balance and judge different pieces of probabilistic evidence. an effective expert system should be capable of such a feat. to a first approximation, an expert system is a computer programme that undertakes tasks that might otherwise be prosecuted by a human expert ostensively by simulating the apparent judgement and behaviour of an individual or organization with expertise and experience within a particular discipline. an expert system might make financial forecasts, or play chess; it might diagnose human illnesses or schedule the routes of delivery vehicles. to create an expert system, one first needs to analyse human experts and how they make decisions, before translating this into rules that a computer can follow. such a system leverages both a knowledge base of accumulated expertise and a set of rules for applying such distilled knowledge to particular situations in order to solve problems. sophisticated expert systems can be updated with new knowledge and rules and can also learn from the success of its prediction, again mirroring the behaviour of properly performing experts. at the heart then of an expert system is the need to combine evidence in order to reach decisions. combining evidence, and reaching a decision based on that combined evidence, is no easier in the laboratory, be that virtual or actual, than it is in the court room. the problem of combining evidence is encountered across the disciplines, and various solutions have arisen in these different areas. within bioinformatic prediction, a particular variety of evidence combination, so-called meta-prediction, is a now a well-established strategy [ , ] . this approach seeks to amalgamate the output of various predictors, typically internet servers, in an intelligent way so that the combined result is more accurate than any of those coming from a single predictor. indeed, combining results from multiple prediction tools does often increase overall accuracy. a consensus strategy was first proposed by mallios [ ] , who combined syfpeithi [ , , ] , propred [ , ] , and the iterative stepwise discriminant analysis meta-algorithm [ ] [ ] [ ] . multipred [ ] integrates hmms and artificial neural networks (ann). six mhc class ii predictors were combined by dai and co-workers [ ] [ ] [ ] basing its overall prediction on the probability distributions of the different scores. trost et al. have used a heuristic method to address class i peptide-mhc binding [ ] . wang et al. [ ] applied a consensus method to calculate the median rank of the top three predictive methods for each mhc class ii protein initially evaluated so as to rank all possible -, -, and -mers from one protein. this rank was used to identify the top % of peptides from each protein. in probabilistic reasoning, or reasoning with uncertainty, there are many ways to represent espoused beliefs-or, in our domain, predictions-that effectively encode the uncertainty of propositions. these include fuzzy logic and the evidential method, among many others. for quantitative data, information fusion, in its various guises [ ] , is one robust route to effective combination. another requires us to enter the world of bayesian statistics, or, at least, a special thread within it. bayes theory, and the ever-expanding strand of statistics devolving from it, is concerned primarily with updating or revising belief in the light of new evidence, while so-called dempster-shafer theory [ ] is concerned not with the conditional probabilities of bayesian statistics but with the direct combination of evidence. it extends the bayesian theory of subjective probability, by replacing bayesian probabilities with belief functions that describe degrees of belief for one question in terms of probabilities for another and then combines these using dempster's rule for merging degrees of belief when based on independent lines of evidence. such belief functions may or may not have the mathematical properties of probabilities but are seemingly able to combine the rigor of probability theory with the flexibility of rule-based approaches. several expert systems of different flavours and hues have now become available within the vaccinology arena. sundaresh et al. developed a specialist software package for the analysis of microarray experiments that could easily be classified as an expert system and used it in the area of reverse vaccinology. this package, which was written in the open-source statistical package r, was used to help analyse a variety of complex microarray experiments on the bacteria f. tularensis, a category a bio-defense pathogen [ ] . this programme implements a two-stage process for diagnostic analysis: selection of antigens based on significant immune responses coupled with differential expression analysis, followed by classification of measured antigen responses using a combination of k-means clustering, support vector machines, and k-nearest neighbours. we have already discussed vaxijen [ , , ] , and the related server epijen [ ] , which combines various methods for identifying epitopes within extant proteins. these two servers can also be classified as vaccine-related expert systems. nerve is another expert system, which has been developed to help automate aspects of reverse vaccinology [ ] . using nerve, the prioritisation of potential candidate antigens consists of several stages: prediction of subcellular localisation; is the antigen an adhesion?; identification of membrane-crossing domains; and comparison to pathogen and human proteomes. candidates are filtered then ranked and putative antigens graded by provenance and its predicted immunogenicity. the web-based expert system, dynavacs [ ] , was developed to facilitate the efficient design of dna vaccines and is available in the url: http://miracle. igib.res.in/dynavac. it takes a structured approach for vaccine design, leveraging various key design parameters, including the choice of appropriate expression vectors, safeguarding efficient expression through codon optimization, ensuring high levels of translation by adding specific sequence signals, and engineering of cpg motifs as adjuvant mechanisms exacerbating immune responses. it also allows restriction enzyme mapping, the design of primers, and lists vectors in use for known dna vaccines. vaxign is another expert system developed to help facilitate vaccine design [ ] . vaxign undertakes dynamic vaccine target prediction from sequence. methodologically, it combines protein subcellular location prediction with prediction of transmembrane helices and adhesins, analysis of the conservation to human and/or mouse proteins with sequence exclusion from the genomes of nonpathogenic strains, and prediction of peptide binding to class i and class ii mhc. as a test, vaxign has been used to predict vaccine candidates against uropathogenic escherichia coli. however, nerve and its various and varied siblings are tasked with such a confounding and difficult undertaking that they are obliged to fall somewhat short of what is required. an obvious first step in tackling the greater problem is to address first subcellular location prediction. then, we can look at antigen presentation, modelling for each component step, before building these into a fully functional model. we can also develop empirical approaches-such as vaxijen [ , , ] . we must also factor in antibody-mediated issues, properly address pamps, post translational danger signals, expression levels, the role of aggregation, and the capacity of molecular adjuvants to enhance the innate immunogenicity to usable levels. see fig. . . the value of vaccines is not yet unchallenged. however, most reasonable people would, in all probability, agree that they are a good thing, albeit with a few minor provisos. the idea underlying all vaccines is a strong and robust one: it is in the reification-that is, the realisation, manifestation, and instantiation-of this abstract concept that the trouble lies, if indeed trouble there is. existing vaccines are by no means perfect; again, most sensible and well-informed people would no doubt acknowledge this also. one might argue that their intrinsic complexity, and the highly empirical nature of their discovery over decades, and the fraught nature of their manufacture, has much to answer in this regard. why should this be? in part, it is due to the extreme complexity of immune response to an administered vaccine, which is largely specific to each individual or at least is different in different sub-groups within the totality of the vaccinated population. the immune responses is comprised, at least for whole-pathogen vaccines, of the adaptive immune response to multiple b cell and t cell epitopes as well as the responses made by the innate immune responses to diverse molecular structures, principally pamps. when one considers also the degree to which such a repertoire of responses is augmented and modified by the action of additives, be they designed to increase the durability and stability of vaccines or be they adjuvants, which are intended to raise the level of immune reactions. add in stochastic and coincidental phenomena, such as reversion to pathogenicity, and we can see immediately that navigating our way through the vaccine minefield is no easy task. all such problems engendered by this intrinsic complexity are themselves compounded by our comparatively weak understanding of immunological mechanisms, since, if we understood the mechanism of responses well enough, we could and would have designed our vaccines to circumvent these issues. part of the answer to this cacophony of conflicting and confounding quandaries is the newly emergent discipline of vaccinomics. a proper understanding of the relationships between gene variants and vaccine-specific immune responses may help us to design the next generation of personalised vaccines. vaccinomics addresses this issue directly. it seeks to identify genetic factors mediating or moderating vaccine-induced immune responses, which are known to be extremely variable within population. much data indicate that host genetic polymorphisms are key determinants of innate and adaptive response to vaccination. hla genes, non-hla genes, and genes of the innate immunity all contribute, and do so in many ways, to the variation observed between individuals for immune responses to microbial vaccines. vaccinomics offers many techniques that can help illuminate these diverse phenomena. principal amongst these are population-based gene/snp association studies between allele or snp variation and specific responses, supplemented by the application of next-generation sequencing technology and microarray approaches. yet, and for all this nay-saying and gainsaying, vaccines and vaccination have demonstrated their worth time after time; yet, to justify the continuing faith we invest in them, new and better ways of making safer and more focussed vaccines must be found. most current vaccines work via antibody-mediated mechanisms; and most target viruses and the diseases they cause. unfortunately, the stock of such disease targets is dwindling. low-hanging fruit has long since been cut down. only fruit that is well out of reach remains. vaccines based on apcs and peptides are new but unproven strategies; most modern vaccine development relies instead on effective searches for vaccine antigens. one of the clearest points to emerge from such work is that there are many competing concepts, thoughts, and ideas that may confound or help efficient identification of immune reactive proteins. certain such ideas we have outlined. some are indisputably persuasive, even compelling, yet many strategies-and the technical approaches upon which they are based-have singly failed to deliver on their promise. long ago, and based on his lifetime's experience of all things immunological, professor peter cl beverley sketched out a paradigm for protein-focussed vaccine development, which we have formalised further, and which schema is summarised in fig. . . some of his factors overlap with the factors from fig. . . he identified many of the factors that potentially contribute to the immunogenicity of proteins, be they of pathogen origin or another source entirely, and also other features which might make proteins particularly suitable for becoming candidate vaccines. of these, some are as-yet beyond prediction, such as the attractiveness for apcs or the inability to down-regulate immune responses. the status of proteins as evasins is currently only possibly addressable through sequence similarity-based approaches and likewise for the attractiveness for uptake by apcs is again, though possible there exist motifs, structural or sequence, which could be identified. currently, the dearth of relevant data precludes prediction of such properties; and, while it is possible to predict some of these properties with some assurance of success, and others are predictable but only incidentally, overall, we are still some way from realising the dream embodied in fig. failure occurs for simple reasons: we deal with simplified abstractions and cannot hope to capture all that which is required for prediction by looking superficially at a single factor. protein immunogenicity comes instead from the dynamic combination of innumerable contributing factors. this is by no means a facile or easily solved informatics conundrum. a vaccine candidate should have epitopes that the host recognises, be available for immune surveillance, and be highly expressed. factors mediating protein immunogenicity are many; possession of b or t cell epitopes, post-translational danger signals, sub-cellular location, protein expression levels, and aggregation state amongst them. predicting such diverse, complex, confounding properties is-and remains-a challenge. vaccine antigens, once discovered, should, ultimately, and with appropriate manipulation, together with an apt, apposite, and appropriate delivery system and the right choice of adjuvant, become first a candidate for clinical trials, before, hopefully, progressing to regulatory approval. we require an integrative, systemsbiology approach to solve this problem. no single approach can be applied universally and with success; what we crave is the full integration of numerous equally wakefield's article linking mmr vaccine and autism was fraudulent computer-aided biotechnology: from immuno-informatics to reverse vaccinology harnessing bioinformatics to discover new vaccines new vaccines against tuberculosis bioinformatics for vaccinology lessons learned concerning vaccine safety vaccines: the real issues in vaccine safety target the fence-sitters an american tragedy'. the cutter incident and its implications for the salk polio vaccine in new zealand - the cutter incident, years later poliomyelitis following formaldehyde-inactivated poliovirus vaccination in the united states during the spring of . ii. relationship of poliomyelitis to cutter vaccine bioinformatics for vaccinology vaccine-derived poliovirus (vdpv): impact on poliomyelitis eradication advances in predicting and manipulating the immunogenicity of biotherapeutics and vaccines the use of genomics in microbial vaccine development post-genomic vaccine development microbial genomes and vaccine design: refinements to the classical reverse vaccinology approach biotechnology and vaccines: application of functional genomics to neisseria meningitidis and other bacterial pathogens complete genome sequence of neisseria meningitidis serogroup b strain mc identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing a universal vaccine for serogroup b meningococcus identification of vaccine candidate antigens from a genomic analysis of porphyromonas gingivalis use of a whole genome approach to identify vaccine molecules affording protection against streptococcus pneumoniae infection identification of a universal group b streptococcus vaccine by multiple genome screen functional selection of vaccine candidate peptides from staphylococcus aureus whole-genome expression libraries in vitro discovery of a novel class of highly conserved vaccine antigens using genomic scale antigenic fingerprinting of pneumococcus with human antibodies immunodominant francisella tularensis antigens identified using proteome microarray a burkholderia pseudomallei protein microarray reveals serodiagnostic and cross-reactive antigens antibody-protein interactions: benchmark datasets and prediction tools evaluation benchmarking b cell epitope prediction: underperformance of existing methods prediction of mhc-peptide binding: a systematic and comprehensive overview in silico tools for predicting peptides binding to hlaclass ii molecules: more confusion than conclusion on evaluating mhc-ii binding peptide prediction methods evaluation of mhc-ii peptide binding prediction servers: applications for vaccine research a critical cross-validation of high throughput structural binding prediction methods for pmhc limitations of ab initio predictions of peptide binding to mhc class ii molecules t cell receptor recognition of a 'super-bulged' major histocompatibility complex class i-bound peptide high resolution structures of highly bulged viral epitopes bound to major histocompatibility complex class i. implications for t-cell receptor engagement and t-cell immunodominance have we cut ourselves too short in mapping ctl epitopes? a long, naturally presented immunodominant epitope from ny-eso- tumor antigen: implications for cancer vaccine design identification and characterization of pathogenicity and other genomic islands using base composition analyses a novel strategy for the identification of genomic islands by comparative analysis of the contents and contexts of trna sites in closely related bacteria mobilomefinder: web-based tools for in silico and experimental discovery of bacterial genomic islands cpgcluster: a distance-based algorithm for cpg-island detection cpgif: an algorithm for the identification of cpg islands identifying cpg islands by different computational techniques cpg_mi: a novel approach for identifying functional cpg islands in mammalian genomes evaluation of genomic island predictors using a comparative genomics approach islandpath: aiding detection of genomic islands in prokaryotes score-based prediction of genomic islands in prokaryotic genomes using hidden markov models a computational approach for identifying pathogenicity islands in prokaryotic genomes resolving the structural features of genomic islands: a machine learning approach detection of genomic islands via segmental genome heterogeneity prediction of pathogenicity islands in enterohemorrhagic escherichia coli o :h using genomic barcodes islandviewer: an integrated interface for computational identification and visualization of genomic islands towards pathogenomics: a web-based resource for pathogenicity islands identification and characterization of a novel family of pneumococcal proteins that are protective against sepsis functional genomics of pathogenic bacteria syfpeithi: database for searching and tcell epitope prediction syfpeithi: database for mhc ligands and peptide motifs hiv sequence databases mhcbn . : a database of mhc/tap binding peptides and t-cell epitopes mhcbn: a comprehensive database of mhc binding and non-binding peptides epimhc: a curated database of mhcbinding peptides for customized computational vaccinology antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data jenpep: a novel computational information resource for immunobiology and vaccinology jenpep: a database of quantitative functional peptide data for immunology the immune epitope database . antigendb: an immunoinformatics database of pathogen antigens violin: vaccine investigation and online information network epitopic peptides with low similarity to the host proteome: towards biological therapies without side effects peptimmunology: immunogenic peptides and sequence redundancy primer: mechanisms of immunologic tolerance recent advances in immune modulation cutting edge: contributions of apoptosis and anergy to systemic t cell tolerance discriminating antigen and non-antigen using proteome dissimilarity iii: tumour and parasite antigens discriminating antigen and non-antigen using proteome dissimilarity ii: viral and fungal antigens discriminating antigen and non-antigen using proteome dissimilarity: bacterial antigens gapped blast and psi-blast: a new generation of protein database search programs single proteins might have dual but related functions in intracellular and extracellular microenvironments locating proteins in the cell using targetp, signalp and related tools improved prediction of signal peptides: signalp . a comprehensive assessment of n-terminal signal peptides prediction methods wolf psort: protein localization predictor secreted protein prediction system combining cj-sphmm, tmhmm, and psort psort-b: improving protein subcellular localization prediction for gram-negative bacteria psort: a program for detecting sorting signals in proteins and predicting their subcellular localization predicting protein subcellular locations using hierarchical ensemble of bayesian classifiers based on markov chains subloc: a server/client suite for protein subcellular location based on soap gpos-ploc: an ensemble classifier for predicting subcellular localization of gram-positive bacterial proteins advantages of combined transmembrane topology and signal peptide prediction-the phobius web server prediction of lipoprotein signal peptides in gram-negative bacteria prediction of twin-arginine signal peptides validating subcellular localization prediction tools with mycobacterial proteins toward bacterial protein sub-cellular location prediction: single-class discrimminant models for all gram-and gram+ compartments multi-class subcellular location prediction for bacterial proteins alpha helical trans-membrane proteins: enhanced prediction using a bayesian approach beta barrel trans-membrane proteins: enhanced prediction using a bayesian approach a predictor of membrane class: discriminating alpha-helical and beta-barrel membrane proteins from non-membranous proteins tatpred: a bayesian method for the identification of twin arginine translocation pathway signal sequences lippred: a web server for accurate prediction of lipoprotein signal sequences and cleavage sites combining algorithms to predict bacterial protein sub-cellular location: parallel versus concurrent implementations predicting the subcellular localization of viral proteins within a mammalian host cell virus-ploc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells structure and sequence relationships in the lipocalins and related proteins structural relationship of streptavidin to the calycin protein superfamily analysis of known bacterial protein vaccine antigens reveals biased physical properties and amino acid composition adaptation of protein surfaces to subcellular location hierarchical classification of g-protein-coupled receptors with data-driven selection of attributes and classifiers gpcrtree: online hierarchical classification of gpcr function optimizing amino acid groupings for gpcr classification on the hierarchical classification of g protein-coupled receptors proteomic applications of automated gpcr classification vaxijen: a server for prediction of protective antigens, tumour antigens and subunit vaccines identifying candidate subunit vaccines using an alignment-independent method based on principal amino acid properties dna and peptide sequences and chemical processes multivariately modeled by principal component analysis and partial least-squares projections to latent structures principal property-values for nonnatural amino-acids and their application to a structure activity relationship for oxytocin peptide analogs peptide binding to the hla-drb supertype: a proteochemometrics analysis proteochemometrics mapping of the interaction space for retroviral proteases and their substrates proteochemometrics analysis of substrate interactions with dengue virus ns proteases generalized modeling of enzyme-ligand interactions using proteochemometrics and local protein substructures rough set-based proteochemometrics modeling of g-protein-coupled receptor-ligand interactions improved approach for proteochemometrics modeling: application to organic compound-amine g protein-coupled receptor interactions melanocortin receptors: ligands and proteochemometrics modeling proteochemometrics modeling of the interaction of amine g-protein coupled receptors with a diverse set of ligands peptide quantitative structureactivity-relationships, a multivariate approach multivariate parametrization of coded and non-coded amino-acids new chemical descriptors relevant for the design of biologically active peptides. a multivariate characterization of amino acids bioinformatic approach for identifying parasite and fungal candidate subunit vaccines jafa: a protein function annotation meta-server metamqap: a meta-server for the quality assessment of protein models a consensus strategy for combining hla-dr binding algorithms prediction of hla-a -restricted ctl epitope specific to hcc by syfpeithi combined with polynomial method propred analysis and experimental evaluation of promiscuous t-cell epitopes of three major secreted antigens of mycobacterium tuberculosis propred: prediction of hla-dr binding sites predicting class ii mhc/peptide multi-level binding with an iterative stepwise discriminant analysis meta-algorithm class ii mhc quantitative binding motifs derived from a large molecular database with a versatile iterative stepwise discriminant analysis meta-algorithm iterative stepwise discriminant analysis: a meta-algorithm for detecting quantitative sequence motifs neural models for predicting viral vaccine targets building a meta-predictor for mhc class ii-binding peptides a probabilistic meta-predictor for the mhc class ii binding peptides a meta-predictor for mhc class ii binding peptides based on naive bayesian approach strength in numbers: achieving greater accuracy in mhc-i binding prediction by combining the results from multiple prediction tools a systematic assessment of mhc class ii peptide binding predictions and evaluation of a consensus approach combination of fingerprint-based similarity coefficients using data fusion connectionist-based dempster-shafer evidential reasoning for data fusion from protein microarrays to diagnostic antigen discovery: a study of the pathogen francisella tularensis epijen: a server for multistep t cell epitope prediction nerve: new enhanced reverse vaccinology environment dynavacs: an integrative tool for optimized dna vaccine design vaxign: the first web-based vaccine design program for reverse vaccinology and applications for vaccine development enzymes, metabolites and fluxes key: cord- -dlqs ay authors: nan title: sequences and topology date: - - journal: curr opin struct biol doi: . / - x( ) -t sha: doc_id: cord_uid: dlqs ay nan . garrell j, modolell j: the drmm~hila locus, am antagonist of proneural geors that, lilac these genes, ~.ncodes a helix-loop-helix protein. ce/ , : - in crystals of ~or-nib-glu(oz~)-leu-mb-ala-leu-an~alm-lys(z)-alb.ome. pro~ natl acx*d sci usa , : - cloning and expre~inn of two distinct high-afl~nlty p~eptot~ ~geat¢ting with acidic and b~ic ~last growth imctogs. embo j . emboj emboj , : emboj - . he~gst l~ lf~w~m~ t, g~lwr~ i~. the ryb l gene in the l~lmion ye~tt sd~m~acctm~ pombe lfalcodin a gtp-bindin s protein belated to rho and ypt: structot~, expfemflon and identificetion of its human homulogue. embo j ~ , : embo j ~ , : - serotouln receptor that activates adenyiate cyclase domain • of lutropin/choriogonadotropin receptor expressed in t~ cells binds choringonadotropin with ltlgh /tmnlty cllgol~o onlal orsanization of adrenergic receptor genes molecoiat chagactegization of a rat ~ n-adrenerl#c receptor identificatlon of rpo , a vaccinia virus rna polymerase gene with structut~ similarity to a eucaryotic transcription elon~ation factor nucleotide sequence analysis of the l g~ne of vesicular stomafltia virus (new jersey serotype) --identification of conserved domai~l~ in l proteins of nonsegmented negative-strand rna viruses a novel u,man immunodeflclency virus type- protein, try, shares ~'quences with tat, ent~ and rev proteins phosphoprotein and nucleo~psid protein evolution of vesicular stomatitis virus new jersey identification of a conserved region common to cadherius and hl~u~llzat s~ a hema~u!tinin~ sequence and evolutionary relationships of african swine fever vh*es thymidine kinase all unusual stnlctul~ of a putative t cell oncugene whlch allows production of similar proteh~ from distinct messengeg rnas ~ ldentilfication of a rd protein factor which binds to the rolls sarcoma virus ltr enhancer --po~lble homology with the serum response factor genetic variation and multigene fannllles in aj~rics~t swine fever virus sequence of the genome rna of rubella vh'us --evidence for genetic rearrangement du~n~ tosavirus evolution derse i~ equine infectious anemia virus tat--insights into the structure, function, and evolution of lentivtrus tran.~activator proteins a colrpoeison of the genome organization of capripoxvirns with that of the orthopo~ ~golutionary orlffln of human and s imian lmtmo~odeflciency vir~jes a new supe~mlly of putative ntp-bindin domaan,t encoded by genomes of small dna and rna viruses envelope gene sequence of htlv- isolate mr- and its comlmrlann with other htlv-i isolates evolutionary relationship between luteovtruses and other rna plant viruses based on sequence motifs in their putative rna pulymet-am~ and nucleic acid hellcases isolation and sequence analysis of caenothabd/~s br/~w~e repetitive elements related to the ~ dqana transposon tcl jmolevo selective clo~ sequence analysis of the hum l sequence* which t~ in the rehttively recent l~tt jowcs m& sequences related to the matte tra~acmable element acin the genum zea. j m / evo/ evulutiotmtv pattern of the hemas~utinin gene of inflmmza-b viru~s ~ulated in japan ~ cocir~lating linesses in the same epidemic semon the dna binding subuult of nf-kaplm-b is identical to factor kbfl and homologous to the rel oncogene product sequencins analyses and com~ of pmrainfluenza virus type- a and type- b np protein genes. virok>gie complete sequence of the gcnomic gna of o'nyon -nyong virus and its use in the constroctton of alphavir~ phylogenetic trecs molecular clouln s of the rinderpest virus matrix gene --comparative sequence analysis with other paramyxorirm~. vi~logy cautd~an p~ ancestry of a human endogenous retrovirus ~ determination of an epitope of the diffuse systemic sclerosis marker antisen dna topoisomerase-l: sequence $ mllagity with retroviral ~ protein suggests a possible cause for autoimmunity in systemic sclerosis. pro natlacad s i u& , : ~ . mcgeoch dj: pgotein sequence cota~lxs show that the 'psuedoprotesses' encoded by poxviruses and certain retrovirus~ belong to the deoxyoridine triphtmphate family ~sk life: ~es of comme//na yellow mottle vlrus's complete dna sequence, genomlc discontinuities and transcript su est that it is a pararetrovlnm i~l~titis c vllrll~ sborl~ amino acid sequence similarity with pe~tivirutu~ and flavivirus~ as well as members of two plant vlgus superggoupo mo~mann ti~ homology of cy~kine synthesis inhibitory factor (el- ) to the epstein-barr virus gene bcrfi nucleotlde sequence analysis of sa-omvv, a vlsna-related ovine lentivirus --phylo~-netic history of lentivirmms single copy seqoences in g~qgo dna retmmable a repetitive hnman aetrotrmmposon-llke family. y mo/e~/ , : re¢otnblnation resulting in unusual features in the polyomavlrus genome isolated from a murine tumor cell line sequence anal~is of rice dwarf elxytoreovirus genome sewments s , s , and s -comparison with the equivalent wound tumor virus segments ho~tu~ ~ s is a ehylngcueticellly distinct human endogenous reteovtgal rlement with structural mad sequence homology to simian sarcoma virus (ssv). vi~ologie identification of a novel -kl)a cell surface receptor common to pm~cee~flc polypeptide molybdenum hydroxylas~ ~ the amino acid seqoence of chicken hepatic solfite oxidase frequency of a]mloglnai h |lm~tn haemoglobins ~ by c ~ t trmmitions in cpg dlnucleotid~ evidence for conservation of ferritin sequences amon plants and animctbt and for a transit peptlde in soybean a -kda llpo~ortin from human mononuclear cells appears to be identical with the placental inhibitor of blood coagulation distinct fercedoxins from rhodobacter-capsulstus -complete amino acid sequences and molecular evolution n~ptide sequence analysis and molecular cloning reveal two calcium pump isoforms in the human erythrocyte membgane cloning and characterization of a novel member of the cytochrome-p subfamily iva in rat prostate a directiy repeated sequence in the ~-globin promoter resulates transcription in murine efythroleukemla cells isolation and chamcterizatinn of the alkane-inducibie nadph-cytochrome-p- olf, idoreductsse gene from candida-tropicalls -identification of invarlant residues wlthin slmilmr amino acid sequences of direr'sent flavoproteins protein klnase-c inhibitor proteins -purification from sheep brain and sequence similarity to lipocortins and - - mci~ aveml~ b& sequence homology between purple acid phosphatases and phusphoprotein pho*phatsses --are phesphoprotcin phosphatatms metalloproteins collt~|nln~ oil~-bridged dinuclcar metal centers negative regulation of the human ~-globin ca~ne by transcriptional interference: role of an mu repetitive ~lement amino acid sequence of chicken catisequestrin deduced from c dna -comp~rison of caisequestrin and aspartactin caisequestrin, an intesccilular calciumbinding protein of skeletal muscle sarcoplssmic reticulm, is homolokous to ~, a putstive latminin-binding protein of the exteac¢llular matr~ bovsm~ ]prote~ c inhihl.gog with structugll and fun~ hotdoio~ou~ ]~-.gtl~ to hum~zn plum~ protein c inhibitor sequence of silkworm hemolymph antitrypsin deduced from its cdna nucleoude sequence --~on of its homology with ~.rplus. l b~cbem (tokyo) human mm~t cell tryptm~e multiple cdnas and genes reveal • multigene serlne protemje lmmlly howam> jc: msc ore. n k#on encoding protehm iteleted to the multidtog ite~letance family of tra~membt'mne tratmpofters m~, a tks~me-speclfi¢ b•tmment membrane protein, is a ia.minin.like protein commrvation of a cytoplasmic ~xy-termitml domain of couexin , a gap junctional protein, in mammal heart and brain the a~lba//a~ plasma membrane h+-a~ multigene ~ -genomle sequence and expression of • rd lsoform, f b/ /owra op#n of calliphora peripheral l~otoreceptors r - --homology with d~ rhl and po~tmnsi~domd processing evolution of rhodopsin supergene family --independent divergence of visual pibments in vertebrates and insects and po~ibly in mollusks ct~tpl¢ the g~ne~ amino acid ~'~m~me gene of sac~baromy~wcet~-v/~ae --nucleotide seque~tce, protein similarity with the other i~kers yeast amino acid petmme~mes, and nitrogen cataboht~ repreulon the -kda peroxlsomal membrane protein is a member of the mdr (p-glycoprotein)-related atp-bindin~g protein superfamlly a new clam of lym~o-real/vacuolar protein sorelng signais. l b~/chem complete amino acid sequence and homologies of human erythrocyte membrane protein band . . proc natl acad scd us a the primary structure of a halorhodopsin from n pbaraom~--structural, functional and evolutionary impnoations for bacterial rhod~ and haloghodopslns soluble lactose. blndln~ vertelmue lectlns: a ~ family the a regulatory subunit of the mltochondrlal fi-atpa~ complex is a heat shock protein. identification of two highly conserved amino acid sequences amon~ the ~x-subunits and molecular ~ sequence of h ilmlfl ~ l~ieat~ • novel gene family of integral membrane proteoglycans a protein with homology to the c-termimml relationaxip~ between/m~-nylate cycla~ and n•+,k+-ati~se lit pat pancgtmti¢ islets human na+ ,k+-ati~¢ genes ~ beta~ubunit gene family conmina at lcest one gene and one ~ evolution of the mltc cles~l genes of a new world lh'imate from ancestral homologues of human non-clessical genes the cdna sequence of mouse pllp- arid homololgy to hntman cd c~ll s~e antitpm and promot#ycen core unk proteins ~tjott of cdna encodin~ a hnman sperm membr~e protein related to a amyloid protebm ptwlflcstlo~ c~mu'actet~muon, and con with memb~ne carbonic anhydrase from human kidney hypermumbility of cpg dinucleoudes in the propcptide-enced/ng sequence of the human alb~tmi~ gene dystt'ophhl in electric or~n of to, pedo-~ homologous to that in h,ml~ muscle botste~ i~ homolosy of a yeast acun-binding protein to signal trmmductlon proteins slid myosin- the complete sequence of drosophila alpha-spectrin --conservation of structurml domahm between alplm-~ and alpl~t cttnin •~ettaatflon of a lqbrilisr collqwn gene ~ spruces reveais the etdy evolutionary appearance of two collqwn gene fmmilk~ the predicted amino acid sequence of ct-lnternexin is that of a novel neuronal lntegmedla~ ~ent protein otsen bl~ type xil collm~n. a larbe multidomnln molecule with partial homology to type ix cousllem / b/d aera amyioid protein in i~mni~l amyloidmfls (plmnlah type) ks homolollotm to gd$oilmb an ac, tht-i~h,.da~g protein. b/~bera b/q b~ res commun key ji~ ~ of a proline-rich cell wall protein gene ~ of soybesn. a ~ ana/ysis. j b/o/~em chicken liver evolutionary rehttinnships and impflcations for the resulation of phoophohpsse-a from snake venom to human secreted forms identification of a locality in snake venom a-ncurotoxins with a slsnlficant comlm*itinmd similarity to marine smdl ct-conotoxins: implications for evolution and structure activity al~ph[biml~ albmtm|nm ~s members of the albumin, alpha-l~toprotein. vitamin-d-binding protein mul~ flmily ~ni~on of the hnm~n llpoprotein lllmse gene and evolution of the llpase gene family e~'t~ion of cloned human reticulocyte - ipoxygenase and immunological evidence that -hpoxygetmses of different cell types are related identification of a protein alt~ inttaspecific evolution of a gene family coding for urinary proteins conservation between yeast and man of a protein a~ociated with i small nuclear rlbonucleoprotein stl~ctute and partial amino acid sequence of calf thymus dna topobmmaertt~-ii -coml~on with other type-h emmyme~ ol~nudeotide correlations between infector and hem genomes hint at evolutiotmry relationships. nu /e~ scot/~ik p& carotenoid desmurases fi, om ~ ~and nmoowo~craua are stru~ and l~n~'tinnally comerved and eonmin domains homolosons to flavoprotein dimdflde oxldoreductm~ deininger pi stt'uc~uee and vsrisbihty of recently inserted alu family members a novel neutrolphfl chemmtttactant generated duan an ln~ammmtory reaction in the l~mt peritoneal cal~lt~ tt~ t~t~o -l~tl'~t~tloil~ ~ amino acid seque~tce and structural relmtmmhip to interkukin-& b~ffx~m j the multlfimctinna -methylmllcyllc acid syn~ ge~e of ~~ ~ its ge~e structmm ieimive to tl~t of other po~lyketide symhase~. f.urj b/odaem mammalkm ublquitin carrier prmmtmh but not i~:i~k, ame ltdated to the -kda yeast , rad . bk~chem b/qohys res commun chambers gk: sequence. structure and evolution of the c.ene codin b for ~t-gi~erol- -phe~plmte ~rdrotfm~ in om,qt~ the cotaplete sequence of bogu/ktmm nenrotoxin type-#, and com~ with other clostrldhtl neugoto~hm if: a pamlly of cxam~fltutive c/bbp-llkc dna blndln~ proteins attenuate the il-l~t induced, ni~b mediated trans-activation of the ansiotemflnogen gene acute-phase response element different fort~ of ultmhithomx proteim generated by alternative spttcim~ are functionally equivalent evolution of collagen-iv genes from a -batm pair faton --a role for lntrmm ht gem~ evolution evolution of the insulin superfamlly tcetins are structoraily related sertoli cell proteim who~ ~on is tightly coupled to the iprtsence of germ cells ivarie r~ a bovine homolo s to the human myolletti c determination factor myf~ sequence conservation and ' proce~ing of transcripts proteiu sertne threonine phoephatmes -an expanding family coppes zl divergence of duplicate genes in three sciaenid species (perciformes) from the south co~t of uruguay coasfaneda m: rrs~j~o~a (mu-~--a~) repetitive dna seqmmce l~vointion in ~hically mstinct isolates. cor~ bnz~n physiol repetitive seq~ce involvement in the duplication and divergence of mouse lysozyme genes the structure of a subtermlnal nut/e/ a /ds res schoofs i~ h~ between amino acid sequenc~ of ~ v~'lt~tm'stte peptide hormones and peptides ~mlated fi-on~ invertebrate sources. corn# bm&.n mg~ol bun'nng s, ~us r& lqatelet gtycoprotetn nb-ma protein antssonim from snake venoms ---evidence for s fumlly of p~telet-~sgqpttlon lnhll~tol~ hikher plant orilgins and the whylogeny of gt~en allpte simihtrity between the t~ ~ sindln s proteins abf how big is the univet~ of e~otm worklwide diffegences in the ~ncideace of type ! diabetes are ammciated with amino acid variation at pos/tion of the hi~-dq ~ chain yeast general trtnscelptimt l~ctor gf! --sequence requirements for binding to dna mad evointhmky commrvttion. nudeg m/ds res concerted ]rv~ution of primate mplm smelllte dna. e'~kmce foe tm an~mt~ sequence sbm'ed by goal~ md human x ~e alpha ~ttdllte the nuchl~m~ sequence of etve ribommaal protein genea from the o/anene. of ~~ impacattom concem~ the mtytosene~ relationship bet~-en cyanelles and chloropluts wmslanoer l~ a new member of a secretory protein gene family in the dipteran c~t~onomot~ tentaus ~ a variant repeat stracture the ~r sequence ~ --die.inn on the x-chromosome and y-chromosome of a large set of closely related sequence~, most of wmda are i~eudogene~ ba~ttmo~e l~ cloning of the pso dna binding subutdt of nf-kapi~-b -homolo~" to gel and dortml l-~te two-monooxr~muse from m~ --clon~ nucleotide sequence, and primary structu~ homology within an enzyme family genetic hot~o~n~ty ~ acute and chronic acute forms of spinal muscular atrophy genetic variants of bovine ~-lactogiobulin --a novel wild.type ~-lacto#obulin w and ~ts primary sequence. b/or (~rn h tt e sey/er l~ltogh~ dna evolution in the olmcm species subgroup of drooophll~ f mot evot lovell-badge l~ a gene mapldng to the sex-determining gegion of the mouse y chromommae ~ a member of a novel ~ of zmbryonk~ly genes ~titmte , -dioxy~mm~ from p~.udomotm~ pustfi~mtion, characterization, ~md compm'tson of the f.mtymes from psemffmmm~m ta~o~k-ron/and aaammms~ spec~clties of the peptidyl prolyl cis-tratm isomeric activities of cydophmn and fk- bindh~ protein --evidence for the existence of a family of distinct enzymes. b~x/aem/ary mltochondrl~ dna evolution in primates -tt-atmltion gate has been extremely low in the lemug homeobox containing genes in the nematode ~enorbabd/f~ elk.gamin nucleic ac shdic add fateesses of ~ • voluttomu.y origins have serine active sites f~entlal arginlne residues dewact-rrer l~ the ltilm omal rna ~-quence of the s~t anemone anemom~s ssdcmta and its evolutionary intuition amomqg other eukaryotes inferred b'om s~l,.m.~ comlmrttmas of a heat shock g~ae in two nematorl~ the l~'/o multtgene family of ok~hag of cdna ~ for the ~ omin of human complement component ca~bi~una protein, seqaenoe homolo~ with thc a c~t~:~a~h proc natl acad s¢t usa highly conserved core domain and unique n terminus with presumptive regulatory moti~ in a hmman tata factor (l'lql~) [letter] identification cimractertzaflon of a novel member of the nerve growth fmctor/besln.dertved neurotrophic factor family ~ bind to s~dlfmme [eal(~-so )l~l-lcer ] and has a sequence homology with other pt'otelns that bind sulfated glycoconjut~tes anllllo acid seqmmce of clnnamomin, a new member of the elicitin family, and its comparison to cryptogein and capsicetn soluble and mtmo[~tle~ioc~ta~l h~ low-ml~n|ty adenomne binding protein (adenotin) --properties and homology with mtmmall~la and avian stress protelus. b~-/~om/stry edolatlon of complementary dna$ f~lcoding a cerebellum-enriched nuclear factor-i family that activates tt'anscription from the mouse m~.lin basic protein promoter ye~mt mltochondrlal dna polymet'ase is related to the family a dna polymerases nudeotide and deduced amino add sequence of a human cdna (nqo ) corresponding to a second membeg of the nad(p)h --quinone oxldoreductase gene family --extensive polymorphism at the nqo gene locus on chgomo~ome- . b/oc.heraistry ult~ sltnlltt'leles a~llolltll enzyme pterin binding sites as demonstrated by a monoeinnal amiidiotypic antibody blundell tl molecular anatomy: phylogenetic relationship* derived from three~limenslonal structure~ of proteins subfamily structure and evolution of the hnmtn . family of repetitive scquence~. f mot evo selmt~te mltochondrlal dna sequences are contiguous in htlmsa~ genol~ic dna l~t~lit~ within mmmm~lla~ sogl~tol deh~ --the prlmm'y structure of the human liver enzyme heterogeneous modifications of the l /alo ltrote~a of ibtegleuldn-~t cells are concentrated in a/,ti~hly r~qg~.titlv ~ amino-t~ vaults.ell rebofmcleoprotein structures are msl~ conserved among higher and lower e~tes rnas le~d support to the monophyletic nature of the ~erla lmmunoloslcal ~lmllmtties ~etween cytosolic and partictdate tissue trans#utamilsc. febs lat mans~ti x#tope m~w~zed by a protective m~aodonm antibody is identical to the sta~e-specific embryonic antlgen-l. proc naa acad sa o~ the murg gene of t-brucei contains multiple dom.l.m of extensive editinil and is hofaoin~m~ to a subultit of nadh dehy~ neparm-bindl~ nenrotrophtc x~tor (hbnf) and mk, member's of z new i~mily of homolosous~ developmentally l~ted proteitm pugmattion and strucrmml ~on of pttcentel nad + .mtked -hydroxyproma#andm dehydtoffmase ~ the primary structure reveals the enzyme to belon to the short-alcohol l)ehydrogena~ l~mlly. b/ochemistry structores and homologies of carbohydrate ~pho~ system ep~l~[ln, a ~o~a-gmjoclated mudn, is generated by a polymorphlc gene encodin splice variants with alternative amino termini a new member of the leucine zipper class of proteins that binds to the hia drct promoter. sc/ence attalysi~ of cdna for human ~ ajudgyrin i~dicltes a repeated structure with homology to tissue-differentiation a~td cell-cycle control protein the b subunlt of a rat hetefomeric ocaat-binding transcription factor shoes a striking sequence identity with the yeast hap transcription factor homology to mouse s-if and sequence similarity to yeast pt~ stgucttu'e and evolution of the small nuclear rna multigene family in primates: gene amplification under nat-¢wal selectinn? ident~catinn of an additional member of the proteln.tyrushle-phosp~ family --l*vidence f~ alternative spliclog in the tyrmine phosphzmme domain a ~le am~o acid difference dis~ishes the human and the rat sequences of statlmaln, a ubiquitous intracehular pho~phoproteln ~ with cell item comp~ison of the seve~le~ gene* of drosop~ffa t~'ff~ end ma ~ muty, an adenine ~ active on g-a mislmirs, has homology to ~t evolution of largesubunit iutna structuge --the ~cation of imvetbe~t d dommin amon mmjor phyiolpmetic groups discrepancy in diveqlenoe of the mltodtondrlal and nuclear genomes of m sensor/and y~ j mot evot ~ adenylate deamll~t~. a mt~flige~e fam~ in p..m~,n, and rats isolmion and structure of ceerol#m, itna,~le hat~ peptmes, from the smm~m, ~ mo~ comp a~a rmm~ i~ vmotocin ge~ of the teleom f.,xott intro~ botany. ~ hot~ ot'l~mization. b~hemioy the adb gene areal share features of sequence structure and nudeast~protected sites. m /cell bto/ the amino-acid sequence of multip/e lectins of the #.corn barnacle m~us-lgo~ and its homology with .animal ]~'tllls. bioclx'm btqobys acta amino add ~.-quence of mtmkey erythrocyte glycophorba mk. its amino acid ~'qu~'~icc ]f][~ a stri~tl~ homology with that of human glycophorin a flsp~r p& drtmophila proliferating cell nuclear antigen. structural and functional homology with its mammalian coonterpart phylogeny of n|trogen*me s~queac~ in ][~mnkla and other nlteogen-fixing ml~m$ vertebrate prot~mlne c~ne evolution. . sequence alignments and gene structure florin l~ a major styl~ matrix polypeptid~ (sp ) is a member of the f~thogenesia-reiated proteins superciass complete amino acid sequence of rat kidney ornithine aminoteat~fet-~e --identity with ijver omithine aminotransferme. l bnxl;em (tokyo) rlbonuclease p --function and variation. j b/o/~bem the primary strum of glycoprotein-m from bovine adrenal medullary granules --sequence similarity with bnmmn serum protein- , and rat sertoli cell giycoprotein- compm'ative ~quence/umlysis of m~mmantan f'a~or ix protaotegs the amino acid sequence of the b nman l~ia polymet'a~-h -kda subunit hrpb is highly cotmerved among eukaryotes phylogenetic conservation of atylsulfatases --cdna cloturing and expre~ion of hnman aryisul~t~e-b. j b/o/cbem c.oll/l~'vlltion and diversity in fatnllies of coated vemcle adaptlns cllaracterizaflon of petel porcine bone sialoproteins, soca'~ted phosphopgotein ! (sppi, osteopontin), bone siaioprotein, and a .kda glycoprotetn ~ demonstration that the -kda glycoprotein is derived from the carboxyl terminus of sppi characterization of matteuccin, the . s storag~ prote~ of the ostcich fern -evolutionary iteiatinnshlp to angiosperm seed storage ~ a new mmber of the glutamine-rlch protein gene family is characterized by the absence of internal lgepe~ts and the androgen control of its expression in the subm*ndlbuiar gland of pad novel insect n~ with homology to peptides of the vea'te~ tachykinin family identircation of a novel platelet-derived neutrophli-chcmaotgctic po~ with structural homology to piatelet-factor- a novel repeated dna sequoncc located in the intergenic regions of ba~tceial chromosomes. nuc eic.,k:ids res the proianlin storage protellx¢ of cere~ seeds ~ structure and evolution functional analysis of the '-terminal part of the balbiani ring gene by hlterspecies sequence comparison dr= mammaban ~yl phosphate symhetase (cp*) --cdna sequence and evolution of the cl m domain of the syrian hamster multifunctional protein cad mammalian dihydroorotase --nudeotide sequence, peptide sequences, and evolution of the imhydroorotsse domain of the multifunctinnal protein cad a receptor for tumor necrosis factor defines an unusual family of cellular and viral proteins the control of flower morphogenesis in a~..ffd~um majusthe protein shows homoinff~ to transcription factors an element of symmetry in ytmst tata-box binding protein transcription factor-lid --consequence of an ancestra/ duplication? c-type natciuretic peptide (cnp): a new member of nateinretic peptide family identified in porcine brain evolution of antioxidant m~: ediol-dependent petoxidm~.s and thiol~ ~umong ptocaryotes towards the evolution of ribozymes alkyl hydroperoxide reductase from sa/mone/ta ~ur/um --sequence and homology to thinredoxin reductase and other fiavoprotein disuliide oxidoreducmses fc: nonuniform evolution of duplicated, developmentally controlled c~azrion genes in a sillumoth the fission yeast cutl + gene regulates spindle pole body duplication and has homolosy to the buddin structural homology b~ween the hnmmn fur gene product mad the sub---like protea~ encoded by ye~t/~x . nuc~ a¢/ds res nudeotide sequences and novel steuctut~ features of hnm=. and cimm~ lighter ~# primary stt~t~ and expression of a nuclear-coded subunit of complex-n n~ to protetm specified by the chtoropiast genome. b/ chera bnfhys r~ commun a novel gene member of the human giycophorin-a and glycophorin-b genc fatuily -molecular cloning and expression the x-chromosome of monotremes shares a highly conserved region with the eutherlan and marsupial x-o~romosomes despite the absence of x-chromosome ittactt~tion c~lract~tion and or~= nl~tion of dna sequences adjacent to the evidence for a new fmily of evolutionarily conserved homeobox genes elellatltlll and albolabrin purified peptides from viper venoms --homologies with the rgds domain of flbrinogen and yon willebrand pactor measurement of $~tiv~-site homology between potato and l~bbit muscle alpha-glum phosphoryiases through use of a iane~r free energy relationship white ~ weiss ~ the neuroflbromatosis typed gene encodes a protein related to gap the dna damage-inducible gcne-dinl of saocbarom q~ewcet~#.s/ae encodes a regulatory subunit of elbonucleotide reductase and is identical to gnr fhlgegprinting of ne~lr-homogeneous dna hgase-i and ligase-h from eh,m~n cells --similarity of their amp-binding domains control of m na st~mlity in • chnoc~qg.~um, by 'inverted ltepeats: effects of stem and loop mutations on degradation ofxtmba mlna/n vt~ nuc/e~ ac alternative messenger rna structures of the ciil-gene of bacteriophage ~. determine the rate of its tt'ansbttion initiation alternative mrna structures of the cm genc of bacta~ophage ~ detc:'mine the rate of its translation initiation. j mo/b~ / a model fog iina editing in klnetopiastid mltochondrla --guide rna molecules transcribed from max/circle dna provide the edited information elements and coding sequences. j mol bio , : - . chang c-y, ~ d-a, mohandas til chung b-c: stt~ctut~e, ~-quence, chromo~maal location, and evolution of the human fercedoxin gene family. dna cell b/o/ , : - key: cord- -w z wir authors: sola, monica; wain-hobson, simon title: drift and conservatism in rna virus evolution: are they adapting or merely changing? date: - - journal: origin and evolution of viruses doi: . /b - - / - sha: doc_id: cord_uid: w z wir this chapter argues that the vast majority of genetic changes or mutations fixed by rna viruses are essentially neutral or nearly neutral in character. in molecular evolution one of the remarkable observations has been the uniformity of the molecular clock. an analysis of proteins derived from complete potyvirus genomes, positive-stranded rna viruses, yielded highly significant linear relationships. these analyses indicate that viral protein diversification is essentially a smooth process, the major parameter being the nature of the protein more than the ecological niche it finds itself in. synonymous changes are invariably more frequent than nonsynonymous changes. positive selection exploits a small proportion of genetic variants, while functional sequence space is sufficiently dense, allowing viable solutions to be found. although evolution has connotations of change, what has always counted is natural selection or adaptation. it is the only force for the genesis of a novel replicon. there is no such thing as a perfect machine. accordingly, nucleic acid polymerization is inevitably error-prone. yet the notoriety and abundance of rna viruses attests to their great success as intracellular parasites. indeed some estimates suggest that % of viruses have rna genomes. it follows that replication without proofreading can be a successful strategy. there is a price to pay, however. manfred eigen was the first to point out that without proofreading there is a limit on the size of rna genomes. obviously, if the mutation rate is too high, any rna virus will collapse under mutation pressure. as it happens, rna viral genomes are up to kb long while mutation rates are - per genome per cycle or less. possibly, rna viruses and retroviruses have simply not invested in proofreading, in which case mutations represent an inevitable genetic noise, to be tolerated or eliminated. hence there would be no loss of fitness, fixed mutations being neutral. a corollary of this would be that the intrinsic life style of a virus is set in its genes. the alternative is to suppose that most fixed mutations are beneficial to the virus in allowing it to keep ahead of the host and/or host population. by this token variation is an integral part of the viral modus vivendi. the twin requirements of a successful virus are replication and transmission. under the rubric replication, a virus could vary to increase its fitness, exploit different target cells or evade adaptive immune responses. in terms of transmission, variation might allow a virus to overcome herd immunity. these two scenarios emphasize the two sides of the molecular evolution debate; one highlights neutrality while the other puts a premium on positive selection. purifying or negative selection is ever operative -a poor replicon invariably goes asunder. through rounds of error and trial, positive selection is the only means of creating a novel replicon. so long as the ecological niche occupied doesn't change, the virus doesn't need to change, purifying selection being sufficient to ensure existence. this raises an important issue: we know that, over the time that we are living and loving, as well as doing experiments, writing papers and reviewing, humans are not evolving. ernst mayr noted that "the brain of years ago is the same brain that is now able to design computers" (mayr, ) . positive fitness selection among mammals is effectively inoperative over our lifetimes. and certainly since we have known about hiv and aids. how is it that vertebrates, invertebrates, plants, fungi and bacteria, all species with a low genomic mutation rate, can control viruses which mutate so much faster-sometimes by a factor of (holland et al., ; gojobori and yokoyama, ; domingo et al., ) . yet they do. we come to the basic question-to what extent is genetic variation exploited by an rnavirus, if at all? and if so, what is the virus adapting to? the answer invariably given to the second question is "the adaptive immune system" (seibert et al., ) . yet apart from the vertebrates none of the other groups mentioned above mounts antigen-specific immune responses. this chapter will argue that most fixed mutations are neutral. in molecular evolution one of the remarkable observations has been the uniformity of the molecular clock. although there has been intense debate as to what molecular clocks mean and quite how far they deviate from null hypotheses, fibronectin fixes mutations faster than alpha-or beta-globin, which do so faster than cytochrome c, etc. rates of amino acid fixation are intrinsic to different proteins. yet some viruses give rise to persistent infections, others to sequential acute infections. all succumb to the vagaries of transmission bottlenecks. how many rounds of infection are necessary to fix mutations? for example, the tremendous dynamics of viral replication have been described. whether it be hiv, hbv or hcv, plasma viral turnover is of the order of - ' virions per day (ho et al., ; wei et al., ; nowak et al., ; zeuzem et al., ) . between % and % of plasma virus is cleared. in the case of hiv this can involve more than rounds of sequential replication per year (wain-hobson, a,b; ho et al., ; pelletier et al., ; wei et al., ) . many of these variables and unknowns can be removed by comparing the fixation of aminoacid substitutions in pairs of viral proteins from two genomes. if one assumes that the two gene fragments remain linked, through the hellfire of ay immune responses and bottlenecking inherent in transmission, relative degrees of fixation should be attainable. note that, so long as frequent recombination between highly divergent genomes is not in evidence, this assumption should be valid. this procedure is outlined in figure . . the first example is taken from the vast primate immunodeficiency virus database (lanl, ) . when normalized to the p reverse transcriptase product designated rt, amino acid sequence divergence for p gag, p gag, integrase, vif, gp , the ectodomain of gp and nef all reveal highly significant linear relationships ( figure . , table . ). the relative rates vary by a factor of two or more. why the hypervariable gp protein shows a relatively low degree of change with respect to the reverse (henikoff and henikoff, ) . it is well established that protein sequence comparisons are more informative when weighted for genetic and structural biases in amino acid replacements. in the blosum weight matrices series, the actual matrix that was used depends on how similar the sequences to be aligned are. different matrices work differently at each evolutionary distance. for a given virus, different protein sequence sets were compared to a given reference such as rt in the case of hiv/siv. n indicates the number of independent two-by-two comparisons. the data were checked for the possibility that a rogue genome strongly influenced the data. only in the case of the inoviridae were there insufficient complete sequences, six in fact, to yield satisfying analyses. instead all pairwise comparisons were made, hence the data points reflect dependent data (#). the form of the linear regressions are given where y and x refer to the first and second protein listed in the column "paired proteins". the correlation coefficients r were highly significant in all cases, the corresponding probabilities being: + < . ;" < . ; *< . . . graphical representation of paired divergence for orthologous proteins taken from complete hiv- , hiv- and siv genome sequences, y = different proteins, x = p sequence of the reverse transcriptase (rt). x and y values correspond to blosum-corrected fractional divergence. only non-overlapping regions were taken into account. the straight lines were obtained by linear regression analysis. their characteristics are given in table . . transcriptase (rt) can be explained by gap stripping, which eliminates the hypervariable regions. consequently the gp data effectively reflects the conserved regions. the linearity, even out to considerable differences, indicates that multiple substitutions and back mutations, which must be occurring, do so to comparable degrees. although these data were derived from completely sequenced primate immunodeficiency viral genomes, analyses on larger data sets, such as p gag/p gag or gp /gp , yielded relative values that differed from those given in table . by at most %. the absence of points far from the linear regression substan-tiates the assumption that recombination between highly divergent genomes is rare. this does not preclude recombination between closely related genomes. the linear regressions passed close to the origin in nearly all cases. only for nef was there some deviation, suggesting that nef was saturating to a different extent from all other proteins. however, as linear correlations involving nef data were always statistically significant, this trend may be fortuitous. note that the data cover the earliest phase, intrapatient variation (generally < %), continuing smoothly to cover interclade, intertype and finally interspecies comparisons. yet this in spite of different environments-that of an individual's immune system, different immune systems stigmatized by highly polymorphic hla, and finally differences between humans, chimpanzees, mandrills and african green monkeys accumulated over million years. the same forces were uppermost during all stages of diversification. it is remarkable that the very different proteins, such as gp and the gp ectodomain (surface glycoproteins), p gag and p gag (structural), rt and integrase (enzymes) and nef and vif (cytoplasmic), all yield linear relationships ( figure . ), as though fixation was an intrinsic property of the protein. applying the same analysis to complete rhinoviral genomes yielded comparable results, i.e. highly significant linear relationships for vp , vp and vp (capsid proteins), p a and c (proteases), p c (cytoplasmic proteins involved in membrane reorganization) compared with the rna-dependent rna polymerase ( d) as reference ( figure . , table . ). hence figure . does not represent some quirk of primate lentiviruses. of course, vertebrate viruses have a redoubtable adversary in the host adaptive immune system. the swiftness of secondary responses is reminder enough. an analysis of proteins derived from complete potyvirus genomes, positive-stranded rna viruses, yielded highly significant linear relationships (table . ). a number of revealing points can be made. firstly, the linear relationships hold out to very large blosum distances ( . ). secondly, potyviruses infect a wide variety of different plants, as their florid names betray. finally, the linear relationships cannot result from adaptive immune pressure because plants are devoid of adaptive immune systems. they only have powerful innate immune responses. unfortunately there are insufficient insect rna viral sequences to allow a comparable study. however, a glance at a few beetle nodavirus capsid sequences (dasgupta et al., ; dasgupta and sgro, ) shows extensive genetic variation with a majority of synonymous base substitutions, typical of most comparisons of mammalian viral sequences (see below). for the time being there doesn't seem to be anything obviously different about insect virus sequence variation. although insects do not mount adaptive immune responses, the breadth and complexity of their innate immune systems is salutary (brey and hultmark, ) . a final example is afforded by the inoviruses, bacteriophages of the fd group, which includes m . although dna viruses fix mutations at a slower rate than rna viruses, they too show linear relationships among comparisons of their i, ii, iii and iv proteins (table . ). and of course bacteria are devoid of adaptive immunity as well. whether the comparisons were between capsid proteins versus enzymes, or secretory versus cytoplasmic molecules, significant linear relationships were obtained for pairwise comparisons in amino acid variation in all cases. such proteins are vastly different in their threedimensional folds and functions. some are "seen" by humoral immunity, others are not. for the plant viruses and bacteriophages, only innate immunity is operative. it is as though the rate of amino acid sequence accumulation is an intrinsic feature of the protein, reminiscent of the differing slopes for the accumulation of substitutions by alpha-globin and cytochrome c already alluded to. of course pairwise comparisons of these two proteins from differing organisms would yield a straight line going through the origin in a manner typical of figures . and . . hence it is fairly safe to assume that, for viral proteins too, amino acid substitutions are accumulated smoothly over time. indeed, this has been shown explicitly for a number of proteins from a varied group of viruses, including the influenza a, coronaviruses, hiv and herpes viruses (hayashida et al., ; gojobori et al., ; querat et al., ; villaverde et al., ; elena et al., ; sanchez et al., ; mcgeoch et al., ; yang et al., ; leitner et al., ; plikat et al., ) . the above analyses indicate that viral protein diversification is essentially a smooth process, the major parameter being the nature of the protein more than the ecological niche it finds itself in. the simplest hypothesis to explain the smoothness of protein sequence diversification is that the majority of fixed amino acid substitutions are neutral, being accumulated at rates intrinsic to each protein. this is not to say that positive selection is inoperative, merely that the majority of fixed substitutions are essentially neutral, so much so that it does not strongly distort the data from a linear relationship expected for genetic drift. in other words, neither the impact of different environments nor the ferocity of the adaptive immune response has much to do with fixation of most substitutions. this is important for the one-dimensional man in all of us sequencers who see all mutations and ask questions about genotype and phenotype -usually about genotype. a short aside is necessary here. it is interesting that in a few areas of rna virology much has been made of escape from the adaptive immune response, particularly cytotoxic t lymphocytes, so leading to persistence (nowak and mcmichael, ; mcmichael and phillips, ) . however, it is not at all obvious that this is the case (wain-hobson, ) . it must not be forgotten that it is possible to vaccinate against a number of rna viruses such as measles, polio and yellow fever. be that as it may, many dna viruses, intracellular bacteria and parasites persist. in these cases de novo genetic variation arising from point mutations is too slow a means to thwart an adaptive immune response. for example, after generations, under experimental conditions whereby muller's ratchet was operative, s. typhimurium accumulated mutations such that only % of the lineages tested had suffered an obvious loss of fitness (anderson and hughes, ) . that this number of generations could be achieved within as little as days gives an idea of the time necessary to generate a mutation affecting fitness. this is more than enough time to make a vigorous immune response. some inklings of immune system escape for the herpes virus ebv (de campos-lima et al., came to nought (burrows et al., ; khanna et al., ) . when antigenic variation is in evidence among dnabased microbes, it invariably results from the use of cassettes and multicopy genes rather than point mutations resulting from dna replication. and of course such complex systems could have only come about by natural selection. finally de novo genetic variation of an rna virus has never been suggested or shown to be necessary for the course of an acute infection. for a virus to persist thanks to genetic variation the phenomenon of epitope escape must be strongly in evidence by the time of seroconversion, generally - weeks. yet such data are not forthcoming, and not for want of trying. when viruses do play tricks with the immune system it is invariably by way of specific viral gene products that interfere with the mechanics of adaptive and innate immunity (ploegh, ) . in the clear cases where genetic variation is exploited by rna viruses, it is used to overcome barriers to transmission set up by the host population, e.g. herd immunity. the obvious example is influenza a virus antigenic variation in mammals. another way of assessing the contribution of positive selection to sequence variation is to compare the relative proportions of synonymous (ks) and non-synonymous (k) base substitutions per site. a ka/k s ratio of less than indicates that purifying selection is uppermost, while a ratio more than i is taken as evidence of an excess of positive selection. comparisons for hiv proteins from different isolates have yielded the same result (myers and korber, ) . some mileage was made out of the fact that this ratio increased with increasing distance of sivs with respect to hiv- , which in turn led to a discussion of siv pathogenesis (shpaer and mullins, ) . however, this may reflect a lack of adequate correction for multiple hits. this effect is illustrated by a comparison of the set of orthologous proteins encoded by herpes simplex viruses and (hsv- and hsv- ; figure . a). the more divergent the protein sequence, the greater the ka/k s ratio. that some proteins fix substitutions faster than others is no surprise. yet as figure . b shows, the k s values change little as they are near to saturation. when k is small, k s > k. this suggests that reliable interpretation of ka/k s ratios is possible only when the degree of nucleic acid divergence is small. now this is the realm of viral quasispecies rendered accessible by pcr. hiv studies abound, reflecting both the phenomenal degree dolan et al., ) . a. ka/k s ratio as a function of uncorrected percentage amino acid sequence divergence (linear regression was ka/k s = . divergence + . , r = . (p < . )). b. individual k s and k a variation with percentage divergence (k s = . divergence + . and k a - . divergence - . with correlation coefficients of . and . respectively, p < . for both). note how at small degrees of divergence, k>>k a decreases as divergence increases. basically, k s is approaching saturation, being uncorrected for multiple and/or back mutations. of sequence variation and its importance as a pathogen, so we'll stick to some such examples that are illustrative. concerning ka/k s ratios for hiv gene segments, widely varying conclusions have been published supporting all sides (meyerhans et al., pelletier et al., ; wolinsky et al., ; leigh brown, ; price et al., ) , so much so that three comments are in order. firstly, many studies have used small numbers of sequences and substitutions and even regions as small as nonameric hla class-irestricted epitopes. in such cases statistical analyses are essential to test the significance of the distribution of synonymous and non-syn-onymous substitutions. this is particularly important as the point substitution matrix is highly biased (pelletier et al., ; plikat et al., ) . it turns out that when the proportions are so analysed the distributions are rarely significantly different from the neutral hypothesis (leigh brown, ) . secondly, the method for counting substitutions is highly variable, ranging from two-by-two comparisons, scoring the number of altered sites in a data set, to phylogenetic reconstruction. this latter method reflects more closely the process of genetic diversification. when so analysed, almost all of the data sets indicated proportions of synonymous to non-synonymous substitutions indistinguishable from that suggested by genetic drift and/or purifying selection (pelletier et al., ; plikat et al., ) . thirdly, prudence is called for. the fact that obviously defective sequences can be identified, occasionally accounting for large fractions of the sample (martins et al., ; gao et al., ) , indicates that not all genomes have undergone the rigours of selection (nietfield et al., ) . indeed, in peripheral blood, hiv is invariably lurking as a silent provirus within a resting memory t-cell. such t-cells have half lives of months or more (michie et al., ) . hence it would be erroneous to interpret findings based on a single or clustered samples (price et al., ) . only when the above caveats are borne in mind is there any hope of discerning how hiv accumulates mutations. when these issues are attended to, purifying selection is dominant (pelletier et al., ; leigh brown, ; plikat et al., ) . one must not deny that positive selection is operative, merely that it is hard to pinpoint when looking at full-length sequences. indeed it is like looking for the proverbial needle in a haystack. in the context of ka/ks-type analyses, the two classic cases in the literature are the hla class i and ii molecules and influenza a haemagglutinin (hughes, ; hughes and nei, ; ina and gojobori, ) . the peptide contact residues of both class i and ii molecules have been under tremendous positive selection. changes in the five antigenic sites on the flu a haemagglutinin help the virus overcome herd immunity set up during previous flu epidemics. consequently, finding ka/k s > i in these regions was, in some ways, a pyrrhic victory because the papers needed experimental data to identify the positively selected segments in the first place. more recently endo et al. ( ) have screened the sequence data bases for proteins in which ka/k s > . of homologous gene groups screened, covering about sequences, only groups came up positive, of which two were encoded by rna virusesthe equine infectious anaemia virus envelope proteins and the reovirus g (outer capsid) proteins. the former case is intriguing as there is no obvious correlation between sequence changes and neutralizing antibodies (carpenter et al., ) . the authors noted that, when a comparable ka/k s analysis was restricted to small segments, the number of protein groups scoring positive rose to % (endo et al., ) . despite the explanatory power of these ratios, the number of identifiable cases of positively selected segments is small indeed. these numbers would probably shrink were phylogenic reconstruction used. to summarize the section, synonymous changes are invariably more frequent than nonsynonymous changes. positive selection may be operative in the evolution of viral protein sequences. when it is, it apparently exploits only a small fraction of mutants. the two rates touted by evolutionary-minded virologists are the mutation rate and the mutation fixation rate. the first describes the rate of genesis of mutations, the second attempts to describe their fixation within the population sampled over a period of time. in the case where all substitutions are neutral, the mutation rate (m) equals the fixation rate (f) per round of replication. it appears that such a situation applies to the evolution of parts of the siv and hiv- genomes over - years (pelletier et al., ; plikat et al., ) . if fixation rates are measured over one year, then f = n.m, where n is the annual number of consecutive rounds of replication. it is simple to show that several hundred rounds of sequential replication are required (wain-hobson, b; pelletier et al., ) . given that the proviral load of an hiv-l-positive patient (~ - ) changes by less than a factor of over years or more, and given the assumption that an infected cell produces sufficient virus to generate two productively infected cells, then annual production would be something akin to , or ~ which is impossible. clearly even a productive burst size of is too large (wain-hobson, a,b) . this must be reduced to . to achieve a realistic proviral load ( . ~ note that the real value for the effective burst size must be even lower, as proviral load is turning over more slowly than once a day. yet to explain the temporal increase in proviral load, the productive burst size must be or more. thus the calculation reveals massive destruction of infected cells, precisely what was to be expected from immensely powerful innate and adaptive immune responses. when purifying selection is in evidence, some additional factor must be introduced to couple the fixation and mutation rates. as the accumulation of most substitutions proceeds in a protein-specific linear manner for small degrees of divergence, the above equation can be modified to f -p-n. m, where > p > is a constant indicating the degree of negative selection. note immediately that, as p < , more rounds of replication are needed to produce the same percentage amino acid fixation. a corollary is an even greater degree of destruction of infected cells. consider the example of a virus that is fixing substitutions only slowly, about - per site per year, something like the ebola virus glycoprotein. the mutation rate for ebola is not known but is probably around - per site per cycle (drake, ) . hence p.n ~ - . what is the value of n? most mammalian viruses replicate within h, while obviously outside of a body they do not replicate. consequently a value of n = - is probably not unreasonable. accordingly p; x - to x - . this means that most mutations generated are deleterious. of those that are fixed, most are neutral, as has been discussed above. the last two sentences describe a profoundly conservative strategy-rna viruses are seen merely to replicate far more than giving rise to genetically distinct, even exotic, siblings. what a stultifying picture, in contrast to the shock-horror of tabloid newspaper virology and that atmospheric, yet profoundly ambiguous term, emerging viruses. conservative perhaps, but is there any suggestion that viruses are more or less so than other replicons? like extrapolation, choosing examples can be problematic. however let's consider one example, the eukaryotic and retroviral aspartic proteases (doolittle et al., ) . the former exist as a monomer with two homologous domains, while the retroviral counterpart functions as a homodimer. despite these differences the folding patterns are almost identical, meaning that the enzymes may be considered orthologous. between humans and chickens there is approximately % amino acid divergence among typical aspartic proteases ( figure . ). the hiv- and hiv- proteases differ by a little more, %. no one would doubt the considerable differences in design, metabolism and lifestyle separating us and chickens. on either side of the hiv protease coding region one finds differences: hiv- is vpx-vpu § while hiv- is the opposite, i.e. vpx+vpu-; there are differences in the size and activities of the tat gene product; the ltrs are subtly different. yet both replicate in the same cells in vivo, produce the same disease, albeit with different kinetics: hiv- infection progresses more slowly. if these differences are esteemed too substantial, consider the % divergence between the hiv and chimpanzee siv proteases. these two viruses are isogenic. pig and human chromosomal aspartic proteases may differ by around %, the differences between these two species being, george orwell apart, obvious to all. even by this crude example, the aids viruses would seem to be more conservative than mammals in their evolution. the same argument pertains to the rhinoviral p a and c serine proteases (figure . ). this conclusion is even more surprising when it is realized that hiv is fixing mutations at a rate of - - - per base per year. by contrast, mammals are fixing mutations approximately one million times less rapidly, i.e. approximately - - - per base per year (gojobori and yokoyama, ) . however, the generation times of the two are vastly different, about i day for hiv and about - years for humans. normalizing for this yields a -fold higher fixation rate per generation for hiv than for humans. amalgamating this with the preceding paragraph, we see that hiv is not only evolving qualitatively in a conservative manner, but it is doing so despite a -fold greater propensity to accommodate change. the same arguments go for almost all rna viruses and retroviruses. why is this? although they mutate rapidly, their hosts are effectively invariant in an evolutionary sense. probably sticking to the niche is all that matters, which is no mean task given the strength of innate and adaptive antiviral immune responses. john maynard smith's argument was simply put. for organisms with a base substitution rate of less than i per genome per cycle, he reasoned that all intermediates linking any two sequences must be viable, otherwise the lineage would go extinct. the example used was self explanatory: word --+ wore--+ gore -+ gone ~ gene (maynard smith, ) . the same is true for viruses, even though their mutation rates are orders higher; the rate for a given protein is still less than substitution per cycle. even for rather stable viruses like ebola/marburg and human t-cell leukaemia virus type and (htlv- /- ), the number of intermediates is huge. while the enormity of sequence space is basically impossible to comprehend, the amount accessible to a virus remains vast. for the lineage to exist, the probability of finding a viable mutant must be at least /population size within the host. imagine a stem-loop structure. any replacement of a g:c base pair must proceed by a single substitution, given that the probability of a double mutation is approximately - that of a single mutation. let substitution of a g:c pair pass by a g:u intermediate, finishing up as a:u. although g:u mismatches are the most stable of all mismatches, they are less so than either a g:c or an a:u pair. there are two scenarios: either the g:u substitution is of so little consequence that it is fixed per se, in which case there would be no selection pressure to complete the process to a:u. alternatively, the g:u substitution is sufficiently deleterious for selection of a secondary mutation to occur from a pool of variants, so completing the process. yet the g:u intermediate cannot be so debilitating otherwise the process would have little chance of going to completion. note also that if the fitness difference is small with respect to the g:c or a:u forms, more rounds of replication are necessary to achieve fixation of g:u to a:u. a corollary is that there must be a range within which fitness variation is tolerated. this is reminiscent of nearly neutral theories of evolution and their extension to rna viruses (chao, ; ohta, ) . note also that from a theoretical perspective the same secondary structure can be found in all parts of sequence space with easy connectivity (schuster, ; schuster et al., ) . figure . shows a number of variations on an hiv stem-loop structure, crucial for ribosomal frameshifting between the gag and pol open reading frames. there have been substitutions at positions , , , and and even an opening up of the loop. all come from viable strains, yet the environment in which these structures are operative, the human ribosome, is invariant. if the changes are all neutral the situation is formally comparable to the steady accumulation of amino acid substitutions. however, if the intermediates are less fit, it has to be understood how they can survive long enough in the face of a plethora of competitors, approximately /mutation rate or about for hiv. the latter is probably the case as there are hiv- genomes with c:g to u:a substitutions at positions and ( figure . ). extensions of nearly neutral theory would fit these findings well (chao, ; ohta, ) . that there are many solutions to this stem-loop problem is clear. if hiv- is brought into the picture, the remarkable plurality of solutions is further emphasized ( figure . ). degeneracy in solutions found by viruses is revealed by some interesting experiments on viral revertants. the initial lesions substantially inactivated the virus. yet with a bit of patience, sometimes more than months, replicationcompetent variants that were not back mutations were identified (klaver and berkhout, ; olsthoorn et al., ; berkhout et al., ; willey et al., ; escarmis et al., ) . as the frequencies of mutation and back mutation are not equivalent, such findings are, perhaps, not surprising. what they show is the range of possible solutions adjacent to that created by the experimentalist. loss of fitness can be achieved hiv- gag-pol figure . "shifty" rna stem-loop structures from hiv- m, n and o group strains as well as from hiv- rod. this structure is part of the information that instructs the ribosome to shift from the gag open reading frame to that of pol. in addition to the hairpin is a heptameric sequence (underlined). frameshifting occur within the gag uua codon within the heptamer and continues agg.gaa etc.* highlights differences in nucleotide sequences compared with the m reference strain lai. by sequential plaquing of rna viruses, the socalled miiller's ratchet experiment, which has been analysed at the genetic level for fmdv . different lesions characterized different lineages. recent work was aimed at characterizing the molecular basis of fitness recovery following large population passage. not one solution was found but a variety, even in parallel experiments (escarmis et al., ) . this reveals the impact of chance in fitness selection on a finite population of variants, which is trivially small given the immensity of sequence space. another example of degeneracy in viable solutions is the isolation of functional ribozymes from randomly synthesized rna (bartel and szostak, ; ekland et al., ) . from a pool of approximately variants, through repeated rounds of positive selection, it was estimated that the frequency of the ribozyme was of the order of - , which is small indeed. yet - . = . even erring by four orders of magnitude, distinct ribozymes could well have been present in the initial pool. although the sequence space occupied may well represent a tiny proportion of that possible for a rna molecule of length n, the space is so large that the number of viable solutions is large, large enough to permit a plethora of parallel solutions to the same problem. these experiments, ribozyme from dust, are cases in plurality. further evidence of the large proportion of viable solutions in protein sequence space comes from in vitro mutagenesis. for example bacteriophage t lysozyme can absorb large numbers of substitutions (rennell et al., ) with very few sites resisting replacement (figure . ) . other examples include the lymphokine, interleukin , in which some forms with enhanced characteristics were noted (olins et al., ; klein et al., ) . with modern mutagenic methods allowing mutation rates of . per base per site or less, hypermutants of the e. coli r dihydrofolate reductase (dhfr) were found by random sequencing of as little as clones (martinez et al., ) . whatever the mutation bias, mutants with - amino acid replacements within the -residue protein could be attained (figure . ). other mutagenesis studies sought enzymes with enhanced catalytic constants or chemical stability. for subtilisin e variants with enhanced features for two parameters could be identified from a relatively small population of randomly mutagenized molecules (kucher and arnold, ) . these data indicate that functional sequence space is probably far more dense than hitherto thought. most of the above examples concern maintenance or enhancement of function. an interest-ing example was recently afforded by engineering cyclophilin into a proline-specific endopeptidase (qu m neur et al., ) . the proline binding pocket of cyclophilin was modified such that a single amino acid change (a s) generated a novel serine endopeptidase with a ~ proficiency with respect to cyclophilin. addition of two further substitutions (f h and n d) generated a serine-aspartic-acid-histidine catalytic triad, the hallmark of serine proteases. the final enzyme proficiency was . x mol/ , typical of many natural enzymes. this shows the interconnectedness of sequence spaces for two functionally very different proteins. if sequence space were sparsely populated, the probability of observing such phenomena would be small. many viruses recombine, and via molecular biology more can be made, some of which are tremendously useful research tools, such as the shivs, chimeras between siv and hiv ( figure . ). although many groups have tried to recombine naturally hiv- with hiv- or siv, none has succeeded. natural and artificial recombination represent major jumps in sequence space. that one can observe such genomes means that the new site in functional sequence space must be only a few mutations figure . systematic amino acid replacement of bacteriophage t lysozyme residues. amber stop codons were engineered singly into each residue apart from the initiator methionine. the plasmids were used to transform suppressor strains. of the resulting single amino acid substitutions, were found to be sufficiently deleterious to inhibit plaque formation. more than half ( %) of the positions in the protein tolerated all substitutions examined. the side chains of residues that were refractory to substitution were generally inaccessible to solvent. the catalytic residues are glu and asp . adapted from rennell et al., . the e. coli r plasmid. all were trimethoprim resistant. only differences with respect to the parent sequence are shown. a representation of the three-dimensional structure is shown above. adapted from martinez et al., , with permission. from a reasonably viable solution, otherwise it would take too long to generate large numbers of cycles and, along with them, mutants. the ferocity of innate and adaptive immunity must never be forgotten. off on an apparent tangent, the phylogeny of geoffrey chaucer's the canterbury tales was recently analysed by programs tried and tested for nucleic acid sequences. the authors used lines from fifteenth-century manuscripts (barbrook et al., ) . apart from the fact that it appears that chaucer did not leave a final version but some annotated working copy, the radiation in medieval english space is fascinating. all the versions are viable and "phenotypically" equivalent even though the "genotypes" are not so. it is ironic that william caxton's first printed edition was far removed from the original. (n.b., printers merely make fewer errors than scribes, tantamount to adding a ' exonuclease domain to an rna polymerase). given the inevitability of mutation, is it possible that over the aeons natural selection has selected for proteins that are robust, those that are capable of absorbing endless substitutions? for if amino acid substitutions were very difficult to fix, huge populations would need to be explored before change could be accommodated. recently the unstructured n-terminal segment of the e. coli r dhfr was shown to stabilize amino acid substitutions in a non-functional miniprotein devoid of this segment (figure . ; martinez et al., ) . while the mechanism by which this occurs is unknown, it suggests that there may be parts of proteins, even multiple or discontinuous segments, that may help the protein accommodate inevitable change. formally it can be seen that such figure . genetic organization of naturally occurring hiv- and siv recombinants and unnatural, genetically engineered, siv-hiv- chimeras called shivs. segments are hatched according to stain origin. references are hiv- mal and hiv- ibng (gao et al., ) , hiv- rw . (gao et al., ) , sivagm sab- (jin et al., ) and shivsbg (dunn et al., ) . proteins would have both short-and long-term selective advantages, for they would permit the generation of larger populations of relatively viable variants as well as buffering the lineage against the effects of bottlenecking. what fraction of amino acid residues is necessary for function? answer -very few. a few examples taken from among the primate immunodeficiency viruses are typical. almost all these viruses infect the same target cell using the membrane proteins cd and ccr . primary hiv- isolates use the chemokine receptor ccr and rarely the homologous molecule cxcr , which differs by % in its extracellular domains. yet two substitutions in the viral envelope protein gp are sufficient to allow use of the cxcr molecule (hwang et al., ) . curiously, the ccr chemokine receptor homologue, us , encoded by human cytomegalovirus, can be used by hiv- despite the fact that us and ccr differ by % in the same extracellular regions . clearly only a small set of residues are necessary for docking . another example is afforded by the vpu protein, which is unique to hiv- and the chim-panzee virus sivcpz (huet et al., ) . vpu is a small protein inserted into the endoplasmic reticulum, tucked well away from humoral immunity. despite an average amino acid sequence difference of . % among orthologous human and chimpanzee proteins, hiv/sivcpz vpu divergence is almost beyond reliable sequence alignment (figure . ): an n-terminal hydrophobic membrane anchor and a couple of perfectly conserved serine residues, which are phosphorylated, and that's about it. among hiv- strains, or between sivcpz sequences, the situation was a little better. yet the necessity of keeping vpu is beyond doubt. a fine final example concerns the hiv/siv rev proteins. these small nuclear proteins are crucial to viral replication. despite this, only residues are perfectly conserved. the situation has been taken beyond the limit, at least ex vivo, in that the htlv- rex protein can functionally complement for hiv- rev (rimsky et al., ) , despite the fact that they are completely different proteins. the above is reminiscent of what is known about enzymes and surface recognition. provided the protein fold is maintained, only a small fraction of residues actually contribute to function, a point made recently in two reviews on rna viral proteases (ryan and flint, ; ryan et al., ) . insertions and deletions are generally less than - residues in length and confined to turns, loops and coils (pascarella and argos, ) . if globular proteins or at least domains are, to a first approximation, taken as spheres, then the surface area is the least for any volume. if amino acids are equally viewed as smaller, closely packed spheres, then a minimum number will be exposed on the surface, ready to partake in recognition and function. the molecular biologist frequently thinks like an engineer who can redesign from scratch. yet replicons have been constrained by a series of historical events representing variations on a founding theme. while they are fit enough to survive, are they the best possible? this question is salutary, for we live in a society that is more and more competitive and, thanks to global communications, knows about the most successful athletes or businessmen worldwide. yet who can remember the name of any olympic athlete who came in fourth? is not fourth best in any large population remarkable? how good are viruses as machines? once again let us look at some examples from hiv- . reverse transcription feeds on cytoplasmic dntps. yet supplementing the culture milieu with deoxycytidine -which is scavenged and phosphorylated to the triphosphate-substan-tially increased viral replication (meyerhans et al., ) . it is known that good expression of a foreign protein is frequently compromised by inappropriate codon usage. by redesigning codon usage of the jellyfish (aequorea victoria) green fluorescent protein gene to correspond to that typical of mammalian genes, greatly improved expression was achieved in mammalian cells (haas et al., ) . the same group engineered codon usage of the hiv- gp glycoprotein gene segment to correspond to that of the abundantly expressed human thy- surface antigen. again expression was greatly improved (haas et al., ) . the coup de grace came with the reciprocal experiment-engineering thy- gene codon usage to correspond to that of gp . thy- surface expression was greatly reduced (haas et al., ) . since hiv- was first sequenced, it has been known that its codon usage is highly biased (wain-hobson et al., ; bronson and anderson, ) . something is clearly overriding maximal envelope expression. furthermore, gp codon usage is similar for all other hiv- genes whether they be structural or regulatory. for that matter, codon usage is comparable for most lentiviruses bronson and anderson, ) . it was possible to show via dna vaccination that codon-engineered gp elicited stronger immune responses in mice than the normal counterpart (andre et al., ) . might this finding suggest that the optimum is actually away from mass production? yet if there is a shadow of reality in this thesis, it indicates that fitness optima in vivo may not necessarily parallel the expectations of fitness based on ex-vivo models. in this context note also that htlv- infects exactly the same cell as hiv, yet its codon usage is very different from that of hiv and the thy- gene (seiki et al., ) . if fitness optimization were ever operative in vivo, then one would predict steady increases in virulence for those viruses that do not set up herd immunity. at some point a plateau would be reached. yet the higgledy-piggledy way by which virulent strains come and go suggests that this is not so. some might use the word stochastic. whatever. if fitness selection can be overridden and we don't have a good theory for it, then we're in a sorry state. there is abundant evidence that, as a good first approximation, rna viruses ex vivo perform as expected from the quasi-species model (holland et al., ; eigen and biebricher, ; duarte et al., ; clarke et al., ; eigen, ; novella et al., ; domingo et al., ; quer et al., ; domingo and holland, ) , which is fitness dominated. problems arise transposing it to the in vivo situation, notably: first and foremost: how does one determine fitness in vivo? should such measurements score intrahost viral titres or transmission probabilities from an index case? (if a virus doesn't spread it's dead.) for outbred populations, is it in fact virulence? second: host innate immunity is hugely powerful, a fact leading rolf zinkernagel to remark with typical aplomb that in terms of immunity "an inferferon receptor knock-out mouse is a % mouse" (huang et al., ; van den broek et al., a,b ). yet the enhanced susceptibility of scid humans or various knock-out mice to infections indicates the part played by adaptive immunity. for example, influenza a can persist in scid children (rocha et al., ) . how are innate and adaptive immune responses coupled and how are they influenced by genetic polymorphisms? third: with acquired immunity rising by day in an acute infection, the virus is replicating in the face of a predator whose amplitude is increasing. fourth: immune responses are densitydependent. that is, the more the virus replicates the stronger the immune response. if the relationship were simply linear one could see how a virus might be able to keep just ahead, given a short lag in the immune response time. but if it were non-linear? indeed it must be so, otherwise it would not be possible to resolve an acute infection. it is not easy to discern where optimal viral fitness would lie. fifth: the wrath of combined immune responses is such that there is massive viral turnover. for the three best known cases, hiv, hbv and hcv, between and virions are turning over daily, representing between % and % of the whole (ho et al., ; wei et al., ; nowak et al., ; zeuzem et al., ) . indeed, these are probably underestimates, given beautiful data from the late s and s showing that, for a variety of rna viruses, plasma titres decay with a half-life of - min, whether the animal be immunologically naive or primed (mimms, ; nathanson and harrington, ) . from this one may conclude that any viral population is unlikely to be in equilibrium. and if a population is not in equilibrium, fitness selection is compromised. sixth: a glance at any histology slide or textbook is a salient reminder of spatial discontinuities over distances of one or two cell diameters. for example the hugely delocalized immune system is characterized by a multitude of different lymphoid organs, a myriad of subtly different susceptible cell types, and a m e of membrane molecules. the exquisite spatial heterogeneity of hiv within the epidermis and splenic white pulps has been described (cheynier et al., (cheynier et al., , sala et al., ) . the same seems to be true for hcv-infected liver (martell et al., ) . for hpv infiltration of skin, spatial discontinuities and gradients are also apparent. discontinuities reduce the possibilities for competition and hence selection of the fitter forms. indeed the m~iller's ratchet experiment and clonal heterogeneity are the most vivid expressions of this. seventh:muchhasbeenmadeofprivilegedsites and viral reservoirs. basically this is reminding us of the fact that immune surveillance is modulated in some organs like the brain. there are some suggestions that cytotoxic t-cells have difficulty infiltrating the kidney. viral reservoirs undermine fitness selection. eighth: in the case of the immunodeficiency viruses, antigenic stimulation of infected yet resting memory t-cells means that variants may become amplified for reasons that have nothing to do with the fitness of the variant (cheynier et al., (cheynier et al., , ). mayr again: "wherever one looks in nature, one finds uniqueness" (mayr, ) . as mentioned, the cardinal difference between the behaviour of rna viruses ex vivo and in vivo is the existence of spatial discontinuities. for replicons, cloning is the ultimate separation. it allows a variant to break away from dominating competitors, disrupts or uncouples a fitter variant locked in competitive exclusion (de la torre and holland, ) . the effect of bottlenecking on fitness, as well as the m(iller's ratchet experiments, have been described (chao, ; duarte et al., ; novella et al., ; escarmis et al., ) . transmission frequently involves massive bottlenecking, and is very much an exercise in cloning. all this should not surprise because allopatric speciation is omnipresent in the origin of species, darwin's galapagos finches being an obvious example. given the non-equilibrium structure of viral variants, vastly restricted population sizes in respect to sequence space, founder effects in vivo take on great importance. while answers for some of these issues seem far away, constraints on fitness selection cannot be so strong that a chain of infections becomes a mfiller's ratchet experiment. yet is that correct? in the experiments with phage ~ , vsv and fmdv, most of the lineages resulted in decreased fitness. yet for some there were no changes, while for a few there were even increases in the fitness vectors (chao, ; duarte et al., ; escarmis et al., ) . could symptomatic infections reflect bottleneck transmission of those fitter clones with asymptomatic (subclinical) infection representing fitness-compromised clones? analysis of rna viruses ex vivo is analogous to the study of bacteria in chemostats. fitness selection dominates. yet there is a world of difference between bacterial strains so selected and natural isolates. one of the observations frequently made upon isolation of pathogenic bacteria is the loss of bacterial virulence determinants (miller et al., ) . indeed, ex-vivo passage of rna viruses has been used to select for attenuated strains used in vaccination. a virus must replicate sufficiently within a host to permit infection of another susceptible host. if the new host is of the same species, differences between the two are minimal-a small degree of polymorphism being inevitable in outbred populations. given that viruses with a small coding capacity interact particularly intimately with the host-cell machinery, it follows that infection of a host from a related species has a greater probability of succeeding if the cellular machinery is comparable. indeed, the closer the two species, the greater the probability. in turn, if the virus gets a toehold and can generate a quasispecies, then only few mutations would probably be necessary to adapt to the new niche. yet species is a difficult word. what might a viral species be? martin ( ) wrote a fascinating review on the number of extinct primate species estimated from the fossil record. depending on the emergence time of primates of modern aspect, he was able to estimate the total number that existed as - . the present number of primates species would thus represent about . - . %. more importantly from our viewpoint was his calculation of the average survival time of fossil primate species as a mere million years (martin, ) . given that rna viruses are fixing mutations approximately i million times faster than mammals (holland et al., ; gojobori and yokoyama, ; , a viral species would become extinct after approximately year! immediately the annual influenza a strain comes to mind. yet rabies, polio and htlv- have arguably been around for millennia. clearly the word "species", when taken from primatology, cannot apply to the viral world. frogs provide a more interesting example. they have been around for several hundred millions of years, and members of some lineages can interbreed despite million years separation. naturally, their protein sequences have not stood still during that time (wilson et al., ) . enough is conserved to allow breeding. maybe the primate picture has undue weight in our appreciation of virology. phenotype can be maintained despite changes in genotypeobvious to a biologist. as usual, holland wasn't far from the mark when he wrote: as human populations continue to grow exponentially, the number of ecological niches for human rna virus evolution grows apace and new human virus outbreaks will likely increase apace. most new human viruses will be unre-markable -that is they will generally resemble old ones. inevitably, some will be quite remarkable, and quite undesirable. when discussing rna virus evolution, to call an outbreak (such as aids) remarkable is merely to state that it is of lower probability than an unremarkable outbreak. new viruses can and do emerge but on a scale that is probably - logs less than the number of viral mutants generated up to that defining moment (wain-hobson, ). they will result from a small number of mutations and a dose of reproductive isolation. the above has attempted to show that the vast majority of genetic changes fixed by rna viruses are essentially neutral or nearly neutral in character. positive selection exploits a small proportion of genetic variants, while functional sequence space is sufficiently dense, allowing viable solutions to be found. although evolution has connotations of change, what has always counted is natural selection or adaptation. it is the only force for the genesis of a novel replicon. once adapted to its niche, there is no need to change. in such circumstances an rna i i i ii i ii i bacteria, archaea, yeast figure . latitude in microbial genome sizes. rna viruses and retroviruses are confined to one log variation in size ( to -- kb). by contrast, dna viruses span more than . logs going from the single-stranded porcine circulovirus ( . kb) to chlorella virus (~ kb, encoding at least dna endonuclease/methyltransferase genes; zhang et al., ) and bacteriophage g (-- kb). the distinction between phage dna and a plasmid has often proven difficult (waldor and mekalanos, ) . as can be seen, the genome size of the largest dna viruses overlaps the smallest intracellular bacteria such as mycoplasmas ( and kb) and is not too far short of autonomous bacteria such as haemophilus influenzae ( . mb). virus would no longer be adapting, even though it could be changing. why is the evolution of rna viruses so conservative? why do they mutate rapidly yet remain phenotypically stable? the lack of proofreading proscribes the genesis of large genomes, restricting their genome sizes to a log range (figure . ) . among the smallest rna and retroviruses are ms and hepatitis b virus, both about kb, while the largest are the coronaviruses at kb or more. most of their proteins are structural or regulatory and take up the largest part of the coding capacity of the virus. additional proteins broadening the range of interactions with the host cell, or rendering the replicon more autonomous, are relatively few. large, gene-sized duplications that may contribute to diversification and novel phenotypes are rare, reducing the exploration of new horizons. thus, evolution of rna viruses is probably conservative because they cannot shuffle domains so generating new combinations. that the information capacity of rna viral genomes is limited by a lack of proofreading is neither here nor there, for they are remarkably successful parasites. rna viruses change far more than they adapt. muller's ratchet decreases fitness of a dna-based microbe increased immune response elicited by dna vaccination with a synthetic gp sequence with optimized codon usage the phylogeny of the canterbury tales isolation of new ribozymes from a large pool of random sequences forced evolution of a regulatory rna helix in the hiv- genome role of the first and third extracellular domains of cxcr- in human immunodeficiency virus coreceptor activity molecular mechanisms of immune responses in insects nucleotide composition as a driving force in the evolution of retroviruses unusually high frequency of epstein-barr virus genetic variants in papua new guinea that can escape cytotoxic t-cell recognition: implications for virus evolution role of host immune response in selection of equine infectous anemia virus variants fitness of rna virus decreased by muller's ratchet evolution of sex and the molecular clock in rna viruses hiv and t-cell expansion in splenic white pulps is accompanied by infiltration of hiv-specific cytotoxic t-lymphocytes antigenic stimulation by bcg as an in vivo driving force for siv replication and dissemination genetic bottlenecks and population passages cause profound fitness differences in rna viruses nucleotide sequences of three nodavirus rna 's: the messengers for their coat protein precursors primary and secondary structure of black beetle virus rna , the genomic messenger for bbv coat protein precursor hla-a epitope loss isolates of epstein-barr virus from a highly al + population t cell responses and virus evolution: loss of hla all-restricted ctl epitopes in epstein-barr virus isolates from highly all-positive populations by selective mutation of anchor residues rna virus quasispecies populations can suppress vastly superior mutant progeny the genome sequence of herpes simplex virus type rna viral mutations and fitness for survival basic concepts in rna virus evolution origins and evolutionary relationships of retroviruses rates of spontaneous mutations among rna viruses rapid fitness losses in mammalian rna virus clones due to muller's ratchet high viral load and cd lymphopenia in rhesus and cynomolgus macaques infected by a chimeric primate lentivirus constructed using the env, rev, tat, and vpu genes from hiv- lai the viral quasispecies sequence space and quasispecies distribution structurally complex and highly active rna ligases derived from random rna sequences does the vp gene of foot-and-mouth disease virus behave as a molecular clock? largescale search for genes on which positive selection may operate genetic lesions associated with muller's ratchet in an rna virus multiple molecular pathways for fitness recovery of an rna virus dibilitated by operation of miiller's ratchet determining divergence times with a protein clock: update and reevaluation human infection by genetically diverse siv-sm related hiv- in west africa the heterosexual human immunodeficiency virus type epidemic in thailand is caused by an intersubtype (a/e) recombinant of african origin a comprehensive panel of near-fulllength clones and reference sequences for non-subtype b isolates of human immunodeficiency virus type rates of evolution of the retroviral oncogene of moloney murine sarcoma virus and of its cellular homologues molecular evolutionary rates of oncogenes molecular clock of viral evolution, and the neutral theory codon usage limitation in the expression of hiv- envelope glycoprotein evolution of influenza virus genes performance evaluation of amino acid substitution matrices rapid turnover of plasma virions and cd lymphocytes in hiv- infection rapid evolution of rna genomes rna virus populations as quasispecies immune response in mice that lack the interferon-gamma receptor genetic organization of a chimpanzee lentivirus related to hiv- protein phylogenies provide evidence of a radical discontinuity between arthropod and vertebrate immune systems pattern of nucleotide substitution at major histocompatibility complex class i loci reveals overdominant selection identification of the envelope v loop as the primary determinant of cell tropism in hiv- statistical analysis of nucleotide sequences of the hemagglutinin gene of human influenza a viruses mosaic genome structure of simian immunodeficiency virus from west african green monkeys the role of cytotoxic t-lymphocytes in the evolution of genetically stable viruses evolution of a disrupted tar rna hairpin structure in the hiv- virus the receptor binding site of human interleukin- defined by mutagenesis and molecular modeling directed evolution of enzyme catalysts analysis of hiv- env gene sequences reveals evidence for a low effective number in the viral population tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type populations with a known transmission history molecular phylogeny and evolutionary timescale for the family of mammalian herpesviruses escape of human immunodeficiency virus from immune control hepatitis c virus (hcv) circulates as a population of different but closely related genomes: quasispecies nature of hcv genome distribution primate origins: plugging the gaps exploring the functional robustness of an enzyme by in vitro evolution independent fluctuation of human immunodeficiency virus type rev and gp quasispecies in vivo natural selection and the concept of a protein space this is biology temporal fluctuations in hiv quasispecies in vivo are not reflected by sequential hiv isolations in vivo persistence of a hiv-l-encoded hla-b -restricted cytotoxic t-lymphocyte epitope despite specific in vitro reactivity restriction and enhancement of human immunodeficiency virus type replication by modulation of intracellular deoxynucleoside triphosphate pools lifespan of human lymphocyte subsets defined by cd isoforms coordinate regulation and sensory transduction in the control of bacterial virulence the response of mice to large intravenous injections of ectromelia virus. i. the fate of injected virus experimental infection of monkeys with langat virus sequence constraints and recognition by ctl of an hla-b -restricted hiv- gag epitope size of genetic bottlenecks leading to virus fitness loss is determined by mean initial population fitness how hiv defeats the immune system viral dynamics in hepatitis b virus infection the meaning of near-neutrality at coding and non-coding regions saturation mutagenesis of human interleukin- leeway and constraints in the forced evolution of a regulatory rna helix analysis of insertions/deletions in protein structures the tempo and mode of siv quasispecies development in vivo calls for massive viral replication and clearance identification of a chemokine receptor encoded by human cytomegalovirus as a cofactor for hiv- entry genetic drift can dominate short-term human immunodeficiency virus type nef quasispecies evolution in vivo viral strategies of immune evasion positive selection of hiv- cytotoxic t lymphocyte escape variants during primary infection antigen-specific release of beta-chemokines by anti-hiv- cytotoxic t lymphocytes engineering cyclophilin into a proline-specific endopeptidase reproducible nonlinear population dynamics and critical points during replicative competitions of rna virus quasispecies nucleotide sequence analysis of sa-omvv, a visna-related ovine lentivirus: phylogenetic history of lentiviruses systematic mutation of bacteriophage t lysozyme trans-dominant inactivation of htlv-i and hiv- gene expression by mutation of the htlv-i rex transactivator antigenic and genetic variation in influenza a (hin ) virus isolates recovered from a persistently infected immunodeficient child virus-encoded proteinases of the picornavirus super-group virus-encoded proteinases of the flaviviridae spatial discontinuities in human immunodeficiency virus type quasispecies derived from epidermal langerhans cells of a patient with aids and evidence for double infection genetic evolution and tropism of transmissible gastroenteritis coronaviruses how to search for rna structures. theoretical concepts in evolutionary biotechnology rna structures and folding. from conventional to new issues in structure predictions natural selection on the gag, pol, and env genes of human immunodeficiency virus (hiv- ) human adult t-cell leukemia virus: complete nucleotide sequence of the provirus genome intergrated in leukemia cell dna rates of amino acid change in the envelope protein correlate with pathogenicity of primate lentiviruses nucleotide sequence of the visna lentivirus: relationship to the aids virus antiviral defense in mice lacking both alpha/beta and gamma interferon receptors immune defence in mice lacking type i and/or type ii interferon receptors fixation of mutations at the vp gene of footand-mouth disease virus. can quasispecies define a transient molecular clock? the fastest genome evolution ever described: hiv variation in situ viral burden in aids running the gamut of retroviral variation nucleotide sequence of the aids virus lysogenic conversion by a filamentous phage encoding cholera toxin viral dynamics in human immunodeficiency virus type i infection in vitro mutagenesis identifies a region within the envelope gene of the human immunodeficiency virus that is critical for infectivity biochemical evolution adaptive evolution of human immunodeficiency virus-type during the natural course of infection molecular evolution of the hepatitis b virus genome quantification of the initial decline of serum hepatitis c virus rna and response to interferon alfa chlorella virus ny- a encodes at least dna endonuclease / methyltransferase genes we would like to thank past and present members of the laboratory and numerous colleagues for endless discussions over the years. mark mascolini needs a special word of thanks for painstakingly going through the manuscript. this laboratory is supported by grants from the institut pasteur and the agence nationale pour la recherche sur le sida. key: cord- - x yubt authors: sawmya, shashata; saha, arpita; tasnim, sadia; anjum, naser; toufikuzzaman, md.; rafid, ali haisam muhammad; rahman, mohammad saifur; rahman, m. sohel title: analyzing hcov genome sequences: applying machine intelligence and beyond date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: x yubt covid- pandemic, caused by the sars-cov- strain of coronavirus, has affected millions of people all over the world and taken thousands of lives. it is of utmost importance that the character of this deadly virus be studied and its nature be analysed. we present here an analysis pipeline comprising phylogenetic analysis on strains of this novel virus to track its evolutionary history among the countries uncovering several interesting relationships, followed by a classification exercise to identify the virulence of the strains and extraction of important features from its genetic material that are used subsequently to predict mutation at those interesting sites using deep learning techniques. in a nutshell, we have prepared an analysis pipeline for hcov genome sequences leveraging the power of machine intelligence and uncovered what remained apparently shrouded by raw data. covid- was declared a global health pandemic on march , [ ] . it is the biggest public health concern of this century [ ] . it has already surpassed the previous two outbreaks due to the coronavirus, namely, severe acute respiratory syndrome coronavirus (sars-cov) and middle east respiratory syndrome coronavirus (mers-cov). the virus acting behind this epidemic is known as severe acute respiratory syndrome coronavirus or in short sars-cov- virus. it is a single stranded rna virus which is mainly , to , bases long in average [ ] . the novel coronavirus is spherical in shape and has spike protein protruding from its surface. these spikes assimilate into human cells, then undergo a structural change that allows the viral membrane to fuse with the cell membrane. the host cell is then attacked by the viral gene through intrusion and it copies itself within the host cell, producing multiple new viruses [ ] . as of mid-april, , about , of high-quality complete genome sequences were present in the gisaid initiative database [ ] collected from clinicians and researchers from around the world. to understand the viral evolution and its nature of spread among the different countries, we present an analysis pipeline of the genome sequence leveraging the power of machine intelligence. this paper makes the following key contributions. a. an alignment-free phylogenetic analysis is carried out with a goal to uncover the evolutionary history of sars-cov- . the resulting phylogenetic tree is able to highlight evolutionary relationships that can be explained by facts and figures and has further identified some mysterious relationships. b. several machine learning and deep learning models are used to identify the virulence of the strains (i.e., to classify a virus strain as either severe or mild). additionally, from the classification pipeline, important features are identified as sites of interest (sois) in the virus strains for further analysis. c. several cnn-rnn based models are used to predict mutations at specific sites of interest (sois) of the sars-cov- genome sequence followed by further analyses of the same on several south-asian countries. d. overall, we present an analysis pipeline that can be further utilized as well as extended and revised (a) to study where a newly discovered genome sequence lies in relation to its predecessors in different regions of the world; (b) to analyse its virulence with respect to the number of deaths its predecessors have caused in their respective countries and (c) to analyse the mutation at specific important sites of the viral genome. figure : the whole analysis pipeline consisted of three phases. in the first phase, the genome sequences are divided into subsets based on country and a phylogenetic tree is constructed considering only the "representative" sequences of each such subset using an alignment-free sequence comparison approach. in the second phase, we employed state of the art classification algorithms, leveraging both traditional and deep learning pipelines to learn to discriminate the viral strains of many countries as either mild or severe. we also identify the features that contributed the most as the discriminant factor in the classification pipeline. finally, we use the identified features from the previous stare to predict the mutation of the interesting sites in the viral strain using a deep learning model. figure presents our overall analysis pipeline. below we present the details of the pipeline. we have collected hcov genome sequences upto the date april, (cut-off date) from the gisaid initiative dataset [ ] . these are high quality complete viral genome sequences submitted by the scientists and scientific institutes of individual countries. we also have collected country wise death statistics (upto cut-off date) from the official site of who [ ] . the label was assigned based on a threshold of deaths which is the estimated median of the number of deaths in the data points. any genome sequence of a country having deaths below (above) the threshold were considered a mild (severe) strain, i.e., assigned a label ( ). a sample labelling is shown in the supplementary table . informatively, we have also considered some other metrics for labeling purposes albeit with unsatisfactory output (please see supplementary file for details) . we divided the whole dataset into training and testing subset in / ratio with a balanced number of data points per class for traditional machine learning pipeline and for deep learning classification routine, we created the subsets training/validation/testing in / / ratio. figure : the viral genome sequences were divided into subsets of sequences based on country. for each subset, each viral genome sequence is converted into a vector representation and pairwise euclidean distance was calculated among the vectors to create the distance matrix. as the matrix is very highdimensional, we used principal component analysis to find the principal component matrix from the distance matrix. representative sequences were identified through k-means clustering on the pca matrix, and a phylogenetic tree was constructed from the representative sequence of each country. we aim to identify and interpret the evolutionary relationships among the hcov genome sequences uploaded at gisaid from different regions around the globe ( figure ). to do that we have used an alignment-free genome sequence comparison method as proposed in [ ] as briefly described below. notably, we do not consider any alignmentbased method since it is not computationally feasible for us to align thousands of viral sequences for analysis and clustering purposes [ ] . at first the sequence set is divided into subsets of sequences based on the location. all sequences are converted into representative ℝ vector. pairwise distance among vectors derived from the fast vector method [ ] are computed using euclidean distance. due to the high dimensionality of the resulting distance matrix, we resort to principal component analysis (pca) technique [ ] to reduce the dimension of the matrix. subsequently, we use k-means clustering [ ] to identify the corresponding cluster centers. for the k-means clustering algorithm, we have used the implementation of [ ] and used the default parameters except for the number of clusters which were set to for determining the cluster center for each of the subsets. for each location-based cluster, the representative sequence (i.e., the "centroid" of the cluster) is then identified and used in the subsequent step of the pipeline. the evolutionary relationship among the representative sequences of different clusters (from section . ) has been estimated by constructing a phylogenetic tree. we have used the neighbor joining algorithm [ ] for phylogenetic tree construction since it is more reliable [ ] . we have used euclidean distance among the vectors, as described in the section . , to prepare the distance matrix. while we predominantly have used the alignment-free method of [ ] , in this stage, we have only representative sequences and hence we have also attempted a few other alignment-free and alignment-based methods to estimate the phylogenetic tree; however, these didn't produce satisfactory results (more details are in supplementary file). for traditional machine learning, we use a pipeline similar to [ ] (see figure in supplementary file). we extracted three types of features from the genomic sequence of novel sars-cov- . inspired by the recent works [ ] [ ] [ ] [ ] that focus only on sequences, we also extract only sequence-based features. these features are: position independent features, n-gapped dinucleotides and position specific features (see details in section of supplementary file). we use the gini value of the extremely randomized tree (extra tree) classifier [ ] to rank the features. subsequently, only the features with gini value greater than the mean of the gini values are selected for training a lightgbm classifier model [ ] (with default parameters) and performed -fold cross validation. lightgbm is a highly efficient and fast gradient boosting framework which uses tree-based algorithms. we use shap values and univariate feature selection to compare the importance of the features. shap (shapley additive explanations) is a game theoretic approach which is used to explain the output of a model [ ] . univariate feature selection works by selecting the best features based on univariate statistical tests [ ] . we use selectkbest univariate feature selection to get the top k highest scoring features according to anova f_classif feature scoring [ ] function. we leverage the power of different deep learning (dl) classification models, namely, vanilla cnn [ ] , alexnet [ ] and inceptionnet [ ] . we transform the raw viral genome sequences into two different representations, namely, k-mers spectral representation [ ] and one hot vectorization [ ] to feed those into the dl networks in a seamless manner. details of these representations are given in section . of the supplementary file. for k-mers spectral representation we experimented with different values of k (k = , , for vanila cnn and k = & only for the rest due to resource limitation). for one hot vectorization, we have trained inceptionnet for epochs for both -and -mers and trained alexnet for , and epochs for -, -and -mers respectively. we design a pipeline to predict mutation on specific sites (chosen in an earlier stage of the pipeline) in the sars-cov- genome (figure ). we follow a similar protocol followed by [ ] and adopt it to fit our setting as follows. we divide all the available countries and the states of the usa into different time-steps by the date of the first reported incidence of sars-cov- infected patients of that location. thus, every resulting time-step represents a date (tk for cluster k) and contains the clusters of genome sequences of the countries/states. then the time series samples are generated by concatenating sites from different time-step one-by-one that represent the evolutionary path of the sars-cov- viral strain. for example, t is the very first date when the virus is discovered in china. so, the time-step contains only one country, china. likewise, time-step t contains clusters for those countries where the virus is discovered on date t and so on. (check table in supplementary file for more details). we generate time series sequences by concatenating genome sites from t ,t ,....,tn (in our case, n = ) and then fed the samples to the model which consists of a convolutional one dimensional layer and a recurrent neural network layer [ ] . we experiment with both pure lstm and bidirectional lstm as our rnn layer (see section . of supplementary file). the model has a dense layer of neurons in the end which predicts the probability of the next base pair of the next time-step. so, in a nut-shell the model takes concatenated genome sequences from t ,t ,....,tn- as input and predicts the mutation for time tn. we further use our mutation prediction pipeline to identify and analyze possible parents of a mutated strain. for this particular analysis, we trained the models specifically for some south-asian countries, namely, bangladesh, india and pakistan. we only used the best performing model for this analysis and generated five time series samples. at the time of generating these samples, the country/location having the minimal euclidean distance was taken for each time-step. we have implemented our experiments mostly in python. we have used scikit-learn library [ ] for clustering and plotting the graphs. for deep learning models, scikit-learn, tensorflow and keras neural network libraries are used and for lightgbm classifier, python lightgbm framework has been used. the phylogenetic trees are constructed using the dendropy library of python [ ] keeping default parameters. we use the tree visualizer tools dendroscope [ ] and evolview [ ] for tree visualization and annotation. the experiments have been conducted in the following machines: a) clustering and phylogenetic analyses have been carried out in a machine with intel(r) core (tm) i - u cpu @ . ghz, ubuntu . os and gb ram. b) experiments involving the deep learning pipelines (i.e., both classification and mutation prediction) have been conducted in the work-stations of galileo cloud computing platform [ ] and the default gpu provided by the google colaboratory cloud computing platform [ ] . c) the lightgbm classifier model was trained in a machine with intel core i - u cpu @ . ghz x , windows os and gb ram. all the codes and data (except for the genome sequences) of our pipeline can be found at the following link: https://github.com/pythonloader/analyzing-hcov-genome-sequence. the genome sequence data have been extracted from and are publicly available at gisaid [ ] . we identify the representative sequence of each of the countries as present in the gisaid dataset (upto cut-off date). the estimated phylogenetic tree constructed from the representative sequences is shown in figure . in what follows, we will be referring to this tree as the sc (sars-cov- ) tree. the phylogenetic tree generated is expected to reveal the evolutionary relationship of the viral strains. however, with careful scrutiny we have some apparently unusual but interesting observations. for example, it is generally expected that the countries sharing (open) borders (e.g., countries in europe) should be either neighbours or at least in the same clade in the tree. however, surprisingly from the tree, we do not notice geographically adjacent countries in europe as neighbors; rather we see for example that china and italy are immediate neighbors. it is to be noted that these two countries are also the first countries to get hit by the first pandemic wave. in addition to that, although the usa and canada share the longest un-militarized international border in the world, representative strains do not appear to be sister branches as they should have been. also, we notice that the usa, uk, canada, turkey and russia are in the same clade which have a higher number of deaths than most of the other countries. all our classifiers are trained to learn whether a given strain is mild or severe. the classification accuracy of the lightgbm classifier (~ %) is superior to that of the deep learning classifiers (~ - %), which, while is somewhat surprising, is in line with the recent findings of [ ] . it should be noted that lightgbm had produced better results in significantly less time than deep learning models for this dataset. the results of the classifier models are shown in figure . quantitative results aside, we also have applied our classifiers on the sequences that have been deposited at gisaid after the cut-off date (i.e. april , ). since the cutoff date, the country wise death statistics [ ] has certainly changed significantly and this has pushed a few countries, particularly from asian regions and several states of the united states of america transition from mild to severe state (based on our predefined threshold). interesting, our classifiers have been able to predict the severity of the new strains submitted from these countries/states correctly. table in the supplementary file shows a snapshot of a few such countries/states with the relevant information. we preliminarily identify the top features of shap and selectkbest feature selection (with k= ). from these features, as sois, we have selected the features that are also biologically significant, i.e., cover different significant gene expression regions ( figure ). in particular, we have selected the position specific features pos_ _ , pos_ _ , pos_ _ and pos_ _ as the sois for the mutation prediction analyses down the pipeline. here, pos_x_y indicates the site from positions x to y of the virus strains. the reason for selecting these features as sois are outlined below. according to gene expression studies [ ] [ ], our sois, namely, pos_ _ and pos_ _ encode to two non-structural proteins, nsp and nsp , respectively. and, our other two sois, namely, pos_ _ and pos_ _ correspond to the spike protein of sars-cov- . nsp binds to viral rna, nucleocapsid protein, as well as other viral proteins, and participates in polyprotein processing. it is an essential component of the replication/transcription complex [ ] . so, the mutation in this protein is expected to affect the replication process of the sars-cov- in host bodies. on the other hand, the spike protein sticks out from the envelope of the virion and plays a pivotal role in the receptor host selectivity and cellular attachment. according to wan et al. there exists strong scientific evidence that sars and sars-cov- spike proteins interact with angiotensin-converting enzyme (ace ) [ ] . the mutation on this protein is expected to have a significant impact on the human to human transmission [ ] . therefore, it is certainly interesting and useful to predict the mutation of such sois. cnn-lstm and cnn-bidirectional lstm performed in a similar manner for different sois of the genome registering . % and % accuracy, respectively, considering all sois together. for detailed results please check table and table of the supplementary material. for the model involving only bangladesh, we applied the cnn-bidirectional lstm model (as this is the best performer among the two) and achieved almost % accuracy. then we analyzed the ancestors in the time series test samples and noticed that some of the states of the usa are present in these samples. these states are california, massachusetts, texas, new jersey and maryland. for india and pakistan, we got similar results for some sites but for other sites, accuracy was not as high as bangladesh (check table of the supplementary file for details). our analyses reveal a very close (evolutionary) relationship between the genome sequences of china and italy. also, similarity was found among the virus strains of the usa, germany, qatar and poland. these countries have similar numbers of deaths and although not geographically directly adjacent (except for germany and poland) they have strong air connectivity among them. in fact, a number of interesting relationships can be inferred from the estimated phylogenetic tree as follows. chinese tourists [ ] . this relationship is clearly portrayed in the sc tree where the two strains appear to be immediate siblings. . poland's strain is in the same clade as that of germany, which can be explained by the fact that its strain (through poland's patient zero) came from germany [ ] . . taiwan is geographically very close to china. the virus was confirmed to have spread to taiwan on january , , through a -year-old woman who had been teaching in wuhan, china [ ] . the virus strains from these regions are also close together as can be seen from the sc tree, about branches apart. similar relationship can also be inferred from the tree between china and south korea: the strain of the virus in south korea is believed to be transmitted from china firstly through a -year old chinese woman and secondly by a -year old south korean national [ ] . interestingly, from the sc tree it can also be deduced that the south korean strain is very close to that of taiwan and also near to the strain from china. the incident of a taiwanese woman being deported from south korea after refusing to stay at a quarantine facility can be a probable explanation as to how the south korean strain might have found its path to taiwan [ ] . . on march , , the virus was confirmed to have reached portugal, when it was reported that a portuguese year-old man working in spain was tested positive for covid- after returning home [ ] . subsequently, within a span of days, more cases were reported all originating from spain [ ] [ ] . the fact that the first cases of covid- in portugal originated from spain is clearly captured in our sc tree. . the sc tree suggests that india's strain is closely related to that from china and also italy (around branches) and that it is also connected to that from saudi arabia. these relationships can be explained as follows. a . turkey's first identified case was a man who was travelling europe [ ] . turkey also announced a huge number of cases and subsequent deaths, which were originating from europe [ ] . in our inferred relationship, we can see that the turkish representative strain is quite close to several central and western european countries like russia, iceland and ireland which can be backed up by the two facts stated above. . it is visible from the sc tree that the strain of germany is very close to the strains of both poland and the usa. it might be the case that the community transmission occurred concurrently in both usa and poland from germany which hit the peak of pandemic before both usa and poland [ ] . . qatar has the second highest number of covid- patients in the middle-east [ ] . the first case of qatar was reported on february , to be a man working in iran [ ] . qatar introduced a travel ban to and from germany and the usa as precautionary measures in mid-march, quite a while later following the first occurrence. qatar has air-routes with germany and usa, with more than airlines operating in that route [ ] [ ] . though the first case has originated from iran, it might be the case that subsequent patients were found to be travelling from the aforementioned countries as a result of which the travel ban was introduced. our estimated sc tree places qatar very close to both the usa and germany. . while we can certainly explain many of the relationships identified by the estimated sc tree a above, there are some relationships which are not that apparent. one such example is the direct relationship between vietnam and greece. while apparently, there exists no direct relationship, when investigated further, we identified something interesting. patient zero of greece is believed to have been contaminated during her trip to the milan fashion week which took place during february - , [ ] . interestingly, the first covid- patient in hanoi [ ] left hanoi on february to visit family members living in london, england and three days later, she traveled from london to milan city. could she be in contact with patient zero of greece or any other who had been contaminated by the latter, before returning to london on february ? we can't be certain, but our inferred relationship between vietnam and greece certainly put a lot of legitimacy to that question. . finally, we are unable to find any apparent explanation analyzing the reported news sources for a few other strong relationships inferred by the tree (e.g., congo-iran, panama-malaysia, sweden-singapore, japan-australia, etc). this could be because of the inherent inaccuracies of the distance matrices as well as the limitations of the tree estimation algorithms: none of these algorithms are % accurate. from another angle, perhaps, the tree did identify these relationships correctly; but the relevant incidences were not accurately identified or not documented. in recent times, the number of deaths is increasing rapidly in india. we have been closely following the change in the virus strains of india before and after the cut-off date. a genome sequence (epi_isl_ ) was collected on april , (before our cutoff date) from a patient in ahmedabad, gujrat, india. it was predicted to be a severe strain (with low confidence) even though at that time we trained the classifier to consider the indian sequences as mild. according to our evolutionary relationship, india is very close to both italy and china. so, we calculated the distance between the representative sequence of both italy and china with this strain. we considered another strain (epi_isl_ ) which was collected from another patient from the same place in india on april , (after our cut-off date) and predicted the severity thereof. the classifiers declared this isolate to be severe with very high confidence (about %). we did the distance calculation like before. interestingly, it was identified that this isolate is closer to both italy and china's representative sequence than the previous less severe one. this strongly suggests that there were some mutations that turned the indian sequences from mild or less severe to severe or highly severe, respectively. also, the sequences from the us states of pennsylvania, maryland, indiana, illinois and florida that were collected on may , (about one month after our cut-off date) were analyzed and our classifiers could correctly capture the severity of the genome sequences (see table in the supplementary file). we conduct an analysis to predict possible parents of the (mutated) virus strains of the south asian region (bangladesh, india and pakistan). our mutation prediction pipeline suggests that the strains of some states of the usa, namely, california, massachusetts, texas, new jersey and maryland could be the parents/ancestors of these south asian strains. now, the total deaths in these states up to june , are , , , and respectively [ ] and the strains thereof are also classified to be severe by our classification pipeline. it thus seems quite likely that the sars-cov- situation in these south-asian countries will worsen in near future. bangladesh, india and pakistan are ranked th , th and nd in global health performance compared to the united states of america which is at the th position [ ] . in the majority of lower middle-income countries such as bangladesh, india and pakistan, available hospital beds are < bed per population and icu beds are < bed per , population [ ] . additionally, an uncontrolled epidemic is predicted to have , , deaths having a duration of nearly days in the majority of these countries [ ] . these predictions coupled with our findings call for stern actions (i.e., interventions) on part of these countries. bibliography: covid- ) outbreak situation genomic characterisation and epidemiology of novel coronavirus: implications for virus origins and receptor binding cryo-em structure of the -ncov spike in the prefusion conformation alignment-free sequence comparison: benefits, applications, and tools a novel fast vector method for genetic sequence comparison who coronavirus disease (covid- ) dashboard a deep learning approach to dna sequence classification dna sequence classification by convolutional neural network principal component analysis and factor analysis. (n.d.). principal component analysis springer series in statistics tempel: time-series mutation prediction of influenza a viruses via attention-based recurrent neural networks dendroscope : an interactive tool for rooted phy-logenetic trees and networks crisprpred(seq): a sequence-based method for sgrna on target activity prediction using traditional machine learning extra tree forests for sub-acute ischemic stroke lesion segmentation in mr sequences isgpt: an optimized model to identify sub-golgi protein types using svm and random forest based feature selection lightgbm: a highly efficient gradient boosting decision tree vietnam confirms th covid- patient -vnexpress international india confirms its first coronavirus case kerala defeats coronavirus; india's three covid- the weather channel, the weather channel india's first coronavirus death is confirmed in karnataka coronavirus: india 'super spreader' quarantines , people , indians quarantined after 'super spreader' ignores government advice responding to covid- -a once-in-a-century pandemic? data, disease and diplomacy: gisaid's innovative contribution to global health evolview, an online tool for visualizing, annotating and managing phylogenetic trees why neighbor-joining works coronavirus, primi due casi in italia: sono due turisti cinesi koronawirus w lubuskiem. godziny, dwa razy za wolno. daleko do laboratorium taiwan confirms st wuhan coronavirus case (update) austria's coronavirus cases are italian citizens greece confirms first coronavirus case, a woman back from milan as coronavirus takes hold, greece worries about migrant camps turkey remains firm, calm as first coronavirus case confirmed human mitochondrial genome compression using machine learning techniques google colaboratory the neighbor-joining method: a new method for reconstructing phylogenetic trees scikit, scikitlearn.org/stable/modules/generated/sklearn.cluster.kmeans.html dynamic interventions to control covid- pandemic: a multivariate prediction modelling study comparing worldwide countries imagenet classification with deep convolutional neural networks going deeper with convolutions europe's coronavirus numbers offer hope as us enters 'peak of terrible pandemic' algorithm as : a k-means clustering algorithm consistent individualized feature attribution for tree ensembles greece's 'patient zero' shares coronavirus experience (lead) taiwanese woman deported for refusing to stay at quarantine facility sağlık bakanı fahrettin koca: pozitif Çıkan yeni vakalarımız var -türkiye haberleri flights from qatar, www.qatar.to/united-states/qatar-to-united-states ministra confirma primeiro caso positivo de coronavírus em portugal scikit, scikitlearn.org/stable/modules/feature_selection.html#univariate-feature-selection nsp of coronaviruses: structures and functions of a large multi-domain protein receptor recognition by the novel coronavirus from wuhan: an analysis based on decade-long structural studies of sars coronavirus role of changes in sars-cov- spike protein in the interaction with the human ace receptor: an in silico analysis measuring overall health system performance for countries. global programme on evidence forhealth policy discussion paper no. qatar reports first case of coronavirus sklearn.feature_selection.f_classif ¶ dendropy: a python library for phylogenetic computing flights from qatar, www.qatar.to/germany/qatar-to-germany flights from qatar, www.qatar.to/united-states/qatar-to-united-states single-stranded rna genome of sars-cov sars-cov- (severe acute respiratory syndrome coronavirus ) sequences antigenic: an improved prediction model of protective antigens dpp-pseaac: a dna-binding protein prediction model using chou's general pseaac key: cord- -s u pvk authors: patel, amrutlal k.; pandit, ramesh j.; thakkar, jalpa r.; hinsu, ankit t.; pandey, vinod c.; pal, joy k.; prajapati, kantilal s.; jakhesara, subhash j.; joshi, chaitanya g. title: complete genome sequence analysis of chicken astrovirus isolate from india date: - - journal: vet res commun doi: . /s - - - sha: doc_id: cord_uid: s u pvk objective: chicken astroviruses have been known to cause severe disease in chickens leading to increased mortality and “white chicks” condition. here we aim to characterize the causative agent of visceral gout suspected for astrovirus infection in broiler breeder chickens. methods: total rna isolated from allantoic fluid of spf embryo passaged with infected chicken sample was sequenced by whole genome shotgun sequencing using ion-torrent pgm platform. the sequence was analysed for the presence of coding and non-coding features, its similarity with reported isolates and epitope analysis of capsid structural protein. results: the consensus length of bp genome sequence of indian isolate of chicken astrovirus was obtained after assembly of , high quality reads. the genome was comprised of bp ′-utr, three open reading frames (orfs) including orf a encoding serine protease, orf b encoding rna dependent rna polymerase (rdrp) and orf encoding capsid protein, and bp of ′-utr which harboured two corona virus stem loop ii like “s m” motifs and a poly a stretch of nucleotides. the genetic analysis of castv/india/anand/ suggested highest sequence similarity of . % with the chicken astrovirus isolate castv/ga followed by . % with castv/ and . %% with castv/poland/g / isolates. the capsid structural protein of castv/india/anand/ showed . % similarity with chicken astrovirus isolate castv/ga , . % with castv/ and . % with castv/poland/g / isolates. however, the capsid protein sequence showed high degree of sequence identity at nucleotide level ( . - . %) and at amino acids level ( . – . %) with reported sequences of indian isolates suggesting their common origin and limited sequence divergence. the epitope analysis by svmtrip identified two unique epitopes in our isolate, seven shared epitopes among indian isolates and two shared epitopes among all isolates except poland isolate which carried all distinct epitopes. electronic supplementary material: the online version of this article (doi: . /s - - - ) contains supplementary material, which is available to authorized users. poultry meat production is increasing globally day by day as it is easily manageable animal protein source for human consumption compared to others. however, viral diseases incurs heavy economic losses to the poultry industry. among the different viruses infecting birds, astroviruses are small round viruses, characterized based on its morphology (caul and appleton ) . astrovirus was first observed in human during in faeces of the infant suffering from gastroenteritis (madeley and cosgrove ) . they are broadly categorized into two genera, mammoastrovirus infecting mammals and aviastrovirus infecting avian species. both of these genera belongs to the family astroviridae. astroviruses can infect electronic supplementary material the online version of this article (doi: . /s - - - ) contains supplementary material, which is available to authorized users. variety of host including human ( (finkbeiner et al. ; madeley and cosgrove ) , cattle (bouzalas et al. ; li et al. ; schlottau et al. ) , sheep (jonassen et al. ; reuter et al. ) , cat (hoshino et al. ; lau et al. ) , dog (martella et al. ; takano et al. ) and deer (smits et al. ) . mammoastrovirus are mostly associated with gastroenteritis in the host. however, aviastrovirus can cause different diseases in the different host. the members of genus aviastrovirus mainly infects turkey, chicken and duck. turkey astrovirus (tastv) causes poultry enteritis mortality syndrome (pems) or poultry enteritis syndrome (pes) (mcnulty et al. ; mor et al. ) while in duck, astroviruses (dastv) are associated with hepatitis (asplin ; gough et al. gough et al. , . in chickens, two astrovirus species avian nephritis virus (anv) (imada et al. ; shirai et al. ) and chicken astrovirus (castv) (baxendale and mebatsion ) have been reported. initially castv was accepted as enterovirus causing growth retardation (schultz-cherry et al. ; todd et al. ). later on it was also found to be associated with gout (bulbule et al. ) and hatchability problem (smyth et al. ) in broiler chickens. recently castv has been linked with 'white chicks', a disease characterized by weakness and white plumage of hatched chicks (smyth et al. ) and leading to increased mortality in chicks and embryo has also been reported in poland and brazil (nunez et al. ) . chicken astroviruses are non-enveloped, - nm in diameter and contain non-segmented positive sense ssrna genome comprised of . to . kb in length (matsui and greenberg ; mendez et al. ). whole genome sequencing and other studies have revealed that the basic structure and molecular mechanism is almost similar for all the astroviruses sequenced till date (koci et al. ) . initially there was no authentic diagnostic method available for astrovirus which relied mostly on electron microscopy (madeley and cosgrove ; mcnulty et al. ) or immunoassay (baxendale and mebatsion ) . however, advances in the molecular biology have led to the development of some easy techniques for detection of viruses. astroviruses can also be diagnosed using rt-pcr (pantin-jackwood et al. ; smyth et al. ; todd et al. ). lee et al. ( ) suggested that a recombinant capsid can be used for diagnosis and vaccination. till now the genome of only three chicken astroviruses castv/ (directly submitted to ncbi), castv/ga (kang et al. ) and castv/poland/g / (sajewicz-krukowska and domanska-blicharz ) have been reported. in this study, we sequenced and characterized whole genome of chicken astrovirus isolated from infected broiler chicks from western part of india. a broiler breeder farm at anand, gujarat, india was facing the problem of visceral gout in the cobb- commercial chicks from last three batches of the parents. the outbreaks were noticed only in the initial hatches of the parents aging to weeks of age. these parents were vaccinated with infectious bronchitis nephropathic vaccine strains. the fertility and hatchabilities were normal but the chicks started showing lameness with spiking mortality from th day onward. mortality continued for - days and ranged from to %. hatches falling during winter season i.e. november to february had high mortality. dead chicks showed increased amount of abdominal fluid, pale and greyish kidneys with dilated tubules filled with urates. chalky white deposits of urates were found on serosal surfaces of pericardium, liver capsule, air sacs, joint capsules and mucosal surfaces of proventriculus, trachea etc. affected chick sample was submitted to hester biosciences limited for diagnosis of infection during the period of february . the spleen, kidney and lungs tissue samples from the freshly dead birds were collected and triturated in pbs to make a % solution, then passed through the . μm syringe filter and inoculated in day embryonated spf eggs through allantoic cavity route. after three blind passages, the embryos started dying to days post inoculation and showed haemorrhagic lesion on the body surface. the allantoic fluid from the dead embryo was collected and used for total rna isolation for subsequent analysis by whole genome sequencing. total rna was extracted using trizol reagent (invitrogen carlsbad, ca, usa) and treated with rnase free dnase i (qiagen, hilden, germany) to remove any dna contamination. total rna thus obtained was subjected to rnaseq library preparation as per the ion total rnaseq kit v (life technologies, carlsbad, ca, usa). the sequencing was carried on ion torrent pgm using chip (life technologies, carlsbad, ca, usa). the sequencing reads generated from the sequencer were subjected to mapping with gallus gallus genome (wgs: insdc: aadn . ) to remove the host sequences. remaining sequences were subjected to quality filtering q > using prinseq v . . and good quality sequences were de novo assembled using gs de novo assembler. the assembled genome sequence was searched for blastn similarity against nr/nt database. the genome sequence was further analysed for prediction of putative open reading frames (orf) by orf finder tool (stothard ) and manual curation for the analysis of ribosomal frameshift signal (rfs) as reported for other astroviruses (koci et al. ; . the predic-tionofstemloopstructureofrfswasperformedbyrnafoldweb server (http://rna.tbi.univie.ac.at/cgi-bin/rnafold.cgi). noncoding rna sequences were inferred using similarity search against rfam database (nawrocki et al. ) . a nearly complete genome sequences of astroviruses (table ) were downloaded from the ncbi and predicted for presence of orfs using orffinder tool and manual curation of rfs start site. multiple sequence alignment was performed using clustal omega webserver (http://www.ebi.ac. uk/tools/msa/clustalo/) to analyse the percent identity with other genomes at nucleotide and amino acids level. phylogenetic analysis of genomes and predicted proteins were performed in mega (tamura et al. ) after building alignment by clustalw algorithm and subsequent tree generation using neighbour joining method (saitou and nei ) with bootstraps replicates. nucleotide sequences of orf encoding capsid protein of reported indian isolates were retrieved from ncbi nr database and analysed for nucleotide and amino acids similarity and phylogeny as described earlier. we used svmtrip (yao et al. ) which predict linear antigen epitope based on support vector machine to integrate tri-peptide similarity and propensity. capsid protein sequence was used for epitope prediction. we compared epitopes of three castv isolates from india [castv/india/ anand/ , vrdc|castv|nz|vhinp- (accession no. agb . ) south region (accession no. aic . )], two from usa (castv/ga and castv/ ), one from uk (accession no.afk . ) and one from poland (castv/poland/g / ). allantoic fluid collected from spf embryo inoculated for three blind passages of field sample was subjected to total rna isolation and next generation sequencing using ion torrent platform resulted in a total of , reads with average read length of bases. after removing the host specific and low quality reads, a total of , reads were used for assembly. the genome was assembled into a consensus length of bp which was identified as chicken astrovirus upon nucleotide blast analysis. the full length genome sequence was deposited to the genbank under the accession number ky . there was an untranslated region (utr) of bp at ′ end and of bp at ′ end with a poly-a tail stretch of nucleotides. the genome was composed of a ( %, nt), t ( % nt), g ( %, nt) and c ( % nt) with a gc content of . %. genome sequence analysis of identified chicken astrovirus isolate for the presence of coding and non-coding features the genome encoded three orfs (orf a, orf b and orf ) with each orf encoding for single protein coding gene and a partial overlap of orf b with orf a (fig. a) . manual curation of the ribosomal frameshift signal (position nt ) revealed presence of atg start codon preceding the slippery heptameric sequence (aaaaaac) (fig. b) followed by sequences forming the stem loop structure as predicted by rnafold analysis (fig. c) . among the analysed genomes, all four chicken astroviruses and duck astrovirus sl isolate were found to possess their own atg start codon for orf b whereas it was found absent in other astroviruses (fig. b) . analysis of ′-utr by rfam analysis revealed presence of two non-coding rnas similar to "s m" rna family (accession number: rf ) at positions - (e value: . e − ) and - (e value: . e − ). analysis of nucleotide similarity with other full length astroviruses genomes ( phylogenetic analysis of the astrovirus genomes suggested formation of separate cluster of chicken astroviruses and placed castv/india/anand/ nearest to the castv/ isolate (fig. ) . similarly, phylogeny based on amino acids sequence of serine protease (fig. a) , rdrp (fig. b ) and capsid protein (fig. c) showed close clustering among the chicken astroviruses except capsid protein of castv/poland/ g / isolate which was clustered with the dastv/sl isolate. among the indian isolates, orf nucleotide sequence of the castv/india/anand/ placed nearest to the vrdc/castv/nz/vhinp- isolate (fig. a) whereas based on the amino acid sequence, it was placed nearest to the vrdc/castv/nz/vhinp- isolate (fig. b) . b-cell epitope analysis of capsid structural protein of identified chicken astrovirus isolate a total of - epitopes were predicted using svmtrip using the capsid protein sequence of the astroviruses. epitope analysis revealed two unique epitopes in case of the castv/ india/anand/ . epitopes present at the positions - and - were common to all of the viruses analysed. based on the epitopes predicted, castv/poland/ g / was found to be unique as it has not shared any epitopes with other viruses. comparison of the three indian castvs revealed that seven epitopes were common to all the three and a epitope - was differing by one amino acid substitution in the castv/india/anand/ (supplementary table ). the poultry viral diseases have great economic impact on poultry industries worldwide as it leads to high mortality. chicken astrovirus causes severe disease especially in young chickens (bulbule et al. ; li et al. ; nunez et al. ; schultz-cherry et al. ; smyth et al. ; todd et al. ). though culturing methods has been described for astroviruses (baxendale and mebatsion ; nunez et al. ) , the isolation of virus is somewhat difficult due to its poor growth in the culture (smyth et al. ). next generation sequencing technology is sophisticated as there is no need to isolate or culture the organism. hence in the present study we directly isolated rna from allantoic fluid of chicken embryo inoculated with clinical sample and sequenced on ion torrent pgm platform. the viral genome of castv/india/anand/ was assembled into bp which is comparable with the size of other published astrovirus genomes (chen et al. ; finkbeiner et al. ; strain et al. ) . the genome showed presence of three orfs encoding for serine protease, rdrp and capsid protein as reported for other astroviruses. the rdrp of most astroviruses do not have its own start codon. kang et al. ( ) reported that rdrp of chicken astrovirus ga has its own start codon. we observed presence of atg start codon for rdrp among all the reported chicken astroviruses and dastv/sl isolate whereas other duck astrovirus and avian nephritis viruses were found to lack atg start codon. the rna family analysis by rfam suggested presence of two motifs matching to corona virus stem loop ii (s m) motif in the ′-utr as reported for other astroviruses (jonassen et al. ; . although, exact function is still not uncovered, the presence of these "s m" motifs is believed to influence gene expression through rna-interference mechanism (tengs et al. ) based on the nucleotide similarity, the virus was found to be closest to the chicken astrovirus isolate ga (kang et al. ) followed by castv/ (direct submission) with other chicken astrovirus capsid protein but higher identity with the duck astroviruses may be due a recombination event and shared ancestors with the duck astroviruses. among the avian astroviruses, chicken astroviruses were found to share higher identity with the duck astroviruses (chen et al. ; fu et al. ; liu et al. ) as compared to the turkey astroviruses (koci et al. ; strain et al. ) and the avian nephritis virus isolates (imada et al. ; zhao et al. a, b) . we next analysed the sequence similarity of capsid structural protein of the castv/india/anand/ with the capsid protein coding sequence and amino acids sequence of reported sequence of the indian astrovirus isolates which revealed about - % sequence identity at nucleotide level and about - % sequence identity at amino acids level suggesting limited structural divergence and their common origin in the indian isolates reported till date. phylogenetic analysis of the genome sequences as well as the protein sequences showed clustering of the castv/ india/anand/ nearest to that of castv/ and castv/ga and all four chicken astrovirus formed separate cluster except capsid protein of the castv/poland/g / isolate which was clustered along with the duck astroviruses. the clustering of castv/poland/g / with the duck astrovirus isolate sl suggest possible recombination between these isolates (sajewicz-krukowska and domanska-blicharz ). based on nucleotide sequence of genomes and amino acids sequence of serine protease and rdrp, chicken astroviruses were placed closed to the duck astroviruses compared to turkey astroviruses or avian nephritis viruses. however, based on capsid protein the turkey astroviruses were phylogenetically placed between different isolates of the duck astroviruses as well as near to the castv/poland/g / isolate. these observations suggest that the capsid protein of turkey, duck and chicken astroviruses evolved through possible recombination between the astroviruses of different avian species and suggests that the turkey and duck may play an important role in epidemiology of avian astroviruses similar to that of influenza viruses (alexander ) . among the indian isolates, phylogenetic analysis of capsid protein showed placement of the castv/ india/anand/ between two north zone isolates, however very limited sequence divergence was seen among the reported indian isolates suggesting their recent emergence and common origin. epitope analysis of the capsid protein sequence revealed two unique epitopes in our isolate whereas epitopes were found to be shared among the indian astrovirus isolates. except castv/poland/g / isolate which contained all unique epitopes, other isolates shared two common epitopes. our analysis suggests that the vaccine design using the indian astrovirus isolate may provide cross protection against prevailing isolates in india. further, epitope mapping would be useful to design safe and effective vaccine against divergent astroviruses (ahmad et al. ; soria-guerra et al. ) . in summary, whole genome analysis of indian astrovirus isolate by next generation sequencing technology determined full length genome of the chicken astrovirus isolate for the first time in india. the present study provided genetic relatedness of the circulating indian isolate with the other reported nearly complete genome sequences of avian astroviruses. the analysis of capsid protein sequence of reported chicken astroviruses from india revealed limited structural divergence suggesting their common ancestral origin and recent emergence. considering high sequence identity of capsid structural protein among prevailing strains in the india, the castv/india/anand/ isolate could serve as the potential source for its further development as a vaccine candidate. the identification of unique and shared epitopes among different astroviruses will be helpful in designing effective epitope based vaccine formulation. conflict of interest the authors declare that they have no conflict of interest. fig. phylogenetic relatedness of chicken astrovirus isolate castv/india/anand/ orf coding sequences (a) and orf encoded capsid protein (b) with reported indian isolates based on neighbour-joining method with b-cell epitope mapping for the design of vaccines and effective diagnostics a review of avian influenza in different bird species duck hepatitis: vaccination against two serological types the isolation and characterisation of astroviruses from chickens neurotropic astrovirus in cattle with nonsuppurative encephalitis in europe role of chicken astrovirus as a causative agent of gout in commercial broilers in india the electron microscopical and physical characteristics of small round human fecal viruses: an interim scheme for classification complete genome sequence of a duck astrovirus discovered in eastern china complete genome sequence of a highly divergent astrovirus isolated from a child with acute diarrhea complete sequence of a duck astrovirus associated with fatal hepatitis in ducklings astrovirus-like particles associated with hepatitis in ducklings an outbreak of duck hepatitis type ii in commercial ducks detection of astroviruses in feces of a cat with diarrhea avian nephritis virus (anv) as a new member of the family astroviridae and construction of infectious anv cdna a common rna motif in the ′ end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus complete genomic sequences of astroviruses from sheep and turkey: comparison with related viruses determination of the full length sequence of a chicken astrovirus suggests a different replication mechanism molecular characterization of an avian astrovirus complete genome sequence of a novel feline astrovirus from a domestic cat in hong kong chicken astrovirus capsid proteins produced by recombinant baculoviruses: potential use for diagnosis and vaccina divergent astrovirus associated with neurologic disease in cattle complete sequence of a novel duck astrovirus viruses in infantile gastroenteritis enteric disease in dogs naturally infected by a novel canine astrovirus detection of astroviruses in turkey faeces by direct electron microscopy association of the astrovirus structural protein vp with membranes plays a role in virus morphogenesis the role of type- turkey astrovirus in poult enteritis syndrome rfam . : updates to the rna families database isolation of chicken astrovirus from specific pathogen-free chicken embryonated eggs detection and molecular characterization of chicken astrovirus associated with chicks that have an unusual condition known as "white chicks" in brazil enteric viruses detected by molecular methods in commercial chicken and turkey flocks in the united states between identification of a novel astrovirus in domestic sheep in hungary the neighbor-joining method: a new method for reconstructing phylogenetic trees nearly full-length genome sequence of a novel astrovirus isolated from chickens with 'white chicks' condition astrovirus-induced "white chicks" condition -field observation, virus detection and preliminary characterization detection of a novel bovine astrovirus in a cow with encephalitis inactivation of an astrovirus associated with poult enteritis mortality syndrome pathogenicity and antigenicity of avian nephritis isolates identification and characterization of deer astroviruses detection of chicken astrovirus by reverse transcriptase-polymerase chain reaction development and evaluation of real-time taqman(r) rt-pcr assays for the detection of avian nephritis virus and chicken astrovirus in chickens chicken astrovirus detected in hatchability problems associated with 'white chicks an overview of bioinformatics tools for epitope prediction: implications on vaccine development the sequence manipulation suite: javascript programs for analyzing and formatting protein and dna sequences genomic analysis of closely related astroviruses detection of canine astrovirus in dogs with diarrhea in japan mega : molecular evolutionary genetics analysis version . a mobile genetic element with unknown function found in distantly related viruses a seroprevalence investigation of chicken astrovirus infections svmtrip: a method to predict antigenic epitopes using support vector machine to integrate tripeptide similarity and propensity sequence analyses of the representative chineseprevalent strain of avian nephritis virus in healthy chicken flocks complete sequence and genetic characterization of pigeon avian nephritis virus, a member of the family astroviridae acknowledgements authors are thankful to mr. rajiv gandhi, managing director and ceo, hester biosciences limited, ahmedabad, india and anand agricultural university, anand, india for providing the facility to carry out the research work.compliance with ethical standards key: cord- - nnqx g authors: canturk, semih; singh, aman; st-amant, patrick; behrmann, jason title: machine-learning driven drug repurposing for covid- date: - - journal: nan doi: nan sha: doc_id: cord_uid: nnqx g the integration of machine learning methods into bioinformatics provides particular benefits in identifying how therapeutics effective in one context might have utility in an unknown clinical context or against a novel pathology. we aim to discover the underlying associations between viral proteins and antiviral therapeutics that are effective against them by employing neural network models. using the national center for biotechnology information virus protein database and the drugvirus database, which provides a comprehensive report of broad-spectrum antiviral agents (bsaas) and viruses they inhibit, we trained ann models with virus protein sequences as inputs and antiviral agents deemed safe-in-humans as outputs. model training excluded sars-cov- proteins and included only phases ii, iii, iv and approved level drugs. using sequences for sars-cov- (the coronavirus that causes covid- ) as inputs to the trained models produces outputs of tentative safe-in-human antiviral candidates for treating covid- . our results suggest multiple drug candidates, some of which complement recent findings from noteworthy clinical studies. our in-silico approach to drug repurposing has promise in identifying new drug candidates and treatments for other viruses. artificial intelligence (ai) technology is a recent addition to bioinformatics that shows much promise in streamlining the discovery of pharmacologically active compounds [ ] . machine learning (ml) provides particular benefits in identifying how drugs effective in one context might have utility in an unknown clinical context or against a novel pathology [ ] . the application of ml in biomedical research provides new means to conduct exploratory studies and high-throughput analyses using information already available. in addition to deriving more value from past research, researchers can develop ml tools in relatively short periods of time. past research now provides a sizable bank of information concerning drug-biomolecule interactions. using drug repurposing as an example, we can now train predictive algorithms to identify patterns in how antiviral compounds bind to proteins from diverse virus species. we aim to train an ml model so that when presented with the proteome of a novel virus, it will suggest antivirals based on the protein segments present in the proteome. the final output from the model is a best-fit prediction as to which known antivirals are likely to associate with those familiar protein segments. these benefits are of particular interest for the current covid- health crisis. the novelty of sars-cov- requires that we execute health interventions based on past observations. grappling with an unforeseen pandemic with no known treatments or vaccines, the potential for rapid innovation from ml is of utmost significance. the ability to conduct complex analyses with ml enables us to research insights quickly that can help steer us in the right direction for future studies likely to produce fruitful results. we present here multiple models that produced a number of antiviral candidates for treating covid- . out of our top predicted drugs, of them have shown positive results in recent findings based on cell culture results and clinical trials. these promising antivirals are lopinavir, ritonavir, ribavirin [ ] , cyclosporine, [ ] , rapamycin [ ] , and nitazoxanide [ ] . for the other predicted drugs, further research is needed to evaluate their effectiveness against sars-cov- . we used two main data sources for this study. the first database was the drugvirus database [ ] ; drugvirus provides a database of broad-spectrum antiviral agents (bsaas) and the associated viruses they inhibit. the database covers viruses and compounds, and provides information on the status as antiviral of each compound-virus pair. these statuses fall into eight categories representing the progressive drug trial phases: cell cultures/co-cultures, primary cells/ organoids, animal model, phases i-iv and approved. see appendix a for a more intuitive pivot table view of the database. the second database is the national center for biotechnology information (ncbi) virus portal [ ] ; as of april , this database provides approximately . million amino-acid and million nucleotide sequences from viruses with humans as hosts. each row of this database contains an amino acid sequence specimen from a study, as well as metadata that includes the associated virus species. in our work, we considered sequences only from the virus species in the drugvirus database or their subspecies in order to be able to merge the two data sources successfully. we also constrained ourselves to amino-acid sequences only in the current iteration. the main reasons for this are two-fold: . amino-acid sequences are essentially derived from the dna sequences, which may encode overlapping information on different levels. in somewhat simplified terms, amino-acid sequences are the outputs of a layer of preprocessing on genetic material (in the form of dna/rna). . nucleotide triplets (codons) map to amino-acids, making amino-acid sequences much shorter and easier to extract features both in preprocessing and in the machine learning methods themselves. shorter sequences also mean the ml pipeline will be more resource-efficient, i.e. easier to train. the amino-acids were downloaded as three datasets: hiv types & ( , , sequences), influenza types a, b & c ( , sequences), and the "main" dataset for all other types including sars-cov- ( , sequences). each dataset came with two components. the "sequence" component is composed on accession ids and the amino-acid sequence itself, while the "metadata" component includes all other data (e.g. virus species, date specimen was taken, an identifier of the related study) as well as the accession id to enable merging the two components. the amount of research with a focus on influenza and hiv naturally lead to these viruses comprising most of the samples. in our experiments, we have excluded these viruses, and have worked only with dataset # , though the other datasets can be integrated into the main one during the class balancing process, an idea we will discuss in section , future work. the first step of the preparation phase was to merge the "sequence" and "metadata" components into a single ncbi dataset based on sequence ids. afterwards, we mapped the "species" column in this main dataset to the virus name column in the drugvirus database. this step was required as these two columns that denote the virus species in the respective datasets did not match due to subspecies present in the sequence dataset and alternative naming of some viruses. afterwards, we processed the drugvirus dataset to a format suitable for merging with the ncbi data frame. every row of the drugvirus dataset consists of a single drug-virus pairing and their respective interaction/drug trial phase, meaning any given drug and virus appeared in multiple rows of the dataset. we derived a new drugvirus dataset that functioned as a dictionary where each unique virus was a key, and the interactions with antivirals encoded as a multi-label binary vector ( if viable antiviral according to the original dataset, if not) of length (the number of antivirals) which corresponded to the value. we came up with three "versions" depending on how we decided an antiviral was a viable candidate to inhibit a virus. the criteria depended on drug trial phases: . in the first version, any interaction between a drug-virus pair is designated by a . this means drugs that did not go past cell cultures/co-cultures or primary cells/organoids testing are still considered viable candidates. . this second version expands upon the first stemming from our discovery that an attained trial phase in the database does not necessarily mean previous phases were also listed in the database. for example, we found that for a given virus, a given drug had undergone phase iii testing, designated by a , but phase i & ii were listed as s. this undermined our assumption that drug trials are hierarchical; though, in reality this is usually the case. this can be caused by missing data reporting or possibly skipped phases. we proceeded with the hierarchy assumption, and extended the database in ( ) to account for the previous phases. this meant that in this second version, an approved drug will have all phases designated with s, for example. keeping track of the phases meant that the size of the database also grew by . . in the third version, we considered a drug-virus pair as viable only if it has attained phase ii or further drug trials, signifying some success with human trials have been observed. in the results presented in section , our training database was based on this third version of the drugvirus database. the full dataset was then generated by merging this "new" version of the drugvirus dataset with the ncbi dataset. we then generated two versions of this full dataset: one that consists of all sars-cov- sequences and one that consists of all other viruses available. this enabled us to compare how successful our models are in a case when they have not been trained on the virus species at all and have to detect peptide substructures in the sequences to suggest antivirals. a sample of this final database (with some columns excluded for brevity) is available in appendix b. upon inspection of the data, we found that there were replete of duplicate or extremely similar virus sequences. to reduce this exploitability and pose a more challenging problem, we removed the duplicate sequences that belonged to the same species and had the exact same length. this reduced the size of the dataset by approximately %. the counts for each virus species before and after dropping duplicate viruses is available in appendix c and c . our main database also contained a class imbalance in the number of times certain virus species appeared in the database. we oversampled rare viruses (e.g., west nile virus: sequences) and excluded the very rare species which compose less than . % of the available unique samples in the dataset (e.g., andes virus: sequences), and undersampled the common viruses (e.g., hepatitis c: , sequences). this produced a more modest database of , amino acid sequences, with each virus having samples in the - range (see appendix c ). we kept the size of the dataset small both to enable easier model training and validation in early iterations and to handle data imbalance more smoothly. the class imbalance problem also presented itself in the antiviral compounds. even with balanced virus classes, the number of times each drug occurred within the dataset varied, simply because some drugs apply to more viruses than others. to alleviate this, we computed class weights for each drug, which we then provided to the models in training. this enabled a fairer assessment and a more varied distribution of antivirals in predicted outputs. the final step of data processing involved generating the training and validation sets. we split the data in two different ways, resulting in two different experiments (see section . , experiment setup for the full experiment pipeline). experiment i is based on a standard, randomized an % training/ % validation split on the main dataset. for experiment ii, we split the data on virus species, meaning the models were forced to predict drugs for a species that it was not trained on, and have to detect peptide substructures in the amino-acid sequences to suggest drugs. in this setup we also guaranteed that the sars-cov- sequences were always in the test set, in addition to three other viruses randomly picked from the dataset. we used a variant of this setup that trains on all virus sequences except sars-cov- and is validated on sars-cov- only to generate the results presented in section . a growing number of studies demonstrate the success of using artificial neural networks (ann) in evaluating biological sequences in drug repositioning and repurposing [ ] [ ] . previous work on training neural networks on nucleotide or amino-acid sequences have been successful with recurrent models such as gated recurrent units (gru), long short-term memory networks (lstm) and bidirectional lstms (bilstm), as well as d convolutions and d convolutional neural networks (cnn) [ ] [ ] . we have therefore focused on these network architectures, and conducted our experiments with an lstm with d convolutions and bidirectional layers as well as a cnn. the network architectures are explained briefly below. lstm and d convolutions for the lstm, a character-level tokenizer was used to encode the fasta sequences into vectors consumable by the network. the sequences were then padded with zeros or cut off to a fixed length to maintain a fixed input size. the network architecture consisted of an embedding layer, followed by d convolution and bidirectional lstm layers (each followed by maxpooling), and two fully connected layers. a more detailed architecture diagram is available in appendix d. convolutional neural network (cnn) for the cnn, the input features were one-hot encoded based on the fasta alphabet/charset, which assisted in interpretability when examining the d input arrays as images. the inputs are also fixed at a length of , resulting in x images, where is the number of elements in the fasta charset. the network architecture consists of four d convolutions with filter sizes of x , x , x and x respectively, which are maxpooled, concatenated and passed through a fully connected layer. a more detailed architecture diagram is available in appendix e. the experiments were run on a computer with an . ghz intel broadwell cpu ( gb ram) and nvidia k gpu ( gb) . both models completed a -epoch experiment in - minutes. one to three training and evaluation runs were made for each setup during model and hyperparameter selections, and ten training and evaluation runs were done to produce the average metrics in section . the experiments start by determining the model to use and apply the appropriate preprocessing steps mentioned in section . . we then proceed with determining the dataset to train and validate on. this part of the experiment setup is more extensively covered in section . . , train/test splitting. we used binary cross entropy (bce) loss, adam optimizer, precision, recall and f -score as metrics since accuracy tends to be an unreliable metric given the class imbalance and the sparse nature of our outputs. after training and validation, predictions were done on the validation set and the results were post-processed for interpretability. in post-processing, we applied a threshold to the sigmoid function outputs of the neural network, where we assigned each drug a probability of being a potential antiviral for a given amino acid sequence. after experimenting with different values, we settled on a threshold value of . . postprocessing outputs a list of drugs that were selected along with the respective probabilities for the drugs being "effective" against the virus with the given amino acid sequence. for other hyperparameters involved as well as information on hyperparameter tuning, see appendix f. here we present the results for the two experiments described in section . . , train/test splitting. the figures and tables presented in this section are based on the lstm and cnn architectures described in section . , which were trained on batch size and . and . learning rates respectively for epochs with an adam optimizer. in the regular setup, we performed an %/ % train-test split on our data of , sequences. the metrics for the best set of hyperparameters (based on validation set f -score) for both the cnn and lstm architectures respectively are presented in table . similarly, plots for the same set of models and hyperparameters over epochs are presented in figures and . our models handled the task successfully, achieving . f -score in a multi-label multi-class problem setting. this means that the models were able to match the virus species with the sequence substructures and appropriately assign the inhibiting antivirals with accuracy. these satisfactory results led to us implementing experiment ii. in experiment ii, the models predicted antiviral drugs for virus species they haven't been trained on. this meant the models were not able to recommend drugs by "recognizing" the virus from the sequence and therefore had to rely only on peptide substructures in the sequences to assign drugs. in the results presented below, the test set consists of sars-cov- , herpes simplex virus , human astrovirus and ebola virus, whose sequences were removed from the training set. we see here that the cnn (and the lstm) had issues with convergence, and the accuracies are clearly below their counterparts in the regular setup, though this is certainly expected. we now turn to the actual predictions on the sequences and attempt to interpret them. upon examination of drug predictions for herpes simplex virus (hsv- ), however, we see that our cnn was in fact quite successful. in table and table , count represents how many times each drug was flagged as potentially effective for hsv- sequences, and mean probability denotes the average confidence predicted over all instances of the drug. a sample of the outputs where these metrics are derived from is available in appendix g. antivirals used for phase ii and further trials for hsv- are highlighted in bold, meaning all six drugs in the database that are used for phase ii and further trials are predicted by our model. three of the top five predictions are approved antivirals for hsv- and the only remaining one is predicted th among antivirals. this high level of accuracy is remarkable given that our model has not been trained on hsv- sequences. predictions for sars-cov- with some variation between the two, both the lstm (table a ) and the cnn (table b) seem to converge on a number of drugs: ritonavir, lopinavir (both phase iii for mers-cov), tilorone (approved for mers-cov) and brincidofovir are in the top five candidates in both, while valacyclovir, ganciclovir, rapamycin and cidofovir rank high up in both lists. most of the remaining drugs are present in both lists as well. the lstm is more conservative in its predictions than the cnn, and the overall counts for sars-cov- are significantly lower than for herpes simplex virus for both, pointing a comparable lack of confidence on the models' part in predicting sars-cov- sequences. a further step we took for the sars-cov- sequences was visualizing the layer activations in the zetane engine to validate that the model was processing the data at a fine-grained level. this was done in similar fashion to a study where integrated gradients were used to generate attributions on a neural network performing molecule classification [ ] . the layer activations in both models showed that different antivirals activated different subsequences of a given sequence at the amino acid level, thus validating our approach. the filter activations are available in appendix h. the preliminary results of our experiments show promise and merit further investigation. we note that our ml models predict that some antivirals that show promise as treatments against mers-cov may also be effective against sars-cov- . these include the broad-spectrum antiviral tilorone [ ] and the drug lopinavir [ ] , the latter of which is now in phase iv clinical trials to determine its efficacy against covid- [ ] . such observations suggest with confidence that our models can recognize reliable patterns between particular antivirals and species of viruses containing homologous amino acid sequences in their proteome. additional observations that support our findings have come to light from a study in the lancet published shortly before this article [ ] . this open-label, randomized, phase ii trial observed that the combined administration of the drugs interferon beta- b, lopinavir, ritonavir and ribavirin provides an effective treatment of covid- in patients with mild to moderate symptoms. both of our models flagged three of the drugs in that trial (note that interferon was not part of our datasets). in terms of number of occurrences aka count, ritonavir, lopinavir and ribavirin were ranked th, th and th by the lstm, while the cnn model ranked them rd, th and th, respectively. other studies have also focused on the treatment of sars-cov- by drugs predicted in our experiments. wang et al. discovered that nitazoxanide (lstm rank th, cnn rank th) inhibited sars-cov- at a low-micromolar concentration [ ] . gordon th) is known to be effective against diverse coronaviruses [ ] . such observations are encouraging. they demonstrate that predictive models may have value in identifying potential therapeutics that merit priority for advanced clinical trials. they also add to growing observations that support using ml to streamline drug discovery. from that perspective, our models suggest that the broad spectrum antiviral tilorone, for instance, may be a top candidate for covid- clinical trials in the near future. other candidates highlighted by our results and may merit further studies are brincidofovir, foscarnet, artesunate, cidofovir, valacyclovir and ganciclovir. the antivirals identified here have some discrepancies with emerging research findings as well. for instance, our models did not highlight the widely available anti-parasitic ivermectin. one research study observed that ivermectin could inhibit the replication of sars-cov- in vitro [ ] . another large-scale drug repositioning survey screened a library of nearly , drugs and identified six candidate antivirals for sars-cov- : pikfyve kinase inhibitor apilimod, cysteine protease inhibitors mdl- , z lvg chn , vby- , and ono , and the ccr antagonist mln- [ ] . it comes as no surprise that our models did not identify these compounds as our data sources did not contain them. future efforts to strengthen our ml models will thus require us to integrate a growing bank of novel data from emerging research findings into our ml pipeline. in terms of our machine learning models, better feature extraction can improve predictions drastically. this step involves improvements through better data engineering and working with domain experts who are familiar with applied bioinformatics to better understand the nature of our data and find ways to improve our data processing pipeline. some proposals for future work that could strengthen the performance of our machine learning process are as follows: . deeper interaction with domain experts and further lab testing would lead to a better understanding of the antivirals and the amino-acid sequences they target, leading to building better ml pipelines for drug repurposing. . better handling of duplicates can improve the quality of data available. the current approach (which is based on species and sequence length) can be improved through using string similarity measures such as dice coefficient, cosine similarity, levenshtein distance etc. . influenza and hiv datasets should be integrated into the data generation and processing pipeline to enhance available data. . vectorizers can be used to extract features as n-grams (small sequences of chars), which has attained success in similar problems [ ] . other unsupervised learning methods such as singular value decomposition also may be applicable to our study [ ] . we hope that the machine learning approaches and pipelines developed here may provide longterm benefit to public health. the fact that our results show much promise in streamlining drug discovery for sars-cov- motivates us to adapt our current models so we can conduct identical drug repurposing assessments for other known viruses. moreover, experimental data suggests that our approaches are generalizable to other viruses (see the hsv- example in section . , experiment ii) -we are therefore confident that we could adapt our models to conduct equivalent studies during the next outbreak of a novel virus. this also means our methods can be used to repurpose existing drugs in order to find more potent treatments for known viruses. the direct beneficiaries of our findings are members of the clinical research community. using relatively few resources, ml-guided drug repurposing technology can help prioritize clinical investigations and streamline drug discovery. in addition to reducing costs and expediting clinical innovation, such efficiency gains may reduce the number of clinical trials -and thus human subjects used in risky research -needed to find effective treatments (this pertains to the ethical imperative to avoid harm when possible). also of importance is that in-silico analyses using machine learning provide yet another means to employ past research findings in new investigations. ml-guided drug repurposing thus provides means to obtain further value from knowledge on-hand; maximizing value in this case is laudable on many fronts, especially in terms of providing maximum benefit from publicly-funded research. the negative consequences that could arise should our models fail appear limited but are noteworthy nonetheless. note that our models aim to only indicate possible therapeutics that merit further clinical investigation in order to prove any antiviral activity against sars-cov- . should our models fail by recommending spurious treatments, these incorrect predictions may divert limited time and resources towards frivolous investigations. it should also be noted that our methods aim to primarily work as guidance for medical experts, and not as a be-all-end-all solution. and any incorrect inferences made by our models are bound to be detected early by medical experts. communicating any machine-learning predictions of tentative antiviral drugs from this study requires much caution. the current pandemic continues to demonstrate how fear, misinformation and a lack of knowledge about a novel communicable disease can encourage counterproductive health-seeking behaviour amongst the public. soon after the coronavirus became a widely understood threat, the internet was awash in false -sometimes downright harmful -information about preventing and treating covid- . included within this misleading health information were premature claims by some prominent government officials that therapeutics like chloroquine and hydroxychloroquine might hold promise as a repurposed drug for covid- . such unfounded advice caused avoidable poisonings from people self-medicating with chloroquine. subsequent clinical investigations demonstrated no notable benefit and potential adverse reactions to chloroquine when used to treat covid- . such unfortunate events remind us that preliminary findings may be misinterpreted as conclusive treatments or as evidence to support inconclusive health claims. the hyperparameters tested in our experiments are presented in section f it is certainly possible to improve the accuracies of our experiments by conducting a vaster coverage of the loss landscape through more extensive training (e.g. running longer experiments with smaller learning rates on more complex network architectures), especially for results in experiment ii. however, due to performance constraints, the scope of hyperparameter tuning as well as the ann architectures experimented on are relatively constrained as we focused on the methodology as opposed to optimal performance in this study. it should be noted that much improvement is possible in this front, as pointed out in discussion and future work. additional notes regarding our observations during hyperparameter tuning are presented below. • for the threshold, we wanted to predict eagerly, i.e. we considered false negatives more costly errors than false positives. a high threshold would mean the outputs would be composed only of the antivirals our models are very confident about per amino acid sequence. this we deem undesirable, as while we do hope these outputs narrow the scope of antivirals to focus on, over-restricting could prevent antivirals that are predicted frequently yet with low probability be detected. a low threshold such as . filtered the number of antivirals sufficiently, but also left enough breathing room for the domain experts to draw their own conclusions on a per-drug basis. • while a larger sequence length cutoff was possible and not detrimental to the results, we deemed a suitable trade-off in terms of performance versus accuracy, as many sequences do not reach lengths in the thousands to begin with. • as mentioned, the number of epochs trained could be increased, as we did not see dramatic signs of overfitting at epochs or further. however, a flattening of the metrics were evident around epochs with the hyperparameters listed, which therefore was selected a suitable stopping point. table : a section of sample outputs for amino acid sequences and their associated antivirals. post-processing outputs a list of drugs that were selected along with the respective probabilities of the drugs being "effective" against the virus with the given amino acid sequence. survey of machine learning techniques in drug discovery drug repositioning: a machine-learning approach through data integration triple combination of interferon beta- b, lopinavir-ritonavir, and ribavirin in the treatment of patients admitted to hospital with covid- : an open-label, randomised, phase trial cyclosporin a inhibits the replication of diverse coronaviruses a sars-cov- protein interaction map reveals targets for drug repurposing remdesivir and chloroquine effectively inhibit the recently emerged novel coronavirus ( -ncov) in vitro discovery and development of safe-in-man broad-spectrum antiviral agents ncbi viral genomes resource drug repurposing using deep embeddings of gene expression profiles deepdr: a network-based deep learning approach to in silico drug repositioning protein family classification with neural networks deepsf: deep convolutional neural network for mapping protein sequences to folds using attribution to decode binding mechanism in neural network models for chemistry tilorone: a broad-spectrum antiviral invented in the usa and commercialized in russia and beyond a systematic review of lopinavir therapy for sars coronavirus and mers coronavirus-a possible reference for coronavirus disease- treatment option corona virus drugs -a brief overview of past, present and future the fda-approved drug ivermectin inhibits the replication of sars-cov- in vitro a large-scale drug repositioning survey for sars-cov- antivirals near perfect protein multi-label classification with deep neural networks neural networks for full-scale protein sequence classification: sequence encoding with singular value decomposition we would like to thank the administrators of the drugvirus and the ncbi virus portal for providing the datasets that are central to this study. we appreciate comments on preliminary drafts of this manuscript from dr tariq daouda from the massachusetts general hospital, broad institute, harvard medical school.the authors declare they will not obtain any direct financial benefit from investigating and reporting on any given pharmaceutical compound. the following study is funded by the authors' employer, zetane systems, which produces software for ai technologies implemented in industrial and enterprise contexts. c database profile c. virus counts before dropping duplicate sequences key: cord- -buc dd y authors: dong, rui; he, lily; he, rong lucy; yau, stephen s.-t. title: a novel approach to clustering genome sequences using inter-nucleotide covariance date: - - journal: front genet doi: . /fgene. . sha: doc_id: cord_uid: buc dd y classification of dna sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including multiple sequence alignment (msa) are time-consuming and computationally expensive. the alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. also, the interactions among nucleotides are neglected in most methods. here we propose a new accumulated natural vector (anv) method which represents each dna sequence by a point in ℝ( ). by calculating the accumulated indicator functions of nucleotides, we can further find an accumulated natural vector for each sequence. this new accumulated natural vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. thus global comparison of dna sequences or genomes can be done easily in ℝ( ). the tests of anv of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed anv method. with the rapid development of next generation sequencing technology, more and more information of the genome sequences is available. studying sequence similarity is a crucial question in research and can explain phylogenetic relationships by constructing trees. one of the most commonly used methods, multiple sequence alignment (msa) uses dynamic programming, a regression technique that finds an optimal alignment by assigning scores to different possible alignments and taking the one with the highest score (yu et al., a) . however, the computational cost of msa is extremely high and msa may not produce accurate phylogeny for diverse systems of different families of rna viruses (yu et al., b) . alignment-free approaches have been developed to overcome those limitations. published alignment-free methods include markov chain models (apostolico and denas, ) , chaos theory (hatje and kollmar, ) , and some other methods based on the statistics of oligomer frequency and associated with a fixed length segment, known as k-mer (sims et al., ) . yau and his team proposed the natural vector method, which takes the position of each nucleotide into consideration. the natural vector method performs well on many datasets (deng et al., ; yu et al., b; hoang et al., ; li et al., ) , however, it only considers the number, average position and dispersion of positions of each nucleotide. relationships between nucleotides are also important, especially when the functions may be related to interactions of nucleotides, such as the folding of a chromosome. in this paper, we propose a new accumulated natural vector (anv) method, which not only considers the basic property of each nucleotide, but also the covariance between them. in the traditional natural vector (nv) method, each sequence is uniquely represented by a single point in r . the traditional natural vector approach is firstly introduced in deng et al. ( ) : for a sequence of length n, n α (αǫ{a, c, t, g}) denotes the number of nucleotide α in the sequence. s [α] [v] is the distance from the first nucleotide (regarded as origin) to the v th nucleotide α in the dna sequence. t α = n α v= s [α] [v] denotes the total distance of each set of a,c,g,t from the origin, αǫ{a, c, t, g}. µ α = t α n α , is the mean value of the distances of nucleotide α from the origin. , is the normalized central moment of order , which can also be seen as the variance of the positions of nucleotide α. therefore, a dna sequence can be represented by a -dim vector: in this paper, we propose an accumulated natural vector approach, which projects each sequence into a point in r , where the additional six dimensions describe the covariance between nucleotides. obviously, anv can provide more information than the traditional nv method, and doesn't include the human intervention, such as choosing the optimal value of k in the k-mer method. therefore, it can distinguish different sequences and classify species into correct clusters with higher accuracy and less time cost. the following six datasets were used to validate the method. the coronaviruses dataset includes viral genomes, in which viruses are from the exact same dataset with (woo et al., ; yu et al., ; hoang et al., ) and the other two viruses are new members in coronavirus. the second dataset consists of the genomes of influenza a viruses, which is a classic dataset to test if a new proposed method performs well. the third dataset includes viruses from zheng et al. ( ) , which focuses on the classification of ebolaviruses. the fourth one is from our colleagues' previous paper (li et al., ) which includes viruses chosen randomly under some criteria. the fifth one is the mitochondrial genomes of mammals, which can be clustered into seven well-known categories. all the sequence materials can be found on ncbi with the reference number provided in the appendices. we also generated different mutations by simulation in a dna sequence and constructed phylogenetic trees of simulated sequences to test our anv method. all computations in this paper are done on a dell laptop equipped with intel i processor under windows home premium with gb ram, together with the matlab (version r a) and mega x. for a given genomic sequence, we first define four indicator functions (u) for adenine, cytosine, guanine and thymine, respectively: if α appears at the i th position of the sequence , if α doesn ′ t appear at the i th position of the sequence ( ) where αǫ{a, c, t, g}, and i = , , . . . , n. heren is the length of the whole sequence. for example, if the genomic sequence is "atctagct, " then the four indicator functions are shown in table . here are some simple properties about the indicator functions: . each column has the sum of . . each row has the sum of the number of corresponding nucleotide. now we define four accumulated indicator functions as the following:ũ the four accumulated indicator functions for the example above ("atctagct"), are shown in table . here are some properties about the accumulated indicator functions: . the i th column has the sum of i. α∈{a,c,g,t}ũ . the last column is the total number of the nucleotideα in the sequence. µ k is the average position in the natural vector in deng et al. ( ) . property and can be easily proved by the definition of indicator function (u α ) and accumulated indicator function (ũ α ), now we prove the property , which builds up the relationship between the accumulated indicator function and the average position of a specific nucleotide. if we assume that the positions of nucleotide α are t , t , . . . , t n α , where n α is the number of nucleotide α in the sequence, then the basic form of accumulated indicator function should be, which satisfies ≤ t < t < . . if we add up those n elements above and denote the sum as α and t = , we have: therefore, we use to describe the average position of nucleotide α, which indicates the distance of the average position to the end of the sequence. for two finite point sets with equal number of elements: a = {a , a , . . . , a n }, b = b , b , . . . , b n in r, which satisfy a < a < . . . < a n and b < b < . . . < b n , the covariance of two sets can be defined as follows: where u a = n i = a i /n and u b = n i = b i /n. now we apply the covariance formula above to the accumulated indicator functions. a set is a collection of definite, distinct objects, known as the elements or members of the set. now for each nucleotide, we have an array of n elements which is the accumulated indicator function for the nucleotide α ∈ {a, c, g, t}: [ , , . . . , , , , . . . , , , , . . . , , . . . (n α − ) , however, those n elements cannot build up a set of n elements since many of them are replicated. hence, we extend the definition of set to a generalized concept, where the elements in a set can be the same. in this generalized definition, each nucleotide has a set of n elements and they can be arranged in the ascending order, i.e., from the smallest to the biggest number. thus, we can use the covariance formula ( ). as the example of sequence "atctagct, " the covariance of nucleotide a and c can be computed in this way: the generalized set of nucleotide a is { , , , , , , , } and of c is { , , , , , , , }. each generalized set has n = elements and the generalized covariance would be similarly, we can get cov (a,g), cov (a, t), cov (c, g), cov (c, t), cov (g, t). for two nucleotides like α and β, the covariance formula is then it is obvious that when α = β, the corresponding formula should be the formula above defines the variance of the positions of nucleotide α. for a given nucleotide sequence, now we can build up its accumulated natural vector. the first four dimensions describe the number of each nucleotide, denoted as n a , n c , n g , n t , which are the last column of the accumulated indicator functions. the second four dimensions describe the average distance of nucleotides to the end of the sequence, denoted as ( ). the third four dimensions describe the divergence of each nucleotide, denoted ( ). please note that this d α is a little different from the d α in the traditional natural vector method since the previous definition of variance cannot be extended to a reliable definition of covariance. the last six dimensions describe the covariances between each two nucleotides, denoted as cov (a, g), cov (a, t), cov (c, g), cov (c, t), cov (g, t) as formula ( ). and the universal form of accumulated natural vector is from section . . to section . . , we introduce how a dna sequence is represented by a vector in r space. therefore, the distance between two sequences can be measured by the euclidean distance between two vectors. suppose that now we have two sequences in r w (in our case, w = ), denoted as x = (x , ..., x w ) and (y , ..., y w ), the euclidean distance between them is for a dataset of m different sequences, we can construct a distance matrix d = (d ij ) m×m , and d ij (≥ ) represents the euclidean distance between sequence i and sequence j. d is a symmetric matrix and the diagonal element is zero. in this research, we use mega x to build up phylogenetic trees. in order to eliminate the influences of different algorithms of constructing trees, we apply the unweighted pair group method with arithmetic mean (upgma) algorithm (sneath and sokal, ) for analysis on the four datasets. for comparison with other common alignment or alignmentfree method, we also perform k-mer and msa (clustalw or muscle) on the same dataset. the feature frequency profile (ffp) (woo et al., ) , which is based on k-mer frequency, calculates the frequency of each k-mer in the sequence and turns a dna sequence into a vector in a k -dimensional space. the euclidean distance between two k-mer vectors can also be computed by formula ( ). we apply msa method, clustalw on several datasets as well, with the default parameters in mega x. clustalw is much slower than another msa algorithm, muscle, while clustalw can give a better result. muscle is applied on the fourth dataset of viruses and after we get the alignment result of the viruses, distance matrix is calculated using hamming distance, to find the nearest neighbor of each virus. hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. it measures the minimum number of substitutions required to change one string into the other or the minimum number of errors that could have transformed one string into the other. since alignment approaches are to arrange the sequences to identify regions of similarity between the sequences, the alignment would provide the performance of each sequence on a fixed number of positions. therefore, the hamming distance can be calculated by simply counting the number of pairwise differences in character states. in the simulated dataset, we use the pairwise alignment distance by the "seqpdist" function inside matlab bioinformatics toolbox, which uses the jukes-cantor algorithms as the correct tree, since the sequences are simulated according to a base sequence. then the distance matrices are compared using robinson-foulds distances, which can measure the congruence to the reference topology. we apply the accumulated natural vector method on five datasets, and compare the results with common methods, such as msa, k-mer (ffp) and the traditional natural vector method. from comparison, the results of accumulated natural vector are more accurate and the calculation cost is very small compared to others. a dataset of viruses has also been tested, and laptop cannot bear such a heavy burden of calculation of aligning them but alignment-free can still be done in a reasonable time. we also use a server to align segments of sequences, to compare the results to anv and other methods. anv also gives the best performance on this dataset. besides, we simulate another dataset of sequences from a randomly generated sequence with length of , bp, and test the phylogenetic trees from this and other methods. we have chosen those datasets of different sizes (number of sequences, and lengths of sequences), which to test if anv can be suitable in all cases. most datasets have been analyzed by previous researches, therefore we can compare our results to others to evaluate the performances. four datasets consist of viruses that are closely related to human health, and the mammal's dataset and simulated dataset show that this method can perform on other types of sequences as well. coronavirus belongs to the subfamily coronavirinae in the family coronaviridae, in the order nidovirales. in this paper, we construct a dataset with coronaviruses, in which viruses are from the exact same dataset with (woo et al., ; yu et al., ; hoang et al., ) . the other two viruses are two new members in coronavirus. details of the coronaviruses can be found in table s . the new chinagd (lu et al., ) was identified in guangdong province (china) in and is an imported middle east respiratory syndrome coronavirus. the other one mers-cov/kor is from south korea (kim et al., ) . as of june , the mers-cov was spreading in south korea, and the chinagd case was a south korean national who traveled to guangdong in may . therefore, those two members were considered highly correlated with each other. the genomic size of coronaviruses ranges from about to kbp, with the average of , nucleotides. using our accumulated natural vector and upgma method (sneath and sokal, ) , we can build up a phylogenetic tree as shown in figure . figure shows that the two new members are clustered together with group , which is also well-known as sars (severe acute respiratory syndrome). between november and july , an outbreak of sars in southern china caused an eventual , cases, resulting in deaths reported in countries. both mers-cov and sars viruses are beta-coronaviruses, however, they belong to different lineages, for more details please see (drexler et al., ; hilgenfeld and peiris, ) . the phylogenetic tree indicates that the chinagd and mers-cov/kor forms a monophyletic clade, sister to the sars clade, which may possibly be a variant from some sars viruses. we also performed the same procedure with k-mer method on the coronaviruses dataset. however, how to choose an optimal k-value is an interesting topic that requires manual intervention. sims et al. showed in woo et al. ( ) that the location of the peak in the distribution of k-mers, i.e., the k with the largest vocabulary, is related to the sequence length n. the k with maximum information is empirically determined but may be closely approximated by where is the alphabet size. they have shown in sims et al. ( ) that reliable tree topologies are typically obtained with k-mer resolutions where k > k hmax whereas lengths below k h max yield unreliable trees. the upper limit of resolution can be empirically determined by a criterion that the tree topology for feature length k is equal to that of k+ , i.e., tree topologies converge. according to this principle, we have ≤ k ≤ . we show the result of k = in figure a , and the results of k = and k = are in the figures s , s . the four outgroup viruses cannot be clustered together as another branch from the tree of coronaviruses, meanwhile the group was divided into smaller groups. the traditional clustalw algorithm of multiple sequence alignment (msa) is also applied on the same dataset, and the result is shown in figure b . msa cannot cluster viruses from same groups together either. from this example, we can see that our anv method is better than the k-mer and msa method. influenza a viruses are single-stranded rna viruses, which have been a major health threat to both human society and animals . influenza a viruses' nomenclature is based on the surface glycoproteins: hemagglutinin (ha) and neuraminidase (na) (obenauer et al., ) . ha has subtypes and na has subtypes, which forms different combinations. the ncbi number of the analyzed influenza a viruses can be found in table s . our result agrees with previous work by hoang et al. ( ) . furthermore, we find that all the influenza a viruses are clustered with the same h and n type in figure , with only one exception of a/turkey/minnesota/ / (h n ). there is no specific research about this virus and we infer that it may be the intermediate from h n to h n . h n had an outbreak in july , causing millions of poultry's infection, but there is no report of infection from human to human yet. however, h n was identified in shanghai, china at the end of march . considering that the ha glycoprotein of those two subtypes are the same and the close outbreak date, we indicate that the h n on march might be a variant from h n , and a/turkey/minnesota/ / (h n ) plays a key role in this variation. we get the same conclusion in another work as well (dong et al., ) . more biological research on this virus should be done to deepen our understanding of influenza a viruses to accelerate the invention of an effective vaccine and to prevent more dangerous variants. the k-mer method and msa are also performed on this dataset as shown in figures a,b . the k-value is determined in the same procedure as in the coronaviruses dataset as . in figure a , the viruses from h n and h n are mixed up together with each other, while msa has a worse result in figure b . the results also indicate that k-mer and msa cannot reveal the real relationships among the viruses. to get a direct image of the relationships between influenza a viruses, we draw the natural graph of them. natural graph was first introduced by zheng et al. ( ) . in figure , the blue lines represent the -level connected components and the red ones -level. classes are marked in different colors and it is obvious that after the construction of two levels, the influenza a viruses with the same h and n are clustered together, including the a/turkey/minnesota/ / (h n ) which is number in figure . h n and h n are clustered together in level , indicating that they have a closer relationship, which accords with our previous conjecture. to illustrate that the new proposed anv method is an important improvement of the traditional natural vector method, a ebolaviruses dataset is tested, which is a subset of the viruses used in zheng et al. ( ) . it consists of ebola virus (ebov), sudan virus (sudv), reston virus (restv), taï forest virus (tafv), bundibugyo virus (bdbv), marburg virus (marv) and lloviuvirus (llov). details of this dataset are shown in table s . in figure a , the phylogenetic tree shows that from the novel accumulated natural vector method classifies all viruses into the right groups, however, in figure b , the traditional natural vector method divides ebov class into two clusters and sudv is misclassified with some ebov virus. this is an indication that including covariance between nucleotides helps improve the accuracy of classification. hence this is an important improvement to the traditional natural vector and other alignment-free methods. we also test a large dataset of viruses in li et al. ( ) , and the details of this dataset can be found in table s . the average length of them is , nucleotides and it makes alignment methods on a laptop impossible. only server or cloud computing can finish such a task. here we use -nearest neighbor ( -nn) method (li et al., ) to see the accuracy of the prediction. this evaluation is inspired by the high rate of missing labels in many databases of viruses. for example, if a virus with missing family label has been added to the database, and it should share the same family label with the virus (stored in the database already) that is closest to it, then we can predict the missing family label according to the information of its nearest neighbor. therefore, for a dataset with no missing labels, we can count how many viruses share the same label with its neighbor. "nearest neighbor" of a specific virus can be defined as the virus that has the smallest euclidean distance in the dataset to it for the alignment-free methods. for alignment results, we use the hamming distance to measure the distance between two sequences. if the virus shares its distance with its neighbor, we consider it as a "correct" one, since even if its label is missing we can still predict it from its nearest neighbor. the accuracy can be computed by dividing the number of correct ones by the number of all viruses, in this case, by . we compare the result of anv to the k-mer method since they are all alignment-free methods, and the results are shown in table . the optimal choice of k is made by the same procedure in the other datasets. from table , it is evident that anv has much higher accuracy than the k-mer method, meanwhile using much less time. thus, we have proved that anv can apply to practical use with high time-efficiency and high-accuracy. for the alignment in this part, we tried to align all the sequences with full length on our server, but it fails to give a reliable result. therefore, we extract , bp from the beginning and align pieces of segments all with length of , bp. the results are shown in table as well and the accuracy is still not as good as what anv gives. our accumulated natural vector performs well not only on virus datasets, but also on other common species. we extract mammalian mitochondrial genomes with the average length of , nucleotides, and the ncbi numbers of them can be found in table s . the genomes are from seven known clusters: primates, carnivora, cetartiodactyla, perissodactyla, eulipotyphla, lagomorpha, and rodentia. the accumulated natural vector method can still distinguish the differences among the seven clusters, as shown in figure a . ffp (k-mer) method has also been tested as well (the optimal k-value for this dataset is ), as shown in figure b . since the species that includes in different paper are not all the same, it is hard to compare the whole topology of phylogenetic trees, however, our work still only has a small difference from the previous work in murphy et al. ( ) and tarver et al. ( ) . the difference can be attributed to that mitochondrial genomes in mammals may not always reflect the organismal evolutionary history (morgan et al., ) , however, it still keeps more information than k-mer does in figure b , since the distance within each group is smaller than the distances among groups, we can still distinguish clusters based on current dataset. in ladoukakis and zouros ( ) , point out that most of the information researchers gained about the tree of life through the use of mtdna remains valid, while we should pay more attention to its role in the function of the organism and its value as a tool in the study of major evolutionary novelties in the history of life. therefore, the result implies that our anv method can capture the key information hidden inside the dna sequences and gives us a reliable topology among mammals. to verify is the similarity distance by our method can be used for clustering dna sequences effectively, we also generated different mutations in dna sequences and constructed phylogenetic trees by various methods. we simulated a sequence of length , bp as a base sequence, and generated two new sequences named "a_original" and "b_original" using point mutations. both a and b have nucleotides different from the original sequence. we then similarly evolved a and b into different mutants by four different mutations (substitutions, deletion, insertion, and transposition) as did in yin et al. ( ) . table is the detailed description on the simulated dna sequences with different mutations. since the sequences are mutated slightly based on an exon sequence, we take the aligned result as the "correct" relationships among the sequences, and the alignment is done by the "seqpdist" function in matlab bioinformatics toolbox. this function uses the classical jukes-cantor algorithm and we calculate the pairwise alignment distance. for comparison, we use the anv method, ffp method (we test k = , , in this case, since the lengths of sequences are about , bp). the upgma trees of alignment, anv and ffp (k= ) methods are shown in figures , a ,b separately. among these trees, it is not very obvious which one is more similar to the alignment results, therefore we calculate the robinson-foulds distances between the distance matrix and the "correct" matrix and the results are shown in table . here we apply the program named "robinson-foulds" (robinson and foulds, ) when calculating table . the simulated dataset is in table s . actually, the differences among trees mainly lie in the branch of sequences generated from b, and anv gives a more similar result, since the order is slightly disorganized by b and the transpositional sequences, while in figure b , the whole branch of b is different from the alignment result. in this paper, we propose a novel vector named accumulated natural vector to analyze sequences, genomes and their phylogenetic relationships. results from our analysis largely agree with the earlier studies, which indicates that our approach can detect the similarity and difference among sequences. therefore, constructing phylogenetic trees only by sequence data could be done accurately in a very reasonable time, without using large computing platforms or conducting biological experiments of high cost. our method can be applied in a global comparison of all genomes and provide a new powerful tool by including the correlations of nucleotides. we are working on extending the anv method to protein sequences, nevertheless, for a protein sequence, it would produce an , -dim vector for each sequence. the calculation cost for this is too large under the current technology. the covariance for three amino acids at a time may be more reasonable, since three consequent nucleotides can also become a codon in expression region of a sequence. fast algorithms for computing sequence distances by exhaustive substring composition a novel method of characterizing genetic sequences: genome space with biological distance and applications a new method to cluster genomes based on cumulative fourier power spectrum ecology, evolution and classification of bat coronaviruses in the aftermath of sars a phylogenetic analysis of the brassicales clade on an alignmet-free sequence comparison method from sars to mers: years of research on highly pathogenic human coronaviruses numerical encoding of dna sequences by chaos game representation with application in similarity comparison a new method to cluster dna sequences using fourier power spectrum complete genome sequene of middle east respiratory syndrome coronavirus kor/knih/ _ _ , isolated in south korea evolutionary and inheritance of animal mitochondrial dna: rules and exceptions virus classification in -dimensional protein space complete genome sequence of middle east respiratory syndrome coronavirus (mers-cov) from the first imported mers-cov case in china mitochondrial data are not suitable for resolving placental mammals phylogeney molecular phylogenetics and the origins of placental mammals large-scale sequence analysis of avian influenza isolates comparison of phylogenetic trees alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions numerical taxonomy the interrelationships of placental mammals and the limits of phylogenetic inference characterization and complete genome sequence of a novel coronavirus, coronavirus hku , from patients with pneumonia a measure of dna sequence similarity by fourier transform with applications on hierarchical clustering protein sequence comparison based on k-string dictionary real time classification of viruses in dimensions a novel construction of genome space with biological geometry ebolavirus classification based on natural vectors ss-ty and rh conceived the idea of covariance. rd implemented the idea and wrote the first draft of the manuscript. lh discussed and revised the first draft. rd, lh, rh, and ss-ty all contributed to the writing of the manuscript and agreed with the manuscript results and conclusions. they jointly developed the structure and arguments for the paper, made critical revisions and approved final version, and reviewed and approved the final manuscript. this study is supported by the national natural science foundation of china ( ) (to ss-ty), tsinghua university start-up fund (to ss-ty). the corresponding author would like to thank national center for theoretical sciences (ncts) for providing excellent research environment while part of this research was done. the supplementary material for this article can be found online at: https://www.frontiersin.org/articles/ . /fgene. conflict of interest statement: the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.copyright © dong, he, he and yau. this is an open-access article distributed under the terms of the creative commons attribution license (cc by). the use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. no use, distribution or reproduction is permitted which does not comply with these terms. key: cord- -osjy rc authors: aydin, berkay; boubrahimi, soukaina filali; kucuk, ahmet; nezamdoust, bita; angryk, rafal a. title: spatiotemporal event sequence discovery without thresholds date: - - journal: geoinformatica doi: . /s - - - sha: doc_id: cord_uid: osjy rc spatiotemporal event sequences (stess) are the ordered series of event types whose instances frequently follow each other in time and are located close-by. an stes is a spatiotemporal frequent pattern type, which is discovered from moving region objects whose polygon-based locations continiously evolve over time. previous studies on stes mining require significance and prevalence thresholds for the discovery, which is usually unknown to domain experts. the quality of the discovered sequences is of great importance to the domain experts who use these algorithms. we introduce a novel algorithm to find the most relevant stess without threshold values. we tested the relevance and performance of our threshold-free algorithm with a case study on solar event metadata, and compared the results with the previous stes mining algorithms. in traditional itemset mining, frequent sequence (or sequential pattern) mining refers to discovering a set of attributes persistently appearing over time among the large number of objects [ ] . a major category of sequences are event sequences, which represent the implicit relations among the categories of objects [ ] . classical event sequence mining can be useful for understanding the user behavior (by mining sequences from weblogs or system traces) [ ] , shopping routines of customers (by mining transaction sequences) [ ] , or the efficiency of business processes (by mining time-ordered managerial and operational activities) [ ] . sequential pattern mining from spatiotemporal data has received much attention in recent years due to its broad application domains such as targeted advertising, prediction for taxi services, and urban planning [ , , , ] . the characteristics of spatiotemporal berkay aydin baydin @cs.gsu.edu georgia state university, atlanta, ga, usa sequences vary widely depending on the discovered knowledge type. most of the recent approaches focus on the point-based spatiotemporal data presumably due to its availability. however, the region-based spatiotemporal data, primarily obtained from scientific resources, has not received much attention. in this work, we focus on spatiotemporal event sequences (stes) from event datasets that contain instances with region-based geometric representations. stess are the ordered series of event types whose instances frequently demonstrate sequence generating behavior. the sequence generating behavior is characterized by the spatiotemporal follow relationship among instances, and it refers to the temporal follow relationship with spatial proximity constraints. the spatiotemporal sequences can be categorized into three classes based on the fundamental data type: sequences of trajectories from uniform groups, sequences of spatiotemporal points from mixed groups, and sequences of trajectories from mixed groups. our work considers the last category, and we mine the stess from event instances formed by the evolving region trajectories. the discovery of stess is potentially critical for the large-scale verification and prediction of scientific phenomena in a broad range of scientific fields including meteorology, geophysics, epidemiology, and astronomy [ ] . the scientific phenomena such as tornadoes, propagation of epidemics, clouds, and solar events can be modeled as trajectories of continuously evolving regions. stes mining can be used for modeling the spatial and temporal relationships among different types of phenomena. later, the discovered sequence patterns can be utilized for performing large-scale verification of current knowledge, as well as the prediction of unknown spatiotemporal relationships among different event types. an application area for stes mining is solar physics and space weather forecasting. studies from government agencies [ , ] and independent institutions [ , ] show that extreme space weather events can impact radiation in space, reduce the safety of space and air travel, disrupt intercontinental communication and gps, and even damage power grids. while much work has been done on prediction of solar flares using physical characteristics of source active regions, the mixed impact of different types of solar events to eruptive activity (such as flares or coronal mass ejections) have not been fully explored. for example, in fig. , we demonstrate an active region event with a large sunspot followed by a flare. such relationships are known to exist among solar events [ , , , ] although, to our knowledge, many studies are confined to a limited number of examples. one way to understand the exhaustive factors and conditions leading to an extreme space weather event is to determine the frequently occurring sequences of events which lead to substantially large flares, smaller flares, and non-flaring instances. the discriminating event sequences among these can shed light into typical conditions leading to eruptions or alternatively help forecasters when issuing all-clear forecasts. public health researchers, particularly epidemiologists, can also benefit from stes mining for understanding the frequently occurring activities leading to spread of infectious diseases. contact tracing applications [ ] are one of the very few tools we can deploy to stop the spread of viral diseases with great epidemic potential such as the novel coronavirus (sars-cov- ) [ ] . such applications can be used to trace the movements of individuals and the paths of individuals can be split into activity event types. mining stess occurring among different activities in the outbreak zones can help epidemiologists understand which activities or the sequences of activities lead to outbreaks and which are relatively safer. identifying these sequences can provide crucial information for prevention such as the relative contributions of different activities and ways of transmission. previous efforts have been devoted to solving the problem of mining the most prevalent spatiotemporal event sequences using apriori [ ] , pattern growth-based [ ] , or top-k algorithms [ ] . while these three approaches achieve the expected results, they heavily rely on user-defined significance and prevalence threshold parameters, which define the cut-off points for sequences. the previous algorithms assume that the user has a prior knowledge of the optimal threshold parameters or a k value which in some cases, should be discovered from the datasets. another issue that surface with the previous stes mining approaches is that the prevalence threshold, is highly dependent on the events taking part in a sequence. for example, algorithms should be more tolerant in the case of stess whose event types have a rare occurrence, yet it should also be informative on the rareness of them. as a matter of fact, defining the same threshold for all the sequences may not be accurate in this context as the threshold parameter should discriminate against the event types participating in a given event sequence. in contrast to threshold-based approaches, we focus on overcoming the limitations of providing a user-defined threshold when discovering the stess and improve the relevance of our results. here, we introduce a novel algorithm, rand-esminer, which, by randomly repeating the mining process on a random subset of instances and follow relationships, finds an estimate participation index for event sequences. the rand-esminer uses our pattern growth-based esgrowth algorithm [ ] as the backbone, where the follow relationships are translated into a directed acyclic graph structure, and randomly permutes the edges of this graph to mine the event sequences. mining the random subset multiple times does not necessarily improve the overall running time as shown in our experiments; however, it increases the robustness of the results and let us understand the distribution of participation indices. in our experiments, we compare the rand-esminer with earlier approaches and evaluate the efficiency and robustness of the algorithm using datasets from solar astronomy field. in the spatiotemporal frequent pattern mining literature, the term sequence (or its derivatives such as sequence patterns, sequential patterns) is used for identifying different types of knowledge from spatiotemporal data. these include sequences of locations frequently visited by spatiotemporal objects [ , ] , partially-or totally-ordered sequences of event types whose instances follow each other [ , , , , , ] (these are also referred to as couplings in [ ] ), sequences of semantic annotations from semantic trajectories [ ] , temporal sequences of ordered spatial itemsets (called spatio-sequences) [ ] , and sequences of spatiotemporal association rules [ ] . cao et al. described the spatiotemporal sequential patterns as the routes of frequently followed by objects in [ ] . namely, a set of frequently visited locations is discovered from a dataset of spatiotemporal trajectory segments. spatiotemporal sequential patterns are related to the movement patterns of spatiotemporal objects in the form of trajectory segments. similarly, giannotti et al. introduced the trajectory patterns, and presented an algorithm for mining trajectory patterns [ ] . trajectory patterns represent a set of trajectories frequently visiting similar locations with similar visiting times. while trajectory patterns are concerned with the behavioral aspect of spatiotemporal objects, the term, sequence, refers to visited locations. verhein introduced complex spatiotemporal sequence pattern mining [ ] that focuses on sequences of spatiotemporal association rules. spatiotemporal association rules represent frequent movements of objects appearing between two regions during a time interval. apart from those, zhang et al. proposed splitter algorithm, which discovers fine-grained sequential patterns from semantic trajectories [ ] . the splitter first retrieves spatially coarse patterns, and later reduces them to fine-grained patterns. the discovered patterns are sequences of categorized locations (deduced from semantic trajectories). another example of spatiotemporal sequences, called spatio-sequences, are presented by salas et al. [ ] . the spatio-sequences are the temporal sequences of ordered spatial itemsets that are used for coupling geographically neighboring phenomena. huang et al. presented a framework for mining sequential patterns from spatiotemporal event datasets in [ ] . the sequential patterns, in [ ] , refer to a sequence of event types from point-based event instances. they defined a follow relation between the pointbased event instances of two different event types, presented significance measures for sequences, and introduced two pattern-growth based algorithms for the mining task. both of the algorithms create a pattern tree and expand the nodes of the pattern tree with recursively calling tree expansion procedures (i.e., follow joins). moreover, mohan et al. [ ] introduced the cascading spatiotemporal pattern mining, which are partially ordered subsets of spatiotemporal event types whose instances are located together and occur in stages. in [ , ] , aydin et al. introduced the stes mining from evolving regions. stes mining is also concerned with the sequences of event types, and the instances are trajectories of evolving regions. hence, the follow relationship is more complex when compared to [ ] . the earlier event sequence mining algorithms [ , , ] operate using a set of userdefined thresholds. here, we will concentrate on discovering the sequences without these threshold values, which often times are not available to domain experts. in stes mining, our focus is mining patterns from evolving region trajectories, and we are not particularly interested in point trajectory data, or stationery spatiotemporal data. our scope is on mining the most relevant. our primary objective is to improve the quality of the results and alleviate the issues associated with using preset, and usually arbitrary, thresholds. the rest of this paper is organized as follows. in section , we present background information on stes mining. in section , we introduce our novel stes mining algorithm. in section , we present our experimental evaluation. lastly, in section , we present our conclusion and possible future work. spatiotemporal event instances (ins i ) are the chronologically ordered lists of timestampgeometry pairs (tg i k ). the geometries are region-based and represented with polygons. the instances are evolving region trajectories, and each of them is identified by a unique identifier and has an associated event type. an event type signifies the class of its associated instances. a timestamp-geometry pair is a pair of timestamp value (t i ), and a region geometry (g i ). the event type of an instance is represented with ins i .e. an event type is denoted by e j . the set of instances of type e j is represented as i e j . in the upper portion of fig. , we show the organizational structure of instances and events. note here that the set of all instances (i) are essentially union of instances from all event types in the dataset (∪ i e j for all e j ). let e = {e , e , . . . , e k } be the set of all event types, and i be a database of all event instances. the problem of stes mining is finding frequently occurring sequences of event types (i.e., event sequences) in the form of (e i e i . . . e i k ), such that the instances of e i is followed-by the instances of e i , . . . , and the instances of e i k− is followed-by the instances of e i k . the event sequences are denoted as esq i , and are derived from instance sequences. an instance sequence (denoted as i sq i in eq. ) represents a chain of spatiotemporal follow relationships (denoted as " ") that occur between its participating instances. the number of participating instances in an i sq is the length of the sequence. a length-k instance sequence is alternatively called k-sequence. an i sq i is of-type an event sequence es j (as shown in eq. , if and only if the event types of the participating instances of i sq i are identical and in the same order as the event types in es j . in the lower portion of fig. , we schematically depict the follow relation between the instances and show an example of a length- instance sequence and its associated spatiotemporal event sequence. given this information, the task of stes mining, in general, is interested in discovering spatiotemporal event sequences whose instance sequences are frequently repeated. the instance sequences are discovered by finding significant follow relationships outlined in section . and event sequences are derived from the types of these instance sequences. the prevalence of event sequences are measured by participation index measure, which is described in section . . in this paper, we will focus on mining stess using a randomization approach, which will take a set of spatiotemporal event instances as input and returns all the discovered stess together with a list of estimated participation index values for each stes, obtained from randomized trials. the instance sequences are formed by two or more instances. between each two consecutive instances in an instance sequence, there exists a spatiotemporal follow relationship. the simplest form of follow relationship occurs between two event instances, and denoted by the ' ' symbol. the relationship is characterized by two predicates that are temporal continuity and spatial proximity. to actualize these predicates, we present two concepts that are the head and tail window of instances. we determine the head and tail with head and tail ratio parameters (denoted as hr and tr). the hr is the ratio between the head segment's lifespan and the instance's lifespan. the tr is the ratio between the tail segment's lifespan and the instance's lifespan. the head and tail window is derived from the tail of the instance. the first operation for obtaining the tail window is the spatial buffer. secondly, the spatially buffered geometries are propagated temporally. the amount of buffering is determined by buffer distance parameter (denoted as bd), while the period of temporal propagation is called tail validity period (denoted as tv). head windows are created similarly, where firstly the head segment is buffered using the buffer distance parameter and later using the head validity (hv) parameter the head is propagated. the difference is that the tail window is propagated forward (i.e., towards a later time step) while the head window is propagated backward (i.e., towards an earlier time step). we show an example head and tail segment generation in fig. a and the creation of tail window in fig. b. formalization given two instances ins i and ins j , there exists a spatiotemporal follow relationship between ins i and ins j (ins i ins j ) if and only if ( ) the start time of ins i is less than the start time of ins j , and ( ) there exists a spatiotemporal co-occurrence between the tail window of ins i and the head window of ins j . under these conditions, ins i is the followee and ins j is the follower in the relationship. to form a -sequence, there must be one spatiotemporal follow relationship between two instances. more generally, to form a k-sequence, there must be k − spatiotemporal follow relationships between each consecutive participating instance. to measure how frequent an stes is in a given dataset, we use the participation index. the participation index is defined in [ ] , and shows the minimum relative frequency of the participating event types. for an event sequence, es j =(e j .. e j k ), the participation index of an stes (es j ) is the minimum of the participation ratios (pr) of its event types. the participation ratio of an event type (e i ) on an stes (es j ) is the ratio of the number of unique participators of e i 's instances to the total number of event instances of e i in the dataset. where | · | shows the set size. another aspect of significance is the strength of the follow relationships. the significance assessment is important as the accuracy and reliability of resulting event sequences are dependent on the discovered significance of the instance sequences. for assessing the strength of the follow relationships, we use the chain index (denoted as ci). the ci for sequences is defined as the significance of spatiotemporal co-occurrence between the tail window (tw) of followee and the head window (hw) of the follower. in this work, we measure the significance of spatiotemporal co-occurrence using j * measure [ , ] . formally, for a -sequence, i sq r = (ins r ins r ), the strength of the follow relationship is assessed as: where t s represents the starting time of an instance for a k-sequence where k > , the significance is assessed as the minimum chain index of each follow relationship in the instance sequence (eq. ) ci in the threshold-based approaches, the instance sequences are considered significant if their chain index (ci) value is greater than a user-defined chain index threshold (ci th ). similarly, the event sequences are considered frequent if their participation index (pi) value is greater than a user-defined participation index threshold (pi th ). the pattern-growth based esgrowth algorithm is introduced in [ ] . the algorithm firstly identifies the follow relationships and creates the event sequence graph (esg). later, the algorithm recursively discovers the stess from the graph structure. the esgrowth algorithm is outlined in algorithm . in the initialization steps, as explained in section , the algorithm creates heads and tail windows of instances and identifies the follow relationships between those instances (algorithm -lines to ). the follow relationships are discovered by a spatiotemporal join procedure where the head and tail window trajectories of the instances are joined and filtered based on the ci th for significance testing. later, these significant follow relationships are inserted into a directed acyclic graph structure -esg (line ). for any discovered follow relationship, (ins i ins j ), the transform procedure adds two vertices that represent ins i and ins j (if they are not present already), and a directed edge from ins i to ins j . the graph only stores the identifiers (i and j ) and event types (ins i .e and ins j .e) of the instances in its vertices. after creating esg, the esgrowth recursively discovers the stess. the algorithm starts by iterating the event types (e i ) in the set of all events (e). for each event type (e i ), it identifies the non-leaf vertices, which corresponds to the instances of e i (step ). then a participation ratio (pr) is calculated from those vertices to check the prevalence using pi th . if pr is greater than the pi th , the recursive growsequence procedure is called. the growsequence procedure is shown in the second part of algorithm . the procedure takes a prefix event sequence (prefix), the current minimum pr for the prefix sequence, and a set of pointers to the vertices (v pre ) as parameters. the vertices in v pre correspond to the last discovered vertices in the paths, which virtually forms the instance sequences of-type prefix (event sequence) in the esg. the procedure proceeds as follows. first, the successors (immediate neighbors) of the vertices in v pre are found and added to the successor vertex set (sucv ). then, for each event type e j , a subset of successor vertex set (containing the instances of e j ) is created (line of growsequence procedure in algorithm -denoted as sucv e j ). after identifying successors, a temporary participation ratio value (pr ) is calculated for extended event sequence (line of growsequence). if pr is greater than the pi th , the prefix is extended with e j , and a new prefix event sequence (denoted as prefix') is created (line of growsequence procedure). at this point, prefix' is guaranteed to be prevalent, thus is inserted to the set of prevalent event sequences, es. lastly, the growsequence procedure is called with the newly created event sequence, prefix'. along with the prefix', the minimum of old and new participation ratio (min(pr, pr )), and the vertex subset formed by the successors (sucv e j ) are passed as parameters. note that the base case in the recursion occurs when the temporary participation ratio is less than the participation index threshold. in this case, the prefix', which is created by appending the new event type to prefix, is not prevalent. therefore, there is no need to check the longer length event sequences generated from prefix'. previous approaches [ , , ] use threshold-based stes mining algorithms that heavily relies on the domain experts' knowledge to choose some appropriate threshold parameters, which is usually not available. the thresholds in the earlier approaches are necessary to understand whether a discovered sequence is spurious or sufficiently frequent. however, often times, determining generic threshold values for all the event sequences is impractical, even for the domain experts. to tackle these issues, we propose a novel algorithm, rand-esminer (randomized spatiotemporal event sequence miner) that can help domain experts understand the intrinsic characteristics of the spatiotemporal event sequences better. the threshold-based approach is essentially a constraint-based data mining approach, where the instance sequences are filtered based on the ci th , and the event sequences are filtered based on the pi th . the usefulness of these thresholds for the efficiency of the mining algorithms is indisputable, primarily because of the exponential time and space complexity of these algorithms. however, the imbalance of the spatiotemporal instances in these datasets and their characteristics such as lifespan, total area, or the areal evolution make it very difficult for domain or data mining experts to create meaningful threshold values [ ] . because of these, we created an algorithm that does not take participation index or chain index thresholds as input, but outputs stess along with a distribution of pi values. threshold-based algorithms output a set of prevalent stess (often coupled with a single pi value). from a practical point of view, event sequences which include rarely occurring events or instances with significantly different spatiotemporal characteristics are often overlooked. with the randomization approach, we do not consider any significance or prevalence threshold and perform the mining on the resampling subsets of the follow relationships. the randomized algorithm is inspired from the permutation tests (random resamples without replacement) and outputs the participation index values of all the discovered patterns. the primary task of the statistical descriptors is to summarize a characteristic of the given data and generalize the finding to the larger population. basic sample statistics such as sample mean or median give information about that particular sample; however, their values would fluctuate from sample to sample and magnitude of these fluctuations around the corresponding parameter is also important for the relevance of these results. statistical bootstrapping is an alternative to the traditional statistical methodology of assuming a particular probability distribution for a sample. bootstrap is a random resampling technique that estimates the distribution of a statistic and allows measures of accuracy to sample estimates [ ] , it is especially useful when there is no analytical form to help estimate the distribution of the statistics of interest such as mean or variance. an increasingly common statistical tool for constructing sampling distributions is the randomization test (also referred to as permutation test). similar to the idea of bootstrapping, a permutation test builds sampling distribution, which is the permutation distribution, by resampling the observed data. specifically, we can permute the observed data. unlike bootstrapping, permutation tests resamples the data without replacement, which is more appropriate for our tasks. in our application area of solar data mining, the lack of accurate data is a common problem and one way to tackle with these noisy data is to randomize the mining process and obtain uncertainties with confidence intervals. our primary data source for stes mining is the solar event metadata obtained from the feature finding teams (fft) of solar dynamics observatory [ ] . firstly, most of the solar event instances (representing the regions of solar events) have become available only after the launch of the solar dynamics observatory (sdo). hence, we only have the slices of the data. additionally, the sdo only captures the images of one side of the sun that is visible to us, and we currently do not have the reports of the solar events occurring on the opposite side of our line of sight. in our randomization approach, we treat the follow relationships as the sample dataset, and the participation index (pi) values of stess as a complex statistic to be obtained from the esg structure. by applying a random resampling without replacement (i.e., applying a set of permutation test) of the follow relationships (i.e., edges) we have the opportunity to explain the prevalence of stess as a distribution. note here that we do not perform a traditional permutation test which would require a null model and a statistical hypothesis testing. given the characteristics of solar event datasets, this approach is very promising due to its power of estimating the confidence intervals of the participation index for stess. this is to say, for each randomized experiment, we can obtain a participation index value for a discovered pattern and estimate its likelihood and see the variance of these values. in this part, we will explain the details of rand-esminer algorithm. the algorithm makes use of random resampling of follow relationships and outputs all the discovered stess along with their participation index values, which shows the prevalence of a particular stes, in each random trial. in algorithm , we show the overview of the randomized stes mining algorithm. the algorithm initially discovers all the follow relationships as in the threshold-based esgrowth algorithm and creates the esg (lines to ). in a nutshell, it firstly randomly resamples the edges in the esg, then discovers the stess from the resampled event sequence graphs, and eventually collects the results. the algorithm takes the set of instances (i), resampling ratio (rr), and the number of random trials to be performed (ν) as input. resampling ratio (rr) is the ratio between the number of edges to be resampled and total number of edges in the esg. note here that neither rr nor ν parameters require the expert knowledge; they are used for regulating the randomized trials and are not necessarily dependent on the intrinsic characteristics of data. the esg structure allows us to create random resamples of the follow relationships. the rand-esminer algorithm performs a randomized run for ν times to estimate the pi value for all the discovered stess (lines to ). the resampling is applied on the edges of the graph, which corresponds to the follow relationships among instances. edge resampling creates a subgraph of the esg, that is resg, by randomly sampling from the edge set of esg without replacement (line ). note that our graph structure is not a multigraph; therefore, we opt for a permutation test (resampling without replacement) instead of bootstrapping (resampling with replacement). next, we discover the stess from the resampled subgraphs of esg. for each resampled subgraph, we perform a recursive procedure similar to the esgrowth algorithm. for each event type e i , we find the non-leaf vertices of e i , and grow the sequences (see growrandsequences procedure in the second part of algorithm ). this can be considered as running the esgrowth algorithm with pi th = . . lastly, we append them to the map structure (es rand ) for every iteration, and return the es rand , which contains the discovered stess and a size-(ν) list of pi values for each sequence. if an stes is not discovered during a trial, the pi value for that stes is recorded as . if an stes is recorded for the first time during the kth trial, we create a a new list of pi values (length-k) backfilled with zeroes for each previous trial it was not discovered. it is worth mentioning that unlike the threshold-based approaches which return a list of prevalent stess, the output of the rand-esminer algorithm is a map structure whose keys are discovered patterns and the values are list of calculated pi values in each of the (ν) random trials. in this section, we present our experimental evaluations of the randomization-based stes mining approach. we used real-life datasets from solar astronomy field. we evaluated the runtime performance of graph transformation procedures and compared the esgrowth and rand-esminer algorithms. our algorithms are implemented in java programming language, datasets were stored in text files, and experiments were conducted on an ubuntu virtual machine with tb ram with an intel xeon processor. all the event instances and graph elements are stored in memory for fair comparison. to analyze the performance of our proposed algorithm, we used real-life solar event datasets. these are monthly datasets from , which include the spatiotemporal instances of seven different solar event types that are: active regions (ar), coronal holes (ch), emerging flux (ef ), filaments (f i), flares (f l), sigmoids(sg), and sunspots (ss). each instance consists of region polygons, downloaded from heliophysics event knowledgebase (hek) [ ] , and the regions are tracked and interpolated using the algorithms presented in [ , ] . the characteristics of our real-life datasets can be seen in table . the datasets are in tabular format, where the instances of a particular event type is stored as a table. each row shows a particular time-geometry pair with four attributes: instance identifier, start time of the time-geometry pair, end time of the time-geometry pair, and spatial geometry. the spatial geometry is a polygon object formatted as well-known text (wkt) format. the goal of the experiments is to examine the performance of our randomized algorithm, both in terms of relevance and effciency, on these datasets and compare it with the threshold-based esgrowth algorithm. we will compare the discovered stess on our relevance analysis and later evaluate the running time efficiency. a preliminary step of the algorithm is the initialization (head and tail window generation) and the graph transformation. for that reason, we kept the global parameters that are used in the esg creation as constant throughout all the experiments. these parameters represent independent variables that should not alter the performance of the algorithms within these set of experiments. head and tail ratio parameters were selected as . and . , respectively. the value used for the tail validity (tv) is hours. head validity (hv) was set to zero for consistency with our earlier works. we chose arcsec as the buffer distance (bd) parameter. for the case of the threshold-based algorithm, we conducted experiments for each dataset with varying ci th nad pi th values. the ci th value was set to . , . , . , and . , while pi th value was set to . , . , . , and . . we ran the esgrowth algorithm with the above-mentioned combinations of ci th and pi th values to discover the frequent sequences. eventually, a total of experiments were performed on the datasets for the threshold-based approach. on the other hand, for the randomization approach, we resampled the data times (ν = ) for every dataset and estimated the distribution of the pi values of all the event sequences. the size of each sample is % the size of its respective original dataset (rr = . ). thus, we generated pi values for all the discovered sequences to estimate prevalence of event sequences. in this part, we will discuss the relevance of the mining results from rand-esminer algorithm. for brevity, we chose to illustrate the length- , length- , and length- stess with the top- mean pi values from jan, feb, mar, and apr datasets. the comprehensive results for length- , , and stess from all datasets can be found in the appendix of this paper. figure illustrates the most prevalent length- (top row), length- (middle row), length- (bottom row) stess from four datasets. the distribution of pi values from rand-esminer are demonstrated with the box plots. we also demonstrate the discovered pi values from the threshold-based approach with varying size scatter elements. each scatter represents a different experimental run, and when the event sequence is not found to be prevalent on an esgrowth experiment (meaning pi was less than pi th ), the result from that experiment is omitted. from fig. , we can see that the discovered top- stess are consistent throughout the four datasets. we also present the number of top- occurrences for length- stess from all datasets in fig. . the results from the length- and length- stess is available in the appendix for completeness. eight of the top- length- stess are discovered in all twelve datasets and of them were discovered at least ten times. we further analyzed these stess. table shows the number of times stess was discovered by esgrowth as well as the averaged (across monthly datasets) median pi values and averaged percetange stes was discovered in randomized trials. we can observe that averaged median pi values for these stess are generally over . (pi th for the comparable esgrowth experiment) and they are discovered in almost all of the randomized trials (that is to say the average percentage of randomized trials these stess were discovered is above % for the aforementioned stess). this is not the case for threshold-based runs, where some of these well-known stess were not discovered, even for relatively low threshold values. (see an incomprehensive list of observations found in the literature for some of these patterns: 'ss sg' [ , ] ; 'ef ef' [ ] ; 'ar ar' [ , ] ; 'ef fl' [ , , ] ; or 'ss ss' [ , ] ). another aspect of our evaluation is the relevance comparison with threshold-based approach in terms of varying frequencies of stess. one observation we can make is the variation of the pi values when using different ci th values in the threshold-based approach. the variation is two-fold: ( ) the variation of the pi values for a particular stes and ( ) the variation of the pi values across different stess. the latter is much expected as the natural phenomena may or may not be spatiotemporally following each other. however, the former variation poses a challange that is difficult to solve with trial and error. for example, for (f l f l) sequences ci th = . and pi th = . can be an accurate cut-off points for thresholds. however, if we set the ci th to . and pi th to . for the entire dataset, we miss practically all (ar f l) sequences, as well as the sequences including the (ar f l) subsequence. it is well-known to solar physics experts that flares can occur anywhere on the sun's surface, from active regions (ar) to the the boundaries of the magnetic network of the quiet sun [ ] . however, large area flares (f l) have preferred locations. they originate from the large active regions showing a complex geometry of the d magnetic field [ ] . the stess are selected based on their top- occurrences (see fig. ). averaged median participation index (pi) values and average percentage an stes was discovered in randomized trials are reported for rand-esminer. number of months an stes was discovered by esgrowth (with ci th = . and pi th = . ) is also reported. rand-esminer discovered these stess for all monthly datasets to capture (ar f l) event sequence, we can use ci th = . and pi th = . (see the results in table ). however, this time majority of (ar ar), (ef ef ), and (ss sg) would be missed. these examples can be extended to include those three sequences leading to a never-ending cycle of pattern importance discussion. even for the simplistic cases of (f l f l) and (ar f l), or (ar f l) and (ss sg) creating user-defined thresholds is difficult, primarily because of the unbalanced spatiotemporal characteristics of the natural phenomena. therefore, we can suggest that mining a distribution of pi values using random edge resampling from the sample esg is a better approach, because outputing a mere pi value based on set thresholds cannot properly represent the characteristics of the population. in this part of our evaluation, we will show the runtime complexity of the initialization phase of rand-esminer algorithm. in fig. , we demonstrate the running times in the initialization phase of the rand-esminer algorithm for all the datasets. we measured the running times in the initialization phase into two categories: ( ) head and tail window generation time, which is denoted as ht generation time in fig. and corresponds to line and in algorithm and ( ) follow relation and graph transformation time, which is denoted as follow time in fig. and corresponds to line and in algorithm . the head and tail window generation requires complex spatial buffer and union operations. similarly, the follow relationships are discovered with a computationally expensive spatiotemporal join operation on evolving region trajectories. creating the esg is significantly less complex in terms of computation time. along with the running times, in fig. , we illustrated the vertex and edge counts in the created esg for each dataset. the number of vertices correspond to the number of instances in the dataset, while the number of edges show the number of follow relationships. from the results, we can see that the head and tail window generation time varies significantly for each monthly dataset. we can observe that part of this stems from the number of instances (vertices) in the dataset, and another factor is the number of individual region polygons in the datasets. we observe the highest head and tail window generation times are recorded for may and june datasets where we have highest number of region polygons in the datasets. similarly, the lowest ht generation times are recorded for february and april datasets where we have the lowest number of region polygons. the follow time also greatly varies across our datasets. the follow time depends on the number of spatiotemporal follow relationships among the instances in the dataset. while we cannot assert a total correlation, the number of edges in the generated graph is a good indicator for the follow time. another factor that impacts the follow time is the number of instances and region polygons, because we get % and % of the instance trajectories as heads and tails (as hr = . &tr = . ). for the case of the head and tail-window generation, our algorithm iterates through all the instances in the dataset and compute the time propagated and spatially buffered timegeometry pairs (representing the region trajectories). this process is done in linear time which explains the relation between the running time and the number of instances and region polygons. on the other hand, the esg generation algorithm iterates through the tail windows and performs a spatiotemporal join on overlap predicate with the head window of instances. this makes the complexity of the follow relationship identification quadratic; however, since we apply a two-step filter based on time-overlap and spatial-overlap predicates the complexity becomes subquadratic (and very close to linear) with respect to the number of region polygons in follow time. it should be noted that, in the situation where there is a time requirement constraint, the user can shrink the size of the head and tail windows to decrease the amount of overlapping; thus, reducing the number of follow relationships. in this part of our experiments, the running time requirements of our rand-esminer algorithm will be compared to the esgrowth algorithm. in fig. , we demonstrate the total running times of our algorithms for each dataset (in (a)), as well as the average time spent on mining stess from esg (in (b)). in fig. a , the blue bars show the average running time of esgrowth algorithm with different ci th values. the red bars show the total running time of rand-esminer algorithm, which consists of randomized runs on the esg. in fig. b , we demonstrate the running times required for mining stess from esg. the initialization steps (ht generation and follow times shown in fig. ) are omitted here for a better comparison, and we report the average running times of threshold-based runs (with different ci th and pi th combinations), and the average running time of randomized runs for each dataset. from the results shown in fig. a , we can notice that the total running times required for mining the stess follow a very similar pattern to the initialization, and it can be observed that for threshold-based approach, the total running time is dominated by the initialization. the higher ci th values we use for filtering the insignificant follow relationships (or edges) extensively prune the esg, leading to very low graph mining times. nevertheless, it is difficult to make conclusions about the trustworthiness of the stess or overall generality of stes mining process with high ci th values. when we analyze the performance of the rand-esminer algorithm, we see that for sparse esgs (such as feb, apr, and oct datasets) the total running time of the rand-esminer is similar to the esgrowth. on the other hand, for denser esgs (such as may, jun, and jul datasets), we observe greater comparison of the average run on the randomized and threshold-based approaches differences. this can be explained well with the algorithmic setup of randomization approach and the observations from fig. b . the average esg mining time of randomization approach for may, jun, and jul datasets are relatively higher than other datasets. in our experiments, the esg is resampled times, and total running time of rand-esminer includes all the randomized runs. whereas, for the threshold-based approach, the esg is mined only once. in summary, the total running time required for rand-esminer is approximately % more than esgrowth. the running time required for the rand-esminer is primarily dependent on the resampling ratio and the number of trials. to increase the trustworthiness of the results, one can increase the number of trials and resampling ratio. in addition to that, the trustworthiness of the results can be traded off with the running time. choosing a lower resampling ratio or number of trials would decrease the running time, as well as the trustworthiness of the results. in this work, we have introduced a novel spatiotemporal event sequence mining algorithm -rand-esminer, specifically designed for discovering stess without user-defined thresholds. our method differs from the conventional threshold-based methods that can be inaccurate; thus, inapplicable for large-scale data analysis. our novel randomized algorithm relies on applying permutation tests to the edges in the event sequence graph generated by spatiotemporal follow relationships. unlike the traditional techniques which discover stess with one pi value [ , ] , our algorithm discovers a distribution of pi values, and estimates a confidence interval for stess without any thresholds. mining stess without thresholds is significant for scientific fields, as it can be easily applied for explorative tasks. our future work lies in the parallelization of rand-esminer algorithm. as the number of random resamplings and resampling ratio increases, the rand-esminer can be less efficient and exploiting parallel computation can leverage the efficiency issues and provide us with highly robust outcomes. a survey of covid- contact tracing apps magnetic flux emergence and associated dynamic phenomena in the sun spatiotemporal frequent pattern mining on solar data: current algorithms and future directions a graph-based approach to spatiotemporal event sequence mining spatiotemporal event sequence mining from evolving regions time-efficient significance measure for discovering spatiotemporal co-occurrences from data with unbalanced characteristics spatiotemporal frequent pattern discovery from solar event metadata measuring the significance of spatiotemporal cooccurrences top-(r%, k) spatiotemporal event sequence mining severe space weather events-understanding societal and economic impacts. a workshop report spatio-temporal interpolation methods for solar events metadata observations of rotating sunspots from trace discovering tight space-time sequences mining frequent spatio-temporal sequential patterns analysing spatiotemporal sequences in bluetooth tracking data sequence data mining nonparametric estimates of standard error: the jackknife, the bootstrap and other methods toward spatio-temporal patterns quantifying sars-cov- transmission suggests epidemic control with digital contact tracing trajectory pattern mining properties and emergence patterns of bipolar active regions measurement of kodaikanal white-light images-v. tilt-angle and size variations of sunspot groups discovering colocation patterns from spatial data sets: a general approach a framework for mining sequential patterns from spatio-temporal event data sets on the relation between filament eruptions, flares, and coronal mass ejections tracking solar events through iterative refinement x-ray network flares of the quiet sun highlights of the space weather risks and society workshop mining probabilistic frequent spatio-temporal sequential patterns with gap constraints from uncertain databases solar storm risk to the north american electric grid heliophysics event knowledgebase triggering an eruptive flare by emerging flux in a solar active-region complex computer vision for the solar dynamics observatory (sdo) cascading spatio-temporal pattern discovery: a summary of results cascading spatio-temporal pattern discovery onset of the magnetic explosion in solar flares and coronal mass ejections mining access patterns efficiently from web logs evolution of magnetic fields and energetics of flares in active region on the probability of occurrence of extreme space weather events the pattern next door: towards spatio-sequential pattern discovery magnetic flux emergence along the solar cycle spatiotemporal data mining: a computational perspective sunspots: an overview mining sequential patterns: generalizations and performance improvements role of sunspot and sunspot-group rotation in driving sigmoidal active region eruptions evolution of active regions mining complex spatio-temporal sequence patterns flares associated with efr's (emerging flux regions) normalized-mutual-information-based mining method for cascading patterns spade: an efficient algorithm for mining frequent sequences splitter: mining fine-grained sequential patterns in semantic trajectories data mining applications in social security spatiotemporal event forecasting in social media acknowledgements this project has been supported in part by funding from the division of advanced cyberinfrastructure within the directorate for computer and information science and engineering, the division of astronomical sciences within the directorate for mathematical and physical sciences, and the division of atmospheric and geospace sciences within the directorate for geosciences, under nsf award # . it was also supported in part by funding from the heliophysics living with a star science publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. berkay aydin, ph.d is a research assistant professor in the department of computer science at georgia state university (gsu), as part of the next generation-astroinformatics research cluster program. he was a postdoctoral research associate at gsu prior to his current position, where his research was sponsored by nsf and nasa grants. dr. aydin's research is interdisciplinary and focused on management, retrieval, and analysis of solar astronomy big data. currently, he works on creating algorithms and data models for multivariate time series prediction and spatiotemporal frequent pattern discovery, which is helpful for understanding the implicit temporal and spatial relationships appearing among objects with spatial and temporal extents as well as forecasting extreme spaceweather events. soukaina filali boubrahimi is a ph.d. student at the computer science department, georgia state university. her research interests are in the problem of ensemble learning applied to time series data, a common machine learning method used to maximize the learning accuracy and popular in many domains. she is also involved in projects in the data-mining lab consisting of: ( ) interpolation of spatio-temporal objects representing solar event trajectories, ( ) clustering and visualization of decision trees of coronal mass ejection data, and ( ) mining discriminative patterns from fmri-based networks. , and information retrieval (text and image data). he has published over journal articles, book chapters, and peer-reviewed conference papers in these areas. his research has been sponsored by several federal agencies: nasa (major contributor), nsf, nga, as well as industry partners: intergraph corporation and rightnow technologies (now oracle), with the successful grant history exceeding $ m. key: cord- -kyx xhvj authors: temple, mark d. title: real-time audio and visual display of the coronavirus genome date: - - journal: bmc bioinformatics doi: . /s - - - sha: doc_id: cord_uid: kyx xhvj background: this paper describes a web based tool that uses a combination of sonification and an animated display to inquire into the sars-cov- genome. the audio data is generated in real time from a variety of rna motifs that are known to be important in the functioning of rna. additionally, metadata relating to rna translation and transcription has been used to shape the auditory and visual displays. together these tools provide a unique approach to further understand the metabolism of the viral rna genome. this audio provides a further means to represent the function of the rna in addition to traditional written and visual approaches. results: sonification of the sars-cov- genomic rna sequence results in a complex auditory stream composed of up to individual audio tracks. each auditory motive is derived from the actual rna sequence or from metadata. this approach has been used to represent transcription or translation of the viral rna genome. the display highlights the real-time interaction of functional rna elements. the sonification of codons derived from all three reading frames of the viral rna sequence in combination with sonified metadata provide the framework for this display. functional rna motifs such as transcription regulatory sequences and stem loop regions have also been sonified. using the tool, audio can be generated in real-time from either genomic or sub-genomic representations of the rna. given the large size of the viral genome, a collection of interactive buttons has been provided to navigate to regions of interest, such as cleavage regions in the polyprotein, untranslated regions or each gene. these tools are available through an internet browser and the user can interact with the data display in real time. conclusion: the auditory display in combination with real-time animation of the process of translation and transcription provide a unique insight into the large body of evidence describing the metabolism of the rna genome. furthermore, the tool has been used as an algorithmic based audio generator. these audio tracks can be listened to by the general community without reference to the visual display to encourage further inquiry into the science. modern computers have had a great impact on biological experimentation and data analyses to reveal otherwise hidden patterns in complex data. this is apparent in the field of genomic data analyses. the viral genome of the first patient suffering from covid- was submitted to genbank [ ] on january some weeks after the first patient had been hospitalised in december [ ] . within months over . million people worldwide had tested positive to the virus and the disease referred to as covid- with approximately , deaths reported by johns hopkins university [ ] . during this time a large body of evidence has arisen regarding rna sequence homology to other sars like virus strains [ , ] and these studies may help identify targets for immune recognition. this paper demonstrates that sonification of rna sequence data may also be useful to understand how the genome functions. the audio is generated using two approaches. the rules governing gene expression have been applied to the process of generating a linear audio stream similar to the expression of a linear sequence of amino acids. these methods are based on our previous approach to sonify dna sequences [ ] . these methods have been improved upon to include multi-layering of related audio tracks, and the inclusion of audio that is representative of sequence metadata. additionally, a real time animated display (as shown in fig. ) of both the biological process and the notes being generated has been implemented. these displays are important since the ability to sequence dna has vastly outpaced tools for their visualisation [ ] . the real-time visual animation is an important addition since with sonification data alone is it difficult to relate the auditory display to the underlying sequence information. the combination of the auditory and visual displays is more informative than either display in isolation. in these displays the auditory and visual output are produced by the same events, since the sequence is processed in a linear fashion, and it is thought that the multisensory integration improves the perception of each [ ] . the systematic and reproducible representation of data as sound is increasingly becoming a adjunct to the traditional visualization techniques of data inspection and analysis [ , ] . in recent years auditory displays have become more popular to represent complex biological phenomena. a systematic review of over sonification project highlighted the importance of pitch and spatial auditory dimensions in the auditory display [ ] . within the domain of molecular biology the properties of amino acid residues [ ] and protein folding [ ] have been sonified by a combination of musical techniques and sound effects. more recently researchers have generated musical scores representing amino acid sequences of protein structures and note sequences from short amino acid sequences [ ] . recently these authors applied their approach to sonifying the amino acid sequence and structure of the spike protein of sars-cov- . genomic data has also provided a good candidate for sonification. these studies include sonification of the spectral properties of dna, molecular analyses [ , ] and a preliminary investigation into rna structures [ , ] . gene expression data has been sonified to discriminate between differentially expressed genes [ , ] and chip-seq data [ ] . in the realm of cancer progression, epigenomic data has been sonified to investigate the importance of methylation [ ] . it has also been suggested that audio may be useful to interpret tomography of human adipose and tumor tissue samples [ ] . microbial ecology data has been sonified into musical style by mapping rows of numerical data to chords [ , ] , towards the end of communicating complex results to people not specialised in the field. previous studies into dna sonification for sequence analyses [ ] demonstrated that mutations in repetitive dna sequences such as telomer or alphoid dna could be detected by ear alone and that coding regions could be distinguished from noncoding regions. the sars-cov- rna genome does not contain extensive repetitive sequences except for the ′ poly-a tail, hence this sequence provided more of a challenge for display. given that the rna genome is almost , kb in length it would be abrasive and fatiguing to the ear to use harsh or dissonant tones for the entire a b fig. the animated display. panel a shows the sliding window of the animated display in translation mode. key features of the animated display are labelled such as the translated peptide sequences and the frame in which they occur, the presence of start and stop codons are highlighted in green and red, respectively. the location of the audio play-head is represented to coincide with the peptidyl-transferase centre of the ribosome. the sonified audio is generated as the sars-cov- genome sequence passes through the play-head. the direction in which the ribosome moves relative to the rna sequence is indicated. panel b shows the animated display in transcription mode. the newly synthesised minus rna strand is shown below the genome sequence with the ′ extended nucleotide shown in the play-head. the direction in which the replicase protein complex moves in relation to genome sequence is indicated auditory display. hence the decision was made to use more musical tones to generate the audio. the web tool described in this paper [ ] operates in two modes that broadly represents translation or transcription. the audio display is generated using algorithms based on biological rules to generate sound at the play-head. the play-head substitutes for a ribosome during translations mode or the rna replicase/polymerase during transcription mode. a complex auditory stream was generated by overlaying up to layers of audio (as summarised in table ). each layer of audio is derived from an rna motif directly or metadata was used to flag the region of sequence to be sonified. additionally, prior to the start of each gene sequence an ascending run of notes is triggered. this scale pattern is independent of the raw sequence data and is based solely on metadata relating to sequence position. this provides an audio cue to anticipate the upcoming gene coding sequence. the most fundamental building block of rna is an individual nucleotide and these were sonified as one of four individual notes whereas di-nucleotides were sonified as one of notes and together these were panned left and right in the auditory display. another characteristic of nucleic acid sequences which is often used as a metric of genome status is the gc content which is often represented as a ratio. typically in coronavirus the count of u is above average and c is below average whereas a is preferred over g [ ] leading to a relatively low gc ratio. in our approach two gc ratios were determined within a sliding window of or nucleotides respectively across the entire genome. each time the gc ratio changed by an increment or decrement of . a note was generated and these were panned against each other in stereo. when there is no change between two adjacent features in an audio stream, the first instance of the feature was allowed to play for a longer period of time rather than generating another instance of the same note. this approach provides a brief pause in the audio layer and provides an opportunity for another layer to be distinguished in the auditory display. together these four audio tracks create an audio landscape that can be heard across the entire auditory display of the genome. these rna features are not specific to either transcription or translation nor are they specific to a particular region of the genome. other sonified genome features were layered over this sonified landscape. in the translation mode, codons represent an important feature of rna and these were sonified as notes when representing translation into amino acids. no distinction was made between the start methionine or that which occurs in the body of the peptide sequence. additionally, stop codons were sonified as an additional note since these are highly significant in the function of the genome. overlapping codons in each of the three reading frames were sonified during translation to detect orf's in either frame. an important consideration in the modelling of translation was to use the start and stop codons in each reading frame to trigger or halt the audio derived from other codons. additionally, in the visual display the audible codons were shown using the one letter amino acid representation. using this simple method all gene sequences reported in the genbank metadata were accurately represented in both the audio and visual displays. additionally, all open reading frames throughout the rna genome are shown and sonified. however, only open reading frames that correlate with the known metadata (gene sequences) were labelled in the visual display. this is consistent with prior approaches of mapping either individual bases [ ] , codons [ ] or amino acids [ , ] to musical notes in a manner inspired by the genetic code or codon usage during translation. in the display representing transcription, codons per se were not considered. instead tri-nucleotide features were considered for sonification, however, these were considered to be positioned adjacent in the sequence rather than overlapping. given that there are different tri-nucleotides it is not possible to use a traditional scale. a traditional piano consists of octaves plus a minor third ( notes). given that there are scale notes in an octave it would require over octaves to accommodate trinucleotides. using synthesised notes could overcome this limitation but this would entail playing shrill high pitched notes that would be grating to the ear. therefore, linear mapping of codons to individual notes was avoided. in the transcription display, tri-nucleotides were mapped to individual notes since only the first and third position in each was considered. since trinucleotides play no functional role in the process of transcription there was no loss of information content using this approach and the audio could be designed to complement the single nucleotides and di-nucleotides in the audio stream and avoid the mapping to shrill notes. additionally, tri-nucleotides were not mapped to start or stop functionality and these are audible throughout the entire genome. their occurrence had no further effect in the auditory display. metadata specific to the coronavirus sars-cov- sequence was used to supplement the audio generated from the intrinsic characteristics of the rna sequence. audio from un-translated sequences between the open reading frames were mapped to an audio stream at a reduced tempo so that they were more clearly distinguished from the coding regions. additionally, the viral genome is known to contain transcriptional regulatory sequences (trs) and five known stem loop (sl) structures known to play a role in the function of the genome [ ] and their occurrence was sonified. these conserved motifs were sonified and since they often occurred in the untranslated regions the audio from these two were panned in stereo. the genome codes for a large polyprotein from a large open reading frame. this polyprotein is thought to be cleaved into individual polypeptides (often referred to as nsp proteins) and the occurrence of the known cleavage sites was sonified. in addition to generating a short burst of notes, cleavage regions were also used to pause the progression of the play-head for a second or so by slowing the tempo to one tenth of the coding tempo. this effectively highlights the transition from one nsp sequence to the next. the occurrence of three or more identical nucleotides was also sonified since these are easy to detect by eye and their sound may help the user to keep track of where they are in the display. audio generated from each of these sequence motifs and metadata were combined to create a complex auditory display to represent either transcription or translation. as the audio is played a sliding window of nucleotides is shown on the screen. at any point in time the first nucleotide in the visual play-head can be heard in the auditory display. other sequence features are determined relative to the position of this nucleotide. to play the entire genome takes approximately min in translation mode which corresponds to approximately five nucleotides per second. this is slower than cellular translation which is thought to proceed at approximately nucleotides per second [ ] , however, to play this any faster makes it difficult to interpret due to the shortened duration between each note and a different algorithm would need to be devised. in transcription mode the full display lasts min since the number of nucleotides played per second is a little slower, this approach was taken to clearly distinguish it from translation mode. three sets of interactive buttons (summarised in table ) have been provided for each sonified feature so that each can be selected directly, for example a gene sequence or trs can be selected and played directly without having to play through the proceeding sequence data. these buttons change to a red colour as the respective feature is displayed. in this study, auditory streams were paired and played as stereo layers. audio that plays consistently throughout the entire genome were played at low frequency and transient data was highlighted at a higher frequency register to make them more prominent. in addition to simply considering the basic construction of pitch and separation, the data was harmonised to make it more listenable. the root tone and third note of the scale were played across two octaves with the limited note mono-nucleotide sonification to establish a strong harmonic landscape throughout the playback. the drone generated from the gc content (which is sometimes invariant for periods of time) was also used to reinforce the foundation of the basic scale harmony. the g or c bases, as nucleotide, di-nucleotide or trinucleotides were each matched to higher octaves and a and u were mapped to lower octaves. this was done consistently between these audio streams in an attempt to harmonise the otherwise random note selection based on sequence information. an exception to this principle was made for start and stop codons which were mapped to higher pitches than gc rich codons so that their occurrence was easily perceived in the auditory display (since higher pitched notes are perceived to be louder). given that these codons are used to trigger and halt individual audio streams this approach further emphasises the occurrence of an open reading frame. the wider note range of the codons ( notes) were used to introduce leading tones that often sound more dissonant and add complexity to the harmonic spectrum. this allows them to be easily discerned above the landscape tones of the simpler motifs. lastly, less frequent audio from dispersed regions of the genome e.g. trs or stem-loop (sl) motifs were pitched at the highest octave ranges or more dissonant notes within the diatonic scale to highlight their occurrence. all of this was done within a mode of the diatonic major scale. translation was played in bb aeolian (bb, c, db, eb, f, gb, ab) whereas transcription was played in c lydian (c, d, e, f#, g, a, b) . the parameters for mapping of each rna feature into an audio stream is summarised in table . these choices are arbitrary and in later iterations of the tool it may be possible to choose the scale modes and key of choice. the ionian mode mode of the major scale was avoided since this is generally considered to be happy sounding and inappropriate for the data. each nucleotide generates a note on every beat whereas each di-nucleotide generates a note every second beat. each codon (in an orf) generates a not every third beat. together these notes are syncopated to create a characteristic sound during peptide translation that is distinct from the surrounding untranslated region. audio from the gc tracks are only triggered when the gc ratio changes by an increment of . . if a note sequence has identical adjacent notes then the length of the note is extended rather than being repeated. this creates space and clarity for other notes layered in the auditory display. translation of the genomic rna leads to expression of a large polyprotein following ribosome binding to the ′ prime untranslated region. however, from this genomic template the subsequent genes downstream from the polyprotein cannot be directly expressed presumably due to the stop codon at the end of the gene. in the display the sonification also stops at this point, however, play can be resumed to inspect the downstream sequence. additionally, the tempo of the untranslated regions are slower than that of the coding regions so that the tempo increases as the play-head (in place of the ribosome) reads into a gene sequence. this was implemented to help the user distinguish between different sequence types during the display of translation. one of the more interesting characteristics of the viral genome is the phenomena of discontinuous transcription whereby a template switch occurs during the synthesis of sub-genomic negative-strand rna's [ ] . various mechanisms have been proposed to explain how the transcription regulatory sequences (trs) are involved in the synthesis of positive strand sub-genomic rna from various negative strand intermediates [ ] . trs sequences are located in the untranslated regions between the genes and one model suggests that these facilitate transcription skipping to the trs sequence located in the ′ untranslated region. this process is driven by complementary interactions between trs regions to add a copy of the leader sequence to form sub-genomic rna species. in these sub-genomic rna's the polyprotein sequence has been omitted and ribosome binding at ′ end can read through and express the contiguous downstream gene sequence [ ] . this functional behaviour of the rna has been built into the auditory and visual display. by default, the process of auditory translation runs from the ′ end through to the stop codon at the end of the polyprotein, whereas transcription runs the full length of the rna beginning at the ′ end. a toggle switch, labelled 'translate as sub-genomic rna' has been implemented to change these behaviours. when the toggle switch is selected during the transcription mode, the play-head will skip from any upcoming trs region to the leader trs located in the ′ region (mimicking the behaviour of the rna replicase). subsequently in translation mode with the toggle activated the play-head will, by way of example, skip from the leader trs (omitting the polyprotein) through to the trs region adjacent to the start of the s protein. whilst the metadata use to drive this behaviour does not change the characteristics of the sound, it does change the selection which regions are sonified. the website does not rely on a server and instead the entire rna sequence is downloaded into the client browser when the page is loaded. all code is written in javascript and runs within the client browser. the react framework was used to create the environment state whereby each iteration of state represents a sliding window to the next base. redux was also used to help manage state. audio is generated in real time within the client browser using tone.js. the reactronica framework [ ] was used to further manage audio within the environment state. translation of the viral polyprotein is known to be subject to a frameshift mutation and since this does not follow the normal rules of gene expression a conditional expression was used to change the display for that instance so that the translated protein in frame shifts to frame in both the visual and auditory display. to understand the function of the viral plus rna strand the information needs to be processed in the ′ to ′ direction during translation and in the reverse ′ to ′ direction for transcription (whereby nucleotides are extended to the newly synthesised minus strand at the ′ end). in this study an auditory display of the sequence was generated with a sliding window moving in either direction. processing of information within the sliding window was used to generate a synchronised auditory and visual display. this is advantageous since it mimics the behaviour of biological processes within the cell. to further emulate translation the generation of audio was triggered by start codons and silenced by stop codons. furthermore, the visual display shows all possible peptide sequences and these are aligned with the rna sequence being processed. from the sequence data alone the tool was able to detect and display all known open reading frames and metadata was used solely to label these in the display. other open reading frames were detected throughout the genome in the displays, however, since these are not downstream of an in-frame ribosome binding site no claim is made that these are actually translated. high resolution analysis of gene expression in coronavirus genomes has detected ribosome protected fragments which map to non-canonical orf's, these may be novel protein-coding orfs and short regulatory uorfs. the tool highlights the occurrence of one such uorf of nucleotides (including the stop codon) in the ′ untranslated region downstream of trs [ ] that is not documented in the genbank metadata. an image of the raw wave files and their relationship to the sequence information for this region are shown in fig. . non-standard uorf's such as this have been detected as translation products in rna sequencing and ribosome profiling experiments which allude to the complexity of gene regulation [ ] . for this reason, all open reading frames are included in the display. this uorf is clearly represented in additiaonal file : example , supplementary file 'sonification untranslated ends' (mp file) whereby at s into the auditory display of the ′ untranslated region a high-pitched start codon introduces a short sequence of layered audio that is punctuated a few seconds later by another high-pitched note as the layered audio ends. this can also be observed in example (mp file) as a nine amino acid residue sequence in reading frame . the ′ untranslated region is also characterised by the distinctive sound of the trs sequence at to s into the audio display. fig. multitrack wave files representing a portion of an auditory display. these tracks play in unison to generate the auditory display and each represent approximately nucleotides beginning at nucleotide position . this sequence is located in the ′ untranslated region and includes a trs region and a uorf. each audio stream was generated from a different algorithm, only nucleotides that gave rise to audio are shown (the entire nucleotide sequence is shown in track ). in track , each nucleotide generates a note for every beat unless it is a repeat of the previous in which case the length of the note is extended. in track , each di-nucleotide generates a note every second beat. in tracks and , audio from the gc track is only triggered when the gc ratio changes by an increment of . . each change in the gc ratio is indicated by a plus (+) or minus (−) symbol on the wave files. in track , only codon sequences beginning with a start codon (aug) are shown through to the next stop codon (e.g. uaa). isolated stop codons also give rise to a note. this track is a compilation of audio form three sub-tracks each representing a different reading frame and notes in this track are panned left, centre or right, respectively. track represents the audio generated from metadata that indicates the location of a trs region. additionally, the consensus sequence within this region is coloured purple in the visual display. track represents audio generated by the occurrence of three nucleotides of the same type. other data tracks are not represented since no audio was generated in these during processing of this sequence of the genome. additionally, the amino acid sequence of the orf is shown in the codon track similarly, three short orf's are apparent in the ′ untranslated region of example beginning at min s following the high-pitched repetitive pattern of the sl region. these two untranslated regions were manually played one after the other during the same auditory display using the navigation buttons. since they are both characterised by the absence of long open reading frames they provide a good introduction to the basic sound of the auditory display over which the highlighted notes from other rna features will be layered. the additional file : example 'sonification utr to surface glycoprotein' (supplementary file) represents the sonification of a sub-genomic rna. for this run the 'translate as sub-genomic rna' checkbox was selected to mimic translation from one of the products of discontinuous transcription, a process upon which viral gene expression is reliant. sonification of the entire genome in either direction results in an auditory display lasting up to h in duration. selecting the 'translate as sub-genomic rna' checkbox results in a shorter auditory display since shorter regions of rna are processed. example again plays from the beginning of the plus strand sequence (as does example ), however in this display the play-head skips from trs to trs and immediately into the orf of the surface glycoprotein (skipping a portion of the untranslated region and skipping all of the ~ , bp of the polyprotein sequence). the display highlights that the prior discussed uorf is skipped in the ′ leader of the sub-genomic rna from which the genes downstream of the polyprotein are translated. the audio diverges from example after s or so since the layered sound of the surface glycoprotein (an open reading frame) begins and continues to play for the rest of the display. portions of the two stereo waveforms of the display from example and are shown in fig. . to the left of the cursor both stereo waveforms are essentially the same whereas to the right of the cursor the audio displays have clearly diverged as different sequences were processed beyond trs . the visual display shows that this pattern continues for approximately amino acid residues. whilst this may only be an artefact of the analyses rather than an undocumented protein sequence it does demonstrate the auditory display is capable of detecting unusual features in the genome. it is also worth noting that frame shift mutations do occur in the polyprotein sequence through a process that is not fully understood giving rise to a protein sequence that does not follow canonical gene expression patterns. the tool highlights the position of other relatively long open reading frames within the display so that they can be considered in the analysis of genome function. the nucleocapsid phosphoprotein sequence is followed by the orf sequence which is about one third the length of this parallel orf. this analysis also highlights one of the properties of the auditory display which is that data in the three possible reading frames give rise to a triplet note pattern whereas data in two reading frames gives rise to a duplet note pattern (e.g. from min s through to min s). these note patterns make it easier to distinguish the features in the auditory display. in the last min and s of the display the genome alternates between gene sequences, transcription regulatory sequences, orf , stem loop structures and untranslated regions. these features have been further annotated in the video file with circles and arrows to emphasise their occurrence in the combined visual and auditory display. in the additional file : supplementary example 'sonification sub-genomic rna' the auditory display represents the process of transcription. the tempo and scale patterns used for these displays are distinct from those used to represent translation. additionally, no attention was paid to the occurrence of open reading frames or codon usage patterns since these pay no role in genome replication or transcription. metadata relating to sl and trs elements were sonified, however, cleavage information relating to the polyprotein modification did not seem relevant. the resulting auditory display is therefore simpler than that arising from translation. this can be heard in example which begins with sonification of the poly-a tail. in this example the play-head skips from trs through to trs which models the behaviour of discontinuous transcription. there is a check box on the page to switch between normal genome replication (whereby the entire genome would be sonified) and discontinuous transcription. the additional files , and : examples to in the supplementary files include regions already describe in the previous auditory displays. however, in these examples various streams of audio that contribute to the auditory display have been toggled on and off. checkboxes are provided on the web page to facilitate this on the left-hand side of the note display table. the reason for this is two-fold. it provides a method to delineate how each feature of the rna genome contributes to the auditory display. for instance, the sound of a trs element or open reading frames could be highlighted (soled) or excluded (muted) from the overall sound of the translation display. this provides a better understanding of how the auditory displays are constructed. secondly for those who are less interested in the science of coronavirus and who are more interested in algorithmic music generation these tools can be used to compose and modify the inherent audio stream. the first of these, example 'remix utr through to polyprotein' , highlights the contribution that gc content makes to the audio stream since these features are soloed at the beginning of the display. at one minute into the display audio from the translated amino acids are also toggled on or off to highlight their contribution. example 'remix orf to the poly-a tail' highlights the off-beat syncopation between the dinucleotides playing every second beat against codons playing every third beat. lastly example highlights how important it is to continually sonify individual nucleotides across the sequence, since this provide a sonic landscape to overlay the other features. to emphasise this the individual bases were soloed at the beginning and excluded at the end of the display. all previously mentioned example files have been uploaded as supplementary files. in addition to using the tool to navigate and inspect the function of the genome, the tool has been used to generate isolated audio in both translation and transcription modes. a playlist of four tracks has been uploaded to soundcloud. these audio tracks are to be listened to without reference to the visual display. the intention of this is to engage the non-specialised community with the concept of 'the sound of the coronavirus genome' and hopefully encourage people to delve a little deeper into the ideas behind the concept. without the context of the display and without a clear understanding of the molecular biology of rna virus the audio has to engage purely on its own sonic qualities-as an example of algorithmic music. in translation mode two auditory displays were prepared, the first (covid- translation polyprotein) plays through to the end of the polyprotein lasting h and min, covering approximately , nucleotides. the second audio track from a sub-genomic rna (covid- translation discontinuous) skips the polyprotein entirely to the untranslated region prior to the surface glycoprotein and then plays through to the ′ poly-a tail. this piece covers approximately nucleotides and lasts min. in addition, two audio tracks were generated representing transcription/ rna synthesis. the track representing genome replication (covid- transcription) last h. the track representing discontinuous transcription (covid- transcription discontinuous) skips between trs and trs lasts only min and s. this paper extends prior work whereby dna was sonified using the rules of gene expression to generate an auditory display. previously an individual algorithm was used to produce an individual stream of audio from either a nucleotide, a di-nucleotide or codons and it was concluded that the sonification of codons was the most useful to identify mutations in repetitive dna or to distinguish coding regions from non-coding regions [ ] . here we layer up to layers of audio, each relating to an rna feature of interest. these include metadata to layer rna features such as consensus sequences in trs regions, sl regions, cleavage sites in the polyprotein and interspersed untranslated regions between characterised orf's. this approach produces a more detailed and rich auditory display which acts as a viable complement to an animated visual display. metadata was also used to affect the behaviour of the display to mimic what is known to occur during the coronavirus life cycle. the polyprotein is the only product to be translated from the genomic rna since this is thought to be the only orf that has access to the ribosomal binding site in the ′ untranslated region. to mimic this the default behaviour of the tool is to stop at the in-frame stop codon at the end of the polyprotein. the tool can be restarted at the adjacent untranslated region or elsewhere using the navigation buttons. the default behaviour in transcription mode is to read through the entire genome sequence from end to end to mimic genome replication to produce a new coronavirus associated with human respiratory disease in china a sequence homology and bioinformatic approach can predict candidate targets for immune responses to sars-cov- the molecular biology of coronaviruses an auditory display tool for dna sequence analysis org: a serverless web tool for dna sequence visualization biases in visual, auditory, and audiovisual perception of space polyphonic sonification of electrocardiography signals for diagnosis of cardiac pathologies a novel sonification approach to support the diagnosis of alzheimer's dementia a systematic review of mapping strategies for the sonification of physical quantities using non-speech sounds to convey molecular properties in a virtual environment melody discrimination and protein fold classification sonification based de novo protein design using artificial intelligence, structure prediction, and analysis using molecular modeling autoregressive modeling and feature analysis of dna sequences molecular music: the acoustic conversion of molecular vibrational spectra supplementary material for "browsing rna structures by interactive sonification browsing rna structures by interactive sonification a short treatise concerning a musical approach for the interpretation of gene expression data gene expression music algorithm-based characterization of the ewing sarcoma stem cell signature chromas from chromatin: sonification of the epigenome musical patterns for comparative epigenomics sonification of optical coherence tomography data and images microbial bebop: creating music from complex dynamics in microbial ecology more of an art than a science: using microbial dna sequences to compose music real-time audio and visual display of the covid- genome on the biased nucleotide composition of the human coronavirus rna genome basically musical the sound of the dna language convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold open access which fosters wider collaboration and increased citations maximum visibility for your research: over m website views per year your research ? snare dance: a musical interpretation of atg transport to the tubulovesicular cluster conversion of amino-acid sequence in proteins to classical music: search for auditory patterns the structure and functions of coronavirus genomic ' and ' ends translatomics: the global view of translation continuous and discontinuous rna synthesis in coronaviruses viral and cellular mrna translation in coronavirus-infected cells high-resolution analysis of coronavirus gene expression by rna sequencing and ribosome profiling aidan temple setup the react and redux frameworks and helped with technical aspects of javascript programming. kaho chung developed the reactronica framework used to implement tone.js within react. the complementary minus strand. a toggle switch has been implemented to mimic discontinuous transcription and in the first instance the polymerase will jump from trs to trs in the ′ leader region. this can be overridden using the navigation buttons but if another trs region is encountered by the play-head it will also jump to the trs in the leader region. in translation mode the same toggle causes the ribosome play-head to skip the polyprotein and skip from trs to trs and into the surface glycoprotein sequence. from this point the play-head will continue to the ′ end reading through the remainder of the genome. all other stop codons will be sonified but they will not influence the progression of the play-head. the auditory display in combination with realtime animation provide a unique insight into the large body of evidence describing the metabolism of the rna genome. this provides another useful tool in the domain of genome browsers to further understand the complex function of the viral genome. project name real-time audio and visual display of the covid- genome.project home page https ://coron aviru s.dnaso nific ation .org/.operating system: platform independent.programming language javascript (ecmascript ).other requirements none. license gnu gplv . any restrictions to use by non-academics: no restrictions. supplementary information accompanies this paper at https ://doi.org/ . /s - - - .additional file . example sonification untranslated ends. additional file . example sonification of the nucleocapsid phosphoprotein. additional file . example remix utr through to polyprotein.additional file . example remix orf to the poly-a tail. abbreviations sl: stem-loop; trs: transcription regulatory sequences; orf: open reading frame. mdt devised the project and algorithms, wrote the code for the sonification website and put together all aspect of this manuscript. this work was made possible through funding from the school of science, western sydney university. the funding body played no role in the design or conclusion of this study. source code is available on https ://githu b.com/markt emple /coron aviru s-sonifi cati on. audio tracks generated by the tool for a non-specialised audience are available on a soundcloud playlist. this playlist includes four tracks. these are, covid- translation polyprotein, covid- translation discontinuous, covid- transcription, and covid- transcription discontinuous. this playlist is available on https ://sound cloud .com/templ emark /sets/the-sound -of-the-coron aviru s. not applicable. the authors declare that they have no competing interests.received: may accepted: september key: cord- - otxft authors: altman, russ b.; mooney, sean d. title: bioinformatics date: journal: biomedical informatics doi: . / - - - _ sha: doc_id: cord_uid: otxft why is sequence, structure, and biological pathway information relevant to medicine? where on the internet should you look for a dna sequence, a protein sequence, or a protein structure? what are two problems encountered in analyzing biological sequence, structure, and function? how has the age of genomics changed the landscape of bioinformatics? what two changes should we anticipate in the medical record as a result of these new information sources? what are two computational challenges in bioinformatics for the future? ular biology and genomics-have increased dramatically in the past decade. history has shown that scientific developments within the basic sciences tend to lag about a decade before their influence on clinical medicine is fully appreciated. the types of information being gathered by biologists today will drastically alter the types of information and technologies available to the health care workers of tomorrow. there are three sources of information that are revolutionizing our understanding of human biology and that are creating significant challenges for computational processing. the most dominant new type of information is the sequence information produced by the human genome project, an international undertaking intended to determine the complete sequence of human dna as it is encoded in each of the chromosomes. the first draft of the sequence was published in (lander et al., ) and a final version was announced in coincident with the th anniversary of the solving of the watson and crick structure of the dna double helix. now efforts are under way to finish the sequence and to determine the variations that occur between the genomes of different individuals. essentially, the entire set of events from conception through embryonic development, childhood, adulthood, and aging are encoded by the dna blueprints within most human cells. given a complete knowledge of these dna sequences, we are in a position to understand these processes at a fundamental level and to consider the possible use of dna sequences for diagnosing and treating disease. while we are studying the human genome, a second set of concurrent projects is studying the genomes of numerous other biological organisms, including important experimental animal systems (such as mouse, rat, and yeast) as well as important human pathogens (such as mycobacterium tuberculosis or haemophilus influenzae). many of these genomes have recently been completely determined by sequencing experiments. these allow two important types of analysis: the analysis of mechanisms of pathogenicity and the analysis of animal models for human disease. in both cases, the functions encoded by genomes can be studied, classified, and categorized, allowing us to decipher how genomes affect human health and disease. these ambitious scientific projects not only are proceeding at a furious pace, but also are accompanied in many cases by a new approach to biology, which produces a third new source of biomedical information: proteomics. in addition to small, relatively focused experimental studies aimed at particular molecules thought to be important for disease, large-scale experimental methodologies are used to collect data on thousands or millions of molecules simultaneously. scientists apply these methodologies longitudinally over time and across a wide variety of organisms or (within an organism) organs to watch the evolution of various physiological phenomena. new technologies give us the abilities to follow the production and degradation of molecules on dna arrays (lashkari et al., ) , to study the expression of large numbers of proteins with one another (bai and elledge, ) , and to create multiple variations on a genetic theme to explore the implications of various mutations on biological function (spee et al., ) . all these technologies, along with the genome-sequencing projects, are conspiring to produce a volume of biological information that at once contains secrets to age-old questions about health and disease and threatens to overwhelm our current capabilities of data analysis. thus, bioinformatics is becoming critical for medicine in the twentyfirst century. the effects of this new biological information on clinical medicine and clinical informatics are difficult to predict precisely. it is already clear, however, that some major changes to medicine will have to be accommodated. with the first set of human genomes now available, it will soon become cost-effective to consider sequencing or genotyping at least sections of many other genomes. the sequence of a gene involved in disease may provide the critical information that we need to select appropriate treatments. for example, the set of genes that produces essential hypertension may be understood at a level sufficient to allow us to target antihypertensive medications based on the precise configuration of these genes. it is possible that clinical trials may use information about genetic sequence to define precisely the population of patients who would benefit from a new therapeutic agent. finally, clinicians may learn the sequences of infectious agents (such as of the escherichia coli strain that causes recurrent urinary tract infections) and store them in a patient's record to record the precise pathogenicity and drug susceptibility observed during an episode of illness. in any case, it is likely that genetic information will need to be included in the medical record and will introduce special problems. raw sequence information, whether from the patient or the pathogen, is meaningless without context and thus is not well suited to a printed medical record. like images, it can come in high information density and must be presented to the clinician in novel ways. as there are for laboratory tests, there may be a set of nondisease (or normal) values to use as comparisons, and there may be difficulties in interpreting abnormal values. fortunately, most of the human genome is shared and identical among individuals; less than percent of the genome seems to be unique to individuals. nonetheless, the effects of sequence information on clinical databases will be significant. . new diagnostic and prognostic information sources. one of the main contributions of the genome-sequencing projects (and of the associated biological innovations) is that we are likely to have unprecedented access to new diagnostic and prognostic tools. single nucleotide polymorphisms (snps) and other genetic markers are used to identify how a patient's genome differs from the draft genome. diagnostically, the genetic markers from a patient with an autoimmune disease, or of an infectious pathogen within a patient, will be highly specific and sensitive indicators of the subtype of disease and of that subtype's probable responsiveness to different therapeutic agents. for example, the severe acute respiratory syndrome (sars) virus was determined to be a corona virus using a gene expression array containing the genetic information from several common pathogenic viruses. in general, diagnostic tools based on the gene sequences within a patient are likely to increase greatly the number and variety of tests available to the physician. physicians will not be able to manage these tests without significant computational assistance. moreover, genetic information will be available to provide more accurate prognostic information to patients. what is the standard course for this disease? how does it respond to these medications? over time, we will be able to answer these questions with increasing precision, and will develop computational systems to manage this information. several genotype-based databases have been developed to identify markers that are associated with specific phenotypes and identify how genotype affects a patient's response to therapeutics. the human gene mutations database (hgmd) annotates mutations with disease phenotype. this resource has become invaluable for genetic counselors, basic researchers, and clinicians. additionally, the pharmacogenomics knowledge base (pharmgkb) collects genetic information that is known to affect a patient's response to a drug. as these data sets, and others like them, continue to improve, the first clinical benefits from the genome projects will be realized. . ethical considerations. one of the critical questions facing the genome-sequencing projects is "can genetic information be misused?" the answer is certainly yes. with knowledge of a complete genome for an individual, it may be possible in the future to predict the types of disease for which that individual is at risk years before the disease actually develops. if this information fell into the hands of unscrupulous employers or insurance companies, the individual might be denied employment or coverage due to the likelihood of future disease, however distant. there is even debate about whether such information should be released to a patient even if it could be kept confidential. should a patient be informed that he or she is likely to get a disease for which there is no treatment? this is a matter of intense debate, and such questions have significant implications for what information is collected and for how and to whom that information is disclosed (durfy, ; see chapter ). a brief review of the biological basis of medicine will bring into focus the magnitude of the revolution in molecular biology and the tasks that are created for the discipline of bioinformatics. the genetic material that we inherit from our parents, that we use for the structures and processes of life, and that we pass to our children is contained in a sequence of chemicals known as deoxyribonucleic acid (dna). the total collec- r. b. altman and s. d. mooney tion of dna for a single person or organism is referred to as the genome. dna is a long polymer chemical made of four basic subunits. the sequence in which these subunits occur in the polymer distinguishes one dna molecule from another, and the sequence of dna subunits in turn directs a cell's production of proteins and all other basic cellular processes. genes are discreet units encoded in dna and they are transcribed into ribonucleic acid (rna), which has a composition very similar to dna. genes are transcribed into messenger rna (mrna) and a majority of mrna sequences are translated by ribosomes into protein. not all rnas are messengers for the translation of proteins. ribosomal rna, for example, is used in the construction of the ribosome, the huge molecular engine that translates mrna sequences into protein sequences. understanding the basic building blocks of life requires understanding the function of genomic sequences, genes, and proteins. when are genes turned on? once genes are transcribed and translated into proteins, into what cellular compartment are the proteins directed? how do the proteins function once there? equally important, how are the proteins turned off ? experimentation and bioinformatics have divided the research into several areas, and the largest are: ( ) genome and protein sequence analysis, ( ) macromolecular structure-function analysis, ( ) gene expression analysis, and ( ) proteomics. practitioners of bioinformatics have come from many backgrounds, including medicine, molecular biology, chemistry, physics, mathematics, engineering, and computer science. it is difficult to define precisely the ways in which this discipline emerged. there are, however, two main developments that have created opportunities for the use of information technologies in biology. the first is the progress in our understanding of how biological molecules are constructed and how they perform their functions. this dates back as far as the s with the invention of electrophoresis, and then in the s with the elucidation of the structure of dna and the subsequent sequence of discoveries in the relationships among dna, rna, and protein structure. the second development has been the parallel increase in the availability of computing power. starting with mainframe computer applications in the s and moving to modern workstations, there have been hosts of biological problems addressed with computational methods. the human genome project was completed and a nearly finished sequence was published in . the benefit of the human genome sequence to medicine is both in the short and in the long term. the short-term benefits lie principally in diagnosis: the availability of sequences of normal and variant human genes will allow for the rapid identification of these genes in any patient (e.g., babior and matzner, ) . the long-term benefits will include a greater understanding of the proteins produced from the genome: how the proteins interact with drugs; how they malfunction in disease states; and how they participate in the control of development, aging, and responses to disease. the effects of genomics on biology and medicine cannot be understated. we now have the ability to measure the activity and function of genes within living cells. genomics data and experiments have changed the way biologists think about questions fundamental to life. where in the past, reductionist experiments probed the detailed workings of specific genes, we can now assemble those data together to build an accurate understanding of how cells work. this has led to a change in thinking about the role of computers in biology. before, they were optional tools that could help provide insight to experienced and dedicated enthusiasts. today, they are required by most investigators, and experimental approaches rely on them as critical elements. twenty years ago, the use of computers was proving to be useful to the laboratory researcher. today, computers are an essential component of modern research. this is because advances in research methods such as microarray chips, drug screening robots, x-ray crystallography, nuclear magnetic resonance spectroscopy, and dna sequencing experiments have resulted in massive amounts of data. these data need to be properly stored, analyzed, and disseminated. the volume of data being produced by genomics projects is staggering. there are now more than . million sequences in genbank comprising more than billion digits. but these data do not stop with sequence data: pubmed contains over million literature citations, the pdb contains three-dimensional structural data for over , protein sequences, and the stanford microarray database (smd) contains over , experiments ( million data points). these data are of incredible importance to biology, and in the following sections we introduce and summarize the importance of sequences, structures, gene expression experiments, systems biology, and their computational components to medicine. sequence information (including dna sequences, rna sequences, and protein sequences) is critical in biology: dna, rna, and protein can be represented as a set of sequences of basic building blocks (bases for dna and rna, amino acids for proteins). computer systems within bioinformatics thus must be able to handle biological sequence information effectively and efficiently. one major difficulty within bioinformatics is that standard database models, such as relational database systems, are not well suited to sequence information. the basic problem is that sequences are important both as a set of elements grouped together and treated in a uniform manner and as individual elements, with their relative locations and functions. any given position in a sequence can be important because of its own identity, because it is part of a larger subsequence, or perhaps because it is part of a large set of overlapping subsequences, all of which have different significance. it is necessary to support queries such as, "what sequence motifs are present in this sequence?" it is often difficult to represent these multiple, nested relationships within standard relational database schema. in addition, the neighbors of a sequence element are also critical, and it is important to be able to perform queries such as, "what sequence elements are seen elements to the left of this element?" for these reasons, researchers in bioinformatics are developing object-oriented databases (see chapter ) in which a sequence can be queried in different ways, depending on the needs of the user (altman, ) . the sequence information mentioned in section . . is rapidly becoming inexpensive to obtain and easy to store. on the other hand, the three-dimensional structure information about the proteins that are produced from the dna sequences is much more difficult and expensive to obtain, and presents a separate set of analysis challenges. currently, only about , three-dimensional structures of biological macromolecules are known. these models are incredibly valuable resources, however, because an understanding of structure often yields detailed insights about biological function. as an example, the structure of the ribosome has been determined for several species and contains more atoms than any other to date. this structure, because of its size, took two decades to solve, and presents a formidable challenge for functional annotation (cech, ) . yet, the functional information for a single structure is vastly outsized by the potential for comparative genomics analysis between the structures from several organisms and from varied forms of the functional complex, since the ribosome is ubiquitously required for all forms of life. thus a wealth of information comes from relatively few structures. to address the problem of limited structure information, the publicly funded structural genomics initiative aims to identify all of the common structural scaffolds found in nature and grow the number of known structures considerably. in the end, it is the physical forces between molecules that determine what happens within a cell; thus the more complete the picture, the better the functional understanding. in particular, understanding the physical properties of therapeutic agents is the key to understanding how agents interact with their targets within the cell (or within an invading organism). these are the key questions for structural biology within bioinformatics: . how can we analyze the structures of molecules to learn their associated function? approaches range from detailed molecular simulations (levitt, ) to statistical analyses of the structural features that may be important for function (wei and altman, ). bioinformatics for more information see http://www.rcsb.org/pdb/. . how can we extend the limited structural data by using information in the sequence databases about closely related proteins from different organisms (or within the same organism, but performing a slightly different function)? there are significant unanswered questions about how to extract maximal value from a relatively small set of examples. . how should structures be grouped for the purposes of classification? the choices range from purely functional criteria ("these proteins all digest proteins") to purely structural criteria ("these proteins all have a toroidal shape"), with mixed criteria in between. one interesting resource available today is the structural classification of proteins (scop), which classifies proteins based on shape and function. the development of dna microarrays has led to a wealth of data and unprecedented insight into the fundamental biological machine. the premise is relatively simple; up to , gene sequences derived from genomic data are fixed onto a glass slide or filter. an experiment is performed where two groups of cells are grown in different conditions, one group is a control group and the other is the experimental group. the control group is grown normally, while the experimental group is grown under experimental conditions. for example, a researcher may be trying to understand how a cell compensates for a lack of sugar. the experimental cells will be grown with limited amounts of sugar. as the sugar depletes, some of the cells are removed at specific intervals of time. when the cells are removed, all of the mrna from the cells is separated and converted back to dna, using special enzymes. this leaves a pool of dna molecules that are only from the genes that were turned on (expressed) in that group of cells. using a chemical reaction, the experimental dna sample is attached to a red fluorescent molecule and the control group is attached to a green fluorescent molecule. these two samples are mixed and then washed over the glass slide. the two samples contain only genes that were turned on in the cells, and they are labeled either red or green depending on whether they came from the experimental group or the control group. the labeled dna in the pool sticks or hybridizes to the same gene on the glass slide. this leaves the glass slide with up to , spots and genes that were turned on in the cells are now bound with a label to the appropriate spot on the slide. using a scanning confocal microscope and a laser to fluoresce the linkers, the amount of red and green fluorescence in each spot can be measured. the ratio of red to green determines whether that gene is being turned off (downregulated) in the experimental group or whether the gene is being turned on (upregulated). the experiment has now measured the activity of genes in an entire cell due to some experimental change. figure . illustrates a typical gene expression experiment from smd. computers are critical for analyzing these data, because it is impossible for a researcher to comprehend the significance of those red and green spots. currently scientists are using gene expression experiments to study how cells from different organ- isms compensate for environmental changes, how pathogens fight antibiotics, and how cells grow uncontrollably (as is found in cancer). a new challenge for biological computing is to develop methods to analyze these data, tools to store these data, and computer systems to collect the data automatically. with the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. the basic algorithms for analyzing sequence and structure are now leading to opportunities for more integrated analysis of the pathways in which these molecules participate and ways in which molecules can be manipulated for the purpose of combating disease. a detailed understanding of the role of a particular molecule in the cell requires knowledge of the context-of the other molecules with which it interacts-and of the sequence of chemical transformations that take place in the cell. thus, major research areas in bioinformatics are elucidating the key pathways by which chemicals are transformed, defining the molecules that catalyze these transformations, identifying the input compounds and the output compounds, and linking these pathways into bioinformatics networks that we can then represent computationally and analyze to understand the significance of a particular molecule. the alliance for cell signaling is generating large amounts of data related to how signal molecules interact and affect the concentration of small molecules within the cell. there are a number of common computations that are performed in many contexts within bioinformatics. in general, these computations can be classified as sequence alignment, structure alignment, pattern analysis of sequence/structure, gene expression analysis, and pattern analysis of biochemical function. as it became clear that the information from dna and protein sequences would be voluminous and difficult to analyze manually, algorithms began to appear for automating the analysis of sequence information. the first requirement was to have a reliable way to align sequences so that their detailed similarities and distances could be examined directly. needleman and wunsch ( ) published an elegant method for using dynamic programming techniques to align sequences in time related to the cube of the number of elements in the sequences. smith and waterman ( ) published refinements of these algorithms that allowed for searching both the best global alignment of two sequences (aligning all the elements of the two sequences) and the best local alignment (searching for areas in which there are segments of high similarity surrounded by regions of low similarity). a key input for these algorithms is a matrix that encodes the similarity or substitutability of sequence elements: when there is an inexact match between two elements in an alignment of sequences, it specifies how much "partial credit" we should give the overall alignment based on the similarity of the elements, even though they may not be identical. looking at a set of evolutionarily related proteins, dayhoff et al. ( ) published one of the first matrices derived from a detailed analysis of which amino acids (elements) tend to substitute for others. within structural biology, the vast computational requirements of the experimental methods (such as x-ray crystallography and nuclear magnetic resonance) for determining the structure of biological molecules drove the development of powerful structural analysis tools. in addition to software for analyzing experimental data, graphical display algorithms allowed biologists to visualize these molecules in great detail and facilitated the manual analysis of structural principles (langridge, ; richardson, ) . at the same time, methods were developed for simulating the forces within these molecules as they rotate and vibrate (gibson and scheraga, ; karplus and weaver, ; levitt, ) . the most important development to support the emergence of bioinformatics, however, has been the creation of databases with biological information. in the s, structural biologists, using the techniques of x-ray crystallography, set up the protein data bank (pdb) of the cartesian coordinates of the structures that they elucidated (as well as associated experimental details) and made pdb publicly available. the first release, in , contained structures. the growth of the database is chronicled on the web: the pdb now has over , detailed atomic structures and is the primary source of information about the relationship between protein sequence and protein structure. similarly, as the ability to obtain the sequence of dna molecules became widespread, the need for a database of these sequences arose. in the mid- s, the genbank database was formed as a repository of sequence information. starting with sequences and , bases in , the genbank has grown by much more than million sequences and billion bases. the genbank database of dna sequence information supports the experimental reconstruction of genomes and acts as a focal point for experimental groups. numerous other databases store the sequences of protein molecules and information about human genetic diseases. included among the databases that have accelerated the development of bioinformatics is the medline database of the biomedical literature and its paper-based companion index medicus (see chapter ). including articles as far back as and brought online free on the web in , medline provides the glue that relates many high-level biomedical concepts to the low-level molecule, disease, and experimental methods. in fact, this "glue" role was the basis for creating the entrez and pubmed systems for integrating access to literature references and the associated databases. perhaps the most basic activity in computational biology is comparing two biological sequences to determine ( ) whether they are similar and ( ) how to align them. the problem of alignment is not trivial but is based on a simple idea. sequences that perform a similar function should, in general, be descendants of a common ancestral sequence, with mutations over time. these mutations can be replacements of one amino acid with another, deletions of amino acids, or insertions of amino acids. the goal of sequence alignment is to align two sequences so that the evolutionary relationship between the sequences becomes clear. if two sequences are descended from the same ancestor and have not mutated too much, then it is often possible to find corresponding locations in each sequence that play the same role in the evolved proteins. the problem of solving correct biological alignments is difficult because it requires knowledge about the evolution of the molecules that we typically do not have. there are now, however, well-established algorithms for finding the mathematically optimal alignment of two sequences. these algorithms require the two sequences and a scoring system based on ( ) exact matches between amino acids that have not mutated in the two sequences and can be aligned perfectly; ( ) partial matches between amino acids that have mutated in ways that have preserved their overall biophysical properties; and ( ) gaps in the alignment signifying places where one sequence or the other has undergone a deletion or insertion of amino acids. the algorithms for determining optimal sequence alignments are based on a technique in computer science known as dynamic programming and are at the heart of many computational biology applications (gusfield, ) . figure . shows an example of a smith-waterman matrix. unfortunately, the dynamic programming algorithms are computationally expensive to apply, so a number of faster, more heuristic methods have been developed. the most popular algorithm is the basic local alignment search tool (blast) (altschul et al., ) . blast is based on the observations that sections of proteins are often conserved without gaps (so the gaps can be ignored-a critical simplification for speed) and that there are statistical analyses of the occurrence of small subsequences within larger sequences that can be used to prune the search for matching sequences in a large database. another tool that has found wide use in mining genome sequences is blat (kent, ) . blat is often used to search long genomic sequences with significant performance increases over blast. it achieves its -fold increase in speed over other tools by storing and indexing long sequences as nonoverlapping k-mers, allowing efficient storage, searching, and alignment on modest hardware. one of the primary challenges in bioinformatics is taking a newly determined dna sequence (as well as its translation into a protein sequence) and predicting the structure of the associated molecules, as well as their function. both problems are difficult, being fraught with all the dangers associated with making predictions without hard experimental data. nonetheless, the available sequence data are starting to be sufficient to allow good predictions in a few cases. for example, there is a web site devoted to the assessment of biological macromolecular structure prediction methods. recent results suggest that when two protein molecules have a high degree (more than percent) of sequence similarity and one of the structures is known, a reliable model of the other can be built by analogy. in the case that sequence similarity is less than percent, however, performance of these methods is much less reliable. when scientists investigate biological structure, they commonly perform a task analogous to sequence alignment, called structural alignment. given two sets of threedimensional coordinates for a set of atoms, what is the best way to superimpose them so that the similarities and differences between the two structures are clear? such computations are useful for determining whether two structures share a common ancestry and for understanding how the structures' functions have subsequently been refined during evolution. there are numerous published algorithms for finding good structural alignments. we can apply these algorithms in an automated fashion whenever a new structure is determined, thereby classifying the new structure into one of the protein families (such as those that scop maintains). one of these algorithms is minrms (jewett et al., ) . minrms works by finding the minimal root-mean-squared-distance (rmsd) alignments of two protein structures as a function of matching residue pairs. minrms generates a family of alignments, each with different number of residue position matches. this is useful for identifying local regions of similarity in a protein with multiple domains. minrms solves two problems. first, it determines which structural superpositions, or alignment, to evaluate. then, given this superposition, it determines which residues should be bioinformatics considered "aligned" or matched. computationally, this is a very difficult problem. minrms reduces the search space by limiting superpositions to be the best superposition between four atoms. it then exhaustively determines all potential four-atommatched superpositions and evaluates the alignment. given this superposition, the number of aligned residues is determined, as any two residues with c-alpha carbons (the central atom in all amino acids) less than a certain threshold apart. the minimum average rmsd for all matched atoms is the overall score for the alignment. in figure . , an example of such a comparison is shown. a related problem is that of using the structure of a large biomolecule and the structure of a small organic molecule (such as a drug or cofactor) to try to predict the ways in which the molecules will interact. an understanding of the structural interaction between a drug and its target molecule often provides critical insight into the drug's mechanism of action. the most reliable way to assess this interaction is to use experimental methods to solve the structure of a drug-target complex. once again, these experimental approaches are expensive, so computational methods play an important role. typically, we can assess the physical and chemical features of the drug molecule and can use them to find complementary regions of the target. for example, a highly electronegative drug molecule will be most likely to bind in a pocket of the target that has electropositive features. prediction of function often relies on use of sequential or structural similarity metrics and subsequent assignment of function based on similarities to molecules of known r. b. altman and s. d. mooney function. these methods can guess at general function for roughly to percent of all genes, but leave considerable uncertainty about the precise functional details even for those genes for which there are predictions, and have little to say about the remaining genes. analysis of gene expression data often begins by clustering the expression data. a typical experiment is represented as a large table, where the rows are the genes on each chip and the columns represent the different experiments, whether they be time points or different experimental conditions. within each cell is the red to green ratio of that gene's experimental results. each row is then a vector of values that represent the results of the experiment with respect to a specific gene. clustering can then be performed to determine which genes are being expressed similarly. genes that are associated with similar expression profiles are often functionally associated. for example, when a cell is subjected to starvation (fasting), ribosomal genes are often downregulated in anticipation of lower protein production by the cell. it has similarly been shown that genes associated with neoplastic progression could be identified relatively easily with this method, making gene expression experiments a powerful assay in cancer research (see guo, , for review) . in order to cluster expression data, a distance metric must be determined to compare a gene's profile with another gene's profile. if the vector data are a list of values, euclidian distance or correlation distances can be used. if the data are more complicated, more sophisticated distance metrics may be employed. clustering methods fall into two categories: supervised and unsupervised. hand. usually, the method begins by selecting profiles that represent the different groups of data, and then the clustering method associates each of the genes with the be performed automatically. two such unsupervised learning methods are the hierarchical and k-means clustering methods. hierarchical methods build a dendrogram, or a tree, of the genes based on ing close neighbors into a cluster. the first step often involves connecting the closest profiles, building an average profile of the joined profiles, and repeating until the entire tree is built. k-means clustering builds k clusters or groups automatically. the algorithm begins by picking k representative profiles randomly. then each gene is associated with the representative to which it is closest, as defined by the distance metric being employed. then the center of mass of each cluster is determined using all of the member gene's profiles. depending on the implementation, either the center of mass or the nearest member to it becomes the new representative for that cluster. the algorithm then iterates until the new center of mass and the previous center of mass are within some threshold. the result is k groups of genes that are regulated similarly. one drawback of k-means is that one must chose the value for k. if k is too large, logical "true" clusters may be split into pieces and if k is too small, there will be clusters that are bioinformatics commonly applied because these methods require no knowledge of the data, and can supervised learning methods require some preconceived knowledge of the data at representative profile to which they are most similar. unsupervised methods are more their expression profiles. these methods are agglomerative and work by iteratively join-merged. one way to determine whether the chosen k is correct is to estimate the average distance from any member profile to the center of mass. by varying k, it is best to choose the lowest k where this average is minimized for each cluster. another drawback of k-means is that different initial conditions can give different results, therefore it is often prudent to test the robustness of the results by running multiple runs with different starting configurations (figure . ) . the future clinical usefulness of these algorithms cannot be understated. in , van't veer et al. ( found that a gene expression profile could predict the clinical outcome of breast cancer. the global analysis of gene expression showed that some can- r. b. altman and s. d. mooney cers were associated with different prognosis, not detectable using traditional means. another exciting advancement in this field is the potential use of microarray expression data to profile the molecular effects of known and potential therapeutic agents. this molecular understanding of a disease and its treatment will soon help clinicians make more informed and accurate treatment choices. biologists have embraced the web in a remarkable way and have made internet access to data a normal and expected mode for doing business. hundreds of databases curated by individual biologists create a valuable resource for the developers of computational methods who can use these data to test and refine their analysis algorithms. with standard internet search engines, most biological databases can be found and accessed within moments. the large number of databases has led to the development of meta-databases that combine information from individual databases to shield the user from the complex array that exists. there are various approaches to this task. the entrez system from the national center for biological information (ncbi) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the human genome project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources. the sequence retrieval system (srs) from the european molecular biology laboratory allows queries from one database to another to be linked and sequenced, thus allowing relatively complicated queries to be evaluated. newer technologies are being developed that will allow multiple heterogeneous databases to be accessed by search engines that can combine information automatically, thereby processing even more intricate queries requiring knowledge from numerous data sources. the main types of sequence information that must be stored are dna and protein. one of the largest dna sequence databases is genbank, which is managed by ncbi. genbank is growing rapidly as genome-sequencing projects feed their data (often in an automated procedure) directly into the database. figure . shows the logarithmic growth of data in genbank since . entrez gene curates some of the many genes within genbank and presents the data in a way that is easy for the researcher to use (figure . ) . year the exponential growth of genbank total number of bases figure . . the exponential growth of genbank. this plot shows that since the number of bases in genbank has grown by five full orders of magnitude and continues to grow by a factor of every years. in addition to genbank, there are numerous special-purpose dna databases for which the curators have taken special care to clean, validate, and annotate the data. the work required of such curators indicates the degree to which raw sequence data must be interpreted cautiously. genbank can be searched efficiently with a number of algorithms and is usually the first stop for a scientist with a new sequence who wonders "has a sequence like this ever been observed before? if one has, what is known about it?" there are increasing numbers of stories about scientists using genbank to discover unanticipated relationships between dna sequences, allowing their research programs to leap ahead while taking advantage of information collected on similar sequences. a database that has become very useful recently is the university of california santa cruz genome assembly browser (figure . ) . this data set allows users to search for specific sequences in the ucsc version of the human genome. powered by the similarity search tool blat, users can quickly find annotations on the human genome that contain their sequence of interest. these annotations include known variations (mutations and snps), genes, comparative maps with other organisms, and many other important data. although sequence information is obtained relatively easily, structural information remains expensive on a per-entry basis. the experimental protocols used to determine precise molecular structural coordinates are expensive in time, materials, and human power. therefore, we have only a small number of structures for all the molecules characterized in the sequence databases. the two main sources of structural information are the cambridge structural database for small molecules (usually less than atoms) and the pdb for macromolecules (see section . . ), including proteins and nucleic acids, and combinations of these macromolecules with small molecules (such as drugs, cofactors, and vitamins). the pdb has approximately , high-resolution structures, but this number is misleading because many of them are small variants on the same structural architecture (figure . ) . if an algorithm is applied to the database to filter out redundant structures, less than , structures remain. there are approximately , proteins in humans; therefore many structures remain unsolved (e.g., burley and bonanno, ; gerstein et al., ) . in the pdb, figure . . a stylized diagram of the structure of chymotrypsin, here shown with two identical subunits interacting. the red portion of the protein backbone shows α-helical regions, while the blue portion shows β-strands, and the white denotes connecting coils, while the molecular surface is overlaid in gray. the detailed rendering of all the atoms in chymotrypsin would make this view difficult to visualize because of the complexity of the spatial relationships between thousands of atoms. each structure is reported with its biological source, reference information, manual annotations of interesting features, and the cartesian coordinates of each atom within the molecule. given knowledge of the three-dimensional structure of molecules, the function sometimes becomes clear. for example, the ways in which the medication methotrexate interacts with its biological target have been studied in detail for two decades. methotrexate is used to treat cancer and rheumatologic diseases, and it is an inhibitor of the protein dihydrofolate reductase, an important molecule for cellular reproduction. the three-dimensional structure of dihydrofolate reductase has been known for many years and has thus allowed detailed studies of the ways in which small molecules, such as methotrexate, interact at an atomic level. as the pdb increases in size, it becomes important to have organizing principles for thinking about biological structure. scop provides a classification based on the overall structural features of proteins. it is a useful method for accessing the entries of the pdb. the ecocyc project is an example of a computational resource that has comprehensive information about biochemical pathways. ecocyc is a knowledge base of the metabolic capabilities of e. coli; it has a representation of all the enzymes in the e. coli genome and of the compounds on which they work. it also links these enzymes to their position on the genome to provide a useful interface into this information. the network of pathways within ecocyc provides an excellent substrate on which useful applications can be built. for example, they could provide: ( ) the ability to guess the function of a new protein by assessing its similarity to e. coli genes with a similar sequence, ( ) the ability to ask what the effect on an organism would be if a critical component of a pathway were removed (would other pathways be used to create the desired function, or would the organism lose a vital function and die?), and ( ) the ability to provide a rich user interface to the literature on e. coli metabolism. similarly, the kyoto encyclopedia of genes and genomes (kegg) provides pathway datacets for organism genomes. a postgenomic database bridges the gap between molecular biological databases with those of clinical importance. one excellent example of a postgenomic database is the online mendelian inheritance in man (omim) database, which is a compilation of known human genes and genetic diseases, along with manual annotations describing the state of our understanding of individual genetic disorders. each entry contains links to special-purpose databases and thus provides links between clinical syndromes and basic molecular mechanisms (figure . ). the smd is another example of a postgenomic database that has proven extremely useful, but has also addressed some formidable challenges. as discussed previously in several sections, expression data are often represented as vectors of data values. in addition to the ratio values, the smd stores images of individual chips, complete with annotated gene spots (see figure . ). further, the smd must store experimental conditions, the type and protocol of the experiment, and other data associated with the experiment. arbitrary analysis can be performed on different experiments stored in this unique resource. a critical technical challenge within bioinformatics is the interconnection of databases. as biological databases have proliferated, researchers have been increasingly interested in linking them to support more complicated requests for information. some of these links are natural because of the close connection of dna sequence to protein structure (a straightforward translation). other links are much more difficult because the semantics of the data items within the databases are fuzzy or because good methods for linking certain types of data simply do not exist. for example, in an ideal world, a protein sequence would be linked to a database containing information about that sequence's function. unfortunately, although there are databases about protein function, it is not always easy to assign a function to a protein based on sequence information alone, and so the databases are limited by gaps in our understanding of biology. some excellent recent work in the integration of diverse biological databases has been done in connection with the ncbi entrez/pubmed systems, the srs resource, discoverylink, and the biokleisli project. the human genome sequencing projects will be complete within a decade, and if the only raison d'etre for bioinformatics is to support these projects, then the discipline is not well founded. if, on the other hand, we can identify a set of challenges for the next generations of investigators, then we can more comfortably claim disciplinary status for the field. fortunately, there is a series of challenges for which the completion of the first human genome sequence is only the beginning. with the first human genome in hand, the possibilities for studying the role of genetics in human disease multiply. a new challenge immediately emerges, however: collecting individual sequence data from patients who have disease. researchers estimate that more than percent of the dna sequences within humans are identical, but the remaining sequences are different and account for our variability in susceptibility to and development of disease states. it is not unreasonable to expect that for particular disease syndromes, the detailed genetic information for individual patients will provide valuable information that will allow us to tailor treatment protocols and perhaps let us make more accurate prognoses. there are significant problems associated with obtaining, organizing, analyzing, and using this information. there is currently a gap in our understanding of disease processes. although we have a good understanding of the principles by which small groups of molecules interact, we are not able to fully explain how thousands of molecules interact within a cell to create both normal and abnormal physiological states. as the databases continue to accumulate information ranging from patient-specific data to fundamental genetic information, a major challenge is creating the conceptual links between these databases to create an audit trail from molecular-level information to macroscopic phenomena, as manifested in disease. the availability of these links will facilitate the identification of important targets for future research and will provide a scaffold for biomedical knowledge, ensuring that important literature is not lost within the increasing volume of published data. an important opportunity within bioinformatics is the linkage of biological experimental data with the published papers that report them. electronic publication of the biological literature provides exciting opportunities for making data easily available to scientists. already, certain types of simple data that are produced in large volumes are expected to be included in manuscripts submitted for publication, including new sequences that are required to be deposited in genbank and new structure coordinates that are deposited in the pdb. however, there are many other experimental data sources that are currently difficult to provide in a standardized way, because the data either are more intricate than those stored in genbank or pdb or are not produced in a volume sufficient to fill a database devoted entirely to the relevant area. knowledge base technology can be used, however, to represent multiple types of highly interrelated data. knowledge bases can be defined in many ways (see chapter ); for our purposes, we can think of them as databases in which ( ) the ratio of the number of tables to the number of entries per table is high compared with usual databases, ( ) the individual entries (or records) have unique names, and ( ) the values of many fields for one record in the database are the names of other records, thus creating a highly interlinked network of concepts. the structure of knowledge bases often leads to unique strategies for storage and retrieval of their content. to build a knowledge base for storing information from biological experiments, there are some requirements. first, the set of experiments to be modeled must be defined. second, the key attributes of each experiment that should be recorded in the knowledge base must be specified. third, the set of legal values for each attribute must be specified, usually by creating a controlled terminology for basic data or by specifying the types of knowledge-based entries that can serve as values within the knowledge base. the development of such schemes necessitates the creation of terminology standards, just as in clinical informatics. the riboweb project is undertaking this task in the domain of rna biology (chen et al., ) . riboweb is a collaborative tool for ribosomal modeling that has at its center a knowledge base of the ribosomal structural literature. riboweb links standard bibliographic references to knowledge-based entries that summarize the key experimental findings reported in each paper. for each type of experiment that can be performed, the key attributes must be specified. thus, for example, a cross-linking experiment is one in which a small molecule with two highly reactive chemical groups is added to an ensemble of other molecules. the reactive groups attach themselves to two vulnerable parts of the ensemble. because the molecule is small, the two vulnerable areas cannot be any further from each other than the maximum stretched-out length of the small molecule. thus, an analysis of the resulting reaction gives information that one part of the ensemble is "close" to another part. this experiment can be summarized formally with a few features-for example, target of experiment, cross-linked parts, and cross-linking agent. the task of creating connections between published literature and basic data is a difficult one because of the need to create formal structures and then to create the necessary content for each published article. the most likely scenario is that biologists will write and submit their papers along with the entries that they propose to add to the knowledge base. thus, the knowledge base will become an ever-growing communal store of scientific knowledge. reviewers of the work will examine the knowledge-based elements, perhaps will run a set of automated consistency checks, and will allow the knowledge base to be modified if they deem the paper to be of sufficient scientific merit. riboweb in prototype form can be accessed on the web. one of the most exciting goals for computational biology and bioinformatics is the creation of a unified computational model of physiology. imagine a computer program that provides a comprehensive simulation of a human body. the simulation would be a complex mathematical model in which all the molecular details of each organ system would be represented in sufficient detail to allow complex "what if ?" questions to be asked. for example, a new therapeutic agent could be introduced into the system, and its effects on each of the organ subsystems and on their cellular apparatus could be assessed. the side-effect profile, possible toxicities, and perhaps even the efficacy of the agent could be assessed computationally before trials are begun on laboratory animals or human subjects. the model could be linked to visualizations to allow the teaching of medicine at all grade levels to benefit from our detailed understanding of physiological processes-visualizations would be both anatomic (where things are) and functional (what things do). finally, the model would provide an interface to human genetic and biological knowledge. what more natural user interface could there be for exploring physiology, anatomy, genetics, and biochemistry than the universally recognizable structure of a human that could be browsed at both macroscopic and microscopic levels of detail? as components of interest were found, they could be selected, and the available literature could be made available to the user. the complete computational model of a human is not close to completion. first, all the participants in the system (the molecules and the ways in which they associate to form higher-level aggregates) must be identified. second, the quantitative equations and symbolic relationships that summarize how the systems interact have not been elucidated fully. third, the computational representations and computer power to run such a simulation are not in place. researchers are, however, working in each of these areas. the genome projects will soon define all the molecules that constitute each organism. research in simulation and the new experimental technologies being developed will give us an understanding of how these molecules associate and perform their functions. finally, research in both clinical informatics and bioinformatics will provide the computational infrastructure required to deliver such technologies. bioinformatics is closely allied to clinical informatics. it differs in its emphasis on a reductionist view of biological systems, starting with sequence information and moving to structural and functional information. the emergence of the genome sequencing projects and the new technologies for measuring metabolic processes within cells is beginning to allow bioinformaticians to construct a more synthetic view of biological processes, which will complement the whole-organism, top-down approach of clinical informatics. more importantly, there are technologies that can be shared between bioinformatics and clinical informatics because they both focus on representing, storing, and analyzing biological data. these technologies include the creation and management of standard terminologies and data representations, the integration of heterogeneous databases, the organization and searching of the biomedical literature, the use of machine learning techniques to extract new knowledge, the simulation of biological processes, and the creation of knowledge-based systems to support advanced practitioners in the two fields. the proceedings of one of the principal meetings in bioinformatics, this is an excellent source for up-to-date research reports. other important meetings include those sponsored by the this introduction to the field of bioinformatics focuses on the use of statistical and artificial intelligence techniques in machine learning introduces the different microarray technologies and how they are analyzed dna and protein sequence analysis-a practical approach this book provides an introduction to sequence analysis for the interested biologist with limited computing experience this edited volume provides an excellent introduction to the use of probabilistic representations of sequences for the purposes of alignment, multiple alignment this primer provides a good introduction to the basic algorithms used in sequence analysis, including dynamic programming for sequence alignment algorithms on strings, trees and sequences: computer science and computational biology gusfield's text provides an excellent introduction to the algorithmics of sequence and string analysis, with special attention paid to biological sequence analysis problems artificial intelligence and molecular biology this volume shows a variety of ways in which artificial intelligence techniques have been used to solve problems in biology genotype to phenotype this volume offers a useful collection of recent work in bioinformatics another introduction to bioinformatics, this text was written for computer scientists the textbook by stryer is well written, and is illustrated and updated on a regular basis. it provides an excellent introduction to basic molecular biology and biochemistry what ways will bioinformatics and medical informatics interact in the future? will the research agendas of the two fields merge will the introduction of dna and protein sequence information change the way that medical records are managed in the future? which types of systems will be most affected (laboratory, radiology, admission and discharge, financial it has been postulated that clinical informatics and bioinformatics are working on the same problems, but in some areas one field has made more progress than the other why should an awareness of bioinformatics be expected of clinical informatics professionals? should a chapter on bioinformatics appear in a clinical informatics textbook? explain your answers one major problem with introducing computers into clinical medicine is the extreme time and resource pressure placed on physicians and other health care workers. will the same problems arise in basic biomedical research? why have biologists and bioinformaticians embraced the web as a vehicle for disseminating data so quickly, whereas clinicians and clinical informaticians have been more hesitant to put their primary data online? key: cord- -ybd hi y authors: dutilh, bas e title: metagenomic ventures into outer sequence space date: - - journal: bacteriophage doi: . / . . sha: doc_id: cord_uid: ybd hi y sequencing dna or rna directly from the environment often results in many sequencing reads that have no homologs in the database. these are referred to as “unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as “biological dark matter." however, unknowns also exist because metagenomic datasets are not optimally mined. there is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. this can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crassphage. the unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. however, it remains an open question, what is the actual size of biological sequence space? the de novo assembly of shotgun metagenomes is the most powerful tool to address this question. s equencing dna or rna directly from the environment often results in many sequencing reads that have no homologs in the database. these are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." however, unknowns also exist because metagenomic datasets are not optimally mined. there is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. this can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crassphage. the unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. however, it remains an open question, what is the actual size of biological sequence space? the de novo assembly of shotgun metagenomes is the most powerful tool to address this question. metagenomics is the untargeted sequencing of genetic material isolated from communities of micro-organisms and viruses. these communities may be derived from bioreactors, environmental, clinical, or industrial samples; in short, from anywhere in our unsterile biosphere. the classical questions in metagenomics that are asked about the sampled microbial community are "who is there?" and "what are they doing?." originally an approach to answer these classical questions, metagenomics as a field has made great progress in the past decade. applications include the use of metagenomics for the discovery of novel genetic functionality, for describing microbial ecosystems and tracking their variation, in untargeted medical diagnostics and forensics, and as a powerful tool to determine the genome sequences of rare, uncultivable microbes. powered by advances in next-generation sequencing technology, metagenomics has the potential to venture beyond the limits of currently explored sequence space by sampling environmental microbes and viruses at an unprecedented scale and resolution. quite literally, sequence space is defined as the multi-dimensional space of all possible nucleotide (or protein) sequences. sequence space contains n dimensions; one dimension per residue that can take one of (or , for proteins) states, with a total volume of s n sequences when summed over all possible sequence lengths n. evolution may have largely explored this space, but it remains an open question how large the current biological sequence space is, i.e. the fraction occupied by extant life. figuratively, and within the context of this paper, "outer sequence space" is the remainder of this biological sequence space waiting to be explored by science. metagenomics has traditionally addressed the classical questions listed above by aligning the sequencing reads in metagenomic data sets to a reference database containing known, annotated sequences. this allows the taxonomic and functional diversity of the sampled microbes to be described in terms of existing knowledge, allowing for straightforward interpretation of the results. however, a persistent concern in the analysis of metagenomes has been the unknown fraction, consisting of the reads keywords: biological dark matter, crassphage, human gut, human virome, metagenomics, metagenome assembly, unknowns that cannot be annotated by using database searches. the level of unknowns can range up to % of the metagenomic reads, depending on the sampled environment, the protocols used for nucleotide isolation and sequencing, the homology search algorithm, and the reference database. unknowns exist for reasons that are not unrelated. the first reason is technical. due to limitations of some next-generation sequencing platforms and library preparation protocols, spurious sequences may be generated that do not reflect true biological molecules. these artificial sequences include artifacts due to the sequencing technology and chimeras, i.e., sequences generated from separate genetic molecules derived from different organisms. since chimeras frequently arise during pcr amplification, they are expected to be more abundant in environmental amplicon sequencing than in shotgun metagenomics, and can be detected using bioinformatic tools. the second reason that unknowns exist is biological, as they reflect the enormous natural diversity of microorganisms that we are only beginning to unveil with metagenomics. this is both overwhelming and exciting, highlighting how much remains to be discovered in biology. this genetic diversity has been referred to as biological "dark matter," , and is especially pronounced in viral metagenomes. this issue can only be resolved by expanding reference databases, as exemplified by recent studies of one of the most studied microbial ecosystems: the human gut. the first metagenomic snapshots of the microbiota in the human gut were taken from healthy adults, and revealed a high interindividual diversity and many unknowns. to a large extent, these unknowns were resolved when a reference catalog was created based on the sequences in the gut metagenomes themselves, decreasing the percentage of unknowns from » % to » %. moreover, subsequent large scale sequencing efforts revealed that in fact, many people share a similar intestinal flora, regardless of whether these similarities are viewed as discrete enterotypes or as gradients. these results illustrate how unknowns can be depleted by expanding the databases with appropriate reference sequences. this not only requires increased sequencing effort of phylogenetically diverse isolates or single cells, but also mining of draft genomes from metagenomes, sampled from microbial environments around the globe. thus, by mapping the global sequence space, we can provide reassurance that at least some level of sampling saturation can be achieved. for viruses, and particularly for bacteriophages, efforts to provide a denser sampling of sequence space are still lacking. the third reason that unknowns exist is methodological. because the advances in dna sequencing technology have greatly outpaced improvements in computer power, bioinformatic approaches to analyze metagenomes often cut corners. for example, reference databases may be reduced to include only those references that are expected in the sample a priori. moreover, read annotation may be limited to identifying almost exact sequence matches, as this can be computed much faster than if sequence variations needs to be taken into consideration in a permissive homology search. these issues lead to an inherent blind spot for discovering true novelty, such as sequences that are not expected in the sample, or organisms that have not been observed before. one way to, at least partially resolve this issue is by de novo assembly of the metagenome. depending on the diversity of the sample, assembly can combine many short sequences (individual reads) into fewer, longer ones (assembled contigs). reducing the number, and increasing the length of the sequences allows homology searches to be performed with more sensitive, computationally more expensive algorithms such as translated homology searches or profile searches, leading to more specific annotation and improved biological interpretation. moreover, larger and more comprehensive reference databases can be used, allowing unexpected hits to be found. the fourth reason that unknowns exist is logistical. most research projects that generate metagenomic sequencing datasets deposit the read files in large repositories, provide an accession number in the associated publication, and move on. it is not unlikely that many of these data sets, consisting of files sometimes gigabytes in size, are never looked at again. thus, while a certain sequence may have been "seen" in a metagenome and is thus strictly no longer "dark matter," it will still not be recognized when it is observed again. reidentification of this sequence would only be possible if the publishing researcher identified it as an interesting sequence in his or her (assembled) metagenome, and submitted it to a searchable database like genbank. because genbank maintains very high standards for the sequences it accepts, submission can be a tedious process that is rarely worthwhile for unknown metagenomic contigs. an in depth investigation of the unknowns is rarely within the scope of a research project, and those sequences are thus first ignored and later forgotten. this is a waste of valuable resources: time, money, and work. the metagenomes available in public databases should be better exploited and mined for common sequences. to facilitate this, it is critical that metadata annotations of the metagenomes include a detailed description of the samples and sequencing protocol. exploiting these datasets will allow us to create more comprehensive maps of sequence space, and greatly improve our understanding and interpretation of metagenomes. in the short term, ignoring the unknowns can facilitate the interpretation of a metagenome. because a taxonomic or functional description cannot be provided, the classical questions in metagenomics are left unanswered for the unknown fraction of the metagenome, and concentrating on the annotated sequences leads to a more straightforward answer. however, unexpected or novel sequences are quickly overlooked, even if they represent highly abundant or widespread organisms. thus, in the long term, stockpiling the unknown sequencing reads in badly accessible bulk sequence repositories can severely slow down research, the discovery of novel species, and the charting of biological sequence space. one striking example of a novel genome discovered among the unknown sequences is crassphage, a bacteriophage whose genome uniquely aligned sequencing reads from % of the analyzed human gut metagenomes, and constituted a total of . % of those metagenomic reads. like many bacteriophages, its genome sequence is highly divergent from everything that was present in the annotated part of the genbank database, which is why it was not observed before. it has been suggested that the unknown fraction of metagenomes is enriched for viral sequences, , because viral genomes are thought to evolve more rapidly than the genomes of cellular organisms, allowing them to explore a larger region of sequence space in the same amount of time. to summarize, unknowns are genetic sequences that are difficult to identify using standard methods, such as by alignment to an annotated reference database. unknowns remain a persistent elephant in the room in most metagenomics research projects, and exist for technical, biological, methodological, and logistical reasons. the most promising option to resolve the unknowns is by creating improved reference databases that chart biological sequence space, including the outer realms that remain unexplored by science (also known as dark matter). besides sequencing reference strains or single cells, it may be expected that metagenomic sequencing, assembly, and binning will greatly add to improving these reference databases, for example by identifying common sequences in many metagenomes, and prioritizing them for targeted characterization. characterizing unknowns will be vital to fully exploit the increasingly available metagenomic data sets from all ecosystems, toward understanding the roles of microbes and viruses in the biosphere. it remains an open question what is the actual size of biological sequence space, but the untargeted, shotgun nature of metagenomics makes it the most powerful tool to address this question. metagenomics: application of genomics to uncultured microorganisms fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla human gut microbiome viewed across age and geography isolation of a novel coronavirus from a man with pneumonia in saudi arabia genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes natural selection and the concept of a protein space how much of protein sequence space has been explored by life on earth metagenomics and future perspectives in virus discovery tagdust-a program to eliminate artifacts from next generation sequencing data uchime improves sensitivity and speed of chimera detection insights into the phylogeny and coding potential of microbial dark matter scratching the surface of biology's dark matter metagenomic analysis of the human distal gut microbiome a human gut microbial gene catalogue established by metagenomic sequencing enterotypes of the human gut microbiome a guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets a phylogeny-driven genomic encyclopaedia of bacteria and archaea genomes from metagenomics meeting report: the terabase metagenomics workshop and the vision of an earth microbiome project the pace and proliferation of biological technologies the minimum information about a genome sequence (migs) specification a highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes umars: un-mappable reads solution i thank my collaborators for their contributions in the crassphage project, and the anonymous reviewers of this manuscript for valuable suggestions. key: cord- -r te xob authors: balloux, francois; brønstad brynildsrud, ola; van dorp, lucy; shaw, liam p.; chen, hongbin; harris, kathryn a.; wang, hui; eldholm, vegard title: from theory to practice: translating whole-genome sequencing (wgs) into the clinic date: - - journal: trends microbiol doi: . /j.tim. . . sha: doc_id: cord_uid: r te xob hospitals worldwide are facing an increasing incidence of hard-to-treat infections. limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. wgs could hold the key to solving these issues. however, uptake in the clinic has been slow. some major scientific and logistical challenges need to be solved before wgs fulfils its potential in clinical microbial diagnostics. in this review we identify major bottlenecks that need to be resolved for wgs to routinely inform clinical intervention and discuss possible solutions. hospitals worldwide are facing an increasing incidence of hard-to-treat infections. limiting infections and providing patients with optimal drug regimens require timely strain identification as well as virulence and drug-resistance profiling. additionally, prophylactic interventions based on the identification of environmental sources of recurrent infections (e.g., contaminated sinks) and reconstruction of transmission chains (i.e., who infected whom) could help to reduce the incidence of nosocomial infections. wgs could hold the key to solving these issues. however, uptake in the clinic has been slow. some major scientific and logistical challenges need to be solved before wgs fulfils its potential in clinical microbial diagnostics. in this review we identify major bottlenecks that need to be resolved for wgs to routinely inform clinical intervention and discuss possible solutions. thanks to progress in high-throughput sequencing technologies over the last two decades, generating microbial genomes is now considered neither particularly challenging nor expensive. as a result, whole-genome sequencing (wgs) (see glossary) has been championed as the obvious and inevitable future of diagnostics in multiple reviews and opinion pieces dating back to [ ] [ ] [ ] [ ] . despite enthusiasm in the community, wgs diagnostics has not yet been widely adopted in clinical microbiology, which may seem at odds with the current suite of applications for which wgs has huge potential, and which are already widely used in the academic literature. common applications of wgs in diagnostic microbiology include isolate characterization, antimicrobial resistance (amr) profiling, and establishing the sources of recurrent infections and between-patient transmissions. all of these have obvious clinical relevance and provide case studies where wgs could, in principle, provide additional information and even replace the knowledge obtained through standard clinical microbiology techniques. this review reiterates the potential of wgs for clinical microbiology, but also its current limitations, and suggests possible solutions to some of the main bottlenecks to routine implementation. in particular, we argue that applying existing wgs pipelines developed for fundamental research is unlikely to produce the fast and robust tools required, and that new dedicated approaches are needed for wgs in the clinic. at the most basic level, wgs can be used to characterize a clinical isolate, informing on the likely species and/or subtype and allowing phylogenetic placement of a given sequence relative to an existing set of isolates. wgs-based strain identification gives a far superior resolution in principle, wgs can provide highly relevant information for clinical microbiology in near-real-time, from phenotype testing to tracking outbreaks. however, despite this promise, the uptake of wgs in the clinic has been limited to date, and future implementation is likely to be a slow process. the increasing information provided by wgs can cause conflict with traditional microbiological concepts and typing schemes. decreasing raw sequencing costs have not translated into decreasing total costs for bacterial genomes, which have stabilised. existing research pipelines are not suitable for the clinic, and bespoke clinical pipelines should be developed. compared to genetic marker-based approaches such as multilocus sequence typing (mlst) and can be used when standard techniques such as pulsed-field gel electrophoresis (pfge), variable-number tandem repeat (vntr) profiling, and maldi-tof are unable to accurately distinguish lineages [ ] . wgs-informed strain identification could be of particular significance for bacteria with large accessory genomes, which encompass many of the clinically most problematic bacteria, where much of the relevant genetic diversity is driven by differences in the accessory genome on the chromosome and/or plasmid carriage. somewhat ironically, the extremely rich information of wgs data, with every genome being unique, generates problems of its own. clinical microbiology tends to rely on often largely ad hoc taxonomical nomenclature, such as biochemical serovars for salmonella enterica or mycobacterial interspersed repetitive units (mirus) for mycobacterium tuberculosis. while the rich information contained in wgs should in principle allow superseding traditional taxonomic classifications [ , ] , defining an intuitive, meaningful and rigorous classification for genome sequences represents a major challenge. for strictly clonal species, which undergo no horizontal gene transfer (hgt), such as m. tuberculosis, it is possible to devise a 'natural' robust phylogenetically based classification [ ] . unfortunately, organisms undergoing regular hgt, and with a significant accessory genome, do not fall neatly into existing classification schemes. in fact, it is even questionable whether a completely satisfactory classification scheme could be devised for such organisms, as classifications based on the core genome, accessory genome, housekeeping genes (mlst), genotypic markers, plasmid sequence, virulence factors or amr profile may all produce incompatible categories ( figure ). beyond species identification and characterization, genome sequences provide a rich resource that can be exploited to predict the pathogen's phenotype. the main microbial traits of clinical relevance are amr and virulence, but may also include other traits such as the ability to form biofilms or survival in the environment. sequence-based drug profiling is one of the pillars of hiv treatment and has to be credited for the remarkable success of antiretroviral therapy (art) regimes. prediction of amr from sequence data has also received considerable attention for bacterial pathogens but has not led to comparable success at this stage. resistance against single drugs can be relatively straightforward to predict in some instances. for example, the presence of the sccmec cassette is a reliable predictor for broad-spectrum beta-lactam resistance in staphylococcus aureus, with strains carrying this element referred to as methicillin-resistant s. aureus (mrsa). in principle, wgs offers the possibility to predict the full resistance profile to multiple drugs (the 'resistome'). possibly the first real attempt to predict the resistome from wgs data was a study by holden et al. in , showing that, for a large dataset of s. aureus st isolates, . % of all phenotypic resistances could be explained by at least one previously documented amr element or mutation in the sequence data [ ] . since then, several tools have been developed for the prediction of resistance profiles from wgs. these include those designed for prediction of resistance phenotype from acquired amr genes, including resfinder [ ] and abricate (https://github.com/tseemann/abricate), together with those also taking into account point mutations in chromosome-borne genes such as arg-annot [ ] , the sequence search tool for antimicrobial resistance (sstar) [ ] , and the comprehensive antibiotic resistance database (card) [ ] . of these, resfinder and card can be implemented as online methods that, dependent on user traffic, can be considerably slower than most other tools that only use the command-line. they are, however, superior in terms of broad usability and are more intuitive than, for example, the glossary accessory genome: the variable genome consisting of genes that are present only in some strains of a given species. many of the organisms representing the most severe amr threats are characterised by large accessory genomes containing important components of clinically relevant phenotypic diversity. antimicrobial resistance (amr): the ability of a microorganism to reproduce in the presence of a specific antimicrobial compound. also referred to as antibiotic resistance (abr or ar). the sum of the detected amr genes in a sequenced isolate is sometimes referred to as the resistome. horizontal gene tranfer (hgt): the transmission of genetic material laterally between organisms outside 'vertical' parent-to-offspring inheritance, including across species boundaries. genetic elements related to clinically relevant phenotypes such as amr and virulence are often transmitted via hgt. k-mer: a string of length k contained within a larger sequence. for example, the sequence 'attgt' contains two -mers: 'attg' and 'ttgt'. the analysis of the k-mer content of raw sequencing reads allows for rapid characterization of the genetic difference between isolates without the need for genome assembly. multilocus sequence typing (mlst): a scheme used to assign types to bacteria based on the alleles present at a defined set of chromosome-borne housekeeping genes. also referred to as sequence typing (st). phylogenetic tree: a representation of inferred evolutionary relationships based on the genetic differences between a set of sequences. also referred to as a phylogeny. transmission chain: the route of transmission of a pathogen between hosts during an outbreak. this can often be characterized using wgs compared to traditional epidemiological inference based on, for example, tracing contacts between patients. virulence: broadly, a pathogen's ability to cause damage to its host, for example through invasion, adhesion, immune evasion, and toxin production. however, virulence is currently loosely defined by indirect proxies either phenotypically (e.g., through serum-killing assays) or genetically (e.g., by the presence of genes involved in capsule synthesis or hypermucosvisity). whole-genome sequencing (wgs): the process of determining the complete nucleotide sequence of an organism's genome. this is generally achieved by 'shotgun' sequencing of short reads that are either assembled de novo or mapped onto a high-quality reference genome. graphical user interface of sstar. other tools exist for richer species-specific characterization such as phyresse [ ] and patric-rast [ ] . further tools have been developed to predict phenotype directly from unassembled sequencing reads, bypassing genome assembly [ , ] . it has been proposed that wgs-based phenotyping might, in some instances, be equally, if not more, accurate than traditional phenotyping [ ] [ ] [ ] [ ] . however, it is probably no coincidence that the most successful applications to date have primarily been on m. tuberculosis and s. aureus, which are characterised by essentially no, or very limited, accessory genomes, respectively. other successful examples include streptococcal pathogens, where wgs-based predictions and measured phenotypic resistance show good agreement even in large and diverse samples of isolates [ , ] . on the whole, however, predicting comprehensive amr profiles in organisms with open genomes, such as escherichia coli, where only % of genes are found in every single strain [ ] , is challenging and requires extremely extensive and well curated reference databases. the transition to wgs might appear relatively straightforward if viewed as merely replacing pcr panels which are already used when traditional phenotyping can be cumbersome and unreliable. however, to put the problem in context, there are over described b-lactamase gene sequences responsible for multiresistance to b-lactam antibiotics such as penicillins, cephalosporins, and carbapenems [ ] . whilst b-lactam resistance in some pathogens, including s. pneumoniae, can be predicted through, for example, penicillin-binding protein (pbp) typing and machine-learning-based approaches [ ] , the general problem of reliably assigning resistance phenotype based on many described gene sequences is commonplace. at this stage, many of the amr reference databases are not well integrated or curated and have no minimum clinical standard. they often have varying predictive ranges and biases and produce fairly inaccessible output files with little guidance on how to interpret or utilise this information for clinical intervention. perhaps because of these limitations, although of obvious benefit as part of a diagnostics platform, both awareness and uptake in the clinic has been limited. additionally, with some notable exceptions, such as the pneumococci [ ] , most amr profile predictions from wgs data are qualitative, simply predicting whether an isolate is expected to be resistant or susceptible against a compound despite amr generally being a continuous and often complex trait. the level of resistance of a strain to a drug can be affected by multiple epistatic amr elements or mutations [ ] , the copy number variation of these elements [ ] , the function of the genetic background of the strain [ - ], and modulating effects by the environment [ ] . the level of resistance is generally well captured by the semiquantitative phenotypic measurement minimum inhibitory concentration (mic), even if clinicians often use a discrete interpretation of mics into resistant/susceptible based on fairly arbitrary cut-off values. quantitative resistance predictions are not just of academic interest. in the clinic, low-level resistance strains can still be treated with a given antibiotic but the standard dose should be increased, which can be the best option at hand, especially for drugs with low toxicity. the majority of efforts to predict phenotypes from bacterial genomes have been on amr profiling. yet, some tools have also been developed for multispecies virulence profiling: the virulence factors database (vfdb) [ ] or virulencefinder [ ] as well as the bespoke virulence prediction tool for klebsiella pneumoniae, kleborate [ ] . one major challenge is that virulence is often a context-dependent trait. for example, in k. pneumoniae various imperfect proxies for virulence are used. these include capsule type, hypermucovisity, biofilm and siderophore production, or survival in serum-killing assays. while all of these traits are quantifiable and reproducible, and could thus in principle be predicted using wgs, it remains unclear how well they correlate with virulence in the patient. given that virulence is one of the most commonly studied phenotypes, yet lacks a clear definition, the general problem of predicting bacterial phenotype from genotype may be substantially more complex than the special case of amr, which is itself far from solved for all clinically relevant species. beyond phenotype prediction for individual isolates, wgs has allowed reconstructing outbreaks within hospitals and the community across a diversity of taxa ranging from carbapenemresistant k. pneumoniae [ ] [ ] [ ] and acinetobacter baumannii [ ] to mrsa [ , ] , streptococcal disease [ ] , and neisseria gonorrhoea [ ] , amongst others. wgs can reveal which isolates are part of an outbreak lineage and, by integrating epidemiological data with phylogenetic information, detect direct probable transmission events [ ] [ ] [ ] [ ] . timed phylogenies, for example generated through beast [ , ] , can provide likely time-windows on inferred transmissions, as well as dating when an outbreak lineage may have started to expand. approaches based on transmission chains can also be used to identify sources of recurrent infections (so called 'super-spreaders'), and do not necessarily rely on all isolates within the outbreak having been sequenced, allowing for partial sampling and analyses of ongoing outbreaks [ ] . in this way wgs-based inference can elucidate patterns of infection which are impossible to recapitulate from standard sequence typing alone [ ] . however, wgs-informed outbreak tracking is usually performed only retrospectively. typically, the publication dates of academic literature relating to outbreak reconstruction lag greatly, often in the order of at least years since the initial identification of an outbreak [ , ] . even analyses published more rapidly are generally still too slow to inform on real-time interventions [ ] . some attempts have been made to show that near-real-time hospital outbreak reconstruction is feasible retrospectively [ , ] or have performed analyses for ongoing outbreaks in close to real-time [ , ] , but these studies are still in a minority and remain largely within the academic literature. some of this time-lag probably relates to the difficulty of transmission-chain reconstruction at actionable time-scales. this can be relatively straightforward for viruses with high mutation rates, small genomes, and fast and constant transmission times, such as ebola [ ] and zika virus [ ] , but conversely, reconstructing outbreaks for bacteria and fungi poses a series of challenges. available tools tend to be sophisticated and complex to implement, and the sequence data needs extremely careful quality control and curation. unfortunately, in some cases insufficient genetic variation will have accumulated over the course of an outbreak, and a transmission chain simply cannot be inferred without this signal [ , ] . furthermore, extensive within-host genetic diversity (typical in chronic infections) can render the inference of transmission chains intractable [ ] . these complexities mean that a 'one-size fits all' bioinformatics approach to outbreak analyses simply does not exist. one of the key promises of wgs is in molecular surveillance and real-time tracking of infectious disease. this relies on transparent and standardized data sharing of the millions of genomes sequenced each year, together with accompanying metadata on isolation host, date of sampling, and geographic location. with enough data, surveillance initiatives have the potential to identify the likely geographic origin of emerging pathogens and amr genes, group seemingly unrelated cases into outbreaks, and clearly identify when sequences are divergent from other circulating strains. in a hospital setting, surveillance can help to detect transmission within the hospital and inflow from the community, optimize antimicrobial stewardship, and inform treatment decisions; at national and global scales, it can highlight worldwide emerging trends for which collated evidence can direct both retrospective but also anticipatory policy decisions. amongst the most successful global surveillance initiatives and analytical frameworks are those relating specifically to the spread of viruses. influenza surveillance is arguably the most developed, with large sequencing repositories such as the gisaid database (gisaid.org) and online data exploration and phylodynamics available through web tools such as nextflu [ ] and nextstrain (http://nextstrain.org), which also allows examination of other significant viruses including zika, ebola, and avian influenza. another popular tool for the sharing of data and visualization of phylogenetic trees and their accompanying meta-data is microreact (microreact.org) [ ] , which also allows for interactive data querying and includes bacteria and fungi. a further tool, predominately for bacterial data, is wgsa (www.wgsa.net). wgsa allows the upload of genome assemblies through a drag-and-drop web browser, allowing for a quick characterization of species, mlst type, resistance profile, and phylogenetic placement in the context of the existing species database based on core genes. at the time of writing wgsa comprises genomes predominantly from s. aureus, n. gonorrhoeae, and salmonella enterica serovar typhi, together with ebola and zika viruses, all with some associated metadata. although an exciting initiative, wgsa and associated platforms are still a reasonably long way off characterizing all clinically relevant isolates and often rely entirely on the sequences uploaded already being assembled. more generally, the success of any wgs surveillance is dependent on the timely and open sharing of information from around the globe. while sequence data from academic publications is near systematically deposited on public sequence databases (at least upon publication), such data are near useless if the accompanying metadata (see above) are not also released, as remains the case far too often. additionally, as more genomes are routinely sequenced in clinical settings as part of standard procedures, ensuring that the culture of sharing sequence data persists beyond academic research will become increasingly important. for wgs to be routinely adopted in clinical microbiology, it needs to be cost-effective. it is commonly accepted that sequencing costs are plummeting with the national human genome research institute (nhgri) estimating the cost per raw megabase (mb) of dna sequence to . usd (www.genome.gov/sequencingcostsdata). this has led to claims that a draft bacterial genome can currently cost less than usd to generate [ ] . this is a misunderstanding as one cannot simply extrapolate the cost of a bacterial genome by multiplying a highthroughput per dna megabase (mb) sequencing cost by the size of its genome. for microbial sequencing, multiple samples must be multiplexed for cost efficiency, which is easier to achieve in large reference laboratories with high sample turnover. excluding indirect costs such as salaries for personnel, preparation of sequencing libraries now makes up the major fraction of microbial sequencing costs ( figure ). the precipitous drop in the cost of producing raw dna sequences in recent years (figure a ) mostly reflects a massive increase in output with new iterations of illumina production machines. these numbers ignore all other costs and simply reflect output relative to the cost of the sequencing kits/cartridges. realistic cost estimates for a microbial genome including library preparation on the best available platforms give a different picture ( figure b ). since the introduction of the illumina miseq platform in , new sequencing kits generating higher output have only marginally affected true microbial genome sequencing costs, as library preparation makes up a significant portion of the total ( usd of a total of usd for a typical bacterial genome in ). these costs have remained stable over time and are unlikely to go down significantly in the near future. indeed, the market seems to be consolidating in fewer hands (e.g., represented by the procurement of kapa by roche in ), which economic theory predicts will not favor price decrease. it is also important to keep in mind that these costs are massive underestimates which do not include indirect costs such as salaries for laboratory personnel and downstream bioinformatics. such indirect costs are difficult to estimate precisely in an academic setting but are far from trivial. per-genome sequencing and analysis costs are likely to be even higher in a clinical diagnostics environment due to the need for highly standardised and accredited procedures. however, a micro-costing analysis covering laboratory and personnel costs estimated the cost of clinical wgs to £ per m. tuberculosis isolate versus £ applying standard methods, representing relatively marginal cost savings but with significant time savings [ ] . wgs does indeed represent a potentially cost-effective and highly informative tool for clinical diagnostics, but for microbiology-scale sequencing we seem to be in a post-plummeting-costs age. one key feature of useful diagnostics tools is their ability to rapidly inform treatment. most applications of wgs so far have been for lab-cultured organisms (bacteria and fungi). traditional culture methods require long turnaround time, with most bacterial cultures taking - days, fungal cultures - days, and mycobacterial cultures up to - days. in this scenario, wgs is used as an adjunct technology primarily to provide information on the presence of amr and virulence genes, which is particularly useful for mechanisms that are difficult to determine phenotypically (e.g. carbapenem resistance). this use of wgs, whilst solving some of the current clinical problems, does not speed up the diagnosis of infection; it is more the case that new technology is replacing some of the more cumbersome laboratory techniques whilst providing additional information. wgs is more appealing as a microbiological fast diagnostics solution when combined with procedures that circumvent (or shorten) the traditional culture step. this can be achieved through direct sampling of clinical material (box ) or by using a protocol enriching for sequences of specific organism(s). such enrichment methods, generally based on the capture of known sequences though hybridization, are a particularly tractable approach for viruses due to their small genome size. for example, the vircap virome capture method targets all known viruses and can even enrich for novel sequences [ ] . similar methods targeting specific organisms have been developed and successfully deployed, representing an attractive option for unculturable organisms [ , [ ] [ ] [ ] [ ] . relative to the time required for culture and downstream analysis of the data, variation in the speed of different sequencing technologies is relatively modest. there is considerable enthusiasm for the oxford nanopore technology (ont) which outputs data in real time, although the ont requires a comparable amount of time to the popular illumina miseq sequencer to generate the same volume of sequence data. sequencing on the miseq sequencer takes between to hours, but as run time correlates with sequence output and read length, researchers tend to systematically favour runs of longer duration. in the context of this review, genetic material from the human patient present in clinical samples represents contamination, a major obstacle to obtaining a high yield of microbial dna. protocols exist to deplete human dna prior to sequencing [ , ] but these are not completely problem-free as the depletion protocol is likely to bias estimates of the microbial community, and some human reads will likely remain. in particular, levels of human dna are significantly higher in faecal samples from hospitalized patients compared to healthy controls [ ] , box . wgs beyond single genomes wgs in the strict sense usually refers to sequencing the genome of a single organism, and it is common to distinguish between the sample (the material that has actually been taken from the patient) and the isolate (an organism that has been cultured and isolated from that sample). wgs methods traditionally sequence a cultured isolate to reduce contamination from other organisms, or sometimes rely on enrichment strategies targeting sequences from a specific organism [ , ] . however, this represents only a small fraction of the total microbial diversity present in a clinical sample. in contrast, metagenomic approaches sequence samples in an untargeted way. this approach is particularly relevant for clinical scenarios where the pathogen of interest cannot be predicted and/or is fastidious (i.e., has complex culturing requirements). example applications of clinical metagenomics include: when the disease causing agent is unexpected [ , ] ; investigating the spread of amr-carrying plasmids across species [ ] ; and characterizing the natural history of the microbiome [ ] . the removal of the culture requirement can drastically decrease turn-around time from sample to data and enable identification of both rare and novel pathogens. different samples however present different challenges. easy-to-collect sample sites (e.g., faeces and sputum) typically also have a resident microbiota, so it can be challenging to distinguish the etiological agent of disease from colonizing microbes. conversely, sites that are usually sterile (e.g., cerebrospinal fluid, pleural fluid) present a much better opportunity for metagenomics to contribute to clinical care. metagenomic data are more complex to analyze than single species wgs data and tend to rely on sophisticated computational tools, such as the desman software allowing inference of strain-level variation in a metagenomic sample [ ] . such approaches can be difficult to implement, are computationally very demanding, and are unlikely to be deployable in clinical microbiology in the near future, although cloud-based platforms may circumvent the need for computational resources in diagnostic laboratories. furthermore, some faster approaches for rapid strain characterization from raw sequence reads, such as mash [ ] and kmerfinder [ , ] , could find a use in diagnostics microbiology, with the latter having been shown to identify the presence of pathogenic strains even in culture-negative samples [ ] . however, the differences between these methods should not obscure their fundamental similarities. obtaining singlespecies genomes from culture is one end of a continuum of methods that stretches all the way to full-blown metagenomics of a sample. in principle, all methods produce the same kind of data: strings of bases. furthermore, in all cases what is clinically relevant represents only a small fraction of these data. integrating sequencing data from different methods into a single diagnostics pipeline is therefore an attractive prospect to quickly identify the genomic needles in the metagenomic haystack in a species-agnostic manner. for example, the presence of a particular antibiotic-resistance gene in sequencing data may recommend against the use of that antibiotic; whether the gene is present in data from a single-species isolate or from metagenomes is irrelevant. as an example, leggett et al. used minion metagenomic profiling to identify pathogen-specific amr genes present in a faecal sample from a critically ill infant all within h of taking the initial sample [ ] . suggesting that the problem is exacerbated in clinical settings. therefore, the ethical and legal issues raised by introducing human wgs into routine healthcare [ ] cannot be avoided by microbially focused clinical metagenomics. dismissing these concerns as minor may be an option for academic researchers uninterested in these human data, but it is naive to think that hospital ethics committees will share this view. even in the absence of human dna, metagenomic samples from multiple body sites can be used to identify individuals in datasets of hundreds of people [ ] . managing clinical metagenomics data in light of these concerns should be taken seriously, not only as a barrier to implementation but because of the real risks to patient privacy. a major problem in the analysis of wgs data is that there are currently very few (if any) accepted gold standards. the fundamental steps of wgs analyses in microbial genomics tend to be similar across applications and usually consist of the following steps: sequence data quality control; identification/confirmation of the sequenced biological material; characterization of the sequenced isolate (including typing efforts as well as characterization of virulence factors and putative amr elements/mutations); epidemiologic analysis; and finally, storage of the results ( figure ). however, how these analyses are implemented varies widely, both between microbial species and human labs. despite some commercial attempts at one-stop analysis suites such as ridom seqsphere+ (http://www.ridom.com/seqsphere/), most laboratories use a collection of open-source tools to perform particular subanalyses. typically, these tools are then woven together into a patchwork of software (a 'pipeline'). the idea of a pipeline is to allow within-laboratory standardized analysis of batches of isolates with relatively little manual bioinformatics work. such pipelines can be highly customizable for a wide range of questions. there are also some communal efforts at streamlining workflows across laboratories. as an example, galaxy (https://usegalaxy.org) is a framework that allows nonbioinformaticians to use a wide array of bioinformatics tools through a web interface. one major limitation to rapidly attaining useful information in a clinical setting is that analysis pipelines for microbial genomics have generally been developed for fundamental research or public health epidemiology [ ] . this usually means that the pipeline permits a very thorough and sophisticated workflow with a large number of options and moving parts. for example, at the time of writing (may, ), the 'qc and manipulation' step in galaxy alone consists of different tools, tests, and workflows that can be applied to an input sequence. while this is desirable from a researcher's perspective, it is clearly prohibitive for real-time analysis in a clinical setting. a user requires in-depth knowledge about the purpose each tool serves, the relative strengths and weaknesses of each approach, and a functional understanding of the important parameters. furthermore, most analysis pipelines require proficiency in linux systems and navigating the command line, something clinical microbiologists are rarely trained for. the road to stringent, exhaustive analysis of wgs data is long and paved with good intentions. in order to move towards real-time interpretable results for clinics it will be necessary to take certain shortcuts. the focus should be on rapid, automated analysis and clear, unambiguous results. some steps in the pipeline can simply be omitted for clinical purposes. as an example, genome assembly might appear to be a bottleneck for real-time wgs diagnostics, but is probably rarely required; sufficient characterization of an isolate can be made by analysis of the k-mers in the raw sequence data, which is orders of magnitude faster. accurate identification of an isolate can be made rapidly with minhash-based k-mer matching methods such as mash [ ] , and amr elements can be identified from k-mers alone [ ] . another example of a computationally intensive step that could be omitted from a default pipeline is sophisticated phylogenetic inference. best practice for the creation of phylogenetic trees may involve evaluating the individual likelihood of a very wide range of possible trees given a sequence alignment or other distance metric, repeated for thousands of bootstrapped replicates, giving a tree with high confidence but with extreme computational time costs. a clinical pipeline could use much faster approaches and still provide an informative phylogenetic tree [ ] . in figure we outline our schematic vision of a computational pipeline specific to diagnostics in clinical microbiology. the clinical pipeline would only encompass a small subset of the research pipeline aimed at generating rapid and interpretable output. for epidemiological inference, pairwise distances between strains would be computed as a matrix of jaccard distances on the shared proportion of k-mers as outputted by mash [ ] . this matrix could be used to generate a phylogenetic tree using a computationally inexpensive method (e.g., neighbor-joining). additionally, a correlation between pairwise genetic distance and sampling date could be performed steps on the right marked with an asterisk represent simplified versions optimised for speed. cgmlst, core genome multilocus sequence typing; snp, single-nucleotide polymorphism; wgmlst, whole genome multilocus sequence typing. to test for evidence of temporal signal in the data (i.e., accumulation of a sufficient number of mutations over the sampling period). in the presence of temporal signal, the user would be provided with a transmission chain based on a fast algorithm such as seqtrack [ ] . any bespoke pipeline for clinical diagnostics would need to be linked with regularly updated multispecies databases containing information about the latest developments in typing schemes, as well as clinically important factors such as amr determinants. results would have to be continuously validated, and international accreditation standards met at regular intervals. at a national level, accreditation bodies (e.g., ukas in the uk) may lack the expertise required. in our experience, many promising databases have collapsed after funding expired or the responsible postdoc left for another job. if wgs is ever to make it into the clinic it will be necessary to secure indefinite funding of both infrastructure and personnel for such databases. the lack of uptake of wgs-based diagnostics may also be in part due to an understandable desire to maintain the 'status quo' in a busy hospital environment with already established treatment and intervention systems. additionally, and perhaps significantly, it also highlights the difficulty to communicate the potential benefits of wgs to the day-to-day life of a clinic. the main proponents of wgs tend to be based in the public health/research environment and are rarely actively involved in clinical decision-making. this in itself can present something of a language barrier, challenging meaningful dialogue over how adoption of new approaches can lead to quantifiable improvements in existing systems. further, the physical planning, implementation and integration of wgs diagnostics may be unlikely to succeed without carefully planned introduction and continued training of its user base. this is of course challenged by the already resource-limited infrastructure of many clinical settings. despite its immense promise and some early successes, it is difficult to predict if and when wgs will completely supersede current standards in clinical microbiology. there are several major bottlenecks to its implementation as a routine approach to diagnose and characterise microbial infections (see outstanding questions). these include, among others: the current costs of wgs, which remain far from negligible despite a common belief that sequencing costs have plummeted; a lack of training in, and possible cultural resistance to, bioinformatics among clinical microbiologists; a lack of the necessary computational infrastructure in most hospitals; the inadequacy of existing reference microbial genomics databases necessary for reliable amr and virulence profiling; and the difficulty of setting up effective, standardized, and accredited bioinformatics protocols. focusing in the near future on wgs applications that fulfil unmet diagnostic needs and demonstrate clear benefits to patients and healthcare professionals will help to drive the cultural changes required for the transition to wgs in clinical microbiology. however, irrespective of how this transition occurs and how complete it is, it is likely to feel highly disruptive for many clinical microbiologists. there is also a genuine risk that precious knowledge in basic microbiology will be lost after the transition to wgs, particularly if investment prioritises new technology at the expense of older expertise. more positively, irrespective of the future implementation of wgs in clinical microbiology, we should not forget that the availability of extensive genomic data has been instrumental in the development of a multitude of routine non-wgs typing schemes. efforts to develop wgs-based microbial diagnostics have unsurprisingly focused on highresource settings. though, we can see an opportunity for low-/medium-income countries to outstanding questions can wgs be used to develop robust classification schemes that account for the genetic diversity of organisms with open genomes? which clinically relevant phenotypes can be reliably predicted using wgs, and for which organisms? how can phylogenetic analyses of outbreaks be speeded up to meaningfully contribute to infection control at actionable time scales? how can publicly available databases be reliably maintained to the required clinical accreditation standards over long time periods? will the true cost of generating a bacterial genome remain stable as the sequencing market consolidates in fewer hands? how can clinical metagenomic data be managed safely in line with the ethical considerations applicable to identifiable human dna? how can unwieldy bioinformatics pipelines developed with academic research in mind be adapted for a clinical setting? can current expertise in traditional clinical microbiology be maintained in the transition to wgs? get up to speed with the latest wgs-based developments in real-time clinical diagnostics, rather than adopting classical microbiological phenotyping which might eventually be largely phased out in high-income countries. one precedent for the successful adoption of a technology without transitions through its acknowledged historical predecessors is the widespread use of mobile phones in africa. this has greatly increased communication and allowed access to e-banking, despite the fact that many people previously had no traditional bank account and only limited access to landlines. most hospitals in the developing world do not currently benefit from a clinical microbiology laboratory. the installation of a molecular laboratory based around a standard sequencer, such as a benchtop miseq, might constitute an ideal investment, as it is neither far more expensive nor more complex than setting up a standard clinical microbiology laboratory. high-throughput sequencing and clinical microbiology: progress, opportunities and challenges transforming clinical microbiology with bacterial genome sequencing routine use of microbial whole genome sequencing in diagnostic and public health microbiology bacterial genome sequencing in the clinic: bioinformatic challenges and solutions utility of matrix-assisted laser desorption ionization-time of flight mass spectrometry following introduction for routine laboratory bacterial identification armed conflict and population displacement as drivers of the evolution and dispersal of mycobacterium tuberculosis multilocus sequence typing as a replacement for serotyping in salmonella enterica a robust snp barcode for typing mycobacterium tuberculosis complex strains a genomic portrait of the emergence, evolution, and global spread of a methicillin-resistant staphylococcus aureus pandemic benchmarking of methods for genomic taxonomy arg-annot, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. antimicrob sstar, a stand-alone easy-to-use antimicrobial resistance gene predictor phyresse: a web tool delineating mycobacterium tuberculosis antibiotic resistance and lineage from whole-genome sequencing data antimicrobial resistance prediction in patric and rast rapid determination of anti-tuberculosis drug resistance from whole-genome sequences rapid antibiotic-resistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis wgs accurately predicts antimicrobial resistance in escherichia coli prediction of staphylococcus aureus antimicrobial resistance by whole-genome sequencing whole-genome sequencing and epidemiological analysis do not provide evidence for cross-transmission of mycobacterium abscessus in a cohort of pediatric cystic fibrosis patients short-read whole genome sequencing for determination of antimicrobial resistance mechanisms and capsular serotypes of current invasive streptococcus agalactiae recovered in the usa using whole genome sequencing to identify resistance determinants and predict antimicrobial resistance phenotypes for year invasive pneumococcal disease isolates recovered in the united states comparison of sequenced escherichia coli genomes in silico serine beta-lactamases analysis reveals a huge potential resistome in environmental and pathogenic species validation of beta-lactam minimum inhibitory concentration predictions for pneumococcal isolates with newly encountered penicillin binding protein (pbp) sequences evolutionary mechanisms shaping the maintenance of antibiotic resistance multicopy plasmids potentiate the evolution of antibiotic resistance in bacteria spatiotemporal microbial evolution on antibiotic landscapes vfdb : hierarchical and refined dataset for big data analysis - years on real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic escherichia coli genetic diversity, mobilisation and spread of the yersiniabactin-encoding mobile element icekp in klebsiella pneumoniae populations tracking a hospital outbreak of kpcproducing st klebsiella pneumoniae with whole genome sequencing nested russian doll-like genetic mobility drives rapid dissemination of the carbapenem resistance gene bla(kpc) evolution and transmission of carbapenem-resistant klebsiella pneumoniae expressing the bla(oxa- ) gene during an institutional outbreak associated with endoscopic retrograde cholangiopancreatography utility of whole-genome sequencing in characterizing acinetobacter epidemiology and analyzing hospital outbreaks rapid whole-genome sequencing for investigation of a neonatal mrsa outbreak whole-genome sequencing for the investigation of a hospital outbreak of mrsa in china prolonged and large outbreak of invasive group a streptococcus disease within a nursing home: repeated intrafacility transmission of a single strain genomic analysis and comparison of two gonorrhea outbreaks simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks impact of hiv co-infection on the evolution and transmission of multidrug-resistant tuberculosis bayesian inference of infectious disease transmission from whole-genome sequence data microevolutionary analysis of clostridium difficile genomes to investigate transmission beast : a software platform for bayesian evolutionary analysis bayesian phylogenetics with beauti and the beast . genomic infectious disease epidemiology in partially sampled and ongoing outbreaks transmission of staphylococcus aureus between health-care workers, the environment, and patients in an intensive care unit: a longitudinal cohort study based on wholegenome sequencing whole-genome sequencing to determine transmission of neisseria gonorrhoeae: an observational study a pilot study of rapid benchtop sequencing of staphylococcus aureus and clostridium difficile for outbreak detection and surveillance whole-genome sequencing for analysis of an outbreak of methicillin-resistant staphylococcus aureus: a descriptive study real time application of whole genome sequencing for outbreak investigation -what is an achievable turnaround time? translating genomics into practice for real-time surveillance and response to carbapenemase-producing enterobacteriaceae: evidence from a complex multi-institutional kpc outbreak real-time, portable genome sequencing for ebola surveillance multiplex pcr method for minion and illumina sequencing of zika and other virus genomes directly from clinical samples inferences from tip-calibrated phylogenies: a review and a practical guide when are pathogen genome sequences informative of transmission events? within-host bacterial diversity hinders accurate reconstruction of transmission networks from genomic distance data nextflu: real-time tracking of seasonal influenza virus evolution in humans microreact: visualizing and sharing data for genomic epidemiology and phylogeography insights from years of bacterial genome sequencing rapid, comprehensive, and affordable mycobacterial diagnosis with whole-genome sequencing: a prospective study virome capture sequencing enables sensitive viral diagnosis and comprehensive virome analysis deep sequencing of viral genomes provides insight into the evolution and pathogenesis of varicella zoster virus and its vaccine in humans specific capture and whole-genome sequencing of viruses from clinical samples same-day diagnostic and surveillance data for tuberculosis via whole-genome sequencing of direct respiratory samples rapid whole genome sequencing of m. tuberculosis directly from clinical samples depletion of human dna in spiked clinical specimens for improvement of sensitivity of pathogen detection by next-generation sequencing a method for selectively enriching microbial dna from contaminating vertebrate host dna excretion of host dna in feces is associated with risk of clostridium difficile infection the ethical introduction of genomebased information and technologies into public health identifying personal microbiomes using metagenomic codes astrovirus va /hmo-c: an increasingly recognized neurotropic pathogen in immunocompromised patients human coronavirus oc associated with fatal encephalitis natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability desman: a new tool for de novo extraction of strains from metagenomes mash: fast genome and metagenome distance estimation using minhash rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples rapid minion metagenomic profiling of the preterm infant gut microbiota to aid in pathogen diagnostics whole genome sequencing in clinical and public health microbiology evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study reconstructing disease outbreaks from genetic data: a graph approach we are grateful to nadia debech and jan oksens for their help with digging up historic pricing information for sequencing key: cord- -rmjv ia authors: nan title: the signal sequence of the p protein of semliki forest virus is involved in initiation but not in completing chain translocation date: - - journal: j cell biol doi: nan sha: doc_id: cord_uid: rmjv ia so far it has been demonstrated that the signal sequence of proteins which are made at the er functions both at the level of protein targeting to the er and in initiation of chain translocation across the er membrane. however, its possible role in completing the process of chain transfer (see singer, s. j., p. a. maher, and m. p. yaffe. proc. natl. acad. sci. usa. . : - ) has remained elusive. in this work we show that the p protein of semliki forest virus contains an uncleaved signal sequence at its nh -terminus and that this becomes glycosylated early during synthesis and translocation of the p polypeptide. as the glycosylation of the signal sequence most likely occurs after its release from the er membrane our results suggest that this region has no role in completing the transfer process. iosynthesis of proteins at the er can be subdivided into several steps. these are (a) targeting of translation complexes to the er membrane; (b) synthesis and transfer (translocation) of the polypeptide chain across the lipid bilayer; and (c) protein maturation in the lumen of er (chain folding, disulphide bridge formation, glycosylation, and oligomerization). the mechanisms for these processes have been studied extensively during recent years (kornfeld and kornfeld, ; wickner and lodish, ; rapoport, ; lodish, ; rothman, ) . a most important finding has been that all proteins made at the er carry a signal sequence (also called signal peptide), a hydrophobic peptide which is usually located at the nh -terminal region of the polypeptide chain. one function of the signal peptide is to achieve targeting of the polysome to the er membrane (rapoport, ) . when the signal sequence emerges from the ribosome it binds to the signal recognition particle, which mediates binding of the polysome to the docking protein in the er. after this another function of the signal sequence is expressed, that is to interact with some components of the er membrane and thereby initiate translocation of the polypeptide chain into the lumen of the er (gilmore and blobel, ; robinson et al., ; wiedmann et al., ) . further synthesis of the polypeptide then continues with concomitant chain translocation. an important but as yet unresolved question is whether the signal sequence has any role in the translocation process per se or whether its functions are limited to the targeting and translocation-initiation steps. for instance, singer and co-workers danny huylebroeck's present address is innogenetics, industriepark , box , b- , ghent, belgium. ( a) have suggested a translocator protein model in which the signal sequence helps to keep the machinery open for chain transfer. it is specifically this last question we have addressed in the present work. we describe the characteristics and behavior of the uncleaved signal sequence of the p protein of semliki forest virus (sfv) l upon translocation across the er membrane in vitro. the p protein is one subunit of the heterodimeric spike protein of the sfv membrane (reviewed in garoff et al., ) . it is made as a precursor protein together with the other structural proteins of sfv, i.e., the nucleocapsid protein, c, and the other spike subunit, el. the three proteins are synthesized from a . -kb long mrna in the order c, p , and el, and separated by cleavage of the growing precursor chain. during synthesis of the p polypeptide at the er all but a residue cooh-terminal portion and the membrane anchor is translocated across the membrane. the p signal sequence has so far been only roughly localized to the nh -terminal third of the polypeptide chain (garoffet al., ; bonatti et al., ) . we show here that the signal sequence of p consists of a residue peptide at its nh -terminal region. this region includes one out of four glycosylation sites (asn~ ) for n-linked oligosaccharide on the p chain. we also demonstrate that the glycosylation of the p signal sequence occurs early during chain translocation. as this modification of the signal region most likely correlates with its release into the lumen of er it follows that the signal sequence of p is probably only needed for an initial step in chain translocation and not to small scale plasrnid dna preparations were done using the alkali-sds method essentially as described by birnboim and dnly ( ) . large quantities of plasmids to be used for in vitro transcription were prepared by lysozyme-triton lysis of the bacteria, followed by csc -etbr banding (kahn et al., ) . etbr was removed by several extractions with isopropanol and, after fivefold dilution, the dna was precipitated twice with ethanol and further purified over a biorad a- m column. restriction endonucleases and dna-modifying enzymes were used according to the suppliers instructions. removal of the ' sticky end from the sac i site in pgem -alphag (zerial et al., ) with t dna polymerase was done at °c ( h), dntps were added (end concentration #m each), and the dna was subsequently filled in at °c for h. all ligations were done at °c for h except for linker ligations ( °c, h). all other molecular biological manipulations were done using slightly modified standard protocols (maniatis et al., ) . in vitro transcription ( . #g supercoiled template dna per # vol) in the presence of sp rna polymerase ( - u) and the cap structure was carried out as previously described (zerial et al., ) . in vitro translation reactions using a rabbit reticulocyte lysate were performed at °c essentially as described . . # of the in vitro synthesized rna was translated in a total volume of # . potassium, magnesium, and spermidine concentrations were , . , and . mm, respectively. when indicated, # of er membranes was included. in some translocations the membranes were pretreated with #m peptide for min on ice. the final peptide concentration in the total translation mixture here was, after addition of the pretreated membranes, adjusted to /~m. to obtain partial synchronization of translation, ata was added after a preincubation of . - . rain (borgese et al., ) . a final ata concentration of . mm was found to be sufficient for inhibiting initiation of chain synthesis (see control in fig. , lane/). higher concentrations of ata inhibited first transloeation and then also chain elongation. for protease protection experiments, proteinase k was added to a final concentration of . mg/ml and the samples were incubated at *c for min in the presence or absence of % triton x- . proteolysis was stopped by the addition of pmsf (final concentration mg/ml) and samples were kept at *c for min before further processing for electrophoresis (cutler and garoff, ) . bands containing labeled protein were visualized by fluorography. quantitation of proteins was done by cutting the bands out of the dried gel, solubilizing them with protosol (from dupont de nemours, nen) according to the instructions of the manufacturer, and finally counting the ~s radioactivity in a liquid scintillator (wallac lkb, turku, finland). the localization of the bands on the dried gel was done with the aid of the fluorograph in transillumination. -# translation mixtures were adjusted to ph - . by adding an appropriate volume (pretitrated) of . n naoh. after a -rain incubation on ice the samples were separated into a pellet fraction and a supernatant fraction by centrifugation through a -# alkaline socmse cushion (gilmore and blobel, ) for rain at psi in an airfuge (beckman instruments, inc., palo alto, ca) using the a- / rotor and cellulose propionate tubes precoated with bsa ( % solution). the entire supernatant was removed, neutralized with n hci, diluted . times with water, and then precipitated by adding . vol of acetone. these precipitated proteins and pelleted membranes (obtained from the airfuge tube) were taken up in % sds by incubating at °c for min and then processed for immunoprecipitation reactions as described below. total translation mixtures were adjusted to % sds, then boiled for rain and diluted : with water. vol of immunoprecipitation buffer ( . % triton x- , mm nac , mm tris-hc , ph . , mm edta, and /~g pmsf/rni) and # of antibody were added for h at c. the mixture was briefly centrifuged ( - min in an eppendorf minifuge) and to the supernatant one fifth volume of a : slurry of protein a beads were added and incubated at "c for h under constant agitation. the beads were collected and washed four times with ml ripa buffer (gielkens et al., ) by centrifugation, followed by a single wash with a buffer containing mm naci, mm tris-hcl ph . , and t~g pmsf/ml. the beads were then taken up in excess gel loading buffer (cutler and garoff, ) , heated at °c for min, and cleared by centrifugation before loading the immunoprecipitate on the gel. constructions of pgem alphagx and pgem dhfrx. for the construction of the final fusion protein-coding plasmids used in this study we first had to make plasmids pgem alphagx, which are derived from pgem alphag and pgem dhfrx, which are derived from pgem dhfr (zerial et al., ) . plasmid pgem alphag contains a bp-long nco i-pst i fragment encompassing the entire chimpanzee alpha-globin coding region between the hinc ii and pst i sites of the polylinker of the plasmid pgem (pmmega biotech). the nco i site contains the translation initiation codon from alpha-globin (zerial et al., ). an xho i site, allowing subsequent in-frame ligations of sfv sequences, had to be introduced in pgem alphag. therefore, this plasmid was cut (upstream of the nco i site) with sac i, the ' sticky ends removed with " " dna folymerase, an xho i octamer linker introduced and, after cutting with xho i, the plasmid was religated at low dna concentration ( #g/ml). plasmid pgem alphagx then contains the , bp-long xbo i-pvu i fragment needed for the construction of the fusion protein-coding plasmid pc alphag. an intermediate construct, analogous to pgem alphagx, and also conraining a unique xho i site, was needed for the constructions of dhfrcontaining plasmids. for this purpose we inserted the xho i linker into partially xmn i cut pgem dhfr (zerial et al., ) . after cutting the linkers, linear plasmid was purified on agarose gel and religated. since the second xmn i site in pgem dhfr is located in the beta-lactamase coding region of the vector (snt~liffe, ) and insertion of an xho i site by an octamer linker will result in an ampicillin-sensitive e. coli phenotype after transformation, only the desired pgem dhfrx construct was obtained. from this plasmid, an xho i-pvu' i fragment of at least , bp (the precise length of the cdna insert, i.e., the length of the ' untranslated region of dhfr, is not known in pgem dhfr) was used for the construction of pc dhfr. construction of the fusion protein-coding plasmids pc alphag and pc dhfr. plasmid pgemi-sfv (also called pg-sfv- / ; melancon and garoff, ) contains a re, engineered edna copy of the sfv s mrna sequences cloned as a barn hi fragment in the barn hi site of the polylinker downstream of the sp promoter in the plasmid pgem (promaga biotech). from the sfv plasmid, a , hp-long pvu i-xho i fragment, containing the coding sequences for the capsid protein and the nh -terminai region of the p protein, was isolated. the xho i-pvu i fragments from pgemi-sfv, pgem alphagx and pgem dhfrx were isolated and ligated at a : molar ratio to obtain pc alphag and pc dhfr, respectively. plasmid dnas from ampicillin-resistant colonies were screened and compared to the starting vectors by restriction analysis. altogether, the sfv-alpha-globin edna fusion results in a complete c region and codons from the ' end of the p region fused to the whole of the alpha-globin coding sequence (see fig. ). eight new codons have been introduced at the point of edna fusion. in the sfv-dhfr construction the c region and the first codons of p are fused to the dhfr coding sequence such that one new codon is introduced and the first codons of dhfr are lost. construction of plasmidp dhfr. for engineering of a p protein signal sequence-dhfr fusion protein which is not derived from a c proteincontaining precursor we synthesized the whole p signal sequence region. two overlapping oligonucleotides were made (dna-synthesizer; applied biosystems, foster city, ca) :( ) ' atacacagaattcagcaccatgt-ccgccccgctgattac tgccatgtgtgtcctiv~caatc_~tacct-tcccgtc~ttccagcccccgtgtgtacc~, ( ) ' gttatcct-cgagcatccgtagtgtggcctctgcgttgttttcatagcagca-aggtacacacgggggc tggaagcac gggaaggtagcattgcjca-aggac. they correspond to both strands of the p signal sequence region of the sfv edna. together they span the coding region of amino acid residues - of p . oligo (the coding strand) includes, in addition, the region coding for initiator methionine of the c protein plus its ' flanking sequences ( ' agcaccatg). at the extreme ' end of this oligo we have added the recognition sequence for eco ri and its flanking sequences from the ' end of the structural part of the sfv cdna ( ' atacacagattc). oligo ends at its ' end with the xho i site which follows the signal sequence region on the p gene. the two oligonucleotides were hybridized ( complementary bases), filled in using sequenase (united states biochemical co., cleveland, oh) and restricted with eco ri and xho i. the resulting dna fragment was then purified and inserted into pcp dhfr instead of the c and p sequences. for this purpose the pcp dhfr plasmid was eco ri and xho i restricted and the plasmid part with the dhfr sequences isolated. the resulting plasmid p dhfr contains thus the coding sequences for the initiator methionine of c and the first residues of the p protein, including the signal sequence, in front of the dhfr gene (see fig. ). construction of pgem sfv d- . this plasmid was constructed by ligatiag three fragments together. the first one was the major part of pgem , cut just after the promoter region with hind iii and barn hi. the second fragment (hind i~-xho i) was isolated from the plasmid psvs-sfv . this fragment contains the sequences encoding the capsid and the nh -terminal part of the protein of sfv. the third fragment was obtained by cleaving plasmid pl sfv d- (see below) with xho i and barn hi and isolating the fragment containing the ' part of the coding sequence for the p protein. however, it should be noted that in the d- version there is an exchange of codons at the ' end of the gene for six aberrant ones. the corresponding p protein variant is called d- (see fig. ). it should also be mentioned that pl sfv d- has been derived from pl sfv d- , (cutler and garoff, ) by exchanging the xho i-cla i region containing the ' part of the p coding region with the similar fragment from psv sfv d- . this latter plasmid is described in garoff et al. ( ) . to define the p signal sequence we have studied the translocation phenotype of two reporter molecules, the rabbit alpha-globin and the mouse dihydrofolate reductase (dhfr), both of which have been extended at their nh:-termini with an nh -terminal residue peptide from p . the hybrid molecules were tested in a microsome-supplemented in vitro translation system. the alpha-globin and the dhfr have earlier been shown to be translocation incompetent if not extended with a heterologous signal sequence at their nh termini (zerial et al., ) . we first tested the expression of in vitro-made rna from the construction pcp dhfr in an in vitro translation system. this would be expected to yield free c protein and p reporter hybrid (p -dhfr) through c-catalyzed autoproteolytic cleavage of the nascent c-p -reporter precursor ( fig. ) (aliperti and schlesinger, ; hahn et al., ; melancon and garoff, ) . furthermore, the p -reporter hybrid should be translocated across microsomal membranes and possibly glycosylated at asn~ of the p sequence if the residues long nh -terminal p peptide carries a signal sequence. is shown to be linked to asn residue in the p part (garoff et al., ) . additional amino acids resulting from in frame translation of the multicloning region of pgem and the added xho i linker as well as the initiator met of p dhfr are also indicated. analysis showing the translocation activity of the p -dhfr protein. in the absence of membranes (lane/) two major protein species were translated from the sp -directed transcript. one of these had the expected size of c ( kd) and the other one that of the p -dhfr hybrid molecule ( kd). the coding region has apparently been translated faithfully and the precursor protein cleaved efficiently. the identity of the p -dhfr was directly proven by immunoprecipitation with a dhfr specific antiserum (see fig. ). the two weaker bands migrating faster than the capsid in fig. , lane i were most likely derived from c coding sequences because they are found in all protein analyses of in vitro transcrip- figure . immunological identification of the p -dhfr hybrid protein and analysis of its association with membranes. rna transcribed from pc dhfr was translated in vitro in the presence of membranes which in some cases had been treated with an acceptor (acc) or nonacceptor (non) peptide. the samples were treated, after translation, at ph - . , and the proteins then separated into a membrane-bound pellet fraction (p) and a supernatant fraction (s) by centrifugation. in all samples the p -dhfr polypeptides were isolated using an anti-dhfr antibody. the proteins were then analyzed by sds-page ( %) and subsequent autoradiography. the slower migrating band corresponds to glycosylated and the faster one to nonglycosylated forms of p -dhfr (compare fig. ) . tion/translation mixtures involving cdnas with c regions (compare fig. ). when microsomes were added to the c-p -dhfr in vitro translation system a new band appeared which migrated somewhat slower than the p -dhfr band seen in the analysis of the mixture lacking membranes (fig. , lane ) . it almost comigrated with one of the two weak c derived bands. the new band apparently corresponds to iri -dhfr hybrids that have been translocated into the lumen of the added microsomes and have become glycosylated. the immunoprecipitation analysis shown in fig. confirmed the identity of this material. the protease digestions in the absence (fig. , lane ) and presence of triton x- (lane ) clearly demonstrated that the slower migrating p -dhfr molecules were indeed translocated. about half of this material remains protected in the presence of intact microsomes whereas all is digested when the membranes are solubilized with detergent. in contrast, the other translated material did not show such a pronounced membrane-dependent protease resistance. note that protease treatment of all samples yielded a resistant protein of a small size. this most likely represents a protease-resistant c fragment. the glycosylation of the translocated p -hybrid and its effect on the apparent size of the protein was shown in an experiment where a short peptide (asn-leu-thr), which competes for n-linked glycosylation, was included during translation. apparently only unglycosylated faster migrating p -dhfr hybrids were formed in these conditions although chain translocation took place conferring protease resistance (fig. , lanes - ) . additional analyses (lanes - ) illustrate that a control peptide (asn-leu-athr) which cannot serve as an acceptor site for n-linked glycosylation, had no effect on the glycosylation of the p -reporter hybrids when tested in an analogous way. similar studies as with pcp dhfr were also performed with the pcp globin coded proteins in vitro. the results (not shown) were analogous to those described above for the pcp dhfr construct. c protein and p -globin hybrid were synthesized in the absence of membranes. when membranes were added, a protease-protected form of the hybrid appeared. this hybrid was also glycosylated as deduced from an experiment involving the acceptor peptide for glycosylation. fig. (lanes - ) shows the results of analyses in which we have tested whether the p signal sequence region confers stable membrane attachment to the p -dhfr hybrid. microsome-supplemented translations were adjusted to ph - . with naoh, incubated on ice for min, and then separated into a membrane pellet and supernatant fraction by ultracentrifugation. in all samples the p -dhfr polypeptides were isolated using an anti-dhfr antibody, sds-page shows that the hybrid protein segregates almost quantitatively into the supernatant fraction (compare lane i with lane ). in similar conditions an integral membrane protein, the human transferrin receptor, was found to sediment with the membranes into the pellet fraction and a secretory protein, ig light chain, was only recovered in the supernatant (not shown). if the acceptor peptide for glycosylation was included in the in vitro translation and the mixture then analyzed we found that the now unglycosylated but still translocated p -hdfr hybrids were again mostly found in the supernatant fraction (lanes and ) . lanes and show the analyses with the control peptide. to see whether the c protein exerts an influence on the translation phenotype of the p -dhfr protein the p dhfr plasmid (see fig. ), lacking the c gene, was tested. the results shown in fig. show clearly that the p -dhfr hybrid is translocated and glycosylated in the same way as when expressed from pcp dhfr. thus, apart from providing a free nh -terminal end to the p -dhfr protein by autoproteolysis of the c-p -dhfr precursor the c protein has no role in the translocation process. we conclude that the residue peptide from the p nh -terminal region confers a translocation positive phenotype to the p -globin and p -dhfr polypeptides and therefore must contain a functional signal sequence. the translocated fusion proteins were also shown to be glycosylated. this must involve asn~ of the p peptide as it is part of the only potential glycosylation site on the hybrid polypeptides (garoff et al., ; references on dhfr sequence in legend to fig. ) , finally, we can also conclude that the p signal sequence does not provide a stable membrane anchor to the translocated chain. to define at what time point during p -dhfr chain synthesis the asn becomes glycosylated we performed a time-course experiment essentially as described by rothman and lodish ( ) (fig. ) . in this experiment a -# translation was initiated. after . rain ata was added ( . ram) to block additional starting of chain synthesis. then, at . -rain intervals, two . /~ aliquots were removed; one for mixing with # of hot page sample buffer ( % sds) and the other one for further incubation after mixing with . /~ of % tx- . the first sample from each time point was used for the determination of the time needed for chain completion, which is a function of the translation rate, and the other one allowed determination of the time course of glycosylation of the translocated chain. triton x- solubilizes the microsomal membranes and thereby inactivates glycosylation (but not chain elongation). therefore, only those p -dhfr chains that have presented asnl to the glycosylation machinery before tx- addition have had the possibility to become glycosylated. in fig. , lanes - , one can see that completed p dhfr chains ( residues with initiator met) appear after a -min incubation from the time point of ata addition. if one assumes constant chain initiation during the figure . time course of p dhfr glycosylation. a -# translation was initiated. after a preincubation time of . min ata was added to inhibit further initiation of chain synthesis. then, at intervals of . min (indicated by . ', . ', . ', . ', . ', . ', . ', . ', . ', and . ') two . -/zl samples were removed, one for mixing with page sample buffer and another one for mixing with tx- (final concentration %) and further incubation at °c (for a total time of min after ata addition as indicated by the lower row of time points in the figure) . lanes - show the samples removed for mixing with the page sample buffer. from these results the approximate rate of translation can be derived. completed chains appear in the -min sample. lanes - show the samples in which the membranes have been solubilized with triton x- for inactivation of the glycosylation machinery. from these analyses it is possible to estimate when asn is modified during p -dhfr synthesis. the first glycosylated forms are clearly visible in the . -min sample, a time point where only about half of the p -dhfr chain has been synthesized. the nature of the material in the two weak bands seen in lanes - is unclear. their transient appearance before the completion of the p -dhfr chain suggests that they represent complexes of nascent p -dhfr chains. figure . time course of p d- glycosylation. seven translations in the presence of microsomal membranes were started in parallel. after a -rain initial incubation at °c, ata was added in order to inhibit further initiation of chain synthesis. incubation was then continued for rain. at the indicated time points ( , , , , , , and rain) tx- (tx) was added to stop further chain glycosylation. at the same time one half of each sample was removed and put on ice in order to measure the extent of chain elongation at each time point. all samples were analyzed by sds-page ( %) and autoradiography. lanes - show the analysis of the samples incubatedwith tx- and lanes -- the analysis of the portions put on ice at the different time points. the complete sequence of treatments for each sample is indicated by the labeling in each lane (upper row of time points indicate tx- addition and cooling on ice, respectively; lower row of timepoints indicate incubation in the presence of tx- ). lanes i and represent controls. in the experiment shown in lane i, ata was added before starting a -rain membrane-supplemented translation. in the experiment shown in lane a translation with membranes was allowed to proceed for min. ata was added as in the time course samples but tx- was omitted. the c protein, the unglycosylated (p ) and the glycosylated (gp ) forms of p d- are labeled at right in the figure. arrowheads at left indicate (from above) the migration of the -kd igg heavy chain, the -kd ovalbumin, and the -kd carbonic anhydrase. note that somewhat different amounts of translation mixtures have been analyzed in the various lanes (compare intensities of c and c-derived bands). . -rain preincubation without ata then the total time for chain synthesis is ,~ . rain ( + . rain). this corresponds to a mean translation rate of . peptide bonds per rain. lanes - show that glycosylated chains appear in all those samples that have had the membranes intact for . min or more after ata addition. this means that p dhfr chains that have been elongated for '~ . rain ( . min incubation after and . min before ata addition), to the length of ,'~ residues already carry a sugar unit at asn . as ~ residues of the nascent chain are required to span the ribosome and the lipid bilayer we conclude that glycosylation occurs when the first - residues of p -dhfr appear within the lumen of the er (malkin and rich, ; blobel and sabatini, ; bergman and kuehl, ; smith et al., ; glabe et al., ; randall, ) . we also studied the timing of the glycosylation of asn~ in its normal background, i.e., during p chain synthesis. for this experiment we used the pgem sfvd- construct. this encodes the c and the p membrane protein variant, p d- , in which a few residues of the cytoplasmic protein domain have been exchanged as compared to the wild type sequence (see materials and methods and fig, ) . fig. , lane , shows that rna, which has been transcribed from this construct, directs the synthesis of c and p d- chains. the protein has catalyzed correct c-p cleavage and the p signal sequence has catalyzed the insertion of about half of the p d- chains across the added microsomal membranes. these migrate as glycosylated - kd proteins in contrast to the noninserted molecules which have an apparent molecular mass of 'x, kd. the glycosylated and translocated nature of the - kd material was clearly demon-strated in experiments similar to those described above for the p -dhfr hybrid molecule (not shown). altogether there are four glycosylation sites within the p d- sequence. these correspond to asn residues at positions , , , and (see fig. ). fig. (lanes - ) shows the time course of the four glycosylation events during c-p d- translation. a slightly different protocol was followed in this experiment as compared to that with p -dhfr. seven translations were initiated in parallel and after a -min incubation these were put on ice and ata was added. elongation of the already initiated chains was then continued for a total of min, however, so that triton x- was added to individual samples at , , , , , , and min. at these time points half of each sample was also removed and translation stopped by cooling on ice. lanes - show the sds-page of the samples that had received triton x- at different time points. we found the sequential appearance of p d- polypeptides with no carbohydrate (lane ), with one and two units added (seen as two new bands with slower migration in lanes and ), with three units (lane ), and all four sugar units (lanes , , and ) attached to the protein backbone as the translation proceeded coordinately with time. note that the four glycosylation events result in different degrees of increase of the size of p d- . the second event causes the largest increase and the third one the smallest. as the sugar unit added at each step should be the same we think that these differences reflect some conformational changes in the p folding which occur coordinately with glycosylation. in lanes -- we have analyzed the samples that were withdrawn at the different times but were kept on ice. as expected, we see a sequential appearance of first the capsid the journal of cell biology, volume , protein (in the -min sample) and then the p d- protein (barely visible in the -min sample). the p d- protein is partly present in its glycosylated and partly in its unglycosylated form. using . min as a rough estimate for the translation time of the residue long c-p d- chain (time point of p d- detection, min, plus half of the -min preincubation time without ata) we have calculated the translation rate and derived the approximate earliest time points when the four glycosylation sites of p d- should be available for modification. according to these, asn~ and asn should be the only sites available for glycosylation in the -min sample, shown in lane , and the most abundant ones presented for modification in the -min sample, shown in lane . therefore, it appears reasonable to assume that those chains of these two samples which have obtained two sugar units carry these on the aforementioned two sites. thus, the peptide region with asn,a seems to be target for rapid modification also when present in its normal background, that is with the p protein. the fact that the residue fragment of the nh -terminal region of the p protein is able to translocate two different reporter molecules into microsomes constitutes in our mind convincing evidence for signal sequence activity in this protein fragment. a more precise location of the p signal sequence within the residue p fragment can be done with the aid of the known consensus features of a signal sequence. the most typical characteristic of a signal sequence is a stretch of - uncharged residues, mostly hydrophobic ones (von heijne, ) . this part of the signal sequence probably forms an alpha helix in the er membrane (emr and silhavy, ; briggs et al., briggs et al., , kendall et al., ; batenburg et al., ) . the only possible candidate region within the residue p fragment having these features is the residue segment between pro and pro (see box in uppermost sequence in fig. ) . the pro-rich region in the middle of the residue fragment would not form an alpha-helix, and the cooh-terminal part of the p segment contains a high number of charged residues. as shown in fig. these features are conserved in all those alphaviruses where the p protein has been sequenced. thus, we find the experimental results, together with the structural considerations discussed above, highly indicative that the nh terminal residues of p constitute its signal sequence. eventually, the signal sequence of the p protein becomes translocated across the membrane of the er into its lumenal space. in here it is found as a glycosylated peptide which is part of a amino acid residues long "pro-piece of the p protein. this pro-peptide, called e , is cleaved at a late stage during virus assembly (de curtis and simons, ) and is then either released into the extracellular medium as a soluble protein (sinbis virus) or remains as a peripheral protein subunit on the virus spike (sfv) (garoff et al., ; mayne et al., ) . our present tests of the p -globin and p dhfr hybrids in the high ph wash assay of membrane supplemented in vitro mixtures also support the notion that the p signal sequence does not remain bound to the membrane where it has exerted its function as a translocation signal. in this work we like to use the glycosylation event at asn, of the signal sequence to mark the time point when the latter becomes released into the lumen of the er. the crucial question then becomes whether it is reasonable to assume that the signal peptide has to be released from the er membrane before it can become glycosylated. to answer this question we have to consider what is known about the topology of glycosylation as well as the way by which the p signal might interact with the er membrane. today there is no exact information about how a signal sequence might be inserted into the er membrane when exerting its function in chain translocation. however, the typical cytoplasmic orientation of the nh -termini of membrane protein chains carrying a combined signal sequence-anchoring peptide suggests that signal sequences in general might direct their function in translocation through the insertion of their hydrophobic and uncharged stretch of amino acid residues into the membrane in such an orientation that the nheterminus of the signal remains on the outside of the er mem- (garoff et al., ; rice and strauss, ; dalgarno et al., ; kinney et al., ; chang and trent, ) . amino acid residues are given using the one letter code and they are numbered from the nh -towards the cooh-terminus. the boxes indicate that region in each sequence which best fulfills the consensus features of a signal sequence (the uncharged and hydrophobic region). the * symbols represent attachment sites for oligosaccharide and the (+) and (-) the presence of a charged amino acid side chain. proline residues are labeled with a dot. the sequences are aligned according to maximum amino acid sequence homology. brane (bos et al., ; lipp and dobberstein, ; spiess and lodish, ; zerial et al., ; see also shaw et al., ) . in addition, it is known from physical studies using synthetic signal peptides and artificial lipid membranes that the signal peptides readily insert into the membrane and there obtain an alpha-helical conformation (briggs et al., (briggs et al., , batenburg et al., ; cornell et al., ) . if the p signal sequence adapts such an orientation and conformation in the er membrane it would mean that the glycosylation site at asnt would locate inside the membrane (von heijne, ) . in this location the site can hardly be accessible for the glycosylation machinery. (note that in the related ross river virus, the venezuelan equine encephalitis virus and eastern equine encephalitis virus the corresponding glycosylation site is even closer towards the nh terminus, that is at asn , see fig. .) according to several recent studies, glycosylation requires the exposure of the glycosylation site in the lumen of er. firstly, it has been shown that the binding protein for the glycosylation site of n-linked oligosaccharides is a lumenal -kd protein of the er (geetha-habib et al., ) . secondly, one study with the asialoglycoprotein receptor and another one with the corona virus e membrane protein demonstrate that lumenally oriented glycosylation sites are not used on transmembrane polypeptides if they locate very close to the membranebinding segments of the chains (mayer et al., ; wessels and spiess, ) . in the case of the asialoglycoprotein receptor a site was not used if located residues apart from the membrane anchor, however, if moved more residues apart from the anchor it became glycosylated. in the case of the corona virus protein a site just adjacent to the combined signal sequence-anchor peptide remained unglycosylated, whereas an engineered site residues further away was used for glycosylation. such restrictions in glycosylation are most likely to be explained by sterical problems in attaching the very spacious sugar unit (lee et al., ; see also wier and edidin, ) onto acceptor sites that are fixed in a position which is close to the membrane plane. therefore, we assume that the p signal sequence, with its glycosylation site at asn , cannot become glycosylated before it has been released into er lumen. as this glycosylation event was shown to occur at an early stage of chain translocation it follows that this signal sequence can only interact with the er membrane during the beginning of chain translocation. in other words, the signal sequence of p can only function at the initiation stage of chain translocation and has no role in completing this transfer process. if the latter would be true we would have expected that the signal sequence glycosylation would have occurred first after all of the lumenal domain of the p d- chain would have been translated and translocated. the importance of our results in this work lies in the fact that they rule out translocation models in which the signal sequence would have a role throughout the whole process of chain translocation. for instance, if the translocation site is represented by a multisubunit protein complex forming an aqueous channel across the membrane for chain transfer (see signal hypothesis, blobel and dobberstein, ; amphiphatic tunnel hypothesis, rapoport, ; translocator protein hypothesis, singer et al., a,b) , then the signal sequence could be involved in its assembly or "opening" but apparently not for keeping it together or open until chain transfer is completed (as suggested in singer et al., a) . similarly, when considering models in which the chain transfer occurs directly through a lipid membrane (see the helical hairpin hypothesis, engelrnan and steitz, ; direct transfer model, von heijne and blomberg, ; phospholipid channel hypothesis, nesmayanova, ) the interaction of the signal sequence with the lipid bilayer could be of importance only at the stage of translocation initiation but not at the actual chain transfer step. the possibility that our results about p protein translocation would be unique to the viral system .and different from the general translocation process in the er we find most unlikely. several results from this and earlier works suggest that the signal sequence of the p protein functions much in the same way as cleavable ones do. firstly, studies with a temperature-sensitive mutant of sfv, ts , have shown that the signal sequence of p requires a free nh -terminal end for function (hashimoto et al., ) . at the nonpermissive temperature the ts mutant is defective in cleavage between the c and the p protein region of the protein precursor because of a mutation that inactivates the autoproteolytic activity of c. this defect results in a translocation negative phenotype for the p protein. secondly, the p signal sequence has been shown to be srp dependent. if the mrna for the structural proteins of sfv is translated in vitro in a wheat germ-derived system that is supplemented with saltwashed (and srp-deprived) membranes then p translocation is observed only in the presence of exogenous srp (bonatti et al., ) ~ if srp is supplemented without membranes then p translation is arrested. thirdly, our time course study about p synthesis and glycosylation in this work clearly demonstrates that the p chain is translocated cotranslationally across the er membrane. this was also suggested by earlier studies in vitro (garoff et al., ; bonatti et al., ) . in these studies it was shown that both microsomal membranes as well as srp have to be added to the synthesis mixture before extensive lengths (,,o amino acid residues) of the p chains have been translated. it is also possible to speculate on a mechanism in which the p signal sequence would be released from a putative translocation site by being replaced by another signal sequence-like structure in the p polypeptide. however, such a "rescue" mechanism appears improbable as the p signal sequence was found to be glycosylated early during translation of both the p polypeptide as well as the signal sequence-dhfr hybrid chain. evidence for an autoprotease activity of sindbis virus capsid protein characterization of the inteffacial behavior and structure of the signal sequence of escherichia coli outer membrane pore protein phoe addition of glucosamine and mannose to nascent immunoglobin heavy chains a rapid alkaline extraction procedure for screening recombinant plasmid dna transfer of proteins across membranes. i. presence of proteolytically processed and unprocessed nascent immunoglobin light chains on membrane-bound ribosomes of murine myeloma controlled proteolysis of nascent polypeptides in rat liver cell fractions. i. location of the polypeptides within ribosomes role of signal recognition particle in the membrane assembly of sindbis viral glycoproteins. fur ribosomemembrane interaction: in vitro binding of ribosomes to microsomal membranes nh -terminal hydrophobic region of influenza virus neuraminidase provides the signal function in translocation in vivo function and membrane binding properties are correlated for escherichia coli lamb signal peptides conformations of signal peptides induced by lipids suggest initial steps in protein export phenotypic expression in e. coli ofa dna sequence coding for mouse dihydrofolate reductase nucleotide sequence of the genome region encoding the s mrna of eastern equine encephalomyelitis virus and the deduced amino acid sequence of the viral structural proteins conformations and orientations of a signal peptide interacting with phospholipid monolayers structure of amplified normal and variant dihydrofolate reductase genes in mouse sarcoma sis cells mutants of the membrane-binding region of semliki forest virus e protein. i. cell surface transport and fusogenic activity ross river virus s rna: complete nucleotide sequence and decoded sequence of the encoded structural proteins dissection of semliki forest virus glycoprotein delivery from the trans-golgi network to the cell surface in permeabilized bhk cells importance of secondary structure in the signal sequence for protein secretion the spontaneous insertion of proteins into and across membranes: the helical hairpin hypothesis solid phase peptide synthesis assembly of the semliki forest virus membrane glycoproteins in the membrane of the endoplasmic reticulum in vitro nucleotide sequence of edna coding for semliki forest virus membrane glycoproteins structure and assembly of alphaviruses expression of semliki forest virus proteins from cloned complementary dna. ii. the membrane-spanning glycoprotein e is transported to the cell surface without its normal cytoplasmic domain glycosylation site binding protein, a component of oligosaccharyl transferase, is highly similar to three other kd luminal proteins of the er synthesis of ranscher murine leukemia virus-specific polypeptides in vitro translocation of secretory proteins across the microsomal membrane occurs through an environment accessible to aqueous perturbants glycosylation of ovalbumin nascent chains: the spatial relationship between translation and glycosylation sequence analysis of three sindbis virus mutants temperature-sensitive in the capsid protein autoprotease evidence for a separate signal sequence for the carboxy-terminal envelope glycoprotein e of semliki forest virus preparation and use of nuclease-treated rabbit reticulocyte lysates for the translation of eucaryotic messenger rna dog pancreatic microsomalmembrane polypeptides analysed by two-dimensional gel electrophoresis plasmid cloning vehicles derived from plasmids cole , f, r k, and rk idealization of the hydrophobic segment of the alkaline phosphatase signal peptide nucleotide sequence of the s mrna of the virulent trinidad donkey strain of venezuelan equine encephalitis virus and deduced sequence of the encoded structural proteins expression of semliki forest virus proteins from cloned complementary dna. i. the fusion activity of the spike glycoprotein assembly of asparagine-linked oligosaccharides binding of synthetic clustered ligands to the gal/galnac lectin on isolated rabbit hepatocytes structural and evolutionary analysis of the two chimpanzee alpha-globin mrnas signal recognition particle-dependent membrane insertion of mouse invariant chain: a membrane-spanning protein with a cytoplasmically exposed amino terminus transport of secretory and membrane glycoproteins form the rough endoplasmic reticulum to the golgi partial resistance of nascent polypeptide chains to proteolytic digestion due to ribosomal shielding molecular cloning: a laboratory manual membrane integration and intracellular transport of the coronavirus glycoprotein el, a class ill membrane glycoprotein biochemical studies of the maturation of the small sindbis virus glycoprotein e reinitiation of translocation in the semliki forest virus structural polyprotein: identification of the signal for the el glycoprotein processing of the semliki forest structural polyprotein: role of the capsid protease on the possible participation of acid phospholipids in the translocation of secreted proteins through the bacterial cytoplasmic membrane. febs (fed. fur structure and genomic organization of the mouse dihydrofolate reductase gene translocations of domains of nascent periplasmic proteins across the cytoplasmic membrane is independent of elongation extensions of the signal hypothesis-sequential insertion model versus amphipathic tunnel hypothesis. febs (fed. eur protein translocation across and integration into membranes improved plasmid vectors with a thermoinducible expression and temperature-regulated runaway replication nucleotide sequence of the s mrna of sindbis virus and deduced sequence of the encoded virus structural proteins identification of signal sequence binding proteins integrated into the rough endoplasmic reticulum membrane polypeptide chain binding proteins: catalysts of protein folding and related processes in cells synchronized transmembrane insertion and glycosylation of a nascent membrane protein evidence for the loop model of signal-sequence insertion into the endoplasmic reticulum on the translocation of proteins across membranes on the transfer of integral proteins into membranes nascent peptide as sole attachment of polysomes to membranes in bacteria an internal signal sequence: the asialoglycoprotein receptor membrane anchor complete nucleotide sequence of the escherichia coli plasmid pbr subcellular location of enzymes involved in the n-glycosylation and processing of asparagine-linked oligosaccbarides in saccharomyces cerevisiae structural and thermodynamic aspects of the transfer of proteins into and across membranes trans-membrane translocation of proteins: the direct transfer model insertion of a multispanning membrane protein occurs sequentially and requires only one signal sequence multiple mechanisms of protein insertion into and across membranes a signal sequence receptor in the endoplasmic reticulum membrane constraint of the translational diffusion of a membrane glycoprotein by its external domains the transmembrahe segment of the human transferrin receptor functions as a signal peptide we thank ernst bause for constructive discussion; gunnar von heijne and michael baron for critical reading of the manuscript; johanna wahlberg for help with the figures; margareta berg, tuula marminen, and elisabeth servin for technical assistance; and ingrid sigurdson for typing. this work was supported by grants from the swedish medical research council (b - x- - a), swedish natural science research council (b-bu - ), and swedish national board for technical development ( - p).received for publication march and in revised form may . key: cord- -vmtjc ct authors: georgiev, vassil st. title: genomic and postgenomic research date: journal: national institute of allergy and infectious diseases, nih doi: . / - - - - _ sha: doc_id: cord_uid: vmtjc ct the word genomics was first coined by t. roderick from the jackson laboratories in as the name for the new field of science focused on the analysis and comparison of complete genome sequences of organisms and related high-throughput technologies. two basic computational methods are used for genome analysis: gene finding and whole genome comparison ( ) . gene finding. using a computational method that can scan the genome and analyze the statistical features of the sequence is a fast and remarkably accurate way to find the genes in the genome of prokaryotic organisms (bacteria, archaea, viruses) compared with the still difficult problem of finding genes in higher eukaryotes. by using modern bioinformatics software, finding the genes in a bacterial genome will result in a highly accurate, rich set of annotations that provide the basis for further research into the functions of those genes. the absence of introns-those portions of the dna that lie between two exons and are transcribed into a rna but will not appear in that rna after maturation and therefore are not expressed (as proteins) in the protein synthesis-will remove one of the major barriers to computational analysis of the genome sequence, allowing gene finding to identify more than % of the genes of most genomes without any human intervention. next, these gene predictions can be further refined by searching for nearby regulatory sites such as the ribosome-binding sites, as well as by aligning protein sequences to other species. these steps can be automated using freely available software and databases ( ) . gene finding in single-cell eukaryotes is of intermediate difficulty, with some organisms, such as trypanosoma brucei, having so few introns that a bacterial gene finder is sufficient to find their genes. other eukaryote organisms (e.g., plasmodium falciparum) have numerous introns and would require the use of special-purpose gene finder, such as glimmerm ( , ) . whole genome comparison. this computational method refers to the problem of aligning the entire deoxyribonucleic acid (dna) sequence of one organism to that of another, with the goal of detecting all similarities as well as rearrangements, insertions, deletions, and polymorphisms ( ) . with the increasing availability of complete genome sequences from multiple, closely related species, such comparisons are providing a powerful tool for genomic analysis. using suffix trees-data structures that contains all of the subsequences from a particular sequence and can be built and searched in linear time-this computational task can be accomplished in minimal time and space. because the suffix tree algorithm is both time and space efficient, it is able to align large eukaryotic chromosomes with only slightly greater requirements than those for bacterial genomes ( ) . bacterial genome annotation. the major goal of the bacterial genome annotation is to identify the functions of all genes in a genome as accurately and consistently as possible by using initially automated annotation methods for preliminary assignment of functions to genes, followed by a second stage of manual curation by teams of scientists. the family enterobacteriaceae encompasses a diverse group of bacteria including many of the most important human pathogens (salmonella, yersinia, klebsiella, shigella), as well as one of the most enduring laboratory research organisms, the nonpathogenic escherichia coli k . many of these pathogens have been subject to genome sequencing or are under study. genome comparisons among these organisms have revealed the presence of a core set of genes and functions along a generally collinear genomic backbone. however, there are also many regions and points of difference, such as large insertions and deletions (including pathogenicity islands), integrated bacteriophages, small insertions and deletions, point mutations, and chromosomal rearrangements ( ). the first genome sequence of escherichia coli k (reference strain mg ) was completed and published in ( ) . later, the genome sequence of two other genotypes of e. coli, the enterohemorrhagic e. coli o :h (ehec; strains edl and rimd -sakai) ( , ) and the uropathogenic e. coli (upec; strain cft ) ( ) , were sequenced and the information published. currently, it is accepted that shigellae are part of the e. coli species complex, and information on the genome of shigella flexneri strain a has been published ( ) . a comparison of all three pathogenic e. coli with the archetypal nonpathogenic e. coli k revealed that the genomes were essentially collinear, displaying both conservation in sequence and gene order ( ) . the genes that were predicted to be encoded within the conserved sequence displayed more than % sequence identity and have been termed the core genes. similar observations were made for the shigella flexneri genome, which also shares . mb of common sequence with e. coli ( ) . a comparison of the three e. coli genomes revealed that genes shared by all genomes amounted to , ( ) from a total of , , and about , and , predicted proteincoding sequences for e. coli k , ehec, and upec, respectively ( ) . the region encoding these core genes is known as the backbone sequence. it was also apparent from these comparisons that interdispersed throughout this backbone sequence were large regions unique to the different genotypes. moreover, several studies had shown that some of these unique loci were present in clinical disease-causing isolates but were apparently absent from their comparatively benign relatives ( ) . one such well-characterized region is the locus of enterocyte effacement (lee) in the enteropathogenic e. coli (epec). thus, an epec infection results in effacement of the intestinal microvilli and the intimate adherence of bacterial cells to enterocytes. furthermore, epec also subverts the structural integrity of the cell and forces the polymerization of actin, which accumulates below the adhered epec cells, forming cup-like pedestals ( ) . this is called an attachment and effacing (ae) lesion. subsequently, lee was found in all bacteria known to be able to elicit an ae lesion ( ). the presence of many regions in the backbone sequence similar to lee have been characterized in both gram-negative and gram-positive bacteria ( ) . this led to the concept of pathogenicity islands (pais) and the formulation of a definition to describe their features ( ) . typically, pais are inserted adjacent to stable rna genes and have an atypical g+c content. in addition to virulence-related functions, the pathogenicity islands often carry genes encoding transposase or integrase-like proteins and are unstable and self-mobilizable ( , ) . it was also noted that pais possess a high proportion of gene fragments or disrupted genes when compared with the backbone regions ( ) . it is generally accepted that the pathogenic e. coli genotypes have evolved from a much smaller nonpathogenic relative by the acquisition of foreign dna. this laterally acquired dna has been attributed with conferring on the different genotypes the ability to colonize alternative niches in the host and the ability to cause a range of different disease outcomes ( ) . although sharing some of the features of pais and considered to be parts of the pais, some genomic loci are unlikely to impinge on pathogenicity. to take account of this, the concept of pais has been extended to include islands or strainspecific loops, which represent discrete genetic loci that are lineage-specific but are as yet not known to be involved in virulence ( , ) . currently, there are more than , salmonella serovars in two species, s. enterica and s. bongori. all salmonellae are closely related, sharing a median dna identity for the reciprocal best match of between % and % ( , ) . despite their homogeneity, there are still significant differences in the pathogenesis and host range of the different salmonella serovars. thus, whereas s. enterica subspecies enterica serovar typhi (s. typhi) is only pathogenic to humans causing severe typhoid fever, s. typhimurium causes gastroenteritis in humans but also a systemic infection in mice and has a broad host range ( ) . like e. coli, the salmonellae are also known to possess pais, known as salmonella pathogenicity islands (spis). it is thought that spis have been acquired laterally. for example, the gene products encoded by spi- ( , ) and spi- ( , ) have been shown to play important roles in the different stages of the infection process. both of these islands possess type iii secretion systems and their associated secreted protein effectors. spi- is known to confer on all salmonellae the ability to invade epithelial cells. spi- is important in various aspects of the systemic infection, allowing salmonella to spread from the intestinal tissue into the blood and eventually to infect, and survive within, the macrophages of the liver and spleen ( ) . spi- , like lee and pai- of upec, is inserted alongside the selc trna gene and carries the gene mgtc, which is required for the intramacrophage survival and growth in the low-magnesium environment thought to be encountered in the phagosome ( ) . other salmonella spis encode type iii-secreted effector proteins, chaperone-usher fimbrial operons, vi antigen biosynthetic gene, a type ivb pilus operon, and many other determinants associated with the salmonellae enteropathogenicity ( ). although the mobile nature of pais is frequently discussed in the literature, there is little direct experimental evidence to support these observations. one possible explanation for this may be that on integration, the mobility genes of the pais subsequently become degraded, thereby fixing their position ( ) . certainly, there is evidence to support this hypothesis, as many proposed pais carry integrase or transposase pseudogenes or remnants. one excellent example of this is the high-pathogenicity island (hpi) first characterized in yersinia ( ) . the yersinia hpis can be split into two lineages based on the integrity of the phage integrase gene (int) carried in the island: (i) y. enterocolitica biotype b and (ii) y. pestis and y. pseudotuberculosis. the y. enterocolitica hpi int gene carries a point mutation, whereas the analogous gene is intact in the y. pestis and y. pseudotuberculosis hpis. the yersinia hpi is a -to -kb island that possesses genes for the production and uptake of the siderophore yersiniabactin, as well as genes, such as int, thought to be involved in the mobility of the island. hpi-like elements are widely distributed in enterobacteria, including e. coli, klebsiella, enterobacter, and citrobacter spp., and like many prophages, these hpis are found adjacent to asn-trna genes ( ) . trna genes are common sites for bacteriophage integration into the genome ( ) . integration at these sites typically involves site-specific recombination between short stretches of identical dna located on the phage (attp) and at the integration site on the bacterial genomes (attb). the trna genes represent common sites for the integration of many other pais and bacteriophages, with the secc trna locus being the most heavily used integration site in the enterics ( ). integrated bacteriophages, also known as prophages, are also commonly found in bacterial genomes ( ) . for example, in the s loops of the e. coli o :h strain edl (ehec) unique regions, nearly % were phage related. in addition to the prophage sequences detected in the genome of ehec strain sakai ( ) , the genomes of e. coli k , upec, and s. flexneri have all been shown to carry multiple prophage or prophage-like elements ( , , , ) . moreover, comparison of the genome sequences of ehec o :h strain edl and strain sakai revealed marked variations in the complement and integration sites of the prophages, as did internal regions within highly related phages ( , ) . in addition to genes essential for their own replication, phages often carry genes that, for example, prevent superinfection by other bacteriophages, such as old and tin ( , ) . however, other genes carried in prophages appear to be of nonphage origin and can encode determinants that enhance the virulence of the bacterial host by a process known as lysogenic conversion ( ) . in addition to the presence of the lee pai and the ability to elicit ae lesion, another defining characteristics of the enterohemorrhagic e. coli (ehec) is the production of shiga toxins (stx). the shiga toxins represent a family of potent cytotoxins that, on entry into the eukaryotic cell, will act as glycosylases by cleaving the s ribosomal rna (rrna) thereby inactivating the ribosome and consequently preventing the protein synthesis ( ) . other enteric pathogens such as s. typhi, s. typhimurium, and y. pestis are also known to possess significant numbers of prophages ( , , ) . thus, the principal virulence determinants of the salmonellae are the type iii secretion systems, carried by spi- and spi- , and their associated protein effectors ( , ) . a significant number of these type iii secreted effector proteins are present in the genomes of prophages and have a dramatic influence on the ability of their bacterial hosts to cause disease ( ). small insertions and deletions. even though the large pais play a major role in defining the phenotypes of different strains of the enteric bacteria, there are many other differences resulting from small insertions and deletions, which must be taken into account when considering the overall genomic picture of enterobacteriaceae ( ) . thus, the comparisons between e. coli k and e. coli o :h and between s. typhi and s. typhimurium have indicated the existence of many small differences that exist aside from the large pathogenicity islands. for example, the number of separate insertion and deletion events has shown that there are events of genes or fewer compared with events of genes or more for the s. typhi and s. typhimurium comparison. furthermore, comparison between s. typhi and e. coli revealed events of genes or fewer compared with just events of genes or more. even taking into account that the larger islands contain many more genes per insertion or deletion event, it becomes clear that nearly equivalent numbers of speciesspecific genes are attributable to insertion or deletion events involving genes or fewer as are due to events involving genes or more. these data should lend credence to the assertion that the acquisition and exchange of small islands is important in defining the overall phenotype of the organism ( ) . in the majority of cases studied to date, there is no evidence to suggest the presence of genes that may allow these small islands to be self-mobile. it is far more likely that small islands of this type are exchanged between members of a species and constitute part of the species gene pool. once acquired by one member of the species, they can be easily exchanged by generalized transduction mechanisms, followed by homologous recombination between the near identical flanking genes to allow integration into the chromosome ( ) . this sort of mechanism of genetic exchange would also make possible nonorthologous gene replacement, involving the exchange of related genes at identical regions in the backbone. a specific example to illustrate such a possibility is the observed capsular switching of neisseria meningitides ( ) and streptococcus pneumoniae ( , ) for which different sets of genes responsible for the biosynthesis of different capsular polysaccharides are found at identical regions in the chromosome and flanked by conserved genes. the implied mechanism for capsular switching involves replacement of the polysaccharide-specific gene sites by homologous recombination between the chromosome and exogenous dna in the flanking genes ( ) . point mutations and pseudogenes. one of the most surprising observations to come from enterobacterial genome research has been the discovery of a large number of pseudogenes. the pseudogenes appeared to be untranslatable due to the presence of stop codons, frameshifts, internal deletions, or insertion sequence (is) element insertions. the presence of pseudogenes seems to run contrary to the general assumption that the bacterial genome is a highly "streamlined" system that does not carry "junk dna" ( ). for example, salmonella typhi, the etiologic agent of typhoid fever, is host restricted and appears only capable of infecting a human host, whereas s. typhimurium, which causes a milder disease in humans, has a much broader host range. upon analysis, the genome of s. typhi contained more than pseudogenes ( ) , whereas it was predicted that the number of pseudogenes in the genome of s. typhimurium would be around ( ) . from this observation, it becomes clear that the pseudogenes in s. typhi were not randomly spread throughout its genome-in fact, they were overrepresented in genes that were unique to s. typhi when compared with e. coli, and many of the pseudogenes in s. typhi have intact counterparts in s. typhimurium that have been shown to be involved in aspects of virulence and host interaction. given this distribution of pseudogenes, it has been suggested that the host specificity of s. typhi may be the result of the loss of its ability to interact with a broader range of hosts caused by functional inactivation of the necessary genes ( ) . in contrast with other microorganisms containing multiple pseudogenes, such as mycobacterium leprae ( ) , most of the pseudogenes in s. typhi were caused by a single mutation, suggesting that they have been inactivated relatively recently. taken together, these observations suggest an evolutionary scenario in which the recent ancestor of s. typhi had changed its niche in a human host, evolving from an ancestor (similar to s. typhimurium) limited to localized infection and invasion around the gut epithelium into one capable of invading the deeper tissues of the human hosts ( ) . a similar evolutionary scenario has been suggested for another recently evolved enteric pathogen, yersinia pestis. this bacterium has also recently changed from a gut bacterium (y. pseudotuberculosis), transmitted via the fecal-oral route, to an organism capable of using a flea vector for systemic infection ( , ) . again, this change in niche was accompanied by pseudogene formation, and genes involved in virulence and host interaction are overrepresented in the set of genes inactivated ( ) . yet another example of such an evolutional scenario is shigella flexneri a, a member of the species e. coli (which is predicted to have more than pseudogenes), and is again restricted to the human body ( ) . all of these organisms demonstrate that the enterobacterial evolution has been a process that has involved both gene loss and gene gain, and that the remnants of the genes lost in the evolutionary process can be readily detected ( ). the focus in the postgenomic era is on functional genomics, in which proteomics plays an essential role. the living cell is a dynamic and complex system that cannot be predicted from the genome sequence. whereas genomes will disclose important information on the biological importance of the organism, it is still static and will not reveal information on the expression of a particular gene or of posttranslational modifications or on how a protein is regulated in a specific biological situation ( ) . thus, whereas the complete genome sequence provides the basis for experimental identification of expressed proteins at the cellular level, very little has been accomplished to identify all expressed and potentially modified proteins. direct investigation of the total content of proteins in a cell is the task of proteomics. proteomics is defined as the complete set of posttranslationally modified and processed proteins in a well-defined biological environment under specific circumstances, such as growth conditions and time of investigation ( , ) . proteomics can be studied by following two separate steps: separation of the proteins in a sample, followed by identification of the proteins. the common methodology used for separating proteins is two-dimensional polyacrylamide gel electrophoresis ( d page). the principal method for large-scale identification is mass spectroscopy (ms), but other identification methods, such as n-terminal sequencing, immunoblotting, overexpression, spot colocalization, and gene knockouts, can also be used. because of its high-resolution power, d page is currently the best methodology to achieve global visualization of the proteins of a microorganism. in the first dimension, isoelectric focusing is carried out to separate the proteins in a ph gradient according to their isoelectric point (pi). in the second dimension, the proteins are separated according to their molecular weight by sds-page (sodium dodecyl sulfate-page). the resulting gel image presents itself as a pattern of spots in which pi and the relative molecular weight (m r ) can be recognized as in a coordinate system ( ) . a critical step during the d page procedure is the sample preparation, as there is no single method that can be universally applied because different reagents are superior with respect to different samples. to this end, chaotropes such as urea, which act by changing the parameters of the solvent, are used in most d page procedures. major problems to overcome in d page sample preparation arise because of limited entry into the gel of high-molecular-weight proteins and the presence of highly hydrophobic and/or basic proteins ( , ) . for protein separation, the protein mixture is loaded onto an acrylamide gel strip in which a ph gradient is established. when a high voltage is applied over the strip, the proteins will focus at the ph at which they carry zero net charge. the ph gradient is established during the focusing using either carrier ampholytes in a slab gel ( ) or a precast polyacrylamide gel with an immobilized ph gradient (ipg) ( ) . the latter method is advantageous because of improved reproducibility. samples can be applied to ipg dry strips preferably by rehydration. rehydration of dried ipgs under application of a low voltage ( to v) has significantly improved the recovery especially of high-molecularweight proteins. mass spectrometry is the method of choice for identifying proteins in proteomics. the proteins are converted into gas phase ions that can be measured with an accuracy better than ppm ( ) . two widely used techniques for ionization are matrix-assisted laser desorption ionization (maldi) ( ) and electrospray ionization ( ) . maldi is usually coupled with a tof (time of flight) device for measuring the masses. the ionized peptides are then accelerated by the application of accelerated field and the tof until they reach a detector to calculate their mass/charge ratio ( ) . in electrospray ionization, the peptides are sprayed into the spectrometer ( ) . ionization is achieved when the charged droplets evaporate. an alternative procedure for measuring masses is the ion trap ( ) , which selects ions with certain mass/charge ratios by keeping them in sinusoidal motion between two electrodes. in , the first microbe sequencing project, haemophilus influenzae (a bacterium causing upper respiratory infection), was completed with a speed that stunned scientists (http:// www .niaid.nih.gov/research/topics/pathogen/introduction. htm). encouraged by the success of that initial effort, researchers have continued to sequence an astonishing array of other medically important microorganisms. to this end, niaid has made significant investments in large-scale sequencing projects, including projects to sequence the complete genomes of many pathogens, such as the bacteria that cause tuberculosis, gonorrhea, chlamydia, and cholera, as well as organisms that are considered agents of bioterrorism. in addition, niaid is collaborating with other funding agencies to sequence larger genomes of protozoan pathogens such as the organism causing malaria. the availability of microbial and human dna sequences opens up new opportunities and allows scientists to perform functional analyses of genes and proteins in whole genomes and cells, as well as the host's immune response and an individual's genetic susceptibility to pathogens. when scientists identify microbial genes that play a role in disease, drugs can be designed to block the activities controlled by those genes. because most genes contain the instructions for making proteins, drugs can be designed to inhibit specific proteins or to use those proteins as candidates for vaccine testing. genetic variations can also be used to study the spread of a virulent or drug-resistant form of a pathogen. niaid has launched initiatives to provide comprehensive genomic, proteomic, and bioinformatic resources. these resources, listed below, are available to scientists conducting basic and applied research on a broad array of pathogenic microorganisms (http://www .niaid.nih.gov/research/topics/ pathogen/initiatives.htm): r niaid's microbial sequencing centers (nscs). the niaid's microbial sequencing centers are state-of-theart high-throughput dna sequencing centers that can sequence genomes of microbes and invertebrate vectors of infectious diseases. genomes that can be sequenced include microorganisms considered agents of bioterrorism and those responsible for emerging and re-emerging infectious diseases. resource center is a centralized facility that provides scientists with the resources and reagents necessary to conduct functional genomics research on human pathogens and invertebrate vectors at no cost. the pfgrc provides scientists with genomic resources and reagents such as microarrays, protein expression clones, genotyping, and bioinformatics services. the pfgrc supports the training of scientists in the latest techniques in functional genomics and emerging genomic technologies. r niaid's proteomics centers. the primary goal of these centers is to characterize the pathogen and/or host cell proteome by identifying proteins associated with the biology of the microorganisms, mechanisms of microbial pathogenesis, innate and adaptive immune responses to infectious agents, and/or non-immune-mediated host responses that contribute to microbial pathogenesis. it is anticipated that the research programs will discover targets for potential candidates for the next generation of vaccines, therapeutics, and diagnostics. this will be accomplished by using existing proteomics technologies, augmenting existing technologies, and creating novel proteomics approaches as well as performing early-stage validation of these targets. r administrative resource for biodefense proteomic centers (arbpcs). the arbpcs consolidate data generated by each proteomics research center and make it available to the scientific community through a publicly accessible web site. this database (www.proteomicsresource.org) serves as a central information source for reagents and validated protein targets and has recently been populated with the first data released. r niaid's bioinformatics resource centers. the niaid's bioinformatics resource centers will design, develop, maintain, and continuously update multiorganism databases, especially those related to biodefense. organisms of particular interest are the niaid category a to c priority pathogens and those causing emerging and re-emerging diseases. the ultimate goal is to establish databases that will allow scientists to access a large amount of genomic and related data. this will facilitate the identification of potential targets for the development of vaccines, therapeutics, and diagnostics. each contract will include establishing and maintaining an analysis resource that will serve as a companion to the databases to provide, develop, and enhance standard and advanced analytical tools to help researchers access and analyze data. tb structural genomics consortium. a collaboration of scientists in six countries formed to determine and analyze the structures of about proteins from mycobacterium tuberculosis. the group seeks to optimize the technical and management aspects of highthroughput structure determination and will develop a database of structures and functions. niaid, which is co-funding this project with nigms, anticipates that this information will also lead to the design of new and improved drugs and vaccines for tuberculosis. structural genomics of pathogenic protozoa consortium. this consortium is aiming to develop new ways to solve protein structures from organisms known as protozoans, many species of which cause deadly diseases such as sleeping sickness, malaria, and chagas' disease. the national institute of allergy and infectious diseases is providing support to the microbial genome sequencing centers (mscs) at the j. craig venter institute [formerly, the institute for genomic research (tigr)], the broad institute at the massachusetts institute of technology (mit), and harvard university for a rapid and cost-efficient production of high-quality, microbial genome sequences and primary annotations. niaid's mscs (http://www.niaid.nih.gov/dmid/genomes/mscs/) are responding to the scientific community and national and federal agencies' priorities for genome sequencing, filling in sequence gaps, and therefore providing genome sequencing data for multiple uses including understanding the biology of microorganisms, forensic strain identification, and identifying targets for drugs, vaccines, and diagnostics. in addition, the niaid's mscs have developed web sites that provide descriptive information about the sequencing projects and their progress (http://www.broad.mit.edu/seq/msc/and http://msc.tigr.org/status.shtml). genomes to be sequenced include microorganisms considered to be potential agents of bioterrorism (niaid category a, b, and c), related organisms, clinical isolates, closely related species, and invertebrate vectors of infectious diseases and microorganisms responsible for emerging and re-emerging infectious diseases. in addition, in response to a recommendation from a niaid-sponsored blue ribbon panel on bioterrorism and its implication for biomedical research to support genomic sequencing of microorganisms considered agents of bioterrorism and related organisms, the mscs will address the institute's need for additional sequencing of such microorganisms and invertebrate vectors of disease and/or those that are responsible for emerging and re-emerging diseases (http://www.niaid.nih.gov/dmid/ genomes/mscs/overview.htm). the panel's recommendation included careful selection of species, strains, and clinical isolates to generate genomic data for different uses such as identification of strains and targets for diagnostics, vaccines, antimicrobials, and other drug developments. the mscs have the capacity to rapidly and costeffectively sequence genomic dna and provide preliminary identification of open reading frames and annotation of gene function for a wide variety of microorganisms, including viruses, bacteria, protozoa, parasites, and fungi. sequencing projects will be considered for both complete, finished genome sequencing and other levels of sequence coverage. the choice and justification of complete versus draft sequence is likely to depend on the nature and scope of the proposed project. large-scale prepublication information on genome sequences is a unique research resource for the scientific community, and rapid and unrestricted sharing of microbial genome sequence data is essential for advancing research on infectious agents responsible for human disease. therefore, it is anticipated that prepublication data on genome sequences produced at the niaid microbial sequencing centers will be made freely and publicly available via an appropriate publicly searchable database as rapidly as possible. niaid-supported investigators have completed genome sequencing projects for bacteria, fungi, parasitic protozoa, invertebrate vectors of infectious diseases, and one plant (http://www.niaid.nih.gov/dmid/genomes/ mscs/req process.htm). in addition, niaid completed the sequence for , influenza genomes. in , genome sequencing projects were completed for pathogens as described in section . . . genome sequencing data is publicly available through web sites such as genbank, and data for the influenza genome sequences have been published in . furthermore, through the niaid's microbial sequencing centers, the niaid has funded the sequence, assembly, and annotation of three invertebrate vectors of infectious diseases. in , the final sequence, assembly, and the annotation of aedes aegyptii were released, as well as the preliminary sequence and assembly of the genomes for ixodes scapularis and culex pipiens; the final results for i. scapularis and c. pipiens will be released in . in , niaid supported nearly large-scale genome sequencing projects for additional strains of viruses, bacteria, fungi, parasites, viruses, and invertebrate vectors. new projects included additional strains of borrelia, clostridium, escherichia coli, salmonella, streptococcus pneumonia, ureaplasma, coccidioides, penicillium marneffei, talaromyces stipitatus, lacazia loboi, histoplasma capsulatum, blastomyces dermatitidis, cryptosporidium muris, and dengue viruses, as well as additional sequencing and annotation of aedes aegyptii. in , niaid launched the influenza genome sequencing project (igsp) (http://www.niaid.nih.gov/dmid/genomes/ mscs/influenza.htm), which has provided the scientific community with complete genome sequence data for thousands of human and animal influenza viruses. the influenza sequence data has been rapidly placed in the public domain, through genbank, an international searchable database, and the niaid-funded bioinformatics resource center with accompanying data analysis tools. all of the information will enable scientists to further study how influenza viruses evolve, spread, and cause disease and may ultimately lead to improved methods of treatment and prevention. this sequence information is now providing a larger and more representative sample of influenza than was previously publicly available. the influenza genome sequencing project has the capacity to sequence more than genomes per month and is a collaborative effort among niaid (including the niaid's division of intramural research), the national center for biotechnology niaid is continuing its support for the pathogen functional genomics resource center (pfgrc) (http://www. niaid.nih.gov/dmid/genomes/pfgrc/default.htm) at the institute for genomic research (tigr) (currently part of the j. craig venter institute). the pfgrc was established in to provide and distribute to the broader research community a wide range of genomic resources, reagents, data, and technologies for the functional analysis of microbial pathogens and invertebrate vectors of infectious diseases. in addition, the pfgrc was expanded to provide the research community with the resources and reagents needed to conduct both basic and applied research on microorganisms responsible for emerging and re-emerging infectious diseases and those considered agents of bioterrorism. one of the priorities for the pfgrc has been to provide the scientific community with access to the reagents and genomic and proteomic data that the pfgrc generated. a new software tool, called snp filtering tool, was developed for affymetrix resequencing arrays to analyze the single nucleotide polymorphism (snp) data. enhancements have been made to other tools for microarray data analysis, including a tool for analyzing slide images. a new layout for the tigr-pfgrc web site (http://pfgrc.tigr.org/) has been developed and launched and has the potential to be more user-friendly for the scientific community to access the pfgrc research and development projects, poster presentations, publications, reagents, and their descriptions and data. the number of organism-specific microarrays produced and distributed to the scientific community increased to pfgrc has continued to collaborate with the national institute of dental and craniofacial research (nidcr/nih) in producing and distributing five organism-specific microarrays, including arrays for actinobacillus actinomycetemcomitans, fusobacterium nucleatum, porphyromonas gingivalis, streptococcus mutans, and treponema denticola. pfgrc has also developed the methods and pipeline for generating organism-specific clones for protein expression. seven complete clone sets are now available for human severe acute respiratory syndrome coronavirus (sars-cov), bacillus anthracis, yersinia pestis, francisella tularensis, streptococcus pneumoniae, staphylococcus aureus, and mycobacterium tuberculosis. in addition, individual custom clone sets are available for more than organisms upon request. comparative genomics analysis using the available bacillus anthracis sequence data and the discovery of the snps were used to develop a new bacterial typing system for screening anthrax strains. this system allowed niaid-funded scientists to define detailed phylogenetic lineages of bacillus anthracis and to identify three major lineages (a, b, c) with the ancestral root located between the a+b and c branches. in addition, a genotyping genechip, which has been developed and validated for bacillus anthracis, will be used to genotype about different strains of bacillus anthracis. pfgrc has developed additional comparative genomic platforms for both facilitating the resequencing a bacterial genome on a chip to identify sequence variation among strains and to discover novel genes. a pilot project has been completed with streptococcus pneumoniae for sequencing different strains using resequencing chip technology. in collaboration with the department of homeland security (dhs), a resequencing chip has been developed and is now being used to screen a number of francisella tularensis strains to identify snps and genetic polymorphisms. sixteen francisella tularensis strains are being genotyped by using the newly developed resequencing chip. additional collaboration with dhs led to the development of a gene discovery platform aimed at discovering novel genes among different strains of yersinia pestis. to this end, nine strains are being analyzed using this platform to discover novel gene sets. pfgrc is developing proteomics technologies for protein arrays and comparative profiling of microbial proteins. a protein expression platform is under development, and a pilot comparative protein profiling project using staphylococcus aureus has already been completed and published. a protein profiling project using yersinia pestis to compare proteomes in different strains is now under way, complementing ongoing proteomics projects supported by niaid; numerous proteins are currently being identified that are differently abundant during different growth conditions. a new project was added in for comparative profiling of proteins on the proteomes of e. coli and shigella dysenteriae to provide the scientific community with reference data on differential protein expression in animal models versus cultured systems infected with the pathogen. in , niaid continued to support the population genetics analysis program: immunity to vaccines/infections. a joint project between niaid's division of allergy, immunity, and transplantation (dait) and the division of microbiology and infectious diseases (dmid), this program is aimed to identify associations between specific genetic variations or polymorphisms in immune response genes and the susceptibility to infection or response to vaccination, with a focus on one or more niaid category a to c pathogens and influenza. niaid awarded six centers to study the genetic basis for the variable human response to immunization (smallpox, typhoid fever, cholera, and anthrax) and susceptibility to disease (tuberculosis, influenza, encapsulated bacterial diseases, and west nile virus infection). the centers are comparing genetic variance in specific immune response genes as well as more generally associated genetic variance across the whole genome in affected and nonaffected individuals. the physiologic differences associated with these genome variations will also be studied. in , these centers focused on recruiting the samples needed for genotyping. for example, more than , smallpox-vaccinated individuals and controls were recruited and blood and peripheral blood mononuclear cell (pbmc) samples were obtained for whole genome association studies, which were conducted in . in another example, one of the centers used genome-wide linkage approaches to map, isolate, and validate human host genes that confer susceptibility to influenza infection. nearly , individuals with susceptibility to influenza and , control individuals were recruited using an iceland genealogy database. by late , the center had recruited more than individuals and had genotyped more than in this subproject of the study. during , niaid continued its support of the eight bioinformatics resource centers (brcs) (http://www. niaid.nih.gov/dmid/genomes/brc/default.htm) with the goal of providing the scientific community with a publicly accessible resource that allows easy access to genomic and related data for the niaid category a to c priority pathogens, invertebrate vectors of infectious diseases, and pathogens causing emerging and re-emerging infectious diseases. the brcs are supported by multidisciplinary teams of scientists to develop new and improved computational tools and interfaces that can facilitate the analysis and interpretation of the genomic-related data by the scientific community. in , each publicly accessible brc web site continued to be developed, the user interfaces were improved, and a variety of genomics data types were integrated, including gene expression and proteomics information, host/pathogen interactions, and signaling/metabolic pathways data. a public portal of information, data, and open-source software tools generated by all the brcs is available at http://www.brccentral.org/. in , many genomes of microbial species were sequenced by the niaid's microbial sequencing centers as well as by other national and international sequencing efforts, and the brcs provided either long-term maintenance of the genome sequence data and annotation or the initial annotation for a number of particular microbial genomes. for example, niaid's brc vectorbase collaborated with niaid's mscs to annotate the genome of aedes aegyptii with the scientific community and will continue the curation of this genome. in , niaid continued to support contracts for seven biodefense proteomics research centers (bprcs) to characterize the proteome of niaid category a to c bioweapon agents and to develop and enhance innovative proteomic technologies and apply them to the understanding of the pathogen and/or host cell proteome (http://www. niaid.nih.gov/dmid/genomes/prc/default.htm). these centers conducted a range of proteomics studies, including six category a pathogens, six category b pathogens, and one category c emerging disease organism. data, reagents, and protocols developed in the research centers are released to the niaid-funded administrative resource for biodefense proteomics research centers (www.proteomicsresource.org) web site within months of validation. the administrative resource web site was created to integrate the diverse data generated by the bprcs. in , more than potential targets for vaccines, therapeutics, and diagnostics were generated. examples of progress include: in , more than , potential new pathogen targets for vaccines, therapeutics, and diagnostics were identified, and more than , new corresponding host targets were generated. in addition: (i) two more sars-cov structures were solved. (ii) ninety-six percent of the orfs for b. anthracis were cloned with % sequence validated. (iii) a custom b. anthracis affymetrix genechip was developed. (iv) fifty-three polyclonal sera generated against novel toxoplasma gondii and cryptosporidium parvum proteins were characterized, and accurate time and mass tag databases were populated for salmonella typhi, monkeypox, and vaccinia virus. r niaid staff are participating in two related nih-wide genomic initiatives that focus on examining and identifying genetic variations across the human genome (genes) that may be linked or influence susceptibility or risk to a common human disease, such as asthma, autoimmunity, cancer, eye diseases, mental illness, and infectious diseases, or response to treatment as a vaccine. the approach is to conduct genome-wide association studies in which a dense set of snps across the human genome is genotyped in a large defined group of controls and diseases samples to identify genetic variations that may contribute to or have a role in the disease, with the hope of identifying an association between a genetic variant in a gene or group of genes and the disease. r niaid has continued to participate in a coordinated federal effort in biodefense genomics and is a major participant in the national inter-agency genomics sciences coordinating committee (nigscc), which includes many federal agencies. this committee was formed in to address the most serious gaps in the comprehensive genomic analysis of microorganisms considered agents of bioterrorism. a comprehensive list of microorganisms considered agents of bioterrorism was developed that identifies species, strains, and clinical and environmental isolates that have been sequenced, that are currently being sequenced, and that should be sequenced. in , the committee focused on category a agents and provided the cdc with new technological approaches for sequencing additional smallpox viral strains. affymetrixbased microarray technology for genome sequencing was established, as well as additional bioinformatics expertise for analyzing the genomic sequencing data. in , as a result of this continuing coordination of federal agencies in genome sequencing efforts for biodefense, niaid developed a formal interagency agreement with the department of homeland security (dhs) to perform comparative genomics analysis to characterize biothreat agents at the genetic level and to examine polymorphisms for identifying genetic variations and relatedness within and between species. r niaid continues to participate in the microbe project interagency working group (iwg), which has developed a coordinated, interagency, -year action plan on microbial genomics, including functional genomics and bioinformatics in (http://www.ostp. gov/html/microbial/start.htm). in , the microbe project interagency working group developed guidelines for sharing prepublication genomic sequencing data that serve as guiding principles, so that federal agencies have consistent policies for sharing sequencing data with the scientific community and can then implement their own detailed version of the data release plan. in , the microbe project iwg supported a workshop on "an experimental approach to genome annotation," which was coordinated by the american society for microbiology, and discussed issues faced in annotating microbial genome sequences that have been completed or will be completed in the next few years. in , the microbe project iwg developed a strategic plan and implementation steps as an updated action plan for coordinating microbial genomics among federal agencies, and the plan was finalized in . r niaid continues to participate with other federal agencies in coordinating medical diagnostics for biodefense and influenza across the federal government and in facilitating the development of a set of contracts to support advanced development toward the approval of new or improved point-of-care diagnostic tests for the influenza virus and early manufacturing and commercialization. r niaid continues to participate in the nih roadmap initiatives, including lead science officers for one of the national centers for biomedical computation and one of the national technology centers for networks and pathways. seven biomedical computing centers are developing a universal computing infrastructure and creating innovative software programs and other tools that would enable the biomedical community to integrate, analyze, model, simulate, and share data on human health and disease. five technology centers were created in and to cooperate in a u.s. national effort to develop new technologies for proteomics and the study of dynamic biological systems. r supramolecular architecture of severe acute respiratory syndrome coronavirus (sars-cov). coronaviruses derive their name from their protruding oligomers of the spike glycoprotein (s), which forms a coronal ridge around the virion. the understanding of the virion and its organization has previously been limited to x-ray crystallography of homogenous symmetric virions, whereas coronaviruses are neither homogenous nor symmetric. in this study, a novel methodology of single-particle image analysis was applied to selected coronavirus features to obtain a detailed model of the oligomeric state and spatial relationships among viral structural proteins. the two-dimensional structures of s, m, and n structural proteins of sars-cov and two other coronaviruses were determined and refined to a resolution of approximately nm. these results demonstrated a higher level of supramolecular organization than was previously known for coronaviruses and provided the first detailed view of the coronavirus ultrastructure. understanding the architecture of the virion is a necessary first step to defining the assembly pathway of sars-cov and may aid in developing new or improved therapeutics ( ). r large-scale sequence analysis of avian influenza isolates. avian influenza is a significant global human health threat because of its potential to infect humans and result in a global influenza pandemic. however, very little sequence information for avian influenza virus (aiv) has been in the public domain. a more comprehensive collection of publicly available sequence data for aiv is necessary for research on influenza to understand how flu evolves, spreads, and causes disease, to shed light on the emergence of influenza epidemics and pandemics, and to uncover new targets for drugs, vaccines, and diagnostics. in this study, the investigators released genomic data from the first large-scale sequencing of aiv isolates, doubling the amount of aiv sequence data in the public domain. these sequence data include , aiv genes and complete genomes from a diverse sample of birds. the preliminary analysis of these sequences, along with other aiv data from the public domain, revealed new information about aiv, including the identification of a genome sequence that may be a determinant of virulence. this study provides valuable sequencing data to the scientific community and demonstrates how informative large-scale sequence analysis can be in identifying potential markers of disease ( ) . genome sequencing project. the analysis of the first full genome sequences from human influenza strains, deposited in genbank through the niaid influenza genome sequencing project, was published in ( ) . influenza isolates were chosen in a relatively unbiased manner, allowing a comprehensive look at the influenza virus population circulating within the same geographic region over several seasons, which provided a real picture of the dynamics of influenza virus mutation and evolution. analysis demonstrated that the circulating strains of influenza included alternative minor lineages that could provide genetic variation for the dominant strain. this may allow a novel strain to emerge within a human host and would explain the unexpected emergence of the fujian influenza strain in - that resulted in a vaccine mismatch. these findings demonstrate the usefulness of full genomic sequences for providing new information on influenza viruses and lend further support for the need for large-scale influenza sequencing and the availability of sequence data in the public domain. within the influenza community, public availability of influenza sequence data and sharing of strains has been an important issue. the niaid has been instrumental in promoting the sharing of influenza sequence information, notably by sequencing more than , complete influenza genome sequences and depositing the sequences in the public domain through gen-bank as soon as sequencing has been completed. history of microbial genomics tools for gene finding and whole genome comparison interpolated markov models for eukaryotic gene finding computational gene finding in plants the genomes of pathogenic enterobacteria the complete genome sequence of escherichia coli k- genome sequence of enterohemorrhagic escherichia coli o :h complete genome sequence of enterohemorrhagic escherichia coli o :h and genomic comparison with a laboratory strain k- extensive mosaic structure revealed by the complete genome sequence of uropathogenic escherichia coli genome sequence of shigella flexneri a: insights into pathogenicity through comparison with genomes of escherichia coli k and o large, unstable inserts in the chromosome affect virulence properties of uropathogenic escherichia coli o strain escherichia coli that cause diarrhea: enterotoxigenic, enteropathogenic, enteroinvasive, enterohemorrhagic, and enteroadherent pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution excision of large dna regions termed pathogenicity islands from trna-specific loci in the chromosome of an escherichia coli wild-type pathogen complete genome sequence of multiple drug resistant salmonella enterica serovar typhi ct complete genome sequence of salmonella enterica serovar typhimurium lt cloning and nucleotide sequence of the salmonella typhimurium lt gnd gene and its homology with the corresponding sequence of escherichia coli k a kb chromosomal fragment encoding salmonella typhimurium invasion genes is absent from the corresponding region of the escherichia coli k- chromosome molecular genetic bases of salmonella entry into host cells identification of a virulence locus encoding a second type iii secretion system in salmonella typhimurium identification of a pathogenicity island required for salmonella survival in host cells pathogenicity islands and host adaptation of salmonella serovars the salmonella selc locus contains a pathogenicity island mediating intramacrophage survival the -kb unstable region of yersinia pestis comprises a high-pathogenicity island linked to a pigmentation segment which undergoes internal rearrangement transfer rna genes frequently serve as integration sites for prokaryotic genetic elements complete nucleotide sequence of the prophage vt -sakai carrying the verotoxin genes of the enterohemorrhagic escherichia coli o :h derived from the sakai outbreak a novel mechanism of virus-virus interactions: bacteriophage p tin protein inhibits phage t dna synthesis by poisoning the t single-stranded dna binding protein, go the old exonuclease of bacteriophage p filamentous phages linked to virulence of vibrio cholerae shiga toxin: purification, structure, and function genome sequence of yersinia pestis, the causative agent of plague salmonella pathogenicity islands encoding type iii secretion systems the salmonella pathogenicity island- type iii secretion system capsule switching of neisseria meningitides capsules and cassettes: genetic organization of the capsule locus of streptococcus pneumoniae genetic and molecular characterization of capsular polysaccharide biosynthesis in streptococcus pneumoniae type massive gene decay in the leprosy bacillus yersinia pestis -etiologic agent of plague yersinia pestis, the cause of plague, is a recently emerged clone of yersinia pseudotuberculosis microbial proteomics from proteins to proteomes: large scale protein identification by twodimensional electrophoresis and amino acid analysis membrane proteins and proteomics: un amour impossible? two-dimensional electrophoresis of membrane proteins: a current challenge for immobilized ph gradients new developments in isoelectric focusing isoelectric focusing in immobilized ph gradients: principle, methodology and some applications laser desorption ionization of proteins with molecular masses exceeding , daltons electrospray ionization for mass spectrometry of large biomolecules ion trap mass spectrometry supramolecular architecture of severe acute respiratory syndrome coronavirus revealed by electron cryomicroscopy large-scale sequence analysis of avian influenza isolates large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution key: cord- -fs dj dp authors: liu, yu-tsueng title: infectious disease genomics date: - - journal: genetics and evolution of infectious disease doi: . /b - - - - . - sha: doc_id: cord_uid: fs dj dp the history and development of infectious disease genomics are discussed in this chapter. hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. the completed or ongoing genome projects will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. with the completion of the genome sequence of a virulent menb strain, a “reverse vaccinology” approach was applied for the development of a universal menb vaccine by novartis. the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents. through a systematic screening of , natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis. vector biology network was formed to achieve three goals ( ) to develop basic tools for the stable transformation of anopheline mosquitoes by the year ; ( ) to engineer a mosquito incapable of carrying the malaria parasite by ; and ( ) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by . the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the history and development of infectious disease genomics are tightly associated with the human genome project (hgp) (watson, ) . a series of important discussions about the hgp were made in and (dulbecco, ; watson, ) , which led to the appointment of a special national research council (nrc) committee by the national academy of sciences to address the needs and concerns, such as its impact, leadership, and funding sources. the committee recommended that the united states begin the hgp in (nrc, ) . they emphasized the need for technological improvements in the efficiency of gene mapping, sequencing, and data analysis capabilities. in order to understand potential functions of human genes through comparative sequence analyses, they also advised that the hgp must not be restricted to the human genome and should include model organisms including mouse, bacteria, yeast, fruit fly, and worm. in the meantime, the office of technology assessment (ota) of the us congress also issued a similar report to support the hgp (ota, ) . in , the department of energy (doe) and the national institutes of health (nih) jointly presented an initial -year plan for the hgp (dhhs and doe, ) . in october , the sanger center/institute (hinxton, uk) was officially open to join the hgp. the cost of dna sequencing was about $ À per base in , and the initial aim was to reduce the costs to less than $ . per base before large-scale sequencing (dhhs and doe, ) . the sequencing cost gradually declined during the subsequent years. in , the national human genome research institute (nhgri) challenged scientists to achieve a $ , human genome ( gb/haploid genome) by and a $ genome by to meet the need of genomic medicine. the first complete genome to be sequenced was the phix bacteriophage ( . kb) by sanger's group in (sanger et al., . the complete genome sequence of sv polyomavirus ( . kb) was published in (fiers et al., ; reddy et al., ) . the human epsteinÀbarr virus ( kb) genome was determined in (baer et al., ) . the first completed free-living organism genome was *e-mail: ytliu@ucsd.edu haemophilus influenza ( . mb), sequenced through a whole-genome shotgun approach in (fleischmann et al., ) . the second sequenced bacterial genome, mycoplasma genitalium ( kb), was completed in less than a month in the same year using the same approach (smith, ) . the doe was the first to start a microbial genome program (mgp) as a companion to its hgp in (doe, . the initial focus was on nonpathogenic microbes. along with the development of the hgp, there was exponential growth of the number of completely sequenced freeliving organism genomes. the fungal genome initiative (fgi) (fgi, ) was established in to accelerate the slow pace of fungal genome sequencing since the report of the genome of saccharomyces cerevisiae in (goffeau et al., ) . one of the major interests was to sequence organisms that are important in human health and commercial activities. as of september , completed genome projects, a . -fold increase from years ago, were documented (liolios et al., ) . these include bacterial, archaeal, and eukaryotic genomes. in addition, more than other ongoing sequencing projects were reported. the genomes of human malaria parasite plasmodium falciparum and its major mosquito vector anopheles gambiae were published in (gardner et al., ; holt et al., ) . the effort to sequence the malaria genome began in by taking advantage of a clone derived from laboratory-adapted strain (hoffman et al., ) . many parasites have complex life cycles that involve both vertebrate and invertebrate hosts and are difficult to maintain in the laboratory. currently, a few other important human pathogenic parasites, such as trypansomes el-sayed et al., ) , leishmania (ivens et al., ) , and schistosomas (berriman et al., ; consortium, ) , have been either completely or partially sequenced (brindley et al., ; aurrecoechea et al., ) . in the meantime, the genome sequence of aedes aegypti, the primary vector for yellow fever and dengue fever, was published in . the genome size ( mb) of this mosquito vector is about times larger than the previously sequenced genome of the malaria vector anopheles gambiae. approximately % of the genome consists of transposable elements. in , the genome sequence of the body louse (pediculus humanus humanus), an obligatory parasite of humans and the main vector of epidemic typhus (rickettsia prowazekii), relapsing fever (borrelia recurrentis), and trench fever (bartonella quintana), was reported (kirkness et al., ) . its mb genome is the smallest among the known insect genomes. genome sequencing projects for other important human disease vectors are in progress megy et al., ). these include culex pipiens (mosquito vector of west nile virus), ixodes scapularis (tick vector of lyme disease, babesia, and anaplasma), and glossina morsitans (tsetse fly vector of african trypanosomiasis). the challenge to sequence the genome of an insect vector is much greater than a microbe. for example, the genomes of ticks were estimated to be between and gb and may have a significant proportion of repetitive dna sequences, which may be a problem for genome assembly (pagel van zee et al., ) . furthermore, the evolutionary distances among insect species may also affect homology-based gene predictions. it is as important to understand the sequence diversity within a species as to perform a de novo sequencing of a reference genome from the perspective of human health. this is true for both hosts and pathogens (feero et al., ; alcais et al., ) . the goal of the genomes project is to find most genetic variants that have frequencies of at least % in the human populations studied (kaiser, ) . one of the similar efforts for human pathogens is the nih influenza genome sequencing project. when this project began in november , only seven human influenza h n isolates had been completely sequenced and deposited in the genbank database (fauci, ; ghedin et al., ) . as of may , more than human and avian isolates have been completely sequenced, including the "spanish" influenza virus (taubenberger et al., ) . databases for human immunodeficiency virus (hiv) and hepatitis c virus have also been established. while most human studies of microbes have focused on the disease-causing organisms, interest in resident microorganisms has also been growing. in fact, it has been estimated that the human body is colonized by at least times more prokaryotic and eukaryotic microorganisms than the number of human cells (savage, ) . it was suggested to have "the second human genome project" to sequence human microbiome (relman and falkow, ) . highly variable intestinal microbial flora among normal individuals has been well documented (eckburg et al., ; costello et al., ; turnbaugh et al., ) . therefore, the human microbiome project (hmp) was initiated by the nih to study samples from multiple body sites from each of at least "normal" volunteers to determine whether there are associations between changes in the microbiome and several different medical conditions, and to provide both standardized data resources and new technological approaches (peterson et al., ) . the completed or ongoing genome projects (table . ) will provide enormous opportunities for the discovery of novel vaccines and drug targets against human pathogens as well as the improvement of diagnosis and discovery of infectious agents and the development of new strategies for invertebrate vector control. specific examples will be provided to illustrate how the information provided by various genome projects may help achieve the goal of promoting human health. meningococcal isolates produce of antigenically distinct capsular polysaccharides, but only (a, b, c, w , and y) are commonly associated with disease (lo et al., ) . the polysaccharide capsule is important for meningococci to escape from complement-mediated killing. while conventional vaccines consisting of the conjugation of capsular polysaccharides to carrier proteins for meningococcus serogroups a, c, y, and w- have been clinically successful, the same approach failed to produce clinically useful vaccine for serogroup b (menb). the capsule polysaccharide (α - n-acetylneuraminic acid) of menb is identical to human polysialic acid and therefore is poorly immunogenic (finne et al., ) . alternatively, vaccines consisting of outer membrane vesicles (omv) have been successfully developed to control menb outbreaks in areas where epidemics are dominated by one particular strain (bjune et al., ; sierra et al., ; boslego et al., ; jackson et al., ) . the most significant limitation of this type of vaccine is that the immune response is strain-specific, mostly directed against the porin protein, pora, which varies substantially in both expression level and sequence across strains (martin et al., ; pizza et al., ) . with the completion of the genome sequence of a virulent menb strain, a "reverse vaccinology" approach was applied for the development of a universal menb vaccine by novartis (pizza et al., ; tettelin et al., ; giuliani et al., ) . through bioinformatic searching for surface-exposed antigens, which may be the most suitable vaccine candidates due to their potential to be readily recognized by the immune system, open reading frames (orfs) were selected from a total of orfs of the mc genome. eventually, five antigens were chosen as the vaccine components based on a series of criteria including the ability of candidates to be expressed in escherichia coli as recombinant proteins ( candidates), the confirmation of surface exposure by immunological analyses, the ability of induced protective antibodies in experimental animals ( candidates), and the conservation of antigens within a panel of diverse meningococcal strains, primarily the disease-associated menb strains (pizza et al., ; giuliani et al., ; rinaudo et al., ) . the vaccine formulation consists of an fhbp-gna fusion protein, a gna -gna fusion protein, nada, and omv from the new zealand menzb vaccine strain, which contains the immunogenic pora. initial phase ii clinical results in adults and infants showed that this vaccine could induce a protective immune response against three diverse menb strains in À % of subjects following three vaccinations and À % after four vaccinations (rinaudo et al., ) . in , a phase iii trial for this vaccine ( cmenb) has met primary endpoint. targeting an essential pathway is a necessary but not sufficient requirement for an effective antimicrobial agent (brinster et al., ) . identification of essential genes in a completely sequenced genome has been actively pursued with various approaches (hutchison et al., ; ji et al., ) . the indispensable fatty acid synthase (fas) pathway in bacteria has been regarded as a promising target for the development of antimicrobial agents (wright and reynolds, ) . the subcellular organization of the fatty acid biosynthesis components is different between mammals (type i fas) and bacteria (dissociated type ii fas), which raises the likelihood of host specificity of the targeting drugs. comparison of the available genome sequences of various species of prokaryotes reveals highly conserved fas ii systems suggesting that the antimicrobial agent can be broad spectrum (zhang et al., ) . in addition, through computational analyses, new members of the fas ii system have been discovered in different bacterial species (heath and rock, ; marrakchi et al., ) . one of the protein components in this system, fabi, is the target of an anti-tuberculosis drug isoniazid and a general antibacterial and antifungal agent, triclosan (banerjee et al., ; levy et al., ; zhang et al., ) . through a systematic screening of , natural product extracts, a merck team identified a potent and broad-spectrum antibiotic, platensimycin, which is derived from streptomyces platensis and a selective fabf/b inhibitor in fas ii system (wang et al., ) . treatment with platensimycin eradicated staphylococcus aureus infection in mice. platensimycin did not have cross-resistance to other antibiotic-resistant strains in vitro, including methicillin-resistant s. aureus, vancomycin-intermediate s. aureus, and vancomycin-resistant enterococci. no toxicity was observed using a cultured human cell line. the activity of platensimycin was not affected by the presence of human serum in this study. however, the fas ii system appears to be dispensable for another gram-positive bacterium, streptococcus agalactiae, when exogenous fatty acids are available, such as in human serum (brinster et al., ; balemans et al., ) . the susceptibility to inhibitors targeting the fas ii system indicates heterogeneity in fatty acid synthesis or in acquiring exogenous fatty acids among gram-positive pathogens (balemans et al., ) . comparative genomic approaches may be useful to identify and develop a strategy to target the salvage pathway for streptococcus agalactiae. alternatively, similar approaches as described earlier for menb vaccine may also be applied for streptococcus agalactiae (group b streptococcus) (maione et al., ) . an early mathematical model for malaria control suggested that the most vulnerable element in the malaria cycle was survivorship of adult female mosquitoes (macdonald, ; enayati and hemingway, ) . therefore, insect control is an important part of reducing transmission. the use of ddt as an indoor residual spray in the global malaria eradication program from to reduced the population at risk of malaria to b % by compared with % in (hay et al., ; enayati and hemingway, ) . engineering genetically modified mosquitoes refractory to malaria infection appeared to be an alternative approach (curtis, ) given the environmental impact of ddt and the emergence of insecticide-resistant insects. the vector biology network (vbn) was formed in and proposed a -year plan with the world health organization (who) in to achieve three major goals: ( ) to develop basic tools for the stable transformation of anopheline mosquitoes by the year ; ( ) to engineer a mosquito incapable of carrying the malaria parasite by ; and ( ) to run controlled experiments to test how to drive the engineered genotype into wild mosquito populations by (alphey et al., ; morel et al., ; beaty et al., ) . while some proof-of-concept experiments were achieved for the first two aims in when the anopheles gambiae genome was completely sequenced (catteruccia et al., ; ito et al., ) , the progress has been relatively slow (marshall and taylor, ) . genomic loci of the anopheles gambiae responsible for plasmodium falciparum resistance have been identified through surveying a mosquito population in a west african malaria transmission zone (riehle et al., ) . a candidate gene, anopheles plasmodium-responsive leucine-rich repeat (apl ), was discovered. subsequently, other resistant genes have also been identified (blandin et al., ; povelones et al., ) . studying the genetic basis of resistance to malaria parasites and immunity of the mosquito vector will be important to control malaria transmission. perhaps the most immediate impact of a completely sequenced pathogen genome is for infectious disease diagnosis. the information may be of great importance to the public health when a newly emerged or re-emerged pathogen is discovered. the swine-origin influenza a virus (s-oiv) (dawood et al., ) and sars (severe acute respiratory syndrome) coronavirus rota et al., ) are the two most recent examples. s-oiv emerged in the spring of in mexico and was also discovered in specimens from two unrelated children in the san diego area in april (cdc, ; dawood et al., ) . those samples were positive for influenza a but negative for both human h and h subtypes. the complete genome sequence and a real-time pcr-based diagnostic assay were released to the public in late april. the outbreak evolved rapidly and the who declared the highest phase worldwide pandemic alert on june , . s-oiv has three genome segments (ha, np, ns) from the classic north american swine (h n ) lineage, two segments (pb , pa) from the north american avian lineage, one segment (pb ) from the seasonal h n , and most notably, two segments (na, m) from the eurasian swine (h n ) lineage (dawood et al., ) . with the available influenza genome database, diagnostic assays to distinguish previous seasonal h n , h n , and s-oiv can be easily accomplished (lu et al., ) . a comprehensive pathogen genome database is not only useful for infectious disease diagnosis but also for novel pathogen discovery (liu, ) . homologous sequences within the same family or among different family members are important for new pathogen identification even with the advent of third-generation sequencing technology (munroe and harris, ) . de novo pathogen discovery may be also complicated by coexisting microorganisms, such as commensal bacteria in the human body. without prior knowledge of these microorganisms, one may be misled. in , a microarray-based assay, designated virochip, was used to help discover the sars coronavirus (wang et al., ) . the virochip contained the most highly conserved mer sequences from every fully sequenced reference viral genome in genbank. the computational search for conservation was performed across all known viral families. a microarray hybridized with a reaction derived from a viral isolate cultivated from a sars patient revealed that the strongest hybridizing array elements belong to families astroviridae and coronaviridae. alignment of the oligonucleotide probes having the highest signals showed that all four hybridizing oligonucleotides from the astroviridae and one oligonucleotide from avian infectious bronchitis virus, an avian coronavirus, shared a core consensus motif spanning nucleotides. interestingly, it had been known previously through bioinformatic analyses that this sequence is present in the utr of all astroviruses, avian infectious bronchitis virus, and an equine rhinovirus (jonassen et al., ) . therefore, a new member of the coronavirus was identified through the unique hybridizing pattern and subsequent confirmations. the finding of the seventh human oncogenic virus, merkel cell polyomavirus (mcv) (feng et al., ) in is another example of why conserved sequences are important for novel pathogen discovery. mcv is the etiological agent of merkel cell carcinoma (mcc), which is a rare but aggressive skin cancer of neuroendocrine origin. two cdna libraries derived from mcc tumors were subjected to high-throughput sequencing by a next-generation roche/ sequencer. nearly , sequence reads were generated. the majority ( . %) of the sequences derived from human origin were removed from further analyses. only one of the remaining cdna was homologous to the t antigen of two known polyomaviruses. one additional cdna was subsequently identified to be part of the mcv sequence when the complete viral sequence was known. later analyses showed that % ( / ) of the mcc had integrated mcv in the human genome. monoclonal viral integration was revealed by the patterns of southern blot analysis. only À % of control tissues had low copy number of mcv infection. while we can expect that the efforts of a variety of genome projects may improve human health, the socioeconomic issues that are not discussed in this chapter may be substantial. in addition, the tremendous amount of information derived from these projects will also be a challenge for scientists as well nonscientists to follow and understand. human genetics of infectious diseases: between proof of principle and paradigm malaria control with genetically manipulated insect vectors eupathdb: a portal to eukaryotic pathogen databases dna sequence and expression of the b - epsteinÀbarr virus genome essentiality of fasii pathway for staphylococcus aureus inha, a gene encoding a target for isoniazid and ethionamide in mycobacterium tuberculosis the influenza virus resource at the national center for biotechnology information from tucson to genomics and transgenics: the vector biology network and the emergence of modern vector biology the genome of the african trypanosome trypanosoma brucei the genome of the blood fluke schistosoma mansoni effect of outer membrane vesicle vaccine against group b meningococcal disease in norway dissecting the genetic basis of resistance to malaria parasites in anopheles gambiae efficacy, safety, and immunogenicity of a meningococcal group b ( :p . ) outer membrane protein vaccine in iquique, chile. chilean national committee for meningococcal disease helminth genomics: the implications for human health type ii fatty acid synthesis is not a suitable antibiotic target for gram-positive pathogens stable germline transformation of the malaria mosquito anopheles stephensi swine influenza a (h n ) infection in two children-southern california, marchÀapril the schistosoma japonicum genome reveals features of hostÀparasite interplay bacterial community variation in human body habitats across space and time possible use of translocations to fix desirable genes in insect pest populations the comprehensive microbial resource understanding our genetic inheritance, the u.s. human genome project: the first five years: fiscal years microbial genome program a turning point in cancer research: sequencing the human genome diversity of the human intestinal microbial flora the microbial rosetta stone database: a compilation of global and emerging infectious microorganisms and bioterrorist threat agents the genome sequence of trypanosoma cruzi, etiologic agent of chagas disease malaria management: past, present, and future the genome gets personal-almost clonal integration of a polyomavirus in human merkel cell carcinoma fungal genome initiative complete nucleotide sequence of sv dna an igg monoclonal antibody to group b meningococci cross-reacts with developmentally regulated polysialic acid units of glycoproteins in neural and extraneural tissues whole-genome random sequencing and assembly of haemophilus influenzae rd genome sequence of the human malaria parasite plasmodium falciparum large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution a universal vaccine for serogroup b meningococcus life with genes the global distribution and population at risk of malaria: past, present, and future funding for malaria genome sequencing the genome sequence of the malaria mosquito anopheles gambiae global transposon mutagenesis and a minimal mycoplasma genome transgenic anopheline mosquitoes impaired in transmission of a malaria parasite the genome of the kinetoplastid parasite, leishmania major phase ii meningococcal b vesicle vaccine trial in new zealand infants identification of critical staphylococcal genes using conditional phenotypes generated by antisense rna a common rna motif in the end of the genomes of astroviruses, avian infectious bronchitis virus and an equine rhinovirus dna sequencing. a plan to capture human diversity in genomes ensembl genomes: extending ensembl across the taxonomic space genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle a novel coronavirus associated with severe acute respiratory syndrome vectorbase: a data resource for invertebrate vector genomics molecular basis of triclosan activity the genomes online database (gold) in : status of genomic and metagenomic projects and their associated metadata a technological update of molecular diagnostics for infectious diseases mechanisms of avoidance of host immunity by neisseria meningitidis and its effect on vaccine development detection in of the swine origin influenza a (h n ) virus by a subtyping microarray the epidemiology and control of malaria identification of a universal group b streptococcus vaccine by multiple genome screen a new mechanism for anaerobic unsaturated fatty acid formation in streptococcus pneumoniae malaria control with transgenic mosquitoes effect of sequence variation in meningococcal pora outer membrane protein on the effectiveness of a hexavalent pora outer membrane vesicle vaccine genomic resources for invertebrate vectors of human pathogens, and the role of vectorbase the mosquito genome-a breakthrough for public health third-generation sequencing fireworks at marco island a catalog of reference genomes from the human microbiome genome sequence of aedes aegypti, a major arbovirus vector mapping and sequencing the human genome mapping our genes-genome projects: how big? how fast? tick genomics: the ixodes genome project and beyond the nih human microbiome project identification of vaccine candidates against serogroup b meningococcus by whole-genome sequencing leucine-rich repeat protein complex activates mosquito complement in defense against plasmodium parasites the genome of simian virus the meaning and impact of the human genome sequence for microbiology natural malaria infection in anopheles gambiae is regulated by a single genomic control region vaccinology in the genome era characterization of a novel coronavirus associated with severe acute respiratory syndrome nucleotide sequence of bacteriophage phi x dna microbial ecology of the gastrointestinal tract database resources of the national center for biotechnology information gemina, genomic metadata for infectious agents, a geospatial surveillance pathogen database vaccine against group b neisseria meningitidis: protection trial and mass vaccination results in cuba history of microbial genomics characterization of the influenza virus polymerase genes complete genome sequence of neisseria meningitidis serogroup b strain mc a core gut microbiome in obese and lean twins viral discovery and sequence recovery using dna microarrays platensimycin is a selective fabf inhibitor with potent antibiotic properties the human genome project: past, present, and future antibacterial targets in fatty acid biosynthesis the application of computational methods to explore the diversity and structure of bacterial fatty acid synthase inhibiting bacterial fatty acid synthesis key: cord- - c x f authors: simmonds, peter title: virology of hepatitis c virus date: - - journal: clin ther doi: . /s - ( ) - sha: doc_id: cord_uid: c x f hepatitis c virus (hcv) has been identified as the main causative agent of post-transfusion non-a, non-b hepatitis. through recently developed diagnostic assays, routine serologic screening of blood donors has prevented most cases of post-transfusion hepatitis. the purpose of this paper is to comprehensively review current information regarding the virology of hcv. recent findings on the genome organization, its relationship to other viruses, the replication of hcv ribonucleic acid, hcv translation, and hcv polyprotein expression and processing are discussed. also reviewed are virus assembly and release, the variability of hcv and its classification into genotypes, the geographic distribution of hcv genotypes, and the biologic differences between hcv genotypes. the assays used in hcv genotyping are discussed in terms of reliability and consistency of results, and the molecular epidemiology of hcv infection is reviewed. these approaches to hcv epidemiology will prove valuable in documenting the spread of hcv in different risk groups, evaluating alternative (nonparenteral) routes of transmission, and in understanding more about the origins and evolution of hcv. hepatitis c virus (hcv) has been identified as the main causative agent of posttransfusion non-a, non-b hepatitis. , the identification of hcv led to the development of diagnostic assays for infection, based either on detection of antibody to recombinant polypeptides expressed from cloned hcv sequences or direct detection of virus ribonucleic acid (rna) sequences by polymerase chain reaction (pcr) using primers complimentary to the hcv genome. routine serologic screening of blood donors now prevents most or all cases of posttransfusion hepatitis. assays - / /$ . for antibody also are important diagnostic tools and have been used to investigate the prevalence of hcv in different risk groups, such as intravenous drug users, patients with hemophilia, and other recipients of blood products, and to conduct epidemiologic studies of hcv transmission. the complete genomic sequence of hcv has been determined for several isolates, revealing both its overall genome organization and its relationship to other rna viruses. deducing possible methods of replication by analogy with related viruses is possible, although such studies currently are hampered by the absence of a satisfactory in vitro culture method for hcv. as a consequence, most conventional virologic studies are difficult and artificial. hcv contains a positive-sense rna genome approximately bases in length. in overall genome organization and presumed method of replication, it is most similar to members of the family flaviviridae, particularly in coding for a single polyprotein that is then cleaved into a series of presumed structural and nonstructural proteins (figure ). the roles for these different proteins have been inferred by comparison with related viruses and by in vitro expression of cloned hcv sequences in prokaryotic and eukaryotic systems. these artificial systems allowed the investigation of protein expression, cleavage, and posttranslational modifications. there are numerous positive-stranded rna virus families whose coding capacity is contained within a single open read-ing frame (orf) as is found in hcv, and with which it may be usefully compared (table i) . among human viruses, these include both picomaviridae (eg, poliovirus, coxsackievirus a and b, and hepatitis a virus) and flaviviridae (eg, dengue fever and yellow fever virus). the genomes of those viruses have a similar organization with structural proteins at the ' end and nonstructural proteins at the ' end. however, virus families differ in genome size, the number of proteins produced, the mechanism by which the polyprotein is cleaved, and the detailed mechanism of genome replication. for example, the genome of the picornaviridae is shorter than that of hcv (approximately to bases), contains four nucleocapsid proteins (compared with the single protein of hcv), is nonenveloped (and therefore contains no homologues of the two hcv-encoded hcv glycoproteins e and e ), and uses exclusively virus-encoded proteases to cleave its polyprotein. this is different from both hcv and the flaviviridae, in which cleavage of the structural proteins is thought to be carried out by the host cell--derived signalase. members of the flaviviridae have many features in common with hcv. they have a similar genome size (yellow fever virus has , bases compared with for hcv ) and package a viral-encoded glycoprotein into the virus envelope (el). the homologue of e in flaviviruses (a membrane-bound glycoprotein called ns ; "ns" stands for nonstructural) is expressed only on the infected cell surface. like hcv, the polyprotein is cleaved by a combination of viral and host cell proteases. although there is no close sequence similarity between hcv and other known viruses, at least two regions with conserved amino acid residues provide another fundamental aspect of genome organization that differs between the flavivirus and picornavirus families is the structure of the ' and ' untranslated regions (utrs). these parts of the genome are involved in hcv replication and initiation of translation by cellular ribosomes of the virus-encoded polyprotein. pestiviruses and hcv show evidence for a highly structured 'utr and 'utr, in which internal base-pairing produces a complex set of stem-loop structures that are thought to interact with various host cell and virus proteins during replication. , in particular, studies have shown that for the picornaviridae and, more recently, for hcv and pestiviruses, , a° such structures are involved in internal initiation of translation, in which binding to the host cell ribosome directs translation to an internal methionine (aug) codon. this contrasts strongly with translation of flavivirus genomes, which act much like cellular messenger rna in which ribosomal binding initially occurs to the capped ' end of the rna, followed by scanning of the sequence in the ' to ' direction with translation commencing from the first aug codon. structurally, hcv is also more similar to the pestiviruses than the flaviviruses, with an exceptionally low buoyant density in sucrose ( . to . g/cm ), similar to that reported for pestiviruses and attributable in both cases to heavily gly-cosylated external membrane glycoproteins in the virus envelope. by contrast, flavivirus envelope glycoproteins contain few sites for n-linked glycosylation, and the virion itself is relatively dense ( . g/cm ) . the arrangement and number of cleavage sites of the hcv polyprotein are more similar to pestiviruses, particularly in the further cleavage of both ns and ns proteins into two subunits, in both cases with ns b corresponding to the rna polymerase. recently, two distinct rna viruses have been discovered in new world primate tamarins (sanguinis). this monkey species had previously been shown to harbor an infectious agent causing chronic hepatitis originally derived from inoculation with plasma from a surgeon (gb) in whom chronic hepatitis of unknown etiology had developed. parts of the genome of the two viruses (provisionally called gbv-a and gbv-b) show measurable sequence similarity to certain regions of hcv. for example, a -amino-acid sequence of part of ns of gbv-a and gbv-b shows % and % sequence similarity with the homologous region in hcv (positions to in the hcv polyprotein) and . % sequence similarity to each other. similarly, in ns , the region around the active site of the rna polymerase (including the gdd motif and positions to in hcv) shows % and % sequence similarities and % between gbv-a and gbv-b. in these nonstructural regions, these similarity values are greater than those between hcv and pestiviruses or flaviviruses, although little homology can be found on comparison of the regions of the genomeencoding structural proteins (ie, the core and envelope), nor with the normally highly conserved 'utr. the degree of relatedness between hcv and other positive-stranded rna viruses can be more formally analyzed by phylogenetic analysis of highly conserved parts of the genome, such as the ns region (and homologues in other viruses) encoding the rna-dependent rna polymerase, which invariably contains the canonical gdd motif necessary for the enzymatic activity of the protein. comparisons of a -amino-acid sequence surrounding this motif indicate a close relationship between hcv and gbv-a and gbv-b, an intermediate degree of relatedness with the pestivirus bovine viral diarrhea virus, and a much more distant relatedness to fiaviviruses ( figure ). ' remarkably, a series of plant viruses that are structurally distinct from each of the mammalian virus groups, and with different genome organizations, have rna-dependent rna polymerase amino acid sequences that are perhaps more similar to those of hcv than are the flaviviruses. hcv replication has been studied using a variety of experimental techniques. however, little progress has been made toward the development of a practical hcv culture. hcv does not produce obvious cytopathology, and the amount of hcv released from cells infected in vitro often is low. - this might be because the cells used for culture are not representative of those infected in vivo, or because productive infection requires a combination of cytokines and growth factors that might be present in the liver but which cannot be recreated in cell culture. the observation that low levels of hcv replication might be detected in lymphocyte , and hepatocyte cell lines , indicates that either the tropism of hcv for different cell types may be greater than first imagined or that the virus replication detected so far does not represent the full replicative cycle of hcv that occurs in vivo. transfection of full-length dna sequences of the hcv genome might be expected to initiate the full replicative cycle of hcv, as it does when similar experiments are done in picornavirus sequences. however, only a low level of expression of virus proteins was observed when a complete hcv sequence was transfected into a transformed hepatocyte (hepatoma) cell line (huh ). despite this, there was evidence of replication of the hcv genome and the production of low concentrations of progeny virus particles. such models provide an important experimental system for future investigations of hcv replication. in common with other positive-strand rna viruses, hcv is presumed to replicate its rna genome through the production of a replication intermediate (ie, an rna copy of the complete genome) and is synthesized by the activity of a virally encoded rna-dependent rna polymerase. the minus-strand copy would then be used to generate positive-stranded copies. because templates can be reused, several minus-strand copies can be synthesized from the infecting positive strand, and each of these transcripts can be used several times to produce positive-strand progeny sequences. in this way, a single input sequence may be amplified several thousandfold. although initiation of transcription is well understood for some positive-strand rna viruses (such as the picornaviridae), no information currently is available on how rna synthesis of hcv or other fla- koonin for sources of non-hcv sequences. gbv-a, gbv-b, and hcv (genotypes la, lb, a, b, and a shown) were aligned using the program clustal, and phylogenetic analysis was done using the programs protdist (pam matrix), neighbor, and drawtree in the phylip package. viviruses is primed. hcv lacks homopolymeric tracts (such as poly [u] in the picornaviruses) at the ' end of the genome, whereas the ' end is variable, containing either poly(u) or poly(a) tracts, or possibly neither, as now appears to be the case with the related pestiviruses. furthermore, there appears to be no homologue of the vpg protein of picornaviridae. for these reasons, it is likely that the mechanism of transcription initiation for hcv is different. using a strand-specific pcr method, antisense hcv rna sequences have been detected in the liver of hcv-infected patients, confirming the presumed method of replication of hcv via a replication intermediate. ° such assays provide a valuable technique for detecting hcv replication, as both a sensitive method of monitoring hcv replication in virus culture experiments and a way of investigating the range of cell types and distribution of hcv infection in hcv-infected patients. in particular, the possibility of replication at extrahepatic sites has been proposed on the basis of such assays; these studies have been reviewed by lau et al. the 'utr is thought to play a significant role in initiating and regulating translation of the large orf of hcv. this region is approximately to bases long, and a combination of computer analysis, nuclease mapping experiments, and studies of covariance has led to a proposed secondary structure model for this part of the genome ( figure ). using the same methods, researchers have predicted a remarkably similar structure for pestiviruses, s despite the virtual absence of nucleotide sequence similarities with hcv, indicating the importance of the overall structure of this region in interactions with viral and cellular proteins or other rna sequences. direct evidence for internal initiation of translation has been obtained from in vitro translation of reporter genes downstream from the 'utr sequence placed in mono-or dicistronic vectors. ' ' the nonpaired tip of the stem-loop structure is partially complimentary to the s subunit of ribosomal rna and may, therefore, be the site of binding during internal initiation. the internal ribosomal entry site activity of 'utr is consistent with the hypothesis that translation is initiated from the aug methionine codon at position . there is no evidence for translation from any of the variable number of aug triplets upstream from position , although production of the small proteins from these upstream potential orfs may play some role in regulating expression of the large orf. in the absence of a cell culture system for hcv, most information available on the expression and processing of hcv proteins has been obtained from transfection experiments with cloned dna sequences corresponding to the different proteins, and more recently by direct observations of the cellular distributions and properties of hcv proteins detected in liver or plasma in vivo. transfection of prokaryotic or eukaryotic cells with dna copies of different parts of the hcv genome under the control of artificial promoters allows expression of the encoded proteins, and provides a useful technique for studying their synthesis, biochemical properties, and table ii ). expression of this part of the genome in cells - or reticulocyte-lysate-containing microsomal membranes °, leads to the synthesis of a polyprotein and its cleavage into a series of proteins. the protein identified as the capsid protein on the basis of comparisons with related viruses is expressed as a protein of approximate size to kd. the assignment of this protein as the nucleocapsid protein is supported by the presence of regions within the protein containing numerous basic (positively charged) amino acids that may have rna-binding properties associated with the encapsidation of hcv rna during virus assembly. binding of core protein to ribosomal rna has recently been reported. using similar techniques, expression of the putative envelope proteins of hcv (el and e ) leads to the synthesis in mammalian cells of two heterogeneous proteins with sizes ranging from to kd and to kd, respectively. - cleavage between the capsid protein and el, e and e , and e and ns depends on .~o the addition of microsomal membranes, implying that the host cell signalase has a role in these processing steps. the sizes of e and e are greater than could be explained by their amino acid sequences alone and support biochemical evidence for extensive glycosylation of both proteins after translation. both e and e have a large number of potential n-linked glycosylation sites, although the details of which sites are used, the extent to which the glycoprotein moieties are modified, and whether there is also o-linked glycosylation await further biochemical analysis. two cleavage sites between e and ns (both microsome dependent) have recently been identified, leading to the production of e proteins differing in size by amino acid residues. evidence for intermolecular associations between e and e has been obtained through immunoprecipitation experiments, in which antibody to e or e could precipitate both proteins under nondenaturing conditions. , . the nature or significance of this association is unclear, although current evidence suggests that the association is predominantly noncovalent and does not occur simply through hydrophobic interactions between the membrane anchors of the two proteins. , recently, monoclonal antibodies to either e or e were shown to coprecipitate ns and ns , and there also is evidence for associations between e , ns , and ns b. in vitro translation of the rest of the genome leads to the production of proteins of sizes , to , , , to , and kd, corresponding to ns , ns , ns a, ns b, ns a, and ns b, respectively ( figure ; table ii ). proteolytic cleavage pathways that generate the nonstructural proteins are mediated by ns and ns and have been extensively studied by several groups, as they represent possible targets of antiviral treatment. ns is a serine protease that catalyzes cleavage reactions between ns /ns a, ns a/ns b, ns b/ns a, and ns a/ ns b. ~° ns a is a metalloproteinase that cleaves the ns /ns junction. . the ns /ns a cleavage reaction mediated by ns and the ns /ns cleavage mediated by ns occur in cis, whereas other reactions can occur through intermolecular associations between ns and the rest of the polyprotein. accounts of the complex sequence of events and the interactions between nonstructural proteins involved in cleavage reactions differ in detail depending on the experimental methods used. however, cleavage may be a sequential process modulated by the activities of other proteins, such as ns a. ns protease activity is zinc dependent and contains an active site dependent on residues in ns . therefore, after the cis cleavage of the ns /ns junction, the protease is inactivated and will not act in trans on other substrates. this cleavage reaction has been shown to be essential for activating ns protease, and that natural variation in the efficiency of the reaction may modulate the pathogenicity of hcv in vivo. when released, ns cleaves other sites with varying efficiencies. the active site of ns has been mapped by deletion experiments to lie at the amino terminus of the protein (residues to ). the substrate specificity of the serine protease activity has been defined by sequence comparisons and mutagenesis experiments - and generally conforms to the consensus sequence d/e----c/t$s/a in the target protein. there is some evidence for a less stringent requirement for spe-cific amino acids around the cis cleavage site (ns /ns a) than for those cleaved in trans. several investigators have described the requirement for other protein cofactors for the activity of ns . in particular, it appears that binding of ns a to ns , is necessary at least for the cleavage of ns b/ns a and may modulate the activity of ns in other ways. although there is now some information on the proteolytic cleavage steps used to process the hcv polyprotein, the difficulty associated with in vitro culture of hcv and production of infectious molecular clones of hcv so far has prevented a more detailed understanding of the sites of hcv replication in cells and the processes of virus assembly and release from the cell. future research should reveal the nature of the interaction between the capsid protein and virus rna and how this is packaged into the assembled provirion, the posttranslation modifications to the envelope proteins and where these occur in the cell, and the sites of budding of hcv through cellular membranes. to understand replication more fully, we must also identify the mechanism of priming of rna synthesis from the ends of the genome, the nature of the primers, or whether circularization is necessary for transcription. because a cell culture system to investigate differences in neutralization and cytopathic properties of hcv is not available, nucleotide sequence comparisons and typing assays developed from se-quence data have become the principal techniques for characterizing different variants of hcv. this type of analysis is fairly easy to perform, especially since virus sequences can be amplified by pcr directly from clinical specimens. in common with other rna viruses, variants of hcv show considerable sequence variability, many differing considerably from the prototype hcv (hcv-pt). differences of up to % have been found between the complete genomic sequences of the most extremely divergent variants analyzed to date, ° comparable to those observed between serotypes of other human positive-strand rna viruses such as poliovirus, coxsackievirus, and coronaviruses. sequence variability is evenly distributed throughout all virus genes (table ) - apart from the highly conserved nucleotide (and amino acid) sequence of the core (nucleocapsid) protein and 'utr and the greater variability of the envelope gene (table iii) . nucleotide sequence comparison of complete genomes or subgenomic fragments between variants has shown that variants of hcv obtained from japan are substantially different from the hcv-pt variant obtained in the united states. comparison of the complete genome sequence of hcv-j and hcv-bk from japan showed % sequence similarity to each other but only % with hcv-pt. at that time, the former variants were classified as the "japanese" type (or type ii), while those from the united states (hcv-pt and hcv-h) were classified as type i. comparisons of subgenomic regions of hcv, such as el, core, , and ns , sources of sequences: la = hcv-h ; lb = hcv-j ; lc = hc-j ; a = hc-j ; b = hc-j °; a = nzli ; b = tr. in all comparisons, the 'ncr is the most conserved subgenomic region (maximum % nucleotide sequence divergence), whereas highly variable regions are found in parts of the genome encoding el and ns ( % to % nucleotide sequence and % to % amino acid sequence differences). provide evidence of at least six major groupings of hcv sequences, each of which contains a series of more closely related clusters of sequences ( figure ). the current widely used nomenclature for hcv variants reflects this hierarchy of sequence relationships between different isolates. based on previous suggestions, , the major branches in the phylogenetic tree are referred to as "types," while "subtypes" correspond to the more closely related sequences within most of the major groups ( figure ) . although ns sequences are analyzed in figure , equivalent sequence relationships exist in other parts of the genome. the types have been numbered to and the subtypes a, b, and c, in both cases in order of discovery. therefore, the sequence cloned by chiron is assigned type la, hcv-j and hcv-bk are type lb, hc-j is type a, and hc-j is type b. this nomenclature closely follows the schemes originally described by enomoto (type a) on the basis of phylogenetic analysis of sequences in the ns , ns , core, and 'utr noncoding regions. this approach avoids the inconsistencies of earlier systems and should be easier to extend when new genotypes are discovered. some genotypes of hcv (types la, a, and b) show a broad worldwide distribution, whereas others, such as types a r simmonds and a, are found only in specific geographic regions. blood donors and patients with chronic hepatitis from countries in western europe and the united states frequently are infected with genotypes la, lb, a, b, and a, although the relative frequencies of each may vary. , , - there is a trend for more frequent infection with type lb in southern and eastern europe. in many european countries, genotype distributions vary with the age of the patients, reflecting rapid changes in genotype distribution with time within a single geographic area. a striking geographic change in genotype distribution is apparent between southeast europe and turkey (both mainly type lb) and several countries in the middle east and parts of north and central africa where other genotypes predominate. for example, a high frequency of hcv infection is found in egypt ( % to %), - of which almost all corresponds to type a. °, hcv type also is the principal genotype in countries such as yemen, kuwait, iraq, and saudi arabia in the middle east ° and in zaire, burundi, and gabon in central africa. , , hcv genotype a is frequently found among patients with non-a, non-b hepatitis and blood donors in south africa , , °, but is found only rarely in europe and elsewhere. , in japan, taiwan, and some parts of china, genotypes lb, a, and b are the most frequently found. , - infection with type l a in japan appears to be confined to patients with hemophilia who received commercial (us-produced) blood products, such as factor viii and xi clotting concentrates. , the geographic distribution of type varies; it is only rarely found in japan and is also infrequent in taiwan, hong kong, and macau. how-ever, this genotype is found with increasing frequency in countries to the west, frequently occurring in singapore and accounting for most hepatitis infections in thailand. , in a small sample, it was the only genotype found in bangladesh and eastern india. ° as with type in africa, there is now evidence of considerable sequence diversity within the type genotype, with at least different subtypes of type identified in nepal, india, and bangladesh. ° a genotype with a highly restricted geographic range is type a. this type was originally found in hong kong , ° and was shown to be a new major genotype by sequence comparisons in the ns and e regions. , approximately one third of anti-hcv-positive blood donors in hong kong are infected with this genotype, as are an equivalent proportion in neighboring macau sl and vietnam. a series of novel genotypes has been found in vietnam and thailand °; these genotypes are distinct from types to classified to date but are more closely related to type than to other genotypes, consistent with their overlapping geographic range with type in southeast asia. numerous investigations are being conducted into possible differences in the course of disease associated with different hcv genotypes, such as the rate of development of cirrhosis and hepatocellular carcinoma, and whether certain genotypes are more or less likely to respond to interferon treatment. a large number of clinical investigations have documented severe and progressive liver disease in patients infected with each of the well-char-acterized genotypes (types la, lb, a, b, a, and a), so there is little evidence thus far of variants of hcv that are completely nonpathogenic. however, possible variation in the rate of disease progression, differences between genotypes in routes and frequency of person-to-person transmission, or differences in the probability of achieving a sustained response to antiviral treatment would indicate the potential usefulness of identifying the infecting genotype in certain clinical situations. several clinical studies have catalogued a variety of factors (including genotype) that correlate with the severity of liver disease and show predictive value for response to antiviral treatment. factors that frequently have been shown to influence response to interferon treatment include age and duration of infection, presence of cirrhosis before treatment, genotype, and pretreatment level of circulating viral rna in plasma. a consistent finding reported by several different groups that used a variety of typing assays has been the greatly increased rate of long-term response found when treating patients infected with genotypes a, b, and a compared with type lb. ' ' - for example, chemello et al m found that long-term (> months) normalization of alanine aminotransferase levels was achieved in only % of patients infected with type variants, compared with % of those infected with type and % of those infected with type . in a study by tsubota et al, m infection with type lb, the presence of cirrhosis, and a high pretreatment virus load were each independently associated with a reduced chance of response (relative risks of , , and , respectively) . the mechanism by which different genotypes differ in response to treatment remains obscure. for treatments such as in-terferon, we do not know whether the effect of the drug is directly antiviral or whether the inhibition of virus replication is secondary to increased expression of major histocompatibility complex class i antigens on the surface of hepatocytes and greater cytotoxic t-cell activity against virus-infected cells. elucidating the mechanism of action of interferon and whether there are virologic differences between genotypes in sensitivity to antiviral agents awaits a cellculture model for hcv infection. although determination of the nucleotide sequences is the most reliable method of identifying different genotypes of hcv, this method is not practical for large clinical studies. many of the published methods for "genotyping" are based on amplification of viral sequences in clinical specimens, either by using type-specific primers that selectively amplify different genotypes, by analyzing the pcr product by hybridization with genotype-specific probes, or by using restriction fragment length polymorphisms (rflp). the assays have different strengths and weaknesses. for example, methods based on amplification and analysis of 'ncr sequences have advantages of sensitivity, because this region is highly conserved and can be more frequently amplified from hcv-infected patients than other parts of the genome. however, few nucleotide differences are found between different genotypes. although reliably differentiating six major genotypes by using rflp or by type-specific probes is possible, it is not always possible to reliably identify virus subtypes. types a and b consistently differ at position - , allowing them to be differentiated by the restriction enzyme scrf orby probes to in the inno-lipa (innogenetics, zwijnaarde, belgium). however, sequences of type c are indistinguishable from some of those of type a. similarly, some of the novel subtypes of type often show sequences identical to those of type la or lb, , and a small proportion of type la variants are identical to type lb and vice versa. typing methods based on coding regions, such as core and ns , can reliably identify subtypes as well as major genotypes because the degree of sequence divergence is much greater (table iii) . however, amplifying sequences in coding regions of the genome generally is difficult because sequence variability in the primer-binding sites may reduce the effectiveness of sequence amplification by pcr. nevertheless, the variation is exploited in a genotyping assay that uses type-specific primers complimentary to variable regions in the core gene. currently, this assay can identify and differentiate types la, lb, a, b, and a, ° a° although the method is technically complicated to perform reliably ° and may be difficult to extend to the great range of hcv genotypes now described. serologic typing methods have advantages over pcr-based methods in terms of the speed and simplicity of sample preparation and the use of simple equipment found in any diagnostic virology laboratory. by careful optimization of reagents, such assays may show high sensitivity and reproducibility. for example, type-specific antibody to ns peptides can be detected in approximately % of patients with non-a, non-b hepatitis. h° furthermore, the assays can be readily extended to detect new genotypes. one ns based assay can reliably identify type-spe-cific antibody to six major genotypes, n° although the antigenic similarity between subtypes currently precludes the separate identification of types la and lb and a and b using the ns peptides alone. in contrast to the highly restricted sequence diversity of the 'ncr and adjacent core region, the two putative envelope genes are highly divergent between different variants of hcv (table iii) - and show a three-to-four-times higher rate of sequence change with time in persistently infected patients, ll because these proteins are likely to lie on the outside of the virus, they would be the principal targets of the humoral immune response to hcv elicited on infection. changes in the e and e genes may alter the antigenicity of the virus to allow "immune escape" from neutralizing antibodies, therefore accounting for both the high degree of envelope sequence variability and the observed persistent nature of hcv infection. supporting this model is the observation that much of the variability in the e and e genes is concentrated in discrete "hypervariable" regions, - possibly reflecting the pressures on the virus to evade immune recognition at specific sites where hcv may be neutralized. experimental evidence supporting this theory includes the observations that variants of hcv with changes in the e and e genes are antigenically distinct, and, in many cases, the in vivo appearance of variants with different sequences in the hypervariable region is followed by development of antibodies that specifically recognize the new variants. ~ , -h in one report, persistent hcv infection developed in a patient with deficient anti-body responses (agammaglobulinemia), but without the development of sequence variability in e consistent with the role of antibody-driving variation in immunocompetent persons. on the other hand, envelope sequences obtained sequentially from persistently infected patients sometimes show no significant change, °- whereas in others, variants coexist with antibodies that recognize the corresponding hypervariable region peptides, u cytotoxic t-cell responses also may play a protective role in hcv infection, , as they do in other virus infections for which they are more important in virus clearance than antibody response to infection. although circumstantial evidence supports the theory of immune escape, additional studies are needed to confinn this as a plausible model of virus persistence. many of the current uncertainties may be resolved when a satisfactory in vitro neutralization assay is developed for hcv that enables the effect of amino acid changes in the envelope gene to be directly investigated. additional information also is needed on the relative importance of humoral and cell-mediated immunity to hcv and to determine which is more important in virus clearance and protection from reinfection. persistent infection with hcv entails continuous replication of hcv over years or decades of infection in hcv carders. the large number of replication cycles, combined with the relatively error-prone rna-dependent rna polymerase, leads to measurable sequence drift of hcv over time. for example, over an -year interval of persistent infection in a chim-panzee, the rate of sequence change for the genome as a whole was . % per site per year, similar to the rate calculated for sequence change in the ' half of the genome over years observed in a human carrier ( . %) and in a crosssectional study. using this "molecular clock," it is possible in principle to calculate times of divergence between hcv variants and therefore to establish their degree of epidemiologic relatedness. for example, the finding of relatively few sequence differences between variants infecting two individuals would provide evidence of recent hcv transmission between them. sequence comparisons in variable regions, such as e and ns , of the hcv genome have been used to document transmission between persons, either from mother to child, within families, by iatrogenic routes, - or by sexual contact. [ ] [ ] [ ] in these studies, the possibility of transmission by different risk behaviors was assessed by measuring the degree of relatedness of hcv recovered from implicated persons. phylogenetic analysis of nucleotide sequences provides a more formal method of investigating relationships between sequences. phylogenetic trees produced by such methods indicate the degree of relatedness between sequences, while the branching order of the different lineages shows the most likely evolutionary history of the sampled population. for example, clustering of hcv sequences into a single phylogenetic group among recipients of an hcv-contaminated blood product (anti-d immunoglobulin) was still apparent years after infection ( figure ). these approaches to hcv epidemiology will prove valuable in documenting the spread of hcv in different risk groups, evaluating alternative (nonparenteral) routes of trans- anti-d ig recipients pt figure . phylogenetic relationships between sequences from the ns region of patients exposed to an implicated batch of anti-d immunoglobulin (ig) in (o) and those of epidemiologically unrelated type lb variants from japan (j), the united states (u), and europe (e). b = ns sequence of hepatitis c virus recovered from batch b of anti-d ig; donor ---sequence of variant infecting suspected donor to plasma pool used to manufacture batch b . phylogenetic analysis was done on a segment ( base pairs; positions to ) of the ns gene that was amplified, sequenced, and analyzed as previously described. sequence distances were calculated using the program dnaml in a data set containing the prototype hepatitis c virus (type la) as an outgroup. sequences were obtained from published sources. ' mission, and understanding more about the origins and evolution of hcv. this paper attempts to review a rapidly expanding area of research. it is hoped that a combination of basic science and clinical studies may eventually lead to a greater understanding of the ways in which hcv infection may be prevented or cured by the use of antiviral vaccines. the information provided here will clearly form the basis of many of these developments. of the hepatitis c virus core protein. j virol. ; : - . isolation of a cdna derived from a bloodborne non-a, non-b hepatitis genome. science an assay for circulating antibodies to a major etiologic virus of human non-a, non-b hepatitis genetic organization and diversity of the hepatitis c virus nucleotide sequence of yellow fever virus: implications for flavivirus gene expression and evolution hepatitis c virus shares amino acid sequence similarity with pestiviruses and flaviviruses as well as members of two plant virus supergroups the phylogeny of rna-dependent rna polymerases of positivestrand rna viruses internal ribosome entry site within hepatitis c virus rna secondary structure of the ' nontranslated region of hepatitis c virus and pestivirus genomic rnas a conserved helical element is essential for internal initiation of translation of hepatitis c virus rna pestivirus translation initiation occurs by internal ribosome entry extraordinarily low density of hepatitis c virus estimated by sucrose density gradient centrifugation and the polymerase chain reaction identification of two flavivirus-like genomes in the gb hepatitis agent phylip inference package version . . seattle, wash: department of genetics evidence for in vitro replication of hepatitis c virus genome in a human t-cell line correlation between the infectivity of hepatitis c virus in vivo and its infectivity in vitro susceptibility of human liver cell cultures to hepatitis c virus infection multicycle infection of hepatitis c virus in cell culture and inhibition by alpha and beta interferons susceptibility of human t-lymphotropic virus type i infected cell line mt- to hepatitis c virus infection transfection of a differentiated human hepatoma cell line (huh ) with in vitro-transcribed hepatitis c virus (hcv) rna and establishment of a long-term culture persistently infected with hcv demonstration of in vitro infection of chimpanzee hepatocytes with hepatitis c virus using strand-specific rt/pcr. virology in situ detection of hepatitis c virus: a critical appraisal variation of the hepatitis c virus '-non coding region: implications for secondary structure, virus detection and typing translation of human hepatitis c virus rna in cultured cells is mediated by an internal ribosome-binding mechanism complete ' noncoding region is necessary for the efficient internal initiation of hepatitis c virus rna unusual folding regions and ribosome landing pad within hepatitis c virus and pestivirus rnas end-dependent translation initiation of hepatitis c viral rna and the presence of putative positive and negative translational control elements within the ' untranslated region expression, identification and subcellular localization of the proteins encoded by the hepatitis c viral genome characterization of hepatitis c virus envelope glycoprotein complexes expressed by recombinant vaccinia viruses expression and identification of hepatitis c virus polyprotein cleavage products gene mapping of the putative structural region of the hepatitis c virus genome by in vitro processing analysis a second hepatitis c virus-encoded proteinase hepatitis c virus ns serine proteinase: transcleavage requirements and processing kinetics identification of the domain required for trans-cleavage activity of hepatitis c viral serine proteinase substrate requirements of hepatitis c virus serine proteinase for intermolecular polypeptide cleavage in escherichia coli specificity of the hepatitis c virus ns serine protease: effects of substitutions at the / a, a/ b, b/ a, and a/ b cleavage sites on polyprotein processing substrate determinants for cleavage in cis and in trans by the hepatitis c virus ns proteinase nucleotide sequence of hepatitis c virus (type b) isolated from a japanese patient with chronic hepatitis c at least genotypes of hepatitis c virus predicted by sequence analysis of the putative e gene of isolates collected worldwide sequence analysis of the core gene of hepatitis c virus genotypes investigation of the pattern of hepatitis c virus sequence diversity in different geographical regions: implications for virus classification classification of hepatitis c virus into six major genotypes and a series of subtypes by phylogenetic analysis of the ns- region a proposed system for the nomenclature of hepatitis c viral genotypes there are two major types of hepatitis c virus in japan analysis of a new hepatitis c virus type and its phylogenetic relationship to existing variants serological responses to infection with three different types of hepatitis c virus two french genotypes of hepatitis c virus: homology of the predominant genotype with the prototype american strain detection of three types of hepatitis c virus in blood donors: investigation of type-specific differences in serological reactivity and rate of alanine aminotransferase abnormalities identification of hepatitis c viruses with a nonconserved sequence of the ' untranslated region sequence analysis of the ' noncoding region of hepatitis c virus at least five related, but distinct, hepatitis c viral genotypes exist typing of hepatitis c virus isolates and new subtypes using a line probe assay sequence analysis of the ' untranslated region in isolates of at least four genotypes of hepatitis c virus in the netherlands use of the ' non-coding region for genotyping hepatitis c virus genotypes of hepatitis c virus in italian patients with chronic hepatitis c heterogeneity of hepatitis c virus genotypes in france genotypic analysis of hepatitis c virus in american patients hepatitis c virus infection in egyptian volunteer blood donors in riyadh risk factors associated with a high seroprevalence of hepatitis c virus infection in egyptian blood donors high hcv prevalence in egyptian blood donors sequence variability in the ' non coding region of hepatitis c virus: identification of a new virus type and restrictions on sequence diversity geographical distribution of hepatitis c virus genotypes in blood donors: an international collaborative survey new genotype of hepatitis c virus in south-africa typing of hepatitis c virus (hcv) genomes by restriction fragment length polymorphisms distribution of plural hcv types in japan clinical backgrounds of the patients having different types of hepatitis c virus genomes genomic typing of hepatitis c viruses present in china hcv genotypes in china hcv genotypes in different countries differences in the hepatitis c virus genotypes in different countries prevalence, genotypes, and an isolate (hc-c ) of hepatitis c virus in chinese patients with liver disease imported hepatitis c virus genotypes in japanese hemophiliacs genotypic subtyping of hepatitis c virus survey of major genotypes and subtypes of hepatitis c virus using restriction fragment length polymorphism of sequences amplified from the ' non-coding region a new type of hepatitis c virus in patients in thailand hepatitis c virus variants from nepal with novel genotypes and their classification into the third major group hepatitis c virus variants from vietnam are classifiable into the seventh, eighth, and ninth major genetic groups prediction of response to interferon treatment of chronic hepatitis c hcv genotypes in chronic hepatitis c and response to interferon detection of hepatitis c virus by polymerase chain reaction and response to interferon-alpha therapy: relationship to genotypes of hepatitis c virus factors useful in predicting the response to interferon therapy in chronic hepatitis c hepatitis c virus genotypes--an investigation of type-specific differences in geographic origin and disease simmonds p. hepatitis c serotype and response to interferon therapy prediction of interferon effect in chronic hepatitis c by both quantification and genotyping of hcv-rna genotypes and titers of hepatitis c virus for predicting response to interferon in patients with chronic hepatitis c antiviral effect of lymphoblastoid interferon-alpha on hepatitis c virus in patients with chronic hepatitis type c factors predictive of response to interferon-alpha therapy in hepatitis c virus infection typing hepatitis c virus by polymerase chain reaction with type-specific primers: application to clinical surveys and tracing infectious sources characterization of the genomic sequence of type v (or a) hepatitis c virus isolates and pcr primers for specific detection application of six hepatitis c virus genotyping systems to sera from chronic hepatitis c patients in the united states use of ns- peptides to identify typespecific antibody to hepatitis c virus genotypes , , , , and characterization of hypervariable regions in the putative envelope protein of hepatitis c virus evidence for immune selection of hepatitis c virus (hcv) putative envelope glycoprotein variants: potential role in chronic hcv infections marked sequence diversity in the putative envelope proteins of hepatitis c viruses hypervariable regions in the putative glycoprotein of hepatitis c virus genetic drift of hepatitis-c virus during an . -year infection in a chimpanzee--variability and stability humoral immune response to hypervariable region- of the putative envelope glycoprotein (gp ) of hepatitis c virus hypervariable '-terminus of hepatitis c virus e /ns encodes antigenically distinct variants a structurally flexible and antigenically variable n-terminal domain of the hepatitis c virus e /ns protein--implication for an escape from antibody hypervariable region of hepatitis c virus envelope glycoprotein (e ns ) in an agammaglobulinemic patient the degree of variability in the amino terminal region of the e /ns protein of hepatitis c virus correlates with responsiveness to interferon therapy in viraemic patients sequence variation in the large envelope glycoprotein (e /ns ) of hepatitis c virus during chronic infection dynamics of genome change in the e /ns region of hepatitis c virus in vivo intrahepatic cytotoxic t lymphocytes specific for hepatitis-c virus in persons with chronic hepatitis hepatitis c virus (hcv)-specific cytotoxic t lymphocytes recognize epitopes in the core and envelope proteins of hcv nucleotide sequence and mutation rate of the h strain of hepatitis c virus analysis of genomic variability of hepatitis c virus a unique, predominant hepatitis c virus variant found in an infant born to a mother with multiple variants risk of hepatitis c virus infections through household contact with chronic carriers--analysis of nucleotide sequences comparison of hepatitis c virus strains obtained from hemodialysis patients hepatitis c viral markers in patients who received blood that was positive for hepatitis c virus core antibody, with genetic evidence of hepatitis c virus transmission hepatitis c transmission in a hemodialysis unit: molecular evidence for spread of virus among patients not sharing equipment confl,~mation of hepatitis c virus transmission through needlestick accidents by molecular evolutionary analysis heterosexual transmission of hepatitis c virus analysis of nucleotide sequences of hepatitis c virus isolates from husband-wife pairs acute hepatitis c infection after sexual exposure key: cord- -yefxrj authors: yelverton, elizabeth; lindsley, dale; yamauchi, phil; gallant, jonathan a. title: the function of a ribosomal frameshifting signal from human immunodeficiency virus‐ in escherichia coli date: - - journal: mol microbiol doi: . /j. - . .tb .x sha: doc_id: cord_uid: yefxrj a ‐ nucleotide sequence from the gag‐pol ribosome frameshift site of hiv‐ directs analogous ribosomal frameshifting in escherichia coli. limitation for leucine, which is encoded precisely at the frameshift site, dramatically increased the frequency of leftward frameshifting. limitation for phenylaianine or arginine, which are encoded just before and just after the frameshift, did not significantly affect frameshifting. protein sequence analysis demonstrated the occurrence of two closeiy related frameshift mechanisms. in the first, ribosomes appear to bind leucyl‐trna at the frameshift site and then slip leftward. this is the 'simultaneous slippage’mechanism. in the second, ribosomes appear to slip before binding amlnoacyl‐trna, and then bind phenylaianyl‐trna, which is encoded in the left‐shifted reading frame. this mechanism is identicai to the‘overlapping reading’we have demonstrated at other bacterial frameshift sites. the hiv‐ sequence is prone to frame‐shifting by both mechanisms in e. coli. ribosomes normaiiy maintain a constant reading frame from aug to the finish, but they are capabie of slipping into an alternative reading frame at an average frequency of the order of " (atkins etai, ; j. a. gallant etai, unpubiished) . in certain special cases, much higher frequencies of ribosome frameshifting occur. these cases include production of polypeptide release factor of escherichia coli, which depends upon a rightward frameshift within the coding sequence (craigen et at.. ; craigen and caskey, ; weiss et ai, ; curran and yarus, ) ; translation of the reverse transcriptase of the yeast ty element, which also depends upon a rightward frameshift (clare ef ai, ) ; and translation of the rna of severai retroviruses, which express gag-pol and gag-pro-pol polyproteins by means of leftward frameshifts (reviewed by hatfield and oroszlan, ; cattaneo, ) . ribosomal frameshifting in both rightward and leftward directions has also been shown to occur at certain 'hungry' codons whose cognate aminoacyi-trnas are in short supply (gallant and foley, ; weiss and gailant, ; ; gallant et ai, ; kurland and gallant, ) . not all hungry codons are equally prone to shift: in a survey of frameshift mutations of the rllb gene of phage t , weiss and gallant ( ) found that oniy a minority were phenotypicaily suppressible when challenged by limitation for any of several aminoacyl-trnas. the context njies governing ribosome frameshifting at hungry sites are under investigation, and have been defined in a few cases (weiss et al., ; gallant and lindsiey, ; peter et ai. ; koior ef a/., ; lindsiey and gallant, ) . so far these sequences do not resembie any of the naturally occurring shifty sites summarized in the first paragraph above, in order to find out whether these two categories of ribosome frameshifting are mechanisticaliy reiated, we have tested the susceptibility of a well-studied retroviral frameshift site to manipulation by aminoacyl-trna limitation in e. coli we have directed our analysis to the shifty site at the gag-pol junction of hiv- both because of its clinical interest, and because certain features render it convenient for analysis. in some viral systems, baroque secondary structures in the mrna downstream of the frameshift site are required to augment frameshifting levels (jacks et ai, b; brierley et ai, ) . in the case of hiv- , however, although a stem-loop structure might exist downstream of the frameshift site (jacks et ai, a) , direct modification or elimination of the stem-loop sequence has little effect on the rate of frameshifting (madhani et ai, ; weiss etai. ) . moreover, wilson etal. ( ) demonstrated that a short ( nucleotide) sequence of hiv- without the stem-loop was sufficient to direct a high level of frameshifting in heterologous in vitro systems. the site of ribosomal frameshifting at the siippery sequence u-uuu-uua has been directly established by amino acid sequencing of frameshifted proteins (jacks et ai., b) , and the participation of certain aminoacyl-trnas has been clearly implicated by mutagenesis of the monotonous tract of uridines (jacks et ai, b; wiison et ai, ) . our purpose was to discover whether the ribosomal frameshifting directed by a very short sequence in hiv- could be reproduced by e coli ribosomes in vivo, and, if so, whether we could alter the rate of frameshifting by regimens that change the relative abundance of key aminoacyl trnas encoded at or near the frameshift site. weiss etal. ( ) have also reported that a nucleotide fragment from hiv- is sufficient to direct ribosomal frameshifting in an e. coli system. in this report we present evidence that a much shorter - nucleotide sequence derived from hiv- is sufficient to direct the same ribosomal frameshift event in e. coli as in eukaryotes. we also show that in e. coli the rate of ribosomal frameshifting on that sequence can be increased by limitation for leucine, the amino acid encoded at the frameshift site. protein sequence analysis of the product indicates the occurrence of two siightiy different mechanisms of shifting. the strategy behind the construction of our assay system for ribosomai frameshifting may be understood with reference to fig. . when eukaryotic ribosomes decode the hiv mrna sequence . . . uuuuuuaggg . . ., shown as nucleotides - in fig. a , the adenine at position appears to be read twice: first, as the third position of a leucine codon (uua) and then again as the first position in the overlapping arginine codon (agg ratner et al. ( ) . in a heterologous mammalian in w/ro translation system, most of the frameshift product has the amino acid sequence . . . asn-phe-leu-arg (jacks et al., b) , where leu is encoded by positions - and arg is encoded by positions - . some mutations that result in increased or decreased expression of frameshift products in a heterologous test system are shown above and below the nucleotide sequence, respectively (wilson ef a/., ) . 'n' signifies a mutation to any non-u base. double underlines at positions - and + mark the boundaries of a fragment that directs the synthesis of a frameshift protein product in a heterologous yeast system (wilson et at.. ) . the singly underlined g at position - is the ' boundary of a fragment that directs the synthesis of a frameshift protein product in a mammalian iri vitro system (jacks e( al., b) . b. and c. a portion of &ie mrna sequences expressed from teczframeshift alleles hiv , hiv -a , hiv , and hiv -u are depicted. numbers above the nucleotide sequence correspond to analogous positions of the hiv- gag-po/junction. doubly underlined nucieotides mark the twundaries of sequence thai is consen/ed with respect to hiv- . synthesis of p-gaiactosidase from the alleles requires a leftward frameshift. amino acids are shown for the mature protein, afler in vivo cleavage of the initiating w-terminal fomiyl-met residue. constructs phiv and phiv . and their variants phiv -a and phiv -u , are described in fig. . (the sequence of the critical heptanucleotide at the frameshift site is shown in parenthesis after each construct's designation.) all constructs were transformed into a derivative of cp {rela thr' leu his ~ arg~ thi~) carrying a complete deletion of lacz. methods of cultivation, and enzyme and protein assay were as described previously peter etal., ) . cells were grown into exponential phase in m -glucose medium supplemenled with all required amino acids plus he and val. the lac promoter was induced ( mm iptg and . mm camp) for about one doubling. data are reported ± standard error of the mean, with the number of replicate induced cultures in parentheses. these values include all the unstarved control cultures from the various starvation experiments. to take place in about % of ribosomal transits (jacks etai. b) . in hiv- , the outcome of the leftward ribosomal frameshift is the successful production of the gag-pol fusion protein, in the assay system we have devised, the outcome of an analogous ieftward frameshift by e coli ribosomes wiii be the successful production of the enzyme p-galactosidase from genetically frameshifted alleles of the /acz gene. we have previousiy used an assay system of similar design to demonstrate that lysyl-trna starvation can amplify ribosomai frameshifting in either direction at iysine codons, given certain context ruies (gailant and peter etai. ; lindsiey and gailant, ). aileles to be tested were constructed by the ligatlon of paired complementary oligonucleotides into the h/ndlll-bamh\ site of pbwhoo, as described in galiant and lindsiey ( ) . figure shows the sequence of the translated strand from the region of our constructs that reproduces the gag-pol frameshift signal from hiv- . the /acz frameshift alieles carried on plasmids pi^iv and phiv are constructed so that a shift to the left by one base, as in the expression of the gag-po/fusion of hiv, is required to generate active enzyme. the two constructs both carry a short sequence identical to the region around the frameshift site in the gag-po/overlap of hiv- for nucleotides in phiv and nucleotides in phiv (see fig. ); they differ slightly from one another severai bases downstream of the frameshift site. host ceils carrying either of these piasmids produce active enzyme at about % of the efficiency of cells carrying a control lacz* plasmid (table ) . this basal value is close to the value ( . %) observed by weiss et al. ( ) for frameshifting on a much longer hiv-derived sequence in a similar p-galactosidase reporter. it is also much higher than the frequency of leftward frameshifting ( . - . %) we observed previously at sequences unrelated to hiv (gailant and lindsiey, ) . the presence of the hiv sequence in our reporter thus leads to an unusually high frequency of leftward frameshifting. modification of the critical heptanucieotide sequence from u uuu uua to u uau uua in plasmid phiv -a decreased frameshifting about fivefold, while modification of the heptanucleotide to u uuu uuu in phiv -u increased frameshifting by two-to threefold (tabie ). these genetic resuits are analagous to earlier findings in other reporter systems (jaci