key: cord-0036566-2y146igt authors: Haynes, Matthew; Rohwer, Forest title: The Human Virome date: 2010-10-11 journal: Metagenomics of the Human Body DOI: 10.1007/978-1-4419-7089-3_4 sha: 038dedff24f518a8d3996caab6364cce103849d1 doc_id: 36566 cord_uid: 2y146igt In this chapter we discuss changing approaches to viral discovery and human health, summarize the current understanding of the human-associated viral community, and review contemporary methods in viral metagenomics. The virome is the community of viruses that populate an organism or ecosystem at any given time. This includes the “core” set of commensal viruses that do not give rise to clinical symptoms or viremia, combined with any acute or persistent infections that may be present. Recent technological advances enable us to sequence viral genomes without culturing or cloning. These methods permit not only the discovery of a wider range of viral pathogens, but also a broader assessment of the human virome in the absence of clinically recognized disease. A new focus in contemporary virology is the natural viral community of the human body. This will provide a background for recognition of emerging and previously unrecognized viruses. It should be possible to detect viral infection before the emergence of symptoms, which will have significant implications for health-care delivery. Until fairly recently, it has been customary, in the absence of clinically significant infection, to view the human organism as an isolated entity. In fact, the healthy human body always contains a large number of foreign cells and viruses (Virgin et al., 2009; Dethlefsen et al., 2007; Relman, 2002) . There are more viral particles in the human body than microbial cells, which are ten times more numerous than eukaryotic (human) cells. Similarly, only about 1.5% of the human genome encodes recognizable "human proteins," whereas approximately 45% our genome is retrotransposons, DNA transposons, and viral sequences. Most of the human-associated microbes and viruses, often found on "external" surfaces lining the lumens of organs such as the gut and oral/nasal cavities, participate in complex commensal or mutualistic relationships with their human host (Dethlefsen et al., 2007; Relman, 2002) . Therefore, it is not advantageous to attempt to eradicate every virus and microbial cell from the body in response to infection. A new medical paradigm is emerging: an illness may be defined by a disruption of the normal "healthy" microbiome and/or virome, and that restoration of this state, not elimination of all nonhuman organisms, should be the goal of medical treatment (Harrison, 2007) . Current interest in the human microbiome reflects the increasing acceptance of the view that the microbiota per se should not be seen merely as invasive disease vectors but are in fact an intrinsic part of the human supra-organism (Dethlefsen et al., 2007) . The classical method of viral isolation is by culturing. Koch's postulates (Rivers, 1937) dictate the conditions under which a virus cultured in vitro should be regarded as the cause of an infectious disease; human viruses are usually cultured only in this context. In addition, culturing will be successful only for the small fraction of viruses for which appropriate culture conditions can be determined. To break from the limited view that all viruses are intrinsically harmful requires new methodologies that enable us to characterize entire uncultured viral communities. A culture-independent metagenomics approach to viral community analysis will yield a broader view of the human virome, just as metagenomic sequencing has revealed a wider range of bacteria in the human microbiome than culture-based methods (Harris et al., 2007; Rogers et al., 2004) . A viral metagenome or virome is the total genetic (DNA and RNA) sequence derived from a viral community. Mathematically, the structure of a community may be represented by a graph whose functional form (lognormal, power function, etc.) reflects the relative abundance distribution of its members. The evenness of the distribution (fractional contribution of each genotype), along with the richness (the total number of genotypes), are often combined to denote the diversity of a community, as in the Shannon-Wiener index (H ), where S is the sample richness, and r i is the relative abundance of genotype i. Viral communities tend to be unevenly distributed, with a small number of species or genotypes dominating in abundance ( Fig. 4.1 ). Example of a human viral community rank-abundance curve using sequences from a human oropharyngeal metagenome. Here the relative frequencies of BLAST n hits to a viral sequence database follow a relationship that can be approximated by a power-law equation of the type y = a x −b Metagenomics has been greatly facilitated by recent advances in sequencing technology. Pyrosequencing (Roche/454 Life Sciences), as well as other technologies (e.g., Solexa, SOLiD), enable routine DNA sequencing on the scale of 10 8 bp. All of these new high-throughput methods replace traditional cloning in bacteria with mechanical separation of DNA molecules by some means (e.g., emPCR with DNA immobilized on beads for 454). This requires the creation of a minimally biased DNA library that better reflects the viral community in the sample. At present, the original DNA sample must often be amplified before sequencing, increasing the opportunity for artificial over-or underrepresentation of particular sequences. Despite this limitation, these methods appear to avoid most of the problems associated with conventional cloning, which is subject to strong sequence bias against some "unclonable" sequences. This phenomenon appears to be particularly pronounced in attempts to clone viral sequences. Microarrays, as well, remain semiquantitative detection methods because it is impossible to simultaneously optimize the hybridization of thousands of individual sequences (Table 4 .1). Random RT-PCR Plasma (Jones et al., 2005) PARV4, SAV-1, SAV-2 Nasopharynx (Allander et al., 2005) HBoV Stool (Victoria et al., 2009) Cosavirus HCoSV Nasopharynx (Nakamura et al., 2009) Influenza A, polyomavirus Stool (Zhang et al., 2005 ) PMMV Stool (adult) (Breitbart et al., 2003) Blood phages TTV, HHV3, SEN virus, phages Stool (infant) phages Respiratory tract (Willner et al., 2009a) HHV-1, HHV-2, phages Oropharynx Epstein-Barr Virus Stool (Reyes et al., 2009 ) Phages For further details See "Viral Metagenomics Methods" We can approach this question from two directions: estimation of the number of phages expected based on the size of the human microbiome and the typical viral (phage):host ratio, or by direct counts of viruses in samples from healthy individuals. The human body is composed of about 10 13 cells (Savage, 1977) . There are about 10 times this number of microbial cells associated with the healthy human body (Savage, 1977) . The observed ratio of 7-10 viral-like particles per microbial cell in environmental (Rohwer, 2003) and human samples (Furlan, 2009 ) means that we could expect to find about 10 15 phages in the body. It is possible to compare this prediction with results from recent studies. The data in Table 4 .2 are from direct counts of viruses using epifluorescence microscopy. These data indicate the presence of approximately 3 × 10 12 viruses in the body. Wherever microbes (bacteria and archaea) are present, their viruses will be found. Thus in the human body, the regions of high microbial levels, in particular the gut, also have the highest abundance of viruses. Other organ systems with mucus membranes, such as the nasal and oral cavities and vagina, harbor a smaller but significant viral community. Compared with environmental viral communities, the diversity of the human virome is low. We estimate that there are 1,500 viral genotypes in a typical healthy, human virome. By contrast, 1 kg of marine sediment will contain at least ten thousand, and perhaps a million, viral genotypes. The human-associated viruses are unevenly distributed, with the bulk of the virome composed of a handful of dominant species (Table 4 .3). In the limited data available to date, it appears that a disease state is correlated with an increase in the diversity of the virome (Willner et al., 2009a) . (Angly et al., 2005 (Angly et al., , 2006 6.0-10.8 >3000 0.85-1.00 2.3-13 Most of the viruses are phages. There are also certain eukaryotic viruses, such as herpesviruses, anelloviruses, and papillomaviruses, that are ubiquitous in the human virome and tend to cause few problems considering their abundance (Virgin et al., 2009) . See also Fig. 4 .3. Commensal microbes are ubiquitous in the healthy human body (Dethlefsen et al., 2007; Wilson, 2005) , occupying niches on skin (Grice et al., 2008 (Grice et al., , 2009 , distal gut (Gill et al., 2006; Turnbaugh et al., 2009) , vagina (Hyman et al., 2005) . As a result, viruses that infect microbes (phages) are numerous (Letarov and Kulikov, 2009 ) and have been found in the gut (Reyes et al., 2009) , nasopharynx (Allander et al., 2005) , oropharynx , oral cavity (Hitch et al., 2004) , blood , and lung secretions (Willner et al., 2009a) . Phages comprise by far the majority of the human virome (Willner et al., 2009a and can be expected to exert an influence on the human microbial community (Gill et al., 2006; Hendrix, 2005) that parallels the interactions observed in a variety of environmental samples (Letarov and Kulikov, 2009; Weinbauer, 2006; Rodriguez-Mueller et al., 2010; . By killing specific host organisms, phages regulate the absolute and relative abundance of microbial species . Genetic variation in the hosts is therefore favored as a means of escaping phage predation (Kunin et al., 2008) . In addition, phages are major vehicles of DNA transfer to and from host cells (horizontal gene transfer) through both lytic and lysogenic pathways (Little, 2005) , potentially conferring new phenotypes that can increase the pathogenicity or the fitness (Sharon et al., 2009; Wagner and Waldor, 2002) of the host. Analysis of the phage metagenome can thus provide information not only about potential host taxonomy, but also reveal potential metabolic pathways available to the microbial community (Willner et al., 2009a; Sharon et al., 2009) . Box 4.1 shows the "core" phage metagenome found in the human lower respiratory tract: 19 phage types that were all present in five normal control subjects and five cystic fibrosis patients (Willner et al., 2009a) (Fig. 4.2) . Viruses were purified and concentrated by CsCl density gradient centrifugation as described in Breitbart et al. (2003) . The VLPs were visualized by capturing on a 0.02-μm Anodisc filter, SYBR Gold staining, and viewing by epifluorescence microscopy Viruses capable of infecting the human host ("eukaryotic viruses"), while obviously present in diseased individuals, can also be found in healthy subjects (Virgin et al., 2009; Willner et al., 2009a Willner et al., , 2010 . In asymptomatic subjects, the abundance of these viruses is far lower than that of phages in the healthy human body (Willner et al., 2009a . Depending on the area of the body under examination, the presence of eukaryotic viruses will be due to either transient environmental exposure of accessible regions (e.g., the lungs) or chronic infections that do not give rise to recognizable clinical symptoms. The lack of symptoms might reflect a low-level viral infection that is successfully suppressed by the immune system at an early stage, or perhaps a commensal virus that causes no apparent harm (Virgin et al., 2009; Stapleton et al., 2004; Okamoto, 2009; Antonsson et al., 2000) . An example of the latter is Torque Teno Virus (TTV), which was originally thought to be associated with a form of hepatitis, but now seems likely to be a ubiquitous but benign commensal virus (Okamoto, 2009 ). Instances of true viral-human mutualism in this context are not yet well understood, but it has been suggested that co-infection with GB Virus Type C (originally termed Hepatitis G virus) reduces mortality in HIV-infected individuals (Stapleton et al., 2004) . Box 4.2 shows the "core" eukaryotic viral metagenome found in the human lower respiratory tract: 20 viruses that were all present in five normal control subjects and five cystic fibrosis patients (Willner et al., 2009a We can characterize viruses by their persistence (residence time in the body) and the degree of mutualism they exhibit (Fig. 4.3) . The viruses that comprise the core human virome are relatively persistent (never cleared from the body). This distinguished them from pathogenic viruses causing acute and short-lived infections. There are, however, a number of pathogenic viruses such as herpesviruses that may persist in the body in an intracellular form, only to cause sporadic shedding of viral particles. Still other viruses are transient but common members of the human virome. Plant viruses such as PMMV are taken in with food and pass directly through the digestive tract (Zhang et al., 2005) . Investigation of the human virome has recently been accelerated by technological and methodological developments. The methods fall into three categories: viral nucleic acid isolation, DNA sequencing, and data analysis. For a review of methods in viral metagenomics see Delwart (2007) . The SARS coronavirus was discovered by hybridizing nucleic acids to an array (Virochip) that contained sequences representing all fully sequenced viruses, physically removing the annealed DNA from the array, and PCR amplifying this DNA using primers complementary to linkers that had been added (Kistler et al., 2007; Wang et al., 2002; Chiu et al., 2008) . The prime example of this approach is the cloning and sequencing of the SARS coronavirus (Ksiazek et al., 2003) . Limitations of the method are that it will only succeed with viruses that share significant homology with previously known viruses and that simultaneous optimization of multiple hybridizations on an array may be impossible. There are several variations of randomly primed reverse-transcription PCR (RT-PCR) for amplification of RNA viral sequences. Viral RNA is converted to cDNA using primers containing random octamers for both first-and second-strand synthesis, followed by PCR amplification. These methods have been successful in identifying many RNA viruses from human samples. Examples can be found in Victoria et al. (2009 ), Nakamura et al. (2009 ), and Jones et al. (2005 . The method may be limited by PCR amplification bias, but it is highly sensitive. DNA viral metagenomes, including many phages, have been sequenced by purification of viral particles by CsCl density gradient centrifugation, DNase treatment, DNA isolation, and random amplification with Phi 29 DNA polymerase. Examples are respiratory tract metagenomes (mostly phages) from CF and non-CF subjects (Willner et al., 2009a) and an oropharyngeal metagenome from pooled samples from 19 healthy individuals . Limitations are potential amplification bias (Phi29 polymerase favors small circular and large linear genomes). This method has proved more successful for DNA than for RNA viruses. Due to the "untargeted" nature of metagenomics, and the often unavoidable contamination of viral nucleic acids with large amounts of human DNA, high-throughput sequencing has been essential. To date, the Roche/454 Life Sciences GS-FLX platform has been at the forefront of this technology, particularly because long sequence reads are necessary for shotgun sequencing. Sequencing technology is currently experiencing an unprecedented expansion, however, and it would not be surprising to see a series of further significant changes in sequencing methodology in the near future. Data analysis is often the most challenging aspect of metagenomics research because the results are not pre-filtered by culturing or another selection process. The desired information must be extracted from a very large data set. Bioinformatics methods can be divided overall into two categories: similarity-based and similarityindependent approaches. The original and more conventional means of sequence data analysis is to find segments of similarity to known sequences by searching databases. The most common tools are the various versions of BLAST (McGinnis and Madden, 2004) , which will find local similarities based on the nucleic acid sequence or the deduced amino acid sequence. Microarray hybridization patterns have also been used to characterize novel viral nucleic acids (Urisman et al., 2005) . These approaches are limited when the sample contains novel viruses that share little similarity with known viruses. Viruses in particular are subject to great variations in sequence composition. A large percentage of the sequences in a typical viral metagenome will not resemble any known sequences with any significance. A metagenome can be characterized not only by taxonomy, but also by the cumulative metabolic potential encoded by the metagenome (Meyer et al., 2008) . In the case of viral sequence data derived from lung sputum from CF patients and healthy subjects, the disease state of individuals correlated more strongly with the metabolic potential of viral metagenomes than with the taxonomic analysis (Willner et al., 2009a) . In many cases the phage community appears to carry genes that complement the functions of the microbial community. In particular, phages often seem to use genes for proteins that will increase the short-term energy output of the host cells, either to increase viability (lysogeny) or to boost the production of viral particles (lytic). Some bacteria, such as cholera, are dependent on phage infection to achieve their virulence. More recently, similarity-independent methods have been developed that do not require database searches. For example, PHACCS (Angly et al., 2005) uses contig spectra derived from the sequence data to infer the diversity of genotypes present in the original sample. Other methods enable the comparison of one metagenome to another on the basis of relative abundance of shared sequences. These methods will not identify the unknown viruses, but they can help to characterize the sample by defining the overall complexity of the community. Other methods involve analysis based on the percent G/C content of genomes or the relative frequency of various dinucleotide combinations (Karlin et al., 1997; Burge et al., 1992; Karlin, 1998; Willner et al., 2009b) , which in some cases is diagnostic of particular taxa. When viruses are purified from any human or environmental sample, the extracted DNA inevitably yields a large number of sequences (usually 70-99%) that show no significant similarity to any known sequences (Fig. 4.4) (Willner et al., 2009a Jones et al., 2005) . Provided that adequate precautions have been taken to avoid contamination with nonviral nucleic acids, this suggests that a very large fraction of the existing viral diversity remains uncharacterized. One of the strengths of the "untargeted" approach to viral metagenomics is that these sequences are obtained, but understanding the origin and significance of the "unknown" viral sequences is a substantial bioinformatic challenge that has yet to be solved. If a sequence has no similarity to the DNA of known organisms as defined by BLAST (McGinnis and Madden, 2004) or similar search algorithms, other methods must be developed for this purpose. For example, genome organization patterns such as large-scale arrangements of open reading frames or regulatory elements (promoters, enhancers, and origins of replication) may be signatures that would identify sequences as being of viral origin. This approach would likely require long sequences or even complete genomes to be successful. An accurate assessment of the normal human virome provides a reference point from which to detect any novel viruses. This will serve as a background against which an emerging pathogen or bioterrorism agent would appear in the human population through suitable screening programs. The health of the human subject should be judged by variation from the true "community" that it is, not by the assumption that no nonhuman entities should be present. This is analogous to restoration of a disturbed ecosystem. Knowledge of the normal viral community and assessment of any perturbations found in patients may enable physicians to diagnose disturbances of the microbiome. Glossary amplification bias Inaccurate representation of the true relative abundances of genotypes in a DNA sample has been subjected to nonspecific amplification methods such as MDA or PCR. BLAST (Basic Local Alignment Search Tool) An algorithm used to search nucleic acid and protein databases for sequences similar to a query sequence (McGinnis and Madden, 2004) . commensalism A form of symbiosis that benefits one partner while providing no apparent benefit to the other. community A set of interacting populations in an ecosystem. diversity A measure of the range of variation in a community, frequently represented as a combination of richness (number of variants) and evenness (skewness of the distribution). emPCR PCR performed in a water-in-oil emulsion, so that each micelle functions as a microreactor containing a single amplicon. evenness An index of the skewness of variation: an evenness value close to 0 implies that a community is dominated by one or very few members; a value of 1 implies equal abundance of every member. genome The nucleic acid (DNA or RNA) that constitutes genetic information from a single organism. genotype A genetic subtype that can be distinguished in a sample. In practical terms, two sequences will often be considered to legitimately represent the same genotype if they overlap at least 35 base pairs with 98% identity. hybridization (molecular biology) The annealing of complementary singlestranded DNA or RNA. MDA (multiple displacement amplification) DNA amplification using random primers in an isothermal reaction with a polymerase with helicase activity (Phi29 DNA polymerase), capable of nonspecific replication of double-stranded DNA. metagenome The total genomic nucleic acid (DNA and/or RNA) derived from a community. mutualism A form of symbiosis that benefits both partners. population The total set of members of a genetically distinguishable species or genotypes in a defined biome. sequence read A term frequently used to describe a sequence obtained by highthroughput methods richness The total number of distinct species or genotypes that can be distinguished in a community. One of the several measures of community diversity. A high value is associated with high richness and evenness values. species A genomic subtype that constitutes a genetic lineage or population that exists in a sample or biome. Due to the genomic plasticity of viruses and microbes it can be challenging to define a species, hence the use of the term genotype in a DNA sample when species definition or identification is problematic. symbiosis Any association between two organisms. viremia The presence of viruses in the blood. virome The cumulative viral community in an ecosystem. Cloning of a human parvovirus by molecular screening of respiratory tract samples PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information The marine viromes of four oceanic regions The ubiquity and impressive genomic diversity of human skin papillomaviruses suggest a commensalic nature of these viruses Method for discovering novel DNA viruses in blood using viral particle selection and shotgun sequencing Phage ecology and bacterial pathogenesis Viral diversity and dynamics in an infant gut Metagenomic analyses of an uncultured viral community from human feces Over-and under-representation of short oligonucleotides in DNA sequences Identification of cardioviruses related to Theiler's murine encephalomyelitis virus in human infections Viral metagenomics An ecological and evolutionary perspective on human-microbe mutualism and disease Viral and microbial dynamics in the human respiratory tract Metagenomic analysis of the human distal gut microbiome A diversity profile of the human skin microbiota Topographical and temporal diversity of the human skin microbiome Molecular identification of bacteria in bronchoalveolar lavage fluid from children with cystic fibrosis Microbial ecology of the cystic fibrosis lung Bacteriophage evolution and the role of phages in host evolution Isolation of bacteriophages from the oral cavity Microbes on the human vaginal epithelium New DNA viruses identified in patients with acute viral infection syndrome Global dinucleotide signatures and analysis of genomic heterogeneity Compositional biases of bacterial genomes and evolutionary implications Pan-viral screening of respiratory tract infections in adults with and without asthma reveals unexpected human coronavirus and human rhinovirus diversity A novel coronavirus associated with severe acute respiratory syndrome A bacterial metapopulation adapts locally to phage predation despite global dispersal The bacteriophages in human-and animal body-associated microbial communities Lysogeny, prophage induction, and lysogenic conversion BLAST: at the core of a powerful and diverse set of sequence analysis tools The metagenomics RAST server -a public resource for the automatic phylogenetic and functional analysis of metagenomes Direct metagenomic detection of viral pathogens in nasal and fecal specimens using an unbiased high-throughput sequencing approach History of discoveries and pathogenicity of TT viruses The human body as microbial observatory Phages in the distal human gut Viruses and Koch's postulates Viral and microbial community dynamics in four aquatic environments Characterization of bacterial community diversity in cystic fibrosis lung infections by use of 16S ribosomal DNA terminal restriction fragment length polymorphism profiling Global phage diversity Microbial ecology of the gastrointestinal tract ) Photosystem I gene cassettes are present in marine virus genomes GB virus type C: a beneficial infection? A core gut microbiome in obese and lean twins E-Predict: a computational strategy for species identification based on observed DNA microarray hybridization patterns Metagenomic analyses of viruses in stool samples from children with acute flaccid paralysis Redefining chronic viral infection Bacteriophage control of bacterial virulence Microarray-based detection and genotyping of viral pathogens Ecology of prokaryotic viruses Metagenomic analysis of respiratory tract DNA viral communities in cystic fibrosis and non-cystic fibrosis individuals Metagenomic detection of phage-encoded platelet-binding factors in the human oral cavity Metagenomic signatures of 86 microbial and viral metagenomes Microbial inhabitants of humans: their ecology and role in health and disease RNA viral community in human feces: prevalence of plant pathogenic viruses