key: cord-325750-x7jpsnxg
authors: Mokili, John L; Rohwer, Forest; Dutilh, Bas E
title: Metagenomics and future perspectives in virus discovery
date: 2012-01-20
journal: Curr Opin Virol
DOI: 10.1016/j.coviro.2011.12.004
sha: 
doc_id: 325750
cord_uid: x7jpsnxg

Monitoring the emergence and re-emergence of viral diseases with the goal of containing the spread of viral agents requires both adequate preparedness and quick response. Identifying the causative agent of a new epidemic is one of the most important steps for effective response to disease outbreaks. Traditionally, virus discovery required propagation of the virus in cell culture, a proven technique responsible for the identification of the vast majority of viruses known to date. However, many viruses cannot be easily propagated in cell culture, thus limiting our knowledge of viruses. Viral metagenomic analyses of environmental samples suggest that the field of virology has explored less than 1% of the extant viral diversity. In the last decade, the culture-independent and sequence-independent metagenomic approach has permitted the discovery of many viruses in a wide range of samples. Phylogenetically, some of these viruses are distantly related to previously discovered viruses. In addition, 60–99% of the sequences generated in different viral metagenomic studies are not homologous to known viruses. In this review, we discuss the advances in the area of viral metagenomics during the last decade and their relevance to virus discovery, clinical microbiology and public health. We discuss the potential of metagenomics for characterization of the normal viral population in a healthy community and identification of viruses that could pose a threat to humans through zoonosis. In addition, we propose a new model of the Koch's postulates named the ‘Metagenomic Koch's Postulates’. Unlike the original Koch's postulates and the Molecular Koch's postulates as formulated by Falkow, the metagenomic Koch's postulates focus on the identification of metagenomic traits in disease cases. The metagenomic traits that can be traced after healthy individuals have been exposed to the source of the suspected pathogen.

John L Mokili 1 , Forest Rohwer 1, 2 and Bas E Dutilh 1, 3 Monitoring the emergence and re-emergence of viral diseases with the goal of containing the spread of viral agents requires both adequate preparedness and quick response. Identifying the causative agent of a new epidemic is one of the most important steps for effective response to disease outbreaks. Traditionally, virus discovery required propagation of the virus in cell culture, a proven technique responsible for the identification of the vast majority of viruses known to date. However, many viruses cannot be easily propagated in cell culture, thus limiting our knowledge of viruses. Viral metagenomic analyses of environmental samples suggest that the field of virology has explored less than 1% of the extant viral diversity. In the last decade, the cultureindependent and sequence-independent metagenomic approach has permitted the discovery of many viruses in a wide range of samples. Phylogenetically, some of these viruses are distantly related to previously discovered viruses. In addition, 60-99% of the sequences generated in different viral metagenomic studies are not homologous to known viruses. In this review, we discuss the advances in the area of viral metagenomics during the last decade and their relevance to virus discovery, clinical microbiology and public health. We discuss the potential of metagenomics for characterization of the normal viral population in a healthy community and identification of viruses that could pose a threat to humans through zoonosis. In addition, we propose a new model of the Koch's postulates named the 'Metagenomic Koch's Postulates'. Unlike the original Koch's postulates and the Molecular Koch's postulates as formulated by Falkow, the metagenomic Koch's postulates focus on the identification of metagenomic traits in disease cases. The metagenomic traits that can be traced after healthy individuals have been exposed to the source of the suspected pathogen.

Direct-count epifluorescence and transmission electron microscopy have shown that viruses are highly abundant in most environments. Bergh et al. demonstrated that 1 l of seawater can contain as many as 10 10 virus-like particles (VLPs) [1] , approximately 10 times more than the number of prokaryotes. Terrestrial environments often have 10 9 VLPs per gram. By extrapolation from the estimated number of prokaryotes in different environments [2] , viruses are the most abundant entities in the biosphere totaling an estimated number of 1.2 Â 10 30 , 2.6 Â 10 30 , 3.5 Â 10 31 , and 0.25-2.5 Â 10 31 in the open ocean, in soil and in oceanic and terrestrial subsurfaces, respectively.

In the human holobiont, the 10 13 human cells are outnumbered 10-fold by bacteria and 100-fold by viruses. Viral acquisition starts early in life in utero or perinatally during the first few weeks after birth as demonstrated by studies of the gut viral communities in infants. While no VLPs could be detected in the earliest infant stool samples, there were $10 8 virus particles per gram wet weight of feces by the end of the first week [2] . The majority of these VLPs appear to be bacteriophages, the bacteria-infecting viruses [2] [3] [4] .

Culture techniques have been the gold standard for the detection of viruses for over a century. Despite the knowledge gained using the cultivation of viruses in cell culture, the consensus is that we have barely begun to chart the viral world, which is the 'dark matter' of the biological universe and a rich source of future discoveries [3] . Since the vast majority of viruses are not easily cultivatable, exploration of this dark matter requires culture-independent methods with larger detection coverage than culture.

While the sequencing of the 16S fragment of the small subunit of the ribosomal RNA (rRNA) gene has a proven track record for the detection of known and novel cellular organisms [4] [5] [6] [7] [8] [9] [10] , this technique is not applicable to viruses because they lack the gene. Indeed, viruses do not share any common gene that could similarly qualify as a unified phylogenetic marker [11] .

Metagenomics is an alternative culture-independent and sequence-independent approach that does not rely on the presence of any particular gene in all the subject entities. This approach was originally developed as a tool for 'functional and sequence-based analysis of collective microbial genomes contained in environmental samples' [12, 13] . Early metagenomic studies analyzing the genetic content of environmental samples yielded the identification of metabolic traits, the characterization of organisms and the discovery of new antibiotics and enzymes [12] [13] [14] [15] [16] .

Metagenomic studies now encompass a wide scope of research fields including marine environmental research, plant and agricultural biotechnology, human genetics and diagnostics of human diseases. Accordingly, the number of metagenomics papers in peer-reviewed journals has increased greatly since 2002 ( Figure 1a ). The scope of applications for metagenomics will likely widen from environmental microbiome studies to routine clinical diagnostics for palliative care of patients, public health, industry and beyond.

The first application of metagenomics to the field of virology was in the analysis of the viral communities sampled at two near-shore marine locations in San Diego [17 ] . Since then, it has been used to survey viruses in numerous environments including freshwater, marine sediment, soil and the human gut. Figure 1b shows an overview of diverse areas where the metagenomic approach has been applied for virus discovery since 2002. The success of these studies relied upon the advances observed in the past decade in the area of sequencing technology and in bioinformatics. Although the fundamental concept of metagenomics has not changed, several technical advances have proven valuable for the discovery of previously unidentified, uncultured viruses. While metagenomics originally depended upon cloning for the analysis of doublestranded DNA genomes [17 ,18,19,20 ] , high-throughput sequencing technologies can now be applied to all types of genomes, including single-stranded DNA and RNA [21] . 62, 66, 70, 71, 74, [84] [85] [86] 88, [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] 148, 161, 162] . M: Main characterization method used: 454NGS, 454 high-throughput sequencing using GS FLX or GS titanium platform; sg-Sanger, shotgun library with Sanger sequencing method. S: sample, Symbols used for sample type: ip, insect pool; sb, skunk brain; int: intestine; panc: pancreas; hf: human feces; se, sewer effluent; ms: marine sediment; nasopharyngeal aspirates (NPA).

Historically, diseases caused by viruses have been known before the discovery of their causative agents. The acquired immunodeficiency syndrome (AIDS), poliomyelitis, cervical cancers, and Burkitt's lymphoma were identified before their causative agents. Whereas poliomyelitis was documented in ancient Egyptian literature as early as approximately 3700 BC [22] , poliomyelitis virus was not discovered by Landsteiner and Popper until 1909 [23] . Descriptions of clinical conditions likely to have smallpox have been found in ancient literature from Egypt (1100-1580 BC), China (1122 BC) and India (1500 BC)-long before both Jenner's discovery of smallpox vaccination and the later isolation of variola virus [24] [25] [26] .

The future perspectives in virology appear that, the metagenomic approach will generate a plethora of genetic information from unknown and potentially infectious agents, some of which could be associated with human diseases. The discovery of viruses will start to precede the characterization of the diseases they cause, well before the pathogenicity of these agents is defined.

At this turning point in history, important questions need to be answered. For example, how far has this new viral metagenomics discipline evolved in its first decade? What has been learned so far that can be applied to viral discovery and the forecasting of future viral outbreaks? In this article, we review virus discovery techniques with a focus on metagenomic approaches that employ high-throughput sequencing technologies to characterize novel viruses.

Before the advent of molecular methods, many techniques including filtration, tissue culture, electron microscopy (EM), serology and vaccination have been used for the detection of viruses. In 1892, Ivanovski demonstrated the presence of infectious agents, coined 'virus' by Beijerinck in 1898, in filtrate of infected leaves passed through a Chamberland filter. This marks the discovery of the tobacco mosaic virus [27] and the birth of a new era in virology. Until then, the field of virology was not clearly defined. The instrumentation, from the discovery of tissue culture to modern molecular biology methods, has shaped the field and helped to discover many viruses. Since the invention of the technique of tissue culture in 1907 and the propagation of poliovirus in animal cells in 1909, cultivation of viruses has remained the gold standard for virus discovery for over a century [28] [29] [30] . Despite the achievements made by the culture technique, several limitations have hindered the discovery and detection of viruses in routine laboratory settings. Virus propagation requires the development of controlled conditions that mimic the natural ecosystem shared between viruses and their hosts [31 ] .

The invention of the electron microscope in 1933 provided the first visual proof of a virus. However, this technique is relatively expensive, tedious and lacks both sensitivity and specificity. Alternatively, serology can provide a hint of the acquisition of novel viruses -as was the case for hepatitis C virus [32, 33] -before the viral agents have been cultured or viewed by electron microscopy. The immune sera method has shown little value for virus discovery. The inoculation method, however, not only helped to identify novel viruses, but also was used as an immunization method to confer crossprotection against closely related viruses. Indeed, the cowpox-based inoculation developed by Jenner in 1796 was the first effective vaccine against an infectious disease. Nearly two centuries later, this strategy was used to eradicate smallpox. However, it is unlikely that Jenner's method would pass the scrutiny of modern ethical review boards for vaccine or virus discovery [34] .

The trends in clinical virology practices show gradual substitution of the traditional virus discovery methods with novel molecular biology technology. Nevertheless, traditional and the newer molecular biology techniques to isolate, identify, and characterize viruses play complementary roles in the viral discovery effort. For a comprehensive list and detailed description of molecular methods used for virus discovery, readers are referred to reviews by Delwart [31 ] and Tang [35 ] . Here, we focus on the viruses discovered using these methods and their future applications in clinical microbiology and public health settings.

Two types of molecular methods have been used for the virus discovery effort: sequence-dependent and sequence-independent methods.

Sequence-dependent methods, including PCR using consensus primers and hybridization methods such as microarrays, require the knowledge of the nucleic acid for the detection of novel viruses. Indeed, consensus sequences of previously known viruses have been used to identify novel viruses including highly divergent clades of human immunodeficiency virus [36] , simian retroviruses [37] [38] [39] [40] , and hepatitis E virus [41] . However, PCR using consensus primers based on previously characterized viruses have little or no value in detecting completely novel viruses. The microarray techniques were first introduced in 1995 to monitor the expression of multiple genes simultaneously [42] . For virus discovery, microarrays can be prepared with probes that hybridize known viral sequences and potentially novel viruses with sufficient sequence similarity. The method has been applied to detect a wide range of known viruses as well as novel highly divergent viral taxa [43] . Microarray screening has led to the identification and characterization of a novel gammaretrovirus, xenotropic murine leukemia virusrelated virus (XMRV), in prostate tumors [43, 44] . Subsequent studies did not confirm these initial findings [45, 46] , which points to potential limitations of the method. Another example of a well-known virus discovered with microarrays is SARS-CoV, a highly divergent coronavirus discovered amid a worldwide outbreak of the severe acute respiratory syndrome (SARS) in 2003 [43] . Reproducibility of results between microarray tests is frequently poor [47] .

Unlike PCR and microarrays, the sequence-independent viral metagenomic approaches do not rely on prior knowledge of viruses in the samples. The suppression subtractive hybridization (SSH) and representational difference analysis (RDA) are examples of sequence-independent virus discovery methods. SSH was used first to study gene expression [48] and was later applied to investigate the etiology of diseases of unknown origin [49] . By hybridizing DNA obtained from patients and control subjects, nucleic acid from an unknown pathogen(s) can be detected [49] [50] [51] . Use of RDA led to the discovery of human herpes simplex virus type 8 (HHV8) [52] , Torque Teno virus (TTV) [53] , GBV-A, GBV-B viruses [54] and a novel highly divergent murine norovirus [55] . This method lacks sufficient sensitivity to detect viruses when the viral burden is low or when the DNA sequence of the suspected etiological agent is not clearly distinguishable from the control sample [31] .

Sequence-independent single-primer amplification (SISPA) circumvents the viral load limitation of SSH. Although there are several variations to the original protocol published by Reyes et al. [56] , the main strategy of SISPA is to exploit the sensitivity and the specificity of PCR amplification using primers that bind oligonucleotide fragments ligated to any putative viral DNA materials in the sample. SISPA has been modified to allow the detection of both DNA and RNA viruses after the removal of genomic and contaminating nucleic acids [57] . The SISPA method was used successfully for the discovery of Hepatitis E virus [58, 59] , Norwalk virus [60] , Human astrovirus [61, 62] , and Parvoviruses 2 and 3 [63] . Another sequence-independent technique, the viral metagenomics (described in detail below), provides superior capability to detect known and unknown viruses than the traditional and molecular sequence-dependent and sequence-independent methods.

Compared to virus discovery approaches outlined above, viral metagenomics is less biased. Potentially, any viruses in the samples, culturable or unculturable, known or novel can be readily detected with the viral metagenomic approach.

Viral metagenomic methods have evolved significantly since they were first developed. In early studies [17 ,18,19,20 ] , preliminary sample preparation involved shearing of DNA and cloning. These steps were required in order to obtain sufficient DNA given the low amount of viral DNA in environmental samples ($10 mg/100 l of sea water). Because viral DNA often contains modified nucleotides and because some viral genes (e.g. holins and lysozymes) are toxic to cells, the DNA was randomly sheared to produce small fragments before cloning [17 ,18,19,20 ] . The process of sample preparation has since been streamlined and the sequencing speed increased with the advent of high-throughput sequencing technologies. The replacement of cloning with highthroughput methods has revolutionized metagenomics.

There are several high-throughput sequencing platforms commercially available that vary by the sequencing principle, the sequencing speed, the cost and read length. An overview of a typical viral metagenomic protocol that can be used in a virus discovery study is provided in Figure 2 . Essentially, a metagenomic analysis involves three main steps: (1) sample preparation, (2) high-throughput sequencing and (3) bioinformatic analysis. Below we provide an outline of each of these steps. More detailed descriptions have been previously published [64 ] .

Sample preparation. Theoretically, any type of sample can be analyzed using the metagenomic approach, including seawater [65] , blood [66] , horse feces [67] , stool [20 , [68] [69] [70] [71] , marine sediments [18], coral tissues [72, 73] , and hot springs [74] . Because viral genomes are relatively short, bacterial or eukaryotic nucleic acids can severely interfere with the isolation and detection of viral DNA or RNA that typically represents only a small fraction. Thus, removal 66 Environmental virology High throughput sequencing

Flow chart for the generation of a viral metagenome using highthroughput sequencing.

of non-viral nucleic acid is necessary [64 ,75] . Homogenization, filtration and ultracentrifugation are often necessary to concentrate the viral particles present in the sample ( Figure 2 ). To ensure that viruses are not lost during the virus preparation, epifluorescence microscopy with SYBR-gold staining is used on aliquots of samples obtained after the homogenization, filtration, and chloroform treatments to monitor the presence of VLPs [64 ] .

Chloroform treatment followed by DNase digestion is used to remove contaminating DNA. The chloroform disrupts mitochondrial, bacterial and eukaryotic membranes, thereby exposing non-viral DNA to the subsequent nuclease treatment [76, 77] . Unfortunately, chloroform treatment may also cause enveloped viruses to lose their protective lipid membrane, thereby rendering their DNA subject to DNase digestion [66] . Moreover, DNase treatment does not always completely eliminate non-viral DNA in the sample [63, 64 ] . After extraction, DNA may need to be amplified with random primers [78, 79] . The Whole Transcriptome Amplification (WTA) kit can be used for the synthesis of cDNA from viral RNA [80] .

Single virus genomics (SVG) was introduced by Allen and collaborators to selectively isolate viruses before sequencing [81] . SVG uses flow cytometry to sort viruses based on a method originally described by Brussard et al. [82] . Following the sorting, DNA of different sizes is immobilized in agarose gel, and then amplified using the multiple displacement amplification (MDA) method. The SVG approach can also be applied to RNA viruses provided a reverse-transcription step is inserted between the flow cytometry and MDA.

High-throughput sequencing. Early metagenomic applications involved the generation of shotgun libraries and direct sequencing of the total DNA content using the Sanger enzymatic dideoxy-sequencing method. This approach permitted the discovery of novel phages in marine environments [61, 66] . The Sanger technique had been the standard method for sequencing since it was first described in 1977 [83] . Development of the 'next-generation' sequencing platforms offered the combined advantages of speed, automation and high-throughput, thereby increased sequencing capabilities by a factor of 100 to a million relative to the Sanger technology.

The Illumina/Solexa and Roche 454 next-generation sequencing platforms have been used most often in virus discovery (Figure 1 ). The Illumina/Solexa method is based on sequencing-by-synthesis chemistry using fragments of the sample DNA ligated to oligonucleotide adapters. The adapters on a solid support act as primers for DNA polymerase to incorporate reversible terminator nucleotides, each labeled with a different fluorescent dye. A typical sequencing run can generate up to 18 gigabases of data with an average read length of 75-100 nucleotides [21] . The Sweetpotato badnavirus and the Sweetpotato mastrevirus are examples of viruses discovered using the Illumina/Solexa sequencing platform [84] .

The 454 FLX titanium pyrosequencer commercialized by Roche has been the most used for the discovery and characterization of novel viruses (http://www.454.com/ publications-and-resources/publications.asp?postback= true). This platform was used for the identification of an uncharacterized mycovirus [85] , Solenopsis invicta virus 3 [86] , Merino Walk virus and a new arenavirus [87, 88] , among others (Figure 1b) . For sequencing, DNA is fragmented and ligated to biotinylated specific linkers. The complex DNA/linkers fragment is attached to streptavidin-coated beads that anchor the DNA inside a droplet of water and PCR reagents in oil emulsion. Each fragment is first amplified to produce the template for sequencing reaction. Sequencing is carried out by annealing primers to the linker portion of the template complex, followed by the incorporation of nucleotides by DNA polymerase, which facilitates the extension of the complementary DNA. The pyrophosphate released by this process is measurable by the production of light [89, 90] . The Roche 454 system measures the pyrophosphate released as the result of nucleotide incorporation during DNA synthesis mediated by DNA polymerase. The amount of light released is proportional to the intensity of the light signal captured by a charge-coupled device (CCD) camera, which then converts light signals into digital data [91, 92] . A typical optimum run using a 454 pyrosequencer yields about one million reads with an average length of 350-450 nucleotides, totaling about 0.4 gigabases.

Bioinformatic analyses. The analysis of the copious data generated by high-throughput sequencing is the most challenging aspect of metagenomics. An inherent difficulty in assigning taxonomic designations to viral sequences is that there is no universally homologous nucleic acid component present in all viruses that can be used to build phylogenetic trees -a factor that also fuels the debate over whether or not viruses belong in the tree of life [11, [93] [94] [95] [96] . In most metagenomic studies, sequences generated by high-throughput sequencing are queried by homology search tools to previously documented sequences stored either in a local database or in public databases such as the Genbank. Unfortunately, homology searches against known sequences in Genbank cannot characterize unknown viruses (Figure 3 ).

The analysis of metagenomic libraries requires fast computation and the right algorithms to characterize sequences as belonging to putative viruses. To ensure that bioinformatic analyses are performed only on high quality data, the reads are typically processed through a software pipeline to remove any background sequences including host and bacterial DNA that had not been removed by the filtration, chloroform, and DNase I treatments [97] [98] [99] . The resulting sequence reads are assembled with strict parameters to generate contigs, each made of sequences derived from the same organism quasi-species. Using a stringent assembly parameter is critical to avoid sequence chimerization. The contigs sequences are then compared to the Genbank non-redundant nucleotide database using BLAST [100] or USEARCH [101] . Note that using a database containing only viral sequences will not be able to identify bacterial, archaeal or eukaryotic sequences and lead to an overestimation of the fraction of unknowns (see below).

With the increasing number of data generated from different studies, there is a need for a cross-metagenome meta-analysis [102, 103] . This is particularly important because of the diversity of different viral metagenomic protocols and the lack of standard algorithm for downstream data analysis. The following items should be included in any report on viral metagenomic studies: firstly, the sequencing platform and its version number; secondly, raw sequence data accession numbers in a public database; thirdly, details about the bioinformatic analysis, including the homology search tool and the database being used to assign the taxonomy, and their versions; fourthly, a list of known and previously unknown viruses found, clearly showing if the 'novel' viruses are new strains of a previously described species or completely different viruses; and fifthly, causality evidence if any.

The most intriguing aspect of viral metagenomics is the fact that a large number -usually the majority -of sequences has no significant similarity to anything known. In this review, we refer to these sequences as the 'unknown' (Figure 3a) . A typical human or environmental viral metagenome can contain between 60% and 99% unknown sequences (Figure 3) searches (Figure 3b ). Depending on how they are viewed, the unknowns can represent either a formidable challenge or a treasure trove for virus discovery. Although researchers often tend to consider the unknowns as 'junk,' these sequences could be a valuable blueprint for the discovery of novel viruses [112, 124, 125] . Thus far, there is a lack of suitable bioinformatic methods to characterize the unknown sequences.

A tentative solution is to compare the sequences between samples in order to at least gain some insight about the viral entities that are shared between them. A program such as PHACCS (PHAge Communities from Contig Spectra) can be used to assess the biodiversity of uncultured viral communities by mathematically modeling the community structure using the contig spectrum of metagenome assemblies [126] . This method was extended to assess crossassemblies of reads from different samples [65] , providing a homology-independent tool for the comparison of metagenomes with a high proportion of unknown sequences. Although PHACCS may provide a glimpse of the composition and difference between metagenomes, it has limited value for the characterization of novel viruses. Two tools can be used to predict whether unknown sequences are from bacteriophages undergoing lytic and lysogenic lifestyles. One such tool described by Deschavanne et al. [127] compares the genome signatures of query sequences against those of their host genome in order to identify host-phage relationship and information about the phage lifestyle. The second method, PHACTS, depends on residual homology between the putative unknown sequence and sets of randomly selected viral proteins from known viruses (K McNair et al., PHACTS: a computational approach to classifying the lifestyle of phages, unpublished data). Alternatively, viruses may be classified by basic sequence properties. For instance, the circularity of the contig, its oligonucleotide profile [128] , and the open reading frame (ORF) structure (S Akhter et al., PhiSpy: A novel algorithm for finding prophages in bacterial genomes that combines similarity-based and composition-based strategies, under review) may all provide clues whether the unknown sequence could be from a potential novel virus. These properties can be combined into a prediction network used to classify viruses into lifestyle groups or taxonomic clades.

Although newly discovered viruses are often labeled 'novel,' the question remains whether these sequences represent truly novel viruses or ancient viruses that simply have never been observed before. The age of a sequence has traditionally been determined by multiple alignments of query sequences with their homologs and by calculating the divergence times from a common ancestral node on a phylogenetic tree. Dates can be estimated using either a molecular clock [129] or by assigning a calibration date to a specific node in the tree based on fossil or other evidence [130] [131] [132] . For viral metagenomic sequences, however, building a phylogenetic tree is itself problematic because often the sequenced reads may represent non-overlapping subregions of an unknown viral genome. Moreover, there is no fossil data available to calibrate the age of nodes in the tree. A promising approach might be to estimate divergence times from assembled viral contigs. De novo assembly allows non-overlapping regions to be combined into a single consensus sequence. For a given molecular clock, SNP analysis of the contributing reads could provide an estimation of how long ago the sequenced reads diverged. Such estimates may be critical when addressing the question of the origin of a newly identified infectious agent.

Until recently, virus discoveries were made in the context of disease etiology. Thus, virus discovery studies were biased mainly because of the use of convenient samples available from patients. Because of the difficulties involved, the investment of efforts and resources required to isolate viruses often could not be justified outside the disease context. It is likely that the context of the diseases has also led to the misconception that all viruses are pathogenic. This dogma was challenged by the discovery of viruses such as Torque Teno virus (TTV) and hepatitis G virus (GBV-C), originally associated with post-transfusion hepatitis [53, [133] [134] [135] , and then were subsequently shown be classical examples of viral commensals [136, 137] . The widely accepted notion that viruses act as obligatory pathogens is beginning to give way to the concept that viruses can be part of the normal flora of the human body. Considering their high abundance in the gastrointestinal tract, on skin and even in blood and lungs [138] it is unlikely that viruses could only be pathogenic without any benefits for their hosts. The abundance of viruses, particularly phages, in the lung -an environment previously thought to be sterile -may reflect their beneficial role in keeping bacterial populations in check [138] . The pathogenicity of the GBV-C has shifted to a more radical designation as a 'good' virus in cases of coinfection with HIV. Indeed, GBV-C has been associated with a more favorable prognosis for patients with HIV infection by slowing the progression to AIDS [139, 140] . Similarly, dengue virus, a known pathogen, has been shown to limit HIV-1 replication and to reduce the viral load [141] . These examples need to be taken into account when metagenomic approach is applied to virus discovery. The characterization of a novel virus can be easily achieved in silico with limited bioinformatics tools but the determination of causation may not always be trivial.

The causality is not always conclusive even when the suspect virus is found in the scene of the crime. In other words, finding a virus in a sample from a patient with an illness of unknown etiology and even demonstrating the association does not always prove causation. For this reason, strict guidelines proposed by Robert Koch and later modified by Rivers [142] have been used to assign causality to infectious agents. One of Koch's postulates requires that the candidate etiological agent be isolated from a diseased organism and grown in pure culture. However, many viruses cannot be propagated by culture techniques [143] .

New molecular biology techniques have been used for virus discovery bypassing the prerequisite of the Koch's postulates. For instance, the Merkel cell polyomavirus (MCV) was identified as the causative agent of Merkel's cell carcinoma without satisfying all of the requisites of Koch's postulates [144] . Similarly, the sea turtle tornovirus 1 was associated with fibropapillomatosis using a culture-independent metagenomic approach [118] .

The methodological shift, from culture to metagenomics, will likely create a paradigm shift in the demonstration of disease causation. In many instances Koch's postulates will no longer be satisfied if culture techniques are used to prove causality. Falkow [145 ] proposed the modified Koch's postulates which uses molecular methods to monitor the role played by genes in distinct bacterial virulence. To satisfy the revised molecular Koch's postulates, a strong association must be established between the phenotype or property under investigation and the pathogenic members of a genus or pathogenic strains of a species. The gene of interest should be found in all pathogenic members of the genus or species but be absent in nonpathogenic strains. At best, the nonpathogenic strains could carry the gene with critical mutations that could render the strain non-virulent. However, new molecular methods do not always distinctively characterize virulence genes and make a clear association with a disease of unknown etiology. This could be because genes can be expressed at different time-points during infection. Genes can be turned on and off and may require intrinsic factors in order to trigger the disease process.

Alternatively, we propose the metagenomic Koch's postulates, which focus on the identification of metagenomic traits in disease subjects. The metagenomic traits are molecular markers such as sequence reads, assembled contigs, genes or full-genomes that can uniquely distinguish diseased metagenomes from those obtained from matched healthy control subjects (Figure 4) . The metagenomic traits found in diseased patients can be monitored in healthy individuals exposed to the suspected infectious agent. Although this novel approach requires separation or isolation of remaining co-occurring disease candidates (Figure 4.3) , it does not necessarily require the isolation of the pathogen in tissue culture or pure culture media unlike the original Koch's postulates. Therefore, the genetic make-up of the agent responsible for a disease can provide early clues before its isolation by tissue culture.

The modified metagenomic Koch's postulates proposed in this paper require that: Firstly, the diseased metagenome be significantly different from the metagenome constructed with the same sample type obtained from a healthy matched control subject. The suspected metagenomic traits must be present and more abundant in the diseased subject compared to matched control (Figure 4.1) . Secondly, inoculating a healthy individual with a sample from a diseased subject must result in disease state (Figure 4 .2). Differential metagenomic traits in step (1) recovered in the newly induced diseased subject may be the biomarker of the candidate etiological agent; and finally, selective inoculation of samples from the disease subject (in step 2) must induce disease in another healthy control subject if the metagenomic contains the trait associated with the etiological agent of the disease, or phenotype under investigation (Figure 4.3) . Assuming that the metagenomic trait 'E' (Figure 4 .3) is a contig sequence from a previously unknown and unculturable virus, its early identification using the metagenomic approach could spearhead the effort to generate diagnostic assays such as ELISA and PCR, well before the isolation and the characterization of the viruses by culture techniques.

Fulfilling this metagenomic model of the Koch's postulates is possible when one or multiple viral agents are involved in disease causation. With the original Koch's postulates or the modified molecular Koch's postulates, it is difficult enough to prove causality with one suspected agent using the culturing prerequisite. The complexity is even greater when multiple viruses are involved in the causation of a disease.

A similar approach, the siRNA-ome used previously by Kreuze et al. [84] led to the detection of etiological viruses causing diseases in plants despite the low copy number of the suspected traits [84] . The modified metagenomic Koch's postulates could also be tested in human diseases such as the murine mink cell leukemia caused by a C-type retrovirus, named the mink cell focus-inducing virus (MCFIV) [146] . MCFIV requires the cooperative interaction with other viruses to increase its propensity to cause leukemia [146] . The Burkitt's lymphoma caused by others Epstein-Barr virus (EBV) in regions holoendemic for Plasmodium falciparum, the etiology of malaria [147] . Metagenomics could become the future method of choice enabling the simultaneous analysis of multiple agents in a sample and assessment of the association and disease causality without the limitations imposed by culture techniques [138, 148, 149] .

The etiology of many diseases remains unknown. These ailments are collectively defined as diseases of unknown etiology when all conventional testing laboratory techniques are unsuccessful. Yet, the diseases with unknown origin have high rates of morbidity and mortality. For example, as many as 40% of cases of the infantile diarrhea, which alone claims $1.8 million fatalities annually, have no known specific causative agent [112] . Infantile diarrhea, the pyrexia of unknown origin, influenza-like illnesses, chronic fatigue syndrome, Alzheimer's disease, various forms of tumors such as diffuse large B-cell lymphoma and many other diseases of unknown origin can benefit directly from the metagenomic technology.

The success of metagenomics in identifying novel viruses in a wide variety of samples opens doors to new application areas particularly in public health and the prevention of infectious diseases. Although the metagenomic technology is not yet part of the routine diagnostics, results from clinical virology research provides valuable proof of concepts for a new era in clinical virology practices. For example, Finkbeiner et al. analyzed samples from 12 children using metagenomics and identified a large number of known eukaryotic viruses as well as sequences from putatively novel viruses [112] . Another study identified a corona-like virus, the Human Cosavirus E1 (HcoSV-E1), in a child with acute diarrhea [150] . These initial studies identified promising viral candidates to establish the etiology in these cases of diarrhea. The 2009 pandemic of influenza A (2009 H1N1) provided proof of concept in that metagenomics was effective to rapidly characterize the full genome of the flu virus [151] . Using the metagenomic approach, Palacios et al. discovered an arenavirus in samples which had tested negative by culture, PCR, serology and a microarray assay using oligonucleotide probes from a wide range of infectious agents [87] , suggesting a potential causative agent for unexplained cases of posttransplantation death. In another study, Towner et al. described a new Ebola virus responsible for an outbreak of a hemorrhagic fever in the District of Bindibugyo, Uganda [152] . Rapid identification of these agents would provide the blueprint for the development of therapeutic regimen or preventive vaccine.

Prevention is better than cure. Potentially, a single or multiple jump of an animal virus to humans can have serious consequences. One way to prevent infectious diseases is through vaccine development. But the development of a vaccine takes time and demands a huge amount of resources. Preventing the introduction of an unknown virus to human populations is rather a farreaching goal unless the methods of virus identification and characterization are put in place. A simple and practical strategy would be to assess the danger posed by viruses that thrive in animals and could cross to human through zoonosis.

Zoonosis is a source of up to 75% of emerging infectious diseases in humans [153] . As such, cross-species transfer from animals to humans has serious repercussions not only in public health but also in the socio-economical and political stability [68, [154] [155] [156] [157] [158] . The detection and characterization of novel viruses are of paramount importance in the forecasting of future outbreaks of viral diseases in humans. Surveying natural reservoirs for potential zoonotic infection [69] and human populations such as bush meat hunters who are exposed to animals could help prevent major outbreaks before the wide spread of viruses to human population. Data obtained in early identification of viruses are valuable for forecasting new emerging and re-emerging viral epidemics.

The experience gained from studying marine environments and hostile mine environments can be applied in public health programs that seek to determine the normal viral population and monitor changes in different geographical settings. We have termed such an approach as Public Health Viral Metagenomics Surveillance (PHVMS). Viral metagenomics surveillance is defined as the survey of the functional and taxonomic signatures representing the viruses normally circulating within that population in the absence of noticeable epidemics. In the event of a zoonotic outbreak, these functional and taxonomic signatures of the virome will likely show detectable shifts. Figure 5 shows a hypothetical rank abundance curve for six viruses (a-f). The introduction of a highly pathogenic species (g) can be expected to result in a disruption of the normal virome, including the appearance of opportunistic viral infections (h).

Using PHACCS analysis [126] , several parameters can be compared between the normal and disturbed viromes including the total number of viral species (richness) and their relative abundance (evenness). Another approach would be to determine the normal virome, a background viral metagenome to refer to in case of an outbreak. Lessons learned from studies of bacterial microbial metagenomes suggest that different environments often have different microbial signatures [159] , including the functional metabolic information, the nucleotide usage, proportion of different species. Disrupting key metabolic processes of an environment can lead to disruption of the balance in that ecosystem. Similarly, the viromes in different human populations in different locations may display functional profiles characteristic of their respective environment, lifestyle and viruses circulating in each region. The magnitude of disturbance of the virome profile will depend on the fitness and virulence of the newly introduced pathogens and the immune fitness of the host. The viral communities in two different metagenomes can be compared using XIPE [160] . This statistical approach was developed for comparing metagenomic sequences derived from samples collected from the Sargasso Sea and from acid mine drainage and was able to accurately predict the physiology, metabolic potential and ecology of each ecosystem [160] .

During the last decade, we have witnessed the emergence of metagenomics as a powerful novel tool with endless areas of applications in virology. Epidemiological data suggest that novel viruses are likely to be introduced into the human population through zoonosis [153, 158] . Also, the danger of intentional introduction of viruses through bioterrorism cannot be ignored. Viral metagenomics is a powerful, fast and sensitive technique available for identifying viruses including those that cannot be detected by conventional culture and sequence-dependent methods. Monitoring of emerging infectious diseases using a metagenomic approach. A hypothetical example of the potential use of the Public Health Viral Metagenomics Surveillance (PHVMS) approach for virus discovery based on comparison of viromes sampled before (I) and during (II) an epidemic. Depicted here are the rank abundance curves for viral species (a-h), where g represents a newly introduced, highly pathogenic species and h a less virulent virus.

Papers of particular interest, published within the period of review, have been highlighted as: of special interest of outstanding interest

High abundance of viruses found in aquatic environments

Prokaryotes: the unseen majority

Consider something viral in your research

Analysis of a marine picoplankton community by 16S rRNA gene cloning and sequencing

Comparing bacterial communities inferred from 16S rRNA gene sequencing and shotgun metagenomics

Bacterial 16S sequence analysis of severe caries in young permanent teeth

A renaissance for the pioneering 16S rRNA gene

A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library

Metagenomics -the key to the uncultured microbes

A census of rRNA genes and linked genomic sequences within a soil metagenomic library

The Phage Proteomic Tree: a genomebased taxonomy for phage

Metagenomics: genomic analysis of microbial communities

Biotechnological prospects from metagenomics

Opportunities to improve fiber degradation in the rumen: microbiology, ecology, and genomics

Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms

Next-generation DNA sequencing techniques

A history of poliomyelitis. Yale Studies in the History of Science and Medicine

The discovery of the poliovirus

Smallpox and Its Eradication

The Greatest Killer -Smallpox in History

Discovery of the first virus, the tobacco mosaic virus: 1892 or 1898?

Methods to detect infectious human enteric viruses in environmental water samples

Role of cell culture for virus detection in the age of technology

Rapid viral diagnostic techniques

Delwart EL: Viral metagenomics

A very comprehensive description of metagenomic methods and important benchmarks achieved in the virus discovery effort

Isolation of a cDNA clone derived from a blood-borne non-A, non-B viral hepatitis genome

An assay for circulating antibodies to a major etiologic virus of human non-A, non-B hepatitis

Ethical reflections on Edward Jenner's experimental treatment

Metagenomics for the discovery of novel human viruses

A comprehensive review describing metagenomic methods and important benchmarks achieved

Identification of a novel clade of human immunodeficiency virus type 1 in Democratic Republic of Congo

A novel simian immunodeficiency virus from black mangabey (Lophocebus aterrimus) in the Democratic Republic of Congo

Characterization of a novel simian immunodeficiency virus (SIVmonNG1) genome sequence from a mona monkey (Cercopithecus mona)

Isolation and partial characterization of a lentivirus from talapoin monkeys (Myopithecus talapoin)

A novel simian immunodeficiency virus (SIVdrl) pol sequence from the drill monkey, Mandrillus leucophaeus

Isolation of a cDNA from the virus responsible for enterically transmitted non-A, non-B hepatitis

Quantitative monitoring of gene expression patterns with a complementary DNA microarray

Viral discovery and sequence recovery using DNA microarrays

Identification of a novel Gammaretrovirus in prostate tumors of patients homozygous for R462Q RNASEL variant

Prostate cancer: XMRV -contaminant, not cause?

No association of xenotropic murine leukemia virus-related viruses with prostate cancer

Reliability and reproducibility issues in DNA microarray measurements

Efficient isolation of genes differentially expressed on cellulose by suppression subtractive hybridization in Agaricus bisporus

Virus discovery by sequenceindependent genome amplification

Suppression subtraction hybridization (SSH) and macroarray techniques reveal differential gene expression profiles in brain of sea bream infected with nodavirus

Suppression subtractive hybridization: a versatile method for identifying differentially expressed genes

Identification of herpesvirus-like DNA sequences in AIDS-associated Kaposi's sarcoma

A novel DNA virus (TTV) associated with elevated transaminase levels in posttransfusion hepatitis of unknown etiology

Identification of two flavivirus-like genomes in the GB hepatitis agent

STAT1-dependent innate immunity to a Norwalk-like virus

Sequence-independent, single-primer amplification (SISPA) of complex DNA populations

Metagenomics and the molecular identification of novel viruses

Viruses in the faecal microbiota of monozygotic twins and their mothers

Hepatitis E virus (HEV): the novel agent responsible for enterically transmitted non-A, non-B hepatitis

The isolation and characterization of a Norwalk virus-specific cDNA

Identification of a novel astrovirus (astrovirus VA1) associated with an outbreak of acute gastroenteritis

Detection of a novel astrovirus in brain tissue of mink suffering from shaking mink syndrome by use of viral metagenomics

A virus discovery method incorporating DNase treatment and its application to the identification of two bovine parvovirus species

Laboratory procedures to generate viral metagenomes

An excellent compilation of standard operating procedures to perform metagenomic analysis on different types of samples

The marine viromes of four oceanic regions

Method for discovering novel DNA viruses in blood using viral particle selection and shotgun sequencing

Analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes

Multiple diverse circoviruses infect farm animals and are commonly found in human and chimpanzee feces

Bat guano virome: predominance of dietary viruses from insects and plants plus novel mammalian viruses

Viral diversity and dynamics in an infant gut

RNA viral community in human feces: prevalence of plant pathogenic viruses

Viral communities associated with healthy and bleaching corals

Metagenomic analysis of stressed coral holobionts

Assembly of viral metagenomes from yellowstone hot springs

Using pyrosequencing to shed light on deep mine microbial ecology

Microbes and health sackler colloquium: metagenomic detection of phage-encoded platelet-binding factors in the human oral cavity

Extraction of high molecular weight genomic DNA from soils and sediments

Rapid amplification of plasmid and phage DNA using Phi 29 DNA polymerase and multiply-primed rolling circle amplification

Assessment of whole genome amplification-induced bias through highthroughput, massively parallel whole genome sequencing

Whole transcriptome amplification for gene expression profiling and development of molecular archives

Single virus genomics: a new tool for virus discovery

Flow cytometric detection of viruses

DNA sequencing with chainterminating inhibitors

Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses

Arbovirus detection in insect vectors by rapid, highthroughput pyrosequencing

Isolation and characterization of Solenopsis invicta virus 3, a new positive-strand RNA virus infecting the red imported fire ant, Solenopsis invicta

A new arenavirus in a cluster of fatal transplant-associated diseases

Genomic and phylogenetic characterization of Merino Walk virus, a novel arenavirus isolated in South Africa

Parallel tagged sequencing on the 454 platform

Targeted high-throughput sequencing of tagged nucleic acid samples

The history of pyrosequencing

A new method of sequencing DNA

The not so universal tree of life or the place of viruses in the living world

Reasons to include viruses in the tree of life

Viral genomes are part of the phylogenetic tree of life

There is no such thing as a tree of life (and of course viruses are out!)

Quality control and preprocessing of metagenomic datasets

Fast identification and removal of sequence contamination from genomic and metagenomic datasets

TagCleaner: identification and removal of tag sequences from genomic and metagenomic datasets

Basic local alignment search tool

The minimum information about a genome sequence (MIGS) specification

Get the most out of your metagenome: computational analysis of environmental sequence data

Cloning of a human parvovirus by molecular screening of respiratory tract samples

Metagenomic analysis of coastal RNA virus communities

Identification of a third human polyomavirus

Metagenomic characterization of Chesapeake Bay virioplankton

A metagenomic survey of microbes in honey bee colony collapse disorder

Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil

Biodiversity and biogeography of phages in modern stromatolites and thrombolites

Viral genome sequencing by random priming methods

Metagenomic analysis of human diarrhea: viral detection and discovery

Novel borna virus in psittacine birds with proventricular dilatation disease

Rapid identification of known and new RNA viruses from animal tissues

Next-generation sequencing and metagenomic analysis: a universal diagnostic tool in plant virology

Genetic detection and characterization of Lujo virus, a new hemorrhagic fever-associated arenavirus from southern Africa

The complete genome of klassevirus -a novel picornavirus in pediatric stool

Discovery of a novel single-stranded DNA virus from a sea turtle fibropapilloma by using viral metagenomics

Novel anellovirus discovered from a mortality event of captive California sea lions

Metagenomic analysis of viruses in reclaimed water

Novel circular DNA viruses in stool samples of wild-living chimpanzees

Novel picornavirus in Turkey poults with hepatitis

The fecal virome of pigs on a high-density farm

Systematic artifacts in metagenomes from complex microbial communities

Metagenomics: facts and artifacts, and computational challenges

PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information

The use of genomic signature distance between bacteriophages and their hosts displays evolutionary relationships and phage growth cycle determination

Metagenomic signatures of 86 microbial and viral metagenomes

Molecular dating in the evolution of vertebrate poxviruses

Genomic fossils calibrate the long-term evolution of hepadnaviruses

Fossil record of an archaeal HK97-like provirus

r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock

Detection of a novel DNA virus (TTV) in blood donors and blood products

Prevalence of GBV-C and hepatitis G virus variants in patients with fulminant hepatic failure in Japan

A prospective study of transfusion-transmitted GB virus C infection: similar frequency but different clinical presentation compared with hepatitis C virus

Transfusion transmission of highly prevalent commensal human viruses

Chronic viral hepatitis in hemodialysis patients

Metagenomic analysis of respiratory tract DNA viral communities in cystic fibrosis and non-cystic fibrosis individuals

Persistent GB virus C infection is associated with decreased HIV-1 disease progression in the Amsterdam Cohort Study

GBV-C/hepatitis G virus (HGV) RNA load in immunodeficient individuals and in immunocompetent individuals

Decrease in human immunodeficiency virus type 1 load during acute dengue fever

Viruses and Koch's postulates

Sequence-based identification of microbial pathogens: a reconsideration of Koch's postulates

Clonal integration of a polyomavirus in human Merkel cell carcinoma

Significant paradigm change and a challenge to the existing Koch's postulates and the proposal to use molecular methods to assign etiology to infectious agents

A virus-virus interaction circumvents the virus receptor requirement for infection by pathogenic retroviruses

Etiology of Endemic Burkitt's Lymphoma

Deep sequencing analysis of RNAs from a grapevine showing Syrah decline symptoms reveals a multiple virus infection that includes a novel virus

The prostate cancer-associated human retrovirus XMRV lacks direct transforming activity but can induce low rates of transformation in cultured cells

Identification of a novel picornavirus related to cosaviruses in a child with acute diarrhea

A metagenomic analysis of pandemic influenza A (2009 H1N1) infection in patients from North America

Newly discovered ebola virus associated with hemorrhagic fever outbreak in Uganda

Risk factors for human disease emergence

Emerging disease: looking for trouble

Bushmeat hunting, deforestation, and prediction of zoonoses emergence

Emergence of unique primate T-lymphotropic viruses among central African bushmeat hunters

Naturally acquired simian retrovirus infections in central African hunters

Applying the theory of island biogeography to emerging pathogens: toward predicting the sources of future emerging zoonotic and vector-borne diseases

Functional metagenomic profiling of nine biomes

An application of statistics to comparative metagenomics

New DNA viruses identified in patients with acute viral infection syndrome

Novel, divergent simian hemorrhagic fever viruses in a wild Ugandan red colobus monkey discovered using direct pyrosequencing

We are grateful to Merry Youle for helpful suggestions and editing of the manuscript. JM was supported by a grant from the UCSD Center for AIDS Research (NIAID 5 P30 AI36214) and Moores UCSD Cancer Center (NCI 5P30 CA23100). BED was supported by the Dutch Science Foundation (NWO) Veni grant (016.111.075).