key: cord-267500-x3u9i1vq
authors: Rose, Rebecca; Constantinides, Bede; Tapinos, Avraam; Robertson, David L; Prosperi, Mattia
title: Challenges in the analysis of viral metagenomes
date: 2016-08-03
journal: Virus Evol
DOI: 10.1093/ve/vew022
sha: 
doc_id: 267500
cord_uid: x3u9i1vq

Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.

In the last decade, at least seven separate viral outbreaks have caused tens of thousands of human deaths (Woolhouse, Rambaut, and Kellam, 2015) , and the ever-increasing density of livestock, rate of habitat destruction, and extent of human global travel provides a fertile environment for new pandemics to emerge from host switching events (Delwart 2007; Fancello, Raoult, and Desnues 2012) , as was the case for SARS, Ebola, Middle East Respiratory Syndrome (MERS), and influenza-A (H1N1) (Castillo-Chavez et al. 2015) . At present we have a limited grasp of the extent of viral diversity present in the environment: the 2014 database release from the International Committee for the Taxonomy of Viruses classified just 7 orders, 104 families, 505 genera, and 3286 species (http://www.ictvon line.org/virustaxonomy.asp); yet, one study estimated that there are at least 320,000 virus species infecting mammals alone (Anthony et al. 2013 ).

V C The Author 2016. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/ licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com High throughput (or so-called 'next generation') sequencing of viruses during the most recent outbreaks of MERS in South Arabia (Gire et al. 2014; Carroll et al. 2015; Park et al. 2015) and Ebola in West Africa (Quick, J et al. 2016 ) has facilitated rapid identification of transmission chains, rates of viral evolution, and evidence of the zoonotic origin of these outbreaks. Access to such information during initial stages of an outbreak would offer invaluable insight into when, where, and how an epidemic might emerge, informing intervention and mitigation measures or even stopping it altogether. A major step towards this goal is therefore to identify existing zoonotic and environmental pathogens with pandemic potential. This is a significant undertaking, demanding considerable investment and close collaboration between government, NGOs and academia, for example, the USAID program PREDICT http://www.vetmed.ucdavis.edu/ohi/predict/index.cfm, as well as on the ground surveillance by local authorities and scientists in areas of the world most at risk.

The characterization of unknown viral entities in the environment is now possible with modern sequencing; however, current tooling for exploiting these data represents a practical and methodological bottleneck for effective data analysis. Practically, most available software tools are inaccessible to the majority of potential users, demanding expertise and computing resources often lacked by the researchers from diverse backgrounds involved in sample collection, sequencing, and analysis. There is a need for robust and intuitive analytical tools without requirements for fast internet connectivity, which may be unavailable in remote or developing regions. More fundamentally, the intended scope of published analytical tools and workflows is often less than clear, and given the diverse applications of viral sequencing, it can be difficult to gauge the relevance of newly published tools without first testing them. For example, a fast sequence classifier might fail entirely to detect a novel strain of a well-characterized virus, and equally might perform well with Illumina sequences yet deliver poor results for data generated with the Ion Torrent platform. Furthermore, results arising from these analyses should be replicable, intelligible, and useful to the end user, with provision for quality control and error management. Software tools that target expert users should be tested, documented and robustly distributed as packages or containers so as to streamline the processes of installation and generating results.

Methodologically, most genomic sequence analysis software is not well suited for viral genomes. Generic tools that are able to address the challenges posed by viral sequences are often applicable only in limited circumstances. Choosing between approaches is made difficult due to an abundance of disparate yet functionally equivalent methodologies and in general a lack of rigorous benchmarks for viral datasets. While there is much ongoing research in this area, both the sensitive detection of previously characterized viruses and viral discovery remain key challenges open for innovation. Here we survey the landscape of available approaches for analyzing both known and unknown viruses within genomic and metagenomic samples, with focus on their practical and methodological suitability for use by a broad spectrum of researchers seeking to characterize viral metagenomes.

Within metagenomes the proportion of viral nucleic acids is typically far lower than that of host or other microbes, limiting the amount of signal available for analysis after sequencing. To (Ruby, Bellare, and Derisi 2013) . Alternatively, PCR amplification may be used to generate an abundance of specific viral sequences present in a sample, a widely used strategy, which was employed in the identification and analysis of MERS coronavirus (Zaki et al. 2012; Cotten et al. 2013 Cotten et al. , 2014 , although effective primer design can be challenging in the presence of high genomic diversity in the target viral species. Conversely, an excess of sequencing coverage can lead to the construction of overly complex and unwieldy de novo assembly graphs in the presence of high genomic diversity, reducing assembly quality. Using in silico normalisation (Crusoe et al. 2015) , excess coverage may be reduced by discarding sequences containing redundant information. This approach increases analytical efficiency when dealing with high coverage sequence data, and we have shown that it can benefit de novo assembly of viral consensus sequences. Another in silico strategy for increasing analytical efficiency by discarding unneeded data is to filter sequences from known abundant organisms through alignment with one or more reference genomes using an aligner or specialist tool (approaches reviewed in Daly et al. 2015) .

There are several sequencing technologies in widespread use that are capable of reading hundreds of thousands to billions of DNA sequences per run (Reuter, Spacek, and Snyder 2015) . The current market leader, Illumina, manufactures instruments capable of generating billions of 150 base pair (bp) paired end reads (see 'Glossary') per run, with read lengths of up to 300 bp. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of lowfrequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. 2014) ). Ion Torrent (ThermoFisher) is capable of generating longer reads than Illumina at the expense of reduced throughput and a higher rate of insertion and deletion (indel) error (Eid et al. 2009 ). Single molecule real-time sequencing commercialized by Pacific Biosciences (PacBio) produces much longer (>10 kbp) reads from a single molecule without clonal amplification, which eliminates the errors introduced in this step. However, this platform has a high (10%) intrinsic error rate, and remains much more expensive than Illumina sequencing for equivalent throughput. The Nanopore platform from Oxford Nanopore Technologies, which includes the pocket sized MinION sequencer, also implements long read single molecule sequencing, and permits truly real-time analysis of individual sequences as they are generated. Although more affordable than PacBio single molecule sequencing, the Nanopore platform also suffers from high error rates in comparison with Illumina (Reuter, Spacek, and Snyder 2015) . However, the technology is maturing rapidly and has already demonstrated potential to revolutionize pathogen surveillance and discovery in the field, as well as enabling contiguous assembly of entire bacterial genomes at relatively low cost (Feng et al. 2015; Quick et al. 2015; Hoenen et al. 2016 ). Hybrid sequencing strategies using both long and short reads leverage the ability of long reads to resolve repetitive DNA regions while benefitting from the high accuracy of short reads, at the expense of additional sequencing, library preparation and data analysis (Madoui et al. 2015) .

The reconstruction of sequencing reads into full length genes and genomes can be performed by means of either referencebased alignment or de novo assembly, a decision dependent on experimental objectives, read length, quality and data complexity. In reference-based approaches, reads are mapped to similar regions of a supplied template genome, a well-studied and computationally efficient process implemented with a suffix array index of the reference genome. In contrast, de novo assembly is computationally exhaustive but important in cases where either a target genome is poorly characterized or reconstruction of genomes of a priori unknown entities in metagenomes is sought, such as in surveillance studies. For short read data, the increased sequence length afforded by assembly can be necessary to distinguish members of highly conserved gene families from one another. Assembly is also widely used for generating whole genome consensus sequences to facilitate analyses of viral variation, and is a typical starting point for analyses of diverse populations of well-characterized viruses. Even where Glossary Contigs: Contiguous nucleotide sequences assembled from multiple overlapping reads.

Coverage: The number of times a genome (or part thereof) has been sequenced. de Bruijn graph: A network of nodes and edges, where each edge represents a k-mer found in the collection of reads, and each node represents either the prefix or suffix of the k-mer. De novo assembly: Reconstruction of short sequences into longer sequences (or contigs), without use of a reference sequence Digital signal processing data transformation: Analytical techniques for transforming sequential data into a domain representative of data features. Discrete Fourier transform: A spectral analysis technique for identifying sine and cosine frequency components in numerical signal data. Discrete wavelet transform: A spectral analysis technique for decomposing data to its frequency and spatial components. k-mer: A subsequence of length k. Many genomic analyses involve decomposition of sequences into all possible subsequences of a specified length k. Numerical sequence representation: Numerical mapping of nucleotide sequences, permitting the application of signal processing transformation approaches.

Paired-end reads: Reads generated from both 5 and 3 ends of the same DNA molecule. Depending on the length of the molecule and that of the reads, these pairs may or may not overlap in the middle. Read overlap graphs: A network of nodes and edges, where each edge represents a read and each vertex represents an overlap between two nodes. Reference-based alignment: Orientation/alignment of reads with respect to a specified reference sequence. Scaffolds: DNA sequences comprising contigs with gaps between them, often generated using read pairing information. Suffix array: A sorted array of all suffixes of a string, such as a DNA sequence, enabling efficient sequence comparison.

long reads are available, assembly plays an important role in mitigating the high error rates associated with single molecule sequencing technologies, yielding accurate consensus sequences from inaccurate individual reads.

Modern de novo assemblers generally leverage either de Bruijn graphs or read overlap graphs as part of the approach known as overlap layout consensus (OLC). Figure 1 illustrates the differences between the two methods. OLC assemblers use the similarity of whole reads in order to construct a graph wherein each read is represented by a node, and subsequently merge overlapping reads into consensus contigs (Deng et al. 2015) . OLC is relatively time and memory intensive, scaling poorly to millions of reads and beyond. However, the fewer, longer reads generated by emerging single molecule sequencing technologies tend to be well suited to OLC assembly, which can be easily implemented to tolerate long and noisy sequences (Compeau, Pevzner, and Tesler 2011) . Older, notable, de novo assemblers implementing OLC include CAP3 (Huang and Madan 1999) and Celera (http://www.jcvi.org/cms/research/projects/cabog/over view/), while MHAP (Berlin et al. 2015) , Canu (Berlin et al. 2015) , and Miniasm (Li 2016) represent the current state of the art. There also exist a number of OLC assemblers intended for use with viral sequences: VICUNA was designed for short, nonrepetitive and highly variable reads from a single population (Yang et al. 2012) , and PRICE (Ruby, Bellare, and Derisi, 2013) iteratively assembles low to moderate complexity metagenomes (e.g. Runckel et al. 2011; Grard et al. 2012 ;) using a similar algorithm to the actively developed consensus assembler IVA (Hunt et al. 2015) , which like VICUNA is designed for single virus populations rather than metagenomes (see Table 1 for additional details on programs). A de Bruijn or k-mer graph represents a set of reads in terms of its k-mer composition, where k-mers are subsequences of a length k, specified by the user. Each k-mer is assigned to an edge in a graph, where the nodes are k-1 prefixes and suffixes of the k-mer. The assembler identifies the path through the graph in which each edge is visited only once (reviewed in Compeau, Pevzner, and Tesler 2011) . De Bruijn graphs are much more efficient to construct than overlap graphs and are suited to large numbers of short reads, and where coverage is high, since redundant k-mers occupy negligible random access memory (RAM). However, with this efficiency comes a lack of error tolerance in identifying overlaps, less tolerance of repeated sequences in comparison to overlap graphs, and a loss of read coherence, meaning that k-mers originating from different reads may be co-assembled. Examples of assemblers using de Bruijn graphs include SOAPdenovo (Luo et al. 2012 ), ALLPATHS Figure 1 . Two widely used methodologies in de novo assembly of short reads. Reads are not represented explicitly within a de Bruijn graph; they are instead decomposed into distinct subsequence 'words' of length k, or k-mers, which can be linked together via overlapping k-mers to create an assembly graph. In OLC, a pairwise comparison of all reads is performed, identifying reads with overlapping regions. These overlaps are used to construct a read graph. Next, overlapping reads are bundled into aligned contigs in what is referred to as the layout step, before finally the most likely nucleotide at position is determined through consensus. This figure is simplified to demonstrate the theory for the assembly of single genomes; note that the process has additional complexities for the reconstruction of metagenomes. (Butler et al. 2008) , SPAdes (Bankevich et al. 2012) , and ABySS (Simpson et al. 2009 ).

Typical de novo assemblers are designed to reconstruct genomes with uniform sequencing coverage across their length. This is problematic for metagenomes (including viromes) where coverage typically varies considerably both among different genomes and within individual genomes. To address this problem, dedicated metagenome assemblers have been developed. Omega (Haider et al. 2014 ) is an OLC-based method that uses a minimum cost flow analysis of the OLC graph to generate initial contigs, merging these to create longer contigs and scaffolds using mate-pair information. Genovo (Laserson, Jojic, and Koller 2011) is another OLCbased method that generates a probabilistic model for the dataset and subsequently uses an iterative approach to reconstruct the most likely genome contigs. MEGAHIT (Li et al. 2015) prioritizes speed, leveraging a succinct de Bruijn graph to rapidly reconstruct high complexity metagenomes, such as those of soil or seawater, on a single computer. Noteworthy is the iterative de Bruijn graph assembler SPAdes, which although not initially intended for metagenome assembly, has been widely adopted for its effectiveness in assembling variable coverage metagenomes of limited complexity. MetaSPAdes (Nurk et al. 2016 ) is a metagenome-specific release of the SPAdes pipeline with refinements to its graph simplification and repeat resolution algorithms, counterintuitively capable of leveraging rare strain information so as to improve its consensus reconstruction capabilities. Other de Bruijn graph metagenome assemblers based on their genomic counterparts include Ray-Meta (Boisvert et al. 2012) , MetAMOS (Treangen et al. 2013) , MetaVelvet (Namiki et al. 2012; Afiahayati, Sato, and Sakakibara 2015) , and IDBA-UD (Peng et al. 2012) .

For example, unlike the genome assembler Velvet, MetaVelvet's de Bruijn graph is decomposed into many subgraphs (using coverage difference and graph connectivity), and scaffolds are built independently for each subgraph. MetaVelvet-SL addresses limitations with MetaVelvet, using supervised learning to detect and classify chimeric nodes within the de Bruijn graph. IDBA-UD partitions a de Bruijn graph into isolated components, constructs a multiple alignment, and subsequently identifies variation within these partitions using multiple depth relative thresholds to remove erroneous k-mers. Ray Meta (Boisvert et al. 2012 ) extends the massively distributed assembly model of Ray to variable coverage metagenomes, while MetAMOS (Treangen et al. 2013 ) is both a metagenomic extension and successor to the AMOS genome assembler.

We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al. 2015) . SPDT methods, such as the discrete Fourier transform (DFT) (Agrawal, Faloutsos, and Swami 1993) , and discrete wavelet transform (DWT) (Percival and Walden 2006) (Fig. 2) , are used to reduce sequences into lower dimensional space, preserving only prominent data characteristics. Analysis is subsequently performed with these lower dimensionality transformations, enabling faster data comparison. Since SPDT methodologies such as the Fourier and wavelet transforms are applicable only to numerical sequences, nucleotide sequences must first be numerically transformed with one of several techniques including real number representations (Chakravarthy et al. 2004 ), complex number representations (Anastassiou 2001) , the DNA walk (Lobry 1996) , and the Voss method (Voss 1992) .

Although metagenome assemblers generally outperform single genome assemblers in reconstructing different genomes simultaneously, the complexity of this task stipulates their tendency to collapse variation at or beneath strain level into consensus sequences. Even to this end, their effectiveness may be limited as a consequence of extreme variation within specific RNA virus populations due to mutation and recombination, and low and/or uneven sequencing coverage across a particular genome. Furthermore, it should be noted that de novo assembly is particularly sensitive to the quality of input sequences, meaning that problems during sample extraction, enrichment and library preparation can be highly detrimental to downstream analyses. Of key importance therefore are quality control methods for detecting, and where appropriate correcting, problems associated with contamination (Darling et al. 2014; Orton et al. 2015) , primer read-through and low quality reads (reviewed in Leggett et al. 2013 ).

Viral genomes and metagenomes comprising high intraspecific variation can be challenging targets for assembly, giving rise to complex assembly graphs and fragmented assemblies. This is often the case for clinical samples from HIV and Hepatitis C patients, in which high rates of mutation and long durations of infection can contribute to extreme population divergence, but can also be observed in environmental samples. Where such diversity exists, alignment based probabilistic population reconstruction approaches can be effective, permitting the reconstruction of individual viral variants into 'haplotypes' exceeding read length. This problem has been well studied, and tools such as ShoRAH, QuRE, and PredictHaplo (Giallonardo et al. 2014 ) are designed for haplotyping viral populations. ShoRAH (Zagordi et al. 2011 ) extracts local alignments of a specified window length, reconstructs haplotypes for each 'cluster' in that window, and removes mutations from sequences in the cluster not matching the reconstructed haplotype using a model-based probabilistic clustering algorithm. QuRe (Prosperi and Salemi 2012; Prosperi et al. 2013 ) removes nucleotide substitutions and indels with a Poisson model and reconstructs haplotypes using a heuristic algorithm based on a multinomial distribution. Both approaches have the advantage of reporting probabilities for the reconstructed haplotypes. PredictHaplo is notable for taking into account the read pairing information in Illumina data. A limitation of all of these approaches; however, is their reliance upon a single reference sequence with which to perform the initial alignment, a process which assumes a degree of sequence similarity which may not always be observed in diverse regions, such as regions encoding envelope proteins, of RNA virus genomes. This can be mitigated through construction of a data-specific template through iterative reference mapping and consensus refinement strategies (Archer et al. 2010; B rinda, Boeva, and Kucherov 2016) . Other possibilities for broader utility of these approaches include the use of multiple viral reference sequences, either through consideration of multiple linear sequences or by direct alignment of sequences to a variation graph [https://github.com/vgteam/ vg], an emerging approach for modeling genomic variation.

Sequence classification is one of the most studied problems in computational biology, and taxonomic assignment is a key objective of metagenome analysis. All classification methods, to some extent, depend upon detecting similarity between a query sequence and a collection of annotated sequences. Classification may be undertaken using either unassembled reads or the reconstructed contigs arising from the assembly process. The computational requirements of available approaches vary dramatically according to their ability to detect homology in divergent sequences; for example, exact k-mer matching approaches permit rapid sequence classification, yet typically struggle to identify divergent sequences of viral origin, while high-sensitivity protein alignment searches may be prohibitively slow, especially in application to entire sequencing datasets. Some of the more contemporary and speed-optimized taxonomic assignment approaches also have high RAM requirements, limiting scope for their use with readily available computer hardware. The output of sequence homology search tools is not itself easily interpreted, requiring post-processing in order to yield meaningful classifications. Retroactive taxonomic assignment using these results is non-trivial, requiring additional database lookups, for example, for determination of a conservative 'lowest common ancestor' (LCA) taxon shared by all matches for each query sequence. This kind of complexity necessitates the need for the integration of different tools within application-specific 'pipelines'.

Viral identification approaches typically depend on similarity searches against a database using an aligner such as BLAST (Altschul et al. 1990 ). Comprehensive databases (e.g. GenBank) or smaller custom databases containing for example, only viral sequences of interest may be used, although the latter can generate misleading results. ProViDE (Ghosh et al. 2011 ) uses virusspecific alignment parameters and thresholds to assign viruses at different taxonomic levels from BLAST matches to a protein database. VIROME (Wommack et al. 2012 ) is a multifaceted tool integrating results from searches of several sequence and function databases. MEGAN (Huson et al. 2011 ) is a generally applicable metagenomic classifier, which uses BLAST results to infer the LCA for a given sequence and provides functional analyses through a graphical interface. Automatic pipelines which combine various homology search strategies to identify a final set of viral reads include VirusHunter (Zhao et al. 2013) , a Perl script that automates viral identification using BLAST prior to assembly; MetaVir (Roux et al. 2011) , a web application that compares users' datasets to published viral sequences; and VirSorter (Roux et al. 2015) , which identifies prophages and viruses by comparison with custom datasets. With the exception of web applications, however, these are not intuitive tools for the majority of users, requiring manual configuration and installation of software dependencies. Furthermore, similarity search approaches are in general extremely resource-intensive, and performing sensitive BLAST-like database searches with millions of reads is intractable without use of specialist computational resources. To address this problem, tools have emerged leveraging optimized search algorithms and prebuilt databases so as to increase the tractability of classifying millions of reads. For example, Kraken (Wood and Salzberg 2014) and Clark (Ounit et al. 2015) are fast exact k-mer matching approaches that use prebuilt databases of viruses, bacteria, human, and fungi, although custom databases may also be built. One Codex is a proprietary web-based metagenome analysis platform with an integrated fast k-mer matching engine (similar to that of Kraken) which is both fast, very easy to use, and free for academic use (Minot, Krumm, and Greenfield) . Lambda (Hauswedell, Singer, and Reinert 2014) and Diamond (Buchfink, Xie, and Huson 2015) are sensitive and heavily optimized BLAST-like aligners which leverage alphabet reduction to permit protein searches three to five orders of magnitude faster than BLAST, offering prebuilt database indexes for common applications.

Although exhaustive BLAST-like methods can detect homology in divergent sequences, these methods are in general limited by the relatively few validated viral sequences deposited in public databases, the high diversity within viral families which can obscure relatedness, and the lack of a defined set of core genes common to all viruses that can be used to distinguish species (e.g. the 16S gene for bacteria) (Fancello, Raoult, and Desnues 2012) . These features make it difficult to assign similarity thresholds for classification that are applicable to all potential viruses in a sample (Simmonds 2015) . Comparison methods that do not rely on sequence similarity include PhyloPythia (McHardy et al. 2007 ), which uses nucleotide frequencies to classify reads, and PHYMM (Brady and Salzberg 2009) , which uses interpolated Markov models to find variable length oligonucleotides that characterize species in the NCBI RefSeq database.

Although these approaches are less accurate than BLAST searches, PHYMMBL (Brady and Salzberg 2011) combines PHYMM and BLAST and outperforms either one on its own. Alignment-free comparison approaches, for example, based on dinucleotide frequencies, codon usage patterns, or small but conserved regions of family wide ubiquitous genes, may be more robust to the limitations of the database than sequence similarity searches. These features may also reduce the computation required and highlight evolutionary relationships otherwise obscured by high sequence variability.

A fundamental challenge in the classification of viral sequences with any of these methods remains their limited representation within curated sequence databases. While the rate at which new viruses are being added to NCBI's RefSeq collection has increased considerably, from a year average 0.34 species/ day in 2010 to 2.5 species/day in 2015 (Fig. 3) , our documented understanding of the extent of viral diversity remains superficial (Anthony et al. 2013) . Reads of true viral origin are therefore liable to be missed in many cases. The rate of database growth also highlights the need to maintain frequently updated search indexes for sequence classification, construction of which often demands specialist servers equipped with hundreds of gigabytes of RAM. Even if up-to-date indexes are maintained inside a public repository, their file sizes are substantial, demanding users have access to a fast internet connection. Consequently, complete outsourcing of sequence classification to remote web services is a compelling prospect for those with adequate internet connections but without powerful computing hardware, increasing scope for conducting analyses with portable computers.

We see several barriers to realizing the goal of active, on-theground surveillance and early detection of viruses with epidemic potential.

1. The emergence of virus-specific assembly and metagenomic tools is a relatively recent phenomenon, with many of the methodologies in use today repurposing one or more existing algorithms. These tools mostly target a small audience of expert users and, as with most research software, decay after initial release due to a lack of ongoing funding, poor software development practices and/or authors' change of circumstances (Duck et al. 2016 ). There is a need for a better balance between research software presenting novel methodologies and for sustainably developed, documented and tested software distributed through robust and user friendly channels such as package managers so as to increase the useful life of viral informatics software. Researchers and granting agencies should consider the importance of this step and allocate resources accordingly. 2. Democratisation of routine analyses through development of user friendly, locally installable software and remote web services is critical. Preconfigured cloud virtual machines offer a convenient, low cost way to run analyses, yet must permit straightforward sequence database and software version updates so as to remain relevant after their initial release. 3. Maintaining up to date indexes of large sequence databases is a problem all classification tools must address, stipulating access either to powerful computers for index construction or the ability to download the prebuilt indexes over a fast connection. Furthermore, classification of viral sequences is critically dependent upon the quality of curated viral databases such as RefSeq, to which submitting newly discovered sequences can be prohibitively time consuming. A solution might involve the creation of a central database containing for any given sequencing project both raw reads as well as filtered, assembled and/or annotated reads, and analysed using a single central pipeline. On a regular basis, the database could report sequences and corresponding metadata for unclassified 'dark matter', which is often discarded and yet is likely to contain sequences belonging to novel pathogens. By combining the dark matter from multiple studies, trends within these unclassified reads may be identified that could lead to greater power to identify new biological entities. 4. Benchmarking of software also remains an open problem within the field, which lacks standardized test datasets that are used across multiple studies. Often benchmarking datasets are chosen to highlight the advantages of the method under study, and therefore may be quite specific for a given application. Thus the field needs to agree upon a set of standard, well-characterized reference datasets for virusfocused studies.

The future of the field is promising, with emerging technologies showing potential to eliminate certain challenges. Single molecule sequencing, for example, permits the sequencing of whole viral genomes as single reads, with forthcoming portable and smartphone operated sequencers promising potentially revolutionary analyses in the field. Innovative analytical approaches are constantly being published, and it is evident that the motivation, creativity and expertise needed to meet these challenges exists within the community. Broader communication among developers and end users is essential, and in conjunction with well-funded international initiatives directed at this goal, intelligent viral surveillance could soon be realized.

MetaVelvet-SL: An Extension of the Velvet Assembler to a De Novo Metagenomic Assembler Utilizing Supervised Learning

Efficient Similarity Search in Sequence Databases

Basic Local Alignment Search Tool

Genomic Signal Processing

A Strategy to Estimate Unknown Viral Diversity in Mammals

The Evolutionary Analysis of Emerging Low Frequency HIV-1 CXCR4 Using Variants Through Time-An Ultra-Deep Approach

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

Assembling Large Genomes with Single-Molecule Sequencing and Locality-Sensitive Hashing

Ray Meta: Scalable De Novo Metagenome Assembly and Profiling

Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models

Dynamic read mapping and online consensus calling for better variant detection

Fast and Sensitive Protein Alignment Using DIAMOND'

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Temporal and Spatial Analysis of the 2014-2015 Ebola Virus Outbreak in West Africa

Beyond Ebola: Lessons to Mitigate Future Pandemics

Autoregressive Modeling and Feature Analysis of DNA Sequences

How to Apply de Bruijn Graphs to Genome Assembly

Full-Genome Deep Sequencing and Phylogenetic Analysis of Novel Human Betacoronavirus', Emerging Infectious Diseases

The khmer Software Package: Enabling Efficient Nucleotide Sequence Analysis', F1000Res

Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data

PhyloSift: Phylogenetic Analysis of Genomes and Metagenomes

Viral Metagenomics', Reviews in Medical Virology

An Ensemble Strategy that Significantly Improves de novo Assembly of Microbial Genomes from Metagenomic Next-Generation Sequencing Data

A Survey of Bioinformatics Database and Software Usage through Mining the Literature

Real-Time DNA Sequencing from Single Polymerase Molecules

Computational Tools for Viral Metagenomics and Their Application in Clinical Research

Nanopore-based Fourth-generation DNA Sequencing Technology

ProViDE: A Software Tool for Accurate Estimation of Viral Diversity In Metagenomic Samples

Full-Length Haplotype Reconstruction to Infer the Structure of Heterogeneous Virus Populations

Genomic Surveillance Elucidates Ebola Virus Origin and Transmission During the 2014 Outbreak

A Novel Rhabdovirus Associated with Acute Hemorrhagic Fever in Central Africa

Omega: An Overlap-Graph De Novo Assembler for Metagenomics

Lambda: The Local Aligner for Massive Biological Data

Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool', Emerging Infectious Disease

CAP3: A DNA Sequence Assembly Program

IVA: Accurate De Novo Assembly of RNA Virus Genomes

Integrative Analysis of Environmental Sequences Using MEGAN4

Genovo: De Novo Assembly for Metagenomes

Sequencing Quality Assessment Tools to Enable Data-Driven Informatics for High Throughput Genomics

MEGAHIT: An Ultra-Fast Single-Node Solution for Large and Complex Metagenomics Assembly via Succinct de Bruijn Graph

Minimap and Miniasm: Fast Mapping and De Novo Assembly for Noisy Long Sequences

Comparison of Illumina and 454 Deep Sequencing in Participants Failing Raltegravir-Based Antiretroviral Therapy

A Simple Vectorial Representation of DNA Sequences for the Detection of Replication Origins in Bacteria

SOAPdenovo2: An Empirically Improved Memory-Efficient Short-Read De Novo Assembler

Genome Assembly using Nanopore-Guided Long and Error-Free DNA Reads

Accurate Phylogenetic Classification of Variable-Length DNA Fragments

Fast and sensitive taxonomic classification for metagenomics with Kaiju

One Codex: A Sensitive and Accurate Data Platform for Genomic Microbial Identification

MetaVelvet: An Extension of Velvet Assembler to De Novo Metagenome Assembly from Short Sequence Reads

metaSPAdes: A New Versatile De Novo Metagenomics Assembler

Distinguishing Low Frequency Mutations from RT-PCR and Sequence Errors in Viral Deep Sequencing Data

CLARK: Fast and Accurate Classification of Metagenomic and Genomic Sequences Using Discriminative k-mers

Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone

IDBA-UD: A De Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model

Empirical Validation of Viral Quasispecies Assembly Algorithms: State-of-the-Art and Challenges

Rapid Draft Sequencing and Real-Time Nanopore Sequencing in a Hospital Outbreak of Salmonella

VirSorter: Mining Viral Signal from Microbial Genomic Data

PRICE: Software for the Targeted Assembly of Components of (Meta) Genomic Sequence Data

Temporal Analysis of the Honey Bee Microbiome Reveals Four Novel Viruses and Seasonal Prevalence of Known Viruses, Nosema, and Crithidia

Methods for Virus Classification and The Challenge of Incorporating Metagenomic Sequence Data

ABySS: A Parallel Assembler for Short Read Sequence Data

Alignment by numbers: sequence assembly using compressed numerical representation

MetAMOS: A Modular and Open Source Metagenomic Assembly and Analysis Pipeline

Evolution of Long-Range Fractal Correlations and 1/f Noise in DNA Base Sequences

VIROME: A Standard Operating Procedure for Analysis of Viral Metagenome Sequences

Kraken: Ultrafast Metagenomic Sequence Classification Using Exact Alignments

Lessons from Ebola: Improving Infectious Disease Surveillance to Inform Outbreak Management

De novo Assembly of Highly Diverse Viral Populations

ShoRAH: Estimating the Genetic Diversity of a Mixed Sample from Next-Generation Sequencing Data

Isolation of a Novel Coronavirus from a Man with Pneumonia in Saudi Arabia

Identification of Novel Viruses Using VirusHunter-An Automated Data Analysis Pipeline

The Virogenesis project receives funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 634650. Bede Constantinides receives funding through a Biotechnology and Biological Sciences Research Council (BBSRC) Doctoral Training Program and Avraam Tapinos receives funding from a BBSRC project grant, BB/M001121/1. We thank Katrina Lithgoe and two anonymous reviewers for their helpful edits and suggestions.

Conflict of interest: None declared.