key: cord-0739586-7w26ah7j
authors: Hölzer, Martin; Marz, Manja
title: Chapter Nine Software Dedicated to Virus Sequence Analysis “Bioinformatics Goes Viral”
date: 2017-12-31
journal: Advances in Virus Research
DOI: 10.1016/bs.aivir.2017.08.004
sha: 22c4bc4168a7e987509e4743b476d385394a2d59
doc_id: 739586
cord_uid: 7w26ah7j

Abstract Computer-assisted technologies of the genomic structure, biological function, and evolution of viruses remain a largely neglected area of research. The attention of bioinformaticians to this challenging field is currently unsatisfying in respect to its medical and biological importance. The power of new genome sequencing technologies, associated with new tools to handle “big data”, provides unprecedented opportunities to address fundamental questions in virology. Here, we present an overview of the current technologies, challenges, and advantages of Next-Generation Sequencing (NGS) in relation to the field of virology. We present how viral sequences can be detected de novo out of current short-read NGS data. Furthermore, we discuss the challenges and applications of viral quasispecies and how secondary structures, commonly shaped by RNA viruses, can be computationally predicted. The phylogenetic analysis of viruses, as another ubiquitous field in virology, forms an essential element of describing viral epidemics and challenges current algorithms. Recently, the first specialized virus-bioinformatic organizations have been established. We need to bring together virologists and bioinformaticians and provide a platform for the implementation of interdisciplinary collaborative projects at local and international scales. Above all, there is an urgent need for dedicated software tools to tackle various challenges in virology.

"Big data" has been awarded to be the second-best Anglicism in 2014. a Although microorganisms and particularly viruses are tiny, the standard properties of big data apply: volume, variety, velocity, and veracity. The biodiversity of viruses with its coverage of multiple scales and its high complexity is a big challenge for algorithm and software development in the big data field (Beckstein et al., 2014) .

Recently, we have started to explore the virus' and host's genomes, transcriptomes, metabolome, proteome, and metagenome but also their phenotype, occurrence, and environment. Linking such raw heterogeneous data with current data, e.g., collected from social networks on cumulative occurrences of disease-carrying mosquitoes, is a challenging task. For example, such a task might be solved by combining georeference photos from mobile phones with an automatic determination software, allowing better decisions on overarching questions (Graham et al., 2011) .

The storage of such data is essential and currently a computationally unsolved problem. Additionally, calculations on computational cluster machines have annual electricity costs of one third of its acquisition costs. Medical data are usually only semianonymous and therefore cannot be stored and computed in clouds. b In the future, we will need novel, qualitatively different computational methods and paradigms. We will witness the rapid extension of computational pan-genomics, a new subarea of research in bioinformatics. A prominent example for a computational paradigm shift is the transition from the representation of single reference genomes as strings to cloud-like representations as graphs (Marschall et al., 2016) . Especially, viruses are notorious mutation machines. Therefore, a viral quasispecies is a cloud of viral haplotypes that surround a given master virus (Qin et al., 2012) .

Interestingly, already the storage of simple linear viral genomes is complicated. For instance, although most viral genomes are stored in the NCBI, many virologists refuse to integrate their data due to the generality of the database: One of the first questions during the upload process is "What chromosome is this?" Therefore, virus-specific databases are necessary, however, only a few exist so far (Table 1) , and a general database for all viruses needs to be urgently developed. a http://www.anglizismusdesjahres.de/anglizismen-des-jahres/adj-2014/. b Cloud storage or cloud computing refers to shared computer processing resources and data on demand.

Next-Generation Sequencing (NGS) has dramatically increased the accessibility of genetic information, generating in only a few hours massive amounts of genome and transcriptome data that is rapidly changing the landscape of many life science disciplines (Goodwin et al., 2016) . In April 2003, the complete human genome was announced and the project succeeded 

EpiFlu TM GISAID EpiFlu TM is the world's most complete collection of genetic sequence data of influenza viruses and related clinical and epidemiological data. EpiFlu TM is tailored to the needs of influenza researchers from both the human and the veterinary fields. The data is publicly accessible but not public domain (GISAID does not remove nor waive any preexisting rights).

McCauley (2017) HIV The HIV database contains data on HIV genetic sequences and immunological epitopes. The website also provides an access to several tools that can be used for analysis and visualization.

Druce et al.

HCV HCV is a comprehensive database of the hepatitis C virus (HCV). Kuiken et al. (2005) ViralZone ViralZone is a web-resource from the Swiss Institute of Bioinformatics for all viral genus and families, providing general molecular and epidemiological information, along with virion and genome figures. Each virus or family page gives an easy access to UniProtKB/Swiss-Prot viral protein entries.

Hulo et al.

The virus variation resource (VVR) is a selection of web retrieval interfaces, analysis, and visualization tools for virus sequence datasets. Hatcher et al. (2017) after spending $3-billion with a high-quality human reference genome (Schmutz et al., 2004) . Although the assembly of such a huge genome is still a very challenging task, nowadays the sequencing can be done in just a few days and for only some thousands of dollars (Goodwin et al., 2016) by utilizing the still emerging NGS technologies. In recent years, DNA sequencing (DNA-Seq) based on novel NGS technologies (Table 2 ) became the most sophisticated method for the sequencing of full genomes. A general DNA-Seq workflow starts with the library Generally, NGS technologies can be divided in short-read and long-read approaches, depending on the length of the produced reads. SNA, single-nucleotide addition; CRT, cyclic reversible termination; SMRT, single-molecule real-time sequencing; indel, nucleotide insertion-deletion; subst., nucleotide substitution. This table is mainly based on recent reviews (Goodwin et al., 2016; Mardis, 2017) .

preparation including the fragmentation (chemically, physically) of the DNA molecules. After amplification and sequencing millions of short subsequences, so-called reads, are produced. In general, methods like Illumina and Ion Torrent produce reads with a length between 50 and 500 bp, depending on the setup and machine used (Goodwin et al., 2016) . Next to that short read producing NGS technologies more and more long read NGS approaches are emerging. Very popular is the single-molecule realtime sequencing (SMRT) introduced by Pacific Biosciences (Rhoads and Au, 2015) (PacBio) producing reads with an average length of 15,000 bp and a maximum of >40,000 bases. However, PacBio produces only 50,000 reads per SMRT cell, whereas Illumina yields 180 million reads on one HiSeq2500 lane (Goodwin et al., 2016) . It is clearly important to produce longer reads to improve the results of various analyzes like the de novo assembly of highly repetitive, large or fast mutating genomes. Nanopore sequencing is another recent incumbent in the SMRT area: the way nanopore-based sequencing works is by pulling a nucleotide strand (DNA or RNA) through a kind of molecular channel isolated from a bacterium. While passing through the pore, the nucleotide sequence produces a small change in the applied voltage, which can be reinterpreted as the familiar sequence of the bases A, C, T/U, and G, including also modifications such as methylation ( Jain et al., 2016) . Because each pore produces its own signal, this technology can be highly parallelized. For example, with the current USB-sized MinION sequencer, 2048 pores are situated on a membrane of the size of a finger nail. The sequencer itself costs a fraction of the aforementioned ones. Furthermore, each pore's signal can be detected in real time (Gardy et al., 2015) , allowing unprecedented speed and mobility in sequence-based diagnostics, as exemplary demonstrated in field trials during the 2014 Ebola outbreak (Quick et al., 2016) . Furthermore, nanopore sequencing is currently the only technique that does (in theory) not technologically limit the potential read length, which means an entire viral genome can be sequenced in one part at an intact pore. No additional assembly step would be required. The current read length maximum is >900 Kbp (personal communication with N. Loman). The MinION's throughput has been shown to provide up to 15 Gb in 48 h with a protocol-dependent error rate of 5%-15%.

Besides the sequencing of genomic DNA, RNA sequencing (RNA-Seq) emerged as a powerful method for discovering, profiling, and quantifying RNA transcripts or viral RNA genomes (Mortazavi et al., 2008) . However, with currently available short-read NGS techniques such as Illumina it is not possible to directly sequence RNA molecules-first the RNA must be reversely transcribed to complementary DNA (cDNA) for sequencing. Strikingly, nanopore just recently announced a sequencing kit that should allow for the direct sequencing of RNA molecules (and therefore also RNA viruses).

Importantly, within each NGS project one should consider the need and amount of replication, different protocols for molecule selection and library preparation, the achieved throughput and length of the reads and further specific parameters like strand-specificity and the insertion size between paired-end reads.

Within the last decade numerous genomes of previously unknown viruses have been identified. However, it is still a challenging task to discriminate an outnumbered amount of viral sequences from the majority of host reads. Genome assemblers specifically designed for viral genomes are rare (Table 3 ) and cannot overcome an uneven or incomplete coverage of viral genomes.

Many assembly tools and software suites have been developed for the complete genome assembly in general, such as Velvet (Zerbino and Birney, 2008) , ABySS (Simpson et al., 2009) , or Geneious (Kearse et al., 2012) (Fig. 1) . These common tools often fail to assemble full viral genomes, due to a low and uneven read coverage , as well as repetitive elements in the viral UTR regions. However, algorithms developed for single-cell sequencing like SPAdes (Bankevich et al., 2012) or IDBA-UD ) perform very well for tested samples and outperform assembly tools like VICUNA , especially designed for viral data (Fig. 1) .

For an efficient viral de novo assembly we suggest enriching of the viruses by, e.g., ultracentrifugation or FACS prior to the library preparation step. After the sequencing, a standard read quality control should be conducted followed by a host genome filter step, if possible. Finally, the assembly step can be performed based on de Bruijn graphs or overlapping layout consensus (OLC) approaches. If possible, the usage of multiple k-mer values is recommended. The final assembly can be used for annotation and identification of contigs from viral origin. Fig. 2 shows the viral assembly workflow as used in the VrAP assembly pipeline (Fricke et al., 2017) .

The above described de novo assembly methods can reconstruct viral genomes. However, to yield a small number of contigs, the algorithms usually include a step that calls a consensus on a given sequence position. This consensus is implemented to reduce the noise in the raw assembly. However, in the context of viral haplotype variants, this step is misleading, because it effectively ignores low-frequency variants and technical errors .

To gain insights into viral haplotypes, the reads should be mapped either to a known reference genome or to the contigs that were generated during assembly. This "classification" can be used to infer the viral population structure of each individual species in the sample, thereby increasing the resolution of the diversity estimate. (Intrahost) viral populations consist of many AV454 AV454 is a de novo consensus assembler designed for small and nonrepetitive genomes sequenced at high depth.

RIEMS RIEMS is a software for the sensitive and reliable analysis of metagenomic datasets.

V-FAT V-FAT is a tool to perform automated computational finishing and annotation of de novo viral assemblies.

VICUNA VICUNA is a de novo assembly tool targeting populations with high mutation rates.

The VrAP (Viral Assembly Pipeline) is based on the genome assembler SPAdes (Bankevich et al., 2012) combined with an additional read correction and several filter steps. The pipeline classifies the contigs (contiguous sequences constructed from short reads) to distinguish host from viral sequences. VrAP can identify viruses without any sequence homology to known references. related virions, generated by mutation, recombination, and selection. The resulting diversity is especially large for RNA viruses (Holmes, 2009 immune escape (Luciani et al., 2012) , or affect virulence (T€ opfer et al., 2013) . Estimating intrahost viral genetic diversity and reconstructing the individual haplotype sequences relies on both error correction and read assembly (Pulido-Tamayo et al., 2015) . It can be performed on different spatial scales, including single sites of the genome (single-nucleotide-variant calling), small sliding windows (local reconstruction), or complete genomes (global reconstruction). Viral haplotype reconstruction tools can quantify viral diversity from NGS data (e.g., Beerenwinkel et al., 2012) . It was shown that haplotypes differ enough, current NGS reads are not too short and the coverage is high enough to assemble accurate viral haplotype genomes (Zagordi et al., 2012) . A common prerequisite for these tools is a highquality alignment of the reads (e.g., T€ opfer et al., 2014). However, tools exist that allow haplotype calling without a reference genome as presented in Gregor et al. (2016) . Nevertheless, the short-read-based discovery of viral sequences in mixed samples remains challenging (Marschall et al., 2016) because most analysis steps are not easily automated and various technical or biological limitations exist (Fricke et al., 2017) . There is a need for an integrated workflow combining the different processing steps in viral diversity studies to discover the underlying virus populations that can be used on a daily basis by clinicians and virologists. The advent of SMRT sequencing provides new opportunities. One of the main limitations of the past was the limited length of the sequenced nucleotide fragments. Currently, it is not possible to write cDNA longer than a few thousand of nucleotides (e.g., 2000 nucleotides for the wheat stripe rust pathogen (Ling et al., 2007) ). However, even if the cDNA transcription would be no limiting factor, current short-read sequencing technologies such as Illumina are only able to sequence small fragments of several hundred nucleotides. Nanopore sequencing lifts these two constraints: it is now possible to sequence much longer fragments (as described above) and to sequence the RNA directly, without the need of a cDNA intermediate, advancing the detection of viral quasispecies.

RNA viruses are flanked by highly structured 5 0 -and 3 0 -untranslated regions (UTRs), which are indispensable for translation and replication of the viral genome (Liu et al., 2009; Lohmann, 2013) .

Standard RNA secondary structure prediction tools such as mfold and RNAfold (Table 4 ) are based on the calculation of the minimum free energy (MFE) and can fold reliably on small local windows of up to 300 nt. Secondary structures of larger genomic segments or interactions spanning larger regions, including pseudogenes, are still bioinformatically challenging. Foldings based on not only one but also multiple sequences are generally more reliable due to following the footsteps of evolution by compensatory mutations. Viruses usually come along with a high mutation rate and therefore with a bunch of similar sequences perfect for a large alignment and predicting secondary structures.

For example, LocARNA creates a multiple alignment based on sequence and structure simultaneously. Based on this tool larger genomic regions up to 800 nt can be reliable predicted as shown for coronaviruses (Fig. 3 ) (Madhugiri et al., 2014) and HCV (Fig. 4) (Fricke et al., 2015) . Nowadays, long-range interactions (LRIs) are computationally predictable by tools such as LRIscan , suggesting circularizations of viruses during replication. 

LRIscan LRIscan is a tool for the prediction of longrange interactions in full viral genomes based on a multiple genome alignment. LRIscan is able to find interactions spanning thousands of nucleotides.

Yes Fricke and Marz (2016) 

The general workflow of a short-read RNA-Seq experiment involves: (1) the extraction of total RNA from a biological sample of interest, (2) the purification of the sample to enrich a certain type of RNA such as mRNAs or microRNAs, and (3) the preparation of a library ready for short-read NGS. The generation of the library may involve steps like the fragmentation of longer RNA molecules, followed by the reverse transcription of the RNA to cDNA, ligation of adapters to the 5 0 -and/or 3 0 -ends of the cDNA fragments and PCR amplification to enrich the library for correctly ligated ′ ′ Fig. 3 Alignment-based secondary structure prediction of 5 0 genome regions of alphacoronaviruses. The viruses included in this analysis represent all currently recognized species in the genus Alphacoronavirus. The alignment (not shown) was calculated by LocARNA and the structure by RNAalifold (Hofacker, 2007) . The consensus sequence is represented using the IUPAC code. Colors are used to indicate conserved base pairs: from red (conservation of only one base pair type) to purple (all six base pair types are found); from dark (all sequences contain this base pair) to light colors (one or two sequences are unable to form this base pair). To refine the alignment, an anchor at the highly conserved core TRS-L was used. cDNA fragments (Corney, 2013) . The resulting reads from an RNA-Seq experiment can be used to estimate the abundances of certain transcripts within each sequenced sample. If different conditions are sequenced, the obtained transcript abundances can be further used to identify differential expressed genes. Before RNA-Seq came up, gene expression studies were performed with hybridization-based microarrays. Contrasting the microarray technology, RNA-Seq allows for the identification of novel transcripts and does not necessarily need a sequenced reference genome. Furthermore, RNA-Seq allows for the genome-wide analysis of transcripts at a singlenucleotide resolution and therefore includes the identification of singlenucleotide variants, gene fusions, allele-specific expression, and alternative splicing events (Corney, 2013) . However, besides all its advantages, RNA-Seq is still an expensive technology. Therefore, in most RNA-Seq studies the number of biological replicates is limited (only 3-5 replicates per condition are quite common) contrasting the comparative high number of genes that are simultaneously tested.

A typical RNA-Seq experiment, involving an eukaryotic cell line and involving two different conditions (untreated, infected), three time points and four biological replicates already results in the sequencing of 24 samples. The current Ensembl annotation of the human genome (v85) consists of 58,051 genes comprising 19,961 genes coding for proteins. In a differential gene expression study, all expressed genes can be compared between different conditions and time points, resulting in an overwhelming amount of data. Genes can be further analyzed for differential expressed isoforms and clustered according to their function. With a de novo gene prediction, one of the huge advantages of RNA-Seq in comparison to microarrays, an incomplete annotation can be further extended and even more genes are possibly involved. The use of different library preparation protocols can extend the complexity of such an RNA-Seq study even further.

Therefore, the statistical analysis of RNA-Seq data with the final goal to define significantly differential expressed genes is a challenging task. Especially, if a high number of reads originating from viral transcripts is involved, outshining the expression of host genes. Furthermore, the generation of a sensible number of biological replicates can be difficult when working with such deadly viruses like Ebola. The analysis can become even more complicated when no reference genome for mapping and quantification of the RNA-Seq reads is available. In this case, a de novo transcriptome assembly can be constructed and annotated from scratch.

To tackle these difficulties, profoundly occurring when working with virus infected RNA-Seq data, different tools and parameter settings should be conducted and combined to achieve a comprehensive overview picture of the host's transcriptional reaction to a viral infection. An exemplary pipeline combining different tools for mapping and assembly and working on a genomic and transcriptomic context as well is given in Fig. 5 The overall goal of the underlying study was to understand why bats can live with the Ebola virus, while humans suffer so much from this deadly infection.

In this study, performed by H€ olzer et al. (2016), (1) total RNA from a human HuH7 cell line and a fruit bat cell line (R06E-J; Rousettus aegyptiacs) infected with either the Ebola or Marburg virus (EBOV, MARV) was harvested 3, 7, and 23 h postinfection, depleted of ribosomal RNA and sequenced on an Illumina HiSeq2500. The bat RNA was further pooled and additionally sequenced on an Illumina MiSeq system. Initial quality control and trimming of the raw data were conducted with FastQC (Andrews, 2010) and PRINSEQ (Schmieder and Edwards, 2011). (2) For bat RNA, a de novo transcriptome assembly was constructed by combining MiSeq and HiSeq data using Velvet/Oases (Schulz et al., 2012; Zerbino and Birney, 2008) , ABySS/Trans-ABySS Simpson et al., 2009 ), SOAPdenovo-Trans (Luo et al., 2012) , Trinity (Grabherr et al., 2011) , and Mira (Chevreux et al., 2004) with default parameters and multiple k-mer values, if possible. (3) The mapping of the RNA-Seq short-reads was performed for Mock-, EBOV-, and MARV-treated cells onto human/bat genomes and the bat transcriptome with Segemehl and TopHat (Kim et al., 2013) . (4) A differential gene expression analysis was performed by counting uniquely mapped reads with HTSeq-count (Anders et al., 2015) and applying a DESeq (Love et al., 2014) analysis in R. The results were further used for clustering and scatter/group plot analyzes. (5) A homology search in bats was performed for all significantly differential expressed genes from (4) and for the genes assumed to be involved in the response to infection based on an enriched pathway analysis and the literature. The Rousettus aegyptiacus genome and coding sequences from Pteropus vampyrus, a closely related bat species, were used to validate but also to detect homologous sequences in the bat transcriptome. Detected homologs were employed for the differential gene expression analysis. (6) One huge advantage of this comprehensive study was the manual inspection of 7.5 % of the human genes. Each candidate gene was manually investigated in the IGV (Thorvaldsdóttir et al., 2013) and UCSC (Dreszer et al., 2012) browsers for the human and bat samples from all time points. Single-nucleotide modifications (differential SNPs, posttranscriptional modifications), intronic transcripts and regulators, alternative splicing and isoforms, as well as upstream and downstream transcript characteristics were described.

Phylogenetic analysis is a common method in virology, forming a crucial element of investigations describing viruses or viral epidemiology. Nevertheless, many characteristics of viruses pose distinct challenges for phylogenetics: (1) strong differences in evolution rates, (2) great potential for recombination and gene transfer, (3) evolutionary relationships between viruses and their hosts, (4) lack of physical "fossil records" of viruses, and (5) the abundance of genomic viral fossils as parts of ancient viral genomes that occur within the genomes of extant species.

Today, various phylogenetic tree-building methods such as MrBayes (Ronquist and Huelsenbeck, 2003) , BEAST , Phy-loBayes (Lartillot et al., 2009) , and RAxML (Stamatakis et al., 2008) exist. However, trees cannot represent complex evolutionary relations relevant for viruses such as horizontal gene transfer, interspecific recombination, or virus-host coevolution. Different types of phylogenetic networks were developed to represent such relations (e.g., Huson et al., 2011) . However, there is still a high need for research on how to reconstruct such aspects of virus phylogeny.

Genomic evolution can be already observed over the course of years or even days due the fact that the short-term evolution rates of many viruses are so high. It is important that the phylogenetic methods can include the sampling dates of the sequences for analyzing short-term evolution as implemented in TipDate (Rambaut, 2000) . Furthermore, spatial dispersal processes play an essential role, for example, the spatial distribution of a virus within the host's body (Bloomquist et al., 2010) . Moreover, the evolutionary substitution rates of viruses can differ even for short-term evolutionary scenarios. One reason is that substitution rates reflect a complex product of mutation rate, generation time, effective population size, and fitness ( Jenkins et al., 2002; Sanjuan et al., 2010) . Particularly in viruses, substitutions might be an artifact generated by polymerase errors and nucleotide modifications (Domingo and Holland, 1997) . Thus, the classical assumption of a time-homogeneous substitution process used by different phylogeographic statistical inference methods does not hold and new approaches that can include varying evolutionary rates have been already introduced (e.g., Bielejec et al., 2014) .

Another problem for viral "deep phylogeny" reconstruction is the genetic distance between viruses. The distance can be so large that reasonable alignments become impossible to calculate. To achieve biologically correct alignments, the development of advanced approaches would help, however, can only marginally alleviate the problem of saturated substitution processes. Including aspects such as genome organization or protein structure as phylogenetic characters could further improve viral alignments and phylogenies (Holmes, 2011) .

Several ancient viruses have left parts of their genome (or other traces) in the genome of germ line cells of their hosts. Such parts, called endogenous viral elements (EVEs), have survived as nonfunctional, neutrally evolving pseudogenes, or even became fixed as functional. Most EVEs stem from retroviruses because they integrate into host genomes as part of their life cycle. For example, 8% of the human genome is derived from >100,000 retroviral fossils (Lander et al., 2001) . However, in recent years, EVEs from many other viruses have been found (Horie and Tomonaga, 2011) . Different programs have been developed to detect EVEs in complete genome sequences such as RepeatMasker (Smit et al., n.d.) , LTR_STRUC (McCarthy and McDonald, 2003) , and RetroTector (Sperber et al., 2009 ). Moreover, a combination of several of these programs seems very promising for the calculation of viral phylogenies (Lerat, 2010) .

Withal, associations between viruses and their hosts can influence the phylogeny of both partners. A divergence of the host can also lead to a divergence of the virus (codivergence) and thus to a (local) congruence of both phylogenies. A match of the virus phylogeny with host evolutionary events at known dates can be used to adjust the virus phylogeny or corresponding molecular clocks (Sharp and Simmonds, 2011) . The ability of viruses to switch their hosts can enable viruses to replicate and spread more efficiently. This process is commonly known as an epidemic and is observed in pathogenic viruses (Weiss, 2003) . Owing to the advantages conferred by the conquest of new host territory, several researchers presume host switching as an elementary component of virus evolution that might initiate viral speciation (Kitchen et al., 2011) . Attributed to the fact that virologists are highly interested in the reconstruction of the common history of viruses and their hosts, several bioinformatic tools have been developed for this purpose (de Vienne et al., 2013) .

However, there is still a huge amount of research questions that need to be answered based on new computational methods. For example, the inclusion of biogeographic information, ecological traits, or preferential host switching are crucial tasks (Cuthill and Charleston, 2013) . A better knowledge of the timing and underlying conditions of those processes might enable projections into the future and thereby contribute to tackle one of the major issues in today's infectious diseases research: the prediction and prevention of future pandemics and outbreaks.

It is essential to bundle the expertise's of virus bioinformatics to follow with larger steps the small footsteps that were already taken. There is an urgent need for novel and specialized tools that allow the efficient detection, assembly, and classification of already known and completely new viruses in a fast and reliable way.

One big step in this direction involves the establishment of research networks between experienced scientists to facilitate the exchange of knowledge and to speed-up the development of powerful tools. DiaMETA-net is a German network which focuses on metagenomics in infection medicine. The research groups within the network devote themselves to the very broad detection and characterization of pathogens (viruses, bacteria, parasites) by means of NGS. However, the first specialized virologybioinformatics organization, the EVBC (European Virus Bioinformatics Center), has been established rather recently on March 2017, comprising up to now 100 members from over 50 research institutions distributed across 13 European countries.

The future of virus bioinformatics clearly depends on how fast we develop specific bioinformatical tools, take first steps to establish a useful virus-specific database, and help to establish joint research projects. We must initiate and coordinate ring trials, undergraduate courses, graduate summer schools, and courses for principal investigators.

Whereas the list of bioinformatical tools presented in this section is supposed to be incomplete, they should provide a good overview and starting point to dive even deeper into the computational analysis of viral sequences.

HTSeq-a Python framework to work with highthroughput sequencing data

FastQC: a quality control tool for high throughput sequence data

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

Applications of nextgeneration sequencing technologies to diagnostic virology

Explorative analysis of heterogeneous, unstructured, and uncertain data: a computer science perspective on biodiversity research

Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol. 3, 329

Inferring heterogeneous evolutionary processes through time: from sequence substitution to phylogeography

De novo transcriptome assembly with ABySS

Three roads diverged? Routes to phylogeographic inference

Vfat: A post-assembly pipeline for the finishing and annotation of viral genomesPlease provide year of publication for this reference

Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs

RNA-seq using next generation sequencing

A simple model explains the dynamics of preferential host switching among mammal RNA viruses

Cospeciation vs host-shift speciation: methods for testing, evidence from natural associations and relation to coevolution

RNA virus mutations and fitness for survival

The UCSC genome browser database: extensions and updates

Improving HIV proteome annotation: new features of BioAfrica HIV Proteomics Resource. Database (Oxford), baw045. ISSN: 1758-0463

Bayesian phylogenetics with BEAUti and the BEAST 1.7

Prediction of conserved long-range RNA-RNA interactions in full viral genomes

Conserved RNA secondary structures and long-range interactions in hepatitis C viruses

VrAP: full length de novo genome assembly of unknown RNA viruses

Real-time digital pathogen surveillance-the time is now

Coming of age: ten years of nextgeneration sequencing technologies

Full-length transcriptome assembly from RNA-Seq data without a reference genome

Using mobile phones to engage citizen scientists in research

Snowball: strain aware gene assembly of metagenomes

The Vienna RNA websuite

QUAST: quality assessment tool for genome assemblies

Virus Variation Resource -improved response to emergent viral outbreaks

Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection

RNA consensus structure prediction with RNAalifold

A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection

The Evolution and Emergence of RNA Viruses

What does virus evolution tell us about virus origins?

Differential transcriptional responses to Ebola and Marburg virus infection in bat and human cells

Non-retroviral fossils in vertebrate genomes

ViralZone: a knowledge resource to understand virus diversity

Phylogenetic Networks: Concepts, Algorithms and Applications

The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community

Rates of molecular evolution in RNA viruses: a quantitative phylogenetic analysis

Geneious basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Family level phylogenies reveal modes of macroevolution in RNA viruses

The Los Alamos hepatitis C sequence database

Initial sequencing and analysis of the human genome

PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating

Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs

Construction and characterization of a full-length cDNA library for the wheat stripe rust pathogen (Puccinia striiformis f. sp. tritici). BMC Genomics. 8, 145

Cis-acting RNA elements in human and animal plusstrand RNA viruses

Hepatitis C virus RNA replication

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Next generation deep sequencing and vaccine design: today and tomorrow

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

RNA structure analysis of alphacoronavirus terminal genome regions

DNA sequencing technologies

Computational pan-genomics: status, promises and challenges

Challenges in RNA virus bioinformatics

LTR_STRUC: a novel search and identification program for LTR retrotransposons

Mapping and quantifying mammalian transcriptomes by RNA-Seq

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth

ViPR: an open bioinformatics database and analysis resource for virology research

Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations

A metagenome-wide association study of gut microbiota in type 2 diabetes

Estimating the rate of molecular evolution: incorporating noncontemporaneous sequences into maximum likelihood phylogenies

PacBio sequencing and its applications

MrBayes 3: Bayesian phylogenetic inference under mixed models

Viral mutation rates

RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets. BMC Bioinf. 16, 69

Quality control and preprocessing of metagenomic datasets

Quality assessment of the human genome sequence

Oases: robust de novo RNA-Seq assembly across the dynamic range of expression levels

Evaluating the evidence for virus/host co-evolution

GISAID: global initiative on sharing all influenza data -from vision to reality

ABySS: a parallel assembler for short read sequence data

Retrotector online, a rational tool for analysis of retroviral elements in small and medium size vertebrate genomic sequences

A rapid bootstrap algorithm for the RAxML web servers

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Sequencing approach to analyze the role of quasispecies for classical swine fever

Viral quasispecies assembly via maximal clique enumeration

Cross-species infections

Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering

De novo assembly of highly diverse viral populations

Read length versus depth of coverage for viral quasispecies reconstruction

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

Mfold web server for nucleic acid folding and hybridization prediction