key: cord-0324991-z92ica1z
authors: Plyusnin, Ilya; Kant, Ravi; Jääskeläinen, Anne J.; Sironen, Tarja; Holm, Liisa; Vapalahti, Olli; Smura, Teemu
title: Novel NGS Pipeline for Virus Discovery from a Wide Spectrum of Hosts and Sample Types
date: 2020-05-08
journal: bioRxiv
DOI: 10.1101/2020.05.07.082107
sha: 2a13d7555c9b704a6603a715c112229dbf03a2e8
doc_id: 324991
cord_uid: z92ica1z

The study of the microbiome data holds great potential for elucidating the biological and metabolic functioning of living organisms and their role in the environment. Metagenomic analyses have shown that humans, along with e.g. domestic animals, wildlife and arthropods, are colonized by an immense community of viruses. The current Coronavirus pandemic (COVID-19) heightens the need to rapidly detect previously unknown viruses in an unbiased way. The increasing availability of metagenomic data in this era of next-generation sequencing (NGS), along with increasingly affordable sequencing technologies, highlight the need for reliable and comprehensive methods to manage such data. In this article, we present a novel stand-alone pipeline called LAZYPIPE for identifying both previously known and novel viruses in host-associated or environmental samples and give examples of virus discovery based on it. LAZYPIPE is a Unix-based pipeline for automated assembling and taxonomic profiling of NGS libraries implemented as a collection of C++, Perl, and R scripts.

Our ability to produce sequence data in the rapidly growing field of genomics has surpassed our ability to extract meaningful information from it. Analyzing viral data is particularly challenging given the considerable variability in viruses and the low coverage of viral diversity in current databases. It is estimated that a vast majority of virus taxa are yet to be described and classified (Geoghegan and Holmes, 2017) . This challenge is further complicated by the high evolutionary rate of viruses leading to emergence of new virus lineages and the relative scarcity of viral genetic material in metagenomic samples (Rose et al., 2016) . Next-generation sequencing (NGS) is a high-throughput, impartial technology with numerous attractive features compared to established diagnostic methods for virus detection (Mokili et al., 2012) . NGS-based studies have improved our understanding of viral diversity (Cantalupo et al., 2011) . There is considerable interest within virology to explore the use of metagenomics techniques, specifically in the detection of viruses that cannot be cultured (Smits et al., 2015; Graf et al., 2016) . Metagenomics can also be used to diagnose patients with rare or unknown disease aetiologies that would otherwise require multiple targeted tests (Pallen, 2014) or emerging infections for which tests are yet to be developed.

In recent years, the role of the bacterial microbiome in health and disease has been acknowledged and studied extensively (Kataoka, 2016; Biedermann and Rogler, 2015) . Nonetheless, the influence of the viral constituent of the microbiome (i.e., virome) has received considerably less attention.

Recent research has indicated that both pathogenic and commensal viral species can modulate host immune responses and thereby either prevent or induce diseases (Lim et al., 2015; Neil and Cadwell, 2018) . Additionally, recent research has revealed modifications in the virome that are related to diseases such as acquired immunodeficiency syndrome and inflammatory bowel disease (Norman et al., 2015) . Accordingly, there is a need to identify novel viruses that may be established pathogens and to define wider links of the virome with health and disease. Beyond humans, the veterinary, wildlife, arthropod and environmental viromes have large implications in e.g. animal health, zoonotic emergence, and ecosystem research that require new tools to understand and study the virosphere.

Bioinformatics pipelines and algorithms designed for the analysis of NGS microbiome data can be separated into three groups. The first group includes pipelines for virome composition analysis. These pipelines mine the relative abundance and types of viruses present in a given sample. These pipelines include VirusSeeker (Zhao et al., 2017) , Viral Informatics Resource for Metagenome Exploration (VIROME) (Wommack et al., 2012) , viGEN (Bhuvaneshwar et al., 2018) , the Viral MetaGenome Annotation Pipeline (VMGAP) (Lorenzi et al., 2011) and MetaVir (Roux et al., 2014) .

The second group includes pipelines that are designed for bacterial composition analysis, such as MG-RAST (Meyer et al., 2008) . Pipelines in the third group, such as MetaPhlan2 (Truong et al., 2015) , Kraken2 (Wood et al., 2019) and Centrifuge (Kim et al., 2016a) , can perform composition analysis for all known taxa. There are also a number of tools, pipelines, and algorithms for virus discovery, including Genome Detective (Vilsker et al., 2019) , VIP (Li et al., 2016) , PathSeq (Kostic et al., 2011) , SURPI (Ho and Tzanetakis, 2014) , READSCAN (Naeem et al., 2013) , VirusFinder (Wang et al., 2013) and MetaShot (Fosso et al., 2017) . Most of these tools depend exclusively on nucleotide-level sequence alignments and can detect viruses with highly similar sequences to a known virus.

These limitations make it impossible to detect extremely divergent novel viral sequences that do not share nucleotide similarity to any known or existing viral sequence (Wang et al., 2013; Takeuchi et al., 2014) . For these divergent novel sequences, it is critical to utilize amino acid-based comparison.

Although amino acid-based comparison is computationally more challenging and is used by very few published methods, this approach can facilitate the detection of novel viruses.

The availability of robust bioinformatics pipelines for virome detection and annotation from NGS data continues to be one of the critical steps in many research projects. Pipelines are needed to efficiently detect viral sequences present in a complex mixture of host, bacterial, and other microbial sequences. The discovery of viral sequences depends on sequence alignment with other viral sequences in databases, as, in contrast to bacteria where 16S RNA is present in all taxa, viruses lack 'explanatory genes' found in all taxa.

Lazypipe offers several advantages to the existing methods for taxonomic profiling of viral NGS data. Lazypipe outsources homology search to a separate server eliminating the need to install and update local sequence databases. This is helpful in both reducing the workload on the user and ensuring that all the latest viral sequences are covered by the homology search. Additionally, this feature can significantly reduce the threshold for employing Lazypipe by the less technically savvy researchers. Lazypipe uses SANSparallel (Somervuo and Holm, 2015) to search for amino acid homologs in the UniProtKB database. Searching for homologs in the protein space is expected to retrieve more distant viral homologs than searches with nucleotide sequences (Zhao et al., 2017) . Furthermore, SANSparallel is approximately 100 times faster compared to the blastp search (Somervuo and Holm, 2015) , which is the default search engine employed by nearly all other annotation pipelines that search the protein space. Unlike most taxonomic profilers, Lazypipe assembles and annotates viral contigs, thus reducing the workload on the downstream analysis.

Lazypipe implements a flexible stepwise architecture that allows re-execution of individual steps or parts of the analysis. This architecture addresses the increased risk of execution failure that is inherent to the analysis of large NGS libraries. Lazypipe supports data formats that can be used both by human researchers and automated tools. Results are output in the form of intuitive excel tables and interactive graphs, but also, in the form of standardized taxonomic profiles that can be integrated with automated workflows.

Lazypipe does not perform direct taxonomic binning of reads, but instead links these to database sequences via the assembled contigs. This approach results in a very high accuracy of taxon retrieval (see Results), however, this may come at the cost of lower accuracy for read binning. The accuracy of read binning was not accessed in this work since the main objective was to construct a highly accurate taxonomic profiler. Still, we provide the option to retrieve reads linked to any reported taxon or contig. Lazypipe implements a simple and robust model for abundance estimation. In this model reads aligned to a given contig are equally distributed among the taxa found by the homology search.

Although more elaborate models are certainly possible, our benchmarking suggest that this simple model is sufficient for adequate estimation of viral abundancies.

The faecal sample from diarrheic American mink (Neovison vison) was collected in September 2015 from a fur production farm in Finland, as described previously (Smura et al.) . The processing and sequencing of the sample were conducted using a protocol described in (Conceição-Neto et al., 2015) .

The human patient samples were derived from the diagnostic unit of Helsinki University Hospital Laboratory (HUSLAB) and stored in -80° C. In this study, RNA was extracted using either QIAamp Viral RNA kit (Qiaqen Inc., Valencia, CA, USA) or EasyMag (bioMerieux) according to the manufacturer's instructions, followed by real time PCR detection described in (Kuivanen et al., 2019) for TBEV, in (Haveri et al., 2020) for SARS-CoV-2, and in Smura et al (manuscript submitted) for entero-and parechoviruses. This study was done according to research permits HUS/32/2018 § 16 for project TYH2018322 and HUS/44/2019 § 13 for projects TYH2018322 and M1023TK001.

Prior to sequencing, samples were treated with DNase I (Thermo Fisher, Waltham, USA) and purified with Agencourt RNA Clean XP magnetic beads (Beckman Life sciences, Indianapolis, USA).

Ribosomal RNA was removed using a NEBNext rRNA depletion kit (New England BioLabs, Ipswich, USA) according to the manufacturers protocol. The sequencing library was prepared using a NEBNext Ultra II RNA library prep kit (New England BioLabs).

Libraries were quantified using NEBNext Library Quant kit for Illumina (New England BioLabs).

Pooled libraries were sequenced on an Illumina MiSeq platform, using either MiSeq v2 reagent kit with 150 base pair (bp) paired-end reads or a MiSeq v3 reagent kit with 300 bp paired-end reads.

We implemented a UNIX pipeline for automated assembly and taxonomic profiling of NGS libraries.

The pipeline also performs taxonomic binning of the assembled contigs. The workflow of our pipeline is illustrated in Fig 1. Our pipeline was implemented as a collection of Perl, C++, and R programs with a command-line user interface written in Perl. Our implementation allows for execution of the whole pipeline with a single command or performing each analysis step separately. This allows for great flexibility when working with large NGS libraries. Each pipeline step is described in more detail below. Paired-end libraries in FASTQ format (Cock et al., 2009 ) serve as input. First, primers, short reads and low-quality reads are removed with Trimmomatic (Bolger et al., 2014) . Then host reads are filtered by aligning reads against the host genome with BWA-MEM (Li, 2013) and removing reads with high scoring alignments with SAMtools (Li et al., 2009) . A threshold of 50 was selected by comparing the pre-assembly mapping of reads to the host genome to the post-assembly taxonomic binning of reads (without the host genome filtering). Comparing these for several samples showed that a threshold of 50 removes between 90% to 94% of reads that are assigned to Eukaryota in postassembly binning while removing only 1% to 10% of reads that are assigned to Viruses (data omitted). Decreasing this threshold to 30 resulted in removal of only 63% to 64% of reads assigned to Eukaryota and increased removal of reads assigned to Viruses (18-29%). Increasing threshold to 100 again decreased the number of filtered eukaryotic reads (to 74-86%) with only slight improvement on the number of retained virus reads (0-9%). For the simulated metagenome (Fosso et al., 2017) , setting the threshold to 50 results in filtering 99.93% of host reads and only 0.05% of viral reads (excluding the endogenous retroviral reads, which are filtered to a large extent). Thus, we selected 50 as a working threshold although we recognize that a more robust optimization can be performed.

In the next step reads are assembled with MEGAHIT (Li et al., 2015) or Velvet (Zerbino and Birney, 2008) . MEGAHIT is used by default as this was the overall best assembler in the CAMI competition (Sczyrba et al., 2017) . The pipeline then scans for gene-like regions in the assembled contigs with MetaGeneAnnotator (Noguchi et al., 2008) (default) or MetaGeneMark (Zhu et al., 2010) and translates these to amino acid sequences using BioPerl (Stajich et al., 2002) . Extracted amino acid sequences are queried against UniProtKB using the SANSparallel (Somervuo and Holm, 2015) server. Top hits that pass a bitscore threshold value are used to assign contigs to the NCBI taxonomy ids. Note that contigs with several genes can be assigned to several taxonomy ids. We also support an alternative strategy of mapping contigs directly against the NCBI nr database. This is done by querying contigs with Centrifuge against NCBI nr and using alignments that pass a threshold value for the alignment score to assign contigs to taxonomy ids. We refer to this alternative version of our pipeline as the Lazypipe-nt.

Reads that passed host genome filtering are realigned to contigs using BWA-MEM top hits that pass a pre-set threshold on the alignment scores. Read distribution tables are generated using SAMtools (Li et al., 2009 ).

Next, taxonomy links generated by SANSparallel and read distribution tables are processed into an abundance table, which summarizes the number of contigs and reads binned to each taxon.

Contigs that are mapped to two or more taxa contribute a corresponding fraction (such as 50%, 33%, 25% etc.) of mapped reads to the corresponding taxa. The raw abundance table is converted to an Excel file (using R) with several spreadsheets, each providing a different view of the acquired data.

These views include the abundance of virus taxa (excluding bacteriophages), bacteriophages, bacteria, eukaryotes and, optionally, other high-level domains. For each of these groups, abundances are reported at three taxonomic levels (family, genus, and species). This arrangement allows for a rapid overview of NGS results and convenient "zooming in" on the taxa of interest. Taxonomic abundances are also presented as an interactive Krona graph (Ondov et al., 2011) , which supports dynamic exploration of abundancies across different taxa. We also convert taxonomic abundances to CAMI Profiling Output Format (Sczyrba et al., 2017) . By providing standardised output we support benchmarking of our pipeline by unbiased third-party evaluation initiatives such as CAMI (Sczyrba et al., 2017) . Standardised output also aims to support simple and stable integration in automated workflows. To simplify accessibility, contigs for different taxa are sorted into a directory structure that follows the taxonomic hierarchy. A summary table is printed that lists all contigs for viruses, bacteriophages, bacteria and eukaryotes along with hits from SANSparallel or Centrifuge search.

In taxonomic profiling reads and contigs from the least abundant taxa have the highest risk of being misclassified. The organizers of the first CAMI competition addressed this problem by removing the last percentile of read distributions assigned by the compared taxonomic profilers (Sczyrba et al., 2017) . We implement a similar strategy by assigning each taxon a cumulative frequency distribution value (csum), which sums read frequencies mapped to that taxon and the more abundant taxa. We also assign confidence scores based on the csum score: the [0,95%] interval is assigned confidence 1, the [95%,99%] confidence 2 and the tail values [99%,100%] are assigned confidence 3. For a typical NGS library taxa with confidence score 1 will be true positives, those with score 3 (i.e. the last percentile) will be false positives and those with score 2 will represent borderline cases.

As an additional feature, Lazypipe offers an option to create interactive graphical reports that display the location and variation in viral contigs relative to reference viral genomes. This requires installation of a local database of viral reference genomes, which is then searched for taxa matching virus taxa found by the homology search. Contigs are aligned against the matching reference genomes with BWA-MEM and the resulting alignments are displayed with Integrative Genomics Viewer (Thorvaldsdóttir et al., 2013) in an internet browser.

In the last step we turn to quality control by generating graphical reports. The quality of the original library and assembly are monitored with histograms and key statistics. We also present the number of reads retained at consecutive pipeline steps: after quality filtering with Trimmomatic, after host genome filtering, after assembling and after gene detection. These are summarized as the survivalrate-plots.

We evaluated our pipeline on the following two sets of data: a simulated metagenome from the MetaShot project (Fosso et al., 2017) and a mock-virome and bacterial mock-community data (SRA reference SRR3458569) (Conceição-Neto et al., 2015) .

The MetaShot metagenome is a 20.5M PE 2x150 Illumina library simulated with ART [13]. We mapped reads in this library using accession numbers in read id-fields to 107 viral taxids, 99 prokaryote taxids and the human genome (94.5% of all reads). Strain taxids were further mapped using NCBI taxonomy to species, genus, family, order and superkingdom taxids resulting in 84 species and 46 genera of viruses, 71 species and 42 genera of bacteria. Based on this mapping we constructed a CAMI taxonomic profile (Sczyrba et al., 2017) , which was then used as the gold standard in pipeline evaluation.

The mock-virome and bacterial mock-community is composed from 9 virus cultures (Porcine We compared the performance of Lazypipe on the MetaShot benchmark against Kraken2 (Wood and Salzberg, 2014) , MetaPhlan2 (Truong et al., 2015) , and Centrifuge (Kim et al., 2016b) . Lazypipe was run with SANSparallel (referred to as Lazypipe) and Centrifuge (Lazipipe-nt) search engines.

Kraken2, MetaPhlan2, and Centrifuge were run with default settings. For Centrifuge we used the NCBI nucleotide non-redundant sequences database; alignments with <60 nt match were removed to improve precision. Classification results were converted to CAMI taxonomic profiles and evaluated against the golden standard using OPAL: a CAMI (Sczyrba et al., 2017) spinoff project implementing CAMI metrics for metagenomic profilers (Meyer et al., 2019) . For the simulated metagenome, we separately evaluated the entire taxonomic profile output by each of the pipelines and subprofiles limited to virus taxa.

Results for OPAL evaluation on the MetaShot benchmark are available from the project's website (https://www.helsinki.fi/en/projects/lazypipe). Precision, recall (syn. sensitivity) and F1-score (harmonic mean of precision and recall) for predicted virus taxa and for all predictions are listed in Tables 1 and 2, respectively. For predicted virus taxa both Lazypipe variants have very high precision and recall at both genus and species level (Table 1) . Lazypipe-nt has clearly the best balance between precision and recall, which is reflected in the highest F1-scores among the compared tools (Table 1 ). In the comparison of all predictions Lazypipe has the highest F1-score at the family and genus levels. Note that in this evaluation all methods have mediocre performance below the genus level. Lazypipe and Centrifuge are challenged with false positives and MetaPhlan2 and Kraken2 with false negatives (Table 2) . To evaluate the performance of Lazypipe on real data, we ran the Lazypipe analysis with default settings on the mock-community data (for results please see project's webpage). Recovery of the 9 mock-community viral taxa was evaluated by manual inspection of Lazypipe summary tables.

Lazypipe recovered all 7 eukaryotic viruses included in the mock-virome. Moreover, the correct eukaryotic viruses were the only eukaryotic viruses predicted for this data with acceptable confidence scores (scores 1 and 2; excluding score 3, which has a high risk of being false positive). Thus, we had 100% sensitivity and 100% precision for the eukaryotic viruses at the species level. Lazypipe also reported the Dickeya LIMEstone virus, but did not report the Acanthamoeba polyphaga mimivirus.

We compared the execution time of Lazypipe, Kraken2, MetaPhlan2 and Centrifuge on the MetaShot simulated metagenome on a GNU/Linux machine with 64 2300 MHz CPUs. All programs were run with 16 threads. The wall clock time in the order from the fastest to the slowest was: Kraken2 (2min 30sec), Centrifuge (21min 43sec), MetaPhlan2 (2h 21min 21sec) and Lazypipe (4h 31min 51sec).

Comparing this order to Tables 1 and 2 we see a trade-off between accuracy and speed. The fastest tools (Kraken2 and Centrifuge) are the least accurate, and the most accurate tools (Lazypipe and MetaPhlan2) are the slowest. Although Lazypipe is about twice as slow as MetaPhlan2, it is more accurate and creates annotated assembly, which is not done by any of the compared tools. We also note that key subprograms employed by Lazypipe (i.e. BWA, Megahit, SANSparallel and SAMtools)

have parallel implementation and are expected to have good scalability.

In addition to the mock-community data we tested the performance of Lazypipe using real data from different sample types (cerebrospinal fluid, serum, faeces, tissue samples) derived from various host species.

Since the pipeline is designed also for the detection of unknown viruses, we explored various sources for virus discovery with a by default unknown viral diversity. As an example of searching for the causative agents of veterinary disease, we analyzed sequence data derived from a faecal sample of a mink with gastroenteritis manifesting as diarrhea. Altogether, Lazypipe detected multiple contigs that indicated the presence of virus genomes (see Table 3 ). Notably, one contig contained a large ORF that most likely represents a novel picorna-like virus (order Picornavirales) with only 30% amino acid identity to the closest match. In addition, partial genomes of a toti-like virus with 29-32% amino acid identity to Beihai toti-like virus 4 (contig length 3792) and with 38-48% aa identity to Hubei unio douglasiae virus 1 (contig length 3219) were detected together with smaller fragments of other yet unclassified viruses (see Table 3 ). Sapovirus genotype XII Displaying contigs exceeding 500 nucleotides in length. Length (nt), contig nt length, Identity (%), amino acid identity to the closest match.

In addition to the above, virus groups with well-known association to the gastrointestinal system were detected. These included members of family Caliciviridae and Parvoviridae. Of the family Caliciviridae, six norovirus and six sapovirus contigs were detected. More thorough examination suggested that the norovirus contigs constitute a complete genome of a new representative of noroviruses with 89% amino acid identity in ORF1 (non-structural polyprotein) to norovirus genotypes IV and VI found in cats and dogs (Ford-Siltz et al., 2019), 63% amino acid identity in ORF2 (VP1) protein to genotype II found in pigs and 57% amino acid identity to genotype II in ORF3 (VP2).

The sapovirus contigs constituted a complete genome with 80-81% amino acid identity in ORF1

(including VP1 72-73% aa identity) and 45% amino acid identity in ORF2 (minor capsid protein VP2)

to sapovirus GXII previously detected in minks (Oka et al., 2016; Guo et al., 2001) .

In addition to these, short low coverage contigs matching to Atlantic salmon calicivirus (78-100% amino acid identity) were detected. Most likely, these are derived from the feed.

Of the family Parvoviridae, the largest contig (3069 nucleotides) matched to chicken chapparvovirus 2 spanning from 3'end of the 5'end of VP1, whereas another large contig (2448 nucleotides) contained 3'end of NS protein with 44% aa similarity to Chiropterian protoparvovirus and the 5' end of VP1 protein with 39% aa identity to Aleutian mink disease virus (amdoparvovirus). In addition to these, small fragments of mink bocaparvovirus were detected.

As an example of testing the suitability of Lazypipe for human clinical samples and exploring its use for detection of viral pathogens in humans, we used sequence data derived from human CSF, serum, brain tissue and nasopharyngeal swab samples that were previously tested positive for entero-, entero/parecho-, tick-borne encephalitis and SARS-coronavirus-2 viruses, respectively (Table 4) . (Haveri et al., 2020) , a nearly complete SARS-coronavirus-2 (SARS-CoV2) genome and fragments of Human mastadenovirus C sequences were retrieved (see Table 4 ). 

We also analysed samples of arthropod vectors. From an Ixodes ricinus tick sample collected from the Kotka archipelago in 2011, complete genomes of both Siberian subtype TBEV and a novel Alongshan virus (Kuivanen et al., 2019) were obtained (Table 4 ).

We analyzed public Illumina HiSeq/MiSeq libraries sequenced from bronchoalveolar lavage fluid from five patients with pneumonia at the early stage of the COVID-19 outbreak in Wuhan, China. Nine public NGS libraries were collected from NCBI SRA database (BioProject PRJNA605983) and analysed with Lazypipe. By applying default settings, we intentionally recreated a scenario, in which NGS data from SARS patients would be analysed prior to identifying the causative agent. SARS-CoV was identified by Lazypipe in all patients and in eight out of nine NGS libraries (Table 5) . Lazypipe also identified co-infection with Influenza A in two out of five patients (Table 5) . 

The availability of robust bioinformatics pipelines for viral metagenomics continues to be one of the critical steps in many research projects. Many of the existing pipelines are hindered by one or several limitations including large locally installed reference databases, slow homology search engines employed, low sensitivity for novel divergent sequences, low precision/recall performance for viral taxa or the lack of benchmarking for viral taxon retrieval, and the lack of assembling and contig annotation steps in the analysis. These limitations slow down the use of the unbiased sequencing approaches for rapid detection of novel emerging viruses.

In this publication we present Lazypipe, a novel bioinformatics pipeline that addresses the limitations typically encountered in viral metagenomics. Lazypipe avoids installation of large reference databases by delegating homology search to an external server. This frees the user from the need to install, index and update local reference databases, which can pose serious technical and resource constrains due to the sheer size of the modern sequence databases. By using SANSparallel (Somervuo and Holm, 2015) we also make Lazypipe considerably faster than pipelines based on BLASTP, and, simultaneously, render Lazypipe sensitive to highly divergent sequences, because viral peptides tend to be more conservative than the nucleotide sequences.

Taxonomic profiling by Lazypipe is done by querying assembled contigs instead of the reads, which translates into highly accurate taxonomic profiling of viral taxa. Benchmarking on simulated data showed that Lazypipe was clearly the most accurate taxonomic profiler for viral taxa among the four software packages compared. Testing on real mock community data demonstrated precision and recall nearing 100% for eukaryotic viruses. The detection of multiple novel viruses from various environmental and clinical samples reported here and in previous studies that used Lazypipe analysis (Kuivanen et al., 2019; Forbes et al., 2019) demonstrates that Lazypipe is also well suited for the detection and characterisation of novel and highly divergent viral genomes. Reflecting on the SARS- These examples demonstrate the efficacy of Lazypipe data analysis for NGS libraries with very different DNA/RNA backgrounds, ranging from mammalian tissues to pooled and crushed arthropods.

The current pandemic highlights the need for an efficient and unbiased way to screen 1) for previously unknown viruses from either wildlife and arthropods for potential viral diversity that may emerge as human or animal pathogens, or 2) from individuals or human populations, production animals or companion animals manifesting with a disease of unknown aetiology for previously unknown or atypical causative agents. We showed here that Lazypipe can contribute to both of these important efforts and that it was able to detect the causative agent of the current pandemic without prior information.

Lazypipe user manual and other resources are hosted at the project's website (https://www.helsinki.fi/en/projects/lazypipe). Lazypipe source code is freely available from the Bitbucket repository (https://bitbucket.org/plyusnin/lazypipe/). 

viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors

The intestinal microbiota: its role in health and disease

Trimmomatic: a flexible trimmer for Illumina sequence data

Raw sewage harbors diverse viral populations. mBio, 2

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

Modular approach to customise sample preparation procedures for viral metagenomics: a reproducible protocol for virome analysis

Bombali Virus in Mops condylurus Bat

Genomics Analyses of GIV and GVI Noroviruses Reveal the Distinct Clustering of Human and Animal Viruses

MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data

Predicting virus emergence amid evolutionary noise

Unbiased Detection of Respiratory Viruses by Use of RNA Sequencing-Based Metagenomics: a Systematic Comparison to a Commercial PCR Panel

Detection and molecular characterization of cultivable caliciviruses from clinically normal mink and enteric caliciviruses associated with diarrhea in mink

Serological and molecular findings during SARS-CoV-2 infection: the first case study in Finland

Development of a virus detection and discovery pipeline using next generation sequencing

Genome Sequences of Coxsackievirus B5 Isolates from Two Children with Meningitis in Australia

The intestinal microbiota and its role in human health and disease

Centrifuge: rapid and sensitive classification of metagenomic sequences

Centrifuge: rapid and sensitive classification of metagenomic sequences

PathSeq: software to identify or discover microbes by deep sequencing of human tissue

Detection of novel tick-borne pathogen, Alongshan virus, in Ixodes ricinus ticks, south-eastern Finland

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv Prepr

The sequence alignment/map format and SAMtools

VIP: an integrated pipeline for metagenomics of virus identification and discovery

Early life dynamics of the human gut virome and bacterial microbiome in infants

TheViral MetaGenome Annotation Pipeline(VMGAP):an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data

Assessing taxonomic metagenome profilers with OPAL

The metagenomics RAST server -a public resource for the automatic phylogenetic and functional analysis of metagenomes

Metagenomics and future perspectives in virus discovery

READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation

The Intestinal Virome and Immunity

MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes

Disease-specific alterations in the enteric virome in inflammatory bowel disease

Genetic Characterization and Classification of Human and Animal Sapoviruses

Interactive metagenomic visualization in a Web browser

Diagnostic metagenomics: potential applications to bacterial, viral and parasitic infections

Challenges in the analysis of viral metagenomes

Metavir 2: new tools for viral metagenome comparison and assembled virome analysis

Critical assessment of metagenome interpretation-a benchmark of metagenomics software

Recovering full-length viral genomes from metagenomes

Fecal microbiota of healthy and diarrheic farmed arctic foxes (Vulpes lagopus) and American mink (Neovison vison)-a case-control study

SANSparallel: interactive homology search against Uniprot

The Bioperl toolkit: Perl modules for the life sciences

MePIC, metagenomic pathogen identification for clinical specimens

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

MetaPhlAn2 for enhanced metagenomic taxonomic profiling

Genome Detective: an automated system for virus identification from highthroughput sequencing data

VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data

VIROME: a standard operating procedure for analysis of viral metagenome sequences

Improved metagenomic analysis with Kraken 2

Kraken: ultrafast metagenomic sequence classification using exact alignments

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

VirusSeeker, a computational pipeline for virus discovery and virome composition analysis

Ab initio gene identification in metagenomic sequences