key: cord-0002149-wjf3c7a7 authors: Friis-Nielsen, Jens; Kjartansdóttir, Kristín Rós; Mollerup, Sarah; Asplund, Maria; Mourier, Tobias; Jensen, Randi Holm; Hansen, Thomas Arn; Rey-Iglesia, Alba; Richter, Stine Raith; Nielsen, Ida Broman; Alquezar-Planas, David E.; Olsen, Pernille V. S.; Vinner, Lasse; Fridholm, Helena; Nielsen, Lars Peter; Willerslev, Eske; Sicheritz-Pontén, Thomas; Lund, Ole; Hansen, Anders Johannes; Izarzugaza, Jose M. G.; Brunak, Søren title: Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers date: 2016-02-19 journal: Viruses DOI: 10.3390/v8020053 sha: 9c8a4fff809abaa0b5ede6a28db62da5edd082e1 doc_id: 2149 cord_uid: wjf3c7a7 Virus discovery from high throughput sequencing data often follows a bottom-up approach where taxonomic annotation takes place prior to association to disease. Albeit effective in some cases, the approach fails to detect novel pathogens and remote variants not present in reference databases. We have developed a species independent pipeline that utilises sequence clustering for the identification of nucleotide sequences that co-occur across multiple sequencing data instances. We applied the workflow to 686 sequencing libraries from 252 cancer samples of different cancer and tissue types, 32 non-template controls, and 24 test samples. Recurrent sequences were statistically associated to biological, methodological or technical features with the aim to identify novel pathogens or plausible contaminants that may associate to a particular kit or method. We provide examples of identified inhabitants of the healthy tissue flora as well as experimental contaminants. Unmapped sequences that co-occur with high statistical significance potentially represent the unknown sequence space where novel pathogens can be identified. The International Agency for Research on Cancer (IARC) lists several biological species with carcinogenic potential in humans [1] . This list comprises a bacterium (species Helicobacter pylori), three parasitic flukes (Clonorchis sinensis, Opisthorchis viverrini and Schistosoma haematobium), and seven viruses: human papillomaviruses (HPV), human immunodeficiency virus-1 (HIV-1), Epstein-Barr virus (EBV), hepatitis B and C virus (HBV and HCV), Kaposi's sarcoma-associated herpesvirus (KSHV), and human T-cell lymphotropic virus type 1 (HTLV-1). With the advent and spread of low-cost sequencing technologies, many viruses were discovered in the last decade [2] [3] [4] [5] [6] [7] [8] . One interesting discovery that fuelled the search for oncoviruses was Merkel cell polyomavirus (MCPyV) found to be clonally integrated in Merkel cell carcinomas [9, 10] . The computational biology community has promptly responded to the growing need for specialised algorithms and pipelines to analyse the wealth of data [9, [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] . Table S1 summarises the main features of some of the common approaches. In spite of particularities in the implementation, these methodologies share key conceptual similarities: First, sequencing reads or assembled contigs that originate from the host are identified and discarded, a process termed computational subtraction [9, 13] . When the genomes or the concentrations of foreign species are small compared to host genomes, this step eliminates a substantial fraction of the total sequencing reads. Second, the remaining non-host sequences are compared to a library of known reference sequences for taxonomic characterisation. The aforementioned methods identify species present across multiple samples, and the recurrence of a given viral entity may indicate an association to disease [10, 26] . Albeit conceptually valid, this bottom-up approach is inherently limited to the pre-existence of the organism in the reference databases, whereas novel oncoviruses showing partial or no similarity to known sequences will be missed. Current efforts aiming at estimating and characterising metagenomic diversity are far from a complete mapping of the (viral) sequence-space [27] . In fact, it is common to observe that a small but significant amount of unknown sequences, the so-called dark matter [28] , goes through the current analysis pipelines without proper characterisation and is discarded from further analyses [24, 29, 30] . Here, we propose a method capable of identifying the recurrence of sequences across related samples independently of their existence in reference databases. Our top-down approach compares samples and establishes recurrence prior to the taxonomic characterisation of the sequences. Thus, enabling the identification of both known and novel biological entities. Our method has conceptual similarities to the work of Andreatta et al. [31] where clustering of genes is used to find families that are predominantly found in pathogenic bacteria. Attending to Koch's postulates as modified by Fredericks and Relman [32] , sequences from biological entities with a causative or facilitator role would be present in diseased samples and absent in healthy controls. In addition, recent studies documented the presence of contaminating and/or artefactual sequences that source from the laboratory kits and reagents used for sample processing and library preparation [14, [33] [34] [35] [36] [37] . If not properly addressed, these confounding observations may lead to erroneous conclusions [38, 39] . Our method ascertains the statistical associations between recurrent sequences and a collection of features that describe the samples with respect to tissue, disease type, laboratory method, etc. Additionally, the presence of other known technical problems, such as cluster invasion on the sequencing flow cells [40] , might be detected. The study was conducted in accordance with the Declaration of Helsinki. Two ethical boards reviewed the protocol of this study: The Regional Committee on Health Research Ethics (Case No. H-2-2012-FSP2) and the National Committee on Health Research Ethics (Case No. 1304226). Because the study used only samples that were anonymised at collection both boards waived the need for informed consent in compliance with the national legislation in Denmark. Two hundred and fifty-two cancer samples of 17 different types were collected from various locations in Denmark and Hungary. Cancer samples of malignant melanoma, acute myeloid leukaemia (AML), B-cell chronic lymphocytic leukaemia (B-CLL), chronic myelogenous leukaemia (CML), and T-lineage acute lymphoblastic leukaemia (T-ALL; n = 9) were obtained from Aarhus University Hospital, Denmark. B-cell precursor acute lymphoblastic leukaemia (BCP-ALL), oropharyngeal head and neck cancer, testicular cancer, and T-ALL (n = 2) were obtained from Rigshospitalet, Denmark (Copenhagen University Hospital). Basal cell carcinoma, and mycosis fungoides (cutaneous T-cell lymphoma) were obtained from Bispebjerg Hospital (Copenhagen University Hospital). Samples of bladder cancer, breast cancer, colon cancer, as well as ascites fluid of breast cancer, colon cancer, ovarian cancer, and pancreatic cancer were obtained from the Danish Cancer Biobank, Herlev Hospital, Denmark. B-cell lymphoma cell lines were obtained from Aalborg University Hospital, Denmark. Vulva cancer samples were obtained from the National Institute of Oncology, Budapest, Hungary. Libraries were prepared at the Center for GeoGenetics (CGG), University of Copenhagen, Denmark based on seven different methods for sample processing comprising five different enrichment methods and shotgun sequencing targeting total DNA or RNA (Table S3 ). The enrichment methods used in the current work were circular genome amplification, sequence capture with retrovirus probes, virion enrichment (DNA and RNA), and mRNA enrichment. Further details on sample processing and library preparation have been published elsewhere [37, 41, 42] , except for mRNA enrichment which was performed using Dynabeads mRNA direct extraction kit (Thermo Fisher Scientific, Waltham, MA, USA) followed by ScriptSeq v2 RNA-Seq Library Preparation kit as for total RNA analysis [41] . Ultimately, the data set consisted of 686 DNA and RNA libraries, for which 2ˆ100 bp paired end sequencing was performed using the Illumina HiSeq 2000 platform at BGI-Europe, Copenhagen, Denmark. The 686 sequencing libraries thus originated from 252 different cancer samples, 32 non-template controls, and 24 exogenous controls. The distribution of methods, libraries and controls for each sample type is provided in Table S2 . Samples were preferably analysed with multiple methods, thus 165 out of 252 samples were analysed with more than one laboratory method (Table S3 ). The datasets went through a sequential pipeline with modules (in order) of preprocessing, computational subtraction of host sequences, low-complexity sequence removal, sequence assembly, clustering, association to metadata features, and taxonomical annotation. Figure 1 provides a schematic representation of the pipeline used to identify recurrent sequences across related samples. Demultiplexing was performed using a local python script to partition the reads based on exact matches in the FASTQ header lines to the multiplexed indices provided. Preprocessing of reads was performed for all datasets in parallel using AdapterRemoval [43] with the following parameters {-trimns, -trimqualities, -minquality 2, -minlength 30, -collapse, -outputcollapsed, -outputcollapsedtruncated, -singleton}. Read ends were trimmed for low quality base calls. Reads were discarded if the length after trimming fell below 30 bp. In these cases, the other read in a pair was kept as a singleton. Overlapping paired reads from short inserts were collapsed into a single read if the overlap was longer than 11 bp, according to the default behaviour of AdapterRemoval. Preprocessed reads were filtered if they showed homology to the human reference genome, which included decoys and alternative sequences from version GCA_000001405.15 (GRCh38) of the Genome Reference Consortium (downloaded August 20, 2014) . Mapping to the human genome was done using BWA [44] version 0.7.10-r789 with the MEM alignment algorithm and default parameters. All mapped reads without Sequence Alignment/Map (SAM) [45] flag 4 were discarded. Single-unmapped reads from read pairs were kept. Human depleted reads were filtered for low-complexity regions using the NCBI-BLAST associated module DustMasker [46] and default parameters. Reads containing low-complexity stretches of 25 bp or longer were discarded. Assembly of the remaining (non-human, high complexity) reads was performed with IDBA-UD [47] and parameters {-precorrection}. Contigs shorter than 200 bp were discarded. A total of 1,387,377 contigs, originating from the 686 data sets, went through the entire pipeline. Contigs ranged from 200 bp to 418,807 bp with an overall N50 of 817 bp. Contigs from all data sets were pooled and clustered based on pairwise sequence homology using CD-HIT [48] , in fast mode. We chose parameters for clustering that maximised grouping of similar sequences while minimising inclusion of unrelated sequences. We Table S4 from where we chose the final settings {-c 0.90 -aS 0.90 -G 1} The datasets were described with a panel of 404 different binary metadata features, for example tissue or disease characteristics (Table S5) . Features logically assessed whether they related to a particular dataset or not. Features describing less than five datasets were removed. Additionally, features that correlated perfectly in terms of Matthew's correlation coefficient (MCC =˘1) were merged. These filters resulted in 143 unique features (Table S5) . Biological features (n = 25) defined sample type, for instance tissue or disease category. Methodological features (n = 49) described specifics for sample preparation such as extraction kits, enrichment methods, polymerases, primers, buffer, filters used, or the laboratory where the work was performed, etc. Technical features (n = 69) defined the flow cell lane identifiers and whether resequencing was performed. The distributions of datasets and samples across the features are provided in Table S5 . Associations in the clustered contigs and metadata features were evaluated with Fisher's exact test using a one-tailed alternative hypothesis (greater) and calculated in R using the function fisher.test [49] . Annotation of taxonomy was performed in two rounds. First aligning contigs with BLASTn [50] with parameters {-evalue 0.001} using default {-task megablast} to a frozen version of the NCBI nucleotide database nt (downloaded 3 February 2015). Secondly, using BLASTx with parameters {-evalue 0.001} of all unmapped contig stretches to a frozen version of the NCBI non-redundant protein database nr (downloaded 3 February 2015). The best hit by highest bit-score was kept for each contig. The taxonomy database (downloaded 3 February 2015) was used to translate all GenBank identifiers from hits to taxonomy identifiers. The taxonomy identifiers were then used to obtain the complete taxonomical lineage and extract scientific names of species. The abundances of all species in each cluster were used to calculate the species evenness index as defined by Mulder et al. [51] . Clusters were annotated as the most abundant species in each cluster. The software to use after the assembly step has been uploaded at https://github.com/ jensfriisnielsen/sequence_recurrence. Sequence clusters that have been described in detail throughout the manuscript have been included as supplementary files. Clustering performance depends on the adequate selection of parameters. We experimented with a variety of configurations described by c0xaY0yGz where x,Y,y,z varied. The variables denote minimum percentage of sequence identity x (c0x), minimum percentage of alignment length y (aY0y) based on mode Y of shortest (aS) or longest (aL) contig in alignment, as well as using local (G0) or global (G1) alignment mode z (Gz). For example, a configuration encoded C090aS090G1 would represent a clustering that requires global alignments with a 90% minimum sequence identity over 90% of the length of the shortest contig. The full list of investigated parameter combinations can be found in Table S4 . We chose the parameters based on the performance of the clustering of expected contaminant sequences from avian leukosis virus (accession id AY350569) [37] and other related avian retroviruses (ARs) such as avian myeloblastosis virus [52] . ARs are used in the manufacture of the reverse transcriptase FailSafe PCR enzyme (Epicentre, Madison, WI, USA) included in the utilized ScriptSeq v2 RNA-Seq Library Preparation kit (Illumina, San Diego, CA, USA). This kit is commonly used for preparation of RNA libraries [52] . We identified clusters containing contigs that aligned to species of the Alpharetrovirus genus (NCBI taxa-id: 153057) according to BLASTn and BLASTx hereafter referred to as AR clusters. All contigs in AR clusters were resolved with BLASTn and BLASTx and two metrics were considered for AR clusters. As the first performance metric, we computed the odds ratios (ORs) of the associations between the presence of AR in the clusters and the use of the ScriptSeq kit. We used a 2ˆ2 contingency table defining the sets of libraries: AR positive and ScriptSeq positive (ARpSSp); AR positive and ScriptSeq negative (ARpSSn); AR negative and ScriptSeq positive (ARnSSp); AR negative and ScriptSeq negative (ARnSSn). OR is then defined as the ratio ARpSSpˆARnSSn / (ARpSSnˆARnSSp) and describes the strength of the association between clusters and features. ORs above 1 indicate association between the presence of the AR virus and the use of the ScriptSeq kit. ORs for all AR clusters were inspected in different parameter settings ( Figure S1 ). The ORs varied mostly block-wise with the parameters. The largest differences observed were between usages of the shortest or longest sequence in alignments with the alignment length filter. Associations from the shortest mode tended to have higher dispersion in the range of ORs. Furthermore, one block of clustering results using global alignment mode, alignment length based on the shortest contig, and a minimum sequence identity of 90% (c09ˆaSyG1), had an overall high range of ORs as well as the highest minimum values. This suggested that the clustering was able to reproduce the association between AR clusters and the ScriptSeq kit. In contrast, the clustering with parameter settings c080aS030G0 had a very broad range of ORs corresponding to a skewed clustering where some clusters had incorporated most sequences and left other clusters with only a few contigs. As a second performance metric we computed the species evenness [51] indices of the AR clusters represented in Figure S1 . The species evenness index is a score that derives from the Shannon's diversity index [53] and compares the abundance of each species within a cluster. An index of zero is assigned to clusters that are constituted uniquely by contigs mapping to a single species. Contrarily, scores closer to 1 would indicate that the cluster points to several species and that these are equally abundant. In our experiment, we favoured lower evenness indices as they indicate that clusters were able to single out species correctly. For example, parameter settings c080aL030G1 generally had a high level of species evenness (median 0.73) in clusters, suggesting an incorrect separation of species. In stark contrast, a block of parameters using global alignment mode, alignment length based on shortest sequence, 90% minimal sequence identity, and a minimum alignment length of 80/85/90/95/99% of the shortest sequence (c090aSyG1) all had a median species evenness of 0. This group of parameter settings also showed desirable performance in terms of OR, as mentioned before. Generally it seemed that global mode (G1) had better ORs than local mode (G0) when keeping other parameters fixed. Additionally keeping 90% minimal sequence identity (c090) and varying minimal length of alignment in shortest mode (aS) seemed stable in both ORs and species evenness indices indicating that these close parameter settings were generally good. We chose to proceed with a clustering based on global alignments with a 90% minimum sequence identity over 90% of the length of the shortest contig (c090aS090G1). This configuration resulted in a total of 681,858 clusters. Of these, 23,205 clusters contained contigs from at least five different data sets and represented 546,735 different contigs. The full distribution of cluster sizes can be found in Table S6 . The associations between the clusters and the binary metadata features were assessed using a Fisher's one tailed exact test. There were 16,567 significant associations having p-value < 3.01e-10, corresponding to a 0.001 significance level when using Bonferroni's correction for multiple testing [54] . The significant associations were arranged in 6165 unique clusters and with 73 unique features. The distribution of the significant associations showed that recurrent sequences originated from diverse sources and that individual clusters often associated to more than one feature (Figure 2 ). Furthermore it is evident that the clusters tend to group in their associations. Likely, these groupings represent one or more organisms. We investigated the nature of the clusters accounting only for the associated feature with the smallest p-value; hereafter described as the strongest associations. There were 50 unique features involved in all the strongest associations. The distribution of p-values for each feature is represented in Figure 3 . The 6165 strongest associations were distributed according to 602 biological, 5045 methodological and 518 technical associations. These unique features were arranged in 3 biological, 24 methodological and 23 technical features. Most p-values were above 1e-24 and associations with lower p-values were to a few methodological features annotated as extraction kits: QIAamp DNA mini kit (f056) (Qiagen, Hilden, Germany), DNase/RNase: Promega DNase (f068) (Promega, Madison, WI, USA), and DNase/RNase: Promega DNase stop solution (f069), purification kit: RNeasy MinElute, Qiagen (f076), library build: NEBNext, New England BioLabs (f079) (New England Biolabs, Ipswitch, MA, USA), and ScriptSeq v2 RNA-Seq, Illumina (f084); the latter with a minimum p-value of 3.04e-89. The boxes span the first and third quartiles. The dark band inside each box represents the median. The whiskers of the boxes extend to the lowest and highest values within a distance of 1.5 times the interquartile range. As can be seen, most p-values were above 1e-24, but a few methodological features have associated clusters with very low p-values, such as f056, f068, f069, f076, f079, and f084. The library preparation kit ScriptSeq v2 RNA-Seq, Illumina (f084) displays strongly associated clusters with p-values as low as 3.04e-89 that mapped as species Avian myeloblastosis-associated virus. Clusters that were annotated as NCBI species Parvovirus NIH/CQV were associated to laboratory-kit RNeasy MinElute, Qiagen (f076) with minimal p-value 5.48e-38. Finally, a cluster annotated as Acanthocystis turfacea chlorella virus MN0810.1 (ATCV) was associated to DNase/RNase: Promega DNase stop solution (f069) with p-value = 4.19e-12. Using BLAST and the NCBI taxonomy database a taxonomic characterisation was attempted for the 546,735 contigs in the 6165 clusters. This resulted in a taxonomical annotation of 3553 clusters using BLASTn and an additional 1630 clusters when using BLASTx. For 982 clusters, neither BLASTn nor BLASTx found significant species in the database. These clusters remained uncharacterised (Table 1) . We found that almost all clusters significantly associated to biological features could be annotated (598 of 602) in contrast to non-biologically associated clusters (4584 of 5563). A total of 1524 unique species were annotated corresponding to 5183 clusters. The Human Microbiome Project (HMP) [55] defines a collection of reference genomes built from metagenomic samples and associates these to specific sites and tissues across human body sites. We used this data set of 1317 associations as a confirmation that our pipeline was able to correctly detect and taxonomically characterise recurrent biologically relevant sequences. HMP provides a list of commensal organisms commonly found in the three sites that relate to our samples: the gastrointestinal tract, oral cavity and urogenital tract. We observed the strongest, significant associations between the expected organisms and biopsies from colon cancer, oral cavity cancer, and vulva cancer. The taxonomical characterisation of these clusters is described in Table 2 . Seven clusters significantly associated to colon cancer biopsies describing four different organisms that inhabit the gastrointestinal tract according to HMP, and 342 clusters significantly associated to oral cavity cancer describing 13 different organisms present in the oral cavity in HMP. Finally, we also discovered a cluster significantly associated to vulva cancer annotated as species Campylobacter ureolyticus (p-value = 1.03e-12), an inhabitant of the urogenital tract as described by HMP. Table 2 . Taxonomical characterisation of certain biologically associated clusters. The clusters are significantly associated with lowest p-values to biological features and the species annotations are described by HMP. In cases where several clusters shared the annotated species, the lowest p-value of the associations is given #sig: number of significant clusters. Cluster In the methodological associations, we correctly detected the strong known association (p-value: 3.04e-89) of avian myeloblastosis-associated virus (accession L10922.1) used in the manufacture of the ScriptSeq v2 RNA-Seq library preparation kit (f089). As the clustering parameters were evaluated with this known contaminant, this is expected. Furthermore, we annotated 19 clusters as NCBI taxonomy species Parvovirus NIH-CQV (accession KC617868.1; NCBI taxa-id 1341019), an established contaminant [34, 35] . The associated feature with lowest p-value to the parvovirus clusters suggested a contamination from the RNeasy MinElute purification kit (f076) manufactured by Qiagen (p-value: 5.48e-38). In addition, a single cluster annotated as NCBI taxonomy species Acanthocystis turfacea Chlorella virus MN0810.1 (accession JX997174.1, taxa-id 1278272) with lowest associated p-value (p-value = 4.19e-12) to laboratory kit DNase/RNase: Promega DNase stop solution (f069). ATCV-1 was previously reported as a contaminant [36] . In addition to the sequences that were characterised in the previous step, we found 982 examples of uncharacterised clusters. The contigs in these clusters varied substantially in length ranging from a minimum of 200 bp to a maximum of 33.6 kb (N50 = 617 bp). Our approach provides the capability to discover these recurrent novel sequences, but also permits the investigation of their plausible origin. Most associations were methodological (Table 1 ), probably sourcing from nucleotide sequences contained in various laboratory kits (Figure 4) . For instance, out of the 868 methodologically associated clusters, there were 648 associated clusters to the laboratory reagent DNase/RNase: Promega DNase stop solution (minimum p-value: 2.40e-36). Additionally, 110 recurring sequences were attributed to technical issues of the flow cell lanes (minimum p-value: 1.85e-21 in feature 383). In total, 4 unmapped clusters were associated to a biological feature, namely oral cavity cancer, with the longest contig of each cluster at 1789, 3247, 4661, and 4720 bp and with respective p-values of 1.01e-10, 1.01e-10, 1.17e-14, and 1.01e-10. To further clarify the unresolved biologically associated sequences, we manually investigated the cluster representatives using the newest databases (December 2015) at the NCBI web-interfaces for BLASTn, BLASTx and CCD v. 3.14 (conserved domains) [56] with default parameters and an e-value <0.001 (Table 3) . All cluster representatives could be explained as commensal bacteria related to the oral cavity as described by HMP. In order of increasing length, the cluster representatives were identified as: Prevotella veroralis, Prevotella veroralis, Prevotella fusca JCM 17724, and Peptostreptococcus anaerobius as the best hits with percent sequence identity: 92%, 90%, 91%, and 72%, respectively. Cluster representatives 3 and 4 contained both bacterial and phage-like conserved domains. The super family DUF4280 is of unknown function but related to bacteria and the ND2 super family is the nicotine adenine dinucleotide (NADH) dehydrogenase subunit 2 involved in electron transport. Conversely, Phage_base_V is related to the tail of phages and rve is an integrase domain that could also be explained as part of a transposon. Likely these sequences derived from less well-described parts of the microbiome. Usually, virus discovery in shotgun sequencing studies involves processing millions of reads in a viroinformatics pipeline. Existing tools typically offer a comprehensive taxonomical description of a single sample that is compared to the taxonomy of other samples to determine their relevance. A downside of this bottom-up methodology is that novel sequences that cannot be sufficiently well characterised in the first round are often discarded in the process. Another disadvantage is that potential contaminants will have to be controlled for in the post-processing of the data, an effort that is often omitted [38] . In the present study, we have presented a methodology to categorise recurring sequences according to experimental origin and metadata features. Additionally, using this methodology we could replicate both biological and methodological sequence associations known from the literature as well as pinpoint new unannotated recurring sequences. In this study, we had no datasets and features of healthy biological controls. We included a comparison to published reference genomes from HMP to validate that biologically co-occurring sequences can be found with the presented methodology. In this case, we are most likely observing normal biological inhabitants of the tissue samples, something our metadata scheme does not account for. The disease association of many of these organisms is obviously not fully known, and some of them could be related to disease features outside the cancer domain, features that we did not include in the present study. Optimising clustering parameters for one virus family might not result in the optimal separation of other families. Here, we optimised clustering parameters to rediscover the association of sequences to a known laboratory kit. Using these clustering parameters may result in a non-optimal separation of clusters that biologically belonged together, or the reverse problem-merged clusters that reflected different biological units. Optimal separation is likely problem-specific. Different taxonomic units would require the use of different clustering parameters to separate. However, choosing taxonomy-specific parameters requires a working hypothesis of the most likely findings. Here, we focused on the general problem of associating sequences to features using a known association to guide the choice of clustering parameters. A combination of several features may be the true foundation of particular sequences but this was not explored in this work. There may also be situations where a combination of clusters is the correct association to a particular feature. For instance, a virus that is present with a low titre may be sequenced sporadically resulting in less than full coverage and several non-coherent contigs from different viral genome regions. Each cluster may include an incomplete amount of data sets and thereby artificially show a weaker association. Merged and viewed as one, the incomplete clusters will have the correct strength of association. A grouping based on taxonomy, or a more data-driven approach that cluster sequence groups based on the associated datasets as seen in Figure 2 , could be included as another iteration to properly strengthen the statistical associations. Furthermore, forming clusters only by internal sequence identity may also miss pathogenic scenarios such as an oncovirus and any necessary helper viruses that do not share homology to the oncovirus. In the present study, we used a majority vote to assign taxonomy. There could be other ways to assign taxonomy, for instance, using a lowest common ancestor (LCA) strategy. A majority vote will likely introduce some false assignments if there are distant taxa involved in the sequence group present in nearly equal fractions. A LCA strategy can handle this but may reduce the taxonomic resolution to a level where there is no real gain of information. After determining what the significantly co-occurring sequence groups are, more effort might resolve interesting unmapped contigs. For instance, use of more sensitive alignment algorithms, profile Hidden Markov Models (HMMs), gene predictors, artificial neural networks trained on specific signals such as viral capsid sequences [57] , or PCR extraction followed by Sanger sequencing might provide the relevant clues. However, that was not within the scope of this study. The major advantage of the top-down approach is that it works without prior knowledge of the sequences. It is not dependent on reference sequence databases to single out the promising candidates for further analysis. The top-down method can determine the relevance of unknown sequences upfront while also systematically controlling for contamination by design. Most of the annotated sequences found in this study were sequenced from cancer specimens. However, it is apparent from the association analysis that several viral sequences detected are possibly contaminants or technical artefacts. Furthermore, the unmapped clusters are retained and easily arranged by relevance according to the nature of their associated features. Having this information helps precipitate a prioritised list of sequence candidates The quality of the associations will depend on the experimental design, sampling, available metadata, as well as the rigorousness and standardisation of both working routines and annotations. We stress the point that care must still be taken when formulating hypotheses and in the interpretation of associations. Virus discovery using high-throughput sequencing and especially characterising clinical samples is a challenge. Many viral discovery pipelines rely on similarity to reference databases as the most compelling argument for identifying putative sequences of medical or biological importance. Although a necessary step in the analysis, it has the downside of not considering novel sequences not included in reference sets as well as not considering the origins of the discoveries. There are many examples of contamination and technical artefacts; therefore, potential discoveries should be accompanied by convincing evidence that the sequences are not instead associated with the methodology or technology in use. We suggest a different approach that has complementary advantages inherent in the design. We show that we can differentiate between biological and non-biological associations, replicate known associations and potentially add new associations of cancer-associated viruses. Supplementary Materials: The following are available online at http://www.mdpi.com/1999-4915/8/2/53, Table S1 : Virus discovery pipelines, Table S2 : Distribution of library types, Table S3 : Methods in cancer samples, Table S4 : Clustering parameters, Table S5 : Feature descriptions, Table S6 : Datasets in clusters, Figure S1 : Clustering performance. Identification of a New Human Coronavirus Cloning of a Human Parvovirus by Molecular Screening of Respiratory Tract Samples New DNA Viruses Identified in Patients with Acute Viral Infection Syndrome Characterization and Complete Genome Sequence of a Novel Coronavirus, Coronavirus HKU1, from Patients with Pneumonia Identification of a Third Human Polyomavirus Identification of a Novel Polyomavirus from Patients with Acute Respiratory Tract Infections A Cornucopia of Human Polyomaviruses Human Transcriptome Subtraction by Using Short Sequence Tags To Search for Tumor Viruses in Conjunctival Carcinoma Clonal Integration of a Polyomavirus in Human Merkel Cell Carcinoma Identification of Novel Viruses Using VirusHunter -an Automated Data Analysis Pipeline CaPSID: A Bioinformatics Platform for Computational Pathogen Sequence Identification in Human Genomes and Transcriptomes Software to Identify or Discover Microbes by Deep Sequencing of Human Tissue Comprehensive Human Virus Screening Using High-Throughput Sequencing with a User-Friendly Representation of Bioinformatics Analysis: A Pilot Study Rapid Identification of Non-Human Sequences in High-Throughput Sequencing Datasets VirusFinder: Software for Efficient and Accurate Detection of Viruses and Their Integration Sites in Host Genomes through Next Generation Sequencing Data Characterization of the Viral Microbiome in Patients with Severe Lower Respiratory Tract Infections, Using Metagenomic Sequencing CREST Maps Somatic Structural Variation in Cancer Genomes with Base-Pair Resolution SVDetect: A Tool to Identify Genomic Structural Variations from Paired-End and Mate-Pair Sequencing Data A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from next-Generation Sequencing of Clinical Samples Faster and More Accurate Sequence Alignment with SNAP RAPSearch2: A Fast and Memory-Efficient Protein Similarity Search Tool for next-Generation Sequencing Data Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores Full Genome Virus Detection in Fecal Samples Using Sensitive Nucleic Acid Preparation, Deep Sequencing, and a Novel Iterative Sequence Classification Algorithm Integrative Analysis of Environmental Sequences Using MEGAN4 A New Arenavirus in a Cluster of Fatal Transplant-Associated Diseases Metagenomics and Future Perspectives in Virus Discovery A Highly Abundant Bacteriophage Discovered in the Unknown Sequences of Human Faecal Metagenomes Complete Viral RNA Genome Sequencing of Ultra-Low Copy Samples by Sequence-Independent Amplification What's in Your next-Generation Sequence Data? An Exploration of Unmapped DNA and RNA Sequence Reads from the Bovine Reference Individual Silico Prediction of Human Pathogenicity in the γ-Proteobacteria Sequence-Based Identification of Microbial Pathogens: A Reconsideration of Koch's Postulates Failure to Confirm XMRV/MLVs in the Blood of Patients with Chronic Fatigue Syndrome: A Multi-Laboratory Study The Perils of Pathogen Discovery: Origin of a Novel Parvovirus-Like Hybrid Genome Traced to Nucleic Acid Extraction Spin Columns Novel Hybrid Parvovirus-Like Virus, NIH-CQV/PHV, Contaminants in Silica Column-Based Nucleic Acid Extraction Kits Traces of ATCV-1 Associated with Laboratory Component Contamination Investigation of Human Cancers for Retrovirus by Low-Stringency Target Enrichment and High-Throughput Sequencing False-Positive Results in Metagenomic Virus Discovery: A Strong Case for Follow-Up Diagnosis Hybrid DNA Virus in Chinese Patients with Seronegative Hepatitis Discovered by Deep Sequencing. Proc. Natl. Acad. Sci High-Throughput DNA Sequencing -Concepts and Limitations Target-Dependent Enrichment of Virions Determines the Reduction of High-Throughput Sequencing in Virus Discovery New Type of Papillomavirus and Novel Circular Single Stranded DNA Virus Discovered in Urban Rattus Norvegicus Using Circular DNA Enrichment and Metagenomics AdapterRemoval: Easy Cleaning of Next Generation Sequencing Reads Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM The Sequence Alignment/Map Format and SAM tools A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing Basic Local Alignment Search Tool Species evenness and productivity in experimental plant communities Avian Myeoloblastosis Virus (AMV): Only One Side of the Coin A Mathematical Theory of Communication How Does Multiple Testing Correction Work? A Framework for Human Microbiome Research CDD: NCBI's Conserved Domain Database Artificial Neural Networks Trained to Detect Viral and Phage Structural Proteins The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.