key: cord-0328893-nv3o4pzd authors: Porter, Ashleigh F.; Cobbin, Joanna; Li, Cixiu; Eden, John-Sebastian; Holmes, Edward C. title: Metagenomic identification of viral sequences in laboratory reagents date: 2021-09-11 journal: bioRxiv DOI: 10.1101/2021.09.10.459871 sha: c2feb614d5ae2ce3abce8d4b05928bb15e57b4ea doc_id: 328893 cord_uid: nv3o4pzd Metagenomic next-generation sequencing has transformed the discovery and diagnosis of infectious disease, with the power to characterize the complete ‘infectome’ (bacteria, viruses, fungi, parasites) of an individual host organism. However, the identification of novel pathogens has been complicated by widespread microbial contamination in commonly used laboratory reagents. Using total RNA sequencing (“metatranscriptomics”) we documented the presence of contaminant viral sequences in multiple libraries of ‘blank’ negative control sequencing libraries that comprise a sterile water and reagent mix. Accordingly, we identified 14 viral sequences in 7 negative control sequencing libraries. As in previous studies, several circular replication-associated protein encoding (CRESS) DNA virus-like sequences were recovered in the blank libraries, as well as contaminating sequences from the RNA virus families Totiviridae, Tombusviridae and Lentiviridae. These data suggest that the contamination of common laboratory reagents is likely widespread and can comprise a wide variety of viruses. 3. Data summary The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. 1.5 Repositories The viral genome sequence data generated in this study has been deposited in the NCBI database under accession numbers MZ824225-MZ824237. Sequence reads are available at the public Sequence Read Archive (SRA) database with accession SRX6803604 and under the BioProject accession PRJNA735051 reference numbers SRR14737466-71 and BioSample numbers SAMN20355437-40. Culture-independent methods, particularly metagenomic next-generation sequencing 48 (mNGS), have revolutionised pathogen discovery, streamlined pathways of clinical diagnosis, 49 and have enhanced our ability to track infectious disease outbreaks [1], including the current combination of e-value, hit length, and percentage similarity to determine the potential of a 107 contig to be a viral sequence. The abundance of reagent-associated reads was calculated by In total, we identified 14 reagent-associated viral sequences in the negative (blank) control 124 samples tested, including seven CRESS-like viral sequences, four novel Tombusviridae-like 125 viral sequences, and single Lentivirus-like and Totiviridae-like viral sequences. The abundance of reads in each library was calculated to compare the percentage of reads 128 associated with viruses (Figure 1 ). This revealed that the virus-associated contigs identified 129 were predominantly CRESS-like (Figure 1b-e) . The L5 library only contained one virus- (Table 2) , containing what we hypothesise are bona fide viruses, reveals a 157 pattern of host-based clustering (Figure 3) . In particular, this phylogeny was characterised by 158 two distinct clades of circoviruses: circoviruses, associated with vertebrate hosts, and 159 cycloviruses associated with invertebrates. Finally, the remaining novel sequence was related to the totiviruses, a family of double-strand 187 RNA viruses commonly associated with fungi. The novel totivirus-like sequence was termed Reagent-associated toti-like virus. It was used in an alignment of the RdRp protein (Table 2) , 189 from which a phylogenetic tree was estimated (Figure 6 ). This revealed that the sequence 190 appears to be related to Scheffersomyces segobiensis virus (83% amino acid identity) 191 associated with the fungus Scheffersomyces segobiensis. Circoviridae. Accordingly, we divided the family into sub-groups, termed here as "host-206 associated circoviruses" (Figure 3) and "CRESS and CRESS-like viruses" and performed 207 phylogenetic analyses on each (Figure 2) . Notably, in the "host-associated circovirus" 208 phylogeny viruses clustered based on broad host species of origin. In contrast, within the 209 "CRESS and CRESS-like" phylogeny, clades could not be defined based on specific hosts or 210 environments, and while many samples were originally derived from marine-or faeces-211 associated environments, these sequences did not cluster together. Interestingly, however, one reagent-associated totivirus identified in this study is distantly related to known totiviruses. 241 We recommend that caution be taken when identifying novel totiviruses, especially if they 242 are related to reagent-associated toti-like virus. 275 The authors declare that there are no conflicts of interest. 285 We acknowledge the University of Sydney high performance computing cluster Artemis and 286 Sydney Informatics Hub which was used for the analyses in this study. Recognizing the reagent microbiome Reagent 360 and laboratory contamination can critically impact sequence-based microbiome analyses Towards 363 precision quantification of contamination in metagenomic sequencing experiments Identification and removal of contaminating 366 microbial DNA from PCR reagents: impact on low-biomass microbiome analyses Full-369 length transcriptome assembly from RNA-Seq data without a reference genome MEGAHIT: an ultra-fast single-node 372 solution for large and complex metagenomics assembly via succinct de Bruijn graph Fast and sensitive protein alignment using DIAMOND BLAST+: architecture and applications MAFFT Multiple Sequence Alignment Software Version 7: 379 Improvements in performance and usability Selection of conserved blocks from multiple alignments for their use in 381 phylogenetic analysis IQ-TREE: a fast and effective 383 stochastic algorithm for estimating maximum-likelihood phylogenies Fast gapped-read alignment with Bowtie 2 Identification of diverse mycoviruses through metatranscriptomics 389 characterization of the viromes of five major fungal plant pathogens Virome characterization of a 392 collection of S. sclerotiorum from Australia Diverse small circular DNA viruses circulating amongst estuarine molluscs Diversity and evolution of 399 novel invertebrate DNA viruses revealed by meta-transcriptomics Diverse 402 circular ssDNA viruses discovered in dragonflies (Odonata: Epiprocta) Complete genome sequences 405 of three novel cycloviruses identified in a dragonfly (Odonata: Anisoptera) from China Metagenomic analysis of coastal RNA virus 408 communities Trichomonasvirus: a new 410 genus of protozoan viruses in the family Totiviridae Giardiavirus double-stranded RNA 412 genome encodes a capsid polypeptide and a gag-pol-like fusion protein by a translation 413 frameshift Viruses of the protozoa The immunopathogenesis of equine infectious 416 anemia virus Transmission of equine infectious anemia 418 virus from horses without clinical signs of disease Transmission of equine 421 infectious anemia virus by Tabanus fuscicostatus Application of next generation 423 sequencing technology on contamination monitoring in microbiology laboratory