key: cord-293890-thfros7x authors: Carbo, Ellen C.; Sidorov, Igor A.; Zevenhoven-Dobbe, Jessica C.; Snijder, Eric J.; Claas, Eric C.; Laros, Jeroen F.J.; Kroes, Aloys C.M.; de Vries, Jutte J.C. title: Coronavirus discovery by metagenomic sequencing: a tool for pandemic preparedness date: 2020-08-21 journal: J Clin Virol DOI: 10.1016/j.jcv.2020.104594 sha: doc_id: 293890 cord_uid: thfros7x INTRODUCTION: The SARS-CoV-2 pandemic of 2020 is a prime example of the omnipresent threat of emerging viruses that can infect humans. A protocol for the identification of novel coronaviruses by viral metagenomic sequencing in diagnostic laboratories may contribute to pandemic preparedness. AIM: The aim of this study is to validate a metagenomic virus discovery protocol as a tool for coronavirus pandemic preparedness. METHODS: The performance of a viral metagenomic protocol in a clinical setting for the identification of novel coronaviruses was tested using clinical samples containing SARS-CoV-2, SARS-CoV, and MERS-CoV, in combination with databases generated to contain only viruses of before the discovery dates of these coronaviruses, to mimic virus discovery. RESULTS: Classification of NGS reads using Centrifuge and Genome Detective resulted in assignment of the reads to the closest relatives of the emerging coronaviruses. Low nucleotide and amino acid identity (81% and 84%, respectively, for SARS-CoV-2) in combination with up to 98% genome coverage were indicative for a related, novel coronavirus. Capture probes targeting vertebrate viruses, designed in 2015, enhanced both sequencing depth and coverage of the SARS-CoV-2 genome, the latter increasing from 71 to 98%. CONCLUSION: The model used for simulation of virus discovery enabled validation of the metagenomic sequencing protocol. The metagenomic protocol with virus probes designed before the pandemic, can assist the detection and identification of novel coronaviruses directly in clinical samples. The Severe Acute Respiratory Syndrome Coronavirus type 2 (SARS-CoV-2) pandemic of 2020 demonstrates the devastating effect an emerging virus can have. Although previous pandemics such as the Spanish Flu (1918) and Asian Flu (1957) resulted in a multitude of fatal cases, the SARS-CoV-2 pandemic exhibits an unprecedented impact on public health, the economy and society as a whole. Metagenomic Next-Generation Sequencing (mNGS) enables hypothesis-free sequencing of all nucleic acids in a given sample, including genomes of pathogens. All sequences are amplified, followed by classification of sequences based on a reference database. While research applications are more common, mNGS is being introduced in clinical diagnostic laboratories as indicated by recently diagnosed cases of encephalitis [6] . Implementation of mNGS in clinical diagnostics requires validation of metagenomic protocols. Metagenomic protocols and pipelines have been successfully used for detection of known pathogens [6] [7] [8] . However, detection and identification of novel, J o u r n a l P r e -p r o o f previously unknown emerging viruses presents a challenge due to the absence of their genome sequences in reference databases. In this study, we validated the identification of emerging coronaviruses by a viral metagenomic protocol, using clinical samples with SARS-CoV-2, and samples spiked with cultivated isolates SARS-CoV Frankfurt-1 (SARS-CoV) and MERS-CoV EMC/2012 (MERS-CoV). The validation included analysis of the performance of both an in-house and a commercially available data analysis pipeline, Genome Detective [9] . Identification of coronaviruses was tested using modified databases lacking SARS-CoV-2, SARS-CoV, and MERS-CoV, mimicking the situation at the time of virus discovery. Additionally, the efficacy of detection of novel coronaviruses using capture probes targeting vertebrate viruses [10] [11] known before the current pandemic was analyzed using a SARS-CoV-2 clinical sample. Nasopharyngeal swabs were obtained from two patients who tested positive for SARS-CoV-2 by realtime PCR targeting the SARS-CoV-2 E-gene [12] with Cq values of 20 and 30, respectively. These PCRs Library preparation and sequencing were performed using a previously validated protocol [15] [16]. Briefly, 200μl of patient samples were spiked with equine arteritis virus (EAV) and phocid herpesvirus-1 (PhHV-1) prior to NA extraction using the Magnapure 96 DNA and Viral NA Small volume extraction kit on the MagnaPure 96 system (Roche, Basel, Switzerland) resulting in 100μL nucleic acid-containing eluate. Of this eluate, 50μl per sample was used as input for the library prep, utilizing the NEBNext Ultra II Directional RNA Library prep kit for Illumina (New England Biolabs, Ipswich, MA, USA), dual indexed NEBNext Multiplex Oligos for Illumina (1.5µM), and a protocol optimized for processing RNA and DNA simultaneously in a single tube [15] . Library preps of the samples where processed both with and without enrichment for viruses using sequence capture probes (see below). Subsequent sequence analysis was performed using a After quality pre-processing using an in-house QC pipeline, Biopet version 0.9.0 [17] and removal of human reads after mapping them to human reference genome GRCh38 [18] with Bowtie2 version 2.3.4 [19] , the remaining sequencing reads were taxonomically classified using Centrifuge 1.0.2-beta [20] Pre-processed short reads were de novo assembled into contigs using SPAdes version 3.10.1 [22] . All contigs were analyzed using the NCBI Basic Local Alignment Search Tool (BLAST 2.8.1) [23] using the BLAST NCBI's nucleotide (nt) database (accessed April 2018). Only viral hits for contigs with a length of ≥500bp were selected to identify the best shared homology to viruses. A length of 500bp was taken to ensure coverage of the built contigs by at least 3 reads, to rule out any possible contamination. Only hits dated prior to 2020 genomes were considered to mimic the virus discovery setting for SARS-CoV-2. After extraction of human reads, FASTQ files generated for SARS-CoV-2 samples (with and without viral enrichment) were uploaded for classification and de novo assembly by the commercial webbased tool Genome Detective v1.120 (www.genomedetective.com, accessed 2020-05-11) [9] , using a reference database (generated 2019-09-21). In brief, after removal of low-quality reads and trimming by Trimmomatic [24] , candidate viral reads were identified using the protein-based alignment method DIAMOND [25] in combination with the Swissprot UniRef90 protein database followed by de novo assembly using metaSPAdes [26] . Blastx and Blastn [23] were used to search for candidate reference sequences using the NCBI RefSeq virus database (accessed 2019-09-21). Consensus sequences were produced by joining de novo contigs using Advanced Genome Aligner [27] . Classification results of viral reads are shown in Figure 1 and Results of de novo assembly of all samples for contigs longer than 500bp are shown in Table 2 . BLASTn was used to search for hits with sequence homology. Only viral hits with the lowest E-value of all matches identified that were submitted before the publication of SARS-CoV-2 genomes were considered. BLASTn search results of the contigs with Coronaviridae hits are listed in Table 2 including the length of the longest contig for each sample. Identity data of the hits with the lowest Evalue are listed in Supplementary Table 1 . Additional BLAST alignment figures of the longest contigs of both the SARS-CoV and MERS-CoV samples can be found in Supplementary Figure 1 and 2, respectively. GenomeDetective results of identification of SARS-CoV-2 sequences using a database created before the emergence of SARS-CoV-2 are shown in Figure 2 . SARS-CoV-2 sequences were identified as SARS-CoV, with nucleotide and amino acid identity of 80-81% and 83-85% respectively in combination with up to 98% genome coverage, being indicative for a novel finding. The efficacy of a metagenomic sequencing protocol using capture probes targeting vertebrate virus sequences designed before the emergence of SARS-CoV-2, was studied in the context of virus discovery. We analyzed metagenomic data from the two SARS-CoV-2 positive samples prepared both with and without viral enrichment. The total amount of contigs and the number of contigs matching genomes of viruses form Coronaviridae are shown in Table 2 and Reads mapping to the SARS-CoV-2 reference genome were used to visualize the difference in using capture probes as depicted in Figure 3 , where the SARS-CoV-2 genome is almost completely covered. The two largest contigs built by SPAdes that had a hit with the lowest E-value when BLASTed against genomes from Coronaviridae, were 4866bp and 5811bp in length for the two SARS-CoV-2 samples enriched using probes. In this study, we evaluated the performance of a metagenomic sequencing protocol for the identification of emerging viruses using clinical samples in combination with a simulated reference database. High and low loads of SARS-CoV-2, SARS-CoV, and MERS-CoV in clinical samples could be detected as 'novel' viruses, using only reference sequences created before these viruses emerged. Sequence reads were assigned to the closest relatives of these viruses available at that time and assembled with heterologous sequences to 'novel' consensus genomes. Low identity of these consensus genomes with genomes of closely related ones indicated a novel virus. Additionally, J o u r n a l P r e -p r o o f probes targeting sequences of vertebrate viruses, available prior to the coronavirus pandemic of 2020, succeeded in the capture of nearly the full genome of SARS-CoV-2. It must be noted that the validation was performed using emerging viruses with nucleotide identity of over 76% to their closest known relatives and conclusions cannot be extended to novel viruses which are less closely related. Nucleotide (and amino acid) identities reported in literature with regard to novel human pathogenic viruses vary, for example 50% for older viruses like SARS-CoV [1] , 80% for MERS-CoV [14] , 88% for parts of the Human Metapneumovirus [28] and up to 97.2% for parts of SARS-CoV-2 [29] . Several reports have shown an increase of 100-10.000 fold in sensitivity for detection of known viruses when using capture probes [10] , [30] and here we report the potential of using capture probes in the detection of novel viruses. Sequence variation was addressed in the probe design by retaining mutant or variant sequences if sequences diverged by more than 90% [10] . Lipkin and colleagues describe the capture of conserved regions of a rodent hepacivirus isolate with 75% identity using VirSeqCap VERT, and even 40% for detection rather than whole genome sequencing is suggested [10] . The capture probes used in this study targeted sequences of several isolates of alpha-, beta-, gamma-, and deltacoronaviruses. In this study the whole genome of SARS-CoV-2, with 76-100% overall nucleotide identity to the probe targets, was detected using these probes. Metagenomic sequencing is increasingly being used in diagnostic laboratories as a hypothesis-free approach for suspected infectious diseases in undiagnosed cases. Metagenomic sequencing in diagnostic laboratories has resulted in the detection of pathogens present in the reference database but either not tested for by routine methods due to rare or unknown associations with a specific disease, or for which routine testing failed (e.g., due to primer mismatches). Additionally, mNGS enables the detection of novel pathogens not (yet) present in the databases. Common bioinformatic classifiers are usually not designed for discovery purposes, so additional algorithms including a separate validation to assess the performance in a discovery setting are needed. Reports on specific J o u r n a l P r e -p r o o f bioinformatic discovery tools typically describe the algorithm and an in silico analysis and here we present validation studies on the performance of virus discovery tools using clinical samples. Implementation of virus discovery protocols in diagnostic laboratories may contribute to increased vigilance for emerging viruses and therefore aids in surveillance and pandemic preparedness. The authors have no conflicts of interest. J o u r n a l P r e -p r o o f Table showing the total number of built contigs with a length >=500bp, the number of these contigs where the hit with the lowest E-value would be a hit to viruses, the number of contigs where the hit with the lowest E-value would be a hit to Coronaviridae and of this last group the length of the longest contig, the alignment length, identity match, taxonomic name of BLAST result and the release years of sequences belonging to the species and subjects found by BLAST. Identification of a Novel Coronavirus in Patients with Severe Acute Respiratory Syndrome A Novel Coronavirus Associated with Severe Acute Respiratory Syndrome Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid VIP: an integrated pipeline for metagenomics of virus identification and discovery Metagenomics for pathogen detection in public health Genome Detective: an automated system for virus identification from highthroughput sequencing data Virome Capture Sequencing Enables Sensitive Viral Diagnosis and Comprehensive Virome Analysis', mBio Enhanced virome sequencing using targeted sequence capture Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR Mechanisms and enzymes involved in SARS coronavirus genome expression Isolation of a Novel Coronavirus from a Man with Pneumonia in Saudi Arabia Retrospective Validation of a Metagenomic Sequencing Protocol for Combined Detection of RNA and DNA Viruses Using Respiratory Samples from Pediatric Patients The respiratory virome and exacerbations in patients with chronic obstructive pulmonary disease Fast gapped-read alignment with Bowtie 2 Centrifuge: rapid and sensitive classification of metagenomic sequences Interactive metagenomic visualization in a Web browser SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing Basic local alignment search tool Trimmomatic: a flexible trimmer for Illumina sequence data Fast and sensitive protein alignment using DIAMOND metaSPAdes: a new versatile metagenomic assembler An alignment method for nucleic acid sequences against annotated genomes Analysis of the Genomic Sequence of a Human Metapneumovirus A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein Improved diagnosis of viral encephalitis in adult and pediatric hematological patients using viral metagenomics We would like to thank Joost van Harinxma thoe Slooten and Alhena Reyes for the library preparations and viral probe enrichments. Additionally, we would like to thank Lopje Höcker, Margriet Kraakman and Tom Vreeswijk for all their technical assistance in the lab.