key: cord-1025271-vtc6d2wr authors: van Boheemen, Sander; van Rijn-Klink, Anneloes L.; Pappas, Nikos; Carbo, Ellen C.; Vorderman, Ruben H.P.; van `t Hof, Peter J.; Mei, Hailiang; Claas, Eric C.J.; Kroes, Aloys C.M.; de Vries, Jutte J.C. title: Metagenomic Sequencing for Combined Detection of RNA and DNA Viruses in Respiratory Samples from Paediatric Patients date: 2018-12-10 journal: bioRxiv DOI: 10.1101/492835 sha: 92eaf4a156d3f39cc820260dadba37f1af280dc5 doc_id: 1025271 cord_uid: vtc6d2wr Introduction Viruses are the main cause of respiratory tract infections. Metagenomic next-generation sequencing (mNGS) enables the unbiased detection of all potential pathogens in a clinical sample, including variants and even unknown pathogens. To apply mNGS in viral diagnostics, there is a need for sensitive and simultaneous detection of RNA and DNA viruses. In this study, the performance of an in-house mNGS protocol for routine diagnostics of viral respiratory infections, with single tube DNA and RNA sample-pre-treatment and potential for automated pan-pathogen detection was studied. Materials and Methods The sequencing protocol and bioinformatics analysis was designed and optimized including the optimal concentration of the spike-in internal controls equine arteritis virus (EAV) and phocine-herpes virus-1 (PhHV-1).The whole genome of PhHV-1 was sequenced and added to the NCBI database. Subsequently, the protocol was retrospectively validated using a selection of 25 respiratory samples with in total 29 positive and 346 negative PCR results, previously sent to the lab for routine diagnostics. Results The results demonstrated that our protocol using Illumina Nextseq 500 sequencing with 10 million reads showed high repeatability. The NCBI RefSeq database as opposed to the NCBI nucleotide database led to enhanced specificity of virus classification. A correlation was established between read counts and PCR cycle threshold value, demonstrating the semi-quantitative nature of viral detection by mNGS. The results as obtained by mNGS appeared condordant with PCR based diagnostics in 25 out of the 29 (86%) respiratory viruses positive by PCR and in 315 of 346 (91%) PCR-negative results. Viral pathogens only detected by mNGS, not present in the routine diagnostic workflow were influenza C, KI polyomavirus, and cytomegalovirus. Conclusions Sensitivity and analytical specificity of this mNGS protocol was comparable with PCR and higher when considering off-PCR target viral pathogens. All potential viral pathogens were detected in one single test, while it simultaneously obtained detailed information on detected viruses. Viruses are the main cause of respiratory tract infections. Metagenomic next-generation 24 sequencing (mNGS) enables the unbiased detection of all potential pathogens in a clinical 25 sample, including variants and even unknown pathogens. To apply mNGS in viral 26 diagnostics, there is a need for sensitive and simultaneous detection of RNA and DNA 27 viruses. In this study, the performance of an in-house mNGS protocol for routine diagnostics 28 of viral respiratory infections, with single tube DNA and RNA sample-pre-treatment and 29 potential for automated pan-pathogen detection was studied. The sample types represented the routine diagnostic samples from paediatric patients sent to 119 our laboratory: predominantly nasopharyngeal washings (n=17), sputum (n=2) and broncho-120 alveolar lavages (BAL, n=2). The patient selection (age range 1.2 months -15 years) 121 represented the paediatric population with respiratory diagnostics in our university hospital in 122 terms of (underlying) illness. 123 124 Sample pre-treatment 125 A sample pre-treatment protocol was designed with 1) potential for automation, 2) potential 126 (PhHV1, kindly provided by prof. dr. H.G.M. Niesters, the Netherlands) as internal controls 135 for RNA and DNA virus detection respectively. To determine the optimal concentration of 136 the internal controls a dilution series was added to a mix of two pooled influenza A positive 137 throat swabs (Cq 25) (PhHV1/EAV 1:100,000 1:10,000 1:1,000 1:100 To determine the detection limit of mNGS, serial dilutions (undiluted, 10 -1 , 10 -2 , 10 -3 , 10 -4 ) of 172 an influenza A positive sample was tested with both lab developed real-time PCR and 173 mNGS. 174 175 To determine the reproducibility of metagenomic sequencing an influenza A positive clinical 177 sample (throat swab) was tested in quadruple. This sample was divided into separate aliquots, 178 nucleic acids were extracted, library preparation and subsequent sequence analysis on the 9 Bioinformatics: taxonomic classification 182 All FASTQ files were processed using the BIOPET Gears pipeline version 0.9.0 developed at 183 the LUMC [22] . This pipeline performs FASTQ pre-processing (including quality control, 184 quality trimming and adapter clipping) and taxonomic classification of sequencing reads. Centrifuge-download utility and were used as input for centrifuge-build. 197 Centrifuge settings were evaluated to increase the sensitivity and specificity. The default 198 setting, with which a read can be assigned to up to five different taxonomic categories, was 199 compared to one unique assignment per read [26] where a read is assigned to a single 200 taxonomic category, corresponding to the lowest common ancestor of all matching species. 201 Kraken-style reports with taxonomical information were produced by the Centrifuge-kreport 202 utility for all (default) options. Both unique and non-unique assignments can be reported, and 203 these settings were compared. The resulting tree-like structured, Kraken-style reports were 204 visualized with Krona [28] version 2.0. 205 In silico simulated EAV reads were analysed in different databases (NCBI nucleotide vs 206 RefSeq), classification algorithms (max 5 labels per sequence, vs unique (common ancestor)) 207 and reporting (non-unique vs unique) to determine the most sensitive an specific 208 bioinformatic analyses using Centrifuge. 209 To determine the amount of reads needed, results of 1 and 10 million reads were compared. 1 210 million reads were randomly selected of the 10 million reads of one FASTQ file and 211 analysed. To retrieve longer assembly contigs a reiterative assembly approach was used by processing 220 the proposed contigs by the biowdl reAssembly pipeline 0.1. This preassembly pipeline 221 aligns reads to contigs of a previous assembly, then selects the aligned reads, downsamples 222 them and runs a new assembly using SPADES. Subtools used for this consisted of BWA 223 0.7.17 [32] for indexing and mapping, SAMtools 1.6 [33] for creating bam files, SAMtools 224 view (version 1.7) for filtering out unmapped reads using the setting "-G 12", Picard 225 SamToFastq (version 2.18.4) and seqtk for creating fastq files with 250 000 reads. The 226 contigs from the reAssembly pipeline were then processed for a second using SPADES, with 227 setting the 'cov-cutoff' to 5. The resulting contigs were then processed with the reAssembly 228 pipeline for the third and last time setting the 'cov-cutoff' in SPADES to 20. The contigs from the last reAssembly step were then run against the blast NT database using 230 Serial dilution 1:10,000 detected EAV with a substantial read count in the presence of a viral 259 infection and without a significant decline in target virus family reads (Table 1 ). Based on 260 these results we determined the concentration of internal controls for further experiments. 261 The EAV Cq value of the dilutions correlated with the number of EAV reads from both 262 BLAST alignment and the Centrifuge analysis ( Figure 1) . 263 Since the NCBI database was lacking a complete PhHV1 genome sequence, PhHV1 was 264 sequenced and based on the gained sequence reads the genome was build using SPAdes [31] . 265 The proposed almost complete genome of PhHV1 was added to the NCBI genbank database 266 with respiratory complaints. The specificity could be increased by using the NCBI RefSeq 297 database instead of the nucleotide database. The classification was further improved by 298 changing the Centrifuge tool settings to limit the assignment of homologous reads to the 299 lowest common ancestor (maximum 1 label per sequence). Classification with maximum 5 300 labels per read resulted in two different outcomes using the report with all mappings and the 301 report with unique mappings, with the latter missing the reads assigned to multiple 302 organisms. 303 Comparison of classification using these different settings shows the highest sensitivity and 304 specificity using the RefSeq database with one label (lowest common ancestor) assignment, 305 both with in silico prepared datasets containing solely EAV sequence fragments ( Figure 5 ) 306 and with clinical datasets (with highly abundant background) ( Figure 6 ). 307 To determine the effect of the total number of sequencing reads obtained per sample on 308 sensitivity, one million and 10 million reads were compared by means of in silico analysis 309 (Table 2) . (Table 320 3), resulting in a sensitivity of 86% for PCR targets. If a cut-off of 15 reads was handled, 321 sensitivity declined to 67% (Table 4) . 322 mNGS target read count showed a correlation with the Cq values of the qPCR (Figure 7) . The addition of an internal control to a PCR reaction is widely used as an quality control of a 376 qPCR [38] . While the addition of internal controls in mNGS is not yet an accepted standard 377 procedure, we employed EAV and PhHV1 as an RNA and DNA controls, respectively, as for 378 diagnostic application such precautions are required. The amount of internal control reads 379 and target virus reads have been reported to be dependent of the amount of background reads 380 (negative correlation) [39] . In our protocol, the internal controls were used as qualitative 381 controls but may be used as indicator of the amount of background. Since the NCBI database 382 was lacking a complete PhHV1 genome, the Centrifuge index building and classification was 383 limited to classification on a higher taxonomic rank. In order to achieve classification of 384 RNA viruses, that were not recovered by mNGS had high Cq values, over 31, i.e. a relatively 389 low viral load. This may be a drawback of the retrospective nature of this clinical evaluation 390 as RNA viruses may be degraded due to storage and freeze-thaw steps, resulting in lower 391 sensitivity of mNGS. A correlation was found between read counts and PCR Cq value, 392 demonstrating the quantitative nature of viral detection by mNGS. Discrepancies between the 393 Cq values and the number of mNGS reads may be explained by 1) unrepresentative Cq 394 values, e.g. by primer mismatch for highly divergent viruses like rhino/enteroviruses and 2) 395 differences in sensitivity of mNGS for several groups of viruses, as has been reported by 396 others [42] . Additionally, several viral pathogens were detected that were not targeted by the 397 routine PCR assays, including influenza C virus, which is typical of the unbiased nature of 398 the method. In addition, though not within the scope of this study, bacterial pathogens, 399 including Bordetella pertussis (qPCR positive), were also detected. In the current study only 400 viruses were targeted since these could be well compared to qPCR results, bacterial targets 401 remain to be studied in clinical sample types more suitable for bacterial detection than 402 nasopharyngeal washings. The analytical specificity of mNGS appeared to be high, especially 403 with a cut-off of 15 reads. However, the clinical specificity, the relevance of the lower read 404 numbers, still needs further investigation in clinical studies. 405 Sequencing using Illumina HiSeq 4000 chemistry resulted in frequent false positive 406 rhinovirus reads in numerous samples. This problem could be attributed to 'index hopping' 407 (index miss-assignment) as earlier described [34] . Although the percentage of reads which 408 contributed to the index hopping was very low, this is critical for clinical viral diagnostics, as 409 this is aimed specifically at low abundancy targets [34, 40] . 410 Bioinformatics classification of metagenomic sequence data with the pipeline Centrifuge 411 required identification of the optimal parameters in order to minimize miss-and unclassified 412 reads. Default settings of this pipeline resulted in higher rates of both false positive and false 413 negative results. The nucleotide database includes a wide variety of unannotated viral 414 sequences, such as partial sequences and (chimeric) constructs, in contrast to the curated and 415 well-annotated sequences in the NCBI Refseq database, which resulted in a higher 416 specificity. In addition to the database, settings for the assignment algorithm were adapted as 417 well. The assignment settings were adjusted to unique assignment in the case of homology to 418 the lowest common ancestor. This modification resulted in higher sensitivity and specificity 419 than the default settings, however the ability to further subtyping diminished. This is likely to 420 be attributed to the limited representation/availability of strain types within the RefSeq 421 database. In consequence, this leads to a more accurate estimation of the common ancestor 422 for particular viruses, but limited typing results in case of highly variable ones. To obtain 423 optimal typing results, additional annotated sequences may be added or a new database 424 should be build, with a high variety of well-defined and frequently updated virus strain types. 425 To conclude, this study contributes to the increasing evidence that metagenomic NGS can 426 effectively be used for a wide variety of diagnostic assays in virology, such as unbiased virus 427 detection, resistance mutations, virulence markers, and epidemiology, as shown by the ability Zinc-mediated RNA