key: cord-0309923-1h7t4wg0 authors: Carbo, E. C.; Sidorov, I. A.; van Rijn-Klink, A. L.; Pappas, N.; van Boheemen, S.; Mei, H.; Hiemstra, P. S.; Eagan, T. M.; Claas, E. C. J.; Kroes, A. C. M.; de Vries, J. J. C. title: Performance of five metagenomic classifiers for virus pathogen detection using respiratory samples from a clinical cohort date: 2022-01-21 journal: nan DOI: 10.1101/2022.01.21.22269647 sha: b655cbc090f05f7dcc3cf16619ea8c212c37179e doc_id: 309923 cord_uid: 1h7t4wg0 Viral metagenomics is increasingly being applied in clinical diagnostic settings for detection of pathogenic viruses. While a number of benchmarking studies have been published on the use of metagenomic classifiers for abundance and diversity profiling of bacterial populations, studies on the comparative performance of the classifiers for virus pathogen detection are scarce. In this study, metagenomic data sets (N=88) from a clinical cohort of patients with respiratory complaints were used for comparison of the performance of five taxonomic classifiers: Centrifuge, Clark, Kaiju, Kraken2, and Genome Detective. A total of 1,144 positive and negative PCR results for a total of 13 respiratory viruses were used as gold standard. Sensitivity and specificity of these classifiers ranged from 83-100% and 90-99% respectively, and was dependent on the classification level and data pre-processing. Exclusion of human reads generally resulted in increased specificity. Normalization of read counts for genome length resulted in minor overall performance, however negatively affected the detection of targets with read counts around detection level. Correlation of sequence read counts with PCR Ct-values varied per classifier, data pre-processing (R2 range 15.1-63.4%), and per virus, with outliers up to 3 log10 reads magnitude beyond the predicted read count for viruses with high sequence diversity. In this benchmarking study, sensitivity and specificity were within the ranges of use for diagnostic practice when the cut-off for defining a positive result was considered per classifier. In the era of next-generation sequencing (NGS), clinical metagenomics, analysis of all microbial genetic material in clinical samples, is being introduced in diagnostic laboratories, revolutionizing diagnostic PCR assays to identify suspected pathogens, one single metagenomic run enables the pathogen detection are scarce. Publications on the performance of the computational analysis of 87 viral metagenomics are usually limited to in silico analysis of artificial sequence data [14] [20] [21] or 88 mock samples [22] [23]. Though both sensitivity and specificity can be deducted when using 89 simulated datasets, they usually do not represent the complexity of data sets from clinical samples 90 which typically contain sequences from wet lab reagents that have been referred to as the 'kitome' is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Bioinformatic metagenomics tools designed for taxonomic classification were selected for 139 benchmarking based on the following criteria: applicable for viral metagenomics for pathogen 140 detection; available either as download or webserver; and it is either widely used or showed 141 potential of being adopted for diagnostics in the future. Some tools considered were excluded due 142 to lack of support or details on how to use the tool, or non-functioning webservers. An overview of 143 characteristics of the selected classifiers can be found in Table 2 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. ; https://doi.org/10.1101/2022.01.21.22269647 doi: medRxiv preprint In this study, we compared the performance of five taxonomic classification tools for virus pathogen detection, using datasets from well-characterized clinical samples. In contrast to previously reported 272 comparisons with datasets from real samples, both sensitivity and specificity could be assessed using 273 a unique set of 1,144 PCR results as gold standard. A uniform database was created to exclude 274 variability based on differences in availability of genomes in databases provided with the classifiers. In general, sensitivity and specificity were within ranges applicable to diagnostic practice. Exclusion is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. ; https://doi.org/10.1101/2022.01.21.22269647 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. ; https://doi.org/10.1101/2022.01.21.22269647 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 21, 2022. ; https://doi.org/10.1101/2022.01.21.22269647 doi: medRxiv preprint Figure 3 . Correlation between the number of sequence reads assigned (species level) and Ct-values UniRef: comprehensive and 435 non-redundant UniProt reference clusters SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-438 metaSPAdes: a new versatile 441 metagenomic assembler Basic local alignment search 444 tool An alignment method for nucleic acid sequences against annotated genomes Use of simulated data sets to evaluate the fidelity of metagenomic 449 processing methods Assessing 452 taxonomic metagenome profilers with OPAL Critical Assessment of Metagenome Interpretation-a benchmark of 455 metagenomics software Comprehensive benchmarking and ensemble approaches for 458 metagenomic classifiers Challenges in benchmarking metagenomic profilers KrakenUniq: confident and fast metagenomics