key: cord-0747471-pznblc2h
authors: Balaji, Advait; Kille, Bryce; Kappell, Anthony D.; Godbold, Gene D.; Diep, Madeline; Elworth, R. A. Leo; Qian, Zhiqin; Albin, Dreycey; Nasko, Daniel J.; Shah, Nidhi; Pop, Mihai; Segarra, Santiago; Ternus, Krista L.; Treangen, Todd J.
title: SeqScreen: Accurate and Sensitive Functional Screening of Pathogenic Sequences via Ensemble Learning
date: 2021-08-08
journal: bioRxiv
DOI: 10.1101/2021.05.02.442344
sha: 3bb11d9c85ed24b37ada48cf5ae5b5f78b82fa4a
doc_id: 747471
cord_uid: pznblc2h

The COVID-19 pandemic has emphasized the importance of detecting known and emerging pathogens from clinical and environmental samples. However, robust characterization of pathogenic sequences remains an open challenge. To this end, we developed SeqScreen, which can accurately characterize short nucleotide sequences using taxonomic and functional labels, and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed pathogen characterization and is available for download at: www.gitlab.com/treangenlab/seqscreen

Rapid advancements in synthesis and sequencing of genomic sequences and nucleic acids have ushered in a new era of 24 synthetic biology and large-scale genomics. While the democratization of reading and writing DNA has greatly enhanced 25 our understanding of large-scale biological processes [1] , it has also introduced new challenges[2]. Robust characterization 26 of genetically engineered or de novo synthesized pathogens has never been more relevant, and the importance of detecting 27 and tracking naturally evolving and emerging pathogenic sequences from the environment cannot be overstated. Open 28 challenges that represent barriers to accurate detection include, but are not limited to, (i) the role of abiotic and 29 environmental stress response genes in virulence, (ii) the presence of seemingly pathogenic sequences in commensals, (iii) 30 host-specific pathogen virulence, and (iv) interplay of different genes to generate pathology [3] . Accurate and sensitive 31 detection of pathogenic markers has also been confounded by the difficulty of characterizing multifactorial microbial 32 virulence factors in the context of the biology of the host [4] . The limited number of publicly available databases to had annotation scores above 3 with a median score of 4, indicating a higher degree of confidence in its functional alignment-based tools to classify sequences. SeqScreen obtains alignments to both DNA and amino acid databases. While aligning to amino acid databases provides taxonomic information as well as functional information, aligning to nucleotide database. DIAMOND is an open-source software that is designed for aligning short sequence reads and performs at 159 approximately 20,000 times the speed of BLASTX with similar sensitivity. Our reduced version of the UniRef100 160 database [38] only contains proteins with a high annotation score. Not including poorly annotated proteins both decreases 161 the runtime and increases the specificity of SeqScreen functional annotations. SeqScreen then runs Centrifuge, a novel 162 tool for quick and accurate taxonomic classification of large metagenomic datasets. Centrifuge classifications are given 163 higher weights and are always assigned a confidence score of 1.0. SeqScreen always picks the taxonomic rank with the 164 highest score for Centrifuge and assigns it to the sequence. In the case where Centrifuge fails to assign a taxonomic rank 165 to a particular sequence, we assign DIAMOND's predictions to it. To incorporate DIAMOND's predictions, we consider 166 all taxonomic ids that are within 1% of the highest bit-score as the taxonomy labels for a sequence (Supplementary 167 Figure SF1 ). widely used but is limited due to many of the sequences not having clearly available annotations or justification for their base describes the pathogen-host interactions but does not focus on pathogenic effects on the host. SeqScreen was leverages functional information combined with curations to identify FunSoCs 363 364 proteins to identify pathogens. SeqScreen provides an advantage in that it also reports the most likely strain-level 371 assignments and protein-specific functional information for each sequence, including GO terms and FunSoCs, to 372 accurately identify pathogenic markers in each sequence without relying solely on taxonomic markers. We also observed 373 through inspecting the FunSoC lookup table that SeqScreen preserves FunSoC labels even when the proteins are distantly 374 related (up to 40% sequence similarity). Hence, the FunSoC abstraction represents a robust framework to detecting novel 375 pathogens as it does not rely on specific taxonomic labels in the database but on learning latent features that connect 376 similar pathogenic makers. SeqScreen also provides a more detailed framework beyond species or strain-level taxonomic 377 classifications to aid the user in interpreting the pathogenicity potential of a query sequence, including exact protein hits,

GO terms, multiple likely taxonomic labels with confidence scores, and FunSoC assignments.

The task of mapping biological (e.g., functional annotations) and textual features (e.g., keywords and abstract metadata) to concern. However, coordinated community efforts are needed to further extend out and improve annotation quality of 394 proteins in these key databases. We also note that while we have shown SeqScreen to be an accurate pathogen detection 395 tool, explicitly identifying and labelling pathogens is not possible with only FunSoC information, as seen in Fig. 4 

Cost function:

(2) 535 536

The third best performing model deviates from the two-stage detection and classification pipeline and instead 

The LinearSVCs for all the models were directly incorporated using their scikit-learn Fig. 6 . Pathogen identification of hard-to-classify pathogens: FunSoCs Assigned to Genes by SeqScreen. Abbreviated gene names are listed in pink cells if at least one read from the gene had a UniProt e-value < 0.0001, was assigned a FunSoC, and was from the expected genus (i.e., Escherichia or Shigella, Clostridium, Streptococcus, Lactobacillus). FunSoCs with at least one gene that met the criteria for detection in at least one isolate were included in the table. The removal of genes from genera that were not expected in these bacterial isolates allowed for removal of genes that were likely derived from likely contaminating organisms (e.g., PhiX Illumina sequencing control). An expanded table for cells denoted by (*) and complete gene names are listed within each cell in Supplementary  Table ST3 . 

Synthetic DNA synthesis and assembly: Putting the synthetic in synthetic 617 biology

Biodefense in the Age of Synthetic Biology

Agents NRC (US) C on SM for the D of a GS-BCS for the O of S. Sequence-Based Classification of Select 625 Agents. Sequence-Based Classification of Select Agents

Next Steps for Access to Safe, Secure DNA Synthesis. Frontiers in Bioengineering 627 and Biotechnology

Next-generation genome annotation: We still struggle to get it right

Predicting bacterial resistance from whole-genome sequences using k-mers and 631 stability selection

Pathoscope: Species 633 identification and strain attribution with unassembled sequencing data

PathoScope 2.0: A 636 complete computational framework for strain identification in environmental or clinical sequencing samples

A cloud-compatible 639 bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical 640 samples

Clinical 646 PathoScope: Rapid alignment and filtration for accurate pathogen identification in clinical samples using 647 unassembled sequencing data

Laboratory validation of a 650 clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid

Evaluation of the 654 cosmosid bioinformatics platform for prosthetic joint-associated sonicate fluid shotgun metagenomic data 655 analysis

SeqScreen: A biocuration platform for 663 robust taxonomic and biological process characterization of nucleic acid sequences of interest

PathoFact: a pipeline for the 670 prediction of virulence factors and antimicrobial resistance genes in metagenomic data. Microbiome. BioMed 671 Central Ltd

VFDB 2019: A comparative pathogenomic platform with an 673 interactive web interface

release: An enhanced web-based resource for comparative 676 pathogenomics

VFDB 2012 update: Toward the genetic diversity and molecular 680 evolution of bacterial virulence factors

Hierarchical and refined dataset for big data analysis -682 10 years on

Basic local alignment search tool

Outlier detection in BLAST hits

Fast and sensitive protein alignment using DIAMOND

Centrifuge: Rapid and sensitive classification of metagenomic 692 sequences

MEGARes 2.0: a database for 699 classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. 700 Nucleic Acids Research

Profile hidden Markov models

Pfam: The protein families 705 database

UniRef: comprehensive and non-redundant 707 UniProt reference clusters

EDGAR 2.0: an enhanced software platform 709 for comparative gene content analyses. Nucleic acids research

Escherichia coli O157:H7 Shiga toxin-encoding bacteriophages: Integrations, excisions, 712 truncations, and evolutionary implications

Mash: Fast genome and 715 metagenome distance estimation using MinHash

sourmash: a library for MinHash sketching of DNA. The Journal of Open Source 718 Software

Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2

Integrating taxonomic, 722 functional, and strain-level profiling of diverse microbial communities with biobakery 3. eLife. eLife Sciences 723 Publications Ltd

KrakenUniq: Confident and fast metagenomics classification using 725 unique k-mer counts

Fast and sensitive taxonomic classification for metagenomics with Kaiju

Transcriptomic characteristics of bronchoalveolar 729 lavage fluid and peripheral blood mononuclear cells in COVID-19 patients. Emerging Microbes and Infections

Species-level 732 functional profiling of metagenomes and metatranscriptomes

Safety assessment of 735 two probiotic strains, Lactobacillus coryniformis CECT5711 and Lactobacillus gasseri CECT5714

The CAFA challenge reports 745 improved protein function prediction and new functional annotations for hundreds of genes through 746 experimental screens

PANNZER2: A rapid functional annotation web server. Nucleic Acids 748 Research

Fast genome-wide 750 functional annotation through orthology assignment by eggNOG-mapper

DeepGOPlus: improved protein function prediction from sequence. Cowen L, 753 editor

Geospatial Resolution of Human and Bacterial Diversity with City-757

ART: a next-generation sequencing read simulator

MEGAN analysis of metagenomic data. Genome Research. Cold 761 Spring Harbor Laboratory Press

Explainable AI 766 reveals changes in skin microbiome composition linked to phenotypic differences. bioRxiv. Cold Spring Harbor 767 Laboratory

Adam: A method for stochastic optimization

Scikit-learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg 773

Journal of Machine 774 Learning Research

Trimmomatic: a flexible trimmer for Illumina sequence data

Parke for their efforts in assisting with the development of the UniProt queries, internal organization of the curation data, 784 and assistance in software development and testing at Signature Science, LLC. We would also like to acknowledge project 785 team contributions of Jason Hauzel, Kristófer Thorláksson, Manoj Deshpande

Porter at Fraunhofer CMA 787 USA for their support in software quality assurance and development of the HTML report generator, Jeremy Selengut of 788 the University of Maryland for sharing his insights into viral pathogenesis, Jim Gibson of Signature Science, LLC for 789 graphics development, and Letao Qi, Jacob Lu, and Chris Jermaine of Rice University for insightful discussions and work 790 specific to the machine learning algorithms

for their helpful discussions about pathogenesis ontologies and 792 assistance with deployment and testing of our software on their servers

We are thankful for all the time and effort provided be 795 end users to test early versions of our software, recommend improvements, and guide its application to current challenges 796 in pathogen detection, synthetic biology, and genome engineering. Finally, we would like to thank IARPA and all our 797 SeqScreen end users for their helpful feedback

Intelligence Advanced Research Projects Activity (IARPA), via the Army Research Office 802 (ARO) under Federal Award No. W911NF-17-2-0089. The views and conclusions contained herein are those of the 803 authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or 804 implied, of the ODNI, IARPA, ARO, or the US Government

0157:H7 were classified as E. coli O16:H48 and E. coli 2009C-3554, respectively. PathoScope only 025 classified two pathogens, C. sporogenes and C. botuinum, as their nearest neighbor counterparts

Xuzhou21 SRR8758382 C. sporogenes C. botulinum C. botulinum SRR8981313 C. botulinum C. sporogenes C