key: cord-0962762-yxf37huq authors: Li, Po-E; Myers y Gutiérrez, Adán; Davenport, Karen; Flynn, Mark; Hu, Bin; Lo, Chien-Chi; Player Jackson, Elais; Shakya, Migun; Xu, Yan; Gans, Jason; Chain, Patrick S G title: A Public Website for the Automated Assessment and Validation of SARS-CoV-2 Diagnostic PCR Assays date: 2020-08-10 journal: Bioinformatics DOI: 10.1093/bioinformatics/btaa710 sha: 271e02989b3236a4305c70462cecfada7d4bd98b doc_id: 962762 cord_uid: yxf37huq SUMMARY: Polymerase chain reaction-based assays are the current gold standard for detecting and diagnosing SARS-CoV-2. However, as SARS-CoV-2 mutates, we need to constantly assess whether existing PCR-based assays will continue to detect all known viral strains. To enable the continuous monitoring of SARS-CoV-2 assays, we have developed a web-based assay validation algorithm that checks existing PCR-based assays against the ever-expanding genome databases for SARS-CoV-2 using both thermodynamic and edit-distance metrics. The assay screening results are displayed as a heatmap, showing the number of mismatches between each detection and each SARS-CoV-2 genome sequence. Using a mismatch threshold to define detection failure, assay performance is summarized with the true positive rate (recall) to simplify assay comparisons. AVAILABILITY AND IMPLEMENTATION: The assay evaluation website and supporting software are Open Source and freely available at https://covid19.edgebioinformatics.org/#/assayValidation, https://github.com/jgans/thermonucleotideBLAST, and https://github.com/LANL-Bioinformatics/assay_validation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Many aspects of the control, management and treatment responses to the global COVID-19 pandemic require accurate detection of its causative agent, SARS-CoV-2. To address this challenge, research groups around the world have developed Polymerase Chain Reaction (PCR)-based assays to detect SARS-CoV-2 genomic RNA (Supplementary Table S1 ). The impact of SARS-CoV-2 genetic drift on the ability of PCR-based assays to successfully detect target sequences is a concern. To address this concern, we have developed a web-based application that monitors existing SARS-CoV-2 PCR-based assays that are in use around the world and provides a visual summary of assay performance. Both the acquisition of new genomes and the assay validation process is automated, so that assays are checked and displayed daily to give near real-time results. The core of the validation algorithm is the ThermonucleotideBLAST (Gans and Wolinsky, 2008) in silico PCR screening tool. Publicly available assays are used as queries in ThermonucleotideBLAST and searched against a target database of SARS-CoV-2 genomes from the Global Initiative on Sharing All Influenza Data (GISAID) (Shu and McCauley, 2017) and GenBank (Clark et al., 2016) . Sequences are downloaded daily from these databases and filtered to exclude any that are less than 29 kilobases or are pangolin-SARS and bat-SARS. For sequences found in both databases, only the GISAID version is retained. Predicted false negatives are defined as assay/target combinations that have either (a) one or more oligo/target pairwise alignments with 3 or more mismatches, (b) one or more predicted oligo/target melting temperatures less than 40°C, or (c) one or more mismatches in the last two 3' positions of a primer that are reported by (Li et al, 2004) to inhibit detection by increasing detection Ct  2. True positives are defined as any assay-target combination not predicted to be a false negative. Since all of the included assays are intended to detect SARS-CoV-2 and false positives are not predicted, assay performance is quantified by the recall (defined as the number of true positives divided by the sum of true positives and predicted false negatives). Per-assay recall values are summarized (Fig. 1A) . The assays with the best recall rates are shown in a bar chart, which also displays detailed mismatch counts. The total mismatch and failure results are summarized in the per-assay table of aggregated data. Selecting any bar in the chart or assay in the table will display additional information on the distribution of targets with mismatches (Fig. 1D) . The phylogenetic tree (Fig. 1B) , is created using PhaME (Shakya et al. 2020 ) and "high-quality" GISAID genomes (<1% Ns and <0.05% unique mutations). The leaves on the tree are represented by the genome labels and color-coded by geographic location. Mousing over the genome labels displays metadata associated with the sample. Identical SARS-CoV-2 sequences are clustered and represented as collapsed branches in the tree. The heatmap (Fig. 1C) , color-coded to indicate the number of mismatches, shows analysis of every combination of assay and SARS-CoV-2 genome sequence. Selecting an individual cell of the heatmap displays detailed pairwise alignment information (Fig. 1E ). This visualization is rendered using a custom PhyD3 phylogenetic tree viewer (Kreft et al., 2017) . Few other public resources exist for assessing the performance of PCRbased SARS-CoV-2 assays. GISAID, one of the primary repositories for SARS-CoV-2 genomes, provides a high-level summary of PCR-based assay performance for registered users. However, this information is provided in the form of a static image with only a limited amount of information. The virological.org website provides static tables summarizing the high-level performance of PCR assays that have been periodically uploaded (Holland et al., 2020) . Unlike these resources, the web-based application presented here provides a more detailed and interactive view of molecular assay performance that is updated regularly with recently deposited genomes (>66K as of July 15, 2020). The heatmap-phylogeny view reveals patterns in predicted assay performance, including mismatches for the Charité RdRP assays (Vogels et al., 2020 , Corman, et al., 2020 that were originally developed for testing SARS and/or SARS-related bat coronaviruses (Fig. 1C, Supplementary Figure S1) . A different pattern, previously noted by Vogels et al. (Vogels et al., 2020) , is seen within a subset of phylogenetically related strains due to a mismatch in the USA CDC N3 assay (CDC, 2020) (Supplementary Figure S1 ). As genomics continues to be used for understanding pathogen outbreaks, resources such as the one provided by this website may help in the early identification of potential assay concerns, and provide guidance on alternate assay designs early on, to mitigate current assays that may be eroding. -nCoV) Real-Time RT-PCR Diagnostic Panel For Emergency Use Only: Instructions for Use Diagnostic detection of 2019-nCoV by real-time RT-PCR Improved assay-dependent searching of nucleic acid sequence databases BioLaboro: A bioinformatics system for detecting molecular assay signature erosion and designing new assays in response to emerging and reemerging pathogens. bioRxiv PhyD3: a phylogenetic tree viewer with extended phyloXML support for functional genomics data visualization Standardized phylogenetic and molecular evolutionary analysis applied to species across the microbial tree of life GISAID: Global initiative on sharing all influenza data -from vision to reality Analytical sensitivity and efficiency comparisons of SARS-COV-2 qRT-PCR assays. medRxiv The authors declare no conflict of interest. Hosting of edgebioinformatics.org is provided by CyVerse, which is supported by the National Science Foundation under Award Numbers DBI-0735191, DBI-1265383, and DBI-1743442. We acknowledge the authors and originators of sequences from the submitting laboratories who have contributed to the GISAID database. This research was supported by LANL (20200732ER), by DTRA (CB10152 and CB10623) and by the DOE Office of Science (KP160101), through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act.