key: cord-1033451-eifrg2fe
authors: Holland, Mitchell; Negrón, Daniel; Mitchell, Shane; Dellinger, Nate; Ivancich, Mychal; Barrus, Tyler; Thomas, Sterling; Jennings, Katharine W.; Goodwin, Bruce; Sozhamannan, Shanmuga
title: BioLaboro: A bioinformatics system for detecting molecular assay signature erosion and designing new assays in response to emerging and reemerging pathogens
date: 2020-04-10
journal: bioRxiv
DOI: 10.1101/2020.04.08.031963
sha: ef8d7fc2fbc21c567502d5dbedc78d2e1b80d990
doc_id: 1033451
cord_uid: eifrg2fe

Background Emerging and reemerging infectious diseases such as the novel Coronavirus disease, COVID-19 and Ebola pose a significant threat to global society and test the public health community’s preparedness to rapidly respond to an outbreak with effective diagnostics and therapeutics. Recent advances in next generation sequencing technologies enable rapid generation of pathogen genome sequence data, within 24 hours of obtaining a sample in some instances. With these data, one can quickly evaluate the effectiveness of existing diagnostics and therapeutics using in silico approaches. The propensity of some viruses to rapidly accumulate mutations can lead to the failure of molecular detection assays creating the need for redesigned or newly designed assays. Results Here we describe a bioinformatics system named BioLaboro to identify signature regions in a given pathogen genome, design PCR assays targeting those regions, and then test the PCR assays in silico to determine their sensitivity and specificity. We demonstrate BioLaboro with two use cases: Bombali Ebolavirus (BOMV) and the novel Coronavirus 2019 (SARS-CoV-2). For the BOMV, we analyzed 30 currently available real-time reverse transcription-PCR assays against the three available complete genome sequences of BOMV. Only two met our in silico criteria for successful detection and neither had perfect matches to the primer/probe sequences. We designed five new primer sets against BOMV signatures and all had true positive hits to the three BOMV genomes and no false positive hits to any other sequence. Four assays are closely clustered in the nucleoprotein gene and one is located in the glycoprotein gene. Similarly, for the SARS-CoV-2, we designed five highly specific primer sets that hit all 145 whole genomes (available as of February 28, 2020) and none of the near neighbors. Conclusions Here we applied BioLaboro in two real-world use cases to demonstrate its capability; 1) to identify signature regions, 2) to assess the efficacy of existing PCR assays to detect pathogens as they evolve over time, and 3) to design new assays with perfect in silico detection accuracy, all within hours, for further development and deployment. BioLaboro is designed with a user-friendly graphical user interface for biologists with limited bioinformatics experience.

Using next generation sequencing, the whole genome sequence (WGS) of SARS-CoV-2 134 are continuously being released and shared (306 complete genomes as of March 09, 2020) with 135 the entire research community through Global Initiative on Sharing All Influenza Data (GISAID) 136 [31]. The release of WGS allowed us to test the BioLaboro pipeline (described in this study) to 137 evaluate currently used diagnostic assays and to rapidly design new assays. 138 In a previous study, we described a bioinformatics tool called PSET (PCR signature 139 erosion tool) and used it to show in silico, confirmed with wet lab work, the effectiveness of 140 existing Ebolavirus diagnostic assays against a large number of sequences available at that time 141 [32]. The phrase "signature erosion" used here signifies potential false-positive or false-negative 142 results in PCR assays due to mutations in the primers, probe, or amplicon target sequences (PCR 143 signatures). Signature erosion could also mean failure of medical countermeasures; for example, 144 a change in the genomic sequence resulting in an amino acid change that could potentially alter 145 the efficacy of sequence-based therapeutics [33, 34] . 146 In this study, we describe an expanded bioinformatics pipeline called BioLaboro in which 147 we have integrated several tools: BioVelocity®, Primer3 and PSET for end-to-end analysis of 148 outbreak pathogen genome sequences to evaluate existing PCR assay efficacy against the new 149 sequences, and to identify unique signature regions (BioVelocity), design PCR assays to these 

BioLaboro architecture 158 BioLaboro is comprised of three algorithms -BioVelocity, Primer3, and PSET -which 159 are built into a pipeline for user-friendly applications. The user has the option to launch one of 160 four different job types: Signature Discovery, Score Assay Targets, Validate Assay, or New 161 Assay Discovery. Each of the three algorithms can be run individually or together as a complete 162 end-to-end pipeline (Figure 1 ). For the BOMV use case, in the first phase of the pipeline 163 BioVelocity was used to analyze a set of genome sequences for unique regions that are both 164 conserved and signature to the target sequences selected. This was achieved by splitting a chosen 165 representative whole genome sequence into sliding 50 base pairs (bps) k-mers. Each k-mer was 166 then scanned against all target sequences to determine conservation. Conserved k-mers were then 167 elongated based on overlaps and formed into contigs. These contigs were then split into k-mers ≤ 168 250 bps and scanned against all non-target sequences to determine specificity. All passing 169 sequences were then elongated based on overlaps and the signature contigs were passed to the 170 next step in the pipeline. Primer3 was then used to evaluate the signature contigs to identify 171 suitable primers and probes for assay development. Primer3 was run in parallel against all 172 signatures and the output was ranked by penalty score in ascending order. The top five best 173 primer sets were passed along to the final step in the pipeline, PSET. In this step the primer sets 174 were run through a bioinformatics pipeline which aligned the sequences against large public 175 sequence databases from NCBI using BLAST and GLSEARCH [36] to determine how well each assay correctly aligned to all target sequences while excluding off-target hits. 199 Even the two assays that passed in silico criteria did not have perfect matches raising the 200 possibility that these assays may fail in wet lab testing due to mismatches against currently 201 available BOMV genomic sequences. Hence, as described below, we designed new assays using 202 the BioLaboro platform.

Discovery of potential new BOMV assays using BioLaboro end-to-end pipeline 204 Using BioLaboro we ran a New Assay Discovery job to discover new BOMV signatures 205 and determine their potential for accurate detection using PSET. In the first phase, BioVelocity 206 was used to search for conserved and signature regions within the selected genomes. We selected 207 the organism of interest by searching for "Bombali ebolavirus" from the database and selecting 208 the three available complete genomes. The MF319185.1 genome was used as the algorithmic 209 reference sequence as it is the same one that NCBI selected for the RefSeq database (Genbank 210 ID: NC_039345.1). The algorithmic reference sequence was first split into k-mers of 50 bps each 211 using a sliding window of 1 bp, which amounts to 18,994 k-mers to be evaluated with 212 BioVelocity's conserved sequence detection algorithm. BioVelocity found 27% (5,237) of these 213 k-mers to be conserved in all three of the BOMV genomes. The conserved k-mers were then 214 evaluated to determine overlapping segments and were combined into 120 conserved contigs.

These contigs were next evaluated with BioVelocity's signature sequence detection algorithm.

The contigs were split into signatures with a max size of 250 bps (longer contigs were split into 217 563,843 complete genomes and plasmids from the NCBI GenBank repository. There were 291 k-219 mers sequences found to be signatures to BOMV. The signatures were then evaluated to 220 determine overlapping reads and combined back into 119 signature contigs. Metrics for the 221 BioVelocity run in phase one are shown in Table 2 Table 3 Legend. The five new assays identified by Primer3 ranked by lowest penalty score. The

Identifier column is an automated ID generated from the pipeline, the Targets column is the (Table 4) . Table 4 Legend: True positive (TP: All assay components hit with >=90% identity over >=90% Table 5 . In the second phase, Primer3 was used to identify potential primer pairs and probes for 302 generating new PCR detection assays as described above for BOMV. There were 330 primer 303 sets created from the signatures which were assigned a penalty score to facilitate comparison of 304 the results. Primer sets were sorted by lowest penalty score and five potential assays were chosen manually in order to distribute potential candidates across the genome. These assays were then 306 formatted and sent to the final step for validation using PSET. Primer sets sent to PSET are 307 shown in Table 6 . 308 Table 6 Legend: The five new assays identified by Primer3 ranked by lowest penalty score. The

Identifier column is an automated ID generated from the pipeline, the Targets column is the 312 Taxonomy ID for SARS-CoV-2, the Definition column contains the amplicon sequence with the 313 primers in brackets (orange) and the probe in parentheses (blue), and the Penalty Points column 314 contains the score generated after taking into account primer design parameters.

The five new assays were mapped to the SARS-CoV-2 genome presented below (Figure 3 ). This 

In the third phase, PSET was used to test the five newly designed assays identified by 326 Primer3 in silico against publicly available sequences as described above for BOMV signatures.

The results were then validated by comparing the hits to the target NCBI Taxonomy identifier which is likely due to missing sequences.

We also tested the SARS-CoV-2 assays using PSET on near neighbor sequences that 351 were generated during this outbreak, such as the bat and pangolin sequences. As expected the 352 analyses showed a range of TP from pan assays, FN results due to sequence divergence (Table   353   8 ).

354 Table 8 . PSET results of SARS-CoV-2 PCR assays against bat and pangolin SARS-CoV 355 sequences. Since its discovery in 2016, BOMV RNA has been detected in oral and rectal swabs as The queue includes three sections: 1) The currently running job is identified with information on 495 the current sub-process, 2) The job queue is shown, which can be re-arranged as needed, and 3) 496 The finished jobs section shows completed jobs with timestamp metadata and an option to re-run 497 the same job with identical parameters. Weekly updates of SARS-CoV assay performance using PSET are posted at Viological.org.

(http://virological.org/t/preliminary-in-silico-assessment-of-the-specificity-of-published-573 molecular-assays-and-design-of-new-assays-using-the-available-whole-genome-sequences-of-574 2019-ncov/343/16).

The datasets analyzed during the current study are available in the following: 

Emerging infectious diseases: a 10-year 610 perspective from the National Institute of Allergy and Infectious Diseases. Emerg 611 Infect Dis

Emerging infectious diseases: threats to human health and 613 global stability

What Recent History Has Taught Us

About Responding to Emerging Infectious Disease Threats

Pricing infectious disease. The economic and health implications of 618 infectious diseases

Emerging infectious diseases: A proactive approach

Coronavirus disease 2019

An interactive web-based dashboard to track COVID-19 626 in real time

Ebola epidemic

World Health Organization: Situation Report-Ebola Virus Disease-10

World Health Organization: Ebola in the Democratic Republic of Congo-Health 635

Controlled Trial of Ebola Virus Disease Therapeutics

European Medicines Agency: First Vaccine to protect against Ebola

Ebola virus disease, marking a critical milestone in public health preparedness and 644 response

General introduction into the Ebola virus biology 648 and disease

The Pathogenesis

Risks Posed by Reston, the 652 Forgotten Ebolavirus

Reston ebolavirus in humans and animals in the 654 Philippines: a review

The discovery of Bombali 657 virus adds further support for bats as hosts of ebolaviruses

Fruit bats as reservoirs of Ebola virus

Ebola: Hidden reservoirs

Assessing the 664

Evidence Supporting Fruit Bats as the Primary Reservoirs for Ebola Viruses. 665 Ecohealth

Filoviruses in bats: current knowledge and future directions. 667 Viruses

Conserved 669 differences in protein sequence determine the human pathogenicity of Ebolaviruses

Is the Bombali virus 672 pathogenic in humans?

World Health Organization: WHO Director-General's opening remarks at the media 674 briefing on COVID-19 -11

678 Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. 679 Lancet

Huang CL 681 et al: A pneumonia outbreak associated with a new coronavirus of probable bat 682 origin

Coronavirus disease 2019 (COVID-19)Situation Report -684

Data, disease and diplomacy: GISAID's innovative 687 contribution to global health

Evaluation of Signature Erosion 690 in Ebola Virus Due to Genomic Drift and Its Impact on the Performance of 691 Diagnostic Assays

Draft versus finished 696 sequence data for DNA and protein diagnostic signature development. Nucleic Acids 697 Res

Basic local alignment search 699 tool

The EMBL-EBI search and sequence analysis tools APIs in 702 2019

Implementation of Objective PASC-Derived Taxon 705

Demarcation Criteria for Official Classification of Filoviruses

DNA Features Viewer, a sequence annotations 707 formatting and plotting library for Python

Initial Public Health Response and Interim 709

Clinical Guidance for the 2019 Novel Coronavirus Outbreak -United States

-714 nCoV) by real-time RT-PCR

Bombali Virus in Mops condylurus Bat

Bombali Virus in Mops 720 condylurus Bats, Guinea. Emerg Infect Dis

Genomic surveillance elucidates Ebola virus origin and 723 transmission during the 2014 outbreak

Ebola virus disease outbreak in 726

Democratic Republic of the Congo: a retrospective genomic 727 characterisation

MAFFT multiple sequence alignment software version 7: 729 improvements in performance and usability

SNP-sites: 731 rapid efficient extraction of SNPs from multi-FASTA alignments

Statistical Computing

The authors declare that they have no competing interests.