key: cord-0283830-co0u8qqh
authors: Kuchinski, Kevin S.; Duan, Jun; Himsworth, Chelsea; Hsiao, William; Prystajecky, Natalie A.
title: ProbeTools: Designing hybridization probes for targeted genomic sequencing of diverse and hypervariable viral taxa
date: 2022-02-26
journal: bioRxiv
DOI: 10.1101/2022.02.24.481870
sha: 38531efc256e8c335cc37b9e9e7a44f275be5ee0
doc_id: 283830
cord_uid: co0u8qqh

Background Sequencing viruses in many specimens is hindered by excessive background material from hosts, microbiota, and environmental organisms. Consequently, enrichment of target genomic material is necessary for practical high-throughput viral genome sequencing. Hybridization probes are widely used for enrichment in many fields, but their application to viral sequencing faces a major obstacle: it is difficult to design panels of probe oligo sequences that broadly target many viral taxa due to their rapid evolution, extensive diversity, and genetic hypervariability. To address this challenge, we created ProbeTools, a package of bioinformatic tools for generating effective viral capture panels, and for assessing coverage of target sequences by probe panel designs in silico. In this study, we validated ProbeTools by designing a panel of 3,600 probes for subtyping the hypervariable haemagglutinin (HA) and neuraminidase (NA) genome segments of avian-origin influenza A viruses (AIVs). Using in silico assessment of AIV reference sequences and in vitro capture on egg-cultured viral isolates, we demonstrated effective performance by our custom AIV panel and ProbeTools’ suitability for challenging viral probe design applications. Results Based on ProbeTool’s in silico analysis, our panel provided broadly inclusive coverage of 14,772 HA and 11,967 NA reference sequences. 90% of these HA and NA references sequences had 90.8% and 95.1% of their nucleotide positions covered in silico by the panel respectively. We also observed effective in vitro capture on a representative collection of 23 egg-cultured AIVs that included isolates from wild birds, poultry, and humans and representatives from all HA and NA subtypes. 42 of 46 HA and NA segments had over 98.3% of their nucleotide positions significantly enriched by our custom panel. These in vitro results were further used to validate ProbeTools’ in silico coverage assessment algorithm; 89.2% of in silico predictions were concordant with in vitro results. Conclusions ProbeTools generated an effective panel for subtyping AIVs that can be deployed for genomic surveillance, outbreak prevention, and pandemic preparedness. Effective probe design against hypervariable AIV targets also validated ProbeTools’ design and coverage assessment algorithms, demonstrating their suitability for other challenging viral capture applications.

constraining the design space to a limited number of representative reference genomes, selected 66 either by manual curation or clustering [9] [10] [11] [12] . Some of these strategies have been supplemented 67 with multiple sequence alignments over hypervariable loci or entire genomes so that probes are 68 designed from consensus and degenerate sequences [9] [10] . 69

Spacing between probe sequences is another complicated design consideration. Regular 70 spacing (tiling) is the most common approach because it is easy to implement, but it does not 71 ensure optimal positioning of probes. Reducing the spacing increases the likelihood that some 72 enumerated probes are optimally positioned, but it also increases the number of probe candidates 73 and any associated computation to collapse redundancy among them. Creating the smallest 74 possible panel of probes that optimally covers the entire target space quickly becomes an 75 intractable computational problem, one that had led to increasingly complicated approaches 76 including sophisticated minimization of loss functions [13] . 77

Efforts to address viral hypervariability have resulted in several elaborate probe design 78 algorithms. Unfortunately, these have largely been implemented on a study-by-study basis and 79

have not resulted in general-purpose software tools that can be easily used by others. Meanwhile, 80 among the handful of published probe design packages, there is only one option that specifically 81 limitations of the basic k-mer clustering approach: HA and NA segments remained undetected 128 despite designing additional probes, and additional probes provided only modest and diminishing 129 improvements to the distribution of target coverage. increases, but incrementally designed panels achieved more extensive coverage at larger panel sizes. Incrementally 139 designed panels also provided better coverage of the worst-covered reference sequence using fewer probes. B) Incrementally designed panels shifted coverage distributions upwards for the worst-covered reference sequences.

Each reference sequence in the target space is represented as a dot, plotted according to its probe coverage.

Coverage of the worst-covered reference sequence and 10 th percentile of all reference sequences are indicated above 143 the axis. C) Incrementally designed panels improved reference sequence coverage by re-distributing probes from 144 regions with deep coverage (4 or more probes) to regions with shallow coverage (2 or fewer probes).

Improving target coverage with incremental panel design focused on poorly covered 147 targets: To address the limitations we observed with basic k-mer clustering, we devised an 148 incremental design strategy to improve marginal coverage increases, especially for poorly 149 covered targets. In this strategy, basic k-mer clustering was used to design panels in smaller 150 batches of 100 probes. After adding each batch to the growing panel, target space regions 151 without probe coverage were identified using the capture module. These low coverage regions 152 were then extracted with another ProbeTools module called getlowcov and used as a new target 153 space for designing the next batch. In this way, each subsequent batch of probes was focused on 154 regions not already covered by the panel. 155

We compared target space coverage for panels designed with this incremental strategy 156 against panels designed above using basic k-mer clustering ( Figure 1 ). The incremental strategy 157 provided higher 10 th percentiles of coverage, especially for HA panels larger than 2000 probes 158 and NA panels larger than 1200 probes ( Figure 1A) . Furthermore, the incremental strategy 159 provided better coverage for the worst-covered reference sequences ( Figure 1AB ). We also 160 compared depth of probe coverage, i.e. the number of probes covering each nucleotide position 161 in target sequences ( Figure 1C ). This comparison suggested that the incremental strategy space. The minimum, maximum, and 10 th percentile of reference sequence coverage was 187 calculated for each HA and NA subtype and the M segment ( Figure 3A ). 188

We observed that M segments had the best coverage followed by NA subtypes then HA 189 subtypes, reflecting the comparative levels of genomic diversity within these genome segments. and 92.4% probe coverage respectively. We also noted a significant positive monotonic 195 association between a subtype's target coverage distribution and number of reference sequences 196 from that subtype in the target space ( Figure 3B ). This indicated that over-representing subtypes 197 in the target space resulted in preferential design and better probe coverage for these targets, e.g. 198 the high priority subtypes H5, H7, and H9. between the number of sequences from a subtype in the target space and that subtype's 10 th percentile of coverage.

Each dot represents an HA or NA subtype, and the results of Spearman's rank correlation test are indicated on the 210 plots.

In vitro capture of diverse egg-cultured influenza isolates: After assessing the AIV_v1 panel 213 in silico, we had it synthesized and used it to perform in vitro captures on a collection of diverse 214 egg-cultured AIV isolates (Table 1) . We ensured that each avian-origin HA and NA subtype was 215 represented in the collection, and we included isolates from wild birds, poultry, and humans. The Table 1 : Representative collection of egg-cultured avian influenza virus isolates. Isolates were selected to 220 provide representation of each avian-origin haemagglutinin (HA) and neuraminidase (NA) subtype as well as 221 infections from poultry, wild bird, and human hosts. Each specimen was given a name based on an abbreviation of 222 its host type and a sequential number (P for poultry, WB for wild bird, and H for human). Poultry and wild bird 223 isolates were obtained from the Canadian Food Inspection Agency's National Centre for Foreign Animal Disease 224 (CFIA NCFAD), and human isolates were obtained from the Public Health Agency of Canada's National We also used these in vitro results to assess breadth of enrichment, i.e. the percentage of 253 nucleotide positions in each HA, NA, and M segment that had been significantly enriched by 254 probe capture ( Figure 4C , Table S1 ). Breadth of enrichment was greater than 96.3% for 64 of 69 255 segments in the collection, and it was not less than 46.5% for any segment, which is sufficient 256 for segment and subtype identification. Nine isolates contained high priority H5, H7, and H9 257 segments, all of which had greater than 98.7% breadth of enrichment. This included two isolates 258 from zoonotic human infections (H5N1 and H7N9), which were extensively enriched despite the 259 absence of reference sequences from human infections in the target space used for probe design. 260

We further examined the five segments with less than 96.3% breadth of enrichment to 261 understand why they were apparently not captured in full. First, we used the ProbeTools capture 262 module to assess if the AIV_v1 panel lacked probes targeting their particular genome segment 263 sequences. We observed that most positions without significant enriched were nonetheless 264 extensively covered by the probe panel ( Figure 5A ). This indicated that insufficient design by 265 ProbeTools was not a major explanation for the lack of significant capture of these segments. 266 in the five segments discussed above that were impacted by variability between replicates 316 ( Figure S1 ). We also noted that 7.7% of nucleotide positions were significantly enriched despite 317 not being targeted by the AIV_v1 panel, a phenomenon that was observed in most segments 318 across all isolates (Figure 6 and Figure S1 ). We attribute this to the capture of larger fragments 319 containing untargeted sequences adjacent to the location annealed by the probe. It might also 320 indicate that local alignment parameters used by ProbeTools capture are more conservative than 321 actual annealing thermodynamics. Either way, these results showed that ProbeTools predictions 322 generally reflected actual capture of target genomic material, and in silico predictions more often 323 underestimated panel performance when predictions were incorrect. 324

This study highlighted some important considerations when designing panels using ProbeTools. 327

Foremost among these was the effect of target space composition on panel inclusivity. In this 328 AIV case study, we noted a significant positive monotonic association between panel coverage 329 and the number of reference sequences representing a particular subtype in the target space. 330

Based on how the ProbeTools algorithm ranks probe candidates by the number of k-mers in the 331 cluster they represent, it stands to reason that over-representing similar taxa (which would 332 contain many similar k-mers) would bias the resulting panel towards these taxa. 333

Consequently, ProbeTools users should have a thorough knowledge of the contents of 334 their target space and the possible sources of sampling bias in the databases from which they 335 obtain their reference sequences. In the case of AIVs, the agricultural impacts and public health 336 threats of certain HA subtypes have led to more frequent sequencing of these subtypes and 337

accessioning of their genome sequences in popular databases. For our panel, this contributed to 338 bias towards subtypes like H5, H7 and H9. Whether this is a benefit or limitation will depend on 339 the intended application. In the context of outbreak prevention and pandemic preparedness, a 340 panel biased towards taxa that are known for their agricultural impact and zoonotic potential is 341 beneficial. If the objective is to characterize viral diversity and ecology in wildlife, however, this 342 could be a limitation. 343

To obtain the best results, ProbeTools users should purposefully curate their target space 344 to serve their probe capture objectives. Users may want to identify taxa whose detection is a 345 priority and over-represent them in the target space. Conversely, users may want to 'flatten' their 346 target space to ensure no particular taxa, clades, subtypes, etc dominate. This could be done 347 manually, by selecting specific sequences to represent relevant groups, or it could be attempted 348 bioinformatically by pre-clustering target sequences, providing the number and length of target 349 sequences do not make this computationally prohibitive. 350

Another strategy could be to use the various ProbeTools modules to extract low coverage 351 sequences from specific groups whose target sequences have poor probe coverage after a core 352 panel is designed. For instance, had H15 subtype AIVs been a surveillance priority in this study, The clusterkmers module enumerates and clusters probe-length k-mers from user-388 provided target sequences. 1) K-mers are enumerated using a sliding window that advances by a 389 specified number of bases. 2) K-mers are clustered based on nucleotide sequence similarity using 390 VSEARCH cluster_fast [34] . 3) Centroid sequences from each cluster are ranked by the size of 391 the cluster they represent. Centroids from larger clusters are assumed to be better probe 392 candidates by virtue of having similarity to more k-mers in the target space. By default, 393 clusterkmers enumerates 120-mers, advancing the window one base at a time, and it clusters 394 using a nucleotide sequence identity threshold of 90%. Previous studies have observed effective 395 hybridization between targets and probes with this degree of sequence similarity [9, 11] . 396

The capture module predicts how well user-provided probe sequences cover user-397 provided target sequences. 1) Each probe sequence is locally aligned against each target 398 sequence using BLASTn [35] . 2) Alignments are filtered, retaining those with a minimum 399 sequence identity over a minimum alignment length. 3) Subject alignment start and end 400 coordinates are extracted from the BLASTn results to determine which nucleotide positions in 401 the target sequences are covered by probes. By default, capture requires 90% sequence identity 402 over at least 60 bases to assign probe coverage to the aligned positions. 403

The getlowcov module uses the output of capture to extract genomic regions with low 404 coverage from the provided targets. This allows for additional probe design focused on poorly 405 covered regions of the target space. This module returns all sub-sequences where a minimum 406 number of consecutive bases were covered by fewer than a specified number of probes. By 407 default, getlowcov returns all sub-sequences over 40 bases in length where all bases in the sub-408 sequence were covered by zero probes. 409

The stats module uses the output of capture to calculate coverage statistics. For each 410 provided target, it calculates the percentage of nucleotide positions covered by varying numbers 411 of probes ("target coverage" and "probe depth"). 412

The makeprobes module chains the previous modules together to implement a 413 [34] with a 100% sequence identity threshold to remove redundant entries. The remaining entries 427 were used as our final AIV target space (described in Table 2 ). 428 429 were normalized by dividing raw pre-and post-capture read depths by the total reads in the 479 corresponding pre-and post-capture pools (Table S2) Table S2 . 

Sequencing for the Detection and Characterization of RNA Viruses. Front Microbiol

Multiple approaches for massively parallel sequencing of SARS-CoV-2 genomes directly 546 from clinical samples

Clinical and biological insights from viral genome 548 sequencing

Specific capture and whole-genome sequencing of viruses from clinical 551 samples

Design and validation of a universal influenza virus 576 enrichment probe set and its utility in deep sequence analysis of primary cloacal swab 577 surveillance samples of wild birds

Capturing sequence diversity in metagenomes with 585 comprehensive and scalable probe design

MrBait: universal identification and design of 587 targeted-enrichment capture probes

OligoMiner 589 provides a rapid, flexible environment for the design of genome-scale oligonucleotide in situ 590 hybridization probes

Software Package for Multispecies Target DNA Enrichment Probe Design

MetCap: a bioinformatics 596 probe design pipeline for large-scale targeted metagenomics

The evolutionary genetics and emergence of avian 599 influenza viruses in wild birds

Frequency and patterns of reassortment in natural influenza A virus infection in a reservoir 602 host. Virology

Highly Pathogenic Avian Influenza Viruses at the 605

Bird Interface in Europe: Future Directions for Research and 606

The Global Threat of Animal Influenza Viruses of 608

Zoonotic Concern: Then and Now

Zoonotic Potential of Influenza A 610 Viruses: A Comprehensive Overview

The Pandemic Threat of Emerging H5 and H7 Avian Influenza Viruses

Avian influenza virus (H5N1): a threat to human health

Pandemic potential of avian 616 influenza A (H7N9) viruses

A review of H5Nx avian influenza viruses

The disease burden of influenza beyond respiratory illness

Global burden of 629 influenza-associated lower respiratory tract infections and hospitalizations among adults: A 630 systematic review and meta-analysis

The Burden of Influenza: a Complex Problem

The hidden burden of influenza: A 634 review of the extra-pulmonary complications of influenza infection. Influenza Other Respir 635 Viruses

Influenza Collaborators. Mortality, morbidity, and hospitalisations due to 637 influenza lower respiratory tract infections, 2017: an analysis for the Global Burden of 638

Disease Study

Global Consortium for H5N8 and Related Influenza Viruses. Role for migratory wild birds in 640 the global spread of avian influenza H5N8

Connecting the study of wild 642 influenza with the potential for pandemic disease

VSEARCH: a versatile open source tool 644 for metagenomics

BLAST+: architecture and applications

Influenza Research Database: An integrated bioinformatics 651 resource for influenza virus research

Single-reaction genomic amplification accelerates sequencing and vaccine production for 654 classical and Swine origin human influenza a viruses

Development of a real-time reverse transcriptase 656 PCR assay for type A influenza virus and the avian H5 and H7 hemagglutinin subtypes

Global 659 trends in emerging infectious diseases

Global 661 rise in human infectious disease outbreaks

The Global Virome Project. Science

Viral surveillance and discovery

Towards a genomics-informed, real-time, global pathogen surveillance 669 system

Opinion: Intercepting pandemics through genomics

The impact of genomics on precision public health: beyond the 674 pandemic

Epidemiologic data and pathogen genome sequences: a powerful 677 synergy for public health

The role of pathogen genomics in assessing disease transmission

Pathogen Genomics in Public Health

The authors declare that they have no competing interests. JD performed preliminary studies with k-mer clustering, assisted with the design and 516implementation of the ProbeTools algorithms, and provided guidance on bioinformatic data 517 analysis. CH helped assemble the validation collection of egg-cultured AIV isolates, ensured 518 relevant strains were included, and provided direction for AIV probe panel design to ensure its 519 suitability for agricultural surveillance applications. WH provided guidance on implementing 520ProbeTools algorithms, best practices for constructing and distributing bioinformatics tools and 521 packages, and bioinformatic data analysis. NP provided guidance on experiment design for in 522 vitro captures, troubleshooting for library preparation, probe capture, and sequencing of egg-523 cultured AIV isolates, and provided direction for AIV probe panel design to ensure its suitability 524 for public health surveillance applications. All authors reviewed and contributed comments on 525 the manuscript. 526