key: cord-0309690-7sgathuq authors: Kuchinski, Kevin S.; Loos, Kara D.; Suchan, Danae M.; Russell, Jennifer N.; Sies, Ashton N.; Kumakamba, Charles; Muyembe, Francisca; Kingebeni, Placide Mbala; Lukusa, Ipos Ngay; N’Kawa, Frida; Losoma, Joseph Atibu; Makuwa, Maria; Gillis, Amethyst; LeBreton, Matthew; Ayukekbong, James A.; Monagin, Corina; Joly, Damien O.; Saylors, Karen; Wolfe, Nathan D.; Rubin, Edward M.; Tamfum, Jean J. Muyembe; Prystajecky, Natalie A.; McIver, David J.; Lange, Christian E.; Cameron, Andrew D.S. title: Targeted genomic sequencing with probe capture for discovery and surveillance of coronaviruses in bats date: 2022-04-26 journal: bioRxiv DOI: 10.1101/2022.04.25.489472 sha: bc2a281ce50765dac253a36da906f357f7e7b2d4 doc_id: 309690 cord_uid: 7sgathuq Public health emergencies like SARS, MERS, and COVID-19 have prioritized surveillance of zoonotic coronaviruses, resulting in extensive genomic characterization of coronavirus diversity in bats. Sequencing viral genomes directly from animal specimens remains a laboratory challenge, however, and most bat coronaviruses have been characterized solely by PCR amplification of small regions from the best-conserved gene. This has resulted in limited phylogenetic resolution and left viral genetic factors relevant to threat assessment undescribed. In this study, we evaluated whether a technique called hybridization probe capture can achieve more extensive genome recovery from surveillance specimens. Using a custom panel of 20,000 probes, we captured and sequenced coronavirus genomic material in 21 swab specimens collected from bats in the Democratic Republic of the Congo. For 15 of these specimens, probe capture recovered more genome sequence than had been previously generated with standard amplicon sequencing protocols, providing a median 6.1-fold improvement (ranging up to 69.1-fold). Probe capture data also identified five novel alpha- and betacoronaviruses in these specimens, and their full genomes were recovered with additional deep sequencing. Based on these experiences, we discuss how probe capture could be effectively operationalized alongside other sequencing technologies for high-throughput, genomics-based discovery and surveillance of bat coronaviruses. other sequencing technologies for high-throughput, genomics-based discovery and surveillance 48 of bat coronaviruses. 49 Orthocoronavirinae, commonly known as coronaviruses (CoVs), are a diverse subfamily of 52 mapped to draft genomes using bwa mem (v0.7.17-r1188), then alignments were filtered, sorted, 175 and indexed using samtools (v1.11) [Li 2009a , Li 2009b . Variants were called with bcftools 176 mpileup and call (v1.9), then variants were applied to draft genomes with bcftools consensus 177 were prepared from archived RNA that had been previously extracted from these specimens, 208 although some libraries (n=6) were prepared from RNA that was freshly extracted from archived 209 primary specimens (Table 1) . CoVs had been previously detected in these specimens with PCR which had been used to assign these specimens to four novel phylogenetic groups of alpha-and 213 betacoronaviruses (Table 1) . 214 We captured CoV genomic material in these metagenomic bat swab libraries with our 215 8 surveillance settings: sequencing reads from probe captured libraries were assembled de novo 218 into contigs, then CoV sequences were identified by locally aligning contigs against a database 219 of CoV reference sequences. In total, 113 CoV contigs were recovered from 17 of 25 libraries. 220 We compared contig lengths to the partial RdRp amplicons that been previously generated for 221 these specimens (Figure 2A) Next, we used assembly size metrics to assess the extent to which these contigs 231 represented complete genomes. The median total assembly size was 1,724 nucleotides (IQR: 0 to 232 5,834 nucleotides), while median assembly N50 size was 533 nucleotides (IQR: 0 to 908 233 nucleotides) ( Figure 2B ). This assembly size-based assessment of genome completeness had 234 limitations, however. Some assembly sizes may have been understated by genome regions with 235 comparatively low read coverage that failed to assemble. Conversely, other assembly sizes may 236 have been overstated by redundant contigs resulting from forked assembly graphs, either due to 237 genetic variation within the intrahost viral population or due to polymerase errors introduced 238 during library construction and probe capture. For instance, the total assembly size for library 239 CDAB0217R-PRE was 33,195 nucleotides, exceeding the length of the longest known CoV 240 genome ( Figure 2C ). Another limitation of this analysis was that these assembly metrics 241 provided no indication of which regions of the genome had been recovered. 242 To address these limitations, we also applied a reference sequence-based strategy. We 243 used the contigs to identify the best available CoV reference sequences for each of the four novel 244 phylogenetic groups to which these specimens had been assigned. Sequencing reads from 245 captured libraries were directly mapped to these reference sequences and the contigs we had 246 assembled de novo were also locally aligned to them (Fig 3 and S1-S4 ). Based on these read mapped sequencing reads or contigs ( Figure 4A ). 250 The median breadth of reference sequence recovery for all libraries was 2,376 251 nucleotides (IQR: 306 to 9,446 nucleotides). Most libraries (48%) represented specimens from 252 phylogenetic group Q-Alpha-4, which had a median reference sequence recovery of 6,497 253 nucleotides (IQR: 733 to 9,802 nucleotides, max: 12,673 nucleotides). Phylogenetic group W-254 Beta-3 also accounted for a substantial fraction of libraries (32%), and although median 255 reference sequence recovery was lower than for Q-Alpha-4 (2,427 nucleotides), W-Beta-3 256 provided the libraries with the most extensive reference sequence recoveries (IQR: 780 to 19,286 257 nucleotides, max: 26,755 nucleotides). As a simple way to quantify differences in recovery of 258 CoV genome sequence between probe capture and amplicon sequencing, we calculated the ratio 259 between the breadth of reference sequence recovery and the length of the previously generated 260 partial RdRp amplicon sequence for each library ( Figure 4B ). The median ratio was 6.1-fold 261 Since their sequences were known, we could assess probe coverage in silico and demonstrate 277 whether these targets were covered by the panel. All partial RdRp amplicons had at least 95.3% into extensive recovery. For 12 of 25 libraries, no part of the partial RdRp sequence was 280 recovered, and full/nearly-full recovery (>95%) of the partial RdRp sequence was achieved for 281 only 7 of 25 libraries ( Figure 5A ). These results demonstrated that genome recovery had been 282 limited by factors other than probe panel inclusivity. 283 Next, we examined nucleic acid concentration and integrity, two specimen characteristics 284 associated with successful library preparation. Median RNA Integrity Number (RIN) values and 285 RNA concentrations for these specimens were low: 1.1 and 14 ng/μl respectively, as was 286 expected from archived material ( Figure 5B ). To assess the impact of RIN and RNA 287 concentration on probe capture recovery, we compared these specimen characteristics against 288 breadth of reference sequence recovery from the corresponding libraries ( Figure 5B ). Weak 289 monotonic relationships were observed, with lower RNA concentration and lower RIN values 290 generally leading to worse genome recovery. This relationship was significant for RNA 291 concentration (p=0.045, Spearman's rank correlation), but not for RNA integrity despite trending 292 towards significance (p=0.053, Spearman's rank correlation). These weak associations suggested 293 additional factors hindered recovery, e.g. low prevalence of viral material or missing probe 294 coverage for genomic regions outside the partial RdRp target. Missing probe coverage is 295 considered in the next section. Prevalence of viral material was not practical to consider as there 296 are no established pan-CoV methods for quantifying genome copies in RNA specimens, a 297 limitation that would also preclude attempts to triage surveillance specimens based on viral 298 abundance in high-throughput settings. 299 300 Inclusivity of custom probe panel against CoV taxa in study specimens: Next, we considered 301 if blind spots in the probe panel had contributed to incomplete genome recovery from these 302 specimens. This inquiry suffered a counterfactual problem: to assess whether the CoV taxa in our 303 specimens were fully covered by our probe panel, we would need their complete genome 304 sequences. We did not have their full genome sequences, however, because the probes did not 305 recover them. Instead, we evaluated probe coverage of the reference sequences assigned to each 306 phylogenetic group, assuming they were the available CoV sequences most similar to those in also suggested the CoVs in these specimens contained spike genes that were highly divergent 322 from any others that have been previously described. This led us to perform deep metagenomic 323 sequencing on select specimens to attempt recovery of complete CoV genomes. We selected the 324 following nine specimens, either due to extensive recovery by probe capture (indicating 325 comparatively abundant and intact viral genomic material) or to ensure representation of the four 326 novel phylogenetic groups: CDAB0017RSV, CDAB0040RSV, CDAB0174R, CDAB0203R, 327 CDAB0217R, CDAB0113RSV, CDAB0491R, and CDAB0492R. 328 Complete genomes were only recovered from 5 specimens: CDAB0017RSV, 329 CDAB0040RSV, CDAB0203R, CDAB0217R, and CDAB0492R. The abundance of CoV 330 genomic material in these 5 specimens was estimated by mapping reads from uncaptured 331 libraries to the complete genome sequence that we recovered. On-target rates, i.e. the percentage 332 of total reads mapping to the CoV genome, were calculated ( Figure 7A ). These ranged from 333 0.003% to 0.064%, revealing the extremely low abundance of viral genomic material present in 334 these swabs. Considering these were the most successful libraries, these results highlighted that 335 low prevalence of viral genomic material is one challenging characteristic of swab specimens. 336 We also used the complete genome sequences that we recovered to assess how 337 effectively probe capture enriched target genomic material in these specimens. Valid reads from 338 probe captured libraries were mapped to the complete genomes from their corresponding 339 specimens. On-target rates for captured libraries ranged from 11.3% to 45.1% of valid reads these means, with the probe captured on-target rates significantly higher (p<0.001, t-test on 2 347 independent means). These results confirmed effective enrichment by probe capture of CoV 348 material present in these libraries. Pairwise global alignments of amino acid sequences were conducted between these novel 361 spike genes and the spike genes from GenBank with which they grouped phylogenetically. 362 Alignments completely covered all novel spike sequences, but they were all less than 76.5% 363 identical and less than 85.7% positive (Table 2) . We compared host species and geographic 364 collection locations for our study specimens and the phylogenetically related spike sequences. 365 Only specimens CDAB0203R and CDAB0217R were collected from the same bat species as 366 their closest spike protein matches in GenBank (Eidolon helvum). Other specimens were 367 detected in bat genera different from their closest GenBank match. All study specimens were 368 collected from the DRC, but their closest GenBank matches were collected from diverse locales, 369 including neighbouring Kenya, Cameroon in West Africa, and Yunnan province in China. Taken 370 together, these low alignment scores, disparate host species, and dispersed collection locations 371 suggested these viruses belong to extensive but hitherto poorly characterized taxa of CoV. confirm that probe capture had been hindered by divergence of these novel spike genes from 374 their closest matches in GenBank, which we had used to design our custom panel. For specimen 375 CDAB0017RSV, sequence similarity was so low that no alignment was generated for the spike 376 gene. Nucleotide alignments for the other specimens were all incomplete (18% to 83% coverage 377 of the novel spike sequence) with low nucleotide identities (71.5% to 84.6%). 378 This study highlights the potential for probe capture to recover greater extents of CoV genome 381 compared to standard amplicon sequencing methods. In discovery and surveillance applications, This study also showed the usefulness of probe capture for identifying specimens that 392 warrant the expense of deep metagenomic sequencing for more extensive characterization. The 393 genomic regions missed by the probe panel can provide as much insight into viral novelty as the 394 sequences that are recovered. In this study, failure to capture complete spike gene sequences, 395 even from libraries with otherwise extensive coverage, was successfully used to predict the 396 presence of novel spike genes. Furthermore, contiguity across recovered regions can be used to 397 evaluate abundance and intactness of viral genomic material, identifying specimens where deep 398 metagenomic sequencing is likeliest to succeed. This is valuable when targeting higher 399 taxonomic levels where methods for directly quantifying viral genome copies are hindered by the 400 same genomic variability that constrains amplicon sequencing. 401 This study also revealed two important limitations for probe capture in CoV discovery established track records of success. Screening by amplicon sequencing would enable direct 450 phylogenetic comparisons between specimens across consistent genomic loci and enable a 451 preliminary assessment of threat and novelty. This screening would also identify CoV-positive 452 specimens warranting further study, limiting the number of specimens to be transported to more 453 specialized laboratories with probe capture and deep sequencing capacity. 454 Probe capture on select CoV-positive specimens would be valuable for potentially 455 acquiring additional sequence information which could refine assessments of threat and novelty. Isolation and Characterization of a Novel Bat Coronavirus 665 Closely Related to the Direct Progenitor of Severe Acute Respiratory Syndrome Coronavirus Zoonotic origins of human 669 coronaviruses Phylogenetic tree of translated spike gene sequences from betacoronaviruses Only the 25 closest-matching spike 793 sequences from GenBank were included, as determined by blastp bitscores RefSeq accession numbers are provided in parentheses. The scale bar measures amino acid 795 substitutions per site