key: cord-0893840-0mriwzf8 authors: Varabyou, Ales; Pockrandt, Christopher; Salzberg, Steven L; Pertea, Mihaela title: Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie date: 2021-05-13 journal: Genetics DOI: 10.1093/genetics/iyab074 sha: 479652440f284c54c1874cb39391d54654cc1893 doc_id: 893840 cord_uid: 0mriwzf8 The ability to detect recombination in pathogen genomes is crucial to the accuracy of phylogenetic analysis and consequently to forecasting the spread of infectious diseases and to developing therapeutics and public health policies. However, in case of the SARS-CoV-2, the low divergence of near-identical genomes sequenced over a short period of time makes conventional analysis infeasible. Using a novel method, we identified 225 anomalous SARS-CoV-2 genomes of likely recombinant origins out of the first 87,695 genomes to be released, several of which have persisted in the population. Bolotie is specifically designed to perform a rapid search for inter-clade recombination events over extremely large datasets, facilitating analysis of novel isolates in seconds. In cases where raw sequencing data was available, we were able to rule out the possibility that these samples represented co-infections by analyzing the underlying sequence reads. The Bolotie software and other data from our study are available at https://github.com/salzberg-lab/bolotie. Since the beginning of 2020, the COVID-19 pandemic caused by a newly emerged strain of a 29 betacoronavirus, SARS-CoV-2, has been responsible for over 950,000 deaths and over 30 million 30 infections to date (Dong et al. 2020 ). The strain has been hypothesized to have emerged as a 31 result of a recombination event between strains of betacoronavirus endemic to certain species 32 of bats and pangolins (Zhang et al. 2020) , although its precise origin is not yet known. To date, 33 the genetic diversity of the SARS-CoV-2 has been increasing slowly compared to other RNA 34 viruses, with 5 to 7 major circulating clades being identified based on multiple variants common 35 to large numbers of isolates in the GISAID database (Shu and McCauley 2017; Hadfield et al. 2018) . 36 This relative stability of the genetic content of the circulating forms of the virus is promising for 37 the development of vaccines and therapeutics, as well as general understanding of the biology 38 and pathology of SARS-CoV-2. 39 However, as with other RNA viruses, coronaviruses are known to undergo mutations at high 40 rates (Drake and Holland 1999) . Inter-and intra-host recombinations are also well-studied and 41 occur frequently (Su et Multiple computational methods have been developed to detect recombination in microbial 50 genomes and have been used in studies of HIV-1 mutagenesis, bacterial evolution, and other 51 applications (Posada 2002) . Some popular methods, such as 3seq, analyze every possible triplet 52 of a set of genomes and statistically evaluate the triplets (Lam et al. 2018 ). Other methods, like 53 PhiPack, are designed to work for low-divergence genomes but still require a significant number 54 of variants (1-5%) to perform statistical analysis (Bruen et al. 2006 ). Another limitation of some of 55 3 these methods is that the algorithms are computationally intensive and require significant 56 resources and time to perform analysis, particularly for the amount of data (nearly 300,000 57 complete high-coverage genomes to date) that has been generated for SARS-CoV-2. There exist 58 over 4 quadrillion unique triplets of sequences for the currently available SARS-CoV-2 genomes, 59 and a similarity analysis for each triplet would be computationally infeasible. A more efficient 60 approach is necessary if we want to be able to detect recombinants in a realistic amount of 61 time. 62 In this work we present Bolotie, a new algorithm designed to conduct mutational analysis and 63 to detect recombinant forms and other anomalies among a very large set of viral sequences. 64 The methods presented are also designed such that novel sequences can be analyzed efficiently 65 without the need to rerun the entire protocol. 66 We applied Bolotie to search for recombination events in 87,695 complete genomes of SARS-67 CoV-2 currently available in the GISAID database (Shu and McCauley 2017) . In our analysis we 68 identified multiple unique cases of recombination between 4 prominent clades of the virus. Lastly, we propose a methodology for distinguishing true recombination events from cases of 73 mis-assembly of isolates from a host co-infected with several distinct lineages of a pathogen. 74 The proposed method can be applied by to verify future SARS-CoV-2 genomes prior to database 75 submission. In this work we present Bolotie, a collection of methods that enables rapid alignment, variant 78 calling, inter-clade recombination detection, and parent sequence search for large sets of 79 assembled viral genomes. Our method is robust even when the divergence of collected 80 genomes is very small, and it is designed for ultra-fast detection of anomalies. Because Bolotie 81 utilizes a probability index that can be used for a one-against-all analysis, alignments and 82 4 indices do not have to be recomputed when evaluating novel sequences as recombinants, 83 which greatly increases the efficiency of single-sequence analysis. 2018), and based on the close distances observed in our independent phylogenetic analyses. 94 Mappings between clade IDs in our analysis, those defined by GISAID, and those defined by 95 NextStrain are provided in Supplementary Table 1 . 96 To test genomes for the possibility that the putative recombinant isolates might represent a co-97 infection with two or more distinct SARS-CoV-2 isolates, we searched the SRA database of raw 98 read data using all GISAID and lab-assigned identifiers. After the pairwise alignment step, we construct for every genome in the input set a consensus 109 sequence by substituting reference genome alleles for the high-frequency variants called by the 110 aligner. Not only does this approach allow us to filter variants used to search for recombinants, 111 but also produces a set of genomes with a standardized set of coordinates which can be used as 112 a multiple sequence alignment (MSA) for phylogenetic analysis. Next, since Bolotie is designed to work for genomes with very few variants it is particularly 118 sensitive to ambiguous nucleotides. To avoid biases caused by uncalled bases, any such 119 instances were treated as an unknown base (N). Furthermore, to avoid bias in our predictions 120 for all sequences in the dataset we replace nucleotides at a position of low-frequency variants 121 with the reference allele making such positions equally probable for any clade (clade neutral). 122 We define a low-frequency SNV as one that has fewer than 100 genome sequences that differ 123 from the reference sequence at that position. 124 Because we did not use structural variants when constructing consensus sequences, the final 125 collection of filtered genomes represents a MSA. This not only allows us to use it directly for 126 phylogenetic analysis but also to compute distances among sequences efficiently. automatically choose the precise model using its ModelFinder package (Kalyaanamoorthy et al. 192 2017). The final tree was generated under the GTR model with 1000 rounds of bootstrapping. 193 The same approach was taken to build the phylogenetic tree that included all identified 194 anomalous sequences. We first aligned all genomes to the Wuhan-Hu-1 reference genome (GenBank accession 197 MN908947), from which we detected 84,322 single-nucleotide variants (SNVs) at 29,503 sites. 198 After removing all variants that appear in fewer than 100 sequences, we retained a set of 659 Table 2) . 217 Of the 225 identified recombinants, most of the recombinant signatures had 1 or 2 breakpoints 218 like the ones shown in Figure 3A , 3B and 3C. However, at least 6 genomes including the one 219 depicted in Figure 3D To evaluate this hypothesis, we obtained sets of raw sequence reads deposited at NCBI/SRA or 262 ENA for some of the recombinant genomes identified with Bolotie. Unfortunately, GISAID does 263 not require authors to submit raw data, and only a limited number of submitters have placed 264 their data in public archives with corresponding GISAID identifiers. Thus, we were only able to 265 recover a limited number of datasets for our analysis. 266 Reads for the EPI_ISL_439137 isolate ( Figure 3A) EPI_ISL_489588 also from Scotland, dated one week earlier, which contained the same variants. 273 Another isolate, EPI_ISL_510303 from Spain, had all but one variant (at position 28,143) 274 matching the recombinant signature of the isolates from Scotland. Given the rapid mutation 275 rate in RNA viruses, it is possible that an independent mutation occurred at that position, or 276 that the reference allele is an assembly artifact, however we could not find raw data The method and experiments presented in this work demonstrate that recombination has 294 occurred between the four existing major clades of SARS-CoV-2. While some of the inferred 295 events may be homoplasies or technical artifacts, our analysis shows that at least some of the 296 genomes likely represent true cases of recombination. 297 Of the 225 recombination events identified in our analysis, the majority were represented by 298 single isolates, suggesting that the event was not established in the population. However, 299 because two-thirds of the available genomes were sequenced between late March and early 300 May, it is possible that more data will reveal additional recombinant lineages. 301 The 225 inferred recombinant genomes comprise less than 1% of all sequences analyzed. It is 302 possible that many more anomalous genomes could be detected by lowering the variant 303 frequency threshold. For example, if we require 50 sequences to confirm a variant rather than 304 100, the number of informative sites increases more than two-fold from 411 to 996, possibly 305 allowing detection of events that are rarer, such as those which involve smaller emerging 306 lineages within the 4 clades. However, due to decreased specificity such an approach might 307 require stricter manual inspection as the false positive rate is expected to increase 308 substantially. 309 While overall all events detected by Bolotie passed manual verification, a possibility of mis-310 assembly in cases of co-infection by particles from different clades could also explain the 311 presence of multiple clade-specific variants within a single genome. Although we were unable 312 to obtain raw read data for all recombinants, our analysis of the EPI_ISL_439137 isolate ( Figure 313 3A, Table 1) Due to the differences in library preparation, sequencing technology and assembly protocols, 318 the need for raw data and independent validation is very high. We urge researchers to submit 319 raw sequencing data so that any future studies can verify their findings, not only when studying 320 recombination events, but also individual rare variants, transmission patterns, clade prevalence Table 1 ). However, even though 348 discrepancies in clade assignment may present a challenge to Bolotie, the results will still be of 349 use for the refinement of phylogenetic trees and ultimately clade assignments of the genomes. Our findings suggest that recombination in SARS-CoV-2 is much more common than previously 392 reported and that several recombinant lineages may have become established in the 393 population. 394 We hope that the software presented here along with provided pre-built indices will help to 395 detect future recombination events quickly and reliably, and aid in efforts to track the spread of The evolutionary genomics of pathogen recombination A simple and robust statistical test for detecting 425 the presence of recombination Identification 427 of the nucleotide substitutions in 62 SARS-CoV-2 sequences from Turkey An interactive web-based dashboard to track COVID-19 in 430 real time No evidence for 432 17 increased transmissibility from recurrent mutations in SARS-CoV-2 Mutation rates among RNA viruses The viterbi algorithm Nextstrain: real-time 437 tracking of pathogen evolution ModelFinder: fast model selection for accurate phylogenetic estimates A simple method for estimating evolutionary rates of base substitutions 442 through comparative studies of nucleotide sequences Improved algorithmic complexity for the 3SEQ 444 recombination detection algorithm Fast gapped-read alignment with Bowtie 2 The sequence alignment/map 448 format and SAMtools Minimap2: pairwise alignment for nucleotide sequences IQ-TREE 2: 452 New models and efficient methods for phylogenetic inference in the genomic era Evaluation of methods for detecting recombination from DNA sequences: 455 empirical data The SeqAn C++ 457 template library for efficient sequence analysis: A resource for programmers GISAID: Global initiative on sharing all influenza data-from vision 460 to reality Epidemiology, genetic recombination, and 462 pathogenesis of coronaviruses Introducing difference recurrence relations for faster semi-464 global alignment of long sequences Identification of SARS-CoV-2 466 recombinant genomes 2019 novel coronavirus is undergoing active recombination Probable pangolin origin of SARS-CoV-2 associated with the 469 COVID-19 outbreak 408 We acknowledge the work of all the authors, originating and submitting laboratories who