key: cord-0930922-1uto8vrp authors: Jackson, B.; Boni, M. F.; Bull, M. J.; Colleran, A.; Colquhoun, R. M.; Darby, A.; Haldenby, S.; Hill, V.; Lucaci, A.; McCrone, J. T.; Nicholls, S.; O'Toole, A.; Pacchiarini, N.; Poplawski, R.; Scher, E.; Todd, F.; Webster, H.; Whitehead, M.; Wierzbicki, C.; The COVID-19 Genomics UK consortium,; Loman, N. J.; Connor, T. R.; Robertson, D. L.; Pybus, O. L.; Rambaut, A. title: Generation and transmission of inter-lineage recombinants in the SARS-CoV-2 pandemic date: 2021-06-18 journal: nan DOI: 10.1101/2021.06.18.21258689 sha: 70a739322e54a928810b32a32541aee70a452c50 doc_id: 930922 cord_uid: 1uto8vrp We present evidence for multiple independent origins of recombinant SARS-CoV-2 viruses sampled from late 2020 and early 2021 in the United Kingdom. Their genomes carry single nucleotide polymorphisms and deletions that are characteristic of the B.1.1.7 variant of concern, but lack the full complement of lineage-defining mutations. Instead, the remainder of their genomes share contiguous genetic variation with non-B.1.1.7 viruses circulating in the same geographic area at the same time as the recombinants. In four instances there was evidence for onward transmission of a recombinant-origin virus, including one transmission cluster of 45 sequenced cases over the course of two months. The inferred genomic locations of recombination breakpoints suggest that every community-transmitted recombinant virus inherited its spike region from a B.1.1.7 parental virus, consistent with a transmission advantage for B.1.1.7's set of mutations. Recombination, the transfer of genetic information between molecules derived from different 27 organisms, is a fundamental process in evolution because it can generate novel genetic variation upon 28 which selection can act (reviewed in Felsenstein 1974) . Genetic analysis indicates that recombination 29 occurs frequently in betacoronaviruses (Lai et al. 1985 ; Keck et al. 1988 ; Lai and Cavanagh 1997) , The molecular mechanism of homologous recombination in unsegmented positive-sense RNA viruses 44 such as SARS-CoV-2 is generally by copy-choice replication, a model first suggested in poliovirus 45 (Cooper et al. 1974) . In this process a hybrid or mosaic RNA is formed when the RNA-polymerase 46 complex switches from one RNA template molecule to another during replication (Worobey and 47 Holmes 1999). In order for homologous recombination to occur, and be subsequently detected, there 48 77 78 We identified a total of 16 recombinant sequences from the whole UK dataset of 279,000 sequences 82 up to the 7th March 2021, using our bioinformatic and evolutionary analysis pipeline (see Methods) . 83 Twelve genome sequences that clustered into four groups (labeled A -D) and four additional 84 singletons showed evidence of being mosaic in structure (Table 1; Supplementary Table S1 ). For each 85 group A-D, each of the constituent genomes was sampled from the same geographic locality within 86 the UK (Table 1) To rule out the possibility that any of the sixteen recombinants could have resulted from artefacts as a 96 result of assembling sequence reads from a co-infected sample (generated through either natural co-97 infection or laboratory contamination), we examined the read coverage and minor allele frequencies 98 and assessed the likelihood of a mixed sample. Several lines of evidence suggested the recombinant 99 sequences were not the products of sequencing a mixture of genomes: Firstly, the sequencing protocol 100 used in the UK (Tyson et al. 2020 ) generates 98 short (~350bp) amplicons, such that long tracts that 101 match just one lineage would be unlikely. Secondly, the read data do not support a mixture for any of 102 the putative recombinant genomes. All the recombinants were sequenced to high coverage (lowest 103 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint mean read depth per site per genome: 686; highest mean read depth: 2903). The mean minor allele 104 frequency (MAF) for the putative recombinants was 0.008, which is 6 standard deviations below the 105 mean of the MAF (0.34) for a set of 20 sequences that we suspected to be mixtures (Supplementary 106 Figure S1 ). Finally, for all groups A-D, multiple genomes with the same mosaic structure were 107 sequenced independently from different samples, and by different sequencing centres in the case of 108 Group A, implying that the original assembly was correct and, additionally, that transmission of the 109 recombinant had occurred. All of the read data are available on the European Nucleotide Archive. 110 Accession numbers are given in Supplementary Table S2 . genetic similarity for each of the two non-recombining genome regions were the same sequences for 117 every putative recombinant within a group. For most of the recombinants, there were several 118 equidistant putative parental sequences for each region of the genome; whenever this was true, they 119 all belonged to the same lineage, except for Group C, whose putative parental lineages for the non-120 B.1.1.7-like region of the genome were a mixture of two closely related lineages (B.1.211.1 and 121 B.1.211.2). The putative parental sequence for the non-B.1.1.7 region of the genome varied by group 122 (Table 1) . Importantly, in each case, the sequence and epidemiological data demonstrate that the non-123 B.1.1.7 parental sequence was circulating in the same geographic area as the recombinant in the time 124 immediately before the sampling date of the recombinant. For Group A and the four singletons, the 125 second parental sequence was assigned lineage B.1.177 or one of its descendants. B.1.177, which 126 likely arose in Spain in the summer of 2020 and was exported to multiple European countries 127 (Hodcroft et al. 2020 ), rose to high relative frequency in the UK through Autumn 2020, and was 128 widespread by December (Figure 1) Table 2 ). The lineages identified as the putative 144 parentals assigned by 3SEQ agreed with the lineages for putative parentals assigned by genetic 145 similarity (Tables 1 and 2) even though of the 16 closest neighbours by genetic similarity described 146 above, none were present in the background sequence set of candidate parentals used in the 3SEQ 147 analysis. The breakpoints reported by 3SEQ also agreed with breakpoints inferred from the 148 distribution of Single Nucleotide Polymorphisms (SNPs) and deletions in the putative recombinants 149 and their neighbours by genetic similarity (Tables 1 and 2 ). The two sequences that belong to Group 150 B did not show a statistically significant mosaic signal of non-reticulate evolution, but 3SEQ's Δm,n,2 151 statistic for these two candidate recombinants showed the greatest support for mosaicism possible 152 among the ancestry-informative polymorphic sites with their closest neighbours by genetic similarity 153 as parentals: n = 6, m = 42, k = 42. The associated uncorrected p-value of 5.7e-7 does not survive a 154 multiple comparisons correction due to the number of putative parental lineages and descendants that 155 were tested (Table 2 ) 156 independent since many candidate parental sequences are a small number of nucleotide differences 160 apart from each other. When corrected p-values are borderline, the recommended approach to infer 161 non-reticulate evolution is to build separate phylogenetic trees for the non-recombining regions of the 162 genome to confirm that the recombinant in question has different phylogenetic placements in different 163 genomic regions (Boni et al. 2010) . With the exception of the inner region for CAMC-CB7AB3, 164 whose placement within B.1.177 was not well supported, each recombinant's two phylogenetic 165 placements were with the lineages that we identified as parental by genetic similarity and by using 166 3SEQ, with high bootstrap support (Supplementary Table S3 ). The placement of the two parental 167 genome regions for each recombinant in the context of the whole epidemic in the UK is shown in 168 genetic diversity at the time these analyses were carried out, there was no strong statistical support for 212 recombination (as opposed to non-reticulate diversification) for any particular candidate recombinant. 213 When the number of mutations in a virus sequence is low (e.g. Figure 3 in VanInsberghe et al. (2021); 214 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint looped-out region of the template RNA, which contains at least orf1ab in the case of SARS-CoV-2 240 (Finkel et al. 2021 ). This provides an environment that is highly conducive to homologous 241 recombination: a polymerase that engages in template switching during its normal transcriptional 242 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. coronaviruses, this can account for the shared pattern of recombination-prone regions observed here 247 ( Figure 4) . However, to be detected recombinant genomes must lead to viable viruses, so the 248 distribution of breakpoints observed from genomic surveillance may not represent the distribution of 249 breakpoints that occur in situ (Banner and Lai 1991) . The recombination event that generated each must have occurred before this date. exhibiting the same mosaic genome structures (see Table 1 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. Identification of putative recombinants 375 A national SARS-CoV-2 sequencing effort in the UK, the COG-UK consortium (COVID-19 376 Genomics UK (COG-UK) consortiumcontact@cogconsortium.uk 2020), has undertaken systematic 377 genomic surveillance of SARS-CoV-2 in the country and generated over 440,000 genomes to date. As CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint 393 To identify candidate parental genome sequences in a computationally-tractable manner we created a 395 set of all UK SARS-CoV-2 sequences that (i) contained no N nucleotide ambiguity codes after 396 masking the 3' and 5' UTRs, (ii) spanned the dates 01/12/2020 to 28/02/2021, which represents two 397 weeks before the date of the earliest putative recombinant, to one week after the date of the latest, and 398 between the putative recombinant and the closely related reference sequences were visualised using 410 snipit (https://github.com/aineniamh/snipit). The genomic coordinates of the boundaries between each 411 mosaic genome region were then refined by taking into account observed lineage-defining nucleotide 412 and deletion variation. Specifically, we set the boundary coordinates to the ends of sequential tracts of 413 mutations specific to the putative parental sequences. This is a conservative approach to assigning 414 parental lineages and consequently no parental lineage is assigned to those genome regions that do not 415 contain unambiguous lineage-defining mutations or deletions. Lastly, using these refined region 416 boundaries, we reiterated the genetic distance calculation above to identify a final set of most-417 genetically similar sequences for each putative recombinant. 418 419 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Almost all sequencing sites in the COG-UK consortium use the ARTIC PCR protocol to produce tiled 438 PCR amplicons, which are then sequenced (Tyson et al. 2020 ). The generated sequence reads are then 439 processed using sequence mapping, rather than sequence assembly, to produce a consensus genome 440 for each sample. This approach, which was designed to support epidemiological investigations, 441 creates a single consensus sequence for each sample. Beyond representing sites with high minor allele 442 frequencies using the appropriate IUPAC nucleotide alphabet ambiguity code, this consensus does not 443 reflect the natural genetic variation of SARS-CoV-2 genomes observed within an infected individual 444 (Lythgoe et al. 2021) . Mapping is particularly suited to tiled amplicons generated from samples that 445 contain limited genomic diversity. Further, mapping is typically less prone to introducing 446 errors/artefacts than sequence assembly and enables effective primer sequence removal and 447 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint identification of non-reference mutations. Genomic sites that exhibit intra-sample nucleotide variation 448 could be consistent with a range of processes, including co-infection, within-patient diversity, 449 contamination, or PCR error. The identification of such sites forms part of the consensus-generating 450 pipeline, and we exploit that information here in order to rule out the possibility that our mosaic 451 consensus sequence represents a mixture of virus genomes, rather than representing true recombinant quality, to extract allele calls from the read data using its mpileup subroutine, and to calculate mean 474 read depth per genome using its depth subroutine. 475 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. were not found to be significantly the mosaic product of any of the sequences in the representative 493 background as children, and their closest neighbours by genetic similarity as parentals. P-values for 494 this test were reported without correction and after correction for multiple testing assuming that this 495 test was in addition to the 64 million comparisons that we had already performed. 496 497 498 For each of the eight sets of recombinants (Groups A-D and the four singletons) we carried out the 500 following procedure to test for incongruence between the phylogenetic placements of the two regions 501 of their genomes. We independently added each set's genome(s) to the representative background of 502 2000 sequences, along with the reference sequence, to create eight alignments in total. We masked the 503 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint resulting alignments according to the breakpoints defined by the closest neighbours by genetic 504 similarity, so that for each set, we produced two sub-alignments: one consisting of the region that was 505 inherited from the B.1.1.7 parental in the recombinant(s), and one consisting of the region that was 506 inherited from the other parental. This resulted in 16 alignments in total. We reconstructed the 507 phylogenetic relationships for each with IQTREE v2.1 (Minh et al. 2020) , using the HKY model of 508 we also built a phylogenetic tree of the representative background's complete genomes, to which we 515 added the masked recombinant genomes, so that each recombinant was present in the alignment 516 twice, once with the B.1.1.7 region of its genome unmasked, and once with the opposing region 517 unmasked. We ran IQTREE as above. 518 519 520 To test for onward community transmission of the putative recombinants, we searched the whole UK 522 dataset as of the 5th May 2021 for additional sequences whose genetic variation matched the variation 523 of the recombinants. For each of the eight set of recombinants, we defined a set of SNPs and deletions 524 by which all the recombinants within that set differed from the reference sequence (MN908947.3) . 525 Then we used type_variants to scan the UK dataset for genomes whose SNP and deletion variation 526 was compatible with being a descendant or sibling of the putative recombinants. Group A represented 527 the only recombination event with evidence for further transmission according to the results of this 528 procedure. We carried out the following additional analyses to further investigate transmission of 529 Group A genomes. Firstly, we visualised the nucleotide variation of the additional matching genomes 530 using snipit and extracted their sampling locations and dates. Secondly, to explore the phylogenetic 531 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint context of Group A and its derivatives, we reconstructed their (whole-genome) phylogenetic 532 relationships using IQTREE. We also extracted the 100 closest sequences by genetic similarity for 533 each alternate region of the genome (B.1.1.7-like and non-B.1.1.7-like) for each of the four original 534 members of Group A to provide phylogenetic context to the parental sequences. This resulted in a 535 dataset of 216 sequences in total when the two groups of neighbours were combined, and duplicates 536 removed. We reconstructed their (whole-genome) phylogenetic relationships with the IQTREE, as 537 above. We labelled the phylogenetic tree of recombinants and the phylogenetic tree of parental 538 sequences with the sampling date in number of epidemiological weeks (epiweeks) since the first 539 epiweek of 2020 to assess the temporal context of the recombination event and subsequent 540 transmission. We carried out a second follow up on the 1st June using the same procedure as above. 541 542 543 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.21258689 doi: medRxiv preprint Random Nature of Coronavirus RNA Recombination in the 545 Absence of Selection Pressure Guidelines 547 for Identifying Homologous Recombination Events in Influenza A Virus Evolutionary Origins of the SARS-551 CoV-2 Sarbecovirus Lineage Responsible for the COVID-19 Pandemic An Exact Nonparametric Method for 554 Inferring Mosaic Structure in Sequence Triplets On the Nature of Poliovirus 556 Genetic Recombinants Ndapewa Laudika Ithete Rooting the Phylogenetic 559 Tree of Middle East Respiratory Syndrome Coronavirus by Characterization of a Conspecific 560 Virus from an African Bat An Integrated 562 National Scale SARS-CoV-2 Genomic Surveillance Network MERS-CoV Recombination: Implications about the 565 Reservoir and Potential for Adaptation The Evolutionary Advantage of Recombination The Coding Capacity of SARS-CoV-2 UFBoot2: Improving the Ultrafast Bootstrap Approximation Emergence and Spread of a SARS-CoV-2 Variant 575 through Europe in the Summer of 2020 Evidence of the Recombinant Origin of a 579 Bat Severe Acute Respiratory Syndrome (SARS)-Like Coronavirus and Its Implications on the 580 Direct Ancestor of SARS Coronavirus In Vivo RNA-RNA Recombination of Coronavirus in Mouse Brain The Architecture of SARS-CoV-2 Transcriptome The Recent Ancestry of Middle East Respiratory Syndrome Coronavirus in Korea 588 Has Been Shaped by Recombination Recombination between Nonsegmented RNA Genomes of Murine Coronaviruses The Molecular Biology of Coronaviruses Improved Algorithmic Complexity for 595 the 3SEQ Recombination Detection Algorithm Minimap2: Pairwise Alignment for Nucleotide Sequences The 601 Sequence Alignment/Map Format and SAMtools SARS-CoV-2 within-Host Diversity and 604 Transmission The Molecular Biology of Coronaviruses Ultrafast Approximation 608 for Phylogenetic Bootstrap IQ-TREE 2: New Models and Efficient 611 Methods for Phylogenetic Inference in the Genomic Era Sensitivity of Infectious SARS-CoV-2 B.1.1.7 615 and B.1.351 Variants to Neutralizing Antibodies Pango Lineage Nomenclature: Provisional Rules for Naming Recombinant 617 Lineages Preliminary Genomic Characterisation of 620 an Emergent SARS-CoV-2 Lineage in the UK Defined by a Novel Set of Spike Mutations A Dynamic Nomenclature Proposal for 624 SARS-CoV-2 Lineages to Assist Genomic Epidemiology Coronaviruses Use Discontinuous Extension 626 for Synthesis of Subgenome-Length Negative Strands Coronavirus (COVID-19) Infection Survey: England Improvements to the ARTIC Multiplex PCR Method for SARS-CoV-2 635 bioRxiv : The Preprint Server for Biology Recombinant 638 SARS-CoV-2 Genomes Are Currently Circulating at Low Levels Rapid 641 Detection of Inter-Clade Recombination in SARS-CoV-2 with Bolotie Assessing Transmissibility of 644 SARS-CoV-2 Lineage B. 1.1. 7 in England Evolutionary Aspects of Recombination in RNA 647 Viruses