key: cord-0995029-chyib6l4 authors: Jackson, Ben; Boni, Maciej F.; Bull, Matthew J.; Colleran, Amy; Colquhoun, Rachel M.; Darby, Alistair C.; Haldenby, Sam; Hill, Verity; Lucaci, Anita; McCrone, John T.; Nicholls, Samuel M.; O’Toole, Áine; Pacchiarini, Nicole; Poplawski, Radoslaw; Scher, Emily; Todd, Flora; Webster, Hermione J.; Whitehead, Mark; Wierzbicki, Claudia; Loman, Nicholas J.; Connor, Thomas R.; Robertson, David L.; Pybus, Oliver G.; Rambaut, Andrew title: Generation and transmission of inter-lineage recombinants in the SARS-CoV-2 pandemic date: 2021-08-17 journal: Cell DOI: 10.1016/j.cell.2021.08.014 sha: 72037aa4c3d50fe8d19e1b1f0181827d60a732ff doc_id: 995029 cord_uid: chyib6l4 We present evidence for multiple independent origins of recombinant SARS-CoV-2 viruses sampled from late 2020 and early 2021 in the United Kingdom. Their genomes carry single nucleotide polymorphisms and deletions that are characteristic of the B.1.1.7 variant of concern, but lack the full complement of lineage-defining mutations. Instead, the remainder of their genomes share contiguous genetic variation with non-B.1.1.7 viruses circulating in the same geographic area at the same time as the recombinants. In four instances there was evidence for onward transmission of a recombinant-origin virus, including one transmission cluster of 45 sequenced cases over the course of two months. The inferred genomic locations of recombination breakpoints suggest that every community-transmitted recombinant virus inherited its spike region from a B.1.1.7 parental virus, consistent with a transmission advantage for B.1.1.7’s set of mutations. We present evidence for multiple independent origins of recombinant SARS-CoV-2 viruses sampled 19 from late 2020 and early 2021 in the United Kingdom. Their genomes carry single nucleotide 20 polymorphisms and deletions that are characteristic of the B.1.1.7 variant of concern, but lack the full 21 complement of lineage-defining mutations. Instead, the remainder of their genomes share contiguous 22 genetic variation with non-B.1.1.7 viruses circulating in the same geographic area at the same time as 23 the recombinants. In four instances there was evidence for onward transmission of a recombinant-24 origin virus, including one transmission cluster of 45 sequenced cases over the course of two months. 25 The inferred genomic locations of recombination breakpoints suggest that every community-26 transmitted recombinant virus inherited its spike region from a B.1.1.7 parental virus, consistent with 27 a transmission advantage for B.1.1.7's set of mutations. 28 Introduction deleterious mutations or provide the opportunity to create novel phenotypes by bringing genetic 48 variation from different backgrounds onto a single genome. A concerning scenario from an 49 epidemiological perspective is the potential for recombination to combine, in the same genome, 50 mutations that may confer immune-escape properties with those that may enhance transmissibility. 51 Enhanced transmissibility (Volz et al., 2021) and immune-escape (Planas et al., 2021) phenotypes 52 have already been observed in SARS-CoV-2. Consequently, the characterization of recombination in 53 SARS-CoV-2 is important for surveillance purposes. 54 The molecular mechanism of homologous recombination in unsegmented positive-sense RNA viruses 56 such as SARS-CoV-2 is generally by copy-choice replication, a model first suggested in poliovirus 57 (Cooper et al., 1974) . In this process a hybrid or mosaic RNA is formed when the RNA-polymerase 58 complex switches from one RNA template molecule to another during replication (Worobey and 59 Holmes, 1999) . In order for homologous recombination to occur, and be subsequently detected, there 60 must be co-infection of the same cell within an individual by genetically-distinct viruses (termed the 61 'parental' lineages of the recombinant virus). Co-infection of an individual requires there to be co-62 circulation of multiple viral lineages within a population and, given the short duration of most SARS-63 CoV-2 infections, is most likely to be observed when virus prevalence is high. 64 recombinants (Groups A-D and the four singletons) for the four weeks immediately preceding the 156 date of the earliest sampled genome in each case are shown in Supplementary Figure S3 . 157 We rejected the null hypothesis of non-reticulate evolution for 14 out of the 16 putative recombinant 160 sequences by testing these sequences for mosaicism (3SEQ (Lam, Ratmann and Boni, 2018) with 161 Dunn-Sidak correction for multiple comparisons) against a background set of 2000 sequences 162 randomly drawn from the course of the UK epidemic ( Table 2 ). The lineages identified as the putative 163 parentals assigned by 3SEQ agreed with the lineages for putative parentals assigned by genetic 164 similarity (Tables 1 and 2) even though of the 16 closest neighbours by genetic similarity described 165 above, none were present in the background sequence set of candidate parentals used in the 3SEQ 166 analysis. The breakpoints reported by 3SEQ also agreed with breakpoints inferred from the 167 distribution of Single Nucleotide Polymorphisms (SNPs) and deletions in the putative recombinants 168 and their neighbours by genetic similarity (Tables 1 and 2 ). The two sequences that belong to Group 169 B did not show a statistically significant mosaic signal of non-reticulate evolution, but 3SEQ's Δ m,n,2 170 statistic for these two candidate recombinants showed the greatest support for mosaicism possible 171 among the ancestry-informative polymorphic sites with their closest neighbours by genetic similarity 172 as parentals: n = 6, m = 42, k = 42. The associated uncorrected p-value of 5.7e-7 does not survive a 173 multiple comparisons correction due to the number of putative parental lineages and descendants that 174 were tested (Table 2 ) 175 whose placement within B.1.177 was not well supported, each recombinant's two phylogenetic 184 placements were with the lineages that we identified as parental by genetic similarity and by using 185 3SEQ, with high bootstrap support (Supplementary Table S3 ). The placement of the two parental 186 genome regions for each recombinant in the context of the whole epidemic in the UK is shown in 187 Here we report the first unambiguous detection and characterisation of the arisal and subsequent 215 community-transmission of recombinant SARS-CoV-2 viruses. Comparison of intra-genomic 216 variation, supported by geographic and epidemiological data, demonstrates the occurrence of multiple 217 independent recombination events involving UK virus lineages in late 2020. Recombinant genomes 218 that share genetic identity were sampled from the same geographic location and time period, 219 indicating they represent successful onward transmission after the occurrence of a single ancestral 220 recombination event. In one instance this resulted in a significant transmission cluster comprising 45 221 observed cases, which has been given the Pango lineage name XA. While no obvious biological 222 advantage can be attributed to this cluster (or to any of the observed recombinants) beyond the 223 acquisition of B.1.1.7's set of spike mutations, these recombinants are sentinel events for continued 224 monitoring for new variants. With the increasing co-circulation of variants of concern in the same 225 geographic areas careful monitoring is warranted. 226 227 Large-scale bioinformatic approaches have identified statistical signals of recombination among 228 SARS-CoV-2 sequences, using clade assignment and its changes along the genome as the primary 229 characteristic under investigation (Varabyou et al., 2020; VanInsberghe et al., 2021) . Due to the 230 limited genetic diversity at the time these analyses were carried out, there was no strong statistical 231 support for recombination (as opposed to non-reticulate diversification) for any particular candidate 232 recombinant. When the number of mutations in a virus sequence is low (e.g. Figure The discontinuous nature of these sgRNAs is understood to be the product of template switching by 256 viral polymerase during normal transcription, where the polymerase pauses at a transcription-257 regulatory sequence (TRS) after transcribing the last open reading frame (ORF) of the sgRNA, and 258 switches to a similar TRS upstream of the leader sequence (Sawicki and Sawicki, 1995), omitting a 259 looped-out region of the template RNA, which contains at least orf1ab in the case of SARS-CoV-2 260 (Finkel et al., 2021) . This provides an environment that is highly conducive to homologous 261 recombination: a polymerase that engages in template switching during its normal transcriptional 262 activity, as well as the availability of alternative template RNA molecules, in the form of sgRNAs, 263 which incorporate sequence motifs that mediate template switching, and which might be derived from 264 different genomes in the case of coinfection. As TRSes, which occur between the ORFs (Kim et al., coronaviruses, this can account for the shared pattern of recombination-prone regions observed here 267 ( Figure 4 ). However, to be detected recombinant genomes must lead to viable viruses, so the 268 distribution of breakpoints observed from genomic surveillance may not represent the distribution of 269 breakpoints that occur in situ (Banner and Lai, 1991) . In the Results we discuss the possibility that the mosaic patterns of genetic variation observed in the 296 putative recombinant genome sequences might be the product of some process other than 297 recombination, for example the sequencing of a natural co-infection or of a mixed sample due to 298 laboratory contamination. However, neither the distribution of the raw read allele frequencies for the 299 putative recombinants (which are unlike those we observe in samples that we suspect to be real 300 mixtures), nor the spatial distribution of genetic variation along their genomes (consisting of long 301 contiguous tracts compatible with a single lineage; which is not the pattern expected for a mixture of 302 samples given the sequencing protocol employed in the UK) suggests that this is the case. In the case 303 of the groups A-D, the sequencing of multiple distinct samples with nearly identical genetic variation 304 provides strong a posteriori evidence against any of these genomes being the product of sequencing a 305 mixed sample. This strength of evidence does not exist for the four singleton genomes, so they should 306 be viewed as less certainly recombinant. Table 1 (Related to Figure 1) . 567 The distribution of the most frequent SARS-CoV-2 lineages in the NUTS1 location of each set 568 (Rambaut et al., 2020) . From the sequence alignment, we extracted all sequences that had been 606 assigned to lineage B.1.1.7, up to the 2021/03/07. We genotyped these sequences at the set of 22 sites 607 that discriminate B.1.1.7 from its parental lineage (B.1.1) using a custom script in Python 608 (https://github.com/cov-ert/type_variants), then discarded sequences with missing data at any of the 609 22 sites. We visualised the resulting table of genotype calls in order to identify sequences that showed 610 evidence of a potential mosaic genome structure (i.e. runs of contiguous sites that were not compatible 611 with the B.1.1.7 lineage designation). 612 613 614 To identify candidate parental genome sequences in a computationally-tractable manner we created a 616 set of all UK SARS-CoV-2 sequences that (i) contained no N nucleotide ambiguity codes after 617 masking the 3' and 5' UTRs, (ii) spanned the dates 2020/12/01 to 2021/02/28, which represents two 618 weeks before the date of the earliest putative recombinant, to one week after the date of the latest, and 619 (iii) excluded the putative recombinant genomes identified above. This set consisted of 98859 620 sequences in total. For each putative recombinant, we split its genome sequence into B.1.1.7-like 621 regions and non-B.1.1.7 regions at the junction of genetic regions according to the mosaic structure 622 detected by the custom Python script described above (https://github.com/cov-ert/type_variants; 623 Supplementary Table S1 ). Then for each component region of each mosaic genome, we first masked 624 the remainder of the genome with Ns (in both the focal mosaic sequence and all background 625 sequences) then found the most-genetically similar non-focal sequences by computing pairwise 626 genetic distances (number of nucleotide differences per site) using gofasta (https://github.com/cov-627 ert/gofasta). Subsequently, an alignment was compiled for each putative recombinant, which 628 contained the putative recombinant as well as the most-genetically similar background sequences (as 629 identified above) for each component region of that mosaic genome. The single nucleotide differences 630 between the putative recombinant and the closely related reference sequences were visualised using 631 snipit (https://github.com/aineniamh/snipit). The genomic coordinates of the boundaries between each mosaic genome region were then refined by taking into account observed lineage-defining nucleotide 633 and deletion variation. Specifically, we set the boundary coordinates to the ends of sequential tracts of 634 mutations specific to the putative parental sequences. This is a conservative approach to assigning 635 parental lineages and consequently no parental lineage is assigned to those genome regions that do not 636 contain unambiguous lineage-defining mutations or deletions. Lastly, using these refined region 637 boundaries, we reiterated the genetic distance calculation above to identify a final set of most-638 genetically similar sequences for each putative recombinant. To generate a limited set of genomes that are suitable for computationally-expensive analysis yet are 647 also representative of the genetic diversity of the SARS-CoV-2 epidemic in the UK, we randomly 648 sampled 2000 sequences from 21st March 2020, when sequence data first became available, to 1st 649 March 2021, weighting the probability of choosing a sequence accounting for the sequencing 650 coverage and covid19 prevalence in individual geographic regions of the UK over time, using the 651 same method as in Volz et al. (2021) , which is available at 652 (https://github.com/robj411/sequencing_coverage). We use this dataset to investigate the phylogenetic 653 placement of the alternate regions of recombinant genomes, and as a dataset of putative parental 654 sequences to statistically test for recombination using 3SEQ. 655 656 then processed using sequence mapping, rather than sequence assembly, to produce a consensus 661 genome for each sample. This approach, which was designed to support epidemiological 662 investigations, creates a single consensus sequence for each sample. Beyond representing sites with 663 high minor allele frequencies using the appropriate IUPAC nucleotide alphabet ambiguity code, this 664 consensus does not reflect the natural genetic variation of SARS-CoV-2 genomes observed within an 665 infected individual (Lythgoe et al., 2021) . Mapping is particularly suited to tiled amplicons generated 666 from samples that contain limited genomic diversity. Further, mapping is typically less prone to 667 introducing errors/artefacts than sequence assembly and enables effective primer sequence removal 668 and identification of non-reference mutations. Genomic sites that exhibit intra-sample nucleotide 669 variation could be consistent with a range of processes, including co-infection, within-patient 670 diversity, contamination, or PCR error. The identification of such sites forms part of the consensus-671 generating pipeline, and we exploit that information here in order to rule out the possibility that our 672 mosaic consensus sequence represents a mixture of virus genomes, rather than representing true 673 recombinant genomes. 674 For each putative recombinant sequence we analysed the original read data from virus genome 676 sequencing in order to rule out the possibility that the generated consensus sequence represents a 677 mixture of virus genomes (due to laboratory contamination or coinfection, for example), rather than 678 representing a true recombinant genome. To do this we calculated minor allele frequencies (MAFs) 679 from the read data and compared their distribution between the 16 recombinant genomes and 20 680 samples that we suspected of being the product of sequencing a mixture of genomes, potentially 681 because of coinfection or laboratory contamination. To define sequences that we suspected of being 682 mixtures, we scanned the dataset for consensus sequences that possessed an IUPAC ambiguity code at 683 the 27 genomic positions that differ from the SARS-CoV-2 reference genome (Genbank accession: 684 changes that were inherited from the ancestor of B.1.1.7). We define the MAF at a single site as the 686 number of sequencing reads not containing the most frequently observed single nucleotide allele that 687 mapped to that site, divided by the total number of sequencing reads that include any nucleotide allele calculate MAF as follows. For each recombinant, we considered every site that differed from 690 MN908947.3 by a nucleotide in its own consensus genome, or in the consensus genome of either of 691 its parentals by genetic similarity. For the sequences that we suspected of being mixtures we 692 considered the 27 genomic positions where sequences belonging to B.1.1.7 differ from MN908947.3 693 by a nucleotide change. We used samtools (Li et al., 2009) , with default filters for mapping and base 694 quality, to extract allele calls from the read data using its mpileup subroutine, and to calculate mean 695 read depth per genome using its depth subroutine. We used 3SEQ (Lam, Ratmann and Boni, 2018) as a statistical test for recombination in the UK 702 SARS-CoV-2 data. 3SEQ interrogates triplets of sequences for a signal of mosaicism in one "child" 703 sequence, given the genotypes of the other two "parental" sequences, using an exact non-parametric 704 test for clustering in a sequence of binary outcomes (Boni, Posada and Feldman, 2007) . The test 705 statistic Δ m,n,2 used in 3SEQ simply tests if a putative recombinant's ancestry in parental A clusters in 706 the middle of the genome, while ancestry in parental B clusters in the outer regions of the genome. 707 We manually adjusted two-breakpoint recombinants to be single-breakpoint recombinants if one of 708 the breakpoints according to 3SEQ abutted the beginning or end of the genome. We tested all 709 potential pairs of sequences from the representative parental dataset from the course of the UK 710 pandemic (n = 2000) against each putative recombinant in the child dataset (n = 16), and report p-711 values that are uncorrected and that are Dunn-Sidak corrected for multiple comparisons (n = 64.0 712 million). We performed a single additional run of 3SEQ with two putative recombinant sequences that 713 were not found to be significantly the mosaic product of any of the sequences in the representative 714 background as children, and their closest neighbours by genetic similarity as parentals. P-values for 715 this test were reported without correction and after correction for multiple testing assuming that this test was in addition to the 64 million comparisons that we had already performed. The input and 717 output files for the 3SEQ analysis are available at https://github.com/COG-UK/UK-recombination-718 To determine the placement of the different regions of each recombinant genome in a single context, 737 we also built a phylogenetic tree of the representative background's complete genomes, to which we 738 added the masked recombinant genomes, so that each recombinant was present in the alignment 739 twice, once with the B.1.1.7 region of its genome unmasked, and once with the opposing region 740 unmasked. We ran IQTREE as above. 741  We find evidence for recombination in SARS-CoV-2  We identify 8 clear recombination events 4 of which lead to onward transmission  Estimated breakpoints are consistent with coronavirus cellular replication dynamics  Transmitted recombinants inherited the more-transmissible B.1.1.7 Spike gene Sampling from late 2020 to early 2021 provides evidence for SARS-CoV-2 recombination and onward community transmission of recombinant viruses that inherited the spike region from the B.1.1.7 (alpha) variant. We independently added each set's genome(s) to the representative background of 725 2000 sequences, along with the reference sequence, to create eight alignments in total. We masked the 726 resulting alignments according to the breakpoints defined by the closest neighbours by genetic 727 similarity, so that for each set, we produced two sub-alignments: one consisting of the region that was 728 and rooting the tree on the reference sequence all B lineage sequences. The phylogenetic trees produced by this analysis are Random nature of coronavirus RNA recombination in 772 the absence of selection pressure Guidelines for 775 identifying homologous recombination events in Influenza A virus To test for onward community transmission of the putative recombinants, we searched the whole UK 745 dataset as of the 5th May 2021 for additional sequences whose genetic variation matched the variation 746 of the recombinants. For each of the eight set of recombinants, we defined a set of SNPs and deletions 747 by which all the recombinants within that set differed from the reference sequence (MN908947.3). 748Then we used type_variants to scan the UK dataset for genomes whose SNP and deletion variation 749 was compatible with being a descendant or sibling of the putative recombinants. Group A represented 750 the only recombination event with evidence for further transmission according to the results of this 751 procedure. We carried out the following additional analyses to further investigate transmission of 752Group A genomes. Firstly, we visualised the nucleotide variation of the additional matching genomes 753 using snipit and extracted their sampling locations and dates. Secondly, to explore the phylogenetic 754 context of Group A and its derivatives, we reconstructed their (whole-genome) phylogenetic 755 relationships using IQTREE. We also extracted the 100 closest sequences by genetic similarity for 756 each alternate region of the genome (B.1.1.7-like and non-B.1.1.7-like) for each of the four original 757 members of Group A to provide phylogenetic context to the parental sequences. This resulted in a 758 dataset of 216 sequences in total when the two groups of neighbours were combined, and duplicates 759 removed. We reconstructed their (whole-genome) phylogenetic relationships with the IQTREE, as 760 above. We generated a time-scaled phylogenetic tree from the divergence tree of parental sequences 761 using TreeTime (Sagulenko et al. 2018) , setting the --clock-rate parameter to 0.001. We labelled the 762 phylogenetic tree of recombinants and the phylogenetic tree of parental sequences with the sampling 763 date in number of epidemiological weeks (epiweeks) since the first epiweek of 2020 to assess the 764 temporal context of the recombination event and subsequent transmission. We carried out a second 765 follow up on 14 th July 2021 using the same procedure as above.