key: cord-0963229-72g9f72b authors: Garvin, Michael R.; Prates, Erica T.; Romero, Jonathon; Cliff, Ashley; Machado Gazolla, Joao Gabriel Felipe; Pickholz, Monica; Pavicic, Mirko; Jacobson, Daniel title: The emergence of highly fit SARS-CoV-2 variants accelerated by epistasis and caused by recombination date: 2021-08-24 journal: bioRxiv DOI: 10.1101/2021.08.03.454981 sha: fd714031d64b1629bd5891c03936dc6dde0fcb97 doc_id: 963229 cord_uid: 72g9f72b The SARS-CoV-2 pandemic has entered an alarming new phase with the emergence of the variants of concern (VOC), P.1, B.1.351, and B.1.1.7, in late 2020, and B.1.427, B.1.429, and B.1.617, in 2021. Substitutions in the spike glycoprotein (S), such as Asn501Tyr and Glu484Lys, are likely key in several VOC. However, Asn501Tyr circulated for months in earlier strains and Glu484Lys is not found in B.1.1.7, indicating that they do not fully explain those fast-spreading variants. Here we use a computational systems biology approach to process more than 900,000 SARS-CoV-2 genomes, map their spatiotemporal relationships, and identify lineage-defining mutations followed by structural analyses that reveal their critical attributes. Comparisons to earlier dominant mutations and protein structural analyses indicate that increased transmission is promoted by epistasis, i.e., the combination of functionally complementary mutations in S and in other regions of the SARS-CoV-2 proteome. We report that the VOC have in common mutations in non-S proteins involved in immune-antagonism and replication performance, such as the nonstructural proteins 6 and 13, suggesting convergent evolution of the virus. Critically, we propose that recombination events among divergent coinfecting haplotypes greatly accelerates the emergence of VOC by bringing together cooperative mutations and explaining the remarkably high mutation load of B.1.1.7. Therefore, extensive community distribution of SARS-CoV-2 increases the probability of future recombination events, further accelerating the evolution of the virus. This study reinforces the need for a global response to stop COVID-19 and future pandemics. Notice: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). ”Nothing in Biology Makes Sense Except in the Light of Evolution” -Theodosius Dobzhansky In late 2020, three SARS-CoV-2 variants of concern, VOC; B.1.1.7, B.1.351, and P.1 (also 55 called alpha, beta, and gamma respectively) rapidly became the predominant source of infections the United States (USA), and has now been observed in more than 70 countries worldwide. 63 Notably, several of these VOC have rapidly spread even in regions such as the UK that depend 64 on robust sampling efforts for early detection. There is therefore a critical need to identify provide substantial evidence that the genome-wide mutation load of the late 2020 VOC results 111 from recombination between divergent strains. Using structural analysis and molecular dynamics 112 simulations, we explore the individual effects of key mutations in S and other proteins of SARS- 113 CoV-2 that are shared among different VOC. Additionally, we identify a signature of co-114 evolution between the residue 501 in S and ACE2, and propose the molecular basis for that. 115 Overall, our results indicate that linked mutations in VOC generated from recombination act in 116 an epistatic manner to enhance viral spread. This work emphasizes the important role of 117 community spread in generating future VOC (Sheikh et al. 2021 ). 122 trees do not 123 The COVID-19 pandemic is both an unprecedented tragedy and an opportunity to study 124 molecular evolution given the abundant and global sampling of the mutational space of SARS-125 CoV-2. The MJN is a valuable method of integrating these data to understand viral evolution 126 because the model assumes single mutational steps in which each node represents a haplotype 127 and the edge between nodes is a mutation leading to a new one. Typically, a subsample of extant 128 haplotypes for a taxon is obtained and unsampled, or extinct lineages are inferred. In contrast, 129 SARS-CoV-2 sequence data repositories provide extensive sampling of haplotypes and 130 collection dates (the calendar date of the sample). Given that in an MJN, the temporal 131 distribution of haplotypes is inherent (the model assumes time-ordered sets of mutations), the 132 mutational history of the virus can be traced as a genealogy that can incorporate both the relative 133 6 and absolute times of emergence of SARS-CoV-2 variants. Importantly, when the single-134 mutational step MJN model fails, it produces features such as loops or clusters of inferred 135 haplotypes that can indicate biologically important processes such as recombination events, back 136 mutations, or repeat mutations at a site that may be under positive selection. 137 In order to make a direct comparison, we generated a network and a phylogenetic tree of 138 SARS-CoV-2 haplotypes that were identified from sequences sampled during the first four . Diamond shape nodes denote haplotypes that harbor a 159 3-bp mutation in the nucleocapsid gene (N) that is highly conserved and directly affects viral replication in vitro 160 (Thorne et al. 2021; Tylor et al. 2009 ). b. The phylogenetic tree is unable to convey the same information. For 161 example, rapidly expanding populations often display polytomies, i.e., single mutations from a common central 162 haplotype. Those events are readily identified on the MJN but difficult to interpret on a tree because they are usually 163 visualized as a multi-pronged fork (outlined in the dashed-line box) rather than a star pattern (compare H04 in (a) 164 and (b)). These true biological processes also cause tree algorithms to perform poorly because they violate their 165 assumptions, slowing convergence. Additionally, MJN are able to indicate reticulations (i.e., loops) that could 166 denote recombination, reverse mutations, or other biologically important events whereas the forced bifurcation of 167 phylogenetic tree algorithms is unable to display these. Reference sequence: NC_045512, Wuhan, December 24, 174 We processed more than 900,000 SARS-CoV-2 genomes from human and mink, built a MJN 175 network using the 640,211 genomes that survived our quality control workflow, and annotated 176 with PANGO lineages defined in the GISAID database ( Figure 2 , see Methods). This 177 genealogy-based approach to molecular evolution identifies the mutations that define a given 178 haplotype based on the edge between nodes. Here, they define the variants of concern and 179 variants of interest (VOI) based on the edge that initiates their corresponding clusters of nodes 180 ( His 69 _Val 70 del that is also found in a clade of haplotypes from mink (blue dashed-line box in (a)). c. Accumulation 217 rate for common GISAID lineages including VOC represented by the ratio between the accumulated number of 218 reported sequences of a given lineage per day since the appearance of that haplotype (Nacc) divided by the 219 11 corresponding total number (Ntot) at the final sample date for this study. Colors of curves correspond to node colors 220 in (a). All VOC display accumulation rates of at least 1% of the total for that variant per day. The remaining are 221 less than 1% except for the VOI B.1.526 (not displayed in MJN) that is the highest with 3% per day, indicating 222 further scrutiny of this variant is warranted. We also plotted the accumulation rate for lineages that carry the widely 223 reported S Asp 614 Gly mutation but without the nsp12 Pro 323 Leu commonly found with it, supporting our previous 224 hypothesis ( ) that mutations in S alone are not responsible for the rapid transmission of 225 these VOC/VOI but is a function of epistasis among S and non-S mutations. Reference sequence: NC_045512, likely variants of concern, that is, those that we propose to have strong potential to become VOC. Non-VOC (N- Haploid, clonally replicating organisms such as SARS-CoV-2 are predicted to eventually 243 become extinct due to the accumulation of numerous slightly deleterious mutations over time, 244 i.e., Muller's ratchet (Muller 1964 ). Recombination is not only a rescue from Muller's ratchet, it 245 can also accelerate evolution by allowing for the union of advantageous mutations from 246 divergent haplotypes (Bentley & Evans 2018) . In SARS-CoV-2, recombination manifests as a 247 template switch during replication when more than one haplotype is present in the host cell, i.e. Furthermore, all 28 differences between the Wuhan reference sequence and B.1.1.7 appeared in 274 earlier haplotypes (Table S1 ) and therefore, if rapid evolution were the cause, it would require 275 the extremely unlikely process of 28 independent, repeat mutations to the same nucleotide state. In order to test this, we plotted the population-level mutations per day (including repeat 277 mutations at variable sites), which did not reveal any increase in mutation rate at the time of the we also identified a mutation in nsp3 that is one of the lineage-defining mutations for B.1.1.7 and 295 appears in disjoint nodes, but is unlikely to be under selection because it is synonymous. (Table S2) . 363 364 Given the results from the MJN analysis and our previous hypothesis 365 that the cooperative effects of mutations in S and non-S proteins (i.e., epistasis) define and are 366 responsible for the increased transmission of prevalent SARS-CoV-2 variants (Lauring & 367 Hodcroft 2021), we performed protein structural analyses and discuss below the functional 368 effects of these individual and combined mutations in SARS-CoV-2 VOC. We analyze ten likely 369 key mutation sites (red font, Table 1 ) in S and non-S proteins. (Table S4) . To investigate further, we analyzed structural features in the interaction between ACE2 found in more than ten individuals and haplotypes found in five or more individuals as we had in 717 previous work To calculate the chance of accumulating several mutations in a certain period, the probability 773 density function for a normal distribution is used: where is the expected number of mutations for that date, x is the measured value, and is the 776 standard deviation of error calculated from the data shown in Fig. 1b The period of interest to our discussion (June-October 2020) corresponds to 122 days, for which, 780 the integral of PDF(x=13) gives the probability of 1*10 -15 to accumulate 13 mutational events. Wuhan reference using CLC Genomics Workbench using the default parameters except for 788 length fraction and similarity fraction were set to 0.9. Three sites specific to UK B.1.1.7 were 789 analyzed for possible heterozygosity. Of the 100 we sampled, two appeared to be cases of 790 coinfection. This supports the hypothesis that the large expansion in overall mutations seen in 791 UK B.1.1.7 are likely due to recombination. In addition, it also supports the case that coinfection 792 is occurring at a baseline sufficient to allow for occasional recombination. suggested protocol for nonbonded interactions with the CHARMM36 force field when used in 825 the GROMACS suite was followed. The Hbonds plugin in VMD was used to identify hydrogen bond interactions along the 827 simulations (Humphrey et al. 1996) . The geometric criteria adopted are a cutoff of 3.5 Å for 828 donor-acceptor distance and 30° for acceptor-donor-H angle. The Timeline plugin was used to 829 count contacts formed by a given amino acid residue. We defined the distance of 4 Å between 830 any atom pairs as the cutoff for contact. the United States Effect of recombination on the accuracy of the 881 likelihood method for detecting positive selection at amino acid sites Median-joining networks for inferring 884 intraspecific phylogenies', Molecular biology and evolution Mechanisms and consequences of positive-strand RNA 886 virus recombination 888 (2020). 'Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the 889 COVID-19 pandemic Infection sustained by lineage B.1.1.7 of SARS-CoV-2 is characterised by longer 892 persistence and higher viral RNA loads in nasopharyngeal swabs Beyond Shielding: The Roles of Glycans in the SARS-CoV-2 Risk of mortality in patients infected with SARS-CoV-2 variant of concern 900 202012/1: matched cohort study Structural Basis for Helicase-Polymerase Coupling in the SARS-CoV-2 903 Replication-Transcription Complex Particle mesh Ewald: An N⋅log(N) method for 905 Ewald sums in large systems & 908 CMMID COVID-19 Working Group. (n.d.). 'Increased mortality in community-tested cases 909 of SARS-CoV-2 lineage B.1.1.7 Transmission, infectivity, and neutralization of a 912 spike L452R SARS-CoV-2 variant The Nucleocapsid Protein of SARS-CoV-914 2: a Target for Vaccine Development' The Nose-Hoover thermostat', The Journal of chemical 916 40 physics Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in 919 The N501Y and K417N mutations in the spike protein of SARS-CoV-2 alter 921 the interactions with both hACE2 and human derived antibody: A Free energy of 922 perturbation study Characteristics of SARS-CoV-2 variants of concern B.1.1.7, B.1.351 or P.1: data from 925 seven EU/EEA countries, weeks 38/2020 to 10/2021', Euro surveillance: bulletin Europeen 926 sur les maladies transmissibles = European communicable disease bulletin Potentially adaptive SARS-CoV-2 mutations discovered with novel spatiotemporal 930 and explainable-AI models Potentially adaptive SARS-CoV-2 mutations discovered with novel spatiotemporal 933 and explainable AI models d.). 'Complete mapping of mutations to the SARS-CoV-2 spike receptor-binding 936 domain that escape antibody recognition The coronavirus proofreading exoribonuclease mediates 939 extensive viral recombination SARS-CoV-2 in BALB/c mice for testing vaccine efficacy SARS-CoV-2 943 non-structural protein 13 (nsp13) hijacks host deubiquitinase USP13 and counteracts host 944 antiviral immune response SARS-CoV-2 (COVID-19) structural and evolutionary dynamicome: Insights into 947 functional evolution and human genomics CHARMM additive all-atom force field for 951 carbohydrate derivatives and its utility in polysaccharide and carbohydrate-protein 952 modeling LINCS: A linear 954 constraint solver for molecular simulations A Multibasic Cleavage Site in the 957 Spike Protein of SARS-CoV-2 Is Essential for Infection of Human Lung Cells', Molecular 958 cell CHARMM36 all-atom additive protein force field: 960 validation based on comparison to NMR data RosettaRemodel: a generalized framework for flexible backbone protein design VMD: visual molecular dynamics Application of phylogenetic networks in evolutionary 968 studies', Molecular biology and evolution Generation and transmission of inter-lineage recombinants in the SARS-CoV-2 971 pandemic A high ATP 973 concentration enhances the cooperative translocation of the SARS coronavirus helicase 974 nsP13 in the unwinding of duplex RNA Delicate structural 976 coordination of the Severe Acute Respiratory Syndrome coronavirus Nsp13 upon ATP 977 hydrolysis Comparison of simple potential functions for simulating liquid water', The Journal of 980 chemical physics CHARMM-GUI: a web-based graphical user 982 interface for CHARMM Computational predictions of protein structures associated with COVID-19 MAFFT: a novel method for rapid 988 multiple sequence alignment based on fast Fourier transform The Architecture 991 of SARS-CoV-2 Transcriptome Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases 994 Infectivity of the COVID-19 Virus Genetic Variants of SARS-CoV-2-What Do They 996 Mean? popart: full-feature software for haplotype network 998 construction GROMACS 2020 Source code The N501Y 1003 spike substitution enhances SARS-CoV-2 transmission', bioRxiv : the preprint server for 1004 biology Linear 1006 epitope landscape of the SARS-CoV-2 Spike protein constructed from 1,051 COVID-19 1007 patients Molecular Mechanism of the N501Y Mutation for 1009 Enhanced Binding between SARS-CoV-2's Spike Protein and Human ACE2 Receptor Viral infection and 1012 transmission in a large, well-traced outbreak caused by the SARS-CoV-2 Delta variant The emergence and ongoing convergent evolution of the N501Y lineages 1016 coincides with a major global shift in the SARS-CoV-2 selective landscape', medRxiv : the 1017 preprint server for health sciences THE RELATION OF RECOMBINATION TO MUTATIONAL 1019 ADVANCE', Mutation research Recombination patterns in coronaviruses', 1021 bioRxiv : the preprint server for biology Transmission of SARS-CoV-2 on mink 1024 farms between humans and mink and back to humans 1026 (n.d.). 'Furin cleavage of SARS-CoV-2 Spike promotes but is not essential for infection and 1027 cell-cell fusion Polymorphic transitions in single crystals: A new 1029 molecular dynamics method Antiviral Strategies Against SARS-CoV-2 -For a Bioinformatics Approach'. Hann 1033 Potential pathogenicity determinants identified from structural proteomics of 1036 SARS-CoV and SARS-CoV-2', Molecular biology and evolution Mapping DNA methylation with high-throughput nanopore sequencing', Nature 1040 methods SARS-CoV-2 is transmitted via contact and via 1043 the air between ferrets DynaMut2: Assessing changes in 1045 stability and flexibility upon single and multiple point missense mutations', Protein science: 1046 a publication of the Protein Society MrBayes 3: Bayesian phylogenetic inference under 1048 mixed models Resurgence of COVID-19 in Manaus, Brazil, despite high 1051 seroprevalence Why do 1053 SARS-CoV-2 NSPs rush to the ER? 1056 'Host barriers to SARS-CoV-2 demonstrated by ferrets in a high-exposure domestic 1057 setting Structural basis 1060 of receptor recognition by SARS-CoV-2 Cytoscape: a software environment for integrated models of biomolecular 1063 interaction networks Scotland: demographics, risk of hospital admission, and vaccine effectiveness'. The Lancet Why do RNA viruses recombine?', Nature reviews Detecting DNA cytosine methylation using nanopore sequencing Serine 477 plays a 1073 crucial role in the interaction of the SARS-CoV-2 spike protein with the human receptor 1074 ACE2 SARS-CoV-2 1076 variants of concern are emerging in India', Nature medicine Prospective mapping of viral mutations that escape antibodies used to 1080 treat COVID-19 Deep mutational scanning of SARS-CoV-2 receptor 1083 binding domain reveals constraints on folding and ACE2 binding Evolution of enhanced innate immune evasion by the SARS-CoV-2 B.1.1.7 UK variant', bioRxiv : the preprint server for biology The SR-rich motif in SARS-CoV nucleocapsid protein is important for virus replication Phylogeny as population history'. Philosophy and Theory in Biology CryoEM and AI reveal a structure of SARS-CoV-2 Nsp2, a multifunctional protein 1098 involved in key host processes Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking 1101 epidemiological and genetic data 'mRNA vaccine-elicited antibodies to SARS-CoV-2 and 1104 circulating variants' Emergence and rapid transmission of SARS-CoV-2 B.1.1.7 1107 in the United States 1112 'Evasion of Type I Interferon by SARS-CoV-2 Architecture of a 1114 SARS-CoV-2 mini replication and transcription complex Structural basis for the 1117 recognition of SARS-CoV-2 by full-length human ACE2 Mining of epitopes on spike protein of SARS-CoV-2 from COVID-19 patients', Cell 1120 research SARS-CoV-2 spike-protein D614G mutation increases virion spike density and 1123 infectivity The emergence of highly fit SARS-CoV-2 variants accelerated by recombination 1135 1136 Michael R. Garvin 1,2+* , Erica T. Prates 1,2+ , Jonathon Romero 3 , Ashley Cliff 3 , Joao Gabriel Felipe Tables 1183 1184 1185 Table S4 . Average number of contacts formed between Asn 501 in the receptor-binding domains of SARS-CoV-2 S and 1187 residues in ACE2 from human (hACE2) and ferret (fACE2). A distance of 4 Å between any atom pairs was 1188 defined as the cut-off for contact statistics.