key: cord-0901157-xsrqjk3m authors: Tagliamonte, Massimiliano S.; Abid, Nabil; Ostrov, David A.; Chillemi, Giovanni; Pond, Sergei L. Kosakovsky; Salemi, Marco; Mavian, Carla title: Recombination and purifying selection preserves covariant movements of mosaic SARS-CoV-2 protein S date: 2020-06-10 journal: bioRxiv DOI: 10.1101/2020.03.30.015685 sha: 15a9ca07806da084bf89d1cf5a6c9a2683463383 doc_id: 901157 cord_uid: xsrqjk3m In depth evolutionary and structural analyses of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) isolated from bats, pangolins, and humans are necessary to assess the role of natural selection and recombination in the emergence of the current pandemic strain. The SARS-CoV-2 S glycoprotein unique features have been associated with efficient viral spread in the human population. Phylogeny-based and genetic algorithm methods clearly show that recombination events between viral progenitors infecting animal hosts led to a mosaic structure in the S gene. We identified recombination coldspots in the S glycoprotein and strong purifying selection. Moreover, although there is little evidence of diversifying positive selection during host-switching, structural analysis suggests that some of the residues emerged along the ancestral lineage of current pandemic strains may contribute to enhanced ability to infect human cells. Interestingly, recombination did not affect the long-range covariant movements of SARS-CoV-2 S glycoprotein monomer in pre-fusion conformation but, on the contrary, could contribute to the observed overall viral efficiency. Our dynamic simulations revealed that the movements between the host cell receptor binding domain (RBD) and the novel furin-like cleavage site are correlated. We identified threonine 333 (under purifying selection), at the beginning of the RBD, as the hinge of the opening/closing mechanism of the SARS-CoV-2 S glycoprotein monomer functional to hACE2 binding. Our findings support a scenario where ancestral recombination and fixation of amino acid residues in the RBD of the S glycoprotein generated a virus with unique features, capable of extremely efficient infection of the human host. Coronaviruses (CoVs) are single strand, positive sense single-stranded RNA (+ssRNA) 54 viruses, with diverse tropism, able to infect respiratory, enteric, and hepatic tissues of 55 several species (Fehr and Perlman 2015) . Within the CoVs family, Beta CoVs have 56 repeatedly proven the ability to shift from their natural reservoir and adapt to human 57 hosts (Su, et al. 2016) . Several CoVs, such as HCoV-229E, HCoV-OC43, HCoV-HKU1, 58 and HcoV-NL63, are mostly associated with mild symptoms (Bucknall, et al. 1972; Woo, Methods section) (Kosakovsky Pond, et al. 2006; , with the purpose 100 of achieving higher sensitivity (Jia, et al. 2018 ). Our analysis in fact revealed that the 101 recombination pattern of the S glycoprotein is far more complex than previously thought. Because of the lack of sensitive recombination methods (Li, (Bosch, et al. 2008 ) and activating cleavage on S2' site (Ou, et al. 2020 ). An important peculiarity of SARS-CoV-2 as compared to other CoVs is the presence of a 120 furin-like cleavage sequence (Madu, et al. 2009; Lai, et al. 2017 ). Exposure of this 121 cleavage site to the soluble furin protease is an important step in the viral infection 122 (Wrapp, et al. 2020) , and explain, at least in part, the enhanced infectivity and 123 pathogenicity of the virus (Bosch, et al. 2003) . scanning sub-sets of the alignment including only SARS-CoV-2 and its ancestors, 153 recombination signal was highly significant (q < 1x10 -6 ). A signal for recombination was 154 also detected for SARS-CoV-2 with pangolin-SARS-CoV-2 ancestral isolates (q = 0.03), 155 and when testing SARS-CoV-2 bat (EPI_ISL_402131) and pangolin isolates (q = 0.002). We next scanned the SARS-CoV-2 genomes for putative recombination breakpoints 157 using RDP, GENECOV, MAXCHI, CHIMAERA, SISCAN, and 3SEQ, implemented in 158 RDP4 . This pipeline allows for the identification of recombinants as 159 well as potential their parental isolates -if present in the sampled sequences. Sixty-eight 160 potential recombination events distributed along the genome were found in bat-SL- CoVs, SARS-CoVs, and SARS-like CoVs isolates (Table S1 ). We found that human 162 SARS-CoV-2 isolates were potentially either parental or recombinant sequences in 163 recombination events with pangolin-SARS-CoV-2 isolated in 2019 and bat-SARS-CoV-2 164 genomes. Uncertainty in the estimation (the parental may be the recombinant sequences, or vice versa) suggests missing links in SARS-CoV-2 lineages, which is 166 likely, given the sparse sampling of the wildlife coronavirus diversity, or low sensitivity of 167 the recombination algorithms. Because the recombination event between CoVs of 168 pangolin "b" and human origin involved the region of the S glycoprotein that binds to the 169 cell receptor (Figure 1b-c and Table S1 ), exactly matching the ACE2 binding region 170 where residues are shared between pangolin and human ), we conclude 171 that the SARS-CoV-2 human lineage is the result of recombination between a strain 172 belonging to the pangolin "b" lineage, potential minor parental, with a strain belonging to 173 the bat lineage close to bat-SARS-CoV-2 RaTG13. We corroborated the potential recombination event by phylogenetic inference based on 175 the recombinant and non-recombinant genome fragments from human, bat, and 176 pangolin CoV-2 sequences, after assessing for the presence of phylogenetic signal 177 (Table S2 and S3). Trees inferred using the recombinant region (part of the S 178 glycoprotein) supported pangolin and CoV-2 ancestral relationship, while the segment 179 derived by the major parent kept the CoV-2 clade clustering with the bat sequence 180 ( Figure S1 ). We investigated further recombination patterns with a very sensitive genetic algorithm, with statistically significant evidence of further intra-segment recombination (Table S4) . The mosaic is the result of multiple independent events involving different ancestral We identified nine potential recombination hotspots based on an analysis with sliding 228 windows of size 100 bp. Seven of them were located in the polyprotein segment ( Figure 229 2), which spans three quarters of the viral RNA; the remaining two involved protein E, (Table S5 ). The overall structure of the SARS- CoV-2 has significant structural homology to its SARS-CoV-1 counterpart (Wang, et al. 249 2020). In comparison with the latter available structure (Gui, et al. 2017) , and a recently 250 published study (Lan, et al. 2020) , the SARS-CoV-2 S glycoprotein is composed of the 251 S1 subunit, that binds to hACE2 (Li 2016) , and the S2 subunit which plays a role in 252 membrane fusion (Bosch, et al. 2003 ). In particular, S1 is composed of an N-terminal For simplicity, we show canonical hACE2 binding residues F486, Q493, S494 and N501, 258 as reported by Andersen et al (Andersen, et al. 2020 ). The S2 subunit is located 259 downstream of S1 after the S1/S2 cleavage site (residue 667 for SARS-CoV-1 and 260 residue 684 for SARS-CoV-2). The majority of the residues that differentiate human 261 SARS-CoV-2 S glycoprotein from bat or pangolin ones were found within the S1 subunit 262 ( Figure 3a , Table S6 ). While the S glycoproteins of MERS-CoV, HcoV-OC43, and HKU1 263 display the canonical (R/K)-(2X)n-(R/K)* (or RXXR*SA) motif for the S1/S2 cleavage (Li We identified threonine 333 (corresponding to T333 and T331 in bat and pangolin, 322 respectively) as the hinge of this movement. Noticeably, we verified that this motion is 323 conserved within the SARS-CoV-2 lineage, and therefore is observed in human, bat, and Publicly available genomic CoVs sequences were obtained from GenBank and GISAID 437 (Table S5 and Table S1 . BOOTSCAN (Salminen, et al. 1995) plots were generated using RDP4. Phylogenetic signal for the subset alignment including bat, human and pangolin CoV-2 457 sequences was l was confirmed with TREE-PUZZLE (Schmidt, et al. 2002) , and 458 weighted SH and AU tests (Shimodaira 2002) with different sensitivity to hotspots and coldspots. Segments that had more observed breakpoints than the local maximum number of breakpoints of 99% of 1,000 471 permutations were considered hotspots (p=0.01); in the same fashion, segments that 472 had less breakpoints than the local minimum of 99% of permutations were considered 473 coldspots (Heath, et al. 2006 ). To identify the recombinant structure used to correct selection analyses, we used a (Table S9) . For selection analyses, we used GARD results to partition the 119 unique sequences 489 subset. We inferred ML trees separately for each segment using raxml-ng and the 490 GTR+G+I model with 5 random parsimony starting trees. In these trees, we further 491 identified three branches that included host-switching events and the branch separating 492 2017 and 2019 pangolin isolates (see Figure 4 ). We used BUSTED to assess the presence of selection on the gene 494 S partitions. We ran FEL (Kosakovsky Pond and Frost 2005) and MEME (Murrell, et al. separately on each partition (since it cannot be applied to multi-partition data). Finally, to 500 look for co-evolution between sites, we applied BGM (Poon, et al. 2007 ) to each partition 501 separately, focusing only on internal branches and sites that accumulated at least two substitutions on internal branches. All selection analyses were carried out in HyPhy 503 v2.5.14 . with Chimera v1.12 (Pettersen, et al. 2004 ) and VMD v1.9.2 (Humphrey, et al. 1996) . Counter ions were added to neutralize the overall charge with the genion gromacs tool. After energy minimizations, the systems were slowly relaxed for 5 ns by applying The N-terminal domain was shown is colored tan. The residues unique to human SARS- CoV-2, hACE2 biding site residues, and S1/S2 cleavage sites were shown with cyan, the fusion peptide visits a more solvent exposed conformation (see Figure S4 and Essential dynamics of proteins The proximal 648 origin of SARS-CoV-2 Further Evidence for Bats as the 651 Evolutionary Source of Middle East Respiratory Syndrome Coronavirus Emerging viruses: why they are not jacks of 654 all trades? Activation of the SARS coronavirus spike 656 protein via sequential proteolytic cleavage at two distinct sites Controlling the False Discovery Rate -a Practical 659 and Powerful Approach to Multiple Testing Toward the estimation of the absolute 662 quality of individual protein structure models QMEAN server for protein model quality 664 estimation The evolutionary dynamics of influenza A virus 667 adaptation to mammalian hosts Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for 670 the COVID-19 pandemic An exact nonparametric method for 672 inferring mosaic structure in sequence triplets Cathepsin L functionally cleaves the severe 674 acute respiratory syndrome coronavirus class I fusion protein upstream of rather 675 than adjacent to the fusion peptide The coronavirus spike protein 677 is a class I virus fusion protein: structural and functional characterization of the 678 fusion core complex A simple and robust statistical test for 680 detecting the presence of recombination Neighbor-net: an agglomerative method for the 682 construction of phylogenetic networks Studies with human 684 coronaviruses. II. Some properties of strains 229E and OC43 Canonical sampling through velocity rescaling Particle mesh Ewald: An N log(N) method for 689 Ewald sums in large systems Recombinant canine coronaviruses related to 692 transmissible gastroenteritis virus of Swine are circulating in dogs Analysis of murine 695 hepatitis virus strain A59 temperature-sensitive mutant TS-LA6 suggests that nsp10 696 plays a critical role in polyprotein processing MERS-CoV recombination: implications about the 698 reservoir and potential for adaptation Analysis of cathepsin and 700 furin proteolytic enzymes involved in viral fusion protein activation in cells of the 701 bat reservoir host Comparative protein structure modeling using Modeller Coronaviruses: an overview of their replication and 706 pathogenesis Structural and molecular basis of 709 mismatch correction and ribavirin excision from coronavirus RNA Extensive Positive Selection Drives the Evolution of Nonstructural Proteins in 713 Coronavirus spike proteins in viral entry and 715 pathogenesis Sister-scanning: a Monte Carlo procedure 717 for assessing signals in recombinant sequences Recombination, reservoirs, and the modular spike: 719 mechanisms of coronavirus cross-species transmission Cryo-electron 721 microscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite 722 conformational state for receptor binding Using Parsimony (and Other Methods)). In. Dictionary of Bioinformatics and 725 Computational Biology: In Dictionary of Structure and expression of mouse furin, a yeast Kex2-related protease. Lack 729 of processing of coexpressed prorenin in GH4C1 cells Recombination patterns in 732 aphthoviruses mirror those found in other picornaviruses A Multibasic Cleavage Site in the 734 Spike Protein of SARS-CoV-2 Is Essential for Infection of Human Lung Cells VMD: visual molecular dynamics Application of phylogenetic networks in evolutionary 739 studies The Proteolytic Regulation of Virus Cell Entry by Furin and Other 741 Proprotein Convertases Homologous recombination within the spike 743 glycoprotein of the newly identified coronavirus may boost cross-species 744 transmission from snake to human Characterization 746 of small genomic regions of the hepatitis B virus should be performed with more 747 caution Comparison of simple 749 potential functions for simulating liquid water ModelFinder: fast model selection for accurate phylogenetic estimates Geographical tracking and mapping of 754 coronavirus disease COVID-19/severe acute respiratory syndrome coronavirus 2 755 (SARS-CoV-2) epidemic and associated events around the world: how 21st century 756 GIS technologies are supporting the global fight against outbreaks and epidemics A simple method to control over-alignment in the 759 MAFFT multiple sequence alignment program The Architecture Host cell proteases controlling virus pathogenicity Signature pattern analysis: a method for assessing viral 765 sequence relatedness Not so different after all: a comparison of 767 methods for detecting amino acid sites under selection GARD: a 769 genetic algorithm for recombination detection Peptide Forms an Extended Bipartite Fusion Platform that Perturbs Membrane 772 Order in a Calcium-Dependent Manner 774 Recombination between nonsegmented RNA genomes of murine coronaviruses 777 Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 778 receptor Severe Acute Respiratory Syndrome (SARS) Coronavirus ORF8 Protein Is Acquired from SARS-Related Coronavirus from Greater Horseshoe Bats 782 through Recombination The epidemiology of severe acute respiratory 785 syndrome in the 2003 Hong Kong epidemic: an analysis of all 1755 patients Structure, Function, and Evolution of Coronavirus Spike Proteins Structure of SARS coronavirus spike 790 receptor-binding domain complexed with receptor Emergence of SARS-CoV-2 through Recombination and Strong Purifying 793 Selection Emergence of SARS-CoV-2 through 796 recombination and strong purifying selection Transmission dynamics 798 and evolutionary history of 2019-nCoV Sequence dependency of canonical base pair 800 opening in the DNA double helix Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? 803 The evolution and 805 genetics of virus host shifts The EMBL-EBI search and sequence analysis tools 808 APIs in 2019 Characterization of a highly 810 conserved domain within the severe acute respiratory syndrome coronavirus spike 811 protein S2 domain with characteristics of a viral fusion peptide RDP: detection of recombination amongst aligned 814 sequences RDP4: Detection and 816 analysis of recombination patterns in virus genomes Emergence of recombinant Mayaro virus strains from the 819 Amazon basin Host cell entry of Middle East respiratory syndrome 821 coronavirus after two-step, furin-mediated activation of the spike protein Gene-wide identification of episodic selection 827 Detecting individual sites subject to episodic diversifying selection Attenuation of replication by a 29 831 nucleotide deletion in SARS-coronavirus acquired during the early stages of human-832 to-human transmission IQ-TREE: a fast and 834 effective stochastic algorithm for estimating maximum-likelihood phylogenies 837 Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune 838 cross-reactivity with SARS-CoV Possible emergence of new geminiviruses 840 by frequent recombination Tackling exascale software 842 challenges in molecular dynamics simulati ons with GROMACS Full-genome evolutionary analysis of the novel corona virus 847 (2019-nCoV) rejects the hypothesis of emergence as a result of a recent 848 recombination event Molecular Evidence Suggesting Coronavirus-driven Evolution of Mouse Receptor UCSF Chimera--a visualization system for exploratory research and analysis HyPhy: hypothesis testing using phylogenies An evolutionary-network model reveals 858 stratified interactions in the V3 loop of the HIV-1 envelope The effect of recombination on the accuracy of 861 phylogeny estimation Evaluation of methods for detecting recombination 863 from DNA sequences: computer simulations MrBayes 3.2: efficient Bayesian phylogenetic 867 inference and model choice across a large model space Identification of breakpoints 869 in intergenotypic recombinants of HIV type 1 by bootscanning fur gene expression as a discriminating marker for small cell and 873 nonsmall cell lung carcinomas Consequences of recombination on traditional 875 phylogenetic analysis TREE-PUZZLE: 877 maximum likelihood phylogenetic analysis using quartets and parallel computing Statistical potential for assessment and prediction of protein 880 structures An approximately unbiased test of phylogenetic tree selection Potential impact of recombination 884 on sitewise approaches for detecting positive natural selection Proteolytic activation 887 of the SARS-coronavirus spike protein: cutting enzymes at the cutting edge of 888 antiviral research Why do RNA viruses recombine? Recognition of errors in three-dimensional structures of proteins Analyzing the mosaic structure of genes Less is more: an adaptive branch-site random effects model for efficient 896 detection of episodic diversifying selection Cryo-EM structure of the SARS coronavirus 898 spike glycoprotein in complex with its host cell receptor ACE2 Temperature-sensitive mutants and 901 revertants in the coronavirus nonstructural protein 5 protease (3CLpro) define 902 residues involved in long-distance communication and regulation of protease 903 activity Genetic Recombination, and Pathogenesis of Coronaviruses Maeda 908 K. 2014. Emergence of pathogenic coronaviruses in cats by homologous 909 recombination between feline and canine coronaviruses Evidence of recombinant 911 strains of porcine epidemic diarrhea virus, United States Bats, civets and the emergence of SARS Structural and Functional Basis of SARS-CoV-2 Entry by Using Human SWISS-MODEL: homology modelling of 920 protein structures and complexes Protein structure modeling with MODELLER Rapid evolutionary escape by large populations from 924 local fitness peaks is likely in nature ProSA-web: interactive web service for the 926 recognition of errors in three-dimensional structures of proteins Evidence of 929 recombination in coronaviruses implicating pangolin origins of Clinical and molecular epidemiological features of coronavirus 933 HKU1-associated community-acquired pneumonia Cryo-EM structure of the 2019-nCoV spike in the prefusion 936 conformation A new coronavirus associated with human respiratory disease in China 941 Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins Structural basis for the recognition 943 of SARS-CoV-2 by full-length human ACE2 A highly 945 conserved cryptic epitope in the receptor-binding domains of SARS-CoV-2 and 946 SARS-CoV 948 Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia A novel bat coronavirus reveals natural insertions at the S1/S2 cleavage site 952 of the Spike protein and a possible recombinant origin of HCoV-19