key: cord-0743892-j1wvajjz authors: Brintnell, Erin; Gupta, Mehul; Anderson, Dave W title: Detailed phylogenetic analysis of SARS-CoV-2 reveals latent capacity to bind human ACE2 receptor date: 2020-09-28 journal: bioRxiv DOI: 10.1101/2020.06.22.165787 sha: 392fb285ed9be4453d938ec1c0685e73be7d03b6 doc_id: 743892 cord_uid: j1wvajjz SARS-CoV-2 is a unique event, having emerged suddenly as a highly infectious viral pathogen for human populations. Previous phylogenetic analyses show its closest known evolutionary relative to be a virus detected in bats (RaTG13), with a common assumption that SARS-CoV-2 evolved from a zoonotic ancestor via recent genetic changes (likely in the Spike protein receptor binding domain – or RBD) that enabled it to infect humans. We used detailed phylogenetic analysis, ancestral sequence reconstruction, and in situ molecular dynamics simulations to examine the Spike-RBD’s functional evolution, finding that the common ancestral virus with RaTG13, dating to at least 2013, possessed high binding affinity to the human ACE2 receptor. This suggests that SARS-CoV-2 likely possessed a latent capacity to bind to human cellular targets (though this may not have been sufficient for successful infection) and emphasizes the importance to expand the cataloging and monitoring of viruses circulating in both human and non-human populations. sequence identity) (Supplementary Figure 1) . 21, 29 Next, we sought to investigate the evolution of SARS-CoV-2 infectivity by performing ancestral sequence reconstruction for the Spike-RBD amino acid sequence ( Figure 1A) . While cross-species protein sequence comparisons have been used to investigate critical amino acid changes in the SARS-CoV-2 Spike protein, 37 leveraging the phylogenetic relationships allows us to infer the most likely ancestral sequences for this protein allows us to focus on the subset of genetic changes that are specific to its recent evolution. 38 We inferred statistically well-supported reconstructions of the Spike-RBD sequence for both the common ancestor of all human SARS-CoV-2 cases (labelled "N1", Figure 1A ,C) and the its common ancestor with the closest animal virus (labelled "N0", Figure 1A ,C). N1 is identical to the Spike-RBD in the SARS-CoV-2 reference sequence, as expected, while the N0 Spike-RBD sequence is, to our knowledge, unique, reflecting the uniqueness of SARS-CoV-2's viral origin. 21 ,39 N0 differs from N1 at 4 positions (346, 372, 498, and 519 - Figure 1B) . The reconstruction of N1 for each of those positions is statistically well-supported, with a posterior probability (P.P.) of 1 obtained from two independent calculations (Supplemental Table 1 ; Supplementary Methods). The reconstruction for N0 has high statistical support for positions 346, 372, and 519 (P.P. > 0.94), while position 498 was ambiguously reconstructed, with two alternate states comparably probable (Supplemental Table 1 ). All other positions were reconstructed with high confidence (P.P.>0.85). Together, these four changes (t346R, t372A, h/y498Q, and n519H) differentiate the evolved SARS-Cov-2 Spike protein from the most recent common ancestor with animal viruses (Figure 1 ). As such, this ancestral virus must have existed at least as early as 2013 (as one of its descendants -RaTG13 -was isolated in that year), meaning that the branch between the N0 and N1 ancestors covers at least 7 years of molecular evolution ( Figure 1A) . To quantify functional differences between the N0 ancestor and the Spike-RBD sequences, we conducted 10 ns molecular dynamics simulations (see Supplementary Methods) of the Spike Receptor Binding Domain (RBD) in complex with hACE2 (starting point for each simulation was modelled off crystal structures of the SARS-CoV-2 Spike-RBD/hACE2 complex), 27 which we used to calculate the free energy contributions from electrostatics, polar solvation, van Der Waals interactions, and solvent-accessible surface area (SASA) to infer the free energy of binding between those two proteins. 40, 41 We quantified the root-mean-squared deviation (RMSD) of the portion of the RBD closest to the hACE2 receptor (residues 397 to 512) for each of our replicates to confirm complex stability (Supplementary Figure 2) . Contrary to our expectations, the free energy of binding between the Spike-RBD and the hACE2 appears to have decreased between N0 and N1. In fact, each of the 4 changes either reduced or did not significantly change the free energy of binding (Figure 2A ) (this is true for both alternate reconstructions of position 498 in N0). While overall these four historical changes reduced binding affinity to hACE2, they did not do so equally: t346R and h/y498Q showed the largest decreases in affinity ( Figure 2B) . Additionally, there was no evidence of an epistatic relationship among the four substitutions (see Supplementary Methods). Somewhat surprised by these reductions in binding energy between the N0 and N1 state, we set out to confirm that our observations were not an outcome of oversimplifications in the MD forcefield. We compared standardized changes in binding energy in our MD data and recently released in vitro data 42 and saw remarkably similar changes in binding energy (p = 0.042, Equivalence Test for Means) (Supplemental Table 3 ). In particular, both alternative reconstructed states for position 498 in N0 clearly improve hACE2 binding affinity in both our simulations and in in vitro functional measurements (Figure 2 ; Supplemental Table 3) . 42 There was a discrepancy between the two datasets at position 372 and future in vitro experiments are required to determine the combined effects of this mutation with the other substitutions (Figure 2 ; Supplemental Table 3) . Nonetheless, these results demonstrate that, contrary to expectations, evolutionary changes since 2013 did not improve the SARS-CoV-2 Spike-RBD's binding with hACE2. While there are other animal coronaviruses known to bind to the hACE2 receptor, 43 to our knowledge, this is the first demonstration that the SARS-CoV-2's common ancestor with the RaTG13 lineage may have been capable of binding to hACE2. This has important implications for our understanding of how SARS-CoV-2 evolved to infect humans. First, it suggests that the binding affinity between the Spike-RBD and hACE2 may not be a critical driver of the high degree of infectivity that has been observed during its recent outbreak. Instead, it suggests that tight hACE2 binding may be a latent property of this virus, and that high infectivity may instead have emerged via a distinct set of molecular changes in the SARS-CoV-2 genome. Second, it calls into question the presumption of a recent zoonotic origin for this disease; while other molecular components of the current SARS-CoV-2 virus may have acquired recent evolutionary changes that promoted its infectivity in humans, it appears that the high affinity for hACE2 was not among them. If this is the case -that this viral lineage possessed the ability to bind hACE2 with high affinity for at least the past 7 years ( Figure 1B ) -then why did it not emerge as a public health issue until recently? One possibility is that binding hACE2 by the Spike-RBD is not sufficient, on its own, to infect humans, and that other molecular components first needed to acquire new functions to do so. A second possibility is that this virus may have been capable of infecting human cells for a longer period of time in the past, but that its ancestral form either presented with far fewer symptoms (making it less disruptive and/or noticeable to those infected), or that it was far less infectious (thereby impacting only a small number of people), in either case escaping the notice of public health monitoring (Figure 3 ). To test this, a broad and concerted effort to sequence the range of coronaviruses across human populations would need to be conducted, in order to test whether a closely related virus may also be circulating. [44] [45] [46] Naturally, as an in silico study, these results should be interpreted with some caution. Insofar as they can be validated, however, they are largely consistent with direct functional measurements in the lab. 42 Ideally, combinatorial libraries could be constructed 47,48 and functionally screened 48 in order to glean more detailed insights into the molecular mechanisms underlying the recent evolution of this virus. Predicting the emergence of highly infectious and virulent diseases, while difficult, is vital for human population health. 50 To do so, however, we must take steps to understand how pandemic diseases -such as SARS-CoV-2 -emerged as they did, and to understand if and when they acquired the novel molecular functions that enabled their infectivity. In this case, it appears that the SARS-CoV-2 Spike-RBD did not recently evolve binding affinity to a human-specific protein. Instead, that function appears to have been latent, making it clear that the evolution of this disease -along with so many other aspects of its etiology -is more complex than expected. To confirm SARS-CoV-2 etiology we constructed a phylogenetic tree using PhyML 3.0 of 26 known enzootic and endemic viruses aligned using MAFFT version 7. We then obtained the amino acid spike glycoprotein sequences from the coronaviruses in our phylogeny and 479 SARS-CoV-2 samples, by either downloading the sequences from NCBI -if available, or through extraction using nBLASTx. Spike sequences were then aligned, and ancestral sequence reconstruction was performed using two software packages (RaxML and GRASP). Next, we performed in silico mutagenesis of x-ray crystallography structures for the SARS-CoV-2 spike protein RBD-hACE2 complex to create a library of mutational differences between the current SARS-CoV-2 spike RBD and our inferred ancestor. Molecular interactions for each structure in the library were then characterized through 10 ns MD simulations run in GROMACS. Binding energy between residues 397 to 512 of the SARS-CoV-2 spike RBD and the hACE2 receptor in the MD simulations were calculated using g_mmpbsa and genetic effects were measured for these energies. Finally, MD energy outputs were compared to existing in vitro data using standard statistical methods. A phylogenetic analysis of 26 viral genomes was performed to confirm known SARS-CoV-2 ancestors. 24 known enzootic and endemic viruses and the SARS-CoV-2 reference genome and the Pangolin-CoV genome were downloaded from the National Center for Biotechnology Information (NCBI) 51 and Lam et al. 21 respectfully. Selected sequences were aligned using the Multiple Alignment using Fast Fourier Transform Version 7 (MAFFT) FFT-NS-2 algorithm. 52,53 MAFFT default parameters were used in our alignment, meaning gap penalties were assigned a value of 1.53. PhyML 3.0 was employed to construct a phylogeny of aligned genomes. 54,55 Bayes values ≥ 0.90 were considered statistically significant. The output tree was visualized using the online tool, Interactive Tree of Life (iTOL), and statistically significant clades were examined to validate current knowledge surrounding SARS-CoV-2 evolution. 56 nBLASTx, run using a BLOSUM 62 matrix, a gap opening penalty of 11 and a gap extension penalty of 1, was employed to extract the Spike glycoprotein from the 479 SARS-CoV-2 sequences obtained from GISAID 58,59 selecting for one sequence per day per country from December 30, 2019 to March 25, 2020, and the Pangolin-CoV genome. 57 Additional, Spike sequences, including the RaTG13 Spike protein, were obtained directly from NCBI. 51 Protein sequences were initially aligned using the Multiple Sequence Alignment by Log-Expectation (MUSCLE) program. 60 The optimal parameters for phylogenetic reconstruction analysis were taken from the best-fit evolutionary model selected using the Akaike Information Criterion (AIC) implemented in the PROTTEST3 software, 61 and were inferred to be the Jones-Taylor-Thornton (JTT) model 62 with gamma-distributed among-site rate variation and empirical state frequencies. Phylogeny was inferred from these alignments using the RaXML v8.2.9 software 63 and results were visualized using FigTree v1.4.4 (https://github.com/rambaut/figtree/releases). Ancestral sequence reconstruction was performed with the FastML software 64 and further validated independently using the Graphical Representation of Ancestral Sequence Predictions (GRASP) software. 65 Statistical confidence in each position's reconstructed state for each ancestor determined from posterior probability; any reconstructed positions with less than 95% posterior probability was considered ambiguous, and alternate states were also tested. To understand the evolutionary importance of sequence changes observed between ancestral, zoonotic, and SARS-CoV-2 spike protein sequences, in silico mutagenesis and binding energy studies were performed. A previously constructed x-ray crystallography structure for the complex between the receptor binding domain (RBD) of the SARS-CoV-2 spike protein and the human hACE2 receptor were obtained from RCSB (accession number 6M0J). Utilizing PyMOL mutagenesis wizard, 66 the four missense mutations (R346t, A372t, Q498h or Q498y, H519n) identified between the N0 and N1 sequences were introduced into the SARS-CoV-2 RBD sequence, replicating the sequence of the putative ancestral zoonotic (N0) sequence. In addition to the N1 and N0 structures, additional structures were developed in a similar fashion, selectively including each of the 4 mutations to represent all of the possible combinations that these mutations may have existed throughout evolutionary time Simulation of ACE2 interactions using molecular docking: Molecular interactions were characterized with molecular dynamics simulations using GROMACS, TIP3P waters and CHARM07 force-field parameters for proteins. For each condition, three replicate 10 ns simulations were run, starting from crystal structures or structural models. Historical mutations were introduced and energy-minimized before MD simulation. Each system was solvated in a cubic box with a 10 Å margin, then neutralized and brought to 150 mM ionic strength with sodium and chloride ions. This was followed by energy minimization to remove clashes, assignment of initial velocities from a Maxwell distribution, and 1 ns of solvent equilibration in which the positions of heavy protein and DNA atoms were restrained. Production runs were 50 ns, with the initial 10 ns excluded as burn-in. The trajectory time step was 2 fs, and final analyses were performed on frames taken every 12.5 ps. We used TIP3P waters and the CHARM07 FF03 parameters for proteins, as implemented in GROMACS 4.5.5. 67 Analyses were performed using VMD 1.9.1. 68 GROMACS output was uploaded into Visual Molecular Dynamics (VMD) for Root-Mean Squared Deviation (RMSD) Analysis using the RMSD trajectory tool (ref). After discovering large deviations in RMSD values for the full RBD, which we attributed to noise at the ends of the RBD, we isolated our analysis to residues 397 to 512 of the RBD. Next, we measured the binding energies between residues 397 to 512 and the ACE2 receptor using g_mmpbsa, a program which employs Molecular mechanics Poisson-Boltzmann surface area (MMPBSA) calculations to determine binding energy. Van der Waal forces, polar solvation energy, apolar solvation energy and SASA energy were calculated every 0.25 ns using a gridspace of 0.5 and a solute dielectric constant of 2. The output of the three replicates was amalgamated and binding energy was calculated using the bootstrap analysis (n = 2000 bootstraps) published by Kumari et al. 40, 41 We then characterize the genetic effect of each mutation (on average) and assessed whether there were any statistically significant epistatic interactions using established methods. 46, 47 Comparison to in vitro data: In vitro changes in binding energy for the four mutation were obtained from Starr et al. 42 . These data and our binding energies for the N1, 346_372_498, 346_372_519, 346_498_519 and 372_498_519 were each standardized using Z-scores. Changes in binding energy to the N1 state for each standardized score was calculated by subtracting the N1 energy from the mutant energy. An equivalence test for means (TOST) test was performed on the standardized changes in binding energy. Comparison of molecular dynamics and in vitro z-score normalized changes in binding energy for each mutation from N0 to N1. Changes are shown relative to the z-score normalized current (N1) binding energy. D. Box plot illustrating the range of z-score normalized changes in binding energy for the in vitro and molecular dynamics data. An equivalence test for means (p = 0.042) suggests that the molecular dynamics simulation reflects in vitro data. Figure 3 . Schematic of two possible evolutionary scenarios stemming from the observed evolutionary SARS-CoV-2 Spike-RBD function. In Scenario 1, it is postulated that a zoonotic ancestral SARS-CoV-2 strain possessed the ability to effectively bind hACE2 but was unable to effectively enter human cells, requiring the presence of subsequent mutations to infect humans. In Scenario 2, an ancestral SARS-CoV-2 strain was actively infecting humans prior to the outbreak at low levels, thus escaping public health detection until subsequent mutations lead to increased infectivity and/or severity. Phylogeny was constructed using 25 whole genome sequencing data from a series of known coronaviruses. The two closest relatives to SARS-CoV-2 are highlighted in red and sequence identities are specified. Rootmean-square deviation (RMSD) is shown for simulation window that was used to calculate complex binding energy. Note that the Spike-RBD maintains a consistent stable configuration at the interface with hACE2, suggesting our energy calculations can be safely compared across simulations and that higher random stochasticity should not be a confounding factor. While weak evidence on its own, a reconstruction by maximum parsimony of Spike-RBD ability to bind hACE2 for the closest relatives of SARS-CoV-2 is consistent with our ancestral reconstruction here in suggesting that the common ancestor with RaTG13 possessed high binding affinity for hACE2. Supplemental Table 1 : Statistical confidence of ancestral sequence reconstructions for positions that vary between N0 and N1. Ancestral sequence reconstruction was assessed via computed posterior probability for each reconstructed state at each position in the sequence. The posterior probability for each reconstructed state at the four key positions that vary between N0 and N1 is shown, as calculated by two independent software packages (FastML and GRASP). Supplemental Table 2 : Sources of all SARS-CoV-2 genome sequences. Each genome sequence analyzed from SARS-CoV-2 infection cases are detailed according to geographic region of origin, as well as potentially relevant patient meta-data. Evolution and Emergence of Pathogenic Viruses: Past, Present, and Future Five challenges in evolution and infectious diseases Virus-host coevolution with a focus on animal and human DNA viruses Statistical tests of host-parasite cospeciation Cross-species virus transmission and the emergence of new epidemic diseases. Microbiol Accelerated viral dynamics in bat cell lines, with implications for zoonotic emergence Investigating the zoonotic origin of the West African Ebola epidemic Pathogenesis of the 1918 pandemic influenza virus The proximal origin of SARS-CoV-2 Crossing the species barrier-one small step to man, one giant leap to mankind Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19) Nidovirales: evolving the largest RNA virus genome Coronavirus envelope protein: current knowledge Evolutionary history, potential intermediate animal host, and cross-species analyses of SARS-CoV-2 A pneumonia outbreak associated with a new coronavirus of probable bat origin Genetic evolution analysis of 2019 novel coronavirus and coronavirus from other species Possible Bat Origin of Severe Acute Respiratory Syndrome Coronavirus 2 Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins Identifying and engineering ancient variants of enzymes using Graphical Representation of Ancestral Sequence Predictions (GRASP) A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations VMD: visual molecular dynamics