key: cord-0878727-zoa8drkw authors: Dwivedy, Abhisek; Murmu, Krushna Chandra; Ahmad, Mohammed; Prasad, Punit; Biswal, Bichitra Kumar; Aich, Palok title: Molecular basis of the logical evolution of the novel coronavirus SARS-CoV-2: A comparative analysis date: 2020-12-03 journal: bioRxiv DOI: 10.1101/2020.12.03.409458 sha: 7363b0d10595c593257c88b61e07b39769282586 doc_id: 878727 cord_uid: zoa8drkw A novel disease, COVID-19, is sweeping the world since end of 2019. While in many countries, the first wave is over, but the pandemic is going through its next phase with a significantly higher infectability. COVID-19 is caused by the novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) that seems to be more infectious than any other previous human coronaviruses. To understand any unique traits of the virus that facilitate its entry into the host, we compared the published structures of the viral spike protein of SARS-CoV-2 with other known coronaviruses to determine the possible evolutionary pathway leading to the higher infectivity. The current report presents unique information regarding the amino acid residues that were a) conserved to maintain the binding with ACE2 (Angiotensin-converting enzyme 2), and b) substituted to confer an enhanced binding affinity and conformational flexibility to the SARS-CoV-2 spike protein. The present study provides novel insights into the evolutionary nature and molecular basis of higher infectability and perhaps the virulence of SARS-CoV-2. and affinity of its spike protein, for ACE2 of the human host, significantly higher (Walls et 48 2020) . The higher affinity, dynamic rearrangement, and specificity of the SARS-CoV-2 49 spike protein for ACE2 are among the key factors that might have made the virus more 50 virulent . The pertinent question is how it acquired such potential and 51 precise machinery within a short span, following the SARS-CoV and MERS-CoV 52 that took place in 2003 and 2012. We, therefore, attempted to understand the evolution of 53 coronavirus of various kinds, with special emphasis on SARS-CoV, MERS-CoV and 54 SARS-CoV-2 to determine cues on the evolutionary dynamics that have enhanced the 55 virulence and infectivity of SARS-CoV-2. We have analyzed the amino acid sequences of 56 Spike proteins of 45 relevant coronaviruses and the structural features of select ones to 57 understand the major differences that might explain the increased binding efficiency of the 58 SARS-CoV-2 spike proteins to human ACE2. The Spike proteins from coronaviruses 59 into two distinct fragments, S1 and S2. Fragment S1 is involved in recognition of host cell 60 surface receptors, and the fragment S2 is involved in generation of the pre-fusion complex. 61 Fragment S1 is comprised of two major domains-N-terminal (NTD) and C-terminal (CTD) 62 domains (Li, 2016) . Collectively, NTD and CTD are also known as the receptor binding 63 domain (RBD). The CTD interacts with molecules like ACE2 and CD26, in case of 64 SARS-CoV/CoV-2 and MERS/Bat-CoV, respectively. The NTD is known to recognize 65 sugar containing molecules and cell adhesion molecules (Li, 2016; Sun et al., 2020) . The 66 physiological state of the Spike proteins is comprised of a homo-trimer with a central 67 three-fold symmetry with the three S1 fragments sitting atop the respective membrane 68 anchored S2 fragments ( Figure 1A ) (Li, 2016) . We validated the protein structural data by 69 analyzing the differences in the coding nucleotide sequences. The results showed the 70 plausible mutations that act as the driving force in the natural selection of SARS-CoV-2. 71 2. Materials and Methods 72 2.1 Protein Sequence and Structure analysis 73 The sequences of 45 Coronavirus Spike proteins were retrieved from the SwissProt 74 database. The details of the sequences are presented in Table S1 . The sequences were 75 aligned using Clustal Omega (Sievers and Higgins, 2014 ) and a maximum likelihood 76 phylogenetic tree was generated using the NEXUS algorithm (Giribet, 2005 we estimated the best model using the modelTest function from the "phangorn package" 107 (Posada and Crandall, 1998) . GTR+G+I was selected as the model to perform Maximum 108 likelihood phylogenetic tree with 100 iterations. The sequence alignments were represented 109 using Espript 3.0. The ancestry and substitution analysis were performed using MEGA X. 110 Aligned amino acid and nucleotide sequences were assigned the same names in both 112 the alignments for comparison. The phylogenetic distances were calculated for UPGMA 113 using the "phangorn library" (Schliep, 2011 ). The best model was calculated for both 114 nucleotide and amino acid sequences using the "modelTest" function where the "Akaike 115 Information Criterion" (Ingram and Mahler, 2013) was applied to determine the best model 116 for both trees and then each tree was converted into a dendrogram using "as.dendrogram" 117 function. These dendrograms were further taken for dendrogram comparison using the 118 "dendextend tanglegram function" (Galili, 2015) . Tree distance was calculated using 119 "treedist" function (Smith, 2020). 120 For synonymous and non-synonymous mutation analysis, the 45 nucleotide sequence 122 files in which the headers were labelled same as the respective amino acid sequences were 123 used. Using the reverse align function the nucleotide sequences were reverse aligned by 124 seqinr library (Charif and Lobry, 2007 2020). These two features of SARS-CoV-2 prompted us to ask an important and obvious 137 question-how and when did the novel coronavirus emerge to be a distinct lineage in terms of 138 its enhanced infectability? What are the molecular markers that can be analyzed to 139 understand the molecular basis of the stronger affinity and higher infectability? In the current 140 report, we investigated these questions by-a) comparing the protein structures of spike 141 proteins from SARS-CoV, MERS-CoV, SARS-CoV-2 and other related Bat-CoVs; b) 142 establishing the similarity and differences in major amino acid residues to understand the 143 higher affinity of SARS-CoV-2 towards ACE2 binding compared to other coronaviruses; 144 and c) comparing the nucleotide and amino acid sequences of spike proteins to estimate the 145 most probable evolutionary trend. In order to investigate the evolutionary divergence of the Spike protein, we generated an 166 evolutionary tree ( Figure S2 ). Interestingly, of the seven known human coronaviruses, the 167 three coronavirus species associated with higher infectivity and morbidities, i.e., 168 SARS-CoV, MERS-CoV and SARS-CoV-2 formed a distinct evolutionary cluster. Notably, 169 the other members of these clusters were overtly the bat coronavirus species. Such a specific 170 clustering suggests a possible co-evolution in the Spike proteins of humans and bat 171 coronaviruses that led to the association with more severe infections. The scenario is however much different in other coronaviruses. Particularly, the CTD of 182 Bat-CoV and MERS-CoV are larger with an anti-parallel β -sheet replacing the loop like 183 structures of the ACE2 recognizing region of the CTD of SARS-CoV-2 ( Figure S4 ). This 184 comparative structural analysis hinted at a divergent evolution of the CTDs into two 185 independent lineages-a) MERS-CoV and b) SARS-CoV-2. We also found that the core 186 structure of the NTDs of the SARS-CoV-2 spike protein and the lectin binding NTD of 187 Bov-CoV (Bovine Coronavirus) spike protein are largely similar ( Figure S5 ). Taken 188 together, the results suggest that the evolution of the host receptor recognizing domain in the 189 coronavirus spike proteins are more local in nature while the global architecture demonstrate 190 significant conservation ( Figure 1B ). 191 To explore these subtle evolutionary changes, we probed into the local architecture of the 192 interfaces of SARS-CoV-2 CTD/hACE2 and SARS-CoV CTD/hACE2 complexes 193 reported in the cryo-EM determined structures ( Figure 1C ) (Wrapp et al., 2020) . Our 194 analyses led to important findings that could help not only understanding the evolution and 195 origin of the SARS-CoV-2 but also will help in developing potential intervention. Apart 196 from the two small β -strands present in the SARS-CoV-2 CTD, the interfaces in the 197 complexes were primarily lined up with loop like structures from the CTD of the spike 198 protein and the N-terminal helix of the ACE2 ( Figure 1D ). It is important to note that 199 SARS-CoV-2 CTD has 21 residues that interact with ACE2 N-terminal helix, while the 200 SARS-CoV CTD has only 17 interacting residues. A closer inspection of the amino acid 201 residues, involved in the interactions, suggested that residues Y453, Y473, G476 and F486 202 from SARS-CoV-2 CTD were crucial towards providing a stronger interaction with ACE2, 203 with no identical residues from SARS-CoV in the respective molecular environment ( Figure 204 2A). In order to determine any evolutionary correlation, the MERS-CoV and Bat-CoV CTDs 205 were docked onto N-terminal helix region of the ACE2 followed by in silico energy 206 minimization of the complexes. The MERS-CoV and Bat-CoV CTDs exhibited 18 and 19 207 residues interacting with ACE2 N-terminal helix, respectively ( Figure 2B & 2C) . Notably, 208 the Y453 of SARS-CoV-2 superimposed with the identical interacting residues Y499 and 209 Y503 from MERS-CoV and Bat-CoV CTDs, respectively in the molecular 210 microenvironment ( Figure 2B & 2C) . In order to understand the contribution of each 211 interacting residue of the CTDs in ACE2 binding, in silico alanine scanning mutagenesis 212 analysis was performed. While Y453 of SARS-CoV-2 contributed 2.018 kcal mol -1 , F486 213 contributed 3.01 kcal mol -1 to the interaction. Interestingly, Y499 and Y503 of MERS-CoV 214 and Bat-CoV CTDs contributed significantly higher to their respective interactions-2.877 215 and 3.017 kcal mol -1 , respectively ( Figure 2D ). Also, a significant rise was observed in the 216 dissociation constants (K d ) of binding with the ACE2 following alanine mutations of the 217 aforementioned residues ( Figure 2D ). These energy values suggest the spatial conservation 218 of this tyrosine residue in the CTD of SARS-CoV-2 being key to a stronger ACE2 binding, 219 which is completely absent in the CTD of SARS-CoV. 220 A comparison of the in silico binding properties of the four aforementioned CTDs with the 221 ACE2 revealed that despite a higher K d for SARS-CoV-2, there was a significant decrease in 222 o f 1 6 the surface area of interaction, suggesting a higher specificity of interaction between residues 223 of the CTD and the ACE2 N-terminal helix (Table 1) . It is also worth noticing that while the 224 interacting residues are widely spread across the interacting surface of the SARS-CoV CTD. 225 However, for SARS-CoV-2 CTD the interactions localize on the far ends of the interacting 226 surface ( Figure S6 A-D) . This phenomenon is crucial as the central region of the interacting 227 surface is primarily comprised of uncharged residues that arch away from the N-terminal 228 helix Concurrently, a higher dS is indicative of a purifying selection, that remove deleterious 266 mutations which reduced fitness. We estimated the dN and dS values for the set of 45 Spike 267 proteins' polypeptide sequences. While the dN values varied between -6 to 2, a significant 268 proportion of the clusters depicted value greater than 1, suggesting higher number of 269 non-synonymous substitutions across the spike protein sequences ( Figure S8 ). More 270 importantly, the dS values varied from -4 to 4 ( Figure S9) Spike proteins 287 To better understand the aforementioned evolutionary conundrum, we closely examined the 288 protein ( Figure S10 ) and nucleotide ( Figure S11 ) sequence of CTDs of five Spike protein 289 sequences -SARS-CoV-2, SARS-CoV, MERS-CoV, and Bat-CoV (all belonging to the 290 sub-genus Sarbecovirus) ( Table S2 ). The CTDs of these Spike proteins exhibited 25 291 conserved residues. The CTDs of MERS-CoV and Bat-CoV are evidently longer and shorter, 292 respectively in comparison to the SARS-CoV and SARS-CoV-2 CTDs. The theoretical pI for 293 the CD26 binding CTDs is around 5 suggesting an abundance of negatively charged amino 294 acids. However, the ACE2 binding CTDs have a theoretical pI greater than 8, suggesting a 295 higher proportion positively charged residues. However, despite containing equivalent 296 proportions of aromatic amino acids, the MERS-CoV CTD is significantly more 297 hydrophobic than the others. Notably, ACE2 is localized strictly on the cell membranes, 298 whereas DPP4 localizes on the cell membrane as well as in the cytoplasmic and extracellular 299 fluids. The differential location of targeting receptors might be the possible reason for the 300 lower infectivity of MERS-CoV despite having a significantly higher mortality. A 301 comparison of the Spike Protein CTD coding sequences revealed that the SARS-CoV, 302 MERS-CoV, and Bat-CoV had a higher GC content (~39%) than the SARS-CoV-2 (~34%). 303 This in turn was evident from the comparison of the codon usage of the SARS-CoV-2 Spike 304 protein wherein a significantly higher proportion of amino acids were encoded by the AT rich 305 codons ( Figure S12 ). 306 Further, we closely examined the binding regions of the four CTDs, emphasising specifically 307 on the evolution of Y453 (Figures 5A & 5B) . The residues 449-456 of the Spike Protein from 308 SARS-CoV-2 are Asn-Tyr-Leu-Tyr-Arg-Leu-Phe-Arg. The aligned residues for this stretch 309 SARS-CoV generate a strong steric hindrance causing both the tyrosine residues to remain 311 buried within the CTD (Figure 2A, middle panel) . Interestingly, mutating the second tyrosine 312 a similar yet smaller amino acid leucine (Tyr-Arg-Leu) in SARS-CoV-2 reduces the steric 313 hindrance (Figure 2A, top panel) . This allowed the otherwise buried Tyr453 to interact with 314 amino acids from ACE2, resulting in an enhanced binding. However in MERS-CoV and 315 Bat-CoV, these triads are present as Tyr-Ile-Asn and Tyr-Arg-Ser, respectively. This 316 decrease in hydropathicity significantly reduces the binding affinity of MERS-CoV and 317 Bat-CoV spike proteins with ACE2 under physiological condition. However, this in turn 318 enables it to bind CD26 with a stronger affinity, suggesting a positive selection. Taken 319 together; the removal of the second tyrosine from the triad to a weakly hydrophobic and 320 smaller amino acid suggests a purifying selection in the SARS-CoV-2 Spike protein. Tyr453 to explain the higher affinity of SARS-CoV-2 for ACE2 receptor. We propose that 357 the evolution of SARS-CoV-2 occurred in parallel yet independently of MERS-CoV, 358 following a set of recombination and mutational events involving the genomes of the 359 Bat-CoV and the SARS-CoV. 360 It is important to mention that the current study is the first of its kind to establish a 362 comparative molecular basis on the evolution of SARS-CoV-2 to acquire its virulence and 363 infectivity over other related coronaviruses. The work has the potential to merge with the 364 other published data and transform the knowledgebase on SARS-CoV-2 in a newer 365 dimension to predict upcoming outcomes and aid in the development of novel and effective 366 interventions. 367 Supplementary Materials: The supplementary data are given in a separate compiled file. A 368 brief description of the supplementary data. Figure S1 : amino acid sequence, Figure S2 : A 369 maximum parsimony phylogeny tree, Figure S3 : spike protein comparison, Figure S4 : 370 Superimposition of CTDs, Figure S5 : Superimposition of NTDs, Figure S6 : Surface 371 representation of protein complex, Figure S7 : Comparison of maximum parsimony and 372 maximum likelihood, Figure S8 : Heatmap of hierarchical clustering, Figure S9 : Heatmap of 373 dS values, Figure S10 : multiple protein sequence alignment, Figure S11 : multiple nucleotide 374 sequence alignment, Figure S12 : Comparison of codon usage, Evolution of 393 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as coronavirus disease 394 2019 (COVID-19) pandemic: A global health emergency Stability of two hierarchical grouping techniques case I: Sensitivity to data 402 errors A PDB-wide, 405 evolution-based assessment of protein-protein interfaces 408 The 2019-new coronavirus epidemic: Evidence for virus evolution SeqinR 1.0-2: A Contributed Package to the R Project for 411 Statistical Computing Devoted to Biological Sequences Retrieval and Analysis The PyMOL Molecular Graphics System, Version 2.3. Schrödinger 414 LLC The GC content as a main 416 factor shaping the amino acid usage during bacterial evolution process MUSCLE: Multiple sequence alignment with high accuracy and high 419 throughput dendextend: An R package for visualizing, adjusting and comparing trees of 421 hierarchical clustering TNT: Tree Analysis Using New Technology ESPript/ENDscript: sequence and 3D information 426 from protein structures GalaxyRefineComplex: Refinement of protein-protein 429 complex model structures driven by interface repacking DrugScorePPI webserver: Fast and accurate in silico 442 alanine scanning for scoring protein-protein interactions MEGA X: Molecular 445 evolutionary genetics analysis across computing platforms Comparison tests for dendrograms: A comparative 448 evaluation Structure, Function, and Evolution of Coronavirus Spike Proteins POSA: A user-driven, interactive 452 multiple protein structure alignment server Molecular epidemiology, 455 evolution and phylogeny of SARS coronavirus Recombination drives the evolution of GC-content in the human 458 genome SpotOn: High Accuracy 461 Identification of Protein-Protein Interface Hot-Spots InterProSurf: A web 464 server for predicting interacting sites on protein surfaces ZDOCK server: 467 Interactive docking prediction of protein-protein complexes and symmetric multimers MODELTEST: Testing the model of DNA substitution Comparisons of dN/dS are time dependent for closely related bacterial 473 genomes MCSM-PPI2: predicting 475 the effects of mutations on protein-protein interactions The epidemiology and pathogenesis of coronavirus 478 disease (COVID-19) outbreak phangorn: Phylogenetic analysis in R Evolution of SARS Coronavirus and the relevance of modern 483 Molecular Epidemiology, in: Genetics and Evolution of Infectious Diseases Information theoretic Generalized Robinson-Foulds metrics for 488 comparing phylogenetic trees COVID-19: Epidemiology, Evolution, and Cross-Disciplinary 492 Perspectives Emergence of 496 genomic diversity and recurrent mutations in SARS-CoV-2 TOPS++FATCAT: Fast flexible structural 499 alignment using constraints derived from TOPS+ Strings Model Antigenicity of the SARS-CoV-2 Spike Glycoprotein Structural and Functional 506 Basis of SARS-CoV-2 Entry by Using Human ACE2 RevTrans: Multiple alignment of coding DNA from 509 aligned amino acid sequences WHO, 2020. Coronavirus disease 2019 (COVID-19) Situation Report -174 Cryo-EM structure of the 2019-nCoV spike in the prefusion 514 conformation. Science (80-. ) PRODIGY: A 517 web server for predicting the binding affinity of protein Structural basis for the 520 recognition of SARS-CoV-2 by full-length human ACE2. Science (80-. ) Ggtree: an R Package for 523 Visualization and Annotation of Phylogenetic Trees With Their Covariates and Other 524 Associated Data Relationship between the ABO Blood Group and the 529 COVID-19 Susceptibility S2 and the membrane anchor (MA) region on the viral membrane (VM). (B) A 536 superimposition of the Cα chains for the complete spike protein from SARS-CoV2 537 (lime-green); complete spike protein from SARS-CoV (tv-red) Of note, the CTD of the SARS-CoV2 is more compact as 540 compared to SARS-CoV, MERS-CoV and BAT-CoV. (C) A superimposition of the NTD of 541 the SARS-CoV2 and SARS-CoV bound to the human ACE2 -Cα chain in left panel and 542 secondary structures in right panel SARS-CoV in tv-green, CoV bound ACE2 in yellow). (D) A superimposition of the 544 interacting regions of the NTD of the SARS-CoV2 and SARS-CoV and the human ACE2 SARS-CoV2 in raspberry, CoV2 bound ACE2 receptor in cyan; SARS-CoV in tv-green CoV bound ACE2 in yellow) Figure 2: Comparison of the amino acid residues of the Spike proteins of SARS-CoV2 MERS-CoV and Bat-CoV involved in interactions with human ACE2 (A) Superimposition of the ACE2 binding residues of the CTD of SARS-CoV2 (raspberry) and Crucial interacting 552 residues from CoV2 marked in red arrows). (B) Superimposition of the ACE2 binding 553 residues of the CTD of SARS-CoV2 (raspberry) and MERS-CoV (pale green), with residues 554 from each highlighted in sticks. (Crucial interacting residue from CoV2 with identical 555 superimposed residue from MERS-CoV marked