key: cord-0957500-t8q99tlq authors: Jia, Yong; Shen, Gangxu; Nguyen, Stephanie; Zhang, Yujuan; Huang, Keng-Shiang; Ho, Hsing-Ying; Hor, Wei-Shio; Yang, Chih-Hui; Bruning, John B; Li, Chengdao; Wang, Wei-Lung title: Analysis of the mutation dynamics of SARS-CoV-2 reveals the spread history and emergence of RBD mutant with lower ACE2 binding affinity date: 2021-01-19 journal: bioRxiv DOI: 10.1101/2020.04.09.034942 sha: 67bcc7dc61426b06d2738d038a9747c386a91eb5 doc_id: 957500 cord_uid: t8q99tlq Monitoring the mutation dynamics of SARS-CoV-2 is critical for the development of effective approaches to contain the pathogen. By analyzing 106 SARS-CoV-2 and 39 SARS genome sequences, we provided direct genetic evidence that SARS-CoV-2 has a much lower mutation rate than SARS. Minimum Evolution phylogeny analysis revealed the putative original status of SARS-CoV-2 and the early-stage spread history. The discrepant phylogenies for the spike protein and its receptor binding domain proved a previously reported structural rearrangement prior to the emergence of SARS-CoV-2. Despite that we found the spike glycoprotein of SARS-CoV-2 is particularly more conserved, we identified a receptor binding domain mutation that leads to weaker ACE2 binding capability based on in silico simulation, which concerns a SARS-CoV-2 sample collected on 27th January 2020 from India. This represents the first report of a significant SARS-CoV-2 mutant, and requires attention from researchers working on vaccine development around the world. Highlights Based on the currently available genome sequence data, we provided direct genetic evidence that the SARS-COV-2 genome has a much lower mutation rate and genetic diversity than SARS during the 2002-2003 outbreak. The spike (S) protein encoding gene of SARS-COV-2 is found relatively more conserved than other protein-encoding genes, which is a good indication for the ongoing antiviral drug and vaccine development. Minimum Evolution phylogeny analysis revealed the putative original status of SARS-CoV-2 and the early-stage spread history. We confirmed a previously reported rearrangement in the S protein arrangement of SARS-COV-2, and propose that this rearrangement should have occurred between human SARS-CoV and a bat SARS-CoV, at a time point much earlier before SARS-COV-2 transmission to human. We provided first evidence that a mutated SARS-COV-2 with reduced human ACE2 receptor binding affinity have emerged in India based on a sample collected on 27th January 2020. Monitoring the mutation dynamics of SARS-CoV-2 is critical for the development of effective approaches to contain the 22 pathogen. By analyzing 106 SARS-CoV-2 and 39 SARS genome sequences, we provided direct genetic evidence that 23 SARS-CoV-2 has a much lower mutation rate than SARS. Minimum Evolution phylogeny analysis revealed the putative 24 original status of SARS-CoV-2 and the early-stage spread history. The discrepant phylogenies for the spike protein and its 25 receptor binding domain proved a previously reported structural rearrangement prior to the emergence of SARS-CoV-2. 26 Despite that we found the spike glycoprotein of SARS-CoV-2 is particularly more conserved, we identified a receptor 27 binding domain mutation that leads to weaker ACE2 binding capability based on in silico simulation, which concerns a 28 SARS-CoV-2 sample collected on 27 th January 2020 from India. This represents the first report of a significant SARS-29 CoV-2 mutant, and requires attention from researchers working on vaccine development around the world. As of 24 th March 2020, there are a total of 174 nucleotide sequences for SARS-CoV-2 in the NCBI database. By restricting 124 to the complete or near-complete genomes, 106 sequences from 12 countries were obtained and used for further analyses. 125 This encompasses 54 records from USA, 35 from China, and the rest from other countries: Australia (1), Brazil (2), Finland 126 (1), India (2), Italy (1), Japan (3), Nepal (1), Spain (3), South Korea (1), and Sweden (1) . 127 Based on the gene model of the reference SARS-CoV-2 genome (GeneBank: NC_045512.2), a total of 12 protein-encoding 128 open reading frames (ORFs), plus the 5'UTR and 3'UTR were annotated ( Figure 1A) . Overall, the gene sequences from 129 different samples are highly homologous, sharing > 99.1% identity, apart from the 5'UTR (96.7%) and 3'UTR (98%) 130 (Table 1) , which are relatively more divergent. Sequence alignment showed that there is no mutation in ORF6, ORF7a, 131 and ORF7b. The genetic diversity profile across the 106 genomes was displayed in Figure 1A . A few nucleotide sites 132 within ORF1a, ORF1b, ORF3a, and ORF8 exhibiting high genetic diversity were identified ( Figure 1A) . 133 The S protein is critical for virus infection and vaccine development. As shown in Figure 1B , 12 single amino acid 134 substitutions in 12 genomes were identified for the spike glycoprotein, only one (R408I) of which occurs in the receptor 135 binding domain (RBD). This mutation concerns an accession collected from Kerala State, India on 27 th Jan 2020. 136 To track the occurrence of the R408I mutant, we checked the latest GISAID database (5 th Oct 2020) and confirmed that 137 there are a total of 17 SARS-CoV-2 strains containing the R408I mutation ( Table 1) : England (11), Egypt (2), Portugal (1), 138 Switzerland (1), and India (2). We believe that these numbers are still underestimated by the limited sequencing capacity 139 in respective countries. For example, there are only a total of 152 spike protein records for Egypt in the GISAID database. 140 Noteworthy, the latest R408I SARS-CoV-2 samples were collected on 10 th Sep 2020 and 1 st Sep 2020 from Switzerland 141 and England (Table 1) , respectively, indicating that this mutant is still actively spreading. To assess the mutation rate and genetic diversity of SARS-CoV-2, the ratio of nonsynonymous mutations (dN) and 154 synonymous mutations (dS) was calculated for each protein-encoding ORF based on the 106 SARS-CoV-2 and 39 SARS 155 genomes. For SARS-CoV-2, the highest dN was observed for ORF8 (0.0111), followed by ORF1a (0.0081), ORF9 (0.0079), 156 and ORF4 (0.0063) ( Table 2) , indicating these genes may be more likely to accumulate nonsynonymous mutations. In 157 contrast, ORF1b (0.0029), S gene (0.0040) encoding the spike protein, and ORF5 (0.0023) are relatively more conserved 158 in terms of nonsynonymous mutation. Noteworthy, ORF6, ORF7ab and ORF10 are strictly conserved with no 159 nonsynonymous mutation. Compared to SARS-CoV-2, SARS displayed higher mutation rates for all of the ORFs in the 160 virus genome (Table 1 ), suggesting an overall higher level of genetic diversity and mutation rate. In particular, the dN and 161 dS values for the S gene in SARS-CoV is approximately 12 and 7 times higher than that for SARS-CoV-2, respectively. In 162 contrast, the mutation rate differences for ORF1a and ORF1b between SARS-CoV-2 and SARS are relatively milder, 163 varying from 1.5 times to 4.8 times only. In contrast to SARS-CoV-2, which has strictly conserved ORF6, ORF7a, and 164 ORF7b, SARS displayed mutation rates at different levels. Notably, the dS for ORF10 are comparable between the two 165 genomes at 0.0326 and 0.0341, respectively. Table 2 . Mutation rate analysis on SARS-CoV-2 genes. Gene model is according to the SARS-CoV-2 reference genome (GeneBank: NC_045512.2). 167 S: spike glycoprotein. "Pair-wise identity" indicate the minimum pair-wise sequence identity among the 106 genomes. dN : nonsynonymous mutation; dS : 168 synonymous mutations. "--": not applicable. To trace the potential spread history of SARS-CoV-2 across the world, an unrooted Minimum Evolution (ME) tree of the 172 106 genomes was developed based on whole-genome sequence alignment. The clustering pattern of the ME phylogeny 173 identified. The only Brazil accession (MT126808.1) in this study is found to be clustered with one accession from USA 187 The spike glycoprotein is critical for virus infection. Recent study suggested that the S protein in SARS-CoV-2 may have 204 undergone a structural rearrangement 13 . To investigate this hypothesis, two separate phylogenies were developed based on 205 the full-S and RBD sequences, respectively. Human SARS-CoV-2, MERS, and SARS-CoV reference sequences and their 206 close coronavirus homologues identified from various animal hosts were included for the phylogenetic analyses. Overall, 207 the two phylogenies displayed similar clustering patterns, separating into three major clades (Figure 3) . SARS-CoV-2 was 208 identified in the same major clade and was clustered most closely with two bat SARS CoVs (highlighted in purple and 209 green colors, Figure 3 ) and the human SARS-CoV (orange color, Figure 3) The RBD of virus S protein binds to a receptor in host cells, and is responsible for the first step of CoV infection 3 . Thus, 225 amino acid mutation to RBD may have significant impact on receptor binding and vaccine development. The 3D structure 226 of the spike protein RBD of SARS-CoV-2 (PDB: 6VW1) has recently been determined in complex with human ACE2 227 receptor 6 . One of the 12 spike protein mutations identified above (Figure 1B) was located in the RBD region (R408I). 228 Further data screening against the latest GISAID and NCBI database (5 th Oct) revealed a total of 17 strains from five 229 countries containing the R408I mutation ( Table 1) . Sequence alignment showed that 408R is strictly conserved in SARS-230 CoV-2, SARS-CoV and bat SARS-like CoV (Figure 4A) . Based on the determined CoV2_RBD-ACE2 complex structure, 231 408R is located at the interface between RBD and ACE2, but is positioned relatively far away from ACE2 and thus does 232 not have direct interaction with ACE2 ( Figure 4B) . However, the determined RBD0-ACE2 structure showed that 408R 233 forms a hydrogen bond (3.3 Å in length) with the glycan attached to 90N from ACE2 (Figure 4C ) 6 . The hydrogen bond 234 may have contributed to the exceptionally higher ACE2 binding affinity. The arginine residue is also conserved in human 235 SARS-CoV (corresponding to 395R in PDB: 2AJF), but is positioned relatively distant (6.1 Å) from the glycan bound to 236 90N from ACE2 ( Figure S1) . Interestingly, the 408R-glycan hydrogen bond appears to be disrupted by the R408I mutation 237 in one SARS-CoV-2 accession (GeneBank ID: MT012098.1) (Figure 4D) , collected from India on 27 th Jan 2020. In silico 238 calculations indicatethat the R408I mutation increased the binding free energy by 0.93 kcal/mol, predicting a modest 239 decrease in ACE2-binding affinity. In contrast to the electrically charged and highly hydrophilic arginine residue, the 240 mutated isoleucine residue has a highly hydrophobic side chain with no hydrogen-bond potential (Figure 4E) we found that the spike protein (S) is more conserved that other genetic regions such as ORF1, ORF8, and N, which encode 256 nonstructural polyprotein, virus accessory protein, and the nucleocapsid protein, respectively. A relatively stable spike protein region of SARS-CoV-2 is a good indication for the epidemic control, as less mutation raises the hope of the rapid 258 development of a vaccine and antiviral drugs. Our results are consistent with several recent genetic variance analyses on 259 SARS-CoV-2 15,23-25 , which suggested the SARS-CoV-2 genomes are highly homogeneous. Furthermore, based on the 260 latest genomic data for SARS-CoV-2, molecular geneticists monitoring the virus development also suggested that the 261 mutation rate of SARS-CoV-2 maintains at a low level 17,26,27 . Whilst it is generally safe to say that SARS-CoV-2 tends to 262 mutate at a low rate, as the virus continues to spread rapidly around the world, and more genomic data is accumulated, the 263 evolution and mutation dynamics of SARS-CoV-2 still need to be monitored closely. 264 One critical aim of our study is to identify the original status of SARS-CoV-2 before its wide transmission across different 266 countries. Due to the short time space of sample collection and a relatively low mutation rate for SARS-CoV-2, we believe 267 that Minimum Evolution phylogeny (a parsimony method) may outperform other phylogenetic methods to achieve this 268 aim. Similarly, Peter et al. 28 also adopted a parsimony phylogeny (Maximum Parsimony) to trace the spread history of 269 SARS-CoV-2 in the early stage of the pandemic. Minimum Evolution and Maximum Parsimony are similar phylogeny 270 methods (both using the parsimony sites detected in the sequence alignment) trying to minimize the total number of 271 substitutions in the phylogenetic tree. In our analysis, the earliest few reported SARS-CoV-2 accessions collected from 272 Wuhan China were identified at the center of the phylogenetic tree with the shortest branch. Interestingly, several virus 273 genomes from USA were found to be identical to these putative original versions of the virus from Wuhan. According to 274 public media, the outbreak of SARS-CoV-2 in USA occurred relatively later than other countries. One possible explanation 275 for this observation is that, the spread of SARS-CoV-2 in USA might start much earlier than previously thought or reported. 276 Since a dominant proportion of the samples in this study were collected from China and USA, we observed a significantly 277 higher level of genetic diversity from these two countries. Most SARS-CoV-2 accessions from the other countries can find 278 their closely related sisters from either China or USA. This data bias, on the other hand, may give us an advantage to trace 279 the spread history of SARS-CoV-2 in different countries. This suggestion is reliable because all samples investigated in 280 this study were collected at the early stage of the pandemic, which may avoid the potential data noise caused by recent 281 published genomes of complex spread background. One notable finding in our phylogenetic tree is that, the singleton 282 SARS-CoV-2 accessions collected from Australia, Brazil, South Korea, Italy and Sweden were clustered together with two USA samples but without a Chinese version, suggesting that these infection cases may be somehow related. In addition, 284 one of the three samples collected from the cruise ship stranded in Japan was found to be closely related to a sample 285 collected from Guangzhou, China, whilst the other two were grouped with several cases from USA. Noteworthy, our 286 phylogeny seems to support the presence of two major types of SARS-CoV-2 in the target samples, suggesting the potential 287 existence of two spread sources. Interestingly, this speculation is corroborated by an independent clustering analyses using 288 a different phylogeny method 23 . 289 290 Until now, the origin of SARS-CoV-2, and how it has been transmitted to humans remains largely a mystery. Early genomic 291 data indicated that human SARS-CoV-2 is an enveloped, positive-sense, and single-stranded RNA virus in the subgenus 292 Sarbecovirus of the genus Betacoronavirus 13,14 . Evolutionarily, SARS-CoV-2 is most closely related to bat SARS-like 293 CoV (88% genome sequence identity) and human SARS CoV (79%), the latter of which caused a global pandemic in 2003 294 13 . Based on the strong genome sequence identity between SARS-CoV-2 and bat SARS-like COVs, it was initially 295 speculated that SARS-CoV-2 may have originated from bat 14,29 . However, a more recent study proposed that pangolin may 296 be the most likely reservoir hosts due to the identification of closely related SARS-COVs from this species as well 30 . Both 297 animals can harbor coronaviruses related to SARS-CoV-2. However, direct evidence of the transmission of SARS-CoV-2 298 from either bat or pangolin to human is still missing. 299 300 Prior to this study, several publications have suggested that SARS-CoV-2 may have originated from the genome 301 recombination of SARS-like CoVs from different animal hosts, as evidenced by the discrepant clustering patterns for the 302 phylogenies using different genetic regions. Lu 13 first observed that the RBD of S protein in SARS-CoV-2 is more closely 303 related to human SARS-CoV, whilst the other part of its genome is more similar to bat SARS-CoV. Later Lam 30 identified 304 a bat CoV_RaTG13 and several pangolin SARS-CoVs that are consistently closer to SARS-CoV-2 than human SARS-305 CoV in either full-S protein or RBD. By combining the data from these two studies, our study confirmed the observations 306 reported in both studies, and further determined that the S protein recombination actually happened between human SARS-307 CoV and a bat SARS-CoV, much earlier before its transmission to human, with the newly identified bat SARS-CoV-308 RaTG13 as an intermediate. 309 The RBD of S protein binds to a receptor in host cells and is responsible for the first step of CoV infection. The receptor 311 binding affinity of RBD directly affects virus transmission rate. Thus, it has been the major target for antiviral vaccine and 312 therapeutic development such as SARS 8 . At the time of first completion in late March 2020, this study was the first to 313 report the identification of the R408I mutation in the RBD of S protein in SARS-CoV-2. Since then, the R408I mutant has 314 attracted research attention from a significant number of researchers. Both computational and experimental studies have 315 been performed to further investigate its molecular characteristics and potential immune effects 31-38 . In addition, 316 commercial synthesis of the R408I recombinant RBD protein has been offered by serval companies (Acro Biosystems, 317 Creative Dianostics, SinoBiological, and Creative Biolabs) for immuno-binding and diagnostic testing. Noteworthy, Yan 318 et al. 31 showed that three of the four RBD neutralizing antibodies tested could not bind the R408I mutant, whereas other 319 mutants displayed strong binding interaction with all the neutralizing antibodies tested. The authors stated that 408R played 320 an essential role for SARS-CoV-2 RBD antibody binding and the R408I could abolish the antibody binding interaction 31 . 321 In addition, Zhe et al. 39 also suggested that R408I constitute the RBD epitope residues. These observations contrast an 322 early stage study 17 which did not notice the R408I mutation and predicted that a single vaccine may be sufficient for all 323 circulating SARS-CoV-2 variant. Based on the determined RBD-hACE2 protein structure (PDB: 6VW1) 6 , we found that 324 408R residue can establish an indirect receptor interaction via a glycan attached to human ACE2. This residue was found 325 to be conserved in SARS and MERS as well. Interestingly, the arginine residue (corresponding to 395R in SARS) has also 326 been shown to be involved in receptor interaction in SARS. In this study, we were the first to show that the R408I mutation 327 in SARS-CoV-2 is likely to cause a reduced binding affinity to human ACE2 receptor. Our result has been corroborated in 328 several independent studies later on 32-34 . Although the S protein gene seems to be more conserved than the other protein-329 encoding genes in the SARS-CoV-2 genome, our study provides direct evidence that a mutated version of SARS-CoV-2 S 330 protein with varied transmission rate may have already emerged. Furthermore, we confirmed that, as of 5 th Oct 2020, a 331 total of 17 SARS-CoV-2 strains containing the R408I mutation were present in the GISAID and NCBI databases, with the 332 latest R408I mutants collected on 10 th Sep 2020 and 1 st Sep 2020 from Switzerland and England, respectively. These results 333 suggest that the R408I mutant may spread across to different countries since its first emergence from India and is still 334 actively spreading in different regions. Benson et al. 40 recently reported that R408I accounts for ~2% of infection cases in Potential interventions for novel coronavirus in China: A systematic review Timely development of vaccines against SARS-CoV-2 Structure of SARS coronavirus spike receptor-binding domain 368 complexed with receptor Evidence for a Common Evolutionary Origin of Coronavirus Spike Protein Receptor-Binding Subunits Cryo-EM structure of the 2019-nCoV spike in the prefusion 372 conformation Structural basis of receptor recognition by SARS-CoV-2 Angiotensin receptor blockers as tentative SARS-CoV-2 therapeutics The spike protein of SARS-CoV -a target for vaccine and 377 therapeutic development Structural basis for the recognition of SARS-CoV-2 by full-length 379 human ACE2 Proof of principle for epitope-focused vaccine design Influenza Virosomes in Vaccine Development Coronaviruses An RNA proofreading machine 385 regulates replication fidelity and diversity Genomic characterisation and epidemiology of 2019 novel coronavirus: 387 implications for virus origins and receptor binding Evolution of the novel coronavirus from the ongoing Wuhan outbreak and 389 modeling of its spike protein for risk of human transmission Tang X. On the origin and continuing evolution of SARS-CoV-2. Microbiology A SARS-CoV-2 vaccine candidate would likely match all currently 393 circulating variants AnnotationSketch: a genome annotation drawing 395 library MUSCLE: multiple sequence alignment with high accuracy and high throughput MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger 399 datasets ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R Phylogenetic analysis by maximum likelihood On the origin and continuing evolution of SARS-CoV-2 Emerging SARS-CoV-2 mutation hot spots include a novel RNA-406 dependent-RNA polymerase variant Genome during the Beginning Months of the Outbreak in USA Geographic and Genomic Distribution of SARS-CoV-2 Mutations The interplay of SARS-CoV-2 evolution and 412 constraints imposed by the structure and functionality of its proteins Phylogenetic network analysis of SARS-CoV-2 genomes A pneumonia outbreak associated with a new coronavirus of probable 416 bat origin Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Cross-neutralization antibodies against SARS-CoV-2 and RBD mutations from 420 convalescent patient antibody libraries Comparative genome analysis of novel coronavirus (SARS-CoV-2) from 422 different geographical locations and the effect of mutations on major target proteins: Anin silicoinsight Exploring the genomic and proteomic variations 425 of SARS-CoV-2 spike glycoprotein: A computational biology approach Insights into the structural and dynamical changes of spike 428 glycoprotein mutations associated with SARS-CoV-2 host receptor binding The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity Characterizations of SARS-CoV-2 432 mutational profile, spike protein stability and viral transmission Identification of twenty-five mutations in surface glycoprotein (Spike) of SARS-CoV-2 among Indian isolates and their impact on protein dynamics Stereotypic Neutralizing VH Clonotypes Against SARS-CoV-2 RBD Patients and the Healthy Population Structural basis for neutralization of SARS-CoV-2 and SARS-CoV by a potent 438 therapeutic antibody Analysis of SARS-CoV-2 genomes from across Africa reveals 440 potentially clinically relevant mutations Preliminary identification of potential vaccine targets for the COVID-442 19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies capability in respective countries. Based on the close relationship of SARS-CoV-2 to SARS, current vaccine and drug 337 development for SARS-CoV-2 has also focused on the S protein and its human binding receptor ACE2 7,41 . Considering 338 the significantly varied antibody binding profile for R408I, we propose that this mutant still requires significant attention 339 from doctors and scientists around the world during the development of SARS-CoV-2 therapeutic solutions. One suggestion 340 for the next step of therapeutic development is to focus on the identification of potential human ACE2 receptor blocker, as 341 suggested in a recent commentary 7 . This approach will avoid the above-mentioned challenge faced by vaccine 342 development. 343 The authors would like to thanks the relevant research community for making the genomic data available to the public. We