key: cord-0883905-7pi08r3d authors: Kubik, Slawomir; Arrigo, Nils; Bonet, Jaume; Xu, Zhenyu title: Mutational hotspot in the SARS-CoV-2 Spike protein N-terminal domain conferring immune escape potential date: 2021-05-28 journal: bioRxiv DOI: 10.1101/2021.05.28.446137 sha: acf58703968c6fac0c5f4ab0cd2e144efc6e312a doc_id: 883905 cord_uid: 7pi08r3d Global efforts are being taken to monitor the evolution of SARS-CoV-2, aiming at early identification of mutations with the potential of increasing viral infectivity or virulence. We report a striking increase in the frequency of recruitment of diverse substitutions at a critical residue (W152), positioned in the N-terminal domain (NTD) of the Spike protein, observed repeatedly across independent phylogenetic and geographical contexts. We investigate the impact these mutations might have on the evasion of neutralizing antibodies. Finally, we uncover that NTD is a region exhibiting particularly high frequency of mutation recruitments, suggesting an evolutionary path on which the virus maintains optimal efficiency of ACE2 binding combined with the flexibility facilitating the immune escape. RNA viruses display particularly high mutation rates (1) , with SARS-CoV-2 undergoing approximately 10 -3 substitutions/site/year (2) . Globally, the selective pressure imposes conservation of adaptive mutations facilitating the viral spread. The overall success of viral transmission depends on the mutation rate, the extent of immune response, and the population size (3) . During the pandemic, where population size is large, rapid increase in the frequency of alterations is observed at critical positions of the viral genome. Two commonly reported forces shaping the natural selection for SARS-CoV-2 are the adaptation to host (4) and the evasion of the immune response (5), including immunity triggered by the vaccines (6) . Consequently, the evolutionary rate is particularly high for the S gene encoding the Spike protein (7), the main contact point with the ACE2 receptor of the host cell (8) . Importantly, Spike serves also as the immunizing agent in the majority of COVID-19 vaccines (9) . It is expected that mutations improving viral fitness emerge independently across unrelated viral clades. An example of an adaptive mutation that emerged relatively early during the pandemic is D614G substitution in Spike, by the end of 2020 present in almost every SARS-CoV-2 genome in the world (10) and believed to improve the Spike trimer interaction with ACE2 (4, 11) . Since the last months of 2020, increase in frequency of other mutations was observed, with the N501Y and E484K being two prominent examples. The mechanisms by which they confer evolutionary advantage to SARS-CoV-2 vary. Particularly, N501Y increases the adaptation to host by enhancing interaction with the ACE2 receptor (12) (13) (14) resulting in more efficient transmission (15) . In contrast, E484K appears as selectively advantageous by decreasing the strength of interaction with neutralizing antibodies (5, 16, 17) , which facilitates evasion of the immune response. More recently, L452R substitution was reported to have similar properties to E484K (5, 16, 18, 19) . Importantly, these mutations have arisen independently within diverse, unrelated genomic contexts, and at distant geographical locations, being examples of convergent evolution. Moreover, it may be expected that certain genomic positions under strong negative frequency-dependent selection -as expected in the context of immunity-escaping processes (20) -will display a diverse spectrum of mutations. Adaptive traits require close monitoring, particularly because they are likely to appear as increasingly prominent within SARS-CoV-2 strains under the current global vaccination efforts aiming at establishing herd immunity. Several studies focused on evaluating potential impact of mutations on the viral spread and antibody evasion (16, (21) (22) (23) (24) (25) (26) (27) . Most investigations focused on the receptor binding domain (RBD) of the Spike, the immunodominant part of the protein (28) containing the ACE2-interacting interface. However, mutations at sites outside of the RBD, such as D614, might also have strong impact on both, the infectivity and immune escape. For example, the N-terminal domain (NTD) of the Spike was shown to be a potent target for neutralizing antibodies (6, 29, 30) . By screening SARS-CoV-2 genome sequences for residues undergoing frequent and diverse mutations we pinpointed W152, a residue present in NTD, whose alterations have the potential of being advantageous for viral transmission. We identified that several substitutions, leading to a limited set of amino-acid changes at position W152, were independently recruited numerous times across many distantly related phylogenetic contexts and diverse geographical locations, suggesting their adaptive character. Insights from structural studies confirm that the identified W152 substitutions remove an important interaction point for multiple potent neutralizing antibodies. Furthermore, we demonstrate that mutations in NTD were recruited more frequently than in other regions of Spike during the second wave of the pandemic, likely due to improving viral fitness through the immune escape. Our work highlights the importance of monitoring individual mutations occurring outside of the Spike RBD. We extended our analysis to publicly available data included in the Audacity global COVID phylogeny along with all Spike protein sequences (1'028'876 entries -spikeprot0406.fasta) deposited in GISAID as of 2021/04/11. Protein sequences were aligned against the Spike reference (YP_009724390.1 as obtained from NCBI, as of 2021/04/11), using muscle v3.8.31 (34) with default parameters for protein analysis and converted into VCF files using custom R scripts (35) . The Audacity phylogeny and VCF files were then merged to obtain a phylogeny of 566,422 tips (391,504 internal nodes) with Spike protein information available for all tips. Our analysis aimed at inventorying independent recruitments of Spike mutations. From a phylogenetic standpoint, the task required to regroup SARS-CoV-2 genomes holding a Spike mutation of interest into sets of genomes that shared a common ancestor (i.e. "clades"). Assuming rare recombination, the most recent common ancestor (i.e. "mrca") of a clade marked the "recruitment event" at which the mutation of interest arose in the tree. The remainder of the clade then replayed transmission of the mutant to new hosts and the creation of a contagion cluster. Because the size of the Audacity tree rendered most ancestral character reconstructions intractable, we opted for an ad-hoc heuristic that iteratively delineated clades and identified the respective mrca of mutation carrying sequences. To this end, we applied a tree walk algorithm that identified clades given a tree topology and a set of tips states. Our heuristic proceeded as follows (see We assessed the effect of W152 mutants on neutralizing antibody (nAb) recognition by generating single point mutants (W152C, W152L and W152R) and evaluating their changes in binding free energy (ddG) against 5 different NTD-target antibodies (1-87 and 5-24 (36), 4A8 (29), FC05 (37) and S2X333 (38) , structures obtained from the Protein Data Bank (39)) using Rosetta (40) . nAbs were selected done based on the availability and interaction angle (36) to provide a broad view of the possible scenarios. For each experiment (mutant-antibody pair), side chain minimization was performed after the mutation and before the ddG analysis. As minimization in Rosetta is a stochastic-based process, a total of 100 decoys were generated for each experiment to define a distribution of ddG values. Finally, all decoys of ddG (regardless of the mutation) for a given antibody were normalized to the distribution obtained with wildtype (WT) Spike for that antibody. We screened SARS-CoV-2 genomes present in the DDM database (see Methods section) in order to identify novel, potentially concerning mutations within the S gene, defined as (i) multiple nonsynonymous substitutions present at a single position (ii) displaying increased frequency in comparison with global frequency, (iii) independent recruitments across multiple lineages and (iv) across multiple geographical locations. This approach identified distinct mutations at position W152 of the Spike NTD resulting in substitution of tryptophane to leucine (W152L) or arginine (W152R) ( We investigated the recruitment dynamics of W152 mutations in global datasets deposited in GISAID database. Due to a relatively low number of depositions in weeks 76-80 of the pandemic comparing to the preceding weeks ( Figure S1B ) we only took into account depositions with collection date up to week 75. Rapid and steady growth was observed in the number of submitted sequences bearing three W152 substitutions (W152C, W152L and W152R) during the second wave of the pandemic, in the period between December 2020 and April 2021 (weeks 56-75 of the pandemic) ( Figure 1A and 1B) . These substitutions were associated with 171 independent recruitments since the beginning of the pandemic ( Figure 1C) , with only sporadic cases (14.6%, 25/171) reported during the first wave (until week 55 - Figure 1D ). The largest cluster was reported for W152C with over 13'000 occurrences in 30 countries (including the CAL.20C lineage, referred to as the "California variant"), with the second-largest containing almost 1'500 sequences bearing W152L and present in 20 countries. However, most clusters were relatively small in size (≤5 sequences reported for 86% clades [147/171]) and present in only 1 country ( Figure S1C ). This observation pointed to frequent and independent recruitments rather than spreading of viruses bearing W152 mutations due to cross-border transmission. During weeks 65-75, the number of independent W152 mutation recruitments was in the range between 90 th and 99 th percentile among independent mutation recruitments reported for all Spike positions ( Figure 1E) . These observations placed W152 as one of the most dynamic Spike positions in terms of mutation recruitments, just behind N501, E484, and ahead of L452. W152 was also one of only two Spike residues with 3 independent substitutions having at least 500 occurrences in GISAID being reported for each, with another NTD residue (D80) being the other one ( Figure S1D) . In order to investigate whether W152 substitutions might confer direct evolutionary advantage to SARS-CoV-2, we investigated Spike mutations most frequently co-occurring with each W152 substitution (Figure 2A) W152 substitution did not co-occur with any of the prominent RBD mutations suspected of being advantageous (Figures 2B and 2C) . However, the majority of sequences in each of the largest clusters for individual substitutions contained at least one of these: L452R (for W152C), E484K (W152L) or N501Y (W152R) (Figure 2C) . After excluding the single largest cluster for each substitution, the fraction of sequences without adaptive RBD mutation was 64%. Importantly, our phylogenetic analysis indicated Figure 3A) . In all cases we observed a residue in the Ab chain engaged in a pi stacking interaction with W152 in the wild-type Spike protein ( Figure 3B ). 1-87 and 4A8 nAbs wrap W152 inside one of the CDR loops, making it a key position contributing to the interaction. In 5-24 the position is located just outside of the interface, but close enough for the minimal conformational changes to allow it to participate in the binding. FC05 and S2X333 use W152 for secondary interactions, thus the contribution of the residue to the interface is negligible and its mutation can be easily compensated by a side chain movement. The stacking interactions were lost with any of the considered mutations, thus effectively decreasing the affinity of the nAb to bind to NTD ( Figure 3A) . As expected, the effect varied depending on the extent of the W152 participation in the main interacting surface. Thus, in the case of the nAbs 1-87 ( Figure 3C) and 4A8 (Figure 3D a significant drop in affinity was observed, while in peripherically-interacting nAb such as 5-24, FC05 and S2X333 the effect was little or negligible ( Figure 3A) . In case of 1-87 and 4A8 the amino acid substitution not only results in the loss of the pi stack interaction but also affects the pocket generated by the antibody's CDR loop, which leads to a substantial loss of the binding affinity. The effects are especially drastic for W152L due to the exposure of the hydrophobic side chain. The W152 residue is placed in the vicinity of mutations of the NTD domain that received increasing attention owing to recent emergences in the variants of concern (Figure 4A, L18, H69 or Y144). NTD constitutes an exposed part of the Spike protomer, making it a prominent target for antibodies, yet, contrary to RBD, mutations within NTD have potentially little impact on receptor binding. We investigated the possibility that residues present in NTD undergo extensive mutagenesis facilitating immune escape without hampering the interaction with ACE2. During the first wave of the pandemic (until week 55) mutation recruitments were distributed relatively uniformly across Spike domains ( Figure 4B ). On the contrary, during the second wave (i.e. week 56 and onwards) NTD displayed an elevated number of mutation recruitments comparing to other parts of the protein (Figure 4C ). This localized bias in diversity was statistically significant ( Figure 4D ) and could result from adaptive changes in response to global immunity. The number of mutation recruitments per position correlated significantly with the evolutionary lability of the Spike protein observed across Coronaviridae (Figure S2 ; p<10 -5 for each tested region). Nevertheless, recruitment events were significantly more common for the NTD in comparison to RBD and the remainder of the Spike protein residues, confirming similar evolutionary pattern for the related viruses. These observations suggest that globally, NTD mutations have higher propensity of being advantageous than alterations in other regions of the protein. constitute the most exposed parts of the Spike making them the most likely targets of the immune response. As demonstrated by us and others, mutations in these domains often facilitate immune evasion (5, 6, 16, 17, 21, 24, 25, 36) . However, RBD mutations are more evolutionarily constrained due to their role in interaction with ACE2. Given the relatively small contribution of NTD to ACE2 binding, alterations in this domain might constitute the evolutionary 'disguise' the virus uses to avoid antibody neutralization. The variability is not restricted to amino acid substitutions as a significant number of deletions was also reported in NTD and linked to the immune escape (42) . Identification of potential nAb that can be used against different variants of RBD and NTD is of great importance, considering that antibody cocktails might act collaboratively to impede the progression of viral infection (37) . We identified W152 as a NTD residue undergoing particularly extensive evolutionary dynamics, highlighted by multiple individual substitutions emerging across many phylogenetic and geographical contexts. Remarkably frequent mutation recruitment events were reported at this position globally, with a clear increase in intensity since the end of 2020 (week 55 of the pandemic). The largest clusters reported for each of the three frequent substitutions -W152L, W152R or W152C -were characterized by the corecruitment of one of the prominent, adaptive RBD mutations (E484K, N501Y or L452R, respectively). Although the contribution of W152 mutations to those three particular events could not be decoupled from that of their co-occurring alterations in RBD, our results suggest an adaptive role of the W152 substitutions as most of their recruitment events did not occur in parallel with RBD mutations. In line with this finding, recent study demonstrated that W152C allows to further increase B.1.429 infectivity in comparison to L452R alone (18) . It is generally appreciated that advantageous mutations initially arise as the quasi-species, present only in a fraction of viral genomes within a given host. Providing a competitive edge, they are progressively increasing in prevalence and are eventually transmitted to new hosts giving rise to new clades responsible for infection clusters. In this regard, the advantage conferred by W152 mutations might be exemplified by a reported increase in the intra-host fraction of genomes bearing W152L substitution during the infection (43) . Significant efforts are spent on tracking the spread of specific variants of concern, such as the B. L5F S12F S13I P26S 52del92 57del82 61del73 61del83 HV69del D80A 109del35 116del28 143del2 Y144del 189del D215G D215Y A222V LLA241del S255F W258L A262S P272L R346S N439K L452R E484K N501Y A570D D614G H655Y Q677H P681H P681R A701V T716I G769V A899S S982A 1072del D1118H G1167R D1184H S1252F L5F S13I 13del1 L18F T20I P26S 69del2 114del2 D138Y 144del1 A222V G257S 261del61 262del8 282del34 312del1 L452R E484K N501Y 531del1 A570D D614G Q677H P681H A701V T716I T732A S982A T1100A Why are RNA virus mutation rates so damn high? No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2 Periodic versus Intermittent Adaptive Cycles in Quasispecies Coevolution Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition SARS-CoV-2 immune evasion by variant B.1.427/B.1.429 [Internet]. Immunology One Year of SARS-CoV-2: How Much Has the Virus Changed? Biology Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet SARS-CoV-2 vaccines in development SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant Experimental evidence for enhanced receptor binding by rapidly spreading SARS-CoV-2 variants Enhanced binding of the N501Y mutated SARS CoV 2 spike protein to the human ACE2 receptor: insights from molecular dynamics simulations Mutation N501Y in RBD of Spike Protein Strengthens the Interaction between COVID-19 and its Receptor ACE2 The N501Y spike substitution enhances SARS-CoV-2 transmission Complete map of SARS-CoV-2 RBD mutations that escape the monoclonal antibody LY-CoV555 and its cocktail with LY-CoV016 Evidence of escape of SARS-CoV-2 variant B.1.351 from natural and vaccine-induced sera Transmission, infectivity, and neutralization of a spike L452R SARS-CoV-2 variant Identification of SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization Negative Frequency-Dependent Selection Is Frequently Confounding Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies Multiple SARS-CoV-2 variants escape neutralization by vaccine-induced humoral immunity Effect of natural mutations of SARS-CoV-2 on spike structure, conformation and antigenicity Landscape analysis of escape variants identifies SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization [Internet]. Microbiology Prospective mapping of viral mutations that escape antibodies used to treat COVID-19 Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7. Nature [Internet Escape from neutralizing antibodies by SARS-CoV-2 spike protein variants. eLife Mapping Neutralizing and Immunodominant Sites on the SARS-CoV-2 Spike Receptor-Binding Domain by Structure-Guided High-Resolution Serology A neutralizing human antibody binds to the N-terminal domain of the Spike protein of SARS-CoV-2. Science Potent neutralizing antibodies against multiple epitopes on SARS-CoV-2 spike Recommendations for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples Nextstrain: real-time tracking of pathogen evolution A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology MUSCLE: multiple sequence alignment with high accuracy and high throughput R: A language and environment for statistical computing. R Foundation for Statistical Computing Austria: R Foundation for Statistical Computing Potent SARS-CoV-2 Neutralizing Antibodies Directed Against Spike N-Terminal Domain Target a Single Supersite Structure-based development of human antibody cocktails against SARS-CoV-2 N-terminal domain antigenic mapping reveals a site of vulnerability for SARS-CoV-2 The Protein Data Bank RosettaScripts: A Scripting Language Interface to the Rosetta Macromolecular Modeling Suite Emergence of a Novel SARS-CoV-2 Variant in Southern California Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape Intra-host non-synonymous diversity at a neutralizing antibody epitope of SARS-CoV-2 spike protein N-terminal domain Structural analysis of full-length SARS-CoV-2 spike protein from an advanced vaccine candidate Household transmission of SARS-CoV-2 R.1 lineage with spike E484K mutation in Japan COVID-19 Outbreak Associated with a SARS-CoV-2 R.1 Lineage Variant in a Skilled Nursing Facility After Vaccination Program -Kentucky