key: cord-0712972-hngu8fr2
authors: Ferrareze, Patrícia Aline Gröhs; Franceschi, Vinícius Bonetti; de Menezes Mayer, Amanda; Caldana, Gabriel Dickin; Zimerman, Ricardo Ariel; Thompson, Claudia Elizabeth
title: E484K as an innovative phylogenetic event for viral evolution: Genomic analysis of the E484K spike mutation in SARS-CoV-2 lineages from Brazil
date: 2021-03-24
journal: bioRxiv
DOI: 10.1101/2021.01.27.426895
sha: 8e39f9905b6882a44779baefb9262608048e1ec8
doc_id: 712972
cord_uid: hngu8fr2

The COVID-19 pandemic caused by SARS-CoV-2 has affected millions of people since its beginning in 2019. The propagation of new lineages and the discovery of key mechanisms adopted by the virus to overlap the immune system are central topics for the entire public health policies, research and disease management. Since the second semester of 2020, the mutation E484K has been progressively found in the Brazilian territory, composing different lineages over time. It brought multiple concerns related to the risk of reinfection and the effectiveness of new preventive and treatment strategies due to the possibility of escaping from neutralizing antibodies. To better characterize the current scenario we performed genomic and phylogenetic analyses of the E484K mutated genomes sequenced from Brazilian samples in 2020. From October, 2020, more than 40% of the sequenced genomes present the E484K mutation, which was identified in three different lineages (P1, P2 and B.1.1.33) in four Brazilian regions. We also evaluated the presence of E484K associated mutations and identified selective pressures acting on the spike protein, leading us to some insights about adaptive and purifying selection driving the virus evolution.

The etiological agent of COVID-19, named SARS-CoV-2, belongs to the Coronaviridae family composed of enveloped positively oriented single-stranded RNA viruses . The first cases of the disease were reported in Wuhan province, China, in December 2019. Since then the number of infected people has increased exponentially worldwide and the World Health Organization (WHO) declared it a pandemic on 11 March 2020. As of 24 January, 2021, the number of confirmed cases globally has scaled up to 99 million, with over 2 million deaths. Brazil is the third country in total number of confirmed cases, already exceeding 8.8 million infected people and 216,445 deaths (Johns Hopkins Coronavirus Resource Center, 2021) .

Nearly 410,000 genomes sequenced by researchers from several countries are available on the Global Initiative on Sharing All Influenza Data (GISAID) (Shu and McCauley, 2017) , which is crucial to investigate the SARS-CoV-2 genomic epidemiology. A standardized nomenclature was proposed to reflect genetic characteristics and the viral geographical spread patterns. Two ancestral lineages, A and B, have been described containing sublineages that are distinguished by different recurrent mutations and phylogenetic profiles (Rambaut et al., 2020a) . More recently special attention has been directed to the understanding of aspects that might impact the virulence and transmissibility of SARS-CoV-2 (Korber et al., 2020; Li, 2020; Toyoshima et al., 2020; Volz et al., 2020) and the influence of different mutations in the effectiveness of COVID-19 vaccines and immunotherapies (Andreano et al., 2020; Pinto et al., 2020; Yu et al., 2020) . Some viral lineages have been studied in more detail, mainly those carrying mutations in the spike (S) glycoprotein, since it is the binding site to the human ACE2 (hACE2) receptor, an essential step to invade the host cell. Currently, there are three lineages of major worldwide concern: B.1.1.7, B.1.351, and P.1. The former emerged in England in mid-September 2020 and it is characterized by 14 lineage-specific amino acid substitutions and has been rapidly spread across the UK and Europe ever since its first appearance (Rambaut et al., 2020b) . Two substitutions present in this lineage deserve special attention: N501Y in the Receptor Binding Domain (RBD) of S1 and P681H near the polybasic RRAR sequence in the furin-like cleavage region. N501Y is one of the key contact residues interacting with hACE2 and P681H is one of four residues comprising the insertion that creates a furin-like cleavage site between S1 and S2, which is not found in closely-related coronaviruses (Xia et al., 2020) . The second lineage probably emerged in South Africa in August 2020 and harbors three mutations in RBD: K417N, E484K, and N501Y (Tegally et al., 2020) . The most recent lineage is P.1, derived from B.1.1.28, which was recently identified in returning travelers from Manaus (Amazonas state, Brazil) who arrived in Japan. It has almost the same three mutations present in RBD as the South African lineage, except for the substitution in amino acid site 417, where the original lysine (K) is substituted for a threonine (T), instead of an asparagine (N). Its appearance is likely to have arisen independently (Faria et al., 2021) . In Brazil, the E484K mutation (glutamic acid to lysine substitution at amino acid 484) also arose independently and was identified in Rio de Janeiro state (Southeast Brazil) in early-October carried by the P.2 lineage (Voloch et al., 2020) .

Currently, the E484K mutation is found in several viral genomes from Brazil. Due to its rapid spread, the independent origins and the potential implications in vaccination and passive immune therapies, E484K has received particular attention ever since. Importantly, B.1.351, P.1, and P.2 carry this substitution associated with escape from neutralizing antibodies (Baum et al., 2020; Greaney et al., 2020; Weisblum et al., 2020) . Recently, this mutation was also identified in a sample from a reinfected patient (Nonaka et al., 2021) .

These two events are evidence of the virus' potential for immune escape.

Moreover, B.1.1.7, B.1.351, and P.1 also harbor N501Y mutation, which is associated with enhanced receptor affinity (Starr et al., 2020) and increased infectivity and virulence in a mouse model (Gu et al., 2020) . The presence of E484K and N501Y substitutions in the same SARS-CoV-2 genome may be particularly relevant for viral evolution. This combination was shown to induce more conformational changes than the N501Y mutant alone, potentially altering antibodies' complementarity to this region resulting in the above mentioned immune escape phenomena (Nelson et al., 2021) .

Structural analysis points to E484K as potentially the most crucial mutation so far. It creates a new site for the amino acid 75 hACE-2 binding. This interaction seems even stronger than the binding between hACE-2 and the original main site located at position 501 (at RBD and hACE-2 interface) (Nelson et al., 2021) . We speculate that the consequent neutralization escape due to E484K alone or as a part of a larger array of distinct mutations might act as a common evolutionary solution for several different viral lineages. Due to the independent origin of two current known lineages harboring E484K in Brazil, we aimed to describe the mutation patterns of these lineages, investigate phylogenetic relationships and perform positive selection tests to identify if adaptive evolution has acted as a major evolutionary force leading to the increase of amino acid variability in the RBD sites.

Firstly detected in the South African B.1.351 lineage, the S1 protein mutation E484K is now present in new emerging variants from Brazil. The analysis of genomes containing E484K, downloaded from the GISAID database, showed a distribution of 169 amino acid residues corresponding to nonsynonymous mutations and four amino acid residue deletions in 134 Brazilian samples (Table S1 ) collected between October and December 2020 ( Figure   S1 ). Regarding synonymous mutations, 217 genome positions had transitions ( Figure 1 and Table S2 ) and 113 genome positions had associated transversions ( Figure 1 and Table S2 ).

The arising of E484K mutated genomes with at least four different associated mutation patterns for lineages P.1, P.2, and B.1.1.33 can be seen over time in the aligned genomes ( Figure S1 and Table S1 ). The B.1.1.33 lineage carrying the E484K mutation were found in São Paulo (n=3) and Amazonas (n=1), while P.1 sequences were found only in the Amazonas state (n=19).

All other sequences (n=111) belong to P.2 lineage, the most widespread lineage considering sequenced data available so far. The Northeast region was represented by sequences from Alagoas (n=2), Paraiba (n=3), and Bahia (n=1). The North region is represented by Amazonas sequences (n=6). Southeast sequenced 44 P.2 genomes, 36 from Rio de Janeiro and 8 from São Paulo. Southern Brazil classified 55 genomes as P.2, 5 from Parana and 50 from Rio Grande do Sul, reinforcing its probable dissemination to all regions of Brazil ( Figure   3 ).

The worldwide emergence of E484K began in March 2020, with three sequences represented first. A significant increase was observed in October (n=86), followed by successive increases in November (n=366) and December (n=374) ( Figure 4A ). A similar trend was observed in Brazil, as E484K was observed in sequences obtained from October (n=31), November (n=87), and December (n=40). Most importantly, the proportion of genomes carrying this replacement was 39.7%, 43.9%, and 43.5% from October to December in comparison to all genomes sequenced from Brazil ( Figure 4B ). We believe this apparent slow-growing pattern in Brazil is due to heterogeneous and limited initiatives of sequencing in the country, and probably this substitution is already widespread through Brazilian states, as its harbored by three distinct and apparently independent evolving lineages.

In order to obtain a reliable detection of sites submitted to positive/adaptive or negative/purifying selection, a random set of Brazilian genomes was tested with different approaches. Regarding individual site models, the Bayesian inference FUBAR (Fast, Unconstrained Bayesian AppRoximation) identified eight sites evolving under adaptive selection (positive selection) in the spike sequence (Table 2) , with calculated synonymous and nonsynonymous average rates of 1.227 and 0.857, respectively. For these residues under adaptive pressure, six are included in known mutation sites of spike protein, including E484K (L5F, S12F, P26S, D138Y, A688V).

The analysis with the codeml program from the PAML package confirmed all the positively selected sites predicted by FUBAR, also indicating a highly variable selective pressure among sites by the M3 vs. M0 (p < 0.001) comparison and positive selection by the M2 vs. M1 and M8 vs. M8a model comparison (p < 0.05) ( Table 3, Table 3 ). The ω estimative

is a common method to detect positive selection, since it assumes that synonymous substitutions are neutral and the nonsynonymous are subject to selection.

Consequently, a ω statistically higher than 1 would indicate the action of positive selection or a relaxed selective constraint, whereas low d N /d S values would mean conservation of the gene product due to purifying selection (Tennessen, 2008) . Additionally to the FUBAR results, three sites were also considered with ω > 1 (Table 3 and 

The permanence of a pathogen inside a population host depends on its efficiency in key processes such as the replication, the scaping of the immune system and the binding to the cell receptor. Mutations that create an advantageous scenario towards the host response to infection enhance the pathogen fitness under the natural selection pressure.

Consequently, the host-pathogen coevolution represents an important mechanism to understand the establishment and the prognostics of pathogenesis, since the infections are possibly the major selective pressure acting on humans (Sironi et al., 2015) . Therefore, the circulation of low and moderate pathogens provides time for the pathogen adaptations, in such a way that the modulation of the immunity may possibly promote molecular convergence in different lineages over the time (Longdon et al., 2014) .

The presence of spike S:D614G, N:R203K, N:G204R, and Nsp12:P323L in all sequenced E484 mutated samples, reaching three different lineages, might suggest that these lineages show increased viral replication. Considering that E484K enhances escape from immune system antibodies, these may potentially lead to a viral advantage. The occurrence of simultaneous mutations as N:R203K and N:G204R is already known in the SARS-CoV-2 literature. However, the fixation of these mutations, as well as Nsp12:P323L and D614G in all the E484K evaluated genomes may indicate a novel adaptive relationship among these modifications resulting in viral evolutionary success.

Even as an independent evolutionary event, the potential fixation of mutations such as E484K across lineages may indicate active mechanisms of adaptive selection and are very relevant in planning future therapeutic strategies (for example, newer vaccines and immunotherapy platforms). The deletion of three amino acids in the second helix of the transmembrane protein Nsp6 in P.1 genomes may affect the virus-induced cellular autophagy and the formation of double-membrane vesicles for the viral RNA synthesis (Benvenuto et al., 2020) .

The important role of subsequent stabilization of the flexible NTD by mutations has been speculated (Laha et al., 2020) . Typically, NTD can harbor a larger number of evolutionary events than RBD, including mutations, insertions and deletions that could act allosterically altering the binding affinities between RBD and hACE-2 and inducing immune evasion. It is known that distinct positions may have a linked relationship in the final protein structure and may display some advantage by acting together to achieve increased stability, adaptability, viability and/or transmission efficiency (Laha et al., 2020) . The potential causality or influence of the E484K and others substitutions for the effectiveness of neutralizing antibodies that bind the N-terminal domain of the spike protein (Chi et al., 2020) remains uncertain.

Concerning the selection analysis, the FUBAR method was the first one to be tested.

According to Murrell (2013) , the FUBAR method may have more detection power than methods like FEL, in particular when positive selection is present but relatively weak. The analysis of the South African clade V501.V2

(https://observablehq.com/@spond/zav501v2#table1) found some similar results for FUBAR evaluation, with the detection of adaptive selection at the sites 5 (L5F), 12 (S12F), and 484 (E484K). The residues 155 and 677 are not associated with the E484K-presenting lineages, however, recent data found evidence of evolutionary convergence of this mutation in at least six distinct sub-lineages, which could improve proteolytic processing, cell tropism, and transmissibility . The amino acid residue 677 is next to the furin-like cleavage site (678 to 688). Interestingly, some studies suggest that the emergence of the SARS-CoV-2 in the human species resulted from a five amino acid change in the critical S glycoprotein binding site (Fam et al., 2020) .

To the best of our knowledge, the impact of E484K in different lineages have not been deeply explored. It was structurally demonstrated that, at least in combination with K417N and N501Y, the substitution has profound impact in shifting the main site of contact between viral RBD and hACE-2 residues (Nelson et al., 2021) . However, we still do not know if this holds true for other sets of mutations associated with E484K.

In summary, we have demonstrated widespread dissemination of mutants harboring E484K replacement in geographically diverse regions in Brazil. This substitution was disseminated in our country as early as October 2020, although phylodynamics inferences placed its emergence in July (Voloch et al., 2020) . The fact that E484K was found in the context of different mutations and lineages is suggestive that this particular substitution may act as a common solution for viral evolution in different genotypes. This hypothesis may be related to the profound impact of the mutation, which changes a negatively charged amino acid (glutamic acid) for a positively charged amino acid (lysine). Since this position is present in a highly flexible loop, it has been proposed that the presence of such mutation could create a strong ionic interaction between lysine (K) in RBD and amino acid 75 of the hACE-2 receptor shifting the major sites of binding in S1 from positions 497-502 to 484 (Nelson et al., 2021) .

Mutations in RBD could theoretically decrease neutralizing activity of serum of patients receiving vaccines against SARS-CoV-2 with unknown clinical consequences.

Arguably the effect of E484K could be particularly relevant. In a recent publication, this mutation was associated with complete abolishment of all neutralizing activity in a high proportion of convalescent serum tested (Wibmer et al., 2021) . When taken together these growing body of evidence suggest that E484K should be the target of intense virologic surveillance. Studies testing the activity of serum from vaccinated patients against viruses or pseudoviruses with the aforementioned substitution should be considered a high public health priority. Second generation immune therapies and vaccines focusing on more conserved domains (for instance, in S2 fusion domain) may deserve special attention to assure continuous therapeutic efficacy.

For the graphic evaluation of Brazilian lineages, all genomes available on the GISAID database until January 18 th , 2021, and presenting the E484K mutation were included in the analysis. Genomes from Brazil and other countries submitted until January 24 th , 2021, and with collected dates between January 1 st and December 31 st , 2020, were selected. The counting of genomes containing E484K was performed by the specification of the mutation in the search fields. The monthly count was defined by the values between the first and the last day of each month.

The comparative genomic analysis of the E484K mutated genomes was performed with all the 134 genomes (Table S1 ) presenting the E484K mutation for Brazil in GISAID until January 18 th , 2021. For these, a genomic multiple sequence alignment was performed on the MAFFT web server (Katoh and Standley, 2013) (https://github.com/laduplessis/SARS-CoV-2_Guangdong_genomic_epidemiology/).

The previously aligned genomic sequences were used as input for the phylogenetic analysis. The reference genome NC_045512.2 was added as an outgroup. The inference of the best evolutionary model was performed by ModelTest-NG (Darriba et al., 2020) , which identified GTR+I (Generalised time-reversible model + proportion of invariant sites). The tree was built by Bayesian inference in MrBayes v3.2.7 (Huelsenbeck and Ronquist, 2001 ) with the GTR+I model and 5 million generations. The data convergence was evaluated with Tracer v1.7.1 (Rambaut et al., 2018) and tree visualization was created by FigTree software (http://tree.bio.ed.ac.uk/software/figtree/).

For the phylogenetic analysis using the spike protein, the multiple sequence alignment previously obtained was edited and only the spike coding sequences were maintained. The inference of the evolutionary model and tree construction were performed according to the described model: ModelTest-NG model selection, MrBayes with GTR+I and 10 million generations.

In total, 780 SARS-CoV-2 genomes, collected from Brazil samples between February 1 st and December 31 st , 2020, available on GISAID, were used for this analysis.

After an initial filtering step of undefined nucleotides in the region of spike protein, with deletion of genomes with a N ratio >0.01, 589 genomes were selected. The second step of subsampling, with Augur toolkit kept 498 genomes (approximately 119 representative genomes per lineage) for the multiple sequence alignment (Table S3) .

The genomic multiple sequence alignment was performed by MAFFT (Katoh and Standley, 2013) with default options and 1PAM = k/2 scoring matrix. The selection of the genome location for the spike protein was performed with the software UGENE (Okonechnikov et al., 2012) , using the ORF coordinates of the NC_045512 reference genome as parameter.

GTR+I was inferred by ModelTest-NG (Darriba et al., 2020) as the best evolutionary model for the spike sequences. ModelTest-NG also indicated identical sequences, which were removed resulting in 161 sequences. The Bayesian phylogenetic inference was performed by MrBayes v3.2.7 using those 161 unique spike selected sequences and 5 million generations. The data convergence was evaluated with Tracer v1.7.1.

The analysis of sites under positive and negative selection was performed by HyPhy v2.5.23 according to different approaches: (i) FUBAR (Unconstrained Bayesian AppRoximation) (Murrell et al., 2013) , (ii) FEL (Fixed Effects Likelihood) (Kosakovsky Pond and Frost, 2005) , and (iii) SLAC (Single-Likelihood Ancestor Counting) (Kosakovsky Pond and Frost, 2005) . Finally, the PAML (Phylogenetic Analysis by Maximum Likelihood) (Yang, 2007) 

The authors declare no competing interests.

Scholarships and Fellowships were supplied by the Coordenação de 

ORF1ab mutations are represented by its amino acid positions relative to ORF1a (Nsp1-Nsp11) and ORF1b (Nsp12-Nsp16). ins: insertion. Figure 1. Histogram of frequent mutations observed in the Brazilian SARS-CoV-2 genomes harboring E484K mutation. Red labels above the bars indicate absolute nucleotide position and the blue labels indicate effects of these mutations in the corresponding proteins. As P.1 has only 19 genomes represented and multiple mutations, only main mutations of concern were highlighted. UTR: Untranslated region; Syn: Synonymous substitution; del: deletion; ORF: Open Reading Frame; Nsp: Non-structural protein; S: Spike; E: Envelope; M: Membrane; N: Nucleocapsid. Tables   Table S1 . Brazilian sequenced genomes containing the E484K mutation. Table S2 . Single Nucleotide Polymorphisms identified in the multiple genomes comparison. Table S3 . Brazilian sequenced genomes used for selection analysis (499 sequences).

Figure S1. Multiple sequence alignment (time-ordered) of the 134 sequenced brazilian genomes with the E484K mutation. Genomes retrieved from GISAID until January 18 th , 2020. 

SARS-CoV-2 escape in vitro from a highly neutralizing COVID-19 convalescent plasma

Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies

Evolutionary analysis of SARS-CoV-2: how mutation of Non-Structural Protein 6 (NSP6) could affect viral autophagy

A neutralizing human antibody binds to the N-terminal domain of the Spike protein of SARS-CoV-2

A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models

ACE2 diversity in placental mammals reveals the evolutionary strategy of SARS-CoV-2

Genomic characterisation of an emergent SARS-CoV-2

Complete mapping of mutations to the SARS-CoV-2 spike receptor-binding domain that escape antibody recognition

Adaptation of SARS-CoV-2 in BALB/c mice for testing vaccine efficacy

Emergence in late 2020 of multiple lineages of SARS-CoV-2 Spike protein variants affecting amino acid position 677

Augur: a bioinformatics toolkit for phylogenetic analyses of human pathogens

MRBAYES: Bayesian inference of phylogenetic trees

Johns Hopkins Coronavirus Resource Center

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection

Characterizations of SARS-CoV-2 mutational profile, spike protein stability and viral transmission

Modifiable lifestyle factors and severe COVID-19 risk: Evidence from Mendelian randomization analysis

The Evolution and Genetics of Virus Host Shifts

Genomic Epidemiology of SARS-CoV-2 in

FUBAR: A Fast, Unconstrained Bayesian AppRoximation for Inferring Selection

Molecular dynamic simulation reveals E484K mutation enhances spike RBD-ACE2 affinity and the combination of E484K, K417N and N501Y mutations (501Y.V2 variant) induces conformational change greater than N501Y mutant alone, potentially resulting in an escape mutant

Genomic Evidence of a Sars-Cov-2 Reinfection Case With E484K

Spike Mutation in Brazil

Unipro UGENE: a unified bioinformatics toolkit

Cross-neutralization of SARS-CoV-2 by a human monoclonal SARS-CoV antibody

HyPhy: hypothesis testing using phylogenies

Posterior Summarization in

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations

GISAID: Global initiative on sharing all influenza data -from vision to reality

Evolutionary insights into host-pathogen interactions from mammalian sequence data

Less Is More: An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection

Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding

Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa

Positive selection drives a correlation between non-synonymous/synonymous divergence and functional divergence

SARS-CoV-2 genomic variations associated with mortality rate of COVID-19

Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases 83

Genomic characterization of a novel SARS-CoV-2 lineage from

Evaluating the effects of SARS-CoV-2 Spike mutation D614G on transmissibility and pathogenicity

Escape from neutralizing antibodies by SARS-CoV-2 spike protein variants

The role of furin cleavage site in SARS-CoV-2 spike protein-mediated membrane fusion in the presence or absence of trypsin

PAML 4: phylogenetic analysis by maximum likelihood

Receptor-binding domain-specific human neutralizing monoclonal antibodies against SARS-CoV and SARS-CoV-2

A Novel Coronavirus from Patients with Pneumonia in China

indicate the proportions of groups 0, 1 and 2 in each model, respectively; ω 0 , ω 1 and ω 2 indicate the ω values of groups 0, 1 and 2 in each model, respectively

We thank the administrators and curators of the GISAID database and research groups across the globe for supporting the rapid and transparent sharing of genomic data during the COVID-19 pandemic. We also thank the Mayor's Office, Health Department and