key: cord-0813625-6gi18vka
authors: Singh, Praveen Kumar; Kulsum, Umay; Rufai, Syed Beenish; Mudliar, S. Rashmi; Singh, Sarman
title: Mutations in SARS-CoV-2 Leading to Antigenic Variations in Spike Protein: A Challenge in Vaccine Development
date: 2020-09-01
journal: J Lab Physicians
DOI: 10.1055/s-0040-1715790
sha: 90c585e6b57470813260bad408c121ac465d4189
doc_id: 813625
cord_uid: 6gi18vka

Objectives The spread of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) virus has been unprecedentedly fast, spreading to more than 180 countries within 3 months with variable severity. One of the major reasons attributed to this variation is genetic mutation. Therefore, we aimed to predict the mutations in the spike protein (S) of the SARS-CoV-2 genomes available worldwide and analyze its impact on the antigenicity. Materials and Methods Several research groups have generated whole genome sequencing data which are available in the public repositories. A total of 1,604 spike proteins were extracted from 1,325 complete genome and 279 partial spike coding sequences of SARS-CoV-2 available in NCBI till May 1, 2020 and subjected to multiple sequence alignment to find the mutations corresponding to the reported single nucleotide polymorphisms (SNPs) in the genomic study. Further, the antigenicity of the predicted mutations inferred, and the epitopes were superimposed on the structure of the spike protein. Results The sequence analysis resulted in high SNPs frequency. The significant variations in the predicted epitopes showing high antigenicity were A348V, V367F and A419S in receptor binding domain (RBD). Other mutations observed within RBD exhibiting low antigenicity were T323I, A344S, R408I, G476S, V483A, H519Q, A520S, A522S and K529E. The RBD T323I, A344S, V367F, A419S, A522S and K529E are novel mutations reported first time in this study. Moreover, A930V and D936Y mutations were observed in the heptad repeat domain and one mutation D1168H was noted in heptad repeat domain 2. Conclusion S protein is the major target for vaccine development, but several mutations were predicted in the antigenic epitopes of S protein across all genomes available globally. The emergence of various mutations within a short period might result in the conformational changes of the protein structure, which suggests that developing a universal vaccine may be a challenging task.

Since the rapid outbreak of 2019 novel coronavirus (2019-nCoV, later named SARS-CoV-2 or severe acute respiratory syndrome coronavirus 2) in Wuhan, China, the World Health Organization on January 30, 2020, declared the SARS-CoV-2 epidemic as a public health emergency of international concern. The enduring pandemic has caused nearly 5 million detected cases of coronavirus disease 2019 illness and claimed over 3,25,563 lives worldwide as of May 20, 2020 according to COVID-19 Resource Center Johns Hopkins. 1 However, so far, no proven therapeutic or effective vaccine candidate has been found. 2 For developing a drug or vaccine, the protein profiling and/or genomic information of the pathogen is extremely crucial. To understand genetic landscape of SARS-CoV-2 virus, scientists have worked tirelessly and the complete genome sequences of virus isolates are published. Now, many isolates have been sequenced completely or partially and are available in the database for scientific community. 3 It is found that the genome size of SARS-CoV-2 varies from 29.8 kb to 29.9 kb. Specific genetic characteristic in its genome have also been found. 4 The genome consists of four structural proteins including spike protein (S), envelope (E), membrane (M), and nucleocapsid (N) proteins. 3 Of the four glycoproteins, the M protein is reported to have role in determining the shape of the virus envelope and stabilizing the nucleocapsids, the N protein is involved in processes related to the viral genome, the viral replication cycle, and the cellular response of host cells to viral infections. The E protein which is the smallest protein in the SARS-CoV-2 structure plays the role in the production and maturation of this virus. 4 However, the glycoprotein S is the core transmembrane monomer of approximately 180 kDa size with two subunits S1 and S2. This glycoprotein mediates membrane fusion and finally facilitates virus entry (receptor-binding and entry of virion into the target cells). 5 The receptor binding domain (RBD) (residues 319-541) of the subunit S1 is known to interact with angiotensin-converting enzyme 2 (ACE-2), which provides tight binding to the peptidase domain of ACE-2. 6 This gives an impression of RBD being an important element of virus-receptor interaction and has an essential role in virus-host range, tropism, and infectivity.

The RBD sequences of different SARS-CoV-2 strains that are circulating globally were initially thought to be conserved; however, with the availability of sequencing data several mutations have been reported. 7, 8 These findings emphasize the argument, that there may be correlation of the mutated strains circulating in a particular geographic setting with higher mortality rates and transmission patterns beside other combating factors. 9 The mutation rate in ribonucleic acid (RNA) viruses is intensely high which can be million times higher than that of their hosts and this results in virulence modulation and evolutionary capability for better viral adaptation. 7 Genetic depiction of virus mutations can thus offer valuable insights for assessing the fitness of drug resistance, immune escapism, and pathogenesis. Due to its receptor binding property, the S protein is also supposed to be immunogenic and a putative target for developing the neutralizing antibodies and vaccines. It is reported that single-point mutations in the conserved amino acid residues in the RBD region completely abolishes the capacity of fulllength S protein to induce neutralizing antibodies. 10 Thus, virus mutation studies can be crucial for designing new vaccines and antiviral drugs.

In this study, we aimed to predict the mutations in the spike protein (S) of SARS-CoV-2 genomes available in the database (whole genome sequences as well as partial coding sequences of spike protein) and analyze the effect of each mutation on the antigenicity of the predicted epitopes. This information may be helpful in predicting the transmission and infectivity of various SARS-CoV-2 strains circulating worldwide.

Entrez Direct (EDirect) utilities were used to access the NCBI's nucleotide database by using in-house developed bash scripts to batch download the data. We used query keyword or phrase as "severe acute respiratory syndrome coronavirus 2" and "spike coding sequences (CDS)" by applying ESearch and EFetch utilities implemented in bash scripts to download the dataset for genome and spike proteins that were available till May 1, 2020. A total of 1,325 complete draft genomic sequences of SARS-CoV-2 and additional 279 CDS having partial genomes coding spike protein (total 1,604 CDS of spike protein) were downloaded in FASTA format available globally from NCBI database (►Fig. 1).

Multiple sequence alignment (MSA) of all the 1,325 complete genome sequences as well as 1,604 CDS was performed using ClustalW-MPI with default parameters. 11 Generated SNPs were identified by in-house developed bash scripts to batch process the data using Blade Server (Dell PowerEdge FC640 Server) with 256 GB RAM and 40 Core processor with 2.30 GHz. After MSA, each genome and spike protein were marked based on the location and clustering was done based on 100% similarity for ease of visualization and analysis. Visualization was performed by using JalView 2.11.1.0. 12

The output SNP alignment generated from MSA was used to assemble a maximum likelihood phylogenetic tree using RAxML (Randomized Axelerated Maximum Likelihood RAxML 8.2.12). 13 Phylogenetic trees were visualized using the Interactive Tree of Life (iTOL) V5 with their respective metadata. 14

EMBOSS antigenic software was used to predict the antigenic regions and the epitopes in the 88 unique spike proteins based on antigenic scores using the formula:

where f(Ag) = antigenic frequency; f(s) = surface frequency and antigenic score ≥ 1.0 is considered potentially antigenic. [15] [16] [17] The data for different epitopes were analyzed and the epitopes with high antigenicity were superimposed on the structure of spike protein.

Overall, 1,197 SNPs were found in 1,325 complete genome datasets. On the basis of similarity these were classified in 782 clusters. However, among CDS of spike proteins (1,325 complete genomes and 279 partial CDS) a total 140 SNPs in 88 clusters were found. Further SNP analysis resulted in identification of SNPs in a gene stretch of 21,563-25,384 bp in the S gene, encoding the spike (S) protein. The most predominant SNP predicted in the gene encoding S protein was 23402A>G in 48.2% of overall genomes under study. In addition to several single mutations in the S gene of all available genomes, we also predicted double mutations such as 22436G>T, 22439C>G, 22444C>A, 22445C>T (corresponding to four amino acid antigenic drift ALDP -> SVES at position 292-295) and 21723T>G (L54W); 21726T>A (F55I) in two different genomes from the United States. List of all the SNPs in the RBD, antigenic sites, and double mutations among strains is shown in ►Table 1. One deletion (21994-21996delTTA) was also found in an Indian strain (MT012098.1). No copy number variants were observed in this virus.

The alignment of 1,604 spike proteins extracted from 1,325 complete genomes and 279 partial CDS was performed and after clustering based on 100% identity, we identified 88 unique sequences with 88 hypervariable sites within these protein sequences. Based on the variable sites the phylogeny was inferred showing two major clades A and B with many subclades in the S protein of SARS-CoV-2 circulating worldwide (►Fig. 2).

Furthermore, the evaluation of the antigenicity of spike protein predicted 14 highest scoring antigenic epitopes (antigenic scores ≥ 1.0) due to variations in each (at positions L54F, L54W, F55I, S71F, D111N, F157L, L293M, L293V, D294E, D294I, A419S, V367F, A348V, and A653V). Out of these, amino acid changes were noted at positions A348V, V367F and A419S in the RBD with V367F and A419S being novel. Other speculated mutations in the putative epitopes lying within RBD showing less antigenicity were T323I, A344S, R408I, G476S, V483A, H519Q, A520S, A522S, and K529E out of which T323I, A344S, A522S, and K529E are also novel. In addition, regions outside RBD (4-19, 1,215-1,256, 1,039-1,071, 123-146, 607-629,  1,123-1,136 , 857-867, 535-541) also infer high antigenicity (based on predicted antigenic score in the range 1.167-1.261) with variations in S protein of different genomes from similar locations as shown in ►Fig. 3. The antigenic epitopes are depicted in the protein structure of spike protein (►Fig. 4). Two mutations A930V and D936Y were observed in the heptad repeat domain 1 (HR1) and one mutation D1168H in heptad repeat domain 2 (HR2).

Several studies have shown that mutations within the spike protein influence virus-host interaction. 18 Among the four proteins, viz., M, S, N and E, the M protein is known to play a significant role in virus assembly, role of E protein involves the production and maturation of this virus, the N protein is involved in processes related to viral replication cycle, and cellular response of host cells to viral infections, and S protein is the major target for neutralizing antibodies as it mediates the fusion and facilitates viral entry into host. 4, 5 In the present study, we found that although multiple genetic variants were identified in the same country, yet there were some unique mutations found in a particular country, which suggests that diversity of S protein mutations might have significant role in the pathogenicity of this virus in countries with high or low mortality rates, as proposed by others also. 19 We predicted 140 SNPs in the S protein with A>G and vice versa in SARS-CoV-2 genomes submitted from India, T>A and C>T from China and the interchange of all four nucleotides (C>T, T>A, A>G, G>T, C>G, C>A, G>C, T>C, A>T, G>A) in genomes submitted from the United States. The data from the United States is significant. However, it might be because maximum genomes were submitted from the United States only. The SNP profile revealed that the S protein mutations were predominant at specific positions only. These mutations are expected to make the virus more capable to escape from the host immune and might help in natural selection and evolution of the SARS-CoV-2, as reported by Andersen et al. 20 It is important to mention that double mutations in the S protein were found only in the strains from the United States but not in genomes from other regions. These double mutations probably could have helped in the increased virulence of the virus. 21 It has also been noticed that the death toll is comparatively higher in the United States than in other regions included in the study. 1 This might probably indicate that the prevalence of several mutated strains within the provinces would have either reduced or increased its severity. It may also help in understanding the antigenic and immunogenic changes but the correlation of mutation with regional virulence could not be established due to statistical imbalance of the available genomes in the database at the time the study was done. Moreover, extensive research is required to correlate the mutations with the severity of the disease and mortality. Out of 88 SNP clusters, D614G was found in 34 (38.6%) SARS-CoV-2 genomes. The amino acid change in 23403A>G variant (p.D614G) involves a change of large acidic residue D (aspartic acid) into small hydrophobic residue G (glycine). This observation is important, as this large difference in both size and charge may help compromise the binding affinity of antibodies against S protein, due to electrostatic interactions in the tertiary structure of protein group. This may hinder the developments of vaccines and might potentiate the virus for antigenic drifts. 22 The effect of deletion variant (figured in one SARS-CoV-2 genome from India) on the viral phenotypes needs further investigation. The high frequency of genetic mutations in RNA viruses is well known but in the genomes of SARS-CoV-2, we found a series of single amino acid variations. This can affect the virus evolution and emergence of the new strains. 21 The mutations in the RBD found in our study predicted conformational changes in the S1 domain of spike protein. The mutations in RBD play an important role while designing new drugs, as suggested in a recent study. 23 These mutations might affect the interaction of viral RBD with the host receptor. Our study revealed 12 mutations of which six were novel mutations (►Table 1). Out of the six novel mutations two were exhibiting high antigenicity while others were in the less antigenic region. The amino acid change observed in the antigenic epitopes were from positively charged to uncharged amino acids (R->I, H->Q), and negatively charged to uncharged (D->N, D->G, D->Y) amino acids. We also found mutations in negative to positively charged (D->H) amino acids. These replacements might influence the tertiary structure of the proteins and facilitate the increased virulence by escaping host immune response. 24 The sequences in the HR1 (residues 902-952) and HR2 (residues 1,145-1,184) regions tend to form dimeric or trimeric helix bundles. 25 As the S protein of coronavirus are homodimers or homotrimers, these HR regions may undergo oligomerization and result in the conformational change of S protein during virus-host cell fusion. 26 These regions show different conformations in different fusion states and are known to be the most conserved among other regions in S protein of SARS-CoV. 25, 27 However, a previous study shows variations in HR1 domain which forms helical bundles with HR2 to facilitate fusion and entry of virus into the host and hypothesizes that the mutation A1168V in HR2 domain along with A930V mutation in HR1 domain confers peptide entry inhibitor resistance in mouse hepatitis coronaviruses. 28 Hence the mutation A930V in HR1 domain and D1168H in HR2 domain found in our study might be relevant in explaining the pathogenesis of SARS-CoV-2.

With the rapid spread of this virus and limitation of specific therapy, studies are being focused on exploring the potential of neutralizing antibodies (as in plasma therapy) against vulnerable epitopes of S protein. 29 Our study predicts 

An interactive web-based dashboard to track COVID-19 in real time

A review of SARS-CoV-2 and the ongoing clinical trials

Genomic characterization of a novel SARS-CoV-2

Coronavirus envelope protein: current knowledge

Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation

Structure of mouse coronavirus spike protein complexed with receptor reveals mechanism for viral entry

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies

Real estimates of mortality following COVID-19 infection

Single amino acid substitutions in the severe acute respiratory syndrome coronavirus spike glycoprotein determine viral entry and immunogenicity of a major neutralizing domain

ClustalW-MPI: ClustalW analysis using distributed and parallel computing

Jalview Version 2-a multiple sequence alignment editor and analysis workbench

RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

Interactive Tree of Life (iTOL) v4: recent updates and new developments

EMBOSS: the European Molecular Biology Open Software Suite

A semi-empirical method for prediction of antigenic determinants on protein antigens

New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites

Potential therapeutic targeting of coronavirus spike glycoprotein priming

SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate

The proximal origin of SARS-CoV-2

Why are RNA virus mutation rates so damn high?

Electrostatic interactions in protein structure, folding, binding, and condensation

Exploring the genomic and proteomic variations of SARS-CoV-2 spike glycoprotein: a computational biology approach

Lineage-specific differences in the amino acid substitution process

Interaction between heptad repeat 1 and 2 regions in spike protein of SARS-associated coronavirus: implications for virus fusogenic mechanism and identification of fusion inhibitors

Tectonic conformational changes of a coronavirus spike glycoprotein promote membrane fusion

Mechanisms of viral membrane fusion and its inhibition

Coronavirus escape from heptad repeat 2 (HR2)-derived peptide entry inhibition as a result of mutations in the HR1 domain of the spike fusion protein

A highly conserved cryptic epitope in the receptor binding domains of SARS-CoV-2 and SARS-CoV

Most of the vaccines strategies against COVID-2 are focusing on the predicted epitopes of SARS-CoV-2 spike protein. This protein is also proposed to be the most potent and specific drug target and for designing neutralizing antibodies. Our findings indicate that vaccine designing against SARS-CoV-2 could be a challenging task. Even though both RNA based, and peptide-based vaccines are being developed in more than seven laboratories, our observations may be useful in the efficacy analysis of these vaccine candidates.

None.Conflict of Interest P.K.S. and U.K. are research officers in a Department of Biotechnology funded unrelated project (BT/PR23016/ NER/95/581/2017).