key: cord-0848621-0uysvn6y authors: Guruprasad, Lalitha title: Human SARS CoV‐2 spike protein mutations date: 2021-01-17 journal: Proteins DOI: 10.1002/prot.26042 sha: 837e216b7e4136541bf637c419205c8a3da4c8ab doc_id: 848621 cord_uid: 0uysvn6y The human spike protein sequences from Asia, Africa, Europe, North America, South America, and Oceania were analyzed by comparing with the reference severe acute respiratory syndrome coronavirus‐2 (SARS‐CoV‐2) protein sequence from Wuhan‐Hu‐1, China. Out of 10333 spike protein sequences analyzed, 8155 proteins comprised one or more mutations. A total of 9654 mutations were observed that correspond to 400 distinct mutation sites. The receptor binding domain (RBD) which is involved in the interactions with human angiotensin‐converting enzyme‐2 (ACE‐2) receptor and causes infection leading to the COVID‐19 disease comprised 44 mutations that included residues within 3.2 Å interacting distance from the ACE‐2 receptor. The mutations observed in the spike proteins are discussed in the context of their distribution according to the geographical locations, mutation sites, mutation types, distribution of the number of mutations at the mutation sites and mutations at the glycosylation sites. The density of mutations in different regions of the spike protein sequence and location of the mutations in protein three‐dimensional structure corresponding to the RBD are discussed. The mutations identified in the present work are important considerations for antibody, vaccine, and drug development. The epicenter of the ongoing COVID-19 pandemic caused by the human severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) was first identified in the city of Wuhan-Hubei-1 province, China during mid-December 2019. 1 Since then, the disease has spread rapidly and has affected millions of people in the populations worldwide leading to more than 778102 deaths until date (https://www. worldometers.info/coronavirus/). The SARS-CoV-2 belongs to the family of Coronaviridae, subfamily of Orthocoronavirinae and genera of β-CoV (https://www.ncbi.nlm.nih.gov/taxonomy/694009). The spread of the disease is attributed to contact via respiratory droplets either due to coughing or sneezing or through surface contact. As on date, no vaccines or specific drugs are available to treat the disease, however, there is enormous ongoing efforts worldwide in this direction. The SARS-CoV-2 is a spherical shaped virion with a positivestranded RNA viral genome of size 30 kb that is translated into structural and non-structural proteins. The spike glycoprotein is a homotrimer present on the surface of the coronavirus that plays a vital role in recognition of human host cell surface receptor angiotensinconverting enzyme-2 (ACE-2). 2 This recognition is required for fusion of viral and host cellular membranes for transfer of the viral nucleocapsid into the host cells. SARS-CoV-2 is reported to have originated in bats 3 and subsequently transmitted to humans via pangolins as intermediate host species. [4] [5] [6] In order to be able to jump species and infect a new mammalian host, the viral genome undergoes several mutations in the spike proteins. The spike protein comprises an N-terminal S1 subunit and a C-terminal membrane proximal S2 subunit. The S1 subunit consists S1 A , S1 B , S1 C and S1 D domains. The S1 A domain, referred as Nterminal domain (NTD), recognizes carbohydrate, such as, sialic acid required for attachment of the virus to host cell surface. The S1 B domain, referred as receptor-binding domain (RBD) of the SARS-CoV-2 spike protein interacts with the human ACE-2 receptor. 2, 7 The structural elements within the S2 subunit comprises three long α-helices, multiple α-helical segments, extended twisted β-sheets, membrane spanning α-helix, and an intracellular cysteine rich segment. The PRRA sequence motif located between the S1 and S2 subunits in SARS-CoV-2 presents a furin-cleavage site. 8 In the S2 subunit, a second proteolytic cleavage site S2 0 , upstream of the fusion peptide is present. Both these cleavage sites participate in the viral entry into host cells. In a study on the infectivity and reactivity to a panel of neutralizing antibodies and sera from convalescent patients, 9 mutations and glycosylation site modifications have been reported in human SARS-CoV-2 spike proteins. Few mutations have been reported in the spike glycoprotein. 10 The D614G mutation is reported to be relatively more common [11] [12] [13] [14] and is known to increase the efficiency of causing infection. 2 Mutation sites for spike proteins from some of the SARS-CoV-2 Indian isolates have been mapped on to protein three-dimensional structure. [15] [16] [17] In light of the large number of SARS-CoV-2 spike protein sequences currently available in the NCBI virus database, I intended to carry out an exhaustive analysis, in order to understand the current scenario of mutations in the spike proteins. This study informs us of all the mutations present in the human SARS-CoV-2 spike proteins relative to Wuhan-Hu-1 reference sequence from China, according to their geographical locations, positions of the mutation sites, distribution of the number of mutations at the mutation sites, the different mutation types observed so far, mutations at glycosylation sites, occurrence of multiple mutations in a single spike protein and mutations within the RBD close to the host-cell ACE-2 receptor interactions. This study has implications from the perspective of vaccine, antibody, and drug design. The SARS-CoV-2 spike protein sequences were obtained in the FASTA format from the NCBI virus database (https://www.ncbi. nlm.nih.gov/labs/virus/vssi/). The multiple sequence alignment 18 of the proteins was achieved using NGPhylogeny server 19 Nearly one-third of the spike protein sequence is associated with mutations. The list of the mutation sites along with total number of mutations observed at individual mutation sites is shown in Table S1 . The distribution of the mutations within different regions of the spike protein is shown in Table 2 proteins or the more common occurrence as reported previously. [11] [12] [13] [14] The mutation density evaluated as a function of the number of mutations observed over the sequence length corresponding to different regions in the spike protein is shown in Figure 1 . The protease cleavage site (between residues 675 and 692) in the spike protein is associated with the maximum mutation density. The mutations at this site in the spike protein may be of advantage for the virus to undergo proteolytic cleavage by a large number of host enzymes in its evolution. Further, the NTD (S1 A domain) is another region where mutations have accumulated relatively more in number compared with rest of the spike protein. More than one mutation type can be found at the same position in the spike protein sequence. For instance, at position 88 the amino acid residue D is observed to be mutated either to N, E, Y, or A. At position 675, the amino acid residue Q is mutated either to R, H, K or is deleted among the spike proteins. The geographical location-wise distribution of the mutation sites and mutation types is shown in Table 3 . Accordingly, the total number of mutation sites observed were; North America (300), South America (4) The spike protein plays a vital role for the attachment to host cell- Table 3 . Accordingly, the total number of distinct mutation sites in RBD observed were; North America (27), South America (0), Europe (7), Africa (1), Asia (15) , and Oceania (9) and the total number of distinct mutation types observed were; North America (28), South America (0), Europe (7), Africa (1), Asia (16) , and Oceania (9). The mutations occurring in relatively large numbers in RBD is close to His34 in ACE-2 receptor and T500 and N501 residues in spike protein RBD are close to Tyr41 in ACE-2. 22 Few residues that are close to residues ≤3.2 Å from ACE-2 receptor are also associated with the mutations. These residues are G476 (next to A475), F486 (next to N487) and G502 (next to N501). Mutations were observed for the residues Y453, T500, N501. The Y453F mutation is present in five spike proteins from Europe (NCBI The 8155 spike proteins comprise anywhere from 1 to 16 mutations. North America F2L, L5F, L5I, V6F, L7V, P9L, S12C, Q14H, C15F, N17K, L18F, T20I, T22N, T22A, T22I, Q23K, P25S, A27S, A27V, T29I, F32L, R34C, H49Y, S50L, T51I, Q52L, Q52H, L54F, L54W, F55I, P57L, H69Y, S71F, G72V, T73I, G75V, T76I, F79L, D80N, D80Y, N87Y, D88N, D88E, D88Y, D88A, V90F, T95A, T95I, E96D, K97T, S98F, R102I, I105L, D111N, K113R, L118F, V130A, E132D, C136R, D138H, L141-, L141F, G142V, G142-, V143F, V143-, Y144-, Y144V, Y145H, H146Y, K147E, N148S, S151I, M153T, M153V, M153I, E154V, F157L, R158S, L176F, M177I, D178N, G181V, L189F, R190K, I203M, I210-, R214L, D215Y, D215G, L216F, Q218L, F220L, S221L, A222V, A222P, D228H, L229F, Q239R, T240I, L242F, A243S, A243V, H245R, H245Y, R246K, D253Y, D253G, S254F, S256L, W258L, G261V, G261R, A262S, Y265C, V267L, R273S, E281Q, A288S, L293V, D294E, P295S, E298G, T307I, V308L, E309Q, Q314K, Q314L, Q314H, T315I, Q321L, T323I, P330S, A344S, T345S, A348T, A348S, N354K, R357K, V367F, V382L, P384L, V395I, R403K, V407I, A411S, G413R, L441I, R457K, K458Q, G476S, S477N, P479L, V483A, E484Q, Q493L, S494P, Y508H, H519Q, A520S, A522V, K529E, G545S, T547I, L552F, T553I, E554D, K558N, A570V, T572I, D574Y, E583D, I584V, S596I, I598V, N603H, Q613H, D614G, V615F, T618A, P621S, V622F, V622I, V622A, A623S, H625Y, A626V, P631S, W633R, G639V, S640F, A647S, A647V, E654Z, E654K, H655Y, N658Y, A668S, A672V, Q675R, Q675H, T676S, T676I, Q677H, Q677R, T678I, P681L, P681H, R682W, A684S, A684T, V687L, A688V, A688S, S689I, S691F, S698L, N703D, S704L, V705F, A706V, I714M, T716I, I720V, T724A, M731I, T732A, T732I Europe L5F, T22I, H49Y, Q115R, M153I, L176I, L176F, F186S, N188D, I197V, V213L, T240I, S254F, G261D, V367F, C379F, V382E, T393P, Y453F, F486L, N501T, T553N, K558R, T572I, L611F, D614G, T676I, S686G, M740I, G769V, Y789D, F797C, D839Y, A845S, A1020V, H1101Y, V1122L, P1162L, K1191N, M1229I, D1260N, D1260H, P1263L Africa L5F, S12F, T29I, H49Y, V70F, Y144-, L242F, A288T, Q314R, R408I, A570S, D614G, S640A, A653V, Q677H, P812L Asia F2L, L5F, L8V, S12F, S13I, Q14H, T22I, P25L, Y28H, Y28N, T29A, G35V, Y38C, H49Y, S50L, L54F, A67V, A67S, I68-, I68R, H69-, V70-, S71-, G72-, T73-, N74-, N74K, G75V, G75-, T76I, T76-, R78M, F86S, T95I, E96G, K97Q, S98F, V127F, D138Y, D138H, F140L, L141-, G142-, V143-, Y144-, H146Y, H146R, N148Y, S151G, W152L, M153I, S155I, E156D, S162I, Q173H, M177I, G181A, N185K, R190S, N211Y, V213L, S221W, Y248H, S255F, W258L, G261R, G261S, A262S, V267L, G268D, Q271R, A292V, L293M, D294I, P295H, L296F, S297W, C301F, P337R, V367F, L368P, V382L, R408I, A411D, E471Q, S477N, E484Q, P491L, Q506H, P507H, P507S, Y508N, L518I, H519Q, A520S, A570V, T572I, D574Y, E583D, G594S, Q607L, Q613H, D614G, V622F, A653V, E654Q, H655Y, Y660F, A672D, Q675-, Q675H, Q675R, T676-, Q677H, Q677-, T678-, N679-, S680-, P681-, R682Q, R682W, R682-, A684V, A684-, R685-, S686-, V687-, A688V, A688-, Q690H, A701V, A706S, M731I, L754F, R765L, A771S, V772I, Q774R, E780D, A783S, K786N, T791I, K795Q, G798A, P809S, T827I, A829T, I834V, A879S, S884F, A892V, M900I, A930V, D936Y, L938F, S939F, S939Y, S974P, Q1002E, L1063F, T1077I, H1083Q, D1084Y, H1088R, F1089V, V1104L, F1109L, D1139Y, S1147L, D1153Y, G1167S, K1181R, R1185H, N1187K, K1191N, E1195Q, Q1201K, V1230E, C1243F, G1246A, D1259Y Oceania L5F, T22I, T29I, H49Y, S50L, T76I, S98F, I128F, D138H, M153I, L176F, D178N, E180K, I210-, S221L, S247R, D253G, W258L, A262T, G283V, I468T, E471Z, S477N, V483F, G485R, Q498Z, T500I, N501Y, H519Q, P561L, E583D, D614G, P621S, A626V, Q675K, Q675H, S704L, M731I, T791I, D808B, P812S, D839N, A846V, I931V, D936Y, K1073N, 1079S, G1124V, D1163G, C1254F, D1260N QMT57644, QMT57656, QMT57692, QMT94108, QMT94756, QMT94780, QMT95200, QMT95308, QMT95356, QMT95368, QMT95452, QMT95488, QMT95560), five from Asia (QLL26046, QLI49781 , QLF98260, QKY60061, QKV26077) and two from Africa (QKR84420, QKS66940). The NxT/S represent glycosylation site sequence motifs. The deletions at glycosylation sites; N331 and N343 in the spike proteins are known to have caused lesser infections revealing the importance of glycosylation for viral infectivity. 9 The examination of three-dimensional electron microscopy structure of human SARS-CoV-2 spike protein (PDB code: (NCBI accession code: QKU31901.1) is associated with N17K mutation and (QKV40463.1) is associated with T1136I mutation that is glycosylated at N1134. The spike proteins from Asia; (QLG99547.1) is associated with S151I mutation and (QLA46612.1) with S151G mutation and these spike proteins are glycosylated at N149. The human SARS-CoV-2 spike proteins comprised 400 distinct muta- A new coronavirus associated with human respiratory disease in China Angiotensinconverting enzyme 2 (ACE2) as a SARS-CoV-2 receptor: molecular mechanisms and potential therapeutic target A pneumonia outbreak associated with a new coronavirus of probable bat origin Pangolins harbor SARS-CoV-2-related coronaviruses Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins Human coronavirus spike protein-host receptor recognition Structural and functional basis of SARS-CoV-2 entry by using human ACE2 Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity Mutational frequencies of SARS-CoV-2 genome during the beginning months of the outbreak in USA Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus Controlling the SARS-CoV-2 outbreak, insights from large scale whole genome sequences generated across the world Geographic and genomic distribution of SARS-CoV-2 mutations Amino acid mutations in the protein sequences of human SARS CoV-2 Indian isolates compared to Wuhan Interactions of spike protein residues (cyan) with ACE-2 (green) side-chain residues (yellow) that are within 3.2 Å in crystal structure of human SARS-CoV-2 spike protein RBD complexed with ACE-2 receptor (PDB code: 6LZG). The spike protein mutated residues are shown in (red) RBD, receptor binding domain [Color figure can be viewed at wileyonlinelibrary.com] reference isolate from China Mapping mutations in proteins of SARS CoV-2 Indian isolates on to the three-dimensional structures Full-genome sequences of the first two SARS-CoV-2 viruses from India A virus that has gone viral: amino acid mutation in S protein of Indian isolate of coronavirus COVID-19 might impact receptor binding, and thus MAFFT multiple sequence alignment software version 7: improvements in performance and usability Fr: new generation phylogenetic services for non-specialists The protein data bank Pymol: an open-source molecular graphics tool Evolutionary relationships and sequence-structure determinants in human SARS coronavirus-2 spike proteins for host receptor recognition How to cite this article: Guruprasad L. Human SARS CoV-2 spike protein mutations LGP thanks School of Chemistry, University of Hyderabad for research facilities and ABREAST (https://www.abreast.in) for making available the computer programs used in this work for the identification and analyses of the mutations. The author declares no conflict of interest. The peer review history for this article is available at https://publons. com/publon/10.1002/prot.26042. All data is available in the manuscript and Table S1 . https://orcid.org/0000-0003-1878-6446