key: cord-0846888-4us1qr3p
authors: Alai, Shweta; Gujar, Nidhi; Joshi, Manali; Gautam, Manish; Gairola, Sunil
title: Pan-India novel coronavirus SARS-CoV-2 genomics and global diversity analysis in spike protein
date: 2021-03-19
journal: Heliyon
DOI: 10.1016/j.heliyon.2021.e06564
sha: 71f8c3fb10997da63b57c38817802db30cd0f967
doc_id: 846888
cord_uid: 4us1qr3p

The mortality rates due to COVID-19 have been found disproportionate globally and are currently being researched. India mortality rate with a population of 1.3 billion people is relatively lowest to other countries with high infection rates. Genetic composition of circulating isolates continues to be a key determinant of virulence and pathogenesis. This study aimed to analyse the extent of divergence between genomes of Indian isolates (n=2525 as compared to reference Wuhan-1 strain and isolates from countries showing higher fatality rates including France, Italy, Belgium, and the USA. The study also analyses the impact of key mutations on interactions with angiotensin converting enzyme 2 (ACE2) and panel of neutralizing monoclonal antibodies. Using 1,44,605 spike protein sequences, global prevalence of mutations in spike protein was observed. The study suggests that SARS-CoV-2 genomes from India share consensus with global trends with respect to D614G as most prevalent mutational event (81.66% among 2525 Indian isolates). Indian isolates did not reported prevalence of N439K mutation in receptor binding motif (RBM) as compared to global isolates (0.54%). Computational docking and molecular dynamics simulation analysis of N439K mutation with respect to ACE 2 binding and reactivity with RBM targeted antibodies viz., B38, BD23, CB6, P2B-F26 and EY6A suggests that variant have relatively higher affinity with ACE 2 receptor which may support higher infectivity. The study warrants large scale monitoring of Indian isolates as SARS-CoV-2 virus is expected to evolve and mutations may appear in unpredictable way.

The current 2019 coronavirus pandemic is caused by a positive RNA virus, referred to as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1] . Despite all the global efforts, the virus continues to spread and infect a large population and has affected 218 countries, with more than 101,053,721 confirmed cases, 2,182,867 deaths globally (WHO, 29 Jan 2021) [2] . One of the most striking features of COVID-19 is the variable mortality rate across countries. Global case fatality rates (CFR) varied from 0.06 to 18.94%. France recorded the highest CFR of 18.65%. Belgium and Italy had double-digit CFR rates, while the United States, which experienced the largest outbreak, had a CFR of 5.76%. India, the second most populated country, reported 7,761,312 cases in all with 117,306 deaths (WHO, October 23) . Trends in mortality rates in India are intriguing, despite the second largest reported cases of infection ("https://covid19.who.int/table"), second after the USA (more than 76 lakh) in terms of total confirmed cases. Although numerous hypotheses are proposed, one hypothesis includes the variability of circulating viral strains. Monitoring this variability in key viral genomes and immune-dominant antigens may explain the infectivity, pathogenesis, and transmission of SARS-CoV-2 [10] .

Genome structure of SARS-CoV-2 is characterized by a size of around 30 kB with a large gene region (ORF) encoding non-structural proteins (NSP) and the genes encoding spike (S) glycoprotein, envelope protein, membrane (M) glycoprotein and nucleocapsid (N) proteins [7] . SARS-CoV-2 transient glycoprotein contains a region, the receptor binding domain (RBD), which specifically identifies the human angiotensin conversion enzyme (ACE2) as its receptor [8] . Antibodies targeting spike proteins, especially against the RBD are neutralizing in function [9] . Mutations that lead to variations in these hot spots may affect infectivity, pathogenesis, and transmission of SARS-CoV-2 [10, 72, 79] .

Rapid sequencing of the SARS-CoV-2 genome globally has greatly facilitated efforts to understand global epidemiology [5, 71] . Since the first SARS-CoV-2 reference genomic sequence was reported in January 2020, more than 1,58, 776 global sequences have been deposited into GISAID (23/10/2020, https://www.gisaid.org).

We report here comparative genomics of Pan India isolates (n= 2525) vis a vis reference genome and isolates from countries showing high fatality rates including France, Italy, Belgium, and the USA. The study also reports deep scanning of mutational events and their global frequency at each amino acid residue in spike protein using 1,44,605 global sequences submitted at GISAID. In-silco studies on impact of mutations on structure and binding affinity to critical human ACE2 receptor [73] and RBM targeted antibodies B38, BD23, CB6, P2B-F26 and EY6A were also carried out using molecular docking and simulation analysis [ 11] .

A total of 2525, SARS-CoV-2 complete, and high coverage viral genome sequences were downloaded from Global Initiative on Sharing All Influenza Database (GISAID) platform (23/10/2020). A total of 11,302 genome sequences from countries representing France, Belgium, Italy, India, and the USA (North America) were downloaded from GISAID (https://www.gisaid.org/) (collected till 6/0/2020) and compared for dominant clade analysis [16] . Viral genomes with human hosts were selected, excluding low coverage and incomplete (<29,000 nucleotides) genomes. For phylogenetic analysis 863 complete and high coverage sequences from India were used and compared with sequence from GISAID as hCoV-19/Wuhan/WIV04/2019/EPI_ISL_402124 reference genome. The reported and novel mutations were catalogued. (Additional File1). Lineages were predicted using Pangolin COVID-19 lineage assigner (https://pangolin.cog-uk.io/) [17] . The 1,44,605 spike protein sequences were retrieved from GISAID and compared with reference spike protein sequence (SARS-CoV-2 spike glycoprotein (EPI_ISL_402124).

Sequences from GISAID were downloaded and consensus sequences were aligned using MAFFT [18] . A maximum-likelihood phylogenetic tree was constructed using IQ-TREE and visualized with iTOL [19, 20] 3. Structure Prediction and Docking analysis Impact of mutations on binding affinity towards human ACE2 receptor was performed using docking and MD simulations analysis. The structures of wild type SARS-CoV-2 Spike RBD domain complexed with host ACE2 receptor was retrieved from Protein Data Bank (PDB ID: 6LZG) [21] . Spike protein was mutated using UCSF Chimera 1.10 software [22] . Mutant structures were retrieved from Chimera and energy was minimized for further analysis. All the wild and mutant structures were docked using ZDOCK docking server [23] . Docked structures were subjected to energy minimization and the binding energy was calculated using PDBePISA [24] . The structures of wild type SARS-CoV-2 Spike RBD domain complexed with antibodies B38,BD23,CB6, P2B-2F6, EY6A was retrieved from Protein Data Bank (PDB ID: 7BZ5, 7BYR, 7CO1, 7BWJ and 6ZER) respectively [12] [13] [14] 25] .

4. Modelling of the wild type and N439K variant with respect to ACE-2 binding

The structures of wild type SARS-CoV-2 Spike RBD domain complexed with host ACE2 receptor was retrieved from the Protein Data Bank (PDB ID: 6LZG) [21] . Spike protein was mutated using UCSF Chimera 1.10 software [22] . These structural models were used for molecular dynamics simulations. All molecular dynamics (MD) simulations were performed with GROMACS 2019 [74] software using CHARMM36 forcefield [75] . Explicit TIP3P water model was used to represent the water molecules. To neutralize the systems Na + and Clions were added. Energy minimization was performed using the Steepest Descent algorithm for 50000 steps. The system was equilibrated under the NVT conditions for 100 ps followed by NPT equilibration of 1000 ps. Temperature coupling was applied to maintain the system temperature at 300 K using the velocity rescaling algorithm. Semi-isotropic pressure was maintained using Parrinello Rahman pressure coupling with a pressure of 1 bar. Atomistic simulations were run for 100 ns for both the systems. Trajectories were viewed and analysed using Visual Molecular Dynamics (VMD) [75] tool. Standard GROMACS tools were used to plot the Root Mean Square Deviation (RMSD) and Root Mean Square Fluctuation (RMSF) of the system. Hydrogen bond analysis was performed in VMD using Hydrogen bonds plugin. Binding free energies were calculated using MM-PBSA program in GROMACS [76] .

1.1 Diversity within Indian SARS-CoV-2 genomes compared with reference strain.

To understand the diversity in the SARS-CoV-2 isolates from India as compared to reference strain (hCoV-19/ Wuhan/ WIV04/ 2019/ EPI_ISL_402124), we analysed a total of 2525 SARS-CoV-2 genomes retrieved from global dataset GISAID (23/10/2020) [26] . The samples were isolated between February -October 2020. A total of 2525 complete or near complete genome sequences with high coverage deposited from India were used in this study. The complete characteristics of 2525 SARS-CoV-2 genomes sequences are provided in additional file 1.

The prevalent markers of SARS-CoV-2 diversity observed in Indian genomes are shown in Figure 1 . A total of 954 variants in Indian genomes were observed, 659 were observed as a single event. A23403G and C14408T were observed at higher frequencies (>50%) in all the genomes, while G25563T, C13730T, G11083T C6312A, C241T, C3037T, G11083T, C13730T, C28311T, C6312A and C23929T mutations were predominated (>24% frequency) in Indian genomes (Additional File1).

Higher frequency mutations were mapped for key SARS-CoV-2 protein structures, including nucleocapsid, nsp3, nsp12, nsp14, nsp2 and spike protein ( Figure 2 ). Out of 954 events, 585 events were found in open reading frame1ab (ORF1ab), which is the longest ORF occupying two thirds of the entire genome. ORF1ab is transcribed into a multiprotein and subsequently cleaved into 16 non-structural proteins (NSPs). Of these proteins, NSP12 has the largest number of variants including, P323L as dominant (n=1998) followed by A97V (n=328). D614G mutation in spike protein, which is considered as a prevalent global mutation, was present in 2062 of the 2525 (81.66%) sequenced genomes [27, 28] .

The SARS-CoV-2 collection from India submitted at GISAID consisted of ~2525 isolates after excluding low coverage and incomplete genomes sequences (< 29,600 bases). The isolates belonged to 17 different states: 369 (30.75%) isolates from Gujrat (GJ), 169 (14.08%) isolates from Telangana (TS), Odisha 166 (13.83%) and 140 (11.66%) isolates from Maharashtra (MH) (Additional File 1). The mutational patterns prevalent in each state are presented in Figure 3 . Maharashtra recorded highest number of cases (n=2, 92,589) followed by Tamil Nadu (n=16, 0907) and Delhi (n=12,0107) as of 17 July 2020. The dominant mutation observed in Maharashtra, Tamil Nadu and Delhi was D614G, NSP12 (P323L) and NSP12 (A97V) respectively. Prevalent mutation observed in Maharashtra was in spike protein D614G (75%), followed by non-structural protein NSP12 (P323L) and in nucleoprotein (G204R, R203K). Prevalence of D614G observed in Tamil Nadu and Delhi was less as compared to Maharashtra. The statewise list of mutations observed in country is provided in additional file 2.

Regardless of fourth highest in the number of reported cases, Gujrat has highest death rate in the country as 6.2% which is double than the global mortality rate evaluated by WHO which 3.2% [2] To understand the global positioning, we compared SARS-CoV-2 genomes from India along with the 50,500 global sequences from GISAID [26] . Global characterization of the SARS-CoV-2 variants from all ~50,500 viral genomes sequences, suggested three major variants groups as 1,771 isolate Group 1 "C241T, C3037T, C14408T, A23403G, G28881A, G28882A, G28883C", the 1,458-isolate Group 2 "C241T, C1059T, C3037T, C14408T, A23403G, G25563T", and the 727-isolate Group 3 "C241T, C3037T, -C14408T, A23403G". The four most common mutations were (C241T/5UTR in orf1ab, C3037T in orf1ab (F924F, C14408T P4715L), and D614G in spike protein [29] . Mutations C241T, C3037T, A23403G and C14408T were observed at higher frequencies (>50%) in Indian genomes consistent with global trends. However, co-evolving mutations observed in developed countries associated with clades G,GH and GR such as G25563T (ORF3a), C26735T (NSP14) and C18877T (M protein) were observed less frequently (<15%) in Indian genomes. The most common variant observed globally, 3037C > T, ORF1ab: P4715L, RdRp: P323L; and D614G mostly reported from Europe and the USA were also observed in Indian population [30] . Other key variants including ORF3a: Q57H, ORF1ab: T265I (NSP3: T85I), ORF8: L84S, N203 (204del-insKR), ORF1ab: L3606F (NSP6, L37F) were also observed in genomes retrieved from India [31] .

To study, SARS-CoV-2 classification for their multitude of distribution across geographical locations of India, vis-à-vis states of India, we explored the lineages spread across country. We further observed the prevalence of different clades within high fatality rate populations.

The global isolates were assigned to major GISAID clades according to the marker variant and existing GISAID clades designations [32] [33] [34] . These clades were characterized in the context of marker variant relative to the reference strain (hCoV-19/Wuhan/WIV04/2019/ EPI_ISL_402124) namely, clade S, L, V, G, GH, GR, and other clades. Clades are characterized as S (C8782T, T28144C, NS8-L84S), L (C241, C3037, A23403, C8782, G11083, G25563, G26144, T28144, G228882), V( NSP6-L37F, NS3-G251V), G (S-D614S), GH (S-D614S, NS3-Q57H), S (S-D614S, NG204R). The most represented clades were clade GR observed in 11,298 complete genomes (GISAID, 18/06/2020) [35] . The clade was more prevalent in genomes sequenced from South America and Europe. The second most frequent clade represented was GH which was observed frequently from the genome sequences in North America and Africa continents [36] . The most represented clade from the Asia region was categorized as ancestral type O, which is similar with originating strain from Wuhan, China [6] . The clades are colour coded in squares and shown above chart in the diagram. As transmission across country is increasing trends in clade predominance was observed during these six months of the pandemic in the country. Ever since the emergence of outbreaks clade "O" is still prevalent, but the clades G, GH and GR were rapidly increasing since ease in social restrictions such as lockdowns and migration of workers across country. The wild type of clade "O" showed gradual decrease in the number of incidences over a time ( Figure 5 ).

Further to study association of fatality rate and predominant clade, we compared a total of 10,188 complete and high coverage genome sequences from countries reported higher fatality rates across the globe such as France, Italy, Belgium, and the USA , showed a dominant clade G in France, Italy, Belgium, and clade GH in the USA (Figure 4 ) (Additional file 3). Clade G and GH identified by the signature marker D614G variant in spike protein which is increasing its frequency across globe and specially developed countries. Recently reported as per Korber et al, D614G mutation increased infectivity of the virus [27] . India although reportedly third highest country in the world in terms of infected cases of COVID-19, fatality rate is still less as compared to the many developed world (https://www.mohfw.gov.in/). We observed a dominant clade ancestral type "O" in India was dominant during early phase of pandemic. We observed clade G, GH and GR as frequent clade in Gujrat, West Bengal, and Madhya Pradesh respectively (Figure 4 ). Enrichment and diminution of specific clades was observed in certain states of India [37] . Overall, within genome sequences from high fatality rates of COVID-19 regions, we observed a mutation D614G as prevalent with its resembling clades G, GH and GR.

Phylogenetic analysis based on whole genome alignment of Indian genomes was performed using maximum likelihood tree rooted against reference hCoV-19/Wuhan/ WIV04/ 2019/ EPI_ISL_402124 ( Figure 6 ). Phylogenetic analysis of 864 genomes was done as per the definitions of the PANGOLIN lineage and GISAID clades [32] . The overall 17 different lineages were observed (A, A.1, A.2, A.3, B, B.1, B.1.1, B.1.18, B.1.2, B.1.5, B.136, B.2 

Spike glycoprotein (S protein) of SARS-CoV-2 mediates receptor recognition and membrane fusion with the host cell [38, 39] . During viral infection, the trimeric S protein is cleaved into S1 and S2 subunits and S1 subunits are released which contains the receptor binding domain (RBD), which directly binds to the peptidase domain (PD) of angiotensin-converting enzyme 2 (ACE2) [40] . Whereas S2 is responsible for membrane fusion. When S1 binds to the host receptor ACE2, another cleavage site on S2 is exposed and is cleaved by host proteases, a process that is critical for viral infection [39] [40] [41] . Many preprint studies reported different mutations at protein level and predicted its possible effect on binding affinity with host receptor [42] [43] [44] . Here, we scanned mutation at each single residue in 1273aa long protein sequence and explored their global prevalence. We also mapped mutation at different sites and regions on spike protein to understand its impact on its role in pathogenicity and disease transmission.

A total 1,44,604 spike glycoprotein sequences of SARS-CoV-2 genomes reported globally were retrieved from the GISAID consortium (GISAID,23/10/2020) and aligned using MAFFT with reference SARS-CoV-2 spike glycoprotein (EPI_ISL_402124) (Additional file 5) [45] . Multiple sequence alignment was studied to observe amino acid sequence variation at each residue and complete list of all amino acid variations reported (1273 aa) with number of events observed globally till were listed in additional file 5. Global analysis suggested a total 3897 mutational events reported in spike protein sequence where 1935 were observed at only a single incidence (n=1). Among all the variations, twelve (L5F, L8V, L18F, R21I, L54F, N439K, D614G, A829T, A879S, D936Y, G1124V, P1263L) were dominant (greater than 1000 genomes) ( Figure 6 ). Only 2 (R21I, L54F) were located at N-terminal domain (NTD), 3 variations were found in signal peptide (L5F, L8V, L18F). Single variations (N439K) were found at the receptor-binding domain (RBD) while three variations (A 829T, A879SV, and D936Y) were found at heptad repeat 1 (HR1) domain. Single variations were found in signal sub-domain-2 (D614G), sub-domain-3 and heptad repeat 2 domain (G1124V) (D1168H), and cytoplasmic tail domain (P1263L) each. Only a single variant D614G reported in 85.34% of the genome sequenced globally in 87 countries [46] (Figure 7 ).

Total 29 mutations were reported in receptor binding domain of spike protein of SARS-CoV-2. Complete list of mutation, events and its origin are listed in Table 1 . RBD domain showed only 0.54% of total mutational events (including n=1) where RBM showed 0.25% frequency of mutations. The receptor-binding motif (RBM) is the main functional motif in RBD and is composed of two regions (region 1 and 2) that form the interface between the S protein and hACE2 [40] .

Computer modelling interactions have predicted the receptor binding motif (438-506) which act as main functional motif within RBD domain contains essential seventeen key residues that are potentially involved in binding with host cell receptor ACE2 (Table 2) [39, 40] . We scanned mutation at these seventeen key residues Table 2 which may impact binding of virus and host receptor [47] . Mutations at contacting residues will be important to consider for the J o u r n a l P r e -p r o o f vaccine and therapeutic targeted against S protein (RBD region) [9, 48] . Among these seventeen residues, twelve residues reported unique variations but at exceptionally low incidence (<2%) which can be due to sequencing errors or error prone replication of viruses within hosts. We observed key contacting residues were conserved amongst 99% of genomes of SARS-CoV-2 sequences.

To estimate the functional changes suggested by the RBM mutations, we used the prototype SARS-CoV-2 RBD domain and ACE2 and compared RBM mutants to assess their binding energy to human ACE2. The complex structure of the SARS-CoV-2 S-protein RBD domain and human ACE2 was obtained protein data bank (PDB ID: 6LZG) [49] . Mutant amino acids of the SARS-CoV-2 RBD mutants were directly replaced in the model, was used as a template modelled using SWISS-MODE [50] . We screened and evaluated interaction analysis of mutant N349K as it was most prevalent mutation observed in RBD-RBM region. RBM mutant types (N439K) exhibited significantly lowered ΔG, suggesting a slight increased affinity to human ACE2; compared to the prototype. The ΔG of these mutant types were around -11.2 kJ/mol, lower than the prototype strain (-11 kJ/mol), where 0.093 kcal.mol-1.K-1 (increase of molecule flexibility) was observed with decrease in two hydrogen bonds.

3.5. Simulations predict that N439K variant binds tighter to human ACE2.

As described in the Methods section, the structure of wild type Spike Receptor Binding Domain (RBD) complexed with host ACE2 receptor was retrieved from the PDB (PDB ID: 6LZG) [21] . The crystal structure of the Spike RBD complexed with ACE2 receptor reveal that the interactions are dominated by polar contacts mediated by hydrophilic residues [21] . Even a single mutation introduced in these contacts, for example, K353A could abrogate the interactions between the two proteins, highlighting the importance of the polar contacts. Thus, to understand the effect of the N439K variation, the mutant structure was built using Chimera software. Further, we performed 100 ns atomistic molecular dynamics simulations of the wild type and variant complexes. To quantify the structural variations in the protein complexes over time, the RMSD and RMSF was plotted. Both analyses indicated that the wild type of protein complex was more stable than the variant (Figure 8 A and B) . On the contrary, a differential plot of the decomposed residue-wise binding free energies (Binding Free Energy of Residue Variant -Binding Free Energy of Residue Wild Type ) indicated a large change at position 439 of the Spike RBD, indicating that the residue contribution to free energy at this position was much more favorable in the variant than the wild type (Figure 8  C) . A closer analysis indicates that in the presence of Lysine at position 439 of Spike protein forms water mediated hydrogen bonds with residues Gln 325 and Glu 329 of the ACE2 protein which are more distant in the presence of Asn at position 439 ( Figure 8 D and E) . Thus, simulations coupled with binding free energy calculations suggest that the N439K variant binds ACE2 tighter than the wildtype.

The RBD is the dominant target of neutralizing antibodies to SARS-CoV-1 also, SARS-CoV-recent studies [53] [54] [55] . It is unclear to what extent the RBM will evolve to escape neutralizing antibodies in a manner evocative of vaccine escape mutants. Many antibodies have epitopes that overlap the RBD ACE2 contact interface and are therefore strongly constrained by mutation effects RBM region [ Figure 8 ], such as B38, BD23, CB6,P2B-F26, EY6A and REGN 10933 [12] [13] [14] [15] 51, 54, 56] . To better define the RBM's mutations N439K for antibody escape, we examined structural and binding constraint in the epitopes of antibodies with prototype structures of SARS-CoV-2 RBD.

Structural binding predictions suggests that prevalent mutation in spike RBD region, N439K did not overlap with currently characterized epitopes of neutralizing antibodies recovered from human convalescent patient [47] [48] [49] . The importance of tracking such mutations within RBM region is demonstrated by a recent study that identified mutations enabling escape from RBD-directed neutralizing antibodies [60] . Our data indicate that the none of the currently characterized mutations within RBM region escape the five RBM targeted neutralizing antibodies used in this study.

COVID-19 pandemic since its emergence in China has been showing varied patterns of transmission and infectivity globally [61, 62] . Studies on global diversity and real time tracking of mutations will be important in the disease control strategies [63] [64] [65] . India represents an excellent opportunity for such studies for its size, observing a strictest complete lockdown and varied environmental conditions. Many studies have reported the genomic characterization of Indian isolates till date with limited sample size. We present here a complete comprehensive genomic characterization of pan India SARS-CoV-2 genomes till 23/10/2020 with more than eight months of pandemic, with an objective of understanding its global positioning. The study suggests similarity of Indian genomes with predominant types found across different parts of the globe. It was noted that D614G mutant strains in India increased rapidly during eight months of the pandemic and is present in more than 81% of the genome sequences reported from India. The recent studies by Korber et al [27] and WHO collaborating study in China demonstrated D614G strains are 10-fold more infectious than the original Wuhan-1 strain [69] . There are reports suggesting associations of increased prevalence of G614 variant to the higher case fatality rates in 12 different countries [70] . In India, 81.66 % of genomes exhibited G614; however, its association with increased CFR could not be discovered in India. Distinctively, Indian isolates also did not displayed emergence of co-evolving mutations associated with mutant strain G25563T (ORF3a), C26735T (NSP14) and C18877T (M protein), which are observed globally. The study also attempted to look for such correlations in Indian states which exhibited highest mortality. Yet even in such states, the association with D614G could not be found.

1.351, famously known as South African Variant have been reported. The variant has spread to 31 countries and is shown to be responsible for higher fatality rates. Recently, four cases with the South Africa variant of SARS-CoV-2 were reported in India. Also, one case of Brazil variant known as 20J/501Y.V3 or P.1 lineage was also reported recently. India is reporting a spurt in the number of cases in some states. However, it needs to be ascertained whether this increase is related to emergence of UK, South African, Brazilian variants, or new Indian variants. To this effect, Indian Govt have launched Indian SARS-CoV-2 Genomic Consortia (INSACOG) which will accelerate virus surveillance, genome sequencing and characterization through a multi-laboratory network.

Many therapeutics are currently being pursued with the goal of developing protective immunity against the SARS CoV-2 virus [55, 72] . Notably, immunogens and diagnostics are under development targeting spike protein sequence from the Wuhan reference strain. Several spike protein mutant strains are being reported so far. This study analysed 1, 44,605 spike protein sequences reported globally. Despite of a total of 3895 mutational events observed in spike protein sequences, frequency of events and its occurrence is still low in total available sequences except for D641G. RBD being an important region in host-virus interaction and thus for development of therapeutics, frequency of mutational events observed was very low. The prevalence of mutations in RBD was found low (~ 0.57%) as compared to genome sequences reported globally. Single variant N439K at RBD region was observed in 0.52% (n=293) of genomes. Notably, N439K mutant is reported to show increased affinity binding affinity with hACE2. Interestingly, Indian isolates did not show presence of this variant. It is further noted that this variant is reported majorly few countries, including Scotland, England, and Romania. This led us to study the impact of observed N439K mutation within RBM region using computational docking and MD simulations approaches. Using a panel of RBM directed monoclonal antibodies viz. B38, BD23, CB6, P2B-F2F, and EY6A [12-15, 51, 54,56] , study predicts slightly higher affinity to ACE receptor. This needs further confirmatory studies using in-vivo and in-vitro studies.

One of the reasons for lack of association of genomic signatures of Indian isolates with fatality rates could be low sample size. Large scale monitoring of Indian isolates is required for conclusive correlations. The data from this study along with serosurveillance data from India and clinical meta data will be useful to decipher associations between viral clades and severity. The approach and global positioning of Indian isolates presented in this study will be helpful for monitoring the pandemic in India. Several other factors for low susceptibility of Indians to COVID-19 are being also being proposed. Such factors include host innate immunity status, genetic diversity in immune responses, epigenetic, ABO blood group association and universal BCG immunization need to be investigated with the clinical studies.

The study suggests that SARS-CoV-2 virus diversity in India is consistent with global trends with respect to most prevalent mutations. No major mutation event was reported in RBD region of studied Indian isolates. Global spike protein mutation prevalence analysis suggests impact of some mutations on RBD and ACE receptor interaction which may affect virus infectivity. The study findings are cautiously optimistic that viral diversity will not be hindrance to current vaccine development strategies and will be broadly effective against current isolates.

Supplementary Material: Additional File 1: The complete characteristics of 2525 SARS-CoV-2 genomes sequences used in the study, Additional file 2: The state wise list of mutations observed in country, Additional file 3: The clade and lineage of SARS-CoV2 genomes from India, Additional file 4:Lineages observed India are clustering with sequences from Asia, Europe, the USA, and other Asian countries indicating multiple introductions of multiple lineages of the virus into the country, Additional file 5:A total 1,44,605 Spike glycoprotein sequences of SARS-CoV-2 genomes reported globally were retrieved from the GISAID consortium.

-Acknowledgments: The authors gratefully acknowledge Serum Institute of India Pvt Ltd, Symbiosis International University (SIU) and Pune University for providing the assistance for carrying out this study. We also acknowledge the authors and originating and submitting laboratories of the genome sequences from GISAID database. The authors are also thankful to PARAM Brahma supercomputing resources at IISER, Pune. The findings and conclusions in this study are those of authors and do not necessarily represent the official position of the company and or institutes. 

SARS-CoV-2 is an appropriate name for the new coronavirus. The Lancet

World Health Organization. WHO Coronavirus Disease (COVID-19) Dashboard

Geneva: World Health Organization

The consequences of human actions on risks for infectious diseases: a review. Infection ecology & epidemiology

Genomic characterization, and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet

The establishment of reference sequence for SARS-CoV-2 and variation analysis

Coronavirus genomics and bioinformatics analysis. viruses

The SARS coronavirus S glycoprotein receptor binding domain: fine mapping and functional characterization. Virology journal

The receptor binding domain of the viral spike protein is an immunodominant and highly specific target of antibodies in SARS-CoV-2 patients

Human coronaviruses: a review of virus-host interactions

A noncompeting pair of human neutralizing antibodies block COVID-19 virus binding to its receptor ACE2

Human neutralizing antibodies elicited by SARS-CoV-2 infection

Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput singlecell sequencing of convalescent patients' B cells

Structures of Human Antibodies Bound to SARS-CoV-2 Spike Reveal Common Epitopes and Recurrent Features of Antibodies. bioRxiv: the preprint server for biology

Nextstrain: real-time tracking of pathogen evolution

A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology. bioRxiv

MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization

IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies

Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation

Structural and functional basis of SARS-CoV-2 entry by using human ACE2

UCSF Chimera-a visualization system for exploratory research and analysis

ZDOCK server: interactive docking prediction of protein-protein complexes and symmetric multimers

PDB-wide identification of biological assemblies from conserved quaternary structure geometry

A human neutralizing antibody targets the receptor binding site of SARS-CoV-2

Global initiative on sharing all influenza data-from vision to reality

Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus

Excessive G-U transversions in novel allele variants in SARS-CoV-2 genomes

Comprehensive variant and haplotype landscapes of 50,500 global SARS-CoV-2 isolates and accelerating accumulation of countryprivate variant profiles. bioRxiv

Variant analysis of SARS-CoV-2 genomes. Bulletin of the World Health Organization

Controlling the SARS-CoV-2 outbreak, insights from large scale whole genome sequences generated across the world

On the origin and continuing evolution of SARS-CoV-2

Phylogenetic Clustering by Linear Integer Programming (PhyCLIP)

Distinct Viral Clades of SARS-CoV-2: Implications for Modeling of Viral Spread

Geographic and Genomic Distribution of SARS-CoV-2 Mutations

Genomic analysis of SARS-CoV-2 strains among Indians returning from Italy, Iran & China, & Italian tourists in India

Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus

Structural basis of receptor recognition by SARS-CoV-2

Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor

Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2

Analysis of the mutation dynamics of SARS-CoV-2 reveals the spread history and emergence of RBD mutant with lower ACE2 binding affinity

Structural and Functional Implications of Spike Protein Mutational Landscape in SARS-CoV-2. bioRxiv

The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity. bioRxiv

Making sense of mutation: what D614G means for the COVID-19 pandemic remains unclear

Key residues of the receptor binding motif in the spike protein of SARS-CoV-2 that interact with ACE2 and neutralizing antibodies

Protein structure analysis of the interactions between SARS-CoV-2 spike protein and the human ACE2 receptor: from conformational changes to novel neutralizing antibodies. bioRxiv

SWISS-MODEL: an automated protein homology-modeling server

A highly conserved cryptic epitope in the receptor binding domains of SARS-CoV-2 and SARS-CoV

Perspectives on therapeutic neutralizing antibodies against the Novel Coronavirus SARS-CoV-2

Potent binding of 2019 novel coronavirus spike protein by a SARS coronavirus-specific human monoclonal antibody. Emerging microbes & infections

Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies

Structure of severe acute respiratory syndrome coronavirus receptor-binding domain complexed with neutralizing antibody

A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2. Cell host & microbe

Bioinformatic prediction of potential T cell epitopes for SARS-Cov-2

Identification of SARS-CoV-2

Vaccine Epitopes Predicted to Induce Long-term Population-Scale Immunity

Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies

The SARS-CoV-2 outbreak: what we know

Characterizations of SARS-CoV-2 mutational profile, spike protein stability and viral transmission. Infection, Genetics and Evolution

Betacoronavirus genomes: How genomic information has been used to deal with past outbreaks and the COVID-19 pandemic

Analysis of viral diversity for vaccine target discovery. BMC medical genomics

Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California

Antiviral mechanisms of candidate chemical medicines and traditional Chinese medicines for SARS-CoV-2 infection. Virus research

Comprehensive analysis of drugs to treat SARS-CoV-2 infection: Mechanistic insights into current COVID-19 therapies

Remdesivir is a direct-acting antiviral that inhibits RNA-dependent RNA polymerase from severe acute respiratory syndrome coronavirus 2 with high potency

COVID 19 in INDIA: Strategies to combat from combination threat of life and livelihood

Epidemiological studies on COVID-19 pandemic in India: Too little and too late

SARS-CoV-2 pandemic in India: what might we expect?

Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune crossreactivity with SARS-CoV

The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity

SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate

Design of a peptide-based subunit vaccine against novel coronavirus SARS-CoV-2

Can SARS-CoV-2 Accumulate Mutations in the S-Protein to Increase Pathogenicity

High throughput designing and mutational mapping of RBD-ACE2 interface guide non-conventional therapeutic strategies for

GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

CHARMM36 all-atom additive protein force field: Validation based on comparison to NMR data

VMD: visual molecular dynamics

Open-Source Drug Discovery Consortium, & Lynn, A.A GROMACS tool for high-throughput MM-PBSA calculations

Prime Minister's statement on coronavirus

RNA virus mutations and fitness for survival. Annual review of microbiology

 Tables   Table 1. List of mutations observed in RBD domain, events and region observed among global spike protein sequences.

Events Origin and initial occurrence of mutations