key: cord-274409-4ugdxbmy authors: Laskar, Rezwanuzzaman; Ali, Safdar title: Mutational analysis and assessment of its impact on proteins of SARS-CoV-2 genomes from India date: 2020-10-19 journal: bioRxiv DOI: 10.1101/2020.10.19.345066 sha: doc_id: 274409 cord_uid: 4ugdxbmy The ongoing global pandemic of SARS-CoV-2 implies a corresponding accumulation of mutations. Herein the mutational status of 611 genomes from India along with their impact on proteins was ascertained. After excluding gaps and ambiguous sequences, a total of 493 variable sites (152 parsimony informative and 341 singleton) were observed. The most prevalent reference nucleotide was C (209) and substituted one was T (293). NSP3 had the highest incidence of 101 sites followed by S protein (74 sites), NSP12b (43 sites) and ORF3a (31 sites). The average number of mutations per sample for males and females was 2.56 and 2.88 respectively suggesting a higher contribution of mutations from females. Non-uniform geographical distribution of mutations implied by Odisha (30 samples, 109 mutations) and Tamil Nadu (31 samples, 40 mutations) suggests that sequences in some regions are mutating faster than others. There were 281 mutations (198 ‘Neutral’ and 83 ‘Disease’) affecting amino acid sequence. NSP13 has a maximum of 14 ‘Disease’ variants followed by S protein and ORF3a with 13 each. Further, constitution of ‘Disease’ mutations in genomes from asymptomatic people was mere 11% but those from deceased patients was over three folds higher at 38% indicating contribution of these mutations to the pathophysiology of the SARS-CoV-2. The ongoing COVID-19 global pandemic began from Wuhan, China and has devastated millions of lives, economies and even nations as a whole. The first reported case was in 2012 created a scare as they had a relatively higher mortality rate. However, SARS-CoV-2 is by far the most contagious one [1] [2] [3] [4] . The higher incidence of viral infections would imply a faster evolution process for SARS-CoV-2 [5] . This is so because more the virus replicates, higher are the chances of it accumulating mutations with the possibility of it leading to altered dynamics of its virulence, pathogenesis and interactions with host. The changes may not be necessarily favoring the virus; however, the unpredictability demands caution. The SARS-CoV-2 genome encodes for 16 non-structural proteins in addition to the replicase polyprotein, the spike (S) glycoprotein, envelope (E), membrane (M), nucleocapsid (N) and other accessory proteins [6] . The impact of mutations in all the regions of the genome needs to be assessed to understand viral evolution. With a definitive possibility of India becoming the most affected country by SARS-CoV-2 in near future and the demographic burden involved, its pertinent to be analyze the accumulating variations in the genome accounting for possible changes in protein and their potential to alter the virus in any manner. On 6th June, 2020 we retrieved 611 FASTA sequence congregations from India along their rational meta data from GISAID EpiCoV server to construct the phylo-geo-network and analyze the haplogroups along with their geographical distribution across different states of India [7] . Herein we extend our study using the same congregation of sequences to analyze the nature and composition of the observed mutations and their impact on proteins of SARS-CoV-2. GISAID EpiCoV is an open access repository of genomic and epidemiologic information about novel corona viruses from across the world from wherein sequences were extracted and alignment performed as previously reported [7] . Briefly, 611 FASTA sequence congregations along their rational meta data from GISAID EpiCoV (www.epicov.org) server. For mutational profile analysis with clinical correlation, we selected 15 genomes of deceased patients from existing congregation. However, there were just two genomes for asymptomatic patients in the congregations. So, on 09.12.2020, we downloaded 775 FASTA sequences with patient status from the same server and selected 30 genomes from asymptomatic patients. As the data filter for genome extraction, we used hCoV-19 as a virus name, human as a host, India as a location and complete sequence with high coverage. Details of the asymptomatic 5 samples are given in Supplementary file 1. The sequences thus extracted were analyzed with NC_045512.2 from Wuhan, China as reference. MEGA(v.10) is a multithreaded tool for molecular and evolutionary analysis. Multiple Sequence Alignment (MSA) of the extracted sequences [7] was initially visualized by this software then the variable sites are exported into spreadsheets with or without missing/ambiguous and gap sites along their respective positions [8] . Using this software, we estimate the MCL (Maximum Composite Likelihood) nucleotide substitution pattern and Tajima's Neutrality test to understand transition transversion bias and nucleotide diversity [9, 10] . PIRO IGLSF is a MATLAB-based simulation software, we used this for the identify the location of mutated nucleotide position on specific gene [11] . Coronavirus Typing Tool of Genome Detective (v.1.13) and COVID-19 Genome Annotator of Coronapp are webtools for analysis of protein and nucleotide mutation [12, 13] . We used these tools for annotation, identification and classification of mutated protein followed by verification and validation of the positions with the mutated nucleotide sites by the output of MEGA. The nucleotide similarity percentage was validated by NCBI BLAST (blast.ncbi.nlm.nih.gov) to investigate the sequence diversity. SIFT, PROVEAN and ws-SNPs&GO are the prediction tools which report positive or negative impact of variants on protein phenotype. The assessments are focused upon scores using several algorithms. It is expected that a SIFT score of < 0.05 is diseased ("affect protein function"), and that > 0.05 is neutral ("tolerated"). This is stated that a PROVEAN score of < −2.5 is diseased ("deleterious"), and > −2.5 is neutral. ws-SNPs&GO 's PHD-SNP method is estimated to be > 0.5 mutation in the probability of disease, and < 0.5 is neutral [14] [15] [16] . Composition and distribution of variable sites (Table 1 ) and its negative value indicated the significance of these variable sites. However, excluding the gaps and ambiguous sequences reduced this percentage cover to 1.65% encompassing 493 variable sites which we have used for subsequent analyses reported in this study. This included 152 parsimony informative (PI) sites and 341 singleton sites (SNP: Single nucleotide Polymorphism). The PI sites are those whose incidence was observed in multiple samples whereas singleton sites had a restricted single sample incidence. The distribution of these sites according to various substitutions, protein localizations and impact therein has been summarized in Figure 1 , Supplementary file 2. As evident therein, C→T (181 sites) forms the most prevalent mutation in both PI and singleton sites and G→T (95 sites) comes a distant second. The common aspect of two most prevalent mutations is "T" being the substituted nucleotide. Further, there were two multi-variable (MV) sites each in PI and singleton category wherein two separate mutations were observed at the same site in different samples. The details of observed MV sites have been summarized in Table 2 . The distribution of the variable sites across proteins of SARS-CoV-2 in a non-uniform manner is reflective of the differential contributions of proteins in evolution. As per our data, NSP3 had the maximum of 101 variable sites followed by S protein (74 sites), NSP12b (43 sites) and ORF3a (31 sites) ( Figure 1 ; Supplementary file 2). These four proteins account for over half of the total variable sites of the genome and may be considered as drivers of genomic evolution for SARS-CoV-2. The mutations of S protein have been the focus for multiple research groups owing to its plausible impact on viral entry to the host cell but the mutations elsewhere may be equally relevant as the viral genome is known to harbor only what's essential [17] [18] [19] . We believe a holistic approach is required to understand the evolution as more often than not the selection advantage being offered by any mutation is a chance event and can be from any part of the genome. In terms of the impact of these variable sites on amino acid sequence of the viral proteins we classified them into four categories. First, the sites located in the extragenic region and hence no influence on the coding proteins. There were 10 such variable sites localized to the UTR regions (2 in 3'UTR and 8 in 5'UTR). Secondly, SNP-silent included those variable sites wherein the nucleotide change was leaving the amino acid sequenced unaltered. A total of 186 such sites were distributed across the genome. Thirdly, the variable sites which were leading to the introduction of a stop codon were referred to as SNP-stop and there were 8 such sites in our study. Lastly, the variable sites which were affecting the protein sequence are referred as SNP in the study and there were 281 such sites (Supplementary file 3). The prevalence and distribution of these sites has been summarized in Figure 1 and results of the prediction of their impact on protein has been discussed later. In order to understand the underlying dynamics of substitutions, we performed the maximum composite likelihood estimate of nucleotide substitution as shown in Table 3 We thereon looked at these variations in combination with their prevalence across samples. The most prevalent nucleotide at the variable sites in reference sequence was C (209) followed by G (137) whereas T was by far the predominantly substituted nucleotide (293, 60%). Also, the other three nucleotides had an almost equal representation in substitutions (A-68, G-68, T-64). This biased prevalence was not restricted to the alignment but was also getting translated to population incidence. There was a total of 723 mutations with C as reference nucleotide and 1032 mutations with T as substituted nucleotide across 611 studied genomes. The composition of 493 variable sites, their substitutions and prevalence across samples has been summarized in Figure 2 and Supplementary file 2. Evidently, any particular mutation may be incident across multiple samples and a single sample can harbor multiple mutations. A cumulative number for the same has been referred to as "Sum of Mutation Incidence" herein and thereafter in this study. We subsequently analyzed the patient's dataset with reference to age and gender for the incidence of mutations. However, since patients' data wasn't cumulatively available, the data for this aspect isn't exhaustive but representative for 255 samples (104 females and 151 males). The patients whose genomes were used in the study and age was known were classified into seven categories from infancy to over 75 years. The maximum number of patients for both males and females belonged to mature adulthood category of 50 to 74 years with 57 and 40 samples respectively (Figure 3, Supplementary file 4) . This adheres to the fact that the older population is at a greater risk for infection owing to a possibly weaker immune system and other physiological conditions. The simple question of whether or not age and gender are associated with accumulation of genome variations has a not so simple answer. The overall average number of mutations per sample was 2.69 and the corresponding values for males and females separately was 2.56 and 2.88 respectively. Thus, women were contributing more to the mutational accumulation as compared to males. The individual mutational load for different age groups in males and females has been represented in Figure 4 . Evidently, women are contributing more to the mutational load except for three age groups; 5-9years, 10-14years and 20-34 years. The highest difference on the basis of gender is for 15-19 years (2.67) but since there was just one female sample in that age group, it can't be emphasized much in isolation but the overall pattern does seem relevant. This is more so because, in terms of incidence, males are almost 1.5 times of the females but in terms of variations, fewer females are contributing more to the mutational load. Possibly, the virus is behaving differently depending on gender. The mutational distribution across different states of India was subsequently ascertained. Generally speaking, more the virus replicates more should be the accumulated variations. The fact that the samples used in the study aren't uniformly distributed across states provides for an intriguing template for analysis. The number of samples and the mutations therein for different states has been summarized in Figure 5 However, we can surely say that some sequences are mutating more than the others but whether the geographical location is playing a role needs to be ascertained. A total of 281 SNPs which were present which were altering the amino acid sequence. Their details and positions have been summarized in Table 4 and Supplementary file 5. We also ascertained the prevalence of these variants across samples. The most incident variant Q57H localized to ORF3a was present in 127 samples followed by A994D in NSP3 present in 29 samples. Amongst the silent SNPs, Y71Y in M protein was present in 117 samples followed by D294D in S protein with 69 incidences. The overall data for variants present in 10 genomes or more has been summarized in Figure 6a . Conversely, we also assessed the accumulation of variations in a given genome as summarized in Figure 6b . Interestingly, one sample (Genome ID 461495) had highest incidence of 11 mutations while 114 samples harbored just a single mutation. There were 156 samples with no mutations and 340 with more than one mutation. To account for these, the Sum of Mutation Incidence has been used in this study as explained above. The impact of mutations on proteins was predicted through three different tools; SIFT, PROVEAN, ws-SNPs&GO; which classified the mutations as "Neutral/Tolerated" or "Disease/Affect Protein Function/Deleterious". For the sake of simplicity, we have referred the results from all sites as Neutral and Disease. Though the prediction outcomes of the three tools were not in sync for all sites but since the classification of outcomes were on similar lines, the results can be represented in a binary manner with four categories. First two categories represent wherein the three tools have the same prediction; either all predicting a site to be "Neutral" or "Disease". The other two categories represent deviation between prediction outcomes. They are "Disease by One, Neutral by Two" and "Disease by Two and Neutral by One". For comparison between variants, any mutation predicted as Disease by Two or Three tools are considered as Disease and mutations predicted as Neutral by Two or Three tools are considered as Neutral. The distribution of Disease and Neutral variants across the different genes of SARS-CoV-2 has been shown in Table 4 and Supplementary file 5. These could be analyzed in three aspects. First, in terms of overall incidence. The maximum variants affecting protein sequence were present in NSP3 (63) followed by Spike (S) protein with 49 variants. Secondly, if we focus only on variants with predicted outcome as "Disease" then NSP13 has a maximum of 14 such variants followed by S protein and ORF3a with 13 variants each. Thirdly, we looked at those proteins which had more Disease variants as compared to Neutral. There were five such proteins namely: NSP5, NSP10, NSP13, ORF3a, E, ORF7a. Of these NSP10 had just two variants and both of them were predicted as Disease by all three tools. Others had differential bias towards Disease variants. Thus, we can say that though some regions of the genome have more variations but mostly Neutral while others with fewer variations are more impactful in terms of their predicted impact due to more Disease variants. Conversely, mutations in some proteins can be relatively better tolerated by the viral genome. The overall protein prediction outcomes of the 611 genomes have been summarized in Figure 7 . There were total of 198 mutations (70%) and 83 mutations (30%) which are predicted to be Neutral and Disease respectively by at least two tools. These predictions suggest that even though mutations are accumulating in SARS-CoV-2, they are predominantly neutral. This is the possible reason that no major virulence or physiological deviations have been observed so far. In order to further assess impact of these variations we compared their prevalence across samples which were asymptomatic with those wherein the patient died. The idea was that if predictions are true, then asymptomatic samples should have more of neutral mutations whereas deceased ones should have more of disease mutations. The present congregation of samples in the study had just 2 asymptomatic samples and 15 deceased. Thereon, we included 13 mutations with those of 15 deceased samples. Their comparative data has been shown in Table 5 . The p value therein represents the probability that a given variant chosen at random to be Neutral or Disease. Taking the threshold as common prediction by at least two tools the data gives interesting insights. As shown and previously mentioned, for the original congregation of 611 samples, 70% mutations were Neutral (p value 0.7) and 30% were Disease The mutational accumulation in SARS-CoV-2 genomes is a multifactorial event with some areas of genome more prone to mutations, selective mutations being more prevalent, nonlinear assimilation of mutations across various states and differential correlation between mutational impact on proteins and physiological state. Age and gender specific bias in incidence of mutations was observed. The asymptomatic samples had higher occurrence of Neutral variants while deceased samples had relatively higher incidence of Disease variants. A cross-linking of mutational dynamics and patient history will provide for better correlation and understanding of the variations in SARS-CoV-2 genomes. ; D=9) A872T, T882I, Y925C, E940D, P971S, G989V, S1029I, P1054L, M1083I, H1141Y, A1268T, S1534I, T1543K, T1567I, M1588I, V1629A, S1733G, T1761I, G1861S, T1822I, T1854A, T1854I, N1871T, K1973R, S2103F, K2029E, G2035E, P2144S, L2146F, A2249V, V2372I, T2274I, T2300I, L2323V, P2480L, H2520Y, A2593V, S2625F, ; D=13) L54F, N148Y, E156D, A243S, S255F, G261S, Q271R, T299I, T323I, E471Q, A520S, T572I, E583D, T602I, V622I, Q677H, A706S, T761S, G769V, T827I, A831S, I434K, S494P, D574Y, A892V H1083Q, P1263L T723I, F797C, L828P, T941K, V1068F, D1153Y, C1243F G857C A930T A930V S1021F I1179N C1250F A879S, T1027I, H1101Y, V1104L, G1124V, K1181R, K1191N, G1251V, Q1201K 5 ORF3a 31 178 22 (N=9; D=13) V13L, G18V, S74A, V77F, T175I L41F, S74F, S171L, T190I I62T, L83F, T176I I35T, L46F, L53F, Q57H Severe acute respiratory syndrome Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia A Major Outbreak of Severe Acute Respiratory Syndrome in Hong Kong A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Epitope-based chimeric peptide vaccine design against S, M and E proteins of SARS-CoV-2 etiologic agent of global pandemic COVID-19: an in silico approach Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study Phylo-geo-network and haplogroup analysis of 611 novel Coronavirus (nCov-2019) genomes from India Molecular Evolutionary Genetics Analysis across Computing Platforms Statistical method for testing the neutral mutation hypothesis by DNA polymorphism Prospects for inferring very large phylogenies by using the neighbor-joining method Microsatellite Diversity, Complexity, and Host Range of Mycobacteriophage Genomes of the Siphoviridae Family identification from high-throughput sequencing data coronapp : A Web Application to Annotate and Monitor SARS-CoV-2 Mutations SIFT web server: predicting effects of amino acid substitutions on proteins Predicting the functional effect of amino acid substitutions and indels 0: predicting stability changes upon mutation from the protein sequence or structure Missense mutations in SARS-CoV2 genomes from Indian patients Geographic and Genomic Distribution of SARS-CoV-2 Mutations Mechanisms of viral mutation The authors thank the Department of Biological Sciences, Aliah University, Kolkata, India for all the financial and infrastructural support provided. Authors acknowledge all the authors associated with originating and submitting laboratories of the sequences from GISAID's EpiFlu™ (www.gisaid.org) Database on which this research is based. The authors declare they have no competing interests. Not Applicable. All data pertaining to the study has been provided as Supplementary material of the manuscript. RL: Methodology, Investigation, Formal Analysis and Validation SA: Conceptualization, Supervision and Writing.