key: cord-0949013-psux4fuj authors: Laskar, Rezwanuzzaman; Ali, Safdar title: Mutational analysis and assessment of its impact on proteins of SARS-CoV-2 genomes from India date: 2021-02-04 journal: Gene DOI: 10.1016/j.gene.2021.145470 sha: cd6da74e04d1db9e3ea059df30184af03b5a74ff doc_id: 949013 cord_uid: psux4fuj Mutational status of SARS-CoV-2 genomes from India along with their impact on proteins was ascertained through multiple tools including MEGA, Genome Detective, SIFT, PROVEAN and ws-SNPs&GO. Excluding gaps and ambiguous sequences, 493 variable sites (152 parsimony informative and 341 singleton) were observed. NSP3 had the highest incidence of 101 sites followed by S protein (74), NSP12b (43) and ORF3a (31). Average mutations per sample for males and females was 2.56 and 2.88 respectively. Non-uniform geographical distribution of mutations suggests that sequences in some regions are mutating faster than others. There were 281 mutations (198 Neutral and 83 Disease) affecting amino acid sequence. NSP13 has a maximum of 14 Disease variants followed by S protein and ORF3a with 13 each. Disease mutations in genomes from asymptomatic people was mere 11% but those from deceased patients was at 38% indicating contribution of these mutations to the pathophysiology of the SARS-CoV-2. In order to perform mutational profile analysis with clinical correlation, we selected 15 48 genomes of deceased patients from existing congregation. However, there were just two 49 genomes for asymptomatic patients in the congregations. So, on 09.12.2020, we downloaded 50 3 51 from asymptomatic patients. As the data filter for genome extraction, we used hCoV-19 as a 52 virus name, human as a host, India as a location and complete sequence with high coverage. (SNP: Single Nucleotide Polymorphism). The PI sites are those whose incidence was 104 observed in multiple samples whereas singleton sites had a restricted single sample incidence. The distribution of these sites according to various substitutions, protein localizations and Table 2 . Calap, 2016). We believe a holistic approach is required to understand the evolution as more 123 often than not the selection advantage being offered by any mutation is a chance event and 124 can be from any part of the genome. prevalence and distribution of these sites has been summarized in Figure 1 and results of the 135 prediction of their impact on protein has been discussed later. We thereon looked at these variations in combination with their prevalence across samples. The most prevalent nucleotide at the variable sites in reference sequence was C (209) Incidence" herein and thereafter in this study. Age and Gender wise distribution of samples and mutations therein 162 We subsequently analyzed the patient's dataset with reference to age and gender for the 163 incidence of mutations. However, since patients' data wasn't cumulatively available, the data 164 for this aspect isn't exhaustive but representative for 255 samples (104 females and 151 165 males). The patients whose genomes were used in the study and age was known were of infection which has not been feasible for present dataset due to paucity of information. However, we can surely say that some sequences are mutating more than the others but 203 whether the geographical location is playing a role needs to be ascertained. variations are more impactful in terms of their predicted impact due to more Disease variants. Conversely, mutations in some proteins can be relatively better tolerated by the viral genome. The overall protein prediction outcomes of the 611 genomes have been summarized in Figure 247 7. There were total of 198 mutations (70%) and 83 mutations (30%) which are predicted to be 248 Neutral and Disease respectively by at least two tools. These predictions suggest that even 249 though mutations are accumulating in SARS-CoV-2, they are predominantly neutral. This is Taking the threshold as common prediction by at least two tools the data gives interesting ; D=9) A872T, T882I, Y925C, E940D, P971S, G989V, S1029I, P1054L, M1083I, H1141Y, A1268T, S1534I, T1543K, T1567I, M1588I, V1629A, S1733G, T1761I, G1861S, T1822I, T1854A, T1854I, N1871T, K1973R, S2103F, K2029E, G2035E, P2144S, L2146F, A2249V, V2372I, T2274I, T2300I, L2323V, P2480L, H2520Y, A2593V, S2625F, ; D=13) L54F, N148Y, E156D, A243S, S255F, G261S, Q271R, T299I, T323I, E471Q, A520S, T572I, E583D, T602I, V622I, Q677H, A706S, T761S, G769V, T827I, A831S, I434K, S494P, D574Y, A892V H1083Q, P1263L T723I, F797C, L828P, T941K, V1068F, D1153Y, C1243F G857C A930T A930V S1021F I1179N C1250F A879S, T1027I, H1101Y, V1104L, G1124V, K1181R, K1191N, G1251V, Q1201K 5 ORF3a 31 178 22 (N=9; D=13) V13L, G18V, S74A, V77F, T175I L41F, S74F, S171L, T190I I62T, L83F, T176I I35T, L46F, L53F, Q57H, C81F, L85F ORF3a with 13 each. Disease mutations in genomes from asymptomatic people was mere 11% but those from deceased patients was at 38% indicating contribution of these mutations to the pathophysiology of the SARS-CoV-2. Excluding gaps and ambiguous sequences, 493 variable sites (152 parsimony informative and 341 singleton) were observed. NSP3 had the highest incidence of 101 sites followed by S protein (74), NSP12b (43) and Average mutations per sample for males and females was 2.56 and 2.88 respectively indicating a higher incidence of mutations in females Non-uniform geographical distribution of mutations implied by Odisha (30 samples, 109 mutations) and Tamil Nadu (31 samples, 40 mutations) suggests that sequences in some regions are mutating faster than others Neutral and 83 Disease) affecting amino acid sequence. NSP13 has a maximum of 14 Disease variants followed by S protein and ORF3a with 13 each. This clearly indicates mutations in some proteins can be relatively better tolerated Disease mutations in genomes from asymptomatic people was mere 11% but those from deceased patients was over three folds higher at 38% Mutational analysis and assessment of its impact on proteins of SARS-CoV-2 genomes from India Running Title: Mutational analysis of SARS-CoV-2 genomes in India