key: cord-0820287-67d0h0lr authors: Laskar, Rezwanuzzaman; Ali, Safdar title: Differential mutational profile of SARS-CoV-2 proteins across deceased and asymptomatic patients date: 2021-03-31 journal: bioRxiv DOI: 10.1101/2021.03.31.437815 sha: 432e0aac94486a7f2bddf8a0ef5883187e6d5116 doc_id: 820287 cord_uid: 67d0h0lr The SARS-CoV-2 infection spread at an alarming rate with many places showed multiple peaks in incidence. Present study involves a total of 332 SARS-CoV-2 sequences from 114 Asymptomatic and 218 Deceased patients from twenty-one different countries. The mining of mutations was done using the GISAID CoVSurver (www.gisaid.org/epiflu-applications/covsurver-mutations-app) with the reference sequence ‘hCoV-19/Wuhan/WIV04/2019’ present in NCBI with Accession number NC-045512.2. The impact of the mutations on SARS-CoV-2 proteins mutation was predicted using PredictSNP1(loschmidt.chemi.muni.cz/predictsnp1) which is a meta-server integrating six predictor tools: SIFT, PhD-SNP, PolyPhen-1, PolyPhen-2, MAPP and SNAP. The iStable integrated server (predictor.nchu.edu.tw/iStable) was used to predict shifts in the protein stability due to mutations. A total of 372 variants were observed in the 332 SARS-CoV-2 sequences with several variants incident in multiple patients accounting for a total of 1596 incidences. Asymptomatic and Deceased specific mutants constituted 32% and 62% of the repertoire respectively indicating their exclusivity. However, the most prevalent mutations were those present in both. Though some parts of the genome are more variable than others but there was clear difference between incidence and prevalence. NSP3 with 68 variants had total occurrence of only 105 whereas Spike protein had 346 occurrences with just 66 variants. For Deleterious variants, NSP3 had the highest incidence of 25 followed by NSP2 (16), ORF3a (14) and N (14). Spike protein had just 7 Deleterious variants out of 66. Deceased patients have more Deleterious than Neutral variants as compared to the symptomatic ones. Further, it appears that the Deleterious variants which decrease protein stability are more significant in pathogenicity of SARS-CoV-2. The causative agent of ongoing COVID-19 global pandemic is Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) which belongs to family Coronaviridae characterized by single strand positive sense RNA genome . As of 31 st March 2021, there were 5,52,566, active cases, 1,14,34,301 discharged cases and 1,62,468 deaths in India due to SARS-CoV-2 (https://www.mohfw.gov.in/). The same time, as per WHO there have been 27,349,248 confirmed cases of COVID-19, including 2,787,593 deaths worldwide due to COVID-19 (covid19.who.int). As the SARS-CoV-2 infection spread at an alarming rate and many places showed multiple peaks in incidence, the virus has been replicating and accruing mutations in the process. There have been multiple reports to assess and understand the impact of these mutations Li et al., 2020; van Dorp et al., 2020; Wang et al., 2020) . These reports have focused on novel and recurrent variations and their impact on infectivity and antigenicity. However, the assessment of impact of mutations on SARS-CoV-2 is an ongoing process. In our earlier study, we performed the mutational analysis of 611 genomes from India extracted n June 2020 and analyzed its impact on proteins. Therein we observed a differential mutation profile in viral genomes across deceased and asymptomatic patients (Laskar and Ali, 2021) . However, one lacuna in the study was very limited number of samples as we had focused on India only but the indications therein formed the basis of present study from samples all over the world to ascertain the differential mutational profile of deceased and asymptomatic samples in order to understand the possible clinical implications of protein variants. On 10 th November 2020, we retrieved 6853 sequence from GISAID (www.gisaid.org) using the data filter ~ virus name: hCoV-19 -Host: Human -Complete -High Coverage. Subsequently, they were screened for availability of clinical status which revealed 122 and 542 sequences from Asymptomatic and Deceased patients respectively. The final screening parameter applied was that of age wherein all samples of over 60 years were excluded. This was done because in deaths of people over 60 years, co-morbidities and other physiological aspects are involved (Omori et al., 2020) . Since our aim is to ascertain the impact of mutations, if any, on the deaths due to SARS-CoV-2, thus we excluded the samples where there are high chances of other factors contributing to death. Thereon, `a total of 332 SARS-CoV-2 sequences from 114 Asymptomatic and 218 Deceased patients were included in the study from twenty-one different countries. Details of the sequences used have been provided in Table 1 and Supplementary file 1. There were ten countries which had no asymptomatic patients and five countries with no deceased patients. These included 200 males, 72 females and 60 patients for which gender information was not available. The mining of non-synonymous mutations from the selected 332 sequences was done utilizing the GISAID CoVSurver (www.gisaid.org/epiflu-applications/covsurver-mutationsapp) web resources for sequence variance analysis. Sequence datasets were aligned with the reference sequence 'hCoV-19/Wuhan/WIV04/2019'. The reference amino acid sequence of each protein for further analysis was downloaded from the same server. The reference sequence is hundred percent identical to the NCBI sequence with Accession number NC-045512.2 from Wuhan, China. It has been used as reference for our earlier study on SARS-CoV-2 genomes from India Ali, 2021, 2020) . The impact of the mutations on SARS-CoV-2 proteins mutation was estimated using PredictSNP1(loschmidt.chemi.muni.cz/predictsnp1) which is a meta-server integrating six predictor tools: SIFT, PhD-SNP, PolyPhen-1, PolyPhen-2, MAPP and SNAP. The SNAP and PhD-SNP tools are based on supervised machine learning algorithms that have been trained in massive datasets to "learn" to distinguish between pathogenic and benign variants. The protein sequence and structure method predict how mutations influence the protein phenotype on the basis of the SNP position in the protein structure behind the PolyPhen-1 and PolyPhen-2 predictors. The MAPP and SIFT tools, on the other hand, determine the pathogenicity of mutations based on the conservation of specific amino acids by sequence and evolutionary conservation methods across various species. PredictSNP extracted the individual score of these tools to homogenize the assessment and generate its own confidence score as a percentage of expected accuracy, ranging from 0 to 100 per cent. It has also designed three distinct datasets like PMD, MMP and OVERFIT in order to eliminate bias, duplicity and inconsistency (Bendl et al., 2014) . Subsequently, the variants are categorized as 'Neutral' or 'Deleterious', and these predictions are more robust and accurate than the prediction provided by any individual predictor. The iStable integrated server (predictor.nchu.edu.tw/iStable) encompasses two machines learning based predictor iMutant and MUpro along with thermodynamics parameters in order to predict shifts in the protein stability related to mutations. i-Mutant, an SVM-based tool, predicts protein stability attributable to mutant form in free energy change value [DDG=DG(NewProtein)-DG(WildType) in Kcal/mol]. It classifies the variant in terms of decreasing (DDG<0) or increasing (DDG>0) protein stability. MUpro, an SVM and neural network-based tool, evaluate the mutational impact of protein stability on predictive confidence score (Conf. Score). The score varies from -1 to 1, with Conf. Score <0 signifies a decrease in stability and Conf. Score >0 as an increase in protein stability. iStable utilizes sequence information and predicts the meta-result as an increase or decrease in stability in terms of confident score where a higher score indicates more confidence in the prediction (Chen et al., 2013) . A total of 372 variants were observed in 332 SARS-CoV-2 sequences with several variants being incident in multiple patients accounting for a total of 1596 variants. The distribution of variants across different proteins of SARS-CoV-2 have been shown in Figure 1 , Table 2 and Supplementary file 2. The distribution and incidence of variants can be analyzed through several aspects. First, the incidence of mutations with respect to gender as summarized in Fig 2A. The patients for whom gender information wasn't available have been mentioned as unknown. There were 45 mutations (12%) present in both males and females but these variants were the most prevalent ones as they accounted for around 69% of the total incidences. Also, the male and female specific 221 and 75 mutations (59% and 20%) accounted for a meagre 17% and 5% respectively of total incidence. Partly this difference can be attributed to the skewed sample set in favor of the males. Klein and Flanagan have reviewed the studies assessing the variable nature of immune responses in males and females (Klein and Flanagan, 2016) but since the variants most incident are present in both genders in our study and assuming the variations can potentially alter the immune response, the chances of it in a gender-dependent manner seems rare. Though there 60 patients with unknown gender, the fact that overall distribution of mutants and incidence in the studied population is around 90% for males, females or both means the data from unknown will not have much of an impact on gender wise distribution and incidence of mutants. Almost all of these 60 patients are asymptomatic and belong to the Diamond Princess cruise ship from Japan which has become benchmark study about how wearing masks reduces viral load and hence even if the person gets infected chances of it being mild or asymptomatic are enhanced substantially (Batista et al., 2020; Gandhi et al., 2020 ). Subsequently, we moved to the second aspect of analysis: patient status. The question we wanted to address was are any mutations specific to the deceased or asymptomatic patients? The distribution and incidence of mutations as per patient status has been shown in Figure 2B and Supplementary file 2. Interestingly, 22 mutations present in both deceased and asymptomatic accounted for 952 (60%) of the total occurrence of mutations. Asymptomatic specific mutants encompassing 119 variants (32%) accounted for only 246 (15%) of the total observed mutations whereas the corresponding values for Deceased specific mutants was 231 (62%) variants with 398 (24%) occurrences respectively. The details of these mutations have been provided in Table 2 and Supplementary file 2 and their impact on proteins are discussed later. Evidently, some mutations are getting represented at a much higher rate as compared to others. The fifteen most prevalent mutations along with their incident frequency and localization have been shown in Figure 2C . P323L (NSP12) is the most prevalent variant present followed by D614G (Spike protein) with over two hundred incidences each. Most of these prevalent variants are common to both males and females, which is expected as per gender wise distribution data. The timeline for accumulation of mutants has been shown in Figure 2D and evidently the maximum number of mutations accrued in April 2020 wherein the incidence of COVID-19 was at its peak. The age wise distribution of mutants has been shown in Figure 2E . An absolute correlation between age and incidence can't be drawn due to uneven representation of all age groups. Furthermore, since the samples are from different countries and the virus is behaving differently across geographical locations, the data has to be interpreted with caution. Thereby, looking at the mutation incidence with reference to countries seemed rational. We analyzed the data in terms of number of mutants per sample for each country as represented in Figure 3 and Supplementary file 4. Saudi Arabia with maximal representation of 70 samples in the study was contributing 3.93 variants per sample which is amongst the lowest in the group. Contrastingly, Bangladesh with highest value of 10 variants per sample is being represented by just 5 samples in the study. Thus, its explicit that some nations are exhibiting more variations in SARS-CoV-2 as compared to others. Differential health and gene profile of individuals might be factors contributing to it. We also looked at the variant per sample data for deceased and asymptomatic samples respectively for each country as shown in Supplementary file 4. Bangladesh had the highest variants per sample of 10 followed by India (6.36) for asymptomatic patients whereas it was highest of 9.5 for Indonesia followed by Brazil (6.75) for deceased patients. The fact that some countries had no representation in asymptomatic or deceased samples makes any direct inference plausible but a geographical evolution of the virus is surely supported by the observed data. The impact of mutations on SARS-CoV-2 proteins was assessed using three aspects as highlighted in Figure 4 . These include distribution of variants across SARS-CoV-2 proteins; pathogenicity of the variants in term of deleterious or neutral and assessing the stability of proteins in presence of variants. The distribution of variants across different SARS-CoV-2 proteins revealed some interesting observations. First, variants incidence in some proteins was relatively higher than others as evident in Figure 4a and Supplementary file 2. NSP3 housed a maximum of 68 variants closely followed by Spike protein with 66 variants. N protein and NSP2 with 29 and 27 variants respectively were a distant third and fourth. Thus, some parts of the genome are more variable whereas others are more conserved. These have led to emergence of novel variants around the world which have been so far without clinical manifestations but the situation might change any moment. Secondly, there was clear difference between incidence and prevalence. NSP3 with 68 variants had total occurrence of only 105 whereas Spike protein had 346 occurrences (66 variants). Comparatively, N protein and NSP12 had 328 (29 variants) and 288 (21 variants) occurrences in the study. This clearly implies that clustering of mutations doesn't imply higher prevalence. In order to ascertain the significance of differential incidence and prevalence, we analyzed the pathogenicity of the variants as Deleterious or Neutral which has been shown in Figure 4b , Table 2 and Supplementary file 3. Their presence across Asymptomatic and Deceased samples across different countries has been discussed in the next section. The basic premise for the study was that a protein having more variants will be contributing to the viral evolution only if they are Deleterious. Neutral mutations wouldn't be affecting the protein per se. In terms of incidence of Deleterious variants, NSP3 had the highest of 25 followed by NSP2 (16), ORF3a (14) and N (14). Spike protein had just 7 Deleterious variants out of 66 which partly explains that despite of so many variants it hasn't much impacted the viral pathogenesis yet. It is also interesting to note that only four proteins NSP1, NSP2, ORF3a and ORF7a had more Deleterious than Neutral mutants. Further, NSP7 and ORF6 had only Deleterious mutants (Figure 4b , Table 2 ). Thus, we can say that even a protein with very few mutations can be a more determining factor in viral evolution. Assessment of the protein stability in lieu of mutations was thereby ascertained and has been represented in Figure 4c . Majority of variants resulted in a decrease in protein stability across all proteins. The three proteins which had maximum variants which increased protein stability included NSP3 (19), Spike (18) and N (12). Also, NSP3 (49) and Spike (48) had highest number of variants which decreased protein stability. The impact of these variants discussed individually is significant but their occurrence in population isn't in isolation and hence the cumulative impact of mutations incident together is also required. The distribution of Deleterious and Neutral mutations across asymptomatic and deceased patients has been shown in Table 2 and Supplementary file 3. There are 12 proteins in which no variant was present in both asymptomatic and deceased samples suggesting a correlation between mutational status and disease profile. N protein had the highest of 4 variants across both sample sets. Moreover, except for N protein, NSP1, NSP3, NSP9 and NSP14, all other proteins had more variants associated with deceased patients as compared to asymptomatic ones. NSP5 and NSP7 had no mutations from asymptomatic patients whereas only NSP9 had nor variants coming from deceased patients. Subsequently we analyzed the ratio of Deleterious is to Neutral mutations across asymptomatic and deceased samples as shown in Table 2 Table 3 and Supplementary file 3. When we analyzed mutations only from deceased patients and with deleterious pathogenicity, we observed that there were nine proteins in which all such mutations had decreasing stability suggesting the possibility of the decreased stability contributing to enhanced pathogenicity. Moreover, there were ten proteins which had mutations with increasing as well as decreasing stability. Amongst these there was on N protein which had four variants with increasing stability as compared to three with decreasing stability prediction. All other proteins had more stability decreasing variants as compared to increasing ones. Thus, we can say that the variants with decreasing stability are more significant in pathogenicity of SARS-CoV-2. A total of 372 variants were observed in 332 SARS-CoV-2 sequences with several variants being incident in multiple patients accounting for a total of 1596 variants. The 22 mutations present in both deceased and asymptomatic accounted for 60% of the total occurrence of mutations. Asymptomatic specific 119 variants accounted for 15% whereas Deceased specific 231 variants represented 24% of total occurrences respectively. Since, some countries had no representation in asymptomatic or deceased samples, inference about geographical correlation isn't plausible from present dataset. Four proteins NSP1, NSP2, ORF3a and ORF7a had more Deleterious than Neutral mutants. When looking at deceased patients with deleterious pathogenicity prediction mutants, there were nine proteins in which all such mutations had decreasing stability suggesting the possibility of the decreased stability contributing to enhanced pathogenicity. Minimizing disease spread on a quarantined cruise ship: A model of COVID-19 with asymptomatic infections PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations iStable: off-the-shelf predictor integration for predicting protein stability changes Mutations Strengthened SARS-CoV-2 Infectivity Masks Do More Than Protect Others During COVID-19: Reducing the Inoculum of SARS-CoV-2 to Protect the Wearer Sex differences in immune responses Mutational analysis and assessment of its impact on proteins of SARS-CoV-2 genomes from India Phylo-geo-network and haplogroup analysis of 611 novel Coronavirus (nCov-2019) genomes from India The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity The age distribution of mortality from novel coronavirus disease (COVID-19) suggests no large difference of susceptibility by age Understanding of COVID 19 based on current evidence Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection The establishment of reference sequence for SARS CoV 2 and variation analysis The authors thank the Department of Biological Sciences, Aliah University, Kolkata, India for all the financial and infrastructural support provided. Authors acknowledge all the authors associated with originating and submitting laboratories of the sequences from GISAID's EpiFlu™ (www.gisaid.org) Database on which this research is based. The authors declare they have no competing interests. Not Applicable. All data pertaining to the study has been provided as Supplementary material of the manuscript. RL: Methodology, Investigation, Formal Analysis and Validation SA: Conceptualization, Supervision, Formal Analysis and Writing.