key: cord-0724716-kiin0964 authors: Laha, Sayantan; Chakraborty, Joyeeta; Das, Shantanab; Manna, Soumen Kanti; Biswas, Sampa; Chatterjee, Raghunath title: Characterizations of SARS-CoV-2 mutational profile, spike protein stability and viral transmission date: 2020-06-30 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2020.104445 sha: 42b17b426066a1eadbace30eed61358695ac753f doc_id: 724716 cord_uid: kiin0964 The recent pandemic of SARS-CoV-2 infection has affected more than 3.0 million people worldwide with more than 200 thousand reported deaths. The SARS-CoV-2 genome has the capability of gaining rapid mutations as the virus spreads. Whole-genome sequencing data offers a wide range of opportunities to study mutation dynamics. The advantage of an increasing amount of whole-genome sequence data of SARS-CoV-2 intrigued us to explore the mutation profile across the genome, to check the genome diversity, and to investigate the implications of those mutations in protein stability and viral transmission. We have identified frequently mutated residues by aligning ~660 SARS-CoV-2 genomes and validated in 10,000 datasets available in GISAID Nextstrain. We further evaluated the potential of these frequently mutated residues in protein structure stability of spike glycoprotein and their possible functional consequences in other proteins. Among the 11 genes, surface glycoprotein, nucleocapsid, ORF1ab, and ORF8 showed frequent mutations, while envelop, membrane, ORF6, ORF7a and ORF7b showed conservation in terms of amino acid substitutions. Combined analysis with the frequently mutated residues identified 20 viral variants, among which 12 specific combinations comprised more than 97% of the isolates considered for the analysis. Some of the mutations across different proteins showed co-occurrences, suggesting their structural and/or functional interaction among different SARS-COV-2 proteins, and their involvement in adaptability and viral transmission. Analysis of protein structure stability of surface glycoprotein mutants indicated the viability of specific variants and are more prone to be temporally and spatially distributed across the globe. A similar empirical analysis of other proteins indicated the existence of important functional implications of several variants. Identification of frequently mutated variants among COVID-19 patients might be useful for better clinical management, contact tracing, and containment of the disease. dependent RNA polymerase activity. In the intermediate stage, negative-strand RNA intermediates are produced to serve as a template for positive-sense genomic RNA and subgenomic RNA synthesis. These shorter sub-genomic RNAs encode the structural proteins i.e. Spike, Envelope, Membrane and Nucleocapsid protein, and several other accessory proteins [12] [13] [14] . The mutation rate for RNA viruses is drastically high and this higher mutation rate is correlated with a virulence which is beneficial for viral adaptation [15] . The SARS-CoV-2 genome has the capability of gaining rapid mutations as the virus spreads [16] . The advantage of the increasing amount of whole-genome sequence data of SARS-CoV-2 intrigued us to explore the mutation profile across the genome, to check the genome diversity and to investigate the consequences of those mutations on stability and transmission. In this present study, we used ~ 660 complete SARS-CoV-2 genome data from NCBI virus database (16 th April 2020) for in-silico analysis. We performed gene and protein sequence alignment and characterized the mutation status of all genes. The most conserved and variable regions were recognized for all genes along with synonymous and nonsynonymous changes. As non-synonymous changes dictate the altered amino acid composition, a collection of all mutations for each protein has been determined. We cataloged these substitutions for all proteins and identified different variants that are prevalent in nature. We also evaluated the impact of mutating spike glycoprotein in protein stability, viral transmission, adaptability, and diversification. This brief characterization of SARS-CoV-2 variants and functional impact analysis of those variants could lead to better clinical management of the COVID-19 pandemic. Total 664 SARS-CoV-2whole genome sequences were downloaded from NCBI Virus repository (https://www.ncbi.nlm.nih.gov/labs/virus) as of April 16, 2020 . The repository structures [23] . The differences in total energy, electrostatics, solvation energy etc.were calculated for all the closed, open state as well as prefusion spike protein mutant 3D models with FoldX empirical force field [23, 24] . All three cryo-EM structures have some missing residues in the available PDBs, and are not considered for stability calculation or further structural analysis. However missing residues and loops were generated for each conformation to have a proper environment of the mutated residues to increase the accuracy of energy calculations. For this purpose, we have used SWISS-MODEL (https://swissmodel.expasy.org/) homology modelling online server with Spike glycoprotein sequence (uniprt_ID: P0DTC2) and three individual PDBs were used as a template to generate three conformations. The models were refined by OpenMM7 [25] with CHARMM27 force-field [26] implemented in the SWISS-MODEL server. The models were further validated by RAMPAGE [27] which shows 91-92% residues are in the favoured region and 6-7% are in the allowed region in the phi-psi map of all the three models. Structural analyses and figures have been generated in Discovery studio 2020 (Dassault System BIOVIA Corp) [28] . To study the impact of non-synonymous mutations in the transmission and viability of the newly emerging SARS-CoV-2 sequences across the world, we looked into Nextstrain tool, a powerful visualization tool comprising >10,000 SARS-CoV-2 samples to study the evolution of various pathogens (https://nextstrain.org/) [29] . The frequency of occurrence, the date of collection, and the corresponding geographical location of the mutant strains was noted on 27 th April 2020. Please note that the genotype information on Nextstrain are not consistent on different dates, at least for some residues. Major amino acid substitutions were mapped and visualized in the phylogeny from Nextstrain Next hCoV-19 App. Minor (less frequently) mutated residues that showed variations in Nextstrain data compared to NCBI virus database were excluded from the analysis. Amino acid residues of ORF1ab were split as ORF1a from 1-4400 and the rest as ORF1b in Nextstrain Next hCoV-19 App. To understand the implications of the amino acid substitutions in the mutants, charge state (at neutral pH), Kyte-Doolittle hydropathy indices [30] , and Grantham's mean chemical J o u r n a l P r e -p r o o f difference indices, which takes into account side-chain chemical structure, volume, and polarity [31] , were compared. Stabilities of their side-chains between exposed and buried forms were compared using apparent partition energies as reported by Wertz and Scheraga [32] . The typical contributions of a putative H-bond and salt-bridge towards protein stabilities were assumed to be in the range of 0.5-1.5 kcal/mole [33] [34] [35] and 3 kcal/mole [36] based on previous studies on protein folding. After aligning the individual ORFs of the SARS-CoV-2 genome, we identified mutations both at gene and protein levels (Table 1) . At the gene level, the number of mutations per 100 bases was found to be relatively high in N, ORF10, ORF6, ORF7a, ORF8, and ORF3a, suggesting that these genes may be more prone to mutations as compared to others. Among the structural proteins, M and E proteins contained the least variability, which indicated that these proteins may be associated with housekeeping functions and consequently have a greater resistance to mutations. Looking into the changes per 100 amino acids for each of the proteins (Table 1) , we observed that ORF3a exhibited the highest mutability, closely followed by N and ORF8. While looking into the synonymous and nonsynonymous changes, we have found that the ORF1ab and spike protein contained the largest number of non-synonymous mutations (Table 1) . When we normalized with respect to the length, ORF3a, ORF8, and N exhibited a relatively high number of non-synonymous changes. Considering the altered protein sequences due to non-synonymous changes, we next focused on amino acid substitutions among all proteins of the SARS-CoV-2 genome (Supplementary Table 1 ). An abridged version of the table containing only those substitutions that have been observed in a minimum of 10 isolates (~1.5% of the total isolates) is provided in Figure 1 . Among the structural proteins, E and M showed most conserved structures across all the viral genomes under consideration, with substitutions at two sites of each E and M proteins only in 1and5isolatesrespectively. Upon examining the S proteins, we have found several mutations; nevertheless, most of these substitutions are perceived only in a single isolate, with notable exceptions being the D614G. There have been 264 (41%) instances of J o u r n a l P r e -p r o o f D614G, suggesting its pivotal role in regards to the protein stability and other key characteristics. Among the other changes, V483A, L5F, Q675H, H655Y, and S939F occurred in 6, 5, 3, 2, and 2 isolates respectively. The N protein also depicted substitutions R203K and G204R. Among the non-structural proteins, ORFs 6, 7a, and 10 shared similar behaviour to E and M proteins with them being mostly conserved. In contrast, ORF3a exhibited nonsynonymous mutations, the majority of which had mostly been distributed at 2residues (Q57H and G251V). Mutations in ORF8 showed a major substitution at L84S and an accompanied change of V62L. Another substitution, S24L was observed in 25 samples out of the 660 sequences analysed. We moved on to the largest encoded SARS-CoV-2 protein, ORF1ab that encodes replicase polyproteins required for viral RNA replication and transcription [37] . ORF1a and ORF1b encode two polypeptides, pp1a and pp1ab, and finally processed into 16 nsps ( Figure 1 ) [37, 38] . Majority of the non-synonymous mutations at ORF1ab led to amino acid changes at the 265 th , 4725 th , 5828 th, and 5865 th residues (T265I, P4715L, P5828L, and Y5865C). Notably, L5828P and C5865Y occurred simultaneously in all strains, suggesting a possible functional relationship between these two residues. To identify the variants that are prevalent over time, we next determined the frequently mutated residues that occurred at least 1.5% of the samples. The combined analysis of all proteins with these frequently mutated residues identified 20 possible SARS-CoV-2 variants, among which 15variants comprised more than 97% of the analysed sequences having frequency >1% ( and L84S occurred mostly with the D614 of S protein, while S24L occurred with the G variants, also observed in our analysis with ~660 samples. Overall, the mutation profile that we identified with 664 samples showed excellent concordance with the Nextstrain data comprising ~10,000 samples. We compared these frequently mutated residues with the corresponding protein sequences of Bat coronavirus RaTG13, Pangolin coronavirus, SARS coronavirus TW11, and GD01 ( Table 2 ). All these frequently mutated residues completely matched with Bat coronavirus RaTG13. The notable exceptions in here being mismatches at positions 265, 971, and 3606 of ORF1a in case of pangolin coronavirus and mismatches in all three residues of ORF8 in both the SARS-CoV isolates. Parameters for 17 frequently mutated residues and V483A substitution at the receptorbinding domain of S protein that may affect the protein structures are presented in Table 3 . The relative abundance of a mutation can be taken as a surrogate of viral viability, which would be dictated by the effect of the mutation on protein stability and its function with respect to specific biomolecular events during host-pathogen interaction. showed only 2.7% abundance. Interestingly, both F6158L and L3606F with very low differences in Grantham's index and reversal in the difference in apparent partitioning energy showed low abundance. We have encountered several different variants pertaining to S protein of SARS-CoV-2. Apart from D614G and some co-occurring mutations, other changes have been observed in a few cases, e.g., L5F occurring in 5 strains, V483A in 6 strains, while among others, most of these substitutions were observed in a single strain. By performing the stability analysis of spike glycoprotein for mutating residues that are available in all three pdb file, we found that some of the variants are stable in nature corresponding to negative total energies, calculated for both open, closed as well as prefusion models ( To further understand the implication of these mutants, we have analysed all three structures of S [21, 22, 39] . The spike glycoprotein is a homo-trimeric protein [21, 22, 39] having two subunits S1 and S2 in each monomer protruding from the viral surface. S1 subunit forms a budding head responsible for host-receptor binding, while S2 is mainly a stalk-like structure that helps in the fusion of viral and host membranes ( Figure 3A ). S proteins are cleaved at the S1/S2 interface but remain non-covalently linked with each other J o u r n a l P r e -p r o o f in the prefusion state [39] . S1 subunit can further be divided into sub-domains namely Nterminal domain (NTD: residues 15-261), C-terminal domains 1, 2 and 3 (CTD1: residues 320-516; CTD2: residues 517-579; CTD3: residues 580-663) ( Figure 3A) . CTD1, which is the main region of S protein for host-receptor interaction, is also termed as the receptor-binding domain (RBD) [21, 22, 39] . RBD undergoes conformational changes during receptor binding (human ACE2) that leads to the blossom of the S1 bud in an open or 'UP' conformation conducive for S-ACE2 interaction. Comparing the inert 'DOWN' and active 'UP' conformations (PDB_ID s: 6VXX and 6VYB respectively), it is found that RBD moves as a rigid-body in a hinge bending motion around its linker region with NTD and CTD2 with all-atom rmsd for 198 residues is around 2.8 Å ( Figure 3B) . A similar change of conformation is also observed in prefusion state [22] . The miss-sense mutations in S protein are mainly single point mutations with few double mutations. All these mutations can be classified as stabilizing and destabilizing based on the free-energy changes ( (Figure 1 ). Our study depicts that there are no stabilizing mutations in the receptor-binding domain RBD ( Figure 4A, B) . This observation indicates that the mechanism of S protein for a high affinity human ACE2 binding is unique in nature and any mutation (found to date) leads to an unstable structure and this could be correlated with lower viability of these mutations containing isolates. There are 42 miss-sense mutations found in S protein, we have considered 21 of them that are available in every monomer structure of S protein as well as in three pdbs. Out of these, 32 are in the S1 subunit and only 10 are found in the S2 subunit. The cryo-EM structure of the protein (6VXX) shows a high thermal parameter for the NTD and RBD ( Figure 4D ). The high-temperature factor of RBD could be correlated with its dynamic nature (Table 4 ). The structural comparison of wild-type and in-silico generated D614G mutant shows that a change from Aspartic acid to Glycine alters the electro-static potential of the surface of the protein ( Figure 5A ). This change creates a favourable environment in a hydrophobic pocket of the S protein ( Figure 5B ). Moreover, we have also observed that D614 is at the proximity of the hinge bending region (CTD2/NTD linker) of RBD (Figure5C), therefore mutation of D to a small residue G without any side-chain might increase the flexibility for a smooth switch over from inactive DOWN state to the active UP state, makes the mutant containing variants more virulent in terms of its smoother binding with ACE2. Among these multiple variants, the ones that are occurring in a large fraction of the samples can be said to have adapted, while those strains which only existed with very few samples were likely to get eliminated in the way of selective process and are not generally perceived among the emerging variants. This implies that the favourable variants should be associated with greater stability and/or higher transmission rates of the SARS-CoV-2 proteins, while a decreased stability or transmission rate is expected in the case of the minor variants. Looking into the spatial and temporal distribution of these variants of S protein in Nextstrain and noting down the number of occurrences of each variant along with the country it originated with the corresponding date ( There is no conflict of interest in this manuscript. containing double mutants are given. J o u r n a l P r e -p r o o f * Mutated residues are presented in red J o u r n a l P r e -p r o o f *Mutants that show reduction in total free energy in all three conformations are presented in bold. A novel coronavirus outbreak of global health concern The clinical characteristics of pneumonia patients coinfected with 2019 novel coronavirus and influenza virus in Wuhan Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in A Novel Coronavirus from Patients with Pneumonia in China A new coronavirus associated with human respiratory disease in China WHO Declares COVID-19 a Pandemic Identification of a novel coronavirus in patients with severe acute respiratory syndrome Isolation and propagation of a human enteric coronavirus Identification of Coronavirus Isolated from a Patient in Korea with COVID-19. Osong Public Health Res Perspect Jumping species-a mechanism for coronavirus persistence and survival. Curr Opin Virol Comparative analysis of RNA genomes of mouse hepatitis viruses The Nonstructural Proteins Directing Coronavirus RNA Synthesis and Processing Continuous and Discontinuous RNA Synthesis in Coronaviruses The architecture of SARS-CoV-2 transcriptome Why are RNA virus mutation rates so damn high? Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Molecular Evolutionary Genetics Analysis across Computing Platforms A pneumonia outbreak associated with a new coronavirus of probable bat origin Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak The Protein Data Bank Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation The FoldX web server: an online force field A detailed thermodynamic analysis of ras/effector complex interfaces OpenMM 7: Rapid development of high performance algorithms for molecular dynamics Extending the treatment of backbone energetics in protein force fields: limitations of gas-phase quantum mechanics in reproducing protein conformational distributions in molecular dynamics simulations Structure validation by Calpha geometry: phi,psi and Cbeta deviation Nextstrain: real-time tracking of pathogen evolution A simple method for displaying the hydropathic character of a protein Amino acid difference formula to help explain protein evolution Influence of water on protein structure. An analysis of the preferences of amino acid residues for the inside or outside and for specific conformations in a protein molecule Contribution of hydrogen bonds to protein stability Energetics of hydrogen bonds in peptides Evaluating contribution of hydrogen bonding and hydrophobic bonding to protein folding Salt-bridge energetics in halophilic proteins The coronavirus replicase Emerging coronaviruses: Genome structure, replication, and pathogenesis Cryo-electron microscopy structures of the SARS-CoV spike glycoprotein reveal a prerequisite conformational state for receptor binding We would like to acknowledge Prof. Nitai P. Bhattacharya, retired professor, SINP Kolkata for his critical comments and evaluation of the manuscript. J o u r n a l P r e -p r o o f