key: cord-0967678-4evgbl5g authors: Cheng, Liang; Han, Xudong; Zhu, Zijun; Qi, Changlu; Wang, Ping; Zhang, Xue title: Functional alterations caused by mutations reflect evolutionary trends of SARS-CoV-2 date: 2021-02-13 journal: Brief Bioinform DOI: 10.1093/bib/bbab042 sha: 7161c510163c9526bc69be6d392135666b6e4136 doc_id: 967678 cord_uid: 4evgbl5g Since the first report of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in December 2019, the COVID-19 pandemic has spread rapidly worldwide. Due to the limited virus strains, few key mutations that would be very important with the evolutionary trends of virus genome were observed in early studies. Here, we downloaded 1809 sequence data of SARS-CoV-2 strains from GISAID before April 2020 to identify mutations and functional alterations caused by these mutations. Totally, we identified 1017 nonsynonymous and 512 synonymous mutations with alignment to reference genome NC_045512, none of which were observed in the receptor-binding domain (RBD) of the spike protein. On average, each of the strains could have about 1.75 new mutations each month. The current mutations may have few impacts on antibodies. Although it shows the purifying selection in whole-genome, ORF3a, ORF8 and ORF10 were under positive selection. Only 36 mutations occurred in 1% and more virus strains were further analyzed to reveal linkage disequilibrium (LD) variants and dominant mutations. As a result, we observed five dominant mutations involving three nonsynonymous mutations C28144T, C14408T and A23403G and two synonymous mutations T8782C, and C3037T. These five mutations occurred in almost all strains in April 2020. Besides, we also observed two potential dominant nonsynonymous mutations C1059T and G25563T, which occurred in most of the strains in April 2020. Further functional analysis shows that these mutations decreased protein stability largely, which could lead to a significant reduction of virus virulence. In addition, the A23403G mutation increases the spike-ACE2 interaction and finally leads to the enhancement of its infectivity. All of these proved that the evolution of SARS-CoV-2 is toward the enhancement of infectivity and reduction of virulence. In December 2019, the respiratory disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first reported in Wuhan, China [1, 2] . Since then, it has rapidly spread across the world, leading to an unprecedented global public health emergency. As of 19 August 2020, SARS-CoV-2 has infected over 20 million individuals, and caused over 700 thousand individuals death worldwide. Like the other two coronaviridae family known to infect humans, Middle East respiratory syndrome coronavirus (MERS-CoV) and severe acute respiratory syndrome (SARS), SARS-CoV-2 is also associated with high case fatality rates (CFR) [3, 4] . According to the reference genome of SARS-CoV-2 (NC_045512), the virus genome contains 29 903 nucleotides and consists of 12 major open-reading frames (ORFs) involving ORF1a, ORF1b, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N and ORF10. Analysis of the nucleotide and protein sequence of these ORFs can help to expose derivation and high CFR of SARS-CoV-2 [4] [5] [6] . In January 2020, Zhou et al. identified that SARS-CoV-2 share 79.6% sequence identity to SARS-CoV and 96% sequence identical to a bat coronavirus RaTG13 at the whole-genome level, suggesting that the virus is probable bat origin [5] . Furthermore, Zhou et al. found that SARS-CoV-2 and SARS-CoV share 94.4% identical at CoV species classification domains in ORF1ab, which shows that the two viruses belong to the same species [5] . In May 2020, Ayal et al. conducted an in-depth molecular analysis of 3001 coronavirus genomes to differentiating high CFR strains including SARS-CoV-2, SARS and MERS-CoV from low CFR strains [4] . And they identified 11 regions of nucleotide alignments in four ORFs ORF1ab, S, M and E for predicting high CFR of coronaviruses, of which GAAL insertion in the spike protein of coronavirus strains appears to be associated with high CFR [4] . Since the first report of SARS-CoV-2 strain in December 2019, the virus evolves constantly through mutation in genome, which were identified in recent researches [7, 8] . In March 2020, Tang et al. analyzed 103 SARS-CoV-2 genomes and identified two complete linkage SNPs T8782C and C28144T [7] . This indicates that the virus was evolved into L and S types. In the same time, Peter et al. analyzed 160 complete SARS-CoV-2 genomes sequenced in 28 February 2020 and before [8] . They divided SARS-CoV-2 into three types according to the three central variants. Type B is derived from type A with T8782C and C28144T, and type C is derived from type B with one nonsynonymous mutation G26144T. In total, both of these two researches support that T8782C and C28144T play important roles in the evolution of SARS-CoV-2. Though current discoveries about high CFR associated GAAL insertion in SARS-CoV-2 and evolution associated SNPs T8782C and C28144T, researchers did not investigate the functional alterations caused by these mutations, which could explain high CFR and reflect the evolutionary trends of SARS-CoV-2. Since early researches are limited by the small number of SARS-CoV-2 strains, more mutations need to be investigated further with the increase of SARS-CoV-2 strains. Herein, we analyzed SNPs and functional alterations caused by mutations in 1809 sequences of SARS-CoV-2 strains. It provides new insights into understanding evolutionary trends of SARS-CoV-2. Here the sequence data of SARS-CoV-2 strains were downloaded from GISAID (https://www.gisaid.org/). As shown in Table 1 , it America 10 73 290 328 Europe 12 47 433 158 Asia 144 149 153 12 Totally, 1809 SARS-CoV-2 strains were downloaded from GISAID. contains 648, 703 and 458 sequence data isolated from America, Europe and Asia, respectively. All of these viral genomes were aligned to the reference genome of SARS-CoV-2 (NC_045512) using MAFFT [9] . Figure 1 shows the flow chart of our work. We analyzed all the SNPs in 1809 SARS-CoV-2 strains to evaluate the tendency of mutations, and the significance of synonymous and nonsynonymous mutation rates in each ORF. Here the significance of mutations in ORF was evaluated using fisher's exact test based on 2 × 2 tables. For example, the following table was used to evaluate the significance of the number of mutations in each ORF: the number of mutations in ORF, the number of nucleobase in ORF and the number of mutations in CDS, the number of nucleobase in CDS. To calculate synonymous and nonsynonymous mutation rates, we calculated the number of synonymous and nonsynonymous nucleotide substitutions based on a classical method [10] . Mutations with high frequency were analyzed to detect dominant mutations. First, Haploview was used to detect the patterns of linkage disequilibrium (LD) between SNPs [11] . Then, the SARS-CoV-2 genomes were divided into four groups by month to detect the change of the mutation frequency. We further evaluated functional alteration of genes caused by those key mutations through bioinformatics tools ProtScale [12] , I-Mutant [13] and PPA-Pred [14] . ProtScale and PPA-Pred are used for evaluating the hydrophobicity and binding affinity of protein [12, 14] . I-Mutant is used for prediction of protein stability [13] . To analysis of associations between the 1809 SARS-CoV-2 strains, we performed hierarchical clustering using R package factoextra (https://cran.r-project.org/web/packages/factoextra/i ndex.html). Then, we constructed a maximum likelihood phylogenetic tree for SARS-CoV-2 strains and a bat coronavirus (BatCov RaTG13), which is a probable origin of SARS-CoV-2 based on sequence similarity [5, 7] . Here jModelTest (version 2.1.10) [15] and PhyML (version 3.1) [16] was used to complete the construction of maximum likelihood phylogenetic tree. Based on the ORF alignments of reference genome NC_045512, we identified 1529 SNPs involving 1017 nonsynonymous and 512 synonymous mutations in 1809 SARS-CoV-2 strains. None of these mutations were located in the receptor-binding domain (RBD) of Spike protein. Although the number of nonsynonymous mutations is more than synonymous mutations, the nonsynonymous substitution rate (0.0444) is lower than synonymous substation rate (0.0796). According to the frequency of derived mutations in these virus strains in Figure 2A , the proportion of singleton nonsynonymous mutations (726/1017 = 0.7139) is higher than that of synonymous mutations (325/512 = 0.6348). For those mutations that occurred in 1% and more virus strains, the proportion of nonsynonymous mutations (21/1017 = 0.0206) is also lower than that of synonymous mutation (15/512 = 0.0293). All of these provide the evidence of purifying selection. Whereas, more nonsynonymous substitution rate (11/1017 = 0.0108) than synonymous substitution rate (3/512 = 0.0059) derived over 100 virus strains. These mean the derived nonsynonymous mutations are expected to spread more widely. Figure 2B shows the average of accumulative mutations grows correspondingly with the time. In general, the number of nonsynonymous mutation is more than the number of synonymous mutations in each month. Although it has only 166 virus strains (Table 1) in January, it has the largest number of average mutation (2.48). Each of virus strains in the fourth month contains 6.99 mutations. It indicates that SARS-CoV-2 has about 1.75 new mutations each month on average. We further calculated the average of accumulative mutations in different locations by month. Figure 2C -E shows the number of mutations in America, Asia and Europe, respectively. Asian synonymous mutations in February and European nonsynonymous mutations in April decrease a little. Overall, mutation rate is almost same in America, Asia and Europe. In order to determine whether the distribution of these mutations has a tendency in the ORFs, the fisher's exact test is used to evaluate the significance of the number of individual mutation sites in each of the ORFs (section 'Materials and methods'). As shown in Figure 3A , the number of individual mutation sites has a significant tendency in ORF1b, ORF3a and N (P value < 0.01). We then calculated the ratio of mutation sites based on the number of individual mutation sites and the sequence length of each ORF in Figure 3D . It shows that the ratio of mutation sites in ORF1b is smaller than that in other ORFs and the ratio of mutation sites in ORF3a and N is larger than that in other ORFs. All of these indicate that ORF1b is a conserved region and ORF3a and N is the divergent region. We then investigated the diversity of synonymous and nonsynonymous mutations in these ORFs. Figure 3B shows that the number of synonymous mutations are evenly distributed in different ORFs. This means the mutations in synonymous sites are random in ORFs, which may be because synonymous sites are affected by small pressure of natural selection. In comparison with the synonymous site, nonsynonymous sites are under the greater pressure of natural selection, thus the distribution of their mutations should be different for each of ORFs. In fact, the number of nonsynonymous mutation sites has a significant tendency in ORF1b, ORF3a, N and ORF8 (P value < 0.01) according to Figure 3C , which is almost consistent with Figure 3A . And Figure 3F shows that the ratio of nonsynonymous mutation sites in ORF1b is smaller than that in other ORFs and the ratio of nonsynonymous mutation sites in ORF3a, ORF8 and N is larger than that in other ORFs. In order to determine the tendency of natural selection, we calculated the nonsynonymous substation rate and synonymous substation rate in each of these ORFs in Figure 3E and F. Result shows that ORF3a, ORF8 and ORF10 were under positive selection and other ORFs were under purifying selection. Thirty-six mutations occurred in 1% and more virus strains were analyzed to reveal LD variants and dominant mutations. As shown in Figure 4A , r2 and LOD values for each pair-wise The ratio of these 36 mutations occurs in each month were further analyzed to detect advantage mutations (section 'Materials and methods'), which was shown in Figure 4B . As a result, five locations 8782 (orf1a, synonymous), 28 144 (orf8, nonsynonymous), 3037 (orf1a, synonymous), 14 408 (orf1b, nonsynonymous) and 23 403 (s, nonsynonymous) were identified as dominant mutations in Table 3 . In January, about 38% strains had nucleotides of C and T at LD locations 8782 and 28 144, respectively. Then the number of strains with C (8782) and T (28144) reduces gradually with the growth of the month. In April, almost all the strains had T (8782) and C (28 144). In January, almost all of the strains had nucleotides of C, C and A at LD locations 3037, 14 408 and 23 403, respectively. Whereas in April, most of the strains (93%) had T (3037), T (14408) and G (23403). The details of these five advantage mutations were described in Table 3 . In total, we detected five dominant mutations T8782C, C28144T, C3037T, C14408T and A23403G, the origin nucleotides of which were almost substituted by the mutation in the latest virus strains. Besides, there are two important nonsynonymous mutations C1059T (ORF1a), G25563T (ORF3a). Although no strains in January and few strains in February occurred these two mutations, about 50% strains in April have the mutations. Thus, C1059T and G25563T could be potential dominant mutations. To explore the trend of mutations in different locations, we calculated the ratio of these 36 mutations for virus strains in America, Asia and Europe. Figure 4C -E shows that the virus strains in different locations have the same dominant mutations. Three dominant nonsynonymous mutations C28144T, C14408T and A23403G and two potential dominant mutations C1059T and G25563T were evaluated by bioinformatics tools for investigating the functional alterations caused by these mutations. We predicted the potential changes of protein stability due to the nonsynonymous mutations using I-Mutant [13] , which is a widely used online tool based on support vector machine. I-Mutant directly estimates the relative stability changes upon [13] . Here we got G values with −0.67, −0.83, −0.93, −0.9 for C1059T, C14408T, A23403G and G25563T, respectively. The very low G values (<−0.5) show these mutations decreased protein stability largely, which could lead to the significant reduction of virus virulence [17] . By comparison, C28144T could reduce virus virulence a little, since the G values for the mutation is near zero. Since spike-ACE2 interaction can affect virus infectivity [18, 19] , we analyzed the alteration of the interaction caused by spike-ACE2 binding affinity due to A23403G using PPA-Pred [14] . The tool can evaluate the alterations of binding affinity through two aspects: dissociation free energy G and dissociation constant Kd. Both of these two aspects are inversely proportional to protein-protein binding affinity and interactions [14] . In Table 4 , G and Kd in SARS-CoV-2 spike is decreased by the A23403G, which means that the mutation in SARS-CoV-2 increases the spike-ACE2 interaction, and finally leads to the enhancement of its infectivity [18, 19] . According to the continent where the patients with SARS-CoV-2 are located, SARS-CoV-2 genomes are marked as America, Asia and Europe. Because the patient's area does not represent the difference of SARS-CoV-2 strains, we encoded the virus genome to perform hierarchical clustering (section 'Materials and methods'). As shown in Figure 5A , the results of hierarchical clustering show five distinct groups. According to the continent where the main sample of each group is located, the five groups of SARS-CoV-2 strains were named America1, Asia, Europe1, America2 and Europe2 in turn. The sample sizes of these five regional groups are shown in Table 5 . We then performed hierarchical clustering of virus genomes based on nonsynonymous mutation, the results of which ( Figure 5B ) is consistent with that base on all mutations. As shown in Figure 5C , we identified each regional group SARS-CoV-2 CDS mutations sites and the five groups SARS-CoV-2 strains have few intersections in these mutations, which indicates that our grouping method can effectively distinguish SARS-CoV-2 strains. Further, we used the complete genomes of Bat RaTG13 from the GeneBank and five regional groups SARS-CoV-2 strains specific mutation sites that totaled 147 to construct a maximum likelihood phylogenetic tree ( Figure 5D ). In the phylogenetic tree, the five regional groups SARS-CoV-2 strains are also clearly separated. Compared to Asia, Amer-ica1 and America2 SARS-CoV-2 strains are closer to bat-derived coronavirus. All the codes for conducting this study could be downloaded from the website: https://github.com/liangcheng-hrbmu/FE-SA RS-CoV-2. In addition, we will download and update the latest data each month. Totally, there were 1529 SNPs in 1809 virus strains, none of which were located in RBD of the spike protein. In addition, each of the strains could has about 1.75 new mutations each month in average. Since RBD are targeted by many known neutralizing antibodies and the number of mutation in each strains is very few, the accumulated mutations may have few impacts on antibodies. The nonsynonymous substation rate is lower than synonymous substation rate. It provides evidence of purifying selection. The further analysis in each of ORFs shows that ORF3a, ORF8 and ORF10 were under positive selection, ORF1b is a conserved region and ORF3a, and N is the divergent region. Like ORF1b, even if not significant, mutations in ORF1a also tend to be less. ORF1a and ORF1b can encode two nonstructural proteins of the SARS-CoV-2. The two nonstructural proteins are essential for the basic function (like viral replication, viral assembly) of the SARS-CoV-2 [20] . The stability of ORF1a and ORF1b ensures the basic needs of SARS-CoV-2 survival. The gene region encoding the N protein has the highest tendency to mutation. The 168-208 amino acid region of N protein can directly bind to M protein through ionic interaction [21] . The M protein plays an important role in the assembly, germination and release of the SARS-CoV-2 and O-Glycosylation of M protein is related to the interaction between coronavirus and host [22, 23] . ORF3a, ORF8 and ORF10 are all accessory proteins of the SARS-CoV-2. Some accessory proteins can regulate interferon signaling pathways and the production of proinflammatory cytokines, which makes it play an important role in the host response to coronavirus infection and thereby [24, 25] . A significant body of evidence has found SARS-CoV-2 ORF3a could coordinate attack the heme on the 1-beta chain of hemoglobin and could efficiently induce apoptosis in cells [26, 27] . SARS-CoV-2 ORF8 stands out by structural plasticity and high diversity and its gene transcripts are expressed in higher amounts [28, 29] . Furthermore, SARS-CoV-2 ORF8 protein may inhibit the type I interferon signaling pathway, an important role of antiviral infection [30] . Although the function of ORF10 remains to be elucidated, we infer that ORF10 with positive selection may have an important role in SARS-CoV-2 infection and spread. We further identified dominant mutations in 1809 virus strains, and analyzed the functional alterations caused by these dominant mutations. Totally, we identified five dominant mutations T8782C, C28144T, C3037T, C14408T and A23403G and two potential dominant mutations C1059T and G25563T. There T8782C, and C28144T were also identified by Peter et al. for distinguishing the subtype of SARS-CoV-2 [8] . Viruses with 3037T-14408T-23403G have a fitness gain, which was reported by Yang et al. in their latest discovery [31] . A23403G were deemed as important mutations in spike protein. And this mutation has become the most prevalent form in the global pandemic [32] . Mutations C1059T and G25563T are first highlighted here. We analyzed the alteration of protein stability due to the dominant mutations and using I-Mutant and the alteration of spike-ACE2 binding affinity due to A23403G using PPA-Pred. Results show that mutations decreased protein stability largely, which could lead to a significant reduction of virus virulence. The A23403G mutation increases the spike-ACE2 interaction, and finally leads to the enhancement of its infectivity [18, 19] . This were further validated recently in clinical trials by Plante et al. [33] . All of these proved that the evolution of SARS-CoV-2 is toward enhancement of infectivity and reduction of virulence as other viruses [34, 35] . Up to now, seven types of coronavirus have been known to infect humans, which includes low CFR and high CFR named SARS-CoV-2, SARS and Middle East respiratory syndrome (MERS). In previous studies, Bethany et al. has highlighted an important insert from location 32 029 to 32 040 that encodes GAAL of spike protein [4] . Whereas, the significance of these positions was not pointed out. Here we analyzed changes in binding affinity due to GAAL insertion in SARS-CoV-2 reference genome NC_045512 using PPA-Pred [14] . Dissociation free energy G and dissociation constant Kd were discussed since both of them are inversely proportional to protein-protein binding affinity [14] . Due to the GAAL insertion, G is decreased from −19.19 to −20.08, Kd is decreased from 8.40e-15 to 1.89e-15. It means that the insertion increases the spike-ACE2 binding affinity, and finally leads to the enhancement of its infectivity and virulence [18, 19] . Outbreak of pneumonia of unknown etiology in Wuhan, China: the mystery and the miracle An emerging coronavirus causing pneumonia outbreak in Wuhan, China: calling for developing therapeutic and prophylactic strategies Epidemiology, genetic recombination, and pathogenesis of coronaviruses Genomic determinants of pathogenicity in SARS-CoV-2 and other human coronaviruses A pneumonia outbreak associated with a new coronavirus of probable bat origin Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019 On the origin and continuing evolution of SARS-CoV-2 Phylogenetic network analysis of SARS-CoV-2 genomes MAFFT multiple sequence alignment software version 7: improvements in performance and usability Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Haploview: analysis and visualization of LD and haplotype maps A simple method for displaying the hydropathic character of a protein 0: predicting stability changes upon mutation from the protein sequence or structure Protein-protein binding affinity prediction from amino acid sequence jModelTest 2: more models, new heuristics and parallel computing New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0 Modification of nonstructural protein 1 of influenza A virus by SUMO1 The SARS-CoV-2 exerts a distinctive strategy for interacting with the ACE2 human receptor Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor A new coronavirus associated with human respiratory disease in China Characterization of protein-protein interactions between the nucleocapsid protein and membrane protein of the SARS coronavirus Assembly of the coronavirus envelope: homotypic interactions between the M proteins O-glycosylation of the mouse hepatitis coronavirus membrane protein Accessory proteins of SARS-CoV and other coronaviruses SARS coronavirus accessory proteins COVID-19: Attacks the 1-Beta Chain of Hemoglobin and Captures the Porphyrin to Inhibit Human Heme Metabolism The ORF3a protein of SARS-CoV-2 induces apoptosis in cells Transcriptome & viral growth analysis of SARS-CoV-2-infected Vero CCL-81 cells Evolutionary dynamics of the SARS-CoV-2 ORF8 accessory gene The ORF6, ORF8 and nucleocapsid proteins of SARS-CoV-2 inhibit type I interferon signaling pathway Analysis of genomic distributions of SARS-CoV-2 reveals a dominant strain type with strong allelic associations Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus Spike mutation D614G alters SARS-CoV-2 fitness Evolutionary enhancement of Zika virus infectivity in Aedes aegypti mosquitoes Is HIV-1 evolving to a less virulent form in humans? We thank researchers for sharing their sequencing data from GISAID (https://www.gisaid.org/), and thank GISAID for providing instructions and advice on the database. We also thank researchers and NCBI for providing sequencing data and instructions for SARS-CoV-2 reference sequence (https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2). This work was supported by the Tou-Yan Innovation Team LC and XZ conceived and designed the experiments. XH and ZZ analyzed data. CQ and PW wrote and revised this manuscript. All authors read and approved the final manuscript.