key: cord-0005307-7floluv1 authors: Yan, Shaomin; Wu, Guang title: Mutation patterns in human α-galactosidase A date: 2009-05-26 journal: Mol Divers DOI: 10.1007/s11030-009-9158-4 sha: b1e02833bf7b2d8db0c1c82d9f5186392f14ae24 doc_id: 5307 cord_uid: 7floluv1 A way to study the mutation pattern is to convert a 20-letter protein sequence into a scalar protein sequence, because the 20-letter protein sequence is neither vector nor scalar while a promising way to study patterns is in numerical domain. In this study, we use the amino-acid pair predictability to convert α-galactosidase A with its 137 mutations into scalar sequences, and analyse which amino-acid pairs are more sensitive to mutation. Our results show that the unpredictable amino-acid pairs are more sensitive to mutation, and the mutation trend is to narrow the difference between predicted and actual frequency of amino-acid pairs. failure [12] , and central neurological defects as a consequence of cerebrovascular disease [13] [14] [15] [16] . Clearly, mutations in α-galactosidase A not only lead to various clinical outcomes, but also provide a model to analyze mutation patterns to understand their consequent diseases for better clinical managements. Actually, we can analyze mutation patterns at protein level in several different ways, and the most straightforward way is to directly analyze mutation patterns in terms of difference in amino acids. For example, we record a mutation at position 231 of α-galactosidase A, which changes aspartic acid "D" to asparagine "N" [17] . Although this record could provide some pattern as the documentation increases, it is hard to find numeric features that are generally obtained through mathematical deduction. This is so because the symbolized amino acids are neither vector data nor scalar data, while most patterns found with mathematical tools are in the data domain. This means that we need to perform some conversion to change symbolized protein sequences into scalar protein sequences, then we would have a full ability to analyse the mutation patterns. There are several ways to transform the symbolized protein sequences into scalar data, of which the most profound one is to use the physicochemical property of amino acids to replace each amino acid in a protein sequence, for example, molecular weight, melting point, optical rotation [18] . On the other hand, our group has developed three approaches to convert a symbolized protein sequence into a scalar protein sequence based on random mechanism (for review, see [19] [20] [21] [22] ). Moreover, many studies have indicated that mathematical and computational approaches such as diffusioncontrolled reaction simulation [23] , graph/diagram approach [24] [25] [26] [27] [28] [29] [30] [31] , bio-macromolecular internal collective motion simulation [32] [33] [34] , structural bioinformatics [18, 35] , molecular docking [36] , molecular packing [37, 38] , pharmacophore modelling [39, 40] , Monte Carlo simulated annealing approach [41] [42] [43] [44] , QSAR [45, 46] , protein subcellular location prediction [47] [48] [49] [50] [51] [52] , protein structural class prediction [53] [54] [55] [56] , identification of membrane proteins and their types [57, 58] , identification of enzymes and their functional classes [59] , identification of GPCR and their types [60] [61] [62] , identification of proteases and their types [63, 64] , protein cleavage site prediction [65] [66] [67] , and signal peptide prediction [68, 69] can timely provide very useful information and insights for both basic research and drug design and hence are widely welcome by science community. In this study, we apply our approach to study mutation patterns in hopes that it can throw some light on mutation patterns. The amino acid sequences of the human α-galactosidase A and its 137 missense point mutants are obtained from the UniProtKB/Swiss-Prot entry [70] . There are 20 types of naturally occurring amino acids in proteins. Although we can, for example, use physicochemical properties to replace 20 types of amino acids, the replaced 20 numbers might not be subject to mutation, length of protein sequence, composition of protein, neighboring amino acids, and amino-acid position in protein. Thus, this type of conversion might not be suited to study mutation patterns. The approach we use is to apply the permutation of aminoacid pairs in human α-galactosidase A to determine if an amino-acid pair is predictable or unpredictable in terms of its appearance in human α-galactosidase A [19] [20] [21] [22] [71] [72] [73] [74] [75] . Human α-galactosidase A consists of 429 amino acids. The first and second amino acids can be counted as an aminoacid pair, the second and third as another amino-acid pair, the third and fourth and so forth until the 428th and 429th, thus there is a total of 428 amino-acid pairs. Thereafter, for example, there are 30 aspartic acids "D" and 48 leucines "L" in human α-galactosidase A, the appearance amino-acid pair DL would be 3 (30/429 × 48/428 × 428 = 3.357). Actually we do find three DLs in α-galactosidase A, so DL is predictable by permutation. By contrast, there are 22 arginines "R" and 23 glutamines "Q" in human α-galactosidase A, the appearance of RQ would be 1 (22/429 × 23/428 × 428 = 1.179), i.e., there would be one RQ in α-galactosidase A. However, the RQ pair appears four times indicating that its appearance is unpredictable by permutation. Mutations at predictable and unpredictable amino-acid pairs A point mutation results in two amino-acid pairs being replaced by another two pairs. For example, there is a mutation at position 231 changing aspartic acid "D" to asparagine "N" [17] . This mutation results in two amino-acid pairs AD and DI changing to AN and NI, because the amino acid is alanine "A" at position 230 and isoleucine "I" at position 232. The actual and predicted frequencies of these amino-acid pairs are shown in Table 1 , and we can determine whether the substituted amino-acid pairs (AD and DI) and substituting amino-acid pairs (AN and NI) belong to predictable or unpredictable amino-acid pairs. In this way, we can analyse all of the amino-acid pairs housing other mutations [76] . For the numerical analysis, we calculate the difference between predicted frequency (PF) and actual frequency (AF) of affected amino-acid pairs, i.e., (PF − AF). As seen in Table 1 , before mutation the difference between predicted and actual frequency is (2 −5) + (2 −3) = −4 for substituted amino-acid pairs, and (2 − 1) + (1 − 0) = 2 for substituting amino-acid pairs. After mutation, they are (2−4) + (1−2) = −3 and (2−2) + (1−1) = 0. Thus, we can compare mutation effects on the frequency difference. The Chi-square test was used to compare the occurrence of mutation in predictable and unpredictable kind/pair, and the Mann-Whitney U test for two groups. p < 0.05 is considered significant. Theoretically, 20 types of amino acids can construct 400 kinds of possible amino-acid pairs. As the human The Chi-square test indicates the highly statistical significance of occurrence of mutations between predictable and unpredictable kinds/pairs AF Actual frequency, PF predicted frequency α-galactosidase A has 428 amino-acid pairs, which are more than 400 kinds of theoretical amino-acid pairs, some of 400 types of theoretical amino-acid pairs should appear more than once. Meanwhile, we may expect that some of 400 kinds of theoretical amino-acid pairs are absent from human αgalactosidase A. Out of the 400 kinds of theoretical amino-acid pairs, 161 are absent in human α-Galactosidase A, so 428 amino-acid pairs in human α-galactosidase A include only 239 kinds of theoretical amino-acid pairs (400 − 161 = 239), which furthermore means that some amino-acid pairs should appear more than once. Actually, out of the 428 amino-acid pairs in human α-galactosidase A, 119 kinds appear once, 77 kinds twice, 28 kinds three times, 8 kinds four times, 5 kinds five times, and 2 kinds seven times. Naturally, a further classification appears necessary, say, predictable/unpredictable kind and predictable/unpredictable pair. Out of the 239 kinds of theoretical amino-acid pairs in human α-galactosidase A, 111 kinds are predictable and 128 are unpredictable. Out of the 428 amino-acid pairs in human α-galactosidase A, 148 pairs are predictable and 280 pairs are unpredictable. Hence, the mutation pattern can be found in this regard in Table 2 . If an amino-acid pair, which is directly targeted by mutation, appears once before mutation, this kind of amino-acid will disappear after mutation. However, if a kind of aminoacid pair appears more than once before mutation, this kind of amino-acid pair will still appear after mutation. Moreover, a point mutation is generally related to two pairs, which warrant the remaining of a kind of amino-acid pair after mutation. Table 3 lists the grouped amino-acid pairs, which are targeted by mutations, before and after mutation. This table can be read as follows. The first three columns group the substituted amino-acid pairs according to predictable/unpredictable as well as actual and predicted frequency. The three columns under before mutation are the grouped amino-acid pairs, and the last three columns under after mutation are also the grouped amino-acid pairs. By comparing the appearance before and after mutation, we can see the aim of mutation in this regard, for example, 137 mutations dramatically reduced the appearance of amino-acid pairs, whose actual frequency is larger predicted frequency in both pairs, from 54 to 11 (the third line in Table 3 ), also from row 4 to row 6 under before mutation, 86.86% of these pairs are characterised by one or both substituted pairs whose actual frequency is larger than their predicted one. These results suggest that the impact of mutations is to narrow the difference between actual and predicted frequency by means of reducing the actual frequency. No mutation occurs in the amino-acid pairs whose actual frequency is smaller than predicted frequency in both pairs. This interesting phenomenon suggests that it is difficult for mutations Fig. 1 Frequency difference between substituted and substituting amino-acid pairs before and after mutation in human α-galactosidase A to narrow the difference between actual and predicted frequency by means of increasing the actual frequency. Amino-acid pairs appeared through mutations Table 4 lists the grouped amino-acid pairs, which appeared through mutation, before and after mutation. Actually, the format of results and underlined implication in Table 4 are very similar to Table 3 , for example, 59.85% mutations result in one or both substituting amino-acid pairs are absent before mutation. Frequency difference of amino-acid pairs affected by mutations Figure 1 illustrates the difference between predicted and actual frequency in the amino-acid pairs that are influenced by 137 mutations, besides Fig. 2 shows their statistical comparison. Before mutation, the median of difference between predicted and actual frequency is −2 in substituted aminoacid pairs. This means that the mutations occur in the aminoacid pairs that appear more than their predicted frequency. Meanwhile, the corresponding value is 0 in substituting amino-acid pairs indicating that the mutations lead to the construction of amino-acid pairs randomly. After mutation, the median of difference between actual and predicted frequency is 0 in substituted amino-acid pairs, and their corresponding value is −2 in substituting amino-acid pairs. This implies that these amino-acid pairs are more randomly constructed in the mutants, as their predicted and actual frequencies are about the same. The gene encoding α-galactosidase A has been sequenced and more than 300 different mutations were identified in affected individuals [77, 78] , and the genetic heterogeneity of α-galactosidase A contributes to the different phenotypes of Fabry disease [79, 80] . However, only 137 mutations have been documented at protein level, otherwise we would have a more comprehensive view. Currently, two explanations are commonly proposed to explain why some amino acids are mutated more frequently than the others. The first is targeted mutagenesis, which defined the "hotspot" sites sensitive to endogenous and exogenous mutagens [81] [82] [83] . The second is the function selection, which indicates the disruption of protein functions may depend upon the position of the mutation in the protein [84] [85] [86] . However, these explanations still do not fully answer why some amino acids are more sensitive to mutation. This study explains why some amino acids are more sensitive to mutation from random viewpoint. This is very plausible, not only because pure chance is now considered to lie at the very heart of nature [87] but also because the randomly predictable amino-acid pair suggests the maximal probability of occurrence, which requires the least time and energy for construction of amino-acid pair being consistent with nature parsimony. Needless to say, the functional sites in protein are more likely to be deliberately evolved, thus their actual frequency should be different from the predicted frequency because the amino-acid pair, which can be explained by randomness, may not be explained by its function. Our results suggest that the trend is that the mutation leads the actual frequency to approach to the predicted frequency to some degree. Likely, nature feels uncomfortable to have pairs, whose actual frequency is different from the predicted frequency, and requires the protein to mutate to narrow the difference between predicted and actual frequency at the expense of losing a certain function. However, the aminoacid pairs, which appear through mutation, might lead to the new difference between predicted and actual frequency, which offers the new opportunity of mutation, thus the evolution continues. It really does not matter which method to use to convert the symbolized protein sequence into any scalar protein sequence if we can find something interesting using the scalar protein sequence. However, it is very important that the scalar protein sequence is subject to mutation, composition of protein, length of protein, neighboring amino acid, position in protein sequence, etc., which can be met by our approaches [19] [20] [21] [22] , hence we use the amino-acid pair predictability in this study. In this study, we methodologically demonstrate how to study mutation patterns in proteins using an approach that converts a protein sequence into a numeric sequence. Then we find out the mutation pattern through the analysis of numeric sequence, by which we theoretically find that the mutation pattern in human α-galactosidase A is to narrow the difference between predicted and actual frequency of amino-acid pairs. Fabry disease: guidelines for the evaluation and management of multi-organ system involvement Fabry disease Narrative review: Fabry disease Neuropathy and Fabry disease: pathogenesis and enzyme replacement therapy The diagnostic workup of patients with neuropathic pain Fabry disease during childhood: clinical manifestations and treatment with agalsidase alfa X-chromosome inactivation: role in skin disease expression Fabry disease and the heart: an overview of the natural history and the effect of enzyme replacement therapy Alpha-galactosidase A in vascular disease Myofilament degradation and dysfunction of human cardiomyocytes in Fabry disease Gastrointestinal symptoms in Fabry disease: everything is possible, including treatment Update on Fabry disease: kidney involvement, renal progression and enzyme replacement therapy Monogenic vessel diseases related to ischemic stroke: a clinical approach The role of genetics in stroke Neurological manifestations in Fabry's disease The cerebral vasculopathy of Fabry disease Uneven X inactivation in a female monozygotic twin pair with Fabry disease and discordant expression of a novel mutation in the alpha-galactosidase A gene Handbook of biochemistry: section D physical chemical data Randomness in the primary structure of protein: methods and implications Fate of influenza A virus proteins Mutation trend of hemagglutinin of influenza A virus: a review from computational mutation viewpoint Lecture notes on computational mutation Role of the protein outside active site on the diffusion-controlled reaction of enzyme Graphical rules for enzyme-catalyzed rate laws An extension of Chou's graphical rules for deriving enzyme kinetic equations to system involving parallel reaction pathways Microcomputer tools for steadystate enzyme kinetics Graphical rules in steady and non-steady enzyme kinetics Review: applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and nonsteady state systems Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E Review: steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs Review: low-frequency collective motion in biomacromolecules and its biological functions Low-frequency resonance and cooperativity of hemoglobin Biological functions of soliton and extra electron motion in DNA structure Modelling extracellular domains of GABA-A receptors: subtypes 1, 2, 3, and 5 Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS Energetic approach to packing of a-helices: 2. General treatment of nonequivalent and nonregular helices Energetics of the structure of the four-alpha-helix bundle in proteins Virtual Screening for SARS-CoV Protease Based on KZ7088 Pharmacophore Points Review: progress in computational approach to drug development against SARS Energy-optimized structure of antifreeze protein and its binding mechanism Application of the queueing theory with Monte Carlo simulation to inhalation toxicology Application of queueing theory with Monte Carlo simulation to the study of the intake and adverse effects of ethanol Estimation of the rate of arrivals of ions at a single-channel Review: recent advances in QSAR and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design Unified QSAR approach to antimicrobials. Part 3: First multi-tasking QSAR model for input-coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds Cell-PLoc: a package of web-servers for predicting subcellular localization of proteins in various organisms Review: recent progresses in protein subcellular location prediction Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers Subcellular location prediction of apoptosis proteins A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space Review: prediction of protein structural classes An intriguing controversy over protein structural class prediction Some insights into protein structural class prediction Prediction of protein cellular attributes using pseudo amino acid composition MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM EzyPred: a top-down approach for predicting enzyme functional classes and subclasses Bioinformatical analysis of G-protein-coupled receptors Prediction of G-protein-coupled receptor classes GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes ProtIdent: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information Identification of proteases and their types A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins Review: prediction of HIV protease cleavage sites in proteins HIVcleave: a web-server for predicting HIV protease cleavage sites in proteins Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides Signal-3L: a 3-layer approach for predicting signal peptide Fate of 130 hemagglutinins from different influenza A viruses Timing of mutation in hemagglutinins from influenza A virus by means of unpredictable portion of amino-acid pair and fast Fourier transform Prediction of mutations in H5N1 hemagglutinins from influenza A virus Prediction of mutations in H1 neuraminidases from North America influenza A virus engineered by internal randomness Prediction of mutations engineered by randomness in H5N1 hemagglutinins of influenza A virus Fabry disease: twenty novel alpha-galactosidase A mutations causing the classical phenotype Genetics of Fabry disease: diagnostic and therapeutic implications Natural history of Fabry renal disease: influence of α-galactosidase A activity and genetic mutations on clinical course Genotype and phenotype in Fabry disease: analysis of the Fabry outcome survey Use of mutation spectra analysis software Theoretical analysis of mutation hotspots and their DNA sequence context specificity Frameshift mutations produced by 9-aminoacridine in wildtype, uvrA and recA strains of Escherichia coli; specificity within a hotspot The importance of making ends meet: mutations in genes and altered expression of proteins of the MRN complex and cancer Advances in understanding molecular determinants in FeLV pathology HIV-1 reverse transcriptase inhibitor resistance mutations and fitness: a view from the clinic and ex vivo Chance rules: an informal guide to probability, risk, and statistics This study was partly supported by Guangxi Science Foundation No. 0991080, and 0630003A2.