key: cord-0005203-7bh8222u authors: Zhang, S.-W.; Zhang, Y.-L.; Pan, Q.; Cheng, Y.-M.; Chou, K.-C. title: Estimating residue evolutionary conservation by introducing von Neumann entropy and a novel gap-treating approach date: 2007-08-21 journal: Amino Acids DOI: 10.1007/s00726-007-0586-0 sha: d62432d389bb14a9ec748202e5d7f6dc82da4922 doc_id: 5203 cord_uid: 7bh8222u Evolutionary conservation derived from a multiple sequence alignment is a powerful indicator of the functional significance of a residue, and it can help to predict active sites, ligand-binding sites, and protein interaction interfaces. The results of the existing algorithms in identifying the residue’s conservation strongly depend on the sequence alignment, making the results highly variable. Here, by introducing the amino acid similarity matrix, we propose a novel gap-treating approach by combining the evolutionary information and von Neumann entropies to compute the residue conservation scores. It is indicated through a series of tested results that the new approach is quite encouraging and promising and may become a useful tool in complementing the existing methods. Determining which amino acid residues in a protein are responsible for its function is very important in order to understand the molecular mechanism of protein and for drug discovery as well (Chou, 2004e; Kesel, 2005; Lubec et al., 2005; Clercq, 2006) . The experimental characterization of the constituent residues in their function and role is very expensive, time-consuming, and difficult to automate. This kind of difficulties has challenged us to develop computational approaches. It is assumed in all the existing alignment methods or the evolutionary residue-scoring methods that the importance of a residue is reflected by its evolutionary conservation, meaning that the more important the residue, the sooner it becomes fixed in different evolutionary branches and the more divergent are the branches between which it does vary (Mihalek et al., 2004) . The evolutionary con-servation varies among amino acid sites due to differing degrees of functional constraints on them (Holmquist et al., 1983; Troy et al., 1993; Zhou and Troy, 2005a) . Sites that are important for the protein's tertiary structure and folding, enzymatic activity, ligand binding, or interaction with other proteins are generally more conserved (Zhou and Troy, 1995 , 2003 Brocchieri et al., 2002; Ran et al., 2004; Zhou et al., 2004; Schnell et al., 2005) . Actually, knowledge of the conserved and functionally important residues, such as those involved in forming the binding pocket of a protein to its ligand, has been widely used to help the structure-based drug design and to guide the mutagenesis studies (see, e.g., Kang and Liang, 1997; Chou et al., 1999 Chou et al., , 2000 Chou et al., , 2003 Chou et al., , 2006 Hu et al., 2003; Yu, 2003; Chou, 2004a-d; Du et al., 2004 Du et al., , 2005a Zhang and Yap, 2004; Fan et al., 2005; Gan et al., 2006; Wu et al., 2006; Zhang et al., 2006; Liang and Li, 2007; Wang et al., 2007a, b; Wei et al., 2005 Wei et al., , 2006a Wei et al., , 2006b Wei et al., , 2007 , indicating the importance of finding conservative residues. One of the representative methods in identifying the conservative residues is ConSurf (Armon et al., 2001) , which used a maximum parsimony tree to calculate a site conservation score as the number of substitutions weighted by their physicochemical distance. Another algorithm was named evolutionary trace (ET) (Lichtarge and Sowa, 2002) , which utilizes a phylogenetic tree to identify residues that are identically conserved in a subtree. The maximum tree depth at which a residue remains unchanged is used to rank the degree of conservation. This analysis was later modified to incorporate a quantitative model of residue substitutions. The ET method had been shown to be capable of detecting protein interaction sites and directing protein mutation studies. In order to make the ET method more robust against deviations from the ideal family tree picture occurring in the actual protein evolution, Lichtarge et al. (Lichtarge and Sowa, 2002) developed a class of hybrid methods (real-valued evolutionary trace method and zoom method) that combine evolutionary and Shannon entropies from multiple sequence alignments (Mihalek et al., 2006a, b) . However, the hybrid methods do not account for the physicochemical similarities found between the different amino acids, and the gaps in the multi-sequences alignment are treated as the 21 st amino acid. In the present study, we are to combine evolutionary and von Neumann entropies and propose a different gaptreating approach for estimating the residue's evolutionary conservation. It is demonstrated by using the insulin receptor kinase domains as an example to show that the current hybrid approach can enhance the prediction quality in comparison with the existing methods. It is difficult to give a comprehensive and accurate definition for ''functionally important residue'', although everyone seems to have an intuitive concept of what these residues mean. A widely accepted definition is that the functionally important residues are those indispensable for the protein to perform its molecular function or to play its biological role in the sense that these residues cannot be freely changed (except for change to some compatible amino acids) without directly affecting its native function. In this study we call these ''functionally important residues'' the key residues, which are in a broad sense directly related to the active=catalytic sites, protein binding sites, small ligand binding sites, nucleic acids binding sites, and so on. For example, the key residues taken from the insulin receptor (1irk.pdb) are listed in Table 1 . Half of proteins in the testing dataset were taken from (Mihalek et al., 2004) . Some proteins whose computational results could not be obtained from Lichtarge Computational Biology Lab web service at http:==mammoth.bcm.tmc.edu=report_maker=index.html or Ben-Tal's ConSurf web service at http:==consurf-hssp.tau.ac.il=cgi-bin=consurf-hsspNew.cgi were deleted. Other five proteins were from (Zhu et al., 2006) . To construct a reasonable independent testing dataset, the key residues were defined according to the PDBsum database (http:==www.ebi. ac.uk=thornton-srv=databases=pdbsum=). In general, the key residues are the constituents of the following: active sites, catalytic residues, interaction with ligand, interaction with metal, and PROSITE pattern residues (mainly refer to red, red orange, and orange residues). The protein PDB code and the number of key residues in the testing dataset are given in Table 2 . Three sets of sequences from three different sources are considered. The raw sequence sets were created by using three iterations of PsiBlast (Altschul et al., 1997) , with the 0.001 E-value cutoffs on the UniProt database of proteins. The PsiBlast resulting sets were aligned by a standard alignment method such as ClustalW 1.8 (Thompson et al., 1994) . The pruned sequence sets were Lichtarge_HSSP and Consurf_HSSP. HSSP is a standard database of sequence selections=alignments obtained by carefully rethinking the similarity cutoffs for sequences of different lengths. The Lichtarge_HSSP alignments have the sequences that are selected according to Litcharge's criterion (Mihalek et al., 2004 (Mihalek et al., , 2006b . The Consurf_HSSP alignments have the sequences that are selected according to Ben-Tal's criterion (Glaser et al., 2003) . The residue ranking function assigns a score to each of the residues concerned, and according to the scores they can be sorted in the order of the presumably decreasing evolutionary pressure they experience. By combining the evolutionary and von Neumann entropies to estimate the residue evolutionary conservation, we propose a new approach to calculate the a For a full discussion about these residues, see Hubbard (1997) and Mihalek et al. (2004) residue ranking score as formulated below. The first step is to calculate the Shannon entropy and von Neumann entropy of each alignment column. The Shannon entropy for a residue belonging to column i in an MSA (multiple sequence alignment) is given by where f i;a is the relative occurrence frequency of amino acid a at the alignment position i. The base of 20 ensures that all values are bounded between zero and one. The symbol f i;gap represents the number of nonstandard amino acids (such as ''-'', ''X'', ''Z'', ''B'') at the alignment position i divided by the number of alignment sequences. The von Neumann entropy is given by where ! i is a density matrix with trace ¼ 1. Apart from normalization by the trace, the density matrix is given by the product of the relative frequencies of the amino acids in each alignment position f i;a and an appropriate similarity matrix, that is, . . . f i;a ; . . . ; f i;Y Â similarity matrix. The calculation of Eq. (2) is facilitated by first calculating the eigenvalues i;m of ! i , and hence it follows that When the similarity matrix is the identity matrix, the von Neumann entropy (Eq. (3)) becomes identical to the Shannon entropy of Eq. (1). The second step is to divide an MSA into sub-alignments (that is n groups) that correspond to nodes in the tree. This subdivision of an MSA into smaller alignments reflects the tree topology, and hence the evolutionary variation information within it. The evolutionary score for a residue belong to column i in an MSA is given by the following series of equations where w node ðnÞ, w group ðgÞ are weights assigned to a node n and a group g, respectively. w node ðnÞ ¼ 1; if n on the path to the query protein 0; otherwise ð5Þ w group ðgÞ ¼ 1; if g on the path to the query protein 0; otherwise ð6Þ f g i;a is the frequency of amino acid of type a within a sub-alignment corresponding to group g at the level in which the sequence similarity tree is divided into n groups. Namely, the nodes (labeled by n) are assumed to be numbered in the order of increasing distance from the root, and each one of them is associated with a division of the tree into n groups (subtrees). N is the number of alignment sequences, f g i;gap the number of non-standard amino acids of g group in the alignment position i divided by the number of g group alignment sequences. Further details about division of tree nodes and groups can be found in literature (Mihalek et al., 2004) . We call the scoring function by Eq. (4) the improved zoom (IZ) method. Considering the physicochemical similarities between the different amino acids, Eq. (4) can be further formulated as follows: where g i;m is the eigenvalues of the density matrix ! g i of g group in the alignment position i. Equation (7) is called the physicochemical similarity zoom (PSZ) method. If we take w node ðnÞ ¼ 1=n, w group ðgÞ ¼ 1, Eqs. (4) and (7) can be, respectively, expressed by Equations (8) and (9) are called the improved real-valued (IRV) method and the physicochemical similarity real-valued (PSRV) method, respectively. Through Eqs. (1), (3)-(4) and (7)- (9), each residue of the protein concerned can be assigned a score. The smaller the score is, the more conservation the residue is. To compare the performance of different methods, the parameters of sensitivity (S n ) and specificity (S p ) are introduced to estimate the quality of different approaches. where TP is the number of important residues found correctly, TN is the number of unimportant residues predicted correctly, and FN and FP are numbers of unimportant and important residues predicted incorrectly, respectively. In fact, the sensitivity is the ratio of the number of important residues found correctly to the known total number of important residues (true identified positive=actual positive), while specificity is the number of unimportant residues predicted by the method divided by the number of residues known not to be important (true identified negative=actual negative). The insulin receptor is a transmembrane protein that binds to the insulin hormone. The binding leads to autophosphorylation of tyrosine residues in the activation loop of the protein, resulting in an enhancement of the catalytic activity and creation of binding sites for downstream signaling proteins. Its structure and residue enumeration can be found in the Protein Data Bank (http:==www.rcsb. org=pdb) under the code 1irk. The four key parts of the 1irk machinery are (i) the residues involved in ATP binding, (ii) active site (peptide-binding site), (iii) rotational pivot points and other residues involved in lobe closure, which are important in conformational change between inactive and phosphorylated state, and (iv) activation loop, which occupies the ATP-binding site in the inactive form with the three key tyrosine residues involved in autophosphorylation highlighted. Sensitivity is the percentage of key residues that is found among the top n residues on the list (the score was aligned from small to large, i.e., the top n residues have small scores), and specificity is the percentage of residues that are not singled out as important and can be found below the nth position. Each point on the graph is the specificity-sensitivity pair for one particular choice of n. The sensitivities and specificities of six methods (Shannon, von Neumann, IZ, PSZ, IRV, and PSRV) with raw and Lichtarge_HSSP alignment sequence sets are shown as Figs. 1 and 2 , from which we can see that IZ and PSZ top the other four methods in both sensitivity and specificity for almost any choice of n (the exception being of marginal size) indicating that the performances of IZ and PSZ are better than those of the other four methods. Furthermore, it can also be seen from Figs. 3 to 4, as well as Table 3 , that the performance of our gap-treating approach is better than that of Mihalek gap handling approach (Mihalek et al., 2004) . In addition to the insulin receptor 1irk, the comparison has been extended to cover more proteins. Shown in Table 3 are the predicted results by the four different methods on the 10 proteins listed in Table 2 . As shown in Table 4 , performances of IZ and PSZ are better than that of Lichtarge' s hybrid method and Ben-Tal's ConSurf method. Introducing von Neumann entropy and a novel gap-treating approach in the sequence alignment is quite promising for estimating residue evolutionary conservation. The novel approach can at least play a complementary role in the existing methods in this area. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information Allostery and induced fit: NMR and molecular modeling study of the trp repressor-mtr DNA complex Insights from modelling the 3D structure of the extracellular domain of alpha7 nicotinic acetylcholine receptor Insights from modelling the tertiary structure of BACE2 Modelling extracellular domains of GABA-A receptors: subtypes 1, 2, 3, 5 Molecular therapeutic target for type-2 diabetes Review: structural bioinformatics and its impact to biomedical science Prediction of the tertiary structure of a caspase-9=inhibitor complex A model of the complex between cyclin-dependent kinase 5(Cdk5) and the activation domain of neuronal Cdk5 activator Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS Review: progress in computational approach to drug development against SARS Potential antivirals and antiviral strategies against SARS coronavirus infections Application of bioinformatics in search for cleavable peptides of SARS-CoV Mpro and chemical modification of octapeptides Polyprotein cleavage mechanism of SARS CoV Mpro and chemical modification of octapeptide Molecular modelling and chemical modification for finding peptide inhibitor against SARS CoV Mpro The substrate specificity of SARS coronavirus 3C-like proteinase Synthesis and activity assess of an octapeptide inhibitor designed for SARS coronavirus main proteinase ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information The spatial distribution of fixed mutations within genes coding for proteins Mutation analysis of 20 SARS virus genome sequences: evidence for negative selection in replicase ORF1b and spike gene Crystal structure of the activated insulin receptor tyrosine kinase in complex with peptide substrate and ATP analog Studies on the inhibitory effects of quercetin on the growth of HL-60 leukemia cells Synthesis of novel test compounds for antiviral chemotherapy of severe acute respiratory syndrome (SARS) A new sequence representation (FASGAI) as applied in better specificity elucidation for human immunodeficiency virus type 1 protease Evolutionary predictions of binding surfaces and interactions Searching for hypothetical proteins: theory and practice based upon original data and literature A family of evolution-entropy hybrid methods for ranking protein residues by importance Evolutionary trace report_maker: a new type of service for comparative analysis of proteins A structure and evolution-guided Monte Carlo sequence selection strategy for multiple alignment-based analysis of proteins Hsp70 mutant proteins modulate additional apoptotic pathways and improve cell survival Rapid and accurate structure determination of coiled-coil domains using NMR dipolar couplings:application to cGMP-dependent protein kinase Ia CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice Polysialic acid: from microbes to man 3D structure modeling of cytochrome P450 2C19 and its implication for personalized drug design Study of drug resistance of chicken influenza a virus (H5N1) from homology-modeled 3D structures of neuraminidases Insights from modeling the 3D structure of H5N1 influenza virus neuraminidase and its binding interactions with ligands Theoretical studies of Alzheimer's disease drug candidate [(2,4-dimethoxy) benzylidene]-anabaseine dihydrochloride (GTS-21) and its derivatives Anti-SARS drug screening by molecular docking Molecular insights of SAH enzyme catalysis and their implication for inhibitor design Antiviral drug discovery against SARS-CoV Putative hAPN receptor binding sites in SARS-CoV spike protein Molecular modeling studies of peptide drug candidates against SARS Exploring the binding mechanism of the main proteinase in SARS-associated coronavirus and its implication to anti-SARS drug design 2-D NMR analyses reveals a specific interaction between polyisoprenols (PIs) and the polyisoprenol recognition sequences (PIRS) in model membranes Characterization by NMR and molecular modeling of the binding of polyisoprenols (PI) and polyisoprenyl recognition sequence (PIRS) peptides: three-dimensional structure of the complexes reveals sites of specific interactions Invited review: NMR studies on how the binding complex of polyisoprenol recognition sequence peptides and polyisoprenols can modulate membrane structure NMR study of the preferred membrane orientation of polyisoprenols (dolichol) and the impact of their complex with polyisoprenyl recognition sequence peptides on membrane structure The three-dimensional structure of the cGMP-dependent protein kinase I-a leucine zipper domain and its interaction with the myosin binding subunit NOXclass: prediction of protein-protein interaction types This paper was supported in part by the National Natural Science Foundation of China (No. 60634030, No. 60372085), the Technological Table 3 . The top 25% (76 residues) of all residues in the ranking sequence according to the conservation score from small to big for the insulin receptor (lirk.pdb) by different methods a Authors' address: Shao-Wu Zhang, College of Automation, Northwestern Polytechnical University, Xi'an, 710072, China, Fax: þ86-29-88494352, E-mail: zhangsw@nwpu.edu.cn