key: cord-0867805-twmrilk8
authors: Liu, Xin; Zhao, Ya-Pu
title: Donut-shaped fingerprint in homologous polypeptide relationships—A topological feature related to pathogenic structural changes in conformational disease
date: 2009-05-21
journal: J Theor Biol
DOI: 10.1016/j.jtbi.2009.02.009
sha: 54c3a8137f74c923d699e4bd55344b88264778ca
doc_id: 867805
cord_uid: twmrilk8

Features of homologous relationship of proteins can provide us a general picture of protein universe, assist protein design and analysis, and further our comprehension of the evolution of organisms. Here we carried out a study of the evolution of protein molecules by investigating homologous relationships among residue segments. The motive was to identify detailed topological features of homologous relationships for short residue segments in the whole protein universe. Based on the data of a large number of non-redundant proteins, the universe of non-membrane polypeptide was analyzed by considering both residue mutations and structural conservation. By connecting homologous segments with edges, we obtained a homologous relationship network of the whole universe of short residue segments, which we named the graph of polypeptide relationships (GPR). Since the network is extremely complicated for topological transitions, to obtain an in-depth understanding, only subgraphs composed of vital nodes of the GPR were analyzed. Such analysis of vital subgraphs of the GPR revealed a donut-shaped fingerprint. Utilization of this topological feature revealed the switch sites (where the beginning of exposure of previously hidden “hot spots” of fibril-forming happens, in consequence a further opportunity for protein aggregation is provided; 188–202) of the conformational conversion of the normal [Formula: see text]-helix-rich prion protein [Formula: see text] to the [Formula: see text]-sheet-rich [Formula: see text] that is thought to be responsible for a group of fatal neurodegenerative diseases, transmissible spongiform encephalopathies. Efforts in analyzing other proteins related to various conformational diseases are also introduced.

Features of homologous relationship of proteins can provide us a general picture of protein universe, assist protein design and analysis, and further our comprehension of the evolution of organisms. Here we carried out a study of the evolution of protein molecules by investigating homologous relationships among residue segments. The motive was to identify detailed topological features of homologous relationships for short residue segments in the whole protein universe. Based on the data of a large number of non-redundant proteins, the universe of non-membrane polypeptide was analyzed by considering both residue mutations and structural conservation. By connecting homologous segments with edges, we obtained a homologous relationship network of the whole universe of short residue segments, which we named the graph of polypeptide relationships (GPR). Since the network is extremely complicated for topological transitions, to obtain an in-depth understanding, only subgraphs composed of vital nodes of the GPR were analyzed. Such analysis of vital subgraphs of the GPR revealed a donut-shaped fingerprint. Utilization of this topological feature revealed the switch sites (where the beginning of exposure of previously hidden ''hot spots'' of fibril-forming happens, in consequence a further opportunity for protein aggregation is provided; 188-202) of the conformational conversion of the normal a-helix-rich prion protein PrP C to the b-sheet-rich PrP Sc that is thought to be responsible for a group of fatal neurodegenerative diseases, transmissible spongiform encephalopathies. Efforts in analyzing other proteins related to various conformational diseases are also introduced.

& 2009 Elsevier Ltd. All rights reserved.

Computational approaches, such as homology modeling (Chou, 2004a) , structural bioinformatics (Chou, 2004b; Liu et al., 2008b) , pharmacophore modeling (Sirois et al., 2004) , Monte Carlo simulated annealing (Chou, 1992) , protein subcellular location prediction Shen, 2007b, 2008) , and signal peptide prediction (Chou and Shen, 2007a; Shen and Chou, 2007) , can provide very useful information on and insight into basic research and drug design. Since our ability to characterize the biological properties of a protein is almost exclusively based on properties conserved through evolutionary time, the study of protein evolution using computational approaches has been the focus of many researchers (Socolich et al., 2005; Russ et al., 2005; Zhang and Liu, 2008; Liu et al., 2008a) . Characterization of the protein universe can assist in comprehending the evolvement, i.e., the formation, past, and future of proteins, in designing artificial proteins, and in providing information useful in other biological fields. A well-known feature of the protein universe is that some folds are abundantly represented by proteins with sequence identity as low as random sequences (Rost, 1997; Sander, 1993, 1997) , whereas other folds are represented by a single sequence (Teichmann et al., 1999; Orengo et al., 1999; Holm and Sander, 1996) . To explain this variability in fold representation, it has been suggested that a premise in convergent evolution is that folds with higher designability can be encoded and represented by more sequences (Finkelstein et al., 1995; Govindarajan and Goldstein, 1996; Li et al., 1996) . This phenomenological notion was identified from the observation of exhaustive sequence enumeration in a lattice protein model. Application of this notion led to remarkable progress in research in areas such as folding mechanisms (Li et al., 1998; Wolynes, 1996; England et al., 2003) , plotting of the distribution of protein populations (Taverna and Goldstein, 2000; Shakhnovich et al., 2005) , and hereditary diseases (Wong and Frishman, 2006) . However, the designability principle often provides features of a lattice model, but not of actual proteins. Several attempts have been made to define a more realistic picture of homologous relationships. Dokholyan et al. (2002) offered a general picture of the universe of protein structure. Based on structural alignments provided by the FSSP database, the authors claimed that the graph formed by proteins/vertices of a non-redundant set and connections/edges between any two structurally similar protein domains was a scale-free network. In such a network, the probability density PðKÞ of a domain with K related structures (connection number) follows a power law PðKÞ ¼ K Àa . Several similar studies have been performed based on sequence, structure, or both (Huynen and van Nimwegen, 1998; Yanai et al., 2000; Qian et al., 2001; Koonin et al., 2002) . All the networks obtained have the same scale-free feature. In fact, as a feasible method to obtain useful insights, graphical approaches have been used in the study of many biological systems, such as enzyme-catalyzed reactions (Andraos, 2008; Chou, 1989; Chou and Forsen, 1980; Kuzmic et al., 1992; Myers and Palmer, 1985; Zhou and Deng, 1984) , protein folding kinetics (Chou, 1990) , inhibition kinetics of processive nucleic acid polymerases and nucleases (Althaus et al., 1993; Chou and Kezdy, 1994) , analysis of codon usage (Chou and Zhang, 1992; Zhang and Chou, 1994) , analysis of DNA sequences (Qi et al., 2007) , among others. Moreover, in the recent years, graphical methods have also been used to deal with many complicated biosystems, e.g., the QSAR study (Prado-Prado et al., 2008) , hard bionetwork systems (Diao et al., 2007; , hepatitis B viral infection (Xiao et al., 2006) , HBV virus gene missense mutations (Xiao et al., 2005b) , visual analysis of SARS-CoV (Wang et al., 2005) , representation of complicated biological sequences (Xiao et al., 2005a) , and identification of protein attributes (Xiao and Chou, 2007) . Graphical approaches are a hot topic in biological and medical science. With improvements in graphical analysis capabilities, we can obtain a more in-depth insight than ever. For instance, in a graph of homologous relationships, the aforementioned scale-free feature only indicates distribution of the connectivity of vertex. Even if the distributions are identical, a network may have a specific characteristic that distinguishes it from other networks. Our interest is in identifying some in-depth features specific for homologous relationships.

To obtain in-depth and detailed features, a reasonable standard for the definition of a homologous relationship is a prerequisite. Structure and sequence are two significant characteristics of a protein. Since remote homologous proteins can share the representative fold of a family, structure is a more robust characteristic than sequence. On the other hand, sequence similarity is vital in identifying homologous relationships. As we draw a network of homologous relationship, if only structural similarity is considered, proteins without similar biological properties might be mistakenly connected by an edge. This will result in a false detail in the graph. Similarly, when only sequence information is considered severe differences in biology properties (e.g., structure) might be tolerated. Thus, joint consideration of sequence and structural similarities is most appropriate for plotting a graph of homologous relationships (Qian et al., 2001) .

Biological systems have evolved from simple to complex and from small to large. It has been proposed that short segments of polypeptides may have collapsed together to form folded protodomains in the early evolution of proteins (Trifonov and Berezovsky, 2003; Riechmann and Winter, 2006) . Domains evolved to their modern size through the assembly and/or exchange of smaller gene segments encoding polypeptide segments of sub-domain size (Blake, 1978) , for example, by exon shuffling (Gibert, 1978) or non-homologous recombination (Bogarad and Deem, 1999) . Thus, homologous relationships for short polypeptide segments represent an ideal aspect to investigate protein evolution. On the other hand, in protein evolution, insertion and deletion often occur in variable region, but to a lesser degree in conserved regions that are important for biological properties. Consequently, much progress has been achieved by matching homologous proteins with ungapped residue segments on a site-by-site basis (Smith et al., 1990; Henikoff, 1991, 1992) . Since the alignment of ungapped residue segments retains most of the information significant for corresponding homologs, a suitable representation in characterizing homologous relationships for short polypeptide segments is also provided.

In the present study we propose a novel approach to investigate homologous relationships for proteins that provide useful information for various conformational diseases. We used information on ungapped aligned residue segments to plot a general graph of polypeptide relationships (GPR) by jointly considering sequence and structural similarities. Detailed analysis of the graph revealed a donut-shaped fingerprint in a vital subnetwork of the GPR. Using the information provided by this fingerprint, we identified switch sites for conformational conversion of prion, and other conformational disease-related proteins.

We investigated homologous relationships between pairs of residue segments. In total, 1612 non-membrane proteins from PDB_SELECT25 ([issued on 25 September 2001] Hobohm and Sander, 1994) were used in the analysis. In this non-redundant data set, no pair of sequences shares sequence identity of more than 25%. The solvent-accessible area for each residue was calculated using the DSSP (Kabsch and Sander, 1983 ) algorithm for every protein. A protein sequence is treated as a succession of residue segments. As the residue-residue correlation is notable in 15-residue segments (Liu et al., 2003) , we used a window width of 15 for further consideration. By sliding a 15-residue window along the protein sequence, each segment of the data set serves as a query segment.

In the universe of residue segments, samples are biased, i.e., some segments are closely related to others. To reduce this bias and to filter redundant samples and obtain a non-redundant GPR, we constructed a non-redundant target set of homolog searches fCR p4 g m for each query segment m. This target set is a subset of our database in which each segment shares no more than four common residues (CR p4 , sequence identity is 26.7%) with the query segment. For each query, homologs are searched in the corresponding target set. In this way, all the segments obtained are remote homologs of the query polypeptide.

For each query segment m, we searched the corresponding target set fCR p4 g m for ungapped segments, i.e., remote homologs that are similar to the query in terms of both sequence and structure. This was carried out in two steps. First, multi-aligned remote homolog candidates of the query segment were initialized using a center-star approach. Then the remote homolog candidates were optimized using a position-specific matrix, an updated scoring scheme used in evaluating homologous relationships.

For each query segment m, if the following two conditions are satisfied, we say that segment n, n 2 fCR p4 g m , is a non-redundant structural analog of m.

1. Structural similarity drmsðm; nÞo4Å, where the distance root mean squared deviation (drms; Park and Levitt, 1995) 

for structure m and n is defined as the average distance difference drmsðm; nÞ ¼

where r ai is the coordinate of the C a atom i in structure a. 2. Difference in surface residue Z ¼ P 15 i¼1 dð mi ; ni Þ is at most 2 between m and n, where ai ¼ 1 for a surface residue and ai ¼ 0 otherwise, dðx; yÞ is a step function with dðx; yÞ ¼ 0 for x ¼ y and dðx; yÞ ¼ 1 otherwise. The contribution of a residue to the folding mechanism and protein function depends on whether or not it is exposed to solvent. To identify highly related samples, we investigated segments with similar exposed/ buried residues to m. A surface residue is defined as one with accessible area greater than 10% of the maximum accessible surface area (Chotia, 1975) for that type of residue.

Non-redundant structural analogs of query segment m were searched in set fCR p4 g m . In addition to structural similarity, sequence similarity can be evaluated by the knowledge of the propensity of residues to substitute for each other in homologous proteins (Henikoff and Henikoff, 1992) . Each non-redundant structural analog n of the query segment m has a pairwise sequence alignment score Tðm; nÞ ¼

Þ is an element of the BLOSUM30 matrix for the substitution of residues a i and b i . T is a measure of the degree of sequence similarity, and approximately corresponds to the biological homophyly between the non-redundant structural analog and the query segment. We ranked the non-redundant structural analogs of m in descending order of score T, and used the top 30 samples as the initial remote homolog candidates in multiple sequence alignment of the subsequent optimization step.

In this step, a position-specific matrix of multiple sequence alignments was calculated for the top 30 samples. (The first matrix was calculated using sequences provided by step 1. Updated matrices were calculated using samples produced by step 2.) The scores for a specific position i are of the form

The probability of finding residue j in column i is estimated as

where f ij is the observed frequency for residue j, a and b are the relative weights for the observed and pseudocount residue frequencies, a ¼ N is the number of aligned segments, and b is reasonably set as b ¼ 10. The pseudocount frequencies g j ¼ P k ðf ik =p k Þq jk ,q jk are the target frequencies according to data from the BLOSUM30 matrix. p l is the background probability of the occurrence of residue l implicit in the BLOSUM30 matrix. For a segment t, the alignment score is calculated as

We searched target set fCR p4 g m for non-redundant structural analogs (structural similarity retained) of the query segment m and ranked them in decreasing order of score T 0 (sequence similarity retained). The top 50 segments were identified as updated remote homolog candidates. After reranking these 50 segments using the HFnet (Hydrophobic Force network) algorithm, which boosts the quality of sequence alignment (see supporting information), the position-specific matrix was updated with the top 30 samples. The contribution from HFnet was nearly convergent after seven iterations. Thus, to maximize the alignment quality, a total of seven iterations were processed. Then the final top 30 segments were selected as remote homologs, and formed the remote homolog set fR m g for polypeptide m.

In general, three factors are considered in scanning remote homologs:

1. Sequence identity is limited to 26.7%, so that a non-redundant graph is obtained. 2. Structural similarity is required during the initialization and optimization processes, so that structural information is not lost. 3. Sequence similarity is retained by adopting the top-ranked samples according to the BLOSUM30 homolog database and the updated position-specific matrix.

We attempted to plot a homologous relationship network for the whole universe of polypeptides. Each query segment of our database was defined as a node of the polypeptide relationship network. According to the above method, remote homologs were found for each of these nodes/queries. Two nodes A and B are deemed to be related if B 2 fR A g or A 2 fR B g, where fR A g and fR B g are the remote homolog sets for A and B, respectively. If fR A g and fR B g share no less than five segments, we say that nodes A and B are connected by edge ðA; BÞ. Owing to our definition, each pair of connected nodes/polypeptides has similar biological properties, but a low level of sequence identity. Consequently, we constructed a non-redundant unweighted GPR in which each edge is considered equally. For each node, the value of connectivity K is defined as the number of edges connected to the node.

To decrease the probability of false connection, we introduced an optimization approach by counting the shared segments between two remote homolog sets. The homologous relationship is credible if A shares enough remote homologs with B. A low threshold for the number of shared segments results in a high level of false connections, whereas a high threshold results in more orphans. Empirically, we recommend 5 as a threshold.

As shown in Fig. 1 , the polypeptide population fits a power law. Other than this approximate feature, further knowledge has seldom been mentioned in the literature. In fact, the network of homologous relationships in the polypeptide universe is so vast and complicated that many researchers have avoided a detailed analysis. Consequently, to the best of our knowledge, detailed features of the network of polypeptide relationships are still unknown. Here we investigated such details by analyzing the GPR character from a vital subgraph formed by significant vertices. Using PAJEK software (http://vlado.fmf.uni-lj.si/pub/networks/ pajek/,), the representative features of a network can be analyzed by topological transformation. In this algorithm, nodes and edges are placed in a plane. Relative nodes are close to each other by introducing a virtual attracting force between vertices connected by an edge. On introduction of a virtual repulsive force, all vertices are repelled from each other so that no pair of vertices can get too close. The topological structure of a network is transformed by minimizing the system energy. The coordinates of node clusters converge after energy minimization in PAJEK. As a result of such topological transformations, tightly correlated nodes, i.e., homologous polypeptides, are bunched into node clusters.

We plotted subnetworks of the polypeptide relationship for nodes with connectivity K460. The vertices of the network were colored according to the protein secondary structure (taken from DSSP database. As in most methods, we considered three types of conformation fh; e; cg generated from the eight possible by coarse graining of h; g; i ! h, e ! e and x; t; s; b ! c). In a polypeptide, a subsegment a i a iþ1 a iþ2 a iþ3 a iþ4 a iþ5 a iþ6 (i ¼ 0 or 8) is defined to be H if more than three of its residues are in helix conformation, E if more than three of its residues are in strand conformation, and C otherwise. Thus, nine non-overlapping polypeptide states are defined: HH, HC, CH, EE, EC, CE, HE, EH, and CC. The graphs obtained for these subnetworks are shown in Fig. 2 in decreasing order of connectivity. A clear donut-shaped fingerprint is evident. The Pajek software includes several options for clustering that differ in force model and distance measure. We tested many different options. The resulting donut-shaped topological feature is robust.

Helix segments and N-and C-terminal caps (HH þ HC þ CH) make up the main body of the donut shape. Significant groups/ types of strand segments (and their caps) are not connected to the ring or to each other in subnetworks K4100. When nodes with connectivity of up to 80 are considered (Fig. 2E,F) , such connections emerge with decreasing K. As shown in Fig. 2E , nodes that connect the diameter of the donut shape appear at approximately K ¼ 60. With a further decrease in K, crossings between different parts of the donut ring appear. In other words, the ring in the GPR is connected by nodes with low connectivity. Fig. 1 . Node population as a function of connectivity. Fig. 2 . Donut-shaped fingerprint of the polypeptide relationship network. Nodes with a connection number K greater than 60 are plotted. Orphans in these subnetworks are omitted. Tightly related nodes are bunched up by PAJEK. The donut is rich in HH þ HC þ CH samples. The arc in E is rich in EE þ EC þ CE samples. A connected strandarc is evident in F. In F, there is only one edge (colored in black) that connects the helix-donut to strand-arc. We find that the subgraphs shown in Fig. 2 are not trivial nodelimited profiles of polypeptide relationships, but characterize the topological feature of the whole GPR. As shown in Table 1 , for nodes with connectivity of K480, 7316 no-orphan nodes (Ka0) exist in the corresponding subnetwork (Fig. 2F ). Although these nodes/segments are contributed by only 7.9% of the amino acids in our database, related (according to the definition in Subsection 2.2) or directly connected (two nodes directly connected by one edge) segments of these nodes cover residues of nearly the whole data set. As these segments have similar biological properties to the corresponding nodes, it means that the aforementioned simple topological feature represents the nature of the whole polypeptide universe, i.e., there are two nearly separated regions in phase space of polypeptide segment: a helixdonut zone and a strand-arc zone. The two parts are connected flimsily by sparse edge. Although we cannot draw a picture of the whole GPR because of its extreme complexity, and only vital subgraphs can be depicted, the position of a segment in phase space of polypeptide can be deduced from secondary structure. We assumed that HH þ HC þ CH samples belong to the helixdonut zone, whereas EE þ EC þ CE samples are in the strand-arc zone. Then a picture of the whole graph of the polypeptide universe is constructed. Moreover, the origin of the complicated protein universe might be very neat. As shown by the first two rows of Table 1 , nodes shown in Fig. 2A ,B, comprising approximately 2% of the residues in our database, 'determine' the properties of nearly 80-90% of the sites in the database.

To reveal the reason for the donut shape, we selected the shape shown in Fig. 2C , a network of moderate complexity, for detailed analysis. In this graph, the coordinates of node clusters represent the approximate position of a specific group of homologous polypeptides. By introducing a virtual center and a clockwise angle j, samples of the donut shape in successive p=6 slices were investigated. Polypeptides in each slice were matched site by site. The probability densities for buried and hydrophobic residue were calculated for each site (residue classification was: hydrophobic h ¼ {M, F, I, L, V, A, W}, polar p ¼ {C, Y, Q, H, P, G, T, S, N, R, K, D, E} (Liu et al., 2002) ). As shown in Fig. 3 , with the variation of j, a successive shift in buried/hydrophobic residues was observed for polypeptides making up the donut shape. Thus, the distribution of buried/hydrophobic residues may be closely related to the donut shape. Since helix forms are abundant in this shape, the period of buried/hydrophobic residues is approximately 4.

Moderate conversion of their structure is vital for the biological properties of protein molecules. Thus, moderate changes in the structure of homologous protein are allowable, whereas a significant conversion may not be possible. As illustrated by vital subgraphs (Fig. 2E,F) , there are a limited number of nodes in the whole GPR that form a 'bridge' between the helix-donut zone and the strand-arc zone. This means that, in terms of protein evolution, significant structural conversion, e.g., a change from a helix to a sheet, is difficult. Consequently, the protein universe is in a relatively steady state, with infrequent exceptions. One well-known exception is the prion protein (PrP) that exhibits a change in structure in pathological conditions (Prusiner, 1982 (Prusiner, , 1998 .

PrP is deemed to be responsible for transmissible spongiform encephalopathies (TSEs), a group of fatal neurodegenerative diseases that are associated with conformational conversion of the normally monomeric and a-helical protein molecule, PrP C , to the b-sheet-rich PrP Sc . TSEs arise in several mammalian species by genetic, infectious, or sporadic means, and include bovine spongiform encephalopathy in cattle, scrapie in sheep, chronic wasting disease in cervids, and Creutzfeldt-Jakob disease and kuru in humans (Prusiner, 1982 (Prusiner, , 1998 Caughey and Baron, 2006; Collinge, 2001; Aguzzi and Polymenidou, 2004; Weissmann, 2004) . It is now widely accepted that in these protein-only diseases (Prusiner, 1982) , TSE transmission does not require nucleic acids.

As a marginally stable form between a-helix-rich and b-sheetrich states, PrP must be a protein with inbuilt polypeptides related to some 'bridge' nodes of the GPR (nodes connecting the helixdonut zone to the strand-arc one). It is also reasonable that such inbuilt polypeptides should correlate with the origin of conformational conversion. Although identifying a detailed mechanism for this structural conversion is beyond the scope of the present study, and the structure of PrP Sc is also largely unknown, we can apply our view of protein evolution to identify the segment in which the conformational conversion arises.

By sliding a window along the residue sequence, each 15residue segment of human PrP (hPrP, 121-230, PDB ID:1QM2) was analyzed in terms of GPR. Vertices and edges in whole GPR were used as a framework for defining polypeptide relationships. For a given segment of hPrP, the top 30 remote homologs were searched in whole GPR with the method described in Subsection 2.1. By definition, these remote homologs are highly similar to the query hPrP segment in sequence and structure, i.e., they represent Count_NON, number of no-orphan nodes in a subnetwork; Coverage-NON, coverage given by no-orphan nodes in a subnetwork; Coverage-SRNON, coverage given by self and related nodes of the no-orphan nodes in a subnetwork (if B 2 fR A g or A 2 fR B g, A and B are related); Coverage-SCNON, coverage given by self and directly connected nodes of the no-orphan nodes in a subnetwork. agents of the query segment. Nodes of the GPR directly connected to these agents were identified. States (HH, HC, etc.) of these collected nodes indicate the probability of whether an agent belongs to a helix-donut zone (corresponding to HH þ HC þ CH) or a strand-arc zone (EE þ EC þ CE). We assigned identified nodes to sites of the central residue of the query hPrP segment. The frequencies of the types of nodes identified are shown in Fig. 4 site by site.

Usually the aggregation-prone regions tend to be blocked in native state of globular proteins because side chains are hidden in the inner hydrophobic core, or the cellular environment forbids the condition of the formation of aggregation (Dobson, 1999) . A nosogenetic misfolding starts in a region where the unlocking begins. It is like a switch of exposure of sensitive regions. Then based on an exposure or partly exposure (Claudio, 2001) , the aggregation-prone regions might have the further chance to form amyloid in the following folding pathway. Here we attempted to predict such switch sites in hPrP using the GPR feature of sparse connections between the helix-donut zone and the strand-arc. We assumed that conformational conversion is due to transition between two regions of polypeptide phase space. If polypeptides change their structures near native conformations, there will be no conformational disease. So we should pay attention to segments which are prone to fold to structures other than their native conformations. In Fig. 4 , except for sites of two inborn strands of approximately 130 and 160, there is a peak for EE þ EC þ CE at site 195. With a high frequency for EE þ EC þ CE and a low frequency for HH þ HC þ CH, the two inborn strands can easily extend according to the GPR. Due to thermal motion, a protein molecule can moderately change its conformation at room temperature. Such facile extension of the inborn strands should be allowed by PrP C , otherwise, if it could cause disease, the corresponding life-form would have been lost during evolution. Therefore, it is likely that such a site is not responsible for conformational changes related to disease. On the other hand, sites around position 195 are different. As shown in Fig. 4 , the native conformation of this region is in HH þ HC þ CH. As these sites have a high probability of being in their inborn helix-donut state, normally it is difficult to change state to a strand-arc. While in this special case there is reasonable probability that the polypeptide will transform to the strand-arc region, i.e., induce a conformational conversion. Consequently, residues around position 195 (% 188-202) should be responsible for the diseaserelated conformational change. This conclusion contrasts to earlier, largely theoretical models, and is consistent with the experimental observations of Kuwata et al. (2007) that intercalation of an anti-prion compound GN8 to regions N159, V189, T192, K194, and E196 hampers the pathogenic conversion process. Moreover, as non-redundant polypeptides were used throughout our approach and analysis, this conclusion should be the same for all members of the PrP family.

Here we identified a simple feature of the evolution of protein molecules, and presented a general picture of the non-membrane polypeptide universe. In the GPR there are few shortcuts connecting the diameter of a donut and 'bridges' between the helix-donut zone and the strand-arc. Such crossing nodes are of low connectivity. This indicates that homologous relationships generally evolve gradually. Most polypeptides evolved strictly along a helix-donut or a strand-arc track, with very few samples exhibiting a drastic shift in biological properties during evolution, e.g. as shown in Fig. 3 , such an evolvement induces a gradual change in the distributions of buried and hydrophobic residue and thus in biochemical properties. While it is interesting that the evolvement can final hook-up and form a ring. Since the present work focuses on divergent evolution, it suggests that divergent evolution can result in convergent evolution at a sub-domain level, but in a gradual way that induces a donut-shaped topological feature.

It is interesting to make a second consideration of the formation of donut ring. Shift of the distributions of buried and hydrophobic residue has a high correlation with donut. While with the evolvement of polypeptide segment, there should be opportunity to form different groups of polypeptide structures. Each group owns a donut-shaped fingerprint with shift of buried/hydrophobic residue, but a different way in the change of three-dimensional structure. This would result in several connected rings. But it does not happen. As the GPR is a network with moderate structural deviation, a two-step evolvement may arouse severe structural change as big as that among different groups. This can be illustrated by the insertions in Fig. 3 , where there is no obvious character in the distribution of protein secondary structure. Consequently, the candidate donut rings final joint together, and only one ring is resulted. In such a consideration, graph of the enlarged polypeptide segment will only correspond to further shift of the distribution of buried/hydrophobic residue, and the one ring donut-shaped fingerprint will reoccur. Actually, we have drawn the graph of 17 and 19 length segments, and have found similar fingerprints too.

A marked difference between this study and others is that the picture obtained not only provides details of topology features, but also has direct and important applications. According to GPR, sparse connection between the helix-donut and the strand-arc is an indicator of conformational changes related to disease. As shown by the analysis of PrP, we can use conformational information on one state to deduce the switch sites for structural conversion related to pathological conditions. This study can be extended to other conformational diseases, such as sickle cell anemia, antithrombin deficiency thromboembolic disease, and familial amyloid neuropathy (see supporting information; details to be published elsewhere). Identification of the site of origin of such conformational conversions is extremely important in designing suitable therapy approaches. By revealing switch sites for structural conversion, we can design drugs to hamper this pathogenic conversion process, or even upgrade species by mutation. Such switch sites are usually determined by cases in which the switch role is evident, such as disease-related point mutations reported in clinic and experiments in hampering the pathogenic conversion process. The cost of such research is considerable. More significantly, the disease conformation was believed to be a prerequisite in previous research, which limited the number of proteins that could be investigated. A systemic study of conformational diseases in organisms was thus beyond the scope of previous approaches. The knowledge provided by GPR can be used to overcome the requirement for unnecessary disease structures, to predict target sites for clinical treatment, and to investigate suitable therapy schemes based on normal proteins. This new approach could lead to great progress in curing conformational diseases. Moreover, as demonstrated by the example described here, GPR considers both structural information and sequence identity, and thus represents a suitable strategy for meeting challenges in the design of conformational protein switches (Ambroggio and Kuhlman, 2006) .

Connections in GPR are highly exact. As shown in Fig. 2F , none of the 109,045 edges of the subgraph make a false connection crossing the diameter. As our aim was to provide a general picture of the whole universe of polypeptide segments, nodes and connections should be both representative and properly weighted so that the resulting feature is universal but not biased. Consequently remote homologous relationship was selected as a feasible representation. While in this representation, if the criterion for structural similarity is too strict, there will be a drastic decrease in the number of suitable candidates. Thus we set the cut-off as drmso4Å, which is a moderate level. This provides the opportunity for false connection. However, such false connections are finally controlled. This owed much to the HFnet algorithm. In 2008 we suggested that the family representative intramolecular hydrophobic force networks makes a crucial contribution to the biological properties conserved throughout protein evolvement (Liu et al., 2008a) . It uncovers the truth of protein evolution significantly. Based on this theory, we have developed a model called HFnet to evaluate the significance of each sequence in a given multiple sequence alignment. The power of HFnet has been proven not only in silico, but also in wet experiment. Based on the HFnet algorithm, we have ever designed five artificial remote proteins of the WW domain. As all of them have low pairwise sequence identity (o30%) with each other and with each proteins in the learning set, it is usually difficult to write out such sequences, and say nothing of a family sharing specific biological properties. However, in biological experiment, four of them exhibited detectable ligand-binding affinity. These experiment data demonstrated that our theory and the HFnet algorithm are very robust, and dominate/identify not only protein structure but also biological properties. In the present case, HFnet algorithm contributed at least 50% increase in accuracy of remote homologs identification. However, as only two letters were used in HFnet, for such a simple algorithm, signals for segments that are too short may be missed. This was another consideration when selecting the 15-residue window. Fortunately, as structural information was also considered in this work, a 15-residue polypeptide was long enough for HFnet. With a decrease in residue-residue correlation (Liu et al., 2003) , a greater window length would cover more secondary factors. However, as there are only two major conformations, a helix and a strand, in protein molecule, the donut-arc topological feature should not be remarkably modified.

As we have minimized the false signals during network construction, the vertices and connections in the GPR can be used as a framework that reliably represents the universe of polypeptide relationships. The biological properties of a protein can be credibly predicted from such a representation. This will facilitate studies of complex proteins and allow noise-free analysis. Further improvements and applications of this representation are currently being investigated.

Mammalian prion biology: one century of evolving concepts

Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E

Design of protein conformational switches

Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs

A hierarchical approach to protein molecular evolution

Prions and their partners in crime

The nature of the accessible and buried surfaces in proteins

Graphical rules in steady and non-steady enzyme kinetics

Applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady state systems

Energy-optimized structure of antifreeze protein and its binding mechanism

Modelling extracellular domains of GABA-A receptors: subtypes 1, 2, 3, and 5

Structural bioinformatics and its impact to biomedical science

Graphical rules for enzyme-catalyzed rate laws

Diagrammatization of codon usage in 339 HIV proteins and its biological implication

Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides

Recent progresses in protein subcellular location prediction

Cell-PLoc: a package of web-servers for predicting subcellular localization of proteins in various organisms

Steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases

Protein misfolding and disease; protein refolding and therapy

Prion disease of human and animals, their cause and molecular basis

The community structure of human cellular signaling network

Protein misfolding, evolution and disease

Expanding protein universe and its origin from the biological Big Bang

Natural selection of more designable folds: a mechanism for thermophilic adaptation

Why do protein architectures have Boltzmann-like statistics?

Why genes in pieces?

Why are some protein structures so common?

Automated assembly of protein blocks for database searching

Amino acid substitution matrices from protein blocks

Enlarged representative set of protein structures

Protein-structure comparison by alignment of distance matrices

Mapping the protein universe

An evolutionary treasure, unification of a broad set of amidohydrolases related to urease

The frequency distribution of gene family sizes in complete genomes

Dictionary of protein secondary structure. Pattern recognition of hydrogen-bonded and geometrical features

The structure of the protein universe and genome evolution

Hot spots in prion protein for pathogenic conversion

Kinetic analysis by a recursive rate equation

Emergence of preferred structures in a simple model of protein folding

Are protein folds atypical?

Simplified amino acid alphabets based on deviation of conditional probability from random background

Distances and classification of amino acids for different protein secondary structures

Major factors of protein evolution revealed by eigenvalue decomposition analysis

CLEMAPS: multiple alignment of protein structures based on conformational letters

Microcomputer tools for steady-state enzyme kinetics

From protein structure to function

The complexity and accuracy of discrete state models of protein structure

Unified QSAR approach to antimicrobials. Part 3: First multi-tasking QSAR model for input-coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds

Novel proteinaceous infectious particles cause scrapie

New 3D graphical representation of DNA sequence based on dual nucleotides

Protein family and fold occurrence in genomes, power-law behavior and evolutionary model

Early protein evolution: building domains from ligand-binding polypeptide segments

Protein structures sustain evolutionary drift

Natural-like function in artificial WW domains

Protein structure and evolutionary history determine sequence space topology

Signal-3L: a 3-layer approach for predicting signal peptide

Virtual screening for SARS-CoV protease based on KZ7088 pharmacophore points

Finding sequence motifs in groups of functionally related proteins

Evolutionary information for specifying a protein fold

The distribution of structures in evolving protein populations

Advances in structural genomics

Evolutionary aspects of protein structure and folding

A new nucleotide-composition based fingerprint of SARS-CoV with visualization analysis

The state of the prion

Symmetry and the energy landscapes of biomolecules

Fold designability, distribution, and disease

Digital coding of amino acids based on hydrophobic index

A probability cellular automaton model for hepatitis B viral infections

Using cellular automata to generate image representation for biological sequences

An application of gene comparative image for predicting the effect on replication ratio by HBV virus gene missense mutation

Predictions of gene family distributions in microbial genomes, evolution by gene duplication and modification

Analysis of codon usage in 1562 E. coli protein coding sequences

Significant residue features revealed by eigenvalue decomposition analysis of BLOSUM matrices

An extension of Chou's graphical rules for deriving enzyme kinetic equations to system involving parallel reaction pathways

We are grateful to professors Zeng-Ru Di, Zuo-Bing Wu, and Yuan-Kai Hong, and doctors Fang-Ting Li and Ming Li for their helpful discussions. To ensure a healthy development of modern biology, a patent is applied for corresponding method. We encourage pure scientific research. Contact authors when the method is to be used. This work was jointly supported by the National High-tech R&D Program of China (