key: cord-0037108-6u3a9nez
authors: Sun, Zhirong
title: Foundations for the Study of Structure and Function of Proteins
date: 2013-07-01
journal: Basics of Bioinformatics
DOI: 10.1007/978-3-642-38951-1_10
sha: 0466eb738d6198252c87ce53225ccdc8cedbdec1
doc_id: 37108
cord_uid: 6u3a9nez

Proteins are the most abundant biological macromolecules, occurring in all cells and all parts of cells. Moreover, proteins exhibit enormous diversity of biological function and are the most final products of the information pathways. Protein is a major component of protoplasm, which is the basis of life. It is translated from RNA and composed of amino acid connected by peptide bonds. It participates in a series of complicated chemical reactions and finally leads to the phenomena of life. So we can say it is the workhorse molecule and a major player of life activity. Biologists focus on the diction of structure and function of proteins by the study of the primary, secondary, tertiary, and quaternary dimensional structures of proteins, posttranscriptional modifications, protein-protein interactions, the DNA-proteins interactions, and so on.

Proteins are the most abundant biological macromolecules, occurring in all cells and all parts of cells. Moreover, proteins exhibit enormous diversity of biological function and are the most final products of the information pathways. Protein is a major component of protoplasm, which is the basis of life. It is translated from RNA and composed of amino acid connected by peptide bonds. It participates in a series of complicated chemical reactions and finally leads to the phenomena of life. So we can say it is the workhorse molecule and a major player of life activity. Biologists focus on the diction of structure and function of proteins by the study of the primary, secondary, tertiary, and quaternary dimensional structures of proteins, posttranscriptional modifications, protein-protein interactions, the DNAproteins interactions, and so on. There are two examples illustrating the importance of protein.

The first one is about the SARS. One protein is found to increase the self-copy efficiency for 100 

From the aspect of chemical structures of proteins, proteins can be classified into two classes. If proteins are completely composed of amino acids, these proteins are called simple proteins, such as insulin; if there are other components, they are named conjugated proteins like hemoglobin.

According to the symmetry of proteins, proteins can be divided into globin and fibrin. Globins are more symmetric and similar to balls or ovals in shape. Globins dissolve easily and can crystallize. Most proteins are globins. Comparatively, fibrins are less symmetric and look like thin sticks or fibers. They can be divided into soluble fibrins and unsolvable fibrins.

Simple proteins can be subdivided into seven subclasses: albumin, globulin, glutelin, prolamine, histone, protamine, and scleroprotein. Conjugated proteins can also be subdivided into nucleoprotein, lipoprotein, glycoprotein and mucoprotein, phosphoprotein, hemoprotein, flavoprotein, and metalloprotein. Different classes of proteins have various functions. These include serving as:

1. Catalyzers of metabolism: enzyme 2. Structural component of organisms 3. Storage component of amino acid 4. Transporters 5. Movement proteins 6. Hormonal proteins 7. Immunological proteins 8. Acceptor and for transfer of information 9. Regulatory or control mechanisms for the growth, division, and the expression of genetic information

Biological function and biological character are two different concepts. Characters can be shown from a chemical reaction, while functions of molecules are shown by the whole system in several cooperated reactions. Functions are related to the molecule interactions. The repeated structure on the backbone of polypeptide is called peptide unit or planar unit of peptide. Peptide bond cannot turn freely because of its double-bond character. The bonds beside peptide unit can wheel freely, which are described using dihedral angles and . (Fig. 10.5) 

1. An amino acid unit in a peptide chain is called a residue. 2. The end having a free '-amino group is called amino-terminal or N-terminal. 3. The end having a free '-carboxyl group is called carboxyl-terminal or C-terminal. 4. By convention, the N-terminal is taken as the beginning of the peptide chain and put at the left (C-terminal at the right). Biosynthesis starts from the N-terminal.

Z. Sun 

The Electronic Interaction of Biological Molecules

The electronic interaction includes charge-charge interaction, charge-dipole interaction, dipole-dipole interaction, and induced dipole interaction. Dipole moment D g l; u D E

Charge-dipole interaction ( Fig. 10 .6)

When the radius vector between two dipoles and the center is far bigger than the length of dipoles, namely, r l, the interaction of these two dipoles is: The neutral molecules or groups with overlapped positive and negative charges will be polarized by electric field and become induced dipoles. The dipole moment:

ind D a ind E

Hydration is the process of the subject interacting or combining with water.

The forces that sustain the structure of proteins are the so-called weak interaction, non-covalent bond, or inferiority bond, including hydrogen bond, hydrophobic interaction, electrostatic interaction, and van der Waals force. When these weak interactions present independently, they are weak bond, but when these bonds are added together, a strong force will form to sustain the protein structure space.

Under the physiological condition, the side chain of acidic amino acid can be broken down into negative ions, while the side chain of basic amino acid can be disassociated into positive ions. Some atoms will form dipoles because of polarization. These interaction forces between charges or dipoles are called electrostatic force and it meets the Coulomb's law.

Van der Waals force can also be called van der Waals bond. It includes attractive force and repulsion force. Van der Waals attractive force is in inverse ratio to the sixth power of the distance between atoms or groups. When they are too close to each other, they will repel each other. The van der Waals bond length is 0.3-0.5 nm. The bond energy is 1-3 kcal/mol. Although the van der Waals force is weak, when the surfaces of two big molecules are close enough to each other, this force is very important. It contributes to sustain the tertiary structure and quaternary structure.

Protein structures have conventionally been understood at four different levels ( Fig. 10 .7):

1. The primary structure is the amino acid sequence (including the locations of disulfide bonds). 2. The secondary structure refers to the regular, recurring arrangements of adjacent residues resulting mainly from hydrogen bonding between backbone groups, with '-helices and "-pleated sheets as the two most common ones. 3. The tertiary structure refers to the spatial relationship among all amino acid residues in a polypeptide chain, that is, the complete three-dimensional structure. 4. The quaternary structure refers to the spatial arrangements of each subunit in a multi-subunit protein, including nature of their contact.

In protein solution, if the environment changes, for example, pH, ion strength, or temperature changes, the natural structure of protein may disintegrate and leads to the denaturation of proteins. This process is called protein denaturation. When the condition is normal, if the denatured protein can have their natural structure and character back, then the protein will renature. The way to make bean curd by heating the solution of bean protein and adding a little salt is an example to make use of the protein denaturation to deposit protein. 

According to the classical view, the primary structure of protein decides the highlevel structure of proteins. So the high-level structure can be inferred from the primary structure. We can align multiple protein sequences ( In the 1980s, when sequences started to accumulate, several labs saw advantages to establishing central repositories. The trouble is many labs thought the same and made their own. The proliferation of databases causes problems. For example, do they have the same format? Which one is the most accurate, up-to-date, and comprehensive? Which one should we use?

Local organization of protein backbone is '-helix, "-strand (which assembles into "-sheet), turn, and interconnecting loop.

'-Helix is in a shape of stick. Tightly curled polypeptide backbone forms the inner side of the stick; the side chains expand outside in the form of helix. '-Helix tends to be stable because the hydrogen in NH and the oxygen in the fourth residue CO form hydrogen bond. Each helix contains 3.6 residues. The helix distance is 0.54 nm. "-Sheet is another frequently occurrence structure. Two or more fully expended polypeptides cluster together laterally. Hydrogen bond is formed by -NH and CDO on the neighboring peptide backbones. These polypeptide structures are "-sheet. In the "-sheets, all peptides join in the cross-linking between hydrogen bonds. The hydrogen bonds are almost vertical to the long axis of peptide chains. Along the long axis of peptide chain, there are repeated units.

"-Sheet includes two types. One is the parallel sheet. The arrangement polarization of its peptide chain (N-C) is unidirectional. The N-end of all the peptide chains is in the same direction. Another one is antiparallel. The polarization of the peptide chain is opposite for the neighboring chains.

In the backbones of polypeptide chain, the structures which are different from the '-helix and "-sheet are called random coil. Random coils mean the irregular peptide chain. For most globins, they often contain a great amount of random coils besides '-helix and "-sheet. In random coils, "-turn is a very important structure.

"-Turn can also be called reverse turn, "-bend, and hairpin structure. It is composed of four successive amino acids. In this structure, the backbone folds in a degree of 180 ı . The oxygen on CDO of the first residue and hydrogen on the N-H 

Two or several secondary structure units connected by connecting peptides can form special space structures. They are called protein supersecondary structures.

Protein databases (PDB)

1. Analysis of main-chain conformations in known protein structure. 2. 12,318 residues from 84 proteins, structure 5,712 fell outside the regions of regular structure. 3. Torsion angles (', ) for the 5,712 residues were calculated and allocated to seven classes in the Ramachandran plot: a,b,e,g,l,p,t ) a,b,e,l,t H, E. Sequence and conformational data were stored for three successive residues in each of the two elements of secondary structure on either side of the connecting peptide, for example, HHH abl EEE. 5. Classification of the pattern and conformation for supersecondary structure motifs.

'-'-Hairpin is made up of two almost antiparallel '-helixes connected by a short peptide. This short peptide is usually composed of 1-5 amino acids. "-"-Hairpin is made up of two antiparallel "-sheets connected by a short peptide. This peptide is usually composed of 1-5 amino acids.

'-'-Corner is made up of '-helixes on two different planes connected by a connecting peptide. The vector angle between these two '-helixes is nearly right angle.

'-"-Arch structure is made up of an '-helix and a "-sheet connected by a short peptide. The most frequently occurring '-"-structure is composed of three parallel "-sheets and two '-helixes. This structure is called Rossmann sheet.

There are mainly three characteristic descriptions of supersecondary structures: sequence pattern, hydrophobic pattern, and H-bond pattern.

In protein structures, many basic supersecondary structure motifs form some more complicated complexes motif, which are called complicated supersecondary structures.

The commonly occurring complicated supersecondary structures include Rossmann fold (Fig. 10.13a) , Greek Key topology structure (Fig. 10.13b) , and four-helix bundle (Fig. 10.13c 

Polypeptide chains further fold by non-covalent bond interaction and curl into more complicated configuration, which is called tertiary structure.

For bigger protein molecules, polypeptide chains are always composed of two or more independent three-dimensional entity. These entities are called domains. Z. Sun According to the amount of '-helix and "-sheet, proteins can be divided into four types: '-protein, "-protein, ' C "-protein, and '/"-protein (Fig. 10.16) .

'-Protein contains more than 40 % of '-helix and less than 10 % of "-sheet (Fig. 10.17a) . "-Protein contains more than 40 % of "-sheet and less than 10 % of '-protein (Fig. 10.17b ). ' C "-Protein contains more than 10 % of '-helix and "-sheet. ',"-Clusters in different regions. '/"-Protein (Fig. 10.17c ) contains more than 10 % of '-helix and "-sheet. These two configurations appear in the peptide chain alternatively. The two configurations of different '/"-proteins (Fig. 10.17d ) arrange face to face. The shape of the whole molecule varies a lot.

Spatial arrangement of subunits in a protein that contains two or more polypeptide chains is called quaternary structure. It often involves symmetry, but doesn't have to. Subunits of proteins form quaternary structure by hydrophobic interaction, H-bond, and van der Waals. The number of most oligomeric proteins is even. There are always one or two types of subunits. The arrangement of most oligomeric protein molecules is symmetric. Some globins contain two or more polypeptide chain. These polypeptide chains interact with each other, and each of them has their own tertiary structure. These polypeptide chains are subunits of proteins. From the view of structure, subunit is the smallest covalent unit of proteins. Proteins clustered by subunits are called oligomeric proteins. Subunit is the function unit of oligomeric proteins.

The hierarchy of structural classification (Fig. 10.18 ):

• Class -Similar secondary structure content -All ', all ", ' C ", '/", etc. 

Homologous family: evolutionarily related with a significant sequence identity Superfamily: different families whose structural and functional features suggest common evolutionary origin Folds: different superfamilies having the same major secondary structures in the same arrangement and with the same topological connections (energetic favoring certain packing arrangements) Class: secondary structure composition

Proteins have varieties of movements. Movement and structures are the basic elements of protein functions. Protein movement includes short-time and smallamplitude movement, median-time and median-amplitude movement, and longtime and big-amplitude movement (Fig. 10.19 ).

Five schemes of protein three-dimensional structures:

1. The three-dimensional structure of a protein is determined by its amino acid sequence. 2. The function of protein depends on its structure.

3. An isolated protein has a unique or nearly unique structure. 4. The most important forces stabilizing the specific structure of a protein are noncovalent interactions. 5. Amid the huge number of unique protein structures, we can recognize some common structural patterns to improve our understanding of protein architecture.

In the following part, we are going to talk about the comparative modeling, inverse folding, ab initio, secondary structure prediction, supersecondary structure prediction, structure-type prediction, and tertiary structure prediction.

The development and research of life science show that protein peptide chainfolding mechanism is the most important problem to be solved. How does protein fold from primary structure into active natural tertiary structure is waiting to be answered. The elucidation of the protein peptide chain-folding mechanisms is called decoding the second biological code.

As the human genome and other species genome sequencing plan start and finish, the capacity of databases (e.g., SWISS-PROT) collecting protein sequence increases exponentially. Meanwhile, the capacity of databases (e.g., PDB) collecting protein tertiary crystal structures increases slowly. The increasing rate of the protein sequence number is much greater than that of the known protein structure number. So we need the computational predictive tools to narrow the widening gap.

In the most genome era, one of the biggest challenges we face is to discover the structure and function of every protein in the genome plan. So, predicting protein structure theoretically becomes one way to decrease the disparity between protein structure and sequence.

Why should we predict secondary structure? Because it is an easier problem than 3D structure prediction (more than 40 years of history) and accurate secondary structure prediction can be important information for the tertiary structure prediction. Ever since the first work of prediction of secondary structure done by Chou-Fasman, it has been 30 years. The accuracy is around 60 %. Since 1990s, several machine learning algorithms have been successfully applied to the prediction of protein secondary structure and the accuracy reaches 70 %. From this, we can see a good method can help improve the prediction result significantly. There are a few prediction methods including statistical method (Chou-Fasman method, GOR I-IV), nearest neighbors (NNSSP, SSPAL, Fuzzy-logic-based method), neural network (PHD (Fig. 10.20) , Psi-Pred, J-Pred), support vector machine (SVM), and HMM.

There are many researches in this field. V. Vapnik [1] developed a promising learning theory (Statistical Learning Theory (SLT)) based on the analysis of the nature of machine. Support vector machine (SVM) is an efficient implementation of SLT. SVM has been successfully applied to a wide range of pattern recognition problems, including isolated handwritten digit recognition, object recognition, speaker identification, and text categorization. Fig. 10.21 The linearly separable case For the linearly separable case (Fig. 10.21) , the SVM tries to look for one unique separating hyperplane, which is maximal in the margin between the vectors of the two classes. This hyperplane is called Optimal Separating Hyperplane (OSH) (Fig. 10.22) .

Introducing Lagrange multipliers and using the Karush-Kuhn-Tucker (KKT) conditions and the Wolfe dual theorem of optimization theory, the SVM training procedure amounts to solving a convex quadratic programming problem: The solution is a unique globally optimized result which can be shown to have an expansion (Fig. 10.23 ):

When an SVM is trained, the decision function can be written as:

For the linearly non-separable case, the SVM performs a nonlinear mapping of the input vectors from the input space R d into a high-dimensional feature space H and the mapping is determined by a kernel function. Then like the linearly separable case, it finds the OSH in the higher-dimensional feature space H.

The convex quadratic programming problem:

y i D 0 i D 1; 2; : : : ; N

The decision function:

The problem of risk minimization: Given a set of functions The goal is to find an optimal function f E x;˛ which minimizes the expected risk (or the actual risk) (Fig. 10.24) .

Here L f .E x;˛ /; y is the loss function. For this case one simple form is

The risk functional R.˛/ is replaced by the so-called empirical risk function constructed on the basis of the training set: 

The bound of generalization ability of learning machine (Vapnik & Chervonenkis):

Here, N is the size of the training set; h, VC dimension, the measure of the capacity of the learning machine; andˆ.N= h/, the confidence interval. When the N/h is larger, the confidence interval is smaller (Fig. 10.25 ).

Q 3 D Number of residues correctly predicted Number of all residues 100

For the case of the single sequence, each residue is coded by the orthogonal binary vector (1,0, : : : ,0) or (0,1, : : : ,0). The vector is 21-dimensional. If the window length is l, the dimensionality of the feature vector (or the sample space) is 21*l.

When we include the evolutionary information, for each residue the frequency of occurrence of each of the 20 amino acids at one position in the alignment is computed.

We design six binary classifiers (SVMs) as follows:

Z. Sun 

The selection of the optimal kernel function and the parameters:

We set the optimal D 0:10. 

Homology modeling is a knowledge-based protein structure prediction. These kinds of methods are based on the evolutional conservation of protein structure and sequence. They use the structure of known proteins to build the structure of the unknown homological proteins. They are the most mature protein structure prediction methods so far. When the homology is high, we will get reliable prediction results. In the whole genome, only about 20-30 % sequences can be predicted using these methods. One difficult point in the homology modeling method is the prediction of the circle region on the protein surface. That is because the circle region on the surface is very flexible. But because the circle region is usually the active part of the protein, the prediction of the structure of circle region is quite important to the protein structure modeling. The protein homology modeling includes:

1. Matching of object protein sequence and model sequence 2. Modeling object protein structure model according to the model structure 3. Modeling the conserved region in the object protein 4. Modeling the SCRs backbone 5. Predicting the side chain structure 6. Optimizing and estimating the modeling structure

Threading (or inverse folding) method can be used to predict structure without homology information. The basic assumption is that the folding type of natural protein is limited. So we can align the sequence of proteins whose structures are unknown and those proteins whose structures are known. And then predict on the best alignment. This method cannot predict new types of proteins correctly. Threading method can be done by summarizing known independent protein structure patterns as the model of unknown structure and then by learning known database to summarize average potential function which can distinguish correction and error. In this way we can get the best alignment way.

Protein sequence incrustation:

1. Basing on the experience method. Build various potential functions by analyzing protein of known structure, and see if it can align with known structure by using the standard of lowest folding configuration to guide the object protein sequence incrustation. 2. Basing on the 3D profile. Predict sequence space structure by building a 3D profile, using dynamic programming, comparing new sequence with those in profile databases, and seeking optimal alignment.

Protein secondary structure prediction research has developed for more than three decades. From the progression of research method, there are three different periods. The first period is statistic prediction basing on single residue; the second period, statistic prediction basing on sequence segments; and the third period, statistic prediction combining evolutionary information.

Rost and Sander (1993) promoted prediction basing on neural network -PHD (Profile fed neural network systems from HeiDelberg). It is the first method with the prediction accuracy over 70 %, first efficient method bringing in evolutionary method, and one of the most accurate methods so far.

PHD is a complicated method basing on neural network. It includes polysequence information. Recently, Cuff and Barton synthesize many good secondary prediction methods, such as DSC, NNSSP, PREDATOR, and PHD. Up to now, there are still some other artificial intelligence methods to predict secondary structure, such as expert system and nearest neighbor method.

Recently, it is a good opportunity to predict protein secondary structure. For one thing, structural genomic plan is carried out throughout the world to increase the speed of measuring the number of protein structure and fold type. For another, the field of machine learning develops fast. For example, in recent 2 years, the building and perfecting of famous statistic learning theory of V. Vapnik make it possible for us to use the latest machine learning method to improve the prediction accuracy of secondary structure.

Our paper published on JMB (J. Mol. Biol.) applied SVM to predict protein secondary structure and got an accuracy of 76.2 %.

P ij D f ij f j j: configuration i: one of the twenty amino acids f j : fraction of the jth configuration f ij : jth configuration fraction of the ith amino acid residue. f ij D n ij =N i n ij : the total appearance of a residue in a certain configuration N i : the total number of a residue in the statistical samples. f j D N j =N t N t : the total number of residues in the statistical samples 2. The tendentiousness of folding-type related secondary structure (a) Protein folding type: all ', all ", ' C ", and '/" (b) Analysis of secondary structure tendentiousness: '-helix propensity factor P ' , "-sheet propensity factor P " , and irregular curl propensity factor P C 3. Chou-Fasman method (a) '-Helix rule In a protein sequence, there are at least four residues in the neighboring six residues tending to form '-helix kernel. The kernel extends laterally until the average value of '-helix tendentiousness factor in the polypeptide segment P " < 1.0. Lastly, drop three residues at each end of '-helix. If the rest part is longer than six residues, P ' > 1.03, it will be predicted as helix. (b) "-Sheet folding rule If three residues in five tend to form "-sheet, we think it is the folding kernel. The kernel extends laterally until the average of the tendentiousness of the polypeptide segment P " < 1.0. Lastly, discard two residues from each end; if the rest part is longer than the four residues and P ' > 1.05, then it is predicted as "-sheet.

The Nature of Statistical Learning Theory

The data sets Two nonhomologous data sets:1. The RS126 set -percentage identity -25 % 2. The CB513 set -the SD (or Z) score -5We exclude entries if:1. They are not determined by X-ray diffraction. 2. The program DSSP could not produce an output. 3. The protein had physical chain breaks. 4. They had a resolution worse than 0.19 nm.

Now the automatic assignments of secondary structure to the experimentally determined 3D structure are usually performed by DSSP, STRIDE, or DEFINE.Here we concentrate exclusively on the DSSP assignments, which distinguish eight secondary structure classes: H ('-helix), G (310-helix), I ( -helix), E ("strand), B (isolated "-bridge), T (turn), S (bend), and (the rests).We reduce the eight classes to three states -helix (H), sheet (E), and coil (C) according to two different methods:1. DSSP: H, G, and I to H; E to E; and all other states to C 2. DSSP: H and G to H, E and B to E, and all other states to C

Cross-validation trials are necessary to minimize variation in results caused by a particular choice of training or test sets.A full jackknife test is not feasible, especially on the CB513 set for the limited computation power. We take the sevenfold cross-validation on both sets. Assembly of the binary classifiers:1. SVM MAX D We combined the three one-versus-rest classifiers (H/ H, E/ E, and C/ C) to handle the multiclass case. The class (H, E, or C) for a testing sample was assigned as that corresponding to the largest positive distance to the OSH. 2. SVM TREE (Fig. 10.26) 3. SVM NN (Fig. 10.27) The tertiary classifiers we designed: