key: cord-319291-6l688krc
authors: Hung, Chun-Min; Huang, Yueh-Min; Chang, Ming-Shi
title: Alignment using genetic programming with causal trees for identification of protein functions
date: 2006-09-01
journal: Nonlinear Anal Theory Methods Appl
DOI: 10.1016/j.na.2005.09.048
sha: 
doc_id: 319291
cord_uid: 6l688krc

A hybrid evolutionary model is used to propose a hierarchical homology of protein sequences to identify protein functions systematically. The proposed model offers considerable potentials, considering the inconsistency of existing methods for predicting novel proteins. Because some novel proteins might align without meaningful conserved domains, maximizing the score of sequence alignment is not the best criterion for predicting protein functions. This work presents a decision model that can minimize the cost of making a decision for predicting protein functions using the hierarchical homologies. Particularly, the model has three characteristics: (i) it is a hybrid evolutionary model with multiple fitness functions that uses genetic programming to predict protein functions on a distantly related protein family, (ii) it incorporates modified robust point matching to accurately compare all feature points using the moment invariant and thin-plate spline theorems, and (iii) the hierarchical homologies holding up a novel protein sequence in the form of a causal tree can effectively demonstrate the relationship between proteins. This work describes the comparisons of nucleocapsid proteins from the putative polyprotein SARS virus and other coronaviruses in other hosts using the model.

Identification of protein function is the main theme of the post-genome era. Recently, biomedical scientists have been striving to study proteomics and explore the therapeutic potential of genes for curing disease. Thus, a powerful and integrated methodology is required for predicting novel protein function. During the past decade, many algorithms applied to computational biology have been used to solve several of the above problems, with the most frequently used methods including dynamic programming [1] , the hidden Markov model [2, 3] , the Bayesian theorem, and probabilistic modeling [4] [5] [6] . The Journal of Molecular Biology contains an introduction to hidden Markov models for biological sequences written by Krogh [7] . As a statistical model, the hidden Markov model (HMM) is appropriate for many tasks in molecular biology. A profile-like HMM architecture can be used for searching against sequence databases for new family members and also can make multiple alignments of protein families. Moreover, the latest review by Tsakonas and Dounias [8] indicates that artificial intelligence computational methodologies have been widely applied in bioinformatics, for example genetic algorithms [9] , genetic programming [10] , neural networks [11, 12] , fuzzy logic [13, 14] , and some classification and clustering techniques used in data mining [15] . Design of automatic models for comparing protein sequences that are homologous to a known family or super-family has become increasingly important. One standard approach to solving this problem uses 'profiles' of position-specific scoring measure estimated using multiple sequence alignment (MSA) [16] . In fact, a potential member of a distantly related protein family cannot be identified with the other members in the same family using only the primary protein structures. Consequently, a new protein, such as the case of the SARS virus, with numerous unknown functions cannot be accurately predicted by modeling a learning model. This study applies a hybrid methodology based on genetic programming with a causal tree [4, 28, 31] model to predicting protein function. The causal tree [28, 31] with a cause-effect form is proposed for locally comparing protein sequences across each pairwise alignment of MSA. The clues to this proposal are that comparisons of distantly related sequences may carry more sensitive and accurate messages regarding protein function if given a segmented alignment using the more comprehensive depiction of the protein family. Additionally, the probabilistic scoring measures, similar to profile-to-profile comparisons [17] , can be used to achieve one objective of genetic programming, namely aligning the hierarchical homologies based on profile comparisons. The resulting output is capable of function-related representation for further linkages of protein function. Furthermore, the proposed model can generate a tree structure output with a special aligned format rather than a traditional local or global alignment format such as CLUSTALW [23] . Traditionally, a local alignment based on the Smith-Waterman algorithm [7] has been used to compare sequences with gaps using various scoring systems. Using scoring systems, many methods are used to evaluate the similarity of two profile columns, for example the FFAS method [18] , and prof sim [19] . The FFAS method can calculate the correlation coefficients and thus score the 'dot-product' scores of the amino acid frequencies in the two columns. Meanwhile, the prof sim method utilizes a single similarity score combined with the divergence score and the significance score of the probability distributions in two columns. Recently, the method COMPASS [17, 20] generated the occurrence probability of target residue in one profile column based on the other column of given the target residue frequencies to construct the local profile-to-profile alignment. However, the proposed model was designed to detect the potential functions of a novel protein with a hierarchical homology structure. The hybrid model, namely Alignment using Genetic programming with Causal Tree (AGCT), is a heuristic evolutionary method that contains three basic components: (i) genetic programming with innerexchanged individual strategy, (ii) causal trees [4, 28, 31] with probabilistic reasoning, and (iii) construction of hierarchical homologies with local block-to-block alignment using the methods of moment invariant and robust points matching (RPM) [24] . Locally, a standard construction of an alignment requires at least a two-step design: the first step is a scoring evaluation for similarity between two given positions, while the second step is an alignment extension that considers the gap penalties and an extension algorithm. The AGCT model applies two modified steps to an alignment with a standard construction. The first step is to change the compared objects from positions to blocks that initialize extents based on a particular signal wave, while the second step is to iteratively adjust the boundaries using trimming local fragments rather than the extension algorithm. Actually, in BLAST [21] and its successors [22] , the effective local extension is displayed for the cases of sequence-to-sequence and sequence-to-profile comparison. Similarly, in AGCT, the approach to a local block-to-block comparison modified from profile-to-profile alignment described in [17] shifts the leftmost and rightmost boundaries in relation to the parent segments when high local alignment scores are achieved. Notably, a detailed and refined procedure of alignment such as the Smith-Waterman algorithm [7] is also used in a part of the proposed model for confirming a meaningful fragment detected, because a fragment with a short length reduces the computation time and only some individuals in the population have to be compared making the combination of heuristic and exhaustive search models feasible.

The proposed AGCT model is an evolutionary method based on a probabilistic inference model. The proposed model also incorporates a modified RPM [24] method and a local sequence alignment into the probabilistic model. Meanwhile, the model uses a moment invariant theorem [32] for pattern recognitions to extract the feature-nodes from some fragmented signal curves. The RPM compares feature points to precisely construct a soft match for two sets of points. Moreover, the modification of RPM changes the original transformation between matrices to the current transformation between a matrix and a tree. For achieving this matrix-to-tree transformation, the proposed constraints, namely rooted one-way winner-take-all (RO-WTA), resembling the two-way winner-take-all (WTA) constraints [27] , are also used. The proposed RO-WTA method is a WTA method that initially takes a unique root node by applying WTA to the slacked matrix column that is involved for the null point correspondences. Then WTA is applied to all rows to ensure each row only produces a corresponding feature-node in a causal tree, and finally at most three nodes are taken and used as the branches of the parent node using the first three high values in the same column. Moreover, the tree-to-matrix transformation simply visits a tree in preorder to derive a set of pairwise points. Finally, an equation, Eq. (28), is applied to yield a correspondence matrix for optimizing the RPM.

A traditional MSA method for protein classification compares more than two sequences to provide the aligned sequences for the observation of conserved domains. However, in a distantly related protein family, the data provides neither meaningful knowledge of biochemical function nor a consistent outcome when using different MSA algorithms. Clearly, the MSA comparisons would be worst if it is applied with the distantly related protein family members. In this model, a small protein fragment transformed into a feature-node would be a compared target point with spatially localized features. Individual feature-nodes simplified by six moment invariants [32] are matched with each other for the signal registration of 2-D curves [25, 30] . The 2-D curve is depicted using a block-to-block method that can transform a fragmenting sequence of protein residues into a signal curve by iteratively accumulating a block of amino acid properties. The signal curves are plotted by accumulating hydrophobicity against protein sequence here. Because the problem has been transformed into the problem of comparing feature-nodes (points), the comparison of spatially localized features for protein function can be considered an optimization problem of point match for the 2-D shape. In fact, the optimization problem is a difficult problem. Especially, the mapping problem must also consider the rigid and non-rigid deformations. A hybrid technique, combining genetic programming and the causal tree, is ideal for solving such difficult problems. Genetic programming has emerged from the evolutionary based systems [26, [40] [41] [42] , and the causal tree is derived from probabilistic reasoning in intelligent systems [4] . Since the genetic programming can avoid local minima of an energy function, the model is appropriate for recognizing a potential pattern which is never found via direct comparisons.

This model uses a popular technique for synthesizing a special signal indicating specific protein functions by accumulating the properties of a group of protein residues. Comparing these special signals with each other can provide a wider window for viewing the world of the protein.

Essentially, the signals must be decomposed for comparison using a method that can prevent a suitable waveform from destroying the protein signals. In the implementation, a technique of wavelet reconstruction for signals [29] is used to depict a noiseless signal curve skeleton. Next, the reconstructed curve is differentiated with respect to the positions of each protein residue, and let its results equal zero. The technique thus can easily identify some positions with local minimum accumulating properties. Subsequently, to preventing the generation of less than 40 residues, the method would discard some of the neighbor positions around the minima. Next, all of the fragmented signals are used to provide an initial dataset for the input of evolutionary computation. Although the local minimum values can not determine the true boundaries of the signals at the beginning step, the local alignment afterwards for fragmented proteins using the Smith-Waterman algorithm [7] would iteratively regularize the boundaries until the suboptimal boundaries are found. Next, the model must further extract signal features from the fragmented input signals after decomposing the original signals. Technically, the vector features can be associated with the signal features. However, the vector value representing a signal must be invariant when the physical linear-transforms deform the signal curve (shape), such as move, rescale, stretching, and rotation. Once the problem is transformed into the image and signal processing domains, the RPM method is easily applied to protein function identification. Conceptually, this proposed model includes two stages for feature matching. The first stage is designed to obtain the initial signal features for the signal registration [25, 30] . During the second stage, an inner-exchanged genetic programming is used to repeatedly refine the RPM parameters.

This work utilizes the theorem of two-dimensional moment invariants [32] for planar geometric figures from the theorem of moment spaces [33] to extract features of the protein fragments. Accumulating hydrophobicity of surrounding residues for each position of protein enables the protein fragments to generate signal curves. Next, the signal curve further reduces to a fixed length vector, denoting a feature node, using moment invariants for matching the nodes. Next, one of the energy functions in the model introduces the energy function of a correspondence algorithm of non-rigid mapping [24] derived from the TPS theorem [34] . Finally, genetic programming with the inner-exchanged subpopulation strategy is used to refine the comparison results.

The theorem of moment invariants, which is extensively used in computer science, was first proposed in [32] for recognizing visual patterns. In the visual field, an essential step must extract patterns for recognizing visual objects from the original objects. The extracted patterns must also be independent of position, size and orientation. Likewise, the model uses the moment invariants of the signal curves to substantially reduce the computational complexity. Because the flexibility of the moment invariants used for recognizing protein function was such that the model could find more domains of conservation, the domains should be able to attain more biological meaning by this moment invariant based comparison. The following subsections detail how the model obtains a feature vector denoting a feature-node in a causal tree by using the moment invariant theorem for signal recognition.

Riemann integrals [32] define two-dimensional (k + l)th order moments for a density distribution f (x, y) as follows:

where f (x, y) denotes a piecewise continuous bounded function, and has nonzero values if and only if it is on the finite part of the x y plane. Vice versa, the f (x, y) may derive the unique double moment sequence {ṁ k,l } in the moment of any order.

By directly integrating the double moment sequence, Eq. (2) can express a central moment µ k,l derived from the ordinary moments.

where x =ṁ 1,0 /ṁ 0,0 and y =ṁ 0,1 /ṁ 0,0 is an expected value of x and y, respectively. The point (x, y) is termed a center of gravity or center of centroid for x and y. For example, the first four orders have µ 0,0 =ṁ 0,0 ≡ µ, µ 1,0 = µ 0,1 = 0, µ 2,0 =ṁ 2,0 − µx 2 , µ 1,1 =ṁ 1,1 − µx y, µ 0,2 =ṁ 0,2 − µy 2 , µ 3,0 =ṁ 3,0 − 3ṁ 2,0 x + 2µx 3 , µ 2,1 =ṁ 2,1 −ṁ 2,0 y − 2ṁ 1,1 x + 2µx 2 y, µ 1,2 =ṁ 1,2 −ṁ 0,2 x − 2ṁ 1,1 y + 2µx y 2 , µ 0,3 =ṁ 0,3 − 3ṁ 0,2 y + 2µy 3 .

In a similar mathematical sense, a measure of area for a two-value image b(m, n) with (k+l)th order moment can approximately obtain Eq. (4) from Eq. (1) .

Moreover, the corresponding central moment is expressed as Eq. (5):

M 0,0 ,M andN are the values for the area length and width, respectively.

A pattern of the two-value image which must be independent of position, size, and rotation is derived from similitude moment invariants. Likewise, the function b(m, n) with respect to a pair of fixed axes can display the signal curve of protein sequences. Moreover, the two-dimensional central moments µ k,l expresses image patterns. Additionally, numerous properties of the second central moment are equivalent to a covariance matrix in probability fields. Therefore, the covariance matrix for the second central moment is expressed as follows:

By the diagonalization, Eq. (7) is obtained as follows:

where E = e 1,1 e 1,2 e 2,1 e 2,2 , a column vector of matrix E is an eigenvector of matrix U , and the eigenvalues of U determine Λ = λ 1 0 0 λ 2 . The values of λ max and λ min corresponding to (λ 1 , λ 2 ) and (λ 2 , λ 1 ) are shown in Eqs. (8) and (9), respectively.

Moreover, Eq. (10) denotes a direction angle of the area of the image. From the above theoretical deviations, Eqs. (1)-(10) can easily formularize some basic properties of the pattern.

Notably, an added restriction such as µ 2,0 > µ 0,2 can determine the unique angle of θ . Clearly, a discrimination property of the patterns increases when the higher order moments are used as the pattern. The higher order moments with respect to the principal axes can also be easily determined using the method of the principal axes described in [32] . Because µ 0,0 = M 0,0 can be used to measure the area, a factor √ α of proportionality, a divisor, should be used to further normalize the central moment in Eq. (5) by dividing the variables of the function of Eq. (2) by the factor. Eq. (2) then can adopt the form f x/ √ α, y/ √ α . As a result the normalized central moment ν k,l equals the original central moment of f (x, y) divided by √ α k+l+2 . Because of µ 0,0 being able to measure area, α = µ 0,0 . Eq. (11) then can be used to define the normalized central moment, as follows:

An identification of the pattern independent of position, size and orientation then can be suitably employed to formularize the moment invariants for any signal curve using Eq. (11).

To improve pattern recognition capability, an absolute moment invariant should further be achieved by using multiple moment invariants. The normalized central ν k,l in the second and third orders can generate the following six absolute and orthogonal moment invariants listed in Eq. (12) .

While the skewed invariant listed in the original study [32] is useful for distinguishing mirror images, the model discards this invariant since one skewed orthogonal invariant is not necessary for signal recognition here. The six moment invariants in Eq. (12) are not only independent of position, size and orientation, but also can be used to present a feature-node for a fragmented protein sequence. Finally, a six-valued vector ω = ω 1 ω 2 ω 3 ω 4 ω 5 ω 6 T can denote a feature-node to enable further convenient processing.

Suppose that a feature-node ω a = (ω a1 , ω a2 , ω a3 , ω a4 , ω a5 , ω a6 ) can represent a signal curve from the fragmented protein corresponding to one point in the 2-D space. Assume a set of vectors Ψ = { ω a | a = 1, 2, . . . , K } and the other set of X = { x a | a = 1, 2, . . . , N}. Since vector ω a with six variables would increase the problem complexity, a simple linear-transform is used to obtain a scalar for solving the point-matching problem using the TPS of two variables. One problem of the fragmented sequence-mapping then can simply be transformed into the other new problem. The new problem involving the two feature-node sets of X and Ψ fits a mapping function f ( ω a ) using TPS, and simultaneously minimizes the TPS energy function defined as follows:

Typically, two types of deformations of a compared object exist, rigid (affine) and non-rigid (non-affine). The rigid deformation includes parallel translation and rotation. Meanwhile, the non-rigid deformation includes shrink, dilation, and distortion. By minimizing the first error measurement term of (13), the feature-node set X maps as closely as possible to the other featurenode set Ψ . Generally, an infinite number of mappings f can minimize the first term since the mapping is non-rigid. The energy function of Eq. (13) can uniquely determine a minimizing function f specified by two parameter matricesÃ andW in Eq. (14) if and only if it has a fixed value λ.

whereÃ represents a (D + 1)

with c being a constant for any ω [34] , is related to the TPS kernel. The kernel comprises the information regarding the internal structural relationships of the feature-node set and a non-rigid warp comes into play when the warping coefficient matrixW is considered in Eq. (14) . The second term of Eq. (13), which is responsible for regularizing the mapping function between X and Ψ , is essentially a smoothness constraint. The Lagrange parameter λ in Eq. (13) regularizes the warping of the matches of feature-nodes. The precise matches appear if λ approaches zero. When a solution for f in Eq. (14) is substituted into the TPS energy function in Eq. (13) , and using Mercer-Hilbert-Schmidt theorems of Riesz and Sz-Nagy [35] to prove Eq. (15), Eq. (16) is obtained:

whereλ 1 ≥λ 2 ≥ · · ·λ K ≥ 0 denote eigenvalues of the continuous eigenfunction φ 1 , φ 2 , . . . , φ K , andW t is the transposed matrix ofW .

where X and Ψ are merely concatenated versions of the vector values of the nodes

vectors. Each row of the matrix derives from the original vectors. In the model, an energy function with the affineÃ and warpingW parameters, which is derived from Eqs. (13)- (16), is used as a fitness function with genetic programming. Furthermore, the local border between two regions of protein fragments should be considered an important feature for understanding protein functions. Therefore, a distinctive characteristic of the TPS can always decompose a geometrical transformation into a global affine transformation and a local non-affine warping component.

Using this characteristic, a functional signal generally can be detected based on the same essential protein family even if it is undergoing various environment parameters. The parameters include the setting of primary forming signal of the sequences, for example the fragment length in relation to signal response. Alternatively, the rotation, translation, and global shear of the signals should slightly influence the detection of the functional signals based on the feature geometrical characteristics (pattern). Consequently, the smoothness termW in Eq. (16) is specifically applied to be the warping components, and this term should be penalized when signal deformation is ineffective to their essential properties. Visually, two compared objects with similar patterns should increase the influence of TPS to interpolate a fitting spline. Correspondingly, the similar protein fragments compared with one another can be identified by TPS matching.

The proposed model uses TPS methods to compare with two feature-node sets for multiple optimization problems. The TPS method raises three issues of optimizations. Traditionally, the main issues include optimizations of mapping and correspondence. Generally, the work of correspondence must permute a one-to-one relationship between two feature-node sets, and the work of mapping must fit the two feature-node sets to an optimal function f . However, this model defines a new issue that the problem solved must simultaneously optimize the hierarchical heaping structure by constructing a hierarchical relationship among feature-nodes. Thus, the work of correspondence in the model must achieve the hierarchical relationship of homologies rather than the one-to-one relationship. Alternatively, the model must also optimize heaping. Heaping optimization should meet the following minimum requirements: (i) only generating the node of a unique root, (ii) a parent node can compare with multiple children nodes, (iii) using the limitation of the number of branches, the model must only choose the first certain children nodes depending on homology likelihood to attach to a parent node, and (iv) the model must place the predicted feature-nodes and the predicting feature-nodes to the external and internal nods of a causal tree, respectively. These requirements describe the one-to-many query strategy in the homology search model. More precisely, heaping also accomplishes a descending order of homology probability for children nodes. Nevertheless, the computations have increased difficulty in simultaneously optimizing the hierarchical correspondence, non-rigid mapping, and heaping.

The subsection formally defines a correspondence matrix for a causal tree via the strategy of a one-to-many query. Assume that a sequence {Q 1 } with unknown functions is used as a query sequence to compare with the fragmented member sequences {Q x } Z x=2 for constructing an optimal hierarchical structure. The optimal structure should comprise the maximum of homology relationships among these protein fragments of both 'query' and 'member'. Moreover, Z denotes the number of the input sequence classified to the same distantly related protein family. Furthermore, letM denote a correspondence matrix representing the correspondence between two node-sets in a tree. Assume the matrix contains (N + 1) × (K + 1) elements:

where each m ia is an element of the matrix, with a value that is a real number between zero and one. One query sequence Q 1 is divided into S 1 fragments. The summation of m ia in Eq. (17) for the constraints of each row inM must equal one. The node of the same label thus appears in the tree a maximum of once. The WTA method and Sinkhorn balancing theorem [36] can be used to normalize m ia to satisfy the row and column constraints in Refs. [24, 27, 36] . However, this model only uses the row constraint to modify WTA as RO-WTA. Alternatively, the correspondence matrix for a causal tree must satisfy the constraint N+1 a=S 1 +1,i =a m ia = 1 in Eq. (17) to implement a one-to-many relationship. Simultaneously, a constraint N+1 a=S 1 +1 m aa = 0 is used to prevent a self-correspondence. The self-correspondence that appears when one feature-node points to itself is invalid. Furthermore, a constraint N i=S 1 +1 m i(N+1) = 1 and ∀i ≤ S 1 m i(N+1) can decide the node of the unique root. From the definition of the correspondence matrix in Eq. (17), let:

where K represents the number of parent nodes with the starting number one and ending at number K . However, the alternative representation is beginning at number S 1 + 1 and ending at number N. In fact, the two numbering representations are identical for the convenience of representation and implementation. From Eqs. (18), (19) and (20) define the set of parent and children nodes in a tree, respectively.

where j and (l x , l y ) denote different coordinates of the one and two dimensions, respectively. Moreover, the variable S p denotes the number of fragments in each protein sequence. Next, Eqs. (21)-(25) further define a causal tree denoted CT using the correspondence matrix M. A set of causal trees , similarly described in [35] , can be expressed in terms of i as follows:

where denotes the set of causal trees i . A causal tree includes the components R i , I i and E i , which denote a root node, internal node-set, and external node-set, respectively. The causal tree is a tree network that is constructed from the binary random variables labeled, and is also illustrated in Fig. 1. Fig. 2 shows an example of internal and external node assignment. Of nine fragments, five corresponding fragments of internal nodes 5, 6, 7, 8, represent a set of fragmenting members from each member sequence {Q x } Z x=2 . The external nodes numbered 1, 2, 3, and 4 in Fig. 1 are for a query sequence {Q 1 } in accordance with the query fragments of Fig. 2 in the numbering representation. In fact, the internal nodes can be used for inferring the probabilistic reasoning [4] for predicting protein function. The reasoning process infers along a cost path in the causal tree by assessing orderly conditional probabilities combined together from a root-node to an external feature-node. Each internal feature-node ω ∈ I indicates a given protein fragment of different properties. Likewise, each external feature-node ω ∈ E indicates an unknown protein fragment function. For heaping the causal tree, the constraintsH 0 ,H 1 andH 2 are used to allocate a correct location of feature-nodes in the causal tree. TheH 0 in Eq. (22) defines a root-node R by applying the WTA method to column (N + 1) of correspondence matrixM. Alternatively, a feature-node is a root-node if and only if no parent node points to it.

Additionally, Eqs. (23) and (24) define the correct locations of internal and external nodes, respectively. The node-pairs −−→ (a, i ) and − −−→ (a, a ) represent a direction from a parent node to a child node. For clarity, Fig. 3 illustrates an example of the correspondence matrix for a causal tree. The proposed RO-WTA selects the maximum value from the real numbers within the parenthesis only for each row and the N + 1 column. Besides RO-WTA, the diagonal elements must all be zero to eliminate self-correspondence.

Eventually, Eq. (25) defines a causal tree with the entities of a correspondence matrix CT as follows: Fig. 3 . Correspondence matrix for a causal tree.

where ∩ and ∪ denote set intersection and union, respectively. Because the causal tree CT may heap some feature-nodes to a cause-effect form, a set of given causes can also be inferred to obtain an unknown effect. The model has transformed the TPS method of two point-sets correspondence into an optimization of the causal tree. Next, the model must control the tree growth by adding an equation in Eq. (26) into the resulting energy function. The value N in Eq. (26) denotes the total number of all input fragments as the tree size. By adding Eq. (26) to the energy function, the model can attain an approximated tree in accordance with the correspondence matrix.

where i denotes the labeled identification of a feature-node. In the presence of this feature-node added to a causal tree j , the value n i returns 1, and otherwise it returns 0. Finally, Eq. (27) combines Eqs. (16), (17) and (26) to create the final energy functions, as follows:

By minimizing Eq. (27), the model can use genetic programming to obtain a globally suboptimal solution with respect to mapping, correspondence, and heaping. The first term in Eq. (27) is an error measure term for matching the similarity between two feature-nodes. Meanwhile, the second term in Eq. (27) can guard against all matches, and a controlling parameter ζ determines a degree of influence for the term. Next, the third term of Eq. (27) , which refers to a barrier function, is an entropy term with an annealing temperature T in statistical physics. By controlling the entropy barrier function, the distribution of random variables m ia in CT gradually stabilizes. Alternatively, the model can generate the minimum number of states required to push the objective minimum away from the discrete points such that some rules emerge from m ia . A gradually reduced temperature T can suitably control thermal fluctuation to yield an optimal solution. Regarding related applications, the application of this entropy term to statistical physics is also briefly described in [37] [38] [39] . Subsequently, the fourth term in Eq. (27) is used to penalize the differences of an affine mapping parameterÃ and an identity matrix I . The differences stimulate some unphysical reflections that result from flipping through the whole plane. The fifth term in Eq. (27) , which is a standard TPS regularization term, penalizes the local warping coefficientsW [24] . Using an approximated least-squares approach and QR-decomposition technique, the model can solve the parameters (Ã,W ) given fixed CT. Additionally, two parameters λ 1 and λ 2 are used to control the influence of the fourth and fifth terms, respectively. Finally, keeping a matrix consistent with a tree during the evolutionary process of genetic programming is very difficult. Thus, the final term in Eq. (27) 

where m ia must conform to RO-WTA. Currently, a parent node can only connect a maximum of three children nodes for this model. Consequently, the summation of all elements in each column must equal three. When the model attains an optimal causal tree, a probability theorem is further used to model a training model for identifying protein function.

Consideration of a given protein family can be requested and answers obtained to a query with an unknown protein sequence of what the interrelationship among the particular protein fragments is. Unfortunately, the calculation of a probability distribution in Bayes networks is computationally intractable (NP-complete). However, a causal tree of fixed structure may efficiently compute the result of the probabilistic distribution [31] . Using the basic probability theorem, then assuming that P(x i , x a ) denotes the joint probability distribution of two random variables x i and x a , and that P(x i | x a ) denotes the conditional probability of x i given x a ; this conditional probability is defined as follows:

Using the Bayes inference described in [4] , Eqs. (30) and (31) are easily derived by Eq. (29) .

where x i and x a are conditionally independent given x a . Subsequently, a well defined Bayes model can be obtained by further generalizing beyond simple Markov models and considering the probability distribution in the form of general trees. By the expansions of Eqs. (30) and (31) , a probability distribution can be applied to learning and inferring for biological models when the tree network has a specific predetermined structure. During a learning phase, the variables within internal nodes and external nodes represent a set of member states and query states, respectively. These variables are both observable during the learning phase. Instead, these variables are hidden during an inferring phase when the structure and probability distributions of a causal tree are temporally fixed. This process fully corresponds to the operation of hidden states in a hidden Markov model. Iteratively, the model eventually produces a refined prediction result based on fitness function estimation. Recalling the example of a causal tree, Fig. 1 defines nine random binary variables on the tree. The case produces a joint probability distribution of the nine variables using a product of the following terms:

where each internal node x i separates variables above and below it. For example, using Eq. (31) can show that the node x 7 gains the conditional probability P(

Initially, the root variable x 7 in Eq. (32) first generates a value using the result from Eq. (28) . Given x 7 , the probabilistic model can determine the conditional probabilities P(x 6 | x 7 ), P(x 9 | x 7 ), and P(x 5 | x 7 ). This model then can generate the values of x 6 , x 9 , and x 5 , respectively. This model thus can calculate the remaining variables recursively. Finally, this model obtains an estimated value as one of the multiple fitness functions described in the following section. By rearranging and observing the terms in Eq. (32) , an ordered product of terms is also very easily implemented by a preorder tree search. Besides the fitness function of the probabilistic and RPM models mentioned above, the model needs another fitness function to determine the boundaries of protein fragments. Eq. (33) shows the normalized measures of a local alignment score α divided by the sequence length and obtained by the Smith-Waterman algorithm [7] .

where δ ia denotes local alignment score and ρ ia represents sequence length. Furthermore, Eqs. (34) and (35) are used to determine a value of parameters ζ and λ 2 for adjusting the reasonable parameters in Eq. (27) . Since temperature T in Eq. (27) is gradually decreased during the evolution, these m ia values are rapidly and accordingly decreased. From Eq. (28), it is best for m ia to be in the interval of [0.0 1.0] to protect it from an infinite drop in value that cannot be computed due to the reasonable computation capacity limitation. Thus, Eq. (34) can be easily obtained from the inequality equation and normal distribution.

where e 1 is the natural logarithm that provides a rough scope for estimating the resolution of 256 states that sufficiently fit this computing issue. Moreover, µ, σ represent the mean and standard deviation of all feature data, respectively. Furthermore, Hall and Titterington [47] proposed a solution for estimating the smoothing parameter λ 2 , by Eq. (35) .

where F(λ 2 ) denotes an influence matrix, I represents an identity matrix, and σ is a standard deviation of all feature data. A method of QR-decomposition can be used to determine the value of (I − F(

The method decomposes the data of featurenodes Ψ = (Q 1 : Q 2 )( R 0 ) into the product of orthonormal matrix and upper triangular matrix. Finally, a process of iteratively estimating Eq. (35) is used to find the optimal value λ 2 with respect to minimizing the error in Eq. (35).

This section focuses on an implementation design of genetic programming that evolves three related subpopulations with inner-exchanged individuals.

In 1992, Koza et al. were early pioneers of the concept of genetic programming [26, [40] [41] [42] which was extended from the genetic algorithm. Genetic programming can automatically create programs for solving general problems. Genetic programming is based on a grammar-based methodology of an evolutionary theorem using the Darwinian survival principle. Therefore, genetic programming can describe highly complicated models for any domain problem without sufficient domain knowledge. First, problem solving requires randomly yielding an initial population of individuals. Subsequently, a fitness function can be used to evaluate individuals in the population and select excellent individuals for further performing specific genetic operations. The genetic operations typically include crossover, replication, and mutation operations. The operations enable the parent population at a given generation to form a new offspring population at the next generation. Iteratively, genetic programming continuously performs the above evolutionary steps until obtaining an optimal individual or satisfying certain conditions. A representation of genetic programming for a domain problem is composed of several functional programs. The programs form a rooted tree structure as a solution for solving the problem. Because genetic programming uses a tree form, the model can easily provide a flexible hierarchical presentation to estimate the relationship of homologies for predicting protein function. Characteristically, genetic programming provides a mechanism for searching for the fittest individual. The individual is the problem solution provided by using the computer program in the search space of all possible computer programs. The search space comprises terminalnodes (external nodes) and function-nodes (internal nodes) that are appropriate to represent the specific problem [26] . When a causal tree following genetic operations generates an infeasible individual, the model must regularize this infeasible individual by using a tournament pick (TP) method. The TP method using the depth-first search (DFS) strategy is proposed to accelerate the performance of regularizing this infeasible individual to a feasible one. The TP+DFS method recursively cures the infeasible nodes residing in this individual. The method replaces an infeasible node with a feasible node, in a preorder visiting for trees, randomly selected from the tournament that contains numerous feasible nodes. This advantage of the TP+DFS method results from the mutation appearing at a suitable time, and estimates the fitness function simultaneously following the crossover. When all individuals conform to all constraints of a causal tree, a fitness function can be used for individual estimation. To improve prediction speed and accuracy, this model uses a multiple fitness function strategy. This strategy implements an exchanged individual mechanism involving three subpopulations. The model next considers the following parameters for developing a co-evolution with the strategy of inner-exchanged individuals in a population. The model exchanges the individuals in the three subpopulations 0, 1, and 2. There are nN individuals in subpopulation 0, nM individuals in subpopulation 1, and nK individuals in subpopulation 2, respectively. The size of the subpopulations should have nN nM, nK for considering a trade-off between speed and accuracy. The selection method used in the model always uses the tournament selection strategy rather than the fitnessproportionate selection strategy. Additionally, the tournament selection (TS) differs from TP in the target level. The former selects the individuals in the population as its targets on the size 7 tournament. Meanwhile, the latter selects the programs (nodes) of an individual (tree) to do the same. Regarding the probabilities of genetic operations, the genetic operation of crossover and reproduction gives values of 0.9 and 0.1, respectively. To control tree shape, the tree depth must be 20 or less after performing the crossover operation. Furthermore, each internal node has between one and three children connected to it. Related applications employing genetic programming can be found in [43] [44] [45] . Fig. 4 illustrates the inner-exchange strategy of individuals in three subpopulations of a population for achieving the cooperation of the multiple fitness function. Initially, a set of initial feature nodes is fed into the subpopulation 0. Subsequently, two feature-node sets produced by random orders of all feature-nodes match each other with TPS methods to find the best correspondence between them. The subpopulation 0 is responsible for comparing the functional signals via heuristic matching. For effective heuristic match, the model assumes that the identical conserved protein domains demonstrate genetically similar functions given analogous functional signal shapes. Consequently, a physical deformation of the signal shapes should be preserved in somewhat ancestral contours. Subsequently, in Fig. 4 , the n best individuals of subpopulation 0 must emigrate into subpopulation 1 for modifying the correct boundaries of the protein fragments. In subpopulation 1, an exhausted local alignment of two fragments (nodes) is used to adjust the boundaries of two protein fragments. Meanwhile, a moment invariant method calculates a new set of feature data for updating the signal data. Moreover, the model must realign the protein fragments once the boundaries of the related fragments have been changed. Next, the m best individuals of subpopulation 1 must emigrate into subpopulation 2 to confirm the significant protein fragments by training of real cases. Moreover, the n best individuals of subpopulation 0 also must emigrate into subpopulation 2 to do the same work. Finally, the k best individuals of subpopulation 2 must emigrate into subpopulation 0 to resume the individuals having specific biological evidence. These resumed individuals can improve the quality of the individuals of subpopulation 0 using the guidelines from real cases. Simultaneously, subpopulation 2 eliminates the worst individuals from subpopulation 2 to prevent them from backing to subpopulation 0 or 2. When all of the individuals in the three subpopulations have been coevolved by genetic programming, a final solution should select the optimum individual from subpopulation 0. Specifically, the solution takes the form of a causal tree. Each path from the root node to any external node represents the functional heredity process. This path can provide an auxiliary guide to biologists for comprehensively understanding the unknown function of query sequences. The strategies used by exchanged individuals are feasible and essential for improving the final solution quality. More specifically, the model cooperatively evolves the causal trees via the cooperation of multiple fitness functions. The multiple fitness functions use Eq. (27) as the fitness function of subpopulation 0, Eq. (33) as the fitness function of subpopulation 1, and the product of Eq. (30) for all nodes as the fitness function of subpopulation 2. Fig. 5 describes the statements of the whole algorithm constructing this AGCT model.

The protein nucleocapsid (nsp1) of a coronavirus family from various species was tested to clarify the protein function of nsp1 of SARS coronavirus. 

For seven input sequences, one sequence is a query sequence and six are member sequences. The model simultaneously inputs the sequences to analyze its performance. For convenience, Table 1 lists the information on sequence number, abbreviated names, accessions, definitions, and sources obtained from the NCBI web site for the proteins, as follows:

These coronaviruses are members of a family of enveloped viruses that replicate in the cytoplasm of animal host cells [46] . SARS coronavirus TW1 is a severe acute respiratory syndrome associated coronavirus known as the TW1 isolate. For simplicity, the model only inputs the sequence of putative nsp1 protein of TW1 for analysis with the known protein nsp1 in other hosts. However, this analysis is useful for understanding the novel protein based on inference using other proteins with known functions. Although the traditional MSA can provide accurate knowledge regarding the conserved domains for these proteins, it cannot clearly express homology relationships over these conserved domains. In fact, the proposed model can provide a hierarchical contour of protein fragments. The contour can also help biologists to increase understanding of the functional heredity relationships only depending on the primary structure of proteins. Due to the space limitation, this matching causal tree for the problem using this model is not shown.

This section analyzes the performance of convergence of energy at different evolutionary generations using the AGCT model. Table 2 lists some of test parameters used for the AGCT model to obtain a suboptimal solution to the problems. The parameters which are appropriate for the present case are obtained using Eqs. (33)- (35) . Moreover, a reasonable value of λ 2 = 1.8 is used for Eq. (27) through analysis using Eq. (35) . The analysis attains the reasonable value using a general cross validation. To analyze the AGCT model, cases (I) to (V) confirm these typically varied situations. First, case (I) designs for illustrating the ideal solution can be found prematurely. Next, case (II) designs a situation involving a sufficiently large subpopulation size but related small migration size for the subpopulations. Next, case (III) designs a situation involving a small population size and small migration size. Case (IV) then designs a situation involving sufficient population size and large migration size. Finally, case (V) attempts to understand the real circumstances of evolution at subpopulations 0 and 1 when the energy of subpopulation 2 demonstrates convergence. Experimentally, the key parameters are the rates of II  100  10  50  10  50  10  100  III  20  10  20  10  20  10  100  IV  100  20  50  30  50  30  200  V  1 0  2  1 0  2  8  1  200   Table 3 Test results of AGCT model Table 3 lists the results of the five cases. The sensitivity column and specificity column of Table 3 display the prediction and discrimination capacities, respectively. Since the capacity estimation is based on averaging all generations, the ratios of comparison between the heuristic signal match and exhaustive sequence alignment are quite low. In fact, the high prediction capacity can improve the final solution quality. The high discrimination capacity can increase the execution speed. Although the model obtains low sensitivity results, the model attains relatively high specificity results. Actually, the results of the later generations have higher sensitivity than previous generations (data not shown). Thus, the sensitivity results, which decrease the execution performance during the co-evolution in this model, do not influence the prediction accuracy. In fact, the rises and falls depend on a threshold (cut-off) value, because of a trade-off between sensitivity and specificity. Thus, Eq. (33) specifies a threshold value of 1.5 to determine whether or not two protein fragments are homologous. Moreover, Eq. (27) specifies a threshold of value of 0.3 to decide whether or not two signals are analogous.

From the observation of the data in Table 3 , case II has better sensitivity than the other cases. Furthermore, case II is also competitive with other reported results. Although fewer individuals are involved in case III than case II, the resulting reports in case III are the best except for this sensitivity result. The following discussion can further explain the above confusion. Although cases II and IV involve the same numbers of individuals, case IV involves more immigrated or emigrated individuals than case II. Consequently, case IV obtains worse results than case II. The above demonstrates that, appropriate individual migration is essential for improving solution quality. However, the amount of migration should not be too large to perform the normal evolution by genetic programming. Meanwhile, population size is not particularly important in the present cases since the RPM method and Smith-Waterman algorithm can undoubtedly obtain the local optimal result for each generation of genetic programming. Therefore, even through case III involves fewer individuals than case II, case III can still achieve the performance results approaching case II. Finally, the mechanism of premature convergence in cases I and IV attained the worst results. Clearly, case V improves the results of case V through a long-term evolution.

Accordingly, Fig. 6 displays the performance of convergence in each subpopulation on fixed temperature. The energy of subpopulations 0 converges at 0.488 and has a different rate of convergence to the other two subpopulations. Clearly, subpopulation 2 evolves faster than the other two subpopulations. Fig. 7 illustrates the improvement of the energy convergence of subpopulation 0 by using the five temperature rates (Tr) for decreasing annealing temperature, including 1, 0.95, 0.90, 0.85, and 0.80.

Eventually, the suboptimal solution to this present experiment is yielded using the genetic programming via a cooperation of the multiple fitness functions. This cooperation can continuously change the positions appropriate for splitting the protein sequence. Due to the space limitation in this paper, the starting and ending positions of each protein fragment are not shown.

Subsequently, because of case II containing more individuals than the others, Fig. 8 compares the energy convergences based on case II with each other for obtaining an appropriate Tr value. More precisely, the rate of value 0.85 is the best choice for situations involving rapid and steady convergence. Additionally, Fig. 9 compares four different Tr values 0.97, 0.95, 0.9, and 0.85 for analyzing the energy of subpopulation 0. Clearly, case IV really is the worst case. Next, Fig. 10 simultaneously illustrates the energy convergence of three subpopulations based on case V by varying the rates of temperature decrease. Clearly, the model proportionally and consistently reduces the energies of each subpopulation. From the observation of generations in Fig. 10 , subpopulations 1 and 2 already begin to converge at around the fortieth generation. Finally, the model gradually performs the RPM work in subpopulation 0 between the fortieth and eightieth generations.

The AGCT model utilizes a novel representation of tree structure to illustrate the concept of applying functional homologies to predict protein function. This model differs in numerous respects from the other traditional model of MSA. Nevertheless, this model still includes various traditionally popular methods used for the high level application, predicting protein function. These methods include genetic programming, wavelet transform, profile comparison, local alignment with the Smith-Waterman algorithm, moment invariant theorem, RPM with TPS, Bayes inference, and the causal tree model. Besides the above methods, this model introduces various strategies, including the inner-exchanged individuals in subpopulations, cooperation of multiple fitness functions, guided evolution involving real cases, and TP+DFS regularization. The prediction accuracy thus can be improved. This hybrid model is developed to exploit global search capabilities in genetic programming for predicting protein functions of a distantly related protein family that have difficulties in the conserved domain identification. Moreover, the RPM involved can identify more sequence homologies through softening comparisons. Essentially, this work contributed a complex and integrated methodology that enables new advances in predicting protein function in the post-genome era. We believe that the model robustness can be increased in the future if biologists generate more experimental data in laboratories. Future works will apply this model to investigate issues related to protein interaction.

Sequence comparison by dynamic programming

A hidden Markov model that finds genes in e. coli DNA

Hidden Markov models in computational biology: Applications to protein modeling

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

Modeling splice sites with Bayes networks

Prediction of the secondary structure of proteins from their amino acid sequence

Identification of common molecular subsequences

Hybrid computational intelligence schemes in complex domains: An extended review

Genetic algorithms in molecular recognition and design

Searching for discrimination rules in protease proteolytic cleavage activity using genetic programming with a min-max scoring function

Logistic regression and artificial neural network classification models: a methodology review

Statistical mechanics beyond the Hopfield model: solvable problems in neural network theory

Applying fuzzy logic to medical decision making in the intensive care unit

Increasing the efficiency of fuzzy logic-based gene expression data analysis

EPPS: mining the COG database by an extended phylogenetic patterns search

Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids

Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments

Comparison of sequence profiles. Strategies for structural predictions using sequence information

Within the twilight zone: a sensitive profile-profile comparison tool based on information theory

COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance

Basic local alignment search tool

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice

A new algorithm for non-rigid point matching

Ant colony system with extremal dynamics for point matching and pose estimation

Genetic Programming On the Programming of Computers by Means of Natural Selection

New algorithms for 2-D and 3-D point matching: pose estimation and correspondence

Probabilistic prediction of protein secondary structure using causal networks

Perfect sampling for wavelet reconstruction of signals

A computational vision approach to image registration

Visual pattern recognition by moment invariants

Moment spaces and inequalities

Spline Models for Observational Data

Functional Analysis

A relationship between arbitrary positive matrices and doubly stochastic matrices

A new method for mapping optimization problems onto neural networks

Statistical physics algorithms that converge

A novel optimizing network architecture with applications

Genetic and evolutionary algorithms come of age

Genetic programming III: Darwinian invention and problem solving: Book Review

Genetic programming and evolutionary generalization

Discovering knowledge from medical databases using evolutionary algorithms

Genetic programming for knowledge discovery in chest-pain diagnosis

Classifying proteins as extracellular using programmatic motifs and genetic programming

Fields Virology

Common structure of techniques for choosing smoothing parameters in regression problems