key: cord-327744-5k8np850 authors: Munteanu, Cristian Robert; Magalhães, Alexandre L.; Uriarte, Eugenio; González-Díaz, Humberto title: Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices date: 2009-03-21 journal: Journal of Theoretical Biology DOI: 10.1016/j.jtbi.2008.11.017 sha: doc_id: 327744 cord_uid: 5k8np850 Abstract The cancer diagnostic is a complex process and, sometimes, the specific markers can interfere or produce negative results. Thus, new simple and fast theoretical models are required. One option is the complex network graphs theory that permits us to describe any real system, from the small molecules to the complex genetic, neural or social networks by transforming real properties in topological indices. This work converts the protein primary structure data in specific Randic's star networks topological indices using the new sequence to star networks (S2SNet) application. A set of 1054 proteins were selected from previous works and contains proteins related or not with two types of cancer, human breast cancer (HBC) and human colon cancer (HCC). The general discriminant analysis method generates an input-coded multi-target classification model with the training/predicting set accuracies of 90.0% for the forward stepwise model type. In addition, a protein subset was modified by single amino acid mutations with higher log-odds PAM250 values and tested with the new classification if can be related with HBC or HCC. In conclusion, we shown that, using simple input data such is the primary protein sequence and the simples linear analysis, it is possible to obtain accurate classification models that can predict if a new protein related with two types of cancer. These results promote the use of the S2SNet in clinical proteomics. Cancer is a leading cause of death worldwide, accounted for around 13% of all deaths in 2007 (WHO, 2008) . Two of the leading types of cancer are the human breast cancer (HBC) and the human colon cancer (HCC). The estimated new cancer cases and deaths in US for 2008 shows that HBC will affect 26% of the women (15% will die) and HCC will involve 10% of the men/women (8% men and 9% women will die) (Jemal et al., 2008) . Therefore, simple and fast theoretical method can be very useful in the detection of cancer diseases. The actual work will use the protein quantitative proteomedisease relationship (QPDR) (Ferino et al., 2008) , similar to quantitative structure-activity relationship (QSAR) (Devillers and Balaban, 1999) . QPDR is one of the widely used analyse for predicting the protein properties and, in the present study, is using the macromolecular descriptors, named topological indices (TIs), obtained with the graph theory. The branch of mathematical chemistry dedicated to encode the DNA/protein information in graph representations by the use of the TIs has become an intense research area (Agü ero-Chapin et al., 2006; Bielinska-Waz et al., 2007; Liao and Wang, 2004; Liao and Ding, 2005; Randic, 2000; Randic and Basak, 2001; Randic and Balaban, 2003; Randic et al., 2000) . The graphic approaches of the biological systems study can provide useful insights in QSAR studies (Gonzá lez-Díaz et al., 2006 , 2007c Prado-Prado et al., 2008) , protein folding kinetics (Chou, 1990) , enzyme-catalyzed reactions (Chou, 1989; Chou and Forsen, 1980; Chou and Liu, 1981; Kuzmic et al., 1992) , inhibition kinetics of processive nucleic acid polymerases and nucleases (Althaus et al., 1993a, b; Althaus et al., 1994 Althaus et al., , 1996 Chou et al., 1994) , DNA sequence analysis (Qi et al., 2007) , anti-sense strands base frequencies , analysis of codon usage (Chou and Zhang, 1992; Zhang and Chou, 1994) and in complicated network systems investigations (Diao et al., 2007; Gonzalez-Diaz et al., 2007a , 2008 . Recently, the ''cellular automaton image'' (Wolfram, 1984 (Wolfram, , 2002 has also been applied to study hepatitis B viral infections (Xiao et al., 2006a) , HBV virus gene missense mutation (Xiao et al., 2005b) , and visual analysis of SARS-CoV (Gao et al., 2006; Wang et al., 2005) , as well as representing complicated biological sequences (Xiao et al., 2005a) and helping to identify protein attributes (Xiao and Chou, 2007; Xiao et al., 2006b) . We have chosen the TIs for these QPDR models based on the previous work results with similar QSAR/QPDR models. Even if the TIs cannot be always interpreted, they demonstrate to encode the information that permits to create accurate QSAR/QPDR models. Other interesting fields to apply the graph theory are the oncology and clinical proteomics. A classification model for discriminating prostate cancer patients from control group with connectivity indices where constructed by Gonzá lez-Díaz et al. (2007b) . Vilar's group designed a QSAR model for alignment-free prediction of HBC biomarkers based on electrostatic potentials of protein pseudofolding HP-lattice networks (Vilar et al., 2008) . The actual work is proposing a new cancer/non-cancer classification model based on protein embedded/non-embedded star graph TIs such are the trace of connectivity matrices, Harary number, Wiener index, Gutman index, Schultz index, Moreau-Broto indices, Balaban distance connectivity index, Kier-Hall connectivity indices and Randic connectivity index. This classification can predict two types of cancer: HBC and HCC. The primary protein sequence is transformed in connectivity star graph's TIs that are used by a statistical linear method in order to construct an input-coded multi-target classification model. Two sets of protein primary sequences are used: a set of 189 HBC/HCC cancer proteins (Sjoblom et al., 2006) and 865 noncancer proteins (Dobson and Doig, 2005; Dobson et al., 2004) . The list of cancer-related proteins in our work is the same with the list obtained by the Sjoblom group after the experimental analysis of 13,023 genes in 11 breast and 11 colorectal cancers. Each protein sequence was transformed in a star graph, where the amino acids are the vertices (nodes), connected in a specific sequence by the peptide bonds. The star graph is a special case of trees with N vertices where one has got NÀ1 degrees of freedom and the remaining NÀ1 vertices have got one single degree of freedom (Harary, 1969) . Each of the 20 possible branches (''rays'') of the star contains the same amino acid type and the star centre is a non-amino acid vertex. A protein can be represented by diverse forms of graphs, which can be associated with distinct distance matrices. The best method to construct a standard star graph is the following: each amino acid/vertex holds the position in the original sequence and the branches are labelled by alphabetical order of the 3-letter amino acid code (Randic et al., 2007) . The graph is embedded if the initial sequence connectivity in the protein chain is included. Figs. 1A and B present the non-embedded/embedded star graphs of PRPS1 using the alphabetical order of one-letter amino acid code. Thus, the primary structure of protein chains are transformed in the correspondent Star graphs invariant TIs. The resulted graphs are not depending on the three-dimensional structure or the shape of the protein. The comparison of the graphs is made by using the corresponding connectivity matrix, distance matrix and degree matrix. The matrices of the connectivity in the sequence and in the star graph are combined in the case of the embedded graph. These matrices and the normalized ones are the base of the TIs calculation. The conversion of the amino acid sequences in star graph TIs was made by using sequence to star networks (S2SNet) application, developed by our group (Munteanu and Gonzá les-Diá z, 2008) . S2SNet is based on wxPython (Rappin and Dunn, 2006) for the GUI application and has Graphviz (Koutsofios and North, 1993 ) as a graphics back-end. The present calculations are characterized by embedded and non-embedded TIs, no weights, Markov normalization and power of matrices/indices (n) up to 5. The results file contains the following TIs (Todeschini and Consonni, 2002) : Trace of the n connectivity matrices (tr n ) or the spectral moments: (1) where n ¼ 0-power limit, Randic connectivity index ( 1 X): These TIs and other derivate ones will be used in the next step to construct a cancer/non-cancer classification model by linear statistical methods. An input-coded multi-target classification model was created with general discriminant analysis (GDA) method (Kowalski and Wold, 1982 ; Van Waterbeemd, 1995), STATISTICA 6.0 package (StatSoft.Inc., 2002). This model can predict if a protein is HBC or HCC-related using a single equation. For this reason, in addition to the 30 star graph embedded and non-embedded TIs are introduced other two types of continuous predictors (attributes) encoded specific information about each cancer types as following: 30 products of the HBC/HCC cancer probability with the embedded/non-embedded TIs (pTI ¼ prob HBC/HCC *TI) and 30 differences between the same TIs and the average of the TIs for each type of cancer [dTI ¼ TI-average(TI) HBC/HCC ]. The cancer probabilities represent the fractions of proteins HBC/HCC-related from the entire Sjö blom's proteins (cancer proteins) and have values of 0.639 (HBC) and 0.361 (HCC). For each protein there are two cases corresponding to both types of cancer. The dependent variable (CancerOrNot) takes 1 for cancer and 0 for non-cancer and the cross-validation (CV) variable has two values (train and val). The best CV methods to examine a predictor are the following: independent dataset test, subsampling test, and jackknife test (Chou and Zhang, 1995) . Shen (2007, 2008) have shown that only the jackknife test has the least arbitrariness . Thus, the jackknife test has been increasingly used by investigators to examine the accuracy of various predictors (Chen and Li, 2007a, b; Diao et al., 2007; Ding et al., 2007; Jiang et al., 2008; Li and Li, 2008; Lin, 2008; Niu et al., 2006; Xiao and Chou, 2007; Zhang et al., 2008; Zhou et al., 2007) . In the actual work, the independent data test is used by splitting the data at random in a training series (train, 75%) used for model construction and a prediction one (val, 25%) for model validation (the CV column is filled by repeating 6 train and 2 val). All independent variables are standardized prior to model construction. The general QPDR formula contains embedded and nonembedded TIs, pTIs and dTIs: where C/nCÀscore is the continue score value for the cancer/noncancer classification (HBC or HCC), c 1 -c n are the TIs coefficients (n ¼ number of TIs), c n -c m pTIs coefficients (nom; m-n ¼ number of pTIs), c m -c 0 dTIs coefficients (mo0; 0-m ¼ number of dTIs) and c 0 is the independent term. We inspected the percentage of good classification and the number of variables to be explored in order to avoid over-fitting or chance correlation. The forward model type was tested for the embedded, non-embedded and both data, including TIs, pTIs, dTIs and all indices. In addition, the Dobson's set is use to select a subset of 61 noncancer proteins with cancer probability between 0.3 and 0.5 in order to proceed 17 single amino acid mutations with log-odds PAM250 (Dayhoff, 1978) greater or equal with 2 (see Table 1 ). The best classification model predicted the probability of presence in HBC/HCC cancer for any of these mutated proteins and the results were analysed with two-way joining clustering analysis method (tw-JCA) from STATISTICA (StatSoft.Inc., 2002). Fifteen classification models were tested with the aim of finding the best GDA equation which is able to discriminate between proteins related with HBC and HCC. The attributes include 30 embedded/non-embedded star graph TIs obtained with S2SNet application and other 60 composed predictors, pTIs and dTIs. The values obtained for the training/predicting accuracies with the forward stepwise method are presented in Table 2 . The forward stepwise selection variable method conjugated with the embedded TIs and dTIs provides the best results for our data set with values of correctly classified compounds of 89.9%, 90.3% and 90.0% for the training, CV and full sets, respectively, and using only six/five parameters/variables (Eq. (14)). The embedded TIs have the name of the non-embedded ones plus ''e'' as suffix. The simple linear mathematical form of the model has been chosen in the absence of prior information. C=nC2score ¼ À 4:4 þ 1:7 Ã tr 3e þ 124:8 Ã Se À 126:5 Ã dJe þ 48:6 Ã dX2e À 45:9 Ã X5e (14) where N is the number of cases (C and nC), R c is the canonical regression coefficient, U is the Wilk's statistics, F is the Fisher's statistics and p is the p-level (probability of error). The above results are typically considered as excellent in the literature for LDA-QPDR/QSAR models (Castillo-Garit et al., 2008; Estrada and Molina, 2001; Marrero-Ponce et al., 2004; Morales et al., 2006; Vilar et al., 2008) . In order to check the variation of this model with the training/CV sets, we carried on a CV study by using ten totally random sets, including the initial one from the actual model (with the same 75% training and 25% CV). The classification values are presented in Table S1 from the supple-mentary material and show an average of 90.2% for training and 89.2% for CV. These values demonstrate the stability of the model with the selection of the classification sets. In order to illustrate the performance of the approach when applied to a single set of cancer related proteins (e.g. either breast or colon), we obtained two equations, one for HBC and other for HCC. Therefore, we have to consider that the Eq. (14) represents an input-coded multi-target classification model that can evaluate if a protein is HBC or HCC-related by using the HBC or HCC average Je and X2e values (contained in the dJe and dX2e differences). Eq. (14) can be reduced to two different equations, one for each type of cancer (HBC and HCC): HCC=nHCC2score ¼ À 20:8 þ 1:7 Ã tr 3e þ 124:8 Ã Se2Je þ 0:2 Ã X2e À 45:9 Ã X5e. The detailed classification results for each type of cancer obtained with Eqs. (14a), (14b) are presented in Table 3 . A similar input-coded multi-target classification model was obtained by using the forward stepwise method and the embedded pTIs and provides values of correctly classified compounds of 90.3, 91.0 and 90.5 for the training, CV and full sets, respectively (using seven/six parameters/variables) (Eq. (15)). C=nC2score ¼ À 4:1 À 118:6 Ã p tr 0e þ 80:7 Ã p tr 2e þ 1:4 Ã p tr 3e þ 100:3 Ã pSe À 101:4 Ã pJe þ 39:7 Ã pX2e, In order to evaluate if a protein is HBC or HCC-related, it is necessary to use the HBC or HCC probability inside the pTIs products. The classification values obtained for the individual equations are presented in Table 3 . The equations obtained are the following: HBC=nHBC2score ¼ À 5:6 À 0:3 Ã tr 0e þ 0:8 Ã tr 2e þ 0:6 Ã tr 3e þ 0:2 Ã X2e (15a) Table 1 Single amino acid mutations and the corresponding log-odd PAM250 value.. Original AA Mutated AA log-odd PAM250 Notation Table 2 Training/predicting accuracies of Cancer (C)/non-cancer (nC) models using embedded (E) and non-embedded (nE) star graph TIs, pTIs and dTIs.. Cross-validation Total Eq. vars. HCC=nHCC2score ¼ À 5:6 À 0:2 Ã tr 0e þ 0:5 Ã tr 2e Eqs. (14), (15) show similar results when the input data is containing probability of cancer (products with TIs) or the TIs averages for each type of cancer (differences with TIs). In general, in the case of embedded, non-embedded and both indices, we obtained better results with dTIs compared with the pTIs (not mixed with the original TIs). This difference can be explained by a superior recover of the cancer-related protein sequence information in the case of the differences between the original TIs and the average of them for each type of cancer (dTIs) compared with the products of the original TIs and the cancer type probability (pTIs). Thus, we can conclude that the average of star graph structure for each type of cancer (dTIs) is described better the actual QPDR model compared with the composition of the data sets for each type of cancer that generates the cancer probabilities. In addition, Table 2 shows that better results are obtained using the original TIs and the derived ones (pTIs and dTIs) compared with the isolated TIs/pTIs/dTIs. This difference can be explained be the fact that each set of indices can contains different parts of the protein information that is cancer-related. Therefore, the use of all these indices will sum all this information in a better QPDR model. Another interesting aspect is the type of the indices (original or derived from the original) that are more frequent in all models presented in Table S2 from supplementary material. Thus, we can observe the importance of the Wiener index (W) and Kier-Hall connectivity index X5 for the models based on the non-embedded TIs. The embedded TIs models contain more frequent the trace of the graph/sequence connectivity matrixes tr 3 and the non-trivial part of the Schultz TI S (W is based on the distance matrix, X5 and S on the degree matrix, and tr 3 on the connectivity matrix). The most important type of index that is present in both embedded and non-embedded TI equations is J, the Balaban distance connectivity index based on the node distance degree information. In order to compare two equations with the same number of Table 3 Accuracy of input-coded multi-target and individual HBC and HCC classification models based on the embedded TIs (TIe+dTIe and pTIe).. Cancer TIs, we have chosen the embedded models with pTIe and {TIe, pTIe} that contain six variables and reduced the common terms (based on tr 3, S and J). Thus, we can observe that the addition of the TIe to the pTIe will shift the preference from the low order traces (p tr 0e, p tr 2e) and Kier-Hall index (pX2e) to high order trace (tr 5e), Harary number (He) and Gutman TI (S6e). The first embedded TIs & dTIs model was chosen to estimate the cancer probability for proteins mutants of non-cancer-related proteins. These values were analysed with tw-JCA using 61 mutated proteins and 17 types of single amino acid mutations. In the case of HBC, we obtained 215 data groups, called input blocks. To detect the larger variability regions (mutants) we computed a tw-JCA partition of input blocks (rearrange of blocks) setting the threshold value of variability at StDv/2 (see Fig. 2 ). The value obtained was 0.059. The 215 input blocks are regrouped, for similarity, into 11 output blocks (see Tables S3 and S4 in the supplementary material). We can observe that the proteins with number 24 to number 48 are very susceptible to become HBCrelated proteins for all studied mutations. The plot corresponding to the reduced values of the reordered data matrix (Table S4) presented in Fig. 3 . On the other hand, we carried out the same study for the HCC mutated proteins and found different susceptible proteins, with visible lower probability to be HCCrelated (Fig. 4) . The 184 input blocks were regrouped, for similarity, into 11 output blocks (StDv/2 ¼ 0.050) (see Tables S5 and S6 in the supplementary material). The reduced data from Table S6 are presented as a plot in Fig. 5 . The tw-JCA partition obtained in this way is statistically significant as reported by other authors that used this method to reach similar goals (Ferino et al., 2008) . One interesting non-cancer chain protein is 1QRK B, the human coagulation factor XIII with strontium bound in the ion site (Fox et al., 1999) , with eight single amino acid mutations that present HBC probability up to 71% as following: 70.8% for V-L, 68.8% for V-I, 62.0% for L-I, 59.3% for D-N, 58.3% for E-Q, 55.9% for F-L, 54.8% for E-D and 51.0% for V-M. The most persistent mutation (log-odd PAM250 ¼ 4), valine (V) to isoleucine (M), can be considered as the most dangerous one. The main calcium/ strontium binding site within each monomer involves the main chain oxygen atom of Ala-457, and also the side chains from ARTICLE IN PRESS Fig. 6 . Graphical representation of two-way joining cluster analysis of the probability of the mutated HBC-related proteins to turn into non-cancer proteins. Fig. 7 . Graphical representation of two-way joining cluster analysis of the probability of the mutated HCC-related proteins to turn into non-cancer proteins. residues . The mutations of Glu (E) in Q and D can affect the capacity of binding metals and the normal biological activity. This coagulation factor XIII is a transglutaminase which stabilizes blood clots by covalently crosslinking fibrin, being essential for normal haemostasis. FXIII deficiency due to the genetic mutations results in a life-long bleeding disorder with added complications in wound healing and tissue repair (Anwar et al., 1998) . In addition, the abundant fibrinogen present in the tumor connective tissue might contribute to the structural integrity of breast or colon tumor tissues (Costantini et al., 1991; Takahashi et al., 2000; Yee et al., 1994) . We can observe that, in general, the natural mutations with higher PAM250 values are less frequent even for 1QRK B (Y-F with PAM250 of 7 is absent) because we cannot create a direct relation between the PAM250 natural amino acid mutation frequency and the influence of the mutations in these types of cancer. The probability for a cancer-related protein to turn into a noncancer one was studied too. For each type of cancer, ten HBC/HCCrelated proteins where mutated using the same PAM250 values. The tw-JCA plots are presented in Fig. 6 (for HBC) and Fig. 7 (for HCC), and correspond to data in Tables S7 and S8 from the supplementary material. The results did not show important probability to obtain a HBC/HCC-related protein by using single PAM250 natural mutations. Activin beta E (INHBE, C_5) has the highest probability (around 50%) to turn into a HBC-related protein after almost all the mutations ( Fig. 6 and Table S7 ). This study is proposing two cancer/non-cancer input-coded multi-target classification models for HBC and HCC using the star network TIs of the protein amino acid sequences. The results prove the excellent predictive ability (90.0%) of the simple and fast star network TIs and GDA statistics linear models in the case of the actual protein model. In addition, the prediction of cancer probability for mutated proteins was calculated. The human coagulation factor XIII (1QRK B), that normally do not generate HBC, if suffer several mutations, can become a HBC-related protein. This work can help in oncology proteomics or serve as a model for other studies. In addition, S2SNet application is demonstrating his capacity to transform simple protein sequences in TIs and to be the base of numerous protein studies. Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E Kinetic studies with the nonnucleoside HIV-1 reverse transcriptase inhibitor U-88204E Steady-state kinetic studies with the polysulfonate U-9843, an HIV reverse transcriptase inhibitor The benzylthio-pyrimidine U-31, 355, a potent inhibitor of HIV-1 reverse transcriptase New splicing mutations in the human factor XIIIA gene, each producing multiple mutant transcripts of varying abundance Distribution moments of 2D-graphs as descriptors of DNA sequences Bond-based 3D-chiral linear indices: theory and QSAR applications to central chirality codification Fibrinogen deposition without thrombin generation in primary human breast cancer tissue Prediction of the subcellular location of apoptosis proteins Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo amino acid composition Graphical rules in steady and non-steady enzyme kinetics Review: applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady state systems Graphical rules for enzyme-catalyzed rate laws Graphical rules for non-steady state enzyme kinetics Recent progress in protein subcellular location prediction Cell-PLoc: a package of web-servers for predicting subcellular localization of proteins in various organisms Diagrammatization of codon usage in 339 HIV proteins and its biological implication Prediction of protein structural classes Review: steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases Do antisense proteins exist? A model of evolutionary change Topological Indices and Related Descriptors in QSAR and QSPR. Gordon and Breach The community structure of human cellular signaling network Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network Predicting enzyme class from protein structure without alignments Prediction of protein function in the absence of significant sequence similarity 3D connectivity indices in QSPR/QSAR studies Using spectral moments of spiral networks based on PSA/mass spectra outcomes to derive quantitative proteome-disease relationships (QPDRs) and predicting prostate cancer Identification of the calcium binding site and a novel ytterbium site in blood coagulation factor XIII by X-ray crystallography A novel fingerprint map for detecting SARS-CoV 3D-QSAR study for DNA cleavage proteins with a potential anti-tumor ATCUN-like motif Medicinal chemistry and bioinformatics-current trends in drugs discovery with networks topological indices Discriminating prostate cancer patients from control group with connectivity indices ANN-QSAR model for selection of anticancer leads from structurally heterogeneous series of compounds Cancer statistics Using the concept of Chou's pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy Drawing Graphs with Dot Pattern recognition in chemistry Kinetic analysis by a recursive rate equation Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach Graphical approach to analyzing DNA sequences Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleotide bases The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition 3D-chiral quadratic indices of the 'molecular pseudograph's atom adjacency matrix' and their application to central chirality codification: classification of ACE inhibitors and prediction of sigma-receptor antagonist activities A radial-distributionfunction approach for predicting rodent carcinogenicity S2SNet-Sequence to Star Network Predicting protein structural class with AdaBoost Learner Unified QSAR approach to antimicrobials. Part 3: first multi-tasking QSAR model for input-coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds New 3D graphical representation of DNA sequence based on dual nucleotides Condensed representation of DNA primary sequences On a four-dimensional representation of DNA primary sequences Characterization of DNA primary sequences based on the average distances between bases On 3-D graphical representation of DNA primary sequences and their numerical characterization On representation of proteins by starlike graphs wxPython in Action The consensus coding sequences of human breast and colorectal cancers STATISTICA (data analysis software system), version 6 Tissue transglutaminase, coagulation factor XIII, and the pro-polypeptide of von Willebrand factor are all ligands for the integrins alpha 9beta 1 and alpha 4beta 1 Handbook of Molecular Descriptors Discriminant Analysis for Activity Prediction QSAR model for alignment-free prediction of human breast cancer biomarkers based on electrostatic potentials of protein pseudofolding HP-lattice networks A new nucleotide-composition based fingerprint of SARS-CoV with visualization analysis Cancer, World Health Organization Cellular automation as models of complexity Digital coding of amino acids based on hydrophobic index Using cellular automata to generate image representation for biological sequences An application of gene comparative image for predicting the effect on replication ratio by HBV virus gene missense mutation A probability cellular automaton model for hepatitis B viral infections Using cellular automata images and pseudo amino acid composition to predict protein subcellular location Three-dimensional structure of a transglutaminase: human blood coagulation factor XIII Analysis of codon usage in 1562 E. Coli protein coding sequences Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern Using Chou's amphiphilic pseudoamino acid composition and support vector machine for prediction of enzyme subfamily classes Supplementary data associated with this article can be found in the online version at doi:10.1016/j.jtbi.2008.11.017.