key: cord-0864814-85fqseal authors: Lee, Bum Ju; Shin, Moon Sun; Oh, Young Joon; Oh, Hae Seok; Ryu, Keun Ho title: Identification of protein functions using a machine-learning approach based on sequence-derived properties date: 2009-08-09 journal: Proteome Sci DOI: 10.1186/1477-5956-7-27 sha: 41b92c19649c6e5cd6a4e883e7edf9a43589bcdd doc_id: 864814 cord_uid: 85fqseal BACKGROUND: Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. RESULTS: A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. CONCLUSION: We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions. The need to analyse the massive accumulation of biological data generated by high-throughput human genome projects has stimulated the development of new and rapid computational methods. Computational approaches for predicting and classifying protein functions are essential in determining the functions of unknown proteins in a faster and more cost-effective manner, because experimentally determining protein function is both costly and time-consuming. Approaches based on sequence and structure comparisons play an important role in predicting and classifying the function of unknown proteins. Generally, if an unknown gene or protein sequence is identified, researchers may carry out a sequence similarity search using BLAST [1] , PSI-BLAST [2] , or FASTA [3] to find similar proteins or annotation information in public databases. However, proteins that have diverged from a common ancestral gene may have the same function but different sequences [4, 5] . As a result, sequence similaritybased approaches are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak (called the "twilight zone" or "midnight zone") [6] [7] [8] [9] [10] [11] [12] . Thus, researchers should be cautious when using this approach, because its efficiency is limited by the availability of annotated sequences in public databases and a high-similarity BLAST search result does not always imply homology [13, 14] . Structure similarity-based approaches -such as Deli [15] or MATRAS [16] , both of which use structure comparisons -have routinely been used to identify proteins with similar structures, because protein structural information is better conserved than sequence information [17, 18] . Nevertheless, proteins with the same function can have different structures, and a structural comparison through structure determination is more difficult than a sequence comparison [9, 19, 20] . Recently, several researchers have developed methods for classifying and predicting protein function independent of sequence or structural alignment [5] [6] [7] [8] [9] . Rather than making predictions based on direct sequence or structural comparisons, these approaches use various features to predict protein function, such as protein length, molecular weight, number of atoms, grand average of hydropathicity (GRAVY), amino acid composition, periodicity, physicochemical properties, predicted secondary structures, subcellular location, sequence motifs or highly conserved regions, and annotations in protein databases. These features include statistics extracted from the protein sequence, structure, and annotation. In addition, to obtain good predictive power, various machine-learning algorithms such as support vector machines (SVMs), neural networks, naïve Bayes classifiers, and ensemble classi-fiers have been used to build classification and prediction models. Of these, the most widely used machine-learning algorithm for classification and prediction of protein function is SVM [5] [6] [7] 22, 26, 27, 29, 31, 36, [39] [40] [41] [42] [43] 45] . A method for classifying the functions of homodimeric, drug absorption, drug delivery, drug excretion, and RNAbinding proteins, among others, has been proposed by Cai et al. [29] . Classification of protein function was performed using an SVM statistical learning algorithm, based on features such as amino acid composition, hydrophobicity, solvent accessibility, secondary structure, surface tension, charge, polarisability, polarity, and normalised van der Waals volume. Cai et al. [29] found that the testing accuracy of protein classification was in the range 84-96% and suggested that amino acid composition, hydrophobicity, polarity, and charge play more critical roles than other features. Recently, Tung et al. [40] proposed a prediction method for ubiquitylation sites using three datasets (amino acid identity, physicochemical properties, and evolutionary information) and three machine-learning algorithms (k-nearest neighbour, SVM, and naïve Bayes). The greatest accuracy (72.19% ) was obtained using SVM with 531 physicochemical properties as features. Moreover, accuracy improved from 72.19% to 84.44% when 31 physicochemical properties were selected and used based on feature selection by an informative physicochemical property mining algorithm (IPMA). In addition, Li et al. [45] demonstrated the ability of the SVM prediction method to identify potential drug targets. On the basis of amino acid composition, hydrophobicity, polarity, polarisability, charge, solvent accessibility, and normalised van der Waals volume, they obtained an accuracy of 84% in predicting known drug targets versus putative nondrug targets. In that study, the performance of the SVM model did not change significantly with a greater number of negative targets, as determined from experiments using various ratios of negative to positive samples. Protein function has been predicted using naïve Bayes classifiers in several studies [21, 49, 50] . The FEATURE framework for recognition of functional sites in macromolecular structures was developed by Halperin et al. [49] . They used naïve Bayes classification to find and weigh the most informative properties that distinguish sites from nonsites, using a large number of physicochemical properties. In another study, Gustafson et al. [50] suggested a classification method for identification of genes essential to survival by using naïve Bayes classification. They focused on easily obtainable features such as open reading frame (ORF) size, upstream size, phyletic retention, amino acid composition, codon bias, and hydro-phobicity, and they concluded that the best performing feature was phyletic retention or the presence of an orthologue in other organisms. Neural networks are also frequently used for function prediction [23] [24] [25] 36] . For example, six enzyme classes and enzymes/nonenzymes were predicted by Jensen et al. [24] , based solely on a few meaningful features such as O-β-GlcNAcsites, N-linked glycosylation, secondary structure, and physicochemical properties. Function prediction was carried out using a neural network, and certain meaningful features -such as differences in secondary structures between enzymes and nonenzymes -were analysed. The discriminative ability of each feature was represented by its correlation coefficient. Ensemble classifiers have recently become popular for protein function classification [25, 35, 46, 47, [51] [52] [53] [54] [55] . Zhao et al. [35] suggested that no single-classifier method can always outperform other methods and that ensemble classifier methods outperform other classifier methods because they use various types of complementary information. In comparing the performance of classifiers for predicting glycosylation sites, Caragea et al. [52] demonstrated that ensembles of SVM classifiers outperformed single SMV classifiers. In addition, Guan et al. [51] illustrated the benefits of using an SVM-based ensemble framework by analysing the performance of ensembles of three classifiers as a single SVM, a hierarchically corrected combination of SVMs, and naïve Bayes classifiers. Ge et al. [53] provided evidence that ensemble classifiers outperform single decision tree classifiers by comparing C4.5 with several ensemble classifiers (i.e. random forest, stacked generalisation, bagging, AdaBoost, LogitBoost, and MultiBoost) for classification of premalignant pancreatic cancer mass-spectrometry data. A prediction method using a domain-based random forest of decision trees to infer protein-protein interactions (PPIs) was proposed by Chen et al. [46] ; in experiments using a Saccharomyces cerevisiae dataset, they showed that the randomforest method achieved higher sensitivity (79.78%) and specificity (64.38%) than maximum likelihood estimation (MLE). These previous studies exploit direct relationships between basic protein properties and their functions to predict protein function without consideration of sequence or structural similarities. Among these studies, a wide variety of features or only a few meaningful features were selected to increase the performance of function prediction based on experience or a few heuristics. However, in most of the previous studies, features that represent subtle distinctions in small portions of protein sequences have not been sufficiently represented. Although proteins share similar structural organisations, biological properties, and sequences, small changes in amino acids of a protein sequence can result in different functions [19, 20, 56] . Although local information such as the presence of motifs or highly conserved regions is useful for function prediction, motif detection problems present another arduous task. Therefore, in this study, a method was developed to detect small changes in amino acids within a sequence. The method described here is characterised by three primary features designed to address specific problems inherent in protein function prediction. First, this approach was designed to accurately predict various protein functions over a broad range of cellular components, molecular functions, and biological processes without using sequence or structural similarity information. Second, this study was designed to determine whether the use of feature selection improves prediction performance for various protein functions. In other words, does the use of more features related to the protein sequence increase the accuracy of prediction? Third, this study was designed to determine whether local information for the protein sequence is meaningful in predicting protein function, and if so, to determine which features are correlated with protein function. In summary, a highly accurate prediction method capable of identifying protein function is proposed. One of the advantages of this method is that it requires only the protein sequence for feature extraction; information vis-à-vis predicted features or structural properties is not required. In addition, four features that represent subtle differences in local regions of the protein sequence -differences due to positively and negatively charged residues -are introduced. A total of 484 features, including 451 traditional features and 33 features introduced in this study, were used to predict 11 protein functions. We applied two machine-learning algorithms (i.e. SVM and random forests) with and without feature selection to the data set to predict protein function. The prediction performance for each protein function was evaluated, and the features most relevant to prediction of specific protein functions were determined. To predict the functions of a variety of proteins from a broad range of cellular components, molecular functions, and biological processes, 16,618 positive sample sequences and 35,619 negative sample sequences were collected and comprised the dataset. The dataset included positive and negative samples for 11 protein functions selected from the Swiss-Prot database [57] using the SRS program [58] . Positive samples are sample sequences associated with a specific protein function and were labelled as belonging to that class of proteins. Negative samples for each protein class were selected from proteins that do not belong to that class and from enzyme families such as oxidoreductases, hydrolases, lyases, and isomerases. For example, the composition of the negative sample set for fatty acid metabolism is shown in Table 1 . Proteins that consisted of fewer than 30 amino acids were excluded from the dataset. In the feature extraction step, proteins that had missing values were also excluded. For hypothetically or automatically annotated sequences in protein databases such as GenBank and Swiss-Prot, the percentage of incorrect annotations is not known, because the annotations do not include a description of the specific methodology used for sequence analysis; this sometimes yields incorrect search results [14] . However, Swiss-Prot incorporates corrections provided by user forums and communities [13] ; therefore, protein sequences from the Swiss-Prot database were used in the present study. A total of 484 features were extracted solely from the protein sequences described in this study. These features included traditional features adopted from previous studies [7, [30] [31] [32] and new features extracted using the novel method developed in this study. Among the traditional features, 34 were extracted from the Swiss-Prot protein sequences [57] using the ProtParam tool [59] . Traditional features consisted of amino acid composition, protein length, number of atoms, molecular weight, GRAVY, and theoretical isoelectric point (pI), among others. In addition, two ways of using positively charged residues (i.e. lysine and arginine or histidine, lysine, and arginine), the percent composition of each amino acid pair, and 17 features based on physicochemical properties (i.e. 16 properties adopted from Syed et al. [20] and one additional property) were calculated. The importance of negatively/positively charged residues in protein function has been described in several studies [20, 23, 24, 38, 60] . The 20 standard amino acids are divided into negatively charged residues, positively charged residues, and neutral residues according to their pI. Negatively charged residues (aspartic acid and glutamic acid) have lower pIs, while positively charged residues (arginine and lysine) have higher pIs. Oppositely charged residues attract, while similarly charged residues repel each other. To account for subtle differences that occur in small regions of the protein sequences, features representing the percentage change in charged residues as well as the distribution of charged residues were designed and computed. The method used to identify these new features is simple. PPR was calculated using the following equation: where #AA is the total number of amino acids in a sequence and #PP is the total number of continuous changes from one positively charged residue to the next positively charged residue in each protein sequence. Similar to PPR, NNR was calculated using the following equation: where #NN is the total number of continuous changes from one negatively charged residue to the next negatively charged residue. PNPR was calculated using the following equation: where #PNP is the total number of continuous changes from a positively charged residue to the next negatively charged residue or vice versa. Finally, Dist (x, y) is the distribution function for PP, NN, or PNP in the interval from x to y in the sequence, with the stipulation that x