key: cord-0024475-mjrfz9n1 authors: Jia, Yuran; Huang, Shan; Zhang, Tianjiao title: KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest date: 2021-11-29 journal: Front Genet DOI: 10.3389/fgene.2021.811158 sha: 9b7eaf6760abbbc017f69d2d97eb14ac441bf46d doc_id: 24475 cord_uid: mjrfz9n1 DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods. Proteins are spatially structured substances formed by the complex folding of amino acids into polypeptide chains through dehydration and condensation. Proteins are the material basis of life and they are required for every vital activity. Given the vast number of proteins and their roles, protein classification has always been central to the study of proteomics. DNA-binding proteins (DBP) are a very specific class of proteins whose specific binding to DNA guarantees the accuracy of biological processes and whose nonspecific binding to DNA guarantees the high efficiency of biological processes (Gao et al., 2008) . DNA-protein interactions, such as gene expression and transcriptional regulation, occur ubiquitously throughout the biological activities of living bodies Shen and Zou, 2020; Xu et al., 2021a) . All of these interactions are tightly linked to DBP, where the fraction of DNA-binding proteins in eukaryotic genes is approximately 6-7%. The role of DBP in biological activities has gained a lot of attention in recent years, as various large genome projects and research on DBP identification have rapidly progressed. However, identifying DBP using traditional biochemical analyses is inefficient and expensive (Li and Li, 2012; Xu et al., 2021b) . In recent years, machine learning methods have been widely used in the field of bioinformatics (Jiang et al., 2013; Geete and Pandey, 2020; Tao et al., 2020; Wang et al., 2021a; Long et al., 2021) . Using machine learning methods for DNA-binding protein identification can enable rapid and accurate prediction of DBP from a large number of proteins, while drastically reducing prediction costs (Fu et al., 2018) . Because the number of proteins is large and promiscuous, overcoming every classification prediction problem with one method is difficult, if not impossible (Wang et al., 2021b) . Therefore, we must continue to propose effective methods for high-quality DBP prediction and identification in order to understand the significance of more vital activities and to promote further progress within the bioinformatics field. Feature extraction methods can be broadly classified into two categories: those based on structural information and those based on sequence information (Kim et al., 2004; Meng and Kurgan, 2016; Qu et al., 2019; Ao et al., 2021a; Lv et al., 2021a; Liu et al., 2021; Tang et al., 2021; Wu and Yu, 2021) ; (Stawiski et al., 2003) proposed a model based on protein structure that utilises a neural network approach incorporating information like residue and hydrogen bond potential. Liu et al. (Liu et al., 2014) developed a model called IDNA-prot|dis, based on the pseudo amino acid composition (PseAAC) of protein sequence information. iDNAPro-PseAAC (Liu et al., 2015) , which uses a similar feature extraction method, adopts a prediction model based on a support vector machine to predict DBP. IDNAprot (Lin et al., 2011) was constructed based on physicochemical properties and random forest (RF) classification. In addition, a support vector machine model based on k-mer and autocovariance transformation was proposed by Dong et al. (Liu et al., 2016) . Local-DPP (Wei et al., 2017a) used random forests based on PSE-PSSM features to predict DBP. MK-FSVM-SVDD is a multiple kernel SVM prediction tool based on the heuristic kernel alignment developed by to identify DBP. In addition, two models for predicting DBP were developed: DNAprot (Kumar et al., 2009) and DNAbinder (Kumar et al., 2007) . Lu et al. (Lu et al., 2020 ) developed a prediction model for DBP based on support vector machines using Chou's five-step rule. Currently, a number of DNA-binding protein prediction methods based on different strategies exist. Unfortunately, most of these DBP prediction methods fail to extract features based on evolutionary information, so their robustness and prediction accuracy have much room for improvement. To address these issues, more research is needed with regard to feature extraction and the selection of classifiers (Zuo et al., 2017; Zheng et al., 2019) . In this paper, we propose a new DNA-binding protein prediction method called KK-DBP. We first obtained the position specificity score matrix (PSSM) of the protein sequence for each sample used to train the model. PSSM information was then used to extract three features of each sample: PSSM-COMPOSITION (Zou et al., 2013) , RPSSM (Ding et al., 2014) and AADP-PSSM (Liu et al., 2010) , which were combined to form the initial feature set of each sample. The final initial feature set of each sample reached 930 dimensions. To avoid feature redundancy and improve prediction accuracy, KK-DBP used the max relevance max distance (MRMD) (Zou et al., 2016) feature ordering method to establish the optimal feature subset for model training. Finally, a new DBP prediction model was constructed using the random forest learning method. The complete method framework is shown in Figure 1 : The dataset is one of the key factors determining the quality of the predictive model and is the cornerstone of machine learning algorithm learning, which directly affects the final effect of the model, so dataset construction is meticulous and complex (Liang et al., 2017; Su et al., 2021) . Other researchers have proposed many prediction models for DNA-binding proteins that have been pertinent to objectively comparing existing data. In the present study, we have used protein sequences from the PDB database as our training dataset and test dataset. Table 1 shows the contents of the dataset: The training set PDB1075 contained 525 DNA-binding proteins and 550 non-DNA-binding proteins, and the test set PDB186 contained 93 DNA-binding proteins and 93 non-DNAbinding proteins. The dataset construction rules are as follows: where S + is the positive subset containing only DNA-binding proteins, and S − is the negative subset containing only non-DNAbinding proteins. Feature extraction is very important to modeling sequence classifications, which directly affect the accuracy of predictive models (Zhang et al., 2020a; Lv et al., 2021b) . Evolutionary Step A: Construction of Position Specificity Score Matrices for protein sequences. Step B: Extraction of three features: AADP-PSSM, PSSM-COMPOSITION, and RPSSM as the initial feature set for a single sample. Step C: Feature ranking and selection using the MRMD algorithm. Step D: Identification of DBP using random forests. information is among the most important information we have regarding protein function and genetics (Zuo et al., 2014) . Position specificity score matrices (PSSM) can intuitively display protein evolutionary information. Thus, the feature extraction method based on PSSM is widely used in protein classification. In 1997, Altschul et al. (Altschul et al., 1990) proposed the BLAST algorithm. When given a protein sequence, BLAST can represent the evolutionary information of a protein by aligning it with data in a specific database and extracting a position specific score matrix (PSSM). To improve the prediction accuracy of proteins, our method predominantly utilises protein evolution information to extract features. For the training and test sets used in our method, the PSSM matrices for each sequence were generated by three PSI-BLAST iterations with an E-value of 0.001. The PSSM is a matrix of size L × 20, where L is the length of the protein sequence and 20 is the number of amino acids. Coordinates (i, j) in the position specificity score matrix. (PSSM) represent the log score for the amino acid at position i being replaced by the log score of the amino acid at position j. When the coordinate value is greater than 0, it indicates that during the alignment, there is as large probability that the amino acid at the corresponding position in the sequence is mutated to 20 native amino acids. The higher the value is when the number is a negative integer, the less prone it is to alteration. This numerical pattern indicates the probability of the mutation of a residue in a given protein sequences. Its matrix form behaves as follows: Reduced Position Specificity Score Matrices and Position Specificity Score Matrices-Composition PSSM-COMPOSITION is generated by adding the same amino acid rows in the original PSSM matrix, dividing by the sequence length and scaling to [-1,1] . For each protein sequence PSSM matrix, a 400-dimensional vector feature{d 1 , d 2 , d 3 , ..., d 400 } is generated. Li et al. (Li et al., 2003) first proposed that 10 might be the minimum number of residue types (letters) needed to construct a reasonably folded model. Reduced PSSM (RPSSM) borrowed this idea and simplified the original PSSM matrix with form L × 20 to one with form L × 10. a 1 a 2 . . . a L is a protein in the dataset, a i is assumed to be mutated to s, and p i,s represents the pseudo composition component of amino acid a i . The pseudo composition of all amino acids in protein a 1 a 2 . . . a L is defined as: s 1, 2, ...10; i 1, 2, ..., L The dipeptide composition was later incorporated into the RPSSM method in order to overcome its inability to extract full sequence information. Assuming that a i+1 is replaced by 't', the dipeptide pseudocomposition of a i a i+1 is defined as: where x i,i+1 represents the difference of p i,s and p i+1,t from their mean values. Finally, because each protein sequence in the dataset will consist of the pseudo composition of all of its dipeptides, we can generate a 110-dimensional vector feature of RPSSM, defined as follows: s, t 1, 2, . . . 10 (5) A protein's structure is closely related to its amino acid composition. For every amino acid sequence in the dataset, AADP-PSSM produces a vector with dimensions 20 + 400 420. AADP-PSSM is divided into two parts. The amino acid composition is first extracted from its PSSM matrix: the average value of the PSSM matrix column of length 20 is called AAC-PSSM, where x i is the type of amino acid in the PSSM matrix and represents the average fraction of amino acid mutations during evolution. It is defined as follows: x j 1 L L i 1 p i,j j 1, 2, . . . , 20 The traditional dipeptide composition was later extended to PSSM and represented with DPC-PSSM to avoid the loss of information due to an X in the protein, which was defined as a vector of 400 dimensions: Feature redundancy or dimensionality disasters often occur during feature extraction. Feature selection not only reduces the risk of overfitting but also improves the model's generalization ability and computational efficiency Yang et al., 2021a; Ao et al., 2021b; Zhao et al., 2021) . In the present paper, we use the max relevance max distance (MRMD) feature selection method to reduce the dimensions of the initial feature set (He et al., 2020) . In MRMD, feature selection is based primarily on the correlation between the subset and the target vector and the redundancy of the subset. When measuring correlations, MRMD used the Pearson correlation coefficient, which is defined as: where X and Y are two vectors, x k and y k are the kth elements in X and Y, and N is the total sample number. The initial feature set constructed using this method is F {f 1 , f 2 , f 3 , . . . , f 930 }. The maximum correlation value maxMR i between feature f i and target class vector C is defined as: where M is the initial feature set dimension, f i → is the vector composed of the ith feature of each instance, and C i → is the vector composed of the target category of each instance. When evaluating the similarity between two vectors, MRMD uses the distance functions Euclidean distance (ED), cosine similarity (COS) and Tanimoto coefficient (TC) to measure: We use the mean of the three above as the maximum distance maxMD i for feature i: The MRMD values of all the features are calculated with the above two constraints. The PageRank algorithm is used to sort the initial feature set from high importance. One feature is added to the feature subset at a time and is used to train the model to determine which subset is the best. Protein prediction is usually described as a binary classification problem (Zhai et al., 2020; Zhang et al., 2021; Zulfiqar et al., 2021) . We selected the random forest learning method for prediction modelling in the present study. Because the random forest method randomly extracts features and samples during construction of a decision tree set, it is more suitable to addressing the problem of high feature dimensions. By using RandomizedSearchCV and GridSearchCV for parameter selection, the random forest model constructed finally includes 800 subtrees, in which each tree has no limit, and a single decision tree is allowed to use all features. The maximum depth of each decision tree is 50. We selected four different performance measures, accuracy (ACC), specificity (SP), sensitivity (SN) and Matthew's correlation coefficient (MCC), to evaluate the methodology used by this study to demonstrate the predictive ability of the model used (Wei et al., 2014; Wei et al., 2017b; Manavalan et al., 2019a; Manavalan et al., 2019b; Jin et al., 2019; Su et al., 2019; Li et al., 2020a; Liu et al., 2020a; Ao et al., 2020; Li et al., 2020b; Zhang et al., 2020b; Yu et al., 2020; Zhao et al., 2020; Wang et al., 2021c; Zhu et al., 2021) . The equations for determining these four parameters are shown below: Where TP represents positive samples predicted to be positive by the model, FP represents negative samples predicted to be positive by the model, and TN represents negative samples predicted to be negative by the model. FN represents positive samples predicted to be negative by the model. Removing the above four performance measures, the ROC curve will also be used to assess the effect of our predictions. A large amount of information on homologous proteins is contained in evolutionarily informative features based on the PSSM matrix. In our method, we selected the evolutionary information-based features PSSM-COMPOSITION, RPSSM, and AADP-PSSM for experimentation. To better show the efficiency of prediction models under different combinations of features, the receiver operating characteristic (ROC) curve was used for analysis. The closer the curve is to the y-axis, the better the classification results will be. The area under the curve (AUC) is defined as the area under the ROC curve enclosed by the coordinate axis. The closer the area is to 1, the better the prediction model will be Random forests can achieve better prediction performance when dealing with high-dimensional features. In this section, we use random forests with default hyperparameters on the training set Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 811158 pdb1075 for 10-fold cross validation of different feature fusion schemes and find out the feature fusion method that can maximize the area of AUC. As shown in Figure 2 , the prediction performance of RF was the best after fusing the three features, and its AUC area reached 0.963. In addition, we also tested the predictive performance of SVM and KNN under different feature fusion schemes, and their optimal feature fusion schemes had AUC areas of 0.828 and 0.790, respectively. The ROC curve details of SVM and KNN are given in Figure 1 and Figure 2 of supplementary material respectively. For the 930-dimensional features of the initial vector set, we ranked all features from high to low based on MRMD scores. After obtaining the final feature ranking results, we took the first feature as the feature subset and utilised random forest to check the performance of the selected feature subset in 10-fold cross validation on PDB1075. Subsequently, we added one feature in the feature subset, one at a time, according to the feature sorting order. Then we repeated the above process until all the features in the initial feature set were included in the feature subset. Finally, we determined the best predictive accuracy and the optimal feature subset. The results are shown in Figure 3 . The feature subset achieves the best accuracy when it contains 267dimensional features, so the optimal feature subset we used for training models is 267-dimension. The optimal feature subset contains 98-dimensional AADP-PSSM features, 142-dimensional PSSM-COMPOSITION features, and 27-dimensional RPSSM features. The details of the optimal feature subset are given in the supplementary materials. From the distribution of the optimal feature subset, it can be found that the distribution difference of amino acid pairs is the key to identify DBP from massive proteins. To determine the prediction model with the best performance, we put the best feature subset into four powerful classification algorithms with default hyperparameters, KNN, SVM, RF and naïve Bayes, and we used 10-fold cross validation to compare performance. Experimental results show that the random forest method demonstrates the best classification performance ( Figure 4) . We use ACC, Sn, SP, MCC and AUC to evaluate the performance. As shown in Figure To evaluate the generalization ability of the prediction model proposed in this paper, we tested the model independently using dataset PDB186. Table 2 compares the performance of this study to other prediction methods on the dataset PDB186. From Table 2 , we can see that on the independent test set PDB186, the ACC, SN, SP of KK-DBP reach 81.2, 97.8 and 64.5%, respectively. In terms of prediction accuracy, KK-DBP is higher than other existing methods. Compared with the current method with the highest accuracy Local-DPP, KK-DBP was improved by 2.2 and 5.3% on the ACC and SN, respectively. SP is slightly lower than Local-DPP and IDNA-Prot. The results of independent verification experiments confirm that KK-DBP has reliable predictive performance and can recognize DBP from a large number of unknown proteins more accurately than existing DBP recognition methods. A large number of studies have shown that the classification of DNA-binding proteins has important theoretical and practical significance for future genomics and proteomics research. This paper proposes a DNA-binding protein prediction method, called KK-DBP, that is based on multi-feature fusion and improves the feature extraction method in DNA-binding protein prediction. This method uses PSSM features that contain dipeptide composition information for multi-feature fusion to construct the initial feature set, and it obtains the optimal feature subset for modeling by the maximum correlation maximum distance method. Finally, PDB186 was used as an independent test to further evaluate the effectiveness of our method. On the independent test set, the prediction accuracy, sensitivity and specificity of the model reached 81.2, 97.8 and 64.5%, respectively. KK-DBP surpasses existing methods in prediction accuracy, confirming that our method can identify DBP more accurately than existing methods. Although our method improves the prediction accuracy of DNA-binding proteins, we still do not know how to construct a better feature extraction algorithm based on sequence and structure information. Therefore, our future research direction will be towards finding more distinguishable feature extraction algorithms (Ding et al., 2016; Zeng et al., 2020a; Yang et al., 2021b ; Wang et al., 2021d; Jin et al., 2021) and more suitable classifiers (Ding et al., 2019; Ding et al., 2020a; Ding et al., 2020b; Yang et al., 2021c; Guo et al., 2021) and prediction models (Liu et al., 2020b; Zeng et al., 2020b; Chen et al., 2021; Xu et al., 2021c; Song et al., 2021; Xiong et al., 2021) to better recognise DNA-binding proteins. The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors. YJ conceived the algorithm, performed the experiments, analyzed the data, and drafted the manuscript. TZ designed the experiments and revised the manuscript. YJ, SH, and TZ provided suggestions for the study design and the writing of the manuscript. All authors approved the final manuscript. This work was supported by the Fundamental Research Funds for the Central Universities (2572021BH01) and the National Natural Science Foundation of China (62172087, 62172129). The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.811158/ full#supplementary-material Basic Local Alignment Search Tool Prediction of Antioxidant Proteins Using Hybrid Feature Representation Method and Random forest RFhy-m2G: Identification of RNA N2-Methylguanosine Modification Sites Based on Random forest and Hybrid featuresMethods Prediction of Bio-Sequence Modifications and the Associations with Diseases MUFFIN: Multi-Scale Feature Fusion for Drug-Drug Interaction Prediction A Protein Structural Classes Prediction Method Based on Predicted Secondary Structure and PSI-BLAST Profile Identification of Drug-Side Effect Association via Multiple Information Integration with Centered Kernel Alignment Identification of Drug-Target Interactions via Dual Laplacian Regularized Least Squares with Multiple Kernel Fusion. Knowledge-Based Syst Identification of Drug-Target Interactions via Fuzzy Bipartite Local Model Predicting Protein-Protein Interactions via Multivariate Mutual Information of Protein Sequences Improved DNA-Binding Protein Identification by Incorporating Evolutionary Information into the Chou's PseAAC DBD-Hunter: a Knowledge-Based Method for the Prediction of DNA-Protein Interactions Robust Transcription Factor Binding Site Prediction Using Deep Neural Networks An Efficient Multiple Kernel Support Vector Regression Model for Assessing Dry Weight of Hemodialysis Patients Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction Predicting Human microRNA-Disease Associations Based on Support Vector Machine DUNet: A Deformable Network for Retinal Vessel Segmentation. Knowledge-Based Syst Application of Deep Learning Methods in Biological Networks Protein Structure Prediction and Analysis Using the Robetta Server DNA-prot: Identification of DNA Binding Proteins from Protein Sequence Information Using Random forest Identification of DNA-Binding Proteins Using Support Vector Machines and Evolutionary Profiles DeepATT: a Hybrid Category Attention Neural Network for Identifying Functional Effects of DNA Sequences DeepAVP: A Dual-Channel Deep Neural Network for Identifying Variable-Length Antiviral Peptides Reduction of Protein Sequence Complexity by Residue Grouping Annotating the Protein-RNA Interaction Sites in Proteins Using Evolutionary Information and Protein Backbone Structure Pro54DB: a Database for Experimentally Verified Sigma-54 Promoters Frontiers in Genetics | www.frontiersin.org iDNA-Prot: Identification of DNA Binding Proteins Using Random forest with Grey Model Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning DNA Binding Protein Identification by Combining Pseudo Amino Acid Composition and Profile-Based Protein Representation iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition An Improved Anticancer Drug-Response Prediction Based on an Ensemble Method Integrating Matrix Completion and Ridge Regression Function Determinants of TET Proteins: the Arrangements of Sequence Motifs with Specific Codes Evaluating DNA Methylation, Gene Expression, Somatic Mutation, and Their Combinations in Inferring Tumor Tissue-Of-Origin Identification of Novel Key Targets and Candidate Drugs in Oral Squamous Cell Carcinoma Prediction of Protein Structural Class for Low-Similarity Sequences Using Support Vector Machine and PSI-BLAST Profile Integrated Biomarker Profiling of the Metabolome Associated with Impaired Fasting Glucose and Type 2 Diabetes Mellitus in Large-Scale Chinese Patients Use Chou's 5-Step Rule to Predict DNA-Binding Proteins with Evolutionary Information DeepIPs: Comprehensive Assessment and Computational Identification of Phosphorylation Sites of SARS-CoV-2 Infection Using a Deep Learning-Based Approach A Sequence-Based Deep Learning Approach to Predict CTCF-Mediated Chromatin Loop mAHTPred: a Sequence-Based Meta-Predictor for Improving the Prediction of Antihypertensive Peptides Using Effective Feature Representation Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation DFLpred: High-Throughput Prediction of Disordered Flexible Linker Regions in Protein Sequences A Review of DNA-Binding Proteins Prediction Methods Basic Polar and Hydrophobic Properties Are the Main Characteristics that Affect the Binding of Transcription Factors to Methylation Sites The Computational Power of Monodirectional Tissue P Systems with Symport Rules Annotating Nucleic Acid-Binding Function Based on Protein Structure Deep-Resp-Forest: A Deep forest Model to Predict Anti-cancer Drug Response PPD: A Manually Curated Database for Experimentally Verified Prokaryotic Promoters A Novel Hybrid Feature Selection and Ensemble Learning Framework for Unbalanced Cancer Data Diagnosis with Transcriptome and Functional Proteomic A Method for Identifying Vesicle Transport Proteins Based on LibSVM and MRMD DM3Loc: Multi-Label mRNA Subcellular Localization Prediction and Analysis Based on Multi-Head Self-Attention Mechanism Identify RNA-Associated Subcellular Localizations Based on Multi-Label Learning Using Chou's 5-steps Rule The Stacking Strategy-Based Hybrid Framework for Identifying Non-coding RNAs Modular Arrangements of Sequence Motifs Determine the Functional Diversity of KDM Proteins Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set Local-DPP: An Improved DNA-Binding Protein Prediction Method by Exploring Local Evolutionary Information Improved Prediction of Protein-Protein Interactions Using Novel Negative Samples, Features, and an Ensemble Classifier EPSOL: Sequence-Based Protein Solubility Prediction Using Multidimensional Embedding ADMETlab 2.0: an Integrated Online Platform for Accurate and Comprehensive Predictions of ADMET Properties Multi-substrate Selectivity Based on Key Loops and Non-homologous Domains: New Insight into ALKBH Family A Polar-Metric-Based Evolutionary Algorithm An In Silico Approach to Identification, Categorization and Prediction of Nucleic Acid Binding Proteins Granular Multiple Kernel Learning for Identifying RNA-Binding Protein Residues via Integrating Sequence and Structure Information Drug-disease Associations Prediction via Multiple Kernel-Based Dual Graph Regularized Least Squares Risk Prediction of Diabetes: Big Data Mining with Fusion of Multifarious Physical Examination Indicators Predict New Therapeutic Drugs for Hepatocellular Carcinoma Based on Gene Mutation and Expression A Consensus Community-Based Particle Swarm Optimization for Dynamic Community Detection Frontiers in Genetics | www.frontiersin.org Network-based Prediction of Drug-Target Interactions Using an Arbitrary-Order Proximity Embedded Deep forest Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins iCarPS: a Computational Tool for Identifying Protein Carbonylation Sites by Novel Encoded Features AIEpred: an Ensemble Predictive Model of Classifier Chain to Identify Anti-inflammatory Peptides ECFS-DEA: an Ensemble Classifier-Based Feature Selection for Differential Expression Analysis on Expression Profiles RAACBook: a Web Server of Reduced Amino Acid Alphabet for Sequencedependent Inference by Using Chou's Five-step Rule. Database (Oxford) Computational Identification of Eukaryotic Promoters Based on Cascaded Deep Capsule Neural Networks Accurate Prediction of Bacterial Type IV Secreted Effectors Using Amino Acid Composition and PSSM Profiles A Novel Features Ranking Metric with Application to Scalable Visual and Bioinformatics Data Classification MK-FSVM-SVDD: A Multiple Kernel-Based Fuzzy SVM Model for Predicting DNA-Binding Proteins via Support Vector Data Description Identification of Cyclin Protein Using Gradient Boost Decision Tree Algorithm Predicting Peroxidase Subcellular Location by Hybridizing Different Descriptors of Chou' Pseudo Amino Acid Patterns PseKRAAC: a Flexible Web Server for Generating Pseudo K-Tuple Reduced Amino Acids Composition