key: cord-321386-u1imic5l
authors: Li, Chun; Zhao, Jialing; Wang, Changzhong; Yao, Yuhua
title: Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
date: 2018-02-17
journal: Comb Chem High Throughput Screen
DOI: 10.2174/1386207321666180130100838
sha: 
doc_id: 321386
cord_uid: u1imic5l

AIM AND OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. CONCLUSION: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.

DNA-binding proteins (DNA-BPs) are very important functional proteins in a cell. These proteins play vital roles in various cellular processes, including DNA replication, transcription, regulation of gene expression, packaging, and other activities associated with DNA [1] [2] [3] [4] [5] . It is therefore substantially important to distinguish DNA-BPs from non-DNA-binding proteins (NBPs). In the past, many experimental and computational techniques have been developed for identifying DNA-BPs. Experimental techniques can provide a clear-cut answer to a query protein. However, the experimental methods are cost-intensive and time-consuming, and thus impractical for large datasets [3] [4] [5] [6] [7] . Computational methods can be broadly divided into two categories: structure-based method and sequence-based *Address correspondence to this author at the School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China; Tel: +86-898-65883210; E-mail: lichwun@163.com method. The former can discriminate DNA-binding and nonbinding proteins with high accuracy, but these methods can't be employed in high throughput annotation, as they require the structure information of a query protein [1] . Though tremendous progress has been achieved in experimental determination of protein structures in the past five decades, it can't keep pace with the explosive growth of sequence information resulting from modern sequencing technology [8] . Yet as suggested by Anfinsen [9] , proteins contain within their amino acid sequences enough information to determine their native conformation. Therefore, it is more promising to use sequence-based methods to identify DNA-BPs.

One of the core issues to the sequence-based methods is how to characterize protein sequences and harvest the fruits hidden in them. The most typical approach is using the amino acid composition (AAC) to formulate a protein sequence. Owing to its simplicity, the AAC model was widely applied in a number of earlier statistic-based methods. However, as pointed out in Ref [6] , if we denote by the counts of 20 standard amino acids in a protein sequence, then we can see that there are a total of different sequences/strings possessing the same AAC. The reason is that AAC model neglects the order relation among elements of a sequence. To overcome this drawback, the concept of pseudo amino acid composition (PseAAC, or Chou's PseAAC) was proposed [10] [11] [12] [13] [14] [15] [16] [17] [18] . The essence of PseAAC is that it not only covers AAC, but also contains additional order-correlated factors along a protein sequence. Another popular way for sequence analysis is to convert the protein primary sequence over 20 amino acids into a reduced one. The earliest and simplest reduction was the well-known HP model, in which 20 standard amino acids are divided into two types, hydrophobic (H) (or non-polar) and polar (P) (or hydrophilic). On the basis of the classic model, a detailed HP model was introduced by dividing the polar class into three subclasses: positive polar, uncharged polar and negative polar [19] . In addition, a few five-group classifications of amino acids were presented for practical purposes [20] [21] [22] [23] . By considering property-based triples, Li et al. [6] put forward a six-letter model of amino acids. Also based on three physical-chemical properties of amino acids, Yao et al. [24] mapped the 20 standard amino acids to eight vertices of a cube with the center of origin, and thus an eightgroup model of amino acids is obtained.

Motivated by the work mentioned above, we propose a generalized PseAAC which is grounded on a three-letter model and 2-D graphical representation of a protein sequence. We summarize the main work of this paper as follows: In section 2, we briefly introduce five datasets used in this study. In section 3, on the basis of two important physicochemical properties of amino acids, we cluster the 20 standard amino acids into three groups. By assigning to each group a representative symbol, we transform a protein sequence into a three-letter sequence. Then a 2-D graph without loops and multiple edges and its geometric line adjacency matrix are obtained. A sequence-derived feature vector of dimension (25+ ) is thus constructed to characterize a protein sequence. Our scheme is similar to, but obviously different from that of PseAAC. In section 4, we apply the presented feature vector to compare -globin proteins of 17 species and 72 spike proteins of coronaviruses respectively. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. Experiment results show that the presented method outperforms the existing methods including DNAbinder [1] , DNA-Prot [2] , iDNA-Prot [3] and enDNA-Prot [4] . Finally, conclusions are given in section 5.

In this study, the following five datasets are used. For convenience, they are denoted by BetaSet, CoVSet, DNASet, DNAeSet and DNAiSet, respectively.

The dataset called BetaSet is composed of -globin protein of 17 species: Human (ALU64020), Gorilla (P02024), Chimpanzee (P68873), Cattle (CAA25111), Banteng (BAJ05126), Goat (AAA30913), Sheep (ABC86525), European hare (CAA68429), Rabbit (CAA24251), House mouse (ADD52660), Western wild mouse (ACY03394), Spiny mouse (ACY03377), Norway rat (CAA29887), Opossum (AAA30976), Guttata (ACH46399), Gallus (CAA23700), Muscovy duck (CAA33756). This dataset is used to determine the adjustable parameters in a feature vector.

This dataset consists of 72 spike proteins of coronaviruses (CoVs), 23 of which are MERS-CoVs, and 30 are SARS-CoVs. CoVs can be divided into three groups according to serotypes. Group alpha (formerly known as CoV-1) and group beta (formerly CoV-2) contain mammalian viruses, while group gamma (formerly CoV-3) contains only avian viruses. The name, accession number, and abbreviation of the 72 sequences are listed in Table 1 .

According to the existing taxonomic groups, sequences 1-5 belong to the first group, sequences 6-8 belong to the third group, and the remainings belong to the second group.

This is a benchmark dataset created in 2007 by Kumar et al. [1] . It contains 396 sequences, 146 of which are DNA-BPs (positive samples), and 250 NBPs (negative samples). In both the positive and the negative sets, the sequence similarity between any two proteins is not more than 25%.

This dataset was also generated by Kumar et al. [1] which is based on the work of Wang and Brown [25] . It originally contains 92 DNA-BPs and 100 NBPs. In order to avoid overestimating a given method, those sequences having sequence similarity with DNASet were removed by Xu et al. [4] , and the final dataset is composed of 82 DNA-BPs and 100 NBPs.

As an expanded benchmark dataset, DNAeSet was constructed in 2014 by Xu et al. [4] . According to a sequence filter criteria which is identical to DNASet, they added a number of NBPs to DNASet, and the total number of NBPs is 2125. By removing the sequence which has sequence identity with DNAiSet, the current version of DNAeSet has 146 DNA-BPs and 1710 NBPs.

Isoelectric point (pI) and relative distance (RD) are two important physicochemical properties of the 20 standard amino acids [26] [27] [28] . Their original numerical values are listed in Table 2 (relative distance) varies between 1469 and 3355. Therefore, the normalization of these values is needed. Here, we scale them into the interval [0,1] by the formulary below:

The corresponding values are listed in Table 3 . The last row in this table gives the average values.

For the i-th amino acid , if , then we label it by "+", otherwise we will label it by "-". Similarly, if property is considered, the second label for amino acid can be obtained. In this way, each of the 20 standard amino acids has a label pair. In Table 3 , the corresponding labels are also listed. Amino acids with a same label pair are viewed as members of a same group. Thus, the 20 standard amino acids are distributed to the following groups:

For each group, the first amino acid is used to stand for the group. Thus the three groups have three representative letters, they are A, C and H, respectively. The value for the property of a group is defined as the average value for the property of all members in the group. In the left-hand side of Table 4 , we list the corresponding values of the three groups. Obviously, each group can be viewed as a 2-D vector. In order to make the vectors of the three groups have unit length, we further normalize them to be unit vectors, and list the normalized values ( ) in the right-hand side of Table 4 .

In Fig. (1) , we show the 2-D map of the 20 standard amino acids according to the classification above.

By substituting each amino acid with its representative letter, a protein primary sequence is reduced into a threeletter sequence. For example, the three-letter sequence of the sequence segment EKAAVTGFWGKVKVDEVGAEA is AHAAAHCCCCHAHACAACAAA.

To obtain the graphical representation of a reduced sequence, we start from the origin (0,0) and move in xoy-plane in the direction dictated by Fig. (1) . In mathematics, one can let be a given three-letter sequence. And then one has a map , which maps S into a plot set. Explicitly, , and is given by where, T represents the transpose of a matrix, (j=1,2) represents the j-th component of the unit vector corresponding to (cf. Fig. 1 and Table 4 ). Connecting all points of the plot set in turn, a 2-D curve is drawn. In Fig.  (2) , we show the 2-D graphical representation of sequence AHAAAHCCCCHAHACAACAAA. It is not difficult to find that the 2-D graphical representation has no degeneracy, and thus is a simple graph, that is, a graph without loops and multiple edges. 

In this section, we give a numerical characterization of a protein sequence that will facilitate quantitative comparisons of protein sequences. As is known, once a graphical representation is given, it can be transformed into some structural matrices, such as the matrices ED, GD, M/M, and L/L [6, 24, [29] [30] [31] [32] [33] [34] [35] [36] [37] . Here we employ the L/L matrix. L/L is a nonnegative symmetric matrix whose off-diagonal entries are defined as a quotient of the Euclidean distance between two vertices of the graph and the sum of geometrical lengths of edges between the two vertices. By definition all diagonal elements are zero. Obviously, the entries in a L/L matrix are less than or equal to one. The higher order k L/ k L matrix is the matrix whose (i,j)-entry is . As the exponent k approaches positive infinity, k L/ k L converges to a (0,1) matrix (denoted by b L/ b L). With respect to the proposed 2-D graph, [ b L/ b L] ij =1 if and only if the two corresponding vertices lie on a straight line in the curve, including the cases of adjacency and non-adjacency. In this sense, we call such a matrix a geometric line adjacency matrix (GLAM), or simply a generalized adjacency matrix (GAM), generated by a graph, and denote it by .

The first Zagreb index is a well-known vertex-degreebased molecular structure descriptor. This index was first time considered by Gutman and Trinajstic about 45 years ago, and since then discussed and used in numerous studies (see [38] [39] [40] and the references cited therein). The first Zagreb index is defined as (2) where du denotes the degree (=number of first neighbors) of the vertex u in graph G. If G is a simple graph (i.e. without loops and multiple edges), Z g1 can be also obtained directly from its adjacency matrix since the row-sums of this matrix are equal to degrees of the corresponding vertices.

It should be mentioned that the Zagreb index gives greater weights to inner vertices and edges than to outer vertices and edges of a graph [38] . One way to amend it is to insert inverse values of the vertex-degree into Eq(2), and thus the modified Zagreb index has been proposed [38] :

Clearly, m Z g1 gives greater weights to outer vertices/edges than to inner ones in a graph.

At the same time, on the basis of our geometric line adjacency matrix, we can count the vertex-pair with generalized adjacency relationship. It should be noted that, in our case, the 'neighbors' include not only the conventional neighbors, i.e. the first neighbors, but also the second neighbors, the third neighbors, and so on. We call the corresponding number of graph G a line-adjacency index, and denote it by La(G). Then we have a graph-based index:

For a symmetric matrix, eigenvalue-based indices, such as the leading eigenvalue [29] [30] [31] [32] [33] 35] and the graph energy [17] , are often used as the matrix invariants. Moreover, in our previous paper [41] , an alternative invariant called 'ALE-index' was proposed. The ALE-index is defined by the following formula: (4) where L is the order of the matrix, and are the m1-and F-norms of a matrix respectively. In order to reduce variations caused by comparison of matrices with different sizes, we consider a normalized ALE-index instead of . For convenience, we denote this matrixbased index by .

In addition, with respect to three-letter sequence , we define a coupling mode function by , (n=1, 2)

where P 1 and P 2 are values for properties of the corresponding representative letter (group), integer k represents the counted rank (or tier) of the coupling mode. Then, following the similar procedures in [10, 11] , we can extract global sequence-order information of the three-letter sequence S by , , .

where is called the k-th tier correlation factor. Clearly, reflects the coupling mode between the most contiguous elements along three-letter sequence S, is the coupling mode between the second most contiguous, the third most contiguous, and so forth.

Furthermore, if the respective counts of the three representative letters (A, C and H) in sequence S are , respectively, then we can obtain a so-called group composition (GC): where, denotes the size of a group (set).

Consequently, elements are derived, which reflect the information about the reduced sequence and, particularly, the 2-D graphical representation. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: ,

where (8) Here, are frequencies of occurrence of the 20 standard amino acids in a protein sequence, and are weight factors. As will be described later in detail, the four adjustable parameters in Eqs (7) and (8) can be determined by a set of known samples. Roughly speaking, the vector contains the feature of AAC, and the information beyond AAC as well, which is similar to Chou's PseAAC in form. Therefore, we call such a vector formulated by Eqs (7) and (8) the generalized PseAAC of a protein sequence.

In this section, we will discuss the use of the generalized PseAAC. As can be seen from Eqs (7) and (8), the present mathematical descriptor contains four uncertain parameters:

, w 1 , w 2 and w 3 . Here represents the total number of correlation ranks counted (cf. Eq(6)), which is an integer. Generally speaking, the greater the value of , the more sequence-order effects will be incorporated. However, if the value is too large, it might cause the overfitting problem or 'high dimension disaster' [15] , therefore, we endeavour to limit the value of to a small integer. In this study, the five datasets (BetaSet, CoVSet, DNASet, DNAeSet and DNAiSet) are arranged into two groups: one contains BetaSet, the other includes the rest. The first group is used for determining the four adjustable parameters, and the second group for testing purpose.

According to the method mentioned above, we first associate each of 17 protein sequences in BetaSet with a dimensional vector (cf. Eqs (7) and (8)), and then calculate the pair-wise Euclidean distance between any two of the 17 protein sequences via their m-D vectors. Thus a real symmetric matrix is obtained. On the basis of the achieved distance matrix , a UPGMA tree is constructed using MEGA4 package. The result will depend on values of the rank and the three weight factors. It is found that when , , and , the three non-mammals (Muscovy duck, Gallus and Guttata) form a separate branch and stay outside of the mammals. Moreover, in the subtree of mammals, primate species (Human, Chimpanzee, Gorilla) are grouped closely. Also, rodent species (Norway rat, Spiny mouse, House mouse, Western wild mouse) and lagomorph species (Rabbit, European hare) are situated at independent branches, respectively. While Goat, Sheep, Cattle and Banteng appear to cluster together (Fig. 3) . This result is analogous to that reported in the literature [6, 29, 30, 35, 36] . Accordingly, the four numerical values are respectively used for the four uncertain parameters, and a 31-D feature vector is thus obtained. Fig. (3) . The relationship tree of 17 species.

In order to evaluate the effectiveness of our method, we test it by phylogenetic analysis on the CoVSet dataset. Coronaviruses (CoVs) belong to the genus Coronavirus of family Coronaviridae [42] . The first coronavirus (HCoV-229E) was isolated from humans in 1965. Until 2003, coronaviruses attracted little interest beyond causing mild upper respiratory tract infections. However, this phenomenon changed dramatically with the emergence of SARS-CoV and MERS-CoV. As of July 2017, 2040 laboratory-confirmed cases of MERS-CoV infection were reported in over 27 countries, and at least 710 individuals have died (crude CFR 34.8%) [43] .

Using the above-determined values for parameters , w 1 , w 2 , and w 3 , we calculate the 31-D feature vectors of 72 coronavirus spike proteins and their Euclidean distance matrix; then the corresponding phylogenetic tree (Fig. 4) is constructed. Observing Fig. (4) , we find that the 72 coronavirus spike proteins are clustered into three groups: one contains the five alpha coronaviruses (PEDVC, PEDV, TGEVG, TGEV, and HCoV-229E), the second includes the three gamma coronaviruses (IBV, IBVBJ, IBVC), and the third corresponds to the group beta. A closer look at the subtree of beta coronaviruses shows that MERS-CoVs are clearly clustered together, so it is with SARS-CoVs, while MHV, MHVA, MHVM, MHVP, MHVJHM, BCoV, BCoVE, BCoVL, BCoVM, BCoVQ and HCoV-OC43 are situated at an independent branch. The resulting cluster agrees well with the established taxonomic groups. 

To further assess the effectiveness of the porposed method, we conduct a series of experiments of identification of DNA-binding proteins on three datasets: DNASet, DNAeSet and DNAiSet. Among them, DNASet and DNAeSet serve as training datasets, while DNAiSet serves as an independent testing dataset. Support vector machine (SVM) is employed as the classifier, and R package 'e1071' v1.6-8 [44] is used to implement SVM. For a given set of binary-labeled training examples, SVM maps the input space into a higherdimensional space and seeks a hyperplane to separate the positive samples from the negative ones [25] . The optimal hyperplane maximizes the separation margin between the two classes of training data. The distance measurement between the data points in the high-dimensional space is defined by the kernel function. In this study, we use the radial basis function (RBF) kernel . This model involves two tunable parameters: the kernel width and the penalty parameter C. Prediction performance can be assessed using some quality indices including Accuracy (ACC), Sensitivity (Se), Specificity (Sp), Fmeasure (F1M) and Matthews correlation coefficient (MCC) [2, 4, 5, 25, 37, 45] :

, , , .

where TP, TN, FP, and FN are defined as the numbers of true positive, true negative, false positive, and false negative samples obtained from the prediction respectively, while P and R denote Precision value and Recall value, respectively. One can also use the alternative definition by a series of studies published recently [15, [46] [47] [48] . The higher the values of these measurements, the better the quality of prediction.

This experiment is made on DNASet itself. To obtain a reliable result with few error, the SVM model on DNASet is established by 5-fold cross-validation (5CV) with 3 runs. Here the 31-D feature vector of a protein sequence serves as the input for SVM. In a 5CV, the positive and negative samples are randomly distributed into five subsets or the socalled folds, and the test is repeated five times. In each of the five iterations, one subset is used as the testing set, while the remaining four subsets are combined together and used to build a classifier (training). The predictions made for the test data instances in all the five iterations yield the final result. The sensitivity, specificity, ACC, MCC and F1M are calculated for each run, and the corresponding results and their average values are listed in Table 5 . As can be seen Fig. (4) . The relationship tree of 72 coronavirus spike proteins. T a lw a n T C 2 T a lw a n T C 1 T a lw a n T C 3 TW 1 TW 2 TW H TW J Urbani from this table, we achieve the accuracy (ACC) of 89.65%, with MCC of 0.776 and F1M of 84.91%. This result shows that our SVM model performs well on the benchmark dataset DNASet. 

It is important to examine the performance of the newly developed method on an independent dataset. In this experiment, we establish the classifier with the benchmark dataset DNASet and then test it on the independent dataset DNAiSet. To decide the parameter pair (γ, C), we utilize a systematic grid search for and , where integers i and j are in ranges [-3, 3] and [0, 3], respectively. It is find that and are the optimal values for DNASet. With the best pair (γ, C), DNAiSet is fed to the SVM. As a result, our model correctly predicts 68 out of 82 DNA-BPs and 92 out of 100 NBPs. The ACC arrives at 87.91%, with the MCC, sensitivity, specificity, and F1M of 0.756, 82.93%, 92.00% and 86.07%, respectively (see Table  6 ). This demonstrates that our SVM model performs equally well on independent dataset.

For convenience of comparison, results of some existing methods including DNAbinder [1] , DNA-Prot [2] , iDNA-Prot [3] and enDNA-Prot [4] are also listed in Table 6 . DNAbinder developed by Kumar et al. [1] can extract evolutionary information in form of position specific scoring matrix (PSSM) from the corresponding protein sequence. PSSM-21 and PSSM-400 are two feature vectors generated by means of PSSM, whose dimensions are 21 and 400, respectively. In [1] , PSSM-400 based SVM model was mainly used for predicting DNA-BPs. DNA-Prot [2] is a Random Forest based method, in which the feature vector includes sequence information and structure information, such as the composition of 20 standard amino acids, composition of 10 amino acid groups, and secondary structure information predicted from a protein sequence. iDNA-Prot [3] constructs the feature vector via the grey model, and Random Forest is also used as the operation engine. EnDNA-Prot [4] is a predictor which encodes a protein sequence into a feature vector with dimension of 188 and adopts an ensemble classifier constructed with four types of machine learning classifiers. All these methods are tested on the same datasets to make an unbiased comparison with our method. Observing Table 6 , we can see that the current approach outperforms other methods by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1. .76% in terms of F1M. This result indicates that our method achieves highly comparable performance. 

When the size of positive samples is comparable to that of negative samples, many machine learning algorithms should have better performance. However, in real life, the number of non-binding proteins is much greater than that of DNA-BPs, i.e., .

In this case, the frequency of NBPs is generally much greater than that of the binding ones in the predictions, that is, .

Eqs (10) and (11) lead to that the value of ACC defined by Eq (9) tends towards 1. To solve this problem, instead of using the definition of ACC in Eq (9), here we use the alternative definition [49, 50] :

. (12) In order to analyze the influence of the number of negative samples in a benchmark dataset on the predictive performance of the current method, we construct a series of subsets of DNAeSet and use them as training set in turn, while DNAiSet is always used as the testing set. Each subset contains all the 146 DNA-BPs and a part of NBPs in DNAeSet. In detail, if the set of NBPs in is denoted by , k=1, 2, ..., then consists of 250 NBPs randomly selected from DNAeSet. And is obtained by adding 50 NBPs to , until 1700 NBPs are contained in it. For each subset , k=1, 2, ..., 30, we develop the SVM model by 5CV with 3 runs. The results averaging over the three runs are given in Fig. (5) . From Fig. (5) we can see that the curves of ACC and acc visibly split with each other when n, the size of , is larger. With increasing of n, ACC increases rapidly, while acc tends to be steady. The value of ACC seems higher and higher on the surface, but it cannot correctly reflect the performance because it is nothing but a false appearance.

In order to show the advantage of their method, Xu et al. [4] created a dataset called expanded benchmark dataset1100 with all the 146 positive samples and 1100 negative samples in DNAeSet, which is employed as another training dataset to evaluate the predictive performance on the independent dataset DNAiSet. For convenience of comparison, we also select the expanded benchmark dataset to establish the classifier and test it on DNAiSet. Repeating this procedure five times, the average results are given in Table 7 (the first row). Results obtained by the other four methods (DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot) trained on the expanded benchmark dataset with n=1100 are also listed in Table 7 . From this table we see that the overall accuracy of our method is about 92%, with MCC of 0.84 and F1M of 91.24%, which outperforms other methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. This suggests that our method performs well on unbalanced datasets. 

Based on two important physicochemical properties, 20 standard amino acids were distributed into three groups, and to each of which a representative symbol was assigned. By replacing each amino acid with its representative letter, a protein primary sequence was converted into a three-letter sequence, which can be viewed as a coarse-grained description of the protein primary sequence. On the basis of the three-letter sequence, a graph without loops and multiple edges was obtained. By taking the advantage of the 2-D graph, we constructed a geometric line adjacency matrix (GLAM) and then the corresponding ALE-index, the lineadjacency index, the first Zagreb index and its modification were calculated. In addition, order-correlated factors were extracted via the reduced sequence. By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. On five popular datasets, the proposed method was tested by phylogenetic analysis and identification of DNA-binding proteins. The results illustrated the better performance of our method. 

Identification of DNA-binding proteins using support vector machines and evolutionary profiles

DNA-prot: identification of DNA binding proteins from protein sequence information using random forest

iDNA-prot: identification of DNA binding proteins using random forest with grey model

enDNA-Prot: identification of DNA-binding proteins by applying ensemble learning

gDNA-Prot: predict DNA-binding proteins by employing support vector machine and a novel numerical characterization of Protein sequence

Numerical characterization of protein sequences based on the generalized Chou's pseudo amino acid composition

Light-directed synthesis of peptide nucleic acids (PNAs) chips

Protein structure prediction from sequence variation

Principles that govern the folding of protein chains

Prediction of protein cellular attributes using pseudoamino acid composition

Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology

iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC

PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions

Identify recombination spots with pseudo dinucleotide composition

Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM

Protein sequence comparison based on physicochemical properties and the position-feature energy matrix

A Novel protein characterization based on pseudo amino acids composition and star-like graph topological indices

Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses

A computational approach to simplifying the protein folding problem

Modeling study on the validity of a possibly simplified representation of proteins

2-D graphical representation of protein sequences and its application to coronavirus phylogeny

Clustering of the protein design alphabets by using hierarchical self-organizing map

A novel descriptor of protein sequences and its application

BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences

Amino acid difference formula to help explain protein

Correlation analysis of some physical chemistry properties among genetic codons and amino acids

Similarity analysis of protein sequences based on the normalized relative entropy

On 3-D graphical representation of DNA primary sequences and their numerical characterization

Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation

Milestones in graphical bioinformatics

Graphical representation of proteins

Representation of proteins as walks in 20-D space

Phylogenetic analysis of DNA sequences based on k-word and rough set theory

On the characterization of DNA primary sequences by triplet of nucleic acid bases

DV-Curve: A novel intuitive tool for visualizing and analyzing DNA sequences

A Novel method for similarity analysis and protein sub-cellular localization prediction

The Zagreb indices 30 years after

On vertex-degree-based molecular structure descriptors

Graphs with fixed number of pendent vertices and minimal Zeroth-order general Randic index

New invariant of DNA sequences

Genetic drift of human coronavirus OC43 spike gene during adaptive evolution

WHO MERS-CoV global summary and risk assessment

Assessing the accuracy of prediction algorithms for classification: an overview

iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition

Using deformation energy to analyze nucleosome positioning in genomes

iRNA-PseU: identifying RNA pseudouridine sites

Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve

Using a Euclid distance discriminant method to find protein coding genes in the yeast genome

The authors' greatest gratitude goes to the anonymous referees for their insightful suggestions and generous support.The authors are also indebted to the previous programs: the Natural Science Foundation of Liaoning Province (201602005), the Program for Liaoning Innovative Research Team in University (LT2014024), and the National Natural Science Foundation of China (61762035).

Not applicable.

The authors declare no conflict of interest, financial or otherwise.