key: cord-0021762-m1ussyhe
authors: Allehaibi, Khalid; Daanial Khan, Yaser; Khan, Sher Afzal
title: iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis Biomarkers
date: 2021-09-27
journal: Appl Bionics Biomech
DOI: 10.1155/2021/2803147
sha: 8bd2ce60a3ef3ca5eac4777a16830c84665462c3
doc_id: 21762
cord_uid: m1ussyhe

A crucial biological process called angiogenesis plays a vital role in migration, growth, and wound healing of endothelial cells and other processes that are controlled by chemical signals. Angiogenesis is the process that controls the growth of blood vessels within tissues while angiogenesis proteins play a significant role in the proper working of this process. The balancing of these signals is necessary for the proper working of angiogenesis. Unbalancing of these signals increases blood vessel formation, which causes abnormal growth or several diseases including cancer. The proposed work focuses on developing a two-layered prediction model using different classifiers like random forest (RF), neural network, and support vector machine. The first level performs in silico identification of angiogenesis proteins based on the primary structure. In the case the protein is an angiogenesis protein, then the second level predicts whether the protein is linked with tumor angiogenesis or not. The performance of the model is evaluated through various validation techniques. The model was evaluated using k-fold cross-validation, independent, self-consistency, and jackknife testing. The overall accuracy using an RF classifier for angiogenesis at the first level was 97.8% and for tumor angiogenesis at the second level was 99.5%, ANN showed 94.1% accuracy for angiogenesis and 79.9% for tumor angiogenesis, and the accuracy of SVM for angiogenesis was 78.8% and for tumor angiogenesis was 65.19%.

The biological process in which new blood vessels develop from preexisting blood vessels is called angiogenesis [1] . It is a normal process that plays a vital role in the migration, growth, and healing of endothelial cells. Angiogenesis itself is controlled by chemical signals. Usually, the consequences of these chemical signals remain balanced which means that new blood vessels only develop on a need basis. But sometimes these signals can be unbalanced and may increase blood vessel formation, which in return causes abnormal growth or diseases [2, 3] . Angiogenesis plays a vital role in the development and growth of cancer cells [4, 5] . Just like normal cell growth, tumor cells also need oxygen and other nutrients to grow and expand. These elements are present in the blood. Tumor cells send chemical signals that stimulate the growth of new blood vessels. Without the angiogenesis process, abnormal or tumor cells cannot grow beyond 1-2 mm in size [6, 7] . But this abnormal angiogenesis process not only causes cancer but also is a precursor of several diseases like leukemia, hematologic diseases, muscular degeneration, and eye diseases [8] [9] [10] .

Cancer is ranked as the leading cause of death in the 21st century around the world. According to a survey report published in 2015 by the World Health Organization (WHO), cancer is the first and second major reason for death before the age of 70 in 91 countries around the globe [7] . Furthermore, according to the cancer statistics report 2018 by the International Agency for Research on Cancer and Cancer Research UK, 9. 6 million people around the world are dying due to cancer [7, 11] . This ratio is predicted to increase in the coming years.

Researchers, scientists, and biologists all around the world are searching for different techniques for developing different drugs and systems to fight against this deadly disease [12] . Until now, a lot of researchers have contributed their knowledge to develop different systems for tumor prediction at different stages of its life cycle. Different strategies were proposed to control this disease like chemotherapy [13, 14] , radiation therapy [15, 16] , surgeries, and bone marrow transplant also known as cord blood and vaccines [17] . Cancer can attack the brain that is the most crucial part of the human body. It has the most delicate and complex structure, so it is difficult to inject drugs to cure it. But different approaches can deliver drugs like high-dose chemotherapy, blood-brain barriers, and disruption [18] . Many therapies for tumors revolve around the attempt to suppress the tumor angiogenesis process. Scientists have discovered many ligands that can bind to tumor angiogenesis proteins such that their function is inhibited. Hence, identification of angiogenesis and tumor angiogenesis proteins is crucial in finding novel and effective tumor therapies.

Formerly, several mathematical [3] and computational models have been developed for the classification or identification of various proteomic and genomic attributes [19] . The proposed work establishes a computational model based on position and combinational information of a primary sequence that attempts to accurately identify angiogenesis and tumor angiogenesis proteins. Since tumor angiogenesis proteins are also characterized as angiogenesis proteins, the similarity of their obscure features can often lead to an ambiguous outcome. Ambiguity among seemingly similar angiogenesis and tumor angiogenesis proteins is resolved by a two-layer classification model. The initial layer distinguishes between angiogenesis and nonangiogenesis proteins while the second layer deciphers if a protein identified as an angiogenesis protein is tumor causing or not. The twolayered model helps alleviate ambiguity and yield more accurate and assiduous results.

The rest of the paper is organized as follows. Section 2 illuminates the importance of angiogenesis uncovered in the previous research and also discusses the state-of-theart models used for in silico identification of proteomic attributes. Section 3 discusses the methodology adopted for the proposed in silico identification model. Section 4 illustrates the accuracy of the model obtained through well-defined rigorous testing methodologies. Section 5 provides a general discussion regarding the performance of the proposed model.

State of the Art. The crucial role of angiogenesis in tumor progression was first discovered by Judah Folkman in 1971 [20] . Angiogenesis is a crucial process of vascular system growth through the sprouting and splitting of blood vessels [21] . Tumor cells also require a constant flow of blood for their growth for which they simulate the growth of blood vessels through secretion of various tumor angiogenesis proteins or growth factors. Cancer treatment therapies are aimed at finding inhibitors for such growth factors. Identification of angiogenesis and tumor angiogenesis proteins bears enormous significance in cancer research as they are targets of such inhibitors [22] . Most of the cancer research revolves around finding ligands and substances that will bind with tumor angiogenesis proteins and inhibit its role [23] . Scientists use various methodologies for the identification of protein attributes [24] [25] [26] [27] [28] . In silico identification techniques have evolved and received acclaim over the past few years as they provide robust and fast results and are cost-effective [29, 30] . Scientists have used various mathematical and computational models to identify attributes of proteins based on the composition and positioning of amino acid residues [31] . A position-based mathematical model, namely, position-specific scoring matrix (PSSM), was introduced in 1982 [32] . Numerous prediction models have been designed that incorporate the use of PSSM for the identification of proteomic attributes. However, since PSSM did not incorporate the composition relevant information into the model, therefore it lacked a major aspect that determines proteomic attributes. In 2001, Chou introduced the pseudo amino acid composition model that encompassed position as well as composition information into the model and hence provided better results [33] . Many generalizations and variants have since been proposed to provide even better results [31] . The choice of the most appropriate classifier plays a pivotal role in the design of such methodologies. A multitude of classifiers have been engaged for the prediction of posttranslational modification sites including random forest, support vector machine, neural networks, and deep learning. In [34] , the authors incorporate adapted normal distribution biprofile, Bayes, with PseAAC to formulate a prediction model. The accuracy is further improved using kernel sparse representation classification and minimum redundancy and maximum relevance algorithm [35] . Subsequently, an improved depiction uses a deep learning algorithm formulated by [36] . Deep learning has emerged as an encouraging model for the resolution of a multitude of problems [37] [38] [39] . The proposed work presents a twolayered model based on position and composition relative features and statistical moments [31] for the identification of angiogenesis and tumor angiogenesis proteins which are probed on various classifiers to accrue the best results.

Angiogenesis has been identified as a critical process that needs to be subjugated to disrupt the progression of cancer. Angiogenesis proteins especially the ones that lead to tumor angiogenesis have a crucial significance in this process. Since they promote the development of new blood vessels within the cancerous tissue, therefore they are considered an important biomarker for early detection of cancer.

Tumors also use the same process for their growth; however, it is possible to uniquely identify the growth factors that are responsible for its growth. In terms of proteomic features, angiogenesis and tumor angiogenesis have mutual properties. Therefore, to fulfill the arduous challenge of distinctly identifying tumor angiogenesis proteins, a twolayered approach is adopted as shown in Figure 1 .

The first layer of the model detects whether or not a protein is an angiogenesis protein, using the primary structure 2

Applied Bionics and Biomechanics of that protein. In the case it is an angiogenesis protein, then the second layer of the model is invoked to decide whether the angiogenesis protein can potentially cause cancer or not. The proposed workflow is shown in Figure 2 , consisting of the following five-step approach; initially, we will collect the well-reviewed and experimentally tested dataset consisting of angiogenesis proteins preprocessed to remove redundancies. Further, feature extractions are performed to transform the biological data into its equivalent mathematical matrix. In the third step, the obtained feature matrix is used to train the model for further prediction. In the fourth step, the model is evaluated for its correctness, sensitivity, specificity, and MCC. In the fifth step, we developed the webserver.

2.1. Dataset Collection. The dataset was collected from the UniProt database using meticulously designed search parameters. UniProt is a Universal Protein Resource that contains huge information about the sequence of proteins and their biological functions [22] . A dataset containing positive samples was composed for both angiogenesis and tumor angiogenesis using the UniProt keyword "Angiogenesis." Similarly, negative samples were also collected. UniProt has no keyword for "Tumor Angiogenesis" proteins. Nonetheless, they comprise within the set of angiogenesis proteins; therefore, tumor angiogenesis proteins were manually curated from the acquired dataset. Each sample within the dataset was manually analyzed for annotated proteomic properties and published evidence within the database to form a set of tumor angiogenesis proteins. However, ambiguous samples were left out. After the collection of data from UniProt, the CD hit suite (http://weizhong-lab.ucsd.edu/ cdhit_suite/cgi-bin/index.cgi) was used to reduce the homology of data samples. Clustering of the angiogenesis and tumor angiogenesis datasets was performed by setting the sequence identity parameters at 60%. Ultimately, 761 positive and 2776 negative clusters were formed for the angiogenesis dataset. Similarly, 256 positive and 448 negative clusters were formed for the tumor angiogenesis dataset. A representative sequence was selected from each cluster to form the final dataset.

The above equation shows the benchmark dataset used in this work, where A + represents the positive data samples of angiogenesis protein and A − shows the negative data. 

Also, the positive tumor angiogenesis samples are represented as T + , and negative tumor angiogenesis proteins are represented as T − as shown in the equation below:

2.2. Feature Extraction. A robust and efficient methodology for the transformation of biological sequences into a numerical notation for incorporation into a machine learning algorithm is the most pivotal concept in the design of such predictive models [31, 40] . This conversion must keep intact the original information or features of the sequence for analysis in some numerical form. For this purpose, each primary sequence within the collected data is converted into a fixed-size vector. A feature vector of static length is formed which represents a primary sequence and remains essentially invariant upon the scale of the sequence [41] . Incorporation of such a transformation model is ideal as most of the stateof-the-art classifiers work with vectors [22, 42, 43] . A vector described in a model may also lose complete information of the pattern sequence [44] . For this problem, Chou's PseAAC was proposed which is used by many scientists for the construction of genomic and proteomic prediction models and their applications [45, 46] . Later, this model was improved to provide a better correlation perspective among residues that reflect onto feature coefficients. Let P be a sequence of proteins of length L, which is represented as

where R i is an arbitrary residue of a polypeptide chain with length L. Feature extraction yields a vector with numerous numerical coefficients. This transformation from a variable-length polypeptide chain into a fixed-length feature vector is illustrated in the following equation:

where Δ is the transformation function, Ψ i is an arbitrary coefficient, and Ω is the constant length of the feature vector [22, 31] .

The proposed methodology develops on the use of statistical moments to form a numerical representation such that the obscured information within the primary structure of proteins stays intact. These moments form a succinct numerical form such that the original data can be reconstructed without any significant loss of information. Moments can be obtained up to several orders; each provides a deeper perspective into specific aspects of data like positioning, eccentricity, skewness, and peculiarity [31] . Mathematicians and statisticians have devised many moments generating coefficients incarnated based on welldefined distribution functions and polynomials [35, 44] .

In the proposed work, Hahn moments, raw moments, and central moments are organized to form a feature set. The Hahn moment bears location-and scale-oriented vari-ance and is calculated based on the Hahn polynomial. Central moments abide information regarding asymmetry, mean, and variance. The central moments are derived for the centroid of collective data making these moments scale variant and location invariant. Subsequently, raw moments are scale and location variants and represent properties like asymmetry, variance, and mean.

A matrix P′ with m × m dimensions is formulated for a two-dimensional residual protein representation where = d ffiffiffi ffi L p e.

The vector P is easily transformed into matrix P ′ by using a simple mapping function explained in [47] . The primary sequence is fitted into a two-dimensional matrix so that it could be formulated into the Hahn polynomial which is orthogonal. The same two-dimensional notation was used for deriving raw and central moments. The Hahn moment is computed using the Hahn polynomial as given below.

Central moments are computed using the equation given below.

The following equation is used to compute the raw moments.

In equations (7) and (8), s and t represent the order of raw moments. Orthogonality of these moments renders its use assiduous as their inverse functions can be used to reconstruct data. Detailed explanation and use of these notations can be found in [48] .

Determination. The cumulative frequency of occurrence of each specific amino acid residue is furnished into a frequency vector. Information about the distribution of amino acid residues within the primary sequence is summarized into this frequency vector which is represented as 4 Applied Bionics and Biomechanics

where f i refers to the frequency of occurrence of an arbitrary distinct amino acid residue.

The primary sequence of the proteins forms the basis of formulation of feature vectors of primary structures which are otherwise obscure. Information pertaining to position relative incidence of arbitrary protein residues is formulated as a matrix of size ð20 × 20Þ. The Position Relative Incidence Matrix (PRIM) is illustrated as 

The sum of the relative position of the jth protein residue corresponding to the first occurrence of the ith residue is computed in the above matrix given as X ij . The matrix contains all the possible permutations for such occurrences as explained in [48] .

Matrix (RPRIM). More obscure features of the primary sequence are uncovered with the help of the Reverse Position Relative Incidence Matrix (RPRIM). The RPRIM is obtained by forming the PRIM of the reversed primary sequence. X RPRIM is illustrated as where R i,j is an arbitrary element of X RPRIM .

2.7. Accumulative Absolute Position Incidence Vector (AAPI V) Calculation. The AAPIV matrix is used to calculate the sum all the positions at which each native amino acid occurs within the primary sequence; hence, it bears a length of 20 and is denoted as

Any i th element in the above matrix is computed as

where P k is the position of occurrence of a native amino acid while n is its frequency of occurrence. All the above-defined features are aggregated to form a feature vector. The dimensionality of P′, X PRIM , and X RPRIM is reduced by computing their Hahn, central, and raw moments. Ultimately, a fixed-size feature vector is formed to represent primary structures of varied lengths.

After extraction of feature vectors from positive as well as negative sequences, the data is used to train classifiers. A diverse set of currently widespread classifiers were used for the purpose which includes random forest, neural network, and support vector machine. Comparison of results yielded from each classifier work enables the identification of the most suitable classifier with the highest accuracy.

3.1. Random Forest. The random forest (RF) classifier was trained at two levels for the prediction of angiogenesis and tumor angiogenesis proteins. At the first level, the classifier was used to identify angiogenesis and nonangiogenesis proteins while at the second level the angiogenesis protein was passed through another classifier to identify if the protein is tumor causing or not. The random forest is a very powerful classifier used for classification and regression problems [49, 50] . Initially, it converts the whole data into decision trees [23, 51] . Furthermore, a random forest classifier is applied to each tree to predict a class. The class with the highest votes becomes the models' prediction result [41] as illustrated in Figure 3 . (ANN) . Subsequently, the artificial neural network (ANN) was also similarly employed at two levels. ANN has interconnected layers of neurons [52] . The connectionist architecture of the backpropagation network is illustrated in Figure 4 . The ANN mechanism used is based on a feedforward network and uses the backpropagation algorithm to reduce error. An input layer is clamped to the input feature vectors. It also has a hidden layer that receives selected numbers of neurons from the input layer and forms the main processing unit of the whole network. The activation unit of ANN sums all preceding weighted inputs in addition to bias values [23, 31] . The output of the 3-layer feedforward network with error backpropagation is represented by

where the input layer has k neurons and the hidden layer has h neurons. Partial output calculated by the mth neuron in the network is denoted by O m . Supposing that the arbitrary 5 Applied Bionics and Biomechanics node receives an input I a , then W xy represents the weight of the edge connecting node x to node y. Similarly, W ym represents the weight of the yth node connected to an arbitrary output layer neuron m. The classical sigma function which determines the activation of neurons is denoted as f in

Actual activated levels in the output units are compared with the target output for every training iteration. The error rate hence observed is denoted by ∈ and is calculated by the difference between the expected output and actual activated output given as

where O i is the target output, P i is the actual calculated output by the network, and o is the number of neurons in the output layer. The gradient descent method is used to minimize the error rate. The error generated at the output layer is sent back to the input layer. The set of all the weights is represented by a vector V. The backpropagation procedure selects a differential ΔV such that it lessens the error. This is continued iteratively until convergence is achieved as shown below: where

This equation shows a change in weight at time t + 1, and a positive constant η signifies the learning rate usually set between 0 and 1. The change in weights is expressed as

Here, ΔV u,v shows the minimal ∈ weight among the u th and v th neurons in the i th iteration. This procedure is followed in both backward and forward passes of input signals. It is a lightweight procedure that consumes less memory space, and it is extensively used for the training of ANN. Patterns are repetitively offered to the network to train it and to make it capable of minimizing the mean square error (MSE) as shown in

The actual output received at the i th neuron of the output layer is represented as O o i , and P o i represents the expected value where the total number of input samples is n and there are k output neurons.

A support vector machine (SVM) is a machine learning classifier that is used in regression-related problems. SVM works by attempting to fit in a hyperplane in an N-dimensional space where N represents the number of feature elements that represents the samples distinctly. Hyperplanes are simple decision boundaries that classify the data points, and these data points are present on both sides of the hyperplane, which ideally partitions different classes. The hyperplane is most optimally adjusted by means of support vectors. Figure 5 illustrates points on either side of the hyperplane belonging to different classes, namely, class A and class B.

In the current study, the dataset was constructed on two levels. The first level uses 785 positive and 2776 negative samples regarding angiogenesis proteins whereas the second level encompasses 256 positive and 448 negative samples for tumor angiogenesis proteins. A feature vector input matrix (FIM) was formed for both angiogenesis and tumor angiogenesis datasets separately. Every single row of FIM is a feature vector that represents a single data sample. Also, an Expected Output Matrix (EOM) was formed corresponding to FIM. All the classifiers were trained using both FIM and EOM. FIM was given as an input for training the model where EOM was used to compute errors and retrain until convergence is achieved [23, 31, 43, 45] .

All the classifiers were implemented using Python version 3.6 using SciKit Learn API. Subsequently, results gathered using this framework are rigorously analyzed in terms of their performance parameters.

A major design issue regarding the design of a new prediction model is to set up some parameters to measure its accuracy. Researchers have predominantly used four descriptive metrics for performance analysis. These metrics are as follows:

(1) Sp measures the specificity which quantifies the ability of the model to identify positive samples accurately [46] (2) Sn measures the sensitivity, which represents the accuracy in predicting negative data samples

(3) Acc is used to measure the overall accuracy of the model (4) MCC is for measuring the stability of the model (5) The following formulation is used to quantify these metrics.

Senstivity Sn

Accuracy Acc

where true negatives are represented by TN, true positives are represented by TP, false positives are represented by FP, and false negatives are represented by FN [43, 53, 54] . But unfortunately, the formation of equations (21) , (22) , (23) , and (24) is somewhat cryptic for biologists [55] . Another more intuitive format has been suggested by scientists in [56, 57] , and their modifiers were introduced in Applied Bionics and Biomechanics [47] . Symbols used to represent these equations are N + , N − , N + − , and N − + . Explanation of these representations is given in Table 1 .

Hence, these metrics are also calculated as

Testing is another important factor for the validation of the predicting models [22, 31, 42, 45] . The validation phase encompasses four most commonly used tests discussed below.

4.2.1. Self-Consistency. The self-consistency test is the most trivial and intuitive of the tests. A trained model is simply tested on the dataset that was used to train it. Capability of a model to learn from a given dataset is underscored with this basic but useful evaluating benchmark. Good results merely indicate that the classifier has the ability to find obscure patterns within the training data. Self-consistency testing was performed on angiogenesis and tumor angiogenesis datasets upon which the proposed model was trained. Results obtained from self-consistency tests are illustrated in Table 2 showing the overall performance of the proposed The results indicate that the random forest classifier has the best capability to learn and decipher obscure patterns that peculiarly characterize each sample.

The cross-validation technique is used when unknown data for testing is not readily available [45, 58] . The dataset is randomly divided into multiple partitions or folds spanning over a comprehensive sample space hence rendering cross-validation as a rigorous test. Partitions are devised in a manner such that they are disjointed from each other and are comparable in size. A partition is left out while the model is trained on the rest of the data. Once the model is fully trained, the left-out partition is used as unknown data to test the model. These steps are recapitulated for each fold. The overall accuracy of the model for the cross-validation test is reported by taking the mean of accuracy yielded against each fold.

Cross-validation tests were performed by partitioning the benchmark dataset into 5-folds and 10-folds. Table 3 depicts the results of the test.

The random forest exhibits the best results at both levels with an accuracy of 99.7% for the identification of angiogenesis proteins and an accuracy of 99.5% for the identification of tumor angiogenesis proteins.

Testing. Jackknife testing is the most rigorous testing methodology. In each iteration, it leaves out a single sample while the model is trained on the rest. After sufficient training, the model is tested with the left-out sample. This process exhaustively proceeds for all data samples. Hence, this test is repeated N times, where N represents the size of the overall dataset. In every iteration, the testing data sample is different, so all samples are tested exactly once. This technique is the most rigorous which also makes it slower [59] [60] [61] [62] [63] . After successfully training and testing, the number of true positive, false positive, true negative, and false negative was obtained [55] .

Since the sample is tested exactly once, therefore the overall accuracy obtained for this test remains unique [31, 40, 45, 46] . RF results illustrated in Table 4 for angiogenesis and tumor angiogenesis proteins portray higher accuracies and are reported as 99.3% and 99.7%, respectively, in comparison with other classifiers.

Testing. Independent test evaluates how well a model performs on unknown data. Initially, the data is partitioned such that the larger partition is used for training and the left-out partition is used as unknown data for testing. Once the model is completely trained, then independent set testing is performed using the left-out data. An 9 Applied Bionics and Biomechanics independent set needs to be formulated intelligibly such that the training data encompasses comprehensive obscure patterns and the test data thoroughly queries the ability of the model to decipher these patterns. Otherwise, testing results may be ambiguous. Results obtained from independent testing illustrate the overall accuracies of RF, ANN, and SVM classifiers after independent testing as presented in Table 5 .

The random forest shows the best results as compared to ANN and SVM classifiers at both levels for the identification of angiogenesis as well as tumor angiogenesis proteins while the performance of the ANN classifier is better than that of the SVM classifier.

Working with classification models renders performance measurement as an essential task quantified using classification scores. But this type of performance is not suitable while dealing with flawed datasets with heavy class imbalance. In such cases, ROC (Receiver Operating Characteristic) curves provide a graphical view along with quantitative analysis of the overall scenario. ROC is a prevalently used performance evaluation method for evaluating any classification model. The ROC curve is plotted by mapping the True Positive Rate (TPR) against the False Positive Rate (FPR). It depicts the accuracy with which the model is capable of distinguishing among classes. TPR is plotted along the y-axis while FPR is Various testing techniques were applied to gauge the effectiveness of the classifiers as discussed earlier. To prioritize the classifiers based on efficiency, a comparison is depicted through a ROC curve. Figure 6 represents the com-parison based on testing performed in the previous section. Figures 6-10 depict that RF shows the best results in comparison with ANN and SVM. The RF curve encompasses an area close to 1 implying that the model has the best measure of separability. Graphical representations accentuate that RF and ANN both exhibit better results as compared to SVM. However, in the case of jackknife testing, SVM classifier accuracy is high as compared to that of ANN as illustrated in Figure 10 .

A similar comparison is performed for classifiers at the second level which predicts tumor angiogenesis proteins. 

Formulation of the robust dataset and feature extraction methodology forms the foundation of a computationally intelligent model for efficient prediction of uncategorized proteomic sequences. However, the availability of such a tool is also of extreme importance so that the research community could benefit from it [45] . To make a novel predictor for the forbearance of all users and biologists around the globe, there is a need for a user-friendly and publically accessible webserver. In the final step of Chou's 5-step rule, a webserver is devised for this purpose [48] . The webserver enables scientists and biologists to easily access and to utilize such prediction applications without getting into the complex mathematical details. The webserver for the proposed work will soon be made available. Meanwhile, its code has been made available along with a readme file at https:// github.com/RabiaKhan-94/Thesis_WebServer.git which can be easily set up by an intermediate-level Python developer.

This study proposes a prediction model for the classification of angiogenesis and tumor angiogenesis. A robust welldefined methodology was adopted for dataset collection. Duplicate and redundant data were removed, and homologous sequences up to 60% were excluded. Variable-length proteomic sequences were transformed into fixed-length feature vectors using a position-and composition-based technique. Position relative information was further transmuted into a succinct form using statistical moments. Three classifiers random forest (RF), artificial neural network (ANN), and support vector machine (SVM) were used to find the best results. All of these algorithms are powerful, robust, and well understood. The random forest (RF) and artificial neural network (ANN) can deal with linear as well as complicated nonlinear problems. The current study reveals that RF showed the best results among these classification approaches. As a result of cross-validation, RF exhibited an accuracy of 97.8% for angiogenesis proteins and an accuracy of 99.5% for tumor angiogenesis, where ANN showed an accuracy of 99.1% for angiogenesis and 79.9% for tumor angiogenesis. Additionally, the accuracy of SVM for angiogenesis was 78.8%, and for tumor angiogenesis, it was 65.19%. The current study has shown different performances for all approaches. Consequently, it concludes that the results exhibited by RF are better than ANN and SVM. On the other hand, the random forest takes less time for training as compared to the neural network. Another important strength of RF is that it is less susceptible to overfitting which is not the case with a neural network. The robustness of the feature extraction technique plays a significant role in the overall accuracy of the model. Feature extraction uncovers obscure features more pertinent to the composi-tion and sequence of the primary structures. The meticulously collected data helps the model to produce better results. The in silico nature of the model makes it an alluring opportunity as it is timely and cost-effective. Biologists and scientists can greatly benefit from the proposed tool for the characterization of proteins and understand their role in angiogenesis and tumor angiogenesis processes. Furthermore, the model can prove to be effective in identifying the biomarkers that cause a tumor. Additionally, it augments the work of biologists and scientists in research aimed at finding new treatments and discovering new drugs.

Tumor-causing angiogenesis proteins are important biomarkers for the onset of cancer. Timely identification of these proteins can help in the treatment and possible cure of the disease. This study proposes a robust in silico technique for the identification of tumor angiogenesis using a two-level predictor. The first level indicates whether a protein is an angiogenesis protein or not while the second level identifies whether the given protein is responsible for tumor angiogenesis or not. A mature feature extraction technique was used to gather features for the benchmark dataset. Classifiers like RF, SVM, and ANN were trained using the resultant feature vectors. Once the models are thoroughly trained, they are rigorously tested using test methods like k-fold cross-validation, self-consistency, independent set testing, and jackknife testing. The random forest classifier showed 99.3% accuracy for angiogenesis and 99.7% for tumor angiogenesis, and ANN showed an overall 96.23% accuracy for angiogenesis and 95% for tumor angiogenesis. On the other hand, SVM showed 78.65% accuracy for angiogenesis and 65.19% for tumor angiogenesis.

Advanced drug therapies and treatments integrate the use of ligands that target tumor angiogenesis proteins to inhibit them. Inhibition of these tumor growth factors disrupts its growth, and in some cases, the tumor even dies out. Tools that help the discovery and identification of tumor angiogenesis proteins greatly help cancer researchers to identify these growth factors in a timely and cost-effective manner. One such tumor growth factor has been uncovered; there is an incessant need to identify ligands that can inhibit them. In silico models that simulate ligand bindings with tumor growth factors can also greatly enhance tumor research. Further, in the future, the proposed model can be made more adaptive by incorporating updated data and using deep learning features.

Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection

Les petites br??lures

A qualitative analysis of a free boundary problem modeling tumor growth with angiogenesis

Angiogenesis inhibitors

TargetAntiAngio: a sequence-based tool for the Prediction and analysis of anti-angiogenic peptides

Application of nanotechnology to target tumor angiogenesis in cancer therapeutics

Multiscale modeling reveals angiogenesis-induced drug resistance in brain tumors and predicts a synergistic drug combination targeting EGFR and VEGFR pathways

Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries

Angiogenesis

Association between angiogenesis and cytotoxic signatures in the tumor microenvironment of gastric cancer

Angiogenesis in brain tumours

Cancer Research UK

Tumor-targeted nanomedicines: enhanced antitumor EfficacyIn vivoof doxorubicin-loaded, long-circulating liposomes modified with cancerspecific monoclonal antibody

A review on the effects of current chemotherapy drugs and natural agents in treating non-small cell lung cancer

Chemotherapeutic drugs sensitize cancer cells to TRAIL-mediated apoptosis: up-regulation of DR5 and inhibition of Yin Yang 1

Cancer and radiation therapy: current advances and future directions

Mebendazole potentiates radiation therapy in triple-negative breast cancer

New approaches to treat cancer -what they can and cannot do

Drug delivery to brain tumors

Predicting cancer outcomes from histology and genomics using convolutional networks

Cancer hallmark text classification using ConvNets

SPrenylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins

IPhosY-PseAAC: identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC

iGluK-deep: computational identification of lysine glutarylation sites using deep neural networks with general pseudo amino acid compositions

iHyd-LysSite (EPSV): identifying hydroxylysine sites in protein using statistical formulation by extracting enhanced position and sequence variant feature technique

Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations

NPalmi-toylDeep-PseAAC: a predictor of N-palmitoylation sites in proteins using deep representations of proteins and PseAAC via modified 5-steps rule

Identification of 4-carboxyglutamate residue sites based on position based statistical feature and multiple classification

DPP-PseAAC: a DNA-binding protein prediction model using Chou's general PseAAC

Propy: a tool to generate various modes of Chou's PseAAC

A novel alignment-free method to classify protein folding types by combining spectral graph clustering with Chou's pseudo amino acid composition

Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou's PseAAC

IRSpot-ADPM: identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components

Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou's pseudo amino acid compositions

Some remarks on protein attribute prediction and pseudo amino acid composition

Prediction of protein cellular attributes using pseudo-amino acid composition

Improved DNA-binding protein identification by incorporating evolutionary information into the Chou's PseAAC

PSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach

Situation recognition using image moments and recurrent neural networks

Prediction of Nlinked glycosylation sites using position relative features and statistical moments

Random forest dissimilarity based multi-view learning for radiomics application

Predicting the protein structure using random forest approach

Evaluating multiple classifiers for stock price direction prediction

Pectin extraction from Helianthus annuus (sunflower) heads using RSM and ANN modelling by a genetic algorithm approach

BP neural network could help improve pre-miRNA identification in various species

AntiAngioPred: a server for prediction of anti-angiogenic peptides

Enhanced artificial neural network for protein fold recognition and structural class prediction

MFSC: multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou's PseAAC components

Predicting membrane proteins and their types by extracting various sequence features into Chou's general PseAAC

ISNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins

Naïve Bayes classifier with feature selection to identify phage virion proteins

A prediction model for membrane proteins using moments based features

UbiSi-tePred: A novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou's pseudo components

Using Chou's 5-steps rule to predict Olinked serine glycosylation sites by blending position relative features and statistical moment

iPhosD-PseAAC: identification of phosphoaspartate sites in proteins using statistical moments and PseAAC

iTSP-PseAAC: identifying tumor suppressor proteins by using fully connected neural network and PseAAC

A sequence-based predictor of Zika virus proteins developed by integration of PseAAC and statistical moments

Sequence-based identification of allergen proteins developed by integration of PseAAC and statistical moments via 5-step rule

iSUMOK-PseAAC: prediction of lysine sumoylation sites using statistical moments and Chou's PseAAC

ProtoPred: advancing oncological research through identification of protooncogene proteins

Evaluating machine learning methodologies for identification of cancer driver genes

Prediction of Saudi Arabia SARS-COV 2 diversifications in protein strain against China strain

Identification of antimicrobial peptides using Chou's 5 step rule

This project was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University (https://www .kau.edu.sa/), Jeddah (under grant no. G:160-611-1441). The authors, therefore, acknowledge with thanks DSR technical and financial support.

The authors declare that they have no conflicts of interest to report regarding the present study. 12 Applied Bionics and Biomechanics