key: cord-0887474-k935v0z9 authors: Beltrán Lissabet, Jorge Félix; Belén, Lisandra Herrera; Farias, Jorge G. title: AntiVPP 1.0: A portable tool for prediction of antiviral peptides date: 2019-02-19 journal: Comput Biol Med DOI: 10.1016/j.compbiomed.2019.02.011 sha: 8d8b127167f5e5a374b47da5bf596fae8e7c59d1 doc_id: 887474 cord_uid: k935v0z9 Viruses are worldwide pathogens with a high impact on the human population. Despite the constant efforts to fight viral infections, there is a need to discover and design new drug candidates. Antiviral peptides are molecules with confirmed activity and constitute excellent alternatives for the treatment of viral infections. In the present study, we developed AntiVPP 1.0, an accurate bioinformatic tool that uses the Random Forest algorithm for antiviral peptide predictions. The model of AntiVPP 1.0 for antiviral peptide predictions uses several features of 1088 peptides for training and validation. During the validation of the model we achieved the TPR = 0.87, SPC = 0.97, ACC = 0.93 and MCC = 0.87 performance measures, which were indicative of a robust model. AntiVPP 1.0 is a fast, accurate and intuitive software focused on the assessment of antiviral peptides candidates. AntiVPP 1.0 is available at https://github.com/bio-coding/AntiVPP. Viruses are very old and ubiquitous pathogens, which cause high rates of infection and mortality in the human population [1] . The success of viruses during evolution has been possible due to three general attributes: genetic variation, the variety of forms for their transmission and the efficient way to replicate within their host cells in order to remain in them [2, 3] . Due to these attributes, the control of viral diseases throughout history has not been an easy task [4] . In spite of the existence of antiviral drugs, it is necessary to explore novel antiviral compounds in order to control emerging viral pathogens [4, 5] . In recent decades, peptides have become increasingly important in the design and delivery of drugs. Research in this regard is focused on the development and refinement of techniques to design and identify synthetic and natural peptides as drug candidates [1, 6] . Antiviral peptides (AVPs) are known to fight against various types of viruses and can come from synthetic combinatorial libraries or segments of natural proteins [5, 6] . There are different scenarios in which the AVPs have shown activity, e.g. Enfuvirtide (also known as T20), the first peptide inhibitor approved by the FDA against the HIV-1 [7] . Antiviral activity has also been reported for viruses, e.g. Rabies [8] , HCV [9] , influenza A virus H1N1, H3N2, H5N1, H7N7, H7N9, SARS-CoV and MERS-CoV [10] , among others. Nowadays, there are different databases that contain collections of AVPs, among them: AVPpred [11] , APD3 [12] , CAMPR3 [13] and HIPdb [14] , which constitutes excellent opportunities for the development of computational tools focused on the prediction of these molecules. However, unlike the development of bioinformatics tools in the field of antimicrobial peptides predictions (bacteria, fungi, animal cells) [15] , the development of in silico tools for the prediction of AVPs is an area that has remained scarcely explored [11] . Currently, there are only three methods for predicting AVPs. The first one is the AVPpred server, which uses a vector support machine (SVM) for its predictions [11] . The second method is based on Random Forest (RF) algorithm and the resulting model of this work showed a better performance in the prediction of AVPs than AVPpred [16] . However, this model has not software to carry out prediction tasks by researchers who are not related to the field of machine learning. The third method, AVP-IC50Pred, was developed by Quresshi and coworkers. AVP-IC50Pred is a regression-based algorithm which uses experimentally proven datasets by employing multiple machine learning algorithms [17] . In this work, we have developed a friendly and portable software based on the RF algorithm for the prediction of AVPs with excellent performance measurements. To carry out this study, the data set reported by Thakur et al., was selected [11] . For training of the model, the data set T544p+544n* was used (a total of 1088 peptides). 544p corresponds to a collection of 544 antiviral peptides with experimentally validated activity, while the 544n* are 544 non-experimental negative peptides, which has been used in the development of prediction models of antiviral peptides [11, 16] . For validation of the model, the independent data set V60p +60n* was selected, composed of 60 peptides with experimentally validated activity (V60p) and 60 negative non-experimental peptides (60n*) (a total of 120 peptides). The building of the training and validation of the model is shown in Fig. 1 . For this study, the following features: net charge [18] , number of hydrogen bond donors [19] , molecular weight [20] and hydropathy index [21] , were evaluated. Also, the composition of charged (DEKHR), aliphatic (ILV), aromatic (FHWY), polar (DERKQN), neutral (AGHPSTY), hydrophobic (CVLIMFW), positively charged (HKR), negatively charged (DE), tiny (ACDGST), small (EHILKMNPQV) and large (FRWY) residues as well as the relative frequency of all 20 natural amino acids, were assessed. All features were computed by using the Python 3.6 programming language (available at https://www.python. org/). where Rfre [a.a] is the relative frequency of a natural amino acid of type i. N is the total number of natural amino acids in the peptide (peptide length). where PEP [comp] is the sum of all Rfre [a.a] in a peptide. For the construction of the prediction models, the Random Forest algorithm (RF) was evaluated. The training of the models was carried out in the Python 3.6 programming language. The Anaconda 3 package (available at https://www.anaconda.com) was used to run the libraries: 'sklearn.ensemble', 'RandomForestClassifier', 'pandas', 'sklearn.externals', 'joblib' and 'score'. The 'score' function (accuracy) was implemented to choose models with scores > 0.95 as the cut-off for posterior validations. The score function measures the accuracy of probabilistic predictions and ranges from 0 to 1. For model validations the following equations were used: MCC is used to evaluate the performance of the predictor. Its value ranges from −1 to 1 and a larger MCC means a better prediction [22] . For the development of our application, we used the programming language Python 3.6 and the WinPython software which is a free opensource portable distribution of the Python programming language. AntiVPP 1.0 has a friendly interface that, in addition to having the ability to discriminate antiviral and non-antiviral peptides, can also be used to calculate different physical-chemical characteristics of the peptides. The software as well as the instructions to run it is available at https://github.com/bio-coding/AntiVPP.1.0. During the training with the data set T544p + 544n* we obtained several prediction models based on RF with scores > 0.95, each of these models were subjected to validation with the use of the independent data set V60p + 60n*. After evaluating each of the models obtained on the validation data, we selected a model with the best balance in the performance measures: TPR = 0.87, SPC = 0.97, ACC = 0.93 and MCC = 0.87. This model presented a score = 0.993 during the training phase. Previously, we had performed an analysis using the Support vector machine (SVM), Artificial neural network (ANN) and k-nearest neighbor (kNN) algorithms in the prediction of antiviral peptides, observing a better balance in the performance measures obtained with the RF algorithm (Table 1) . Our software was developed with the programming language Python 3.6. AntiVPP 1.0 is an application with a simple and intuitive interface, making it ideal for researchers who are involved in the search and design of AVPs and they lack knowledge about the field of machine learning (Fig. 2) . AntiVPP 1.0 returns two types of predictions: 'True' for positive cases and 'False' for negative cases. In addition, the software performs the computation of several peptide features, which are the characteristics used for this program in AVPs classifications. Viral infections are one of the most important risks to consider for global health [23, 24] . Over the last 50 years, extensive efforts have been dedicated to the development of antiviral drugs and great success has been accomplished for some viruses. Nevertheless, there are other viral infections such as epidemic influenza, which continue to spread worldwide and new threats of viruses, as well as drug-resistant viruses, are continuously emerging [23] . Peptide-based drugs have been of great interest to the scientific community from the past decade to the present, given that the modern pharmaceutical industry has come to appreciate the role of these molecules in addressing unmet medical needs. All this is because the peptides can be an excellent complement or even a more suitable alternative to small molecules and biological therapeutics [25] . Regardless of the potential of AVPs, there is a considerable lack of algorithms for AVPs prediction compared to other areas such as the investigation of antimicrobial peptides. To date, the algorithm based on RF for the prediction of AVPs has been the one that has shown a better performance in the prediction of these molecules as reported in the literature [11, 16, 17] . The comparison of the performance measures obtained in our study, using the different algorithms, supports the previous results on the robustness of RF for AVP predictions [16] , as shown in Table 1 . In this study, we evaluated the RF algorithm using new combinations of chemical-physical characteristics of the AVPs, obtaining an excellent model with the following performance measures during the validation phase: TPR = 0.87, SPC = 0.97, ACC = 0.93, and MCC = 0.87. In addition, we also confirmed the need to include the relative frequency for the improvement of AVP predictions as previously reported [16] . A comparison among the existing methods for the prediction of AVPs shows that AntiVPP 1.0 has the highest SPC. Specificity is one of the most relevant measures in the construction of predictive models and is characterized by determining the proportion of positive cases (AVPs) correctly identified (Table 2 ) [26] . On the other hand, we report for the first time the number of hydrogen bond donors as another important characteristic to be considered in the development of future AVP prediction algorithms, due to its role improving the quality of performance measures during the testing of our prediction models. It has been studied that H-bond pairing has a great influence on ligand-binding affinity, improving the strength of ligand-receptor interactions [27] . For this reason hydrogen bonds have had an important role in the design and discovery of new peptide-based drugs [28] . This feature is addressed in our work in a novel way, since it had not been used previously for the prediction of antiviral peptides. AntiVPP 1.0 is a fast, accurate and intuitive tool focused on prediction of antiviral peptides as alternatives to the current tools for this purpose. The hydrogen bond is an important feature to consider in future algorithms addressed to the design and discovery of future antiviral peptides. This software would be helpful for researchers working in the development of antiviral therapies based on peptides due to its high success rates and user-friendliness. There is no conflict of interest to declare. AntiVPP 1.0 is protected by copyright. This software is free for academic users. For commercial purposes, please contact: jorge.farias@ ufrontera.cl. Not reported Not reported Not reported Not reported [17] TPR: sensitivity, SPC: specificity, ACC: accuracy, MCC: correlation coefficient of Matthews, RF: Random Forest, SVM: Support vector machine, ANN: Artificial neural network, kNN: k-nearest neighbor, *: current study. Peptide entry inhibitors of enveloped viruses: the importance of interfacial hydrophobicity Evolutionary history and phylogeography of human viruses Mechanisms of viral emergence Current scenario of peptide-based drugs: the key roles of cationic antitumor and antiviral peptides Mining the Tree of Life: Host Defense Peptides as Antiviral Therapeutics Synthetic therapeutic peptides: science and market Enfuvirtide: the first therapy to inhibit the entry of HIV-1 into host CD4 lymphocytes Antiviral drug discovery strategy using combinatorial libraries of structurally constrained peptides A novel peptide with potent and broad-spectrum antiviral activities against multiple respiratory viruses AVPpred: collection and prediction of highly effective antiviral peptides APD3: the antimicrobial peptide database as a tool for research and education CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides HIPdb: a database of experimentally validated HIV inhibiting peptides Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou's general PseAAC Analysis and prediction of highly effective antiviral peptides based on random Forest AVP-IC50Pred: multiple machine learning techniques-based prediction of peptide antiviral activity in terms of half maximal inhibitory concentration (IC50) Prediction of protein function from sequence properties: discriminant analysis of a data base Amino acid side chain parameters for correlation studies in biology and pharmacology AAindex: amino acid index database A simple method for displaying the hydropathic character of a protein Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric Current landscape and future prospects of antiviral drugs derived from microbial products Emerging infectious diseases and pandemic potential: status quo and reducing risk of global spread The current state of peptide drug discovery: back to the future? Diagnostic tests. 1: sensitivity and specificity Regulation of protein-ligand binding affinity by hydrogen bond pairing The future of peptide-based drugs This work was supported by the projects: DI12-PEO1 (EXE12-0004) DIUFRO and DIUFRO DIE14-0001 of the Universidad de La Frontera, Chile.