key: cord-0911265-b3iqrqli authors: Timmons, Patrick Brendan; Hewage, Chandralal M. title: ENNAVIA is an innovative new method which employs neural networks for antiviral and anti-coronavirus activity prediction for therapeutic peptides date: 2021-03-25 journal: bioRxiv DOI: 10.1101/2021.03.25.436982 sha: c8e132d70a53860fc9d739e76b5bb7fb3f3e0f8c doc_id: 911265 cord_uid: b3iqrqli Viruses represent one of the greatest threats to human health, necessitating the development of new antiviral drug candidates. Antiviral peptides often possess excellent biological activity and a favourable toxicity profile, and therefore represent a promising field of novel antiviral drugs. As the quantity of sequencing data grows annually, the development of an accurate in silico method for the prediction of peptide antiviral activities is important. This study leverages advances in deep learning and cheminformatics to produce a novel sequence-based deep neural network classifier for the prediction of antiviral peptide activity. The method out-performs the existent best-in-class, with an external test accuracy of 93.9%, Matthews correlation coefficient of 0.87 and an Area Under the Curve of 0.93 on the dataset of experimentally validated peptide activities. This cutting-edge classifier is available as an online web server at https://research.timmons.eu/ennavia, facilitating in silico screening and design of peptide antiviral drugs by the wider research community. against hepatitis-C [25] . A number of databases exist which detail the antiviral activities of AVPs, such as AVPdb [26] , DBAASP [27, 28] , CAMP [29] and APD3 [30] . In silico methods offer a fast, efficient way of exploring the large chemical space that AVPs inhabit, by minimizing the quantity of peptides that need to be synthesized and experimentally assayed for antiviral activity. A few methods for the prediction of peptide antiviral activity exist, namely AVPpred [31] , AntiVPP 1.0 [32] , Meta-iAVP [33] , Firm-AVP [34] and the method of Chang et al. [35] . Antiviral peptide prediction methods have been comprehensively reviewed by Charoenkwan et al. [36] . Furthermore, Pang et al. recently developed a novel method for the prediction of peptides with specifically anti-coronavirus activity [37] . The most popular machine learning methods employed are support vector machines or random forests, although a number of others have also been trialled. Many areas of bioinformatics have benefited from the predictive power of deep learning; neural network-based methods exist for many tasks, such as DeepPPISP for the prediction of protein-protein interaction sites [38] , SCLpred and SCLpred-EMS for protein subcellular localization prediction [39, 40] , CPPpred for the prediction of cell-penetrating peptides [41] , HAPPENN for the prediction of peptide hemolytic activity [42] , ENNAACT for the prediction of peptide anticancer activity, [43] and APPTEST for the prediction of peptide tertiary structure [44] . As the quantity of antiviral peptide sequence data continuously increases, we have exploited the available data to create a deep neural network method for the identification of antiviral peptides from the primary sequence. Herein, we describe ENNAVIA, a novel neural network peptide antiviral and anti-coronavirus activity predictor. ENNAVIA is available as a free-to-use online webserver for the benefit of the academic community at https://research.timmons.eu/ennavia . To facilitate easy comparison with existing peptide antiviral activity predictors, the two AVPpred datasets of Thakur et al. were used in this work [31] . The first dataset consists of 604 peptides with experimentally validated antiviral activities, and 452 peptides that were experimentally found to have poor or no antiviral activity. This dataset is divided into training and external validation subsets, termed T 544p+407n and V 60p+45n respectively, where p and n denote the number of positive and negative samples. For brevity, these are collectively referred to as ENNAVIA-A. The second dataset consists of 604 peptides with experimentally validated antiviral activities, and 604 negative peptides from the AntiBP2 negative dataset, which were randomly extracted from non-secretory proteins [45] . This second dataset is similarly divided into training and external validation subsets, termed T 544p+544n and V 60p+60n respectively, where p and n again denote the number of positive and negative samples. These are collectively referred to as ENNAVIA-B. Peptide sequences in the datasets consist only of natural amino acids; peptides that contain residues not in-6 cluded in the canonical 20 amino acids are excluded, as are peptides with a sequence length shorter than 7 or longer than 40. Information about the peptides' secondary structure is not included in the dataset. The datasets are available for download from the webserver website and as supplementary material to this article. In order to develop a classifier specific to the prediction of peptides with anti-coronavirus activity, two additional datasets were created, ENNAVIA-C and ENNAVIA-D. The positive samples of both datasets are peptide sequences with anti-coronavirus activity, taken from the dataset created by Pang et al. [37] . The original dataset included 139 peptide sequences with anti-coronavirus activity. Once peptide sequences with a sequence length shorter than 7 or longer than 40 were excluded, 109 peptide sequences remained. The negative samples of ENNAVIA-C and ENNAVIA-D are the same as the negative samples of ENNAVIA-A and ENNAVIA-B, respectively. It is imperative to thoroughly validate classifier models created by machine learning. Tenfold cross-validations and validation by an external test set were employed for the performance evaluation of all models presented herein. The models trained under cross-validation were ensembled and evaluated with the external test sets. For ENNAVIA-A and ENNAVIA-B, the peptides used in the external test sets are those from the V 60p+45n and V 60p+60n datasets of Thakur et al. [31] , in order to facilitate a direct 7 comparison with existing methods. Peptides with anti-coronavirus activity which are also present in the ENNAVIA-A and ENNAVIA-B datasets are assigned to the same fold as in ENNAVIA-A and ENNAVIA-B. In order to prevent overfitting, the CD-HIT-2D program [46, 47] was used to identify anti-coronavirus peptides that can be matched to anti-virus peptides using a sequence identity cut-off value of 0.9. Anti-coronavirus peptides which had high sequence identity to antivirus peptides in the ENNAVIA-A and ENNAVIA-B datasets were assigned to the same fold as those peptides. The negative peptides of the ENNAVIA-C and ENNAVIA-D datasets maintained the same fold-assignment as in the ENNAVIA-A and ENNAVIA-B datasets. The amino acid composition of the experimentally verified antiviral peptides was analysed and compared to that of the experimentally verified non-antiviral peptide sequences and the random non-secretory peptide sequences extracted from UniProt. The composition analysis includes the peptides' full sequences, the 10 N-terminal residues, and the C-terminal 10 residues. Enrichment depletion logos (EDLogo) [48] were created for the antiviral peptides' sequences to identify any position-specific amino acid preferences that may exist. The experimentally validated non-antiviral peptide sequences 8 were used as the baseline in the construction of the logo plots. A variety of features was extracted from the peptides' primary sequences. These features can be divided into two subcategories, amino acid-based descriptors and physicochemical descriptors. Only features that were nonzero for at least 20 samples were retained in the final feature vector. The peptides' compositional descriptors were calculated based on the peptides' amino acid, dipeptide, and tripeptide compositions for the conventional 20-amino acid alphabet. Additionally, descriptors were also calculated based on the reduced amino acid alphabets of Veltri et al. [49] , Thomas and Dill [50] , and the conjoint alphabet [51] . g-gap dipeptide and tripeptide compositions were calculated to account for the three-dimensional structure of the peptides [52] , with the values of the parameter g being 1, 2 and 3 for the dipeptide compositions, and 3 and 4 for the tripeptide compositions. Furthermore, conjoint triad, composition, transition and distribution [53] and pseudo amino acid composition [54] descriptors were also calculated. The modlAMP package was employed for the calculation of global physicochemical descriptors and amino acid scale-based descriptors [55] . Global physicochemical features include molecular formula, sequence length, molecular weight, sequence charge, charge density, isoelectric point, instability 9 index, aliphatic index [56] , aromaticity index [57] , hydrophobic ratio and the Boman index [58] . Amino acid scale-based descriptors include hydrophobicity [59] [60] [61] [62] [63] , side-chain bulkiness [64] , refractivity [65] , side-chain flexibility [66] , α-helix propensity [67] , transmembrane propensity [68] , polarity [64, 69] , amino acid charges, AASI [70] , ABHPRK [55] , COUGAR [55] , Ez [71] , ISAECI [72] , MSS [73] , MSW [74] , PPCALI [75] , t scale [76] , z3 [77] , z5 [78] and pepArc [55] . Additional physicochemical features were calculated based on amino acid properties detailed in the AAindex [79] . The peptides' hydrophobicities were quantified using the amino acids' hydrophobicities [80, 81] , hydropathies [82] , retention coefficients in HPLC [83] and partition energies [84, 85] . Similarly, the peptide sequences' hydrophilicities were characterised using descriptors based on the amino acid hydrophilicity scale [86] , the amino acids' net charges [87] , polar requirements [88] and fractions of site occupied by water [89] . Descriptors pertaining to sterics were obtained from the residues' steric hindrance [90] and bulkiness [64] properties, while secondary structure features were calculated based on helical [91] propensities. Furthermore, descriptors were also calculated from the side-chain interaction parameters [92] and membrane-buried preference parameters [93] . Unsupervised and supervised machine learning approaches are employed in the current study. The former includes principal component analysis (PCA) [94] and t-distributed Stochastic Neighbour Embedding (t-SNE) [95] for visualising the data. The latter includes support vector machine (SVM) [96] , random forest (RF) [97] , and dense fully connected neural networks [98] for creating supervised classifiers. The scikit-learn Python module is used for its PCA, t-SNE, SVM and RF implementations [99] . SVMs were trialled using both a linear and non-linear radial base function (RBF) kernel. A grid search was employed for the tuning of the RF number of estimators, the maximum number of features, and the maximum depth hyperparameters, and the SVM regularization parameter C and kernel width parameter γ. The Keras deep learning framework with a Tensorflow backend was used to build and train the deep-fully connected neural networks [100] . The neural network's input features are scaled to have minimum and maximum values of 0 and 1, respectively. The optimal combination of neural network architecture and hyperparameters was selected using a randomized grid search strategy. The first hidden layer has 1024 nodes, and is followed by two layers of 256 nodes each. Batch normalization [101] is applied before the ReLU activation function for each hidden layer. To prevent overfitting to the training data, each hidden layer is followed by a Dropout regularization layer, with a rate of 0.30 [102] . The output layer is a single node activated by the sigmoid function. As is common in binary classification neural networks, the binary cross-entropy loss function is employed. It is defined as: (1) where y i is the true value of the i th sample, andŷ i is the predicted value of the i th sample. As the predicted labels of all training data approach their respective true values, the value of the function approaches zero. The optimal optimizer was found to be Adaptive Momentum (Adam), with an optimal initial learning rate of 0.05 and a decay of 0.0001. Adam utilises the following formula to update the neural network weights [103] : where them t andv t are the bias-corrected estimates of the mean and the variance of the gradients, respectively. The neural networks were trained for 600 epochs, without stopping criteria. The model with the highest validation accuracy encountered during training was retained for each of the cross-validation splits. As the dataset of peptides with anti-coronavirus is small, numbering only 109 peptides, transfer learning was used to train the models for the ENNAVIA-C and ENNAVIA-D datasets. Models originally trained for each crossvalidation fold for ENNAVIA-A and ENNAVIA-B, respectively, were used to initialize the weights for the neural network models of the corresponding cross-validation folds for ENNAVIA-C and ENNAVIA-D, respectively. The neural network models were then trained for 600 epochs, without stop-ping criteria. The model with the highest validation accuracy encountered during training was retained for each of the cross-validation splits. A number of standard metrics are employed for the evaluation of the presented models' performance, specifically accuracy (Acc), sensitivity (Sn), specificity (Sp), the Matthews correlation coefficient (MCC), and the receiver operating characteristic (ROC) curve. Confidence intervals are provided at the 95% level of significance. The first four metrics are defined by the following equations: where • TP = True positives: the number of correctly predicted positive (antiviral) peptides. • FP = False positives: the number of non-antiviral peptides incorrectly predicted as being antiviral. • TN = True negatives: the number of correctly predicted negative (non-antiviral) peptides. • FN = False negatives: the number of anticancer peptides incorrectly predicted as being non-antiviral. The dataset of peptide sequences was subjected to an amino acid composi- To identify if particular amino acid residues are more prevalent in antiviral To complement the PCA analysis, a t-distributed Stochastic Neighbour Embedding (t-SNE) analysis was conducted for the experimentally verified antiviral and non-antiviral peptides, again for all computed descriptors, only the physicochemical descriptors, and only the compositional descriptors subsets ( Figure 4) . As with the results of the PCA analysis, the interclass separation is incomplete, although it is clearly greater. The principal aim of this study was to train and evaluate a selection of machine learning classifiers for the prediction of peptide antiviral activity. Table 1 . As the neural networks' performance was superior to the SVM and RF 23 approaches, it was deemed as the best model for the prediction of peptide antiviral activity and further studied. To establish the utility of ENNAVIA in the context of prediction methods already described in the literature, ENNAVIA was benchmarked against three existing antiviral peptide prediction methods, specifically AVPpred [31] , the method of Chang et al. [35] , AntiVPP [32] , Meta-iAVP [104] and FIRM-AVP [34] . Detailed results are given in Table 2 . Contact with the corresponding authors of these articles was attempted prior to publication, however, we have not received a response to our queries prior to publication. A recent study by Pang et al. described a machine learning method for the identification of anti-coronavirus peptides through imbalanced learning strategies [37] . This study utilises the datasets created by Pang et al. validation is limited to ten-fold cross-validation. Detailed results are given in Table 3 . To ascertain the extent to which a given set of features can contribute to the correct prediction of peptide antiviral activity, neural networks were trained on subsets of the feature space. The validation results obtained by these neural networks trained on the peptides' physicochemical features, dipeptide composition, dipeptide g-gap composition and tripeptide composition are detailed in Table 4. 28 None of the reduced subset models trained achieve performance better than the hybrid model trained on both compositional and physicochemical descriptors, validating the choice of the hybrid model as the principal approach. Information about local sequence order can be relayed to a machine learning method through the use of dipeptide and tripeptide composition descriptors. A peptide's dipeptide and tripeptide composition can be defined as the percentage of a given dipeptide or tripeptide in the sequence. These g-gap compositions, defined as the proportion of a pair of amino acids separated by 1, 2 or 3 residues, are a useful descriptor as they correspond to residues that may be proximate to one another in three-dimensional space. As peptides often possess secondary structure upon interaction with their targets, this information allows for the capturing of the chemical environment that the peptide presents to its target. Models trained on g-gap dipeptide composition do not perform better than those trained on conventional dipeptide composition, achieving an accuracy and MCC of 90.0% and 0.80. Models trained on physicochemical features, such as charge, amphiphilicity and charge, achieve an accuracy and MCC of 88.3% and 0.76, respectively. Although this performance is poorer than that achieved by the models trained on compositional features, it is only marginally so, and still demonstrates predictive capability. The To conclude, the limited quantity of available experimentally validated data and the incomplete understanding of the mechanism of peptide antiviral activity continue to pose challenges for the research community. In an effort to overcome these challenges, this study described ENNAVIA, a collection of novel in silico peptide antiviral and anti-coronavirus activity classifiers. The classifiers, which employ a deep neural network architecture and benefit from a rich feature-space, achieve predictive power that surpasses the state-of-the-art. This work complements a suite of existing in silico classifiers developed by the authors, which includes methods for the prediction of peptide anticancer and hemolytic activity, and peptide tertiary structure. The authors believe that the results of this work, in combination with the aforementioned methods, will enable better in-silico design of novel peptide-based antiviral and anti-coronavirus therapeutics, thereby reducing the cost and time required for the design phase, helping to drive medicinal chemistry into an unprecedented revolution. For the benefit of the scientific community, the ENNAVIA classifier is available as a user-friendly, publicly accessible web server online at https://research.timmons.eu/ennavia . The web server is capable of predicting peptides' antiviral activity based on the primary sequence. Input peptide sequences are restricted to only the 20 20 natural amino acids; non-natural amino acids are not supported. The web server includes many features, and models trained on both the ENNAVIA-A (T 504p+406n ) and ENNAVIA-B dataset (T 504p+504n ) are available for prediction. Peptide antiviral activities can be predicted for both a single sequence and a batch of sequences. Peptide sequences should be provided in the standard FASTA format. The maximum batch size is variable depending on the length of the sequences; longer sequences necessitate smaller batch sizes. The prediction will be carried out by the ensemble of trained neural networks, and the average score will be returned, which corresponds to the probability of the peptide sequence possessing antiviral activity. Probabilities are given on a scale of 0-1, whereby 0 and 1 are most probably non-antiviral, and most probably antiviral, respectively. Mutation analysis may be carried out on single peptide sequences, by selecting the mutation analysis option and inputting the residue number to be mutated. Mutant sequences will be created by substituting the residue at the specified position with each of the other 20 natural amino acids. The probability of each of the mutant sequences possessing antiviral activity will be returned by the chosen neural network model. Residue scans, such as, for instance, an alanine scan, are available for single peptide sequences, by choosing the residue scan option and selecting the amino acid residue to be scanned with. Mutant sequences are attained by substituting successive residues with the selected amino acid residue. The probability of the native and mutant sequences possessing antiviral activity will be returned by the selected neural network model. All data generated or analysed during this study are available for download at https://research.timmons.eu/ennavia Mechanisms of viral emergence Genetic diversity and evolution of SARS-CoV-2. Infection Control of Viral Infections and Diseases Antimicrobial peptides: An emerging category of therapeutic agents The role of cationic antimicrobial peptides in innate host defences The Potential of Antiviral Peptides as A novel peptide with potent and broad-spectrum antiviral activities against multiple respiratory viruses Virucidal activity of a scorpion venom peptide variant mucroporin-M1 against measles, SARS-CoV and influenza H5N1 viruses Structure-based discovery of Middle East respiratory syndrome coronavirus fusion inhibitor Peptide-based drug design: Here and now Therapeutic peptides: Historical perspectives, current development trends, and future directions General method for rapid synthesis of multicomponent peptide mixtures Methods for generating and screening libraries of genetically encoded cyclic peptides in drug discovery Evolving a peptide: Library platforms and diversification strategies Rationally Designed ACE2-Derived Peptides Inhibit SARS-CoV-2 Current progress in antiviral strategies Human Immunodeficiency Virus Type 1 Protease Inhibitors Direct-acting antiviral agents for hepatitis c virus infection Approaches for Identification of HIV-1 Entry Inhibitors Targeting gp41 Pocket The effect of peginterferon alpha-2a vs. peginterferon alpha-2b in treatment of naive chronic HCV genotype-4 patients: A single centre egyptian study Success in anti-viral immunotherapy Antiviral peptides as promising therapeutic drugs Antiviral Peptides: Identification and Validation AVPdb: A database of experimentally validated antiviral peptides targeting medically important viruses Erratum: DBAASP v.2: An enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics CAMP: Collection of sequences and structures of antimicrobial peptides APD3: The antimicrobial peptide database as a tool for research and education AVPpred: Collection and prediction of highly effective antiviral peptides AntiVPP 1.0: A portable tool for prediction of antiviral peptides Meta-iavp: A sequence-based meta-predictor for improving the prediction of antiviral peptides using effective feature representation Better understanding and prediction of antiviral peptides through primary and secondary structure feature importance Analysis and Prediction of Highly Effective Antiviral Peptides Based on Random Forests In silico approaches for the prediction and analysis of antiviral peptides: a review Identifying anticoronavirus peptides by incorporating different negative datasets and imbalanced learning strategies Protein-protein interaction site prediction through combining local and global features with deep neural networks SCLpred: Protein subcellular localization prediction by N-to-1 neural networks SCLpred-EMS: Subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep N-to-1 Convolutional Neural Networks CPPpred: Prediction of cell penetrating peptides HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs 44 ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides APPTEST is an innovative new method for the automatic prediction of peptide tertiary structures AntiBP2: Improved version of antibacterial peptide prediction CD-HIT Suite: A web server for clustering and comparing biological sequences Cd-hit: A fast program for clustering and com-45 paring large sets of protein or nucleotide sequences A new sequence logo plot to highlight enrichment and depletion Deep learning improves antimicrobial peptide recognition An iterative method for extracting energy-like quantities from protein structures Predicting protein-protein interactions based only on sequences information Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions PyDPI: Freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies mod-lAMP: Python for antimicrobial peptides Thermostability and Aliphatic Index of Globular Proteins Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 escherichia coli chromosome-encoded genes Antibacterial and antimalarial properties of peptides that are cecropin-melittin hybrids Structural Prediction of Membrane-Bound Proteins Hydrophobic moments and protein structure A simple method for displaying the hydropathic character of a protein Prediction of protein antigenic determinants from amino acid sequences Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins The characterization of amino acid sequences in proteins by statistical methods Refractive indices of proteins in relation to amino acid composition and specific volume Positional flexibilities of amino acid residues in globular proteins Conformational Preferences of Amino Acids in Globular Proteins An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: Relationship to biological hydrophobicity Amino acid difference formula to help explain protein evolution Computational design of highly selective antimicrobial peptides Depth-dependent Potential for Assessing the Energies of Insertion of Amino Acid Side-chains into Membranes: Derivation and Applications to Determining the Orientation of Transmembrane and Interfacial Helices Amino Acid Side Chain Descriptors for Quantitative Structure-Activity Relationship Studies of Peptide Analogues Topological shape and size of peptides: Identification of potential allele specific helper T cell antigenic sites MS-WHIM scores for amino acids: A new 3D-description for peptide QSAR and QSPR studies Scrutinizing MHC-I Binding Peptides and Their Limits of Variation Amino Acids Characterization by GRID and Multivariate Data Analysis Peptide Quantitative Structure-Activity Relationships, a Multivariate Approach New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids AAindex: Amino acid index database, progress report Hydrophobic parameters pi of aminoacid side chains from the partitioning of N-acetyl-amino-acid amides Physicochemical Basis of Amino Acid Hydrophobicity Scales: Evaluation of Four New Scales of Amino Acid Hydrophobicity Coefficients Derived from RP-HPLC of Prediction of protein surface accessibility with information theory New Hydrophilicity Scale Derived from High-Performance Liquid Chromatography Peptide Retention Data: Correlation of Predicted Surface Residues with Antigenicity and X-ray-Derived Accessible Sites Partition coefficients of amino acids and hydrophobic parameters π of their side-chains as measured by thin-layer chromatography Amino acid side-chain partition energies and distribution of residues in soluble proteins Atomic and residue hydrophilicity in the context of folded protein structures Prediction of protein function from sequence properties. Discriminant analysis of a data base Evolution of the genetic code Local interactions as a structure determinant for protein molecules: II. BBA -Protein Structure 576 Protein folding and the genetic code: An alternative quantitative model Helix capping Optimization of Amino Acid Parameters for Correspondence of Sequence to Tertiary Structures of Proteins Quantifying the Effect of Burial of Amino Acid Residues on Protein Stability On lines and planes of closest fit to systems of points in space Visualizing data using t-SNE Support-Vector Networks Random decision forests Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms Scikit-learn: Machine Learning in Python Large-Scale Machine Learning on Heterogeneous Distributed Systems Batch normalization: Accelerating deep network training by reducing internal covariate shift Dropout: A Simple Way to Prevent Neural Networks from Overfitting Adam: A method for stochastic optimization ACPred: A computational tool for the prediction and analysis of anticancer peptides NMR model structure of the antimicrobial peptide maximin 3 Structural and positional studies of the antimicrobial peptide brevinin-1BYa in membrane-mimetic environments Insights into conformation and membrane interactions of the acyclic and dicarba-bridged brevinin-1BYa antimicrobial peptides The authors would also like to thank University College Dublin for the Research Scholarship granted to P.B.T. The authors have no competing interests to declare.