key: cord-0959768-ka2x3tzr
authors: Timmons, Patrick Brendan; Hewage, Chandralal M
title: ENNAVIA is a novel method which employs neural networks for antiviral and anti-coronavirus activity prediction for therapeutic peptides
date: 2021-07-23
journal: Brief Bioinform
DOI: 10.1093/bib/bbab258
sha: b4a6db4f32619cfe3dc30371f471989eea63724a
doc_id: 959768
cord_uid: ka2x3tzr

Viruses represent one of the greatest threats to human health, necessitating the development of new antiviral drug candidates. Antiviral peptides often possess excellent biological activity and a favourable toxicity profile, and therefore represent a promising field of novel antiviral drugs. As the quantity of sequencing data grows annually, the development of an accurate in silico method for the prediction of peptide antiviral activities is important. This study leverages advances in deep learning and cheminformatics to produce a novel sequence-based deep neural network classifier for the prediction of antiviral peptide activity. The method outperforms the existent best-in-class, with an external test accuracy of 93.9%, Matthews correlation coefficient of 0.87 and an Area Under the Curve of 0.93 on the dataset of experimentally validated peptide activities. This cutting-edge classifier is available as an online web server at https://research.timmons.eu/ennavia, facilitating in silico screening and design of peptide antiviral drugs by the wider research community.

Viruses are an ancient infection agent that replicate inside the cells of living organisms. They are ubiquitous, affecting all species, from bacteria to plants and animals [1] , and are incredibly successful due to their genetic diversity, nonuniformity of mode of transmission, efficient replication and capacity for persistence in their hosts [2] [3] [4] . Viral diseases are difficult to control due to their potential for high pathogenicity, increased resistance to antiviral drugs, continuous evolution of existing viruses and the emergence of novel viruses [5] . Viruses are responsible for many human diseases and are the cause of many death annually. Cold sores, influenza, AIDS and the current coronavirus disease 2019 (COVID- 19) pandemic are all caused by viral infection. Zoonotic viruses, such as the Ebola, Zika, West Nile, HIV, SARS-CoV and SARS-CoV-2 negative charge and other secondary structures have also been identified [8] . Most importantly, AVPs are a promising resource for the development of novel antiviral drugs for the prevention or treatment of viral diseases [8] , including those caused by coronaviruses. For example, a subsequence derived from βdefensin, P9, possesses potent inhibitory activity against the SARS-CoV and MERS-CoV viruses [9] . Other anti-coronavirus peptides include Mucroporin-M1 and HR2P, which inhibit the SARS-CoV and MERS-CoV viruses, respectively [10, 11] .

This class of potential antiviral agents possesses a number of advantages over conventional non-peptide drugs, as they are highly specific, cost-effective to produce while remaining easy to modify and synthesize, and possess a limited susceptibility to drug resistance [12] . Although initially AVPs were isolated from plant and animal secretions where they formed part of the host defence mechanism [13] , AVPs have also been derived from chemical [14] , genetic [15] and recombinant [16] libraries, as well as from rational design [17] . AVPs can be divided into two classes based on their mechanism of action: virus-targeting and host-targeting [18] . AVPs belonging to the former class focus on the inhibition of viral enzymes involved in transcription and replication [19, 20] , or the inactivation of viral structural proteins [21] . AVPs of the latter class act as immunomodulators, like interferons [22, 23] , or target cyclophilins, which are important cellular factors that are hijacked by viruses during replication cycle [18, 24] . The currently identified AVPs, however, represent only a small subset of a largely unexplored chemical space, with only a few of those being peptide-based antiviral drugs available on the market. Those drugs include Enfuvirtide, the first peptide inhibitor of HIV-1, Boceprevir and Telaprevir, which both act against hepatitis-C [25] . A number of databases exist which detail the antiviral activities of AVPs, such as AVPdb [26] , DBAASP [27, 28] , CAMP [29] and APD3 [30] .

In silico methods offer a fast, efficient way of exploring the large chemical space that AVPs inhabit, by minimizing the quantity of peptides that need to be synthesized and experimentally assayed for antiviral activity. A few methods for the prediction of peptide antiviral activity exist, namely AVPpred [31] , AntiVPP 1.0 [32] , Meta-iAVP [33] , Firm-AVP [34] and the method of Chang et al. [35] . Antiviral peptide prediction methods have been comprehensively reviewed by Charoenkwan et al. [36] . Furthermore, Pang et al. recently developed a novel method for the prediction of peptides with specifically anti-coronavirus activity [37] . The most popular machine learning methods employed are support vector machines (SVMs) or random forests, although a number of others have also been trialled. Many areas of bioinformatics have benefited from the predictive power of deep learning; neural network-based methods exist for many tasks, such as DeepP-PISP for the prediction of protein-protein interaction sites [38] , SCLpred and SCLpred-EMS for protein subcellular localization prediction [39, 40] , CPPpred for the prediction of cell-penetrating peptides [41] , HAPPENN for the prediction of peptide hemolytic activity [42] , ENNAACT for the prediction of peptide anticancer activity [43] and APPTEST for the prediction of peptide tertiary structure [44] . As the quantity of antiviral peptide sequence data continuously increases, we have exploited the available data to create a deep neural network method for the identification of AVPs from the primary sequence. Herein, we describe ENNAVIA, a novel neural network peptide antiviral and anti-coronavirus activity predictor. ENNAVIA (Employing Neural Networks for Antiviral Activity Prediction for Therapeutic Peptides) is available as a free-to-use online webserver for the benefit of the academic community at https://research.timmons.eu/ennavia.

To facilitate easy comparison with existing peptide antiviral activity predictors, the two AVPpred datasets of Thakur et al. were used in this work [31] . The first dataset consists of 604 peptides with experimentally validated antiviral activities, and 452 peptides that were experimentally found to have poor or no antiviral activity. This dataset is divided into training and external validation subsets, termed T 544p+407n and V 60p+45n , respectively, where p and n denote the number of positive and negative samples. For brevity, these are collectively referred to as ENNAVIA-A. The second dataset consists of 604 peptides with experimentally validated antiviral activities, and 604 negative peptides from the AntiBP2 negative dataset, which were randomly extracted from non-secretory proteins [45] . This second dataset is similarly divided into training and external validation subsets, termed T 544p+544n and V 60p+60n , respectively, where p and n again denote the number of positive and negative samples. These are collectively referred to as ENNAVIA-B. Peptide sequences in the datasets consist only of natural amino acids; peptides that contain residues not included in the canonical 20 amino acids are excluded, as are peptides with a sequence length shorter than 7 or longer than 40. Information about the peptides' secondary structure is not included in the dataset. The datasets are available for download from the webserver website and as supplementary material to this article.

In order to develop a classifier specific to the prediction of peptides with anti-coronavirus activity, two additional datasets were created, ENNAVIA-C and ENNAVIA-D. The positive samples of both datasets are peptide sequences with anti-coronavirus activity, taken from the dataset created by Pang et al. [37] . The original dataset included 139 peptide sequences with anti-coronavirus activity. Once peptide sequences with a sequence length shorter than 7 or longer than 40 were excluded, 109 peptide sequences remained. The negative samples of ENNAVIA-C and ENNAVIA-D are the same as the negative samples of ENNAVIA-A and ENNAVIA-B, respectively.

It is imperative to thoroughly validate classifier models created by machine learning. Tenfold cross-validations and validation by an external test set were employed for the performance evaluation of all models presented herein. The models trained under cross-validation were ensembled and evaluated with the external test sets. For ENNAVIA-A and ENNAVIA-B, the peptides used in the external test sets are those from the V 60p+45n and V 60p+60n datasets of Thakur et al. [31] , in order to facilitate a direct comparison with existing methods.

Peptides with anti-coronavirus activity which are also present in the ENNAVIA-A and ENNAVIA-B datasets are assigned to the same fold as in ENNAVIA-A and ENNAVIA-B. In order to prevent overfitting, the CD-HIT-2D program [46, 47] was used to identify anti-coronavirus peptides that can be matched to antivirus peptides using a sequence identity cut-off value of 0.9. Anti-coronavirus peptides which had high sequence identity to anti-virus peptides in the ENNAVIA-A and ENNAVIA-B datasets were assigned to the same fold as those peptides. The negative peptides of the ENNAVIA-C and ENNAVIA-D datasets maintained the same fold-assignment as in the ENNAVIA-A and ENNAVIA-B datasets.

The amino acid composition of the experimentally verified AVPs was analysed and compared to that of the experimentally verified non-antiviral peptide sequences and the random nonsecretory peptide sequences extracted from UniProt. The composition analysis includes the peptides' full sequences, the 10 N-terminal residues and the C-terminal 10 residues.

Enrichment depletion logos (EDLogo) [48] were created for the AVPs' sequences to identify any position-specific amino acid preferences that may exist. The experimentally validated nonantiviral peptide sequences were used as the baseline in the construction of the logo plots.

A variety of features was extracted from the peptides' primary sequences. These features can be divided into two subcategories, amino acid-based descriptors and physicochemical descriptors. Only features that were non-zero for at least 20 samples were retained in the final feature vector, which has a dimensionality of 6397.

The peptides' compositional descriptors were calculated based on the peptides' amino acid, dipeptide and tripeptide compositions for the conventional 20-amino acid alphabet. Additionally, descriptors were also calculated based on the reduced amino acid alphabets of Veltri et al. [49] , Thomas and Dill [50] and the conjoint alphabet [51] . g-gap dipeptide and tripeptide compositions were calculated to account for the three-dimensional structure of the peptides [52] , with the values of the parameter g being 1, 2 and 3 for the dipeptide compositions, and 3 and 4 for the tripeptide compositions. Furthermore, conjoint triad, composition, transition and distribution [53] and pseudo amino acid composition [54] descriptors were also calculated.

The modlAMP package was employed for the calculation of global physicochemical descriptors and amino acid scale-based descriptors [55] . Global physicochemical features include molecular formula, sequence length, molecular weight, sequence charge, charge density, isoelectric point, instability index, aliphatic index [56] , aromaticity index [57] , hydrophobic ratio and the Boman index [58] . Amino acid scale-based descriptors include hydrophobicity [59] [60] [61] [62] [63] , side-chain bulkiness [64] , refractivity [65] , side-chain flexibility [66] , α-helix propensity [67] , transmembrane propensity [68] , polarity [64, 69] , amino acid charges, AASI [70] , ABHPRK [55] , COUGAR [55] , Ez [71] , ISAECI [72] , MSS [73] , MSW [74] , PPCALI [75] , t_scale [76] , z3 [77] , z5 [78] and pepArc [55] .

Additional physicochemical features were calculated based on amino acid properties detailed in the AAindex [79] . The peptides' hydrophobicities were quantified using the amino acids' hydrophobicities [80, 81] , hydropathies [82] , retention coefficients in HPLC [83] and partition energies [84, 85] . Similarly, the peptide sequences' hydrophilicities were characterized using descriptors based on the amino acid hydrophilicity scale [86] , the amino acids' net charges [87] , polar requirements [88] and fractions of site occupied by water [89] . Descriptors pertaining to sterics were obtained from the residues' steric hindrance [90] and bulkiness [64] properties, while secondary structure features were calculated based on helical [91] propensities. Furthermore, descriptors were also calculated from the side-chain interaction parameters [92] and membrane-buried preference parameters [93] .

Unsupervised and supervised machine learning approaches are employed in the current study. The former includes principal component analysis (PCA) [94] and t-distributed Stochastic Neighbour Embedding (t-SNE) [95] for visualizing the data. The latter includes SVM [96] , random forest (RF) [97] and dense fully connected neural networks [98] for creating supervised classifiers. The scikit-learn Python module is used for its PCA, t-SNE, SVM and RF implementations [99] .

SVMs were trialled using both a linear and non-linear radial base function (RBF) kernel. A grid search was employed for the tuning of the RF number of estimators, the maximum number of features and the maximum depth hyperparameters, and the SVM regularization parameter C and kernel width parameter γ .

The Keras deep learning framework with a Tensorflow backend was used to build and train the deep-fully connected neural networks [100] .

The neural network's input features are scaled to have minimum and maximum values of 0 and 1, respectively.

The optimal combination of neural network architecture and hyperparameters was selected using a randomized grid search strategy.

The first hidden layer has 1024 nodes, and is followed by two layers of 256 nodes each. Batch normalization [101] is applied before the ReLU activation function for each hidden layer. To prevent overfitting to the training data, each hidden layer is followed by a Dropout regularization layer, with a rate of 0.30 [102] . The output layer is a single node activated by the sigmoid function. As is common in binary classification neural networks, the binary cross-entropy loss function is employed.

It is defined as:

where y i is the true value of the i th sample, andŷ i is the predicted value of the i th sample. As the predicted labels of all training data approach their respective true values, the value of the function approaches zero. The optimal optimizer was found to be Adaptive Momentum (Adam), with an optimal initial learning rate of 0.05 and a decay of 0.0001. Adam utilizes the following formula to update the neural network weights [103] :

where them t andv t are the bias-corrected estimates of the mean and the variance of the gradients, respectively.

The neural networks were trained for 600 epochs, without stopping criteria. The model with the highest validation accuracy encountered during training was retained for each of the cross-validation splits.

As the dataset of peptides with anti-coronavirus is small, numbering only 109 peptides, transfer learning was used to train the models for the ENNAVIA-C and ENNAVIA-D datasets. Models originally trained for each cross-validation fold for ENNAVIA-A and ENNAVIA-B, respectively, were used to initialize the weights for the neural network models of the corresponding crossvalidation folds for ENNAVIA-C and ENNAVIA-D, respectively. The neural network models were then trained for 600 epochs, without stopping criteria. The model with the highest validation accuracy encountered during training was retained for each of the cross-validation splits.

A number of standard metrics are employed for the evaluation of the presented models' performance, specifically accuracy (Acc), sensitivity (Sn), specificity (Sp), the Matthews correlation coefficient (MCC) and the receiver operating characteristic (ROC) curve. Confidence intervals are provided at the 95% level of significance.

The first four metrics are defined by the following equations: 

The dataset of peptide sequences was subjected to an amino acid composition analysis and residue position preference analysis. Feature vectors comprising the peptides' physicochemical descriptors, compositional descriptors and all descriptors were constructed and visualized in two-dimensional space using PCA and t-SNE plots. Plots created using both methods show an incomplete separation of the positive and negative classes. Finally, three machine learning classifiers, namely SVMs, random forests and neural networks, are trained on the dataset's feature vectors, and the antiviral activity prediction results are evaluated.

To identify if particular amino acid residues are more prevalent in antiviral and anti-coronavirus peptides, an amino acid residue composition analysis was performed. The amino acid compositions of anti-coronavirus peptides, AVPs, experimentally validated non-antiviral peptides and random non-antiviral peptide sequences are illustrated in Figure 1 . Statistical analysis was carried out using a Chi-squared test; all results are significant at the P < 0.01 significance level.

Interestingly, antiviral and anti-coronavirus peptides are enriched in the cysteine and the hydrophobic isoleucine residue, and depleted in proline and histidine. While AVPs in general exhibit enrichment in lysine and tryptophan, this is not observed for the specifically anti-coronavirus peptides. Similarly, AVPs are depleted in glycine and valine, while anti-coronavirus peptides are enriched in these residues. While the amino acid composition for anti-coronavirus peptides is based on a limited sample size, it does suggest that the composition requirements for peptides to possess activity against coronaviruses differ from the composition requirements for activity against viruses in general.

Furthermore, an amino acid composition analysis was carried out for AVPs on the basis of their mode of action ( Figure 2 ). Interestingly, while AVPs are generally not enriched in aspartic acid or tryptophan, AVPs that act at the viral membrane are rich in aspartic acid.

To assess the possibility of a preference existing for certain amino acid residues at certain positions in the peptides' primary sequence, an enrichment-depletion logo plot was produced ( Figure 3 ) for the experimentally validated AVPs. The experimentally validated non-antiviral peptides were used to establish a baseline for the plot.

The first inspection of the logo plot suggests that AVPs are enriched in tryptophan at most positions. This is consistent with the aforementioned amino acid composition analysis. More specifically, however, AVPs appear to be enriched in glycine at position 1, and have a preference for a positively charged residue at position 4. Conversely, they are enriched in aspartic acid at position 5 and 8, and the third-last residue. Enrichment is also observed in phenylalanine at the three C-terminal positions. Again, in agreement with the amino acid composition analysis, AVPs are depleted in proline and tryptophan at all positions.

PCA was carried out on the ENNAVIA-A dataset for all computed descriptors, only the physicochemical descriptors and only the compositional descriptors subsets (Figure 4) . While a separation does exist between the experimentally verified antiviral and non-antiviral peptides, it is incomplete, and the two classes are significantly overlapped. 

To complement the PCA analysis, a t-SNE analysis was conducted for the experimentally verified antiviral and nonantiviral peptides, again for all computed descriptors, only the physicochemical descriptors and only the compositional descriptors subsets ( Figure 5 ). As with the results of the PCA analysis, the inter-class separation is incomplete, although it is clearly greater.

The principal aim of this study was to train and evaluate a selection of machine learning classifiers for the prediction of peptide antiviral activity. Tenfold cross-validation was employed for the evaluation of the classifiers' robustness and predictive power. Additionally, the 10 models trained for each classifier under 10-fold cross-validation were ensembled and further evaluated through the use of the external, independent test set. The Table 1 .

A grid search strategy was employed for the optimization of SVM and RF hyperparameters.

The SVM classifier achieved its best performance with the regularization parameters C = 1 and C = 10 for the linear and RBF kernels, respectively, and the kernel coefficient γ = 1.5 × 10 −4 for the non-linear kernel. The SVM classifiers, both with a linear and non-linear kernel, perform worse than the RF and NN approaches, with cross-validation accuracies of 84.2% and 82.8%, and MCCs of 0.68 and 0.65, respectively on the ENNAVIA-A dataset.

The optimal RF hyperparameters differed depending on the dataset used. For the ENNAVIA-A dataset, optimal performance was observed with 124 estimators, a maximum tree depth of 10 and a maximum of 80 features, achieving a cross-validation accuracy and MCC of 84.9% and 0.69, respectively. For the ENNAVIA-B dataset, meanwhile, optimal performance was observed with 512 estimators, unrestricted tree depth and a maximum of 13 features.

The neural network approach, however, achieves the best predictive performance of all machine learning approaches trialled, with an accuracy and MCC scores of 93.88% and 0.87 on the ENNAVIA-A external test set, and 95.65% and 0.91 on the ENNAVIA-B external test set. Furthermore, the neural network achieves a very good balance between sensitivity and specificity, 94.74% and 92.68% for ENNAVIA-A. ROC ( Figure 6 ) curves were produced to further evaluate the neural networks' robustness, as were the corresponding AUC values, which were calculated as 0.93 and 0.98 for the ENNAVIA-A and ENNAVIA-B models, respectively.

As the neural networks' performance was superior to the SVM and RF approaches, it was deemed as the best model for the prediction of peptide antiviral activity and further studied.

To establish the utility of ENNAVIA in the context of prediction methods already described in the literature, ENNAVIA was benchmarked against three existing antiviral peptide prediction methods, specifically AVPpred [31] , the method of Chang et al. [35] , AntiVPP [32] , Meta-iAVP [104] and FIRM-AVP [34] . Detailed results are given in Table 2 .

The results presented in Table 2 are reproduced from the respective articles describing the methods. It must be noted, however, that the results for Meta-iAVP and AntiVPP 1.0 could not be reproduced. Independent evaluation of the Meta-iAVP via its webserver on the V 60p+45n dataset resulted in Acc, Sn and Sp values of 81.0%, 83.3% and 77.8%, respectively. Similarly, evaluation of the AntiVPP 1.0 software on the V V60p+60n dataset resulted in Acc, Sn and Sp values of 81.6%, 76.6% and 86.6%, respectively. Contact with the corresponding authors of these articles was attempted prior to publication, however, we have not received a response to our queries prior to publication.

A recent study by Pang et al. described a machine learning method for the identification of anti-coronavirus peptides through imbalanced learning strategies [37] . This study utilizes the datasets created by Pang et al. and employs transfer learning to adapt the ENNAVIA-A and ENNAVIA-B models to the task of anti-coronavirus peptide prediction. For both ENNAVIA-A and ENNAVIA-B, the neural network weights of each of the 10 models trained under cross-validation are transferred to their corresponding models for anti-coronavirus peptide prediction, which are then trained on their respective datasets. The accuracy, MCC, sensitivity and specificity parameters, together with their respective confidence intervals, are reported for each model. ROC curves with the calculated AUC values are also given for both final neural network models ( Figure 6 ). The anti-coronavirus peptide prediction performance obtained by each model is compared with the results obtained by Pang et al. As the size of the anti-coronavirus peptide dataset is extremely limited, and neural network performance typically increases with the amount of data available, validation is limited to 10-fold cross-validation. Detailed results are given in Table 3 .

To ascertain the extent to which a given set of features can contribute to the correct prediction of peptide antiviral activity, neural networks were trained on subsets of the feature space. The validation results obtained by these neural networks trained on the peptides' physicochemical features, dipeptide composition, dipeptide g-gap composition and tripeptide composition are detailed in Table 4 . None of the reduced subset models trained achieve performance better than the hybrid model trained on both compositional and physicochemical descriptors, validating the choice of the hybrid model as the principal approach.

Information about local sequence order can be relayed to a machine learning method through the use of dipeptide and tripeptide composition descriptors. A peptide's dipeptide 

g-gap compositions, defined as the proportion of a pair of amino acids separated by 1, 2 or 3 residues, are a useful descriptor as they correspond to residues that may be proximate to one another in three-dimensional space. As peptides often possess secondary structure upon interaction with their targets, this information allows for the capturing of the chemical environment that the peptide presents to its target. Models trained on g-gap dipeptide composition do not perform better than those trained on conventional dipeptide composition, achieving an accuracy and MCC of 90.0% and 0.80.

Models trained on physicochemical features, such as charge, amphiphilicity and charge, achieve an accuracy and MCC of 88.3% and 0.76, respectively. Although this performance is poorer than that achieved by the models trained on compositional features, it is only marginally so, and still demonstrates predictive capability.

Feature selection was performed for each validation split using SVMs and random forests; the 500 features with the largest absolute SVM weights, and the 500 features with the largest RF feature importance were selected. Neural network models were constructed and trained on the sets of selected features, the results are presented in Table 5 .

The prediction results obtained in all cases are inferior to those obtained by the models trained on the full feature sets, most notably in the case of the models trained on the ENNAVIA-D dataset. This is not unexpected, considering that the feature selection is performed on the ENNAVIA-B dataset prior to transfer learning, which appears to result in the exclusion of features important for anti-coronavirus activity prediction.

The need for novel anti-viral drugs, especially in the context of the COVID-19 pandemic, is great. Interest in the development of novel peptide-based therapeutics has increased in recent years, even as the number of new drugs approved each year declines and the cost of drug research and development grows. More specifically, AVPs represent a promising class of novel drug candidates. Despite extensive research having been conducted on the relationship between the conformations of various bioactive peptides and their biological activities [105] [106] [107] , understanding of this relationship remains insufficient for the accurate de-novo design of novel peptide drugs, especially antiviral peptide drugs, which compared to antimicrobial peptides are less numerous in the literature, and consequently less studied. Molecular dynamics simulations can reveal insights into activity, but are time-consuming and largely unsuitable for bulk-screening of peptide sequences.

An accurate computational method for the prediction of peptide antiviral activity from the primary sequence alone would facilitate a more rapid exploration of the peptide chemical space, and lower the cost of research and development by reducing the need for chemical synthesis and laboratory evaluation of peptide antiviral activity. With a view to accelerating the screening and design of new antiviral peptide drugs, the present study focuses on the combination of compositional and physicochemical descriptors with a deep neural network architecture to create an in silico method for a more accurate classification of peptides as either antiviral or non-antiviral, and additionally the prediction of peptide anti-coronavirus activity specifically, solely on the basis of their primary sequence.

To facilitate as direct a comparison as possible with existing antiviral peptide prediction methods, the dataset of Thakur et al. [31] was adapted for use in this study. Peptide sequences comprising non-natural amino acids or with a length outside the 7-40 amino acid range were excluded. A total of 577 of the original 604 AVPs remain in the ENNAVIA datasets. Two negative datasets are used in this study: the ENNAVIA-A dataset includes 420 experimentally evaluated non-antiviral peptides, while the ENNAVIA-B dataset includes 597 random peptide sequences as the negative samples.

Compositional and physicochemical descriptors were employed for the construction of feature vectors from the peptides' primary sequences, and a selection of machine learning methods were evaluated for the peptide antiviral activity prediction task through both 10-fold cross-validation and validation on an external test set. Deep neural networks proved most promising, and their architecture was, therefore, further optimized and evaluated.

The neural network model with five hidden layers was found to achieve optimal performance. On the ENNAVIA-A dataset, a 10-fold cross-validated accuracy, sensitivity and specificity of 91.3%, 90.6% and 91.9% was achieved, clearly demonstrating that the neural network model is capable of accurately identifying AVPs among non-antiviral peptides. ENNAVIA's predictive performance was compared to existing methods, especially the existent state-of-the-art, Meta-iAVP, which exhibited a crossvalidated accuracy, sensitivity and specificity of 88.2%, 89.2% and 86.9%, respectively, on the T 504p+407n dataset. ENNAVIA's performance surpasses that of Meta-iAVP and other existent models on all metrics, designating it a new state-of-the-art model for antiviral peptide prediction.

Similarly, neural network models were trained and evaluated on the ENNAVIA-B dataset, achieving cross-validated accuracy, sensitivity and specificity of 95.9%, 93.4% and 98.6%, respectively, demonstrating that ENNAVIA can distinguish between AVPs and random peptide sequences. A comparison of performance on this dataset to existing methods again establishes ENNAVIA as the best-in-class method for antiviral peptide prediction, surpassing the previously best accuracy, sensitivity and specificity of 93.2%, 89.0% and 97.4% achieved by meta-iAVP on the T 504p+504n dataset.

Recently, Pang et al. published a study that employed random forests with imbalanced learning strategies for the identification of anti-coronavirus peptides. Notably, the anti-coronavirus peptide dataset is small, with only a total of 139 peptide sequences. Despite the small number of positive samples available for training, respectable validation statistics were achieved, with a sensitivity, specificity and MCC of 85.7%, 85.3% and 0.31 with nonantivirus peptides as the negative dataset, and 100%, 97.7% and 0.73 with random peptide sequences as the negative dataset.

To expand the scope of the current study to include the facilitation of rapid screening of peptides for anti-coronavirus activity specifically, two additional datasets which include the anti-coronavirus peptides from the dataset of Pang et al. as the positive samples were constructed: ENNAVIA-C and ENNAVIA-D, which use the negative peptides from the ENNAVIA-A and ENNAVIA-B datasets, respectively. As the number of positive samples is too small to accurately train neural network models, transfer learning was employed, whereby the alreadytrained weights of the ENNAVIA-A and ENNAVIA-B models were transferred to the ENNAVIA-C and ENNAVIA-D models, respectively, and further fine-tuned to the anti-coronavirus peptide prediction task. The ENNAVIA-C model achieved a sensitivity, specificity and MCC of 91.6%, 96.0% and 0.87, representing a significant improvement on the work of Pang et al. The ENNAVIA-D model, similarly, achieved good performance, with a sensitivity, specificity and MCC of 89.8%, 98.8% and 0.91, respectively, outperforming the method of Pang et al. in specificity, although not sensitivity.

The ENNAVIA model does possess drawbacks, some of which it shares with the other existing algorithms. Since the publication of the dataset of Thakur et al. [31] , the literature on AVPs has expanded, and continues to expand as new AVPs continue to be identified. Consequently, the number of peptide sequences available for training increases. As neural networks' predictive power scales with the quantity of data available for training, further improvements in predictive performance for both the antiviral and anti-coronavirus predictive models could be achieved through the development of an updated, expanded dataset. Neural networks are generally known as non-interpretable black box models, which precludes rigorous analysis of the basis for the model's predictions. Furthermore, as mentioned previously, AVPs can exert their biological activity through a variety of host-targeting and virus-targeting mechanisms of action, which can include the prevention of virus cell-entry, blocking cell receptors, viral lysis or enhancement of host immune response. While it stands to reason that the mechanism of action a given peptide utilizes to exert antiviral activity depends on the peptide's amino acid composition and physicochemical properties, unfortunately the number of known antiviral peptide sequences still cannot be considered plentiful, much less the number of AVPs that utilize a given mechanism of action. For instance, while inhibition of virus entry is the most prevalent mechanism by which AVPs exert their action, accounting for 30% of entries in the AVPdb, only seven peptides are listed in the AVPdb as exerting their antiviral activity through immunostimulation [26] . Consequently, it is not always feasible to analyse the relationships between peptides' properties and their mode of action, nor is it currently feasible to construct machine learning models that are specific to a mode of action. Instead, antiviral activity predictors remain limited to the prediction of the presence or absence of antiviral activity.

To conclude, the limited quantity of available experimentally validated data and the incomplete understanding of the mechanism of peptide antiviral activity continue to pose challenges for the research community. In an effort to overcome these challenges, this study described ENNAVIA, a collection of novel in silico peptide antiviral and anti-coronavirus activity classifiers. The classifiers, which employ a deep neural network architecture and benefit from a rich feature-space, achieve predictive power that surpasses the state-of-the-art. This work complements a suite of existing in silico classifiers developed by the authors, which includes methods for the prediction of peptide anticancer and hemolytic activity, and peptide tertiary structure. The authors believe that the results of this work, in combination with the aforementioned methods, will enable better in-silico design of novel peptide-based antiviral and anti-coronavirus therapeutics, thereby reducing the cost and time required for the design phase, helping to drive medicinal chemistry into an unprecedented revolution.

For the benefit of the scientific community, the ENNAVIA classifier is available as a user-friendly, publicly accessible web server online at https://research.timmons.eu/ennavia. The web server is capable of predicting peptides' antiviral activity based on the primary sequence. Input peptide sequences are restricted to only the 20 natural amino acids; non-natural amino acids are not supported. The web server includes many features, and models trained on the ENNAVIA-A (T 504p+406n ) , ENNAVIA-B (T 504p+504n ), ENNAVIA-C and ENNAVIA-D datasets are available for prediction.

Peptide antiviral activities can be predicted for both a single sequence and a batch of sequences. Peptide sequences should be provided in the standard FASTA format. The maximum batch size is variable depending on the length of the sequences; longer sequences necessitate smaller batch sizes. The prediction will be carried out by the ensemble of trained neural networks, and the average score will be returned, which corresponds to the probability of the peptide sequence possessing antiviral activity. Probabilities are given on a scale of 0-1, whereby 0 and 1 are most probably non-antiviral, and most probably antiviral, respectively.

Mutation analysis may be carried out on single peptide sequences, by selecting the mutation analysis option and inputting the residue number to be mutated. Mutant sequences will be created by substituting the residue at the specified position with each of the other 20 natural amino acids. The probability of each of the mutant sequences possessing antiviral activity will be returned by the chosen neural network model.

Residue scans, such as, for instance, an alanine scan, are available for single peptide sequences, by choosing the residue scan option and selecting the amino acid residue to be scanned with. Mutant sequences are attained by substituting successive residues with the selected amino acid residue. The probability of the native and mutant sequences possessing antiviral activity will be returned by the selected neural network model.

• An artificial neural network model ENNAVIA was constructed for the prediction of antiviral and anticoronavirus peptides • Feature extraction was used to obtain compositional and physicochemical descriptors from the peptide sequences • Transfer learning was employed to adapt neural networks for anti-coronavirus activity prediction • ENNAVIA was evaluated by 10-fold cross-validation and an external test set • ENNAVIA outperforms the current best-in-class methods for antiviral peptide prediction

All data generated or analysed during this study are available for download at https://research.timmons.eu/ennavia.

The ancient Virus World and evolution of cells

Emerging viral diseases

Mechanisms of viral emergence

Genetic diversity and evolution of SARS-CoV-2

Control of Viral Infections and Diseases

Antimicrobial peptides: An emerging category of therapeutic agents

The role of cationic antimicrobial peptides in innate host defences

The Potential of Antiviral Peptides as COVID-19 Therapeutics

A novel peptide with potent and broadspectrum antiviral activities against multiple respiratory viruses

Virucidal activity of a scorpion venom peptide variant mucroporin-M1 against measles, SARS-CoV and influenza H5N1 viruses

Structure-based discovery of Middle East respiratory syndrome coronavirus fusion inhibitor

Peptide-based drug design: Here and now

Therapeutic peptides: Historical perspectives, current development trends, and future directions

General method for rapid synthesis of multicomponent peptide mixtures

Methods for generating and screening libraries of genetically encoded cyclic peptides in drug discovery

Evolving a peptide: Library platforms and diversification strategies

Rationally Designed ACE2-Derived Peptides Inhibit SARS-CoV-2

Current progress in antiviral strategies

Direct-acting antiviral agents for hepatitis c virus infection

Approaches for Identification of HIV-1 Entry Inhibitors Targeting gp41 Pocket

The effect of peginterferon alpha-2a vs. peginterferon alpha-2b in treatment of naive chronic HCV genotype-4 patients: A single centre egyptian study

Interferons: Success in anti-viral immunotherapy

Antiviral peptides as promising therapeutic drugs. Cellular and Molecular

Antiviral Peptides: Identification and Validation

AVPdb: A database of experimentally validated antiviral peptides targeting medically important viruses

Erratum: DBAASP v.2: An enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides

DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics

CAMP: Collection of sequences and structures of antimicrobial peptides

APD3: The antimicrobial peptide database as a tool for research and education

AVPpred: Collection and prediction of highly effective antiviral peptides

AntiVPP 1.0: A portable tool for prediction of antiviral peptides

Meta-iavp: A sequence-based meta-predictor for improving the prediction of antiviral peptides using effective feature representation

Better understanding and prediction of antiviral peptides through primary and secondary structure feature importance

Analysis and Prediction of Highly Effective Antiviral Peptides Based on Random Forests

In silico approaches for the prediction and analysis of antiviral peptides: a review

Identifying anticoronavirus peptides by incorporating different negative datasets and imbalanced learning strategies

Protein-protein interaction site prediction through combining local and global features with deep neural networks

SCLpred: Protein subcellular localization prediction by N-to-1 neural networks

SCLpred-EMS: Subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep Nto-1 Convolutional Neural Networks

CPPpred: Prediction of cell penetrating peptides

HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs neural networks

ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides

APPTEST is an innovative new method for the automatic prediction of peptide tertiary structures

AntiBP2: Improved version of antibacterial peptide prediction

CD-HIT Suite: A web server for clustering and comparing biological sequences

Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences

A new sequence logo plot to highlight enrichment and depletion

Deep learning improves antimicrobial peptide recognition

An iterative method for extracting energy-like quantities from protein structures

Predicting protein-protein interactions based only on sequences information

Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis

PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions

PyDPI: Freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies

Python for antimicrobial peptides

Thermostability and Aliphatic Index of Globular Proteins

expressivity and aromaticity are the major trends of amino-acid usage in 999 escherichia coli chromosome-encoded genes

Antibacterial and antimalarial properties of peptides that are cecropin-melittin hybrids

Structural Prediction of Membrane-Bound Proteins

Hydrophobic moments and protein structure

A simple method for displaying the hydropathic character of a protein

Prediction of protein antigenic determinants from amino acid sequences

Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins

The characterization of amino acid sequences in proteins by statistical methods

Refractive indices of proteins in relation to amino acid composition and specific volume

Positional flexibilities of amino acid residues in globular proteins

Conformational Preferences of Amino Acids in Globular Proteins

An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: Relationship to biological hydrophobicity

Amino acid difference formula to help explain protein evolution

Computational design of highly selective antimicrobial peptides

Depth-dependent Potential for Assessing the Energies of Insertion of Amino Acid Side-chains into Membranes: Derivation and Applications to Determining the Orientation of Transmembrane and Interfacial Helices

Amino Acid Side Chain Descriptors for Quantitative Structure-Activity Relationship Studies of Peptide Analogues

Topological shape and size of peptides: Identification of potential allele specific helper T cell antigenic sites

MS-WHIM scores for amino acids: A new 3D-description for peptide QSAR and QSPR studies

Scrutinizing MHC-I Binding Peptides and Their Limits of Variation

Amino Acids Characterization by GRID and Multivariate Data Analysis

New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids

Amino acid index database

Hydrophobic parameters pi of amino-acid side chains from the partitioning of Nacetyl-amino-acid amides

Physicochemical Basis of Amino Acid Hydrophobicity Scales: Evaluation of Four New Scales of Amino Acid Hydrophobicity Coefficients Derived from RP-HPLC of Peptides

Prediction of protein surface accessibility with information theory

New Hydrophilicity Scale Derived from High-Performance Liquid Chromatography Peptide Retention Data: Correlation of Predicted Surface Residues with Antigenicity and X-ray-Derived Accessible Sites

Partition coefficients of amino acids and hydrophobic parameters π of their side-chains as measured by thin-layer chromatography

Amino acid side-chain partition energies and distribution of residues in soluble proteins

Atomic and residue hydrophilicity in the context of folded protein structures

Prediction of protein function from sequence properties

Evolution of the genetic code

Local interactions as a structure determinant for protein molecules: II

Protein folding and the genetic code: An alternative quantitative model

Helix capping

Optimization of Amino Acid Parameters for Correspondence of Sequence to Tertiary Structures of Proteins

Quantifying the Effect of Burial of Amino Acid Residues on Protein Stability

On lines and planes of closest fit to systems of points in space

Visualizing data using t-SNE

Support-Vector Networks

Random decision forests

Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms

Scikit-learn: Machine Learning in Python

Large-Scale Machine Learning on Heterogeneous Distributed Systems

Batch normalization: Accelerating deep network training by reducing internal covariate shift

A Simple Way to Prevent Neural Networks from Overfitting

A method for stochastic optimization

ACPred: A computational tool for the prediction and analysis of anticancer peptides

NMR model structure of the antimicrobial peptide maximin 3. Eur Biophys

Structural and positional studies of the antimicrobial peptide brevinin-1BYa in membrane-mimetic environments

Insights into conformation and membrane interactions of the acyclic and dicarba-bridged brevinin-1BYa antimicrobial peptides

The authors would also like to thank University College Dublin for the Research Scholarship granted to P.B.T.