Submitted 24 June 2016 Accepted 8 September 2016 Published 10 October 2016 Corresponding author Paul F. Long, paul.long@kcl.ac.uk Academic editor Jaume Bacardit Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.90 Copyright 2016 Gacesa et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions Ranko Gacesa1, David J. Barlow1 and Paul F. Long1,2,3,4 1 Institute of Pharmaceutical Science, King’s College London, London, United Kingdom 2 Department of Chemistry, King’s College London, London, United Kingdom 3 Brazil Institute, King’s College London, London, United Kingdom 4 Faculdade de Ciências Farmacêuticas, Universidade de São Paulo, São Paulo, Brazil ABSTRACT Ascribing function to sequence in the absence of biological data is an ongoing challenge in bioinformatics. Differentiating the toxins of venomous animals from homologues having other physiological functions is particularly problematic as there are no universally accepted methods by which to attribute toxin function using sequence data alone. Bioinformatics tools that do exist are difficult to implement for researchers with little bioinformatics training. Here we announce a machine learning tool called ‘ToxClassifier’ that enables simple and consistent discrimination of toxins from non-toxin sequences with >99% accuracy and compare it to commonly used toxin annotation methods. ‘ToxClassifer’ also reports the best-hit annotation allowing placement of a toxin into the most appropriate toxin protein family, or relates it to a non-toxic protein having the closest homology, giving enhanced curation of existing biological databases and new venomics projects. ‘ToxClassifier’ is available for free, either to download (https://github.com/rgacesa/ToxClassifier) or to use on a web-based server (http://bioserv7.bioinfo.pbf.hr/ToxClassifier/). Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Protein sequences, Biological function, Animal venom, Automatic annotation, Functional prediction INTRODUCTION Falling costs of tandem mass spectrometry for shotgun proteomics have made generating vast amounts of protein sequence data increasingly affordable, yet the gap between obtaining these sequences and then assigning biological function to them continues to widen (Bromberg et al., 2009). Often, most sequences are deposited into protein databases with little, if any, accompanying experimental data from which biological functions can be inferred. Customarily, biological function is deduced indirectly by comparing amino acid sequence similarity to other proteins in large databases to calculate a ranking of proteins with respect to the query sequence. Using simple pair-wise comparisons as a sequence searching procedure, the BLAST suite of programs (for example, BLASTp) was first of its kind and has gained almost unprecedented acceptance among scientists (Neumann, Kumar & Shalchian-Tabrizi, 2014). Variations of the BLAST algorithm (for instance, PSI- BLAST (Altschul et al., 1997)) and development of probabilistic models (such as hidden How to cite this article Gacesa et al. (2016), Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions. PeerJ Comput. Sci. 2:e90; DOI 10.7717/peerj-cs.90 https://peerj.com mailto:paul.long@kcl.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.90 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://github.com/rgacesa/ToxClassifier http://bioserv7.bioinfo.pbf.hr/ToxClassifier/ http://dx.doi.org/10.7717/peerj-cs.90 Markov models, HMMs (Krogh et al., 1994)) use multiple sequence alignments to detect conserved sequences (also referred to as motifs). Use of these models enables detection of remote homology between proteins seemingly unrelated when analysed by pairwise alignment alone (Krogh et al., 1994). Conversely, development of accurate algorithms and fast software tools that can automatically identify critical amino acid residues responsible for differences in protein function amongst sequences having exceptionally high sequence similarity remains a challenging problem for bioinformatics. In a post-genomic era, the toxins of venomous animals are an emerging paradigm. Animal venoms comprise predominantly toxic peptides and proteins. Duplication of genes that formerly encoded peptides and proteins having non-toxic physiological functions is one of the foremost evolutionary drivers that gives rise to the enormous functional diversity seen in animal venom toxins (Fry, 2005; Chang & Duda, 2012). However, evidence is ambiguous as to whether these genes were expressed in multiple body tissues, with the duplicate copy then recruited into venom glands with subsequent neo-functionalisation to develop toxicity, or if there was duplication with ensuing sub-functionalisation of genes encoding pre-existing but non-toxic venom gland proteins (Hargreaves et al., 2014). Examples of both mechanisms have been demonstrated in different venomous animals (Reyes-Velasco et al., 2015; Vonk et al., 2013; Junqueira de Azevedo et al., 2015). Nonetheless, many toxin proteins that constitute venoms share a remarkable similarity to other proteins with non-toxic physiological functions, and deciphering sequence data to disentangle these functions is not a trivial task (Kaas & Craik, 2015). Previous proteomic data from our laboratory and subsequent results of others realised an astonishingly high sequence similarity between cnidarian (jellyfish, coral, sea anemones etc.) toxins and those of other higher venomous animals (Weston et al., 2012; Weston et al., 2013; Li et al., 2012; Li et al., 2014). This suggested to us that a small number of sequences, when occurring in combination, might explain this similarity and prompted the search for toxin-specific motifs (Starcevic & Long, 2013). An unsupervised procedure was developed that resulted in the identification of motifs we called ‘tox-bits’, and which could describe most toxins as combinations of between 2–3 ‘tox-bits’ (Starcevic et al., 2015). The ‘tox-bits’ are defined as HMM-profiles and were found to be superior at differentiating toxin from non-toxin sequences, when compared against standard BLAST or HMM based methods (Starcevic et al., 2015). However, implementation of ‘tox-bits’ HMM profiles is not straightforward for scientists with little or no bioinformatics experience. Hence, in this paper we introduce and make freely available an easy-to-use machine learning tool called the ‘ToxClassifier’ that employs ‘tox-bits’ HMM profiles and other standard classifier tools running in parallel to distinguish toxins from their non-toxic homologues. METHODS Datasets Data for training and testing of machine learning classifiers used in ToxClassifier ensemble was obtained from UniProtKB database (Bateman et al., 2015), according to the following methodology: Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 (1) ‘Positive’ dataset representing well annotated putative animal toxins was downloaded from UniProtKB/SwissProt-ToxProt (Jungo et al., 2012). Database was searched for animal toxins and venoms, using search query: taxonomy: ‘‘Metazoa [33208]’’ (keyword:toxin OR annotation:(type: ‘‘tissue specificity’’ venom)). All duplicate entries with identical sequence or sequence identifier were removed, resulting in 8,093 sequences. (2) ‘Easy’ negative dataset representative of physiological proteins was obtained by random sampling of 50,000 sequences in UniProtKB/SwissProt database (Bateman et al., 2015). All entries with duplicate sequence identifier or protein sequence were removed, as were all entries also occuring in Positive dataset; final dataset included 47,144 protein sequences. (3) ‘Moderate difficulty’ negative dataset was designed to match highly curated toxin- like proteins with physiological function; it was created by BLASTp searching UniProtKB/SwissProt database with Positive dataset, with e-value cutoff of 1.0e–10. Resulting BLASTp hits were collected and all duplicates (with identical sequence or sequence identifier), and all sequences also present in Positive dataset or Easy dataset were removed, resulting in 8,034 proteins. (4) ‘Hard’ negative dataset was constructed from TrEMBL database (Bairoch & Apweiler, 2000) instead of Swiss-Prot. As with Moderate dataset, it was created from results of BLASTp using Positive dataset as query and TrEMBL as target database. Duplicates and sequences also occurring in Positive, Easy or Moderate datasets were removed for total of 7,403 sequences. All datasets are available for download at https://github.com/rgacesa/ToxClassifier/tree/ master/datasets. Machine learning classifiers, training and testing Models describing protein sequences were constructed as follows: (1) Single Amino acid frequency model (OF): model uses length of sequence and frequency of each amino acid as input features. (2) Amino acid dimer frequency model (BIF): model uses length of sequence, frequency of each amino acid and of each amino acid 2-mer. (3) Naive tox-bits model (NTB): input features for this model are the number of ‘tox-bits’ for each ‘tox-bits’ HMM listed in the ‘tox-bits’ database (Starcevic et al., 2015). (4) Scored ‘tox-bits’ model (STB): STB is a modification of the NTB model, with HMM bit-scores replacing the number of ‘tox-bits’ in each ‘tox-bit’ HMM model. (5) Tri-Blast Simple (TBS) model: TBS uses BLASTp searches against positive (UniProtKB/SwissProt-ToxProt) and two negative control databases (close non-toxins from UniProtKB/SwissProt and non-toxins from TrEMBL); features include bit-score, query length, subject length, query/subject length ratio, query coverage, percentage of iden- tity, percentage of positive matches; features also include amino-acid frequencies. Scores are computed from the ‘best-hit’ in each database, with a BLAST e-value of 1.0e–10. (6) Tri-Blast Enhanced A (TBEa) model: TBEa model is an expanded variant of TBS, with amino dimer frequencies included in the model. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 3/20 https://peerj.com https://github.com/rgacesa/ToxClassifier/tree/master/datasets https://github.com/rgacesa/ToxClassifier/tree/master/datasets http://dx.doi.org/10.7717/peerj-cs.90 (7) Tri-Blast Enhanced B (TBEb): model is a variation of TBEa, trained on 80% of the input dataset and with a BLAST e-value cut-off value of 1.0e+3 for the detection of similar toxin or non-toxic sequences. Feature extraction and vectorisation was implemented using the Python 2.7 (https://www.python.org/download/releases/2.7/) scripting language, NCBI BLAST+ version 2.2.31 (Camacho et al., 2009), HMMER 3.1b1 (Eddy, 2011), the ‘tox-bit’ HMM profile database (Starcevic et al., 2015) and a set of custom BLAST databases based on UniProtKB/SwissProt-ToxProt, UniProtKB/SwissProt and TrEMBL databases. Vectorization scripts, vectorised data sets and trained models are available for download at https://github.com/rgacesa/ToxClassifier. Support Vector Machine (SVM), Gradient Boosted Machine (GBM) and Generalised Linear Model (GLM) classifiers were trained for each of the models; Classifiers were implemented using the R-programming language, Caret package (http://topepo.github.io/ caret/index.html). The training set for each dataset was selected by random sampling of 75% of the sequences, and combined training sets were used to train the classifiers. Input features were 0-centered and scaled by standard deviation and training was performed with 10 internal bootstraps. Learning curves were constructed for each classifier to evaluate training efficiency and potential overtraining by plotting performance versus number of sequences in training set. Classifiers were tested using those sequences from Positive, Easy, Moderate and Hard datasets not used in training. Performance was evaluated on each of the datasets and on the summary dataset. The following performance measurements were calculated: (1) Number of true positives (TP): number of toxic sequences in dataset correctly predicted as toxins; sequence was considered a toxin if listed in UniProtKB/ToxProt database. (2) Number of true negatives (TN): number of non-toxic sequences in dataset which are correctly predicted as non-toxic (not in UniProtKB/ToxProt database). (3) Number of false positive (FP): number of non-toxic sequences in dataset incorrectly classified as toxins. (4) Number of false negatives (FN): number of toxic sequences in dataset incorrectly classified as non-toxic. (5) Accuracy (ACC): accuracy is calculated as proportion of sequences correctly classified as toxins or non-toxins; ACC = (TP + TN) / (TP + TN + FP + FN). (6) Specificity (SPEC): also called true negative rate, specificity is proportion of correctly predicted non-toxins (true negatives); SPEC = TN / (TN + FP). (7) Sensitivity (SENS): also called true positive rate or recall, sensitivity is proportion of correctly predicted toxins (true positives); SENS = TP/(TP + FN). (8) Balanced accuracy (B.ACC): balanced accuracy is a mean value of specificity and sensitivity; BACC = (SPEC + SENS)/2. (9) Negative predictive value (NPV): proportion of negatives that are true negatives. NPV = TN/(TN + FN). (10) Positive predictive value (PPV): also called Precision, PPV measures proportion of positives which are true positives; PPV = TP/(TP + FP). Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 4/20 https://peerj.com https://www.python.org/download/releases/2.7/ https://github.com/rgacesa/ToxClassifier http://topepo.github.io/caret/index.html http://topepo.github.io/caret/index.html http://dx.doi.org/10.7717/peerj-cs.90 (11) F-Score (F1): F-score is harmonic mean of precision and sensitivity and represents weighted average of precision and recall. It is calculated as F1=2∗TP/(2∗TP+ FP+FN). (12) Matthews’ correlation coefficient (MCC): the MCC value (also known as phi- coefficient) is a measure of correlation between observed and predicted. It is considered balanced measure even for classes with different amounts of positive and negative values; MCC= TP∗TN−FP∗FN√ (TP+FP)∗(TP+FN)∗(TN+FP)∗(TN+FN) (Matthews, 1975; Powers, 2011). Conventional annotation models Annotation models simulating manual annotation were constructed based on BLAST and HMMER, as follows: (1) Naive-BLAST: annotation method is based on the assumption that a sequence is a toxin if a BLAST search against the UniProtKB/SwissProt-ToxProt dataset returns a positive hit at a certain e-value (usually between 1.0e–20 and 1.0e–5). This approach and its variations are commonly encountered in the literature (Sher & Zlotkin, 2009; Schwartz et al., 2015; Liu et al., 2015; Whittington et al., 2010). (2) OneBLAST: this approach classifies a sequence by a single BLAST search against databases including UniProtKB/SwissProt-ToxProt and other ‘toxin-like’ but non-toxic sequences extracted from UniProtKB/SwissProt and TrEMBL databases (combination of Positive, Moderate and Hard datasets). A sequence is classified as a putative toxin if the highest scoring BLAST hit at a selected e-value is from UniProtKB/SwissProt-ToxProt; it is classified as non-toxic if top BLAST hit is not from UniProtKB/SwissProt-ToxProt, or if all BLAST hits are above the selected e-value. (3) TriBLAST: annotation is performed by a variation of OneBLAST, using separate BLAST searches against UniProtKB/SwissProt-ToxProt and two non-toxin databases (for the Moderate and Hard datasets); a sequence is classified as a toxin if the BLAST search against UniProtKB/SwissProt-ToxProt returns a hit below a certain e-value and with a higher score than the searches against the other databases. Variations of this method are commonly used in toxin annotation (Gacesa et al., 2015; Rachamim et al., 2014). (4) hmmerToxBits: this method is a variation of the naive-BLAST, using HMMER package Hmmsearch instead of BLAST, and the database of ‘tox-bits’ HMM models as the target database. A sequence is classified as a toxin if one or more HMMs can be detected within a certain e-value cut-off. (5) hmmerVenom: modification of hmmerToxBits, the method uses HMM profiles extracted by ‘venom’ and ‘toxin’ text search of the Pfam database instead of ‘tox-bit’ HMMs. (6) twinHmmerPfam: a HMMER based variant of TriBLAST approach, this method performs hmmsearches against two HMM databases (toxin/positive HMMs and negative control) and compares bitscores. Sequence is annotated as a toxin if bitscore for toxin HMM database is higher. TwinHmmerPfam toxin HMMs were extracted from Pfam by keyword search for ‘toxin’ and ‘venom,’ while negative control database is comprised from remainder of Pfam. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 5/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 (7) twinHmmerToxBits: a variation of twinHmmerPfam method, method compares hmmsearch against toxin HMMs and negative control. For this model, positive database is composed from ‘tox-bit’ HMMs derived from Tox-Prot toxins (Starcevic et al., 2015), while negative control is Pfam database from which HMMs containing ‘toxin’ or ‘venom’ keywords were removed. BLAST databases for Naive-BLAST, OneBLAST and TriBLAST were constructed from 75% randomly sampled sequences in Positive, Easy, Moderate and Hard and performance was measured by annotating remainder of data. HMMER based methods were tested with 25% of random sequences from input datasets, using all appropriate HMM models. Database construction and testing was repeated 10 times and results were averaged. ToxClassifier Meta-classifier calibration and testing ToxClassifier meta-classifier was constructed from nine annotation model and classifier combinations (BIF_SVM, BIF_GBM, STB_SVM, STB_GBM, TBS_SVM, TBS_SVM, TBEa_SVM, TBEb_SVM and TBEb_GBM). Each of nine classifiers reports 1 if the input sequence is predicted as toxin or 0 if predicted as non-toxic, for final prediction score of 0 to 9. Datasets used for calibration of meta-classifier were chosen from the set of venomous and non-venomous animals (human Homo sapiens, the house mouse Mus musculus, the Burmese python Python bivittatus, king cobra Ophiophagus hannah, the duck-billed platypus Ornithorhynchus anatinus, the snakelocks sea anemone Anemonia viridis, the starlet sea anemone Nematostella vectensis; and all proteins deposited in the UniProtKB/SwissProt and TrEMBL databases attributed to snakes, spiders, wasps and Conus snails). All sequences were downloaded from UniProtKB database with exception of Python bivittatus which was not available in UniProtKB and was downloaded from NCBI protein database (Wheeler et al., 2003); all data is available at https://github.com/rgacesa/ToxClassifier/tree/master/datasets. Datasets were split into training sets consisting of 75% of data and test sets including remaining 25% of sequences. ToxClassifier meta-classifier was calibrated by evaluating prediction score versus per- formance for each animal training set and for summary dataset constructed by combining animal training datasets with exclusion of Conus snail data, which was dropped due to sus- pected low quality of annotation. Calibrated ToxClassifier, with prediction score 5 or more as cut-off for positive classification, was tested on animal data test sets, and performance measures were compared to OneBLAST, NaiveBLAST models and ClanTox server. Data used for training and testing and all calculated performance metrics are available at https: //github.com/rgacesa/ToxClassifier/tree/master/datasets/toxclassifier_calibration_test. ToxClassifier comparison to other published tools Performance of ClanTox server (Kaplan, Morpurgo & Linial, 2007) was tested on Positive, Easy, Moderate and Hard datasets and on animal data test sets. ToxinPred server (Gupta et al., 2013) was tested on 868 Positive dataset sequences of length up to 30 amino acids (ToxinPred sequence length limit) and on negative dataset composed 30 amino acid or shorter protein sequences randomly selected from UniProt database (5,673 non-duplicate, non-ToxProt sequences). ToxinPred was not tested on animal data due to lack of short Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 6/20 https://peerj.com https://github.com/rgacesa/ToxClassifier/tree/master/datasets https://github.com/rgacesa/ToxClassifier/tree/master/datasets/toxclassifier_calibration_test https://github.com/rgacesa/ToxClassifier/tree/master/datasets/toxclassifier_calibration_test http://dx.doi.org/10.7717/peerj-cs.90 sequences available in these datasets. SpiderP server (Wong et al., 2013) was not tested as service was not available at the time and PredCSF server (Fan et al., 2016) was not tested as it was deemed too specialized and only accepts single sequence as input. User interface The ToxClassifier web service front-end is implemented using HTML 5.1 (https://www. w3.org/html/), JavaScript (https://www.javascript.com/), jQuery (https://jquery.com/), CSS (https://www.w3.org/Style/CSS/), Java 7.0 (https://www.oracle.com/java/index.html) and the Java Server Pages (JSP 2.1) framework (http://www.oracle.com/technetwork/java/ javaee/jsp/index.html). Visualisation is performed using R (https://www.r-project.org/), ggplot2 (http://ggplot2.org/) and R markdown (http://rmarkdown.rstudio.com/) packages. ToxClassifier runs on an Apache Tomcat 8.0 web server (http://tomcat.apache.org/ download-80.cgi), under an Ubuntu Linux 12.04 operating system (http://www. ubuntu.com/). The service is hosted by the Section of Bioinformatics, Faculty of Food Technology and Biotechnology, University of Zagreb, Croatia (http://www.pbf.unizg.hr/ en/departments/department_of_biochemical_engineering/section_for_bioinformatics). RESULTS The accuracy of three individual machine-learning classifiers to predict toxins from proteins having other physiological functions was assessed by training each classifier using seven different annotation models. The learning classifiers were a Support Vector Machine (SVM) and Gradient Boosted Machine (GBM) chosen as high-performing predictors, and a Generalised Linear Model (GLM) regarded as a simple classifier, but with which a baseline could be established that would allow comparison of the performance of the SVM and GBM machines. A detailed description of the annotation models is given in ‘Methods’ section, but briefly the annotation models used the following sequence information from the training set as classifier inputs: either the frequency of amino acids (TBSim) or combinations of two amino-acids (BIF), the presence of absence or ‘tox-bits’ (SToxA), HMM scores for ‘tox-bits’ (SToxB), a selection of BLAST output co-variants (TBEa) or a variation on TBSim and TBEa (TBEb). The training set was constructed by merging 75% of arbitrarily selected sequences from each of the following datasets: 1/ A ‘Positive’ dataset that contained all 8,093 protein sequences deposited in the UniProtKB/SwissProt-ToxProt database of animal venom toxins (Jungo et al., 2012). 2/ An ‘Easy’ dataset composed of 47,144 random, non-duplicate sequences from UniProtKB/SwissProt database (Bateman et al., 2015). 3/ A ‘Moderate’ dataset comprised of 8,034 unique sequences from the manually curated UniProtKB/SwissProt database considered to be non-toxic but with homology to toxin proteins in UniProtKB/SwissProt-ToxProt. 4/ A ‘Hard’ dataset that included 7,403 non- duplicate sequences extracted from the computer annotated TrEMBL (Bairoch & Apweiler, 2000) database, also considered to be non-toxic but with homology to animal venom toxins in UniProtKB/SwissProt-ToxProt. All training was performed using 10 internal bootstrap cross-validations on the training set, and learning curves showing the accuracy of predictions versus the number of sequences Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 7/20 https://peerj.com https://www.w3.org/html/ https://www.w3.org/html/ https://www.javascript.com/ https://jquery.com/ https://www.w3.org/Style/CSS/ https://www.oracle.com/java/index.html http://www.oracle.com/technetwork/java/javaee/jsp/index.html http://www.oracle.com/technetwork/java/javaee/jsp/index.html https://www.r-project.org/ http://ggplot2.org/ http://rmarkdown.rstudio.com/ http://tomcat.apache.org/download-80.cgi http://tomcat.apache.org/download-80.cgi http://www.ubuntu.com/ http://www.ubuntu.com/ http://www.pbf.unizg.hr/en/departments/department_of_biochemical_engineering/section_for_bioinformatics http://www.pbf.unizg.hr/en/departments/department_of_biochemical_engineering/section_for_bioinformatics http://dx.doi.org/10.7717/peerj-cs.90 in the training sets were constructed, thereby allowing a comparative evaluation of training efficiency (Fig. S1). The trained classifiers were then tested for prediction accuracy using the remaining 25% of sequences not included in the training set. Performance values were calculated to give an overall comparative classification of protein sequences as toxins or non-toxins (Table 1). By comparing learning curves (Fig. S1) and accuracy of predictions (Table 1), 9 of the annotation model and classifier combinations were chosen to construct the ‘ToxClassifier’ ensemble. The trained classifiers were: BIF_SVM, BIF_GBM, STB_SVM, STB_GBM, TBS_SVM, TBS_SVM, TBEa_SVM, TBEb_SVM and TBEb_GBM. These classifiers all gave excellent accuracy scores to predict toxins from the Positive dataset (range 0.82–0.96) and non-toxin proteins from the Easy, Moderate and Hard datasets (range 0.92–1.00). No GLM classifiers were included in the ensemble because prediction accuracies were considerably lower when compared with SVM and GBM machines. Classifiers using NTB and OF annotation models were also abandoned in favour of better performing STB and BIF models. Furthermore, the TBEa_GBM model consistently underperformed compared to the TBE_SVM model and was excluded, giving an odd number of classifiers in the ensemble, thereby avoiding a ‘tied vote’ scenario when the outputs were interpreted collectively. The prediction accuracy of the trained machine learning classifiers was next compared to more conventional annotation methods based on sequence alignment, to determine if machine learning predictions were superior or inferior to well established and accepted bioinformatics tools. A detailed description of the annotation models based on these bioinformatics tools is given in ‘Methods.’ Briefly, simple predictions were made by taking the best-hit from BLAST comparisons between a query sequence and the UniProtKB/SwissProt-ToxProt database (naiveBLAST method), or the best-hit following a HMMER hmmsearch comparison between a query sequence and either existing HMM models for toxin protein families in the Pfam database (hmmerVenom model), or our own ‘tox-bits’ HMM models (hmmerToxbits classifier). More sophisticated annotation models also used BLAST or HMMER searches, but the best-hit was extracted following simultaneous comparisons between the query sequence and multiple datasets. These sophisticated annotation models are also described in ‘Methods’, but briefly these models were constructed from sequence information extracted from either UniProtKB/SwissProt- ToxProt sequences supplemented with additional toxin-like sequences from the UniProtKB/SwissProt and TrEMBL databases, or UniProtKB/SwissProt-ToxProt sequences supplemented with non-toxin sequences from the ‘Moderate’ and ‘Hard’ datasets used to train the machine classifiers. Training and test sets were analogous in design and execution to the machine classifier learning, with 75% of sequence information used to construct the BLAST and HMMER databases and the remaining 25% of data used to evaluate performance. Prediction accuracy measures for each query sequence using each of the bioinformatics models were repeated 10 times to give a final balanced accuracy value. Accuracy measure calculations are described in ‘Methods’. A range of sequence-alignment scoring was also tested to select the lowest BLAST and HMMER cut-off scores that gave the most precise toxin annotation. This value was 1.0e-20 for both BLAST and HMMER searches (Fig. S2). Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90/supp-1 http://dx.doi.org/10.7717/peerj-cs.90/supp-1 http://dx.doi.org/10.7717/peerj-cs.90/supp-2 http://dx.doi.org/10.7717/peerj-cs.90 Table 1 Prediction accuracy on positive and negative datasets, as well as range of measurements calculated for all test data, and described in detail in ‘Methods.’ An- notation models used as classifier inputs either: the frequency of amino acids (TBSim) or combinations of two amino-acids (BIF); the presence of absence or ‘Tox-Bits’ (SToxA); HMM scores for ‘ToxBits’ (SToxB); a selection of BLAST output co-variants (TBEa); a variation on TBSim and TBEa (TBEb). Classifier Learning Machines used were: Gradient Boosted (GBM), Support Vector (SVM) and Generalised Linear Model (GLM). The datasets were a ‘Positive’ control containing only validated animal tox- ins, an ‘Easy’ dataset composed of non-toxin sequences, a ‘Moderate’ dataset comprising curated non-toxin sequences but with homology to ‘Positive’ sequences, and a ‘Hard’ dataset that included all sequences from the ‘Moderate’ dataset, together with un-curated sequences also with homology to ‘Positive’ sequences. Classification scores for: Test set summary Annotation model Classifier Accuracy (Positive toxin dataset) Accuracy (Easy non-toxin dataset) Accuracy (Moderate non-toxin dataset) Accuracy (Hard non-toxin dataset) PPV NPV Sens. Spec. F1-value MCC GBM 0.80 0.99 0.98 0.92 0.80 0.98 0.84 0.97 0.82 0.80 SVM 0.80 1.00 0.98 0.94 0.80 0.99 0.91 0.97 0.85 0.84TBSim GLM 0.55 0.99 0.96 0.84 0.55 0.97 0.69 0.94 0.61 0.57 GBM 0.83 1.00 0.98 0.94 0.83 0.99 0.92 0.98 0.87 0.86 SVM 0.89 1.00 0.98 0.96 0.89 0.99 0.94 0.99 0.91 0.90BIF GLM 0.71 0.99 0.99 0.91 0.71 0.98 0.82 0.96 0.76 0.74 GVM 0.64 1.00 0.98 0.94 0.64 0.99 0.90 0.96 0.75 0.73 SToxA SVM 0.84 1.00 0.96 0.91 0.84 0.98 0.87 0.98 0.86 0.84 GBM 0.75 1.00 1.00 0.93 0.75 0.99 0.92 0.97 0.83 0.84 SVM 0.85 1.00 0.99 0.92 0.85 0.99 0.91 0.98 0.88 0.81SToxB GLM 0.03 1.00 0.99 0.99 0.03 1.00 0.61 0.89 0.06 0.12 GBM 0.88 1.00 1.00 0.99 0.88 1.00 0.99 0.98 0.93 0.93 SVM 0.93 1.00 1.00 0.97 0.93 1.00 0.97 0.99 0.95 0.94TBEa GLM 0.96 1.00 1.00 0.94 0.96 0.99 0.95 0.99 0.95 0.95 GBM 0.82 1.00 1.00 1.00 0.82 1.00 1.00 0.98 0.90 0.90 SVM 0.96 1.00 1.00 0.97 0.96 1.00 0.97 0.99 0.97 0.96TBEb GLM 0.93 1.00 1.00 0.99 0.93 1.00 0.99 0.99 0.96 0.95 G acesa etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.90 9/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Machine learning classifiers were also evaluated against currently available published tools for toxin prediction and annotation; Animal toxin prediction server ClanTox (Kaplan, Morpurgo & Linial, 2007) was benchmarked using Positive, East, Moderate and Hard datasets and summary of these datasets. As ToxinPred (Gupta et al., 2013) tools predict only small peptide toxins, it was tested using a subset of Positive dataset with sequences no longer than 30 amino acids (868 sequences) and separate negative dataset composed of 5,673 random short proteins from UniProtKB database. SpiderP (Wong et al., 2013) was not benchmarked as the server no longer seems publically available. Finally, PredCSF (Fan et al., 2016) is a conotoxin-specific tool and was deemed not comparable to general annotation tools; it also only allows single sequence input, making it unsuitable for large scale testing. Final performance measures compared between the different tools are listed in Table 2. Testing of sequence-alignment based annotation models (Table 2) demonstrated that the simplistic methods (naiveBLAST, hmmerToxBits and hmmerVenom) gave high prediction accuracies for sequences in the Easy dataset (ranging from 0.95 for hmmerVenom to 0.99 for naiveBLAST), but underperformed in annotation of the physiological toxin-like sequences in the Moderate and Hard datasets (accuracies ranging from 0.74 to 0.83 for Moderate and 0.07 to 0.23 for Hard dataset (the poor performance here, also evinced by the low F1 and MCC scores)). More sophisticated BLAST-based methods (oneBLAST and triBLAST) gave very high prediction accuracy scores (0.93–0.999) for sequences in the Easy and Moderate datasets, but somewhat lower performance on sequences in the Positive and Hard datasets (0.86–0.90). Pfam-based twinHMMER gave the highest accuracy prediction for non-toxin sequences, but underperformed compared to the other annotation models against sequences in the positive toxin dataset (accuracy 0.56). The ‘tox-bits’ based variant accurately predicted sequences in the Easy and Moderate datasets (accuracy 0.85–0.999), but suffered from a high false positive rate when sequences in the Hard dataset were analysed (accuracy 0.44). When compared to machine learning-based methods, even the most accurate of the sequence alignment-based models (oneBLAST and triBLAST) were surpassed by the majority of the machine learning based classifiers, especially by TBEa and TBEb models (SVM and GBM variants), which gave the highest accuracy of prediction for sequences in all test datasets. All prediction methods showed higher performance for negative prediction (predicting non-toxin as non-toxin) compared to positive prediction (correctly predicting toxin as toxic), with Specificity (Spec) and Negative Prediction Value (NPV) significantly higher than Sensitivity (Sens) and Positive Prediction Value (PPV). Each of the 9 machine learning classifiers used in the ‘ToxClassifier’ ensemble gives a simple bit (1 or 0) value as output to predict whether the likely biological activity of the input sequence is as a toxin (1), or has a non-toxic (0) physiological role and scores are summed into final prediction score ranging from 0 to 9. Evaluation of this final prediction score was performed on test sets obtained from randomly sampling 75% of sequences in the published annotated genomes from a selection of venomous animals (king cobra Ophiophagus hannah, the duck-billed platypus Ornithorhynchus anatinus, the snakelocks sea anemone Anemonia viridis, the starlet sea anemone Nematostella vectensis; and all proteins deposited in the UniProtKB/SwissProt and TrEMBL databases attributed to snakes, Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Table 2 Performance for selected annotation models and published toxin prediction tools. Prediction accuracy is listed for positive and negative datasets and mea- surements are also shown for summary of all test data. ToxinPred server, marked with star and displayed in italic, was tested with short protein sequences only. Annota- tion models were constructed from sequence information extracted from either: BLAST (naiveBLAST), ‘tox-bits’ HMM (hmmerToxBits) or Pfam HMM (hmmerVenom) comparisons with the UniProtKB/SwissProt-ToxProt database; or BLAST (oneBLAST), ‘tox-bits’ HMM (twinHmmerPfam) or Pfam HMM (twinHmmerPfam) com- parisons with the UniProtKB/SwissProt-ToxProt sequences supplemented with additional toxin-like sequences from the UniProtKB/SwissProt and TrEMBL databases; or BLAST (triBLAST) comparisons with the UniProtKB/SwissProt-ToxProt sequences supplemented with non-toxin sequences from the ‘Moderate’ and ‘Hard’ datasets used to train the machine classifiers. The datasets were a ‘Positive’ control containing only validated animal toxins, an ‘Easy’ dataset composed of non-toxin sequences, a ‘Moderate’ dataset comprising curated non-toxin sequences but with homology to ‘Positive’ sequences, and a ‘Hard’ dataset that included all sequences from the ‘Moder- ate’ dataset, together with un-curated sequences also with homology to ‘Positive’ sequences. Classification scores for: Test set summary Annotation model Tool Accuracy (Positive toxin dataset) Accuracy (Easy non-toxin dataset) Accuracy (Moderate non-toxin dataset) Accuracy (Hard non-toxin dataset) PPV NPV Sens. Spec. F1-value MCC naiveBLAST BLAST 0.90 0.99 0.83 0.07 0.90 0.86 0.46 0.99 0.60 0.58 oneBLAST BLAST 0.86 1.00 0.98 0.90 0.86 0.99 0.89 0.98 0.87 0.86 triBLAST BLAST 0.87 0.99 0.93 0.87 0.87 0.97 0.78 0.98 0.82 0.80 hmmerToxBits HMMER 0.91 0.99 0.80 0.19 0.91 0.87 0.48 0.99 0.63 0.60 hmmerVenom HMMER 0.65 0.95 0.74 0.23 0.65 0.84 0.34 0.95 0.45 0.38 twinHmmerPfam HMMER 0.56 0.99 0.98 0.91 0.56 0.98 0.78 0.95 0.65 0.62 twinHmmerToxBits HMMER 0.85 1.00 0.93 0.44 0.85 0.92 0.59 0.98 0.70 0.67 ClanTox server ML 0.66 0.99 0.93 0.73 0.66 0.95 0.65 0.96 0.66 0.61 ToxinPred server* ML 0.55 0.98 N/A N/A 0.55 0.98 0.82 0.93 0.66 0.63 G acesa etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.90 11/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 spiders, wasps and Conus snails) and other animals considered to be non-venomous (human Homo sapiens, the house mouse Mus musculus and the Burmese python Python bivittatus). Calibration was performed by assessing the performance measures of the Toxclassifier ensemble relative to prediction score; calibration curves for summary of all animal genome data are presented in Fig. 1. When the average correct annotation of all input sequences for all genomes was calculated, a combined score from five out of the nine classifiers giving correct classification provided a good balance between the detection of toxins and the filtering of non-toxins. Hence, a calibration for the ToxClassifier ensemble was possible where an input sequence giving a combined score of >6 would be considered a likely toxin, a combined score of <3 would be regarded as non-toxic, while an input sequence presenting with a score 4 or 5 would suggest a potential toxin, but would require manual evaluation using additional tools, for example, InterProScan (Zdobnov & Apweiler, 2001). Performance of calibrated ‘ToxClassifier’ meta-classifier was evaluated on a test set comprising 25% of the animal genome data not used for calibration; these results were compared to naiveBLAST and OneBLAST conventional methods and to ClanTox server for animal toxin prediction (Kaplan, Morpurgo & Linial, 2007). Performance measurements are reported in Table 3 and comparison of F1-scores and MCC values across all datasets is presented in Fig. 2; Figs. S3/A and S3/B. Finally, the ‘ToxClassifier’ was assessed in a blinded experiment that used as input a set of protein sequences derived from the venom gland transcriptome of the Amazonian rain forest pit viper Bothrops atrox (Data S1). The sequences had been annotated using standard methods and manually inspected, with the biological activities of some also being authenticated experimentally. The results of the ‘ToxClassifier’ predictions matched with the expert annotation (Table 4). DISCUSSION The continued decline in proteomics sequencing costs over recent years has led to an explosion in venomics data characterising the toxic peptide and protein components in many venomous animals (Kaas & Craik, 2015). However, there is currently no widely accepted and standard method for functional annotation of toxins from these data sources, leading to inconsistent estimates for the number of toxins in the venom of the same animal. For example, the venom of the duck-billed platypus Ornithorhynchus anatinus has only 6 toxins listed following manual annotation in the latest release of the UniProtKB/SwissProt- ToxProt database (11th May 2016), yet 107 putative toxins were identified by a simple pair-wise BLASTp search using venom gland transcriptome sequences as input to search the UniProtKB/SwissProt ToxProt database (Jungo et al., 2012). In addition to separate homology searching methods to interpret the same data, many venomics projects now also include different manual filtering steps as part of the annotation process (Rachamim et al., 2014; Gacesa et al., 2015), exacerbating the problem of results verification. In this study, a selection of machine learning-based classifiers implementing a range of BLAST and HMMER-based annotation models were trained on datasets of known toxins, protein sequences assumed to be non-toxic but with homology to known toxins, and predicted proteins encoded in the genome, transcriptome or proteome of a range of Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90/supp-4 http://dx.doi.org/10.7717/peerj-cs.90/supp-4 http://dx.doi.org/10.7717/peerj-cs.90/supp-3 http://dx.doi.org/10.7717/peerj-cs.90 Figure 1 Calibration curves used to select final prediction scores for ToxClassifier ensemble. Each of performance measures, described in detail in Methods, is shown for ToxClassifier prediction with toxin prediction cut-off values 1–9. Dotted green line shows trends of prediction measurement with increase in ToxClassifier cut-off value and final cut-off value implemented in calibrated ToxClassifier is highlighted with blue line. venomous and non-venomous animals. A comparison between the results presented in Tables 1–3 demonstrated that the majority of the machine learning methods consistently out-performed standard bioinformatics approaches of functional annotation. Interestingly, all tested methods demonstrated higher performance for negative prediction (classification of non-toxic sequences) compared to positive classification (prediction of toxic sequences as toxins). These results demonstrate that differentiating between physiological toxin-like proteins and actual toxins is more difficult then prediction of random proteins as non-toxic, which is to be expected considering the similarity and common origin of many toxins and toxin-like sequences (Fry, 2005; Chang & Duda, 2012; Hargreaves et al., 2014). As such, it is important to consider balanced performance measurements when assessing toxin classifiers, with scores such as F1-score and MCC value (Matthews, 1975; Powers, 2011) providing Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Table 3 ToxClassifier calibrated meta-classifier, Test set results compared to oneBLAST, naiveBLAST and ClanTox annotation methods. Table lists comparison of classification performance for calibrated ToxClassifier to BLAST based annotation models and ClanTox toxin prediction server. All tests were conducted test dataset not used in calibration of ToxClassifier. Classification performance of ToxClassifier meta-classifier, compared to BLAST based methods and ClanTox server Annotation model Tool Accuracy Positive prediction value Negative prediction value Sensitivity Specificity F1-value MCC value naiveBLAST BLAST 0.976 0.294 0.998 0.857 0.977 0.438 0.494 oneBLAST BLAST 0.994 0.660 1.000 0.951 0.995 0.779 0.790 ClanTox server ML-based 0.947 0.209 1.000 0.954 0.976 0.342 0.440 ToxClassifier ML-based 0.997 0.825 1.000 0.967 0.998 0.890 0.892 Figure 2 Comparison between selected toxin prediction tools for all animal test datasets. Calibrated ToxClassifier, with positive prediction cutoff 5, is shown by green bar, ClanTox prediction server is displayed in orange, oneBLAST prediction performance in red and naiveBLAST in blue. F1- value and Matthews’ correlation coefficient are displayed for each of test sets and for summary of all data with exclusion of conus snail proteins. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 14/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Table 4 Results of expert manual annotation to ToxClassifier annotation for set of novel proteins from venom gland transcriptome of Bothrops atrox snake. Sequence annotation Test sample ‘ToxClassifier’ Standard annotation Test 01 Potential toxin BPP precursor—Toxin Test 02 Potential toxin Crisp—Toxin Test 03 Toxin CTL—Toxin Test 04 Toxin CTL—Toxin Test 05 Potential toxin Hyaluronidase—Toxin Test 06 Potential toxin Kunitz –like sequence—Probably a toxin Test 07 Toxin LAAO—Toxin Test 08 Not a toxin Nucleotidase—Probably not a toxin Test 09 Not a toxin Phosphodiesterase— Probably not a toxin Test 10 Toxin PLA2—Toxin Test 11 Toxin SVMP class PI—Toxin Test 12 Toxin SVMP class PII—Toxin Test 13 Toxin SVMP class PIII—Toxin Test 14 Toxin SVSP—Toxin Test 15 Potential toxin VEGF—Toxin more appropriate measurements of performance than simple accuracy. Another issue of toxin classification lies in unbalanced datasets, because most venomous animal genomes encode less than 100 toxins and 20,000–30,000 physiological non-toxic proteins; as a result, even a high performing method can generate a high number of false positive predictions. For example, 99.5% correct prediction of non-toxins results in ∼100 false positive toxins for an average animal proteome, which is in fact more than the actual number of true toxins. In order to minimize both of these problems, the ToxClassifier training scheme was conservative, using only well-annotated toxins from UniProtKB/ToxProt database as positives, and while this might lead to a somewhat lower positive prediction rate (due to missing likely toxins which are not annotated as such), it does serve to minimise the false positive rate. Notably, predictions were less accurate on some genome datasets, especially Conus snail proteins, with low performance metrics observed for all tested annotation methods. This discrepancy was likely caused by the assumption that sequences deposited in the UniProtKB/SwissProt-ToxProt sequence are bona fide toxins, while sequences in the UniProtKB/SwissProt and TrEMBL databases without ‘toxin’ or ‘venom’ keywords are not toxins. Given that toxin activity is attributed to most sequences without biological validation, it is likely that the training datasets almost certainly excluded a number of toxin sequences and included some yet unknown toxins as non-toxic. Another limitation of the ‘ToxClassifier’ lies in the inherent bias of the training sets; an underrepresentation in sequences from certain animal lineages, particularly the basal Metazoa, e.g., Cnidaria, could lead to incorrect assignment and suspicious quality of existing annotation of conotoxins is a reason to treat prediction on this protein class with caution. To elevate these problems, ‘ToxClassifier’ has been designed to also report sequences suspected to Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 have closest homology to underrepresented taxa as ‘suspicious toxin’ and recommends manual annotation with other tools, such as InterProScan (Zdobnov & Apweiler, 2001). Use of machine learning for toxin prediction has been attempted before and a range of such tools exists; however, most of the available tools are heavily specialised for toxins of specific animal origins. For example, SpiderP (Wong et al., 2013) (http: //www.arachnoserver.org/spiderP.html) is a predictor for spider toxins while ToxinPred (Gupta et al., 2013) (http://crdd.osdd.net/raghava/toxinpred/) predicts only small peptide toxins; while ClanTox (Kaplan, Morpurgo & Linial, 2007) (http://www.clantox.cs.huji.ac. il/tech.php) was trained only on an ion-channel toxin dataset and PredCSF (Fan et al., 2016) (http://www.csbio.sjtu.edu.cn/bioinf/PredCSF/) is conotoxin specific. In addition, the reported training set sizes are low (for example ClanTox was trained on ∼600 ion channel toxins; the ToxinPred toxin positive training set is 1,805 sequences, while as of 11th May 2016, the UniProtKB/SwissProt-ToxProt database contained∼6,500 sequences). None of the currently available machine learning methods also gives a comparison with other currently used accepted bioinformatics annotation methods. When compared to ToxClassifier and conventional annotation tools (Tables 2 and 3), ClanTox and ToxinPred tools were found to perform similar to BLAST based methods, while ToxClassifier demonstrated higher performance across all metrics, which is likely a result of comparatively larger training sets and combination of different internal classifiers. In addition to high performance, the user interface of the ‘ToxClassifier’ web service reports the best-scoring hit annotation either to UniProtKB/SwissProt-ToxProt (allowing placement of the toxin into the most appropriate toxin protein family), or to the best hit in UniProtKB/SwissProt (giving the closest homology to a non-toxin protein). In summary, this study has established baseline prediction accuracies for a selection of toxin annotation methods and integrates these methods into an easy-to-use, high-precision, machine learning-based classification system named ‘ToxClassifier.’ This tool offers a reliable and reproducible framework for toxin annotation to enable standardised toxin prediction in venomics projects and to allow for semi-automatic annotation or re-annotation of existing datasets. ACKNOWLEDGEMENTS The authors are grateful to Prof. Dr. Ana M. Moura-da-Silva who provided the unpublished anonymised snake venom transcriptome sequences. We also thank Prof. J Malcolm Shick and Prof. Dr. John Cullum for critically reviewing the manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the United Kingdom Medical Research Council (MRC grant G82144A). PFL is also supported as a Visiting International Research Professor by the Universidade de São Paulo (USP grant 13.1.1502.9.8). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 16/20 https://peerj.com http://www.arachnoserver.org/spiderP.html http://www.arachnoserver.org/spiderP.html http://crdd.osdd.net/raghava/toxinpred/ http://www.clantox.cs.huji.ac.il/tech.php http://www.clantox.cs.huji.ac.il/tech.php http://www.csbio.sjtu.edu.cn/bioinf/PredCSF/ http://dx.doi.org/10.7717/peerj-cs.90 Grant Disclosures The following grant information was disclosed by the authors: United Kingdom Medical Research Council: G82144A. Universidade de São Paulo: 13.1.1502.9.8. Competing Interests The authors declare there are no competing interests. Author Contributions • Ranko Gacesa performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. • David J. Barlow analyzed the data, reviewed drafts of the paper. • Paul F. Long conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as a Supplementary File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.90#supplemental-information. REFERENCES Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389–3402 DOI 10.1093/nar/25.17.3389. Bairoch A, Apweiler R. 2000. The SWISS-PROT protein sequence database and its sup- plement TrEMBLin 2000. Nucleic Acids Research 28:45–48 DOI 10.1093/nar/28.1.45. Bateman A, Martin MJ, O’donovan C, Magrane M, Apweiler R, Alpi E, Antunes R, Arganiska J, Bely B, Bingley M, Bonilla C, Britto R, Bursteinas B, Chavali G, Cibrian-Uhalte E, Da Silva A, De Giorgi M, Dogan T, Fazzini F, Gane P, Castro LG, Garmiri P, Hatton-Ellis E, Hieta R, Huntley R, Legge D, Liu W, Luo J, Macdougall A, Mutowo P, Nightingale A, Orchard S, Pichler K, Poggioli D, Pundir S, Pureza L, Qi G, Rosanoff S, Saidi R, Sawford T, Shypitsyna A, Turner E, Volynkin V, Wardell T, Watkins X, Zellner H, Cowley A, Figueira L, Li W, McWilliam H, Lopez R, Xenarios I, Bougueleret L, Bridge A, Poux S, Redaschi N, Aimo L, Argoud-Puy G, Auchincloss A, Axelsen K, Bansal P, Baratin D, Blatter MC, Boeckmann B, Bolleman J, Boutet E, Breuza L, Casal-Casas C, De Castro E, Coudert E, Cuche B, Doche M, Dornevil D, Duvaud S, Estreicher A, Famiglietti L, Feuermann M, Gasteiger E, Gehant S, Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Jungo F, Keller G, Lara V, Lemercier P, Lieberherr D, Lombardot T, Martin X, Masson P, Morgat A, Neto T, Nouspikel N, Paesano S, Pedruzzi I, Pilbout S, Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 17/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90/supplemental-information http://dx.doi.org/10.7717/peerj-cs.90#supplemental-information http://dx.doi.org/10.7717/peerj-cs.90#supplemental-information http://dx.doi.org/10.1093/nar/25.17.3389 http://dx.doi.org/10.1093/nar/28.1.45 http://dx.doi.org/10.7717/peerj-cs.90 Pozzato M, Pruess M, Rivoire C, Roechert B, Schneider M, Sigrist C, Sonesson K, Staehli S, Stutz A, Sundaram S, Tognolli M, Verbregue L, Veuthey AL, Wu CH, Arighi CN, Arminski L, Chen C, Chen Y, Garavelli JS, Huang H, Laiho K, McGarvey P, Natale DA, Suzek BE, Vinayaka CR, Wang Q, Wang Y, Yeh LS, Yerramalla MS, Zhang J. 2015. UniProt: a hub for protein information. Nucleic Acids Research 43:D204–D212 DOI 10.1093/nar/gku989. Bromberg Y, Yachdav G, Ofran Y, Schneider R, Rost B. 2009. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the ‘‘Deep Web’’. Current Opinion in Drug Discovery & Development 12:408–419. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:1–9 DOI 10.1186/1471-2105-10-421. Chang D, Duda TF. 2012. Extensive and continuous duplication facilitates rapid evolution and diversification of gene families. Molecular Biology and Evolution 29:2019–2029 DOI 10.1093/molbev/mss068. Eddy SR. 2011. Accelerated profile HMM searches. PLoS Computational Biology 7:e1002195 DOI 10.1371/journal.pcbi.1002195. Fan YX, Song J, Kong X, Shen HB. 2016. PredCSF: an integrated feature-based approach for predicting conotoxin superfamily. Protein & Peptide Letters 18:261–267 DOI 10.2174/092986611794578341. Fry B. 2005. From genome to ‘‘venome’’: molecular origin and evolution of the snake venom proteome inferred from phylogenetic analysis of toxin sequences and related body proteins. Genome Research 15:403–420 DOI 10.1101/gr.3228405. Gacesa R, Chung R, Dunn SR, Weston AJ, Jaimes-Becerra A, Marques AC, Morandini AC, Hranueli D, Starcevic A, Ward M, Long PF. 2015. Gene duplications are extensive and contribute significantly to the toxic proteome of nematocysts isolated from Acropora digitifera (Cnidaria: Anthozoa: Scleractinia). BMC Genomics 16:774 DOI 10.1186/s12864-015-1976-4. Gupta S, Kapoor P, Chaudhary K, Gautam A, Kumar R, Raghava GPS. 2013. In silico approach for predicting toxicity of peptides and proteins. PLoS ONE 8:e73957 DOI 10.1371/journal.pone.0073957. Hargreaves AD, Swain MT, Hegarty MJ, Logan DW, Mulley JF. 2014. Restriction and recruitment—gene duplication and the origin and evolution of snake venom toxins. Genome Biology and Evolution 6:2088–2095 DOI 10.1093/gbe/evu166. Jungo F, Bougueleret L, Xenarios I, Poux S. 2012. The UniProtKB/Swiss-Prot Tox- Prot program: a central hub of integrated venom protein data. Toxicon 60:551–557 DOI 10.1016/j.toxicon.2012.03.010. Junqueira-de-Azevedo ILM, Bastos CMV, Ho PL, Luna MS, Yamanouye N, Casewell NR. 2015. Venom-related transcripts from Bothrops jararaca tissues provide novel molecular insights into the production and evolution of snake venom. Molecular Biology and Evolution 32:754–766 DOI 10.1093/molbev/msu337. Kaas Q, Craik DJ. 2015. Bioinformatics-aided venomics. Toxins 7:2159–2187 DOI 10.3390/toxins7062159. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 18/20 https://peerj.com http://dx.doi.org/10.1093/nar/gku989 http://dx.doi.org/10.1186/1471-2105-10-421 http://dx.doi.org/10.1093/molbev/mss068 http://dx.doi.org/10.1371/journal.pcbi.1002195 http://dx.doi.org/10.2174/092986611794578341 http://dx.doi.org/10.1101/gr.3228405 http://dx.doi.org/10.1186/s12864-015-1976-4 http://dx.doi.org/10.1371/journal.pone.0073957 http://dx.doi.org/10.1093/gbe/evu166 http://dx.doi.org/10.1016/j.toxicon.2012.03.010 http://dx.doi.org/10.1093/molbev/msu337 http://dx.doi.org/10.3390/toxins7062159 http://dx.doi.org/10.7717/peerj-cs.90 Kaplan N, Morpurgo N, Linial M. 2007. Novel families of toxin-like peptides in insects and mammals: a computational approach. Journal of Molecular Biology 369:553–566 DOI 10.1016/j.jmb.2007.02.106. Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. 1994. Hidden markov models in computational biology, applications to protein modeling. Journal of Molecular Biology 235:1501–1531 DOI 10.1006/jmbi.1994.1104. Li R, Yu H, Xing R, Liu S, Qing Y, Li K, Li B, Meng X, Cui J, Li P. 2012. Application of nanoLC-MS/MSto the shotgun proteomic analysis of the nematocyst proteins from jellyfish Stomolophus meleagris. Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences 899:86–95 DOI 10.1016/j.jchromb.2012.05.006. Li R, Yu H, Xue W, Yue Y, Liu S, Xing R, Li P. 2014. Jellyfish venomics and venom gland transcriptomics analysis of Stomolophus meleagris to reveal the toxins associated with sting. Journal of Proteomics 106:17–29 DOI 10.1016/j.jprot.2014.04.011. Liu G, Zhou Y, Liu D, Wang Q, Ruan Z, He Q, Zhang L. 2015. Global transcriptome analysis of the tentacle of the Jellyfish Cyanea capillata using deep sequencing and expressed sequence tags: insight into the toxin-and degenerative disease-related transcripts. PLoS ONE 10:1–22 DOI 10.1371/journal.pone.0142680. Matthews BW. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA—Protein Structure 405:442–451 DOI 10.1016/0005-2795(75)90109-9. Neumann RS, Kumar S, Shalchian-Tabrizi K. 2014. BLAST output visualization in the new sequencing era. Briefings in Bioinformatics 15:484–503 DOI 10.1093/bib/bbt009. Powers D. 2011. Evaluation: from precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies 2:37–63. Rachamim T, Morgenstern D, Aharonovich D, Brekhman V, Lotan T, Sher D. 2014. The dynamically evolving nematocyst content of an anthozoan, a scyphozoan, and a hydrozoan. Molecular Biology and Evolution 32:740–753 DOI 10.1093/molbev/msu335. Reyes-Velasco J, Card DC, Andrew AL, Shaney KJ, Adams RH, Schield DR, Casewell NR, Mackessy SP, Castoe TA. 2015. Expression of venom gene homologs in diverse python tissues suggests a new model for the evolution of snake venom. Molecular Biology and Evolution 32:173–183 DOI 10.1093/molbev/msu294. Schwartz EF, Diego-Garcia E, Rodríguez de la Vega RC, Possani LD, Whittington CM, Papenfuss AT, Locke DP, Mardis ER, Wilson RK, Abubucker S, Mitreva M, Wong ESW, Hsu AL, Kuchel PW, Belov K, Warren WC, Sher D, Zlotkin E, Knebel A, Bsor T, Nesher N, Tal T, Morgenstern D, Cohen E, Fishman Y, Zlotkin E, Schwartz EF, Diego-Garcia E, Rodríguez de la Vega RC, Possani LD, Liu G, Zhou Y, Liu D, Wang Q, Ruan Z, He Q, Zhang L. 2015. Transcriptome analysis of the venom gland of the Mexican scorpion Hadrurus gertschi (Arachnida: Scorpiones). Toxicon 11:865–879 DOI 10.1371/journal.pone.0142680. Sher D, Zlotkin E. 2009. A hydra with many heads: protein and polypeptide toxins from hydra and their biological roles. Toxicon 54:1148–1161 DOI 10.1016/j.toxicon.2009.02.036. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 19/20 https://peerj.com http://dx.doi.org/10.1016/j.jmb.2007.02.106 http://dx.doi.org/10.1006/jmbi.1994.1104 http://dx.doi.org/10.1016/j.jchromb.2012.05.006 http://dx.doi.org/10.1016/j.jprot.2014.04.011 http://dx.doi.org/10.1371/journal.pone.0142680 http://dx.doi.org/10.1016/0005-2795(75)90109-9 http://dx.doi.org/10.1093/bib/bbt009 http://dx.doi.org/10.1093/molbev/msu335 http://dx.doi.org/10.1093/molbev/msu294 http://dx.doi.org/10.1371/journal.pone.0142680 http://dx.doi.org/10.1016/j.toxicon.2009.02.036 http://dx.doi.org/10.7717/peerj-cs.90 Starcevic A, Long PF. 2013. Diversification of animal venom peptides-were jellyfish amongst the first combinatorial chemists? ChemBioChem 14:1407–1409 DOI 10.1002/cbic.201300305. Starcevic A, Moura-Da-Silva AM, Cullum J, Hranueli D, Long PF. 2015. Combinations of long peptide sequence blocks can be used to describe toxin diversification in venomous animals. Toxicon 95:84–92 DOI 10.1016/j.toxicon.2015.01.005. Vonk FJ, Casewell NR, Henkel CV, Heimberg AM, Jansen HJ, McCleary RJR, Kerkkamp HME. 2013. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system. Proceedings of the National Academy of Sciences of the United States of America 110:20651–20656 DOI 10.1073/pnas.1314702110. Weston AJ, Chung R, Dunlap WC, Morandini AC, Marques AC, Moura-da-Silva AM, Ward M, Padilla G, Da Silva LF, Andreakis N, Long PF. 2013. Proteomic characterisation of toxins isolated from nematocysts of the South Atlantic jellyfish Olindias sambaquiensis. Toxicon 71:11–17 DOI 10.1016/j.toxicon.2013.05.002. Weston AJ, Dunlap WC, Shick JM, Klueter A, Iglic K, Vukelic A, Starcevic A, Ward M, Wells ML, Trick CG, Long PF. 2012. A profile of an endosymbiont-enriched fraction of the coral Stylophora pistillata reveals proteins relevant to microbial- host Interactions. Molecular & Cellular Proteomics 11:M111.015487–M111.015487 DOI 10.1074/mcp.M111.015487. Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schrimi LM, Sequerira E, Tatusova TA, Wagner L. 2003. Database resources of the National Center for Biotechnology. Nucleic Acids Research 31:28–33 DOI 10.1093/nar/gkg033. Whittington CM, Papenfuss AT, Locke DP, Mardis ER, Wilson RK, Abubucker S, Mitreva M, Wong ESW, Hsu AL, Kuchel PW, Belov K, Warren WC. 2010. Novel venom gene discovery in the platypus. Genome Biology 11:R95 DOI 10.1186/gb-2010-11-9-r95. Wong ESW, Hardy MC, Wood D, Bailey T, King GF. 2013. SVM-based prediction of propeptide cleavage sites in spider toxins identifies toxin innovation in an Australian tarantula. PLoS ONE 8:e66279 DOI 10.1371/journal.pone.0066279. Zdobnov EM, Apweiler R. 2001. InterProScan-an integration platform for the signature- recognition methods in InterPro. Bioinformatics (Oxford, England) 17:847–848 DOI 10.1093/bioinformatics/17.9.847. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 20/20 https://peerj.com http://dx.doi.org/10.1002/cbic.201300305 http://dx.doi.org/10.1016/j.toxicon.2015.01.005 http://dx.doi.org/10.1073/pnas.1314702110 http://dx.doi.org/10.1016/j.toxicon.2013.05.002 http://dx.doi.org/10.1074/mcp.M111.015487 http://dx.doi.org/10.1093/nar/gkg033 http://dx.doi.org/10.1186/gb-2010-11-9-r95 http://dx.doi.org/10.1371/journal.pone.0066279 http://dx.doi.org/10.1093/bioinformatics/17.9.847 http://dx.doi.org/10.7717/peerj-cs.90