key: cord-0071769-j595u92d authors: Herrera-Bravo, Jesús; Farías, Jorge G.; Contreras, Fernanda Parraguez; Herrera-Belén, Lisandra; Norambuena, Juan-Alejandro; Beltrán, Jorge F. title: VirVACPRED: A Web Server for Prediction of Protective Viral Antigens date: 2021-12-17 journal: Int J Pept Res Ther DOI: 10.1007/s10989-021-10345-2 sha: 7f30cfcbf94d4bf155d14c087bcde82efa6efba6 doc_id: 71769 cord_uid: j595u92d Viral antigens are key in the development of vaccines that prevent or eradicate infections caused by these pathogens. Bioinformatics tools are modern alternatives that facilitate the discovery of viral antigens, reducing the costs of experimental assays. We developed a bioinformatics tool called VirVACPRED, which is highly efficient in predicting viral antigens. In this study, we obtained a model based on the gradient boosting classifier, which showed high performance during the training, leave-one-out cross-validation (accuracy = 0.7402, sensitivity = 0.7319, precision = 0.7503, F1 = 0.7251, kappa = 0.4774, Matthews correlation coefficient = 0.4981) and testing (accuracy = 0.8889, sensitivity = 1.0, precision = 0.8276, F1 = 0.9057, kappa = 0.7734, Matthews correlation coefficient = 0.7941). VirVACPRED is a robust tool that can be of great help in the search and proposal of new viral antigens, which can be considered in the development of future vaccines against infections caused by viruses. There is a considerable number of antiviral drugs against many viruses, and some of them do not eliminate infections but simply alter the clinical course of the disease (Huang et al. 2004 ). Antiviral vaccines have been the most successful alternatives in the prevention of epidemics, and it is for this reason that is necessary to exploit new technologies that identify critical antigens in order to induce a potent immune response (Graham 2013) . Vaccination has allowed combating various infectious diseases mediated by viruses, like influenza, smallpox, varicella, diphtheria, polio, hepatitis, rotavirus, papillomavirus, among others (Graham 2013; Soria-Guerra et al. 2015) . A vaccine is a molecular agent that induces specific protective immunity that triggers an enhanced adaptive immune response to reinfection by pathogens through the enhancement of immune memory (Pollard and Bijker 2021) . Conventional vaccines are composed of attenuated or killed pathogens and they can take up to 15 years to develop. While it is true that these vaccines have saved many lives, they can also have adverse effects that could compromise the life of the patient (Bogdanos et al. 2001; Jarząb et al. 2013; Olson et al. 2001 ). The main component of vaccines are molecules called antigens, which are foreign to the immune system, and in turn, can have the ability to induce an immune response (Lahariya 2016) . Protective antigens are capable of inducing protection against a disease caused by an infectious agent after they are evaluated by means of an immunization scheme in an animal model. This approach to vaccine development includes several steps such as pathogen culturing, purifying the components (candidate antigens), and evaluating immunogenicity in an animal model ("An overview of biotechnology in vaccine development" 2020). Recombinant DNA and sequencing technology have led to a new concept within the field of vaccine development, where antigens capable of stimulating a specific 1 3 immune response are identified (Brusic and Petrovsky 2005; Soria-Guerra et al. 2015; De 2014, 2010) . In recent years, RNA vaccines have been attracting increasing attention due to their ability to induce a safe and long-lasting immune response using in vivo models (Pardi et al. 2018; Zhang et al. 2019) . RNA vaccines differ from traditional ones in that they do not administer live attenuated agents or fragments of it, eliminating the risk of causing the disease that is intended to be prevented. For the development of RNA vaccines, it is necessary to find the DNA sequences that encode essential antigens of the infectious agent and then transcribe them to obtain the corresponding RNA, which will be used as a vaccine (Brisse et al. 2020; Tombácz et al. 2021; Verbeke et al. 2019) . However, like the traditional approach to vaccine development, the identification of candidate antigenic molecules is necessary. The field of bioinformatics has allowed the acceleration and discovery of new vaccine candidates, through the largescale prediction of different molecules that constitute potential protective antigens. Currently, there are many bioinformatic tools that predict antigenicity from a protein sequence, which is usually divided into small peptides called epitopes, which have the ability to induce an immune response mediated by T lymphocytes (Soria-Guerra et al. 2015) . However, the tools that allow predicting whether a protein is antigenic or not are very scarce, with Vaxijen v2.0 being a widely cited tool and the only one of its kind to date (Doytchinova and Flower 2007) . The Vaxijen v2.0 approach is extremely interesting and useful since it allows predicting antigenic proteins from various sources such as bacteria, viruses, and tumor cells. However, the viral antigen prediction model has not been updated for years. In consequence, taking into account the concept of Vaxijen v2.0, the main objective of the present work, was to develop an updated immunoinformatic tool for the robust and reliable prediction of viral antigens. The dataset used in this work was extracted from the publication of Vaxijen v2.0. This dataset is composed of 100 sequences of viral antigens referenced in the literature and 100 sequences identified as non-antigens (Doytchinova and Flower 2007) , for a total of 200 sequences. This dataset was divided into training and testing datasets in a relationship of 80% and 20%, respectively (Fig. 1 ). To calculate the characteristics of the antigens, we use our script called AIDApy (Herrera-Bravo et al. 2021) . AIDApy allows the calculation of 544 physicochemical and biochemical properties derived from the AAindex database. The AAindex database contains numerical indices that indicate different physicochemical and biological characteristics of amino acids and amino acid pairings (Kawashima et al. 2008 ). As mentioned above, in this study, all the indices contained in this database were calculated for the antigens and non-antigens by selecting the equation number (4) as shown below: Usually, machine learning models that include many variables show low performance, for this reason reducing the dimensionality of the variables is a procedure that helps solve this problem (Mladenić 2006) . For this reason, after calculating all AAindex characteristics (a total of 544), the best ten predictors were filtered and selected. For this purpose, the information gain function (Quinlan 1986 ) contained in the Orange3 3.28.0 library and written in Python 3 was used (Demšar et al. 2013 ). The training, leave-one-out cross-validation (LOOCV) and testing, were carried out with the use of the open source PyCaret 2.3.1 (https:// pypi. org/ proje ct/ pycar et/) and scikitlearn 0.24.2 (https:// pypi. org/ proje ct/ scikit-learn/) libraries. PyCaret allows evaluation of several machine learning algorithms in an efficient and fast way, abstracting the functionalities of the popular Scikit-learn library on which it is based. A total of 16 machine learning algorithms were evaluated as shown below: random forest classifier (RF), extra trees classifier (ETC), quadratic discriminant analysis (QDA), light gradient boosting machine (LGBC), gradient boosting classifier (GBC), naive Bayes classifier (NBC), linear discriminant (1) a.a n ∶ Total number of any amino acid of the 20 natural ones . For this study the PyCaret and scikit-learn library default parameter of all classifiers were used. The selection of the best classifier against the training, LOOCV, and testing phases was made based on the following metrics: The architecture used for the generation of the predictive models of protective viral antigens All of the performance measures shown above have a range from zero to one (0-1). Models with measurements close to one are considered more reliable. The results of the information analysis allowed identifying the best predictors AAindex as shown below: AURR980113 (score: 0.207), FINA770101 (score: 0.191), QIAN880116 (score: 0.190), QIAN880102 (score: 0.183), KOEP990101 (score: 0.179), QIAN880133 (score: 0.174), (Table 1) . On the other hand, an excellent performance was observed on the independent dataset (testing), where these measures increased considerably, which is indicative of robust prediction models (Table 2 ). It is important to highlight that the GBC algorithm presented the best performance measures during the testing phase, which allows its selection for the construction of a tool for the prediction of viral antigens (Table 2) . Taking into account the aforementioned aspects, we developed a web application called VirVACPRED, which includes the predictive model based on the gradient boosting classifier. This application was developed with the Python 3.9 programming language and the Flask framework, both open sources. VirVACPRED has a friendly and robust interface for the reliable and fast prediction of viral antigens, which is available at https:// virva cpred. herok uapp. com/. Vir-VACPRED returns probability scores in the range of 0 and 1, where probability scores > = 0.5 indicate that the input sequence is a viral antigen. During the past decade, viruses have emerged or re-emerged that have suddenly become major threats to humanity and the global economy, which was a concern regarding their epidemic transmission (Afrough et al. 2019; Trovato et al. 2020) . Zoonoses such as Lassa fever, dengue fever, Middle East respiratory syndrome (MERS), swine flu, Ebola and Marburg hemorrhagic fevers, yellow fever, severe acute respiratory syndrome (SARS), West Nile fever, Zika, Chikungunya vector-borne diseases, and recently the coronavirus disease 2019 , are examples of the damage that viruses can cause in the world population (Trovato et al. 2020) . In this sense, the development of innovative and technological platforms that allow the discovery of new drugs to prevent and combat viral infections is essential. Bioinformatics has emerged as a powerful tool for solving different problems within the biological sciences, including the field of immunology (Soria-Guerra et al. 2015) . Currently, there are dissimilar bioinformatics tools focused on the prediction of small linear peptides presented in the context of MHC. However, tools aimed at predicting the antigenicity of a complete protein are scarce. The prediction of the antigenicity of a protein is extremely important, considering that 90% of the epitopes processed by B cells are conformational and only 10% linear (Benjamin 1995; Huang and Honda 2006) . Taking into account the aforementioned aspects, the need for tools that predict the global antigenicity of a protein is an important factor to take into consideration. Vaxijen v2.0 is a widely cited tool, it allows evaluating global antigenicity from an input amino acid sequence (Doytchinova and Flower 2007) . However, since the development of Vaxijen v2.0 to date, there have been important advances in the field of machine learning, which could be used to improve the predictive capacity of viral antigens using the same approach as Vaxijen v2.0. The results of the information gain analysis showed that the ten best predictive AAindexes are related to characteristics of secondary protein structures such as helix, beta-sheet, alpha-helix, coil, beta-turn, and helix-coil. In this sense, we suggest that future tools focused on predicting antigenicity take these structural properties into account. In fact, it has been reported that the secondary structure of viral antigens is key to the development of an immune response mediated by T lymphocytes (Gairin and Oldstone 1993) . In this work, the GBC presented the best performance measures in the classification of viral antigens during the training and testing phase. This classifier has been successfully used in the development of predictive models in the area of bioinformatics, such as the prediction of submitochondrial localization (Yu et al. 2020) , DNA-binding residue (Deng et al. 2018) , gene-expression data analysis (Blagus and Lusa 2015) , prediction of the interaction between target and ligand (Xuan et al. 2019) , diagnostic classification of cancers , and prediction of RNA-protein interactions (Jain et al. 2018 ), among others. However, other classifiers such as ETC, QDA, LBGM, GBC, NBC, LDC, ABC, KNN, DTC, and RF, also presented good performance measures in both phases. It is important to highlight that the performance measures obtained with GBC, even outperforming the RF classifier, the latter very popular and widely used in the field of bioinformatics (Beltrán Lissabet et al. 2019a, b; Jorge Félix Beltrán Lissabet et al. 2019a, b; Boulesteix et al. 2012; Herrera-Bravo et al. 2021) . For this reason, as mentioned above, the GBC was selected to develop the VirVACPRED tool. In this work, we make a comparison of VirVACPRED with the performance measures reported by Vaxijen v2.0. In this comparison, it was observed that both tools present a similar performance during training. However, VirVA-CPRED presented a better performance over the independent dataset (Table 3 and Fig. 2) , due to the high-performance measures obtained, demonstrating its high efficiency in the prediction of viral antigens. As mentioned above, the datasets used to train and test VirVACPRED consisted of antigenic and non-antigenic protein sequences in monomeric states (primary sequence), obtained from different virus species (Doytchinova and Flower 2007) . Consequently, we recommend that users make predictions using the viral primary sequences as input. VirVACPRED is a tool that has a friendly interface, which unlike Vaxijen v2.0, can process multiple protein sequences in FASTA format. We believe that VirVACPRED can be very useful in the discovery of new protective viral antigens, which could be considered in the formulation of future vaccines to prevent future epidemics. The tool is freely available at https:// virva cpred. herok uapp. com/. This tool has a simple user interface for amino acid sequence processing (Fig. 3) . The discovery of viral antigens plays a key role in the development of vaccines that allow the prevention of viral infections. Vaxijen v2.0 and VirVACPRED are the only tools of their kind, which allow predicting the global antigenicity of a protein. VirVACPRED is an updated tool that allows predicting viral antigens with high efficiency according to the performance measures obtained in the training and testing phases. The present server is limited to processing no more than 1000 protein sequences per prediction. We believe that VirVACPRED can be of great help in the discovery of new viral antigens, which will allow the development of future vaccines that prevent the risk of infections caused by viruses. Fig. 3 User interface of the Vir-VACPRED tool for prediction of protective viral antigens. A Input and B result interfaces Emerging viruses and current strategies for vaccine intervention AntiVPP 1.0: a portable tool for prediction of antiviral peptides TTAgP 1.0: a computational tool for the specific prediction of tumor T cell antigens B-cell epitopes: fact and fiction. Advances in experimental medicine and biology Boosting for high-dimensional twoclass prediction Molecular mimicry and autoimmune liver disease: virtuous intentions, malign consequences Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics Emerging concepts and technologies in vaccine development Immunoinformatics and its relevance to understanding human immune disease Orange: data mining toolbox in python Tomaž Curk Matija Polajnar Laň Zagar PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines Virus and cytotoxic T lymphocytes: crucial role of viral peptide secondary structure in major histocompatibility complex class I interactions Advances in antiviral vaccine development TAP 1.0: a robust immunoinformatic tool for the prediction of tumor T-cell antigens based on AAindex properties CED: a conformational epitope database A review of licensed viral vaccines, some of their safety concerns, and the advances in the development of investigational viral vaccines A data driven model for predicting RNA-protein interactions based on gradient boosting machine Subunit vaccines-antigens, carriers, conjugation methods and the role of adjuvants AAindex: amino acid index database, progress report Vaccine epidemiology: a review An overview of biotechnology in vaccine development. New generation vaccines Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data Feature selection for dimensionality reduction. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) A virus-induced molecular mimicry model of multiple sclerosis mRNA vaccines-a new era in vaccinology A guide to vaccinology: from basic principles to new developments Induction of decision trees An overview of bioinformatics tools for epitope prediction: implications on vaccine development Immunoinformatics: an integrated scenario Immunoinformatics: a brief review Vaccination with messenger RNA: a promising alternative to DNA vaccination Viral emerging diseases: challenges in developing vaccination strategies Three decades of messenger RNA vaccine development Gradient boosting decision tree-based method for predicting interactions between target genes and drugs. Front Genet SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting Advances in mRNA vaccines for infectious diseases