key: cord-0774328-uh5rkjwx authors: González-Díaz, Humberto; Pérez-Montoto, Lázaro G.; Ubeira, Florencio M. title: Model for Vaccine Design by Prediction of B-Epitopes of IEDB Given Perturbations in Peptide Sequence, In Vivo Process, Experimental Techniques, and Source or Host Organisms date: 2014-01-12 journal: J Immunol Res DOI: 10.1155/2014/768515 sha: a0f4a467be80972f434dabf027d2b5260b608740 doc_id: 774328 cord_uid: uh5rkjwx Perturbation methods add variation terms to a known experimental solution of one problem to approach a solution for a related problem without known exact solution. One problem of this type in immunology is the prediction of the possible action of epitope of one peptide after a perturbation or variation in the structure of a known peptide and/or other boundary conditions (host organism, biological process, and experimental assay). However, to the best of our knowledge, there are no reports of general-purpose perturbation models to solve this problem. In a recent work, we introduced a new quantitative structure-property relationship theory for the study of perturbations in complex biomolecular systems. In this work, we developed the first model able to classify more than 200,000 cases of perturbations with accuracy, sensitivity, and specificity >90% both in training and validation series. The perturbations include structural changes in >50000 peptides determined in experimental assays with boundary conditions involving >500 source organisms, >50 host organisms, >10 biological process, and >30 experimental techniques. The model may be useful for the prediction of new epitopes or the optimization of known peptides towards computational vaccine design. National Institute of Allergy and Infectious Diseases (NIAID) supported the launch, in 2004, of the Immune Epitope Database (IEDB), http://www.iedb.org/ [1] [2] [3] [4] . The IEDB system withdrew information from approximately 99% of all papers published to date that describe immune epitopes. In doing so, IEDB system analyses over 22 million PubMed abstracts and subsequently curated ≈13 K references, including ≈7 K manuscripts about infectious diseases, ≈1 K about allergy topics, ≈4 K about autoimmunity, and 1 K about transplant/alloantigen topics [5] . IEDB lists a huge amount of information about the molecular structure as well as the experimental conditions ( ) in which different th molecules were determined to be immune epitopes or not. This explosion of information makes necessary both query/display functions for retrieval of known data from IEDB as well predictive tools for new epitopes. Salimi et al. [5] reviewed advances in epitope analysis and predictive tools available in the IEDB. In fact, IEDB analysis resource (IEDB-AR: http://tools.iedb.org/) is a collection of tools for prediction of molecular targets of Tand B-cell immune responses (i.e., epitopes) [6, 7] . On the other hand, Quantitative Structure-Activity/Property Relationships (QSAR/QSPR) techniques are useful tool to predict new drugs, RNA, drug-protein complexes, and protein-protein complexes. In general, QSAR/QSPR-like methods transform molecular structures into numeric molecular descriptors ( ) in a first stage and later fit a model to predict the biological process. For example, DRAGON [8] [9] [10] , CODESSA [11, 12] , MOE [13] , TOPS-MODE [14] [15] [16] [17] , TOMO-COMD [18, 19] , and MARCH-INSIDE [20] are among the most used softwares to calculate molecular descriptors based on quantum mechanics (QM) and/or graph theory [21] [22] [23] [24] [25] [26] [27] . The software STATISTICA [28] and WEKA [29] are often 2 Journal of Immunology Research used to perform multivariate statistics and/or machine learning (ML) analysis in order to preprocess data and later fit the final QSAR/QSPR model using techniques like principal component analysis (PCA), linear discriminant analysis (LDA), support vector machine (SVM), or artificial neural networks (ANN) [28] . QSAR/QSPR models are also important in immunoinformatics to predict the propensity of different molecular structures to play different roles in immunological processes. They include skin vaccine adjuvants and sensitizers [30] [31] [32] [33] [34] [35] [36] [37] [38] , drugs and their activity/toxicity protein targets in the immune system [39] , and epitopes [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] . Moreover, Reche and Reinherz [50] implemented PEPVAC (promiscuous epitopebased vaccine), a web server for the formulation of multiepitope vaccines that predict peptides binding to five distinct HLA class I supertypes (A2, A3, B7, A24, and B15). PEPVAC can also identify conserved MHC ligands, as well as those with a C-terminus resulting from proteasomal cleavage. The Dana-Farber Cancer Institute hosted the PEPVAC server at the site http://immunax.dfci.harvard.edu/PEPVAC/. To close with a last example, Lafuente and Reche [51] reviewed the available methods for predicting MHC-peptide binding and discussed their most relevant advantages and drawbacks. In many complex QSPR-like problems in immunoinformatics, like in other areas, we know the exact experimental result (known solution) of the problem, but we are interested in the possible result obtained after a change (perturbation) on one or multiple values of the initial conditions of the experiment (new solution). For instance, we often know, for large collections of th molecules ( ), organic compounds, drugs, xenobiotics, and/or peptide sequences, the efficiency of the compound ( ) as adjuvant, action as epitope, immunotoxicity, and/or the interaction (affinity, inhibition, etc.) with immunological targets. In addition, we often known for each molecule the exact conditions ( ) of assay for the initial experiment including structure of the molecule (drug, adjuvant, and sequence of the peptide), source organism (so), host organism (ho), immunological process (ip), experimental technique (tq), concentration, temperature, time, solvents, and coadjuvants. This is the case of big data retrieved from very large databases like IEDB [1] [2] [3] [4] and CHEMBL [52] . However, we do not know the possible result of the experiment if we change at least one of these conditions (perturbation). We refer to small changes or perturbations in both structure and condition for input or output variables. It means that we include changes in ho, so, ip, and tq, changes of the compound by one analogue compound with similar structure, changes in the sequence of the epitope (artificial by organic synthesis or natural mutations), and polarity of the solvent or coadjuvants. In these cases, we could use a perturbation theory model to solve the QSAR/QSPR problem. Perturbation theory includes methods that add "small" terms to a known solution of a problem in order to approach a solution to a related problem without known solution. Perturbation models have been widely used in all branches of science from QM to astronomy and life sciences including chaos or "butterfly effect, " Bohr's atomic theory, Heisenberg's mechanics, Zeeman's and Stark's effects, and other models with applications in like protein spectroscopy and others [53] [54] [55] [56] [57] . In a very recent work Gonzalez-Diaz et al. [58] formulated a general-purpose perturbation theory or model for multiple-boundary QSPR/QSAR problems. However, there is not report in the immunoinformatics literature of a general QSPR perturbation model for IEDB B-epitopes. Here we report the first example of QSPR-perturbation model for Bepitopes reported in IEDB able to predict the probability of occurrence of an epitope after a perturbation in the sequence, the experimental technique, the exposition process, and/or the source or host organisms. We calculated the molecular descriptors of the structure of peptides using the software MARCH-INSIDE (MI) based on the algorithm with the same name [59] . The MI approach uses a Markov Chain method to calculate the th mean values of different physicochemical molecular properties ( ) for th molecules ( ). These ( ) values are calculated as an average of ( ) values for all atoms placed at topological distance ≤ ; which are in turn the means of atomic properties ( ) for all atoms in the molecule and its neighbors placed at = . For instance, it is possible to derive average estimations of molecular refractivities MR( ), partition coefficients ( ), and hardness ( ) for atoms placed at different topological distances ≤ . In this first work, we calculated only one type of ( ) values. We calculated for all peptides the average value ( ) of all the atomic electronegativities for all atoms connected to the th atom ( → ) and their neighbors placed at a distance ≤ 5 [59] : We calculate the probabilities ( ) for any atomic property including ( ) using a Markov Chain model for the gradual effects of the neighboring atoms at different distances in the molecular backbone. This method has been explained in detail in many previous works so we omit the details here [59] . Epitopes. Very recently Gonzalez-Diaz et al. [58] formulated a general-purpose perturbation theory or model for multipleboundary QSPR/QSAR problems. We adapted here this new theory or modeling method to approach to the peptide prediction problem from the point of view of perturbation theory. Let be a set of th peptide molecules denoted as with a value of efficiency as epitopes experimentally determined under a set of boundary conditions ≡ ( 0 , 1 , 2 , 3 , . . . , ). We put the main emphasis here on peptides reported in the database IEDB. In this sense, the boundary conditions used here are the same reported in this database, 0 = is the specific peptide, 1 = so , 2 = ho , 3 = ip , and 4 = tq . In general, so is the organism that expresses the peptide (but it can include also artificial peptides, cellular lines, etc.), ho is the host organism exposed to the peptide by means of the bp detected with tq. As our analysis, based on the data reported by IEDB we are unable to work with continuous values of epitope activity . Consequently, we have to predict the discrete function of B-epitope efficiency ( ) = 1 for epitopes reported in the conditions and ( ) = 0, otherwise. Our main aim is to predict the shift or change in a function of the output efficiency Δ ( ) = ( ) ref − ( ) new that takes place after a change, variation, or perturbation (Δ ) in the structure and/or boundary conditions of a peptide of reference. But we know the efficiency of the process of reference ( ) ref in addition to the molecular structure and the set of conditions for initial (reference) and final processes (new). Consequently, to predict Δ ( ) we have to predict only ( ) new the efficiency function of the new state obtained by a change in the structure of the peptide and/or the boundary conditions. Let Δ be a perturbation in a function ; we can define as the state information function for the reference and new states. According to our recent model [58] , we can write as a function of the conditions and structure of the peptide as follows. In fact, the variational state functions have to be written in pairs in order to describe the initial (reference) and final (new) states of a perturbation, as follow: The state function is for the th peptide measured under a set of boundary conditions in output, final, or new state. The conjugated state function is for the th peptide measured under a set of boundary conditions for the input, initial, or reference state. The difference Δ between the new (output) state and the reference (input) state is the additive perturbation [58] . Consider Equation (3) described before opens the door to test different hypothesis. A simple hypotheses is H 0 : existence of one small and constant value of the perturbation function Δ = 0 for all the pairs of peptides and a linear relationship between perturbations of input/output boundary conditions with coefficients , , , and . Consider We can use elemental algebraic operations to obtain from these equations an expression for efficiency as epitope of the peptide ( ) new . In this case, considering ≈ , we can obtain the different expressions; the last may be very useful to solve the QSRR problem for the large datasets formed by IEDB B-epitopes. Consider The * indicates that quantities like * is the average value of the mean electronegativity ( ) for all the peptides in IEDB that are epitopes for the same boundary condition. We propose herein, for the first time, a QSRR-perturbation model able to predict variations in the propensity of a peptide to act as B-epitope taking into consideration the propensity of a peptide of reference and the changes in peptide sequence, immunological process, host organism, source organisms, and the experimental technique used. The first input term is the value ( ) ref is the scoring function of the efficiency of the initial process (known solution). The function ( ) ref = 1 if the th peptide could experimentally be demonstrated to be a B-epitope in the assay of reference (reference) carried out in the conditions , ( ) ref = 0 otherwise. The variational-perturbation terms ΔΔ are at the same time terms typical of perturbation theory and moving average (MA) functions used in Box-Jenkin models in time series [60] . These new types of terms account both for the deviation of the electronegativity of all amino acids in the sequence of the new peptide with respect to the peptide of reference and with respect to all boundary conditions. In Table 1 , we give the overall classification results obtained with this model. Speck-Planche et al. [61] [62] [63] introduced different multitarget/multiplexing QSAR models that incorporate this type of information based on MAs. The results obtained with the present model are excellent compared with other similar models in the literature useful for other problems including moving average models [64, 65] or perturbation models [58] . Notably, this is also the first model combining both perturbation theory and MAs in a QSPR context. The other input terms are the following. The first Δ and ( ) for all new and reference peptides in IEDB that are epitopes under the th or th boundary condition. The values of these terms have been tabulated for >500 source organisms, >50 host organisms, >10 biological process, and >30 experimental techniques. We must substitute the values of ( ) and ( ) of the new and reference peptides and the tabulated values of * ( ) and * ( ) for all combinations of boundary conditions to predict the perturbations of the action as epitope of peptides. In doing so we can found the optimal sequence and boundary conditions towards the use of the peptide in the development of a vaccine. In Table 2 we give some of these values of * ( ) and * ( ). In Table 3 we depict the sequences and input-output boundary conditions for top perturbations present in IEDB. All these perturbations have observed value of ( ) new = 1 and predicted value also equal to 1 with a high probability. See Supplementary Material available online at http://dx.doi .org/10.1155/2014/768515 file contains a full list of >200,000 cases of perturbations. The design and implementation of the immune epitope database and analysis resource The immune epitope database and analysis resource: from vision to blueprint Automating document classification for the Immune Epitope Database The immune epitope database and analysis resource: from vision to blueprint The immune epitope database: a historical retrospective of the first decade Applications for T-cell epitope queries and tools in the Immune Epitope Database and Analysis Resource Immune epitope database analysis resource Applications of 2D descriptors in drug design: a DRAGON tale Dragon method for finding novel tyrosinase inhibitors: Biosilico identification and experimental in vitro assays Virtual computational chemistry laboratory-design and description Six-membered cyclic ureas as HIV-1 protease inhibitors: a QSAR study based on CODESSA PRO approach CODESSA-based theoretical QSPR model for hydantoin HPLC-RT lipophilicities Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery Using the TOPS-MODE approach to fit multi-target QSAR models for tyrosine kinases inhibitors In silico studies toward the discovery of new anti-HIV nucleoside compounds through the use of tops-mode and 2D/3D connectivity indices. 2. Purine derivatives Creating molecular diversity from antioxidants in Brazilian propolis. combination of TOPS-MODE QSAR and virtual structure generation What are the limits of applicability for graph theoretic descriptors in QSPR/QSAR? modeling dipole moments of aromatic compounds with TOPS-MODE descriptors Tomocomd-Cardd, a novel approach for computer-aided "rational" drug design: I. Theoretical and experimental assessment of a promising method for computational screening and in silico design of new anthelmintic compounds Protein quadratic indices of the "macromolecular pseudograph'scarbon atom adjacency matrix". 1. prediction of arc repressor alanine-mutant's stability Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach Some new trends in chemical graph theory Recent advances on the role of topological indices in drug discovery research Handbook of Molecular Descriptors Quantum-connectivity descriptors in modeling solubility of environmentally important organic compounds Molecular quantum similarity and the fundamentals of QSAR On the electronic structure of cocaine and its metabolites Chemical graph theory and n-center electron delocalization indices: a study on polycyclic aromatic hydrocarbons STATISTICS: Methods and Applications: A Comprehensive Reference for Science, Industry and Data Mining Data mining in bioinformatics using Weka Use of category approaches, read-across and (Q)SAR: general considerations Updating the skin sensitization in vitro data assessment paradigm in 2009-a chemistry and QSAR perspective Nonanimal alternatives for skin sensitization: letter to the editor From knowledge generation to knowledge archive. a general strategy using TOPS-MODE with DEREK to formulate new alerts for skin sensitization A chemical dataset for evaluation of alternative approaches to skin-sensitization testing Further evaluation of quantitative structure-activity relationship models for the prediction of the skin sensitization potency of selected fragrance allergens Computer-aided knowledge generation for understanding skin sensitization mechanisms: the TOPS-MODE approach Quantitative structure-activity relationships for predicting skin and eye irritation A QSAR model for the eye irritation of cationic surfactants ANN multiplexing model of drugs effect on macrophages; theoretical and flow cytometry study on the cytotoxicity of the anti-microbial drug G1 in spleen Computational vaccinology: quantitative approaches Quantitative structure-activity relationships and the prediction of MHC supermotifs Prediction of genomewide conserved epitope profiles of HIV-1: classifier choice and peptide representation Prediction of cross-recognition of peptide-HLA A2 by melan-a-specific cytotoxic T lymphocytes using threedimensional quantitative structure-activity relationships A novel strategy of epitope design in Neisseria gonorrhoeae Stepwise identification of HLA-A * 0201-restricted CD8 + T-cell epitope peptides from herpes simplex virus type 1 genome boosted by a steprank scheme An integrated approach to epitope analysis II: a system for proteomic-scale prediction of immunological characteristics Quantitative modeling of peptide binding to TAP using support vector machine Recognition of the ligand-type specificity of classical and non-classical MHC I proteins Recognition and classification of histones using support vector machine PEPVAC: a web server for multi-epitope vaccine development based on the prediction of supertypic MHC ligands Prediction of MHC-peptide binding: a systematic and comprehensive overview A large-scale bioactivity database for drug discovery Great Physicists: The Life and Times of Leading Physicists from Galileo to Hawking Epicyclic orbits in a viscous fluid about a precessing rod: theory and experiments at the micro-and macro-scales Intrinsic protein electric fields: basic non-covalent interactions and relationship to protein-induced Stark effects Perturbation theory without wave functions for the Zeeman effect in hydrogen Thermodynamic perturbation theory for associating fluids with small bond angles: effects of steric hindrance, ring formation, and double bonding New theory for multiple input-output perturbations in complex molecular systems. 1. Linear QSPR electronegativity models in physical, organic, and medicinal chemistry MIANN models in medicinal, physical and organic chemistry Time Series Analysis: Forecasting and Control Unified multi-target approach for the rational in silico design of anti-bladder cancer agents New insights toward the discovery of antibacterial agents: multitasking QSBER model for the simultaneous prediction of antituberculosis activity and toxicological profiles of drugs Multi-target inhibitors for proteins associated with Alzheimer: in silico discovery using fragment-based descriptors In silico discovery and virtual screening of multitarget inhibitors for proteins in Mycobacterium tuberculosis Chemoinformatics in anti-cancer chemotherapy: multi-target QSAR model for the in silico discovery of antibreast cancer agents The present study was partially supported by Grants AGL2010-22290-C02 and AGL2011-30563-C03 from Ministerio de Ciencia e Innovación, Spain, and Grant CN 2012/155 from Xunta de Galicia, Spain. The authors declare that there is no conflict of interests regarding the publication of this paper.