key: cord-0963434-4xc6pb0a authors: Chaudhuri, Rupanjali; Ramachandran, Srinivasan title: Prediction of Virulence Factors Using Bioinformatics Approaches date: 2014-05-06 journal: Immunoinformatics DOI: 10.1007/978-1-4939-1115-8_22 sha: f79321a7b375892794ff14cf43c7ef9d01101b2c doc_id: 963434 cord_uid: 4xc6pb0a Virulence factors produced by a pathogen are essential for causing disease in the host. They enable the pathogen to establish itself within the host thus enhancing its potential to cause disease and in some instances underlie evasion of host defense mechanisms. Identification of these molecules, especially those of immunological interest and their use in vaccine development are attractive and are among the initial steps of reverse vaccinology. Surface localized virulence factors such as adhesins serve as excellent immunogenic candidates in this regard. In this chapter we have described the bioinformatics approaches for adhesin prediction, which include specific adhesin prediction algorithms. Despite advances in technologies to combat infections, infectious diseases continue to challenge humans. This may be attributed to the rise in drug-resistant strains of pathogens such as Mycobacterium tuberculosis and new emerging infectious pathogens such as SARS coronavirus and infl uenza virus. A key step in the establishment of infectious disease is microbial virulence, which has been described as an emergent property of host-microbe interaction [ 1 ] . At the molecular level, entities like proteins, carbohydrates, or lipids enable the pathogens to establish themselves in a susceptible host. These molecules form inherent part of the pathogen cellular system and are collectively termed "virulence factors" [ 2 , 3 ] . Virulence factors in various pathogens play diverse roles in the establishment of disease. These include colonization of the host, evasion of host defense mechanisms, immunosuppression, acquisition of nutrients from host cell, mediation of entry and exit into host cell in intracellular pathogens, and sensing change of environment [ 4 , 5 ] . These factors enable colonization of host niche and eventually cause damage to host tissues [ 2 , 4 , 6 ] . It was therefore realized that targeting these microbial molecules by identifying their immunogenicity and use in vaccine formulations could serve as effi cient anti-infective strategy. Vaccinologists are therefore preparing vaccine formulations with these molecules for priming the immune system in order to neutralize their activity in the event of a host-pathogen contact [ 5 , 7 ] . A diverse array of molecules is involved during host-pathogen interaction and the prominent players vary between the pathogens. These include adhesins, toxins, enzymes, and capsules (polysaccharides or polypeptides). Adhesins have attracted interest from immunological perspective because they are located on the cell surface and are likely to be accessible to the molecules of the immune system [ 8 ] . In the subsequent sections we provide an overview of these molecules and describe their prediction using Bioinformatics. Adhesins enable adherence of the pathogen to host cells and constitutes the initial major step in the process of infection. This role of adhesins qualifi es them for vaccine candidates as targeting adhesins could arrest infection at the initial stage [ 8 ] . Even though adhesins exhibit sequence polymorphisms, the conserved regions may serve for potential vaccine especially those containing receptor binding domain [ 9 ] . Recently, a potent combination of adhesins of Plasmodium falciparum has been identifi ed, which could transcend strain variations [ 10 ] . Examples include FimH adhesin of uropathogenic Escherichia coli . Vaccination with this protein proved effective against urinary tract infection caused by E. coli in both mice and in nonhuman primates [ 11 ] . Filamentous hemagglutinin (FHA) and pertactin adhesins of gram-negative bacteria Bordetella pertussis elicits longlasting cell mediated respiratory immune response [ 12 ] . These adhesins are components of the approved acellular pertussis licensed vaccine [ 13 ] . Another adhesin Neisseria meningitidis adhesin A (NadA) is part of a multicomponent meningococcal serogroup B vaccine named Bexero, which is capable of eliciting a robust immune response. This vaccine has cleared all clinical trials and awaiting license for use [ 14 ] . The advent of genomics technologies has revolutionized biological research. The complete genome sequence of a pathogen provides an abundance of opportunities to identify putative virulence factors through sequence analysis. These investigations are being aided by the development of new computational algorithms in this area. In the sections below, we discuss and outline the methods used in several investigations: 1. Sequence Similarity Search: Sequence similarity search is very popular and is among the fi rst to be applied in sequence analysis. The goal here is to obtain orthologous sequences corresponding to a given query. This approach has been used to identify orthologues of known adhesins characterized in other pathogens ( see Note 1 ). The best known algorithm is the Basic Local Alignment Search Tool (BLAST) algorithm [ 15 ] . Examples include application of BLAST algorithms in screening for potential adhesins in Mycoplasma agalactiae , Escherichia coli , Mycoplasma conjunctivae , Mycoplasma pneumonia , Rickettsial species [ 16 -21 ] . In addition BLAST can be used to identify orthologues of enzymes from pathogens involved in virulence: Hyaluronidase, Neuraminidase, Phospholipases, Proteases, Collagenase, Kinase, Coagulase, Leukocidins, Hemolysins. 2. Sequence Motif search: Sequence motif refers to a particular arrangement or pattern of amino acids within a protein sequence, or nucleotides within a DNA sequence, which is characteristic of a specifi c biochemical function [ 22 ] . In particular, majority of protein sequence motifs, provide unique detectable sequence features for a set of protein sequences and thus act as signatures of protein families. Such motifs indicate similar functional roles. For example, in fungi, many Glycosylphosphatidylinositolmodifi ed (GPI) proteins linked to plasma membrane via preformed GPI anchor play role in adhesion and virulence [ 23 , 24 ] . These proteins have C-terminal GPI-motif described as follows: (10) >" in Prosite format, where ">" indicates the C-terminal end of the protein [ 26 ] . Algorithm based on identifying sequences having a C-terminal, fungusspecifi c, consensus sequence for GPI modifi cation (GPI-motif) helps screen a set of potential fungal adhesins [ 25 ] . Table 1 lists the motifs identifi ed in several adhesins. 3. Signal Peptide: Signal Peptide (SP) is a short stretch of sequence present in the N-terminus of the protein directing it to the secretory pathway [ 31 ] . Adhesins being membrane attached proteins usually posses N-terminal signal peptide for translocation across the membrane of the endoplasmic reticulum [ 32 , 33 ] . Therefore, algorithms using this information to screen for proteins having N-terminal signal peptide may help identifying potential adhesins ( see Note 2 ). However, there are adhesins called "anchorless adhesins," which do not have Signal peptide or Transmembrane domain. These "anchorless adhesins" cannot be identifi ed through these approaches. [ 26 ] FxxN, GGA(I,L,V) These are tetrapeptide motifs FxxN and GGA(I, L, V) present in polymorphic membrane protein family (Pmp) of Chlamydia pneumonia . They are required as duplicate copies for adhesion to host cells [ 27 ] RGD, SGxG These are arginine-glycine-aspartic acid (RGD) and glycosaminoglycan binding site (SGXG) motifs present in autotransporter family proteins of Bordetella pertussispertactin (Prn), Bordetella resistance to killing (BrkA) and Bordetella autotransporter protein-C (BapC). The arrangement of motifs confer BapC adhesive property to binding sites on the macrophages and epithelial cells [ 28 ] PARF motif This is a (A/T/E)XYLXXLN amino acid sequence motif referred to as PARF (peptide associated with rheumatic fever). It is located in the N-terminal hypervariable region of the collagen binding M protein type 3 of Streptococcus pyogenes and Streptococcus dysgalactiae ssp. equisimilis (SDSE) [ 29 ] HExxH containing metalloprotease adhesins This is a zinc binding sequence motif His-Glu-Xaa-Xaa-His. It is present in certain adhesins like Treponema pallidum extracellular matrix binding adhesin Tp0751 [ 30 ] YadA adhesin protein domain, fi brinogen-binding domain, Gingipain adhesin domains forming part of cleaved adhesin domain in bacterial species [ 35 -40 ] . Sequence analysis to study the presence of such adhesin related domains in the query protein sequence may help predicting potential adhesins. Although the computational methods described in preceding section permit identifying potential adhesins they are limited in their scope. Unlike many families of proteins, adhesins lack a well defi ned common sequence pattern or signatures, rendering their identifi cation using the general signature sequence search or unique motif search diffi cult. This is mainly because adhesins include diverse proteins. Even adhesins belonging to same species include diverse molecular types and lack a common specifi c pattern in sequence. For example, the adhesins-M proteins in Streptococcus pyogenes , Gal/GalNAc lectin in Entamoeba histolytica , Fimbrial adhesins in Escherichia coli , Blood group antigen binding adhesin (BabA) in Helicobacter pylori , YadA collagen binding adhesin in Yersinia enterolitica [ 41 -45 ] lack signifi cant similarity among each other. However, in certain cases like in fungal species where many adhesins possess fungal specifi c GPI-motif, sequence motif search algorithm can be used to screen for potential fungal adhesins. However, identifi cation methods solely based on motif searches such as GPI-anchor searches could return several false positives because all GPI-anchored proteins are not adhesins. Similar concerns apply to other identifi cation methods such as Signal peptide search. The basic principles and limitations of various bioinformatics approaches used to characterize adhesins are summarized in Fig. 1 . These limitations formed the foundation for developing nonhomology group of algorithms, which use a large number of compositional properties. SPAAN is an adhesin prediction tool developed using artifi cial neural network trained on compositional properties of known adhesins and non-adhesins. The algorithm is trained to predict adhesins and adhesin-like proteins solely from the sequence data. It is a non-homology method. SPAAN was trained using 105 compositional properties including 20 amino acid frequencies, 20 selected dipeptide frequencies, 20 multiplet frequency, 20 charge compositions, and 25 hydrophobic compositions. It showed an optimal sensitivity of 89 % and specifi city of 100 % on a defi ned test set and could identify 97.4 % of known adhesins at high Pad value from a wide range of bacteria. Though SPAAN was trained on datasets dominated by bacterial adhesins, it can be used for general purpose to identify adhesins from a wide spectrum of species belonging to diverse phyla. Many novel adhesins in diverse species have been characterized using SPAAN [ 46 ] . It is one of the most widely used adhesin prediction tool available. The standalone software package of SPAAN can be downloaded from http://sourceforge.net/ projects/adhesin/fi les/ . System Requirement: Red Hat Linux version 7.3 or above. Other requirements: C compiler Instruction for usage 1. SPAAN is provided as a tar-gzipped fi le. Post download, it should be unzipped and untarred by the command "tar xvzf SPAAN.tar.gz." 2. The query sequences should be in FASTA format. Multiple sequences can be present in the input fi le. 3. The input fi le should be named as "query.dat." 4. The command to run the software SPAAN is "./askquery." 5. The output data is stored in "query.out." 6. If the existing binary fi les are not compatible to the system, the source C codes provided need to recompiled using the following example command-"gcc -lm standard.c -o standard.o." List of C source codes to be compiled-standard.c, fi lter.c, annotate.c, and fi nalp1.c in the main SPAAN directory; recognize.c, AAcompo.c, hdr.c, multiplets.c, querydipep.c, and charge.c in their respective directories: AAcompo, hdr, multiplets, dipep, and charge: recognize.c needs to be compiled individually in each of the fi ve mentioned directories. Figure 2 describes an example of a run of SPAAN output result fi le "query.out." MAAP was developed using Support Vector Machine (SVM) trained through compositional properties for classifying malarial adhesins and adhesin-like proteins [ 47 ] . The SVM light package [ 48 ] of Support Vector Machine was used for this purpose. A total of 420 compositional properties including amino acid frequencies of 20 and 400 dipeptide frequencies were used to characterize the sequences of known adhesins and nonadhesins of Plasmodium species. MAAP runs on complete proteomes of Plasmodium species revealed that in Plasmodium falciparum at P maap scores above 0.0, a sensitivity of 100 % was observed with two false positives. In P. vivax and P. yoelii an optimal threshold P maap score of 0.7 was found optimal with very few false positives (upto 5). The MAAP Web server provides users with an interface where they can paste or upload their query sequences and predict whether the protein sequence is an adhesin ( see Note 5 ). Users have the facility to set their own desired threshold cutoff value. The result can be exported as tab delimited text fi le by the users. The standalone version can be downloaded from the "Download" tab of MAAP Web server or http://sourceforge.net/projects/adhesin/fi les/ . Figure 3 describes the output result obtained using MAAP Web server. In pathogenic fungi, adhesins play major roles as virulence factors mediating the interaction of the pathogens to variety of host cell types. In addition, adhesins in fungi aid in biofi lm formation contributing to increased drug resistance and persistence of infections [ 49 ] . It has been established that differences in adhesion are responsible for greater virulence of one strain compared to other in fungi [ 50 ] . The fungal pathogens represent a diverse group of species. FungalRV adhesin predictor was developed using Support Vector Machine (SVM) trained through compositional properties for classifying human pathogenic fungal adhesins and adhesin like proteins [ 51 ] . This tool was developed using SVM light package of Support Vector Machine trained through 3,945 compositional properties including amino acid frequencies of 20 from amino acids, 247 selected dipeptide frequencies, 3,653 selected tripeptide frequencies, 20 amino acid multiplets frequencies, frequency of the hydrophobic amino acids and four moments of hydrophobic amino acid distribution of order 2-5. This is a non-homology based prediction tool. We obtained an overall MCC value of 0.8702 considering all 8 pathogens, namely, Candida albicans , Candida glabrata , Aspergillus fumigatus , Coccidioides immitis , Coccidioides posadasii , Histoplasma capsulatum , Blastomyces dermatitidis , and Paracoccidioides brasiliensis thus showing high sensitivity and specifi city at a threshold of 0.511. In case of P. brasiliensis the algorithm achieved a sensitivity of 66.67 %. This tool was made into FungalRV Web server available at http:// fungalrv.igib.res.in . The "Adhesin Predictor" tab of the FungalRV Web server provides users with an interface where they can paste or upload their query sequences and predict whether the protein sequence is a fungal adhesin ( see Note 6 ). Users have been provided the facility to set their own desired threshold cutoff value. This facility has been provided to allow users to optimize the threshold for other fungi for which "FungalRV adhesin predictor" was not trained. The result can be exported as tab delimited text fi le by the users. The facility to search for fungal specifi c GPI Fig. 3 Screenshot of output result obtained using MAAP Web server. The protein sequences scoring above threshold are highlighted in green color, whereas those scoring below the threshold are highlighted in red color. The result can be saved in a tab delimited plain text fi le format by clicking on the purple colored link ( encircled ) pattern in the predicted adhesins and adhesin like proteins using fuzzpro program of EMBOSS has been provided. Users also have been provided the facility to conduct BLAST search with human reference proteins ( see Note 7 ). The standalone version can be downloaded from the "Download" tab of FungalRV Web server or http://sourceforge.net/projects/adhesin/fi les/ . Figure 4 describes the output adhesin prediction results obtained using FungalRV Web server. In addition to FungalRV, another Support Vector Machine (SVM) based algorithm named Faapred for prediction of fungal adhesins and adhesin-like proteins is available [ 52 ] . The SVM models for Faapred development were trained with compositional features-amino acid, dipeptide, multiplet fractions, charge and hydrophobic compositions, as well as PSI-BLAST derived PSSM matrices. The best classifi ers were screened based on high MCC and accuracy. The amino acid composition model (ACHM), PSSM-a, and PSSM-b came out as the best classifi ers with ACHM providing the highest MCC value of 0.610. Thus the prediction of Faapred uses classifi ers based on compositional properties as Fig. 4 Screenshot of output result obtained using FungalRV Web server. The protein sequences scoring above threshold are highlighted in green color, whereas those scoring below the threshold are highlighted in red color. The result can be saved in a tab delimited plain text fi le format by clicking on the purple colored link ( encircled ). Additional data on BLAST with Href proteins and GPI patterns are also displayed well as PSSM. Faapred provides overall accuracy of 86 %. The prediction method is freely available as a World Wide Web based server at http://bioinfo.icgeb.res.in/faap . 1. BLAST algorithm is widely used to fetch orthologues. Reciprocal Best Hits (RBH) method has shown good efficiency in identifying orthologues. RBH is based on the principle that two genes from different genomes are orthologous if they fi nd each other as the best hit in BLAST search in the other genome. Here BLASTP is usually carried out at a maximum E-value threshold of 1 × 10 −6 , including Smith-Waterman algorithm and Soft-fi ltering. 2. Various bioinformatics algorithms are available, which aid identifying signal peptides. SignalP algorithm available at http://www.cbs.dtu.dk/services/SignalP/ is widely used. The query sequences input in FASTA format can be submitted to predict presence of signal peptides. 3. Transmembrane prediction algorithms for example TMHMM available at http://www.cbs.dtu.dk/services/TMHMM/ is generally used to predict presence of transmembrane regions. 4. Conserved Domains can be predicted using domain prediction algorithms for example CDD search available at http://www. ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml . The presence of known adhesin related domains in the query sequences can be predicted. 5. The query proteins in FASTA format can be uploaded in the MAAP Web server. The server can be used to analyze the whole genome in one run. 6. Query protein sequences in FASTA format can be uploaded in FungalRV Web server. This Web server can be used to analyze the whole genome. 7. An adhesin vaccine should ideally not have similarity to human reference proteins to avoid cross-reactivity. The facility to conduct BLAST search with human reference proteins has therefore been provided in the FungalRV Web server. The cutoff E-value used here is 0.01, which borders on the limits of threshold similarity. Microbial virulence as an emergent property: consequences and opportunities Virulence and pathogenesis The concept of virulence: interpretations and implications What is a virulence factor Virulence factors and their mechanisms of action: the view from a damage-response framework Virulence mechanisms of bacterial pathogens Hostpathogen interactions: redefi ning the basic concepts of virulence and pathogenicity Adhesins as targets for vaccine development Antiadhesion therapy of bacterial diseases: prospects and problems Identifi cation of a potent combination of key Plasmodium falciparum merozoite antigens that elicit straintranscending parasite-neutralizing antibodies Vaccination with FimH adhesin protects cynomolgus monkeys from colonization and infection by uropathogenic Escherichia coli Cell-mediated immune responses in four-year-old children after primary immunization with acellular pertussis vaccines Nature, evolution, and appraisal of adverse events and antibody response associated with the fi fth consecutive dose of a fi ve-component acellular pertussis-based combination vaccine Bexsero: a multicomponent vaccine for prevention of meningococcal disease Basic local alignment search tool CS22, a novel human enterotoxigenic Escherichia coli adhesin, is related to CS15 Characterization of P40, a cytadhesin of Mycoplasma agalactiae Characterization of LppS, an adhesin of Mycoplasma conjunctivae Isolation and characterization of P1 adhesin, a leg protein of the gliding bacterium Mycoplasma pneumoniae Identifi cation of two putative rickettsial adhesins by proteomic analysis Cloning and molecular characterization of an immunogenic LigA protein of Leptospira interrogans What are DNA sequence motifs? Genome-wide identifi cation of fungal GPI proteins Comprehensive analysis of glycosylphosphatidylinositolanchored proteins in Candida albicans Systematic identification in silico of covalently bound cell wall proteins and analysis of protein-polysaccharide linkages of the human pathogen Candida glabrata BETAWRAP: successful prediction of parallel beta -helices from primary sequence reveals an association with many microbial pathogens Members of the Pmp protein family of Chlamydia pneumoniae mediate adhesion to human cells via short repetitive peptide motifs BapC autotransporter protein of Bordetella pertussis is an adhesion factor Region specifi c and worldwide distribution of collagen-binding M proteins with PARF motifs among human pathogenic streptococcal isolates Bifunctional role of the Treponema pallidum extracellular matrix binding adhesin Tp0751 Transfer of proteins across membranes. I. Presence of proteolytically processed and unprocessed nascent immunoglobulin light chains on membrane-bound ribosomes of murine myeloma Section 17.4, translocation of secretory proteins across the ER membrane Protein secretion and the pathogenesis of bacterial infections The three-dimensional structure of an enzyme molecule The PA14 domain, a conserved all-beta domain in bacterial toxins, enzymes, adhesins and signaling molecules Molecular phylogenetics of ascomycotal adhesins-a novel family of putative cell-surface adhesive proteins in fi ssion yeasts Als3 Is a Candida albicans invasin that binds to cadherins and induces endocytosis by host cells The Yersinia adhesin YadA collagen-binding domain structure is a novel left-handed parallel beta-roll Structure determination and analysis of a haemolytic gingipain adhesin domain from Porphyromonas gingivalis Identifi cation of a fi bronectin-binding domain within the Campylobacter jejuni CadF protein M proteinassociated adherence of Streptococcus pyogenes to epithelial surfaces: prerequisite for virulence Molecular characterization of the Escherichia coli FimH adhesin Structure and function of the Entamoeba histolytica Gal/GalNAc lectin Helicobacter pylori virulence factors in gastric carcinogenesis The Yersinia adhesin YadA collagen-binding domain structure is a novel left-handed parallel beta-roll SPAAN: a software program for prediction of adhesins and adhesin-like proteins using neural networks MAAP: malarial adhesins and adhesin-like proteins predictor Making large-scale SVM learning practical Biofi m formation by Candida species on the surface of catheter materials in vitro Flocculation, adhesion and biofi lm formation in yeasts FungalRV: adhesin prediction and immunoinformatics portal for human fungal pathogens FaaPred: a SVMbased prediction method for fungal adhesins and adhesin-like proteins RC thanks The Indian Council of Medical Research for fellowship. This work was funded through grants "GENESIS" BSC0121 to SR from CSIR.