key: cord-0007935-p19e9uom authors: Myler, Peter J. title: Searching the Tritryp Genomes for Drug Targets date: 2008 journal: Drug Targets in Kinetoplastid Parasites DOI: 10.1007/978-0-387-77570-8_11 sha: 32272283047b962dd7b5e1f0dcd838e2511e9145 doc_id: 7935 cord_uid: p19e9uom The recent publication of the complete genome sequences of Leishmania major, Trypanosoma brucei and Trypanosoma cruzi revealed that each genome contains 8300-12,000 protein-coding genes, of which -6500 are common to all three genomes, and ushers in a new, post-genomic, era for trypanosomatid drug discovery. This vast amount of new information makes possible more comprehensive and accurate target identification using several new computational approaches, including identification of metabolic “choke-points”, searching the parasite proteomes for orthologues of known drug targets, and identification of parasite proteins likely to interact with known drugs and drug-like small molecules. In this chapter, we describe several databases (such as GENEDB, BRENDA, KEGG, METACYC, the THERAPEUTIC TARGET DATABASE, and CHEMBANK) and algorithms (including PATHOLOGIC, PATHWAY HUNTER TOOL, AND AUTODOCK) which have been developed to facilitate the bioinformatic analyses underlying these approaches. While target identification is only the first step in the drug development pipeline, these new approaches give rise to renewed optimism for the discovery of new drugs to combat the devastating diseases caused by these parasites. Traditionally, drug discovery in the trypanosomatids (and other organisms) has proceeded from two different starting points: screening large numbers of existing compounds for activity against whole parasites or more focused screening of compounds for activity against defined molecular targets. Most existing anti-trypanosomatids drugs were developed using the former approach, although the latter has gained much attention in the last twenty years under the rubric of “rational drug design”. Until recently, one of the major bottlenecks in anti-trypanosomatid drug development has been our ability to identify good targets, since only a very small percentage of the total number of trypanosomatid genes were known. That has now changed forever, with the recent (July, 2005) publication of the “Tritryp” (Trypanosonm brucei, Trypanosoma cruzi and Leishmania major) genome sequences.(1-4) This vast amount of information now makes possible several new approaches for target identffication and ushers in a post-genomic era for trypanosomatid drug discovery. of small sequence insertions relative m the trypanosomes, but the lower gene density in Leishmania is mostly explained by its larger inter-CDS regions. Each species contains a number of gene families of varying size. Predicted functions have been ascribed to -40% of the protein-coding genes, but this has been confirmed experimentally for only -5% of the proteins. Most of the remaining genes encode conserved hypothetical proteins, of which slightly more than half are found only in trypanosomatids. Interestingly, -2-3% of the Tritryp proteins are related to those found in prokaryotes but not other eukaryotes. At least some of these appear to have arisen from horizontal gene transfer, and may represent excellent candidates for drug targets. The Tritryp genomes display a remarkable degree of synteny, with -75% of the genes in L. major having orthologues in both other species and >90% of these occurring in the same genomic context (see Table 1 ). The proteins within this Tritryp "core" proteome exhibit an average 57% identity between T. brucei and T. cruzi, and 44% identity between L. major and the two other trypanosomes, reflecting the expected phylogenetic relationships. 5'6 Interestingly, substantially fewer orthologues are shared only between L. major and T. brucei than between L. major and T. cruzi, perhaps reflecting the common intracellular environment of their mammalian stages. However, all three genomes contain a significant number of species-specific genes, which account for .-21% and 38% of the protein-coding genes in T. brucei and T. cruzi, respectively, but only -13% of the L. major genes. These species-speciflc genes (and pseudogenes) mostly encode large families of surface proteins, exemplified by the variant surface glycoproteins (VSGs) and Procyclic Acidic Repetitive Proteins (EP/PARP/procyclin) of T. brucei; the trans-sialidases, dispersed gene family protein 1 (DGF-1), mucins, and mucin-associated surface proteins (MASPs) of T. cruzi; and the amastins andpromastigote surface antigens (PSA-2) of L. major. In addition to these species-speciflc genes, all three species demonstrate differential paralogous gene expansion or contraction, with the ESAG4 adenylate/guanylate cyclases and leucine-rich repeat proteins being over-represented in T. brucei; GP63 surface proteases and recombination hot spot (RHS) proteins in T. cruzi; and mitochondrial carrier protein, ATP-Binding Cassette (ABC) transporters, and Heat Shock Protein (HSP) 90 gene families in L. major. Many of these species-speciflc genes or paralogous expansions occur in telomeric and sub-telomeric gene dusters, possibly reflecting similar strategies used for immune evasion. Transcription and RNA processing in the trypanosomatids is quite different from that in other eukaryotes, 7 with unique or unusual processes such as large polycistronic gene dusters, 8-1~ RNA polymerase I-mediated transcription of some protein-coding genes, 11'12 and trans-splicing. 13 While annotation of the Tritryp genomes uncovered most of the expected RNAP polymerase subunits, there was a dearth of transcription factors normally involved in regulation of transcription initiation by other eukaryotes. 3 However, recent experiments have identified several highly divergent transcription factors in T. brucei, [14] [15] [16] [17] suggesting that Tritryp transcription initiation may represent an ancestral, less sequence-specific, mechanism mosdy replaced in other eukaryotes by the archetypal TATA-containing promoters. Conversely, the paucity of Tritryp genes encoding transcriptional regulators is offset by an abundance of proteins with RNA binding motifs, 18 consistent with their reliance on post-transcriptional models of gene regulation. 19 DNA replication in trypanosomatids also appears to differ significandy from that in higher eukaryotes, with only one of the six subunits typically found in the eukaryotic replication origin complex being identified. 2 There are also substantial differences in the mitochondrial replication machinery, since the complexity of the kinetoplast DNA (the trypanosomatid equiva-9 20 lent of a mitochondrial genome) structure dictates an unusual replication mechanism. Bioinformatic analyses of the Tritryp genomes suggests that they lack several classes of signaling molecules found in other eukaryotes, including serpentine receptors, heterotrimeric G proteins, most classes of catalytic receptors, SH2 and SH3 interaction domains, and regulatory transcription factors, but that they do possess a large and complex set of protein kinases and protein phosphatases. 2'21 However, the distribution of protein kinase classes differs from that in other organisms; with no tyrosine kinases (other than dual specificity kinases), receptor kinases or TKL and RGC group kinases. Since the trypanosomatids have complicated life cycles in different hosts, it is likely that these kinases play important roles in regulating their response to changes in these different environments. The experience gained by the pharmaceutical industry during the last few decades of drug development has lead to the postulation of a number of selection criteria for successful drug 22 target identification. In the context of the trypanosomatids, these criteria include selectivity (i.e., the parasite target is absent from, or substantially different in, the host); "druggability" (the target structure has a small molecule-binding pocket); suitable biochemical properties (the target has a low turnover rate and/or catalyzes a rate-limiting step within a pathway); validation (the target is essential for growth and/or survival in the mammalian stage of the parasite lifecycle); "assayability" (specific, inexpensive and high-throughput screens are available using in vitro expressed target); and low potential for development of drug resistance (absence of different isoforms or alleles and/or biochemical "bypass" reactions). With these criteria in mind, several bioinformatic approaches have been proposed, which take advantage of the availability of the complete genome sequences described above to accelerate progress in developing effective clinical interventions for the important diseases caused by these parasites. Analysis of the Tritryp genomes has provided a comprehensive view of the parasites' metabolic potential by identifying numerous common and species-specific metabolic and transport processes. Manual examination of metabolic maps identified a number of pathways that appear to be especially amenable to potential chemotherapeutic intervention; including glycolysis, the electron transport chain, the urea cycle, the glyoxylase pathway and associated trypanothione metabolism, glycosylphosphatidylinositol (GPI) anchor biosynthesis, fatty acid biosynthesis, as well as the ergosterol and isoprenoid biosynthetic pathway. 1 Since the particulars of target identification and drug development for each of these pathways (and others) have 23 26 been described in detail in several of the accompanying chapters and elsewhere, -they will not be further explored here. Instead, several different computational attempts to catalogue metabolic pathways and identify "choke-points" will be described. BRENDA (BRaunschweig ENzyme DAtabase) is a comprehensive collection of enzyme and metabolic information (http://www.brenda.uni-koeln.de), including Enzyme Commission (EC) classification and nomenclature, reaction and specificity, function and structure, isolation and stability, as well as links to primary literature references. The database is now based on a controlled vocabulary and ontology for some information fields, and search tools include EC and taxonomy-tree browsers, a chemical substructure search engine for ligand structure, and a thesaurus for ligand names. BRENDA contains more than 100,000 enzymes representing 4060 different EC numbers from about ~ 10000 different organisms. There are currently (as of September, 2006) 842 entries for T. brucei, 751 for T. cruzi and 607 for L. major. KEGG (Kyoto Encyclopedia of Genes and Genomes) is a suite of databases and associated software, designed to integrate current knowledge of genes and proteins (GENES database), chemical compounds and reactions (LIGAND), metabolic, regulatory and interaction networks (PATHWAY), and ontologies (BRI~). Biological systems are represented in KEGG by nested graphs, which are used for pathway reconstruction and functional inference, and line graphs, which form the basis for integrating genome and chemical information with the networks. The BruTE database provides the pathway reconstruction through a series of functional hierarchies and represents the logical foundation for the KEGG project. KEGG maintains a gene catalogue of sequenced27.~lenomes and maps them onto 301 manually drawn and curated reference pathways. -Currendy, there are 83, 90, and 89 entries in the PATHWAY database for T. brucei, T. cruzi and L. major, respectively, mostly describing metabolic pathways. The BIoCYc collection of Pathway/Genome Databases (PGDBs) provides electronic reference sources on the pathways and genomes of more than 200 different organisms (http:// biocyc.org). The databases within the BIoCYc collection are organized into tiers according to the amount of manual review and updating they have received. Tier 1 PGDBs are created through intensive manual efforts, and receive continuous updating. EcoCyc, which describes Escherichia coli K-12, is the only organism-specific Tier 1 database. Tier 2 PGDBs are computationally generated using PATHOLOGIC software, 32'33 and have undergone moderate amounts of review and updating. There are currently 12 databases in Tier 2, including HUMANCYC and PIASMOCYC (which describes the malaria parasite, Plasmodium falciparum). Tier 3 databases are computationally generated by the PATHOLOGIC program, and have undergone no review and updating. ~ There are 191 PGDBs in Tier 3, representing mostly bacterial genomes. The individual BIoCYc web-sites can be used to visualize single or multiple metabolic pathways, including a complete metabolic map of the organism. An OMICS VIEWER can be used to analyze gene expression, proteomics, or metabolomics data to produce animated views of time-course gene-expression experiments. There are currently no BIoCYc PGDBs for any of the trypanosomatid genomes, although it should be relatively straightforward to generate Tier 3 databases using the PATHOLOGIC software. 32 Other programs are also available for genome-scale reconstruction of metabolic networks. 35-38 However, since this process is largely dependent on sequence-based homology searches to identify the enzymes and the Tritryp genomes are quite divergent from other eukaryotes, considerable manual curation will probably be necessary to obtain truly accurate representations of the metabolic networks in these organisms. While most of the individual PGDBs within BIoCYc represent species-specific databases, METACYC (http://metacyc.org) is a collection of metabolic pathways and enzymes from more than 240 organisms (mostly bacteria and plants). The goal of METACYC is to represent every experimentally elucidated metabolic pathway, reaction, and chemical compound, as well as the genes encoding the enzymes that catalyze the reactions involved. 39 As well as being used as a reference source to look up individual facts, METACYC facilitates computational studies of the metabolism, such as design of novel biochemical pathways for biotechnology, studies of evolution of metabolic pathways, and simulation of metabolic pathways. Additionally, desktop software is available for comparing the overall metabolic maps, specific pathways and genomic maps of two organisms. Careful manual examination of a metabolic pathway can identify metabolic "choke-points", i.e., the enzyme(s) which is (are) uniquely necessary to produce a critical metabolite. Obviously, choke-points in pathways that result in metabolites critical for parasite survival would make excellent potential targets for development of novel anti-trypanosomatid drugs. The PATH-WAY HUNTER TOOL (http://www.pht.uni-koeln.de) uses an extended form of graph theory (in which enzymes are represented by edges between nodes representing metabolites) to identify choke-points and rank them according to their "load". 4~ Load is defined as the ratio of the number of shortest paths through the enzyme and nearest neighbors attached to it, compared to the average values for these properties in the entire network. Comparison of pathogen (trypanosomatid) and host (human) metabolic networks could be used to identify highly ranked choke-points that are unique to the parasite or are ranked much lower in the host. Another computational approach for identification of metabolic enzymes as drug targets involves the concept of minimal cut sets, which are defined as the minimal set of reaction in a network whose inactivation will definitively lead to a failure in a particular network fimction. 42 Screening parasite metabolic networks for all possible minimal cut sets and identification of those which are small (i.e., contain few enzymes) and not present in the host could serve to identify potential drug targets. The approaches oudined above are designed to identify targets that meet only some of the criteria outlined at the beginning of this section; namely they have suitable biochemical properties, are likely to be essential for the parasite, and are sufficiently different from any host homologue. However, alternative approaches seek to make use of the finding that successful drugs have specific structural and physicochemical properties that allow them to be efficacious, bioavailable, and safe. These properties are exemplified by Lipinski's so-called "rule of five". 43 This has lead to the concept of"druggable" proteins, based on their ability to bind potentially effective drug-like small molecules. 44--~6 Thus, it makes sense to search the Tritryps genomes for proteins that are likely to meet these criteria. Two different approaches have been proposed for developing computational solutions to this problem: searching the genome for proteins with similar properties to known drug targets in other organisms (primarily humans) and direct interrogation of the parasite proteins for their likelihood to bind drug-like chemicals. The Therapeutic Target Database (TTD) (http://xin.cz3.nus.edu.sg/group/cjttd/ttd.asp) represents a comprehensive and publicly available attempt to catalogue information about all the currently known protein and nucleic acid targets described in the literature. 46'47 The database also contains information about the drugs and ligands directed at these targets, as well as corresponding disease conditions. This database currently contains 153 5 targets and 2107 drugs/ ligands, including 19 entries listing potential anti-trypanosomatid use. The most simplistic approach for searching the Tritryp genomes for potential targets similar to these existing targets would be to carry out BLASTP or PsIBLAST searches of the Tritryp protein databases to identify parasite proteins with significant sequence similarity to those in the TTD. The resulting list of parasite proteins would need to be subsequently winnowed down by removing those that are too similar to the human orthologues and/or are similar to proteins involved in more than two pathways in humans, since drugs against these are likely to have deleterious effects on the human host. However, given what we know about the imprecise nature of the relationship between protein sequence and structure, it is likely that this method will have a significant false negative rate (i.e., it will miss many potentially useful targets because they won't have suf~cient sequence similarity). Statistical learning methods, such as support vector machines (SVM) and neural networks, have recently enjoyed considerable success for prediction of protein structure and may be useful for identifying targets missed by simple BLAST searching. A SVM method has been used to screen the human and HIV genome for druggable proteins, with a promising degree of accuracy. 46'48'49 Similar methods could be used to screen the Tritryp genomes. Algorithms such as AtyroDocK 5~ have been used for some time to predict small molecules that will potentially flU protein ligand-binding pockets, as a first step in rational drug design. This process has been reversed to some extent by using docking software with integrated molecular dynamics simulation to predict which drugs are likely to bind (and inhibit) proteases from human coronavirus, 51 cytomegalovirus, 52 and human immunodeficiency virus (HIV). 53 A recent publication describes the use of this method to screen 2500 compounds in the CHSMBANK database (http://chembank.broad.harvard.edu) against 13 proteins from Plasmodium falciparum whose structure had been determined by X-ray crystallography.54 This approach found that the K/s predicted for three existing anti-malarial drugs compared well with their known values and that their predicted inhibitory activity ranked in the top 5th percentile of all tested drugs. Another 20 drugs were predicted to have multi-target activity, i.e., they showed high affinity with two or more proteins. Multi-target drugs are attractive because they are less likely to encounter problems with development of drug resistance. It should be possible to screen the Tritryp proteome for multi-target drugs using a similar approach. Obviously, one major constraint is the availability (or lack thereof) of trypanosomatid proteins with known structure. Currently, the protein structure database (PDB) contains 79 nonredundant structures from the genera Leishmania or Trypanosoma. However, this number has been increasing rapidly over the last few years due to the efforts of the Structural Genomics of Pathogenic Protozoa (SGPP) consortium (http://www.sgpp.org) and is likely to increase further in the near future. The recent completion of the Tritryp genome sequencing project provides an unprecedented opportunity for development of novel anti-trypanosomatid chemotherapeutic agents. The identification of more than 8000 new protein-coding genes, many of which are shared between the Leishmania and Trypanosoma genera, vastly expands the potential drug targets available for investigation. In fact, the situation has gone from a relative dearth of useful targets to an embarrassment of riches, with far more potential targets available than can possibly be studied in detail. In this chapter, we have described several different computational approaches that should be useful in reducing this smorgasbord of genes to a manageable number of high-value targets, which will form the basis of detailed biological and pharmacological investigation. Of course, target identification is only the first stages in the lengthy and expensive process of drug development; with steps such as target validation, lead identification and optimization, as well as preclinical pharmacological screening, necessary before a potential drug can enter clinical trials. Nevertheless, these bioinformatic methods hold great promise in being able to identify targets (and potential lead compounds in some cases) which have a higher probability of successful drug development than traditional methods. While only time will reveal the validity of this promise, we hope that this advent of the post-genomics era for trypanosomatid biology heralds a renaissance in the discovery of much needed new drugs for the devastating diseases caused by these parasites. The genome of the African trypanosome, Trypanosoma brucei The genome sequence of Trypanosoma cruzi, etiological agent of Chagas' disease The genome of the kinetoplastid parasite, Leishmania major Comparative genomics of trypanosomatid parasitic protozoa The molecular phylogeny of trypanosomes: Evidence for an early divergence of the Salivaria The molecular evolution of Trypanosomatidae Transcription in kinetoplastid protozoa: Why be normal? Leishmania major Friedlin chromosome 1 has an unusual distribution of protein-coding genes Transcription of Leishmania major Friedlin chromosome 1 initiates in both directions within a single region Transcription initiation and termination on Leishmania major chromosome 3 Control of gene expression in trypanosomes Increased expression of LD1 genes transcribed by RNA polymerase I in Leishmania donovani as a result of duplication into the rRNA gene locus mRNA processing in the Trypanosomatidae Trypanosomal TBP functions with the multisubunit transcription factor tSNAP to direct spliced-leader RNA gene expression Characterization of a multisubunit transcription factor complex essential for spliced-leader RNA gene transcription in Trypanosoma brucei A divergent transcription factor TFIIB in trypanosomes is required for RNA polymerase II-dependent SL RNA transcription and cell viability A TFIIB-like protein is indispensable for spliced leader RNA gene transcription in Trypanosoma brucei Emergence of diverse biochemical activities in evolutionarily conserved structural scaffolds of proteins Life without transcriptional control? From fly to man and back again Multiple mitochondrial DNA polymerases in Trypanosoma brucei Comparative analysis of the kinomes of three pathogenic trypanosomatids: Leishmania major, Trypanosoma brucei and Trypanosoma cruzi Opportunities and challenges in antiparasitic drug discovery Chemotherapy of human African trypanosomiasis: Current and future prospects Fatty Acid synthesis by elongases in trypanosomes Hannaert Vet al. Experimental and in silico analyses of glycolytic flux control in bloodstream form Trypanosoma brucei Glycolysis and proteases as targets for the design of new anti-trypanosome drugs LIGAND: Chemical database for enzyme reactions A database for post-genome analysis From genomics to chemical genomics: New developments in KEGG Kyoto encyclopedia of genes and genomes The KEGG resource for deciphering the genome The pathway tools software Computational analysis of Plasmodium falciparum metabolism: Organizing genomic information to facilitate drug discovery Expansion of the BioCyc collection of pathway/ genome databases to 160 genomes Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms Metabolic modeling of microbial strains in silico Reconstruction of metabolic networks using incomplete information Integrated system for high-throughput genome sequence analysis and metabolic reconstruction A multiorganism database of metabolic pathways and enzymes Observing local and global properties of metabolic pathways: 'Load points' and 'choke points' in the metabolic networks Metabolic pathway analysis web service (Pathway Hunter Tool at CUBIC) Minimal cut sets in biochemical reaction networks Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings The druggable genome The multiple orthogonal tools approach to define molecular causation in the validation of druggable targets Therapeutic targets: Progress of their exploration and investigation of their characteristics TTD: Therapeutic Target Database Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence Automated docking of flexible ligands: Applications of AutoDock Identifying inhibitors of the SARS coronavirus proteinase Virtual screening of HIV-1 protease inhibitors against human cytomegalovirus protease using docking and molecular dynamics PIRSpred: A web server for reliable HIV-1 protein-inhibitor resistance/susceptibility prediction Identification of potential multitarget antimalarial drugs