key: cord-0005306-uukrqp7j authors: Haibin, Wei; Ji, Qi; Bailin, Hao title: Prokaryote phylogeny based on ribosomal proteins and aminoacyl tRNA synthetases by using the compositional distance approach date: 2004 journal: Sci China C Life Sci DOI: 10.1360/03yc0137 sha: ae5f56a3a326c5df68253bf734a473991b5ffe5c doc_id: 5306 cord_uid: uukrqp7j In order to show that the newly developed K-string composition distance method, based on counting oligopeptide frequencies, for inferring phylogenetic relations of prokaryotes works equally well without requiring the whole proteome data, we used all ribosomal proteins and the set of aminoacyl tRNA synthetases for each species. The latter group has been known to yield inconsistent trees if used individually. Our trees are obtained without making any sequence alignment. Altogether 16 Archaea, 105 Bacteria and 2 Eucarya are represented on the tree. Most of the lower branchings agree well with the latest, 2003, Outline of the second edition of the Bergey’s Manual of Systematic Bacteriology and the trees also suggest some relationships among higher taxa. The systematics of prokaryotes has been a challenge in microbiology as there are too few morphological characteristics that can be used for classification [1] . A major breakthrough took place in the 1970s when Carl Woese [2] and coworkers aligned the small subunit ribosomal RNA (SSU rRNA) sequences to infer phylogenetic relations. The recognition of Archaea as a third domain of life in addition to Bacteria and Eukarya and the support to the endosymbiotic origin of chloro plasts and mitochondria were the main achievements along this line. Databases of rRNAs have been established [3, 4] to facilitate SSU rRNA based molecular phylogeny. Even the second edition of the Bergey's Manual of Systematic Bacteriology followed "a phylogenetic framework based on analysis of the nucleotide sequence of the SSU rRNA, rather than a phenotypic structure" (see George Garrity's Preface to ref. [5] ). However, the reliability of using SSU rRNAs alone to infer phylogenetic relationships has been questioned in recent years. These sequences of about 1500 nucleotides may not contain enough phylogenetic information to resolve all branchings on the tree of life. There was evidence that even these conserved RNAs might be horizontally transferred [6, 7] . Moreover, the inpouring of complete prokaryote genomes since 1995 has brought about more problems than clarifications in molecular phylogeny. For example, it is a consensus now that different genes may tell different stories and a gene tree cannot be equated to the species tree. In particular, the implications of lateral gene transfer and lineage-dependent gene loss on molecular phylogenetics have become a subject of hot debate [8, 9] . In order to make use of the ever increasing genomic data many "whole-genome" methods have been suggested (for a review see, e.g., refs. [10, 11] ). In between the two extremes of using single genes or whole genomes it has also been proposed to base the trees on combinations of protein sequences [12] . Nevertheless, all these methods need sequence alignments and scoring schemes explicitly or implicitly at one or another stage thus depend on many parameters and fine adjustments. In order to avoid sequence alignment and selection of particular genes we have developed a K-string composition distance approach to infer phylogenetic relationships from complete genomes. The new approach has been successfully applied to the study of prokaryotes [13] , chloroplasts [14] and coronaviruses [15] . On the other hand, the use of whole genome data might be considered as a demerit of the method. Therefore, in this work we chose two protein sets that behave quite differently in yielding phylogenetic information. The ribosomal proteins are interwoven with rRNAs to form complexes which function as a whole thus they may not be easily transferred horizontally to other species. No wonder that sequence-based methods using concatenated ribosomal proteins have led to reasonable phylogenetic results [10, 16] . On the contrary, the aminoacyl tRNA synthetases act as individual molecules and there was no severe obstacle to prevent them from being transferred between organisms. Indeed, they have been known as notorious molecules for phylogenetics. The 20 different aminoacyl tRNA synthetases, if used individually, yield 20 different trees; some may not even show the trifurcation of the three main domains of life, Archaea, Bacteria and Eu-karya [17, 18, 19] . However, as our results show the collection of all aminoacyl tRNA synthetase sequences in a species leads to a phylogenetic tree comparable to the tree based on ribosomal proteins or on the whole proteome [13] . The goal of this paper is threefold. First, to show that the composition distance method does not necessarily require whole proteome data; protein sequences from a proper family may well do the job. Second, to provide a new approach in molecular phylogeny that is independent on but largely supportive to the "standard" methodology based on SSU rRNA sequences. Third, to verify the new method by a stringent comparison with bacteriologists' classification instead of merely using stability and self-consistency tests of bootstrap or Jack-knife type. There are two sets of prokaryote genomes. Those in GenBank [20] are the original data deposited by the authors. Those at the National Center for Biotechnological Information are curated or re-annotated by the NCBI staff [21] and are distinguished by accession numbers prefixed with NC_. We used all but one prokaryote genomes from ref. [21] that were available by 10 June 2003. The skipped one was Pasteurella multocida because no ribosomal and tRNA synthetase information could be found in the annotation. The organism names, their abbreviations, NCBI accession numbers as well as their standing in the Bergey's Manual are given in the Appendix. The distance matrices were calculated by using the K-string composition method which has already been described elsewhere [13] . Therefore, only a brief summary of the method follows. First, collect all amino acid sequences from a protein family or from a whole genome. Second, calculate the frequency of appearance of overlapping oligopeptides of length K. A random background was subtracted from these frequencies by using a Markov model of order K−2 in order to diminish the influence of random neutral mutations at the molecular level and to highlight the shaping role of selective evolution. Third, putting these "normalized" frequencies in a fixed order a composition vector of dimension 20 K was obtained for each species. Fourth, the correlation C (A, B) between two species A and B was determined by taking projection of one normalized vector on another, i.e., taking the cosine of the angle between them. Thus if the two vectors were the same they would have the highest correlation C = 1; if they had no components in common then C = 0, i.e., the two vectors would be orthogonal to each other. Lastly, the normalized distance between the two species was defined to be D = (1-C)/2. Once a distance matrix was obtained the tree construction went in the standard way [22] by using the neighbor-joining algorithm in the Phylip package [23] . The tree topology did stabilize with K increasing and with respect to re-sampling of protein sequences. For more on statistical tests and justification of this approach please see refs. [13, 14] . The tree based on ribosomal proteins is given in fig. 1 and that based on aminoacyl tRNA synthetases in fig. 2 . The calculation included all 123 organisms. Since different strains of the same species as well as different species within the same genus always grouped together, in the final drawing we kept only one representative species from each genus. Therefore, these trees are effectively genus trees. With 121 organisms from 67 genera 55 families 46 orders 25 classes and 14 of the 25 prokaryote phyla represented on the trees we are in a position to carry out a detailed and more stringent comparison with the bacteriologists' taxonomy. In fact, we now undertake the comparison of our results with three different but related schemes: the SSU rRNA tree in ref. [1] which was a composite tree containing 253 species, the RDP-II Backbone Tree for Release 8.0 [4] which contained 183 representatives of 217 taxonomic families collected in the second edition of the Bergey's Manual, and the Bergey's Manual [5, 24] itself which is based largely on the SSU rRNA model but also takes into account the traditional taxonomy. In general, the tree based on ribosomal proteins agrees better with the SSU rRNA trees than that based on the collection of aminoacyl tRNA synthetases. The latter in turn behaves much better than trees based on any single tRNA synthetase [17, 18, 19] . In particular, the division of all organisms into the three main domains of life is a consistent and prominent feature of the two trees. The branchings from genera up to families and orders basically agree with that of the SSU rRNA trees. Therefore, we concentrate on discrepancies at various taxonomic levels, especially, on those which might call for taxonomic revisions. Paraphyletic placement of species is invisible on genus trees such as the RDP-II Backbone Tree [4] or our trees shown in figs. 1 and 2. However, there are two such cases on our more detailed organism trees. First, Urepa gets mixed into the Mycoplasma genus as was the case on the SSU rRNA tree in ref. [1] . This might hint on genus assignment problem of Urepa. Second, Shifl appears in the Escherichia genus. For the latter case it would be interesting to await SSU rRNA result. On higher taxonomic level it was observed in ref. [1] that the beta group of Proteobacteria appeared within the gamma group. This is so on all our trees in this paper and in ref. [13] . We could add a further observation that the separated deeper gamma subgroup comprises two genera with small genome size (Buchnera and Wigglesworthia). The latter may even get quite far from the main Proteobacteria groups on the tRNA synthetase tree ( fig. 2 ). The fact that the species with significantly smaller genome forms a separate deeper subgroup on all these trees might be a manifestation of real evolutionary history as small genomes should naturally evolve earlier. Anyway, the effect of genome size poses a problem which could not be seen clearly on trees based on any single gene. All the three Spirochetes (Burbu, Trepa and Lepin) appear together in fig. 1 as they were grouped in the Bergey's Manual. However, Lepin stands out in fig. 2 and on the proteome trees in ref. [13] . We could not tell whether this was a consequence of significant difference other two. The Archaea Methanopyrus kandleri (Metka) was once predicted by SSU rRNA analysis to be an outlier to methanogenic Archaea [25] . However, on all our trees it stands firmly within the methanogens in agreement with the gene content and gene pair analysis reported in ref. [26] . The three genera from Crenarchaeota (Pyrae, Aerpe and Sulso) always stay together, but Halsp and Theac may change their location with respect to the majority of Euryarhcaeota as it was observed on some trees in refs. [10, 11] . There was only one cross-phylum difference. The new genus Oceanobacillus represented by Oceih is situated in the Firmicutes phylum very close to its Bacillus siblings (B13.3.1.1 in terms of Bergey's code) in figs. 1 and 2 and on the K = 5 and K = 6 proteome trees [13] . This is in accordance with the NCBI [21] taxonomy, but in the 2002 Outline of Bergey's Manual [24] it was put in Gammaproteobacteria (B12.3.8.1.6) with a footnote that "The position of Oceanospirillales within the ARB tree is ambiguous". However, while waiting for the Referees' comments on this manuscript we were glad to see that Oceih has been moved to B13.3.1.1.12 in the newly released 2003 edition of the Outline [27] . Accordingly, in table 2 below we have moved Oceih to its correct position. Before concluding the discussion we touch briefly on the problem of higher taxa. The demarcation and placement of higher taxa has been a subject of debate in taxonomy beyond that of prokaryotes. In a taxonomic outline such as the Bergey's Manual many phyla could only be juxtaposed under the archaeal or bacterial domain. Comparing all our trees in ref. [13] and in this paper with the SSU rRNA Tree in ref. [1] , with the RDP-II Backbone Tree [4] , and with trees obtained by other whole-genome methods [10, 11] , we are able to recognize some common features on all trees that can hardly be incidental artefacts: 1. The two phyla Aquificae (B1) and Thermotogae (B2) always come together before joining a main trunk of the tree. 2 The phyla Chlorobi (B11) and Bacteroides (B20) do the same as was observed in refs. [1, 19] . 3 The points where the phyla Chlamydiae (B16) and Spirochaetes (B17) join the tree are always close to each other (with the exception of Lepin jumping out of B17 on some trees). 4 The closeness of Deinococcus-Thermus (B4) and Actinobacteria (B14) was apparent on many trees. 5 The separation of the Mycoplasma from the main body of Firmicutes (B13) was a prominent feature on many whole-genome trees including ours. However, one should also bear in mind that for the time being 6 phyla out of 14 were represented only by one species. The relationship of higher taxa will be further verified when genomic data from a wider assortment of taxa become available. The composition distance method provides a new systematic way of inferring phylogenetic relationships without sequence alignment and parameter adjustment. Together with the traditional SSU rRNA analysis it may help to put prokaryote taxonomy on an unified molecular basis. We used 16 Archaea, 105 Bacteria and 2 Eukarya in this work. All organism names, their abbreviations and Accession numbers are given in tables 1 to 3 below. The last column in tables 1 and 2, the "Bergey's Code", is a shorthand of the classification in the second edition of the Bergey's Manual of Systematic Bacteriology [24] . For example, EcoliK is listed in Genus I (Escherichia) Family I (Enterobacteriaceae) Order XIII (Enterobacteriales) Class III (Gammaproteobacteria) of Phylum BXII (Proteobacteria). We changed all Roman numerals to Arabic and wrote the lineage as B12.3.13.1.1, dropping the taxonomic units and the Latin names. The winds of (evolutionary) change: Breathing new life into microbiology Phylogenetic structure of the prokaryotic domain: The primary kingdoms The European Ribosomal RNA database The Ribosomal Database Project (RDP-II): Previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy Bergey's Manual Trust, Bergey's Manual of Systematic Bacteriology Phylogenetic inferences from molecular sequences: Review and critique Engineering of bacterial ribosomes: Replacement of all seven Escherichia coli rRNA operons by a single plasmid-encoded operon Phylogenetic classification and the universal tree Detection of lateral gene transfer among microbial genomes Genome trees constructed using five different approaches suggest new major bacterial clades Genome trees and the tree of life Universal trees based on large combined protein sequence data sets Whole genome prokaryote phylogeny without sequence alignment: A K-string composition approach Origin and phylogeny of chloroplasts: A simple correlation analysis of complete genomes Molecular phylogeny of coronoaviruses including human SARS-Cov Archael phylogeny based on ribosomal protein Evolutionary anomalies among the aminoacyl-tRNA synthetases, Current Opinion in Genetic & Development Evolution of aminoacyl-tRNA synthetases-Analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process, Microbiology and Database resources of the National Center for Biotechnology Molecular Evolution and Phylogenetics PHYLIP (phylogeny inference package) version 3.5c Taxonomic Outline of the Procaryotes, Bergey's Manual of Systematic Bacteriology Methanopyrus kandleri: An archeal methanogen unrelated to all other known methanogens The complete genome of hyperthermophile Methanopyrus kandleri AV19 and monophyly of archaeal methanogens Taxonomic Outline of the Procaryotes, Bergey's Manual of Systematic Bacteriology