key: cord-0697903-x8zdlml2 authors: Fertil, Bernard; Massin, Matthieu; Lespinats, Sylvain; Devic, Caroline; Dumee, Philippe; Giron, Alain title: GENSTYLE: exploration and analysis of DNA sequences with genomic signature date: 2005-07-01 journal: Nucleic Acids Res DOI: 10.1093/nar/gki489 sha: 96650e6fa280653c83e6f9b1ba1e5217f45e4799 doc_id: 697903 cord_uid: x8zdlml2 GENSTYLE () is a workspace designed for the characterization and classification of nucleotide sequences. Based on the genomic signature paradigm, GENSTYLE focuses on oligonucleotide frequencies in DNA sequences. Users can select sequences of interest in the GENSTYLE companion database, where the whole set of GenBank sequences is grouped per species, or upload their own sequences to work with. Tools for the exploration and analysis of signatures allow (i) identification of the origin of DNA segments (detection of rare species or species for which technical problems prevent fast characterization, such as micro-organisms with slow growth), (ii) analysis of the homogeneity of a genome and isolation of areas with novel functionality (horizontal transfers for example) – and (iii) molecular phylogeny and taxonomy. A great number of DNA sequences are now available from web-based databases. DNA samples of >140 000 named organisms can be found, for example, in GenBank. The characteristics of these sequences have been extensively studied, and extracted information is often interpreted in terms of evolution or systematic molecular biology. Many works are devoted to the so-called metagenomic analysis of DNA sequences. One approach deals with the frequencies of short oligonucleotides. Karlin and Burge initially focused on dinucleotide relative abundance (1) . It quickly became obvious that the set of oligonucleotide frequencies was species specific (2) (3) (4) (5) (6) . The set of oligonucleotide frequencies was subsequently considered to be a genomic signature. Studies based on genomic signature are becoming more and more popular (7, 8) . It has been observed that the genomic signature results from a species-specific 'writing STYLE' (4, 9, 10) . Indeed, on the one hand, the genomic signatures of species differ from one another, and, on the other hand, the majority of genome segments within a species have comparable signatures. As a consequence, each species can be assigned a DNA style that can be derived from most of its available DNA fragments. The methodology that we have developed thus makes it possible to study and compare a great number of sequences and species, inasmuch as the calculation of a signature on a laptop computer requires <1 s per million nucleotides. The genomic signature is visualized as a parametric image using the 'chaos game representation' algorithm (3, 5, 8, (11) (12) (13) (14) . Our experience with genomic signatures shows that the comparison of four-letter word signatures offers a good trade-off between accuracy of classification, usual size of DNA fragments and computer load (9, 15) . In our hands, comparison of signatures is achieved by means of the Euclidian metric in a space with 256 dimensions (there are 256 different 4-letter words). Of course, other methods for comparison of signatures are available. They often provide slightly different results [see Refs (4, 8, 16, 17) for some other measures of dissimilarities]. It must be pointed out that comparisons of DNA style do not require homologous sequences and almost any DNA segment is eligible (4, 9) . In fact, the species-specific DNA style concept motivates and justifies most of the works dealing with the genomic signature, including, for example, assignment of genomic fragments (4, 18) , taxonomic/phylogenetic analyses (15, 17, 19) and detection of horizontal transfers (HTs) (20, 21) . Detection of HTs is a major application of the DNA style concept. Some of the abnormal patterns in a genome may be considered to result from HTs. Numerous methods relying on a gene's nucleotide or oligonucleotide composition for the detection of HTs are available (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) . Among them, hidden Markov models (HMMs) and wavelet transforms are two of the efficient approaches in use for detecting and characterizing original motifs and patterns. Their performances have been subjected to extensive comparisons (20, 21, 31, 32) . Many other applications are emerging, such as the characterization of unknown sequences, the quality control of sequencing and pre-processing for homologous sequences screening. if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oupjournals.org A web service (http://www.megx.net/tetra) has recently been made available for the comparison of tetranucleotide usage patterns in DNA sequences (33) . It comes with pre-computed tetranucleotide usage patterns for 166 prokaryote chromosomes as a source for limited data mining. GENSTYLE is grounded in the genomic signature paradigm. It offers three sets of tools for the characterization and classification of nucleotide sequences. Parts of GENSTYLE were made accessible to the bioinformatics community through our site (http://genstyle.imed.jussieu.fr/) starting in 1999, after the publication of the seminal paper describing the concept and its usefulness (3). The current version results from a substantial redesign that developed into the GENSTYLE workspace. Three dedicated toolboxes have been implemented for collecting, selecting and processing sequences. The sequence analysis toolbox is made for (i) Identification of the origin of short DNA fragments. Any DNA sequence is eligible for searching for its origin. This feature is useful, for example, for the recognition of rare and/or slow growth organisms (sequences usually hard to characterize). (ii) Detection of 'atypical' areas in a genome, in particular the detection of HTs (and potential donors). The closest species (from the genomic signature point of view) of an atypical DNA segment give clues about the donor in the case of putative HTs (under implementation). (iii) Building of taxonomic and phylogenetic trees. Distance between signatures remains to been established as a reference for phylogenetic studies, but several recent and interesting results have shown its potentially great value (15, 17) . In particular, our current work with corona viruses is very promising with this respect. There is a large genomic signature database behind GENSTYLE that greatly enhances its power and scope. The full set of GENBANK sequences, stored by species, is available for signature studies. The GENSTYLE companion genomic signature database handles 170 000 species and unspecified organisms (>2 000 000 DNA sequences). It is updated on a regular basis, using the bimonthly releases issued by GenBank. GENSTYLE tools are available from within a user workspace. This makes it possible to work online on the whole set (or part of it) of GenBank nucleotide sequences belonging to one or several species. User's sequences can also be uploaded to work with. Tools for the exploration and analysis of signatures are straightforward. They do not require much prior knowledge. Results are displayed in specific windows with images, tables and charts. Most of the outputs can be downloaded for further processing. The user's workspace can be saved for later use. There are three toolboxes in the GENSTYLE workspace: (i) Sequence collector toolbox. The sequence collector allows workspace to be loaded with sequences of interest. Sequences can be selected through the GENSTYLE companion database browser. The user's sequences (FASTA format, eventually grouped into a single text file, zipped or not) can be uploaded through the uploader. (ii) Sequence filters toolbox. Although many sequences can be uploaded to a given workspace, it may be interesting to work on selected subsets. Several tools are available for this task, including selection of DNA type and size of sequences. Online versions of additional tools already in use in our lab are currently under development. They include navigation along genomes by means of local signatures (for HT detection, for example), visualization of similarities between local signatures along several genomes and taxonomic trees. A tutorial is available online. It demonstrates how the origin of a small DNA sequence can be looked for in the GENSTYLE companion database. Briefly, the sequence of interest has to be pasted into the appropriate field of the demonstrator tool ( Figure 1A) . The sequence signatures for oligonucleotides (words) 1-9 nt long are subsequently calculated, oligonucleotide counts are obtained ( Figure 1A ) and signatures are displayed ( Figure 1B ). Specific word counts and frequencies are available in popup windows ( Figure 1C ). Species with the closest signatures are then determined ( Figure 1D ). Distances to the sequence of interest are expressed in an arbitrary unit (AU). It can be seen that the sequence of interest belongs to the SARS Virus (d = 11) and that the closest species are PEDV and PTGV corona viruses. Although this procedure seems to mimic BLAST/FASTA functions, it is quite different in nature. Similarities between sequences can be observed even when they are not homologous. As a consequence, the origin of a sequence can be obtained once the DNA material characterizing the genomic signature of the species of origin is available (typically 2000 nt). Homologous DNA counterparts are not required in the database. Dinucleotide relative abundance extremes: a genomic signature Comparative DNA analysis across diverse genomes Genomic signature: characterization and classification of species assessed by chaos game representation of sequences Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier Analysis of genomic sequences by Chaos Game Representation A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency Pervasive properties of the genomic signature The spectrum of genomic signatures: from dinucleotides to chaos game representation Genomic signature is preserved in short DNA fragments Classification of species based on DNA style Chaos game representation of gene structure Chaos game visualization of sequences Mathematical characterization of Chaos Game Representation. New algorithms for nucleotide sequence analysis Entropic profiles of DNA sequences through chaos-game-derived images A genomic schism in birds revealed by phylogenetic analysis of DNA strings Distance, correlation and mutual information among portraits of organisms based on complete genomes Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling Application of tetranucleotide frequencies for the assignment of genomic fragments Evolutionary implications of microbial genome tetranucleotide frequency biases Detection and characterization of horizontal transfers in prokaryotes using genomic signature A new computational method for the detection of horizontal gene transfer events How to interpret an anonymous bacterial genome: machine learning approach to gene identification Biased biological functions of horizontally transferred genes in prokaryotic genomes Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis Probabilistic and statistical properties of words: an overview What can we learn with wavelets about DNA sequences? Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA Horizontal gene transfer in bacterial and archaeal complete genomes On surrogate methods for detecting lateral gene transfer Reconciling the many faces of lateral gene transfer TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences Multivariate Statistics Conflict of interest statement. None declared.