key: cord-0013321-i8vakna6 authors: Cheng, Jinkui; Cao, Fuliang; Liu, Zhihua title: AGP: A Multimethods Web Server for Alignment-Free Genome Phylogeny date: 2013-02-06 journal: Mol Biol Evol DOI: 10.1093/molbev/mst021 sha: f85eaba391dc057bd5538f16fefec5fdbd5fa4fd doc_id: 13321 cord_uid: i8vakna6 Phylogenetic analysis based on alignment method meets huge challenges when dealing with whole-genome sequences, for example, recombination, shuffling, and rearrangement of sequences. Thus, various alignment-free methods for phylogeny construction have been proposed. However, most of these methods have not been implemented as tools or web servers. Researchers cannot use these methods easily with their data sets. To facilitate the usage of various alignment-free methods, we implemented most of the popular alignment-free methods and constructed a user-friendly web server for alignment-free genome phylogeny (AGP). AGP integrated the phylogenetic tree construction, visualization, and comparison functions together. Both AGP and all source code of the methods are available at http://www.herbbol.org:8000/agp (last accessed February 26, 2013). AGP will facilitate research in the field of whole-genome phylogeny and comparison. Phylogenetic analysis reveals the evolutionary derivation of species. A phylogenetic tree is traditionally inferred from multiple sequence alignment of conservative proteins or genes. Alignment methods and tools have been widely used for the construction of phylogenetic trees. However, with the development of various genome sequencing projects, alignment methods meet huge challenges when dealing with wholegenome sequences. Alignment methods cannot evaluate the recombination, shuffling, and rearrangement events of the whole genomes, and whole-genome multiple alignments are computationally intensive. These obstacles for alignmentbased phylogenetic reconstruction motivated several alignment-free methods in recent years. Two main categories of alignment-free methods have been proposed including methods based on word frequency and methods that do not require resolving the sequence with fixed length word. The first category includes feature frequency profile (FFP), composition vector (CV), return time distribution (RTD), and frequency chaos game representation (FCGR). FFP assembles the frequency information for all the possible words of fixed length k (k-mers) into an FFP vector, and the selection of word length is critical in the method Sims et al. 2009; Sims and Kim 2011) . CV method subtracts the random background of these frequencies using a Markov model to diminish the influence of random neutral mutations to highlight the shaping role of selective evolution and then puts these normalized frequencies in a fixed order into a CV (Qi et al. 2004; Gao and Qi 2007; Xu and Hao 2009; Yu, Liang, et al. 2010) . Chaos game representation (CGR) was proposed as a scale-independent representation for genomic sequences (Jeffrey 1990) . Each CGR is a unique fingerprint of the underlying sequence. However, the CGRs are not directly comparable. If the CGRs are divided by grid lines, each grid square denotes the occurrence of one pattern of k-mers in the sequence (Deschavanne et al. 1999; Almeida et al. 2001 ). These frequencies can be represented as FCGRs. FCGRs are numerical matrices and can be used to infer phylogenetic trees (Wang et al. 2005; Hatje and Kollmar 2012) . RTD was defined as the time required for the reappearance of particular k-mers. Two statistical parameters ( and ) of each RTD were used to derive a feature vector (Kolekar et al. 2011 (Kolekar et al. , 2012 . Methods that are not based on k-mers are rather heterogeneous, for example, average common subsequence (ACS), graph-based methods, the Kr estimator, and methods based on information correlation or compress. ACS calculates the pairwise genome sequence distances using average common substring at every site of each sequence (Cohen and Chor 2012) . The Kr estimator designed by Haubold et al. is closely related to the ACS, which calculates the number of substitutions per site between two unaligned DNA sequences using the shortest absent substring Haubold et al. 2009 ). Graph-based methods have been used for graphical representations of DNA sequences (Gates 1985; Nandy et al. 2006 ). The features from graphical representations of DNA sequences have been developed to capture the essence of the base composition and distribution of the sequences in a quantitative manner. Deng et al. (2011) and Huang et al. (2011) used natural vector and a 10-dimensional statistical ß The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com vector to characterize the two-dimensional (2D) graphical DNA curve, they were called two-dimensional natural vector (2DNV) and two-dimensional statistical vector (2DSV), respectively. Yu, Chu, et al. (2010) converted the 2D genome space to an N-dimensional moment vector (N equals to the length of genome, and this method was called 2DMV) and used the first n components of the vectors to construct phylogenetic tree for 34 lentiviruses based on whole-genome sequences. Methods based on information correlation emphasized the base correlation property of DNA sequence. Information correlation and partial information correlation (IC-PIC) and base-base correlation (BBC) were proposed to analyze the phylogenetic relationships among species (Liu et al. 2005; Gao and Luo 2011; Liu, Zeng, Yang, Ren, et al. 2012; Zeng et al. 2012) . Sequence distance measures based on information compress were proposed using Lempel-Ziv and Kolmogorov complexity (Li et al. 2001; Otu and Sayood 2003) . Lempel-Ziv complexity uses the relative information between the sequences and is computationally intensive. Kolmogorov complexity can be regarded as the ultimate lower bound of all measures of information, it is a theoretical limit and cannot be computed in the general case (Li et al. 2001) . Although many alignment-free methods have been proposed, only the CV and Kr methods have been implemented as web servers, source code for FFP, and Lempel-Ziv complexity can be obtained from authors who proposed the methods. Thus, we implemented 12 popular alignment-free methods and constructed a user-friendly web platform for alignment-free genome phylogeny (AGP) research. AGP also implemented methods for phylogenetic tree visualization and comparison. AGP will facilitate the phylogenetic researches in the field of whole-genome phylogeny and comparison. AGP implemented 12 alignment-free methods for the construction of phylogeny trees using whole genomes and four methods for phylogenetic tree comparison. AGP constructed the first user-friendly multimethods web server for the phylogeny analysis using alignment-free methods and whole genomes. AGP integrated functions including phylogenetic tree construction, visualization, and comparison, which is a comprehensive multimethods platform for the phylogenetic analysis of whole genomes. There are total six pages in AGP. The home page is shown in figure 1. "METHODS" and "TREECOMPARE" pages perform the web server functions, including phylogenetic tree construction, visualization, and comparison. "METHODS" page gives a simple introduction of all alignment-free methods and lists every method with a hyperlink, which will lead you to its input page. Detailed information about the method and input data is described on the input page. For all methods, the input genome file must be in multi-FASTA format. For methods based on k-mers (e.g., FFP, CV, FCGR, and RTD), you need to supply the k value, which indicates the fixed length of word. The k value has influence on the results of sequence comparisons, which is determined by the length of genome sequences. For BBC, IC-PIC, and 2DMV, you also need to set the k parameter, which indicates the max distance between bases and the number of components of the moment vector used in the analysis. The reference range of each k value is also supplied to users on the input page of the method. For FCGR, RTD, 2DNV, 2DSV, 2DMV, IC-PIC, and BBC, you could select one of the 10 distance methods to calculate the distance matrix among genome sequences. To compare with the traditional alignment-based method, AGP also provided functions for the phylogeny construction based on whole-genome alignment, which was implemented using MUMmer (Kurtz et al. 2004) . When all input data have been successfully submitted, the web server will return you back the computing results. The results of phylogeny analysis contain distance matrices, phylogenetic trees, and tree maps. Two kinds of distance matrices were provided in Phylip and Nexus formats. Phylogenetic trees were formatted into standard Newick and Nexus files, which can be used for editing and viewing in other tools directly (e.g., Mega, Phylip, and TreeView). Circular and rectangular tree maps were rendered into five types of figures including TIFF, GIF, JPG, PS, and PNG (Felsenstein 1989; Page 1996; Tamura et al. 2011) . All results can be viewed and downloaded online. Result page for the phylogenetic analysis of 155 complete chloroplast genomes using FFP is shown in figure 2. When you obtain phylogenetic trees using different methods with various parameters, you can compare the differences between these trees. You need to put these trees in a plain text file. Each tree starts with the symbol ">" and the name of the tree at a new line; the following lines describe the structure of the tree. Then you can submit the file to "TREECOMPARE" page. The web server will return back the comparison results in an all-by-all distance matrix. Optionally, you can choose the distance method used for the comparison. These four distance methods were implemented for tree comparison in AGP, including RobinsonFoulds, Symmetric, FalsePositivesAndNegatives, and Euclidean. AGP implemented 12 alignment-free methods and 10 distance methods for the construction of phylogenetic trees. The phylogenetic trees constructed were outputted as standard Newick and Nexus files and visualized as circular and traditional rectangular tree maps. To compare with the traditional alignment-based method, AGP also implemented functions for the phylogeny construction based on wholegenome alignment. Furthermore, AGP implemented four methods for the comparison of phylogenetic trees constructed in the analysis. All results can be viewed and downloaded online. AGP is the first multimethods platform for alignment-free phylogeny analysis, and it will help researchers AGP implemented 12 alignment-free methods including FFP, CV, FCGR, RTD, ACS, Kr, 2DNV, 2DSV, 2DMV, IC-PIC, BBC, and Lempel-Ziv. FCGR, RTD, 2DNV, 2DSV, 2DMV, IC-PIC, and BBC converted genome sequences into numerical multidimensional vectors and then used 10 kinds of methods to compute the distance matrix among the vectors, including Euclidean, Braycurtis, Canberra, Chebyshev, Cityblock, Correlation, Cosine, Minkowski, Seuclidean, and Sqeuclidean. FFP and CV represented genome sequence as a 4 k -dimension frequency vector of k-mers and calculated the distance matrix using the distance formula published in the articles (Qi et al. 2004; Sims et al. 2009 ). ACS, Kr, and Lempel-Ziv did not convert genome sequences into vectors, they computed pairwise genome distances directly. When we got the distance matrix among genomes analyzed, we used neighbor-joining method to construct the phylogenetic trees (Saitou and Nei 1987) . AGP implemented four methods for the comparison of phylogeny trees obtained, including the following: RobinsonFoulds: This method returns the Robinsons-Foulds distance between two trees, the sum of the square of differences in branch lengths for equivalent splits between two trees (Robinson and Foulds 1981) . Symmetric: The symmetric distance between two trees is the sum of the number of splits found in one of the trees but not the other. A Multimethods Web Server for AGP . doi:10.1093/molbev/mst021 MBE FalsePositivesAndNegatives: This method returns a tuple pair, with the first element is the number of splits in the first tree but not found in the second tree compared, whereas the second element is the number of splits in the second tree, which are not in the first tree. Euclidean: This method returns the "branch length distance" of Felsenstein (2004) , the sum of absolute differences in branch lengths for equivalent splits between two trees. We programmed methods CV, FCGR, RTD, ACS, 2DNV, 2DSV, 2DMV, IC-PIC, and BBC according to the algorithms published with the methods. All methods were implemented using the Python language. Phylogenetic tree visualization and comparison functions were implemented based on a Python environment for tree exploration (ETE) and DendroPy Python Packages (Huerta-Cepas et al. 2010; Sukumaran and Holder 2010) . The web server was implemented based on the web2py framework (http://www. web2py.com/, last accessed February 26, 2013). Analysis of genomic sequences by chaos game representation Detecting phylogenetic signals in eukaryotic whole genome sequences A novel method of characterizing genetic sequences: genome space with biological distance and applications Genomic signature: characterization and classification of species assessed by chaos game representation of sequences Efficient estimation of pairwise distances between genomes PHYLIP-phylogeny inference package (version 3.2) Inferring phylogenies Whole genome molecular phylogeny of large dsDNA viruses using composition vector method Genome-based phylogeny of dsDNA viruses by a novel alignment-free method Simpler DNA sequence representations A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method Estimating mutation distances from unaligned genomes Alignment-free comparison of genome sequences by a new numerical characterization ETE: a python environment for Tree exploration Chaos game representation of gene structure Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignmentfree method with optimal feature resolution Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping Genotyping of Mumps viruses based on SH gene: development of a server using alignment-free and alignment-based methods Versatile and open software for comparing large genomes An information-based sequence distance and its application to whole mitochondrial genome phylogeny Classifying genomic sequences by sequence feature analysis A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping Coronavirus phylogeny based on base-base correlation Identification of medicinal vines by ITS2 using complementary discrimination methods Applying DNA barcodes for identification of plant species in the family Araliaceae Mathematical descriptors of DNA sequences: development and applications A new sequence distance measure for phylogenetic tree construction TreeView: an application to display phylogenetic trees on personal computers CVTree: a phylogenetic tree reconstruction tool based on whole genomes Comparison of phylogenetic trees The neighbor-joining method: a new method for reconstructing phylogenetic trees Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions Whole-genome phylogeny of Escherichia coli/ Shigella group by feature frequency profiles (FFPs) DendroPy: a Python library for phylogenetic computing MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods The spectrum of genomic signatures: from dinucleotides to chaos game representation CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes A novel construction of genome space with biological geometry Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model Phylogenetic study of Oryzoideae species and related taxa of the Poaceae based on atpB-rbcL and ndhF DNA sequences