key: cord-0769620-2lg8i9r1 authors: Dai, Qi; Guo, Xiaodong; Li, Lihua title: Sequence comparison via polar coordinates representation and curve tree date: 2012-01-07 journal: J Theor Biol DOI: 10.1016/j.jtbi.2011.09.030 sha: 80fb940d534395c1cb9f3814740b6ee5f3bb7200 doc_id: 769620 cord_uid: 2lg8i9r1 Sequence comparison has become one of the essential bioinformatics tools in bioinformatics research, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Existing graphical representation methods have achieved promising results in sequence comparison, but there are some design challenges with the graphical representations and feature-based measures. We reported here a new method for sequence comparison. It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. The curve tree was then constructed to numerically characterize the closed curve of biological sequences, and further compared biological sequences by evaluating the distance of the curve tree of the query sequence matching against a corresponding curve tree of the template sequence. The proposed method was tested by phylogenetic analysis, and its performance was further compared with alignment-based methods. The results demonstrate that using polar coordinates representation and curve tree to compare sequences is more efficient. With the development of high-throughput sequencing technology, the rate of addition of new sequences to the databases increases continuously. However, such a collection of sequences does not by itself increase the scientist's understanding of the biology of organisms. Comparing a new sequence with the sequences of known functions is an effective way of assigning function to the new genes/proteins and understanding the biology of that organism from which the new sequence comes. Many methods have been proposed for sequence comparison. They can be categorized into two classes. One is alignment-based methods, in which dynamic programming that finds an optimal alignment by assigning scores to different possible alignments and picking the alignment with the highest score. Several alignmentbased algorithms have been proposed such as global alignment, local alignment, with or without overlap (Gotoh, 1982; Needleman and Wunsch, 1970; Smith and Waterman, 1981) . Waterman (1995) and Durbin et al. (1998) provided comprehensive reviews about this method. However, the search for optimal solutions using sequence alignment has problems in computationally load with large biological databases and choice of the scoring schemes (Pham and Zuegg, 2004; Vinga and Almeida, 2003) . Therefore, the second class, alignment-free methods, was proposed to overcome the limitations of alignment-based methods. Graphical representation is one of widely used alignment-free methods. It provides a simple way of viewing, sorting and comparing various gene sequences with their intuitive pictures and pattern. Various graphical representations have been proposed during the past 10 years (Hamori and Ruskin, 1983; Gates, 1986; Nandy, 1994; Leong and Morgenthaler, 1995; Randic et al., 2003a Randic et al., , 2003b Randic et al., , 2006 Randic et al., , 2001 Liao and Wang, 2004; Chi and Ding, 2005; Yao et al., 2005; Zhang and Liao, 2007; Zhang and Chen, 2006; Huang et al., 2009 Huang et al., , 2011 Wu et al., 2010; Maaty et al., 2010a Maaty et al., , 2010b Bai et al., 2011; Xie and Mo, 2011; Yao and Wang, 2004; Liu et al., 2006; Randic, 2000; . Randic et al. (2011) gave a comprehensive review on this method. All the graphical representations generally differ in two aspects: graphical representations and feature-based similarity measures. Graphical representation of DNA sequences was first proposed by Hamori and Ruskin (1983) in which DNAs have been shown as 3D curves. Gates (1986 ), Nandy, (1994 , and Leong and Morgenthaler (1995) developed 2D graphical representations of DNA sequences. These methods are straightforward but are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself. Randic et al. (2003a) developed a novel 2D representation method to overcome the degeneracy of the graphical representation. Recently, several other 2D and 3D representations have been proposed (Liao and Wang, 2004; Chi and Ding, 2005; Yao et al., 2005; Zhang and Liao, 2007; Zhang and Chen, 2006; Huang et al., 2009 Huang et al., , 2011 Wu et al., 2010; Maaty et al., 2010a Maaty et al., , 2010b Bai et al., 2011; Xie and Mo, 2011; Randic et al., 2003b Randic et al., , 2006 Randic et al., , 2001 Yao and Wang, 2004; Liu et al., 2006; Randic, 2000; . According to the handling bases of biological sequences, all the methods can be classified as: single nucleotide-based (Liao and Wang, 2004; Chi and Ding, 2005; Yao et al., 2005; Zhang and Liao, 2007; Zhang and Chen, 2006; Huang et al., 2009 Huang et al., , 2011 Wu et al., 2010; Maaty et al., 2010a Maaty et al., , 2010b Bai et al., 2011; Xie and Mo, 2011; Randic et al., 2003a Randic et al., , 2006 Yao and Wang, 2004) and dual nucleotide-based representations (Liu et al., 2006; Randic, 2000; Randic et al., 2001; . They often assign the n bases to corresponding points (Liao and Wang, 2004; Chi and Ding, 2005; Yao et al., 2005; Zhang and Liao, 2007; Zhang and Chen, 2006; Huang et al., 2009 Huang et al., , 2011 Wu et al., 2010; Maaty et al., 2010a Maaty et al., , 2010b Bai et al., 2011; Xie and Mo, 2011) , to the four lines (Randic et al., 2003b (Randic et al., , 2006 , or to the cell/system (Yao and Wang, 2004; Liu et al., 2006) to design the graphical representation. Some features of graphical representations have been proposed to capture the essence of the base composition/distribution of the sequences and further facilitate biological sequence comparison. These widely used features are always associated with the central coordinate and distance matrices. The central coordinate can effectively characterize the whole changes of the geometrical curves and has been widely used for biological sequence comparison (Liao and Ding, 2006; Wen and Zhang, 2009; Abo ElMaaty et al., 2010) . Another useful tool for characterization of biological sequences is distance matrix that is proposed by Randic and Vracko (2000) and further developed by Randic et al. (2001) , Song and Tang (2005) , and Liao and Wang (2004) . They first transformed the graphical representations of biological sequences into distance matrices such as E matrix, D/D matrix, L/L matrix and their ''high order'' matrices. Then they extracted the invariants of matrices such as leading eigenvalue, average row element, etc. to numerically characterize the biological sequences and designed the feature-based similarity measure for sequence comparison. Although the above graphical representation methods have achieved promising results, there are some problems in developing graphical representations and designing the feature-based similarity measures. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Second, the choice of the direction/points/cells for the single base or dual nucleotides is arbitrary. Finally, the feature-based similarity measures are always associated with the invariants of the distance matrices that are gotten by complex repetitive computation. When the sequences are long, these kinds of feature-based similarity measures become less useful. Moreover, we believe that better representation and similarity measure will allow us to design more effective sequence comparison method. This paper introduced a novel method to represent and compare biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. The closed curve was then transformed into a curve tree instead of the distance matrix, and a tree matching distance was proposed to estimate the similarity of two biological sequences. To assess the effectiveness of the proposed method, we took two experiments and compared its performance with the alignment-based method. Given a DNA Sequence, almost all the graphical representations map it into a curve in Cartesian coordinates. The polar coordinates have not been used for graphical representation of DNA sequence until now. In addition, dual nucleotides have been introduced to design graphical representations (Liu et al., 2006; Randic, 2000; Randic et al., 2001; , in which each dual nucleotide is assigned to a corresponding point in Cartesian coordinates, but the distribution of the dual nucleotides is not considered in graphical representation. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. Given a DNA sequences, there are 16 kinds of the dual nucleotides. The distribution of the dual nucleotides consists of their frequencies in a given sequence. For a sequence s, the frequency of a dual nucleotide w XY , denoted by f(w XY ), is the number of occurrence of w XY in the sequence s, where X A fA,C,G,Tg, Y A fA,C,G,Tg. The standard approach for calculating the frequencies of the dual nucleotide in a sequence of length m is to use a sliding window of length 2, shifting the frame one base at a time from position 1 to m À2 þ1, in which dual nucleotides are allowed to overlap in the sequence. In this way, the distribution of the dual nucleotides is represented by a 16-dimensional vector F s where . .,f 16 Þ. When the vector F s 2 of DNA sequence is given, we are interested in its graphical representation in polar Coordinates. We first calculate the radius and angles of the distribution of the dual nucleotides as follows: where o is a weighted value. Then we plot 16 feature points based the above radius and angles in the polar coordinates. Spline function is introduced to fit a smooth curve to a set of the radius and angles of the distribution of the dual nucleotides. Consider a cubic spline with abscissas x i and ordinates y i , i¼0,2,y,NÀ 1. If the second derivatives at each point are known, the spline function has the form and y 00 is the second derivatives. Here, we choose x i based on the distribution and the biological sequence Using the spline function, we obtain the function values y(i), i¼1,2,3,y,n À 1. Plot x(i) and y(i), we will get the closed curve of a DNA sequence in the polar coordinates. For example, Fig. 1 is the polar representation of the coding sequence of the first exon of b-globin gene of Human. As for the parameter o in the definition of radius and angles, we have performed extensive experiments. The polar coordinates representation with different o show a clear trends: o ¼ 1 is suitable for the short sequence. As the sequence length increases, its closed curve with o ¼ 1 is becoming more and more like a circle, which is not suitable for comparing various sequences with their intuitive pictures and pattern. For example, the polar coordinates representation of HCoV-229E coronavirus genomes is shown in Fig. 2(a) . At the same time, if the value of o is too large, the curvature difference on the small-scale will be covered that is not good for sequence representation either. Therefore, we should increase o to a suitable value if the sequence length is large. As for the sequence, its polar representation becomes more inerratic with o ¼ 2 represented in Fig. 2(b) . In order to facilitate characterize and compare different polar representations, we map the closed curve into a curve tree that is constructed as following two steps: (1) dividing a closed curve into two open curves and determining the direction of them; (2) constructing the curve tree. Given a closed curve, we should divide it into two open curves and determine the direction of the curves. Take a closed curve represented in Fig. 3 as an example, we first find the two farthest points on the curve: A and B, with which the closed curve is For example, if the curvilinear path of curve M 1 A is longer than other three curves, we define the curve AM 1 B as AB that is the first curve with initial point A, and the curve BM 2 A is the second curve denoted by CD whose initial point is C. When the lengths of the four curves are equal, we will define the initial point and the direction as the curvilinear path of curve M1A is the longest. As for an open curve presented in Fig. 4 , we will find its initial point A and M 0 , the midpoint of line AB. We draw the perpendicular bisector of line AB denoted by M 0 M that intersects the line AB at M. Then, we define h¼( À1) a MM 0 /AM 0 as the directed relative height of the line AB, where a A f0,1g. If the vector AM ! , MM 0 ! and the vector z that is upright perpendicular to the plane AMM 0 , are satisfied with the Right-Hand Rule, a is equal to 1; otherwise, a is equal to 0. The curve tree nodes store the directed relative heights of the line. Taking the curve of Fig. 5(a) for an example, the directed relative height of line AB, denoted as h 0,0 , is stored in the root node of the curve tree. The curve AB is divided into two open curves by the point M 0,0 . For these two open curves, we repeat the procedure mentioned above, their results are shown in Fig. 5(b) . We obtain the left child node and right child node of the root node h 0,0 , which are denoted as h 1,0 and h 1,1 , respectively. We then take the notes h 1,0 and h 1,1 as root nodes of the sub-trees, and repeat the operations until the obtained curve is almost a straight line. At last, we get a curve tree of the line AB, which is presented in Fig. 6. Given a DNA sequence, we can map it into a closed curve and construct a curve tree to characterize the closed curve. Here, we are not only interested in using the curve tree to characterize the closed curve, but also interested in facilitating comparison of the polar representation of DNA sequences. As we all know, the more similar two sequences, the closer the two sequences relate. It is also true for the curve trees. That is to say, if the relations of two curve trees are closer, they are more similar. On the basis of this assumption, we define a curve match distance of the two curves C 1 and C 2 as follows: is the values of the j-th node on the i-th layer in the complete binary tree corresponding to the curve tree T 1 (T 2 ) of the curve C 1 (C 2 ), z i is a weight, n is the layers that is determined by the actual precision required and curvature curve, cðxÞ is monotonically increasing function that is defined as following: The function cðxÞ enlarges the distance between the larger local differences to improve comparison accuracy, and reduces the distance between the smaller local differences to increase the anti-jamming ability of the curve distance. fz i g nÀ1 i ¼ 0 is a weight series that influences the distance by the element differences of the curve tree. The value of fz i g nÀ1 i ¼ 0 should be set as the appropriate value on the basis of actual needs. If the differences of large-scale curvatures are as the same as that of small-scale curvatures of the curve, it is better to choose fz i g nÀ1 i ¼ 0 as a constant series. If we pay more attention on the curvature difference on the large-scale when comparing the different curves, it is better to choose fz i g nÀ1 i ¼ 0 as a descending series; otherwise, it is better to choose fz i g nÀ1 i ¼ 0 as an increasing series. Biological sequence comparison is the essential motivation of polar representation of DNA sequences. Here, we propose intuitive and quantitative methods to compare biological sequences with help of the proposed polar representation. The alphabet representation of biological sequences is easily handled with computer but difficult for us to observe their differences. Graphical representation provides us with a simple way to view various biological sequences and facilitate sequence comparison with the intuitive pictures and pattern. In Fig. 7 , we present the polar representations of the first coding sequences of b-globin gene of Human, Gorilla, Gallus, and Rabbit. Comparing the closed curves, it is easily to find that the most similar pair is Human-Gorilla because they are Primates. The more similar pairs are Human-Rabbit and Gorilla-Rabbit, which is consist with the (Ferungulates, (Primates, Rodents)) grouping (Liao and Wang, 2004; Chi and Ding, 2005; Yao et al., 2005; Zhang and Liao, 2007; Zhang and Chen, 2006; Huang et al., 2009; Liao and Ding, 2006; Wen and Zhang, 2009 ). The closed curve of Gallus is dissimilar to the other because it is the only non-mammal animals among them. Therefore, polar representation provides us with a simple way to compare different biological sequences. For comparison, we list the recently published results of the examination of the degree of similarity between Human and other several species in Fig. 8 (Zhang, 2009; Yu et al., 2009; Wang et al., 2010; Xie and Mo, 2011) . As one can see there is an overall agreement among similarities obtained by different approaches despite some variation among them. But it is also noted that the degree of dissimilarities of Human-Goat and Human-Bovine are larger than that of Human-Opposum and Human-Gallus, which is an undesirable result because Gallus is the only non-mammal among them, and Opossum is the most remote species from the remaining mammals. Since the proposed curve matching distance DistðC 1 ,C 2 Þ is a distance measure, we can further evaluate the proposed method with phylogenetic analysis. Here, we choose two date sets that have been studied by many researchers (Liao and Wang, 2004; Chi and Ding, 2005; Yao et al., 2005; Zhang and Liao, 2007; Zhang and Chen, 2006; Huang et al., 2009; Liao and Ding, 2006; Wen and Zhang, 2009; Gu et al., 2004; Zhang, 2009 ). The first data set consists of the first exon of b-globin gene of 11 species presented in Table 1 . It is a small data set with average sequence length 92. The second data set are 24 coronavirus genomes with average length about 30,000. They are downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2 . Given a set of biological sequences, their phylogenetic relationship can be obtained through the following main operations: firstly, we construct the polar representation of biological sequences and calculate their curve matching distance based on the curve tree; secondly, by arranging all the curve matching distance into a matrix, we obtain a pair-wise distance matrix; finally, we put the pair-wise distance matrix into the neighborjoining program in the PHYLIP package (Felsenstein, 1989 ). Fig. 9(a) is phylogenetic tree of the first exon of b-globin gene of 11 species using the proposed method with o ¼ 1 and Fig. 10(a) is phylogenetic tree of the 24 coronavirus genomes obtained using the proposed method with o ¼ 2 and fx i g nÀ1 Generally, an independent method can be developed to evaluate the accuracy of a phylogenetic tree, or the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopt the form one to test the validity of our This work Xie and Mo, 2011 Wang et al.,2010 Yu et al.2009 Zhang,2009 . Comparison degree of similarity/dissimilarity between Human and other several species in Table 1 with the proposed method and methods in Zhang (2009 ), Wang et al. (2010 , and Xie and Mo (2011). phylogenetic tree. Both two data sets are aligned with the multiple alignmen CLUSTAL X and use the neighbor-joining to construct the phylogenetic tree presented in Figs. 9(b) and 10(b). From Fig. 9 , we find that the eleven species are separated clearly in our results: (1) three Primates (Human, Gorilla and Chimpanzee) are clustered closely; (2) two Rodents (Mouse and Rat) are grouped closely; (3) Rabbit is clustered closely with Human, Gorilla and Chimpanzee. (4) Opossum and Gallus are less closely with other species, which is consistent with the fact that Gallus is the only non-mammal among them, and Opossum is the most remote species from the remaining mammals. Our results are consistent with the results of the multiple alignment CLUSTAL X ( Fig. 9(b) ). Fig. 10(a) shows that our results are quite consistent with the authoritative results (Gu et al., 2004; Zhang, 2009 ) and that of the multiple alignment in the following aspects. First of all, all SARS-CoVs are grouped in a separate branch, which appear different from the other three groups of coronaviruses. Secondly, BCOV, BCOVL, BCOVM, BCOVQ, MHV, MHV2, MHVM, and MHVP are grouped into a branch, which is consonant with that they belong to group II. Thirdly, HCoV-229E, TGEV, and PEDV are closely related to each other, which is consistent with the fact that they belong to group I. Finally, IBV forms a distinct branch within the genus Coronavirus, because it belongs to group III. Rota et al. (2003) found out that the overall level of similarity between SARS-CoVs and the other coronaviruses is low. Our tree also reconfirms that SARS-CoVs are not closely related to any previously isolated coronaviruses and form a new group, which indicates that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses. Sequence comparison is one of the major goals of sequence analysis, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Despite the prevalence of the alignment-based methods, it is also noteworthy that it is computationally intensive and consequently unpractical for querying large data sets. Therefore, considerable efforts have been made to seek for alternative methods for sequence comparison. Graphical representation is one of widely used alignment-free methods to view, sort, and compare biological sequences. This work presented a novel method to represent and compare biological sequence. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Then we transformed the closed curve into a curve tree instead of the distance matrix, and proposed a tree matching distance to estimate the similarity of two biological sequences. To compare the effectiveness of the proposed method, we performed extensive tests including similarity of biological sequences and phylogenetic analysis, and compared its performance with alignment-based method. The results demonstrate that the proposed method is efficient, which highlight the necessity for graphical representation method to consider the whole distribution of the dual bases. Thus, this understanding can then be used to guide development of more powerful graphical representation for biological sequence comparison. 3D graphical representation of protein sequences and their statistical characterization Similarity analysis of DNA sequences based on the EMD method Novel 4D numerical representation of DNA sequences Biological Sequence Analysis PHYLIP-Phylogeny inference package (version 3.2) A simple way to look at DNA An improved algorithm for matching biological sequences Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales H-curves, a novel method of representation of nucleotide series especially suited for long DNA sequences Similarity studies of DNA sequences based on a new 2D graphical representation Alignment free comparison of genome sequences by a new numerical characterization Random walk and gap plots of DNA sequences A 3D graphical representation of DNA sequences and its application PNN-curve: a new 2D graphical representation of DNA sequences and its application Analysis of similarity/dissimilarity of DNA sequences based on 3-D graphical representation 3D graphical representation of protein sequences and their statistical characterization Representation of protein sequences on latitude-like circles and longitude-like semi-circles A new graphical representation and analysis of DNA sequence structure: methodology and application to globin genes A general method applicable to the search for similarities in the amino acid sequence of two proteins A probabilistic measure for alignment-free sequence comparison Novel 2D graphical representation of DNA sequence based on dual nucleotides New 3D graphical representation of DNA sequence based on dual nucleotides Condensed representation of DNA primary sequences On the similarity of DNA primary sequences On the characterization of DNA primary sequence by triplet of nucleic acid bases Novel 2-D graphical representation of DNA sequences and their numerical characterization Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation A novel unexpected use of a graphical representation of DNA: graphical alignment of DNA sequences Graphical representation of proteins Characterization of a novel coronavirus associated with severe acute respiratory syndrome Identification of common molecular subsequences A new 2-D graphical representation of DNA sequences and their numerical characterization Alignment-free sequence comparison-a review Bilateral similarity function: a novel and universal method for similarity analysis of biological sequences Introduction to Computational Biology: Maps, Sequences, and Genomes: Interdisciplinary Statistics 2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids A 2D graphical representation of protein sequence and its numerical characterization Three 3D graphical representations of DNA primary sequences based on the classifications of DNA bases and their applications A class of new 2-D graphical representation of DNA sequences and their application Analysis of similarity/dissimilarity of DNA sequences based on a 3-D graphical representation TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications DV-Curve: a novel intuitive tool for visualizing and analyzing DNA sequences Invariants of DNA sequences based on 2DD-curves On the similarity of DNA sequences based on 3-D graphical representation We thank the referees for many valuable comments that have improved this manuscript. This work is supported by the National