key: cord-0761119-g6obo798
authors: Zheng, Xiaoqi; Li, Chun; Wang, Jun
title: A complexity-based measure and its application to phylogenetic analysis
date: 2008-12-09
journal: J Math Chem
DOI: 10.1007/s10910-008-9511-3
sha: dc69a3dfee3f38b64a59b33b25eb6872dee7d3e6
doc_id: 761119
cord_uid: g6obo798

In this article, we propose two well-defined distance metrics of biological sequences based on a universal complexity profile. To illustrate our metrics, phylogenetic trees of 18 Eutherian mammals from comparison of their mtDNA sequences and 24 coronaviruses using the whole genomes are constructed. The resulting monophyletic clusters agree well with the established taxonomic groups.

The fast increase of many complete genomes of prokaryotes and eukaryotes raises a fundamental and challenging question to modern phylogenetics: how to reconstruct the phylogenetic history of different organisms using whole genomes? Traditional attempts require a multiple alignment of sequences and assume some sort of an evolutionary model. However, not to say the inherent computational complexity, it is meaningless to align two genomes because different genomes have different genes and gene order, and some evolutionary operations, such as rearrangements and lateral gene transfer, affect the final alignment seriously. Thus there is an urgent need to develop new sequence comparisons to deal with the ever increasing genome data.

Among the early attempts, Snel et al. [1] proposed gene content as a measure of similarity. The gene content between two sequences is defined as the number of genes they share divided by their total number of genes. This method is successful to compare long genomes for its light computational load, but fails to distinguish closely related species, e.g., mitochondrial (mt) genomes of placental mammals (all of them share the same genes and gene order). Observing that relative abundances of all dinucleotides are remarkably constant across the genome, Karlin et al. [2] [3] [4] proposed the "genome signature" to describe a genome. The genome signature consists of the array of dinucleotide relative abundances ρ xy = f xy / f x f y extended over all dinucleotides, where f x is the frequency of nucleotide x and f xy is the frequency of dinucleotide x y. The final distance between two genomes is defined as the distance between their corresponding "signatures". Blaisdell [5] proposed a Markov chain model of biological sequences, and the difference between two sequences was quantified by the Euclidean distance between their transition matrices.

From the last decade of the 20th century, many data compression techniques, which were proved to be efficient in information storage and transmission, began to find their use in phylogenetic inferences [6] [7] [8] . The distance metric presented by Li et al. [6] is

where K (S|T ) is the conditional Kolmogorov complexity of S given T , and K (S) is the abbreviation of K (S| ), with an empty string. However, Kolmogorov complexity is not a recursive function, that is, it is not incorporated in a computational scheme, and thus generally can only be approximated [9] [10] [11] . The complexity measure proposed by Lempel and Ziv [12] [13] [14] was an explicitly computable implementation of K-complexity for finite sequences, and many text compression algorithms were based on their measure (gzip, zip, and Stacker, for instance).

In the following text, we will first introduce the basic concepts and some properties regarding "LZ complexity", after which some previously proposed distance metrics are discussed. In the main text, two mathematically rigorous distance metrics based on "LZ-complexity" are proposed, and their applications are shown by constructing phylogenetic trees of 18 Eutherian mammals and 24 Coronaviruses including SARS-CoVs.

For symbol sequences S, T and R defined over a finite alphabet A, let l(S) be the length of S, S(i) be the ith element of S and S(i, j) be the subsequence of S that starts at position i and ends at position j. The sequence R is called an extension of S if R can be written as a concatenation of S and a given sequence T , i.e., R = ST .

An extension R = ST from S is said to be reproducible, denoted by S→R, if there exists an integer p ≤ l(S) such that T (k) = R( p + k − 1), for k = 1, . . . , l(T ). For example, WACC→ WACCAC with p = 2, and AACGT→ AACGTCGTC with p = 3. Moreover, if an extra different symbol at the end of the extension process is allowed, i.e., S → R(1, l(R) − 1), we can obtain the definition of producible extension, denoted by S⇒R. For example AACGT⇒AACGTCGTCW with p = 3. Thus we can say if S→R then S⇒R, but the reverse is not always true. An extension is called exhaustive if it is producible but not reproducible. For instance, the extension AACGT⇒AACGTCGTCW is exhaustive, but AACGT⇒AACGTCGTC is not.

According to the above definitions, any sequence S can be generated from the null sequence using iterated processes of "producible" extension. For example, the generating processes of S =AACGT can be written as: ⇒A⇒AA⇒AAC⇒AACG⇒ AACGT, or ⇒A⇒AAC⇒AACG⇒AACGT. The LZ complexity of a sequence S, denoted by c(S), is defined as the minimum number of steps required to generate S from a null sequence using producible processes, e.g., c(AACGT) = 4. It is easy to declare that, in this case, each extension is exhaustive with a possible exception of the last one, and the LZ complexity of any sequence is unique.

Note that c(ST ) − c(S) measures the amount of information in T when treating information in S as free. So if S and T are closely related to each other, c(ST ) − c(S) will be very small, i.e., it measures the degree of dissimilarity. From this consideration, Otu and Sayood [7] defined the distances between two sequences as follows:

(1), (2) and their normalized versions were used to infer the phylogenetic relationships among 20 Eutherian mammals. The resulting monophyletic clusters agree well with the established taxonomic groups. Li et al. [15] defined a conditional producible operation. The minimum number of conditional producible operations were considered as the conditional complexity given T , denoted by c(S|T ). The distance between two sequences S and T was characterized by

However, one can find that the above two distances are actually not metrics. Mathematically, a function d(X, Y ) defined on a set S is called a distance metric if for any X, Y, Z ∈ S, the following three conditions are satisfied:

The above three conditions are essential characterizations of a distance measure. They are all necessary for the accurate clustering of a data set. But the above two distances do not satisfy the identity condition as d(S, S) = 0 if the last generating step is exhaustive, and d(S, T ) may be equal to 0 when S = T . Therefore, it is significant to revise the above distances or propose some more rigorous methods.

In the present work, we define two new distance metrics of symbol sequences. Instead of considering the conditional compression ratio as done by Otu and Sayood, we make use of a maximum operation. The distances between two sequences S and T are defined as:

where R can be any sequence over the alphabet A. The first distance is evaluated by the maximum complexity divergency between S and T when adding the same suffix. If change the suffix into prefix, we get d 2 (S, T ). We have two theorems below. Theorem 1 shows the validity of d 1 and d 2 , and Theorem 2 gives a concrete computational approach of d 1 .

The functions d 1 (S, T ) and d 2 (S, T ) are distance metrics, i.e., they satisfy the three axioms for a metric.

Proofs of above two theorems will be given in So d 2 is actual a refinement of d. However, the difference between d and d 2 is so small, and perhaps no difference can be detected between their resulting phylogenies. In the following, we will use an alternative distance, d 1 , to infer phylogenies of two data sets.

The mammalian phylogenetic relationship at the molecular level remains to be a controversial topic in nowadays molecular genetics. Different molecular data and analyses result in trees of different topological structures, and the most debatable is the relationship among three main groups of placental mammals, namely Primates, Ferungulates, and Rodents.

In the present work, we choose the whole mitochondrial genomes of 17 placental mammals and the platypus to construct the phylogenetic tree. Platypus, the only nonplacental mammal, is selected as the outgroup. All the 18 data files are obtained from GenBank. In the first step, pairwise distance matrix is calculated using the distance d 1 , then phylogenetic tree is constructed from the matrix using the UPGMA program in the PHYLIP package [16] . The final tree drawn by TREEVIEW [17] is shown in Fig. 1 . According to our tree, species within each main group are clustered accordingly, and platypus stays outside of all 17 placental mammals. Notably, our result supports the topology of [Primates (Rodents, Ferungulates)], which is slightly different from the results of Li et al. [6] and Otu and Sayood [7] . Fig. 1 Evolutionary tree of 18 mammalian species using the distance metric d 1 . The resulting tree supports the topology of (Primates (Rodents, Ferungulates)), that is, Rodents and Ferungulates are the closest pair As another application, we infer the evolutionary relationships of 24 coronaviruses including SARS-CoVs (Severe Acute Respiratory Syndrome). Coronaviruses are members of a family of enveloped viruses that replicate in the cytoplasm of animal host cell. According to the type of the host, coronaviruses isolated previously are classified into three groups, groups I and II contain mammalian viruses, whereas group III contains only avian viruses. Marra et al. [18] and Rota et al. [19] first chose data from above three groups and some SARS-CoVs to construct phylogenetic tree. Their results indicate that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a new fourth group. Using similar data, Zheng et al. [20] applied a geometric approach. They transformed each coronavirus genome into "Z-curve", an equivalent graphical representation of DNA sequence, then used geometric center and three eigenvectors of "Z-curve" as descriptors of this genome.

Motivated by Rota et al., Marra et al. and Zheng et al., we use our distances to infer the phylogenetic relations of the above coronaviruses (data are shown in Table 1 ). However, different from the first data set (mitochondrial genomes), lengths of coronavirus genomes vary significantly (from 27 kb to above 31 kb). In order to eliminate the effects of different lengths, we normalize our distances by dividing the sum and maximum of c(S) and c(T ). The consensus tree drawn by TREEVIEW is shown as Fig. 2 . We find that our topology coincides well with the conventional taxonomic groups, i.e., coronaviruses within each typical group (I human coronaviruses, II bovine coronaviruses and Murine hepatitis viruses, III avian viruses) cluster accordingly, and Fig. 2 The consensus tree constructed by two normalized versions of the distance d 1 (by dividing the maximum and sum of c(S) and c(T ), respectively). According to this tree, all 12 SARS-CoVs are clustered and form new group, which is distantly related to the group II coronaviruses all 12 SARS-CoV strains are grouped together and form a new fourth group, which is distantly related to the group II coronaviruses. This result is in accordance with maximum likelihood tree built from a fragment of the spike protein [21] , and also agrees with the alignment tree from comparing replicase ORF1b amino acid sequences of some viruses [22] .

With the completion of many genome projects of Prokaryotes and Eukaryotes, whole genome phylogenies are available and expected to be more accurate compared to traditional experiments on only a single gene or a fragment of genome. However, multiple sequence alignment of genomic sequences is still a bottleneck, first due to the computational time, and second due to the inherent model assumptions. Therefore, there is a great need to develop new sequence comparisons free of the above problems. In recent years, a quantity of alignment-free methods, e.g. complexity-based approaches [6, 7] , k-words composition [23] and graphical representations [24] [25] [26] [27] [28] [29] [30] have been proposed. However, compared to the alignment method, alignment-free comparison methods are still in their premature stage.

In this article, we propose two well-defined distance metrics on the basis of a universal sequence complexity. To illustrate them, phylogenetic trees of 18 mammals and 24 coronaviruses are constructed, and results show that our distances can successfully cluster species at different levels. In the first data set, our tree supports the topology of [Primates (Rodents, Ferungulates)], i.e., Rodents and Ferungulates are the closest pair. Phylogenetic tree built from the second data set shows that all 12 SARS-CoV strains are grouped together and form a new fourth group, which is distantly related to the group II coronaviruses. In contrast to the traditional alignment method, our method does not suffer from some evolutionary operations, e.g., gaps and large rearrangements in genomic sequences. Moreover, it is fully automated, i.e., does not need any free parameter and human intervention. So it could serve as an alternative way of genome comparison, especially in the case that there are no agreed-upon evolutionary models.

Proof of Theorem 1 The positivity and symmetry conditions obviously hold according to the definition of d 1 . We will check the triangle condition by introducing a sequence Q.

which is the triangle inequality, as required. We now need to show that d 1 (3) If neither of the last generating step of S and T is exhaustive, we can add the same suffix Q to these two sequences till at least one of the last steps of S Q and T Q is exhaustive, which will come back to the above two cases.

In order to prove Theorem 2, we first give a lemma: 

Proc. Natl. Acad. Sci. USA

Proc. Natl. Acad. Sci. USA

Proc. Natl. Acad. Sci. USA

Proc. Natl. Acad. Sci. USA

Similarity analysis of DNA sequences based on the generalized LZ complexity of (0,1)-sequences, Preprint

Elements of Information Theory

Current Topics in Computational Molecular Biology

PHYLIP (Phylogeny Inference Package) version 3.5c. Department of Genetics

Visualization of DNA sequences based on 3DD-Curves, Preprint

Acknowledgements This work was supported in part by Leading Academic Discipline Project of Shanghai Normal University (No. DZL803) and Shanghai Leading Academic Discipline Project (No. S30405).