key: cord-0891562-y7697hx8 authors: Dai, Qi; Yan, Zhaofang; Shi, Zhuoxing; Liu, Xiaoqing; Yao, Yuhua; He, Pingan title: Study of LZ-word distribution and its application for sequence comparison date: 2013-11-07 journal: J Theor Biol DOI: 10.1016/j.jtbi.2013.07.008 sha: 3c30d7fa7a46ae3606943b7ca1cf3a833c04af83 doc_id: 891562 cord_uid: y7697hx8 Lempel–Ziv complexity has been widely used for sequence comparison and achieved promising results, but until now components' distribution in exhaustive history has not been studied. This paper investigated the whole distribution of LZ-words and presented a novel statistical method for sequence comparison. With the components' length in mind, we revised Lempel–Ziv complexity and obtained various sets of LZ-words. Instead of calculating the LZ-words' contents, we defined a series of set operations on LZ-word set to compare biological sequences. In order to assess the effectiveness of the proposed method, we performed two sets of experiments and compared it with alignment-based methods. With high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing new sequences to those with known functions is a key way of understanding the biology of an organism. Many methods have been proposed for sequence comparison. They can be categorized into two classes. One is alignment-based methods, in which dynamic programming is used to find an optimal alignment by assigning scores to different possible alignments and picking the alignment with the highest score. Several alignment-based algorithms have been proposed such as global alignment, local alignment, with or without overlap (Gotoh, 1982; Needleman and Wunsch, 1970; Smith and Waterman, 1981; Randic, 2013a Randic, , 2013b . Waterman (Waterman, 1995) and Durbin et al. (Durbin et al., 1998) provided comprehensive reviews about this method. However, the search for optimal solutions using sequence alignment has problems in: (i) computationally load with large biological databases and (ii) choice of the scoring schemes (Pham and Zuegg, 2004; Vinga and Almeida, 2003) . Therefore, the emergence of research into the second class, alignment-free methods, is apparent and necessary to overcome critical limitations of alignment-based methods. Up to now, many alignment-free methods have been proposed, but they are still in the early development compared with alignmentbased methods. One of the most widely used alignment-free approaches is statistical model, in which each sequence is first mapped into an m-dimensional vector according to its k-word frequencies, and sequence similarity can then be measured by distance measures, such as Euclidean distance (Blaisdell, 1986 ), Pearson's correlation coefficient (Fichant and Gautier, 1987) , Kullback-Leibler discrepancy (Wu et al., 2001) , Cosine distance (Stuart et al., 2002) among their corresponding vectors. Recently, Ewens and Grant (2005) studied probabilistic properties of words in sequences, deducted the exact distributions and evaluated its asymptotic approximations. When the k-words occurring in biological sequence are estimative probabilities rather than frequencies, they are more easily described by more complex models such as Markov model (Pham and Zuegg, 2004; Hao and Qi, 2004; Wu et al., 2006; Apostolico and Denas, 2008) , mixed model (Kantorovitz et al., 2007) and Bernoulli model (Lu et al., 2008) . Graphical representation is another widely used alignment-free method. It provides a simple way to view, sort and compare various gene sequences with their intuitive pictures and pattern. Randic et al. gave a comprehensive review on these methods (Randic et al., 2011 (Randic et al., ,2013 . In order to facilitate comparison of different biological sequences, they transformed graphical representations into some mathematical objects such as E matrix (Yao and Wang, 2004; Liao and Wang, 2004; Song and Tang, 2005) , D/D matrix (Li and Wang, 2003; Yao and Wang, 2004; Liao and Wang, 2004; Song and Tang, 2005) , L/L matrix (Randic et al., 2003; Li and Wang, 2003; Yao and Wang, 2004; Liao and Wang, 2004; Song and Tang, 2005) and their "high order" matrices (Yao and Wang, 2004; Song and Tang, 2005) . Once a matrix was given, they calculated matrix invariants as descriptors of the sequence, such as the average matrix element, the average row sum, the leading eigen value and the Wiener number. But, for long sequences, these methods become less useful because they require complex repetitive computation to get matrix invariants. Recently, Out and Sayood introduced Lempel-Ziv (LZ) complexity to compute the distance between two DNA sequences (Out and Sayood, 2003) . Because it is based on exact direct repeats, the LZ complexity works well with the small DNA alphabet. Unlike DNA sequences, protein sequences and RNA secondary structures consist of more complex alphabets and structure information, which poses more of a challenge for LZ complexity. So Bacha and Baurain, Liu and Wang, Chen and Zhang presented several strategies in which protein sequences or RNA secondary structures were encoded to a new alphabet prior to computation of the LZ complexity (Bacha and Baurain, 2005; Liu and Wang, 2006; Zhang and Chen, 2010; Zhang and Wang, 2010; Chen and Zhang, 2012) . Zhang et al. found that the LZ complexity is strongly correlated with sequence length and proposed a normalized LZ complexity for sequence comparison (Zhang et al., 2009) . Taking into account a specific kind of the inexact copy in the text, Li et al. generalized the LZ complexity and proposed a new sequence distance measure for sequence comparison (Li et al., 2010) . Liu et al. introduced relative LZ complexity to depict the complexity relationship between two sequences (Liu et al., 2012) . All above LZ-based methods have achieved promising results in biological sequence comparison, but they generally placed a heavy emphasis on the number of components in the exhaustive history, so little attention has been paid to the components themselves. In this paper, we used the proposed revised LZ complexity to obtain a series of LZ-words from the exhaustive history of biological sequences. Based on the LZ-word distributions, we constructed a sorted union LZ-word set from which an indicator sequence was obtained. We then calculated numerical characteristics of the indicator sequence to compare biological sequences. The performance of the proposed method was evaluated by the phylogenetic analysis and comparison with alignment-based method. Given a finite alphabet Ω, let U, V and W be sequences over it. L(U) is the length of the sequence U, UðiÞ is the i-th element of U, and Uði; jÞ is the subsequence of U starting at position i and ending at position j. Here, Uði; jÞ ¼ ∅, for i 4 j. Concatenating V and W can construct a new sequence U¼VW, in this equation, V is named "a prefix" of U, and U is called "an extension" of V if there exists an integer i such that V ¼ Uð1; iÞ. An extension U¼VW of V is reproducible from V denoted by V-U, if there exists an integer P≤LðVÞ such that WðkÞ ¼ Uðp þ kÀ1Þ, for k ¼1, 2,…, L(W). A nonnull sequence U is producible from its prefix Uð1; jÞ, denoted by Uð1; jÞ⇒U, if Uð1; jÞ-Uð1; LðUÞÀ1Þ. For example: 01⇒0100 with p¼ 1. Note that, the producibility allows for an extra different symbol at the end. Usually, a DNA primary sequence can be taken as a string of letters A, G, C, and T, which denote the four nucleic acid bases: adenine, guanine, cytosine, and thymine, respectively. Let S ¼ s 1 s 2 :::s n to be a DNA sequence. To indicate a substring of S that starts at position i and ends at position j, we write Sði; jÞ, where is, Sði; jÞ ¼ s i s iþ1 :::s j for i≤j. Any sequence S can be built using a production process where at its ith step Sð1; h iÀ1 Þ⇒Sð1; h i Þ, which is described as following: (1) At the beginning, we had a null-sequence, denoted by ∅. We then added a prefixs 1 to ∅ and obtained a new sequence S. If LðSÞ 4 1, we added a symbol "n"after Sð1; 1Þ. If R could not be reproduced from the set, then joined Q and R to get a new prefix QR, and added a symbol "n"following QR. If R could be reproduced from the set, then checked again if R ¼ Sðh m ; h m þ 2Þ can reproduced from the sequence Q ¼ Sð1; h m Þ. If so, checked again if R ¼ Sðh m ; h m þ 3Þ can reproduced from the sequence Q ¼ Sð1; h m Þ, ⋯ and so on. There two possible cases: in the case R ¼ Sðh m ; LðSÞÞ, we ended the procedure and got new prefix Q R ¼ S; in another case R ¼ Sðh m ; h mþ1 Þ cannot be reproduced from the sequenceQ ¼ Sð1; h m Þ, we got a new prefix QR and added a symbol"n" behind it. (3) Repeated the step (2) until produce S. Instead of focusing on the total number of components in the exhaustive history, we analyzed the components themselves. For convenience, we denoted a component in the exhaustive history as a LZ-word, and all the components in the exhaustive history as a LZ-word set. For example, the LZ-words of S¼ ATGGTCGGTTTC can be gotten through the following steps, where n is used to separate the decomposition component: Generate a novel symbol A: Ø+A-A. Generate a novel symbol T: A+T-AT. Generate a novel symbol G: AnT+G-AnTnG. Copy the longest fragment+generate a additional symbol GT: Generate a novel symbol C: AnTnGnGT+C-AnTnGnGTnC. Copy the longest fragment+generate a additional symbol GGTT: Copy the longest fragment TC: AnTnGnGTnCnGGnTT-AnTnGn GTnCn GGnTTnTC. A, T, G, GT, C, GGTT and TC are the LZ-words of the sequence S. And {A, T, G, GT, C, GGTT, TC} is the LZ-word set of the sequence S. LZ complexity of a sequence is measured by the minimal number of steps required for its synthesis in a certain process. For each step only two operations are allowed in the process: either generating an additional symbol which ensures the uniqueness of each component or copying the longest fragment from the part of a synthesized sequence. When a new decomposition component Sð1; h i Þ is generated, it should be checked whether it is copied from the longest fragment of the Sð1; h iÀ1 Þ. Consequently, the length of LZ-word inevitably becomes large as production process going on. With this problem in mind, we proposed a revised LZ complexity that is described as following: (4) At the beginning, we had a null-sequence, denoted by∅. We then added a prefix s 1 to ∅ and obtained a new sequence S. If LðSÞ 4 1, we added a symbol"n"after Sð1; 1Þ. If R could not be reproduced from the set, then joined Q and R to get a new prefix QR, and added a symbol"n"following QR. If R could be reproduced from the set, then checked can reproduced from the setfSð1; h 1 Þ; Sðh 1 þ 1; h 2 Þ; ⋯; Sðh mÀ1 þ 1; h m Þg, ⋯ and so on. There two possible cases: in the case R ¼ Sðh m ; LðSÞÞ, we ended the procedure and got new prefix QR ¼ S; in another case R ¼ Sðh m ; h mþ1 Þ cannot be reproduced from the set fSð1; h 1 Þ; Sðh 1 þ 1; h 2 Þ; ⋯; Sðh mÀ1 þ 1; h m Þg, we got a new prefix QR and added a symbol "n" behind it. (6) Repeated the step (2) until produce S. Take the above sequence S ¼ATGGTCGGTTTC as an example, we obtained its revised LZ-words through the following steps, where n is used to separate the decomposition component: Generate a novel symbol A: Ø+A-A. Generate a novel symbol T: A+T-AT. Generate a novel symbol G: AnT+G-AnTnG. Copy the fragment G+ generate a additional symbol T: AnTnG Generate a novel symbol C: AnTnGnGT+C-AnTnGnGTnC. Copy the fragment G+generate a additional symbol G: AnTnGn GTnC-AnTnGnGTnCnGG. Copy the fragment T+generate a additional symbol T: AnTnGn GTnC-AnTnGnGTnCnGGnTT. Copy the fragment T+generate a additional symbol C: AnTnGn GTnCnGGnTT-AnTnGnGTnCn GGn TTnTC. A, T, G, GT, C, GG, TT and TC are revised LZ-words of the sequence S. And {A, T, G, GT, C, GG, TT, TC} is revised LZ-word set of the sequence S. It is interesting to note that the maximum length of the revised LZ-word set is 2, significantly smaller than that of the LZ-word set. Given a DNA sequence, we can get a revised LZ-word set. Here, we are interested not only in using the revised LZ-word set to numerically characterize the biological sequences, but also in facilitating comparison of biological sequences. There is a large body of literatures on word statistics, where a sequence is interpreted as a succession of symbols (Reinert et al., 2000) . A k-word is a series of k consecutive letters in a sequence. The word statistical analysis consists of counting occurrences of words and calculating their numerical characteristics. The standard approach for counting k-words in a sequence of length m is to use a sliding window of length k, shifting the frame one base at a time from position1 to m-k+1. Instead of counting the LZ-words' content, we analyzed the distribution diversity of revised LZwords and designed an operation measure to compare biological sequences. Given two DNA sequences X and Y, we obtained their revised LZ-word sets RLZSet X and RLZSet Y . We then blended RLZSet X and RLZSet Y to compose anther set RLZSet XÀY According to the length of revised LZ-words, the RLZSet XÀY set is divided into several mutually exclusive sets RLZSet t where RLZSet t XÀY ¼ fx∈RLZset XÀy jlengthðxÞ ¼ tg: We then lined the elements of the RLZSet t XÀY set in the lexicographic order and got an ordered ↑RLZSet XÀY set For example, if X¼ ATGCGTCGGTCCACCCACGTA and Y¼ ATCGGT CTGTTACAGACTACG are two given DNA sequences, we can get there ↑RLZSet X , ↑RLZSet X and ↑RLZSet XÀY sets: Now we focus on the blend degree of two biological sequences. Given any pair of neighboring elements in ↑RLZSet XÀY set, there are two possible cases: if one is from RLZSet X (RLZSet X ) and the other is from RLZSet Y (RLZSet X ), we suppose there is transition operation ( ) between them. Otherwise, they may both come from the same set RLZSet X (RLZSet Y ), we suppose there is extension operation (-) between them. Take above two sequences X and Y for an example, we first listed all the elements of the ↑RLZSet t X sets in a line with "" denoting them, and list all the elements of the↑RLZSet t Y sets in a line with "∘" denoting them. We then presented all the operations between the ↑RLZSet t X sets and the ↑RLZSet t Y sets based on the↑RLZSet XÀY set, which is shown in Fig. 1 . It is interesting to note that the transition operations in the operation figure indicate the similarity between the ↑RLZSet t X sets and the ↑RLZSet t Y set, and the extension operations imply their diversity. That is to say, the more the extension operations are, the more similar the ↑RLZSet t X sets and the ↑RLZSet t Y set are. According to that, we define length of the operations LðOÞ as follows: Given an operation LðOÞ with lengthξ, we counted its total appearances (N LðoÞ ¼ ξ ) in operation figure. Since LðOÞ varies with different valueζ, it can be regarded as a discrete random variable. Given a random variableLðOÞ, and a positive integer n, PðLðOÞ ¼ nÞ is the probability that LðOÞtakes the value n The collection of pairsðPðLðOÞ ¼ nÞ; nÞ, for all positive integer n, is the probability distribution of the listed as follows: Take all the operations in Fig. 1 for an example, the probability distribution of the operations is Based on the operation distribution function, we calculated its expectation and propose an operation measure (OMeasure) between two sequence X and Y, OMeasure, the average length of the operation, is depended on both the extension operations and transition operations. It is important to note that OMeasure only satisfies the identity and symmetry, it does not satisfy inequality conditions. So it is only a dissimilarity measure for sequence comparison. We are interested in OMeasure for two reasons. First of all, it provides an opportunity to study the components' distribution which is, in some ways, more singular than the total number of components in the exhaustive history. The second reason involves the lengths of the operations because differencing lengths of the operations strengthens the effects of the different operations. One of the characteristics of the revised Lempel-Ziv complexity is to check whetherR ¼ Sðh m ; h mþ1 À1Þcan be reproduced from the set fSð1; h 1 Þ; Sðh 1 þ 1; h 2 Þ; ⋯; Sðh mÀ1 þ 1; h m Þg instead of from the set Sð1; h m Þ. To find their difference, we compared their LZ-word's distributions. We first compared their component difference in the exhaustive histories. For example, HCoV-229E is a given sequence of Human coronavirus, its length is 27,317 with accession number NC_002645. With Lempel-Ziv complexity and revised Lempel-Ziv complexity, we got two exhaustive histories. We then deleted all the symbols "n" in the exhaustive histories and obtained two new deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ. It is difficult for us to observe sequence difference directly, but we can calculate k-word counts of the deduced sequences to assess their difference. Fig. 2 is the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ with k from 1 to 4. Interestingly, the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ are similar in Fig. 2 . That is to say, the sequence information held through Lempel-Ziv (LZ) complexity and revised Lempel-Ziv complexity operation is similar. In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a technique used to compare means of two or more samples (using the F distribution). Here, we used a one-way ANOVA to test whether the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ differ from each other. The F value obtained by one-way ANOVA test tells us whether the data is significantly different from the Gaussian distribution or not. We rejected the hypothesis if the test is significant at the 0.05 level. Since the F-value is 0.91 4F 0.05 , there is no significant difference between the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ. That is to say, the components in the exhaustive history between Lempel-Ziv complexity and revised Lempel-Ziv complexity are similar. We found that the total number of components in the exhaustive history with the revised Lempel-Ziv (LZ) complexity algorithm is 4491, which is 1026 larger than Lempel-Ziv (LZ) complexity algorithm. In addition to components' distribution in the exhaustive history, we also compared the distribution of lengths of the components in the exhaustive history. Take the HCoV-229E as an example, Fig. 3 is the comparison of length distribution of the components in the exhaustive history obtained by Lempel-Ziv complexity and revised Lempel-Ziv complexity. It is interesting to note that there is a great difference between the lengths of the components in the exhaustive history. The maximum length of the components obtained by Lempel-Ziv complexity is 16, while that of the components obtained by revised Lempel-Ziv complexity is only 10. The most appearance length of the components obtained by revised Lempel-Ziv complexity is 6, which is 2 smaller than that of the components obtained by Lempel-Ziv complexity. Comparison between Lempel-Ziv (LZ) complexity and revised Lempel-Ziv complexity illustrates that they can both extract the similar information of the primary sequences, but the component lengths in the exhaustive history obtained by the revised Lempel-Ziv complexity are obviously smaller than Lempel-Ziv complexity. So the revised Lempel-Ziv complexity is a better way to make the components easier to handle. Given two DNA sequences X and Y, we obtained their revised LZ-word sets RLZSet X and RLZSet Y with the revised Lempel-Ziv complexity. We then blended RLZSet X and RLZSet Y to compose anther set RLZSet XÀY . In order to highlight the influence of different LZ-words' size, we divided the RLZSet XÀY set into several mutually exclusive sets RLZSet t XÀY according to the length of revised LZ-word t. It is worthy to note that mutually exclusive sets RLZSet t XÀY rely heavily on set splitting methods. In order to evaluate the influence of the set splitting methods, we adopted the operation measure to classify HEV Genotypes with step-wise refinement of set splitting methods. HEV (Hepatitis E virus) is a non-enveloped, positive-sense, single-stranded RNA virus and belongs to Hepevirus genus under the separate family of Hepeviridea (Lu et al., 2006) . The genome of HEV is approximately 7.2 kb in length and contains a short 5′ untranslated region (5′ UTR), three overlapped open reading frames (ORF1, ORF2, and ORF3′) and a short 3′ UTR. We retrieved a total of 48 full-length HEV genome sequences from NCBI (http:// www.ncbi.nlm.nih.gov/). Abbreviation for the strains, accession number, nucleotide length, country, and genotype of all HEV genomes (Lu et al., 2006) are described in Table 1 . And the 48 HEV genomes were distinctly clustered into four genotypes by the traditional classification (Liu et al., 2008) . This experiment aims at assessing how well the operation measure with step-wise refinement of set splitting methods performs on classification. Here, set splitting methods with the stepwise refinement (SSM) are: In relation to the clustering literature (Handl et al., 2005) , Neighbor-joining (Felsenstein, 1989) , a classic tree construction algorithm, can be considered as hierarchical methods. These results are represented in Fig. 4 . To evaluate the performance of the operation measure for HEV genotypes classification, we counted the number of misplaced HEV genotype against a gold standard. For the classification of HEV genotypes, we took the traditional classification as the gold standard (Lu et al., 2006) . The numbers of misplaced HEV genotype for the operation measure with SSM 1 , SSM 2 , SSM 3 , SSM 4 , and SSM 5 are 1, 1, 2, 1 and 0, respectively. These results indicate that the higher the refinement scheme is, the higher the operation measure efficiency is. Since the outbreak of atypical pneumonia referred to as severe acute respiratory syndrome (SARS), more attentions have been paid to the relationships between the SARS-CoVs and the other coronaviruses, which would be helpful to discover drugs and develop vaccines against the virus. Generally, coronaviruses can be divided into three groups according to serotypes. Group I and group II contain mammalian viruses, while group II coronaviruses contain a hemagglutinin esterase gene homologous to that of Influenza C virus. Group III contains only avian. Based on the operation measure, we next considered to infer the phylogenetic relationships of coronaviruses with the complete coronavirus genomes. The 24 complete coronavirus genomes used in this article were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2 . Given a set of biological sequences, their phylogenetic relationship can be obtained through the following main operations: firstly, we construct the LZ-word set with revised Lempel-Ziv complexity and calculate the similarity/dissimilarity using operation measure; secondly, by arranging all the similarity/dissimilarity into a matrix, we obtain a pair-wise matrix; finally, we put the pair-wise distance matrix into the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) program in the PHYLIP package (Felsenstein, 1989) . Fig. 5(a) is phylogenetic tree of the 24 coronavirus genomes obtained using the proposed operation measure with SSM 5 . Generally, an independent method can be developed to evaluate the accuracy of a phylogenetic tree, or the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopted the form one to test the validity of our phylogenetic tree. Both two data sets were aligned with the multiple alignment CLUSTAL X and constructed the phylogenetic tree presented in Fig. 5(b) . Fig. 5 (a) shows that our results are quite consistent with the authoritative results (Gu et al., 2004; Zheng et al., 2005) and that of the multiple alignment Fig. 5 (b) in the following aspects. First of all, all SARS-CoVs are grouped in a separate branch, which appear different from the other three groups of coronaviruses. Secondly, BCOV, BCOVL, BCOVM, BCOVQ, MHV, MHV2, MHVM, and MHVP are grouped into a branch, which is consonant with the fact that they belong to group II. Thirdly, HCoV-229E, TGEV, and PEDV are closely related to each other, which is consistent with the fact that they belong to group I. Finally, IBV forms a distinct branch within the genus Coronavirus, because it belongs to group III. Rota et al. (Rota et al., 2003) found out that the overall level of similarity between SARS-CoVs and the other coronaviruses is low. Our tree also reconfirms that SARS-CoVs are not closely related to any previously isolated coronaviruses and form a new group, which indicates that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses. Sequence comparison is one of the major goals of sequence analysis, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Despite the prevalence of the alignment-based methods, it is also noteworthy that it is computationally intensive and consequently unpractical for querying large data sets. Therefore, considerable efforts have been made to seek for alternative methods for sequence comparison. This work presented a novel method to compare biological sequence with the revised Lempel-Ziv complexity. Instead of focusing on the total number of components in the exhaustive history, we analyzed the distribution of components themselves. Then we defined transition and extension operations among the revised LZ-word sets and represented them in the operation figure. With the length of operations in mind, we designed an operation measure to estimate the similarity/dissimilarity of two biological sequences. To assess the effectiveness of the proposed method, two sets of evaluation experiments were taken, and its performance was further compared with alignment-based methods. The results demonstrate that the proposed method is efficient, which highlight the necessity for LZ-based method to consider the whole distribution of the components in the exhaustive history. Thus, this understanding can then be used to guide development of more powerful alignment-free for biological sequence comparison. Fast algorithms for computing sequence distances by exhaustive substring composition Application of Lempel-Ziv complexity to alignment-free sequence comparison of protein families A measure of the similarity of sets of sequences not requiring sequence alignment Comparative analysis of RNA molecules Biological Sequence Analysis Statistical Methods in Bioinformatics: An Introduction PHYLIP-Phylogeny inference package (version 3.2) Statistical method for predicting protein coding regions in nucleic acid sequences An improved algorithm for matching biological sequences Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales Computational cluster validation in postgenomic data analysis Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance A statistical method for alignment-free comparison of regulatory sequences Numerical characterization and similarity analysis of DNA sequences based on 2-D graphical representation of the characteristic sequences A generalization of Lempel-Ziv complexity and its application to the comparison of protein sequences 3-D graphical representation of DNA sequences and their numerical characterization A relative Lempel-Ziv complexity: Application to comparing biological sequences A method for rapid similarity analysis of RNA secondary structures A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping An improved string composition method for sequence comparison Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis A general method applicable to the search for similarities in the amino acid sequence of two proteins A new sequence distance measure for phylogenetic tree construction A probabilistic measure for alignment-free sequence comparison Very efficient search for protein alignment-VESPA Very efficient search for nucleotide alignments Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation Graphical representation of proteins Milestones in graphical bioinformatics Probabilistic and statistical properties of words: an overview Characterization of a novel coronavirus associated with severe acute respiratory syndrome A new 2-D graphical representation of DNA sequences and their numerical characterization Integrated gene and species phylogenies from unaligned whole genome protein sequences Alignment-free sequence comparison-a review Introduction to Computational Biology: Maps, Sequences, and Genomes: Interdisciplinary Statistics Statistical measures of DNA dissimilarity under Markov chain models of base composition Phylogenetic analysis using complete signature information of whole genomes and clustered Neighbour-Joining method A class of new 2-D graphical representation of DNA sequences and their application A complexity-based method to compare RNA secondary structures and its application Normalized Lempel-Ziv complexity and its application in bio-sequence analysis Comparisons of RNA secondary structures based on LZ complexity Coronavirus phylogeny based on a geometric approach We thank the referees for many valuable comments that have improved this manuscript. This work is supported by the National Natural Science Foundation of China (61170316, 61001214, 61003191).