key: cord-0764377-t89235t3 authors: Zheng, Wen-Xin; Chen, Ling-Ling; Ou, Hong-Yu; Gao, Feng; Zhang, Chun-Ting title: Coronavirus phylogeny based on a geometric approach date: 2005-05-10 journal: Mol Phylogenet Evol DOI: 10.1016/j.ympev.2005.03.030 sha: b31266b7d2b894122d2ff2689445789c579f2bc7 doc_id: 764377 cord_uid: t89235t3 A novel coronavirus has been identified as the cause of the outbreak of severe acute respiratory syndrome (SARS). Previous phylogenetic analyses based on sequence alignments show that SARS-CoVs form a new group distantly related to the other three groups of previously characterized coronaviruses. In this paper, a geometric approach based on the Z-curve representation of the whole genome sequence is proposed to analyze the phylogenetic relationships of coronaviruses. The evolutionary distances are obtained through measuring the differences among the three-dimensional Z-curves. The Z-curve is approximately described by its geometric center and the associated three eigenvectors, which indicate the center position and the trend of the Z-curve, respectively. Although some information is lost due to the approximate description of the Z-curve, the phylogenetic tree constructed based on these parameters is consistent with those of previous analyses. The present method has the merits of simplicity and intuitiveness, but it is still in its premature stage. Because the phylogenetic relationships are inferred from the whole genome, instead of some individual genes, the present method represents a new direction of phylogeny study in the post-genome era. The outbreak of atypical pneumonia, referred to as severe acute respiratory syndrome (SARS) was first identified in Guangdong Province, China, and spread to several countries later Ksiazek et al., 2003; Lee et al., 2003; Peiris et al., 2003; Poutanen et al., 2003; Tsang et al., 2003) . A novel coronavirus was isolated and found to be the cause of SARS. Although SARS has been under control, some scattering cases infected by SARS-CoVs were reported. No effective drugs are currently available to cure this disease. Gaining insight into the phylogenetic relationships among coronaviruses would be helpful to discover drugs and develop vaccines against the virus. The SARS-coronavirus is a new member of the order Nidovirales, family Coronaviridae, and genus Coronavirus. They consist of a diverse group of large, enveloped, positive-stranded RNA viruses that cause respiratory and enteric diseases in humans and other animals . Excluding SARS-CoVs, coronaviruses can be divided into three groups according to serotypes. Group I and group II contain mammalian viruses, while group II coronaviruses contain a hemagglutinin esterase gene homologous to that of Influenza C virus (Lai and Holmes, 2001) . Group III contains only avian viruses. Previous work showed that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV) within the genus Coronavirus (Marra et al., 2003; Rota et al., 2003) . An intuitive method is proposed to infer the phylogenetic relationships of coronaviruses in this article. Historically, Cork et al. proposed a three-dimensional representation of genomic sequences, called the Wcurve (Wu et al., 1993) . Since then, the W-curve has been used to analyze genomic sequences and study the phylogeny of bacteria (Cork, 2003; Cork and Toguem, 2002) . Instead of the sequence alignment, we adopt a geometric method based on the Z-curve of the whole genome. The Z-curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that each can be reconstructed given the other Zhang, 1991, 1994) . Based on the Z-curve method, a coronavirus-specific gene-finding system ZCURVE_ CoV has been developed , and the software is especially suitable for gene recognition in SARS-CoV genomes. The system is further improved by taking the prediction of cleavage sites of viral proteinases in polyproteins into consideration (Gao et al., 2003) . Here we use the differences between the three-dimensional space curves as the foundation to derive the phylogeny of coronaviruses. The key problems are what parameters should be used to describe a curve and how to determine evolutionary distances among organisms based on a group of curves. In this paper, we use a series of parameters, such as the geometric center and the covariance matrix to reflect the center position and the distribution pattern of a curve, respectively. The result shows that SARS-CoVs form an independent group, which is consistent with previous analyses. The 24 complete coronavirus genomes used in this paper were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 1 . According to the existing taxonomic groups, sequences 1-3 belong to group I, and sequences 4-11 are members of group II, while sequence 12 is the only representative of group III. Refer to Table 1 for details. The Z-curve is a three-dimensional curve that constitutes a unique representation of a given DNA sequence in the sense that each can be uniquely reconstructed given the other Zhang, 1991, 1994) . The resulting curve has a zigzag shape, hence the name Zcurve. The Z-curve is briefly presented as follows. Consider a DNA sequence read from the 5 0 to the 3 0 -end with N bases. Beginning from the first base, inspect the sequence one base at a time. In the nth step, where n = 1,2, . . . , N, count the cumulative numbers of the bases A, C, G, and T, occurring in the subsequence from the first base to the nth base in the DNA sequence inspected, and denote them by A n , C n , G n , and T n respec- tively. The Z-curve consists of a series of nodes P n , where n = 1,2, . . . , N, whose coordinates are uniquely determined by the Z-transform of DNA sequences Zhang, 1991, 1994) x n ¼ ðA n þ G n Þ À ðC n þ T n Þ R n À Y n ; y n ¼ ðA n þ C n Þ À ðG n þ T n Þ M n À K n ; z n ¼ ðA n þ T n Þ À ðC n þ G n Þ W n À S n ; n ¼ 0; 1; . . . ; N ; x n ; y n ; z n 2 ½ÀN ; N ; where A 0 = C 0 = G 0 = T 0 = 0 and x 0 = y 0 = z 0 = 0. Here R, Y, M, K, W, and S represent the bases of puRine, pYrimidine, aMino, Keto, Weak hydrogen bonds, and Strong hydrogen bonds, respectively, according to the Recommendation 1984 by the NC-IUB (Cornish-Bowden, 1985) . The line that connects the nodes P 0 (P 0 = 0), P 1 , P 2 , . . ., until P N one by one sequentially is called the Z-curve for the DNA sequences inspected. The Z-curve defined above is a three-dimensional space curve, having three independent components, i.e., x n , y n , and z n , which display the distributions of bases of R/Y, M/K, and W/S types, respectively, along the sequence. By viewing the Z-curve, some global and local features of the sequence can be detected in a perceivable way. For almost all genome or chromosome sequences, the curves of z n $ n are roughly straight lines (Zhang et al., 2001) . For convenience, the curve of z n $ n is fitted by a straight line using the least square technique where (z, n) is the coordinate of a point on the fitted straight line and k is its slope. Instead of using the curve of z n $ n, we will use the z 0 n $ n curve hereafter, where z 0 n ¼ z n À kn. ð3Þ In this paper, we propose a new way to infer evolutionary distances between organisms from the whole genome sequences. As the Z-curve is a unique representation of a genome, it can be used to reflect a genomeÕs characteristics (Fig. 1) . For convenience, we use the coordinates (X, Y, Z 0 ) rather than (X, Y, Z). The differences among the Z-curves of these genomes form the basis for constructing the phylogenetic tree. To study the phylogenetic relationships, the process can be separated into three stages. First, the Z-curve of each genome is described by a set of parameters; second, the distance matrix is generated based on the parameters obtained in the first stage; and finally, the phylogenetic tree can be constructed based on the distance matrix. Fig. 1 . The three-dimensional Z-curves (x, y, z 0 ) for three complete coronavirus genomes. (A-C) The Z-curves of BJ01, TOR2, and BCoV, respectively. It can be clearly seen that the Z-curves of BJ01 and TOR2 are very similar, while the Z-curve of BCoV is significantly different from the former two. This forms the basis of the present method. (D) A sketch of the three eigenvectors for a certain genome (TOR2), which illustrates the relationship between the three eigenvectors and the Z-curve. Table 2 The geometric center and three eigenvectors of the Z-curve for each of the 24 coronavirus genomes a i Abbreviation (i) The parameters of the Z-curve for each genome. Based on the Z-curve, any genome can be represented by a three-dimensional space curve composed of N nodes corresponding to every base position denoted by x n , y n , z 0 n where n = 1,2, . . . , N (Figs. 1A-C). To describe its characteristics, we calculate the following parameters. The first is the geometric center of all the n nodes Consequently, we can obtain ð x; y; z 0 Þ for each genome. Refer to Table 2 for details. Then, the covariance matrix which describes the global distribution pattern of the three-dimensional space curve is calculated as follows: where where p, q = x, y, z 0 . Obviously, the matrix is a real symmetric 3 · 3 one. Using a 3 · 3 matrix to represent a three-dimensional Z-curve is a very rough approximation, resulting in information loss considerably. However, the advantage is that this approximation makes it possible to compare genomes with different lengths. It is seen that a 3 · 3 covariance matrix is uniquely derived based on Eq. (6) for each given genome regardless of its length. From a geometrical point of view, the distribution pattern can be reduced to a three-dimensional ellipsoid approximately. Each direction of the main axis of the ellipsoid can be denoted by an eigenvector and its length should be proportional to the square root of its associated eigenvalue. The eigenvectors and their associated eigenvalues are defined as follows: Corresponding to each eigenvalue k k , thereÕs an eigenvector C k . Corresponding to k 1 < k 2 < k 3 , the three eigenvectors are denoted by C 1 , C 2 , C 3 , respectively. ItÕs easy to obtain the eigenvalues and associated normalized eigenvectors using the Jacobi algorithm. The geometric center and three eigenvectors for each of the 24 genomes are obtained in the same way. Refer to Table 2 for details about the parameters. (ii) The distance matrix derived from the above parameters. In this paper, the Euclid distance is used to reflect the diversity between two points where d ij denotes the distance between the geometric centers of the ith and the jth genomes, and M is the total number of all genomes (M = 24, here). Then we obtain a real M · M symmetric matrix whose elements are d ij . To reflect the differences between the trends of every two three-dimensional curves, the angles between the corresponding eigenvectors of every two genomes are used. The three-dimensional vectors are denoted as follows: where C i k is the kth vector of the ith genome. Each genome has three such eigenvectors. According to the projections on the three axes, the vectors can be divided into three groups. The three groups of vectors are represented with arrows of different styles (refer to Fig. 2A ). Obviously they can be separated apart depending on their space distribution. The dark group (X group) has the greatest projections on the x-axis, while the vectors represented with dot (Y group) and grey (Z 0 group) arrows have the greatest projections on the y-axis and the z 0 -axis, respectively. For each genome, the three vectors can be divided into three groups, i.e., each genome has three vectors belonging to three groups, respectively. The three groups of eigenvectors are obtained, and denoted by C i x , C i y , and C i z 0 , respectively (see Table 2 ). The cosine between any two vectors in a certain group can be computed as follows: Repeating this procedure for all the three groups, we obtain three real M · M symmetric matrices. These matrices are then translated into angles, whose elements are as follows: The sum of h k ij over k for given i, j can be used to reflect the trend information of the eigenvectors involved Consequently, two sets of parameters are obtained. The first reflects the difference of center positions represented by the Euclid distance between the geometric centers. The second indicates the difference of the trends of the Z-curves represented by the related eigenvectors. The overall distance D ij between the species i and j is defined by (iii) Clustering. Accordingly, a real symmetric M · M matrix D ij is obtained and used to reflect the evolutionary distance between the species i and j. The clustering tree is constructed using the UPGMA method in PHY-LIP package (http://evolution.genetics.washington.edu/ phylip.html). The final phylogenetic tree is drawn using the DRAWGRAM program in the PHYLIP package. The branch lengths are not scaled according to the distances and only the topology of the tree is concerned. As mentioned above, one of the advantages of the Zcurve is its intuitiveness. The feature of a genome can be viewed intuitively regardless of how long the genome is. Therefore, global and local compositional features of a genome can be grasped quickly in a perceivable form . To give an intuitive comprehension of the difference among the three-dimensional curves, we take SARS-CoV strains TOR2, BJ01, and BCoV as examples. TOR2 and BJ01 are SARS-CoVs and BCoV belongs to another group of coronaviruses. From the coordinates and the trends in Figs. 1A-C, we can see that the Z-curves of TOR2 and BJ01 are almost the same while that of BCoV is significantly different from both of them, indicating that the former two have close phylogenetic relationship, whereas the relationships between the former two and the latter are more distant. Similarity of related Z-curves implies close evolutionary relationship of the organisms involved and vice versa. This constitutes the basis of the current algorithm. The Z-curve is approximately described by the geometric center and eigenvectors, which indicate its center position and the trends, respectively (Fig. 1D) . In Fig. 1D the three arrows represent the three eigenvectors, and the point from which they start is the geometric center. The three eigenvectors of a certain genome can be divided into three groups according to their relationships with the axes (refer to Fig. 2) . The trends of Zcurves carry a part of the information used to construct the phylogenetic tree, and some interesting results can be revealed by this figure. It can be seen from Fig. 2B that the vectors in the Y group, which have the greatest projections on the positive y-axis, are coplanar perfectly. They are almost in the x-y plane. As can be seen from the plot, the 24 vectors are almost superposed with each other as a single vector. The phenomenon can also be seen from the data in Table 2 . All of the absolute value of C i y;z 0 ði ¼ 1; 2; . . . ; MÞ are smaller than 0.0059. That is to say, they all have very small projections on the z 0axis and are constrained into the x-y plane. The vectors in the X group and Z 0 group (represented with black and grey arrows, respectively, in Fig. 2B ) are also coplanar in the x-z 0 plane, though their coplanarity is not as good as that of the Y group. As mentioned above, there are three groups of coronaviruses. Group I includes HCoV-229E, TGEV, and PEDV and group II contains BCoV, BCoVL, BCoVM, BCoVQ, MHV, MHV2, MHVM, MHVP, etc. All the viruses in these two groups are mammalian viruses. Group III contains only avian viruses, of which only the genome of IBV has been completely sequenced. Many researchers have analyzed the phylogenetic relationships among coronavirus genomes based on the 3C-like proteinase, polymerase, the structural proteins S, E, M, and N, respectively (Marra et al., 2003; Rota et al., 2003) . Their results indicated that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV) within the genus Coronavirus (Marra et al., 2003; Rota et al., 2003) . As shown in Fig. 3 , four groups of coronaviruses can be seen from the phylogram. The SARS-CoVs appear to cluster together and form a separate branch, which can be distinguished easily from other three groups of coronaviruses. IBV, belonging to group III, is situated at an independent branch, whereas the TGEV, PEDV, and HCoV-229E, which belong to group I, tend to cluster together. In another branch, the group II coronaviruses, including BCoV, BCoVL, BCoVM, BCoVQ, MHV, MHV2, MHVM, and MHVP tend to cluster together. First, group I and group II, which are all mammalian viruses, cluster together forming a bigger group. Second, this group joins group III, which contains only avian viruses, to form a much bigger group. Finally, SARS-CoVs join them and result in the phylogenetic tree shown in Fig. 3 . The resulting monophyletic clusters agree perfectly with the established taxonomic groups. To validate the current method, a set of random sequences were used as a control. We generated 100 random sequences meeting the requirements in the method. Each time a phylogenetic analysis was done using 25 sequences including one random sequence and the 24 genomes. Consequently, 100 phylogenetic trees were obtained. Ninety-eight out of the 100 trees showed that the random sequence formed a distinct group without disturbing the other four groups. Only two of the random sequences disturbed the four groups, suggesting that the current method is solid with respect to the situation that a random sequence is added. Almost all of the previous analyses revealed that SARS-CoVs form a distinct group different from the other three groups of coronaviruses. However, the question that how SARS-CoVs emerged suddenly still remains open. Rota et al. and Marra et al. performed phylogenetic analysis based on sequence alignments using different genes. The results indicated that SARS-CoVs belong to a new group but the original group that SARS-CoVs were derived from could not be determined . The detection of SARS-CoV-like viruses in Himalayan palm civets and other small animals in live retail market indicates a rout of interspecies transmission, although the natural reservoir is unknown. Virus infection was also detected in humans working at the same market. All the animal isolates retain a special 29-nucleotide fragment, which is not found in most human isolates . Stavrinides and Guttman made phylogenetic analysis on the SARS virus replicase, surface spike, matrix, and nucleocapsid proteins. The results support a mammalian-like origin for the replicase protein, an avian-like origin for the matrix and nucleocapsid proteins, and a mammalian-avian mosaic origin for the host-determining spike protein. They proposed that a recombination event between mammalian-like and avian-like parent viruses within the S gene might have taken place (Stavrinides and Guttman, 2004) . However, the phylogenetic inference based on genome contents tends to locate the recombinant outside of related genomes, such as seen in Fig. 3 . Therefore, we emphasize that it is very unlikely to trace back the evolutionary history such as the recombination event using the method presented. The present method reflects the global characters of genomes because the whole genome is taken into consideration. The phylogenetic tree (Fig. 3) reveals that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses. As can be seen from Fig. 3 , the distance between the SARS-CoVs and all the others is the greatest. We supposed that the precursor of SARS-CoV may have existed in some hosts and developed separately for many years. Grigoriev found that the mutational patterns in SARS-CoV genome were strikingly different from the other coronaviruses in terms of mutation rates (Grigoriev, 2004) . Phylogenetic analysis based on codon usage pattern suggested that SARS-CoV was diverged far from all the three known groups of coronavirus (Gu et al., 2004) . The overall level of similarity between SARS-CoVs and the other coronaviruses is low . We suppose that this is due to different evolution paths. The isolation of SARS-CoV-like virus in Himalayan palm civets indicates a route of interspecies transmission. We hypothesize that some events such as the nucleotide deletion or mutation in some important genes of the precursor may have resulted in the change of host range. Due to the lack of morphological features and frequent gene exchanges, it is highly valuable to develop methods of molecular phylogeny for viruses. Now phylogenetic analysis based on sequence alignments is well developed. Sequence alignments are always based on some special genes or some conserved fragments (Saitou, 1996) . Such analysis can be done at both the amino acid level and the nucleotide level. To overcome the biases caused by individual genes or genome segments, it is valuable to develop methods of molecular phylogenetic analysis based on whole genome sequences. Being different from the sequence alignment method, the current method is a geometric approach which is based on measuring the differences of Z-curves of whole genomes, including coding and non-coding sequences. There is no need to search for similar sequences. Probably, the most remarkable advantages of the present method is its simplicity and intuitiveness. The result shows that four groups exist in the genus Coronavirus. Note that group I (HCoV-229E, TGEV, and PEDV) and group II (BCoVM, BCoVL, BCoVQ, BCoV, MHVM, MHV2, MHVP, and MHV) cluster together forming a bigger group firstly. Second, this group joins group III (IBV) to form a much bigger group. Finally, SARS-CoVs join them and result in the phylogenetic tree shown here. Also note that the resulting monophyletic clusters agree perfectly with the established taxonomic groups. The increasing availability of complete genomes has cast doubt instead of adding details to the phylogenetic tree (Qi et al., 2004) . Phylogenetic analysis based on sequence alignments is usually done on the most conservative part of a gene. These fragments are usually coding sequences, especially the sequences coding for catalytic sites or the core of proteins, because they tend to be more evolutionarily conserved. It was said by a virologist that people could not simply assume that a virus can be represented by its polymerase (http:// www.ncbi.nlm.nih.gov/ICTV/). A virus must be viewed as a whole. Non-coding sequences also play an important role in the virus, so do the less conserved genes. In addition, analyses based on different genes may lead to different results. Consequently, by using complete genomes one can avoid choosing which genes to be aligned. Therefore methods that are based on the whole genome are likely to be more objective. Recently, a kstring composition approach was proposed to analyze prokaryote phylogeny based on the whole proteome and satisfactory results were obtained (Qi et al., 2004) ; however, such analysis must rely on the annotation information. In contrast, the complete genome sequence is the only input of the current method; neither the annotation information, nor any adjustable parameters are needed. It is noteworthy that the current method is performed automatically without any human intervention. The Z-curve, which serves as the foundation of the present method is a powerful tool to study the complete genome sequence. The Z-curve contains all the information that the corresponding DNA sequence carries. Many characteristics of a genome with biological meaning can be observed from the corresponding Z-curve, such as the replication origins and genomic islands for some bacterial and archaeal genomes . We can inspect a genome in an intuitive way regardless of the gene content and gene order, even though the sequences are of different lengths. If the Zcurves of two species show similar pattern even though the genomes have different lengths, one may infer that they are evolutionarily close organisms, and vice versa. In this paper, we use the geometric center and the eigenvectors to describe the pattern approximately. Although this is only a rough approximation, it represents just an attempt to apply the Z-curve method to the phylogenetic analysis and the results obtained agree well with previous analyses. This method is aimed to analyze the phylogeny of the genomes which have close phylogenetic relationships. Phylogenetics analysis is based on the differences among the three-dimensional Z-curves. In this paper, the 24 genomes under study all belong to the same genus Coro-navirus. Additionally, the differences of length among genomes are not very large. If the genomes under study have much farther phylogenetic relationships, and the differences in length are considerably large, the present method may not work. Consequently, cautions must be taken when using the present method to study the phylogeny of organisms with far evolutionary distances. In addition, unlike the estimation based on comparison of orthologous genes, the Z-curve approach is also sensitive to genome rearrangements: a single large-scale inversion can change the form of Z-curve drastically. Therefore, the method presented here is considerably limited in the cases of genome rearrangements. In addition, as mentioned above, the three-dimensional Z-curve is approximately depicted by a few parameters, such as the geometric center and the associated three eigenvectors. Consequently, information contained in the Zcurve is lost considerably in so doing. It is reasonable to suppose that the more information is extracted from the Z-curve, the more accurate result can be gained. Therefore, the current method can be improved if new and more effective algorithms are proposed to extract information contained in the Z-curves. In summary, although the present method has some advantages, it is still in its premature stage. The method may not be applied to some general cases, therefore the applications of it are considerably limited at present. A geometric approach to infer phylogenetic relationships based on the Z-curves of complete genomes is proposed in this article. Phylogenetic analysis of the 24 coronaviruses shows that SARS-CoVs belong to a new cluster, named group IV, and this result is consistent with those of previous analyses. The method has much room to be improved because of the possibility to extract information from the whole genome, instead of some individual genes. Although having some limitations, the current whole-genome-based geometric approach represents a new direction to infer phylogenetic relationships of organisms in the post-genome era. However, the method is still in its premature stage and its applications are considerably limited at present. was supported in part by the National Natural Science Foundation of China (Grant 90408028). ZCURVE_-CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes Achieving consensus of long genomic sequences with the W-curve Achieving congruency of phylogenetic trees generated by W-curves of genomic sequences Using fuzzy logic to confirm the integrity of a pattern recognition algorithm for long genomic sequences: the W-curve of genomic sequences Nomenclature for incompletely specified bases in nucleic acid sequences: recommendation 1984 Identification of a novel coronavirus in patients with severe acute respiratory syndrome Prediction of proteinase cleavage sites in polyproteins of coronaviruses and its applications in analyzing SARS-CoV genomes Mutational patterns correlate with genome organization in SARS and other coronaviruses Analysis of synonymous codon usage in SARS coronavirus and other viruses in the Nidovirales Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China A novel coronavirus associated with severe acute respiratory syndrome Coronaviridae: the viruses and their replication A major outbreak of severe acute respiratory syndrome in Hong Kong Coronavirus as a possible cause of severe acute respiratory syndrome the National Microbiology Laboratory, Canada, and the Canadian Severe Acute Respiratory Syndrome Study Team Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach Reconstruction of genes trees from sequence data Mosaic evolution of the severe acute respiratory syndrome coronavirus A cluster of cases of severe acute respiratory syndrome in Hong Kong Computer visualization of long genomic sequences A novel method to calculate the G + C content of genomic DNA sequences Analysis of distribution of bases in the coding sequences by a diagrammatic technique Z-curves, an intuitive tool for visualizing and analyzing DNA sequences The Z-curve database: a graphic representation of genome sequences We are indebted to both referees, whose comments are critical for improving the quality of the paper. We thank Ren Zhang for invaluable assistance. We are thankful to Prof. Jingchu Luo (Peking University) and Prof. Xi-Tai Huang (Nankai University) for their invaluable help. Discussions with Feng-Biao Guo and Bin-Guang Ma are acknowledged. The present study