key: cord-0912931-iowyb7xs authors: Gao, Lei; Qi, Ji; Wei, Haibin; Sun, Yigang; Hao, Bailin title: Molecular phylogeny of coronaviruses including human SARS-CoV date: 2003 journal: Chin Sci Bull DOI: 10.1007/bf03183929 sha: c634588d1223a0fce5373a8e5321aac01e5cd8d5 doc_id: 912931 cord_uid: iowyb7xs Phylogenetic tree of coronaviruses (CoVs) including the human SARS-associated virus is reconstructed from complete genomes by using our newly developed K-string composition approach. The relation of the human SARS-CoV to other coronaviruses, i.e. the rooting of the tree is suggested by choosing an appropriate outgroup. SARS-CoV makes a separate group closer but still distant from G2 (CoVs in mammalian host). The relation between different isolates of the human SARS virus is inferred by first constructing an ultrametric distance matrix from counting sequence variations in the genomes. The resulting tree is consistent with clinic relations between the SARS-CoV isolates. In addition to a larger variety of coronavirus genomes these results provide phylogenetic knowledge based on independent novel methodology as compared to recent phylogenetic studies on SARS-CoV. The outbreak of SARS sets an urgent task to reveal the origin of human SARS-CoV, i.e. its relation to other known species of coronavirus, and to trace the genetic variation in the spreading process of SARS. Partial answer to the problem may be obtained from phylogenetic analysis of available genomes. We call a phylogenetic tree of different species of coronavirus including the human SARS-Cov a "CoV Tree" and that of different isolates of SARS-CoV a "SARS Tree". CoY trees have been constructed by maximal parsimony based on alignment of 405 nt of the CoY polymerase gene ORF Ib [l] , and in comparison with predicted amino acid sequences for 6 different proteins [2] . Besides the fact that SARS-CoV makes a separate group with respect to the other three known groups, the precise location of the SARS group remains ambiguous. SARS trees have been built for 5 iso-lates by aligning complete genomes [3] and for 14 isolates by maximal parsimony based on 16 sequence variations that occurred more than twice [4] . The interrelation ofvarious isolates remains largely uncertain. Moreover, since all SARS genomes sequenced so far are very close to each other, how to construct the SARS Tree requires special consideration. All said calls for a study on more species using an independent methodology. In particular, appropriate choice of an outgroup may provide further indication on where to locate the root of the trees. We use 14 complete coronavirus genomes and 17 complete SARS-CoV genomes from GenBankl). Four genomes from Flaviviridae and Togaviridae are used as outgroup. Their abbreviation, accession number and description are given in Table 1 . The CoY Tree is constructed by using our newly developed K-string composition method [5] . This method circumvents alignment of genomic sequences and does not require scoring matrices. It has been successfully applied to prokaryote genomes [5] and chloroplasts [6] . Since this approach yields an umooted tree, the interrelationship among monophylic groups is examined by adding an outgroup from two distant families of single-strand RNA viruses, Flaviviridae and Togaviridae. Statistical tests of trees built in this way have been discussed in [5] and will not be repeated here. As regards the SARS Tree the small size of SARS-CoV genome tempts one to align complete genomes for tree construction. However, the high similarity of sequences makes much of the alignment work redundant. In fact, there were only 42 single-letter variations in the first 12 SARS complete genomes (excluding ZJ01, BJ02-4 and GZ01). If one counts the variations among all genome pairs the number varies from 1 to 21. Taking the sequence error rate to be 1 in 10000 2 ), there might be 2-3 errors in each genome and 4-6 variations pairwise. Keeping only 16 sequence variations that occurred twice or more as did in [4] is a safe but overcautious approach because there must be single-occurrence variations that are real. If one further excludes the synonymous nucleotide variations these numbers drop from 42/16 to 27/13. Using maximal parsimony means keeping only 16 or 13 variations. Furthermore, the choice of outgroup becomes extremely difficult when all genomes for which we wish to resolve the interrelationship are very close to each other while the candidate outgroup is too distant because an improper outgroup may change the internal branchings in a significant way. In order to make use of all sequence variations at the cost of allowing some sequence errors and to avoid the outgroup problem we propose a new way of tree construction as follows. If one defines distance between any two species as the branch length to their connnon ancestor on an additive phylogenetic tree, the distance matrix is ultrametric[9 J • Conversely, we may take ultrametricity as a criterion to guide tree construction. A distance matrix derived in some way may not be ultrametric per se. However, starting from this matrix one may construct two ultrametric matrices which serve as lower and upper bounds to the original one. In between these two there exist infinitely many ultrametric matrices which may be obtained from the original one by performing various transformations. From these matrices we choose one that is closest to the original one in some well-defined sense as the optimal distance matrix. Starting from this matrix both Unweighted Pair-Group with Arithmetic Mean (UPGMA) or Neighbor-Joining (NJ) (see ref. [7] for these standard methods) would lead to identical trees. The "ultrametrization" has the additional advantage to yield a rooted tree without choosing an out- group. Actually, choosing an outgroup for the SARS Tree is not a feasible task because all G1 through G3 genomes are too far from the SARS-CoV as it is evident by inspecting the distance matrices. The method of clustering and tree-construction via ultrametrization of distance matrix was sketched in [8] . We implemented the algorithm and applied it to getting the SARS Trees. The method will be described in detail elsewhere and we only present the result in this paper. 2 Results and discussion ( i ) The CoV Tree. On all 7 CoV trees given in [1] and [2] SARS-CoVs make a separate group besides the 3 known groups. The SARS group is surely distant from G1, but its relation to G2 or G3 varies from tree to tree. In Fig. 1 we present a phylogenetic tree for 20 coronaviruses including 6 SARS-CoVs plus 4 viruses from Flaviviridae and Togaviridae as outgroup. This tree is constructed us- 8JOI TWOI ing composition vectors [5] from the amino acid sequences at string length K = 5. errors as well. To be safe one may only keep those sequence variations that occurred twice or more. In this way the numbers 137 and 97 reduce to 18 and 12 without and with synonymous substitutions excluded. These two distance matrices are given in the upper-right and lower-left triangles of Table 3 respectively. Four SARS Trees built by using the ultrametrization procedure outlined in the Material and methods section are shown in Fig. 2. Fig. 2(a) is based on the 12 sequence variations that occurred at least twice and synonymous substitutions are excluded, i.e. based on the distance matrix given in the lower-left triangle of Table 3 . Fig. 2(b) is based on the 18 sequence variations that occurred at least twice but with synonymous substitutions kept. The distance matrix is given in the upper-right triangle of Table 3 . Fig. 2(c) is based on all 97 sequence variations including single ones but excluding synonymous substitutions corresponding to the distance matrix given in the lower-left triangle of Table 2 . Fig. 2(d) is based on all 137 sequence variations with both single and synonymous ones kept. The distance matrix is given in the upper-right triangle of Table 2 . If the trees built from the 4 distance matrices differ significantly from each other one would not have much to say and more study is required. However, these four trees are topologically consistent in spite of the comparatively large change of the number of variations due to updating of the BJ02-04 and GZO 1 genomes from partial to complete. Fig. 2(a) and (b) are based on the most conserved data and tum out to be consistent except for the relocations of CUHKW. They both support the observation [4] that the SARS-CoV spreading process has split into two paths. So does the location of the root. In addition, our method also reveals some finer branches which could not be resolved by using maximal parsimony. We note that these finer branches are consistent with the clinic relations described in [4] . The data used to build trees in Fig. 2(c) and (d) may contain fictitious variations due to sequencing errors, but also make use of real variations that were omitted in Fig. 2(a) and (b) . The branchings on these trees are not as reliable as that in Fig. 2(a) and (b) . We keep these trees in order to show the improvement reached by excluding single-occurrence variations. The genome sequence of ZJO 1 is somehow different from others in that it brings about many more single-sequence variations. However, this does not show off in Fig. 2(a) and (b) when one keeps only variations that occurred twice or more. We summarize the main findings of this paper. SARS-CoV makes a separate group to the three known groups; its apparent closeness to G2 may be questionable. The origin of SARS-CoV cannot be revealed by phylogenetic study alone at present time as there are too few CoV species represented in GenBank. We must await more As mentioned above, when monophylic groups on a tree are too distant from each other the intra-group branchings may not be taken seriously as such. One must refer to trees built specially to resolve intra-group relations (see Fig. 2 in Subsection (ii)). The question on SARS origin cannot be answered by phylogenetic study alone as no genomes of close neighbors are present in GenBank for the time being. The only plausible conclusion that may be drawn from all CoV Trees constructed so far is SARS makes a separate group within the Coronavims genus. The outgroup added to our tree indicates that the SARS group is closer to G2, i.e. to some coronavimses in mammalian hosts. We mention in passing that the ultrametrization procedure applied to the CoV Tree without using any outgroup also puts the root exactly where the outgroup in Fig. 1 is located. ( ii) The SARS Tree. We first present four distance matrices obtained by counting sequence variations in all available SARS-CoV genomes. The upper right triangle of Table 2 gives pairwise distance by counting all instances of different characters in aligning two sequences (Hamming distance on 4-letter alphabet). There are 137 variations in total. Some nucleotide variations do not change the encoded amino acid if we adopt the Open Reading Frame definitions of the corresponding genome annotation. By excluding these synonymous variations we keep 97 variations shown in the lower left triangle of Table 2 . The numbers shown in Table 2 may contain some sequencing A phylogenetic tree for 20 coronviruses including 6 SARS-CoVs based on the composition vector method at string length K~5. Four viruses from Flaviviridae and Togaviridae are added as outgroup. Note that this is an unrooted tree and the branches are not to scale. Table 2 Distance matrices based on 137 sequence variations when synonymous substitutions are kept (upper triangle) and on 97 variations when synonymous ones are excluded (lower triangle) TWOl 0 6 10 4 3 4 24 2 3 3 4 10 23 23 16 13 50 Urbani 2 0 14 8 7 8 28 6 7 7 8 12 27 27 20 17 54 HKUN 7 9 0 12 11 12 32 10 11 11 12 18 31 31 24 21 58 SIN2677 2 4 9 0 3 4 26 2 5 5 6 12 25 25 18 15 52 SIN2500 2 4 9 2 0 3 25 1 4 4 5 11 24 24 17 14 51 SIN2774 1 3 8 1 1 0 26 2 5 5 6 12 25 25 18 15 52 ZJOI 18 20 25 20 20 19 0 24 25 25 26 32 45 45 38 Table 3 CoV genomes, probably from other mammalians, to be sequenced. A "clinic tree" of SARS spreading like the clinic relation described in [4] does not necessarily imply a phylogenetic tree at molecular level. However, the fact that the SARS Trees (Fig. 2) are consistent with each other and with the clinic relations described in [4] is a manifestation of high mutation rate of SARS-CoV Characterization of a novel coronavirus associated with Severe Acute Respiratory Syndrome A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJO 1) Comparative full-length genome sequence analysis of 14 SARS coronovirus isolates and common mutations associated with putative origins of infection Whole genome prokaryote phylogeny without sequence alignment: a K-string composition approach, 1 Origin and phylogeny of chloroplasts: a simple correlation analysis of complete genomes Molecular Evolution and Phy1ogenetics Ultrametricity for physicists Acknowledgements The authors thank Li Wei, Beijing Genomics Institute, for providing error estimate of the BJ01 genome; Luo Jingchu, Peking University, for discussion and checking with data on CBl's Anti-SARS website (http://antisars.cbi.pku.edu.cn:5555/index.jsp); Edison T. Liu, Genome Institute of Singapore, for clarifying sing1e-sequence variations; Chu Ka Hou, Chinese University of Hong Kong, for suggesting species for outgroup on the CoV tree. This work was partly supported by the Special Funds for Major State Basic Research Projects, the National Natural Science Foundation of China (Grant No. 30170232), the Innovation Project of the Chinese Academy of Sciences, and by a grant from Shanghai Municipality via Fudan University.