key: cord-300061-l2pfl776 authors: Zhang, Ren title: A rebuttal to the comments on the genome order index and the Z-curve date: 2011-02-16 journal: Biol Direct DOI: 10.1186/1745-6150-6-10 sha: doc_id: 300061 cord_uid: l2pfl776 BACKGROUND: Elhaik, Graur and Josic recently commented on the genome order index (S) and the Z-curve (Elhaik et al. Biol Direct 2010, 5: 10). S is a quantity defined as S = a(2 )+ c(2 )+ g(2 )+ t(2), where a, c, g and t denote corresponding base frequencies. The Z-curve is a three dimensional curve that represents a DNA sequence in the manner that each can be uniquely reconstructed given the other. Elhaik et al. made 4 major claims. 1) In the previous mapping system with the regular tetrahedron, calculation of the radius of the inscribed sphere is "a mathematical error". 2) S follows an exponential distribution and is narrowly distributed with a range of (0.25 - 0.33). 3) Based on the Chargaff's second parity rule (PR2), "S is equivalent to H [Shannon entropy]" and they are derivable from each other. 4) Z-curve "suffers from over dimensionality", because based on the analysis of 235 bacterial genomes, x and y components contributed only less than 1% of the variance and therefore "would be of little use". RESULTS: 1) Elhaik et al. mistakenly neglected the parameter [Formula: see text] when calculating the radius of the inscribed sphere. 2) The exponential distribution of S is a restatement of our previous conclusion, and the range of (0.25 - 0.33) only paraphrases the previously suggested S range (0.25 -1/3). 3) Elhaik et al. incorrectly disregard deviations from PR2 by treating the deviations as 0 altogether, reduce S and H, both having 4 variables, a, c, g and t, into functions of one single variable, a only, and apply this treatment to all DNA sequences as the basis of their "demonstration", which is therefore invalid. 4) Elhaik et al. confuse numeral smallness with biological insignificance, and disregard the distributions of purine/pyrimidine and amino/keto bases (x and y components), the variations of which, although can be less than that of GC content, contain rich information that is important and useful, such as in locating replication origins of bacterial and archaeal genomes, and in studies of gene recognition in various species. CONCLUSION: Elhaik et al. confuse S (a single number) with Z-curve (a series of 3D coordinates), which are distinct. To use S as a case study of Z-curve, by itself, is invalid. S and H are neither equivalent nor derivable from each other. The criticisms of Elhaik, Graur and Josic are wrong. REVIEWERS: This article was reviewed by Erik van Nimwegen. The debate originated from a paper published in 1991, in which we defined a quantity S = a 2 + c 2 + g 2 + t 2 , where a, c, g and t denote corresponding base frequencies in a DNA sequence, and we studied S values for protein coding genes [1] . In 2004, we calculated S values for genome sequences, and found that S < 1/3 is valid for most genomes [2] . In 2008, Elhaik et al. criticized this work with 2 claims [3] . 1. S > |x| ≈ |y| ≈ 0. That is, x and y components are small numbers that are close to 0. However, it does not necessarily mean that x and y components, i.e., variations of purine/pyrimidine and amino/keto bases, respectively, along the genome, are not important. For instance, based on x and y components, replication origins have been located in more than 1000 bacterial genomes [12, 13] , and also in archaeal genomes [14] . For example, for archaea Sulfolobus solfataricus and Aeropyrum pernix, analysis based on x and y components predicted multiple replication origins [14, 15] , which are consistent with later experimental results [16, 17] . Second, Z-curve can be used in analyzing any DNA sequences, such as protein coding genes [18] , promoter sequences [19] and translation start sites (TSS) [20] . Protein coding genes or DNA sequence segments in various species do not necessarily have the same nucleotide variation patterns as the one in the 235 bacterial genomes, the basis of their conclusion. For instance, based on Z-curve behaviors, bacterial TSS can be reliably predicted, and for sequences around bacterial TSS, x and y components in fact have more variations than the z component, in contrast to the variation pattern of bacterial genomes [20] . Z-curve based algorithms have been successfully used in recognizing protein coding genes in genomes of budding yeast [18] , bacteria and archaea [21] , viruses and phages [22] , especially coronaviruses [23] and in recognizing short coding sequences of human genes [24] . In all these algorithms, x and y components are absolutely needed to achieve high gene recognition accuracy. In this section, the major mistake (among some others, such as incorrectly extrapolating a result based on a subset of bacterial genomes to those for all DNA sequences) of Elhaik et al. is the confusion of numeral smallness with biological insignificance. Variations of purine/pyrimidine and amino/keto bases (x and y components) should not be disregarded and treated as "little use" only because they could be small in magnitude; in contrast, they are important and useful. As mentioned above, based on x and y components, a large number of replication origins have been located in both bacterial [12, 13] and archaeal genomes [14] . The x and y components play an absolutely indispensable role in Z-curve based gene finding algorithms, which have been successfully applied in recognizing protein coding genes in, to name a few, the genomes of L. interrogans Lai [25] , B. amyloliquefaciens FZB42 [26] , B. thuringiensis BMB171 [27] , A. mediterranei U32 [28] , M. tuberculosis H37Ra [29] , Drosophila [30] , new human coronaviruses HCoV-NL63 [31] and HKU1 [32] , four coronaviruses from bats [33] , new phages Rtp in E. coli [34] and in a pandemic V. parahaemolyticus O3:K6 strain [35] . In many cases the statements by themselves [3, 5] make little sense. Below are some examples. 1. "The genome order index was selected as a case study to the usefulness of the Z-curve method." S is a statistical quantity (one single number), while Z-curve is a 3-dimensional curve that constitutes a one-to-one correspondence of a DNA sequence (a series of 3-D coordinates). S is not Z-curve, and S cannot be used as a case study of Z-curve. 2. "We must conclude that both the Z-curve and S are over complicated measures to GC content and Shannon H index, respectively." Z-curve is not a measure of GC content. S is not a measure of Shannon H index. If Zcurve were a measure of GC content, it would be striking that gene recognition can be achieved with a high accuracy [18, 21, 22, 24] based solely on GC content. 3. "the dimension stands for GC content alone suffices to represent any given genome." GC content alone does not suffice to represent any given genome, simply because the genome is composed of 4 kinds of nucleotides, and distributions of purine/pyrimidine and amino/ keto bases should not be disregarded only because their variations can be less than that of the GC content. 4. Elhaik, Graur and Josic finally concluded that "the genome order index is a misconceived mathematical tool that should not be used in a meritorious sequence analyses." This conclusion is, by itself, not consistent. The Shannon entropy is a well-established method that has been widely used in many areas. Elhaik et al. on the one hand claim that S is strictly equivalent to the Shannon entropy, and on the other hand claim that S is a misconceived mathematical tool; then the next logical conclusion would be the Shannon entropy is a misconceived mathematical tool, which is obviously against scientific commonsense. In summary, Elhaik, Graur and Josic (i) confuse the reduced coordinate system with the original one, and consequently, mistakenly neglected the parameter 4 3 / when calculating the radius of the inscribed sphere. (ii) The exponential distribution of S is a restatement of our previous conclusion, and the range of (0.25 -0.33) only paraphrases the previously suggested S range (0.25 -1/3). (iii) Elhaik et al. incorrectly disregard deviations from PR2 by treating the deviations as 0 altogether, reduce S and H, both having 4 variables, a, c, g and t, into functions of one single variable, a only, and apply this treatment to all DNA sequences as the basis of their "demonstration", which is therefore invalid. Importantly, they confuse numeral smallness with biological insignificance, and disregard the distributions of purine/pyrimidine and amino/keto bases, the variations of which, although sometimes less than that of GC content, contain rich information that is important and useful. Therefore, the criticisms of Elhaik, Graur and Josic are wrong. The same 235 bacterial genomes (based on genome names) that were used by Elhaik et al. in [5] were analyzed. The data in Table S1 in ref. [5] contain numerous mistakes. The Table S1 contains 4 columns, genome name, size, GC content and ID. Eighteen IDs correspond to plasmids, not genomes. These IDs are: NC_007410, NC_006873, NC_004943, NC_003080, NC_007414, NC_007515, NC_007801, NC_007483, NC_007274, NC_007336, NC_007901, NC_007641, NC_006855, NC_007608, NC_005951, NC_006663, NC_005229 and NC_004554. Calculation of genome length and GC content is incorrect for many genomes. For instance, the calculated GC content for B. fragilis YCH46 (NC_006347) was 33.50% [5] , while the correct number is 43.27%. The calculated GC content for C. acetobutylicum ATCC 824 (NC_003030) was 37.00% [5] , while the correct number is 30.93%. This manuscript, seems to be the latest shot in an ongoing dispute between this author and Elhaik et al. regarding the usefulness of certain statistics for analyzing base composition of DNA. After looking at this manuscript and the paper that it is a rebuttal to, I must say that I am amazed that so much debate can arise over issues that are essentially very basic (i.e. how to summarize base composition in one or a few statistics) and I am wondering how useful these kinds of exchanges are for general readers. Much of the discussion centers around the DNAsequence statistic S, which is defined as the sum of the square-frequencies of the letters: S = (f_a)^2, + (f_c)^2 + (f_g)^2 + (f_t)^2 where f_a, f_c, f_g, and f_t are the base frequencies. Clearly, since f_a + f_c + f_g + f_t = 1, we necessarily have that S lies in the range [0. 25, 1] . Both this author and Elhaik et al. seem to agree that, for a large collection of bacterial genomes, we find S < 1/3 but there is disagreement about how 'surprising' this is and what kind of constraint that this is indicative of. First of all, it is clear that for uniformly random sequences the frequencies f_x will be close to 0.25 and thus S will be close to 0.25 as wll. Only for extremely biased base compositions would one get values of S close to 1 and so, in my opinion, it is not 'surprising' at all that there that one does not find genomes with large S values. One might reasonably argue, in my opinion, that the surprising observation is that one gets S values as HIGH as 0.33. A second point of contention is whether the S statistic and the entropy H = -sum_x f_x log(f_x) are 'equivalent'. The dispute here seems to mostly be of a semantic nature, i.e. regarding the meaning of the word 'equivalent'. I can only see two relevant points: 1) For large DNA sequences (like whole genomes) it is observed that there is an approximate symmetry between the two DNA strands, i.e. the base composition in one strand is not significantly different from the base composition in the other strand. Since, by Watson-Crick base-pairing rules, we only have C-G/G-C and A-T/T-A pairs, this implies that APPROXIMATELY f_a = f_t and f_c = f_g (*) Now, if we assume that the equalities (*) hold exactly, then we have three constraints f_a+f_c+f_g+f_t = 1 f_a = f_t f_c = f_g and so we effectively have only 1 degree of freedom left (which is essentially GC-content). Since both S and H are invertible functions of the remaining degree of freedom, it immediately follows that S can be calculated from H and H from S. Whether you want to call this equivalent or not is a matter of semantics. The point is that when all three constrains are acting, there is only one degree of freedom left. Instead of calculating S or H, I think it would be much more straight-forward to just talk about GC-content directly. Indeed, it is remarkable that CG-content ranges from as low as 0.22 to as high as 0.77 and the relevant biological question, in my opinion, is not whether to use S or H or whatever other derived statistic, but rather trying to explain why GCcontent varies so much in bacterial genomes. Indeed there has been quite some interesting developments in this area recently. See for example the discussion in: Rocha EP, Feil EJ. Mutational patterns cannot explain genome composition: are there any neutral sites in the genomes of bacteria? PLoS Genet. 2010 Sep 9;6(9). The discussion about Renyi entropies is useless in my opinion. Yes, both S and H are both members of a family of functions (Renyi entropies) but I fail to see how this is relevant for any biological question. Of course, in reality one only has that f_a is approximately equal to f_t (and similar for f_c and f_g). Thus, H and S may vary independently. However, because the equalities almost always very nearly hold, and because H and S are smooth functions of the base frequencies, there is still a very tight quantitative relation between H and S in real data. Thus, I agree with Elhaik et al. that the variation of S and H across different genomes is dominated by the variation in GC-content. 2) The remaining question is whether there is any biological meaning in the deviations from f_c = f_g and f_a = f_t. The current author makes the valid point, in my opinion, that numerically small deviations may still be meaningful biologically. The author asserts in several places that, indeed, these deviations are highly meaningful but frustratingly fails to give citations to back this claim up. My own recollection is that in bacteria the G/ C-skew has been proposed to be a result of different mutational spectra acting on the leading and lagging strands (and would thus not necessarily have functional implications). The author does later cite a number of papers that use the Z-curve statistic to find genes and replication origins and states that the components orthogonal to GCcontent are crucial for these methods. I immediately believe this to be correct. For example, as we and others have found the presence of ribosomal binding sites plus the avoidance of RNA secondary structure around the translation start site leads to clear base-compositional biases around the starts of genes (Eyre-Walker and Bulmer Nucl. Acids Res 1993, Molina & van Nimwegen Genome Res 2008). However, this seems to now confound the question of local compositional biases and their functional implications versus global patterns of base composition, because as far as I can tell Elhaik et al. were talking about global compositional patterns. Finally, the remark that S can be calculated faster than H 'which is especially important for handling large genomes' does not make a lot of sense to me. If one really worries about computational costs in calculating H one could calculate f*log(f) for all values of before-hand and store them in a table. Author's response Elhaik, Graur and Josic made 4 major claims, which are rebutted. The review report, although long, evades 2 major points being debated. The first 2 claims made by Elhaik et al. are: 1) The conclusion that the mapping points of most genomes are within the inscribed sphere, i.e., S < 1/3, is a consequence of mathematical error. 2) S follows an exponential distribution. I point out that their first claim is incorrect due to the neglect of a coordinate transform parameter and their second claim is only a restatement of our previous conclusion. Both points are not touched in the review report, and I therefore presume the reviewer has no objection to my rebuttal. The reviewer, however, does disagree with my rebuttal but agree with Elhaik et al. on some issues, to which I will respond point by point. I am amazed that so much debate can arise over issues that are essentially very basic It is not 'surprising' at all that there that one does not find genomes with large S values I am wondering how useful these kinds of exchanges are The discussion about Renyi entropies is useless Author's response I agree that some issues are basic. For instance, their first claim is due to mistakenly neglecting a parameter in coordinate transformation, which belongs to elementary mathematics. However, first, here the issue is not about whether a topic is basic or not, surprising or not, useful or not; it is about right or wrong. Regarding the questions such as whether the calculation of the inscribed sphere radius is 'a mathematical error', and whether Z-curve suffers from 'over dimensionality', there is only one answer: yes or no. Science literatures and readers deserve the truth. Second, in contrast, whether a topic is surprising or useful is largely a personal opinion. Therefore I will not further discuss whether certain issues are basic/surprising/useful. A second point of contention is whether the S statistic and the entropy H = -sum_x f_x log(f_x) are 'equivalent'. The dispute here seems to mostly be of a semantic nature, i.e. regarding the meaning of the word 'equivalent'. I can only see two relevant points: 1) For large DNA sequences (like whole genomes) it is observed that there is an approximate symmetry between the two DNA strands, i.e. the base composition in one strand is not significantly different from the base composition in the other strand. Since, by Watson-Crick base-pairing rules, we only have C-G/G-C and A-T/T-A pairs, this implies that APPROXIMATELY f_a = f_t and f_c = f_g (*) Now, if we assume that the equalities (*) hold exactly, then we have three constraints f_a+f_c+f_g+f_t = 1 f_a = f_t f_c = f_g and so we effectively have only 1 degree of freedom left (which is essentially GC-content). Since both S and H are invertible functions of the remaining degree of freedom, it immediately follows that S can be calculated from H and H from S. Whether you want to call this equivalent or not is a matter of semantics. The point is that when all three constrains are acting, there is only one degree of freedom left. Instead of calculating S or H, I think it would be much more straight-forward to just talk about GC-content directly. Indeed, it is remarkable that CG-content ranges from as low as 0.22 to as high as 0.77 and the relevant biological question, in my opinion, is not whether to use S or H or whatever other derived statistic, but rather trying to explain why GC-content varies so much in bacterial genomes. Indeed there has been quite some interesting developments in this area recently. See for example the discussion in: Rocha EP, Feil EJ. Mutational patterns cannot explain genome composition: are there any neutral sites in the genomes of bacteria? PLoS Genet. 2010 Sep 9;6(9). The discussion about Renyi entropies is useless in my opinion. Yes, both S and H are both members of a family of functions (Renyi entropies) but I fail to see how this is relevant for any biological question. Throughout the criticisms and the rebuttal, when debating on S and H, the only Chargaff Parity Rule being referred to is the parity rule 2 (PR2). Note that PR2 is a phenomenon in one single DNA strand (a~= t and c~= g), but not double DNA strands. Indeed, in a duplex DNA, a = t and c = g, due to Watson-Crick base pairing, but that is the Chargaff Parity Rule 1. The reviewer's discussion is based on the phenomenon in 2 DNA strands. The reviewer writes: "symmetry between the two DNA strands, i.e. the base composition in one strand is not significantly different from the base composition in the other strand... Since by Watson-Crick base-pairing rules, we only have C-G/G-C and A-T/T-A pairs ...". The debate is about PR2, a phenomenon of base compositions in the DNA single strand, while the reviewer's discussion is about DNA double strands. Because of this misunderstanding, the reviewer's discussion about S and H becomes almost irrelevant. Thus, I agree with Elhaik et al. that the variation of S and H across different genomes is dominated by the variation in GC-content. Here the reviewer agrees with Elhaik et al. for a point that Elhaik et al. did not intend to make. Elhaik et al. studied the variations of Z-curve's 3 components (x,y,z) using 235 bacterial genomes, and found that the z component (which is related to GC content) contributed to most of the variance, comparing to x and y components (please refer to the figure 4 in ref. [5] ). Note that the studied variations are about Z-curve, not related to S and H. Nevertheless, it is true that distributions of S and H are indeed quite related to the GC content. But that is a conclusion made by myself in the original article. Please refer to the figure 3 in ref. [2] and the text therein. The author asserts in several places that, indeed, these deviations are highly meaningful but frustratingly fails to give citations to back this claim up. My own recollection is that in bacteria the G/C-skew has been proposed to be a result of different mutational spectra acting on the leading and lagging strands (and would thus not necessarily have functional implications). Deviations from PR2 result from both mutation and selection pressures, reflecting biases in, e.g., DNA replication, transcription and repair. I added a review article. However, this seems to now confound the question of local compositional biases and their functional implications versus global patterns of base composition, because as far as I can tell Elhaik et al. were talking about global compositional patterns. No. Elhaik et al. concluded that Z-curve suffers from "over-dimensionality", without restricting their conclusion to global compositional patterns only. Z-curve can be used to study any DNA sequences, such as whole genomes, protein coding genes, promoter sequences and translation start sites. Therefore, one part of their analysis that is logically flawed is that they analyzed a subset of bacterial genomes but tried to make a general conclusion for all DNA sequences. In my rebuttal, however, I have to show separately that for both whole genomes and short DNA segments, their conclusion is wrong. Finally, the remark that S can be calculated faster than H 'which is especially important for handling large genomes' does not make a lot of sense to me. If one really worries about computational costs in calculating H one could calculate f*log(f) for all values of before-hand and store them in a table. The reviewer finally suggests that one could calculate f*log(f) for all values beforehand and store them in a table. However, this suggestion is not practical. Both S and H are real numbers. The number of all real numbers within the interval, e.g., [0,1] is infinite. Therefore, the table that contains 'all values' cannot be saved, unless with the infinitely large computer storage, which, however, does not exist. Analysis of distribution of bases in the coding sequences by a diagrammatic technique A nucleotide composition constraint of genome sequences Genome order index' should not be used for defining compositional constraints in nucleotide sequences A rebuttal to the comments on the genome order index Genome order index' should not be used for defining compositional constraints in nucleotide sequences-a case study of the Z-curve Segmentation algorithm for DNA sequences Separation of B. subtilis DNA into complementary strands. 3. Direct analysis A test of Chargaff's second rule Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms On measures of information and entropy Relations between Shannon entropy and genome order index in segmenting DNA sequences Ori-Finder: a web-based system for finding oriCs in unannotated bacterial genomes DoriC: a database of oriC regions in bacterial genomes Identification of replication origins in archaeal genomes based on the Z-curve method Multiple replication origins of the archaeon Halobacterium species NRC-1 Identification of two origins of replication in the single chromosome of the archaeon Sulfolobus solfataricus Biochemical analysis of a DNA replication origin in the archaeon Aeropyrum pernix Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides GS-Finder: a program to find bacterial gene start sites with a self-training method ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes ZCURVE_V: a new self-training system for recognizing protein-coding genes in viral and phage genomes ZCURVE_CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes Comparison of various algorithms for recognizing short coding sequences of human genes Unique physiological and pathogenic features of Leptospira interrogans revealed by whole-genome sequencing Comparative analysis of the complete genome sequence of the plant growth-promoting bacterium Bacillus amyloliquefaciens FZB42 Complete Genome Sequence of Bacillus thuringiensis Mutant Strain BMB171 Complete genome sequence of the rifamycin SV-producing Amycolatopsis mediterranei U32 revealed its genetic characteristics in phylogeny and metabolism Genetic Basis of Virulence Attenuation Revealed by Comparative Genomic Analysis of Mycobacterium tuberculosis Strain H37Ra versus H37Rv Kellis M: Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes Identification of a new human coronavirus The novel human coronaviruses NL63 and HKU1 Prevalence and genetic diversity of coronaviruses in bats from China The genome of the novel phage Rtp, with a rosette-like tail tip, is homologous to the genome of phage T1 Characterization of a New Plasmid-Like Prophage in a Pandemic Vibrio parahaemolyticus O3:K6 Strain Cite this article as: Zhang: A rebuttal to the comments on the genome order index and the Z-curve Authors' contributions RZ analyzed the data and wrote the manuscript. The author read and approved the final manuscript. The author declares that they have no competing interests.