key: cord-0001288-b8jlvkg7
authors: Ai, Yuncan; Ai, Hannan; Meng, Fanmei; Zhao, Lei
title: GenomeFingerprinter: The Genome Fingerprint and the Universal Genome Fingerprint Analysis for Systematic Comparative Genomics
date: 2013-10-29
journal: PLoS One
DOI: 10.1371/journal.pone.0077912
sha: c57fb53010f62a669421b0f4894a26e22e0e496c
doc_id: 1288
cord_uid: b8jlvkg7

BACKGROUND: No attention has been paid on comparing a set of genome sequences crossing genetic components and biological categories with far divergence over large size range. We define it as the systematic comparative genomics and aim to develop the methodology. RESULTS: First, we create a method, GenomeFingerprinter, to unambiguously produce a set of three-dimensional coordinates from a sequence, followed by one three-dimensional plot and six two-dimensional trajectory projections, to illustrate the genome fingerprint of a given genome sequence. Second, we develop a set of concepts and tools, and thereby establish a method called the universal genome fingerprint analysis (UGFA). Particularly, we define the total genetic component configuration (TGCC) (including chromosome, plasmid, and phage) for describing a strain as a systematic unit, the universal genome fingerprint map (UGFM) of TGCC for differentiating strains as a universal system, and the systematic comparative genomics (SCG) for comparing a set of genomes crossing genetic components and biological categories. Third, we construct a method of quantitative analysis to compare two genomes by using the outcome dataset of genome fingerprint analysis. Specifically, we define the geometric center and its geometric mean for a given genome fingerprint map, followed by the Euclidean distance, the differentiate rate, and the weighted differentiate rate to quantitatively describe the difference between two genomes of comparison. Moreover, we demonstrate the applications through case studies on various genome sequences, giving tremendous insights into the critical issues in microbial genomics and taxonomy. CONCLUSIONS: We have created a method, GenomeFingerprinter, for rapidly computing, geometrically visualizing, intuitively comparing a set of genomes at genome fingerprint level, and hence established a method called the universal genome fingerprint analysis, as well as developed a method of quantitative analysis of the outcome dataset. These have set up the methodology of systematic comparative genomics based on the genome fingerprint analysis.

By using conventional methods based on pair-wisely base-tobase comparison, comparing whole-genome sequences at large scale has not been achieved; even no attention was paid on handling a number of genomes crossing genetic components (chromosomes, plasmids, and phages) and biological categories (bacteria, archaeal bacteria, and viruses) with far divergence over large size range. We define such comparisons as the systematic comparative genomics. We believe it should be a priority task to carry out whole-genome-wide comparative genomics at large scale based on the geometrical analysis of sequences crossing diverse genetic components and biological categories in the post-genomic era. However, even simply visualizing a DNA sequence has been challenging for decades; little progress has been made to date [1] .

Pioneering works in geometrical visualizing DNA sequences using computers had been done in one-dimension [2, 3] , twodimensions (Z-curve) [4] , and three-dimensions (H-curve) [5, 6] . However, those were valid only for 'static' modeling and visualizing. The 'dynamic' modeling and visualizing had been explored in a virtual reality environment [7, 8] . AND-viewer, for example, provided a three-dimensional sensing of a big picture of a DNA sequence in a virtual reality environment by using a handsensor instead of mouse and keyboard [7, 8] . This pioneering work made fantastic progress in dynamically mimicking 3D visions and intuitively sensing genome sequences [7, 8] . Still, there was no possibility of using the outcome dataset to further explore the real contexts of biology.

The post-genomic era promoted a huge demand for data mining and robust reasoning with massive genome sequences [1] . So far, there were numerous methods for comparative genomics at small scale. These methods were divided into two types: algebraic approach [9, 10, 11, 12] and geometrical approach [13] .

The algebraic approach means that the calculation of similarity or identity is based on pair-wisely base-to-base comparison. The output dataset is only used for visualization through graphical techniques [1] . The most common tools were BLAST [9] and CLUSTALW [10] . Recently, a BLAST-based tool, BRIG, was constructed for genome-wide comparison to create images of multiple circular genomes among a number of very closely related bacteria strains [11] . The output image showed the BLASTsimilarity between one central reference sequence and other inquiry sequences as a set of concentric rings, in which BLASTmatches were colored on a sliding scale indicating a defined percentage of BLAST-identity. This tool had great advantages over other common tools, like ACT [12] , in terms of the numbers of genomes being simultaneously compared and the ways of presenting its output images. These features made it a versatile tool for visualizing a range of genome data, but it was still only for visualization. Similarly, the Mauve program [14, 15] , combining both algebraic calculation and graphic display, was widely used for comparing and visualizing a set of genomes. However, even within close relatives, the number of genomes being handled by Mauve was dramatically dependent on the computational constraints, taking up too much CPU time or causing memory overflow, which limited Mauve to handle few very close relatives at one time.

The geometrical approach means that a genome sequence can be transformed into a set of coordinates to be plotted giving a geometrical vision. Most importantly, both calculation and visualization are separately processed in a dynamic way so that the input and output can be subsequently re-useable for geometrical analysis. One promising example was the Z-curve method (Zplotter program), which generated a set of threedimensional coordinates from a linear genome sequence [16] . Such coordinates were plotted to create three-dimensional geometrical visions (as open rough Z-curves) for the given DNA sequences [16] . Hundreds of such visions for microbial genomes were collected as a database [17] . The Z-curve method (Zplotter program) was used not only for visualization but for geometrical analysis to explore the real contexts of biology [18, 19, 20, 21] . For example, two replication ori points in archaeal bacterial genomes were predicted by the Z-curve analysis [22, 23] and confirmed by the wet experiments in other labs [24, 25] , thus showing it's promising. However, the Zplotter algorithm had an inevitable flaw to falsely present a genome sequence due to its ambiguous cuttingpoint error (see Discussion section), which was not be suitable for creating a stable unique genome fingerprint, as we proposed; nonetheless, no statistic analysis could be further applied to the outcome dataset.

In this paper, we present a method called GenomeFingerprinter to unambiguously produce a unique set of three-dimensional coordinates from a sequence, followed by one three-dimensional plot and six two-dimensional trajectory projections, to illustrate the whole-genome fingerprint of a given genome sequence. We further develop a set of concepts and tools, and thereby establish a method called the universal genome fingerprint analysis (UGFA). Finally, we construct a method to quantitatively analyze the outcome dataset of genome fingerprint analysis. Moreover, we demonstrate the applications of such methods through various case studies, giving new insights into the critical issues in microbial genomics and taxonomy. These have set up the methodology of what we called the systematic comparative genomics based on the genome fingerprint and the universal genome fingerprint analysis. We anticipate that these comprehensive methods can be widely applied at large scale in the post-genomic era.

To geometrically visualize a sequence, the key step is to create a set of three-dimensional coordinates (x n , y n , z n ) for each base. To do this, the Z-curve method (Zplotter program) [16] defined a set of coordinates (x n , y n , z n ) for each base in a linear sequence (n = 1, 2, …, N; N is the sequence length) by the equation (0), which defined a unique Z-curve for a given linear sequence and vice versa. Note that A n , T n , G n , C n were the sum of total numbers of each of four base-type (A, T, G, C), respectively, counting from the first base to the bases before the first base (passing through the n th base in the process) in a linear sequence (n = 1, 2, …, N). However, the main problem was the ambiguity of the ''first base'' due to cuttingpoint error in a deposited sequence (see explanations in Discussion section).

, (n~1,2,:::,N) ð0Þ

Here we take the same defining form as the equation (0), but with different contents of A n , T n , G n , C n . Namely, we propose a model called GenomeFingerprinter for the geometrical visualization of a circular sequence. As an artificial example, a circular sequence containing 40-bps, (59-39) ACACTGACGCACACTGACGCA-CACTGACGCACACTGACGC (Figure 1 ), will be used to illustrate the conceptual framework. It will be described in reasonable detail in order to build a bridge for the readers who may not have multiple disciplinary backgrounds [26] .

First, we randomly select a base (the n th ) as the first targeted base (TB) while keep the m th focusing base (FB) moving. We define the relative distance (RD) (1) between the selected TB (n th ) and the moving FB (m th ) (m = 1, 2, …, N).

::: :::

8 > > > > > > < > > > > > > : , (n~1,2,:::,N) ð1Þ

Note that the RD concept is extremely critical. The RD formula (1) can virtually treat an arbitrary linear sequence as a circular one. For example, once we select the TB (e.g., suppose at position 1, base A) and the moving FB (e.g., suppose at position 20, base C), the RD value is 19 ( Figure 1 ). Thus, a collection of RD values (m = 1, 2, …, N) can be generated for each selected TB (in total N number) sliding along with the given sequence. Particularly, the RD value is N, not zero, when the m th FB is located at the same position with TB, which means the m th FB has gone through one circle (i.e., starting from and finishing at the same position at the n th base).

Second, we define the weighted relative distance (WRD) (2). The above value (base C at position 20), for example, is 19/40. This is simply for reducing memory burden and overcoming computational constraints for large sequences. 

Third, for the same selected TB (n th ), we define the sum of the weighted relative distance (SWRD) (3) from the collection of WRD (m = 1, 2, …, N) for each of four base-type (A, G, T, C), respectively.

Fourth, we define a set of coordinates (x n , y n , z n ) (4) for the selected TB (n th ). Note that we count the sum of the weighted relative distance (SWRD) (unlike the Zplotter program counting the sum of numbers) for each of four base-type (A, T, G, C), respectively. So far, only one cycle has been done for only one selected TB (n th ); namely, only one base has had its coordinates (x n , y n , z n ).

, (n~1,2,:::,N) ð4Þ

Finally, we repeat the above steps to create a set of coordinates for every base in the sequence. Briefly, by selecting the next TB (e.g., n = 2) and reiterating the processes for each base, step-bystep, we will finish the N cycles (n = 1, 2, …, N); and each cycle has one selected TB, which will create one set of coordinates (x n , y n , z n ) for that chosen TB. Ultimately, after having finished the total N cycles, all N bases of the sequence will have their own coordinates so that a series of sets of coordinates (x n , y n , z n ) will be created for the genome sequence. We have developed an in-house script, GenomeFingerprinter.exe to do all. Note that our method is also valid for RNA by simply replacing T with U base.

As an example, by using our program GenomeFingerprinter.exe, we can calculate a series of coordinates (x n , y n , z n ) for the artificial genome sequence containing 40-bps ( Figure 1 ); there are total 40 bases and each base has its own coordinates (x n , y n , z n ) (data not shown).

The set of coordinates (x n , y n , z n ) of a given sequence can be plotted as a three-dimensional plot (3D-P) to give a geometrical vision. As an example, the artificial sequence ( Figure 1 ) has only 40 points giving a naive vision (not shown). Instead, we show the real visions of strains from bacteria and archaeal bacteria (Table 1 ) ( Figure 2 ). Clearly, each vision ( Figure 2 ) has its individual genome fingerprint (GF). We define such a GF vision as the genome fingerprint map (GFM), which is an intuitive identity or a unique digital marker for a given genome sequence. For convenience, we further define such a GFM vision of three-dimensional plot as the primary genome fingerprint map (P-GFM). Therefore, from now on, we can directly operate and compare the GFM vision for comparing sequences. In other words, we compare genome sequences through the genome fingerprints (via geometrical analysis) instead of the sequence base-pairs (via algebraic analysis).

For instance, we can intuitively distinguish a number of genome sequences based on their genome fingerprint maps ( Figure 2 ). Within the same species Sulfolobus islandicus, strains M.14. 

To demonstrate the genome fingerprint in a more sophisticated way, we further create six two-dimensional trajectory projections (2D-TPs) for a given P-GFM through six combinations (x n ,n, y n ,n, z n ,n, x n ,y n , x n ,z n , and y n ,z n ) of the coordinates. For convenience, such six 2D-TPs are defined as the secondary genome fingerprint maps (S-GFMs). For example, the six S-GFMs comparing two chromosomes between Halobacterium sp. NRC-1 (NC_002607) and Halobacterium salinarum R1 (NC_010364) clearly demonstrate the subtle variations both globally and locally ( Figure 3 ). Note that the S-GFMs of x n ,z n , y n ,z n , x n ,y n usually Figure 1 . A mathematical model for creating a set of coordinates (x n , y n , z n ) from a circular genome sequence. We randomly select a base (the n th ) as the first target base (TB) while keep moving the m th focusing base (FB). For the given TB (n th ), we define the relative distance (RD) between the selected TB (n th ) and the moving FB carry much more sensitive information than those of x n ,n, y n ,n, and z n ,n do, respectively. Accordingly, the S-GFMs can amplify subtle variations that usually are insensitive or invisible in the P-GFMs. In particular, the S-GFMs of x n ,y n , x n ,z n and y n ,z n are much more sensitive in differentiating the local subtle variations and identifying the unique genome features; whereas the S-GFMs of x n ,n, y n ,n and z n ,n are relatively less informative but still useful when focusing on global patterns ( Figure 3 ).

As shown in Figure 3 , for convenience, we further define the universal genome fingerprint map (UGFM) to unify both P-GFM and S-GFMs for the comparison in-one-sitting. Namely, we can compare a number of sequences through displaying their multiple GFMs (regardless of P-GFMs or S-GFMs) at one time (in-onesitting) as one UGFM vision; from that, each individual GFM can be classified into a discrete group solely based on its location. For example, those P-GFMs ( Figure 2 , D) of the twelve fragmental genomes from eight strains of E.coli (Table 1 ) are enlarged and displayed on one UGFM vision, and classified into six discrete groups ( Figure 4) . Clearly, there are six groups on the UGFM vision ( Figure 4 , A, B, C, D, E, F). Particularly, different fragmental genome sequences either from the same strain (e.g., 91. Moreover, note that a given P-GFM vision has quite different views between its own format and that of the UGFM vision ( Figure 4 ), simply because of what we called the effects of scaledown and view-angle rotation in the UGFM vision. This feature could ensure the UGFM vision to be a powerful tool for global comparison at large scale. Namely, as many sequences as possible could be handled at one time (in-one-sitting) as long as the computer memory and the graphic software could allot.

We further establish a method called the universal genome fingerprint analysis (UGFA) ( Figure 5 ). Briefly, the UGFA method consists of a set of concepts and tools under three subcategories corresponding to three objects: a genome, a strain, and a set of strains, respectively. In other words, the objects of comparison can be one genome sequence, a number of genome sequences crossing genetic components (chromosomes, plasmids, and phages, if applicable) in a strain, or a set of genome sequences of genetic components in strains crossing biological categories (bacteria, archaeal bacteria, viruses). We anticipate that it should be effective for what we called the systematic comparative genomics at large scale, by expanding the scope of genetic component and biological category as well as the power of computation.

5.1. UGFM. First, the UGFM tool, namely the universal genome fingerprint map (UGFM), is the foundation of the UGFA method. As shown earlier ( Figure 3, 4) , the UGFM (combined the P-GFM and the S-GFMs) has been proved powerful in the comparison among a number of genomes crossing both archaeal and prokaryote bacteria genomes.

5.2. UGFM-TGCC. Second, we define the total genetic component configuration (TGCC) for a set of genomes crossing genetic components (chromosomes, plasmids, and phages, if applicable) in a strain for describing the strain as a systematic unit. We further define the universal genome fingerprint map (UGFM) of the total genetic component configuration (TGCC) (UGFM-TGCC) for differentiating a set of genetic components in a strain as a universal system. Putting together, the UGFM-TGCC tool, namely the universal genome fingerprint map (UGFM) of the total genetic component configuration (TGCC), can be used to perform the comparison among a set of genomes crossing genetic components within a strain, which will be exemplified in the next section ( Figure 6 ).

5.3. UGFM-TGCC-SCG. Third, we define the UGFM-TGCC-SCG tool, namely UGFM-TGCC-based systematic comparative genomics (SCG), in order to compare a set of genomes crossing both genetic components (chromosomes, plasmids, and phages, if applicable) and biological categories (bacteria, archaeal bacteria, viruses) in a universal system.

At moderate scale, one example ( Figure 6 ) demonstrates that nineteen genomes (including six chromosomes and thirteen plasmids) with large size range (6 Kbp,4 Mbp) can be mapped and compared by using the UGFM-TGCC-SCG tool. These nineteen genomes from four strains (each containing at least one chromosome and one plasmid) crossing four genera of halophilic Archaea (Table 1) Most importantly, the tiny spots (e.g., corresponding to 6 Kbp) and the giant visions (e.g., corresponding to 4 Mbp) are harmoniously co-existed in the same figure, either closely or distantly.

At large scale, the UGFM-TGCC-SCG vision can demonstrate the amazing landscape of a large set of genomes both crossing diverse genetic components (chromosomes, plasmids, and phages) and crossing diverse biological categories (bacteria, archaeal bacteria, viruses). For instance, we make up a large set (over one hundred) of genomes of interest by combing 6 archaeal bacterial genomes and 13 archaeal bacterial plasmids (shown in Figure 6 ), 12 fragmental chromosomes of E.coli (shown in Figure 4 ), 47 phage genomes and 24 virus genomes (as listed in Table 2) to be compared at large scale by using the UGFM-TGCC-SCG tool.

Remind that the effects of scale-down and view-angle rotation as demonstrated earlier (Figure 4 ) could ensure that as many sequences as possible could be handled at one time as long as the computer memory and the graphic software could allot. Under our conditions (physical 2-Gb memory and 32-bits graphic software), we can only handle up to 1.5 Gb data in-one-sitting. As such, we generate two sets, separately. One set contains eighty three genomes: 24 viruses (I), 12 fragmental chromosomes of E.coli (II), and 47 phages (III), which are shown as three distinct groups (Figure 7, A) . The other set consists of two archaeal bacterial chromosomes (I), two bacterial fragmental chromosomes/two phages/two viruses (II), and three plasmids (III), which are shown as three distinct groups (Figure 7 , B). These are generally consistent with their real biological distinctions at different taxonomical levels. Obviously, here the effects of scale-down and view-angle rotation are demonstrated even stronger than those in earlier sections. Moreover, in the big group of phages and viruses (II), most genomes seem as very close relatives and accordingly almost repeat themselves within the phage or virus subgroup, respectively, resulting in fewer maps than should be. x n ,y n ,z n ; (B). x n ,y n ; (C). x n ,z n ; (D). y n ,z n ; (E). x n ,n; (F). y n ,n; (G). z n ,n; (H). x n ,n and y n ,n together. Note that two replication ori points (oriC1 and ori C2) are marked by arrows; other arrows indicated the genome-wide evolution events. doi:10.1371/journal.pone.0077912.g003

Taken together, such amazing landscapes ( Figure 6, 7) can only be revealed by using the unique UGFA method, under the notions of ''universal genome fingerprint map (UGFM)'' of ''total genetic component configuration (TGCC)'' based ''systematic comparative genomics (SCG)''. Namely, these data are more than enough to prove the concepts and tools (UGFM, UGFM-TGCC, and UGFM-TGCC-SCG) ( Figure 5 ) effective and powerful in handling such real-world diverse genomes in-one-sitting. Most importantly, the representatives are elegantly plotted as beautiful and meaningful UGFM-TGCC-SCG visions ( Figure 6, 7) , explicitly demonstrating the scope and power of the unique comprehensive methods developed in the present study. Remarkably, we re-emphasize that the combined concept and tool of ''UGFM-TGCC-SCG'', namely the ''universal genome fingerprint map (UGFM)'' of ''total genetic component configuration (TGCC)'' based ''systematic comparative genomics (SCG)'', is distinguished from any other traditional methods of comparative genomics. This is simply because all genomes of interest crossing diverse genetic components (chromosomes, plasmids, and phages, if applicable) and diverse biological categories (bacteria, archaeal bacteria, viruses) are much less or even no homology at all (Figure 6, 7) , which should be incredibly challenging to any conventional methods based on the traditional homology analysis. In fact, all documented researches so far about comparative genomics were automatically based on the assumption that there should be at least one reference for those very close relatives in question; otherwise, they would not bother to do comparison. However, in our case, we focus exactly on the opposites: much less or even no homology at all. We have demonstrated the successful usage of the UGFM-TGCC-SCG tool ( Figure 6, 7) in comparing such diverse genetic components and diverse biological categories, regardless of the format of objects and the extent of divergences. Clearly, this is one of the core concepts and the most priority aim in the present study.

The difference between two genomes of interest, whose genome fingerprints are distinguished by one of the visions of UGFM, UGFM-TGCC, and UGFM-TGCC-SCG, can be further quantitatively discussed as follows.

6.1. The geometric center and geometric mean of the genome fingerprint map. First, we define the geometric center ( x x, y y, z z) as a unique digital indicator for its genome fingerprint map. Accordingly, the geometric center ( x x, y y, z z) and the standard deviation of all coordinates (s x ,s y ,s z ) can be calculated (5) by GenomeFingerprinter.exe from a given genome sequence (i~1,2,:::,n) (the length of an entire genome sequence is usually greater than hundreds of base pairs).

x x~1 n

, (i~1,2,:::,n) ð5Þ Second, we define the geometric mean (Gm) (6) of the geometric center of a given genome fingerprint map.

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ( x x)( y y)( z z)

Note that the definition of Gm has two-fold meanings: one is algebraically calculating the geometric-mean value of the three means ( x x, y y, z z), the other is geometrically defining the side-length value of a cube that is roughly equivalent to the cuboid volume, which is created by the values of geometric center starting from and rotating around the origin in the three-dimensional space. Accordingly, the values (Gm, x x, y y, z z) are not the absolute ones but carry the symbols (minus or plus), corresponding to the geometric center of the genome fingerprint map in the same threedimensional space, namely within the scope of geometrical analysis.

6.2. The Euclidean distance and differentiate rate between two genomes. To directly compare two genomes of interest, we define (7) the Euclidean distance (Ed), the differentiate rate (Dr%), and the weighted differentiate rate (WDr%) between two genomes in pairs, which are calculated based on the geometric means of the geometric centers of genome fingerprint maps. Again, the values (Gm a ,Gm b , x x, y y, z z) are not the absolute ones but carry the symbols (minus or plus) corresponding to their geometric centers of genome fingerprint maps in the same three-dimensional space.

6.3. Examples of the quantitative comparison between two genomes. As examples, thirty chromosomes (Table 1) give twenty-nine pairs of comparison (Table 3 ) as the representatives for illustrating the principles. As such, the rules can be summarized from these examples (Table 3 ). In general, the differentiate rates (Dr%) vary from family to family; and the values of Dr% start from least at strain/species level (,50%) to higher at genus level (,500%) to even higher at beyond family level (,1500%). Of course, there are numerous outliers under certain situations (Table 3) with challenging values in terms of either the differentiate rate (Dr%), or the weighted differentiate rate (WDr%), or the Euclidean distance (Ed). (Table 3) . Evidently, such two strains have distinct values of geometric center and geometric mean of the genome fingerprint maps, but the differentiate rate is less than 10%. Indeed, they had been characterized as two distinct but close strains within the same species, Sulfolobus islandicus. In addition, there are four close strains in this species, with differentiate rates ranging between 6.42% and 25.28% (Table 3) . Another example compares two very distant strains (beyond family level), Sulfolobus islandicus Y.G.57.14 vs. Methanosphaera stadtmanae DSM 3091 (Figure 2, C) (Table 3) , which is much greater than those values at genus level. These data together confirm that the two strains are farther divergent beyond the family level.

Furthermore, there are three remarkable exceptions (Table 3) . First, within the same one strain, there are two chromosomes; and the differentiate rate between the two chromosomes is at least close to the values between two species or genera, implying that such two chromosomes are divergent and each independently impacts on the same strain. For instance, the differentiate rates of Halorubrum lacusprofundii 49239 vs. 49239-II (42.39%) and Haloarcula marismortui 43049 vs. 43049-II (30.68%), respectively, are close to certain values of the differentiate rates (e.g., 42.76%, 36.36%, 54.44%) at genus level within the same family Halobacteriaceae. Second, within the same species, Escherichia coli, three strains (BL21(DE3), CB9615, CFT073) are extraordinary because the differentiate rate between UT189 and BL21(DE3) is 321.11%, which is extremely out of the ranges (3.36%,36.42%) defined by the ordinary members in the same species; and it is even much greater than the value of 25.10% between two external genera (Methanococcus voltae A3 and Methanosphaera stadtmanae 3091) in other family. Third, within the same family, Halobacteriaceae, the differentiate rates among different genera vary between 17.10% and 291.91%. Putting together, these data probably indicate that such strains (particularly containing more than one chromosome) have been continuously growing and absorbing new composites so that they are potentially developing into a new species. Moreover, from family to family, the genus levels are not within the same range of divergence in terms of the differentiate rates, implying no possibility of setting up a universal boundary for simply distinguishing all taxa. Most importantly, although the differentiate rate (Dr%) is concise and efficient for most cases (Table 3) , we also note that the weighted differentiate rate (WDr%) is more accurate to deal with outliers, giving more reasonable inference through the crossvalidations after having factored the differentiate rate (Dr%) with the Euclidean distance (Ed). For example, two genera (Halogeometricum boringquense 11551 vs. Halomicrobium mukohataei 12286) seem very similar due to the tiny differentiate rate (1.58%) by chance resulting from the very similar values of Gm (10615.71 vs. 10284.83), but they are actually quite different in terms of their geometric centers (22079.00, 1844.00, 2312055.11) vs. (1900.50, 21177.50, 2486140.66), resulting in larger values of the weighted differentiate rate (WDr% = 275711.29) and the Euclidean distance (Ed = 174157.24), which are essentially close to the extents that distinguished the divergences between other genera in the same family. Thus, we suggest that either Dr% or WDr% can be generally referred to an inference (the first is concise while the latter is accurate); but for outliers arisen, both of them have to be cross-referenced explicitly.

We believe that performing what we called the systematic comparative genomics based on the geometrical analysis of genome sequences, instead of the pair-wisely base-to-base comparison, is a priority task in the post-genomic era. To our knowledge, however, no attention as what we did in the present study has been paid to compare a number of genomes crossing genetic components (chromosomes, plasmids, and phages) and biological categories (bacteria, archaeal bacteria, and viruses) with far divergence over large size range. In particular, no method for creating the unambiguous genome fingerprint (GF) has been documented; neither the universal genome fingerprint analysis (UGFA), nor the total genetic component configuration (TGCC), nor the systematic comparative genomics (SCG) has been proposed; nonetheless, no method for quantitatively differentiating genome sequences has been developed based on using the outcome dataset of genome fingerprint analysis.

Remarkably, the genome sequences crossing diverse genetic components (chromosomes, plasmids, and phages) or crossing diverse biological categories (bacteria, archaeal bacteria, and viruses) have much less or even no homology, which should be incredibly challenging to any conventional methods that are principally based on the pair-wisely base-to-base homology analysis. In other words, no conventional method can compare such diverse genetic components and biological categories in-onesitting, as what we did in the present study. Therefore, it would be impossible to compare other conventional methods with our comprehensive methods as a whole system: the method of genome fingerprinting (GenomeFingerprinter), the method of universal genome fingerprint analysis (UGFA) (including the UGFM, UGFM-TGCC, and UGFM-TGCC-SCG tools), and the method of quantitative analysis (Gm,Ed,Dr%,WDr%) for the outcome dataset of the genome fingerprint analysis. In the present study, however, we have tried our best to compare partial features between our methods and others that are partly related to ours, as well as briefly discuss the future perspectives of quantitative analysis for using the outcome dataset of the universal genome fingerprint analysis.

1. GenomeFingerprinter vs. Zplotter 1.1. Validity. The Zplotter program [16] is not used for the creation of what we called ''genome fingerprint (GF)''. In fact, although some coordinates from the Zplotter program were used to produce hundreds of graphs (as open rough Z-curves) of microbial genomes that were documented as a database [17] , there were no stable unique features in terms of the so-called genome fingerprints. For example, when we re-plotted the visions of Halobacterium sp. NRC-1 genome sequence (NC_002607) using the Zplotter's coordinates of either z n ' or z n , respectively, to present an open rough Z-curve (data not shown), those visions themselves were quite different from one another due to the wavelet transform in the algorithm of Zplotter program [16] . In contrast, our method presented a unique circular vision with accurate and delicate genome fingerprints for the same sequence (data not shown). Again, note that using the z n , coordinates gave a similar vision to ours, except that it was in an open rough Z-curve with less features; while using the z n ' coordinates created a completely different vision from ours (data not shown). We conclude that our GenomeFingerprinter method provides more accurate and delicate coordinates than the Zplotter program does, and therefore is valid for the subsequent applications that have been established by the Z-curve analysis. Of course, one should beware of choosing whether z n from our method or z n ' from the Zplotter program when referring to specific questions.

1.2. Reliability. We found a major problem when using the Zplotter program to handle circular genome sequences with cutting-point errors. In fact, for example, the same circular sequence of Halobacterium sp. NRC-1 (NC_002607) but with two different cutting-points (e.g., NC_002607_RC was re-cut at 700 kbps) were incorrectly presented as different visions by using the Zplotter's coordinates; whereas both scenarios were exactly shown as the same vision by using our method (data not shown). The reasons for such differences come from that the Zplotter program was designed for a linear sequence [16] and its algorithm depends on counting the absolute numbers of bases starting from the ''first'' base in a given linear sequence. Meanwhile, when a sequence was deposited as a linear form (regardless of the original linear or circular form), the documented first base was usually not guaranteed to be the real first one. Taken together, the same circular sequence with cutting-point error changing its real ''first'' base can result in a quite different vision by using the Zplotter program. In contrast, our method was initially created for a circular sequence (Figure 1 ) but has been proved also valid for a linear one as exemplified earlier. This is not only because the linear form is a specific form of circular one, but also because the formula (1) described earlier ensures that our method measures the relative distance in a circular form, rather than the absolute numbers of bases counting from the ''first'' base in a linear sequence. In other words, our method has been proved valid for both circular and linear forms regardless of where the cuttingpoint is (i.e., where the ''first'' base is), overriding any possible cutting-point errors.

1.3. Adaptability. We further emphasize the scientific foundations for the reason why it is critical to deal with circular genomes, which has been overlooked in literatures.

Theoretically, most microbial genomes are in circular strands, which protect them from natural degradation due to relatively simple structures. In other words, the circular form is much more stable than its linear form in living cells. In most cases, the circular genomes and their linear forms usually change into one another when and only when they are at certain functioning stages of living cells, such as the rolling-model replication and the plasmidmediated conjunction. Most importantly, the circular and linear forms are both genetically and physiologically functioning in a coordinated way for a given genome in a given living microbe. That is, their forms are interchangeable when responding to real living conditions. Therefore, we can catch up the circular status of genomes during their life cycles.

Technically, different groups world-wide have not been unified yet to guarantee that all genome sequences are deposited in their correct forms. In fact, most sequences deposited in public databases (such as GenBank) so far are neither in their natural orders of starting from the real ''first'' base, nor in the assumed direction from 59 to 39. We thus have to tackle such cutting-point errors, as illustrated by those examples earlier. Fortunately, the RD formula (1) in our method can virtually treat an arbitrary linear sequence as a circular one (Figure 1 ), avoiding impacts of any possible cutting-point errors exist in the public deposited sequences.

Informatively, the closed (circular form) genome fingerprints carry much more sensitive information, considering genome-wide comparative genomics at the genome fingerprint level (Figure 3 ). Our method can precisely calculate a set of three-dimensional coordinates for a given circular or linear sequence with or without correct cutting-point, which accordingly can present a stable unique genome fingerprint map and further guarantee the validity of the universal genome fingerprint analysis.

To conclude, the GenomeFingerprinter method has great advantages over the Zplotter program in creating unambiguous sets of coordinates, which is valid to the subsequent applications that have been established by the Z-curve analysis [18, 19, 20, 21, 22, 23 ].

2.1. Efficiency. The Mauve program (a typical algebraictype approach), combining both computing and plotting, is commonly used for pair-wisely base-to-base comparison and visualization [14, 15] . However, it has difficulty when dealing with a number of larger genome sequences due to its inner computational constraints, either too slow or memory overflow. In contrast, our method can rapidly calculate and visualize, separately, tens of large genomes, and is much faster than the Mauve program in terms of the time complexity [O(n) vs. O(n 2 ) ] (data not shown). Furthermore, with our method under our hardware conditions (physical 2 Gb memory and 32-bits graphic software), more than one hundred genome sequences can be elegantly plotted in-one-sitting (Figure 7) . Only plotting numerous larger graphics in-one-sitting would cause memory overflow. Most importantly, our method performs calculation and visualization separately, which not only ensures higher performance efficiency for a large set of genomes, but also provides output dataset for the universal genome fingerprint analysis (Figure 2 , 3, 4, 5, 6, 7) and quantitative analysis (Table 3) .

2.2. Prediction. The Mauve program [14, 15] can only visualize what a sequence is, but cannot predict what it should be without one reference sequence or specific pre-knowledge. In contrast, our method provides the universal genome fingerprint map (UGFM) (either the P-GFM or the S-GFMs), which can intuitively identify the unique genome features such as the genome-wide evolution events and the replication ori points (Figure 3 ) that have been characterized in literatures [22, 23, 24, 25] .

2.3. Compatibility. The universal genome fingerprint analysis (UGFA) predicted the subtle variations (Figure 3 , C, D, G) indicating the genome-wide evolution events (Figure 3, C) . We then used the Mauve program to pair-wisely compare two chromosomes and confirmed such events (data not shown), demonstrating that the UGFA method could rapidly predict the evolution events while the Mauve program could precisely confirm such predictions. Thus, we recommend that the UGFA method and the Mauve program be compatible partners, taking advantages of ours for rapid intuitive prediction in general (Figure 3, 6) and of Mauve's for slow precise confirmation in detail, particularly focusing on the targeted fragments' gain, lose, and rearrangement (data not shown).

Likewise, among nineteen genomes (Figure 6 ), including six chromosomes and thirteen plasmids with large size range (6 Kbp, 4 Mbp) belonging to the four strains crossing four genera of halophilic Archaea (Table 1) , the rare homology was mapped only by the progressiveMauve mode [14] (data not shown); whereas the Mauve mode [15] failed in such a comparison because it stopped due to no essential homology, as we predicted beforehand. Yet, the Mauve mode [15] worked well with the subset of either thirteen plasmids or six chromosomes, respectively, confirming their partial homology (data not shown). In other words, the UGFM-TGCC-SCG tool can not only handle the exceptional situations for a large set of genomes, but also facilitate the effective integration of the Mauve program into performing the so-called systematic comparative genomics among a large set of genome sequences crossing diverse genetic components (chromosomes, plasmids, and phages) and diverse biological categories (bacteria, archaeal bacteria, and viruses) with far divergence (less or no homology) over large size range (e.g., 6 Kbp,4 Mbp) (Figure 6, 7) . Meanwhile, the progressiveMauve mode [14] can be compatible to the UGFA method (including the UGFM, UGFM-TGCC, and UGFM-TGCC-SCG tools), whereas the Mauve mode [15] cannot, but still can be used to partially deal with the subsets of genomes in question.

Taken together, we conclude that the UGFA method (including the UGFM, UGFM-TGCC, and UGFM-TGCC-SCG tools) has advantages over the Mauve program [14, 15] in dealing with a set of genomes of less or no homology. Particularly, we recommend that any components with farther divergence be rapidly prescreened out by using the UGFM-TGCC-SCG tool, which could guide the selection of subsets in question for the subsequent comparisons by using the appropriate mode of Mauve program [14, 15] .

Obviously, the main purpose of the present study is to develop a novel method, GenomeFingerprinter, taking a geometric approach to intuitively visualize a genome sequence in order to distinguish numerous genome sequences through their intuitive images. Namely, it is designed to extract the meaningful information but reduce the massive noise from the original millions of base pairs of genome sequences. Accordingly, there is no intention to go backward to perform extensive statistic analysis on such massive discrete data in a traditional way. Rather, we have developed the method of quantitative analysis by using the outcome dataset of genome fingerprint analysis. In particular, we have defined the geometric center ( x x, y y, z z) and its following geometric mean (Gm) of a given genome fingerprint map to determine the Euclidean distance (Ed), the differentiate rate (Dr%), and the weighted differentiate rate (WDr%) in order to quantitatively describe the difference between two genomes of comparison. In fact, the applications with certain examples (Table 3 ) have demonstrated that the differentiate rates generally vary from family to family starting from least at strain/species level (,50%) to higher at genus level (,500%) to even higher at beyond family level (,1500%), which seem promising to be as the basic rules for setting up the general boundaries at certain levels of taxonomical units.

However, we would remind its limitation at current status. As stated earlier, those data (Table 3 ) demonstrated that, from family to family, the genus levels were not within the same range of divergence in terms of the differentiate rates, implying no possibility of setting up a universal boundary for simply distinguishing all taxa. We thus recommend that the inference based on the (weighted) differentiate rate and the Euclidean distance be conducted under clear biological contexts because of two major reasons. First, such inferences should not be made solely based on the differentiate rates when dealing with outliers encountered (Table 3 ). For instance, Halogeometricum boringquense 11551 (NC_014729) vs. Halomicrobium mukohataei 12286 (NC_013202), two genera seemed very similar (Dr% = 1.58%) by chance resulting from the very similar values of Gm, but they were actually quite different in terms of their geometric centers, which were also verified by the large values of the Euclidean distance and the weighted differentiate rate (Table 3) . Second, it is still unclear to determine a precise boundary corresponding to the taxonomical hierarchy because we found that the differentiate rates of outliers dramatically varied (Table 3) , implying no such a boundary could be possibly determined under current knowledge. We thus remind that there is a huge gap to be fulfilled before eventually setting up the upper and lower boundaries in the real-world for different levels of taxa (strains, species, genera, families, and beyond).

Meanwhile, we have only established the method of quantitative analysis to simply compare two genomes in pairs (Table 3) . To make intensive statistic analysis about a number of genomes as one sample or two samples, we suggest that a sophisticated method be developed first, which is beyond the scope of the present study. For example, considering very fewer genome sequences available within certain taxonomic units resulting in very small sizes of samples, the traditional empirical methods of statistical inference and hypothesis testing (such as the normal z-test and student's ttest) would not be appropriate. As such, we suggest that the permutation-based randomization test, such as bootstrap, should be developed for such statistic analyses in order to better use the outcome dataset of the genome fingerprint analysis. To this end, for example, the geometric center, the Euclidean distance and the (weighted) differentiate rate as potential statistical estimators should be kept worthy of being further explored with more realworld data at large scale in future.

We have developed the methodology of what we called the systematic comparative genomics based on the genome fingerprint and the universal genome fingerprint analysis. First, we have created a method, GenomeFingerprinter, to unambiguously produce the three-dimensional coordinates from a sequence, followed by one three-dimensional plot and six two-dimensional trajectory projections, to illustrate the genome fingerprint of a given genome sequence. Second, we have developed a set of concepts and tools (3D-P, 2D-TP, GF, GFM, P-GFM, S-GFM, UGFM, TGCC, UGFM-TGCC, SCG, UGFM-TGCC-SCG), and thereby estab-lished a method called the universal genome fingerprint analysis (UGFA). Particularly, we have demonstrated that the UGFM, UGFM-TGCC, and UGFM-TGCC-SCG tools have great advantages over other conventional methods. Third, we have constructed a method of quantitative analysis to compare two genomes by using the outcome dataset of genome fingerprint analysis. Specifically, we have defined the geometric center ( x x, y y, z z) and its following geometric mean (Gm) for a given genome fingerprint map, followed by the Euclidean distance (Ed), the differentiate rate (Dr%) and the weighted differentiate rate (WDr%) to quantitatively describe the difference between two genomes of comparison. Moreover, we have demonstrated the applications through case studies on various genome sequences crossing diverse genetic components (chromosomes, plasmids, and phages) and crossing diverse biological categories (bacteria, archaeal bacteria, and viruses) with far divergence (less or no homology) over large size range (4 kilo-,5 mega-base pairs per sequence), giving tremendous insights into the critical issues in microbial genomics and taxonomy. We therefore anticipate that these comprehensive methods can be widely applied to the socalled systematic comparative genomics at large scale in the postgenomic era.

Genome sequences used in this study were downloaded from NCBI or derived from this study, which are listed in Table 1 and  Table 2 .

We have implemented our method into an in-house script (GenomeFingerprinter.exe). It will be available upon request to the corresponding author. The programs of Zplotter (v1.0) and Mauve (v2.3.1) used in this study can be downloaded from links: Zplotter.exe at http://tubic.tju.edu.cn/zcurve/and Mauve at http://gel.ahabs.wisc.edu/mauve/. To plot graphics from coordinates, any graphic tool can be used. 

Visualizing genomes: techniques and challenges

Machine-readable DNA sequences

A simple way to look at DNA

H-Curves, a novel method of representation of nucleotide series especially suited for long DNA sequences

Novel DNA sequence representations

ADN-viewer: A 3D approach for bioinformatic analyses of large DNA sequences

A 3D pattern matching algorithm for DNA sequences

Gapped BLAST and Psi-BLAST: A new generation of protein database search programs

CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions specific gap penalties and weight matrix choice

BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons

ACT: The artemis comparison tool

A simple vectorial representation of DNA sequences for the detection of replication origins in bacteria

progressiveMauve: Multiple genome alignment with gene gain, loss, and rearrangement

Mauve: multiple alignment of conserved genomic sequence with rearrangements

Z Curves, An intuitive tool for visualizing and analyzing the DNA sequences

The Z curve database: a graphic representation of genome sequences

ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes

Coronavirus phylogeny based on a geometric approach

Segmentation algorithm for DNA sequences

GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences

Single replication origin of the archaeon Methanosarcina mazei revealed by the Z curve method

Multiple replication origins of the archaeon Halobacterium species NRC-1

Making sense of an alphabet soup: the use of a new bioinformatics tool for identification of novel gene islands

Identification of two origins of replication in the single chromosome of the Archaeon Sulfolobus solfataricus

Teaching bioinformatics: A student-centred and problem based approach