key: cord-0701644-yr8r080l
authors: Pavlović-Lažetić, Gordana M.; Mitić, Nenad S.; Beljanski, Miloš V.
title: n-Gram characterization of genomic islands in bacterial genomes
date: 2009-03-31
journal: Computer Methods and Programs in Biomedicine
DOI: 10.1016/j.cmpb.2008.10.014
sha: 650f81f4234b8adc6190c3a20625565f7ffa42a8
doc_id: 701644
cord_uid: yr8r080l

Abstract The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.

Many bacterial genomes have been shown to contain specific genomic regions, known as islands. Islands that were acquired by horizontal gene transfer (HGT) events among bacteria are designated as genomic islands (GIs) and may contribute to their adaptability. Genes encoded in GIs offer various functions, e.g., additional metabolic activities, the capability of symbiosis with other organisms, antibiotic resistance and secretion, etc. [1] . The group of GIs that contain a variety of virulence factors, providing for specific host recognition,

Abbreviations: HGT, horizontal gene transfer; GI, genomic island; OI, O-island; PAI, pathogenesis island; CU, codon usage.

penetration and colonization of the host organism, and the ability to overcome host defense systems, are known collectively as pathogenicity islands (PAIs). GIs are identified and characterized by different compositional features, such as biased G + C content, codon usage (CU), dinucleotide signature contrasts, amino acid contrasts [2] , and different functional features such as the presence of virulence genes and mobility (e.g., integrases, transposases) genes, or structural features such as the presence of proximal tRNA and/or rRNA gene(s) and repeats at their boundaries, presence of insertion sequence elements, origin of plasmid replication, etc. [3, 4] . In this paper we apply a linguistic method -exhaustive ngram analysis -to already annotated islands in an attempt to characterize GIs more precisely than proved possible using earlier techniques, and to understand their structure better. We illustrate the method on the Escherichia coli O157:H7 EDL933 genome, a member of genus Escherichia of Enterobacteriaceae phylum and a well known and important experimental, medical and biotechnological organism [3, 5] .

The paper is organized in the following way. Section 2 surveys different methods and algorithms for identification and prediction of GIs, including the n-gram technique and its applicability to characterization of different types of texts. It also outlines the authors' prior work in the field. Section 3 describes the three steps of the GI characterization procedure. We first perform n-gram statistical analysis of islands, for different n, in order to classify them according to (dis)agreement with the complete genome. Next, we apply other compositional features (G + C content, CU) to islands and calculate statistical measures-recall, precision, sensitivity and specificity for the results of n-gram classification so as to examine how the n-gram feature contributes to characterizing GIs. Then we identify significantly overrepresented and underrepresented n-grams based on the maximal order Markov model, which may be used as a basis for Zipf-like analysis and for classification of islands based on such n-grams [6] . Section 4 presents computational results obtained for the E. coli EDL 933 genome. In Section 5 we offer our conclusions and outline some future plans.

An n-gram, as introduced by Shannon in 1948 [7] , is a subsequence of length n of a sequence over the given alphabet. The sequence may be a message in a natural or artificial language, a discrete approximation of a continuous signal, e.g., speech, or any sequence of symbols generated by a stochastic process. Any such "text" can be approximated by the set of n-gram statistical data (e.g. frequency distribution and the respective mean and standard deviation), and two such texts may be compared based on the distance of such approximations. n-Grams of length 2, 3 and 4 are usually called bigrams, trigrams and tetragrams, respectively, and for higher values of n-simply n-grams. Formally, as defined in Vinga and Almeida [8] , a sequence X of length k is a linear succession of k symbols from a finite alphabet, A, of cardinality |A| = r. A segment of n consecutive symbols from the sequence X (n ≤ k) is an n-gram (n-tuple, n-word, n-plet, n-mer) of the sequence X. There are L = r n different n-grams over the alphabet A, {w 1 , w 2 , . . . w L }. There are k − n + 1 overlapping n-grams in the sequence X. Some authors use n-grams in a broader sense not assuming contingency of symbols but a distance of a given length between them (spacer).

If c i denotes the number of occurrences of the n-gram w i (i = 1, 2, . . ., L) in the sequence X, and f i denotes relative frequency of the n-gram w i in the sequence X (f i = c i /(k − n + 1)), then a vector of n-gram counts, c X n = (c 1 , c 2 , ..., c L ), as well as a vector of n-gram frequencies, f X n = (f 1 , f 2 , . . . , f L ) may be associated with the sequence X. For example, for DNA sequences, A = {A, C, G, T}, r = 4; for n = 2, number of all possible bigrams is L = r n = 16 and the set of all possible bigrams  will be {AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC,  TG, TT}. For the sequence X = ATATAC, where k = 6, there are k − n + 1 = 5 bigrams, determined by sliding a two letter window: AT, TA, AT, TA, AC, so the vector of n-gram counts will be (0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0), and the vector of n-gram frequencies will be (0, 0.2, 0, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0, 0).

The dissimilarity of two sequences, X and Y, may be defined by a distance function computed in the vector spaces of either n-gram counts or n-gram frequencies.

In general, n-gram analysis proves effective, regardless of the type and origin of the text analyzed. n-Grams were first used in the domain of natural language text analysis, for different tasks such as text compression [9, 10] , spelling error detection and correction [11, 12] , language identification [13] , automatic text categorization [14] , authorship attribution [15] , etc. For example, Damashek [16] reports the automatic classification of a whole library of multilingual documents based on topical similarity which is determined by using n-gram s and Euclidian distance in the vector space of n-gram frequencies.

n-Gram-based methods have also been applied to protein, proteome and genome sequences, for different purposes-to measure sequence similarity and reconstruct phylogenetic trees without sequence alignment, for protein classification, genome characterization, etc. As early as in 1967, Krzywicki and Slonimski compared the expected with the observed frequencies of amino acid bigrams with distance and showed that for certain distances, statistically highly significant deviations are present in proteins [17] . Radomski and Slonimski applied bigram analysis to a set of ribosomal protein sequences some thirty years later [18] to develop the notion of the genomic "style" of proteins, and the concept of n-grams with a spacer was used by Rosato et al. [19] to analyze the thermal dependencies of different proteomes, with some observed anomalies in n-gram distribution at certain distances (spacer lengths).

Deviation of observed from expected n-gram frequencies further aided the investigation of overrepresented and underrepresented n-grams, both in nucleotide and in protein composition. Phillips et al. [20, 21] , Colosimo et al. [22] , Schbath et al. [23] , Gelfand and Koonin [24] , Karlin et al. [25] , Karlin and Burge [26] , Rocha et al. [27] , Pevzner et al. [28] , Karlin et al. [29] , Burge et al. [30] and Schbath [31] , investigated identification of over-and underrepresented oligonucleotides and different methods for calculating the expected number of oligonucleotides and the comparison of the expected against the observed number of oligonucleotides. A comparison of different statistical measures of bias of oligonucleotide sequences in the DNA sequences of bacterial genomes is given by Elhai [32] . Noncontiguous sequences were considered since they exhibited significant bias, and the corresponding methods proved more efficient than Markov analysis at the highest order. Reinert et al. [33] reviewed statistical and probabilistic properties of words in sequences, with emphasis on the deductions of exact distributions and the evaluation of their asymptotic approximations.

n-Grams were also used for alignment-free sequence comparison, in order to reconstruct phylogeny trees or to classify proteins or nucleotide sequences. The earliest work systematizing the use of n-gram counts for sequence comparison [34] [35] [36] used the difference between two DNA sequences by the squared Euclidean distance between their transition matrices. Statistical distribution of the number of n-gram matches between two random sequences was analyzed by Karlin [37] and Lippert et al. [38] . Three asymptotic distributions were identified including compound Poisson, normal and uniform distribution. Radomski and Slonimski [39] recently proposed a method of representing and analyzing complete genome sequences, based upon the frequencies of amino acid bigrams at distance. They also gave a thorough review of the work done on alignment-free sequence comparison.

n-Gram composition has been applied to phylogenetic analysis of genes and species by Stuart et al. [40] , Edgar [41] , Qi et al. [42] , and for classification of both protein and nucleotide sequences. Protein classification based on n-grams was investigated by Solovyev and Makarova [43] and Cheng et al. [44] . Daeyaert et al. [45] studied the characteristics of the amino acid composition of n-grams, and investigated two statistics (termed commonality and specificity) derived from n-gram counts, applicable to protein classification. Ganapathiraju et al. [46, 47] presented a biological language toolkit for statistical n-gram amino acid analysis and comparison of protein sequences. King and Guda [48] presented an n-gram-based Bayesian classifier that predicts the localization of a protein sequence. The classification accuracy of n-gram composition metrics was reviewed by Vinga and Almeida [8, 49] , together with a new definition of distance between protein sequences. Volkovich et al. [50] applied n-grams to the classification of DNA sequences considered as text over the four-letter alphabet {A, C, G, T}. Kirzhner et al. [51] used a predefined set of n-grams over the same alphabet, the so-called "compositional spectrum", for characterizing DNA sequences. Tomović et al. [52] introduced an n-gram-based method for classification and clustering of genome sequences.

Different n-gram-based methods for identification of compositionally distinct regions, prophages, functional sites were proposed by Rajan et al. [53] , Srividhya et al. [54] and Tobi and Bahar [55] .

The authors' own experience with n-gram technique is promising. In Pavlović-Lažetić et al. [56] we applied n-gram distribution analysis to classification of SARS Coronavirus isolates and argued that different classes were characterized by different overrepresented n-grams (for n up to 8). In Vitas et al. [57] we reported computational results obtained for n-gram analysis of word sequences in natural language texts from different sources. In Mitić et al. [58] we applied n-gram analysis to predicting GIs in bacterial genomes using an algorithm based on n-gram bias between a window (sliding over the genome) and the complete genome.

Although the most reliable methods for predicting GIs and their origin are based on gene function and relations within a gene family, compositional methods can support them greatly. Since both approaches may generate false positives and false negatives due to different factors [2] , combining the results of multiple techniques may be beneficial in determining whether a gene or a group of genes has been acquired by HGT. The functional, compositional, or combined approaches, resulted in a number of algorithms for GI/PAIs identification and prediction, and in several databases of predicted islands in bacterial and acrchaeal genomes, such as-IslandPath [59] , Islander [60] , Score-based Identification of Genomic Islands (SIGI) [61] and Pathogenicity Island DataBase (PAIDB) [62, 63] . IslandPath [59] incorporates both DNA sequence signal features (G + C frequency, dinucleotide bias) and annotation features (the presence of tRNA or rRNA, transposase or integrase genes) to aid the identification of GIs. Islander [60] coordinates several pre-existing computer programs and consists of 11 steps including tRNA (tmRNA) and integrase gene identification and then their passage through several filters sequentially. In SIGI procedure [61] , a scoring scheme on codon frequencies is applied and clusters of genes significantly deviating from species-specific value were predicted as GIs. In PAI DB [62, 63] , PAIs (which are a subgroup of GIs) are detected by combining sequence similarities (ORFs, RNA genes and repeat regions) and abnormalities in genomic composition (G + C frequency and CU).

Newly inserted sequences in a bacterial genome may deviate from the backbone sequence in G + C content as well as in nucleotide sequences (e.g., as a consequence of CU). It should thus be expected that n-gram analysis may contribute to genomic sequence characterization and classification since it encompasses both criteria. We are currently investigating application of n-gram analysis to bacterial genomes in order to characterize their genome structure and to relate it to genome organization. The results we presented in Mitić et al. [58] suggest that n-gram analysis does contribute to GI prediction.

The complete genome of the E. coli O157:H7 EDL933 (Gen-Bank database accession number NC 002655), is considered since it has segments annotated as "islands". Its genome consists of a single circular DNA of 5,528,445 bp in length, 5453 genes of total length 4,885,090 bp and inter-gene sequences of length 643,355 bp in total. There is a 4001 bp long sequence of "N"s (unspecified nucleotides) in the genome (positions 1,725,749-1,729,749) and 6,641 incompletely specified nucleotides. There are 177 islands, designated as O-islands (OIs) with the total sum of length 1,278,307 bp. The backbone has a length of 4,250,138 bp. There are 4832 incompletely specified nucleotides in all the OIs. Approximately 26% of the E. coli EDL933 genes (1387/5416) lie completely within OIs, consisting mainly of phage or phage related sequences, and genes that may contribute to pathogenesis (i.e., PAI-like sequences), while others may confer strain-specific abilities to survive in different niches, or represent neutral variation between strains [5, 64] . Nine of the OIs (larger than 0.15 kilo base (kb)) encode for at least one large PAI (OI #148), and putative virulence factors having no obvious role in virulence [5, 64] . The two longest OIs, #43 and #48, are almost identical in length and nucleotide content. The OI #28, although 25,164 bp long, codes for four proteins, one of them being 5,189 amino acids in length (i.e., 15,567 bp; it is the longest protein, putative RTX family exoprotein). The 4001 bp long "N· · ·N" sequence is contained in the OI #52 [5, 64] .

Statistical analysis was performed for testing the agreement of n-gram frequency distributions between OIs and the complete genome. It was also applied to comparison of random sequences from the backbone of the genome, with the complete genome sequence. The complete E. coli EDL933 genome contains all the different n-grams for n = 3-7, while 0.15% of all the 8-grams are missing. Bigrams were not considered since there are only 16 different forms and not enough data for testing the corresponding hypothesis.

As far as OIs are concerned, we analyzed only those of length ≥10 kb, since 10 kb is considered as a lower limit of significant GI length [65, 66] . There are 26 such OIs. All of them contain all the 64 different trigrams and all the 256 different tetragrams.

Statistical analysis of n-gram frequency distribution was performed using the SPSS package [67] and programs written in C for processing MySQL data. MS Excel charts were used for representing the corresponding histograms.

For testing agreement of n-gram frequency distribution we used the Mann-Whitney U-test for two independent samples [67] . It is a rank-based test of whether two sampled populations are equivalent in location. If the populations are identical in location, the ranks of combined observations should be randomly mixed between the two samples. The number of times a score from group 1 precedes a score from group 2 and the number of times a score from group 2 precedes a score from group 1 are calculated. The Mann-Whitney U statistic is the smaller of these two numbers.

In our investigations, the Mann-Whitney U-test was used to test the hypothesis that an OI sequence had the same n-gram frequency distribution as the complete genome sequence, for different n since the hypothesis, if rejected, might support n-gram characterization of OI sequences. It was also used to test the hypothesis that a random sequence from the backbone of the genome had the same n-gram frequency distribution as the complete genome sequence, since the hypothesis, if not rejected, might further support n-gram characterization of OI sequences. A sequence, either an OI or from the backbone, is n-gram-characterized, if the hypothesis that it has the same n-gram frequency distribution as the complete genome is rejected.

We applied some commonly used measures for estimating the quality of our algorithm with respect to simpler criteria such as high or low G + C content (G + C content deviating significantly from the average genome G + C frequency) [2] . As a measure of significant deviation we used two standard deviations from the average genome G + C frequency over the subsequences of a given length.

First, we applied recall and precision as the most frequent and basic measures for information retrieval effectiveness [68] . Recall is the proportion of relevant items that are retrieved, and precision is the proportion of retrieved items that are relevant. When applied to n-gram characterization of OIs, relevant items were considered to be GenBank annotated OIs longer than 10 kb that significantly deviated in G + C content, and retrieved items were the n-gram characterized OIs.

Next we applied sensitivity and specificity, the two related statistical measures of how accurately a binary classification test identifies cases that meet the condition under study (sensitivity) and cases that do not meet the condition under study (specificity). In our case, the binary classification test was the OI n-gram characterization, and a condition was OI G + C content deviation. Sensitivity is basically the same as recall and it is the proportion of true positives that are correctly identified by the test (true positives are actually relevant items, and correctly identified are retrieved items). Specificity is the proportion of true negatives that are correctly identified by the test (true negatives are all the non-relevant items).

Sensitivity is calculated by the following formula: In the case of OI characterization, true positives are the OIs deviating in G + C content that are identified by the n-gram characterization test, false positives are the OIs non-deviating in G + C content that are identified by the n-gram characterization test, true negatives are the OIs non-deviating in G + C content that are not identified by the n-gram characterization test and false negatives are the OIs deviating in G + C content that are not identified by the n-gram characterization test.

Codon usage analysis CU analysis of OIs was performed and used as another criterion for comparison with the complete genome sequence. CU bias was determined by a program written in C using MySQL C-API, according to the method described in Karlin [2] . CU bias of the gene family in a sequence S with respect to the gene family in a sequence T is calculated by the formula:

where p a (S) is the average frequency of an amino acid a in the gene family of S, s(x, y, z) is the average codon frequency for a codon (x, y, z) in the sequence S, normalized so that (x,y,z)=a s(x, y, z) = 1 (similarly for t(x, y, z))

A CU bias was considered significant if it was more than one standard deviation higher than the mean value of CU bias for all the OIs considered.

With respect to n-grams, sequences may be characterized not only by their overall frequency distribution, but by overrepresented or underrepresented n-grams. We applied the statistic introduced in [23] to detect over-and-under represented n-grams in OIs sequences, the complete genome and its backbone, based on the maximal order Markov model of the sequences. The method estimates the bias of the observed to the expected frequencies of n-grams and tends to neutralize excess in n-grams which is due to excess in A + T, G + C, or oligonucleotide components solely. The so-called z M statistic is the normalized difference between the observed count and the expected number of a given word (n-gram in our case), calculated by using the maximal order Markov model [23, 31, 32] : 

The estimated standard deviation of the difference between N and N T E M , , is defined in [29] as

The z M statistic has an approximately normal distribution and is normalized to the standard deviation, so that a z-value higher than +1 indicates that the word (n-gram) is overrepresented, while a z-value lower than −1 indicates that the n-gram is underrepresented.

A modification of Zipf-like analysis was applied in order to investigate differences between n-gram appearances in different sequences [46] . The original Zipf power law was designed for word (i.e., motif) occurrences in any kind of text, stating that the most frequent word is expected to be twice as frequent as the following most frequent word, etc. The modification applied concerns the relationship between the most overrepresented n-grams (for specific n, according to the z M statistic) in a specific sequence and their usage in all the other sequences considered. First, n-grams (for specific n) were sorted in descending order of z-bias in a chosen OI (or the complete genome), and then MS Excel charts were used to represent the z-values of the sorted n-grams in the chosen sequence and in all the other sequences (OIs, complete genome, backbone). The procedure was repeated for each OI. It may suggest that some of the n-grams are overrepresented in a subset of OIs and underrepresented in the backbone, or vice versa, thus characterizing the subset of OIs. We identified such n-grams for the corresponding subsets of OIs.

The results of Zipf-like analysis might suggest that some of the n-grams were exclusively overrepresented (or underrepresented) in a single sequence, which would give us a criterion for defining the sequence's n-gram nucleotide signature. We calculated such signatures for each OI as a set of n-grams which were overrepresented in that OI only, and underrepresented in the backbone of the genome, and also for n-grams which were underrepresented in that OI only, and overrepresented in the backbone.

Results and discussion Since 7-and 8-grams are quite rare in OIs, they cannot be used for testing agreement of frequency distribution between OIs and the complete genome. The hypothesis on the equality of 6-gram frequency distribution for two independent samples, using the Mann-Whitney test, is rejected with a significance level of 0.01 for the complete genome (as the first sample) and each of the OIs (as the second sample) except for the longest ones-#43 and #48. Since 6-gram distribution agreement seemingly relied on the length rather than the structure of the OI, 6-grams cannot be considered as reliable data for this kind of analysis.

The hypothesis on equality of 5-gram frequency distribution for two independent samples using the Mann-Whitney U-test is rejected with a significance level 0.05, for the complete E. coli EDL933 genome as the first sample and OIs #148, #28, #108, #115, #51, #138, #84, #8 and #30 as the second sample. This group of OIs may be designated as the 5-gram Disagreeing Class (5DC). For all the other OIs, the hypothesis cannot be rejected. They form an Agreeing Class (AC). Some of the 5DC OIs have low G + C content (OIs #148, #108, #115, #84 and #8 -40.91%, 40.41%, 36.90%, 36.24% and 41.30%, respectively), OI #28 has a high G + C content (59.15%) -see Table 1a , so that their disagreement with the complete genome in n-gram frequency distribution may be considered to be a direct consequence of their G + C content. But three of the 5DC OIs-#51, #138, #30 have about average G + C content and their disagreement with the complete genome in 5-gram frequency distribution cannot be attributed to their G + C content.

Furthermore, we randomly selected a 37 kb long subsequence from each of the backbone sequences (lying between two consecutive OIs) longer than 50 kb. We then applied the Mann-Whitney U-test to 5-gram frequency distributions in the complete genome (as the first sample) and each of the selected backbone subsequences (as the second sample)-see Table 1b . The length of 37 kb was selected as the average length of OIs longer than 10 kb. There are 25 backbone sequences longer than 50 kb, and thus 25 subsequences of length 37 kb were selected. Only one of them (positions 3,331,041-3,368,040) disagrees with the complete genome in 5-gram frequency distribution. Thus 5-gram frequency distribution may be considered as an OI characterizing feature. Fig. 1 represents 5-gram histograms for the complete E. coli EDL933 genome The Mann-Whitney two-sample U-test of equality of 5-gram frequency distributions is applied to OIs of the E. coli EDL933 of length >10 kb as the first sample, and the complete genome as the second sample (a). Included are OI number (OI#), its position in the genome (Position), z-value and p-value (Asymptotic Significance) of the test, G + C content and CU bias of the sequence, as well as possible origin of a sequence it is related to (Related to) and the element(s) it codes for (Coding for)-according to [5, 64] . The same test is applied to sequences from the backbone, as the first sample, and the complete genome E. coli EDL933, as the second sample (b). Included are the sequence number (No.), its position in the genome (Start, End), z-value and Asymptotic Significance (p-value) of the test, and the G + C content of the sequence. Asymptotic Significance in bold denotes low p-value (<0.05); G + C content data in bold denote high percentage (>58%), in italic-low percentage (<42%); CU bias data in bold denotes high CU bias (>0.30).

sequence, the backbone, OIs from the AC class (a) and OIs from the 5DC class (b). The similarity in histogram shapes among sequences in Fig. 1 (a) may be noticed, as well as their dissimilarity with sequences in Fig. 1(b) . Tetragrams and trigrams do not contribute to characterization of genomic islands based on frequency distribution, since the sets of OIs disagreeing with the complete genome (OIs #28, #84, #115 and #148) are included in the 5DC class, and all of them have high or low G + C content.

We consider CU bias to be significant if it is greater than 0.3, which is one standard deviation (about 0.1) higher than the mean value of CU bias (about 0.2) for all the OIs > 10 kb in length. Significant G + C content deviation is calculated to be outside the interval (42%, 58%). When comparing the 5-gram frequency distribution feature of E. coli EDL933 OIs with their G + C content and CU bias, a correlation was achieved: most OIs deviating significantly in 5-gram frequency distribution interval 0.00-0.025%). x-Values range from 0 to the percent occurrence of the most frequent from the complete genome also deviate significantly in G + C content and CU bias (OIs #8, #28, #84, #108, #115 and #148). Exceptions are the three OIs that deviate significantly (p < 0.05) in 5-gram frequency distribution from the complete genome, while having low CU bias and roughly average G + C content (OIs #30, #51 and #138). Table 1a also represents all the OIs with a length >10 kb, with z and p-values (Asymptotic Significance) for the Mann-Whitney agreement test of 5-gram frequency distribution with the complete genome sequence.

With respect to OIs longer than 10 kb (26 in total), the results presented show that our algorithm for characterizing GIs has identified all the 6 OIs with unusual G + C content or significantly high CU bias (recall, i.e. sensitivity is 100%). On the other side, the algorithm identified 3 more OIs (not recognized by G + C content or CU bias), which results in 67% precision (6/9) and 85% specificity (17/20) . We interpret this imprecision as an additional power of 33% (15%, respectively) with respect to other methods.

One obvious drawback of the method presented is that only OIs longer than 10 kb are analyzed. It is because 10 kb is considered as the lower limit for the length of a GI that exhibits characteristics of HGT [30, 31] . Although there are only 26 (out of 177) such OIs in E. coli EDL933, the total number of GIs identified in other genomes, e.g., in E. coli CFT073 [4, 64, 69] is, by an order of magnitude, smaller than that in E. coli EDL933 [51, 52, 64] and all of them are longer than 10 kb.

A general characteristic of n-gram methods is that they do not require linguistic (genomic) knowledge; the meaning of the sequence is not analyzed, it is merely the structure which is investigated. It makes the methods simpler but nonetheless useful in differentiating between different kinds of sequences, e.g., in characterizing and thus predicting GIs in bacterial genomes [58] .

Overrepresented and underrepresented n-grams Table 2a represents the 10 topmost overrepresented and underrepresented tetragrams (according to z M statistic) in the complete E. coli EDL933 genome, its backbone and 9 OIs from the 5DC class. Among these tetragrams there are 9 different palindromes, all of which are underrepresented, with 29 occurrences in total. There are also 14 pairs of sequences and their reverse complements, each pair simultaneously being either overrepresented or underrepresented. Table 2b represents rank, in the complete genome and the nine 5DC OIs of the 10 topmost overrepresented and underrepresented tetragrams from the backbone. Table AF2 in the additional file contains the same data for all the OIs of length >10 kb, along with the z-values of the z M statistic. It can be seen that the intersection of all the 10 topmost overrepresented tetragram columns is empty, as is the intersection of all the 10 top-most underrepresented tetragram columns, i.e., no tetragram is among the 10 topmost overrepresented tetragrams in all the OIs. However, there are three tetragrams -CACC, CATC, CGCC, that are overrepresented, and one tetragram -GGCC, that is underrepresented (but not in the 10 topmost), in all the OIs, the backbone and the complete genome. Table AF3 in the additional file represents all the tetragrams in all the OIs of length>10 kb, the backbone and the complete genome, along with z-values of the z M statistic.

Some tetragrams which are among the most over (or under) represented in some OIs, while being under (or over) represented in the backbone, are presented in Table 3 along with z-value in the OI and in the backbone. The first part of Table 3 represents tetragrams which are among the most overrepresented in an OI (high z-value), while being among the most underrepresented in the backbone (low z-value); the second part of Table 3 represents tetragrams which are among the most underrepresented in an OI (low z-value) while being among the most overrepresented in the backbone (high z-value); the third part of Table 3 represents some of the tetragrams which are uniquely overrepresented in an OI and underrepresented in the backbone, while the fourth part represents some of the tetragrams which are uniquely underrepresented in an OI while overrepresented in the backbone. It may be seen that tetragrams that are included in E. coli restriction enzyme recognition site sequences [70] are generally more represented in island sequences, than in the backbone (55% vs. 30%). This is in agreement with the fact that most of the OIs are phage or prophage related (61%), see Table 1 . This fact would have been even more apparent if we had used the absolute number of occurrences of these tetragrams instead of normalized (by Markov model).

Application of the modified Zipf-like analysis to the most overrepresented and underrepresented tetragrams of different sequences is illustrated in Fig. 2 (1)- (6) .

Subfigures (1) and (4) present the data for the complete E. coli EDL933 genome, while others correspond to two OIs from the 5DC class: OI #28, deviating significantly in G + C content from the backbone of the genome -subfigures (2), (5) , and OI #51 with G + C content around 50% -subfigures (3), (6) . The corresponding figures for all the OIs from the 5DC class are given in Fig. 1AF1 in the additional file. It may be noticed that the most overrepresented and underrepresented tetragrams in the complete genome and its backbone are quite similar (as one would expect). Some of the most represented tetra First part of this table represents tetragrams which are among the most overrepresented in an OI (high z-value) while being among the most underrepresented in the backbone (low z-value); second part of this table represents tetragrams which are among the most underrepresented in an OI (low z-value) while being among the most overrepresented in the backbone (high z-value); third part of this table represents some of the tetragrams which are uniquely overrepresented in an OI and underreperesented in the backbone, while the fourth part represents some of the tetragrams which are uniquely underrepresented in an OI and overrepresented in the backbone. As expected, the first and the third part of the table have tetragrams in common, as well as the second and the fourth part of the table. Restriction enzyme recognition sequences for some E. coli strains, as well as for other bacteria, are given when known. Restriction enzyme data are taken from Genscript.com [70] . A tetragram with its reverse complement is given in pair (xxxx/yyyy) when the tetragram (xxxx) corresponds to the "OI#" and "z-value in OI" data, while its complement (yyyy) corresponds to the "Restriction enzymes" data. grams in the complete genome are underrepresented in some of the OIs, e.g. CCAG (Fig. 2(1) ), but the most underrepresented tetragrams in the complete genome tend to be underrepresented in most of the OIs, too (Fig. 3) . As far as 5DC OIs are concerned, it can be noticed that some of the most overrepresented tetragrams in each of them are underrepresented in the backbone, and have z-values above the z-values of the corresponding tetragrams in all other sequences (e.g., CCAT, TGGA in the OI #28, Fig. 2 (2), ATGC in the OI #51, Fig. 2(3) ). The same holds true for the most underrepresented tetragrams (e.g., CTTC in the OI #51, Fig. 2(6) ). Such tetragrams may be used as a basis for defining the OIs tetragram nucleotide signature. Tables 4a and 4b represent all the tetragrams that are overrepresented/underrepresented, respectively, in just one of the 5DC OIs and, at the same time, underrepresented/overrepresented, respectively, in the backbone, along with the z-value and rank of the tetragram in both sequences. Fig. 4 represents z-values for the most underrepresented tetragram in each of the 5DC OIs, which is overrepresented in the backbone, along the E. coli EDL933 genome.

On the other side, some tetragrams are overrepresented/underrepresented in a subset of OIs and underrepresented/overrepresented, respectively in the backbone, and may characterize, in a way similar to n-gram frequency distribution, the subset of OIs. For example, ATCC is underrepresented in five OIs, four of which belonging to the 5DC class (#28, #30, #84 and #148) and overrepresented in the backbone; GGCT is overrepresented in six OIs, four of which belong to the 5DC class (#30, #51, #84 and #108) and underrepresented in the backbone. Table AF4a in the additional file represents all the tetragrams that are overrepresented in an OI and underrepresented in the backbone, along with the expected number of occurrences (according to the maximal All the tetragrams that are uniquely overrepresented in one of the 5DC class OIs and underrepresented in the backbone, along with the z-value and rank of the tetragram in both sequences, are listed. Table AF4b in the additional file represents the same data for tetragrams that are underrepresented in an OI and overrepresented in the backbone.

Although bacterial genomes are known to be substantially homogeneous regarding features such as G + C content or CU, island sequences tend to differ from the overall genome and in particular, from backbone sequences. The present investigations add n-grams as an additional feature refining characterization of OIs, classifying, and distinguishing them by classification from the backbone sequences. Structurally different classes of OIs may have different functional interpretations. Unusual n-gram content may identify additional OIs with unusual composition and thus potential candidates for PAIs. This feature may also complement criteria for predicting various island sequences in genomes where they have not yet been annotated. Our future plans include computational experiments applying the proposed method to the island sequences of other bacterial species, in order to verify their characterization findings. We also plan to experiment with other types of motifs, such as repeats, as sequence characterization and classification features.

Ecological fitness, genomic islands and bacterial pathogenicity

Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes

The complete genome sequence of Escherichia coli K-12

Defining genomic islands and uropathogen-specific genes in uropathogenic Escherichia coli

Zipf's law, and language

Mathematical theory of communication

Alignment-free sequence comparison-a review

A dictionary for minimum redundancy encoding

Effective text compression with simultaneous digram and trigram encoding

The use of trigram analysis for spelling error detection

Automatic spelling correction using trigram similarity measure

Trigram-based method of language identification

Trenkle, n-Gram-based text categorization

n-Gram-based author profiles for authorship attribution

Gauging similarity with n-grams: language-independent categorization of text

Formal analysis of protein sequences. I. Specific long range constraints in pair associations of amino acids

Genomic style of proteins: concepts, methods and analysis of ribosomal proteins from 16 microbial species

Evidence for cysteine clustering in thermophylic proteomes

The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over-and underrepresented sequences by Markov chain analysis

Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis

An improved method for detection of words with unusual occurrence frequency in nucleotide sequences

Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences

Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes

Compositional biases of bacterial genomes and evolutionary implications

Dinucleotide relative abundance extremes: a genomic signature

Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons

Linguistics of nucleotide sequences. I. The significance of deviation from mean statistical characteristics and prediction of the frequencies of occurrence of words

Statistical analysis of counts and distributions of restriction sites in DNA sequences

Over-and under-representation of short oligonucleotides in DNA sequences

An efficient statistic to detect over-and underrepresented words in DNA sequences

Determination of bias in the relative abundance of oligonucleotides in DNA sequences

Probabilistic and statistical properties of words: an overview

A measure of the similarity of sets of sequences not requiring sequence alignment

Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences

Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for computer generated system model

Statistical significance of sequence patterns in proteins

Distributional regimes for the number of k-word matches between two random sequences

Primary sequences of proteins from complete genomes display a singular periodicity: alignment-free N-gram analysis

Integrated gene and species phylogenies from unaligned whole genome protein sequences

Local homology recognition and distance measures in linear time using compressed amino acid alphabets

Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach

A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization

Protein classification based on text document classification techniques

Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences

Comparative n-gram analysis of whole-genome sequences, HLT'02

Rare and frequent amino acid n-grams in whole-genome protein sequences

ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes

Comparative evaluation of word composition distances for the recognition of SCOP relationships

The method of N-grams in large-scale clustering of DNA texts

A large-scale comparison of genomic sequences: one promising approach

N-Gram-based classification and unsupervised hierarchical clustering of genome sequences

Identification of compositionally distinct regions in genomes using the centroid method

Identification of prophages in bacterial genomes by dinucleotide relative abundance difference

Recruitment of rare 3-grams at functional sites: Is this a mechanism for increasing enzyme specificity?

Mutational analysis of SARS CoV genome

Word Length Studies and Related Issues

Could N-gram analysis contribute to genomic island determination? J

IslandPath: aiding detection of genomic islands in prokaryotes

Islander: a database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities

SIGI: score-based identification of genomic islands

A computational approach for identifying pathogenicity islands in prokaryotic genomes

Towards pathogenomics: a web based resource for pathogenicity islands

Pathogenecity islands in bacterial pathogenesis

A systematic method to identify genomic islands and its applications in analyzing the genomes of Corynebacterium glutamicum and Vibrio vulnificus CMCP6 chromosome I

SPSS Programming and Data Management: A Guide for SPSS and SAS Users

An Introduction to Information Retrieval

Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli

The work presented has been financially supported by the Ministry of Science and Technological Development of the Republic of Serbia, Project No. 144030.

 CCAG  3  6  16  9  62  41  127  68  78  142  61  CTTC  4  5  9  23  172  129  238  4  54  139  33  GAAG  5  3  51  56  69  11  21  133  29  7  44  CTGG  6  4  74  200  85  1  1  77  18  15  40  GATG  7  7  8  12  65  10  23  150  60  9  2  CATC  8  8  2  8  2  76  18  6  2  22  11  TTGC  9  11  59  102  130  17  2  65  55  58  72  GCAA  10  13  41  5  54  7  6  126  124  8  45   GGAC  247  247  241  65  139  197  180  180  207  236  185  TATA  248  248  251  235  131  226  225  256  206  195  254  GATC  249  253  249  255  204  235  254  207  253  237  253  CTAG  250  249  111  222  179  240  168  255  248  225  246  CGCG  251  252  245  168  233  252  217  175  211  255  239  CAAG  252  250  193  108  207  225  230  82  236  252  223  CTTG  253  251  218  216  120  237  106  112  200  159  161  TTGG  254  255  252  239  112  254  255  213  254  200  244  CCAA  255  254  253  237  163  255  251  123  228  249  202  GGCC  256  256  248  226  225  256  253  216  255 All the tetragrams that are uniquely underrepresented in one of the 5DC class OIs and overrepresented in the backbone, along with the z-value and rank of the tetragram in both sequences, are listed.