key: cord-0043251-6mw6avpg
authors: Bonizzoni, Paola; De Felice, Clelia; Zaccagnino, Rocco; Zizza, Rosalba
title: Lyndon Words versus Inverse Lyndon Words: Queries on Suffixes and Bordered Words
date: 2020-01-07
journal: Language and Automata Theory and Applications
DOI: 10.1007/978-3-030-40608-0_27
sha: ef53ee9a3df47224a44924817dc5061d6c27a2cd
doc_id: 43251
cord_uid: 6mw6avpg

The Lyndon factorization of a word has been extensively studied in different contexts and several variants of it have been proposed. In particular, the canonical inverse Lyndon factorization [Formula: see text], introduced in [5], maintains the main properties of the Lyndon factorization since it can be computed in linear time and it is uniquely determined. In this paper we investigate new properties of this factorization with the purpose of exploring its use in string queries. As a main result, we prove an upper bound on the length of the longest common extension (or longest common prefix) for two factors of a word w. This bound is at most the maximum length of two consecutive factors of [Formula: see text]. A tool used in the proof is a property that we state for factors with nonempty borders in [Formula: see text]: a nonempty border of a factor [Formula: see text] cannot be a prefix of the next factor [Formula: see text]. Another interesting result relates sorting of global suffixes, i.e., suffixes of a word w, and sorting of local suffixes, i.e., suffixes of the factors in [Formula: see text]. Finally, given a word w and a factor x of w, we prove that their Lyndon factorizations share factors, except for the first and last term of the Lyndon factorization of x. This property suggests that, given two words sharing a common overlap, their Lyndon factorizations could be used to capture the common overlap of these two words.

The Lyndon factorization CFL(w) of a word w is the unique factorization of w into a sequence of Lyndon words in nonincreasing lexicographic ordering. This factorization is one of the most well-known and extensively studied in different contexts, from formal languages to algorithmic stringology and string compression. In particular the notion of a Lyndon word has been shown to be useful in theoretical applications, such as the well known proof of the Runs Theorem [2] as well as in string compression analysis. A connection between the Lyndon factorization and the Lempel-Ziv (LZ) factorization has been given in [18] , where it is shown that in general the size of the LZ factorization is larger than the size of the Lyndon factorization, and in any case the size of the Lyndon factorization cannot be larger than a factor of 2 with respect to the size of LZ. This result has been further extended in [28] to overlapping LZ factorizations. The Lyndon factorization has recently revealed to be a useful tool also in investigating queries related to suffixes of a word and sorting such suffixes [25] with strong potentialities [26] for string comparison that have not been completely explored and understood. Relations between Lyndon words and the Burrows-Wheeler Transform (BWT) have been discovered first in [11, 23] and, more recently, in [19] . The main interest in such a factorization is also due to the fact that it can be efficiently computed. Linear-time algorithms for computing this factorization can be found in [15, 16] whereas an O(lg n)-time parallel algorithm has been proposed in [1, 13] .

Most recently, variants of the Lyndon factorization have been introduced and investigated with different motivations. In [5] , the notion of an inverse Lyndon word (a word which is strictly greater than each of its proper suffixes) has been introduced to define new factorizations, called inverse Lyndon factorizations. An inverse Lyndon factorization has the property that a word is factorized in a sequence of inverse Lyndon words, in an increasing and prefix-order-free lexicographic ordering, where prefix-order-free means that a factor cannot be a prefix of the next one. A word w which is not an inverse Lyndon word may have several inverse Lyndon factorizations but it admits a canonical inverse Lyndon factorization. This special inverse Lyndon factorization has been introduced in [5] and denoted ICFL(w) because it is the counterpart of the Lyndon factorization CFL(w) of w, when we use (I)inverse words as factors. Indeed, in [5] it has been proved that ICFL(w) can be computed in linear time and it is uniquely determined for a word w.

In this paper we further investigate ICFL(w). The main results stated here are the following: (1) we find un upper bound on the length of the longest common prefix of two distinct factors in ICFL(w), namely the maximal length of two consecutive factors in ICFL(w) (Proposition 6), (2) we are able to relate sorting of global suffixes, i.e., suffixes of the word w, and local suffixes, i.e., suffixes of the factors in ICFL(w) (Lemma 3).

Differently from Lyndon words, inverse Lyndon words may be bordered. As an intermediate result, we show that if a factor m i in ICFL(w) has a nonempty border, then such a border cannot be inherited by the next factor, since it cannot be the prefix of the next factor m i+1 (Proposition 5). This result is proved by a further investigation on the connection between the Lyndon factorization and the canonical inverse Lyndon factorization of a word, given in [5] through the grouping property. Indeed, given a word w which is not an inverse Lyndon word, the factors in ICFL(w) are obtained by grouping together consecutive factors of the anti-Lyndon factorization of w that form a chain for the prefix order (Proposition 7.7 in [5] ).

Another natural question is the following.

Given two words having a common overlap, can we use their Lyndon factorizations to capture the similarity of these words?

A partial positive answer to this question is provided here: given a word w and a factor x of w, we prove that their Lyndon factorizations share factors, except for the first and last term of the Lyndon factorization of x.

For the detailed proofs of the results in this paper we refer the reader to [6] .

Throughout this paper we follow [4, 10, 20, 22, 27] for the notations. We fix the finite non-empty (totally ordered) alphabet Σ. We denote by Σ * the free monoid generated by Σ and we set Σ + = Σ * \ 1, where 1 is the empty word. For a word w ∈ Σ * , we denote by |w| its length.

Given a language L ⊆ A * , we denote by Pref(L) the set of all prefixes of its elements.

Two words x, y are incomparable for the prefix order if neither x is a prefix of y nor y is a prefix of x. Otherwise, x, y are comparable for the prefix order. We write x ≤ p y if x is a prefix of y and x ≥ p y if y is a prefix of x. We recall that, given a nonempty word w, a border of w is a word which is both a proper prefix and a suffix of w [12] . The longest proper prefix of w which is a suffix of w is also called the border of w [12, 22] . A word w ∈ Σ + is bordered if it has a nonempty border. Otherwise, w is unbordered. A nonempty word w is primitive if w = x k implies k = 1. An unbordered word is primitive. A sesquipower of a word x is a word w = x n p where p is a proper prefix of x and n ≥ 1.

The lexicographic (or alphabetic order) ≺ on (Σ * , <) is defined by setting x ≺ y if x is a proper prefix of y, or x = ras, y = rbt, a < b, for a, b ∈ Σ and r, s, t ∈ Σ * . For two nonempty words x, y, we write x y if x ≺ y and x is not a proper prefix of y [3] . We also write y

x if x ≺ y. Two words x, y are called conjugate if there exist words u, v such that x = uv, y = vu. The conjugacy relation is an equivalence relation. A conjugacy class is a class of this equivalence relation. A Lyndon word w ∈ Σ + is a word which is primitive and the smallest one in its conjugacy class for the lexicographic order. A class of conjugacy is also called a necklace and often identified with the minimal word for the lexicographic order in it. Thus, a nonempty word is a necklace if and only if it is a power of a Lyndon word. A prenecklace is a prefix of a necklace, hence any nonempty prenecklace w has the form w = (uv) k u, where uv is a Lyndon word, u ∈ Σ * , v ∈ Σ + , k ≥ 1, that is, w is a sesquipower of a Lyndon word uv. A characterization of the structure of the prefixes of the Lyndon words is given in [15] . It states that a word is a nonempty prefix of a Lyndon word if and only if it is a sesquipower of a Lyndon word distinct of the maximal letter.

It is known that each Lyndon word w is unbordered. Moreover, a word w ∈ Σ + is a Lyndon word if and only if w ≺ s, for each nonempty proper suffix s of w. Different characterizations and variations of Lyndon words are given [3, 14, 21] . In the following L = L (Σ * ,<) will be the set of Lyndon words, totally ordered by the relation ≺ on (Σ * , <). We know that any word w ∈ Σ + can be written in a unique way as a nonincreasing product w = 1 2 · · · h of Lyndon words, i.e., in the form w = 1 2 · · · h , with j ∈ L and 1 2 . . . h [9] . The sequence CFL(w) = ( 1 , . . . , h ) is called the Lyndon decomposition (or Lyndon factorization) of w. Uniqueness of the above factorization is proved in [15] and allows us to state a recursive definition of CFL(w), for a nonempty word w. Precisely, if w is not a Lyndon word, then CFL(w) = ( 1 , 1 , . . . , h ), where ( 1 , . . . , h ) = CFL(w ), w = 1 w and 1 is the longest prefix of w which is a Lyndon word. Sometimes we need to emphasize consecutive equal factors in CFL. We write CFL(w) = ( n1 1 , . . . , nr r ) to denote a tuple of n 1 +. . .+n r Lyndon words, where r > 0, n 1 , . . . , n r ≥ 1. Precisely 1 . . . r are Lyndon words, also named Lyndon factors of w. There is a linear time algorithm to compute the pair ( 1 , n 1 ) and thus, by iteration, the Lyndon factorization of w [16, 22] . Linear time algorithms may also be found in [15] and in the more recent paper [17] .

The inverse lexicographic or inverse alphabetic order on (Σ * , <), denoted ≺ in , is the lexicographic order on (Σ * , < in ). Here < in means that the order of the alphabet is reversed, that is b < in a ⇔ a < b, for all a, b ∈ Σ. We denote by L in = L (Σ * ,<in) the set of the Lyndon words on Σ * with respect to the inverse lexicographic order. A word w ∈ L in will be named an anti-Lyndon word. Correspondingly, an anti-prenecklace will be a prefix of an anti-necklace, which in turn will be a necklace with respect to the inverse lexicographic order. We have that a word w ∈ Σ + is in L in if and only if w is primitive and w vu, for each u, v ∈ Σ + such that w = uv. Alternatively, a word w ∈ Σ + is in L in if and only if w is unbordered and w v, for each proper nonempty suffix v. We denote by CFL in (w) the Lyndon factorization of w with respect to the inverse order < in . The following definition plays a fundamental role in our results.

It is easy to see that a, b, aaaaa, bbba, baaab, bbaba and bbababbaa are inverse Lyndon words on {a, b}, with a < b. On the contrary, aaba is not an inverse Lyndon word since aaba ≺ ba. Moreover, baaab is not an anti-Lyndon word since it is bordered. In [5] it has been proved that a nonempty word is an anti-Lyndon word if and only if it is an unbordered inverse Lyndon word. Finally, the set of the inverse Lyndon words coincides with the set of the anti-prenecklaces, hence any nonempty prefix of an inverse Lyndon word is still an inverse Lyndon word [5] .

For the material in this section see [5] . An inverse Lyndon factorization of a word w ∈ Σ + is a sequence (m 1 , . . . , m k ) of inverse Lyndon words such that m 1 · · · m k = w and m i m i+1 , 1 ≤ i ≤ k − 1. A word may have different inverse Lyndon factorizations (see Example 2) but it has a unique canonical inverse Lyndon factorization, denoted ICFL(w). If w is an inverse Lyndon word, then ICFL(w) = w. Otherwise, ICFL(w) is recursively defined. The first factor of ICFL(w) is obtained by a special factorization of the shortest nonempty prefix z of w such that z is not an inverse Lyndon word.

Definition 2 [5] . Let w ∈ Σ + , let p be an inverse Lyndon word which is a nonempty proper prefix of w = pv. The bounded right extension p of p (relatively to w), if it exists, is a nonempty prefix of v such that:

(1) p is an inverse Lyndon word, (2) pz is an inverse Lyndon word, for each proper nonempty prefix z of p,

It has been proved that Pref bre (w) is empty if and only if w is an inverse Lyndon word (Proposition 4.2 in [5] ). If w is not an inverse Lyndon word, then Pref bre (w) contains only one pair and the description of this pair is given below (Propositions 4.1 and 4.3 in [5] ).

Let w ∈ Σ + be a word which is not an inverse Lyndon word. Let z be the shortest nonempty prefix of w which is not an inverse Lyndon word. Then,

(2) p = ras and p = rb, where r, s ∈ Σ * , a, b ∈ Σ and r is the shortest prefix of pp such that pp = rasrb, with a < b.

Let us consider w = babaaabb and the prefixes p 1 = bab and p 2 = babaaa of w. First, w is not an inverse Lyndon word. Thus, Pref bre (w) contains only one pair. Moreover each proper nonempty prefix of w is an inverse Lyndon word. By item (1) in Proposition 1, we have w = pp. By item (2) in Proposition 1, the bounded right extension of p 1 = bab does not exist (we should have p 1 = aaabb in contradiction with p 1 p 1 ). Since w starts with b, the shortest common prefix r of p and p has a positive length. Indeed, p = p 2 = babaaa and p = p 2 = bb.

The canonical inverse Lyndon factorization has been also recursively defined.

Let w ∈ Σ + . (Basis Step) If w is an inverse Lyndon word, then ICFL(w) = (w). (Recursive Step) If w is not an inverse Lyndon word, let (p, p) ∈ Pref bre (w) and let v ∈ Σ * such that w = pv. Let ICFL(v) = (m 1 , . . . , m k ) and let r, s ∈ Σ * , a, b ∈ Σ such that p = ras, p = rb with a < b.

We have CFL in (w) = (daba, dab, dab, dadac), ICFL(w) = (daba, dabdab, dadac). Another inverse Lyndon factorizations of w is (dabadab, dabda, dac). Consider z = dabdadacddbdc. It is easy to see that (dab, dadacd, db, dc), (dabda, dac, ddbdc), (dab, dadac, ddbdc) are all inverse Lyndon factorizations of z. The first factorization has four factors whereas the others have three factors. Moreover ICFL(z) = CFL in (z) = (dab, dadac, ddbdc).

Let w ∈ Σ + be a word which is not an inverse Lyndon word, let ICFL(w) = (m 1 , . . . , m k ). The aim of this section is to state that any nonempty border of m i is not a prefix of m i+1 , 1 ≤ i ≤ k − 1 (Proposition 5). The proof of this result is strongly based on a property of ICFL(w), proved in [5] and defined through the notion of groupings of CFL in (w).

Let CFL in (w) = ( 1 , . . . , h ), where 1 in 2 in . . . in h . Consider the partial order ≥ p , where x ≥ p y if y is a prefix of x. Recall that a chain is a set of a pairwise comparable elements. We say that a chain is maximal if it is not strictly contained in any other chain. A non-increasing (maximal) chain in CFL in (w) is the sequence corresponding to a (maximal) chain in the multiset { 1 , . . . , h } with respect to ≥ p . We denote by PMC a non-increasing maximal chain in CFL in (w). Looking at the definition of the (inverse) lexicographic order, it is easy to see that a PMC is a sequence of consecutive factors in CFL in (w). Moreover CFL in (w) is the concatenation of its PMC. Formally, if C is a PMC in CFL in (w), then there are indexes r, s with 1 ≤ r < s ≤ h such that C = ( r , . . . , s ), with

Example 3 [5] . Let Σ = {a, b, c, d} with a < b < c < d, w = dabadabdabdadac. In Example 2, we observed that CFL in (w) = (daba, dab, dab, dadac). This sequence has two PMC, namely (daba, dab, dab), (dadac). Let z = dabdadacddbdc. Then CFL in (z) = (dab, dadac, ddbdc) has three PMC: (dab), (dadac), (ddbdc).

A grouping of CFL in (w) is an inverse Lyndon factorization (m 1 , . . . , m k ) of w such that any factor is a product of consecutive factors in a PMC of CFL in (w). ICFL(w) is always a grouping of CFL in (w) but, as showed below, it is not always its unique grouping.

Example 4 [5] . Let Σ = {a, b, c, d}, a < b < c < d, and w = dabadabdabdadac. We have CFL in (w) = (daba, dab, dab, dadac), ICFL(w) = (daba, dabdab, dadac) (see Example 2) . ICFL(w) is a grouping of CFL in (w) but (dabadab, dabda, dac) is not a grouping. Next, let y = dabadabdabdabdadac. We have CFL in (y) = (daba, dab, dab, dab, dadac) and ICFL(w) = (daba, (dab) 3 , dadac). The inverse Lyndon factorization (dabadab, (dab) 2 , dadac) is another grouping of CFL in (y).

The proof of Proposition 5 is organized as follows. We firstly state that any nonempty border x of a non-increasing chain in CFL in (w) cannot cut any i and admits a shortest border.

Let w ∈ Σ + , let CFL in (w) = ( 1 , . . . , h ) and let ( r , . . . , s ), 1 ≤ r < s ≤ h, be a non-increasing chain in CFL in (w). For any nonempty border x of y = r · · · s there is t, r ≤ t < s, such that x = t+1 · · · s . Consequently, s is a nonempty border of any other nonempty border of r · · · s . The next step is to prove that p in the pair (p,p) ∈ Pref bre (w) has a groupinglike property. Indeed we show that p is always a product of consecutive factors in a PMC of CFL in (w). Thus, thanks to Proposition 2, p has a shortest border. This shortest border determines the relation p p.

Then the following properties hold.

(1) p = n1 1 · · · ng g , for some g, 1 ≤ g ≤ q.

Now, we can state that, for each nonempty border z of p = ras, we have that z and p = rb are incomparable for the prefix order. We use the same notations as in Propositions 2, 3. The word p cannot be a prefix of z because p is not a prefix of p. Thus z should be a prefix of p. By Proposition 2, the shortest border g = u g a g v g of p should be a prefix of z, thus ofp = u g b, a g < b, a contradiction. Proposition 4. Let w ∈ Σ + be a word which is not an inverse Lyndon word and let (p,p) ∈ Pref bre (w). For each nonempty border z of p, one has that z and p are incomparable for the prefix order.

Finally, we can explicitly prove, by induction on |w|, that if z is a nonempty border of m 1 , then z is not a prefix of m 2 . We use the recursive definition of ICFL(w), with the same notations as in Definition 3, and a proof by induction. We distinguish the two cases m 1 = p and m 1 = pm 1 . In the first case, p is a prefix of m 1 = m 2 . Thus, if z were a prefix of m 2 , we would be in contradiction with Proposition 4. In the second case, we have m 2 = m 2 and again two cases: |z| ≥ |m 1 | or |z| < |m 1 |. If z were a prefix of m 2 with |z| ≥ |m 1 |, m 1 would be a prefix of m 2 in contradiction with m 1 m 2 . If z were a prefix of m 2 with |z| < |m 1 |, z would be a border of m 1 , in contradiction with the induction hypothesis. Then, again by induction on |w|, we extend this argument to prove the general result stated below. (m 1 , . . . , m k ) . If z is a nonempty border of m i , then z is not a prefix of m i+1 , 1 ≤ i ≤ k − 1.

Given a word w and two factors x, y of w, we denote by lcp(x, y) the longest common prefix of x, y and we set LCP(x, y) = |lcp(x, y)|. Proposition 5 in the previous section is extremely useful to obtain a bound on the length of the longest common prefix of two factors of a word w, when w is not an inverse Lyndon word (Proposition 6). Precisely, we state that LCP(x, y) is at most the maximum length of two consecutive factors in ICFL(w). As a direct corollary, we obtain the same bound for LCP(x, y), when x, y are suffixes of w [6] .

We also follow the notations used in [5, 24, 25] . Let w, x, u, v ∈ Σ * , and let x be a nonempty factor of w = uxv. Let first(x) and last(x) denote the position of the first and the last symbol of x in w, respectively. If w = a 1 · · · a n , a i ∈ Σ, 1 ≤ i ≤ j ≤ n, then we also set w[i, j] = a i · · · a j . A local suffix of w is a suffix of a factor of w, specifically suf x (i) = w[i, last(x)] denotes the local suffix of w at the position i with respect to x, i ≥ first(x). The corresponding global suffix suf x (i)v of w at the position i is denoted by suf w (i) = w[i, last(w)] (or simply suf (i) when it is understood). We say that suf x (i)v is associated with suf x (i).

When we consider ICFL(w) = (m 1 , . . . , m k ), given a factor m j of ICFL(w) we have that a local suffix x of m j is a suffix of m j and the associated global suffix x w of w is x·m j+1 . . . m k . The following lemmas are crucial for proving our upper bound. Lemma 1 shows that, given two local suffixes x and y of the same factor m i−1 , then the longest common prefix of the associated global suffixes is the longest common prefix between xr and yr. Here r is the longest common prefix between m i−1 and m i . Lemma 2 handles the case of local suffixes x and y of different factors. In this case the result leads to a bound on LCP(x w , y w ).

Let w ∈ Σ + be a word which is not an inverse Lyndon word. Let ICFL(w) = (m 1 , . . . , m k ). Let r, s, t ∈ Σ * , a, b ∈ Σ be such that m i−1 = ras, m i = rbt, a < b, 1 < i ≤ k. If x, y are nonempty suffixes of m i−1 , then lcp(x w , y w ) = lcp(xr, yr). (m 1 , . . . , m k ) . Let i, j be integers such that 1 < i < j ≤ k. If x is a nonempty suffix of m i−1 and y is a nonempty suffix of m j−1 , then lcp(x w , y w ) is a prefix of ym j .

Let w ∈ Σ + be a word which is not an inverse Lyndon word and let ICFL(w) = (m 1 , . . . , m k ). We set M = max{|m i m i+1 | | 1 ≤ i < k}. As a main consequence of the previous lemmas, we state that M is an upper bound on LCP(u, v), where u, v are factors of w. Observe that Lemmas 1 and 2 could lead to a more specialized version of the compatibility property, proved in [5, 24, 25] , which relates sorting local suffixes of a concatenation of factors to sorting the corresponding global suffixes (see Theorem 1). Indeed the above mentioned lemmas could be applied to sort suffixes of a word by sorting factors of w of bounded size.

We recall that the sorting of the nonempty local suffixes of w with respect to a nonempty factor x is compatible with the sorting of the corresponding nonempty global suffixes of w if for all i, j with first(

Theorem 1 [24, 25] . Let w ∈ Σ + and let CFL(w) = ( 1 , . . . , h ) be its Lyndon factorization. Then, for any r, s, 1 ≤ r ≤ s ≤ h, the sorting of the nonempty local suffixes of w with respect to x = r · · · s is compatible with the sorting of the corresponding nonempty global suffixes of w. Lemma 3 states a property similar to the compatibility property when we deal with ICFL(w). Shortly speaking, consider ICFL(w) = (m 1 , m 2 , . . . , m k ) and take two indexes j 1 , j 2 both contained in x = m r m r+1 · · · m s , 1 ≤ r < s ≤ k. Consider the local suffixes starting from j 1 , j 2 and let us compare them with respect to ≺. If suf x (j 1 ) ≺ suf x (j 2 ), then two cases are possible: suf x (j 1 ) suf x (j 2 ) or suf x (j 1 ) ∈ Pref(suf x (j 2 )). In the first case obviously suf (j 1 ) suf (j 2 ). Lemma below covers both the cases.

Example 5. Let w = a 12 bbab ∈ {a, b} + with a < b. We have ICFL(w) = (m 1 , m 2 ) = (a 12 , bbab). Let x = m 1 = a 12 . Consider suf x (4) = a 9 and suf x (12) = a. We have suf x (12) = a ≺ a 9 = suf x (4). We are in the first case of Lemma 3 and then suf (4) = a 9 bbab ≺ abbab = suf (12) .

We have ICFL(w) = (m 1 , m 2 , m 3 ) = (daba, dabdab, dadac). Let x = m 2 . Consider suf m2 (8) = dab and suf m2 (5) = dabdab. We have suf m2 (8) = dab ≺ suf m2 (5) = dabdab = (dab) 2 . We are in the first case of Lemma 3 and then suf (5) = dabdabdadac ≺ suf (8) = dabdadc. Consider now suf m2 (9) = ab ≺ suf m2 (8) = dab. Since suf m2 (9) is not a proper prefix of suf m2 (8)), we are in the second case of Lemma 3 and we have suf (9) = abdadac ≺ suf (8) = dabdadac.

Let w ∈ Σ + be a word and let CFL(w) = ( 1 , . . . , k ) be its Lyndon factorization, k ≥ 1. Let x be a proper factor (resp. prefix, suffix) of w. We say that x is a simple factor of w if, for each occurrence of x as a factor of w, there is j, with 1 ≤ j ≤ k, such that x is a factor of j . Informally speaking, every occurrence of x needs to be within some j . We say that x is a simple prefix (resp. suffix) of w if x is a proper prefix (resp. suffix) of 1 (resp k ). In this section we compare the Lyndon factorization of w and that of its non-simple factors. Lemma 4 handles a trivial case: if x = i i+1 · · · j is a concatenation of consecutive factors of CFL(w), then CFL(x) is the sequence ( i , i+1 , . . . , j ).

Let w ∈ Σ + be a word and let CFL(w) = ( 1 , . . . , k ) be its Lyndon factorization. For any i, j,

If x is a non-simple factor of w and x does not satisfy the hypotheses of Lemma 4, then there are i, j with 1 ≤ i < j ≤ k, a suffix i of i and a prefix j of j , with i j = 1, such that x = i i+1 · · · j−1 j , where it is understood that if j = i + 1, then i+1 , · · · , j−1 = 1 and i = 1, j = 1, i j = i j . We say that the sequence i , i+1 , . . . , j−1 , j is associated with x. The following result gives relations between CFL(x) and CFL(w).

Let w ∈ Σ + be a word and let CFL(w) = ( 1 , . . . , k ) be its Lyndon factorization. Let x be a non-simple factor of w such that x does not satisfy the hypotheses of Lemma 4 and let i , i+1 , . . . , j−1 , j be the sequence associated with x. Let CFL( i ) = (g 1 , . . . , g k ) and CFL( j ) = (g 1 , . . . , g k ) We have

where it is understood that if i = 1 (resp. j = 1), then the first k terms (resp. last k terms) in CFL(x) vanish.

Let x, y, z, w, w ∈ Σ + . Lemma 5 gives relations between the Lyndon factorizations of two overlapping words w, w , i.e., such that w = xy, w = yz, and the Lyndon factorization of the overlap y, when y is non-simple (as a suffix of w and as a prefix of w ). Indeed observe that both w and w are substrings of the same word xyz. As a consequence of Lemma 5, the words w, w may share common Lyndon factors between them and with xyz. Moreover, some of these factors may be in y. More precisely, let CFL(w) = ( 1 , . . . , k ) and CFL(w ) = (f 1 , f 2 , . . . , f h ). If y is a non-simple suffix of w and a non-simple prefix of w , then there are indexes i, j, with 1 ≤ i < k, 1 < j ≤ h, such that y = i i+1 · · · k = f 1 · · · f j−1 f j , where i is a suffix of i and f j is a prefix of f j . Let CFL( i ) = (g 1 , . . . , g k ) and CFL(f j ) = (g 1 , . . . , g k ). By Lemma 5 we have CFL(y) = (g 1 , . . . , g k , i+1 , . . . , k ) = (f 1 , . . . , f j−1 , g 1 , . . . , g k ). Since the Lyndon factorization can be computed in linear time, the above result could lead to efficient measures of similarities between words. These measures could be used to capture words that may be overlapping.

In this paper we investigate new properties of the Lyndon factorization and of the canonical inverse Lyndon factorization, aimed to answer to string queries by using these factorizations. Our main result, Proposition 6, gives an upper bound on the length of the longest common prefix of two factors of a word and this upper bound has relationships with the factors in ICFL. This result could also be applied to investigate parallel approaches to sorting suffixes of a word w with a nontrivial inverse Lyndon factorization. Indeed, the above mentioned bound could relate sorting suffixes of w to sorting factors of w of bounded length. In addition, we state a property showing that substrings of the same word could share common factors of the Lyndon factorization (Lemma 5). This property could be extended to two words that share a common overlap to capture the suffix-prefix relationship between them. It is an open problem if Lemma 5 extends to ICFL(w). This extension, if it exists, may be of interest in the well known problem of efficient computation of the suffix-prefix relationship. This is an interesting problem in the analysis of sequencing data [7, 8] and in the construction of overlap graphs for a collection of strings. We believe that the above results could shed new light in further applications of the Lyndon and the inverse Lyndon factorization and this is the goal of our future research work.

Fast parallel Lyndon factorization with applications

The "Runs" theorem

A new characterization of maximal repetitions by Lyndon trees

Codes and Automata. Encyclopedia of Mathematics and its Applications

Inverse Lyndon words and inverse Lyndon factorizations of words

Lyndon words versus inverse Lyndon words: queries on suffixes and bordered words

An externalmemory algorithm for string graph construction

FSG: fast string graph construction for de novo assembly

Free differential calculus, IV. The quotient groups of the lower central series

Combinatorics of words

A note on the Burrows-Wheeler transformation

Algorithms on Strings

Parallel RAM algorithms for factorizing words

On generalized Lyndon words

Factorizing words over an ordered alphabet

Necklaces of beads in k colors and k-ary de Brujin sequences

Alternative algorithms for Lyndon factorization

On the size of Lempel-Ziv and Lyndon factorizations

On bijective variants of the Burrows-Wheeler transform

Combinatorics on Words

Applied Combinatorics on Words

An extension of the Burrows-Wheeler transform

Sorting suffixes of a text via its Lyndon factorization

Suffix array and Lyndon factorization of a text

Lyndon words and short superstrings

Free lie algebras

On the size of overlapping Lempel-Ziv and Lyndon factorizations

The authors thank the anonymous referees for their helpful suggestions.