key: cord-0043251-6mw6avpg authors: Bonizzoni, Paola; De Felice, Clelia; Zaccagnino, Rocco; Zizza, Rosalba title: Lyndon Words versus Inverse Lyndon Words: Queries on Suffixes and Bordered Words date: 2020-01-07 journal: Language and Automata Theory and Applications DOI: 10.1007/978-3-030-40608-0_27 sha: ef53ee9a3df47224a44924817dc5061d6c27a2cd doc_id: 43251 cord_uid: 6mw6avpg The Lyndon factorization of a word has been extensively studied in different contexts and several variants of it have been proposed. In particular, the canonical inverse Lyndon factorization [Formula: see text], introduced in [5], maintains the main properties of the Lyndon factorization since it can be computed in linear time and it is uniquely determined. In this paper we investigate new properties of this factorization with the purpose of exploring its use in string queries. As a main result, we prove an upper bound on the length of the longest common extension (or longest common prefix) for two factors of a word w. This bound is at most the maximum length of two consecutive factors of [Formula: see text]. A tool used in the proof is a property that we state for factors with nonempty borders in [Formula: see text]: a nonempty border of a factor [Formula: see text] cannot be a prefix of the next factor [Formula: see text]. Another interesting result relates sorting of global suffixes, i.e., suffixes of a word w, and sorting of local suffixes, i.e., suffixes of the factors in [Formula: see text]. Finally, given a word w and a factor x of w, we prove that their Lyndon factorizations share factors, except for the first and last term of the Lyndon factorization of x. This property suggests that, given two words sharing a common overlap, their Lyndon factorizations could be used to capture the common overlap of these two words. The Lyndon factorization CFL(w) of a word w is the unique factorization of w into a sequence of Lyndon words in nonincreasing lexicographic ordering. This factorization is one of the most well-known and extensively studied in different contexts, from formal languages to algorithmic stringology and string compression. In particular the notion of a Lyndon word has been shown to be useful in theoretical applications, such as the well known proof of the Runs Theorem [2] as well as in string compression analysis. A connection between the Lyndon factorization and the Lempel-Ziv (LZ) factorization has been given in [18] , where it is shown that in general the size of the LZ factorization is larger than the size of the Lyndon factorization, and in any case the size of the Lyndon factorization cannot be larger than a factor of 2 with respect to the size of LZ. This result has been further extended in [28] to overlapping LZ factorizations. The Lyndon factorization has recently revealed to be a useful tool also in investigating queries related to suffixes of a word and sorting such suffixes [25] with strong potentialities [26] for string comparison that have not been completely explored and understood. Relations between Lyndon words and the Burrows-Wheeler Transform (BWT) have been discovered first in [11, 23] and, more recently, in [19] . The main interest in such a factorization is also due to the fact that it can be efficiently computed. Linear-time algorithms for computing this factorization can be found in [15, 16] whereas an O(lg n)-time parallel algorithm has been proposed in [1, 13] . Most recently, variants of the Lyndon factorization have been introduced and investigated with different motivations. In [5] , the notion of an inverse Lyndon word (a word which is strictly greater than each of its proper suffixes) has been introduced to define new factorizations, called inverse Lyndon factorizations. An inverse Lyndon factorization has the property that a word is factorized in a sequence of inverse Lyndon words, in an increasing and prefix-order-free lexicographic ordering, where prefix-order-free means that a factor cannot be a prefix of the next one. A word w which is not an inverse Lyndon word may have several inverse Lyndon factorizations but it admits a canonical inverse Lyndon factorization. This special inverse Lyndon factorization has been introduced in [5] and denoted ICFL(w) because it is the counterpart of the Lyndon factorization CFL(w) of w, when we use (I)inverse words as factors. Indeed, in [5] it has been proved that ICFL(w) can be computed in linear time and it is uniquely determined for a word w. In this paper we further investigate ICFL(w). The main results stated here are the following: (1) we find un upper bound on the length of the longest common prefix of two distinct factors in ICFL(w), namely the maximal length of two consecutive factors in ICFL(w) (Proposition 6), (2) we are able to relate sorting of global suffixes, i.e., suffixes of the word w, and local suffixes, i.e., suffixes of the factors in ICFL(w) (Lemma 3). Differently from Lyndon words, inverse Lyndon words may be bordered. As an intermediate result, we show that if a factor m i in ICFL(w) has a nonempty border, then such a border cannot be inherited by the next factor, since it cannot be the prefix of the next factor m i+1 (Proposition 5). This result is proved by a further investigation on the connection between the Lyndon factorization and the canonical inverse Lyndon factorization of a word, given in [5] through the grouping property. Indeed, given a word w which is not an inverse Lyndon word, the factors in ICFL(w) are obtained by grouping together consecutive factors of the anti-Lyndon factorization of w that form a chain for the prefix order (Proposition 7.7 in [5] ). Another natural question is the following. Given two words having a common overlap, can we use their Lyndon factorizations to capture the similarity of these words? A partial positive answer to this question is provided here: given a word w and a factor x of w, we prove that their Lyndon factorizations share factors, except for the first and last term of the Lyndon factorization of x. For the detailed proofs of the results in this paper we refer the reader to [6] . Throughout this paper we follow [4, 10, 20, 22, 27] for the notations. We fix the finite non-empty (totally ordered) alphabet Σ. We denote by Σ * the free monoid generated by Σ and we set Σ + = Σ * \ 1, where 1 is the empty word. For a word w ∈ Σ * , we denote by |w| its length. Given a language L ⊆ A * , we denote by Pref(L) the set of all prefixes of its elements. Two words x, y are incomparable for the prefix order if neither x is a prefix of y nor y is a prefix of x. Otherwise, x, y are comparable for the prefix order. We write x ≤ p y if x is a prefix of y and x ≥ p y if y is a prefix of x. We recall that, given a nonempty word w, a border of w is a word which is both a proper prefix and a suffix of w [12] . The longest proper prefix of w which is a suffix of w is also called the border of w [12, 22] . A word w ∈ Σ + is bordered if it has a nonempty border. Otherwise, w is unbordered. A nonempty word w is primitive if w = x k implies k = 1. An unbordered word is primitive. A sesquipower of a word x is a word w = x n p where p is a proper prefix of x and n ≥ 1. The lexicographic (or alphabetic order) ≺ on (Σ * , <) is defined by setting x ≺ y if x is a proper prefix of y, or x = ras, y = rbt, a < b, for a, b ∈ Σ and r, s, t ∈ Σ * . For two nonempty words x, y, we write x y if x ≺ y and x is not a proper prefix of y [3] . We also write y x if x ≺ y. Two words x, y are called conjugate if there exist words u, v such that x = uv, y = vu. The conjugacy relation is an equivalence relation. A conjugacy class is a class of this equivalence relation. A Lyndon word w ∈ Σ + is a word which is primitive and the smallest one in its conjugacy class for the lexicographic order. A class of conjugacy is also called a necklace and often identified with the minimal word for the lexicographic order in it. Thus, a nonempty word is a necklace if and only if it is a power of a Lyndon word. A prenecklace is a prefix of a necklace, hence any nonempty prenecklace w has the form w = (uv) k u, where uv is a Lyndon word, u ∈ Σ * , v ∈ Σ + , k ≥ 1, that is, w is a sesquipower of a Lyndon word uv. A characterization of the structure of the prefixes of the Lyndon words is given in [15] . It states that a word is a nonempty prefix of a Lyndon word if and only if it is a sesquipower of a Lyndon word distinct of the maximal letter. It is known that each Lyndon word w is unbordered. Moreover, a word w ∈ Σ + is a Lyndon word if and only if w ≺ s, for each nonempty proper suffix s of w. Different characterizations and variations of Lyndon words are given [3, 14, 21] . In the following L = L (Σ * ,<) will be the set of Lyndon words, totally ordered by the relation ≺ on (Σ * , <). We know that any word w ∈ Σ + can be written in a unique way as a nonincreasing product w = 1 2 · · · h of Lyndon words, i.e., in the form w = 1 2 · · · h , with j ∈ L and 1 2 . . . h [9] . The sequence CFL(w) = ( 1 , . . . , h ) is called the Lyndon decomposition (or Lyndon factorization) of w. Uniqueness of the above factorization is proved in [15] and allows us to state a recursive definition of CFL(w), for a nonempty word w. Precisely, if w is not a Lyndon word, then CFL(w) = ( 1 , 1 , . . . , h ), where ( 1 , . . . , h ) = CFL(w ), w = 1 w and 1 is the longest prefix of w which is a Lyndon word. Sometimes we need to emphasize consecutive equal factors in CFL. We write CFL(w) = ( n1 1 , . . . , nr r ) to denote a tuple of n 1 +. . .+n r Lyndon words, where r > 0, n 1 , . . . , n r ≥ 1. Precisely 1 . . . r are Lyndon words, also named Lyndon factors of w. There is a linear time algorithm to compute the pair ( 1 , n 1 ) and thus, by iteration, the Lyndon factorization of w [16, 22] . Linear time algorithms may also be found in [15] and in the more recent paper [17] . The inverse lexicographic or inverse alphabetic order on (Σ * , <), denoted ≺ in , is the lexicographic order on (Σ * , < in ). Here < in means that the order of the alphabet is reversed, that is b < in a ⇔ a < b, for all a, b ∈ Σ. We denote by L in = L (Σ * ,