key: cord-0608908-ppnlxf9g
authors: Daykin, Jacqueline W.; Koppl, Dominik; Kubel, David; Stober, Florian
title: On Arithmetically Progressed Suffix Arrays and related Burrows-Wheeler Transforms
date: 2021-07-06
journal: nan
DOI: nan
sha: d1762a411d635c12fe2a0a59b157ad8a1b02d681
doc_id: 608908
cord_uid: ppnlxf9g

We characterize those strings whose suffix arrays are based on arithmetic progressions, in particular, arithmetically progressed permutations where all pairs of successive entries of the permutation have the same difference modulo the respective string length. We show that an arithmetically progressed permutation $P$ coincides with the suffix array of a unary, binary, or ternary string. We further analyze the conditions of a given $P$ under which we can find a uniquely defined string over either a binary or ternary alphabet having $P$ as its suffix array. For the binary case, we show its connection to lower Christoffel words, balanced words, and Fibonacci words. In addition to solving the arithmetically progressed suffix array problem, we give the shape of the Burrows-Wheeler transform of those strings solving this problem. These results give rise to numerous future research directions.

The integral relationship between the suffix array [36] (SA) and Burrows-Wheeler transform [10] (BWT) is explored by Adjeroh et al. [1] , who also illustrate the versatility of the BWT beyond its original motivation in lossless block compression [10] . BWT applications include compressed index structures using backward search pattern matching, multimedia information retrieval, bioinformatics sequence processing, and it is at the heart of the bzip2 suite of text compressors. By its association with the BWT, this also indicates the importance of the SA data structure and hence our interest in exploring its combinatorial properties.

These combinatorial properties can be useful when checking the performance or integrity of string algorithms or data structures on string sequences in testbeds when properties of the employed string sequences are well understood.

In particular, due to current trends involving massive data sets, indexing data structures need to work in external memory (e.g., [8] ), or on distributed systems (e.g. [24] ). For devising a solution adaptable to these scenarios, it is crucial to test whether the computed index (consisting of the suffix array or the BWT, for instance) is correctly stored on the hard disk or on the computing nodes, respectively. This test is more cumbersome than in the case of a single machine working only with its own RAM. One way to test is to compute the index for an instance, whose index shape can be easily verified. For example, one could check the validity of the computed BWT on a Fibonacci word since the shape of its BWT is known [39, 13, 45] .

Other studies based on Fibonacci words are the suffix tree [43] or the Lempel-Ziv 77 (LZ77) factorization [7] . In [34] , the suffix array and its inverse of each even Fibonacci word is studied as an arithmetic progression. In this study, the authors, like many at that time, did not append the artificial $ delimiter (also known as a sentinel) to the input string, thus allowing suffixes to be prefixes of other suffixes. This small fact makes the definition of BWT T [i] = T[SA T [i]−1] for a string T with suffix array SA T incompatible with the traditional BWT defined on the BWT matrix, namely the lexicographic sorting of all cyclic rotations of the string T.

For instance, SA bab = [2, 3, 1] with BWT bab = bab, while SA bab$ = [4, 2, 3, 1] and BWT bab$ = bba$ with $ < a < b. However, the traditional BWT constructed by reading the last characters of the lexicographically sorted cyclic rotations [abb , bab , bba ] of bab yields bba, which is equal to BWT bab$ = bba$ after removing the $ character.

Note that not all strings are in the BWT image. An O(n log n)-time algorithm is given by Giuliani et al. [28] for identifying all the positions in a string S where a $ can be inserted into so that S becomes the BWT image of a string ending with $.

Despite this incompatibility in the suffix array based definition of the BWT, we can still observe a regularity for even Fibonacci words [34, Sect. 5] . Similarly, both methods for constructing the BWT are compatible when the string T consists of a Lyndon word. The authors of [34, Remark 1] also observed similar characteristics for other, more peculiar string sequences. For the general case, the $ delimiter makes both methods equivalent, however the suffix array approach is typically preferred as it requires O(n) time [41] compared to O(n 2 ) with the BWT matrix method [10] . By utilizing combinatorial properties of the BWT, an in-place algorithm is given by Crochemore et al. [15] , which avoids the need for explicit storage for the suffix sort and output arrays, and runs in O(n 2 ) time using O(1) extra memory (apart from storing the input text). Köppl et al. [33, Sect. 5.1] adapted this algorithm to compute the traditional BWT within the same space and time bounds.

Up to now, it has remained unknown whether we can formulate a class of string sequences for which we can give the shape of the suffix array as an arithmetic progression (independent of the $ delimiter). With this article, we catch up on this question, and establish a correspondence between strings and suffix arrays generated by arithmetic progressions. Calling a permutation of integers [1. .n] arithmetically progressed if all pairs of successive entries of the permutation have the same difference modulo n, we show that an arithmetically progressed permutation coincides with the suffix array of a unary, binary or ternary string. We analyze the conditions of a given arithmetically progressed permutation P under which we can find a uniquely defined string T over either a unary, a binary, or ternary alphabet having P as its suffix array.

The simplest case is for unary alphabets: Given the unary alphabet Σ := {a} and a string T of length n over Σ, SA T = [n, n − 1, . . . , 1] is an arithmetically progressed permutation with ratio −1 ≡ n − 1 mod n.

For the case of a binary alphabet {a, b}, several strings of length n exist that solve the problem. Trivially, the solutions for the unary alphabet also solve the problem for the binary alphabet. However, studying those strings of length n whose suffix array is [n, n − 1, . . . , 1], there are now multiple solutions: each T = b r a s with r, s ∈ [0..n] such that r + s = n has this suffix array. Similarly, T = a n−1 b has the suffix array SA T = [1, 2, . . . , n], which is an arithmetically progressed permutation with ratio 1.

A non-lexicographic ordering on strings, the V -order , considered for a modified FM-index [2] , provides a curious example for the case with ratio −1: if S is a proper subsequence of a string T of length n, then S precedes T in V -order, written S ≺ V T. This implies that T[n] ≺ V T[n − 1..n] ≺ V · · · ≺ V T[1..n], so that SA T [i] = n − i + 1 for every i ∈ [1..n], thus enabling trivial suffix sorting.

Clearly, any string ordering method that prioritizes minimal length within its definition will have a suffix array progression ratio of −1. Binary Gray codes 1 have this property and are ordered so that adjacent strings differ by exactly one bit [29] . These codes, which exhibit non-lexicographic order, have numerous applications, notably in error detection schemes such as flagging unexpected changes in data for digital communication systems, and logic circuit minimization. While it seems that Gray code order has not been applied directly to order the rows in a BWT matrix, just four years after the BWT appeared in 1994 [10] , Chapin and Tate applied the concept when they investigated the effect of both alphabet and string reordering on BWT-based compressibility [11] . Their string sorting technique for the BWT matrix ordered the strings in a manner analogous to reflected Gray codes, but for more general alphabets. This modification inverted the sorting order for alternating character positions with which they demonstrated improved compression.

To date the only known BWT designed specifically for binary strings is the binary block order B-BWT [19] . This binary string sorting method prioritizes length, thus also exhibiting a suffix array with progression ratio −1. Ordering strings of the same length, as applicable to forming the BWT matrix, is by a play on decreasing and increasing lexicographic ordering of the run length exponents of blocks of bits. Experimentation showed roughly equal compressibility to the original BWT in binary lexicographic order but with instances where the B-BWT gave better compression.

To see that the above two binary orders are distinct: 1101 comes before 1110 in Gray code order, whereas, 1110 comes before 1101 in binary block order.

In what follows, we present a comprehensive analysis of strings whose suffix arrays are arithmetically progressed permutations (under the standard lexicographic order). In practice, such knowledge can reduce the O(n) space for the suffix array to O(1).

The structure of the paper is as follows. 2 In Section 2 we give the basic definitions and background, and also deal with the elementary case of a unary alphabet. We present the main results in Section 3, where we cover ternary and binary alphabets, and consider inverse permutations. In Section 4, we illustrate the binary case for Christoffel words and in particular establish that every (lower) Christoffel word has an arithmetically progressed suffix array. We go on to link the binary characterization to balanced words and Fibonacci words. A characterization of strings with larger alphabets follows in Section 5. We overview the theme concepts for meta strings in Section 6. We conclude in Section 7 and propose a list of open problems and research directions, showing there is plenty of scope for further investigation. But for all that, we proceed to the foundational concepts presented in the following section, starting with the case of a unary alphabet.

Let Σ be an alphabet with size σ := |Σ|. An element of Σ is called a character 3 . Let Σ + denote the set of all nonempty finite strings over Σ. The empty string of length zero is denoted by ε; we write Σ * = Σ + ∪ {ε}. Given an integer n ≥ 1, a string 4 of length n over Σ takes the form T = t 1 · · · t n with each t i ∈ Σ. We write T = T[1..n] with T[i] = t i . The length n of a string T is denoted by |T|. If T = uwv for some strings u, w, v ∈ Σ * , then u is a prefix , w is a substring, and v is a suffix of T; we say u (resp. w and v) is proper if u = T (resp. w = T and v = T). We say that a string T of length n has period p ∈ [1..n − 1] if T[i] = T[i + p] for every i ∈ [1..n − p] (note that we allow periods larger than n/2). If T = uv, then vu is said to be a cyclic rotation of T. A string T is said to be a repetition if and only if it has a factorization T = T k for some integer k ≥ 1; otherwise, T is said to be primitive. A string that is both a proper prefix and a proper suffix of a string T = ε is called a border of T; a string is border-free if the only border it has is the empty string ε.

If Σ is a totally ordered alphabet with order <, then this order < induces the lexicographic ordering ≺ on Σ * such that u ≺ v for two strings u, v ∈ Σ * if and only if either u is a proper prefix of v, or u = ras, v = rbt for two characters a, b ∈ Σ such that a < b and for some strings r, s, t ∈ Σ * . In the following, we select a totally ordered alphabet Σ having three characters a, b, c

A string T is a Lyndon word if it is strictly least in the lexicographic order among all its cyclic rotations [35] . For instance, abcac and aaacbaabaaacc are Lyndon words, while the string aaacbaabaaac with border aaac is not.

A reciprocal relationship exists between the suffix array of a text and its Lyndon factorization, that is, the unique factorization of the text by greedily choosing the maximal length Lyndon prefix while processing the text from start to end: the Lyndon factorization of a text can be obtained from its suffix array [31] ; conversely, the suffix array of a text can be constructed iteratively from its Lyndon factorization [38] .

Lyndon words have numerous applications in combinatorics and algebra, and prove challenging entities due to their non-commutativity. Additionally, Lyndon words can arise naturally in Big Data -an instance in a biological sequence over the DNA alphabet, with Σ = {A < C < G < T}, is the following substring occurring in a SARS-CoV-2 genome 5 :

For the rest of the article, we take a string T of length n ≥ 2. The suffix array SA := SA T [1..n] of T is a permutation of the integers [1.

.n] such that

The focus of this paper is on arithmetic progressions. An arithmetic progression is a sequence of numbers such that the differences between all two consecutive terms are of the same value: Given an arithmetic progression {p i } i≥1 , there is an integer k ≥ 1 such that p i+1 = p i + k for all i ≥ 1. We call k the ratio of this arithmetic progression. Similarly to sequences, we can define permutations that are based on arithmetic progressions: An arithmetically progressed permutation with ratio k ∈ [1..n − 1] is an array P := [p 1 , . . . , p n ] with p i+1 = p i + k mod n for all i ∈ [1..n], where we stipulate that p n+1 := p 1 . 6 Here x mod n := x if x ≤ n, x − n mod n for an integer x ≥ 1, and x + n mod n for x < 1. In particular, n mod n = 0 mod n = n. In what follows, we want to study (a) strings whose suffix arrays are arithmetically progressed permutations, and (b) the shape of these suffix arrays. For a warm-up, we start with the unary alphabet: Conversely, given the arithmetically progressed permutation P = [n, n − 1, . . . , 1], we want to know the number of strings from a general totally ordered alphabet Σ = [1..σ] with the natural order 1 < 2 < · · · < σ, having P as

Min. σ Properties of Strings Reference 1 2 unique, Lyndon word Theorem 3.9

k + 1 2 unique, period (n − k) Theorem 3.9 n = (n − 1) 2 unique, period (n − k) Theorem 3.9 = (n − 1) 1 trivially periodic Theorem 2.2 ∈ {1, k + 1, n} 3 unique Theorem 3.2 Figure 1 : Characterization of strings whose suffix array is an arithmetic progression P = [p 1 , . . . p n ] of ratio k. The choice of p 1 determines the minimum size of the alphabet and whether a string is unique, periodic or a Lyndon word. The column Min. σ denotes the smallest possible size σ for which there exists such a string whose characters are drawn from an alphabet Σ with |Σ| = σ.

their suffix array. For that, we fix a string T of length n with SA T = P . Let s j ≥ 0 be the number of occurrences of the character j ∈ Σ appearing in T. Then σ j=1 s j = n. By construction, each character j has to appear after all characters k with k > j. Therefore, T = σ sσ (σ − 1) sσ−1 · · · 1 s1 such that the position of the characters are uniquely determined. 7 In other words, we can reduce this problem to the classic stars and bars problem [23, Chp. II, Sect. 5] with n stars and σ − 1 bars, yielding n+σ−1 n possible strings. Hence we obtain: Theorem 2.2. There are n+σ−1 n strings of length n over an alphabet with size σ having the suffix array [n, n − 1, . . . , 1].

As described above, strings of Theorem 2.2 have the form σ sσ (σ−1) sσ−1 · · · 1 s1 . The BWT based on the suffix array is 1 s1−1 2 s2 · · · σ sσ 1. For s 1 ≥ 2, it does not coincide with the BWT based on the rotations since the lexicographically smallest rotation is 1 s1 σ sσ · · · 2 s2 , and hence the first entry of this BWT is 2. For s 1 = 1, the last character '1' acts as the dollar sign being unique and least among all characters, making both BWT definitions equivalent.

For the rest of the analysis, we omit the arithmetically progressed permutation [n, n − 1, . . . , 1] of ratio k = n − 1 as this case is complete. All other permutations (including those of ratio k = n − 1) are covered in our following theorems whose results we summarized in Fig. 1 .

given an arithmetically progressed permutation P , we show that either there is precisely one string T with SA T = P whose characters are drawn from a ternary alphabet, or, if there are multiple candidate strings, then there is precisely one whose characters are drawn from a binary alphabet. For this aim, we start with the restriction on k and n to be coprime: 

Given an arithmetically progressed permutation P := [p 1 , . . . , p n ] with ratio k, we define the ternary string T[1..n] by splitting P right after the values n − k and (p 1 − k − 1) mod n into the three subarrays A, B, and C (one of which is possibly empty) such that P = ABC. Subsequently, we set

(1) Figure 2 gives an example of induced ternary/binary strings. Proof. Suppose we have constructed SA T . Since a < b < c, according to the above assignment of T, the suffixes starting with a lexicographically precede the suffixes starting with b, which lexicographically precede the suffixes starting with c. Hence, SA[1.

.|A|], SA[|A| + 1..|A| + |B|] and SA[|A| + |B| + 1..n] store the same text positions as A, B, and C, respectively. Consequently, it remains to show that the entries of each subarray (A, B or C) are also sorted appropriately. Let p i and p i+1 be two neighboring entries within the same (5) and (8), the alphabet is binary and the BWTs are the same. The strings of (3) and (8) are periodic with period n − k, since the last text position of each subarray is at most as large as n − k = 3 (cf. the proof of 

we can recursively show that these entries remain in the same order next to each other in the suffix array until either reaching the last array entry or a subarray split, that is, (1)

1. When p i + 1 becomes p 1 − k mod n = p n (the last entry in SA), p i+1 is in the subsequent subarray of the subarray of To sum up, the text positions stored in each of A, B and C are in the same order as in SA T since the j − 1 subsequent text positions of each consecutive pair of entries p i and p i+1 are consecutive in P for the smallest integer j ∈ [1.

.n] such that p i+1 + jk ∈ {p 1 − 1, n}. Reading the last column of a BWT matrix (whose characters are italic) from top downwards yields the BWT defined on the BWT matrix. While the BWT defined on the BWT matrix and the one defined by the suffix array coincides for the strings of Eq. (1) due to Theorem 3.3, this is not the case in general for the binary strings studied in Section 3.3, where we observe that BWT bbabbabb = bbbbbaab defined by the suffix array differs from bbbbbaba (the last column on the right).

Knowing the suffix array of the ternary string T of Eq. (1), we can give a characterization of its BWT. We start with the observation that both BWT definitions (rotation based and suffix array based) coincide for the strings of Eq. (1) (but do not in general as highlighted in the introduction, cf. Fig. 4) , and then continue with insights in how the BWT looks like. Proof. According to Theorem 3.2, SA T = P , and therefore the BWT of T defined on the suffix array is given by

The BWT matrix is constituted of the lexicographically ordered cyclic rotations of T. The BWT BWT matrix based on the BWT matrix is obtained by reading the Figure 5 : Setting of Eq. (1) with the distinction whether the entry p 1 − k − 1 appears before (left) or after (right) n − k in P , yielding a different shape of the

last column of the BWT matrix from top downwards (see Fig. 4 ). Formally, let Q[i] be the starting position of the lexicographically i-th smallest rotation

We do that by comparing both rotations R i and R i+1 characterwise:

First we show that j = p n − p i + 1 mod n by a contradiction: Assuming that j = p n − p i + 1 mod n, we conclude that j = 1 by the definition of i ∈ [1..n − 1]. Since k is the ratio of P , we have

, contradicting the choice of j as the first position where R i and R i+1 differ.

This concludes that j ≤ n (hence, R i = R i+1 ) and j = p n − p i + 1 mod n. 

Proof. Since P is an arithmetically progressed permutation with ratio k then so is the sequence P := [p 1 , . . . , p n ] with p i = p i − 1 mod n. In particular, P is a cyclic shift of P with p n = p 1 − 1 − k mod n because p 1 = p 1 − 1. However, p 1 − 1 − k is a split position of one of the subarrays A, B, or C, meaning that P starts with one of these subarrays and ends with another of them (cf. Fig. 5 ).

Consequently, there is a t such that p t = p n , and we have the property that

We will determine the parameter t = n − k −1 mod n after Eq. (3) in Section 3.4, where k −1 is defined such that k · k −1 mod n = 1 mod n. With Lemma 3.4, we obtain the following corollary which shows that the number of runs in BWT T for T defined in Eq. (1) are minimal: with ratio k such that p 1 ∈ {1, k + 1, n}, the string T given in Eq. (1) is unique.

Proof. The only possible way to define another string T would be to change the borders of the subarrays A, B, and C. Since p 1 ∈ {1, n}, n − k and n, as well as p 1 − k − 1 and p 1 − 1, are stored as a consecutive pair of text positions in P .

• If P is not split between its consecutive text positions n − k and n, then

• If P is not split between its consecutive text positions (

Since Following this analysis of the ternary case we proceed to consider binary strings. A preliminary observation is given in Fig. 2 , which shows, for the cases p 1 is 1 and n in Theorem 3.6, namely Rotations (5) and (8), that a rotation of n− k in the permutation gives a rotation of one in the corresponding binary strings. We formalize this observation in the following lemma, drawing a connection between binary strings whose suffix arrays are arithmetically progressed and start with 1 or n. 

Proof. Since p 1 = 1, T[1] = a and T[n] = b. In the following, we show that P = SA T for P := [p 1 , . . . , p n ] := [p 1 − 1 mod n, . . . , p n − 1 mod n] with p 1 = p 1 − 1 = n (since T [n] = a). For that, we show that each pair of suffixes in SA T is kept in the same relative order in P (excluding SA T [1] = 1):

Hence the relative order of these suffixes given by [p 2 , . . . , p n ] and [p 2 , . . . , p n ] is the same. In total, we have

.n], hence P is an arithmetically progressed permutation with ratio k. Given the first m entries in P represent all suffixes of T starting with a, P is the m-th rotation of P since p 1 = n is the (m + 1)-th entry of P , i.e., the smallest suffix starting with b in T.

Finally, since the strings T and T are rotations of each other, their BWTs are the same.

Like the parameter t of Lemma 3.4, we will determine the parameter m after Eq. (3) in Section 3.4.

We start with the construction of a binary string from an arithmetically progressed permutation: Proof. If p 1 = 1, then P is split after the occurrences of the values n − k and −k = n − k mod n, which gives only two non-empty subarrays. If p 1 = n, P is split after the occurrence of n − k − 1, which implies that C is empty since p n = n − k. Hence, T can be constructed with a binary alphabet in those cases, cf. Fig. 2 .

For the case p 1 = k + 1, P is split after the occurrences of the values n − k and k + 1 − k − 1 mod n = n mod n, so B contains only the text position n. By construction, the requirement is that the suffix T[n] is smaller than all other suffixes starting with c. So instead of assigning the unique character T[n] ← b like in Theorem 3.2, we can assign T[n] ← c, which still makes T[n] the smallest suffix starting with c. We conclude this case by converting the binary alphabet {a, c} to {a, b}. Cf. Fig. 2 , where T in Rotation (6) has become bbabbabb with period n − k = 3.

The main result of this section is the following theorem. There, we characterize all binary strings whose suffix arrays are arithmetically progressed per- b a b b a b b [ 1, 6, 3, 8 , 5, 2, 7, 4 ] 3 3 Figure 6 : All binary strings of length 8 whose suffix arrays are arithmetically progressed permutations with ratio k = 5. Theorem 3.9 characterizes these strings (and also gives the definition of p s ). Cases (1) and (3) also appear in Fig. 2 at Rotation (8) and (5), respectively, while Case (2) can be obtained from Rotation (6) by exchanging the last character with c. Cases (1) and (2) both have period n − k = 3, and Case (3) is a Lyndon word. mutations. More precisely, we identify which of them are unique 9 , periodic, or a Lyndon word.

Theorem 3.9. Let n and k ∈ [1..n − 1] be two coprime integers. If k = n − 1, there are exactly three binary strings of length n whose suffix arrays are arithmetically progressed permutations with ratio k. Each such solution T s ∈ {a, b} + is characterized by 1. p 1 = n and p s = n − k − 1, 2. p 1 = k + 1 and p s = n − k, and 3. p 1 = 1 and p s = n − k. 9 The exact number of these binary strings is not covered by Theorem 3.6.

The string T s has period n − k in Cases 1 and 2, while T s of Case 3 is a Lyndon word, which is not periodic by definition.

For k = n − 1, Cases 2 and 3 each yields exactly one binary string, but Case 1 yields n binary strings according to Theorem 2.2.

Proof. Let S be a binary string of length n, and suppose that SA S = P := [p 1 , . . . , p n ] is an arithmetically progressed permutation with ratio k. Finally, we need to show that no other value for p 1 admits a binary string S having an arithmetically progressed permutation P := [p 1 , . . . , p n ] with ratio k as its suffix array. So suppose that p 1 / ∈ {1, k + 1, n}, then this would imply the following: For a given arithmetically progressed permutation with ratio k, and first entry p 1 ∈ {1, k + 1, n}, the string T s of Theorem 3.9 coincides with T of Theorem 3.8.

Since the inverse P −1 of a permutation P with P −1 [P [i]] = i is also a permutation, one may wonder whether the inverse P −1 of an arithmetically progressed permutation is also arithmetically progressed. We affirm this question in the following. For that, we use the notion of the multiplicative inverse k −1 of an integer k (to the congruence class [1..n] = Z/nZ), which is given by k −1 · k mod n = 1 mod n. The multiplicative inverse k −1 is uniquely defined if k and n are coprime.

Theorem 3.10. The inverse P −1 of an arithmetically progressed permutation P with ratio k is an arithmetically progressed permutation with ratio k −1 and P −1 [1] = (1 − P [n]) · k −1 mod n.

Proof. Let x := P [i] for an index i ∈ [1..n]. Then P [i + k −1 mod n] = P [i] + k · k −1 = x + k · k −1 mod n = x + 1 mod n. For the inverse permutation P −1 this means that P −1 [x] = i and P −1 [x + 1 mod n] = i + k −1 mod n. Thus the difference

Since P [i] = j ⇐⇒ P [n] + ik mod n = j holds for all indices i ∈ [1.

.n], we have (using i ← P −1 [1] and j ← 1 in the above equivalence)

Consequently, using the split index s of p s for SA and

where SA and ISA denote the suffix array and the inverse suffix array of T s , respectively. Another result is that ISA[p s ] = s is the number of a's in T s , for which we split the study into the cases of Theorem 3.9: 

We conclude our main results of Section 3 by drawing connections between strings having arithmetically progressed suffix arrays and Christoffel words (Section 4.1), balanced words (Section 4.2), and Fibonacci words (Section 4.3).

Christoffel words are binary strings whose origins are considered to date from Bernoulli's 1771 work [5] . Christoffel words can be described geometrically in terms of a line segment and associated polygon traversal [9] : let (p, q) ∈ N 2 where (p, q) are coprime and let S be the line segment with endpoints (0, 0) and (p, q). The induced path of a binary string T is a list of points v 0 , . . . , v n ∈ N 2 such that v 0 = (0, 0), v n = (p, q), and for each i ∈ [1.

The string T ∈ {a, b} * is a lower Christoffel word if the path induced by T from the origin to (p, q) is below the line segment S and the path and S determines a simple polygon which contains no other point in N 2 . An upper Christoffel word is defined analogously by taking the path above S. Hence, a Christoffel word is defined by a direction (above or below) and the slope p/q, which determines p and q since p and q are coprime. For instance, Case (3) of Fig. 6 determines a Christoffel word with slope 5 3 . It follows from the coprimality of p and q that Christoffel words are necessarily primitive. In what follows, we focus on lower Christoffel words, and drop the lower adjective when speaking of (lower) Christoffel words.

To show that every lower Christoffel word has an arithmetically progressed suffix array, we use an alternative characterization of Christoffel words based on Cayley graphs [6, Def. 1.4] . Let again p, q ∈ N be coprime. Fig. 8 is the Cayley graph of Z/(p + q)Z generated by q. Cayley graphs are always simple cycles since q and p + q are coprime. In what follows, we use mod 0 with n mod 0 n = 0 mod 0 n = 0 to match the definition in [6] , as opposed to mod with 0 mod n = n mod n = n elsewhere.

An edge s → t in the Cayley graph has the label a if s < t, otherwise it has the label b. Reading the edge labels, starting from node 0, following the edges of the Cayley graph and stopping when reaching node 0 again, yields the lower Christoffel word T c parametrized by p and q. The i-th node in the Cayley graph (0 being the first node) is (i − 1)q mod 0 (p + q). Hence the i-th character of T c is

Given the suffix array SA Tc of a lower Christoffel word T c , the split index s (defined in Theorem 3.9) is given by p, the total number of units along the x-axis in the polygonal path. All lower Christoffel words are Lyndon words [9] and so necessarily border-free. Hence SA Tc [p 1 ] = 1, and SA Tc [s+1] = n since the string T c must end b. The following theorem gives now the connection between lower Christoffel words and the strings of Section 3:

Theorem 4.1. Let p, q ∈ N be coprime. Then the lower Christoffel word T c characterized by p and q has an arithmetically progressed suffix array. The suffix array is given by the arithmetic progression P with p 1 = 1 and k = q −1 , where q −1 is the multiplicative inverse of q in Z/nZ. The string T c is identical to Case 3 of Theorem 3.9 characterizing the binary case.

Proof. We proof the theorem by showing that the Christoffel word T c is equal to the string described in Theorem 3.9 as Case 3 whose suffix array is the arithmetically progressed array P .

Let n = p + q. By Eq. (4) the i-th character of T c is an a if and only if (i − 1)q mod 0 n < iq mod 0 n. We can rewrite that to (i − 1)q mod 0 n < (i − 1)q mod 0 n + q mod 0 n. This condition is fulfilled if and only if ((i − 1)q mod 0 n) + q < n. We can rewrite that to (i − 1)q mod 0 n < n − q. Replacing mod 0 with mod, we obtain

otherwise.

Let k = q −1 . Let T s denote the string of Case 3 in Theorem 3.9, which is characterized by p 1 = 1 and p s = n − k = n − q −1 . Using the results from Section 3.4, Eq. (3), T s can be written as

Substituting k = q −1 , k −1 = q and s = ISA[p s ] = ISA[n − q −1 ] = n − q, we can rewrite the condition for a in Eq. (5) as 1 + (i − 1)q mod n ≤ n − q. Replacing ≤ with < we obtain the same definition for T s as for T c , concluding the proof.

otherwise.

T SA p s s 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 ( 1) a a b a b a a b a b a b [ 1 , 6, 11, 4, 9, 2, 7, 12, 5, 10, 3, 8 ] 7 7 (2) b a b a b a a b a b a a [ 1, 8, 3 , 10, 5, 12, 7, 2, 9, 4, 11, 6 ] 5 5 Figure 9 : Two Christoffel words with S = ((0, 0), (7, 5) ), slope 5 7 , and length 12. The suffix array of the the lower Christoffel word (1) has an arithmetically progressed permutation with ratio k = 5, p s = 7 = n − k, and hence is an instance of Theorem 3.9, Case 3. Then, assuming b < a, the suffix array of the upper Christoffel word (2) has an arithmetic progression ratio of n − k = 7 with split index n − s and position of largest suffix starting b of n − p s .

In the geometric setting, traversing the path of the polygon from (0, 0) to (p, q) is equivalent to scanning the characters in the defining Christoffel word T c , hence we can associate each character with its polygon coordinates. The BWT can be obtained from iterating across the suffix array and for each index selecting the preceding character in the text -the BWT for the first string in Figure  9 has maximal runs in the form b 5 a 7 ; more generally, the BWT of a Christoffel word over Σ = {a, b} with slope q p takes the form b q a p [6] . Further, since the BWT is injective on Lyndon words (see [6] ) it follows that this property holds for Christoffel words.

The progression ratio q −1 of Theorem 4.1 may be useful for accessing geometries of interest in the polygon or aiding discovery of geometric repetitive structures. For instance, to access the 'steps' in the polygon in decreasing width, start at the origin and apply increments of k: for the first string in Figure 9 the widest steps are at coordinates (0, 0) and (3, 2) -see Figure 10 . Note that the geometric width of this polygon at a certain height reflects a run of the character a in its representing Christoffel word T c . Hence, our question of steps in decreasing width can be translated to finding text positions T c [i] with T c [i..] having a long common prefix with a · · · .

We now extend a known result for lower Christoffel words [6, Sect. 6.1], which distinguishes consecutive rows in an associated BWT matrix: This gives an arrangement in the first two rows of A = ab ba starting at position n − k in row 1 and then at position 2(n − 2) mod n in row 2. Similarly for each subsequent row i in M, i < n, there is an additional factor of (n − k) mod n for the position of occurrence of A.

We proceed to show that there are no other differences between adjacent rows other than those occurring in A. So suppose instead that M[i, j] = c and M[i + 1, j] = d for two distinct characters c, d ∈ Σ, 1 ≤ i < n and i(n − k) + 1 < j < i(n − k) mod n. Then the character d in row i + 1 is the same character in T c as the one at position j − (n − k) mod n in row i which must be the same character as the one at position j − (n − k) + (n − k) mod n in row i, namely a c, contradicting the claim.

Hence the only differences between adjacent rows occur at A which coincide with p s and the end of T c . In particular, A determines the lexicographic ordering of the rows of the matrix M.

The arithmetic progression of arithmetically progressed suffix arrays of Christoffel words allows us to determine the following factorizations of such words in constant time (without looking at the explicit characters of the word):

• Right factorization [6] ; originally called the standard factorization [12, 22] .

If w = uv is a Lyndon word with v its lexicographically least proper suffix, then u and v are also Lyndon words and u < v. Equivalently, the right factor v of the standard factorization can be defined to be the longest proper suffix of w that is a Lyndon word [3] .

• Left factorization [44, 46] . If w = uv is a Lyndon word with u a proper Lyndon prefix of maximal length, then v is also a Lyndon word and u < v.

• balanced 2 factorization [40] . A Lyndon word w is balanced 2 if w is a character or there is a factorization w = (u, v) that is simultaneously the left and the right factorization of w, and u and v are also balanced 2 .

For any of the three above factorizations w = (u, v), we will say that the factorization index is the index of the last character of u in w. The factorization index determines the split position and hence the factorization. Suppose T c has an arithmetically progressed suffix array SA Tc with progression ratio k ≥ 1 and split index s. Then T c is a balanced 2 Lyndon word with a factorization T c = T u T v such that |T u | = SA Tc [(s + 2) mod n]. For n = 1, 2 we have that T c ∈ {a, b, ab}, k = 1, and that the split index s is 1.

Proof. We apply the result that a word is a lower Christoffel word if and only if it is a balanced 2 Lyndon word [40] Finally, T v = T c [SA Tc [2] ..n] since otherwise we yield a contradiction to the right factorization that T v is the lexicographically least proper suffix. So j = SA Tc [(s + 2) mod n].

For example, the Christoffel word in Fig. 9 Case (1) Conditions for the factorization of a Lyndon word into exactly two nonoverlapping Lyndon factors are given in [16] , where overlapping factors have non-empty suffixes and prefixes sharing the same characters. We can geometrically consider the balanced 2 factorization of Christoffel words as follows:

Lemma 4.4. Let p, q ∈ N be coprime and T c be the lower Christoffel word characterized by p and q with factorization index i. Further, let the line segment S = ((0, 0), (p, q)) be associated with T c . Then for all points on the path determined by T c (apart from the end points), T c [i] has the shortest Euclidean distance to S.

Proof. Since T c is a Christoffel word it is also a balanced 2 Lyndon word. Let the balanced 2 factorization be T c = T u T v , where the factors have defining line segments S u , S v respectively, i.e., S u is the line from (0, 0) to the point s being the geometric representation of T c [1..|T u |], and S v is the line from s to (p, q), cf. Fig. 10 with S u = aabab and s = (3, 2) . Now assume that there exists a point r that geometrically represents a prefix T c [1..i] with the property that r does not have the shortest Euclidean distance to S. Then at least one of the paths for T u and T v must cross their associated line S u or S v , contradicting the geometric definition of a Christoffel word.

A binary string T is called a balanced word if for each character c ∈ {a, b}, the number of occurrences of c in U and in V differ by at most one, for all pairs of substrings U and V with |U| = |V| of the infinite concatenation T · T · · · of T. Lemma 4.5. Let T be a string over the binary alphabet {a, b} with an arithmetically progressed suffix array P = [p 1 , . . . , p n ] with ratio k. Given that p 1 = k − 1, T is balanced.

Proof. According to Theorem 3.3 for T the BWT defined on the suffix array and the BWT defined on the BWT matrix are identical. From Corollary 3.5 we know that the BWT of T has the shape b x a y . By [42, Thm. 2] for binary words the following conditions are equivalent:

1. There are two coprime numbers p and q such that BWT T = b p a q ; 2. T is balanced.

We conclude the proof by showing that x and y are co-prime. Since p 1 = k − 1 we have p 1 ∈ {1, n}. Using the results from Section 3.4 the number of a's in T is n − k −1 . Thus the number of b's is k −1 . As k −1 is co-prime to n, n − k −1 must be co-prime to k −1 .

The Fibonacci sequence is a sequence of binary strings {F m } m≥1 with F 1 := b, Lemma 4.6. The Fibonacci word F m for even m is given by

Proof. We use the following facts:

• The greatest common divisor of f i and f j is the Fibonacci number whose index is the greatest common divisor of i and j [47, Fibonacci numbers] . Hence, f m−1 and f m are coprime for every m ≥ 2.

• f 2 m−2 mod f m = 1 holds for every even m ≥ 3 [30] . Hence, If SA is arithmetically progressed with ratio k, then its split position must be p s = n − k − 1 according to Theorem 3.9. We show that k = f m−2 by provinḡ

in a way similar to [34, Lemma 8] . For that, letS of a binary string S ∈ {a, b} * denote S after exchanging a's and b's (i.e.,S = a ⇔ S = b). Further, let be the relation on strings such that S T if and only if S ≺ T and S is not a prefix of T. We need this relation since S T ⇐⇒S T while S ≺ T andS ≺T holds if S is a prefix of T. . By using one of the two above items we can show that these arithmetic progression steps yield a list of suffixes sorted in lexicographically ascending order.

In this section we extend our results of Section 3 to alphabets of arbitrary size. Let P = [p 1 , . . . , p n ] be an arithmetically progressed permutation with ratio k. Let Σ = {1, 2, . . . , σ} be an alphabet of size σ where the order is given by 1 < 2 < · · · < σ. To construct a string T over the alphabet Σ having P as its suffix array we proceed similarly to the construction presented in the previous sections: First we split P into subarrays S 1 , . . . , S σ , then for each subarray S i we assign the character i to each position p ∈ S i . When splitting P into subarrays there are some fixed positions where we are required to split P , while the remaining splitting positions can be chosen freely. Let σ min be the size of the minimal alphabet over which there is a string having P as its suffix array. Then there are σ min − 1 positions where we are required to split P . Those required splitting positions are after the following entries, modulo n:

if p 1 = n and k = n − 1. At the beginning of this paper we have looked at the strings over an alphabet of size σ having the suffix array [n, n − 1, . . . , 1]. The number of those strings is given by Theorem 2.2 and is bounded by a polynomial in n and σ. We extend upon that result: For an arbitrary permutation, which is not necessarily arithmetically progressed we give a bound on the number of strings with that permutation as suffix array. For a fixed arithmetically progressed permutation P we give an exact formula for the number of strings having P as their suffix array. We conclude that in total the number of strings having an arithmetically progressed suffix array is bounded by a polynomial in n and σ.

Lemma 5.2. Let Σ be an alphabet of size σ = |Σ|. Given a permutation P of length n there are at most n+σ−1 n strings of length n over the alphabet Σ having P as their suffix array.

Proof. Let the alphabet Σ be given by Σ = {1, . . . , σ} with the order 1 < 2 < · · · < σ. Let T be a string over Σ of length n whose suffix array is P . Permuting T by its suffix array P we obtain the string T with T [i] = T[P [i]]. As P is the suffix array of T we have T [i..n] ≺ T [j..n] and thus T [i] ≤ T [j] for all i < j. Therefore T is given by T = 1 t1 · · · σ tσ , where t i describes the number of occurrences of i in T . The problem of finding the number of strings with this particular shape can be reduced to the classic stars and bars problem [23, Chp. II, Sect. 5] with n stars and σ − 1 bars, yielding n+σ−1 n possible strings. As the permutation P is a bijection on Σ n there is only one string T that, permuted by P , gives the string T . Thus there are at most n+σ−1 n strings having P as suffix array. There may be less as it is possible that a string V, which is a pre-image of some T = 1 t1 · · · σ tσ under the permutation P , has a suffix array different to P . Lemma 5.3. Let Σ be an alphabet of size σ = |Σ|. Given a permutation P of length n, let σ min be the size of the smallest alphabet over which there exists a string having P as its suffix array. Then there are at most n+σ−1 σ−σmin strings of length n over the alphabet Σ having P as their suffix array.

Proof. As described at the beginning of this section, all strings over the alphabet Σ having P as suffix array can be constructed by splitting P into σ−1 subarrays. There are σ min − 1 positions where we are required to split P and σ − σ min splitting positions that can be chosen freely. We model the selection of the positions that can be chosen freely using the stars and bars problem [23, Chp. II, Sect. 5]. The stars represent the entries of the permutation, the bars the splitting positions. Assume we start with n + σ − 1 stars. Then σ − σ min bars (splitting positions) can be freely chosen. There are n+σ−1 σ−σmin ways to do this. Then we replace the stars at the σ min −1 required splitting positions with bars. This gives us n stars and σ −1 bars, which we map to a partition of P into σ subarrays.

Lemma 5.4. Let Σ be an alphabet of size σ = |Σ|. Then the number of strings of length n with an arithmetically progressed suffix array is at most p(n, σ) = n(n − 1) n+σ−1 σ−σmin . Proof. Each arithmetically progressed permutation P of length n can be described by two parameters: Its first entry P [1] a its ratio k. For P [1] there are n different options, for k there are at most n − 1 different options. Thus the number of arithmetically progressed permutations of length n is at most n(n − 1). By Lemma 5.3 the number of strings of length n having a specific permutation as suffix array is bounded by n+σ−1 σ−σmin . Putting those two facts together we obtain that the number of strings of length n with an arithmetically progressed suffix array is bounded by p(n) = n(n − 1) n+σ−1 σ−σmin .

Given a fixed alphabet Σ, by the above lemma, the number of strings of length n with arithmetically progressed suffix array is bounded by a polynomial in n and σ.

In this final section we overview some connections of suffix arrays with generalized forms of words and meta strings. A generalization of strings was proposed in the 1974 groundbreaking paper of Fischer and Paterson [25] , where string matching was considered in a more general setting than with the usual solid letters, whereby either string could have don't care 10 symbols, and was achieved in time nearly as fast as linear. Uncertain sequences, including indeterminate strings, have application in inexact matching tasks, for instance, allowing for errors such as with Web interface data entry and Internet search. They are also useful for expressing DNA polymorphisms, that is biological sequence positions that can have multiple possibilities and encoded with IUPAC 11 meta characters, for example N denotes any of the DNA nucleotides. A codon is a form of meta character whereby a sequence of three nucleotides encodes a specific amino acid and are used for protein expression; so a genetic code can be composed of concatenated codon units. The truncated generalized suffix automaton was introduced for indexing length-bounded k-factors of degenerate DNA and RNA sequences [26] . An elastic-degenerate string is a sequence of sets of strings used for succinctly representing a multiple alignment of a collection of closely related sequences (e.g. a pan-genome, that is all genes and genetic variation within a species), and also supports approximate pattern matching [4] . Sequence alignment is useful for inferring evolutionary relationships between biological sequences.

Daykin and Watson proposed a simple degenerate BWT, the D-BWT [21] , constructed by applying lex-extension order (cf. Example 6.1) to relabel the sets and order the degenerate strings in the D-BWT matrix. Subsequently in [18] , the D-BWT was applied to pattern matching using the backward search technique. This formalized and extended work implemented in [32] presenting a bioinformatics software tool, BWBBLE, for pattern matching on a pan-genome that they called a reference multi-genome.

In what follows, we first formally define indeterminate strings, and subsequently illustrate various approaches for defining an indeterminate suffix array: An indeterminate string 12 T I = T I [1..n] on an alphabet Σ is a sequence of nonempty subsets of Σ. Specifically, an indeterminate string T I has the form T I = t 1 · · · t n where each t i is a set of characters over Σ; a singleton is known as a solid letter . For example, a, b, e}{d}{c, a, b, e}{d}{d}{c, a, b , e, f, g} 10 Don't care symbols match with all characters. 11 International Union of Pure and Applied Chemistry 12 Also known as degenerate string in the literature, particularly in the field of microbiology.

is a ternary border-free indeterminate string of length 6, with a singleton {d}. Example 6.1. Let T I = {b}{b}{c,a}{b,d,a} be an indeterminate string, then SA T I can be alternatively defined by the following approaches:

• We can apply the lex-extension order by treating each character set as a string whose characters are sorted, and use the lexicographic order of these strings to sort the sets, resulting in ranked meta characters: • Finally, we can incorporate the suffix arrays of the individual sets treated as strings, so the above suffix array [1, 2, 4, 3] , with both indeterminate and solid letter positions, becomes

[(1, 1), (2, 1), ((4, 3), (4, 1), (4, 2)), ((3, 2), (3, 1))].

Clearly, there is a natural generalization of many types of well-known patterned strings over solid letters to the indeterminate form, such as indeterminate Fibonacci words, where the solid form can give a meta representation of the generalized form.

Given an arithmetically progressed permutation P with ratio k, we studied the minimum alphabet size and the shape of those strings having P as their suffix array. Only in the case P = [n, n − 1, . . . , 1], a unary alphabet suffices. For general P = [p 1 , . . . , p n ] = [n, n − 1, . . . , 1], there is exactly one such string on the binary alphabet if and only if p 1 ∈ {1, k + 1, n}. In all other cases, there is exactly one such string on the ternary alphabet. We conclude by proposing some research directions.

• A natural question arising from this research is to characterize strings having arithmetic progression properties for the run length exponents of their BWTs, particularly for the bijective [27] or extended BWT [37] , which are always invertible.

For example, given the arithmetically progressed permutation 3214, then the run-length compressed string a 3 c 2 $b 4 (a) matches the permutation 3214 and (b) is a BWT image because its inverse is b 2 cb 2 ca 3 $, which can be computed by the Last-First mapping. However, for the same permutation, a 3 b 2 $b 4 does not work since it is not a BWT image. Further examples of arithmetically progressed BWT exponents are: a 3 c 4 $b 2 , a 4 c 3 e 2 $d 6 b 5 , and a 4 c 3 e 2 $f 7 d 6 b 5 .

• Arithmetic properties can likewise be considered for the following stringology integer arrays:

-Firstly the longest common prefix (LCP) array LCP, whose entry LCP[i] is the length of the longest common prefix of the lexicographically i-th smallest suffix with its lexicographic predecessor for i ∈ [2..n].

-Given a string T ∈ Σ + of length n, the prefix -Integer prefix lists are more concise than prefix tables and give the lengths of overlapping LCPs of T and suffixes of T (cf. [14] ).

-The i-th entry of the Lyndon array λ = λ T [1.

.n] of a given string T = T[1.

.n] is the length of the longest Lyndon word that is a prefix of T [i..] -reverse engineering in [17] includes a linear-time test for whether an integer array is a Lyndon array. Likewise, the Lyndon factorization array F = F T [1.

.n] of T stores in its i-th entry the size of the Lyndon factorization (i.e., the number of factors) of the suffix T[i..n]. The problems are to characterize those arithmetic progressions that define a valid Lyndon array, respectively Lyndon factorization array. For example, consider the string T = azyx, then its Lyndon array is λ T = [4, 1, 1, 1], while the Lyndon factorization array is F T = [1, 3, 2, 1]. Trivially, for T = abc . . . z the Lyndon array is an arithmetic progression and likewise for the Lyndon factorization array of T = z t y t x t . . . a t for z > y > x > . . . > a.

• A challenging research direction is to consider arithmetic progressions for multi-dimensional suffix arrays and Fibonacci word sequences.

The Burrows-Wheeler Transform: Data compression, Suffix arrays, and Pattern matching

Applications of V-order: Suffix arrays, the Burrows-Wheeler transform & the FM-index

The standard factorization of Lyndon words: an average point of view

Approximate pattern matching on elastic-degenerate text

Sur une nouvelle espèce de calcul. Recueil pour les astronomes

Combinatorics on Words: Christoffel Words and Repetition in Words

Crochemore factorization of Sturmian and other infinite words

Inducing suffix and LCP arrays in external memory

Quelques mots sur la droite projective réelle

A block sorting lossless data compression algorithm

Higher compression from the Burrows-Wheeler transform by modified sorting

Free differential calculus, ivthe quotient groups of the lower central series

Simple algorithm for sorting the Fibonacci string rotations

Representing prefix and border tables: results on enumeration

Computing the Burrows-Wheeler transform in place and in small space

Combinatorics of unique maximal factorization families (UMFFs)

Reconstructing a string from its Lyndon arrays

Efficient pattern matching in degenerate strings with the Burrows-Wheeler transform

On arithmetically progressed suffix arrays

Indeterminate string factorizations and degenerate text transformations

Factorizing words over an ordered alphabet

An introduction to probability theory and its applications

Lightweight distributed suffix array construction

String matching and other products

Indexing factors in DNA/RNA sequences

A bijective string sorting transform

When a dollar makes a BWT

Pulse code communication

Composites and Primes Among Powers of Fibonacci Numbers increased or decreased by one

Lyndon words, permutations and trees

Short read alignment with populations of genomes

In-place bijective Burrows-Wheeler transformations

Arithmetics on suffix arrays of Fibonacci words

Combinatorics on Words. Cambridge Mathematical Library

Suffix arrays: A new method for on-line string searches

An extension of the Burrows-Wheeler transform

Suffix array and Lyndon factorization of a text

Burrows-Wheeler transform and Sturmian words

Visualisation de graphes et combinatoire des mots. Thèse d'habilitationà diriger les recherches

Linear suffix array construction by almost pure induced-sorting

Balanced words having simple burrows-wheeler transform

The structure of subword graphs and suffix trees of Fibonacci words

On bases for free Lie algebra

Words with simple Burrows-Wheeler transforms

Algèbres de Lie libres et monoïdes libres

Prime Numbers: The Most Mysterious Figures in Math

We thank Gabriele Fici for the initial help in guiding this research started at StringMasters.Funding: This research was part-funded by JSPS KAKENHI with grant number JP18F18120 and JP21K17701, and by the European Regional Development Fund through the Welsh Government [Grant Number 80761-AU-137 (West)]: