key: cord-0046142-szzmz6y6
authors: Lohrey, Markus
title: Balancing Straight-Line Programs for Strings and Trees
date: 2020-06-24
journal: Beyond the Horizon of Computability
DOI: 10.1007/978-3-030-51466-2_26
sha: 66a9086b5b0a975959573fe193ee790e65fd53b1
doc_id: 46142
cord_uid: szzmz6y6

The talk will explain a recent balancing result according to which a context-free grammar in Chomsky normal form of size m that produces a single string w of length n (such a grammar is also called a straight-line program) can be transformed in linear time into a context-free grammar in Chomsky normal form for w of size [Formula: see text], whose unique derivation tree has depth [Formula: see text]. This solves an open problem in the area of grammar-based compression, improves many results in this area and greatly simplifies many existing constructions. Similar balancing results can be formulated for various grammar-based tree compression formalism like top DAGs and forest straight-line programs. The talk is based on joint work with Moses Ganardi and Artur Jeż. An extended abstract appeared in [11]; a long version of the paper can be found in [12].

In grammar-based compression a combinatorial object (e.g. a string or tree) is compactly represented using a grammar of an appropriate type. In many grammar-based compression formalisms such a grammar can be exponentially smaller than the object itself. A well-studied example of this general idea is grammar-based string compression using context-free grammars that produce only one string each, which are also known as straight-line programs, SLPs for short. Formally, we define an SLP as a context-free grammar G in Chomsky normal form such that (i) G is acyclic and (ii) for every nonterminal A there is exactly one production, where A is the left-hand side. The size |G| of the SLP G is the number of nonterminals in G and the depth of G (depth(G) for short) is the height of the unique derivation tree of G. Since we assume that G is in Chomsky normal form, we have depth(G) ≥ log n if n is the length of the string produced by G.

The goal of grammar-based string compression is to compute from a given string an SLP of small size. Grammar-based string compression is tightly related to dictionary-based compression: the famous LZ78 algorithm can be viewed as a particular grammar-based compressor. Moreover, the number of phrases in the LZ77-factorization is a lower bound for the smallest SLP for a string [24] , and an LZ77-factorization of length m can be converted to an SLP of size O(m·log n)

where n is the length of the string [8, 18, 19, 24] . For various other aspects of grammar-based string compression see [8, 20] .

An advantage of grammar-based compression is that SLPs are well-suited for further algorithmic processing. There is an extensive body of literature on algorithms for SLP-compressed strings, see e.g. [20] for a survey. For many of these algorithms, the depth of the input SLP is an important parameter. Let us give a simple example: in the random access problem for an SLP-compressed string s an SLP G for s is given. Let n be the length of s. The goal is to produce from G a data structure that allows us to compute for a given position i (1 ≤ i ≤ n) the ith symbol of s. As observed in [6] one can solve this problem in time O(depth(G)) (assuming arithmetic operations on numbers from the interval [0, n] need constant time): in the preprocessing phase one computes for every nonterminal A of G the length n A of the string produced from A; this takes time O(G) and produces a data structure of size O(G) (assuming a number from [0, n] fits into a memory location). In the query phase, one computes for a given position i ∈ [1, n] the path from the root to the i-th leaf in the derivation tree of G. Only the current nonterminal A and a relative position in the string produced from A has to be stored. Using the pre-computed numbers n A the whole computation needs time O(depth(G)). Recall that depth(G) ≥ log n. In [6] it is shown that one can compute from G a data structure of size O(|G|) which allows to access every position in time O(log n), irrespective of the depth of G. The algorithm in [6] is quite complicated and used several sophisticated data structures. An alternative approach to obtain access time O(log n) is to balance the input SLP G in the preprocessing phase, i.e., to reduce its depth. This is the approach that we will follow.

It is straightforward to show that any string s of length n can be produced by an SLP of size O(n) and depth O(log n). A more difficult problem is to balance a given SLP: assume that the SLP G produces a string of length n. Several authors have shown that one can restructure G in time O(|G| · log n) into an equivalent SLP H of size O(|G| · log n) and depth O(log n) [8, 19, 24] . Applied to the random access problem, these balancing procedures would yield access time O(log n) at the cost of building a data structure of size O(|G|·log n) during the preprocessing. Our main result shows that SLP balancing is in fact possible with a constant blow-up in SLP-size: As a corollary we obtain a very simple and clean algorithm for the random access problem with access time O(log n) that uses a data structure of size O(m) (in words of bit length log n). We can also obtain an algorithm for the random access problem with access time O(log n/log log n) using a data structure with O(m · log n) words for any > 0; previously this bound was only shown for SLPs of height O(log n). [1] The paper [11] contains a list of further applications of Theorem 1, which include the following problems on SLP-compressed strings: rank and select queries [1] , subsequence matching [2] , computing Karp-Rabin fingerprints [4] , computing runs, squares, and palindromes [17] , real-time traversal [14, 23] and range-minimum queries [15] . In all these applications we either improve existing results or significantly simplify existing proofs by replacing depth(G) by O(log n) in time/space bounds.

The proof of Theorem 1 consists of two main steps. Take an SLP G for the string s of length n and let m be the size of G. We consider the derivation tree t for G; it has size O(n). The SLP G can be viewed as a DAG for t of size m. We decompose this DAG into node-disjoint paths such that each path from the root to a leaf intersects O(log n) paths from the decomposition. Each path from the decomposition is then viewed as a string of integer-weighted symbols, where the weights are the lengths of the strings derived from nodes that branch off from the path. For this weighted string we construct an SLP of linear size that produces all suffixes of the path in a certain weight-balanced way. Plugging these SLPs together yields the final balanced SLP.

Some of the ideas in our algorithm can be traced back to the area of parallel algorithms: the path decomposition for DAGs is related to the centroid path decomposition of trees [9] , where it is the key technique in several parallel algorithms on trees. Moreover, the SLP of linear size that produces all suffixes of a weighted string can be seen as a weight-balanced version of the optimal prefix sum algorithm.

Our balancing procedure involves simple arithmetics on string positions, i.e., numbers of order n. Therefore we need machine words of bit-length Ω(log n) in order to achieve a linear running time in Theorem 1; otherwise the running time increases by a multiplicative factor of order log n. Note that such an assumption is realistic and standard in the field: machine words of bit length Ω(log n) are needed, say, for indexing positions in the represented string. On the other hand, our procedure works in the pointer model regime.

Grammar-based compression has been generalized from strings to ordered nodelabelled trees. In fact, the representation of a tree t by its smallest directed acyclic graph (DAG) is a form of grammar-based tree compression. This DAG is obtained by merging nodes where the same subtree of t is rooted. It can be seen as a regular tree grammar that produces only t. A drawback of DAG-compression is that the size of the DAG is lower-bounded by the height of the tree t. Hence, for deep narrow trees (like for instance caterpillar trees), the DAG-representation cannot achieve good compression. This can be overcome by representing a tree t by a linear context-free tree grammar that produces only t. Such grammars are also known as tree straight-line programs in the case of ranked trees (where the number of children of a node is uniquely determined by the node label) [7, 21, 22] and forest straight-line programs in the case of unranked trees [13] . The latter are tightly related to top DAGs [3, 5, 10, 13, 16] , which are another tree compression formalism, also akin to grammars. Our balancing technique works similarly for these compression formalisms: For top DAGs, this solves an open problem from [5] , where it was shown that from an unranked tree t of size n, whose minimal DAG has size m (measured in number of edges in the DAG), one can construct in linear time a top DAG for t of size O(m · log n) and depth O(log n). It remained open whether the additional factor log n in the size bound can be avoided. A negative answer for the specific top DAG constructed in [5] was given in [3] . On the other hand, Theorem 2 yields another top DAG of size O(m) and depth O(log n). To see this note that one can easily convert the minimal DAG of t into a top DAG of roughly the same size, which can then be balanced. This also gives an alternative proof of a result from [10] , according to which one can construct in linear time a top DAG of size O(n/log σ n) and depth O(log n) for a given tree of size n containing σ many different node labels. Let us finally mention that Theorems 1 and 2 are instances of a general balancing result that applies to a large class of circuits over algebraic structures, see [12] for details.

Access, rank, and select in grammar-compressed strings

Compressed subsequence matching and packed tree coloring

Tight bounds for top tree compression

Fingerprints in compressed strings

Tree compression with top trees

Random access to grammar-compressed strings and trees

Efficient memory representation of XML document trees

The smallest grammar problem

The accelerated centroid decomposition technique for optimal parallel tree evaluation in logarithmic time

Slowing down top trees for better worst-case compression

Balancing straight-line programs

Balancing straight-line programs

Grammar-based compression of unranked trees

Real-time traversal in grammar-based compressed files

Compressed range minimum queries

Tree compression with top trees revisited

Detecting regularities on grammar-compressed strings

Approximation of grammar-based compression via recompression

A really simple approximation of smallest grammar

Algorithmics on SLP-compressed strings: a survey

Grammar-based tree compression

XML tree structure compression using RePair

Constant-time tree traversal and subtree equality check for grammar-compressed trees

Application of Lempel-Ziv factorization to the approximation of grammar-based compression