key: cord-0613173-8hy1vp8r
authors: Cenzato, Davide; Lipt'ak, Zsuzsanna
title: A theoretical and experimental analysis of BWT variants for string collections
date: 2022-02-26
journal: nan
DOI: nan
sha: 22fb3bfae8c53e3cfc5cfd960460a431b5e15941
doc_id: 613173
cord_uid: 8hy1vp8r

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter $r$, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to $4.2$.

The Burrows-Wheeler-Transform [9] (BWT) is a fundamental string transformation which is at the heart of many modern compressed data structures for text processing, in particular in bioinformatics [31, 34, 32] . With the increasing availability of low-cost high-throughput sequencing technologies, the focus has moved from single strings to large string collections, such as the 1000 Genomes project [ . This has led to a widespread use of compressed data structures for string collections.

Concurrently, ever increasing text sizes have been driving a trend towards ever smaller data structures. The size of BWT-based data structures is typically measured in the number of runs (maximal substrings consisting of the same letter) of the BWT, commonly denoted r. This parameter r has become fundamental as a measure of storage space required by such all methods are equivalent.

In 2007, Mantaci et al. [40] introduced the extended Burrows-Wheeler-Transform (eBWT), which generalizes the BWT to a multiset of strings. The eBWT, like the BWT, is reversible; moreover, it is independent of the order in which the strings in the collection are presented. This is not true of any of the other methods mentioned above. Note that the eBWT differs from the BWT in several ways, most importantly in the order relation for sorting conjugates: while the BWT uses lexicographic order, the eBWT uses the so-called omega-order. (For precise definitions, see Section 2.)

The only tool up to date that computes the eBWT according to the original definition is pfpebwt [6]; all other tools append an end-of-string character to the input strings, explicitly or implicitly, and as a consequence, the resulting data structures differ from the one defined in [40] . Moreover, the output in most cases depends on the input order of the sequences (except for [13] and, using a specific option, [33]). As a further complication, the exact nature of this dependence differs from one data structure to another, as we will show in this paper.

The result is that the BWT variants computed by different tools on the same dataset, or by the same tool on the same dataset but given in a different order, may vary considerably. As we will show, this variability extends to the parameter r, the number of runs of the BWT. This is all the more important given the fact that r (and the related parameter n/r, the average length of a run) is increasingly being used as a parameter characterizing the dataset itself, namely as a measure of its repetitiveness (see e.g. [11, 2, 7] ). either, since its length can vary-as opposed to all other BWT variants we review, the xBWT is not a permutation of the input characters but can be shorter, due to the fact that it first maps the input to a tree and then applies the xBWT to it, a BWT-like index for labeled trees, rather than for strings. Likewise, the tool [45] for reference-free xBWT is not included in this review: even though it does not require a reference sequence, it, too, computes the xBWT, which is a data structure that does not fall within the category we focus on.

Finally, we did not include SPRING [10], a reference-free compressor for FASTQ and FASTA files: even though it uses the BWT during computation, it does not output it.

We give the necessary definitions in Section 2; note that we assume familiarity of the reader with the Burrows-Wheeler-Transform. In Section 3, we present the BWT variants and analyse their differences. In Section 4 we discuss the effects on the repetitiveness measure r, while our experimental results are presented in Section 5. We give some conclusions from our study in Section 6. The full tables with detailed results on all eight datasets can be found in the Appendix.

Let Σ be a finite ordered alphabet of size σ. We use the notation T = T [1. Every string T can be written uniquely as T = U m , where U is primitive; we refer to U as root(T ) and to m as exp(T ), i.e., T = root(T ) exp(T ) . A run in string T is a maximal substring consisting of the same character; we denote by runs(T ) the number of runs of T . Often, an end-of-string character (usually denoted $) is appended to the end of T ; this character is not element of Σ and is assumed to be smaller than all characters from Σ. Note that appending a $ makes any string primitive.

For two strings S, T , the (unit-cost) edit distance dist edit (S, T ) is defined as the minimum number of operations necessary to transform S into T , where an operation can be deletion or insertion of a character, or substitution of a character by another. The Hamming distance

The lexicographic order on Σ * is defined by 

Given a string T = T [1.

.n] over Σ, the Burrows-Wheeler-Transform [9], BWT(T ), is a permutation of the characters of T , given by concatenating the last characters of the lexicographically sorted conjugates of T . The number of runs of the BWT of string T is denoted r(T ), i.e. r(T ) = runs(BWT(T )). To make the BWT uniquely reversible, one can add an index to it marking the lexicographic rank of the conjugate in input. For example, BWT(banana) = nnbaaa, hence r(banana) = 3, and the index 4 specifies that the input was the 4th conjugate in lexicographic order. Alternatively, one adds a $ to the end of T , which makes the input unique: BWT(banana$) = annb$aa. Note that BWT with and without end-of-string symbol can be quite different.

Next we define the omega-order [40] on Σ * : S ≺ ω T if root(S) = root(T ) and exp(S) < exp(T ), or if S ω < lex T ω (implying root(S) = root(T )), where T ω denotes the infinite string obtained by concatenating T infinitely many times. The omega-order relation coincides with the lexicographic order if neither of the two strings is a proper prefix of the other. The two orders can differ otherwise, e.g. GT < lex GTC but GTC ≺ ω GT.

Given a multiset of strings M = {T 1 , . . . , T k }, the extended Burrows-Wheeler-Transform [40], eBWT(M), is a permutation of the characters of the strings in M, given by concatenating the last characters of the conjugates of each T i , for i = 1, . . . , k, listed in omega-order. For example, the omega-sorted conjugates of M = {GTC, GT} are: CGT, GTC, GT, TCG, TG, hence, eBWT(M) = TCTGG. Again, adding the indices of the input conjugates makes the eBWT uniquely reversible, in this case 2, 3.

We identified five distinct transforms, which we list below, that were computed by the programs listed in Table 1 . Let M = {T 1 , . . . , T k } be a multiset of strings, with total length

Since several of the data structures depend on the order in which the strings are listed, we implicitly regard M as a list [T 1 , . . . , T k ], and write (M, π) for a specific permutation π in which the strings are presented.

("concatenated BWT") 5. colexBWT(M) = mdolBWT(M, γ), where γ is the permutation corresponding to the colexicographic ('reverse lexicographic') order of the strings in M Because all BWT variants except the eBWT use additional end-of-string symbols as string separators, we refer to these four by the collective term separator-based BWT variants. In Table 2 we show the five data structures on our running example of 5 DNA-strings, and give first properties of these data structures. For ease of exposition and comparison, we replaced all separator-symbols by the same dollar-sign $, even where, conceptually or concretely, different dollar-signs are assumed to terminate the individual strings, as is the case for mdolBWT. Moreover, the concBWT contains one additional character, the final Table 2 Overview of properties of the five BWT variants considered in this paper. The colors in the example BWTs correspond to interesting intervals in separator-based variants, see Section 3.2. C V I T 2 0 2 2 1:6 An analysis of BWT variants of string collections end-of-string symbol, here denoted by #, which is smaller than all other characters; thus, the additional rotation starting with # is the smallest and results in an additional dollar in the first position of the transform. For ease of comparison, we remove this first symbol from concBWT and replace the # by $.

It is important to point out that the programs listed in Table 1 do not necessarily use the definitions given here; however, in each case, the resulting transform is the one claimed, up to renaming or removing separator characters, see Section 3.1 and 3.2.

The first obvious difference between the eBWT and the separator-based variants is their length: eBWT(M) has length N M , while all other variants have length N M + k, since they contain an additional character (the separator) for each input string.

In the four separator-based transforms, the k-length prefix consists of a permutation of the last characters of the input strings. This is because the rotations starting with the dollars are the first k lexicographically; in the eBWT, these k characters occur interspersed with the rest of the transform; namely in the positions corresponding to the omega-ranks of the input strings T i (see Table 2 ).

The next point is that adding a $ to the end of the strings introduces a distinction, not present in the eBWT, between suffixes and other substrings: since the separators are smaller than all other characters, occurrences of a substring as suffix will be listed en bloc before all other occurrences of the same substring. On the other hand, in the eBWT, these occurrences will be listed interspersed with the other occurrences of the same substring. Finally, it should be noted that adding end-of-string symbols to the input strings changes the definition of the order applied. As observed above, the omega-order coincides with the lexicographic order on all pairs of strings S, T where neither is a proper prefix of the other; but with end-of-strings characters, no input string can be a proper prefix of another. Thus, on rotations of the T i $'s, the omega-order equals the lexicographic order. As an example, consider the multiset M = {GTC$, GT$} from Section 2: we have the following omega-order among the rotations: $GT, $GTC, C$GT, GT$, GTC$, T$G, TC$G, which coincides with the lexicographic order. Similarly, adding different dollars $ 1 , $ 2 , . . . , $ k and applying the omega-order results again in the lexicographic order between the rotations, with different dollar symbols considered as distinct characters. Indeed, if we append a different dollar-sign to each input string, then the omega-order, the lexicographic order, and the order of the suffixes of the concatenated string (i.e. our mdolBWT) are all equivalent.

Regarding the differences among the four separator-based BWT variants, we will show that all differences occur in certain well-defined intervals of the BWT, and that the differences themselves depend only on a specific permutation of {1, . . . , k}, given by the combination of the input order, the lexicographic order of the input strings, and the BWT variant applied. In Tables 3 and 4 we give the full BWT matrices for all five BWT variants, along with the optimal one minimizing the number of runs, see Section 4.

Let us call a string U a shared suffix w.r.t. multiset M if it is the suffix of at least two strings in M. Let b be the lexicographic rank of the smallest rotation beginning with U $ and e index mdol rotation U is a suffix of both T i and T j , and the preceding characters in T i and T j are different, i.e., the two occurrences of U as suffix of T i and T j constitute a left-maximal repeat. 1 Note that [1, k] is always an interesting interval, unless all strings end with the same character.

Proof. Follows immediately from the fact that no two distinct substrings ending in $ can be one prefix of the other.

We can now narrow down the differences between any two separator-based BWTs of the same multiset. The next proposition states that these can only occur in interesting intervals (part 1). This implies that the dollar-symbols appear in the same positions in all separator-based variants except for one very specific case (part 2). Moreover, we get an upper bound on the Hamming distance between two separator-based BWTs (part 3).

Proposition 3. Let L 1 and L 2 be two separator-based BWTs of the same multiset M. Table 4 From left to right we show the eBWT the colexBWT and the optimal BWT of the string collection M = {ATATG, TGA, ACG, ATCA, GGA}, see Section 4.

Let I 1 and I 2 be the positions of the dollars in L 1 resp. L 2 . If I 1 = I 2 then there exist

Since all separator-based BWT variants use the lexicographical order of the rotations, this means that there exists a substring U which is preceded by x in one string T j and by y in another T j , the first occurrence has rank i in one BWT and the other has rank i in the other BWT variant. This implies that the two occurrences are followed by two dollars, and either the two dollars are different, or they are the same dollar, and the subsequent substrings are different. Therefore, U defines an interesting interval. Parts 2. and 3. follow from 1.

Proposition 3 implies that the variation of the different transforms can be explained based solely on what rule is used to break ties for shared suffixes. We see next how the different BWT variants determine this tie-breaking rule.

Let us now restrict ourselves to M being a set, i.e., no string occurs more than once. (This is just for convenience since now the input order uniquely defines a permutation w.r.t. lexicographic order; the results of this section apply equally to multisets M.) As we showed in the previous subsection, the only differences between the different separator-based BWT variants are given by the order in which shared suffixes are listed. It is also clear that the same order applies in each interesting interval, including the k-length prefix of the transform. Therefore, it suffices to study the permutation π of the k dollars in this prefix. Since the strings are all distinct, they each have a unique lexicographic rank within the set M, thus the permutation of the k dollars can be given w.r.t. the lexicographic order. Let us refer to these permutations by the dolEBWT, mdolBWT, concBWT, and colexBWT by π de , π md , π conc , and π colex , respectively. Now, also the input order can be seen as a permutation ρ of the lexicographic order 2 ; if the strings are input in lexicographic order, then ρ = id. For our toy example M = [ATATG, TGA, ACG, ATCA, GGA], we have ρ = 25134.

It is easy to see that the permutation π md is equal to ρ, since the dollar-symbols are ordered according to ρ. For the dolEBWT, the rank of $T i equals the lexicographic rank of T i among all input strings, i.e., π de = ρ −1 . Further, π colex = γ by definition, where γ denotes the colexicographic order of the input strings. The situation is more complex in the case of concBWT. Since the # is the smallest character, the last string of the input will be the first, while for the others, the lexicographic rank of the following string decides the order. In our running example, π conc = 45132. We next formalize this.

Let Φ ρ be the linking permutation (1), the permutation that maps each element to the element in the next position and the last element to the first. Let us also define, for j ∈ {1, . . . , k} and i = j,

Lemma 4. Let ρ be the permutation of the input order w.r.t. the lexicographic order, i.e. the ith input string has lexicographic rank ρ(i). Then π conc = π conc (ρ) is given by:

Proof. Follows straightforwardly from the tie-breaking rule of concBWT.

Example 5. The mapping ρ → π conc for k = 3 is as follows: 123 → 312, 132 → 231, 312 → 231, 213 → 321, 231 → 132, and 321 → 123. Note that no ρ maps to 213.

As can be seen already for k = 3, not all permutations π are reached by this mapping. We will call a permutation π feasible if there exists an input order ρ such that π conc (ρ) = π. For k = 4, there are 18 feasible permutations (out of 24), for k = 5, 82 (out of 120). In Table 5 , we give the percentage of feasible permutations π, for k up to 11. The lexicographic order is always feasible, namely with ρ = k, k − 1, . . . , 2, 1; however, the colex order is not always feasible, as the following example shows.

Example 6. Let M = {GAA, ACA, TGA}, thus γ = 213, but as we have seen, no permutation will yield this order for concBWT. In addition, the colexBWT(M) = AAAACGG$AT$$ has 7 runs, while all feasible ones have at least 8: AAAGACG$AT$$, AAACGAG$AT$$, AAAAGCG$AT$$, AAAGCAG$AT$$, AAACAGG$AT$$. An important consequence is that the permutations induced by mdolBWT and concBWT are always differnt: π md = π conc holds always, since π(1) = ρ(k). This means that, in whatever order the strings are given w.r.t. lexicographic order, on most string sets the resulting transforms mdolBWT and concBWT will differ.

What is the effect of the different permutations π of the strings in M, induced by these BWT variants, on the number of runs of the BWT? As the following example shows, the number of runs can differ significantly between different variants. The results of Section 3 give us a method to measure the degree to which the BWT variants can differ. Proof. Place the n a a-characters in a row, creating n a + 1 gaps, namely one between each adjacent a, and one each at the beginning and at the end. Now place all b-characters, each in a different gap; since n a is maximum, there are enough gaps. Then place all c's, first filling gaps that are still empty, if any, then into gaps without c, etc. We never have to place two identical characters in the same gap. If the total number of non-a-characters is at least than n a − 1, then we can fill every gap, thus separating all a's, and creating a run for every character of I. If we have fewer than n a − 1 characters, then we are still creating two runs with each non-a-character, but we cannot separate all a's.

We will use this lemma to measure the variability of a dataset: 

.

Which of the BWT variants produces the fewest runs? As we have shown, this depends on the input order with most BWT variants, and the only possible variation is within interesting intervals. The colexBWT has been shown experimentally to yield a low number of runs of the BWT [33, 12]. Even though it does not always minimize r (one can easily create small examples where other permutations yield a lower number of runs), we can bound its distance from the optimum. Proof. Let I = [b I , e I ] be an interesting interval containing d distinct characters, and let U be the shared suffix defining I. Since the strings are listed according to the colex order, all strings in which U is preceded by the same character will appear in one block, and therefore, L has exactly d runs in the interval I. Let L b I −1 = x and L e I +1 = y. If x occurs in I and it is not the first run of I (i.e., L b I = x), then listing first the strings where U is preceded by x would reduce the number of runs by 1; similarly, listing those where y precedes U as last of the group would reduce the number of runs by 1. By Prop. 3, this is the only possibility for varying the number of runs.

Bentley, Gibney, and Thankachan recently gave a linear-time algorithm for computing the order of the dollars which minimizes the number of runs [4], i.e. the optimal order for mdolBWT. The idea is, in effect, to start from the colex-order and then adjust, where possible, the order of the runs within interesting intervals in order to minimize character changes at the borders, i.e. such that the first and the last run of each interesting interval is identical to the run preceding and following that interesting interval. This is equivalent to sorting groups of sequences sharing the same left-maximal suffix. This sorting can be done on each interesting interval independently without affecting the other interesting intervals.

In Table 4 , we show the result on our toy example, where it reduces the number of runs by 2 w.r.t. colex order. We implemented an algorithm that computes the number of optimal runs according to the method of [4] and applied it to our datasets. In the next section, we compare the number of runs of each of the five BWT variants to the optimum.

We . The main features of the datasets, including the number of sequences, sequence length, and the mean runlength of the optimal BWT are reported in Table 6 . We include the details of the experiment setup in the Appendix.

On each of the datasets, we computed the pairwise Hamming distance between separatorbased BWTs. To compare them to the eBWT, we computed the pairwise edit distance on a small subset of the sequences (for obvious computational reasons), computing also the Hamming distance on the small set, for comparison. We generated some statistics on each of the data sets: the number of interesting intervals, the fraction of positions within interesting intervals (total length of interesting intervals divided by total length of the dataset), and the dataset's variability (Def. 9). To study the variation of the r-parameter, we implemented an algorithm for the optimal input order according to Bentley et al. [4] and computed r OP T for each data set, comparing it to the number of runs of all five BWT variants. In Table 8 and 9, we include a compact version of these results for the two datasets with the highest and the lowest variation between the BWT variants, the SARS-CoV-2 short sequences and the SARS-CoV-2 genomes, respectively. The full experimental results for all eight datasets are contained in the Appendix.

In Table 7 we give a brief summary of these results, reporting, for each dataset, the fraction of positions in interesting intervals, the dataset's variability, the average pairwise Hamming distance between separator-based BWT variants, and the maximum and minimum value, among the five BWT variants, of the average runlength n/r of the BWT.

The experiments showed a high variation in the number of runs in particular on datasets of short sequences. The highest difference was between colexBWT and concBWT, by a multiplicative factor of over 4.2, on the SARS-CoV-2 short dataset. In Figure 1 we plot the average runlength n/r for the four short sequence datasets, and the percentage increase of the number of runs w.r.t. r OP T . The variation is less pronounced on the one dataset which is less repetitive, namely Simons Diversity reads. On most long sequence datasets, on the other hand, the differences were quite small (see Appendix). Recall also that the mdolBWT and concBWT vary depending on the input permutation. To better understand how far the colexBWT is from the optimum w.r.t. the number of runs, we plot in Figure 2 the number of runs of colexBWT w.r.t. to r OP T , on all eight datasets. The strongest increase is on short sequences, where the variation among all BWT variants is high, as well. On the long sequence datasets, with the exception of SARS-CoV-2 long sequences, the colexBWT is very close to the optimum; however, note that on those datasets, all BWTs are close to the optimum.

The average number of runs and the average pairwise Hamming distance strongly depend on the length of the sequences in the input collection. If the collection has a lot of short sequences which are very similar, then the differences between the BWTs both w.r.t. the number of runs, and as measured by the Hamming distance, can be large. This is because there are a lot of maximal shared suffixes and so, many positions are in interesting intervals. To better understand this relationship, we plotted, in Figure 3 , the average Hamming distance against the two parameters variability and fraction of positions in interesting intervals. We see that the two datasets with highest average Hamming distance, SARS-CoV-2 short dataset and the Simons Diversity reads, have at least one of the two values very close to 1, while for those datasets where both values are very low, the BWT variants do not differ very much. Table 6 Table summarizing the main parameters of the eight datasets we used for running the experiments. From left to right we report the datasets names, the number of sequences, the total length, the average, maximum and minimum sequence length and the optimum mean runlength (n/r). From left to right we report dataset names followed by the ratio of positions in interesting intervals, the variability of the dataset (see Def. 9), the average normalized Hamming distance between any two separator-based BWT variants. In the last two columns we report the maximum and minimum runlength (n/r) taken over all five BWT variants. 

We presented the first study of different variants of the Burrows-Wheeler-Transform for string collections. We found that the data structures computed by different tools differ not insignificantly, as measured by the pairwise Hamming distance: up to 12% between different BWT variants on the same dataset in our experiments. We showed that most BWT variants in use are input order dependent, so the same tool can produce different variants if the input set is permuted. These differences extend also to the number of runs r, a parameter that is central in the analysis of BWT-based data structures, and which is increasingly being used as a measure of the repetitiveness of the dataset itself.

With string collections replacing individual sequences as the prime object of research and analysis, and thus becoming the standard input for text indexing algorithms, we believe that it is all the more important for users and researchers to be aware that not all methods are equivalent, and to understand the precise nature of the BWT variant produced by a particular tool. We suggest further to standardize the definition of the parameter r for string collections, using either the colexicographic order or the optimal order of Bentley et al. [4] .

All datasets are stored in FASTA format. We used three tools for computing the five BWT variants; pfpebwt, ropebwt2 and Big-BWT. In order to make the BWTs comparable we did some adaptations to both tools and inputs. We modified ropebwt2 to make it work with the same character order as the other tools, i.e. $ < A < C < G < N < T. Then we used ropebwt2 for computing both the mdolBWT and the colexBWT using the -R and -R -s flags respectively. We used pfpebwt for constructing both the eBWT and the dolEBWT variants. In order to compute the dolEBWT, we modified the input files, appending an end-of-string character at the end of each sequence. Finally, for computing the concBWT, we removed the headers from the FASTA files, arranging the sequences in newline separated files, and ran Big-BWT without additional flags on these newline separated files.

We tested all 12 tools extensively, and determined which data structure they compute, using both our tests and the algorithm descriptions in the respective papers. In this section, we include further information about some of these tools.

pfpebwt is a tool computing the eBWT of string collections (https://github.com/ davidecenzato/PFP-eBWT.git). It takes in input a fasta file and gives in output the eBWT in either plain ASCII text or RLE (run-length-encoded) format. We used (a) no flags for long sequences, and (b) the flags -w 10 -p 10 -n 3 --reads for short sequences. We included it in two different rows of Table 1 because by default pfpebwt computes the eBWT, but it can compute the dolEBWT if the sequences have explicit end-of-string characters (not in multi-thread mode). G2BWT is a tool computing the dolEBWT of short sequence collections (https://bitbucket. org/DiegoDiazDominguez/lms_grammar/src/bwt_imp2). It takes in input newline separated files. Even though it is not stated explicitly, this tool computes the dolEBWT because, when it constructs the grammar, it uses dollars for separating adjacent strings. Thus, also the string rotations will contain dollars. We tested it using the default settings. msbwt is a tool implementing the Holt and McMillan [27] merge-based BWT construction algorithm (https://github.com/holtjma/msbwt.git). It takes in input a list of one or several fastq files. Even if this tool uses the BCR approach [3] for computing the BWTs to merge, it actually computes the dolEBWT. This is because it features a preprocessing where it sorts the input strings lexicographically. Thus, the resulting mdolBWT corresponds to the dolEBWT. BCR is part of the BEETL tool library (https://github.com/giovannarosone/BCR_ LCP_GSA). It takes in input a fasta file, a fastq file, or a gz compressed fastq file. It computes the mdolBWT following the approach of Bauer et al., described in [3] . We set the 'dataTypeLengthSequences' variable in Parameters.h to 1. ropebwt2 is a tool computing the FM-index and the mdolBWT of string collections (https://github.com/lh3/ropebwt2.git), using an approach similar to BCR. It takes in input a fasta file, a fastq file, or a gz compressed fastq file. We listed it in two different rows of Table 1 because it computes the mdolBWT or the colexBWT, depending on the flags. We used the -R and the -R -s flags, respectively, to obtain the two transforms. In addition, we modified main.c in order to change the order of the characters to $ < A < C < G < N < T. merge-BWT computes the mdolBWT of a string collection by merging the BWTs of subcollections of the input (https://github.com/jltsiren/bwt-merge.git). It takes in input a list of one or several mdolBWTs. The order of the dollars will depend on the order in which the input BWTs are listed. We tested it using -i plain_sorted and -o plain_sorted flags. We computed the BWTs of the subcollections using ropebwt2. nvSetBWT is a tool included in nvbio suite (https://github.com/NVlabs/nvbio.git). It takes in input either a fastq or a newline separated file. We tested it using the -R flag for skipping the reverse strand. However, even if the algorithmic descriptions in [47, 35] seem to describe the mdolBWT, the output of the current version (version 1.1) does not correspond to a possible BWT because the Parikh vector is different from that of the input. eGSA computes the generalized enhanced suffix array and the mdolBWT of a string collection (https://github.com/felipelouza/egsa.git). It takes in input a text file, a fasta file, or a fastq file. It uses the gSACA-K algorithm for computing the suffix array of subcollections of the input and then merges all suffix arrays. Thus it computes the mdolBWT. We tested it with the -b flag. eGAP computes the mdolBWT, and optionally the LCP-array (longest common prefix array) and DA (document array) of a string collection (https://github.com/felipelouza/ egap.git). It takes in input a newline separated file, a fasta file, or a fastq file. We tested it with default settings. bwt-lcp-parallel computes the mdolBWT and the LCP-array of a collection of short sequences (https://github.com/AlgoLab/bwt-lcp-parallel.git). It takes in input fasta files and does not support the N character. We tested it using standard settings. gsufsort computes the SA, LCP and mdolBWT of a string collection (https://github. com/felipelouza/gsufsort.git), using the gSACA-K algorithm of [36] . It takes in input a newline separated file, a fasta file, or a fastq file. We tested it using --fasta and --bwt flags.

BigBWT computes the concBWT, and optionally the suffix array, of a highly repetitive text or string collection (https://github.com/alshai/Big-BWT.git). It takes in input a newline separated file or a fasta file. This tool with the -f flag is used internally in the r-index (https://github.com/alshai/r-index), producing the BWT of the strings concatenated without dollars, and an additional data structure is used to mark the string boundaries. On the other hand, the tool without the -f flag will compute the BWT of the fasta files without skipping the fasta headers. We used standard parameters and as input newline separated files, the output then is the concBWT.

C V I T 2 0 2 2 Table 10 Results for the SARS-CoV-2 short dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength (n/r) of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection. Table 15 Results for the 16S rRNA long dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength (n/r) of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection. C V I T 2 0 2 2 1:28 Table 16 Results for the Candida auris reads dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength (n/r) of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.

Sensitivity of string compressors and repetitiveness measures

 Table 17 Results for the SARS-CoV-2 genomes dataset. First row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants. First row right: summary of the dataset properties. Second row: number of runs and average runlength (n/r) of all BWT variants. Third row left: absolute and normalized pairwise Hamming distance between separator-based BWT variants on a subset of the input collection. Third row right: summary of the dataset properties of a subset of the input collection. Fourth row left: absolute and normalized pairwise edit distance between all BWT variants on a subset of the input collection. Fourth row right: number of runs and average runlength (n/r) of all BWT variants on a subset of the input collection.