key: cord-0586933-1zdjaajp
authors: Giuliani, Sara; Romana, Giuseppe; Rossi, Massimiliano
title: Computing Maximal Unique Matches with the r-index
date: 2022-05-03
journal: nan
DOI: nan
sha: e19b56cdce8eba05f82ca60e2152f9023c2bda58
doc_id: 586933
cord_uid: 1zdjaajp

In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the $r$-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the $r$-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the $r$-index, while preserving the space and time bounds. We add additional $O(r)$ samples of the longest common prefix (LCP) array, where $r$ is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.

With the advent of third-generation sequencing, the quality of assembled genomes drastically increased. In the last year the Telomere-to-Telomere project released the first complete haploid human genome [19] and the Human Pangenome Reference Consortium (HPRC) plans to release hundreds of high-quality assembled genomes to be used as a pangenome reference. One important step to enable the use of these high-quality assembled genomes is to build a multiple-sequence alignment of the genomes. Tools like MUMmer [13, 18] , and Mauve [5] proposed a solution to the original problem of multiple-sequence alignment by using Maximal Unique Matches (MUMs) between two input sequences as prospective anchors for an alignment. MUMs are long stretches of the genomes that are equal in both genomes and occur only once in each of them. To reduce the computational costs of computing the MUMs, progressive approaches have also been developed like progressive Mauve [6] and progressive Cactus [1] that enables the construction of pangenome graphs, among others, that have been used in recent aligners like Giraffe [21] . MUMs have also been proven useful for strain level read quantification [23] , and as a computationally efficient genomic distance measure [7] . Recent advances in pangenomics [20, 3] demonstrated that it is possible to index hundreds of Human Genomes and to query such an index to find supersets of MUMs that are maximal exact matches (MEMs), which are substrings of the pattern that occur in the reference and that cannot be extended neither on the left nor on the right. The tool called MONI [20] requires two passes over the query sequence to report the MEMs. Later PHONI [3] showed how to modify the query to compute the MEMs in a streaming fashion, with only one single pass over the query string. Both MONI and PHONI are built on top of an r-index [11] and a straight-line program SLP [9] . Their main objective is to compute the so called matching statistics (see Definition 3) of the pattern with respect to the text, that can be used to compute the MEMs with a linear scan. While, MONI uses the SLP for random access to the text, and needs to store additional information to compute the matching statistics and the MEMs, PHONI uses the SLP to compute efficient longest common extension (LCE) queries which allow to compute the matching statistics and the MEMs with only one scan of the query.

We present mum-phinder, a tool that is able to compute MUMs of a query pattern against an index on a commodity computer. The main observation of our approach is to extend the definition of matching statistics to include, for each suffix of the pattern, the information of the length of the second longest match of the suffix in the text, which allows to decide whether a MEM is also unique. We extended PHONI to keep track at each step of the query, the second longest match of the pattern in the index, and its length. To do this, we add O(r) samples of the longest common prefix (LCP) array to PHONI.

We evaluated our algorithm on real-world datasets, and we tested mum-phinder against MUMmer [18] . We measured time and memory required by both tools for sets of increasing size of haplotypes of human chromosome 19 and SARS-CoV2 genomes and queried using one haplotype of chromosome 19 and one SARS-CoV2 genome not present in the dataset. We report that mum-phinder requires consistently less memory than MUMer for all experiments being up to 25 times smaller. Although MUMer is generally faster than ours (18 times faster for 1 haplotype of chromosome 19, and 6.5 times faster for 12,500 SARS-CoV2 genomes), it cannot process longer sequences due to memory limitations. Additionally, we observe that when increasing the number of sequences in the dataset, the construction time of mum-phinder increases, while the query time decreases. This phenomenon is due to the increase in the number of matches in the search process, that prevents the use of more computational-demanding operations. Note that, due to the use of the r-index, the efficiency of our method increases when the dataset is highly repetitive as in the case of pangenomes.

T [j] ∈ Σ for all j ∈ [0..n). The length of a string is denoted by |T |. We refer to the empty string with ε, that is the only substring of length 0.

We denote a factor (or substring) of T as

We assume throughout the paper that the text T is terminated by termination character $ that does not occur in the original text and it is lexicographically smaller than all the other characters in the alphabet.

The Suffix array (SA) of a string T [0..n) is an array of length n such that (u, v) be the length of the longest common prefix between two strings u and v, that is

.n)), for any 0 < i < n.

The Burrows-Wheeler Transform (BWT) of T is a reversible transformation of the characters of T [4] . That is the concatenation of the characters preceding the suffixes of T listed in lexicographic order, i.e., for all 0 ≤ i < n,

The LF-mapping is the function that maps every character in the BWT with its preceding text character, in the BWT, i.e.

The run-length encoding of a string T is the representation of maximal equal-letter runs of T as pairs (c, ), where c is the letter of the run and > 0 is the length of the run. For example, the run length encoding of T = AAACAAGGGG is (A, 3)(C, 1)(A, 2)(G, 4). We refer to the number of runs of the BWT with r.

The BWT tends to create long equal-letter runs on highly repetitive texts such as genomic datasets. The run-length encoding applied to the BWT (in short RLBWT) is the basis of many lossless data compressors and text indexes, such as the FM-index [8] which is the base of widely used bioinformatics tools such as Bowtie [14] and BWA [15] . Although the BWT can be stored and queried in compressed space [17] , the number of samples of the SA required by the index grows with the length of the uncompressed text. To overcome this issue Gagie et al. [11] proposed the r-index whose number of SA samples grows with the number of runs r of the BWT. The r-index is a text index composed by the run-length encoded BWT and the SA sampled at run boundaries, i.e., in correspondence of the first and last character of a run of the BWT, and it is able to retrieve the missing values of the SA by using a predecessor data structure on the samples of the SA.

A context-free grammar G = {V, Σ, R, S} consists in a set of variables V , a set of terminal symbols Σ, a set of rules R of the type A → α, where A ∈ V and α ∈ {V ∪ Σ} * , and the start variable S ∈ V . The language of the grammar L(G) ⊆ Σ * is the set of all words over the alphabet of terminal symbols generated after applying some rules in R starting from S. When L(G) contains only one string T , that is G only generates T , then the grammar G is called straight-line program (SLP). 

Given a text T [0..n) and a pattern P [0..m), we refer to any factor in P that also occurs in T as a match. A match w in P can be defined as a pair (i, ) such that w = P [i..i + ). We say that w is maximal if the match can not be extended neither on the left nor on the right, i.e. either i = 0 or P [i − 1..i + ) does not occur in T and either i = m − or P [i..i + + 1) does not occur in T .

Example 2. Let T = ACACTCTTACACCATATCATCAA$ be the text and P = AACCTAA the pattern. The factor AA is maximal in P and occurs only once in T , while it is repeated in P at positions 0 and 5. The factor CT of P starting in position 3 is a maximal match that occurs only once in P , but it is not unique in T . The factor CC of P starting in position 2 is unique in both T and P , but both can be extended on the left with an A. On the other hand, the factor P [1.

From now on, we refer to the set of all maximal unique matches between T and P as MUMs. In [3] the authors showed how to compute maximal matches (not necessarily unique neither in T nor P ) in O(r + g) space, where r is the number of runs of the BWT of T and g is the size of the SLP representing the text T . This is achieved by computing the matching statistics, for which we report the definition given in [3] . 

A known property of the matching statistics is that for all i > 0,

Our objective is to show how to further compute MUMs within the same space bound. For our purpose, we extend the definition of MS array with an additional information field to each entry. 

We now show how to compute MUMs by using the eMS array. Lemma 5 shows how to verify if a match occurs only once in T .

Lemma 5. Given a text T , a pattern P , and the eMS array computed for P with respect to We check the maximality of a match in the pattern using an analogous approach as in [20] , that we summarize with the following lemma. We first show that given i ∈ L, if a match w i is not unique in P , then the second occurrence of w i in P is contained in another maximal match unique in T .

Given a text T , a pattern P , and the eMS array computed for P with respect to T , let L be the subset of positions in P such that w i = P [i..i + eMS[i].len) is maximal and occurs only once in T for all i ∈ L. Then, w i is not unique in P if and only if there exist i ∈ L \ {i} and two possibly empty strings u, v such that w i = uw i v is a factor of P .

Proof. Let us assume by contradiction that such i does not exist, then let j /

.j + |w i |) occurs only once in T and it is not maximal, therefore there exists k ∈ L such that k ≤ j and |w k | > |w i | which contradict the hypothesis. The other direction of the proof is straightforward since by definition of w i , either w i occurs twice in P or it is not maximal.

The following Lemma shows, for any i ∈ L, if a match w i is unique in P by using the eMS array. We can summarize the previous Lemmas in the following Theorem. 

Now we show how we can compute eMS extending the algorithm presented in Boucher et al. [3] while preserving the same space-bound.

We can apply verbatim the algorithm of [3] 

In this section we present the algorithm that we use to compute MUMs that builds on the approach of Boucher et al. [3] for the computation of the MS array. The authors showed how to use the r-index and the SLP of [10, 9] to compute the MS array of a pattern P [0..m) in O(m · (t LF + t LCE + t pred )) time, where t LF , t LCE , and t pred represent the time to perform respectively one LF, one LCE, and one predecessor query. Our algorithm extends Boucher et al.'s method by storing additional O(r) samples of the LCP array. Given a text T [0..n) and a pattern P [0..m), in the following, we first show how to compute the eMS array of P with respect to T using the r-index, the SLP, and the additional LCP array samples. Then we show how to apply Theorem 9 to compute the MUMs from the eMS array.

The key point of the algorithm is to extend the last computed match backwards when possible, otherwise we search for the new longest match that can be extended on the left by using the BWT. The algorithm to compute the pos and len entry of the eMS array is analogous to the procedure detailed in [3] . We use the same data structures as the one defined in [3] , that are the run-length encoded BWT and the samples of the SA in correspondence of positions q such that BWT[q] is either the first or the last symbol of an equal-letter run of the BWT. Note that both q p and q s are respectively the last and the first index of their corresponding equal-letter run.

An analogous reasoning can be formulated to compute the second longest match. With respect to the implementation in [3] , we add O(r) sampled values from the LCP array. Precisely, we store the LCP values between the first and the last two suffixes in correspondence of each equal-letter run (if only one suffix corresponds to a run we simply store 0). As shown later, this allows to overcome the problem of computing the LCE queries in case a position p in T is not stored in the sampled SA, i.e. when ISA[p] is neither the first nor the last index of its equal-letter run.

For simplicity of exposition we ignore the cases when a select query of a symbol c in the BWT fails. However, whenever it happens, either c does not occur in T or we are attempting to find an occurrence out of the allowed range, that is between 0 and the number of occurrences of the character c minus 1. For the first case we can simply reset the algorithm starting from the next character of P to process, while the second occurs when we are attempting to compute an LCE query, whose result can be safely set to 0. Analogously we compute lcp s (lines 7-10) and, by Lemmas 11 and 12, we assign to eMS [i] .slen the maximum between lcp p and lcp s . In order to compute eMS[i].slen we need to compute the value of lcp s with respect to q s . To do so, we look for the smallest index q s greater than q s such that BWT[q s ] = P [i], and then apply a similar procedure to Algorithm 2 (lines [14] [15] [16] [17] [18] . In this case, if BWT[q s + 1] = P [i], then we can retrieve lcp s from LCP[q s + 1] since q s is in correspondence of a run boundary. Symmetrically we handle the case lcp p > lcp s (lines 20-26). Finally, we compute eMS[i].slen by picking the maximum between lcp p and lcp s .

Given a text T [0..n), we can build a data structure in O(r + g) space that allows to compute the set MUMs between any pattern P [0..m) and T in O(m·(t LF +t LCE +t pred )) time.

Proof. Algorithm 1, Algorithm 2 and Algorithm 3 show how to compute the eMS array in m steps by using the data structure used in [3] of size O(r + g), to which we add O(r) words from the LCP array, preserving the space bound. Since at each step the dominant cost depends on the LF, LCE, and rank/select queries, eMS is computed in O(m(t LF + t LCE + t pred )) time. By Lemmas 5 and 6, we can build the set L in O(m) steps from the eMS array. Recall that L contains those indices i ∈ [0..m) such that P [i..i + eMS[i].len) is a maximal match that occurs only once in T . Now we have to search those indices in L that are also unique in P . A simple algorithm is 

Note that both g and t LCE depends on the grammar scheme chosen. In fact, if exists a data structure of size λ that supports LCE queries on a text T , then we can still compute MUMs in O(r + λ) space and O(m · (t LF + t LCE + t pred )) time, with t LCE that depends on the data structure used. 

Here we present a different approach to compute the MUMs from the eMS from the one in Theorem 13, that is of more practical use, and that does not require sorting the suffixes of P . We summarize this approach in Algorithm 4.

Let .len (Theorem 9). Hence, we sort the elements in L with respect to the position in T , and starting from L[0], we compare every entry with the following and if both factors are not contained into the other, we store in the set MUMs the one with the smallest starting position and keep track of the other one, otherwise we simply discard the one that is repeated and continue with the following iteration.

To handle the special case when two candidates i = j ∈ L are such that 

We compare our method (mum-phinder) with MUMmer [18] (mummer). We tested two versions of mummer, v3.27 [13] (mummer3) and v4.0 [18] (mummer4). We executed mummer with the -mum flag to compute MUMs that are unique in both the text and the pattern, -l 1 to report all MUMs of length at least 1, and -n to match only A,C,G,and T characters. We setup mum-phinder to produce the same output as mummer. We did not test against Mauve [6] because the tool does not directly reports MUMs. We also did not consider algorithms that does not produces an index for the text that can be queried with different patterns without reconstructing the index, e.g. the algorithm described in Mäkinen et al. [16, Section 11.1.2] . The experiments that exceeded exceeded 16 GB of memory were omitted from further consideration.

We evaluated our method using real-world datasets. We build our index for up to 512 haplotypes of human chromosome 19 from the 1000 Genomes Project [22] and up to 300,000 SARS-CoV2 genomes from EBI's COVID data portal [12] . We provide a complete list of accession numbers in the repository. We divide the sequences into 11 collections of 1, 2, 3, 4, 8, 16, 32, 64, 128, 256, 512 chromosomes 19 (chr19) and 9 collections of 1,562, 3,125, 6,250, 1250,00, 25,000, 50,000, 100,000, 200,000, 300,000 genomes of SARS-CoV2 (sars-cov2). In both datasets, each collection is a superset of the previous one. In Table 1 we report the length n of each collection and the ratio n/r, where r is the number of runs of the BWT. Furthermore, for querying the datasets, we used the first haplotype of chromosome 19 of the sample NA21144 from the 1000 Genomes Project, and the genome with accession number MZ477765 from EBI's COVID data portal [12] .

In Figure 2 we show the construction and query time and space for mum-phinder and mummer. Since mummer is not able to decouple the construction of the suffix tree from the query, for our method we report the sum of the running times for construction and query, and the maximum resident set size of the two steps. We observe that on chr19 mummer3 is up to 9 times faster than mum-phinder, while using up to 8 times more memory, while Table 1a and for the SARSCoV2 (sars-cov2) dataset in Table 1b , we report the number of sequences (No. seqs), the length n in Megabytes (MB), and the ratio n/r, where r is the number of runs of the BWT for each number of sequences in a collection .

mummer4 is up to 19 times faster than mum-phinder, while using up to 7 times more memory. However both mummer3 and mummer4 cannot process more than 8 haplotypes of chr19 due to memory limitations. mum-phinder was able to build the index and query in 48 minutes for 512 haplotypes of chr19 while using less than 11.5 GB of RAM. On sars-cov2, mummer3 is up to 6.5 times faster than mum-phinder, while using up to 24 times more memory, while mummer4 is up to 1.2 times slower than mum-phinder, while using up to 25 times more memory. mummer3 was not able to process more than 25,000 genomes while mummer4 were not able to query mote than 12,500 genomes of sars-cov2 due to memory limitations.

In Figure 2 we also show the construction time and space for mum-phinder. We observe that the construction time grows with the number of sequences in the dataset, however the query time decreases while increasing the number of sequences in the index with a 9x speedup when moving from 1 to 512 haplotypes of chr19. A similar phenomenon is observed in [3] and it is attributed to the increase number of match cases (Algorithm 2) while increasing the number of sequences in the index. From our profiling (data not shown) the more time-demanding part of the queries are LCE queries, which are not performed in case of matches. This observation also motivates the increase in the control logic of Algorithm 3 to limit the number of LCE queries to the essential ones. 

Progressive Cactus is a multiple-genome aligner for the thousand-genome era

Refining the r-index

PHONI: streamed matching statistics with multi-genome references

A block-sorting lossless data compression algorithm

progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement

A genomic distance based on MUM indicates discontinuity between most bacterial species and genera

Opportunistic data structures with applications

Practical Random Access to SLP-Compressed Texts

Rpair: Rescaling RePair with Rsync

Fully functional suffix trees and optimal text searching in BWT-runs bounded space

The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing

Versatile and open software for comparing large genomes

Ultrafast and memoryefficient alignment of short DNA sequences to the human genome

Fast and accurate long-read alignment with Burrows-Wheeler transform

Genome-scale algorithm design

Succinct suffix arrays based on run-length encoding

MUMmer4: A fast and versatile genome alignment system

The complete sequence of a human genome

MONI: A Pangenomic Index for Finding Maximal Exact Matches

Pangenomics enables genotyping of known structural variants in 5202 diverse genomes

The 1000 Genomes Project Consortium. A global reference for human genetic variation

Strain Level Microbial Detection and Quantification with Applications to Single Cell Metagenomics