key: cord-0034824-tjlcnee2
authors: Barsky, Marina; Stege, Ulrike; Thomo, Alex; Upton, Chris
title: A New Algorithm for Fast All-Against-All Substring Matching
date: 2006
journal: String Processing and Information Retrieval
DOI: 10.1007/11880561_31
sha: ee69d5ddc67d91538d24090c1741a8861bddbc58
doc_id: 34824
cord_uid: tjlcnee2

We present a new and efficient algorithm to solve the ’threshold all vs. all’ problem, which involves searching of two strings (with length N and M respectively) for finding all maximal approximate matches of length at least S and with up to K differences. The algorithm is based on a novel graph model, and it solves the problem in time O(NMK (2)).

An important problem in the field of string matching is the extraction of exact and approximate common patterns from a set of strings. In special application areas such as biological sequence analysis, finding exact patterns only can miss a great deal of useful information.

The problem can be defined as "all-against-all approximate substring matching" [1, 2, 4] , and is notorius for its computational difficulty [5] . In practice, various constraints are set for the sought solutions, such as the maximum allowed number of approximations or "errors" and the minimum length of substrings. Despite past attempts, this problem is far from being efficiently solved. Our contribution is a fast algorithm for solving "all-against-all approximate substring matching" for two strings.

A naive approach to this problem is to exhaustively test each pair of substrings from s and t respectively. This approach has O(N 2 M 2 ) time complexity.

The best known solution was proposed by Baeza-Yates and Gonnet in [1, 2] , and is widely used [8] . Their solution significantly improved the average time complexity of the naive approach, by avoiding the examination of repeating substrings. In their method, the two input strings are organized into a suffix tree structure, and the order of substrings comparisons is guided by a depth-first traversal of the suffix tree nodes. The time complexity based on their practical results lies between N M (best case) and N 2 M 2 (worst case), but closer to N 2 M 2 [4] . Setting threshold criteria bounding the error number, i.e., allowing at most K differences in an approximate substring match, significantly improves the performance of the Baeza-Yates and Gonnet algorithm. This is because the value of K can be directly incorporated into their algorithm to cut down the depth of the suffix tree traversal. As we verify through experiments, for small values of K, the Baeza-Yates and Gonnet algorithm performs very well. However, as K increases, the number of suffix tree nodes that are examined grows almost exponentially in K, which is in accordance with Ukkonen [7] .

We cast the original problem into the problem of finding "maximal paths" in a special "matching" graph. 1 Via a careful study of this graph, we are able to derive interesting and useful properties that help us in devising a higly optimized depth-first search procedure for finding "maximal paths," which correspond to the solutions of the original string problem. Our proposed algorithm runs in O(N M K 2 ) time, which is a significant improvement over the Baeza-Yates and Gonnet algorithm. Moreover, we experimentally show that our algorithm scales linearly (as opposed to quadratically) in K and it outperforms the Baeza-Yates and Gonnet algorithm by an order of magnitude for bigger values of K. Finally, our algorithm has an additional nice feature: it reversely depends on the alphabet size. This is contrary to the behavior of the Baeza-Yates and Gonnet algorithm, whose running time worsens with the increase of the alphabet size.

Let Σ be a finite alphabet. A sequence of letters a 1 a 2 . . . a N , where a i ∈ Σ is called a string over Σ. We denote strings with s and t. Given string s, we denote its i-th letter with s[i], and we denote a substring of s starting at position i and ending at position j with s[i, j]. Substring s[i, j] has length j − i + 1. Let the edit distance for two strings s and t be the minimum number of edit operations needed to transform s into t, as defined in [4] . We say the pair (s, t) is a K-bounded approximate match if the edit distance between s and t is at most K. We solve Problem 1 by casting it to an equivalent problem on graphs induced by a "matching matrix".

The matching matrix of s and t (M s,t ) is defined as

Based on matching matrix M, we define a weighted directed graph G M with vertices v ij corresponding to the 1-elements of the matrix, and with (directed) edges defined in a "top-down" and "left-right" fashion as follows: there is an edge e(v ij , v kl ) iff i < k and j < l (cf. Fig. 1 ). We define the cost of an edge

A path in graph G M is a sequence of vertices connected by edges. For a path in G M , we define two characteristic properties. The match length of path π between v ij and v kl is defined as M L(π) = min(k − i + 1, l − j + 1). The error number, EN (π), is defined as the sum of all costs of edges in π.

Note that G M is not a dynamic programming (induced) graph (edit graph [4] ); DP graphs have been very well studied in the literature. However, to the best of our knowledge there is no work that formally studies the properties of G M graph. Graph G M possesses a very desirable property which is as follows.

v ij to v kl in G M .

Input: The graph G M for two strings s and t, and positive integers K and S. Output: All maximal paths with EN ≤ K and with match length at least S.

Based on Theorem 1 we conclude that:

The problem all bounded approximate matches can be reduced to the all paths below threshold problem.

We show next how to construct an instance for all paths below threshold from an instance of all bounded approximate matches.

We outline the logic flow of algorithm APBT, omitting all formal proofs due to space constraints. In the full version of the paper we give a simple way for building and storing the matching matrix in linear time and space. As for graph G M , we never explicitly construct and store it (remaining so linear w.r.t. space). Rather, as we show, we traverse it by constructing the needed paths "on the fly."

Path Expansion. We scan the matching matrix in row-major order. When a vertex of G M is encountered, we initialize a path π with EN (π) = 0 and M L(π) = 1. The algorithm then builds all the possible expansions of this initial path by adding one vertex at a time and by keeping track of the best paths found so far. As paths are constructed, the algorithm examines each partially completed path π: if no more vertices can be added without exceeding threshold K, then we stop the expansion and check whether M L(π) ≥ S. If true, then we report path π as a solution.

A Single-Step Path Expansion. Since the error number of a path cannot exceed K, an edge to be appended to a path clearly has to have a cost of at most K. As a consequence all edges in G M of cost higher than K are excluded from further consideration. Consider a path π with error number EN (π), which ends at vertex v ij . From the above discussion, it is clear that for a single-step expansion of π we need to search (in M) for a possible "next vertex" only inside square ABCD, where A = (i + 1, j + 1), and C = (i + 1 + κ, j + 1 + κ), for κ = K − EN (π). We call square ABCD the target square for path π at vertex v ij . The area of the target square decreases as the error number accumulated by π increases.

On the first sight, for any vertex v ij in G M there are at most (κ + 1) × (κ + 1) outgoing edges to be considered. We show how to reduce the number of edges for consideration. For this, we introduce the following definitions regarding diagonals in the matching matrix M.

Let (i, j) be an arbitrary cell in M.

(2) Let q be a value between 0 and N − j. The (i, j)-q-upper diagonal is the (i, j + q)-main diagonal. (3) Let r be a value between 0 and M − i. The (i, j)-r-lower diagonal is the (i + r, j)-main diagonal.

Let π be a path in G M ending at vertex v ij and with EN (π) ≤ K. Now, assume that v kl and v mn are two vertices in the target square for π at v ij , which lie on one of the upper diagonals w.r.t. v ij . In terms of edge cost it means, that c(v ij , v kl ) = l − j − 1, and c(v ij , v mn ) = n − j − 1. Now, assume that i < k < m and j < l < n. This means that if we build an edge from v ij directly to v mn , we "ignore" vertex v kl , and unnecessarily increase EN (π). Rather, we better expand path π to v kl and later on, in the next round, continue to v mn . Practically, this means that: if we find a vertex v kl on an upper diagonal of the target square, then we can exclude from the search for single-step expansion all the triangular area of the target square, which is bounded by (1) row k (exclusive), and (2) the upper diagonal passing through v kl (inclusive). Symmetrically, if vertex v kl lies on some lower diagonal, then we can exclude from the search for expansion all the triangular area of the target square, which is bounded by (1) column l (exclusive), and (2) the lower diagonal passing through v kl (inclusive). If vertex v kl lies on the main diagonal of the target square, then both triangular areas are excluded at once.

In the full paper, we strengthen the above result by showing that we can safely exclude the row k (or column l) from the search for path expansion. Observe that, the scanning of a target square in this order guarantees that the exclusion of triangular areas takes place as early as possible.

Single path extension from an arbitrary vertex v ij in G M is performed at most once for each of the 2K + 1 diagonals surrounding v ij , and therefore the number of possible extensions for v ij is bounded by 2K + 1.

is accessed at most once from each of 2K + 1 diagonals. This also implies that an arbitrary vertex v ij serves as a path extension for at most 2K + 1 vertices.

We show next how the information from previously explored paths can be reused.

Let π 1 be a previously explored path, which connects vertex v ij with v mn . Let π 2 be another path that we are currently exploring, which originates in v kl , and is built up to vertex v mn . Clearly, if EN (π 2 ) ≥ EN (π 1 ) and M L(π 2 ) ≤ M L(π 1 ), we can omit the further expansion of π 2 . An illustration is given in Fig. 1 , where path v 01 , v 13 , v 24 , v 46 serves as π 1 , which is explored earlier in a row-major order, and path v 04 , v 46 serves to exemplify π 2 . Clearly, path π 2 will only offer a sub-solution to the solution corresponding to π 1 , since the substring t [4, 6] is a substring of t [1, 6] .

Thus, if we remember the smallest error number among all paths, which reached a particular vertex, then at each vertex, we will do at most (K + 1) expansions. The row major processing order ensures, that if M L is defined by the length of the vertical substring, then M L(π 2 ) ≤ M L(π 1 ). This is because both paths end at the same vertex, and π 2 starts at the same or later (greater) row than π 1 . Notably, if we repeat the computation in a column-major order, all paths where M L was defined by the horizontal substring will now be defined by the vertical substring, thus M L of the later path will again be less than M L of the previously built path. For more detailed explanations see the full version.

The union of the solution sets of the two runs of the algorithm yields the final solution set.

From the above, we can conclude, that each vertex in G M is expanded at most 2(K + 1) times. Proof. Since during path extension, any cell is accessed only once from at most 2K + 1 vertices (cf. Corollary 3) and each of these cells, if it is a vertex, is expanded at most 2(K + 1) times (cf. discussion), the upper bound for traversing a particular cell of the matrix is at most 2(K + 1)(2K + 1). Since there are at most M N many cells in M, the total time complexity is O(N M K 2 ).

The pseudocode of our algorithm is given below.

Expand path(π) scan M in row major order if ML(π) ≥ S then if M[i, j] = 1 then add π to the set of solutions create a single-vertex vij path π EN (π) = 0 if a path with error number EN (π) Expand path(π) has already been extended through v kl then abort π and return scan M in column major order if M[i, j] = 1 then Do a single-step expansion (if possible) of π create a single-vertex vij path π creating new expanded path πexp EN (π) = 0 Expand path(π) E x p a n d path (πexp)

We present an experimental evaluation of our All Paths Below Threshold algorithm as it compares with the algorithm of Baeza-Yates and Gonnet [2] . We implemented Gusfield's variant of the Baeza-Yates and Gonnet algorithm [4] . 2 We optimized it using Ukkonen's error bounded dynamic programming method [6] We abbreviate this optimized variant by BY G + U .

The running time was tested on the same 1.2 GHz PC with 312 MB of RAM. Fig. 2 represents the running time of BY G + U and AP BT on a pair of RNA sequences belonging to viruses from the same family, and where the minimum length of matches is set to S = 50. Notably the AP BT algorithm outperforms the BY G+U algorithm for values of K ≥ 6. We also show the size of the output, and this clearly shows that in order to obtain any output at all, even for similar RNA sequences, one has to set a bigger or equal to 6 value of K. Interestingly, for K ≤ 5, the BY G + U algorithm outperforms the AP BT algorithm. This is because the BY G+U algorithm benefits from the early stop of deeply going in the suffix trees, when the accumulated error exceeds K. However, as K grows the BY G+U algorithm goes deeper in the suffix trees, and we observe an almost exponential in K increase in the running time. In contrast, AP BT scales on average linearly with K.

Also, we emphasize the fact that for alphabets of bigger size the AP BT algorithm performs better than the BY G+ U algorithm. The performance of AP BT is orders of magnitute better than BY G+ U for protein sequences with alphabet size of 20. This can be explained by the greater "bushiness" of the suffix trees (used by BY G + U ) close to the root, and by the fact that with the increase of the alphabet size, our matching matrix becomes much sparser. The AP BT algorithm behaves so much better than the BY G + U algorithm that we had to plot their behavior in different scales (cf. Fig. 3 ). 

All-against-all sequence matching

A fast algorithm on average for all-against-all sequence matching

A New Algorithm for Fast All-Against-All Substring Matching

Algorithms on Strings, Trees and Sequences

Combinatorial approaches to finding subtle signals in DNA sequences

Algorithms for approximate string matching

Approximate string matching over suffix trees

Pattern Discovery from Biosequences