key: cord-0043831-m3ixq4a4 authors: Průša, Daniel; Wehar, Michael title: Complexity of Searching for 2 by 2 Submatrices in Boolean Matrices date: 2020-05-26 journal: Developments in Language Theory DOI: 10.1007/978-3-030-48516-0_20 sha: 981ee75c10bcda77a096a20aac30ef27ec5fe567 doc_id: 43831 cord_uid: m3ixq4a4 We study the problem of finding a given [Formula: see text] matrix as a submatrix of a given Boolean matrix. Three variants are considered: search for a matching submatrix of any area, of minimum area, or of maximum area. The problem relates to 2D pattern matching, and to fields such as data mining, where the search for submatrices plays an important role. Besides these connections, the problem itself is very natural and its investigation helps to demonstrate differences between search tasks in one-dimensional and multidimensional topologies. Our results reveal that the problem variants are of different complexities. First, we show that given an [Formula: see text] Boolean matrix, the any variant can be solved in [Formula: see text] time for any given [Formula: see text] matrix, but requires various strategies for different [Formula: see text] matrices. This contrasts with the complexity of the task over matrices with entries from the set [Formula: see text], where the problem is Triangle Finding-hard and hence no algorithm with similar running time is known for it. Then, we show that the minimization variant in the case of Boolean matrices can also be solved in [Formula: see text] time. Finally, in contrast, we prove Triangle Finding-hardness for the maximization variant and show that there is a rectangular matrix multiplication-based algorithm solving it in [Formula: see text] time. We study the complexity of Four Corner Problems. A Four Corner Problem is concerned with finding a given 2 × 2 matrix as a submatrix of a given Boolean matrix. By submatrix, we mean a matrix that is formed by restricting the original matrix to a subset of its rows and columns. In general, the problem of finding a specific submatrix in a larger matrix is of importance in several computer science disciplines. For example Boolean matrices, and their associated submatrices of 1's, play a central role in data mining problems such as frequent itemset mining [13] . Moreover, finding a submatrix of 1's in the adjacency matrix of a graph G corresponds to finding a biclique of G [13] . As the maximum edge biclique problem is NP-complete [10] , the complexity of searching for a k × k submatrix is expected to grow as k grows. In this paper, we deal with the simplest case when k = 2. An example of its use is as follows. Given m respondents answering n yes/no questions in a questionnaire, are there two respondents who answered yes on two of the same questions? The above tasks ANY a b c d , MIN a b c d and MAX a b c d can also be viewed as two-dimensional pattern matching: we search for any/min/max rectangular block of a matrix that matches a given template. In only one dimension, similar pattern matching problems can be described using regular languages [2] . In this case, all the any/min/max tasks are solvable by a finite-state automaton-based algorithm in time linear in the input length [8] . In two dimensions, these problems are easily definable via the notion of local picture languages [5] . This is a formalism defining sets of two-dimensional arrays (so called pictures) for which the membership problem can be determined by looking at a window of size 2×2. These picture languages are a straightforward generalization of the well known local (string) languages [12] , which form a proper subset of the family of regular languages. We introduced in [8] a general algorithm solving two-dimensional pattern matching against local picture languages in time O(mn min{m, n}) for m × n input matrices. Further, for a specific local picture language, we investigated the pattern matching problem which is precisely ANY [ 1 1 1 1 ] and showed it to be solvable in linear time in the input matrix area. Here our goal is to propose more efficient algorithms for a specialized subclass of local picture language pattern matching problems over Boolean matrices called Four Corner Problems. In particular, we show that the problem ANY a b c d is solvable in O(mn) time for any a, b, c, d ∈ {0, 1} (Theorem 1). This result is surprising because it was proven in [8] that searching for a submatrix matching ( 1 0 1 1 ) in an n × n matrix over {0, 1, 2} is Triangle Finding-hard. In other words, the proof introduced a fine-grained reduction [15] from Triangle Finding to the search problem for ( 1 0 1 1 ) over {0, 1, 2} suggesting that Four Corner Problems are harder over larger alphabets. The Triangle Finding problem is to decide whether a given undirected graph G = (V, E) is triangle-free or not. It is a classic algorithmic problem which can be reduced to Boolean Matrix Multiplication (see [6] ) and solved in time O(n ω ), where n = |V | and ω < 2.373 denotes the matrix multiplication exponent [14] . However, it is currently unknown whether Triangle Finding can be solved in time O(n 2 ). Note that conditional lower bounds based on Triangle Finding are known for several problems (see, e.g., [1, 7, 9, 11] . This algorithm is based on computing a minimum witness for Boolean matrix multiplication [4] . However, it is likely impractical because it uses a fast rectangular matrix multiplication algorithm that involves a large constant factor. The paper is structured as follows. Section 2 establishes some required notions. Then, Sects. 3, 4 and 5 gradually present results for the problems ANY a b c d , MIN a b c d and MAX a b c d . Let M be an m × n Boolean matrix. We write M to denote the matrix obtained from M by negating its entries (i.e., we have M i,j = 1 − M i,j for every entry). We consider that rows and columns of M are indexed from 1 to m and n, respectively. A k × (rectangular) block of M at a position (r, c) is denoted as B = M[r, c; k, ], where 1 ≤ k ≤ m, 1 ≤ ≤ n, 1 ≤ r ≤ m − k + 1, 1 ≤ c ≤ n − + 1. Its entries coincide with the entries of the submatrix obtained from M by deleting rows 1, . . . , r − 1 and r + k, . . . , m, and columns 1, . . . , c − 1 and c+ , . . . , n. We use B i,j to refer to the entry in the i-th row and j-th column of B. We have B i,j = M r+i−1,c+j−1 . We define the area of B as a(B) = k , and the 2 × 2 corners submatrix of B as The set of all blocks of M is denoted by B M . For a, b, c, d ∈ {0, 1}, we define the following search problems (also known as Four Corner Problems) for an input Boolean matrix M. This section presents algorithms for ANY a b c d that run in nearly linear time in the input matrix area, for every a, b, c, and d. In some cases an efficient algorithm is achieved by using properties of the minimum matching submatrix, so these algorithms also solve the corresponding MIN a b c d problem (see Lemmas 2 and 3). Out of all ANY a b c d problems, ANY [ To see this, suppose without loss of generality that Since B 1 and B 2 are proper subsets of B, we have found a smaller block containing ( 1 0 1 1 ) as a submatrix. Now, we present the algorithm. It creates a map σ where a key is a pair (i, j) such that M i,j = 1. The value associated with (i, j) is a pair (i , j ) such that i is the largest row index less than i such that M i ,j = 1 (i.e., i is the row index of the nearest entry 1 located upwards from the position (i, j)) and j is the smallest column index greater than j such that M i,j = 1 (i.e., the column index of the nearest entry 1 rightwards). Note that the value of i or j might be undefined if there is no such row index or column index, respectively. It is possible to build σ in O(mn) time by making two passes over M. The first pass is to compute the i 's. The matrix M is scanned column by column. Each column index j is scanned from top to bottom. Whenever entry 1 is detected at a position (i , j), then i is the first component of σ(i, j) for the next detected entry 1 from position (i, j). Analogously, the second pass, scanning M row by row, is to compute the j 's. Now, for each key (i, j) in the map σ, the algorithm takes its value (i , j ) and checks if rows i, i and columns j, j form a desired submatrix matching ( 1 0 1 1 ). By doing this, every existing block with 0 entries on the left and bottom edges is checked. Among these blocks, a minimum-area block B such that κ(B) = ( 1 0 1 1 ) is returned as the result. Assuming constant time map operations, the algorithm runs in O(mn) time (note that the map σ can be implemented by using an m × n array). Proof. Case I (square matrices): Let an n×n Boolean matrix M be given. We present a divide and conquer strategy. If t(n) denotes the runtime of searching for a desired minimum submatrix in an n × n matrix, then we show that To accomplish this, we split time that finds a desired minimum submatrix by the assumption that there is such a submatrix crossing the specified borders. Therefore, in O(n 2 ) time, we reduce finding a desired submatrix in an n by n matrix M to finding a desired submatrix in one of four n 2 by n 2 matrices. If we solve recurrence (1), then we get t(n) = O n 2 log(n) [3] . Without It remains to explain how we obtain the maps. Let us first give a construction for σ top . We create a set X of pairwise disjoint sets of column indexes. Initially, X contains one set containing all column indexes. We repeat the following process for each row of M top , starting at the bottommost one and proceeding upwards: Create two disjoint sets A 0 and A 1 where A 0 contains all column indexes that are 0's and A 1 contains all column indexes that are 1's in the current row. For each set S in X, split S into two disjoint subsets S 0 = S ∩ A 0 and S 1 = S ∩ A 1 . For every {i, j} such that i ∈ S 0 and j ∈ S 1 , set σ top ({i, j}) to the current row index. Then, update X by replacing S with S 0 and S 1 . Throw out any sets from X that have less than two elements. Finish when X is empty or every row of M top has been processed. We similarly build σ bottom , but we start at the top row of M bottom going one row down at a time. It only takes O(n 2 ) time to construct σ top and σ bottom because we do O(n) work per row plus an additional constant amount of work for each pair of columns. Case II (rectangular matrices): Let an m × n Boolean matrix M be given. Assume without loss of generality that m > n. We We consider the case when m ≥ n. We proceed in a similar manner as in the proof of Lemma 1. We create a set S of pairs of column indexes. Initially, the set is empty. The matrix is traversed row by row. For each row, we do the following. We create a set R. Initially, R is empty, but we will add column indexes to R. We scan entries from left to right in the row. When we encounter a 1 entry at column index i, we add i to R. When we encounter a 0 entry at column index j, we go through each column index i from R. If (i, j) is in S, then we found a desired submatrix. Otherwise, we add (i, j) to S. Since In the previous section, we presented fast algorithms for minimization problems MIN [ For convenience, for each Boolean matrix M considered now until the end of this section, assume that the number of rows and the number of columns of M are powers of 2. Since any matrix of a general size m × n can be extended to a 2 log 2 m × 2 log 2 n matrix (with the added entries set to "undefined" value), the assumption will not have any impact on the generality and asymptotic time complexity of the presented algorithms. For p ∈ N + , let S(p) = {2 i | i = 1, 2, . . . , log 2 p } be the set of powers of two greater than 1 and not greater than p. Let M be an m × n Boolean matrix. For k ∈ S(m) and ∈ S(n), let R M (k, ) denote the set of all k × blocks of M whose top left corner is located in M at a position (1 + a · k B be a p by q block of M. There are powers of 2, denoted by k and , such that k < 4p, < 4q and B is included in a block from R M (k, ). Proof. Assume B = M[r, c; p, q] . Let k = min{m, 2 · 2 log 2 p }, = min{n, 2 · 2 log 2 q }, Then, B is included in the block M[1 + a · k 2 , 1 + b · 2 ; k, ] ∈ R M (k, ). This is proved as follows. The definition of a ensures that 1 + a · k 2 ≤ r. It is also needed to verify that Observe that this inequality is trivially fulfilled when a · k 2 + k = m. Hence, assume that a · k 2 + k < m. Since m is divisible by k 2 , inequality (5) implies that It must thus hold that 1 + (a + 1) · k 2 > r (otherwise the right-hand side of (2) is greater than a). Now, it suffices to combine (7) and p ≤ 2 log 2 p = k 2 to obtain inequality (4). Quite analogously, the definition of b ensures that 1 + b · 2 ≤ c and it can be proved that c + q − 1 ≤ b · 2 + . Proof. The algorithm works as follows. For each block B ∈ R M , it uses the algorithm of Lemma 1 to search inside B for a submatrix matching ( 1 1 1 1 ). Among all the detected submatrices, it outputs a minimal one. By Lemma 7, B min is a part of a block B ∈ R M whose area is less than 16 times the area of B min , hence the algorithm of Lemma 1 running on B finds a block of M fulfilling the lemma requirement. Let us identify a suitable subset of blocks in R M such that each minimumarea block containing ( 1 1 1 1 ) as a submatrix is a part of a block from the subset. Define The subset of blocks R M satisfies the following properties. Claim I: Every k × block B in R M is of area less than 16 · S. Indeed, we can derive Claim II: Every p × q block B of M such that a(B) ≤ S is a subset of a block in R M . This is proved as follows. By Lemma 7, there is a k × block B 1 such that 4p > k ∈ S(m), 4q > ∈ S(n), and B is included in B 1 . It holds that k < 16 · pq ≤ 16 · S, and hence < 2 log 2 16·S k . Since l is a power of 2, we can write ≤ 2 log 2 16·S Since the other stages of the algorithm run in O(mn) time, the stated time complexity has been proven. We first prove that the problem MAX a b c d is Triangle Finding-hard for any a, b, c, d ∈ {0, 1} (Theorem 5). Then, we show how MAX a b c d can be solved using rectangular matrix multiplication (Theorem 6). Define the matrix where O is the n × n zero matrix, I is the n × n identity matrix, and A 1 = A 2 = A 3 = A. An example of a graph G and the induced matrices A and M 1 is given in Fig. 1 It is not difficult to verify that there is a one-to-one correspondence between blocks of the third type and triples i < j < k such that To reduce Triangle Finding to MAX Recall that C denotes the matrix created from a Boolean matrix C by negating its entries. Figure 2 shows the matrices M 3 and M 4 constructed for the graph G from Fig. 1 . One can again verify that G has a triangle if and only if each of the constructed matrices contains a block B such that a(B) ≥ 3n 2 and κ(B) matches the desired 2 × 2 matrix. Now, let us focus on approaches for solving the maximization problems. Given an m × n Boolean matrix M and p, q ∈ {0, 1}, let σ M p,q denote the map whose keys are pairs (i, j), where i, j are row indexes of M such that i < j. For a key (i, j), the map value is defined as the smallest column index c such that M i,c = p and M j,c = q. Let M denote the matrix M flipped left to right. It is easy to see that every problem MAX a b c d , where a, b, c, d ∈ {0, 1}, can be solved based on the maps σ M a,c and σ M b,d . Conversely, this also shows that these maps are Triangle Finding-hard to build. The maps can be computed based on a minimum witness for Boolean matrix multiplication [4] , and hence the time complexity of solving MAX a b c d in this way coincides with the time complexity in [4] for the minimum witness problem. We investigated the complexity of Four Corner Problems over Boolean matrices. A Four Corner Problem is concerned with searching for a given 2 × 2 submatrix in a given Boolean matrix. We demonstrated that minimum-area Four Corner Problems over Boolean matrices are solvable in nearly linear time in the input matrix area (Theorem 4) and maximum-area Four Corner Problems over Boolean matrices are Triangle Finding-hard (Theorem 5). The algorithms that we presented for the former problems might lead to efficient implementations, while the results achieved for the latter problems give rise to an interesting unresolved theoretical question: Are the maximum-area Four Corner Problems harder than the Triangle Finding problem? Going further, we suggest that a possible future direction is to investigate the complexity of Four Corner Problems over matrices with entries from larger alphabets. If the current clique algorithms are optimal, so is Valiant's parser Algorithms for finding patterns in strings A general method for solving divide-andconquer recurrences On minimum witnesses for boolean matrix multiplication Two-dimensional languages Finding a minimum circuit in a graph Fast context-free grammar parsing requires fast boolean matrix multiplication Two-dimensional pattern matching against basic picture languages Intersection non-emptiness and hardness within polynomial time The maximum edge biclique problem is NP-complete Lengths of words accepted by nondeterministic finite automata Jewels of Formal Language Theory On the size and recovery of submatrices of ones in a random binary matrix Multiplying matrices faster than Coppersmith-Winograd Hardness of easy problems: basing hardness on popular conjectures such as the strong exponential time hypothesis We greatly appreciate all of the help and suggestions that we received. We would especially like to thank Joseph Swernofsky for helping us obtain some preliminary results, for contributing to a preliminary version of this work, and for providing valuable feedback. We also thank the Czech Science Foundation for supporting the first author (grant no. 19-09967S).