key: cord-0043232-vj8dycwy authors: Mongia, Mihir; Soudry, Benjamin; Davoodi, Arash Gholami; Mohimani, Hosein title: Efficient Database Search via Tensor Distribution Bucketing date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_26 sha: 2313bc846086535eb7501f444512a82ba244973d doc_id: 43232 cord_uid: vj8dycwy In mass spectrometry-based proteomics, one needs to search billions of mass spectra against the human proteome with billions of amino acids, where many of the amino acids go through post-translational modifications. In order to account for novel modifications, we need to search all the spectra against all the peptides using a joint probabilistic model that can be learned from training data. Assuming M spectra and N possible peptides, currently the state of the art search methods have runtime of O(MN). Here, we propose a novel bucketing method that sends pairs with high likelihood under the joint probabilistic model to the same bucket with higher probability than those pairs with low likelihood. We demonstrate that the runtime of this method grows sub-linearly with the data size, and our results show that our method is orders of magnitude faster than methods from the locality sensitive hashing literature. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-47436-2_26) contains supplementary material, which is available to authorized users. One of the fundamental challenges in mass-spectrometry based proteomics is to identify proteins present in a cell culture by searching their mass spectrometry fingerprint against all the peptides in a proteomic database/reference. When the number of mass spectra and the size of the reference proteome increase, this search becomes very slow, especially in cases where post-translational modifications are allowed. Given a peptide sequence, the existing methods construct a binary-valued spectra from each peptide, where we have ones at the positions peaks are present and zeros otherwise. Then a probabilistic model is trained to learn the joint probability distribution P (spec, pep) between the predicted spectra and the discretized mass spectra ( Fig. 1 ) [10] . Given a spectra spec and a set of peptides P ep = {pep 1 , pep 2 . . . pep N }, the goal is to find the peptide(s) pep ∈ P ep that maximize P (spec|pep). This task motivates the following problem. the following maximum likelihood problem that generalizes the problem of matching peptides to spectra. Let A = {a 1 , a 2 , . . . a m } and B = {b 1 , b 2 , . . . b n } be discrete alphabets where m, n ∈ N. Let P be a joint distribution on the alphabets A and B such that i=m i=1 j=n j=1 P(a i , b j ) = 1. Let S ∈ N and P(X, Y ) = s=S s=1 P(x s , y s ) where X = (x 1 , x 2 , . . . , x S ), Y = (y 1 , y 2 , . . . , y S ) and x s ∈ A, y s ∈ B for 1 ≤ s ≤ S. Given a data point Y ∈ B S and a set of classes 1 X 1 , · · · , X N ⊆ A S our goal is to accurately and efficiently predict the class X t , 1 ≤ t ≤ N , that generated Y . Note in reference to mass-spectrometry based proteomics, the classes {X 1 , · · · , X N } model the set of peptides P ep = {pep 1 , pep 2 . . . pep N } and Y models a spectra spec. We address the DPPS problem by solving the following optimization problem: A naive way to solve this optimization problem is to compute P(Y |X t ) for each 1 ≤ t ≤ N , and find the maximum among them. The complexity of this approach grows linearly with N , and thus is prohibitively slow for practical applications as N grows large. As stated above, the issue with the naive way of solving DPPS is that brute force calculation of P(Y |X) for every X ∈ {X 1 , · · · , X N } is a slow algorithm. Another domain where brute force calculation is necessary is nearest neighbor search. In nearest neighbor search, there is a data point Y ∈ R S and a set of points X 1 , X 2 , X 3 . . . X N ∈ R S . The goal is to quickly solve the following minimization problem: arg min where Y − X 2 stands for the Euclidean norm. This problem is equivalent to (1) in the special case where the probability distribution P is continuous and For high-dimensional data, the exact nearest neighbor search problem grows linearly with the number of data points [9] . Therefore, researchers consider the approximate nearest neighbor (ANN) search problem. In the ANN-search problem, the objective is to find X ∈ X 1 , X 2 , . . . X N such that and c > 1 is referred to as the "approximation factor". A common algorithm for solving this problem is locality sensitive hashing [2, 6, 7, 9] (Algorithm 1). This algorithm takes as input hashes h that satisfy the following constraints for some R > 0, and 0 < P 2 ≤ P 1 ≤ 1: where hashes h satisfying (4) and (5) are called (R, cR, P 1 , P 2 ) -sensitive hashes. As stated in Gionis et al. [7] , "The key idea is to hash the points using several hash functions so as to ensure that, for each function, the probability of collision is much higher for objects which are close to each other than for those which are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point." Outputs: A class X ∈ X that is the most similar to Y . Preprocessing: hj(X) = (h1,j(X), h2,j(X), . . . hr,j(X)) Store {hj(x)|x ∈ X } in a hash table H j Query:. Search hj(Y ) in H j , i.e. find all X ∈ X that satisfy hj(Y ) = hj(X). Among all the positives (all the X ∈ X that satisfy the line above), report the X that has the smallest Euclidean distance from Y . The locality sensitive hashing algorithm takes a query point Y and aims to find the point that is the most similar to it in a database. The algorithm does this by first applying r hash functions to the query point in each band j, 1 ≤ j ≤ b. Then in each band, the algorithm considers all points X in the database, that have been hashed to the same values as Y in all of the r hash functions. Currently, LSH is limited to a number of distance metrics including manhattan distance (L 1 ), and euclidean distance (L 2 ). In order to use LSH for other similarity measures, one needs to transform them to L 1 , L 2 metrics for which standard hashes are known. While it is possible to transform the DPPS problem to an approximate nearest neighbor search problem with standard metrics, in this paper we show such transformations result in algorithms with suboptimal complexity. In this paper, we design buckets for the DPPS problem that significantly outperform standard LSH algorithms. We address the DPPS problem by defining pairs of relations (one for each alphabet) that are sensitive to the specific joint distribution that pairs of data points belong. Another distinctive feature of these hash relations, which we refer to as buckets, is that elements in the domain can be mapped to more than one element in the range. We refer to this framework as distribution sensitive bucketing. In distribution sensitive bucketing, the buckets U x : ..Z} satisfy the following constraints for some 0 ≤ β ≤ α ≤ 1: Here, Z ∈ N and for some set R, we define 2 R to be the set of all subsets of R. We refer to the buckets that satisfy (6), (7), (8), (9) , and (10) as (P, Q, α, β, δ x , δ y ) -sensitive buckets. For (P, Q, α, β, δ x , δ y ) -sensitive buckets, it is the case that (i) the probability of a jointly generated pair being mapped to the same value is α, (ii) the probability of random pairs X, Y being mapped to the same value is β, and (iii) the complexity of assigning points to the buckets is proportional to δ x and δ y . Thus intuitively, we would like to maximize α while minimizing β,δ x , and δ y . The rest of the document will proceed as follows. In Sect. 3, we assume an oracle has given us a family of (P, Q, α, β, δ x , δ y ) -sensitive buckets, and we design an algorithm to solve (1) based on this family. In Sect. 4, we provide a way to construct these buckets. In Sect. 5 we derive the overall complexity of the algorithm presented in Sect. 3 and in Sect. 6, we propose an algorithm for constructing optimal buckets. Finally, in Sect. 7, we detail our experiments on simulated and real mass spectra. In the this section we introduce an algorithm for solving (1) when an oracle has given a family of distribution sensitive buckets and we refer to this algorithm as Distribution Sensitive Bucketing. In contrast to the locality sensitive hashing algorithm that attempts to find the pairs of data points that are very similar to each other, the goal of distribution sensitive bucketing is to find pairs of data points that are jointly generated from a known joint probability distribution. Algorithm 2 describes a procedure to solve (1) using a family of distribution sensitive buckets. Here we use r rows and b bands, and in each band we check whether the query Y has a collision with a data point X in each of the r rows. Among all the positives (all the X ∈ X that satisfy the line above, report the X that maximizes P(Y |X). In the previous section, we assumed an oracle has given us a set of distribution sensitive buckets, and we designed an algorithm to solve (1) using this family of buckets. In this section, we present an approach for constructing distribution sensitive buckets. Let A and B be arbitrary binary tensors of dimensions m k × Z and n k × Z, respectively. To construct distribution sensitive buckets, for each pair of buckets we choose positions 1 ≤ S 1 ≤ S 2 . . . ≤ S k ≤ S randomly, and we define buckets U x S1,S2,...,S k and U y S1,S2,...,S k as follows: U y S1,S2,...,S k (Y ) = z ∈ {1, . . . , Z} |B YS 1 ,YS 2 ,...,YS k ,z = 1 (12) As there is a straightforward way to convert a tensor to a matrix, in the rest of this paper we treat A and B as m k by Z and n k by Z binary matrices. Furthermore, for any matrix M and u, v ∈ N we use the notation M [u, v] to refer to the entry belonging to the uth row and vth column of M . In the rest of this section, we derive α, β, δ x , and δ y for the proposed family of buckets. and I z denotes a vector ones of dimension z Proof. For a proof of Theorem 1, see Supplementary Section 1. In Sect. 3, we provided Algorithm 2 to solve (1) given a family of Distribution Sensitive Buckets, and in the previous section we constructed a family of Distribution Sensitive Buckets based on m k × Z and n k × Z matrices, A and B. In this section, we analyze the expected complexity of Query portion of Algorithm 2 given the generative process defined in in DPPS. We first analyze the complexity given a specific instance of Y and X = X 1 , X 2 , X 3 . . . X N , and then derive the expected complexity. In Algorithm 2, the computational work for each query can be broken into (i) searching members of U j (Y ) in a hash table containing X∈X U j (X) in order to find positives and (ii) computing P(Y |X) for all the positives. Since searching a hash table is O(1) complexity, the computational work of (i) is equivalent to |U j (Y )|. Computing U j (Y ) requires forming the cartesian product U 1,j (Y ) × U 2,j (Y )×, . . . , ×U r,j (Y ). Thus, the total size of U j y over b bands can be upper bounded by The computational work of (ii) is equal to the number of positives. The number of positives can be calculated in the following way: The expectation of the sum of (17) and (18) given the generative process in DPPS is then given by Note that here we assumed that almost all of the pairs are independently generated. This assumption is due to the fact that only one X ∈ X is responsible for generating Y . Now the question is how do we select b? The probability of a jointly generated X, Y pair being called a positive (i.e. U x j (X) = U y j (Y ) for at least one j), which we refer to as the T rue P ositive Rate, can be calculated in the following way: We usually want to maintain a true positive rate of nearly 1, e.g. T rue P ositive Rate ≥ 1 − where is a small number. This can be realized by setting Therefore, the overall expected complexity given the generative process in DPPS, can be upper bounded by the following expression 2 and X is already preprocessed: − ln α r Nβ r + − ln α r δ r y (22) In Sects. 4 and 5, we described how to construct a family of distributive sensitive buckets using matrices A and B, and we derived the complexity of Algorithm 2 based on these matrices. In this section we present an approach to find the optimal matrices based on integer linear programming. Algorithm 3 presents a integer linear programming approach for finding matrices A and B that optimize the complexity (22) using (13) , (14) , (15), and (16). Our approach is based on the assumption that matrix A is identity. The reason behind this assumption is that any matrix B is non-intersecting with the identity matrix. Inputs: m × n matrix P, Q, k, r, x, y , Px ← PIm,Py ← InP,Q ← PxPy We often need to design buckets for larger values of k in order to design more efficient algorithms for solving (1) . However, the size of P k grows exponentially with k and thus Algorithm 3 will not run efficiently for k > 10. Thus, we use Algorithm 4 to filter P k to a smaller matrix P k , which only keeps the rows and columns in P k with sum above , and then pass this matrix, which we denote as P k , as an input to Algorithm 3. Inputs: P, k ∈ N, ≤ 1 Outputs: P k P 0 ← 1 for t = 1 to k do P t ← P t−1 ⊗ P Remove any row/column in P t with sum below Report P k . In this section we verify the advantage of our Distribution Sensitive Bucketing approach with several experiments. In the first experiment, we compare the performance of Distribution Sensitive Bucketing to several commonly used methods in the Locality Sensitive Hashing literature on a range of theoretical distributions P. Although these methods are not directly applicable to our problem, we can transform the DPPS problem into problems where these methods work. In the second experiment, we apply the Distribution Sensitive Bucketing algorithm along with the same methods from the Locality Sensitive Hashing literature to the problem of peptide identification from mass spectrometry signals. In this problem, given millions of peptides X = X 1 , X 2 , X 3 . . . X N and a discretized mass spectrum Y , our goal is to find a peptide X ∈ X that maximizes P(Y |X) = s=S s=1 P(y s |x s ) where P can be learned from a training data set of known peptide-spectra pairs. We use the probabilistic model introduced in Kim et al. [10] , which is trained to score a mass spectra against a peptide sequence accounting for neutral losses and the intensity of peaks (see Supplementary Figure 1 ). In this experiment we compared the complexity of 3 algorithms -LSH-Hamming, Inner Product Hash [13] , and Distribution Sensitive Bucketing -on a range of probability distributions. LSH-Hamming and Inner Product Search can not be directly applied to the DPPS problem as LSH-Hamming and Inner Product Hash only work when {X 1 , · · · , X N } and Y both belong to the same alphabet. Nevertheless, through the following transformations, the DPPS problem can be transformed to a nearest neighbor search problem with hamming distance and the maximum inner product search problem. To transform the DPPS problem to nearest neighbor search with hamming distance, map each element in the alphabets A = {a 1 , · · · , a m } and B = {b 1 , · · · , b n } to either 0 or 1. As a result, for each X ∈ {X 1 , , X N }, X ∈ {0, 1} S and Y ∈ {0, 1} S . To transform DPPS to the maximum inner product search problem, first change the objective function to log(P(Y |X)) = s=S s=1 log(P(y s |x s ). Observe that log(P(y s |x s )) can be expressed as the dot product of a one hot vector of size n (the size of the alphabet B) with a vector log(P(y|x s ) ∈ R n . Now we can concatenate all the vectors (one for each 1 ≤ s ≤ S) into signals v Y and w X of length Sn. The dot product of these two vectors will be log(P(Y |X)). Thus one can then apply maximum inner product search (MIPS) to the set X = {w X |X ∈ {X 1 , · · · , X N }} and query vector v Y in order to solve the DPPS problem. We benchmark the algorithms using probability distribution P(t) = P 1 t + In Fig. 2 , we plot the asymptotic complexity of each of the three algorithms. The asymptotic complexity can be expressed as O(N λ ), for some λ. The value of λ is plotted versus t in Fig. 2 . As one can see, Distribution Sensitive Bucketing has a lower asymptotic complexity than LSH-Hamming for 0 ≤ t ≤ .5 and for t ≥ .5 Distribution Sensitive Bucketing and LSH-Hamming have the same asymptotic complexity. Inner Product Hash is always worse than Distribution Sensitive Bucketing by a large margin. In this experiment we evaluated the performance of Distribution Sensitive Bucketing on simulated spectra and peptides. We simulated the mass spectra and peptides using the probabilistic model from Kim et al. [10] . For each of the algorithms, we choose the parameters so that they theoretically achieve a 95% true positive rate. In Figure 2 of the Supplementary, we verify experimentally that we indeed achieve a 95% True Positive Rate. Figure 3 shows the number of positive calls for our method in comparison to the brute force method, LSH-Hamming, and Inner Product Hash. Here, we changed the number of peptides from 100 to 100,000 and computed the average number of positives per spectrum averaged over 5,000 spectra. Figure 4 shows the runtime of brute force search, LSH-Hamming, and Inner Product Hash versus Distribution Sensitive Bucketing. Distribution Sensitive Bucketing is 20X faster than brute force while LSH -Hamming is only 2X faster than brute force. Inner Product Hash is always as slow as brute force search. We applied Distribution Sensitive Bucketing, LSH-Hamming, Inner Product Hash, and brute force search to the problem of mass spectrometry database search in proteomics. Here we search a dataset of 93,587 spectra against the human proteome sequence. We tune the parameters of Distribution Sensitive Bucketing, LSH-Hamming, and Inner Product Hash on a smaller test data set to get a 95% True Positive rate and then apply the algorithms on the larger data set. For distribution sensitive bucketing, this resulted in a 91% true positive rate and 50X decrease in the number of positives in comparison to brute force search. Distribution Sensitive Bucketing also led to 30X reduction in time compared to brute force search. LSH-Hamming resulted in a 2X reduction in positives and a 2X reduction in computation time while achieving a True Positive Rate of 93%. Inner Product Hash did not improve on brute force search. In this paper we introduce a problem from computational biology which requires computing a joint likelihood of all pairs of data points coming from two separate large data sets. In order to speed up this brute force procedure, we develop a novel bucketing method. We show theoretically and experimentally our method is superior to methods from the locality sensitive literature. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions An optimal algorithm for approximate nearest neighbor searching fixed dimensions Similarity estimation techniques from rounding algorithms Set similarity search beyond MinHash Locality-sensitive hashing scheme based on p-stable distributions Similarity search in high dimensions via hashing Approximate algorithms for high-dimensional geometric problems Approximate nearest neighbors: towards removing the curse of dimensionality MS-GF+ makes progress towards a universal database search tool for proteomics MSFragger: ultrafast and comprehensive peptide identification in mass spectrometrybased proteomics Mining of Massive Datasets On symmetric and asymmetric LSHS for inner product search In defense of MinHash over SimHash Acknowledgements. This work was supported by a research fellowship from the Alfred P. Sloan Foundation and a National Institutes of Health New Innovator Award (DP2GM137413).