key: cord-0220274-0qc83gf7 authors: Wu, Wei; Li, Bin title: Locality Sensitive Hashing for Structured Data: A Survey date: 2022-04-24 journal: nan DOI: nan sha: f030a64e9ecca8d94b20dc8ddcc3bd04ca0becd9 doc_id: 220274 cord_uid: 0qc83gf7 Data similarity (or distance) computation is a fundamental research topic which fosters a variety of similarity-based machine learning and data mining applications. In big data analytics, it is impractical to compute the exact similarity of data instances due to high computational cost. To this end, the Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors in an efficient manner without the learning process. Structured data (e.g., sequences, trees and graphs), which are composed of elements and relations between the elements, are commonly seen in the real world, but the traditional LSH algorithms cannot preserve the structure information represented as relations between elements. In order to conquer the issue, researchers have been devoted to the family of the hierarchical LSH algorithms. In this paper, we explore the present progress of the research into hierarchical LSH from the following perspectives: 1) Data structures, where we review various hierarchical LSH algorithms for three typical data structures and uncover their inherent connections; 2) Applications, where we review the hierarchical LSH algorithms in multiple application scenarios; 3) Challenges, where we discuss some potential challenges as future directions. Currently, data are growing prodigiously. For example, the Google search engine was queried over 90,000 per second in 2021 [Chang, 2022] ; there are now over 3.6 billion social media users in the world and the number is expected to be 4.41 billion in 2025 . Big data have been driving machine learning and data mining research in both academia and industry. Data similarity (or distance) computation is a fundamental research topic which benefits numerous similaritybased machine learning and data mining applications, e.g., classification, clustering, retrieval, etc. However, it has been daunting for big data analytics to exactly calculate similarity due to the "3V" characteristics (i.e., volume, velocity and variety). A typical example is that it is intractable to conduct document analysis on the classic data mining book, Mining of Massive Datasets [Rajaraman et al., 2012] , which contains over 10 8 features in the case of 5-shingling, without data preprocessing, e.g., dimension reduction or feature selection. Many efficient algorithms working fluently in the lowdimensional spaces may run unacceptably slowly in the highdimensional spaces due to high computational cost. Therefore, it is necessary to develop efficient yet accurate similarity estimation algorithms in the era of big data. One powerful solution to addressing the abovementioned problem is Locality Sensitive Hashing (LSH) [Indyk and Motwani, 1998; Broder et al., 1998; Gionis et al., 1999; Charikar, 2002; Datar et al., 2004; Andoni and Indyk, 2006; Weinberger et al., 2009; Wu et al., 2020] , which provides the unbiased estimators for some classical similarity (or distance) measures, e.g., MinHash for the Jaccard similarity [Broder et al., 1998 ], Weighted MinHash for the generalized Jaccard similarity [Haveliwala et al., 2000] , SimHash for the Cosine similarity [Charikar, 2002] , LSH with p-stable distribution for the l p distance [Datar et al., 2004] , L1LSH for the l 1 distance [Gionis et al., 1999] , L2LSH for the l 2 distance [Andoni and Indyk, 2006] and Feature Hashing for the inner product [Weinberger et al., 2009] . Intuitively speaking, an LSH algorithm maps similar data instances to the same hashcode with a higher probability than dissimilar ones by employing a set of randomized hash functions. No need of learning process is also a major reason that the LSH technique is highly efficient. Thus far researchers have widely applied the LSH technique to various real-world applications, e.g., document analysis [Manku et al., 2007] , social network analysis [Chierichetti et al., 2009] , bioinformatics [Luo et al., 2016] , information management [Cherkasova et al., 2009] , malware analysis [Raff and Nicholas, 2017] , image detection [Chum et al., 2008 ], privacy protection [Zhang et al., 2016 , database [Christiani et al., 2018] , anomaly detection [Koga et al., 2007] , topic modeling [Kim et al., 2015] , etc. The above-mentioned traditional LSH algorithms are designed for sets or vectors. However, in the real-world scenarios, there exist a variety of structured data, which are composed of elements and relations between the elements, e.g., genes, trajectories, parse trees, XML documents, molecules, social networks, protein-protein interaction networks, etc. From the perspective of data structures, they can be represented as sequences, trees or graphs. Obviously, these structured data are totally different from sets or vectors, and thus cannot be fed into an traditional LSH algorithm because such a direct application may lose valuable relation information, which may lead to severe performance loss. To capture structure information derived from relations between elements, a family of hierarchical LSH algorithms have been proposed to recursively or iteratively sketch the elements along the relation links. In this survey, we explore the hierarchical LSH algorithms from the following perspectives: 1. Data Structures: We introduce multiple hierarchical LSH algorithms 1 for trees, sequences and graphs, and uncover their intrinsic connections -The tree-structure naturally represents the hierarchical structures, and thus we can employ tree-structures to decompose sequences or graphs to extract their intrinsic structure information. We demonstrate the applications of the hierarchical LSH algorithms in some real-world application scenarios (besides straightforward application to similarity computation in machine learning problems). We propose some potential challenges w.r.t. the hierarchical LSH algorithms including the diversity of hierarchical LSH, heterogeneous structured data, streaming structured data and dynamic structured data. A powerful solution to similarity (or distance) estimation is Locality Sensitivity Hashing (LSH), which exploits a family of randomized hash functions to map the similar data instances to the same data points in the low-dimensional space with a higher probability than the dissimilar ones. Based on such a concise representation without the learning process, we can acquire data similarity in an effective and efficient way. Definition 1 (Locality Sensitivity Hashing [Indyk and Motwani, 1998] ). Given a family of hash functions H, the probability of two instances, q and p, to have the same hashcode by h ∈ H is exactly equal to their similarity 2 : In the following we provide brief introduction to some LSH instances which are building blocks for the hierarchical LSH algorithms reviewed in this survey. Definition 2 (MinHash [Broder et al., 1998] ). Given a universal set U = {u 1 , · · · , u N }, a subset S ⊆ U and a set of K random permutations from the uniform distribution, {σ k |σ k : U → U} K k=1 , the elements in S which are placed in the first position of each permutation would be the K-dimensional MinHash-codes of S, h = {arg min(σ k (S))} K k=1 . 1 We have implemented some hierarchical LSH algorithms introduced in this survey at https://github.com/drhash-cn. 2 There is a more common definition for LSH in the language of (R, cR, p1, p2)-sensitivity; but we adopt the alternative definition of LSH as an unbiased estimator here for easier understanding. MinHash gives an unbiased estimator for the Jaccard similarity of two sets Jaccard(S, T ) = |S∩T | |S∪T | , and thus we have Pr[arg min(σ(S)) = arg min(σ(T ))] = Jaccard(S, T ). MinHash places each element in the first position with equal probability, which leads to serious information loss in the case of weighted sets because the weights denoting the importance of each element are simply replaced with 1 or 0. To this end, Weighted MinHash [Haveliwala et al., 2000] was proposed to estimate the generalized Jaccard similarity of the weighted sets, Jaccard g (S, T ) = i min(si,ti) i max(si,ti) , where s i and t i are the weights, by placing each element in the first position with the probability in proportion to its weight. Definition 3 (SimHash [Charikar, 2002] ). Given a vector u and a set of K random hyperplanes {r k } K k=1 drawn from the Gaussian distribution, {h r k |h r k (u) = sgn(r k · u)} K k=1 , where sgn(·) = 1 if r k · u ≥ 0; sgn(·) = 0 otherwise, would return the K-dimensional SimHash-codes of u, h = {h r k (u)} K k=1 . SimHash unbiasedly estimates the Cosine similarity of two Definition 4 (Feature Hashing [Weinberger et al., 2009] ). Given a vector u = [u 1 , · · · , u N ], two randomized hash functions h : N → {1, 2, · · · , K} and ξ : This algorithm gives an unbiased estimator for the inner product of two vectors u, v , and thus we have Datar et al., 2004] ). Given a vector u and a set of and r is the length of the interval when the real axis is segmented into the equal-width intervals, would return the K-dimensional p-stable hashcodes of u, h = {h a k ,b (u)} K k=1 . This algorithm solves the problem under l p norm based on p-stable distributions, where p ∈ (0, 2]. We have where f p denotes the probability density function of the absolute value of the p-stable distribution. In addition, L1LSH [Gionis et al., 1999] and L2LSH [Andoni and Indyk, 2006 ] are designed for the l 1 distance and the l 2 distance, respectively. The above basic LSH algorithms are designed for sets or vectors, so they consider the elements independently and cannot extract the relations between the elements if they exist, but they have paved the way for the hierarchical LSH algorithms introduced in the following. In the real world, the structured data, e.g., texts, genes, trajectories, time series, molecules and social networks, are very common. They contain both information from elements themselves and the structure information constituted by the relations between the elements, and thus cannot be simply represented as sets or vectors because the structure information has to be thrown away in this case. In this section, we roughly categorize the structured data into trees, sequences and graphs. Please note that a sequence is a special case of a tree, and each node of a graph can be naturally represented as a rooted tree. Therefore, these three commonly seen types of structured data can be all hierarchically hashed based on the tree structure with the LSH technique. Table 1 summarizes the hierarchical LSH algorithms. A tree composed of nodes and directed edges describes the hierarchical structure by the directed edges from the parent node to the child nodes. Furthermore, if the nodes are assigned with content information, the tree will have stronger semantic capability. The tree structure is commonly used to represent the hierarchical structure of a text. Two MinHash-based algorithms [Gollapudi and Panigrahy, 2008; Chi et al., 2014] recursively sketch a tree representing a top-down hierarchy of chapters, paragraphs and sentences, which generate the K-dimensional hashcodes h at each level, as shown in Figure 1(a) . Obviously, such a representation in Eq. (1) is richer than the bag-of-words model, where t denotes the tth level. The text information in the leaf nodes are propagated along the edges from bottom to top. [Gollapudi and Panigrahy, 2008] directly concatenates the hashcodes as the flat set, while [Chi et al., 2014] reorganizes them into the nested set with K set-elements, each of which represents the features extracted from all the nodes at this level in Eq. (3), Compared with the flat set in Eq. (2), the nested set in Eq. (3) is able to preserve more context information, which benefits the performance. Tree-based hierarchical LSH algorithms effectively and efficiently capture the relations between elements, which further lays the foundation of sequence and graph data hashing. A sequence is a set of elements that appear in a specific order, e.g., genes, trajectories and time series. The elements in a sequence are hierarchically hashed to preserve the order information; then the data at the higher level are encoded as the small-sized data at the lower level, as shown in Figure 1 The common sequence data, e.g., texts and genes, generally adopt the technique of k-shingling in NLP and k-mers in bioinformatics to extract features. Longer k-mers mean to preserve more dependency within large contexts and lead to higher accuracy. However, it substantially increases time and space cost because huge training sets are required. Gallager LSH [Luo et al., 2016] is constructed hierarchically by employing a small number of L2LSH functions to capture the structure information and represent long k-mers as short hashcodes level by level. Time-series data and trajectories have multi-facet information, e.g., frequency, amplitude, trend and coordinates. SLSH [Kim et al., 2016] explores time-series data with more diverse and refined perspectives by building a two-level LSH framework, where multiple distance metrics are supported at different levels. Specifically, the outer-level LSH (i.e., L1LSH) coarsely stratifies the data via amplitude under the l 1 distance; then, the inner-level LSH (i.e., SimHash) finely sketches each stratified data via angle and shape under the Cosine distance. Consequently, SLSH preserves much more information through different similarity measures in a hierarchical way than the traditional LSH algorithms, because the latter provides only one perspective on the data with its associated similarity measure. Fréchet distance is popular in trajectory comparison because it incorporates the inherent sequential characteristics. The dynamic programming algorithms for Fréchet distance have O(M N ) time complexity, where M and N are the lengths of two compared trajectories, respectively. To lower the time complexity and make it feasible for comparing largescale trajectories, MRTS [Astefanoaei et al., 2018] hierarchically sketches trajectories and directly estimates their distance via LSH. Furthermore, trajectories with semantic information, e.g., user activities, can provide more accurate and meaningful queries, but it is challenging to answer the queries [Liu et al., 2017] organizes the trajectory elements hierarchically in a Quadtree, and captures spatial and semantic information via LSH with p-stable distributions. , 1968] , which extracts subtrees as features from the graph. The core step is to iteratively update the node information by aggregating information from its own and all its neighbors. We find an interesting connection between the WL graph kernel and some early hierarchical LSH algorithms for graph data, and propose a general graph hashing framework shown in Figure 1 (c), which also resembles the Message Passing Neural Networks (MPNN) [Gilmer et al., 2017] by replacing nonlinear activation functions in MPNN with LSH functions. Formally, the framework is composed of aggregation operation AGG, message function M and node update function U , where h v denotes node v's representation, t denotes the tth iteration, AGG and M aggregates and transforms information from the neighbors, respectively; and U generates the latest node representation by updating information from the node itself and all its neighbors. To the best of our knowledge, NSH [Li et al., 2012] is the first hierarchical LSH algorithm for graph data. It adopts the identity function for M , instantiates U via Feature Hashing, and adopts the concatenation operation on the node and all its neighbors as AGG, which significantly improves the WL graph kernel in terms of time and space. However, it suffers from clear performance loss. Therefore, KATH [Wu et al., 2018b] was proposed to achieve the competitive performance as the WL graph kernel with dramatically reduced computation and storage. In KATH, M employs truncation/filling or MinHash to store a varying number of neighbors in fixed-sized vectors, AGG is implemented by the concatenation operation on the node and all its corresponding neighbors, and U is instantiated as MinHash. In order to improve efficiency, KATH fast obtains the neighbors of nodes by introducing the traversal table with the fixed number of columns. NetHash [Wu et al., 2018a] extends each node as a rooted tree along the edges in AGG, and then recursively sketches the rooted tree from bottom to top. Within the framework, M is instantiated as MinHash on the neighbors, AGG aggregates the content information of the node and all its corresponding neighbors via the union operation, and finally U is implemented by MinHash on the aggregated results from the node and all its corresponding neighbors. Although NetHash significantly improves efficiency of exploring loworder neighbors at the expense of accuracy, its time complexity grows exponentially with the depth of the rooted tree because each rooted tree is independently sketched, which makes it hard to utilize high-order proximity. Similarly, #GNN [Wu et al., 2021] adopts the same operations as NetHash to implement M , AGG and U . However, it iteratively sketches each node and all its neighbors. In each iteration, all node information is shared in the whole graph for the next iteration, which makes the time complexity linear w.r.t. the number of iterations and in turn solves the problem of scalability. The above-mentioned algorithms can capture content and structure information in the graph simultaneously. In contrast, NodeSketch [Yang et al., 2019] preserves both the 1st and the 2nd order node structural proximities by adding the self-loop-augmented adjacency vectors of the graph, and recursively sketches each node and all its neighbors via the Weighted MinHash algorithm. Specifically, it implements M and U via Weighted MinHash, and conducts the union oper- ation on the node and all its corresponding neighbors and the empirical distribution of information from the corresponding neighbors in AGG. The hierarchical LSH algorithms for structured data introduced in Section 3 can be straightforwardly applied to all the similarity/distance-based machine learning models and data mining algorithms, e.g., classification, regression, clustering, retrieval, etc., which will not be elaborated here. In this section, we will introduce the applications of the hierarchical LSH algorithms in some real-world application scenarios. The LSH technique was initially applied to large-scale document analysis, i.e., web deduplication in the case of over 30,000,000 documents from the Aka Vista search engine [Broder, 1997] , where each document is represented as a set (bag-of-words representation). However, the hierarchical structure information, which comprises chapters, paragraphs and sentences, in the document cannot be captured, and thus it is hard to understand the context information. To this end, the document is organized into a tree structure to represent the above hierarchy {chapters, {paragraphs, {sentences}}}. As shown in Figure 2(a) , the context information, for example, Paragraphs 1 and 2 contain two and three sentences, respectively, is embedded into the tree structure. [Gollapudi and Panigrahy, 2008; Chi et al., 2014] recursively sketch the tree-structured document as the low-dimensional hashcodes for classification. Social networks are very common in our daily life. For example, in an academic community, a citation network shown in Figure 2 (b) is composed of papers as nodes, where words in the abstract denote attributes of a node, and citations between papers as edges, and one can conduct academic paper classification via node classification; in the World Wide Web, a social network consists of users with many profiles as nodes and friendships between users as edges, which can recommend friends to the users by link prediction. These applications can be implemented through network embedding, which represents each node in a network as a low-dimensional vector with the similarity between the nodes preserved. To this end, NodeSketch [Yang et al., 2019] and #GNN [Wu et al., 2021] iteratively sketch each node and all its neighbors from the node itself to the high-order neighbors, while NetHash recursively sketches each rooted tree from bottom to top by extending each node into a rooted tree. Bioinformatics is an interdisciplinary research field aiming to understand biological data through mathematical, statistical and computer science tools. Gene analysis, most of which depends on similarity computation, is an important problem in bioinformatics. A gene is an enumerated collection of nucleotides (i.e., A, T, C and G) in which repetitions are allowed and order matters. In order to preserve the order information, the technique of k-mers is adopted, which are the substrings of length k (similar to shingling in NLP). Consequently, the gene is transformed into a set of all its subsequences of length k, where each k-mer preserves the order information. Gallager LSH [Luo et al., 2016] hierarchically sketches the set of k-mers as smaller-sized ones from bottom to top. A molecule is represented as a graph G = (V, E, ℓ), where the atoms form the node set V, the bonds between atoms form the edge set E and ℓ : V → L is a function that assigns a label (i.e., the name of the atom) from a label set L to each node. Also, a molecule is associated with a class label y, which denotes the result of the molecule in the bioassay test or clinical diagnosis, e.g., inflammasome, cancers and COVID-19 antigen tests. Molecular classification is to predict the unseen molecules by learning a classifier from NSH [Li et al., 2012] and KATH [Wu et al., 2018b] hierarchically sketch the subtrees composed of the nodes and all their corresponding neighbors in each molecule, which in turn represent each molecule as the hashcode h G with the similarity preserved as the kernel for graph classification. Urban computing aims to solve problems in the urban development, e.g. traffic flows, energy consumption, meteorology, moving trajectories, etc., by resorting to computer science, which helps to understand the essence of the urban phenomena and even predict the urban futures. One of the core research problems is to learn knowledge from spatio-temporal data, e.g. trajectories. Generally, a trajectory is represented as a sequence of ordered points T = (p 1 , p 2 , · · · , p N ). MRTS [Astefanoaei et al., 2018] estimates the distance between trajectories by hierarchically sketching the trajectories. Furthermore, the points in the trajectories can be assigned with the semantic information, e.g., user activities. [Liu et al., 2017] retrieves a huge volume of trajectories efficiently by hierarchically sketching the activity trajectories. Healthcare refers to maintenance or improvement of health with the help of sensors. Particularly, physiological time series, which are observed in the monitors, demonstrate the critical body status, particularly in intensive care units (ICU). Physiological time series contain multiple human physiological indices from diverse perspectives, for example, acute hypotensive episode can be described as amplitude in terms of mean blood pressure, and shape in terms of trend and cycle frequency. Figure 2 (c) shows that (1, 2) and (3, 4) are grouped as similar in the l 1 distance while (1, 3) and (2, 4) are grouped as similar in the Cosine distance which requires normalization. SLSH [Kim et al., 2016] can preserve such different similarity measures by hierarchically sketching the data with a two-level LSH algorithm. Although the hierarchical LSH family has fast developed with many practical implementations of algorithms, which have achieved remarkable successes in a wide range of applications, there are still a number of remaining challenges for future exploration. Table 1 shows that the existing hierarchical LSH algorithms are mainly built on (Weighted) MinHash, which implies that most of them can only deal with tree-organized sets and then measure the (generalized) Jaccard similarity. However, [Kim et al., 2016] points out that data can be described from diverse perspectives, and thus it is indispensable to design diverse hierarchical LSH algorithms for approximating more similarity/distance measures, e.g., edit distance. Variety is one of the "3V" characteristics in big data, which means heterogeneity of data types. For example, a citation network covers heterogeneous information about authors, papers, conferences, terms, venues, etc. The heterogeneous features with distinct statistical properties are in totally different feature spaces, where inconsistency makes similarity between heterogeneous data hard to compute. Nevertheless, to the best of our knowledge, the existing LSH algorithms are designed for the homogeneous data, and the topic of the LSH technique for heterogeneous structured data is almost untouched. In the above-mentioned citation network, the key point of network embedding is how to map the heterogeneous features of the nodes into a common feature space by LSH. One promising approach is the following Asymmetric Locality Sensitive Hashing (ALSH) [Shrivastava and Li, 2014] . Definition 6 (Asymmetric Locality Sensitive Hashing [Shrivastava and Li, 2014]). A family H with two function transformations Q : R D → R D ′ and P : R D → R D ′ , is called (R, cR, p 1 , p 2 )-sensitive if for a given c-near neighbor instance with a query q and any instance p, the hash function h selected uniformly from H satisfies the following: • if dist(q, p) ≤ R, then Pr(h(Q(q)) = h(P (p))) ≥ p 1 , • if dist(q, p) ≥ cR, then Pr(h(Q(q)) = h(P (p))) ≤ p 2 , where c > 1 and p 1 > p 2 . Inspired by the idea of independent asymmetric transformations, one can represent heterogeneous features of the structured data in the common feature space. Therefore, the design of asymmetric transformations is an interesting challenge and the key to combating with the problem. Streaming data are similar with time-series data, because they are both continuously generated in a certain order. However, streaming data, e.g., network packets, have to be processed incrementally with no access to the entire data stream or storing the historical data, which implies that they can be scanned only once in no case of being stored. Streaming structured data mining are mainly faced with two challenges. The first one is continuously increasing dimensionality in the case of substructure extraction adopted by the algorithms in this survey. As new data instances emerge, the dimensionality of the common feature space will grow rapidly, but the historical data cannot be fully projected into the unlimitedly expanding feature space. The second one is concept drift. The underlying data distribution has been always changing in an unforeseen way with new instances arriving, which further deteriorates the performance of the models. It would be interesting to design LSH algorithms which tolerate emerging substructures in the future and disappearing substructures in the past. Dynamic structured data refer to a data instance that evolves over time, for example, in an evolving social network, the topology changes by creating or removing nodes and edges because users may join or exit the service and friendships may be established or cancelled, and the attributes of nodes and edges may also appear or disappear since user profiles are changed. Generally, only a part of the dynamic structured data are different between the adjacent time stamps. Therefore, it is necessary to dynamically update the evolving part without the repeated computation on the unchanged part. However, the two parts are mutually related, and thus there exists information propagation between the two parts. It would be a great challenge to develop LSH algorithms on the dynamic structured data. In this brief survey, we review the emerging hierarchical LSH algorithms for structured data including trees, sequences and graphs. More importantly, we point out the intrinsic connections among the algorithms -these structured data can be decomposed into (sub)tree-structures for hierarchically hashing, which will motivate the LSH community to design hierarchical LSH algorithms in various problem scenarios. We also introduce some typical applications from diverse application fields and discuss the potential challenges. These hierarchical LSH algorithms are highly-efficient, which will make the applications more practical in the era of big data. Multi-resolution Sketches and Locality Sensitive Hashing for Fast Trajectory Processing 90 Google Search Statistics for 2022: Usage & User Behavior Data Applying Syntactic Similarity Algorithms for Enterprise Information Management Andrew Zisserman, et al. Near Duplicate Image Detection: Min-Hash and Tf-idf Weighting Neural Message Passing for Quantum Chemistry Similarity Search in High Dimensions via Hashing The Power of Two Min-Hashes for Similarity Search among Hierarchical Data Objects Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: towards Removing the curse of Dimensionality Efficient Manifold Preserving Audio Source Separation Using Locality Sensitive Hashing Stratified Locality-Sensitive Hashing for Accelerated Physiological Time Series Retrieval Fast Agglomerative Hierarchical Clustering Algorithm using Locality-Sensitive Hashing Semantic-aware Query Processing for Activity Trajectories Low-Density Locality-Sensitive Hashing Boosts Metagenomic Binning Detecting Near-duplicates for Web Crawling Boris Weisfeiler and Andrei A Lehman. A Reduction of a Graph to a Canonical Form and an Algebra Arising During this Reduction K-Ary Tree Hashing for Fast Graph Classification Scalable Local-recoding Anonymization Using Locality Sensitive Hashing for Big Data Privacy Preservation