key: cord-0853171-yvhvedoe
authors: Reddy, A. Srinivas; Reddy, P. Krishna; Mondal, Anirban; Priyakumar, U. Deva
title: Mining subgraph coverage patterns from graph transactions
date: 2021-12-02
journal: Int J Data Sci Anal
DOI: 10.1007/s41060-021-00292-y
sha: d0a4df3fe7548c26dac3a9b377148d7af85148c7
doc_id: 853171
cord_uid: yvhvedoe

Pattern mining from graph transactional data (GTD) is an active area of research with applications in the domains of bioinformatics, chemical informatics and social networks. Existing works address the problem of mining frequent subgraphs from GTD. However, the knowledge concerning the coverage aspect of a set of subgraphs is also valuable for improving the performance of several applications. In this regard, we introduce the notion of subgraph coverage patterns (SCPs). Given a GTD, a subgraph coverage pattern is a set of subgraphs subject to relative frequency, coverage and overlap constraints provided by the user. We propose the Subgraph ID-based Flat Transactional (SIFT) framework for the efficient extraction of SCPs from a given GTD. Our performance evaluation using three real datasets demonstrates that our proposed SIFT framework is indeed capable of efficiently extracting SCPs from GTD. Furthermore, we demonstrate the effectiveness of SIFT through a case study in computer-aided drug design.

A complex graph can be built from pieces of knowledge based on the relationships among various entities. Such a graph contains new kinds of interesting and useful knowledge structures. Hence, it can be extremely valuable for opening up new avenues for enhancing applications in several domains. In this regard, graph mining [10, 33] has become an active area of research for mining knowledge from graph representations in bio-informatics, chemical informatics, social networks, computer vision, video indexing, text retrieval and web analysis. Mining the knowledge of frequent subgraphs from graph transactional data (GTD) is an important and active area of graph mining [14, 17, 22, 23, 40, 42] . It has been demonstrated in [29, 34] that frequent subgraph mining can tions by analyzing the interaction patterns of different drug molecules with the target protein.

Consider a scenario, where we have identified a number of low-binding drug molecules. The objective is to improve the structure of the molecules to increase the binding affinity so that the drug would work at lower dosages. Existing FSM techniques are not capable of extracting coverage-related knowledge of patterns. However, if we develop approaches for discovering coverage-related knowledge patterns, such methods can be effectively used to help in optimizing the binding affinity of the drug molecules w.r.t. the selected protein, thereby making the drug design process more efficient.

A subgraph coverage pattern (SCP) is a set (or pattern), whose elements are the subgraphs of GTD. This work addresses the problem of extracting all SCPs, which cover a given percentage of the graph transactions of GTD. Notably, in addition to the coverage aspect, we have to consider the aspect of overlap, which arises as a consequence of considering the coverage of a set of subgraphs. The sets of graph transactions covered by the corresponding subgraphs of an SCP may contain the common transactions, which we refer to as overlap. In addition to coverage, we consider an SCP is interesting if there is a minimal overlap among the graph transactions covered by subgraphs of an SCP.

Given a GTD, the issue is to extract all of the possible SCPs subject to the constraints associated with coverage and overlap. A brute-force approach would be to first extract all of the possible subgraphs of GTD by employing a frequent subgraph extraction algorithm [9, 22, 25, 42] and then determine the coverage and overlap for each possible pattern consisting of subgraphs. Intuitively, such an approach would be prohibitively expensive because the number of possible subgraphs in a given GTD typically explodes. Consequently, the number of candidate patterns to be considered for the extraction of SCPs also explodes.

For efficiently extracting SCPs from a given GTD, we introduce the model of SCPs and present the framework to extract SCPs. In the proposed model, we define the notion of coverage and overlap and present the problem to extract SCPs. The problem is to extract all SCPs from a given GTD by satisfying the threshold values of given coverage and overlap. We present a framework to extract SCPs, which we designate as subgraph-identifier-based flat transactional (SIFT) framework. As a part of SIFT framework, we propose an approach to extract all subgraphs of the graph transactions in GTD and assign a unique subgraph identifier (SID) to each subgraph. Next, each graph transaction in GTD is transformed into the corresponding flat transaction, which consists of the corresponding SIDs. We also propose an approach to extract SCPs from the flat transactional dataset. The problem is similar to that of coverage pattern extraction [36] from flat transactional databases subject to coverage and overlap constraints. Incidentally, overlap follows the sorted closure property [28] , which we shall use in this work for facilitating effective pruning. Hence, we extend the existing pattern extraction approach [36] , which employs pruning strategy based on overlap, for the efficient extraction of SCPs. Observe that by forming a set of flat transactions, complex computationally intensive graph operations associated by coverage and overlap are replaced with the corresponding simpler and relatively fast set-based operations.

The key contributions of this work are three-fold:

1. We introduce the notion of subgraph coverage patterns (SCPs) for GTD. 2. We propose the SIFT framework for efficiently extracting SCPs from a given GTD. 3. We conduct an extensive performance study using three real datasets to demonstrate that it is indeed feasible to efficiently extract SCPs using our proposed SIFT framework. We also demonstrate the effectiveness of SIFT through a case study in computer-aided drug design.

To the best of our knowledge, this is the first work to consider the extraction of subgraph coverage patterns from graph transactional data. The remainder of this paper is organized as follows. Section 2 reviews related works and background information. Section 3 discusses the proposed framework of the problem. Section 4 presents our proposed SIFT framework. Section 5 reports our performance evaluation. Section 6 presents our case study. Finally, we conclude in Sect. 7.

This section discusses related works and background.

Research efforts, such as the gindex approach [44] , have designed indexes for extracting subgraphs by considering line, cycle and star as basic graph query structures. Moreover, the GString semantic-based approach [24] indexes chemical compounds databases.

Graph mining techniques have also been applied in GTDs. The work in [22] used an apriori-based algorithm for discovering frequent subgraphs in GTDs. Moreover, the work in [25] discussed the frequent subgraph mining algorithm, which incorporates canonical labelling in conjunction with sparse graph representation to reduce both time and space complexity. The work in [42] proposed the gSpan algorithm for discovering frequent subgraphs without candidate generation. In particular, gSpan uses a lexicographic ordering for mapping each graph to a unique minimum depth-first-search code as its canonical label. A good survey on frequent sub-graph mining techniques for GTDs can be found in [23] . An algorithm to extract the top-k frequent subgraphs has been proposed in [17] .

The work in [26] proposed REAFUM, an approximate subgraph mining framework that constructs a list of representative graphs. It extracts frequent representative subgraphs, allows approximate matches and extracts consensus patterns. Another work in [48] proposed HOS-PLOC, a local clustering framework that extracts a small high-order conductance cluster, which largely preserves the user-specified network structures.

The work in [9] proposed an algorithm to mine molecular fragments based on association rule mining. These molecular fragments help in discriminating drug classes. Researchers have made efforts to model protein-ligand complex (PLC) as graph structures and extracted frequently occurring subgraph patterns. The works in [34, 35] proposed GReMLIN (graph mining strategy to infer protein-ligand interaction patterns), which is a methodology to search for conserved protein-ligand interactions in a group of related proteinsligand complexes. They use frequent subgraph mining to recognize structural patterns relevant to protein-ligand interaction. The work in [29] modeled PLC as a bipartite graph and used graph topological properties like degree, closeness, communicability, eccentricity, node betweenness and edge betweenness to summarize and extract frequent patterns. The work in [40] proposed FERRARI, which is a visual exploratory subgraph search framework. In particular, it employs two index structures VACCINE and ADVISE for indexing frequent and infrequent subgraphs to improve efficiency and scalability.

The notion of coverage has been well explored in set theory in the form of the set cover problem [12] and the hitting set problem [15] . In graph theory, the notion of coverage has been explored in the form of the minimum vertex cover problem [11, 39] , hitting set problem in hypergraph, traversals of a hypergraph [20] , clique cover problem [19] and influence maximization problem [27, 47] . The notion of coverage has been used in hypergraphs in the form of the minimum hitting set problem [7] , which involves the extraction of the set of vertices that have a non-empty intersection with every hyperedge. The work in [31] states that a set of k-mers is a universal hitting set if every possible L-long sequence contains a k-mer from a given DNA/RNA sequence dataset.

Clique cover is another important problem with applications in compiler optimization, computational geometry and applied statistics [19] . Information coverage maximization in social networks [27, 47] also uses coverage. An approach was proposed in [46] to select a set of influential users. More recently, the work in [41] proposed TOPKLS, a local search algorithm for finding diversified top-k cliques from a given graph. The notion of coverage has been well explored to solve the maximum coverage problem in facility location [6] . However, these works explore the notion of coverage for a single graph as opposed to a GTD, which is our focus. Moreover, the works in [18, 36] find coverage patterns in transactional data using pattern-growth and level-wise pruning approaches, respectively.

Notably, all of the existing works have addressed the issue of GTDs for graph search and mining of frequent subgraphs related knowledge with applications in chemical and biological areas. The issue of extracting the knowledge related to coverage from GTD has not been addressed. In contrast with the existing works [29, 40, 42] , in this paper, we have made an effort to propose an approach to extract coverage-related knowledge from GTDs.

We shall now discuss the model of subgraph discovery from graph transactions and coverage patterns.

In the literature, several efforts [4, 8] are being made to discover the knowledge of subgraphs from the given set of graph transactions. In particular, in the area of bioinformatics, research efforts [22, 42] have been made by modeling chemical compounds as graph transactions. By considering a given chemical compound as a single unit, the corresponding graph transaction represents chemical elements as vertices and chemical bonds among them as edges. We first present the notion of graph transactions and briefly explain the subgraph discovery approach, which was presented in [42] .

A graph transaction G = (V ,E,L,l) is a labeled, connected, simple and undirected graph, where V is a set of vertices, E ⊆ V 2 is a set of edges, L is a set of labels and l : V ∪ E − → L, where l is a function for assigning labels to vertices and edges. A graph transactional dataset (GTD) D comprises n such graph transactions, where the value of n is typically large. Notably, a vertex/edge may belong to multiple graph transactions in D. A portion S of a graph transaction G is called a subgraph of G. employ a subgraph discovery algorithm (e.g., gSpan [42] ) for extracting candidate subgraphs from D. The gSpan algorithm employs a depth-first search (DFS) strategy to extract all subgraphs without candidate generation. It uses a patterngrowth approach to build a hierarchical search tree called the DFS Code Tree. In the DFS Code Tree, every node represents a subgraph/graph, and any subgraph/graph in GTD can find its node in the DFS Code Tree. Each node in the tree is assigned with a lexicographical canonical label called the DFS code and one subgraph can have multiple DFS codes. The first DFS code in pre-order traversal over the DFS Code Tree is called the minimum DFS code and is assigned as the canonical label to the subgraph. Moreover, gSpan also prunes all the nodes that contain non-minimum DFS code, thereby reducing the size of the DFS Code Tree. A depth-first search over the DFS Code Tree extracts all minimum DFS codes of all candidate subgraphs in D. The performance of gSpan improves drastically due to the merging of isomorphism test and subgraph growth into one procedure.

We shall now explain the concept of coverage patterns [18, 36] . Coverage patterns are characterized by the notions of relative frequency, coverage support and overlap ratio. Given a transactional database D, each transaction is a subset of a set I of m items {i 1 ,i 2 ,i 3 ,. . ., i m }. T i k denotes the set of transactions in which item i k is present. The fraction of transactions containing a given item i k is designated as Relative Frequency of i k (RF(i k )). Hence, RF(i k )= |T i k | |D| . A given item is considered as frequent if its relative frequency is greater than that of a threshold value, which we designate as minRF. A pattern P is a subset of items in I, i.e., P⊆I where P={i p , i q , . . ., i r }, where 1 ≤ p, q, r ≤ m. The Coverage Set (CSet(P)) of a pattern P is the set of all the transactions that contain at least one item from the pattern P, i.e., CSet(P)=T i p ∪ T i q ∪ . . . ∪ T i r . The Coverage Support of a pattern P (CS(P)) is the ratio of the size of CSet(P) to the size of D, i.e., CS(P)= |C Set(P)| |D| . In order to add a new item to the pattern P such that the coverage support increases significantly, the notion of overlap ratio is introduced. (This is possible in the case when the number of transactions, which are common to the new item and the pattern P, is low.) Given a pattern P, the notion of overlap ratio of P satisfies the sorted closure property [28] , when the items in P are sorted in decreasing order of their relative frequencies, i.e.,

The Overlap Ratio of a pattern P (OR(P)) is the ratio of the number of transactions that are common between CSet(P −i r ) and T i r to the number of transactions in T i r , i.e.,

. A high value of coverage support indicates more number of transactions and a low value of overlap ratio means less repetitions among the transactions. A pattern is interesting if its coverage support is greater than or equal to the user-specified minimum Coverage Support threshold value (minCS) and its overlap ratio is less than or equal to the user-specified maximum Overlap Ratio threshold value (maxOR). Given the values of minRF, minCS and maxOR, a pattern P={i p , i q , . . ., i r } is considered as a coverage pattern if RF(i k ) ≥ minRF ∀ i k ∈ P, CS(P) ≥ minCS and OR(P) ≤ maxOR. By exploiting the sorted closure property of overlap ratio, a level-wise apriori-based approach has been proposed in [36] and a pattern-growth-based approach has been proposed in [18] for extracting all coverage patterns from D, given minRF, minCS and maxOR values. Additionally, a MapReduce-based algorithm to extract coverage patterns has been proposed in [32] .

To extract the knowledge of coverage patterns, the following heuristics can be followed for setting minRF, minCS and maxOR values. Normally, the coverage patterns with maximum coverage (100%) and minimum overlap ratio (0%) are interesting. Moreover, the coverage patterns having items with less relative frequency are not interesting. So, minRF threshold value can be set based on the characteristics of the application. In the beginning, as a heuristic, coverage patterns can be extracted by setting maxOR at 0 and minCS at 1 and minRF can be set to 50% of minCS value. Then, based on the requirement of the number of coverage patterns, maxOR can be increased gradually, while minCS and minRF can be decreased gradually.

Consider a graph transactional dataset (GTD) D, a minimum relative frequency threshold min R F g , a minimum coverage support threshold minC S g and a maximum overlap threshold max O g . Our proposed SIFT framework returns the set of all subgraph coverage patterns (SCPs) subject to the min R F g , minC S g and max O g constraints. Note that we employ the notations min R F g , minC S g and max O g , (i.e., we add a subscript g to minRF, minCS and maxO), to represent minimum relative frequency, minimum coverage support and maximum overlap thresholds concerning a graph transactional dataset. Now we shall explain the relevant terminology to present the framework of the problem. Table 1 depicts the summary of notations used in this paper.

Recall the notions of graph transaction, graph transactional dataset and subgraph from the discussions in Sect. 2.2. Given a GTD D and the set of all possible subgraphs over D, a subgraph pattern (S P) is a set of subgraphs belonging to . Consider a graph transaction G i ∈ D. A subgraph S j is said to cover G i if S j exists in G i . We define cover (S j , G i ) as follows:

Computation of cover (S j , G i ) involves solving the subgraph isomorphism problem [25] , which is NP-complete [16] . The gSpan algorithm [42] uses a canonical labeling system called DFS lexicographical order, which assigns minimum DFS code to each graph as the canonical label. We compute cover (S j , G i ) based on DFS codes. The Cover Set of S j (C Set g (S j , D)) is defined as the set of all graph transactions covered by S j . Formally,

The Cover Set of S P (C Set g (S P, D)) is a set of all graph transactions, which are covered by at least one subgraph of S P. It is equal to the union of all graph transactions covered by all the subgraphs in S P. Hence, C Set g (S P, D) = ∀S j ∈S P C Set g (S j , D).

Given D and S j , we denote the percentage of graph transactions in D covered by S j as relative frequency R F g of S j . We compute R F g (S j , D) as follows:

Here, 0 ≤ R F g (S j ,D) ≤ 1. We can extract subgraphs of interest from D based on user-specified minimum relative frequency (min R F g ) threshold.

Example 2 Consider a sample graph transactional dataset D comprising of 10 graph transactions G 1 to G 10 , shown in Fig. 2a . Three subgraphs S 1 , S 2 and S 3 are shown in Fig. 2b .

Here, S 1 is a subgraph of G 1 , G 6 and G 10 ; S 2 is a subgraph of G 5 , G 7 and G 8 ; and S 3 is a subgraph of G 4 and G 7 .

Similarly, RF values of S 2 and S 3 are 0.3 and 0.2, respectively.

Given D and a subgraph pattern S P, the coverage support of S P (C S g (S P, D)) is the percentage of graph transactions in D covered by at least one subgraph in S P. We compute C S g (S P, D) as follows:

Here, 0 ≤ C S g (S P, D) ≤ 1. Notably, C S g (S P, D) = 1 when all of the graph transactions in D are covered by S P. Conversely, C S g (S P, D) = 0 when none of the graph transactions are covered by S P. A pattern S P is interesting w.r.t coverage perspective if C S g (S P, D) ≥ minC S g , where minC S g is a user-defined minimum coverage support threshold for graph transactions.

A pattern S P, which satisfies a given minC S g constraint, may not be interesting if there is significant overlap among the sets of transactions covered by subgraphs of S P. In several applications, an S P with maximum coverage support and minimum overlap could be interesting. We now explain the notion of overlap g for capturing the overlap associated with graph transactions.

In the literature, the concept of overlap between sets is most often described using Euler diagrams [5] . Consider two sets A and B in a universe of objects. The overlap of A and number of times an object is appearing in either A or B. As a result, it is not possible to attach a physical meaning unless we know the nature of repetition of objects in A and B. In this paper, we present the notion of overlap by considering the average number of times an object can appear in the given multi-set (contains duplicate elements). Let M(S P, D) be the multi-set, which contains all transactions (with duplicate entries) covered by each subgraph in S P. We define the value of overlap g as the average number of times a transaction is repeated in M(S P, D). We define overlap g as follows:

Observe that the value of overlap g can exceed 100% as the size of |M(S P, D)| could be unbounded. In this paper, we restrict the size of |M(S P, D)| by considering that a transaction can only appear twice in |M(S P, D)|. Hence, the notion of overlap g denotes the average number of times a subgraph appears at most twice in D. 

We consider an S P as interesting if the cover set of all subgraphs of S P satisfies the min R F g threshold, overlap of S P satisfies the max O g threshold and coverage support of S P satisfies the minC S g threshold. We designate such S Ps as subgraph coverage patterns (SCPs). The definition of SCP is given below.

Consider D and a pattern S P. We call S P as a subgraph coverage 

Given a graph transactional dataset D, and the values of user-defined constraint parameters min R F g , minC S g and max O g , the problem is to extract all subgraph coverage patterns satisfying these user-defined constraints.

It can be noted that the objective is to extract SCPs with high coverage value for a given application scenario. Normally, the SCPs having subgraphs with low relative frequency value are not interesting. So, there will be significant number of SCPs, which cover small portion of GTD. As minCS threshold increases, the number of SCPs will reduce. Similarly, as minRF increases, the number of SCPs will reduce. Regarding overlap, we consider that the SCPs with minimum overlap will be interesting. Therefore, in a dense data set scenario, a smaller number of SCPs will be returned for lower value of overlap. As overlap threshold increases, the number of SCPs explodes.

This section discusses our proposed SIFT framework.

Given a GTD D and the threshold values of min R F g , minC S g and max O g as input, the goal is to extract all the SCPs from D.

A brute-force approach would be to extract all of the possible subgraphs of D based on min R F g , and then determine C S g and overlap g for each combination of subgraphs by computing the corresponding C Set g values of the given pattern. Each combination of subgraphs of D could be a candidate SCP. The number of candidate SCPs formed by the subgraphs of even a small number of graph transactions would essentially explode, thereby making the extraction of SCPs extremely challenging and difficult to scale.

The basic idea is as follows. We convert the given graph transactions into the corresponding flat transactions. For this purpose, we extract all subgraphs from GTD and assign unique Subgraph IDentifiers (SID) to each subgraph. Next, we convert each graph transaction into flat transaction by including the corresponding SID. Next, we propose an efficient methodology to extract SCPs from flat transactional dataset. We shall henceforth refer to this framework as subgraph ID-based flat transactional (SIFT) framework.

We can intuitively understand that SIFT provides opportunities for efficient determination of candidate sets. Further, it provides efficient way to compute coverage and overlap for each candidate set. This is because by considering each graph transaction as a set of SIDs, the coverage and overlap of a given subgraph pattern can be calculated through a set-based operation. Thus, we are essentially replacing complex and computationally expensive graph-based operations by set-based operations, which are typically faster by several orders of magnitude. Hence, the problem of extracting SCPs becomes the problem of extracting combinations of SIDs from the set of flat transactions. Thus, we propose a pattern mining-based extraction method by exploiting an overlaprelated pruning heuristic, which we shall discuss now.

Incidentally, C S g and overlap g threshold constraints do not satisfy the downward closure property [21] . However, we can exploit the overlap ratio measure proposed in [36] for extracting coverage patterns from a flat transactional dataset. [28] . Consider a candidate pattern S P ={S p ,S q ,. . . ,S r }, where the subgraphs in S P are sorted in descending order of their relative frequencies. When overlap ratio of S P fails to satisfy the maximum overlap ratio threshold, any superset of S P cannot possibly satisfy the maximum overlap threshold. Hence, we can avoid generating supersets of S P. We use this heuristic in our proposed approach for effective pruning of candidate patterns. The steps to extract SCPs from flat transaction are as follows. First, we sort all of the candidate subgraphs in descending order of their relative frequencies.

Then, starting from individual candidate subgraph as S P, we continue to generate candidate S P of progressively larger sizes, while using the pruning heuristic based on sorted closure property of overlap ratio to efficiently prune candidate S P.

Given D, min R F g , minC S g and max O g , our proposed SIFT framework extracts SCPs from D. Figure 3 We shall now explain these steps.

Based on the min R F g threshold, a subgraph discovery algorithm gSpan [42] is used to extract all the subgraphs from D subject to min R F g constraint (refer Sect. 2.2 for gSpan algorithm). We construct the set SG of subgraphs using gSpan algorithm, where each subgraph S j is of the form <Clabel, CSet>, where Clabel represents canonical label of S j and CSet consists of all GIDs of graph transactions that contains subgraph S j . 

The input to this step is a set of subgraphs SG of the form <Clabel, CSet>. In this step, we form the flat transaction for each graph transaction in D. The flat transaction contains the SIDs corresponding to GID. The details are given in Algorithm 2. In Algorithm 2, we maintain two hashmaps SubList: <SID, Clabel> and D f :< f i , S(SID)>, where f i represents i th flat transaction identifier and S(SID) represents set of S I Ds corresponding to the i th graph transaction. For every subgraph S j in SG, we check if the canonical label of S j exists in SubList.Clabel. If it does not exist, we assign a new SID to S j and insert SID, Clabel into SubList. Otherwise, we assign the SID to S j corresponding to Clabel of S j in SubList.Clabel (see . In both the cases, for each subgraph S j and for each GID in CSet of S j , we insert SID of S j into set S(S I D) of flat transaction identifier f i corresponding to GID (see Lines 9-10). The set < f i , S(S I D) > forms the SID-based flat transactional dataset D f (see Line 14) ).

The mapping of subgraphs in graph transaction G i to SIDs in flat transaction f i is a bijective function F represented as follows: Insert < count, s.Clabel > into SubList 5:

x ← count 6: count + + 7: else 8:

x ← SubList.S I D 9: end if 10:

for each graph transaction i in s.C Set do 11:

Insert x into the set D f .S(SID) of corresponding flat transaction identifier f i in D f 12: end for 13: end for 14 : return D f

After converting GTD into SID-based flat transactional dataset, our objective is to extract SCPs subject to the constraints of min R F g , minC S g and max O g . In this section, we explain the process to extract SCPs subject to the minC S g and max O g constraints.

Under a brute-force approach, we would need to compute the values of C S g (S P, D) and overlap g for a prohibitively large number of candidate patterns formed by all SIDs. This is because the minC S g and max O g constraints do not satisfy the downward closure property [21] . However, we can exploit the overlap ratio measure proposed in [18] for extracting coverage patterns from a flat transactional dataset. The overlap ratio measure satisfies the sorted closure property [28] . As explained in Sect. 2.2, the coverage pattern mining algorithm extracts coverage patterns subject to the constraints of minimum relative frequency (minRF), minimum coverage support (minCS) and maximum overlap ratio (maxOR). Now, we explain the equivalence between the minRF, minCS and maxOR constraints for flat transactions (presented in Sect. 2.2 as defined in [18] ) and min R F g , minC S g and max O g constraints associated with SCPs (defined in Sect. 3).

Recall that for flat transactions, the notion of relative frequency R F(i k ) of an item i k is the percentage of transactions, which contain i k . In case of GTD, R F g (S j , D) denotes the percentage of graph transactions, which contain a subgraph S j . Furthermore, for flat transactions, the notion of coverage support C S(X ) of a pattern X is the percentage of the union of transactions covered by each item of X . In case of GTD, C S g (S P, D) of a subgraph pattern S P denotes the percentage of the union of graph transactions covered by each subgraph of S P.

Regarding the overlap aspect, we have defined overlap g concept and max O g constraint for GTD. First, we explain the overlap ratio (OR) constraint, which has been defined to extract coverage patterns for flat transactions [36] . Next, we explain how to employ the OR constraint to extract SCPs subject to the max O g constraint.

Given a pattern X , and if the elements in X are sorted in the descending order of their relative frequency values, Overlap Ratio (O R(X )) of a pattern X , which satisfy the sorted closure property. We shall explain the sorted closure property after defining the overlap ratio of the pattern. The notion of C Set of a pattern has been explained in Sect. 2.2. 

is less than or equal to maxOR, i.e., O R(X ) ≤ max O R, all the non-empty subsets of X containing O s will also have O R less than or equal to maxOR.

Suppose, we extract the set S of coverage patterns from a given GTD with O R(X ) ≤ α. We can compute overlap g (X ) for all X ∈ S and extract coverage patterns with overlap g (X ) ≤ α. For a given pattern X , the relationship between OR and overlap g is given in Theorem 1.

Proof From the definition of overlap g in Sect. 3,

We consider a worst case scenario and assume that

Substituting the values of M(X ) and C Set(X ) in Equation 5 ,

By equating the above equation to α and solving for p,

Thus, we conclude that for a pattern X , when O R(X ) ≤ α, for each pattern X ∈ C l do 6: Notably, we have employed two notions (overlap g and overlap ratio) to capture the notion of overlap. The notion of overlap g is intuitive from the user perspective, whereas overlap ratio (and maxOR) was employed as a pruning measure for efficient extraction of SCPs. For extracting SCPs, we can employ an existing coverage pattern algorithm such as a level-wise pruning based approach [32, 36] or a patterngrowth approach [18] , with the value of minCS equal to minC S g and the value of maxOR equal to max O g .

For extracting SCPs from flat transactional dataset, we employ coverage pattern mining algorithm proposed in [36] . Algorithm 3 depicts the coverage pattern mining algorithm. The inputs are flat transactional dataset D f and user parameters minCS and maxOR values. Coverage pattern mining algorithm exploits apriori like level-wise search approach to find the l-size candidate patterns from (l-1)-size non-overlap patterns (see . A non-overlap pattern is a pattern that satisfies maxOR constraint. It uses sorted closure property to prune the search space and extracts all non-overlap patterns, which become the candidates for the next iteration (see . The considered non-overlap patterns that satisfy the minCS constraint are considered as the SCPs (see . This process is repeated until no new non-overlap patterns are generated.

After extracting the set of SCPs, top-k SCPs can be listed by considering a ranking criteria based on C S g or a combination of C S g and overlap g values of SCPs based on the specific requirements of the application domain.

The time complexity of the proposed SIFT framework is equals:

where O(kmn + rm) is the complexity of subgraph extraction, O(mn) is the time complexity of flat transactions modeling, and m l=1 l(|C l−1 | · |C l−1 |) is the time complexity of SCPs computation. The explanation of each term in Equation 7 is as follows:

First, in SIFT framework, we employ gSpan algorithm to extract subgraphs from GTD. As mentioned in [2] , the complexity of gSpan algorithm to extract all subgraphs from GTD is O(kmn + rm), where k is the maximum number of subgraph isomorphism tests, m is the number of frequent subgraphs, n is the number of graph transactions, and r is the maximum number of duplicate codes of the frequent subgraphs that grow from other minimum DFS codes. It can be noted that extraction of all subgraphs from a given graph transaction is an NP-complete problem [2] . To improve performance, the gSpan algorithm employs the notion of minimum DFS code and converting the subgraph extraction problem into a pattern mining problem through string comparison. In practical scenarios, the value of m ≤ n, the value of r is much less than n and the value of k is small for sparse and diverse labels. Hence, the time complexity for subgraph extraction depends on the value of m × n.

Second, the process to compute the flat transactional dataset from the set of <canonical label of a subgraph, set of the corresponding GIDs> produced in the preceding step consists of two steps. First, we assign an SID to each unique canonical label (Clabel) and compute an hashmap < S I D, Clabel >. The search time for the existence of Clabel in the hashmap take O (1) . Second, after mapping the Clabel with unique SID, for each corresponding GID, we will insert SID into the corresponding flat transaction identifier. The search time to insert is O (1) . Consider that on average, each SID belongs to q number of transactions. The time complexity to compute the flat transactional dataset is bounded by O(m × q) . Notably, q m. Therefore, the time complexity to model flat transactions from graph transactions is proportional to m.

Third, in SIFT framework, we employ an iterative levelwise apriori based algorithm. The time complexity of an iterative pruning algorithm is m l=1 l(|C l−1 | · |C l−1 |), where |C l−1 | is the number of candidate patterns of size l (Refer to Chapter 6 of [38] ). In the proposed SIFT framework, the number of candidate patterns generated depends on the value of overlap ratio threshold maxOR. Normally, at lower values of maxOR, less number of candidate patterns are produced at each level.

Overall, the time complexity of proposed SIFT framework depends on the graph transactional dataset size n, number of subgraphs extracted m and number of candidate patterns generated. Note that the value of m depends on minRF threshold and the number of candidate patterns depends on maxOR threshold. By choosing proper values of minRF and maxOR threshold values, it is indeed capable of extracting subgraph coverage patterns from graph transactional dataset.

We conducted our experiments in the ADA cluster [1] (at IIIT Hyderabad), which consists of 42 Boston SYS-7048GR-TR nodes equipped with dual Intel Xeon E5-2640 v4 processors, providing 40 virtual cores per node. The aggregate theoretical peak performance of ADA is 47.62 TFLOPS. We have conducted experiments on 20 virtual machines. Each virtual machine is allocated with 2 GB memory. We also reported the experiments on the scalability aspect of our proposed approach by varying the number of virtual machines from 5 to 40. We implemented our proposed schemes in Python 3.0. The link to the code for the implementation is provided in the footnote. 1 We used three real datasets, namely Yeast 167 (Yeast anticancer), P388 (Leukemia), from Pubchem [3, 43] and Zinc dataset consisting of drug-like molecules [37] . The Yeast 167 and P388 datasets consist of chemical compounds, which are modeled as graph transactions. In these datasets, each Table 2 summarizes the three datasets.

To the best of our knowledge, there exists no other approach for extracting SCPs from a GTD. As the number of subgraphs increases, the complexity of a naïve brute-force approach for extracting SCPs increases exponentially as it requires the determination of the coverage and overlap values based on prohibitively expensive graph-based computations. Hence, in the absence of any meaningful reference approach for comparison, we define the objective of our performance evaluation toward demonstrating the feasibility of the proposed SIFT framework in extracting SCPs from a given dataset.

We have conducted the experiments by implementing three components of the SIFT framework as follows. First, we employ the gSpan algorithm [42] for extracting all candidate subgraphs from a given GTD and assign SIDs to the extracted subgraphs. Second, we employ the proposed SIFT framework to form the transformed flat transactional dataset over the extracted SIDs. Third, to extract SCPs from the transformed flat transactions, we use the MapReduce-based coverage pattern mining algorithm [32] , which was proposed to extract coverage patterns from flat transactions. Table 3 summarizes the parameters of our performance study.

The performance metrics for extracting SCPs are (i) processing time (T S ) to extract subgraphs, assign SIDs and form SID-based flat transactions, (ii) number of candidate subgraphs (N S ), (iii) average number of SIDs (AV G) in the SID-based flat transactions, (iv) processing time (T SC P ) to extract SCPs from flat transactions, (v) number of patterns (N P ) to be examined for extracting SCPs and (vi) number of SCPs (N SC P ). Here, T S represents the processing time con-sumed to extract subgraphs by accessing the graph dataset from the disk. T SC P is the processing time consumed for extracting SCPs from the SID-based transactional dataset, which resides on disk.

The results in Fig. 4 depict the effect of varying minRF. The results in Fig. 4a indicate the performance of T S , while the results in Fig. 4b show the performance of N S as we vary minRF for the P388 and Yeast datasets. The results in Fig. 4a show that when the value of minRF is low, the value of T S is high. As minRF is increased, the value of T S reduces exponentially due to the pruning effect of minRF. The value of T S depends upon the number of subgraphs extracted from the dataset. The results in Fig. 4b show that the number of SIDs decreases with increase in the value of minRF. This occurs due to decrease in the number of subgraphs that satisfy the minRF constraint. The results in Fig. 4c depict the effect of varying minRF on AVG. The results show that when the value of minRF is low, the value of AVG is high, and as minRF increases, the value of AVG is decreased. This is because at lower value of minRF, there will be a large number of subgraphs, which satisfy the minRF constraint. As the value of minRF increases, the number of subgraphs decreases because less number of subgraphs satisfy the minRF constraint. The value of AVG depends on the number of subgraphs that are extracted. Therefore, at lower values of minRF, a transactional dataset with large AVG is extracted.

The results in Figs. 4d, e and f show that the values of T SC P , N P and N SC P decrease with the increase in the value of minRF, respectively. The reason is that at lower value of minRF, there will be large number of subgraphs satisfying the minRF constraint. As minRF increases, the value of N P decreases because less number of patterns satisfy the minRF constraint. Consequently, the values of T SC P and N SC P also decrease. It can be observed that we have reported results starting from minRF=0.1 as we could not experiment with minRF less than 0.1 due to explosion in the number of patterns.

The results in Fig. 5 depict the effect of varying the value of maxOR. The results in Figs. 5a, b and c show that T SC P , N P and N SC P increase with the increase in maxOR, respectively. The reason is that at lower value of maxOR, there will be less number of patterns, which satisfy the maxOR constraint, thereby resulting in less value of T SC P and N SC P . As maxOR increases, the value of N P increases because more patterns satisfy the maxOR constraint, which increases the value of T SC P and N SC P . It can be observed that even at maxOR=0, there are 48 SCPs in Yeast and 13 SCPs in P388. At higher values of maxOR, more number of SCPs can be extracted.

The results in Fig. 6 depict the effect of varying the value of minCS. The results in Fig. 6a and b indicate that the values of T SC P and N P remain comparable for all the values of minCS for both datasets. The reason is as follows. When the values of minRF and maxOR are fixed, the same number of candidate patterns is examined to extract the SCPs. Therefore, as expected, irrespective of variations in the value of minCS, both the values of T SC P and N P remain comparable.

The results in Fig. 6c indicate that the value of N SC P decreases with increase in the value of minCS. Notably, in the proposed approach, after satisfying the maxOR constraint, we prune a candidate pattern if it does not satisfy the minCS constraint. At higher values of minCS, a candidate pattern will be pruned even though it satisfies the maxOR constraint. As a result, the value of N SC P reduces as we increase the value of minCS.

The results in Figs. 7a-c depict the effect of varying the values of minRF, minCS and maxOR, respectively, for Zinc dataset. The results depict trends similar to P388 and Yeast datasets. The value of N SC P is small due to the small size of Zinc dataset.

The results in Fig. 8a depict the effect of varying the values of minCS and maxOR. The result shows that when the values of minCS and maxOR are low, the value of N SC P is small due to candidate pruning based on maxOR constraint.

When the values of minCS are high and maxOR is low, the value of N SC P decreases further because very few patterns satisfy high value of minCS and low value of maxOR. When the value of minCS is low and maxOR is high, the value of N SC P is high because more number of patterns satisfy low value of minCS and high value of maxOR. However, when the values of minCS and maxOR are high, the value of N SC P will decrease because there are few patterns that may satisfy the high value of minCS.

The results in Fig. 8b depict the effect of varying the values of minCS and minRF. The result shows that at low values of minCS and minRF, the value of N SC P is high. This is because, large number of candidate patterns will be generated at lower values of minRF and most of them satisfy the low value of minCS constraint. When minCS is high and minRF is low, the value of N SC P low because, less number of candidate patterns satisfy the high value of minCS threshold. At low values of minCS and high values of minRF, the value of N SC P is low, due to small number of candidate pattern generation. Further, when the values of minRF and minCS are high, the value of N SC P decreases further.

The results in Fig. 8c depict the effect of varying the values of maxOR and minRF. The results show that N SC P does not vary much at lower values of minRF and maxOR. When we increase the value of minRF, the value of N SC P decreases due to decrease in number of candidate pattern. When the values of minRF and maxOR are high, the value of N SC P is less, due to less number of candidate patterns. However, when the values of minRF and maxOR are high, the number of SCPs explodes due to large number of candidate pattern generation. Figure 9 depicts the effect of varying the number N M of machines. Observe that the value of T SC P decreases with increase in the value of N M . This is due to increase in the parallel extraction of SCPs. However, the change in the value of T SC P decreases with increase in N M and exhibits a saturation effect when more than 30 machines are used. This is due to communication overhead. The results show that the value of T SC P can be reduced by employing additional resources.

Given a dataset, the processing time to extract SCPs equals the sum of the processing time to form SID-based graph transactions (as depicted in Figs. 4a) and the processing time to extract SCPs (as depicted in Fig. 9 ). Overall, the results demonstrate that it is feasible to extract the knowledge of SCPs by processing a reasonable size dataset of Yeast with 79601 graph transactions. 

In this approach, conversion of graph transactions into SIDbased flat transactions is one time computation process. Once graph transactions are transformed to flat transactions, it is always possible to choose relative frequency thresholds greater than min R F g and compute SCPs for various values of minC S g and max O R g . Now let us discuss how to set the values of the parameters such as min R F g , minC S g and max O R g . The min R F g threshold value can be set to half the maximum min R F g value. Then, based on number of coverage patterns, min R F g can be decreased. The goal is to extract SCPs with maximum coverage, while minimizing the overlap to zero. Hence, as a heuristic, we could start with the coverage support value equal to 1 and then progressively keep decreasing the value of coverage support until a desired number of SCPs is obtained. Regarding max O R g , we can start with max O R g equal to 0 and then progressively keep increasing the value of max O R g until a desired number of SCPs can be obtained.

Based on the application, the domain expert can first extract SCPs by setting maxOR=0 and minCS=1. If the domain expert needs more number of SCPs, he can increase maxOR or decrease minCS progressively. Normally, the process of pattern mining is an iterative approach. As we have proposed a pattern mining model, the usual methodologies employed to set threshold values can be employed in this case also.

We demonstrate the feasibility of applying knowledge of SCPs in computer-aided drug design toward developing a drug for coronavirus. Corona viruses that include the SARS coronavirus 2 responsible for the COVID-19 pandemic are pathogens that cause various diseases that are [45] . We consider Zinc database comprising 250000 drug-like molecules. We selected Mpro protein (PDB ID: 1WOF) using the Autodock 4.2 software program [30] . Molecular docking procedure takes each of the 250000 molecules, identifies the best binding mode with the protein, and gives the binding affinity and the protein-ligand bound complex (PLC) structure using a scoring function. Better the intermolecular interactions between the protein and the ligand, better is the binding affinity and better is the molecule for it to be a drug. The top-1000 molecules among the 250000 molecules in the initial dataset that yielded high binding affinity with the protein molecule were chosen for mining SCPs. For our case study, protein-ligand complexes (PLCs) are converted to graphs transactions using GReMLIN [34] . A ligand can interact with the same protein at different sites, producing multiple graphs for the same protein and ligand, but different vertex and edge label sets. Here, vertices are amino acids of proteins and atoms of ligands and the edges are interactions between amino acids and ligands. Example 4 presents the modeling of PLC as a graph transaction. The top-1000 ligands interact with 1WOF protein and form 1000 protein-ligand complexes. GReMLIN generated 4672 graph transactions from these 1000 PLCs. We extracted SCPs by providing minRF=0.025, minCS=0.7 and maxOR=0.5. The time consumed to extract subgraph coverage patterns is about 10 seconds (3.52 seconds to model flat transaction from graph transactions and 6.03 seconds to extract subgraph coverage patterns from flat transactions). The top-8 candidate subgraphs along with their corresponding RF values are depicted in Fig. 10b , and the top-8 SCPs sorted by coverage support and their corresponding overlap ratio are provided in Table 4 .

Consider an SCP {S 1 , S 9 , S 11 } that covers 83% of Zinc dataset with 0 overlap. Figure 11a depicts the overall structure of the Mpro protein and highlights the region, where a drug molecule could bind. We have analyzed the interactions among all residues that have interactions with at least one of the 1000 ligands in the dataset. The analysis regarding the utility of SCPs in understanding protein-ligand interactions and its possible inputs to drug design efforts is as follows. Figure 11b depicts an example of a ligand in which interactions corresponding to S 1 and S 9 subgraphs are possible (orange and pink arrows). As shown in Fig. 11b , it is evident that the method captured the hydrophobic interaction S 1 and aromatic interaction S 9 , which contribute toward the favorable binding affinity between the protein and the ligand. On the other hand, Fig. 11c depicts the protein-ligand hydrogen bonding interactions involving another ligand that represent the S 11 subgraph. Similar to the aromatic and hydrophobic interactions above, the approach captures the hydrogen bonded interaction efficiently. These three subgraphs have an overall coverage of 83% with overlap ratio as zero indicating their prevalence and hence importance for the molecules to bind to the protein. Such a new knowledge gives possible directions for improving the molecule by modifying the structure of these ligands so that multiple modes of interactions are possible and hence, improve the binding affinities. Therefore, the proposed SIFT framework not only helps in understanding the protein-ligand interactions, but also helps in designing better drugs by extracting the knowledge of subgraph coverage patterns.

Subgraph pattern mining is an active research area with applications in the domains of chemical, biological and social networks. Given graph transactional data, existing works have focused on the problem of extracting frequent subgraphs, but they have not considered the problem of extracting the knowledge of coverage-related subgraph patterns. Hence, we have introduced the concept of subgraph coverage patterns. In particular, we have proposed the SIFT framework for extracting subgraph coverage patterns from graph transactional data based on minRF, minCS and maxO constraints. Our performance evaluation with three real datasets demonstrates the effectiveness of the proposed scheme in terms of processing time and pruning efficiency. We have also demonstrated the feasibility of applying the knowledge of SCPs through a case study in the bioinformatics domain. To the best of our knowledge, this is the first work to consider the extraction of subgraph coverage patterns from graph transactional data. Given the prevalence of graph data modeling, the proposed model of SCPs has a potentially huge scope for opening up new avenues for the extraction of interesting knowledge from graph datasets in several important and diverse domains. 

On behalf of all authors, the corresponding author states that there is no conflict of interest.

The source code is available at https://github.com/ srinivas2234/SCPs.

Consent for publication Not applicable.

ADA

UIUC technical report

Grasping frequent subgraph mining for bioinformatics applications

Radial sets: interactive visual analysis of large overlapping sets

Time-constrained maximal covering routing problem

The minimal hitting set generation problem: Algorithms and computation

An updated dashboard of complete search FSM implementations in centralized graph transaction databases

Mining molecular fragments: finding relevant substructures of molecules

Managing and mining graph data

A rough set method for the minimum vertex cover problem of graphs

A greedy heuristic for the set-covering problem

Set cover algorithms for very large datasets

Finding frequent substructures in chemical compounds

Implicit hitting set algorithms for maximum satisfiability modulo theories

The graph isomorphism problem

TKG: Efficient mining of top-k frequent subgraphs

Mining coverage patterns from transactional databases

Data reduction and exact algorithms for clique cover

Symbolic learning for improving the performance of transversalcomputation algorithms

Frequent pattern mining: current status and future directions

An apriori-based algorithm for mining frequent substructures from graph data

A survey of frequent subgraph mining algorithms

GString: A novel approach for efficient search in graph databases

Frequent subgraph discovery

REAFUM: Representative approximate frequent subgraph mining

Influence maximization on social graphs: A survey

Mining association rules with multiple minimum supports

CALI: A novel visual model for frequent pattern mining in protein-ligand graphs

AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Coverage pattern mining based on MapReduce

Graph mining: A survey of graph mining techniques

visGReMLIN: Graph mining-based detection and visualization of conserved motifs at 3D protein-ligand interface at the atomic level

GReMLIN: A graph mining strategy to infer protein-ligand interaction patterns

Discovering coverage patterns for banner advertisement placement

ZINC 15 -Ligand discovery for everyone

Introduction to Data Mining

Improving local search in a minimum vertex cover solver for classes of networks

FERRARI: an efficient framework for visual exploratory subgraph search in graph databases

Local search for diversified top-k clique search problem

gSpan: Graph-based substructure pattern mining

Mining significant graph patterns by leap search

Graph indexing: a frequent structurebased approach

Design of wide-spectrum inhibitors targeting coronavirus main proteases

Influence maximization in social networks based on TOPSIS

Information coverage maximization in social networks

A local algorithm for structure-preserving graph cut

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations