key: cord-0537502-05ndl06t
authors: Gaur, Garima; Bhattacharya, Arnab; Bedathur, Srikanta
title: How and Why is An Answer (Still) Correct? Maintaining Provenance in Dynamic Knowledge Graphs
date: 2020-07-29
journal: nan
DOI: nan
sha: d5124cb84945645f593699fd31e6f4e5f334aae0
doc_id: 537502
cord_uid: 05ndl06t

Knowledge graphs (KGs) have increasingly become the backbone of many critical knowledge-centric applications. Most large-scale KGs used in practice are automatically constructed based on an ensemble of extraction techniques applied over diverse data sources. Therefore, it is important to establish the provenance of results for a query to determine how these were computed. Provenance is shown to be useful for assigning confidence scores to the results, for debugging the KG generation itself, and for providing answer explanations. In many such applications, certain queries are registered as standing queries since their answers are needed often. However, KGs keep continuously changing due to reasons such as changes in the source data, improvements to the extraction techniques, refinement/enrichment of information, and so on. This brings us to the issue of efficiently maintaining the provenance polynomials of complex graph pattern queries for dynamic and large KGs instead of having to recompute them from scratch each time the KG is updated. Addressing these issues, we present HUKA which uses provenance polynomials for tracking the derivation of query results over knowledge graphs by encoding the edges involved in generating the answer. More importantly, HUKA also maintains these provenance polynomials in the face of updates---insertions as well as deletions of facts---to the underlying KG. Experimental results over large real-world KGs such as YAGO and DBpedia with various benchmark SPARQL query workloads reveals that HUKA can be almost 50 times faster than existing systems for provenance computation on dynamic KGs.

are automatically constructed using one or more information extraction pipelines over a variety of underlying data sources. As a result, the KG could contain facts obtained through different mechanism and, thus, a query result could be generated through a combination of facts with vastly varying provenance. For instance, it is possible that a fact derived from research studies is joined with a fact from web-based user-generated sources [40] . Tracking fine-grained provenance of individual facts in a knowledge graph alone (as done in [29] ) is not sufficient, since we also need to track the provenance of individual query results as well, i.e., which specific facts were involved in generating an individual result.

In this paper, we use provenance polynomials [19] to track the provenance of answers of a graph pattern query over knowledge graphs stored in property-graph engines such as Neo4j. While provenance polynomials have been successfully used by relational databases [3, 11, 19, 35, 36] and RDF triple stores [44] in the past, they are not available as over a property graph store such as Neo4j.

In addition to providing efficient derivation of provenance information for query answers, we focus on making the query answers as well as their provenance derivations easily updated as the underlying KG is updated. The "knowledge" contained in most real-world KGs continually evolves over time. Such evolution could be due to addition of new facts obtained from novel data sources, via knowledgebase completion (KBC) techniques [39] , continuous fact refinement approaches [31] , etc. Similarly, facts could be culled from the KG when evidence that they are incorrect/invalid is obtained. For example, consider a (fragment of) medical knowledge graph extracted from recent medical news regarding the novel SaRS-CoV-2 virus, one may observe that new symptoms such as "SaRS-CoV-2 causes loss of smell" and "SaRS-CoV-2 causes skin rashes" which need to be included in the medical KG. Addition of such new symptoms can result in (re-)classifying many patients, whose symptoms were considered benign earlier, as potential COVID-19 patients now. Note that it is also possible to de-classify some symptoms, resulting in the deletion of the corresponding fact(s) from the KG.

As a result, queries such as "return all COVID-19 positive cases and their contacts in a hospital database" that are very important and are registered as standing queries, may generate significantly different answers with potentially significant effect on the real-world decisions based on the query answers. Therefore, an important technical challenge that needs to be addressed is to determine whether answers to a query is still valid or up-to-date in the face of changes to the KG on which it is evaluated, and if so, demonstrate how so by presenting its evidence. A typical evidence one seeks is the provenance of how a query result was generated by identifying database entries which were responsible for computing each result instance.

These are the why and how provenance of queries. Provenance is also important in demonstrating why the result of a query is no longer valid if it is not so.

A simplistic way to check if a query result is still valid or not is to re-execute all the registered queries after each update to the knowledge graph, and generate an alert if a result of any query differs from the previous one. Clearly, this is impractical due to potentially large number of registered queries, the large scale of knowledge graphs, the need to materialize potentially large volume of query results for comparison and the high frequency of updates to the KG.

One may also wonder if any incremental graph pattern matching algorithms such as TurboFlux [27] , IncIsoMat [16] , SJ-Tree [12] , Graphflow [26] , etc. can be used to maintain the basic graph pattern queries. However, none of these algorithms support computing and maintaining the provenance (either why or how) of queries. Further, the presence of projections in the queries we consider (see Sec. 3) make simple pattern finding harder [9] . Thus, incremental graph pattern matching algorithms are not applicable in the settings we consider in this paper.

Our contributions in this paper are:

• We present an algorithmic framework called HUKA (maintaining How provenance under Updates to Knowledge grAphs) that achieves the above two objectives by incrementally maintaining the how-provenance of query results in the face of updates to the underlying knowledge graph (Sec. 4). Our solution is based on an adaptation of provenance polynomials [19] for KGs queried using SPARQL. It addresses both the why-and how-provenance of conjunctive queries under dynamic KGs where both insertion and deletion of facts can take place (updates to a fact are modeled using a deletion and a subsequent insertion). • HUKA compactly maintains, using provenance polynomials, the results of all standing queries registered for maintenance, and the facts that are relevant for the query (Sec. 5). It also maintains information about the subqueries obtained by removing one edge at a time from the basic graph pattern of the query. In order to exploit the shared plans across all queries/subqueries in the workload, it merges individual subquery execution plans generated using AND-OR graphs [30] to form a single global execution tree. When the KG is updated, the updated query results is computed using a filter-and-refine paradigm that helps to quickly recompute the subquery results as well (Sec. 6). • We use two large real-world knowledge graphs for empirical evaluation, namely, YAGO2 [24] and DBpedia [8] . Our results show that we can update the answer sets along with their provenance polynomials in about 1 second, and can be faster than the baselines by almost 50 times (Sec. 7).

The need of query result provenance to better understand the role of underlying data in generating a materialized view, called lineage earlier, was felt and fulfilled [13] in data warehouse environments. Query provenance is utilized for various tasks like schema debugging [2] , reverse engineering queries [15] , trust assessment [37] , query computation in probabilistic databases [34] , etc.

The first attempt to formalize the concept of provenance is by [10] . They proposed a model to capture the why and where provenance. Later more models capturing different aspects of query provenance, including when [33] , how [19] , and why not [7] , were proposed. These variations provide different dimensions to understand the query results. The seminal work by Green et al. [19] introduced the how provenance and modeled it using provenance semirings. They used the annotated relations to encode the provenance and presented a framework to propagate these annotations along with the query computation. This model proposed a symbolical representation of the derivation process of query answers. Each derivation is represented by a polynomial with each monomial representing a single derivation. The how provenance constitutes of, along with the derivation process, the data items involved in the derivation process. Since how-provenance subsumes why-provenance, we have used the how-provenance model to capture the provenance.

Query Provenance in Systems. In practice, these models have been adopted quite well to compute the provenance along with query evaluation. Trio [1] is the one of the early systems to support lineage computation. The efforts to support query processing along with provenance computation resulted in building different systems. These include Orchestra [25] that supports provenance computation in a collaborative environment, TripleProv [44] that supports how provenance for linked data, GProM [3]-a successor of Perm [18] that supports why, where and how provenance, and, ProvSQL [36] that supports the same. These systems, however, enable computing, storing and querying provenance for a static dataset. Recently, [21] proposed a theoretical model to maintain the data provenance in a RDF versioning system. Avgoustaki et al. [5] address the problem of computing provenance of data generated by SPARQL INSERT operations. We, on the other hand, compute and maintain the provenance of pre-computed query results of SPARQL conjunctive queries. The work that can be considered as our counterpart is by [17] on maintenance of provenance under fact deletions. In this paper, we focus on maintenance of provenance of query results when facts, i.e., edges are added. We also present deletion handling for completeness.

Knowledge Graph. A knowledge graph (KG) G (V , E, R, I) is a directed graph with a set of vertices V , a set of edges E, a set of relationships R and a function I : E → N associating each edge to a unique identifier given as I(e) = e i with e i ∈ N . Each edge, e (u, l(e), v), is labeled with l(e) (∈ R) that represents the relation between vertex u and v. The complete description of an edge is (e i , u, l(e), v). However, we will often use either (u, l(e), v) or e i to refer to a particular edge with identifier i. Note that this representation closely corresponds to the RDF model used commonly to model a KG. Each entity or property value represented as a vertex in KG has an associated URI or a literal, and we assume that they are assigned a unique identifier (i.e., through I(·)). A labeled edge corresponds to a fact, also represented as a triple ⟨u, l(e), v⟩ where u, l(e), and v are respectively the subject, predicate and object of the fact.

Graph Query. Queries on the KG are posed as graph patterns, and the task is to find subgraphs that match the pattern. The graph query language SPARQL is used to formally express a query as a collection of triple patterns. Similar to a triple, a triple pattern also consists of a subject, a predicate and an object. The subject and object can be variables or can be bound to a particular vertex of the KG. A predicate can also be a variable or can be one of the labels of the edges of the KG. A graph query is formulated as a conjunction of triple patterns and the aim is to find all possible bindings to the variables of the triple patterns. Thus, each edge in the query graph corresponds to a triple pattern in the KG that needs to be matched. In this paper, we restrict ourselves to core basic graph pattern (BGP) queries in SPARQL. For the fragment of SPARQL graph pattern queries we consider, it is also known that they can be expressed using SPJ (Select-Project-Join) fragment of relational algebra [14, 44] . We leave it for future work to adopt the ideas of [43, 44] to extend to full SPARQL. We also do not consider path expression queries since their provenance derivation is tantamount to enumerating all paths between vertices in a graph which is known to be intractable [4] .

Graph Query Answer. The answer of a query A(Q) is a collection of subgraphs of KG that match the query pattern Q. To formally define a matching subgraph, consider a query Q = {t 1 , . . . , t n } as a collection of n triple patterns. We define a surjective mapping v S : Q → E S which maps each triple pattern of the query Q to an edge of subgraph S(V S , E S ). A subgraph S is said to match a query pattern Q, if ∀t i ∈ Q, ∃e j ∈ S such that v s (t i ) = e j . This mapping enforces the query pattern constraints. In other words, mapping ensures that if two triple patterns t i and t j of a query have common node(s), then the corresponding edges v S (t i ) and v S (t j ) should preserve the same restriction.

Our goal is to maintain the results and their provenance for a set of standing queries that have been registered for maintenance in the system, as the underlying KG undergoes updates, i.e., insertion/deletion of edges/nodes. In this section, we delineate the provenance model used and the key idea of maintaining potential matches. In Sec. 5, we discuss the offline phase involving registration of a standing query. When an update occurs, HUKA operates in the online phase as discussed in Sec.6. Throughout the paper, we will use a running example of a toy knowledge graph depicting relationships between academic researchers and a query pattern illustrated in Fig. 1 . Facts such as Sarawagi had Stonebraker as her advisor are represented using a directed edge labeled hadAdvisor connecting the corresponding vertices and is stored as a triple ⟨Sarawagi, hadAdvisor, Stonebraker⟩ in RDF. There are relationships such as coAuthor represented using a bi-directional edge, which correspond to two triples, e.g., ⟨Sarawagi, coAuthor, Ramakrishnan⟩ and ⟨Ramakrishnan, coAuthor, Sarawagi⟩.

We define an example graph pattern query on the above KG which seeks to return professors as well as all collaborators of their students. Note that it only accepts professors and collaborators currently working in an organization. It can be written in SPARQL as follows: The query graph corresponding to the above query pattern is given in Fig. 1b .

The two important classes of query provenance, namely, the why provenance and the how provenance [11] in graphs are next illustrated using our running example KG and the graph pattern query from Fig. 1 . Consider Table 1 that lists the results and their provenance of running the query on the example KG. The why-provenance is simply a set of sets comprising of edges contributed in a particular derivation of a specific result item. The how-provenance, on the other hand, symbolically encodes the interaction of edges involved in derivations of an answer. As shown in Table 1 , the how-provenance is expressed as a polynomial with Fig. 1a , subgraphs corresponding to these two different edge bindings are shaded in red and blue respectively. Since how-provenance subsumes and captures more information about generated results than why-provenance, HUKA uses the howprovenance model represented through provenance polynomials.

Fundamental to our approach is the concept of potential matches of a query Q. Intuitively, these potential matches of a query correspond to subgraphs of the KG that partially match the given query graph pattern. We maintain the potential matches of a query Q that require only a single edge insertion to transform them to full matches, i.e., actual results of Q. We now formally define potential matches, and prove how they can be maintained.

Any subgraph S of the knowledge graph G which can become an actual match of a query Q after a single edge insertion is called a potential match. □ Consider a query Q consisting of n triple patterns. Suppose S ∈ G is a subgraph such that n − 1 triple patterns of the query Q match and only 1 triple pattern does not match. After a new edge e that matches with the unmatched triple pattern of Q is added to the S, the new subgraph, S * = S ∪ {e} now becomes an actual match for Q. In other words, S * ∈ A(Q), where A(Q) is the answer-set of the query Q. S is, therefore, a potential match.

Next, we categorize the potential matches based on the number of matches resulting from the newly inserted edge. Either the newly added edge matches a single triple pattern of the Q or it matches many triple patterns in the query. DEFINITION 2 (1:1 POTENTIAL MATCH). If a newly added edge e to a potential match S matches only one triple pattern of the query Q such that S * = S ∪ {e} becomes an actual match of Q, then the subgraph S is called a 1:1 potential match, denoted by PM 1:1 . □ DEFINITION 3 (1:M POTENTIAL MATCH). If a newly added edge e to a potential match S matches multiple triple patterns of the query Q such that S * = S ∪ {e} becomes an actual match of Q, then the subgraph S is called a 1:m potential match, denoted by PM 1:m . □ We illustrate these important concepts using Fig. 1b . The subgraphs G 1 and G 2 , shown in Fig. 2a and Fig. 2b respectively, are subgraphs of the example toy KG, and are potential matches of the parent query. Although both are potential matches, G 1 is anticipating the insertion of only one fact qp 1 1 : ⟨?collab, worksIn, ?orд1⟩ to match the query. On the other hand, G 2 needs to get match for two triple patterns qp 1 2 : ⟨?pro f , worksIn, ?orд2⟩ and qp 2 2 : ⟨?collab, worksIn, ?orд1⟩ in the query.

Interestingly, a single fact insertion can complete both kinds of potential matches. For instance, if a fact (Sarawaдi, worksIn, IIT B) is added to the KG, it will match to the query pattern qp 1 1 , resulting G 1 to fully satisfy the query. At the same time, the same fact also matches both qp 1 2 and qp 2 2 making even G 2 to satisfy the query.

We categorize registered queries into two types, regular and Mul-tiMap, based on the kind of potential matches it may have. Both of these are conjunctive queries and they differ only in the distribution of predicates among triple patterns. We formally define them next.

A query that has only 1:1 potential matches (and no 1:m potential match) is called a regular query, denoted by Q reg .

For example, the query in Fig. 1b is not a regular query as its two triple patterns ⟨?pro f , worksIn, ?orд2⟩ and ⟨?collab, worksIn, ?orд1⟩ have the same predicate worksIn.

A query that has 1:m potential matches (possibly in addition to 1:1 potential matches) is called a MultiMap query, denoted by Q mm .

The example query, shown in Fig. 1b , is a MultiMap query with G 2 , in Fig.2b , as a PM 1:m . A necessary condition for a conjunctive query to be a MultiMap query is to have at least two triple patterns with common predicate. However, this condition is not sufficient. We discuss those characteristics in Sec. 5.4 while discussing the registration process of MultiMap queries.

The separation of potential matches and the queries into different classes is important since they need to be handled differently. Our strategy of query registration process is based on the following lemmas. LEMMA 1. A PM 1:1 can satisfy one and only one subquery of size n − 1 of a given parent query Q of size n.

PROOF. Consider a parent query Q = {t 1 , . . . , t n } of size n, i.e., having n triple patterns. A subquery Q i of size n−1 is generated by removing the triple pattern t i of Q, i.e.,

Assume that a subgraph S is a PM 1:1 such that completing the triple t i will complete its match to the parent query Q. By definition, S satisfies the subquery Q i .

Assume that S also satisfies another subquery Q j (j i) of Q. This implies that S has all the matching edges of Q j . Since Q j misses only the triple pattern t j , it must contain t i . This is a contradiction and, therefore, S cannot match any other subquery Q j .

Hence, S matches one and only one subquery of Q. □ LEMMA 2. A PM 1:m cannot satisfy any subquery of size n − 1 of a given parent query Q of size n.

PROOF. Consider a parent query Q = {t 1 , . . . , t n } of size n, i.e., having n triple patterns. A subquery Q i of size n−1 is generated by removing the triple pattern t i of Q, i.e.,

Assume that a subgraph S is a PM 1:m such that completing the set of triples {t i 1 , . . . , t i k } will complete its match to the parent query Q.

Assume that S satisfies a subquery Q j of Q. Since the subquery Q j misses only one triple pattern, t j , this implies that S matches the rest of the n − 1 triple patterns. By definition, this is a contradiction.

Hence, S cannot match any subquery of Q. □ PROOF. (Only if) Suppose a PM 1:m S satisfies a query Q consisting of n triple patterns. This implies that S has a match for every triple pattern t i ∈ Q, i = 1, . . . , n. Consider any subquery Q i of Q.

Since it consists of the rest n−1 triple patterns except t i , S necessarily satisfies Q i .

(If) Since S satisfies all subqueries, it satisfies, in particular, two subqueries Q i and Q j (j i) of Q. The triple patterns t l , l = 1, . . . , n, l i match S due to Q i . Since the triple pattern t i match due to Q j , together all the n triple patterns match. Hence, S matches the entire query Q. □ Next, we show how these observations can be used to preprocess queries during their registration for maintenance.

In this section, we discuss in detail the process of registering a query. These queries are called standing queries or continuous queries [6] since their results are always relevant and need to be updated when the KG undergoes changes. When a standing query is registered for maintenance, apart from evaluating its results, HUKA performs the following additional steps: (a) generation of all subqueries obtained by removing one triple pattern at a time from the query and identification of all the potential matches, (b) annotation of the connection points in the knowledge graph, and (c) precomputation of a query processing plan for the subqueries so that they can be recomputed efficiently. In the rest of the section, we describe these steps in detail, first for regular queries and then for MultiMap queries since they need special handling.

Since we are interested in only those subqueries that are just one triple pattern less than the registered query being processed, they are all of size n − 1. From the running example query in Fig. 1b , we can generate a number of subqueries by systematically removing each triple pattern specified in the query. This results in 5 subqueries as listed in Fig. 3 . Notice that in some cases the removal of a triple pattern can lead to generation of two disconnected subqueries.

If a given query is of size n, i.e., has n triple patterns, removal of a triple pattern can generate at most two subgraphs in the resulting subquery. Denoting these two subgraphs as SQ 1 and SQ 2 , without any loss of generality, we assume that |SQ 1 | = k and |SQ 2 | = n−k −1 where 0 ≤ k ≤ n − 1. The subqueries can be categorized into 4 types depending on the size of their subgraphs and the resulting connection points. These types are illustrated in Fig. 4 . In each of these illustrations, the gray-shaded node is part of the subgraph it is shown connected to; they have been shown separately only to highlight the subquery generation process.

• Type I: These subqueries have a singleton subgraph k = 0, and are generated when the triple pattern corresponding to the edge of a leaf node of the query graph is removed. For instance, the subquery shown in Fig. 3a , generated by removing worksIn edge. This leads to |A(SQ 1 )| connection points in the knowledge graph. The general subquery structure for this type is given in Fig. 4a . • Type II: In this case, one of the subgraphs contains only one triple pattern, i.e., k = 1, as shown in Fig. 4b . In this case, we may end up identifying all edges with the predicate of the triple pattern in SQ 2 as connection points which could be a potentially large number of instances in the knowledge graph. We avoid this, and instead only identify the connection points derived from A(SQ 1 ), leading to only |A(SQ 1 )| connection points. Fig. 3c shows an example. • Type III: As shown in Fig. 4c , both SQ 1 and SQ 2 contain at least two triple patterns, i.e., 2 ≤ k < n − 2. For instance, removal of coAuthor generated two subqueries, shown in Fig. 3b , each of size 2. This results in at most |A(SQ 1 )| + |A(SQ 2 )| connection points. • Type IV: In this last type, the removal of a triple pattern from the query does not disconnect the query graph, i.e., k = n − 1. As shown in Fig. 4d , each match of SQ 1 can result in two potential matches leading to a total of 2 · |A(SQ 1 )| connection points in the graph since they may be annotated from both sides.

It is important to note that in each case, both SQ 1 and SQ 2 subgraphs are effectively treated as distinct query patterns and are maintained by HUKA.

Potential matches are identified as subgraphs of the KG that match the subqueries of a parent query.

After identifying all the potential matches of a query, HUKA annotates the entities corresponding to their connection points. Each connection point results in its entity to be annotated with a tuple (expRel, dir, sqId, result, provPoly), where expRel is the relation predicate the connection point expects, dir is the direction of the edge (either outgoing or incoming), sqId is the identifier of the subquery that the connection point belongs to, result is the value of the query variable that needs to be projected, and provPoly is the provenance polynomial corresponding to the potential match.

For instance, consider the subqueries generated by removing the coAuthor edge in Fig. 3b . For ease of exposition only two connections are considered, one for each subquery. The node Ooi in the KG in Fig. 1 matches one of the connection points, and is annotated with the tuple (coAuthor, out, q i , Ooi, e 15 .e 16 ) as it is expecting an outgoing edge with label coAuthor. Similarly, node Gehrke has the annotation (hadAdvisor, in, q j , Ramakrishnan, e 1 .e 3 ). Note that for Gehrke the result attribute has value Ramakrishnan as it would match the projected variable ?prof of the parent query.

As new edges are added to the KG, it is crucial that the results of all the subqueries of registered queries are continually and efficiently maintained. Instead of generating an execution plan each time subquery results have to be updated, HUKA generates an execution plan for each subquery at the time of query registration. These plans are then merged together to form a global plan that tries to exploit shared execution segments. Local Plan Generation. In order to generate the execution plan for each subquery, HUKA adapts AND-OR trees [30] . These trees are comprised of two types of nodes, AND and OR nodes. They are present at alternate levels in the tree, i.e., each AND node has an OR node as its parent and also as children. An AND node represents a way of deriving the intermediate expression corresponding to its parent OR node, and vice versa. These trees effectively capture all possible derivations of the subquery expressions, with OR nodes corresponding to an intermediate expression in the evaluation of the final subquery expression.

After the AND-OR tree is constructed, it is required to choose the best query plan of the subquery. A greedy heuristic is employed to choose the best plan. Each intermediate expression is assigned a score based on its estimated cardinality. The execution tree is then traversed in a bottom-up manner. At each level i, the node corresponding to the intermediate expression with the lowest cardinality is greedily chosen. No other node at level i is considered. At the next level i + 1, its parent nodes are explored. The traversal continues till the root of the tree is reached.

Consider the parent query shown in Fig. 1b . Removal of the triple pattern ⟨?stud, hadAdvisor , ?pro f ⟩ generates the subquery with triple patterns containing hasDeдree, worksIn and coAuthor as predicates. A snippet of partial AND-OR tree for this subquery is shown in Fig. 5 . In interests of space, relations hasDeдree, worksIn and coAuthor are denoted by P 1 , P 2 and P 3 respectively, As shown, AND nodes on the second level correspond to intermediate expressions (P 1 ▷◁ subject P 2 ) and (P 2 ▷◁ subject P 3 ). Similarly, the root node corresponds to the final expression of the subquery. The children of the root node represent the two possible derivations, viz., ((P 2 ▷◁ subject P 3 ) ▷◁ subject P 1 ) and ((P 1 ▷◁ subject P 2 ) ▷◁ subject P 3 ).

P 1 : hasDegree P 2 : worksIn P 3 : coAuthor Global Plan Computation. Once the best plan is chosen for a subquery, it is merged into the global query plan. The global plan is built to reuse the computation of shared intermediate expressions across all subqueries. In a local plan, the subtree rooted at a given node gives its derivation. We start traversing the best local plan in a top-down manner. We do a breadth-first traversal down the best local plan. At each node k, we look for an equivalent expression in the global plan. If an equivalent expression already exists, it implies that the derivation of the expression corresponding to node k exists. So, we do not explore another derivation of the same expression. If no equivalent expression is found, we add the node k along with the corresponding equivalent expression to the global tree and move to the next node. The global plan can have multiple roots. Each root represents the final expression of one of the subqueries.

Cardinality Estimation. In order to obtain high-quality execution plans for queries, an accurate cardinality estimation is crucial. The cardinality estimation used in HUKA is specifically designed for RDF knowledge graphs, and captures the neighborhood of entities in the graph. Specifically, HUKA models queries as star-chain shapes, i.e., a sequence of star shaped queries [28] . Based on this, the statistics are maintained as counts of (s, o) pairs that have the same characteristic pair [20] P c , given by:

The cardinality estimate of a query, represented as a sequence of k star-shaped fragments C 1 , . . . , C k , is the sum of the estimates of the pairs of fragments:

The above discussed steps involving subquery generation, connection point annotation and global execution plan generation work for maintaining regular queries. MultiMap queries, however, require special handling. The presence of more than one triple with the same predicate is a necessary but not a sufficient condition to identify a conjunctive query as a MultiMap query. We next use an example of a non-MultiMap query Q in Fig. 6 to highlight the characteristics of MultiMap queries.

Characteristics of MultiMap Queries. The query Q has two pairs of triple patterns with the same predicate: (⟨?x1, P 4 , ?x2⟩, ⟨?x5, P 4 , ?x6⟩) and (⟨?x2, P 5 , C1⟩, ⟨?x6, P 5 , C2⟩). There are two important observations:

• If C 1 C 2 and relation P 5 is a one-to-one relation then triples ⟨?x2, P 5 , C1⟩ and ⟨?x6, P 5 , C2⟩ cannot be mapped to the same edge of a KG. In other words, ?x2 and ?x6 cannot be bound to a same node. Propagating this constraint backwards, triples ⟨?x1, P 4 , ?x2⟩ and ⟨?x5, P 4 , ?x6⟩ cannot be mapped to the same edge as ?x2 and ?x6 cannot be mapped to same node of KG. So, triple patterns participating in two different paths, comprising of one-to-one relations, leading to different bound variables cannot be mapped to same edges of KG. Such a query cannot be MultiMap. For, instance, paths from ?x1 to C1 and ?x5 to C2 has same relation sequence P 4 , P 5 but none of the involved triple patterns can be mapped to same KG edge. Thus, Q is not a MultiMap query. • Suppose relations P 1 , P 2 and P 3 are same, say isLocatedIn. Then node ?x1 and ?x5 cannot be mapped to same node of KG as a location cannot be located in itself. Thus, triples ⟨?x1, P 4 , ?x2⟩ and ⟨?x5, P 4 , ?x6⟩ cannot be mapped to same edge. Therefore, a query cannot have a PM 1:m involving variables ?x and ?z, if a path P i , . . . , P j from ?x to ?z is comprised of the same asymmetric predicate, i.e., P i = · · · = P j = P where (aPb =⇒ ¬bPa).

Identification of MultiMap Queries. Based on the above two observations, we identify MultiMap queries and the corresponding predicate P shared by the triple patterns, t i , . . . , t j , that can match to the same edge. A MultiMap query can have both PM 1:1 and PM 1:m . A PM 1:1 would satisfy one of the subqueries (Lemma 1) but a PM 1:m would not satisfy any subquery before an appropriate edge e is inserted (Lemma 2). However, when an appropriate edge is inserted, the PM 1:m would satisfy all the subqueries and the parent MultiMap query simultaneously (Lemma 3). Thus, to identify new matches to the parent MultiMap query, we examine the new results of a subquery. Instead of examining all subqueries, only those that are generated by removing a triple pattern that shares the same predicate, are enough to identify a PM 1:m .

An edge deletion requires not only the updates to the parent query results, but also to the existing potential matches and the connection point annotations in the KG. In the running example (Fig. 1) Based on these, an index is built on edges E i that appear in the result set of at least one query. The index entry corresponding to an edge e consists of the set {⟨Q i , L i ⟩} where L i is the list of result items such that e appears in its associated provenance polynomial.

Similarly, the edgeToCP index is built by considering the provenance polynomials of the potential matches. Each participating edge is mapped to a set of connection points which are annotated with those provenance polynomials.

When a new fact is added to the KG, we are first required to find the affected queries and then update the indexes and annotations for handling subsequent updates. Suppose that edge e(u, P, v) is added to a KG. We start by fetching relevant annotations of u that have P as expRel and "out" as dir. Suppose, the relevant annotation has query id sqId as q i . Recall that each annotation corresponds to a PM 1:1 of one of the subqueries. Based on the type of subquery q i , we proceed as follows:

• Type I: If q i is a Type I subquery, then we do not need to perform any check further as the insertion of e with expected relation P has completed the potential match. For example, in Fig. 1a , if edge ⟨Sarawaдi, worksIn, IIT B⟩ is inserted, then based on the fact that Sarawaдi is expecting an outgoing edge label worksIn is enough to confirm the transition of the corresponding potential matches. • Type II: Recall that this class has one subquery SQ 1 of size n − 2 and an single edge with predicate E. We need a simple lookup to check if node v has an correct incident edge with label E. The presence of such edge leads to a parent query answer. • Type III: We examine the annotation of v as well. If v has an annotation with P as expRel and "in" as dir, and query id of complimentary subquery of subquery q i as sqId. If the required sqId is present, then e changes the potential match to an actual match. For instance, the connection points Gehrke and Ooi in Fig.  1a correspond to a pair of complimentary subqueries. Therefore, if the edge ⟨Ooi, coAuthor , Gehrke⟩ is added to the KG, it would result into a new answer (Ramakrishnan, Ooi). • Type IV: We check the annotation of v and if we find the same subquery id as that in u and with expRel as P and "in" as dir, we can conclude that e has completed the potential match. The above checks are for PM 1:1 only. If the predicate P of the newly added edge is involved in any of the MultiMap queries, then it needs special handling. Before reporting new matches to a MultiMap query, the subquery results are updated. The global query plan is browsed in a bottom-up manner and the incremental results of all the intermediate expressions involving the predicate P are computed. Consequently, we find the incremental results of all the concerned subqueries. Recall that each MultiMap query is associated with one of its subqueries S. The subquery was chosen in a manner that if the newly added edge e(u, P, v) matches the missing triple pattern of the subquery, then all the new answers of the subquery would match the parent query, and, thus, would be reported as parent query answers.

The new results also correspond to new connection points which are then annotated. Both the indexes edgeToResult and edgeToCP are also updated.

The deletion algorithm is based on the standard filter-and-refine paradigm where candidate queries are first generated and later confirmed. Assume that an edge e d is deleted. First, we identify the provenance polynomials in which this edge is involved and the corresponding results. This is done by fetching the entry corresponding to e d in the edgeToResult index. The next task is to confirm if these results actually get affected by the edge deletion. For this, the polynomial is evaluated with e d set to 0 and the other edge symbols remaining at 1. An answer vanishes if and only if all the monomials corresponding to it get evaluated to 0. In other words, the answer no longer remains valid. Since the connection points are annotated with the provenance polynomials, they get updated as well. Finally, the entry of e d is dropped from both the indexes edgeToResult and edgeToCP.

Suppose edge e 14 : ⟨Ooi, hadAdvisor, Stonebraker⟩ is deleted from the example KG in Fig. 1 . We evaluate the provenance polynomial of the query answer, given in Table 1 , by assigning 0 to e 14 and 1 to the rest. The resultant value would be 1 which indicates a valid derivation {e 2 , e 3 , e 6 , e 8 , e 17 } of the answer exists and, hence, ⟨Stonebraker, Ramakrishnan⟩ is still an answer. However, if the edge e 2 is deleted then the polynomial would collapse to 0, denoting no valid derivation exists, i.e., the pair is no longer a valid answer.

Insertion. We assume a single edge insertion and argue the correctness for the two types of queries separately. Since maintenance is done after every insertion operation, the correctness extends to multiple insertions.

Lemma 1 shows that a PM 1:1 of a regular query will be of size exactly n − 1 and will satisfy a subquery. Hence, once a PM 1:1 is identified correctly, and an edge insertion completes it, it must match the complete parent query. The corresponding subgraph match and its provenance polynomial is updated to reflect the current state.

A MultiMap query may have a PM 1:m . If it has a PM 1:m , using Lemma 3, we see that if an edge insertion satisfies the subqueries, it must also satisfy the parent query. Thus, once a PM 1:m is identified correctly in a MultiMap query, the correctness is ensured since the subquery results are also updated.

Deletion. If after the deletion of an edge e, a provenance polynomial evaluates to 0, it implies that e was present in each corresponding derivation (monomial). Conversely, as long as a derivation exists that does not involve e, its corresponding monomial will continue to evaluate to 1, and the whole polynomial will be non-zero. If e is not involved in any derivation of any answer, then if will not have any entry in edgeToResult index and there will be no impact on the query results.

7 EXPERIMENTAL EVALUATION 7.1 Setup 7.1.1 Datasets. We consider two widely used benchmark datasets.

• YAGO2 [23, 24] : It is an automatically built ontology gathering facts from sources like Wikipedia, GeoNames, etc. • DBpedia3.9 [8] : It is the structured counterpart of Wikipedia. Table 2 summarizes the statistics of these two datasets.

Sets. For YAGO2, we used a set of queries on which the RDF-3X was originally validated [32] . We chose 4 queries (ids A1, A2, B1, B2) that are fairly large and complex. For DBpedia, we built the query workload from the real world queries available from 2014 USEWOD (https://eprints.soton.ac.uk/ 379401/) query logs. This query log is a collection of over 253K actual http requests to the DBpedia SPARQL endpoint. Out of these, 12K are SPARQL SELECT queries. After filtering out nested, ordering and aggregate queries, we obtained a set of 5, 490 queries. We further removed queries of size 1, those which were not parsed by a SPARQL parser, and duplicates to obtain the final workload of 215 queries.

For each KG, we generated insertion workloads in the following manner. Starting from the original graph, we randomly selected a pair of vertices not yet connected, and connected them with a randomly selected predicate. Note that we did not consider any predicate for insertion which is not part of at least one query, since they do not affect the results of any query and can be trivially answered by a simple lookup. For deletion workloads, we randomly deleted edges of the predicates that are involved in at least one of the queries. Using this procedure, we generated workloads of size 10000 updates each.

We conducted all the experiments on a machine with 32-core 2.1 GHz CPU, 512 GB RAM, and 1 TB hard drive. Our implementation was single-threaded and in Java. We chose Neo4j 3.3.9 to store the knowledge graphs due to its capability to model property graphs. HUKA codebase is publicly available. 1 . 7.1.5 Baseline Systems. Wylot et al. [44] highlighted the shortcomings of named graphs to support the provenance polynomials and devised TripleProv to support provenance computation over a customized RDF-engine dipLODocus [42] . We contacted the authors for their code but unfortunately, they responded that TripleProv is no longer available. We, therefore, did a best effort implementation of TripleProv on top of property graphs supported by a leading graph database engine Neo4j and used it as a baseline. We are calling 

We took an average time of 6.9 minutes and 10 minutes to register DBpedia and YAGO2 queries respectively. Majority of the time while registering a query is spent on Table 4 . For HUKA, we recorded the average total time taken to update the parent queries and maintain the supporting structure. As none of the baseline systems support incremental update of query provenance, we computed the average total time taken to execute all the relevant queries for a particular update request. The relevant queries are those in which the updated edge predicate is present.

On YAGO2, HUKA takes less than a second to update the query results and their provenance polynomials, whereas for DBpedia, the time taken is slightly more. Importantly, our approach is around 4 to 48 times faster than the nearest baselines.

Since the query size directly translates to number of joins, GProM and ProvSQL require lesser time for DBpedia. However, for HUKA, the number of affected queries for DBpedia are generally higher as the total number of queries (4 in YAGO2 versus 215 in DBpedia) are higher. This overshadows the impact of lesser number of joins per query and, consequently, HUKA requires more time for DBpedia than YAGO2. Unlike relational data, Neo4j uses graph data organization and stores logically related data nearby. This results in efficient graph traversal and, therefore, Neo4j requires considerably lesser time for YAGO2 since it is mostly impacted by the number of queries.

Workloads. We next study in detail the effect of different types of updates: deletion versus insertion. We generated multiple workloads (as described in Sec. 7.1) of size 10000 updates each with controlled ratio of operations. We used 5 configurations of deletion to insertion ratios: Insertion-Heavy (1:9), Insertion-Moderate (3:7), Balanced (5:5), Deletion-Moderate (7:3), and Deletion-Heavy (9:1). The average times are reported in Table 5 . For both the datasets, the average total time gradually increases as we move from deletion-heavy to insertion-heavy workload. The trend is expected as insertion is computationally heavier than deletion.

7.3.3 Closer Look. Next, to get a better insight of our system, we decided to have a closer look at its different steps. We chose the larger query set DBpedia and the query set partitions outlined in Table 3 . We treat each query set QS i individually, i.e., built separate query execution plans, annotated separate KGs, and constructed separate indexes for each query set. We divide the total running time into 2 components: (a) response time, defined as the time taken to update the parent query result(s), and (b) maintenance time, defined as the time taken to maintain the data structures so as to ensure correctness of subsequent updates. We can update result of regular parent queries just by examining the relevant connection points, whereas to update a MultiMap parent query we need to update the subquery result as well. For regular queries, maintenance time is the sum of time consumed to update subqueries, annotate graph and update inverted indexes. However, for MultiMap queries, maintenance time involves annotating graph and update indexes. The time splits in terms of percentage of the total time for the balanced workload are shown in Fig. 7 . (There is no MultiMap query of size 6.) Across query sets regular queries exhibit lower response time as compared to MultiMap queries. This happens as MultiMap queries more background work before the query results can be updated. Overall, regular queries require lesser absolute time to complete. Importantly, the time to update subquery results was agnostic to the number of subqueries of a query set QS i . To understand this, we closely inspected the global plans of each query set QS i . Intuitively, higher the number of intermediate expressions to compute, higher is the subquery result updation time. To quantify this, we defined coverage of a global plan as the average number of intermediate expressions required to be computed when an edge is inserted. In other words, it is the ratio of the total number of non-leaf nodes in the global tree to the total number of unique predicates involved in the corresponding query set. The coverage values for QS i are given in Table 3 . Thus, higher the coverage of a query plan, higher is the time to update the subquery results.

In this paper, we have proposed a system to efficiently maintain provenance of results of standing queries in knowledge graphs that are dynamic in nature. We used decomposition of queries into subqueries and inverted indices to quickly update the answers and their provenance polynomials. Experiments show the effectiveness in large real-life knowledge graphs.

In future, we would like to work with more generic SPARQL queries that include aggregates, nested queries, ordering, etc.

Trio: A system for data, uncertainty, and lineage

Spider: a schema mapping debugger

GProM-a swiss army knife for your provenance needs

Counting beyond a Yottabyte, or How SPARQL 1.1 Property Paths Will Prevent Adoption of the Standard

Provenance Management for Evolving RDF Datasets

Continuous Query

Query-Based Why-Not Provenance with NedExplain

DBpedia -A crystallization point for the Web of Data

Navigating the Maze of Wikidata Query Logs

Why and where: A characterization of data provenance

Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases

A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs

Practical lineage tracing in data warehouses

A relational algebra for SPARQL

Reverse-Engineering Conjunctive Queries from Provenance Examples

Incremental Graph Pattern Matching

Tracking the Impact of Fact Deletions on Knowledge Graph Queries Using Provenance Polynomials

The Perm Provenance Management System in Action

Grigoris Karvounarakis, and Val Tannen

Exploiting the query structure for efficient join ordering in sparql queries

Dynamic Provenance for SPARQL Updates

US government linked open data: semantic data gov

YAGO2: exploring and querying world knowledge in time, space, context, and many languages

YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia

The ORCHESTRA collaborative data sharing system

Graphflow: An Active Graph Database

TurboFlux: A Fast Continuous Subgraph Matching System for Streaming Graph Data

Extended characteristic sets: graph indexing for sparql query optimization

The W3C PROV Family of Specifications for Modelling Provenance Metadata

Materialized View Selection and Maintenance Using Multi-query Optimization

Never-Ending Learning

The RDF-3X engine for scalable management of RDF data

Data Integration and Data Exchange: It's Really About Time

Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases

Provenance and probabilities in relational databases

ProvSQL: provenance and probability management in postgreSQL

Reconciling While Tolerating Disagreement in Collaborative Data Sharing

Aspect-Based Academic Search Using Domain-Specific KB

Knowledge graph embedding: A survey of approaches and applications

Toward Enhanced Pharmacovigilance Using Patient-Generated Data on the Internet

DrugBank: a knowledgebase for drugs, drug actions and drug targets

dipLODocus[RDF] -Short and Long-Tail RDF Analytics for Massive Webs of Data

Storing, Tracking, and Querying Provenance in Linked Data

TripleProv: Efficient Processing of Lineage Queries in a Native RDF Store

Efficiently Answering Technical Questions -A Knowledge Graph Approach