key: cord-0141210-dcluthxl
authors: Bonifati, Angela; Dumbrava, Stefania; Fletcher, George; Hidders, Jan; Hofer, Matthias; Martens, Wim; Murlak, Filip; Shinavier, Joshua; Staworko, Slawek; Tomaszuk, Dominik
title: Threshold Queries in Theory and in the Wild
date: 2021-06-29
journal: nan
DOI: nan
sha: b645ede48c4929913e55313a4bd5084096cdd1de
doc_id: 141210
cord_uid: dcluthxl

Threshold queries are an important class of queries that only require computing or counting answers up to a specified threshold value. To the best of our knowledge, threshold queries have been largely disregarded in the research literature, which is surprising considering how common they are in practice. In this paper, we present a deep theoretical analysis of threshold query evaluation and show that thresholds can be used to significantly improve the asymptotic bounds of state-of-the-art query evaluation algorithms. We also empirically show that threshold queries are significant in practice. In surprising contrast to conventional wisdom, we found important scenarios in real-world data sets in which users are interested in computing the results of queries up to a certain threshold, independent of a ranking function that orders the query results by importance.

Queries encountered in a wide range of data management applications, such as data interaction, data visualization, data exploration, data curation and data monitoring [8, 40, 44] , often require computing or counting answers only up to a given threshold value. We call such queries threshold queries. Threshold queries for data exploration. Querying voluminous rich data sources, such as Wikidata [66] , may return more results than are needed during exploratory analytics. Thus, users may specify a threshold on the number of answers they want to see. Consider the following query which lists up to a threshold (expressed with the LIMIT clause in SPARQL) businesses, total assets, cities, and This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14 Note that here the user is interested in unranked output (i.e., there is no ORDER BY clause), which is typical for data exploration [8] . As we will see later in a deep-dive into real query logs (Section 5), such queries on Wikidata are very common in practice.

Threshold queries for data curation. Threshold queries are also useful in detecting violations of database constraints and identifying data items requiring curation actions. As an example, consider the following threshold query using grouping and aggregation in the Nobel Prize database, requiring that every Nobel prize has at most three laureates [4] . Differences with other query answering paradigms. Although threshold queries might look similar in spirit to top-queries [34, 42, 58, 59] , they are inherently different because they do not assume that the results are ranked. They are also different from counting queries [31, 62] , since these aim at computing an exact value, rather than only desiring exact values up to a given threshold. Therefore, prior problems are either more specific or have different objectives. We examine these differences in depth in Section 6.

Our contributions. We are motivated by the following question: Can we exploit the fact that we only need to count or compute the answers of a query up to a threshold? In this paper, we answer this question positively. The starting point for our work is the observation that evaluating some queries with a threshold requires storing not more than + 1 intermediate results [17, 50] . We show that this idea can be fully integrated with state-of-the-art complex join algorithms, leading to significant savings in the size of the intermediate results as typically computed in query processing, leading for some queries to improvements from˜( ) to˜( · ), where is the free-connex treewidth of the query and is the value of the threshold. Here, the free-connex treewith of a query measures its treewidth in connection with its output variables. It can be large even if the query is acyclic.

In detail, the key contributions of our paper are as follows:

(1) New results explaining the interplay between different structural properties of conjunctive queries (i.e., select-projectjoin queries) used in sophisticated evaluation algorithms arXiv:2106.15703v2 [cs.DB] 17 Nov 2021 (Lemma 4.5) , and the consequences for threshold query processing (Theorem 4.7). (2) New evaluation algorithms for threshold queries (for computing answers, counting answers, and sampling answers) with improved asymptotic guarantees (Theorem 4.10). (3) A comprehensive empirical study of threshold queries found in the wild, which highlights their characteristics and shows that they are quite common in important practical scenarios. (4) An experimental evaluation of a proof-of-concept SQL implementation of our algorithm against the current query optimizer in PostgreSQL, showing speedups of several orders of magnitude.

Our work provides the first in-depth theoretical treatment of threshold queries. In addition, it also shows that these queries are important in practical settings by means of an in-depth empirical study. The paper is organized as follows. Section 2 contains basic definitions. In Section 3, we identify the problems of interest, explain the complexity-theoretic limits, and explain some of our main ideas in an example. Section 4 presents the algorithms for the problems of interest. In Section 5, we present experiments and findings on threshold queries that can be found in practice. In Section 6, we discuss related work. Finally, in Section 7 we summarize our findings and outline further research directions. Detailed proofs and additional material can found in the appendix.

Our results apply in both a relational database setting and a graph database setting. We first focus on the relational database setting, for which we loosely follow the preliminaries as presented in [1] . We assume that we have countably infinite and disjoint sets Rel of relation names and Const of values. Furthermore, when ( 1 , . . . , ) is a Cartesian tuple, we may abbreviate it as¯. A -ary database tuple (henceforth abbreviated as tuple) is an element of Const for some ∈ N. A relation is a finite set of tuples of the same arity. We denote the set of all such relations by R. A database is a partial function : Rel → R such that Dom( ) is finite. If is a database and ∈ Rel, we write ( 1 , . . . , ) ∈ to denote that ( 1 , . . . , ) ∈ ( ). By Adom( ) we denote the set of constants appearing in tuples of , also known as the active domain of .

For defining conjunctive queries, we assume a countably infinite set Var of variables, disjoint from Rel and Const. An atom is an expression of the form ( 1 , . . . , ), where ∈ Var ∪ Const for each ∈ {1, . . . , }. A conjunctive query (CQ) is an expression of the form

where¯= ( 1 , . . . , ) consists of existentially quantified variables and each (¯), with ∈ {1, . . . , } is an atom. Each variable that appears in¯should also appear in¯1, . . . ,¯. On the other hand, 1 , . . . ,¯can contain variables not present in¯. For a tuple¯, we write (¯) to emphasize that is a CQ such that all variables in 1 , . . . ,¯appear in either¯or¯. Unless we say otherwise, we assume that the variables in¯1, . . . ,¯are precisely the variables in¯and¯. The arity of (¯) is defined as the arity of the tuplē . We denote by Var( ) the set of all variables appearing in and by FVar( ) the set of so-called free variables of , which are the variables in Var( ) that are not existentially quantified. A full CQ is a CQ without existentially quantified variables.

Query Answers and Relational Algebra. We consider queries under set semantics, i.e., each answer occurs at most once in the result. A binding of ⊆ Var is a function : → Const. We say that bindings and ′ are compatible if ( ) = ′ ( ) for all ∈ Dom( ) ∩ Dom( ′ ). For compatible bindings 1 and 2 , the join of 1 and 2 , is the binding 1 ⊲⊳ 2 such that Dom( 1 ⊲⊳ 2 ) = Dom( 1 ) ∪Dom( 2 ) and 1 ⊲⊳ 2 ( ) = ( ) for all ∈ Dom( ) and ∈ {1, 2}. If 1 and 2 are sets of bindings, then the join of 1 and 2 is 1 ⊲⊳ 2 = 1 ⊲⊳ 2 1 ∈ 1 and 2 ∈ 2 are compatible . For ⊆ Dom( ), the projection of on , written as ( ), is the binding ′ with Dom( ′ ) = ∩ Dom( ) and ′ ( ) = ( ), for every ∈ Dom( ′ ). For a set of bindings, the projection of on is ( ) = ( ) ∈ . A match for in is a binding of Var( ) such that ( (¯)) ∈ for every ∈ {1, . . . , }. 1 The set of answers to on is ( ) = FVar( ) ( ) is a match for in . We define answers as functions instead of database tuples, as this simplifies the presentation and as reasoning about their underlying domains is useful in further sections, when dealing with query decompositions.

Threshold Queries. A threshold query (TQ) is an expression of the form (¯) ∧ ∃ ,¯(¯,¯)

where, from left to right, (¯) is a CQ, ∈ N, ∈ N ∪ {∞}, and (¯,¯) is a CQ in which we do not require that every variable in appears in one of its atoms. Notice that a TQ only has a single counting quantifier ∃ , , although further ordinary existential quantifiers may occur inside and . We use ∃ ≥ and ∃ ≤ as shorthands for ∃ ,∞ and ∃ 0, , respectively, and the corresponding threshold queries will be called at-least and at-most queries. Similar to CQs, we usually denote the entire query as (¯) or even as when is clear from the context. When representing (¯) for decision problems, we assume that the numbers and are given in binary.

As an example, recall the Nobel Prize threshold query in Section 1, and suppose that the schema is NobelPrize(id, year, category) and Laureate(nid, name, country) with the foreign key constraint Laureate[nid] ⊆ NobelPrize[id]. This threshold query can be formalized as follows.

( , , ).

The set of answers of on , written ( ), is the set of answers of (¯) on that have between and compatible answers of (¯,¯). Formally, ∈ ( ) iff ∈ ( ) and ≤ | ( ,

If is a threshold query of the form above, we call¯the free variables (or answer variables) of the query and we write FVar( ) for the set of these variables. We call¯the tally variables of the query and we write TVar( ) for the set of these variables. For the ease of presentation we shall assume that the sets of existentially quantified variables in and are disjoint; we shall write QVar( ) for the union of these sets. Thus, Var( ) is the union of three disjoints sets: FVar( ), TVar( ), and QVar( ).

A threshold query can be intuitively expressed as a SQL query defining (¯) such that in the WHERE clause we additionally check if the number of tuples returned by a correlated SELECT-FROM-WHERE subquery defining (¯,¯) is at least and at most .

Graph Databases. For the purposes of this paper, graph databases can be abstracted as relational databases with unary and binary relations. That is, a graph database can be seen as a database with a unary relation Node and binary relations , , . . . where

• Node( ) ∈ if is a node of the graph database and

is an edge with label in the graph database. An important feature that distinguishes graph database queries from relational database queries are regular path queries (RPQs) [13] . In a nutshell, a regular path query is an expression of the form ( , ), where is a regular expression over the set of edge labels in the graph database. When evaluated over a graph database , the query returns all node pairs ( , ) such that there exists a path from to that is labeled with some word in the language of .

All results in the paper extend naturally to RPQs and therefore to so-called conjunctive regular path queries (CRPQs) [13] . This means that we can generalize CQs to CRPQs, which are defined exactly the same as CQs, but we additionally allow RPQs at the level of atoms. Generalizing threshold queries is analogous. If we want to evaluate a threshold query with RPQs, we can pre-evaluate all RPQs in the query and treat their result as a materialized view. We can then treat the query as an ordinary threshold query in which each RPQ becomes an atom, which is evaluated over the corresponding materialized view.

Query evaluation is arguably the most fundamental problem in databases and comes in many variants, such as: (E1) Boolean evaluation, i.e., testing existence of an answer; (E2) returning a given number of answers; (E3) counting the total number of answers; (E4) sampling answers with uniform probability; and (E5) enumerating the answers with small delay. An important reason why all these variants are considered is that the set of answers to a query can be very large and one is not always interested in the set of all answers.

The computational cost of these variants tends to increase as we go down in the list, but already for CQs even the simplest problem (E1) is intractable [1, Chapter 15] . Triggered by Yannakakis' seminal result on efficient evaluation of acyclic CQs [70] , the literature developed a deep understanding that teaches us that, essentially, low tree-width is not only helpful but even necessary for polynomialtime Boolean evaluation of CQs [38] . Intuitively, the tree-width of a CQ measures the likeness of its graph structure to a tree. In essence, this graph structure is obtained by taking the queries' variables as nodes and connecting variables with an edge iff they appear in a common atom. Queries with low tree-width are tree-like and queries with high tree-width are highly cyclic.

Example 3.1. Consider the following variant of the first query from the introduction:

( , , ) ← Assets( , ), Subsidiary( , ), Shareholder ( , ) .

For the purpose of Boolean evaluation we can rewrite as ′ () ← Assets( , ), ( ) ;

( ) ← Subsidiary( , ), ( ) ;

using views and . It is clear that one can materialize the views and answer ′ in time · log over databases of size . The tree-width of manifests itself as the number of variables used in the definitions of the views. For CQs of tree-width , views in the optimal rewriting will use up to variables and the data complexity will be · log .

For threshold queries, however, low tree-width is not sufficient. The reason is that it is already hard to decide if the number of results of an acyclic CQ is above a threshold (represented in binary).

Proposition 3.2. Given an acyclic conjunctive query , a threshold in binary representation, and a database , testing if returns at least tuples on is coNP-hard.

So, we cannot have a polynomial-time algorithm even for evaluating Boolean acyclic threshold queries of the form ∃ ,¯(¯) , unless P = NP. This is why one focus in the paper is on pseudopolynomial-time algorithms for threshold queries of low tree-width. We call an algorithm pseudopolynomial, if it is a polynomial-time algorithm assuming that the numerical values and in "∃ ,¯" are represented in unary (instead of binary). For instance, a pseudopolynomial algorithm can evaluate threshold queries of the form ∃ ,¯(¯) by keeping + 1 intermediate results in memory, which is not possible in a polynomial-time algorithm.

Let us revisit Example 3.1 and rewrite the query in a suitable way to produce its output. The next rewriting is inspired by research on constant-delay enumeration and answer counting for CQs. ( , ) ←Subsidiary( , ), Shareholder ( , ) .

The standard approach for constant-delay enumeration algorithms [5] first has a preprocessing phase, in which it materializes and groups Assets and by . In the enumeration phase it iterates over possible values of and, for each value of , over the contents of the corresponding groups of Assets and . The complexity of the preprocessing phase is then affected by the cost of materializing , which can be quadratic in the worst case. Overall, the complexity is ( 2 · log ) over databases of size .

Evaluating a threshold query is closely related to constant-delay enumeration and counting answers to CQs. Indeed, a threshold query of the form ∃ ,¯(¯) can be evaluated by enumerating answers to up to threshold + 1 or by counting all answers to . For both these tasks, however, tractability relies on more restrictive parameters of the query. For enumeration, tree-width needs to be replaced with its free-connex variant [5] , which treats answer variables in a special way. For counting, the additional parameter is the star-size [31] . Intuitively, it measures how many answer variables are maximally connected to a non-answer variable. The query has star-size 2 because the existentially quantified variable is connected to two answer variables, and . Thus, in the general approaches to constant-delay enumeration and counting, complexity is very sensitive to the interaction between answer and non-answer variables. Our key insight is that in the presence of a threshold this is no longer the case and low tree-width is sufficient.

Example 3.4. Consider again query from Example 3.1 and suppose that we should return up to answers. We can rely on the rewriting in Example 3.1, but we need to store additional information when materializing the views. For each in we store up to witnessing values of such that Shareholder ( , ) holds. Similarly, for each in we store up to values of that were stored as witnesses for some in with Subsidiary( , ). Now, we can obtain up to answers to by taking the join of Assets( , ) with ( ) and iterating through the witnessing values of for each . Both extended materialization steps, as well as the final computation of answers, can be realized in time · · log( · ) . If we are to count answers up to threshold , we can just count the ones returned by the algorithm above.

This idea allows evaluating low tree-width TQs of the form ∃ ,¯(¯) in pseudopolynomial-time. It is also crucial in our treatment of general TQs of the form (¯) ∧ ∃ ,¯(¯,¯) but, as the following proposition shows, we cannot expect pseudopolynomial evaluation algorithms even for acyclic threshold queries. Our proof uses a reduction from MINSAT [51] . The reason is that acyclic queries can have arbitrarily high freeconnex tree-width, which is the actual source of hardness. For TQs of bounded free-connex tree-width our approach will yield pseudopolynomial evaluation algorithms for all variants (E1)-(E5). That is, our results are tight in terms of combined complexity.

In the remainder of the paper, we will analyze algorithms using -notation. We will use this notation to reflect the data complexity of the algorithms and to hide logarithmic factors. Essentially, using allows us to freely use sorting and indexes such as B-trees. For instance, if we say that something can be done in time˜( 2 ) we mean that its data complexity is in time ( 2 log ).

In this section we define the notions of widths and decompositions informally discussed in Section 3, explore in depth the approach to threshold queries via exact counting, develop the idea illustrated in Example 3.4 to cover arbitrary CQs in a slightly more general setting involving grouping, and employ the obtained algorithm to construct a single data structure that supports constant-delay enumeration, counting, and sampling answers to TQs.

The rewritings discussed in Section 3 are guided by tree decompositions of queries. A tree decomposition of a conjunctive query is a finite tree with a set ⊆ Var( ), called a bag, assigned to each node of , satisfying the following conditions: (1) for each atom of there exists a node of such that Var( ) ⊆ (we say that covers ); (2) for each variable ∈ Var( ), the set of nodes of such that ∈ forms a connected subgraph of . By the width of we shall understand max ∈ | |. 2 The tree-width of query , written as tw( ), is the minimal width of a tree decomposition of . For example, 1 in Figure 1 is a tree decomposition of width 2 for the query in Example 3.1. Since contains atoms involving two variables, it does not admit a tree decomposition of width 1. Hence, tw( ) = 2.

A tree decomposition of a query is -connex for a set ⊆ Var( ), if there exists a connected subset of nodes of , containing the root of , such that the union of bags associated to nodes in is precisely . Note that there is exactly one such that is maximal: it is the one that includes all nodes with ⊆ . We shall refer to it as the maximal -connex set in . The -connex tree-width of is the minimal width of an -connex tree decomposition of . If = FVar( ), we speak of free-connex decompositions and freeconnex tree-width; we write fc-tw( ) for the free-connex tree-width of . For instance, 2 in Figure 1 is a free-connex tree decomposition of width 3 for the query of Example 3.1. It is not hard to see that has no free-connex decomposition of width 2. That is, tw( ) = 2 but fc-tw( ) = 3. This difference can be arbitrarily large, e.g., for the CQs we used in the proof of Proposition 3.2.

A tree decomposition of is -rooted, for ⊆ Var( ), if the root bag of is exactly . The -rooted tree-width of is the minimal tree-width of an -rooted tree decomposition. By analogy to freeconnex, if = FVar( ), we speak of free-rooted decompositions and free-rooted tree-width.

It is convenient to work with tree decompositions of a special shape. A node of is: a project node if it has exactly one child and ⊊ ; a join node if it has exactly two children, 1 and 2 , and = 1 ∪ 2 . We say that is safe if each variable in occurs in an atom of that is covered either by or by a descendant of . We say that is nice if each node of is either a leaf or a project node or a join node, and all nodes of are safe. We now introduce some notions that will be useful later. With each node in a tree decomposition of we associate the subquery obtained by taking all atoms of over the variables appearing in the subtree of rooted at , and quantifying existentially all used variables except those in ∪ FVar( ). Letˇbe the full CQ obtained by taking all atoms of that do not occur in 1 ∧ 2 ∧ · · · ∧ , where 1 , 2 , . . . , are the children of ; we then have Var(ˇ) = FVar(ˇ) ⊆ . For instance, for the query discussed in Section 3 and its tree decomposition 1 shown in Figure 1 with nodes numbered 0, 1, 2 starting from the root, we have 0 = = ∃ Assets( , ) ∧ Subsidiary( , ) ∧ Shareholder ( , ), 0 = Assets( , ), 1 = ∃ Subsidiary( , ) ∧ Shareholder ( , ), 1 = Subsidiary( , ), and 2 =ˇ2 = Shareholder ( , ). In general, if is a leaf then =ˇand if is the root then = . One can evaluate on a database by computing ( ) bottom up, as follows. Assuming that is nice, if is a leaf then

if is a project node with child then

and if is a join node with children 1 and 2 then

If is free-rooted then FVar( ) ⊆ for each and the above computation can be performed in time˜(| | ), where is the width of . The following simple fact will also be useful. Lemma 4.2. Let be a nice tree decomposition of a CQ .

(1) For each node in it holds that ⊆ FVar( ).

(2) If is -connex and is the maximal -connex set in , then each node in either has all its children in or it is a -leaf, that is, it has no children in .

Tree decompositions are not easy to find. Indeed, determining if an arbitrary graph admits a tree decomposition of width at most is NP-hard [3] . However, the problem has been studied in great depth and there is an ongoing effort of making these approaches practical at large scale. For instance, computing tree decompositions of large graphs was the topic of the PACE Challenge [28, 29] twice.

However, in the present context, we only want to compute tree decompositions of queries, which are very small in practice. There are libraries available [30, 35] that can find optimal tree decompositions of queries very efficiently. Indeed, DetkDecomp [30] was used to compute the tree-width of more than 800 million real-world queries [14] [15] [16] and worked very efficiently. Importantly for us, the analysis in [14] [15] [16] showed that real-life queries have very low tree-width.

Each algorithm for computing tree decompositions can be used also to compute -rooted and -connex tree decompositions, with quadratic overhead [5] . In our complexity estimations we rely on Bodlaender's algorithm [11] , which allows computing optimal tree decompositions in linear time (assuming bounded tree-width).

Processing a threshold query

involves counting, for each answer in ( ), how many answers in ( ) are compatible to . Formally, this means that we need to determine the size of ( , ) for each ∈ ( ), where ( , ) is the set of answers in ( ) that are compatible to . So we need to solve the following computational problem for .

Counting answers to grouped by ⊆ FVar( ) over database consists in computing all pairs ( , ) such that : → Adom( ) and = | ( , )|.

Counting answers can leverage low star-size [31] . While the original notion is designed to fit hypertree decompositions, we shall work with a slightly faster growing variant that fits tree decompositions better and is much easier to define; the two variants coincide for queries using at most binary atoms. The star-size of a conjunctive query , written ss( ), is the least positive integer such that by grouping atoms and pushing quantifiers down, we can rewrite as 1 ∧ 2 ∧ · · · ∧ ℓ with |FVar( )| ≤ or QVar( ) = ∅ for all . The following is a routine generalization of the result on counting answers [31] . When we consider (constant-delay) enumeration algorithms, we see that the state-of-the-art approaches, e.g., [5, 41] , use a different parameter of the query. Instead of bounded star-size, these rely on bounded free-connex tree-width. At first sight, this difference is not surprising, because in the absence of a threshold, counting and enumeration cannot be reduced to each other. But a closer look reveals that both parameters play a similar role: limiting them allows to reduce the problem to the much simpler case of full CQs by rewriting the input query as a join of views of bounded arity. This is readily visible for the star-size method, where the views correspond to the queries in the definition of star-size of , but it is also true for the free-connex method: there, the views correspond to subtrees of the decomposition rooted at the shallowest nodes holding an existentially quantified variable. Hence, the methods can be used interchangeably and we can replace star-size with free-connex treewidth in Proposition 4.3. Below, the -rooted free-connex tree-width of a query is the minimal width of a tree decomposition of the query that is both -rooted and free-connex. Proposition 4.4. Counting answers grouped by for conjunctive queries of -rooted free-connex tree-width over databases of size can be done in time˜ .

In particular, all answers to can be counted in˜ tw( ) ·ss( ) by Proposition 4.3 or in˜ fc-tw( ) by Proposition 4.4. The following lemma shows that the latter bound is tighter, so we shall rely on Proposition 4.4. 

The same holds for the -rooted variant and the -connex variant, for every ⊆ FVar( ). Moreover, there exist CQs with arbitrarily large fc-tw( ) that satisfy fc-tw( ) ≤ √︁ tw( ) · ss( ) + 1.

Let us now come back to the threshold query . By a tree decomposition of we shall mean a tree decomposition of the associated CQ ∧ . Based on this, we define all variants of treewidth for threshold queries just like for CQs. Importantly, freeconnex refers to FVar( )-connex, not FVar( ∧ )-connex. Applying Proposition 4.4 requires an FVar( )-connex decomposition; that is, FTVar( )-connex in terms of , where FTVar( ) = FVar( ) ∪TVar( ). Moreover, to avoid manipulating full answers, we need to factorize the evaluation of and counting answers to grouped by FVar( ) in a compatible way. This can be done if our decomposition is also FVar( )-connex. Putting it together, we arrive at free-connex FTVar( )-connex decompositions of .

Example 4.6. Consider an at-least query

. Figure 1 shows a free-connex FTVar( )-connex tree decomposition 3 of of minimal width, which is 3. The subqueries ( , ) and ( , ) correspond to bags { } and { }, respectively, and the subtrees rooted at these bags are { }-rooted (resp. { }-rooted) free-connex tree decompositions for these subqueries. Consider the input database in Figure 2 . Let us count the answers to ( , ) grouped by and the answers to ( , ) grouped by (using Proposition 4.4) and store the resulting pairs in sets { } and { } , respectively. We

, where a pair ( , ) in { } means that for = there are witnessing values of , and similarly for { } . This is the initial information that we shall now propagate up the tree. In the set { , } we put triples ( , , ) such that ( , ) and for = there are witnessing values of ; analogously for { , } . We

. These sets can be computed based on { } and { } , respectively. The set { } stores pairs ( , ) such that is the maximal number of witnessing pairs of values for and that any combination of = and = with ( , ) and ( , ) can provide. In our case, { } = ( 1 , 6) . Because and can be chosen independently from each

That is, the only information that needs to be passed from { , } to { } is the set ′ { , } of pairs ( , ) with defined as above, and similarly for { , }. In our case,

. We return YES iff { } contains a pair ( , ) with ≥ 3. In our case, it is so.

Theorem 4.7. Boolean evaluation of an at-most or at-least query of free-connex FTVar( )-connex tree-width over databases of size can be done in time˜( ). The combined complexity of the algorithm is polynomial, assuming is constant.

In terms of combined complexity, Theorem 4.7 is optimal as both conditions imposed on tree decompositions are needed. Indeed, the hard TQs used in Proposition 3.5 have FTVar( )-connex tree-width 2; and the hard Boolean TQs stemming from Proposition 3.2 have free-connex tree-width 2.

Algorithm 1 Answers to grouped by up to threshold 1: ← a nice -rooted tree decomposition of 2: loop through nodes of bottom-up 3: if is a leaf then ← ( ) 4: if is a project node with child then 5:

if is a join node with children 1 , 2 then 7:

In terms of data complexity, however, we can improve Theorem 4.7.

Here we exploit the presence of thresholds to handle queries with unbounded FTVar( )-connex tree-width. We will cover not only at-least and at-most queries, but arbitrary TQs, and we will solve not only Boolean evaluation, but also constant-delay enumeration, counting answers, and sampling, all based on a single data structure.

The small price we have to pay is pseudopolynomial combined complexity. Our starting point is again the problem of counting grouped answers, but this time up to a threshold.

Counting answers to grouped by ⊆ FVar( ) up to a threshold over database consists in computing the set of pairs ( , ) such that : → Adom( ) and = min( , | ( , )|).

In the presence of a threshold, it is not impractical to solve counting by enumerating answers. This is what we shall do.

Computing answers to query grouped by ⊆ FVar( ) up to a threshold over database consists in computing a subset of ( ) that is complete for and ; i.e., for each : → Adom( ) either ( , ) ⊆ and | ( , )| ≤ , or | ( , ) ∩ | = .

Consider a CQ , a set ⊆ FVar( ) of grouping variables, and a database . We will use the term group to refer to the set of answers to that agree on the grouping variables ; that is, a subset of ( ) of the form ( , ) for some : → Adom( ). If the -rooted tree-width of is , then | | ≤ and the number of groups is (| | ). Consequently, if we can compute answers to grouped by up to threshold in time˜( · | | ), then we can also count them within the same time, because we will get at most answers per group. Additionally, for each : → Adom( ) with ( , ) = ∅, we need to include ( , 0) into the result; this does not affect the complexity bound. Hence, it suffices to show how to compute answers grouped by up to threshold . Having seen an illustrating example in Section 3, we are ready for the full solution.

Theorem 4.8. Computing answers grouped by up to a threshold for conjunctive queries of -rooted tree-width over databases of size can be done in time˜( · ).

Proof sketch. Consider Algorithm 1. We begin by computing a nice -rooted tree decomposition of minimal width , as described in Section 4.1. That is, each bag of has size at most and the root bag is . Consider the queries associated to nodes of . By Lemma 4.2, ⊆ FVar( ). By analogy to the evaluation algorithm described in Section 4.1, for each we solve the problem of computing answers to grouped by up to threshold over ; that is, we compute a subset ⊆ ( ) that is complete for and . The final answer is then the set obtained in the root. If is a leaf, then = FVar( ) = Var( ) and ( ) itself is the only subset of ( ) complete for and . Because |Var( )| ≤ , one can compute ( ) in time˜ | | .

Consider a project node and its unique child . Then ⊊ . To compute the algorithm projects on FVar( ) and then, using the operation prune , it groups the resulting set of bindings by and keeps only bindings from each group. It is clear that this can be done in time˜( · | | ).

Finally, consider a join node with children 1 , 2 . As explained in Section 4.1, the set ( ) is then the join of 1 ( ), 2 ( ), anď ( ). We compute in exactly the same way, except that for each binding of variables in we only keep the first bindings extending and discard the remaining ones. Naive implementation takes time˜ 2 · | | , but it is easy to modify the standard mergejoin algorithm to achieve this in time˜ · | | .

It is routine to check that each is complete for and . □

Recall that is very small in practice and most queries can be expected to be acyclic [15] . If we are just interested in returning answers up to a threshold (no grouping), then the algorithm underlying the present theorem improves the state-of-the-art algorithm from˜( ) to˜( · ) for acyclic queries, where is the free-connex treewidth of the query, which can be large even for acyclic queries.

We now turn to processing general threshold queries. We give a unified approach to constant-delay enumeration, counting, and sampling, based on a single data structure that can be computed using the methods presented in Section 4.3.

Example 4.9. Consider again the database from Figure 2 and the query from Example 4.6. This time we shall work with the freeconnex tree decomposition 4 of shown in Figure 1 ; it has smaller width than 3 . Like before, the subqueries ( , ) and ( , ) correspond to bags { } and { }, respectively, and the subtrees rooted at these bags are { }-rooted (resp. { }-rooted) tree decompositions for these subqueries, but they are not free-connex. We count the answers to grouped by and the answers to grouped by , up to threshold 3, as described in Section 4.3, and store the results in sets { } and { } , respectively. Because the counts obtained in Example 4.6 were all below 3, { } and { } are just like before. Sets { , } and { , } are also computed like before. For Boolean evaluation we would now put into ′ { , } all pairs ( , ) such that is the maximal number of witnessing values that any = with ( , ) can provide. To support constant-delay enumeration we need to pass more information up the tree: we include all pairs ( , ) such that some = with ( , ) can provide witnesses (up to threshold 3); that is, we forget the values , but we keep all values (up to 3), not only the greatest of them. In our case, if is a -leaf of then 3:

if is a project node of with child ∈ then 5:

← witnessCount, ; sum(multiplicity) 6: if is a join node of with children 1 , 2 ∈ then 7:

← witnessCount, ; sum(multiplicity)

if is the root of then ← ≤witnessCount ≤

In the enumeration phase, we iterate over all combinations of ( ,

For each such combination we iterate over corresponding ( , , ) ∈ { , } and ( , , ) ∈ { , } and return ( , , ). In our case this gives To sample an answer, we first sample ∈ { } with weights given by the multiplicities. In our case we choose 1 with probability 1. Then we sample ( , ), ( , ) ∈ { , } × { , } such that · ≥ 3 with weights given by the product of the multiplicities of ( , ) in 

there is only one choice; overall, each answer is returned with probability 1 4 .

Theorem 4.10. For TQs of free-connex tree-width , over databases of size , one can: count answers in time˜( ); enumerate answers with constant delay after˜( ) preprocessing; and sample answers uniformly at random in constant time after˜( ) preprocessing.

Assuming is constant, the combined complexity of each of these algorithms is pseudopolynomial.

Compared to [5, 31] , this theorem provides the same complexity guarantees as for counting answers and enumerating answers for conjunctive queries, but we are able to generalize these to threshold queries (and add sampling). This generalization comes at the small cost of dependence on the value of the threshold. The results in [5, 31] can be strengthened to more refined measures such as hypertreewidth, but we believe that this is also true here.

Proof sketch. Consider a database and a threshold query

We write for the threshold up to which we will be counting witnesses: if < ∞, we let = + 1; if = ∞, we let = . Let be a nice free-connex tree decomposition of , of width , and let be the maximal free-connex set in .

Being a tree decomposition of (¯,¯) = (¯) ∧ (¯,¯), up to dropping unused variables, is a tree decomposition of both (¯) and (¯,¯). Let and be the corresponding subqueries associated to node . With each node ∈ we associate the set of records that will provide information necessary to enumerate, count, and sample answers to . A record for ∈ is a triple ( , , ) consisting of a binding :

→ Adom( ), a multiplicity > 0, and a witness count ∈ {0, 1, . . . , } such that there are exactly extensions ′ of to FVar( ) ∪ FVar( ) ∩ FVar( ) such that FVar( ) ( ′ ) ∈ ( ) and = min , | ( , ′ )| . We can interpret a record ( , , ) as a binding extending to two fresh variables, multiplicity and witnessCount, with values and .

By Lemma 4.2, each node in either is a -leaf or has all its children in . Consequently, we can compute for ∈ bottomup, as in Algorithm 2. In -leaves, we use Theorem 4.8 to get a solution , FVar( ) ∩ FVar( ), to the problem of counting answers to grouped by FVar( ) ∩ FVar( ) up to threshold over ; the set ( ) of all answers to over is computed as explained in Section 4.1. Higher up the tree, we compute based on the values obtained for the children of , by means of standard relational operators with multiset semantics, including grouping ( ) and antijoin (⊲), as well as the -join 1 ⊲⊳ 2 defined as the multiset of records 1 ⊲⊳ 2 , 1 · 2 , min( , 1 · 2 ) such that ( 1 , 1 , 1 ) ∈ 1 , ( 2 , 2 , 2 ) ∈ 2 , and 1 is compatible with 2 .

is computed in˜( · | | ). With all at hand, we can enumerate, count, and sample answers to as in Example 4.9. □

In this section we give evidence that threshold queries are indeed common and useful in practice. Furthermore, we present an experimental evaluation of our algorithm.

We first present some analytical results on large-scale real-world query logs from Wikidata's SPARQL query service [68] . Our study considers a corpus of more than 560M queries. (Previous work has considered a subset in the order of 200M queries [14] .) These logs contain a massive amount of real-life queries, which are classified into (1) robotic (high-volume, single-source bursts) and organic (human-in-the-loop) [57] . Furthermore, the logs distinguish between successful ("OK") and timeout ("TO") requests.

Occurrences of Threshold Queries. We first report on the usage of the keywords LIMIT and ORDER BY in the Wikidata logs, which 41 Figure 3 : Threshold value occurrence ratio in organic and robotic CRPQ logs contain ∼563M well-formed queries, among which ∼74M are unique ( Table 1 ). Since our algorithms apply to CRPQs, we focus on those.

As can be seen in the table, these still constitute 45.2% of the queries and 56.4% of the unique ones. In the remainder of this part, whenever we write a percentage as X% (Y%), then X refers to all and Y to the unique queries i.e., the set of queries after removing duplicates.

If we simply investigate how many CRPQs use LIMIT (columns All and Unique), these numbers are not so spectacular. Indeed, around 6% (4.5%) of the CRPQs use the LIMIT operator. What is remarkable though, is that almost all these queries that use LIMIT, do not use an ORDER BY operation (bottom line of Table 1) .

By looking at the data more closely, we discovered that many of these queries are rather trivial in the sense that they only use one atom (or, equivalently, just one RDF triple pattern), which means that they do not even perform a join. For this reason, we decided to zoom in on the queries with at least two atoms, see the columns All (≥ 2) and Unique (≥ 2). It turns out that the numbers change significantly: around 14.9% (45.3%) of the CRPQs with at least two triples use LIMIT, which is a significant amount. Again, we see that almost all the queries that use LIMIT, do not use ORDER BY. This is remarkable, because a commonly held belief is that LIMIT is most often used in combination with ORDER BY, i.e., as a means to express top-queries. But, in this major real-world query log, this is not the case. Indeed, almost all non-trivial queries using LIMIT are threshold queries, seeking to return just an arbitrary unranked sample of results. In fact, we have run the same analysis on a broader class of queries (CRPQ , which extend CRPQ with unary filter conditions) and the percentages were very similar.

LIMIT Values. We now investigate the values of the thresholds of queries in the logs. To this end, we considered the subset of CRPQs in the raw logs that use the keyword LIMIT (∼15.2M queries, including duplicates). To gain deeper insight, we break down the logs into robotic (∼15.2M) and organic (∼44k) queries. Fig. 3 shows the relative shares of threshold values with varying numbers of digits. The figure shows that threshold values between 1 and 9 are the most common. Still, in both organic and robotic queries, we see that large threshold values (≥ 10K) are not uncommon. Since robotic logs can have large bursts of similar queries, we see that the distribution is not as smooth as for organic logs. For instance, only 0.08% of queries in the robotic logs have three-digit limit values (depicted in blue), whereas there are much more (9.2%) with four-digit values (depicted in green). The largest value we found in robotic logs had 9 digits. In organic logs, however, we found three limit values containing 11, 14, and 17 digits, respectively. Conclusion of Quantitative Study. Threshold queries are indeed quite common, e.g., in querying knowledge bases such as Wikidata.

Since the actual values of the thresholds are typically small, our empirical study confirms the utility in practice of our pseudopolynomial algorithm (Theorem 4.10) that, in order to evaluate queries with a threshold , solely needs to maintain up to + 1 intermediate results per each candidate answer tuple. This is in contrast with traditional query plans where the number of intermediate results per candidate answer tuple is determined by the input data and therefore potentially orders of magnitude larger.

To demonstrate further the usefulness of threshold queries in practice across diverse contemporary domains, we also performed a qualitative study on two real-world graph datasets.

Covid-19 Dataset. The Covid-19 Knowledge Graph [25] is a continuously evolving dataset, with more than 10M nodes and 25M edges, obtained by integrating various data sources, including gene expression repositories (e.g., the Genotype Tissue Expression (GTEx) and the COVID-19 Disease Map genes), as well as article collections from different scientific domains (ArXiv, BioRxiv, MedRxiv, PubMed, and PubMed Central). The inferred schema of this graph exhibits more than 60 distinct node labels and more than 70 distinct edge labels [53] . Such typing information is, however, not sufficient to express the domain-specific constraints that can be found in these real-life graph datasets. Non-trivial constraints expressible with threshold queries can be naturally crafted in order to complement the schema, as we showcase in our study.

Wikidata Dataset. Wikidata is a collaborative knowledge base launched in 2012 and hosted by the Wikimedia Foundation [66] . By the efforts of thousands of volunteers, the project has produced a large, open knowledge base with numerous applications. Wikidata can be seen as a graph database which has a SPARQL endpoint that lets users and APIs formulate queries on the its massive knowledge graph. The query logs collected along the years on this endpoint [52] constitute a useful resource for the qualitative analysis of threshold queries.

Working Method. We have manually inspected the Covid-19 Knowledge Graph schema in search of constraints that can be validated with threshold queries. We have also found a number of threshold queries by sieving through a large sample of Wikidata query logs. We analyzed the structural properties of the collected threshold queries. Below we discuss our findings with the help of five selected threshold queries from the two datasets. The queries Selected Queries. The first example comes from the Wikidata logs and appears to be a data exploration query. TQ1 Return all given names with more than 1000 occurrences.

SELECT ? name WHERE { ? person < given_name > ? name } GROUP BY ? name HAVING ( ( COUNT (*) > 1000 ) );

A structurally similar query can help detect integrity violations in the Covid-19 Knowledge Graph. Namely, the latest release (June 2021) of the Gene Ontology (GO) contains 43917 valid terms [23, 24] . Therefore, in the Covid-19 Knowledge Graph one can check whether a protein is suspicious if it exhibits more than this number of GO terms associated to it. TQ2 Find each protein that has more than 43917 associated gene ontology terms. Notice the large threshold and the complex regular expression.

As another example, reporting coverage in the Covid-19 Knowledge Graph demands that for each age group, in each country, there should be at least three reports for the current number of Covid cases (one for females, one for males, and one for the total). Deviations can be identified with the following threshold query.

TQ3 Find each country that does not have three reports for some age group.

This query can be expressed in Cypher-like syntax as follows.

MATCH (c: Country ) -[ e : CURRENT_FEMALE | CURRENT_MALE | CURRENT_TOTAL ] ->( a : AgeGroup ) WITH c , a , COUNT ( type ( e ) ) AS ecount WHERE ecount < 3 RETURN c , a ;

We point out that this query although acyclic is not free-connex.

We also discovered that graph patterns in limit-queries can be quite large, as in the following query from the Wikidata logs.

TQ4 Return 1000 tuples containing names (x1) and UniProt IDs (x2) for genes with domains (x3) referenced in "Science", "Nature, " or "Cell" articles, as well as their indirect domain label (x4) and class (x5), the article name (x6), and its PubMed ID (x7).

Conclusion of Qualitative Study. Our qualitative study discovered several interesting and realistic uses of threshold queries that we illustrated here. Connecting these to our algorithms, it is interesting to note that all these queries are acyclic, but only TQ1-TQ2 are free-connex (Fig. 4) . For instance, TQ4 has star size 3 (due to 3 , which has 3 neighboring nodes that propagate to the output) and free-connex tree-width 4. Evaluation of such queries with existing querying approaches may incur significant cost while our algorithms handle them very efficiently. For instance, TQ4 can be evaluated in linear time by Algorithm 1 (up to logarithmic factors), while the approach via enumeration [5] requires at least cubic time.

As a proof-of-concept, we implemented the algorithm in Theorem 4.8 in SQL and compared it with the optimizer of a popular DBMS engine, namely the PostgreSQL 13.4 optimizer. Using SQL windowing functions, our implementation can mimic some important aspects of our query evaluation algorithm (like internal information passing up to a threshold), but we note that SQL does not allow to capture our algorithm precisely. As shown in the remainder, the results of this comparison already show the superiority of our algorithm for threshold queries compared to naive evaluation.

All the experiments were executed on an Intel Core i7-4770K CPU @ 3.50GHz, 16GB of RAM, and an SSD. We used PostgreSQL 13.4 in Linux Mint 20.2 and built our own micro-benchmark [12] consisting of the following three types of query templates: (q1) -path selects up to 10 pairs of nodes linked by a -hop path; (q2) -neigh selects all nodes with ≥ 10 -hop neighbors; (q3) -conn selects all pairs of nodes linked by ≥ 10 -hop paths.

Assuming that the database consists of a single binary relation , then these queries for = 2 can be naturally formulated in SQL as: In the following, we will refer to these formulations as the baseline formulations of the queries.

We compared these queries with alternative formulations in SQL that mimic crucial aspects of our evaluation algorithm and that can be understood as follows. A simple decomposition for -path is a tree with a single branch and one node per each joined copy of table R. For this decomposition and = 2 the algorithm in Theorem 4.8 amounts to evaluating the following query: For -neigh we can use the same decomposition and the corresponding query is the same except for the last line which is SELECT X0 FROM S1 GROUP BY X0 HAVING count (*) >=10;

For -conn we can also use the same decomposition but the corresponding query is a bit different, as we need to collect the whole paths, not just their endpoints. We refer to these alternative implementations as the windowed versions of the queries. Natural query plans for the windowed versions have worst-case complexitỹ ( · ) for -path and -neigh, and˜( · 2 ) for -conn.

Datasets. We considered two kinds of data sets:

(1) The real-world IMDb data set used in Join Order Benchmark [54] , which contains information about movies and related facts about actors, directors, production companies, etc. In our experiments, we focused on the movie_link relation, which has a graph-like structure, so that finding paths is meaningful. (2) A number of synthetic data sets, which are Barabási-Albert graphs [7] with varying parameters of (total number of nodes to add) and 0 (the number of edges to attach from newly added nodes to existing nodes). 4 We repeated each experiment several times and report the median.

First Experiment. We compared the running time of the baseline and windowed versions of (q1-q3) for varying values of on both the IMDb and synthetic data sets, both on fully indexed and non-indexed databases. Table 2 shows the results for the nonindexed case. We see that our approach (windowed) outperforms the baseline with speedups up to three orders of magnitude, while the baseline times out (T/O) for higher values of . The running times of the windowed versions reflect the good theoretical bounds, while the baseline clearly shows exponential dependence on . For the indexed case, leveraging the DBMS's default indexes, the baseline ran only marginally faster (∼5%), remaining three orders of magnitude slower than our algorithm. Second Experiment. In this second experiment, we wanted to assess the impact of the structure and size of the data on the runtimes of our algorithm. To this purpose, we have employed the Barabási-Albert synthetic graphs with varying outdegree ( 0 ) and number of nodes ( ) and considered query q2 with = 3. Table 3 shows the different speedups that the windowed approach offers, when compared to the baseline. We varied 0 from 5 to 25, using increments of 5, and varied from 32 to 1M in a logarithmic scale with increment factor of √ 10 ≈ 3.16. In the table, we see that the speedup factor of the windowed approach increases by up to three orders of magnitude as the size of the data and out-degree 0 increase. T/O means that the baseline approach timed out (> 30 minutes). For all entries in Table 3 , the windowed algorithm terminated under 15 minutes. These results show the robustness of our algorithm to variations of dataset size and outdegree as well as its superiority with respect to the baseline.

Top-and Any-. The potential of predefined thresholds to speed up query processing was first noticed by Carey and Kossmann [17] , who explored ways of propagating thresholds down query plans, dubbed LIMIT pushing. This early study only considered applying the thresholds directly to subplans, which made joins a formidable obstacle. In contrast, we first group the answers to subqueries by variables determined based on the structure of the whole query, and then apply thresholds within groups; this way we can push thresholds down through multi-way joins, guided by a tree decomposition of the query. Most follow-up work concerns the ranked scenario, where the goal is to compute topanswers according to a specified preference order. The celebrated Threshold Algorithm [32] solves the top-selection problem: it operates on a single vertically-partitioned table, with each partition being managed by a different external service that only knows the scores of base tuples in its own partition, and produces tuples with the highest score while minimizing the number of reads. There are also multiple approaches to the more general top-join problem. J* [59] is based on the A* search algorithm: it maintains a priority queue of partial and complete join combinations ordered by the upper bounds of their scores. Rank-Join [42] maintains scores for complete join combinations only, and stops when new combinations cannot improve the current top-. LARA-J* [58] offers improved handling of multiway joins. FRPA [34] keeps the number of reads within a constant factor of the optimal. Overall, the focus and the main challenge in top-processing is ordering the answers according to their ranking scores [43] . In the unranked case, when this challenge is absent, the rich body of work on top-processing does not go beyond the initial observations made by Carey and Kossmann [17] . NeedleTail [49, 50] specifically focuses on providing anyanswers to queries without ORDER BY clauses, but it only handles key-foreign key joins, which dominate in the OLAP scenarios. In contrast, we support arbitrary CQs (i.e., select-project-join queries), allowing the complexity to grow with the tree-width (Theorem 4.8). Moreover, anyevaluation of CQs is just a building block of the processing of much more general threshold queries.

A large body of research on query processing led to powerful optimization techniques, such as aggregate pushing [19, 20, 39, 69] and sideways information passing [10, 21, 45, 56, 61] . These techniques aim to speed up the execution of a given join plan and rely on a cost model to heuristically approximate instance-optimal plans. Our focus is on reducing the search space of the heuristic methods by identifying plans with good worst-case guarantees. Such plans can be further improved towards instanceoptimality, using classical techniques. For LIMIT queries the combination of sideways information passing and LIMIT pushing might be beneficial. Indeed, if we can ensure that each tuple produced by the subplan extends to a full answer, then we can stop the execution of the subplan when the desired number of tuples is output. For general threshold queries, the potential for such optimization is less clear. There, instead of a global limit on the number of answers we have a per-group limit. Consequently, the execution of the subplan can be stopped only when each group has sufficiently many tuples. The level of savings depends on the order in which the subplan produces its results. Such optimization goes beyond the scope of our paper, but is a promising direction for future work.

Quantified Graph Patterns. Fan et al. [33] introduced quantified graph patterns (QGP) that allow expressing nested counting properties like having at least 5 friends, each with at least 2 pets. In contrast to threshold queries, QGPs are unable to count -hop neighbours for ≥ 2, nor can they count tuples of variables. Moreover, QGPs adopt isomorphism matching, while threshold queries follow the standard semantics of database queries.

Aggregate Queries. In the context of factorized databases, Bakibayev et al. [6] observed that pushing aggregation down through joins can speed up evaluating queries. These results can be reinterpreted in the context of tree decompositions [60] , but they optimize different aggregates in isolation and do not investigate the interplay between counting and existential quantification. AJAR (aggregations and joins over annotated relations) [46] and FAQs (functional aggregate queries) [48] are two very general sister formalisms capturing, among others, CQs enriched with multiple aggregate functions. Because different aggregate functions are never commutative, the evaluation algorithms for both these formalisms require decompositions consistent with the order of aggregating operations. For example, when applied to counting answers to CQs, this amounts to free-connex decompositions, as in our Proposition 4.4. In contrast, we remove the free-connex assumption by showing that counting up to a threshold and existential quantification can be reordered at the cost of keeping additional information of limited size.

Counting Answers. For projection-free CQs, the complexity of counting answers is tightly bound to tree-width [26, 36] , just like in the case of Boolean evaluation [37, 38] , but the presence of projection makes counting answers intractable even for acyclic CQs [62] . Efficient algorithms for counting answers require not only low tree-width but also low star-size [31] . However, when the problem is relaxed to randomized approximate counting, low treewidth is enough [2] , just like for Boolean evaluation. Our results imply that for a different relaxation -counting exactly, but only up to a given threshold -CQs of low tree-width can also be processed efficiently. However, we go far beyond CQs and show how to count answers to threshold queries (which themselves generalize counting answers to CQs up to a threshold).

Enumerating Answers. Also in the context of constant-delay enumeration low tree-width is not enough to guarantee efficient algorithms: the query needs to have low free-connex tree-width [5] . Importantly, even acyclic queries can have very large freeconnex tree-width. Tree-width with can be replaced with fractional hypertree-width [27, 47, 60] or submodular width [9] but always in the restrictive free-connex variant. Tziavelis et al. [63, 64] partially lift these results to the setting of ranked enumeration, where query answers must be enumerated according to a predefined order; the lifted results allow enumeration with logarithmic delay and handle projection-free CQs of low submodular width as well as free-connex acyclic CQs (but not general CQs). In this work, we show that if the number of needed answers is known beforehand, general CQs of low tree-width can be processed efficiently even if they have large free-connex tree width. Moreover, this result is only the starting point for processing general threshold queries, for which we also provide constant-delay enumeration algorithms.

Sampling Answers. Sampling query answers was identified as an important data management task by Chaudhuri at al. [18] , who proposed a simple algorithm for sampling the join ⊲⊳ by sampling a tuple ∈ with weight | ⋉ { }| and then uniformly sampling a tuple ∈ ⋉ { }. Using the alias method for weighted sampling [65, 67] , this algorithm can be implemented in such a way that after a linear preprocessing phase, independent samples can be obtained in constant time. This approach was generalized to acyclic projection-free CQs [71] . We extend the latter result in three ways: we handle non-acyclic CQs, allowing the complexity to grow with the tree-width; we can allow projection, at the cost of replacing tree-width with its faster growing free-connex variant; and we handle threshold queries, rather than just CQs. A different approach to non-acyclic projection-free CQs [22] provides a uniform sampling algorithm with guarantees on the expected running time; this is incomparable to constant-time sampling after polynomial-time preprocessing, offered by our approach. Finally, Arenas et al. [2] show that efficient almost uniform sampling is possible for CQs of low tree-width. Here, almost uniform means that the algorithm approximates the uniform distribution up to a multiplicative error; this is a weaker notion than uniform sampling. Let us also reiterate that the all these papers only consider CQs, not threshold queries.

In this paper, we have embarked on a deep theoretical study of a newly identified class of threshold queries. Our extensive empirical study shows that threshold queries are highly relevant in practice as witnessed by their utility in real-world knowledge graphs and their presence in massive query logs. Our theoretical investigation shows that threshold queries occupy a distinctive spot in the landscape of database querying problems. Indeed, our complexity analysis proves that threshold queries allow for a more efficient evaluation than solutions for closely related problems of counting query answers, constant-delay query answer enumeration, and top-querying.

As one of the first future steps, we intend to gauge thoroughly the performance of the proposed algorithms. Designing adequate protocols for experimental evaluation requires a deep understanding of the relationships between evaluation of threshold queries and the related querying problems, which we have already accomplished in the present paper. We intend to carry out a comprehensive implementation of threshold queries in an existing database system similarly to how algorithms of top-k queries have been implemented and evaluated [55, 55] . More precisely, we will implement dedicated threshold-aware variants of relational operators and then we will introduce them in the query planning stage.

Our work has led us to identify remarkable similarities in the query answering methods tackling counting and enumerating answers: they rely on various techniques for compiling out existentially quantified variables. We plan to pursue this discovery further and give full treatment to the emerging question: is there a unifying framework for assessing threshold queries and the related problems? Further theoretical results about threshold queries can be envisioned, such as establishing a dichotomy of evaluation complexity and identifying a condition under which evaluation is strongly polynomial, rather than pseudopolynomial.

This work stems from a larger effort on developing schemas for property graph databases, including cardinality constraints, led by the LDBC Property Graph Schema Working Group. We thank Andrea Calì, Leonid Libkin, Victor Lee, Juan Sequeda, Bryon Jacob, and Michael Schmidt for their useful comments and early feedback. Proof. We reduce from the well-known VALIDITY problem. Consider a Boolean formula in disjunctive normal form with clauses 1 , 2 , . . . , over variables 1 , 2 , . . . , . VALIDITY asks if is valid, i.e., if evaluates to true under every possible truth assignment : { 1 , . . . , } → {true, false}. We will define a database , a conjunctive query , and a threshold such that returns at least answers on iff is valid. The database uses the binary relations 1 , . . . , , has Adom( ) = { , } ∪ { 1 , 2 , . . . , }, and is illustrated in Figure 5 . More formally, we put ( , ) in relation iff assigning = true is compatible with , i.e., does not falsify clause . Similarly, we put ( , ) in iff assigning = false is compatible with , i.e., does not falsify . Consider the CQ

Notice that is valid iff query at least 2 tuples over . Intuitively, each truth assignment of 1 , . . . , corresponds to a binding of 1 , . . . , to the constants and , where we map to iff ( ) = true. The condition ∧ =1 ( , ) then tests if there exists a clause that is satisfied by : we can map to iff the clause is satisfied by . □ Proposition 3.5. Boolean evaluation of acyclic at-least and at-most threshold queries is NP-hard, even if thresholds are given in unary.

Proof. For at-most queries, we reduce from MAXSAT. The database is constructed in the same way as in Proposition 3.2, except that we add a unary relation Node(·) that contains all elements in Adom( ). Boolean evaluation of the at-most query

checks if there exists an assignment that satisfies at most clauses in the DNF formula . For the dual CNF formula, this means that there is an assignment that falsifies at most of its clauses. This means that this assignment satisfies at least − of its clauses.

For at-least queries, we reduce from MINSAT, which is also NP-complete [51] . Boolean evaluation of the at-least query

checks if there is an assignment that satisfies at least clauses in the DNF formula . For the dual CNF / SAT formula, this means that there is an assignment that falsifies at least of its clauses. This means that this assignment satisfies at most − of its clauses. □

Lemma 4.1. Each tree decomposition of a conjunctive query can be transformed in polynomial time into a nice tree decomposition ′ of the same width and linear size. Moreover, if is -rooted or -connex, so is ′ .

Proof. We perform the transformation in several steps. The first step is to ensure that is safe. This is done by pulling variables up the tree. Consider an unsafe node , whose descendants are all safe. Let be the set of variables in that violate the safety condition. Because all descendants of are safe, it follows immediately that they do not contain variables from . Consequently, after removing from , we still have a tree decomposition of : the connectedness condition holds and each bag covers the same atoms it covered before the modification. This also means that no safe nodes became unsafe. The node itself, however, is now safe. Proceeding this way we make all nodes of safe.

The second step is to ensure that each internal node either is a project node or contains in its bag the union of the bags of its children. This is done simply by inserting, between each two nodes, an intermediate project node, whose bag is the intersection of the bags of its parent and its unique child. This does not affect safety.

The third step is to ensure that each internal node either is a project node or its bag is equal to the union of the bags of its children. Consider an internal node such that is a strict superset of the union of the bags of the children of . Because is safe, for each variable ∈ \ , there exists an atom of that uses and is covered either by or by one of the descendants of . It cannot be covered by a strict descendent of , because these do not contain . Hence, is covered by . Let us add a new child of and let = ∈ \ Var( ). Note that \ ⊆ ⊆ , so now is the union of the bags of its children. The new node is a leaf and it is clearly safe. It should also be clear that the addition of does not break in any way: it is still a tree decomposition of , still satisfies all previously ensured properties. Proceeding this way we fix all nodes of that need fixing.

The final step is to ensure that all internal nodes are either project nodes or join nodes. We already ensured that for each internal node that is not a project node, is the union of the bags of the children of . The only issue now is that can have only one child or more than two children. If has only one child then = , so we can remove and promote all its children to become children of , without affecting the guarantees we already achieved. If has more than two children, we can fix it by introducing between and its children a complete binary tree of join nodes, of logarithmic depth and linear size.

It is routine to verify that all four steps preserve the property of being -rooted and the property of being -connex. □ Lemma 4.2. Let be a nice tree decomposition of a CQ .

(1) For each node in it holds that ⊆ FVar( ).

(2) If is -connex and is the maximal -connex set in , then each node in either has all its children in or it is a -leaf, that is, it has no children in .

Proof. The first item follows immediately from the safety of node . For the second item, take a node ∈ that is not a -leaf. Node cannot be a leaf in , so it is either a join node or a project node. If is a join node, then the bags associated to its children are subsets of , so the children of belong to by maximality. If is a project node, then it has exactly one child, and because is not a -leaf, this child must belong to . □ Proposition 4.3. Counting answers grouped by for conjunctive queries of -rooted tree-width and star-size over databases of size can be done in time˜ · .

Proof. Let be a conjunctive query of tree-width and star-size , and let ⊆ FVar( ) be a set of grouping variables. Let be an -rooted tree decomposition of , of width . Because ss( ) = , we can equivalently write as the conjunction of CQs 1 , 2 , . . . , ℓ , where |FVar( )| ≤ or QVar( ) = ∅ for each . Without loss of generality we can assume that QVar( ) are pairwise disjoint and if QVar( ) = ∅ then consists of a single atom. Let us rewrite as ′ = 1 (¯1) ∧ 2 (¯2) ∧ · · · ∧ ℓ (¯ℓ ) using views (¯) ← (¯) where, for each , the tuple¯lists each variable from FVar( ) exactly once. We can transform into an -rooted tree decomposition ′ of ′ , by replacing each occurrence of ∈ QVar( ) with the whole set FVar( ) for each . Because the sets QVar( ) are pairwise disjoint, ′ has width · . Using Lemma 4.1 we can assume that ′ is nice.

For each in ′ , let ′ be the associated bag and let ′ be the subquery of ′ obtained by dropping all atoms that use a variable not occurring in the subtree of ′ rooted at node . We have ′ ⊆ FVar( ′ ) = Var( ′ ) for all . We process ′ bottom-up and, for every node , we compute the set of all records of the form ( , ) where : ′ → Adom( ) and = | ′ ( , )|. That is, is the solution to the problem of counting answers to ′ grouped by ′ over . We also interpret a record ( , ) as the binding extending to a fresh variable witnessCount with value . Note that if is the root of ′ , then ′ = ′ , and the whole algorithm can return . It remains to show how to compute , assuming that is already known for all children of . If is a leaf of ′ , then = ( , 1) ∈ ′ ( ) ∪ ( , 0) : ′ → Adom( ), ∉ ′ ( ) . It suffices to show that ′ ( ) can be computed within the required time bounds. The query ′ is the conjunction of all atoms of the form (¯) such that each variable inb elongs to ′ . That is, evaluating ′ involves materializing the corresponding views and computing the join. If is a single atom, is already materialized. In the remaining case, has arity at most and, because it is a subquery of , it has tree-width at most . Every tree decomposition can be turned into a free-rooted tree decomposition by setting aside an arbitrary free variable , adding all the remaining free variables to each bag, and taking any bag originally containing as the root; if the root holds some quantified variables as well, we add a bag with just free variables as its parent. Hence, has free-rooted tree-width at most = + − 1 ≤ · ; the latter inequality holds for all positive integers. Consequently, the corresponding view can be materialized in time˜ | | . Because | ′ | ≤ · , we can compute the join in time˜ | | · .

If is a project node with child , then we let = ′ ;sum(witnessCount) ; that is, we group records ( , ) in by ′ ( ) and for each group return ′ ( ) and the sum of all 's in this group. This is easy to do in time˜ | | · .

If is a join node with children 1 and 2 , then can be obtained as the set of records 1 ⊲⊳ 2 , for ( 1 , 1 ) ∈ 1 , ( 2 , 2 ) ∈ 2 such that 1 is compatible with 2 and = 1 · 2 if 1 ⊲⊳ 2 satisfies all atoms of ′ that not present in ′ 1 ∧ ′ 2 , 0 otherwise .

To compute we materialize the relevant views, as explained above, and then apply a variant of the merge-join algorithm. All this can be done in˜ | | · . □ Proposition 4.4. Counting answers grouped by for conjunctive queries of -rooted free-connex tree-width over databases of size can be done in time˜ .

Proof. Let be a conjunctive query of -rooted free-connex tree-width . Let be a nice -rooted free-connex decomposition of of width and let be the maximal free-connex set in . Let 1 , 2 , . . . , be the -leaves of . By Lemma 4.2, each node in either is a -leaf or has all children in . Consequently, each node in belongs either to or to the subtree of rooted at , for some . With each we associate the subquery in the usual way; that is, we take the subquery of obtained by dropping atoms using variables that are not present in the subtree of rooted at . By the definition of it follows that FVar( ) ⊆ ⊆ FVar( ). We letˇbe the full CQ obtained by dropping from all atoms occurring in 1 ∧ 2 ∧ · · · ∧ . We can writeˇas the conjunction of single atoms +1 , +2 , . . . , ℓ . Then, is equivalent to 1 ∧ 2 ∧ · · · ∧ ℓ and for each either QVar( ) = ∅ and is a single atom or |FVar( )| ≤ | )| ≤ . Thus, we are in a situation entirely analogous to the one in Proposition 4.3, except that the role of star-size is now played by the -rooted free-connex tree-width .

Continuing like in Proposition 4.3 we would obtain an algorithm evaluating on database with running time˜ | | 2 . But it is easy to improve this complexity. Let ′ = 1 (¯1) ∧ 2 (¯2) ∧ · · · ∧ ℓ (¯ℓ ) be the full CQ using views (¯) ← (¯), like in Proposition 4.3. Then, can be turned into an -rooted tree decomposition ′ of ′ simply by restricting it to : for ≤ the atom (¯) is covered by node and for > any node covering the underlying atom of will do. The width of ′ is at most , rather than · . Moreover, as FVar( ) ⊆ for ≤ , the subtree of rooted at can be turned into a free-rooted tree decomposition of simply by dropping variables not used in and adding ∩ FVar( ) on top as the new root. Hence, the free-rooted tree-width of is at most , rather than + − 1. This means that the algorithm described in Proposition 4.3 will run in time˜(| | ). □ Lemma 4.5. For each conjunctive query , ss( ) ≤ fc-tw( ) ≤ tw( ) · max 1, ss( ) .

The same holds for the -rooted variant and the -connex variant, for every ⊆ FVar( ). Moreover, there exist CQs with arbitrarily large fc-tw( ) that satisfy fc-tw( ) ≤ √︁ tw( ) · ss( ) + 1.

Proof. For the first inequality consider a free-connex decomposition of of width . Let 1 , 2 , . . . , be the maximal subtrees of that do not contain a node holding free variables only. Then, for each , all free variables of occurring in must be present in the root of . Consequently, contains at most free variables. It follows that the star-size of is at most . The reasoning is of course not affected when additional conditions on are imposed, like being -rooted or -connex for some ⊆ FVar( ).

Consider now a tree decomposition of of width . As we do not assume that is free-connex, free variables can be located in disjoint subtrees of . Let us write as 1 ∧ 2 ∧ · · · ∧ ℓ such that, for each , is connected and either |FVar( )| ≤ ss( ) or QVar( ) = ∅. Let be the tree decomposition of induced by ; that is, the decomposition obtained from by removing all variables that do not occur in . Clearly, has width at most , just like . We now turn it into a free-rooted decomposition of . First, we add to each bag of the whole set FVar( ) except one arbitrarily chosen variable. After this operation, some bag contains the whole set FVar( ). We reorganize the tree so that this bag becomes the root, and add FVar( ) on top as the new root. The resulting tree decomposition ′ has width at most + ss( ) − 1.

Let ′ be obtained from by including into each bag containing any variable from QVar( ) all variables from FVar( ) and then dropping all existentially quantified variables. Because the sets QVar( ) are pairwise disjoint, the sizes of bags in ′ are bounded by · ss( ). We claim that for each variable ∈ FVar( ) the set of nodes holding forms a connected subset of ′ . Let be the conjunction of all queries that use . Because each is connected, so is . It follows that the set of nodes in that hold a variable from is connected, too. This is precisely the set of nodes of ′ that hold .

We now combine ′ with ′ 1 , ′ 2 , . . . , ′ ℓ . For each there exists node in ′ that holds all variables in FVar( ): any node originally holding a variable from FVar( ) will do. The combined decomposition ′′ is obtained by letting the root of ′ become a child of . It is routine to check that ′′ is a free-connex decomposition of of width · ss( ). It is also straightforward to check that for each ⊆ FVar( ), if is -rooted or -connex, so is ′′ .

Finally, for

we have ss( ) = and fc-tw( ) = tw( ) = + 1, so indeed fc-tw( ) ≤ √︁ tw( ) · ss( ) + 1. □ Theorem 4.7. Boolean evaluation of an at-most or at-least query of free-connex FTVar( )-connex tree-width over databases of size can be done in time˜( ). The combined complexity of the algorithm is polynomial, assuming is constant.

Proof. The following proof is a simpler variant of the proof of Theorem 4.10. The key difference in the data structure is that records contain only the witness count (no multiplicity) and for each binding we only store the maximal witness count for at-least queries and the minimal witness count for at-most queries. The answer then is YES iff the root contains a record with the witness count satisfying the bounds of the query. Below we spell out full details for the benefit of the reader wishing to see the proof of the present result without needing to go through the more complicated construction necessary for Theorem 4.10.

Consider a database and a threshold query (¯) = (¯) ∧ ∃ ,¯(¯,¯) . We assume that is either an at-least query or an at-most query; that is, either = ∞ or = 0. Let be a nice free-connex FTVar( )-connex tree decomposition of , of width , and let be the maximal free-connex set in . Being a tree decomposition of (¯,¯) = (¯) ∧ (¯,¯), is also a tree decomposition of both and (up to dropping unused variables). Let and be the corresponding subqueries associated to node . It follows that = ∧ . By Lemma 4.2, we have ⊆ FVar( ) = FVar( ) ∪ FVar( ). If ∈ then additionally ⊆ FVar( ) and FVar( ) ⊆ FVar( ) = FVar( ), and consequently ⊆ FVar( ) ∪ FVar( ) ∩ FVar( )). With each node ∈ we associate the set of records that will provide information necessary for Boolean evaluation of . A record for ∈ is a pair ( , ) consisting of a binding :

→ Adom( ) and a witness count whose definition depends on whether is an at-least or an at-most query. If is an at-least query, then is the maximum of | ( , ′ )| with ′ ranging over extensions of to FVar( ) ∪ FVar( ) ∩ FVar( ) such that FVar( ) ( ′ ) ∈ ( ). If is an at-most query, then we replace maximum with minimum. In either case, if for some there is no ′ , then there is no record for either. We can interpret a record ( , , ) as a binding extending to a fresh variable witnessCount with value .

By Lemma 4.2, each node in either is a -leaf or has all its children in . Consequently, we can compute for ∈ bottom-up. Let be a -leaf. Then, by the definition of , we have = FVar( ) ∪ FVar( ) ∩ FVar( ) . Let be the solution to the problem of counting answers to grouped by FVar( ) ∩ FVar( ) over . By restricting all bags to Var( ) and adding FVar( ) ∩ FVar( ) on top as the new root, we turn the subtree of rooted at into a FVar( ) ∩ FVar( )-rooted decomposition ′ of of width at most . We claim that ′ is also FVar( )-connex. Indeed, recall that is FTVar( )-connex and let be the maximal FTVar( )-connex set in . Let ′ comprise the restriction of to the subtree rooted at , as well as the newly added root of ′ . Clearly, ′ is connected. Because FVar( ) = FTVar( ) ∩ Var( ), it follows that the union of bags stored in nodes from ′ is exactly FVar( ). This proves the claim.

Hence, we can use Proposition 4.4 to compute in time˜ | | . The subtree of rooted at similarly induces an FVar( )-rooted decomposition of of width at most , so we can compute the set ( ) of all answers to over in time˜(| | ) as explained in Section 4.1. Then, we get as the join = ( ) ⊲⊳ defined as the set of records ( 1 ⊲⊳ 2 , ) such that 1 ∈ ( ), ( 2 , ) ∈ , and 1 and 2 are compatible. This step can also be performed in˜ | | .

Suppose now that ∈ is not a -leaf. Then, all children of belong to . Let us assume that their sets of records are already computed. If is a project node with child , then is obtained from by grouping records ( , ) in by ( ) and returning for each group the binding ( ) and either the maximal witness count in the group if is a an at-least query, or the minimal witness count in the group if is an at-most query. Because | | ≤ | | , this step can also be done in time˜ | | . Suppose that is a join node with children 1 and 2 . First, we compute the set 1 , 2 of records 1 ⊲⊳ 2 , 1 · 2 such that ( 1 , 1 ) ∈ 1 , ( 2 , 2 ) ∈ 2 , and 1 is compatible with 2 . Next, we compute the semi-join 1 , 2 , = 1 , 2 ⋉ˇ( ) of 1 , 2 withˇ( ); that is, the set of records ( , ) ∈ 1 , 2 such that FVar(ˇ) ( ) ∈ˇ( ). Finally, we compute = 1 , 2 , ⋉ˇ( ) ∪ 1 , 2 , ⊲ˇ( ) as the union of 1 , 2 , ⋉ˇ( ) and the anti-join 1 , 2 , ⊲ˇ( ) of 1 , 2 , withˇ( ), defined as the set of records ( , 0) for each ( , ) ∈ 1 , 2 , such that FVar(ˇ) ( ) ∉ˇ( ). Each of these operations can be performed in time˜ | | .

Having computed for the root of , we return YES iff contains a record ( , ) with ≤ ≤ . □ Theorem 4.8. Computing answers grouped by up to a threshold for conjunctive queries of -rooted tree-width over databases of size can be done in time˜( · ).

Proof. It remains to prove that for each node , the set computed by Algorithm 1 is complete for and . We proceed by bottom-up induction on the nodes of the decomposition. The base case is immediate: if is a leaf, = ( ), so it is complete. Suppose that is a project node. Consider :

→ Adom( ). We need to check that contains sufficiently many bindings compatible with . If for every binding ′ extending to , we have that ( , ′ ) ⊆ , then ( , ) ⊆ . Suppose that for some ′ extending to we have ( , ′ ) ⊈ . Because is complete for and by the induction hypothesis, it must hold that | ( , ′ ) ∩ | = . Because all bindings in ( , ′ ) ∩ extend and their projections on FVar( ) are pairwise different (only variables from are projected out, and their values are fixed by ′ anyway), it follows that | ( , ) ∩ | = . Hence, is complete for and . Suppose now that is a join node. By the induction hypothesis, for each child of , is complete for and . The interesting case is when ( , ) ⊈ for some and some : → Adom( ). Will we not miss the discarded bindings now? As is complete for and , we have | ( , ) ∩ | = . Consider a binding ′ extending to . There are only two possibilities. If each of the remaining two joined sets contains at least one binding extending ′ , the result of the join will contain at least such bindings. If some of the remaining joined sets contains no bindings extending ′ , the result of the join will contain no such bindings either, and neither it would if we replaced each with ( ) for each . Thus, is complete for and . □ Theorem 4.10. For TQs of free-connex tree-width , over databases of size , one can: count answers in time˜( ); enumerate answers with constant delay after˜( ) preprocessing; and sample answers uniformly at random in constant time after˜( ) preprocessing. Assuming is constant, the combined complexity of each of these algorithms is pseudopolynomial.

Proof. Consider a database and a threshold query

We write for the threshold up to which we will be counting witnesses: if < ∞, we let = + 1; if = ∞, we let = . Let be a nice free-connex tree decomposition of , of width , and let be the maximal free-connex set in . Being a tree decomposition of (¯,¯) = (¯) ∧ (¯,¯), up to dropping unused variables, is a tree decomposition of both (¯) and (¯,¯). Let and be the corresponding subqueries associated to node . It follows that = ∧ . By Lemma 4.2, we have ⊆ FVar( ) = FVar( ) ∪ FVar( ). If ∈ then additionally ⊆ FVar( ) and FVar( ) ⊆ FVar( ) = FVar( ), and consequently ⊆ FVar( ) ∪ FVar( ) ∩ FVar( )).

With each node ∈ we associate the set of records that will provide information necessary to enumerate, count, and sample answers to . A record for ∈ is a triple ( , , ) consisting of a binding :

→ Adom( ), a multiplicity > 0, and a witness count ∈ {0, 1, . . . , } such that there are exactly extensions ′ of to FVar( ) ∪ FVar( ) ∩ FVar( ) such that FVar( ) ( ′ ) ∈ ( ) and = min , | ( , ′ )| . We can interpret a record ( , , ) as a binding extending to two fresh variables, multiplicity and witnessCount, with values and .

By Lemma 4.2, each node in either is a -leaf or has all its children in . Consequently, we can compute for ∈ bottom-up. that is, the set of records ( 1 ⊲⊳ 2 , 1, ) such that 1 ∈ ( ), ( 2 , ) ∈ , and 1 is compatible with 2 . Note that all multiplicities are 1 because = FVar( ) ∪ FVar( ) ∩ FVar( ) , so only the trivial extension ′ = is available. This step can be performed in˜ · | | , as well.

Suppose now that ∈ is not a -leaf. Then, all children of belong to . Let us assume that their sets of records are already computed. Suppose that is a project node with child . Then is obtained from by grouping records in by the projection to of the binding and by the witness count, and returning for each group the restricted binding, the sum of the multiplicities over the whole group, and the witness count. Because | | ≤ ( + 1) · | | , this step can also be done in time˜ · | | .

Suppose that is a join node with children 1 and 2 . First, we compute the -join 1 , 2 = 1 ⊲⊳ 2 of 1 and 2 , defined as the multiset of records 1 ⊲⊳ 2 , 1 · 2 , min( , 1 · 2 ) such that ( 1 , 1 , 1 ) ∈ 1 , ( 2 , 2 , 2 ) ∈ 2 , and 1 is compatible with 2 . Next, we compute the semi-join 1 , 2 , = 1 , 2 ⋉ˇ( ) of 1 , 2 withˇ( ), defined as the multiset containing all occurrences of records ( , , ) ∈ 1 , 2 such that FVar(ˇ) ( ) ∈ˇ( ). Then, we compute the multiset ′ 1 , 2 , = 1 , 2 , ⋉ˇ( ) ⊎ 1 , 2 , ⊲ˇ( ) obtained as the multiset union of 1 , 2 , ⋉ˇ( ) and the anti-join 1 , 2 , ⊲ˇ( ) of 1 , 2 , withˇ( ), defined as the multiset containing one occurrence of ( , , 0) for each occurrence of record ( , , ) ∈ 1 , 2 , such that FVar(ˇ) ( ) ∉ˇ( ). Finally, to compute , we group the records ( , , ) in ′ 1 , 2 , by the binding and by the witness count , and for each group we return the binding , the sum of the multiplicities over the whole group, and the witness count . Each of these operations can be performed in time˜( · | | ).

In the root we apply an additional filter to the computed records: we only keep records ( , , ) such that ≤ ≤ . Apart from the records themselves we will need an index structure allowing quick access to relevant records in the children of each node. In a project node, for each stored record ( , , ) we need to be able to iterate over all records ( ′ , ′ , ′ ) from the child of such that ( ′ ) = and ′ = . In a join node , similarly, for each stored record ( , , ) we need to be able to iterate over all pairs of records ( 1 , 1 , 1 ) and ( 2 , 2 , 2 ) from the respective children of that contribute to the multiplicity . An appropriate index structure can be computed alongside the sets . One way to do this is as follows. For a child of a project node we store records ( ′ , ′ , ′ ) of grouped by ( ′ ) and by ′ ; each record ( , , ) in is then equipped with a link to the group corresponding to and . For children 1 and 2 of a join node we store the records of 1 and 2 grouped by the bindings; each record ( , , ) in is then equipped with a link to the group of that corresponds to ( ) for = 1, 2. The pairs of records relevant to ( , , ) can be then read off the linked groups of 1 and 2 : not all combinations of records from these groups are relevant, but skipping irrelevant ones involves only constant delay because both groups contain at most + 1 records. Counting answers to based on the computed data structure is straightforward: the number of answers is the sum of all multiplicities of records stored in the root.

Enumeration can be done by iterating over stored records in a hierarchical fashion. That is, in each node the algorithm iterates over stored records and for each such record it iterates over relevant combinations of records in the children. When there is no next record to move to, the algorithm advances the iteration in the parent and updates the state of the iteration process in the affected descendants of the parent. The cost of each step is proportional to the size of the decomposition; that is, linear in the size of the query.

Sampling the space of answers is similar to enumeration, but instead of iterating through the records, we choose them at random. Like in the case of enumeration we always ensure that records chosen in the children are compatible with the one chosen in their parent. Uniformity is guaranteed by using the probability distribution induced by the counts for records and combinations of records. In the root, we choose record ( , , ) with the probability , where is the number of all answers to . Suppose we have chosen a record ( , , ) in a project node . Let ( 1 , 1 , 1 ), . . . , ( ℓ , ℓ , ℓ ) be the relevant records in the child of . Then, we should choose record ( , , ) with the probability . Similarly, suppose we have chosen ( , , ) in a join node . Let

be the relevant pairs of records from the children of . Then, pair ( , , ), ( ′ , ′ , ′ ) should be chosen with probability · ′ . Using the alias method [65, 67] we can precompute in time ( ) an auxiliary data structure that allows performing the above random choices in constant time. □

LIMIT Values. In addition to the analysis on LIMIT values in the body, we also investigated the distribution of values below 100. The result can be found in Figure 6 . Indeed, when focusing on these threshold values < 100, we notice that in both CRPQ and CRPQ s the most occurrences correspond to values < 10, followed by those < 20, by those ranging from 91-100, with few occurrences for the interval 41-50 and outliers.

SPARQL Terms. In order to measure the size of the threshold queries in the logs (CRPQs and CRPQ s as shown in Table 1 ), we have counted the total number of triple patterns contained in the clauses SELECT, ASK, and CONSTRUCT of those queries. Triple patterns are similar to RDF triples, except that each of the subject, predicate, and object can be a variable. The results show that most threshold queries use two triple patterns (54,59%) , and that queries with one or three triple patterns are also quite frequent (15,41% and 29,80%, respectively) . Overall, we can observe that 99,8% of the Wikidata queries in the CRPQ corpus use at most three triple patterns. This trend is similar to query sizes of the entire corpus, as studied in previous work [14] . We observed that organic queries are more diverse in terms of sizes with respect to robotic queries.

We also analyzed the individual components in triple patterns, such as blank nodes, SPARQL variables, IRIs, and literals. Table 4 shows the results for the entire corpus including OK and TO queries. Blank nodes are existential variables; that is, they state the existence of an entity without identifying it, while SPARQL variables can be seen as universal variables. To emphasize the relationship between variables and blank nodes, we compute the ratio between variables and, respectively, blank nodes and variables. Similarly, we record the ratio between IRIs and, respectively, IRIs and literals (both of which can be regarded as constants).

The results, including and ratios, show that the most common terms are SPARQL variables and IRIs. This is consistent with the fact that IRI references are the basis for Linked Data and that variables are at the core of SPARQL. We note that blank nodes are the least frequent and, when used, occur only once per query in 99.96% of cases. Conversely, IRIs, which provide a mechanism for globally identifying resources, are much more common.

We also observed that timeout queries usually contained more variables and blank nodes. This relationship does not apply to almost IRIs and literals. Another potential purpose for threshold queries we identified is data monitoring, exemplified by the following Boolean query from the Wikidata logs. TQ5 Are there at least 67 language versions of Wikipedia pages for the movie The Matrix?

Wim Martens, and Andreas Pieris. 2021. Principles of Databases

When is Approximate Counting for Conjunctive Queries Tractable

Complexity of finding embeddings in a k-tree

Enhancing the Nobel Prize schema

On Acyclic Conjunctive Queries and Constant Delay Enumeration

Aggregation and Ordering in Factorised Databases

Emergence of scaling in random networks

A Structured Review of Data Management Technology for Interactive Visualization and Analysis

Constant Delay Enumeration with FPT-Preprocessing for Conjunctive Queries of Bounded Submodular Width

Using Semi-Joins to Solve Relational Queries

A Linear-Time Algorithm for Finding Tree-Decompositions of Small Treewidth

Sławek Staworko, and Dominik Tomaszuk. 2021. Threshold queries in Python

Hannes Voigt, and Nikolay Yakovets

Navigating the maze of Wikidata query logs

An analytical study of large SPARQL query logs

SHARQL: Shape Analysis of Recursive SPARQL Queries

Enough Already!" in SQL

On Random Sampling over Joins

Including Group-By in Query Optimization

Optimizing Queries with Aggregate Views

On Applying Hash Filters to Improving the Execution of Multi-Join Queries

Random Sampling and Size Estimation Over Cyclic Joins

The Gene Ontology Resource: 20 years and still GOing strong

The Gene Ontology Consortium

COVID-19 Knowledge Graph

The complexity of counting homomorphisms seen from the other side

Compressed Representations of Conjunctive Query Results

The First Parameterized Algorithms and Computational Experiments Challenge

The PACE 2017 Parameterized Algorithms and Computational Experiments Challenge: The Second Iteration

DetkDecomp. 2021. detkdecomp

Structural Tractability of Counting of Solutions to Conjunctive Queries

Optimal aggregation algorithms for middleware

Adding Counting Quantifiers to Graph Patterns

Robust and efficient algorithms for rank join evaluation

HyperBench: A Benchmark and Tool for Hypergraphs and Empirical Findings

The Parameterized Complexity of Counting Problems

The Complexity of Homomorphism and Constraint Satisfaction Problems Seen from the Other Side

When is the evaluation of conjunctive queries tractable

Generalized Projections: A Powerful Query-Optimization Technique

Overview of Data Exploration Techniques

The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates

Supporting top-k join queries in relational databases

A survey of top-k query processing techniques in relational database systems

Data Cleaning

Sideways Information Passing for Push-Style Query Processing

Ajar: Aggregations and joins over annotated relations

Covers of Query Results

FAQ: Questions Asked Frequently

Optimally Leveraging Density and Locality for Exploratory Browsing and Sampling

Optimally Leveraging Density and Locality for Exploratory Browsing and Sampling

The Minimum Satisfiability Problem

Practical Linked Data Access via SPARQL: The Case of Wikidata

Schema Inference for Property Graphs

Query optimization through the looking glass, and what we found running the join order benchmark

RankSQL: Query Algebra and Optimization for Relational Top-k Queries

R* Optimizer Validation and Performance Evaluation for Distributed Queries

Getting the Most Out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph

Efficient top-k aggregation of ranked inputs

Supporting incremental join queries on ranked inputs

Size Bounds for Factorised Representations of Query Results

Pushing Data-Induced Predicates Through Joins in Big-Data Clusters

Tractable counting of the answers to conjunctive queries

Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries

Mirek Riedewald, and Xiaofeng Yang. 2020. Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries

A Linear Algorithm For Generating Random Numbers With a Given Distribution

Wikidata: A New Platform for Collaborative Data Collection

An Efficient Method for Generating Discrete Random Variables with General Distributions

Eager Aggregation and Lazy Aggregation

Algorithms for Acyclic Database Schemes

Random Sampling over Joins Revisited